git

git is a powerful version control software written by Linus Torvalds, the inventor of the Linux kernel. git has a reputation for being complicated and confusing. While there may be some truth to this, git is used extensively in industry, and is relied upon by many of the applications being developed today.

While there are certainly advantages to understanding git in detail, it is unnecessary for many projects. It is possible to get the "gist" of git by learning some basic terminology, workflows, and how git fits into a regular data science or software project. For this book, this is our goal. We want to provide you with the smallest amount of information that allows you to incorporate git into your project, or helps you feel comfortable using git in an already established project. If you want to take a deeper dive into git, check out the resources section below.

git vs. GitHub

While git may be largely synonymous with GitHub, they are distinct things. git is a piece of software running on your computer — your local system. You can use git without using GitHub (or another host like GitLab). git works just like other familiar programs like grep or sed. Like grep or sed, you can read about the commands and options by reading the man (short for manual) pages.

man git

Microsoft’s GitHub, however, is a software development and version control platform, hosted online. git is a complicated tool, and platforms like GitHub aim to make using git as pain-free as possible. It is free and easy to create a GitHub account. Other competing platforms include: GitLab, sourcehut, Gitea, and Bitbucket. Each platform has their advantages and disadvantages, but all largely serve the same purpose.

Setup

  1. Download and install git.

  2. Setup your user name and email.

    git config --global user.name "John Smith"
    git config --global user.email "john@example.com"
  3. Setup your default text editor.

    git config --global core.editor vim

    At this stage, if you were to commit to a project, there would be no way to tell if you are really John Smith. In fact, you could be anybody claiming to be John Smith. In the same way that online document signing applications allow you to verify you are you, you can create a GPG key, upload it to GitHub, and automatically sign your commits so creators know it comes from you. To do so, continue on.

  4. Install Homebrew.

  5. Install gpg2 by running: brew install gpg2.

  6. Install gpgtools from GPGTools.

  7. Open a terminal and type the following.

    gpg --full-generate-key --expert
  8. Select ECC (sign only) in the first prompt, and Curve 25519 for the second. Choose how many years you’d like your key to be valid for, and enter the information as you are prompted.

    It is recommended to not use a passphrase if you want to have your commits automatically signed when using GitHub Desktop. Otherwise, you will need to run the following in a terminal before you can commit to the project.

    export GPG_TTY=$(tty)
  9. When complete, you can print the public key by running the following.

    gpg --export -a "John Smith"

    Make sure your replace "John Smith" with the user name you provided when creating the key.

  10. Copy the public key to your clipboard, navigate to github.com, and sign in. Click on your profile in the upper right-hand corner of the screen and navigate to Settings. On the left-hand menu, click SSH and GPG keys and then New GPG key. Paste your public key in the provided text area and click Add GPG key.

  11. Lastly, in order to sign commits using the newly created key, open up a text editor and modify $HOME/.gitconfig to use your key.

    [user]
        name = John Smith
        email = john@example.com
        signingkey = ABCDEFGHIJKLMNOP
    [gpg]
        program = /usr/local/bin/gpg (or other path to `gpg` executable)
    [commit]
        gpgsign = true

    To get your signing key, run the following.

    gpg --list-secret-keys --keyid-format=long

    Your signing key is the 16 character value following ed25519/ on the sec line.

Terminology

Repository

You can think of a repository (repo) as a version controlled directory for one or more projects. A repo contains all of the projects files, code, documentation, etc., along with the project’s entire revision history. When a single repo contains the code and project files for many projects, it can sometimes be referred to as a monorepo. Repos are typically either public or private. public repos are open to anyone who can access the website. private repos are only open to those who have been explicitly given permissions to the repo.

It is obvious when looking at a project on GitHub what a repo is, but what about on your own computer? What makes a folder a repo? Where are all of the version control components located? The answer is in the hidden .git folder in your project directory. For example, my_project is a repo, with all of the commits, repo addresses, etc., placed in the .git folder. If you were to remove the .git folder, the my_project directory would no longer be a repository, but rather a normal directory.

Repository example
my_project
├── .git
│   ├── HEAD
│   ├── config
│   ├── description
│   ├── hooks
│   │   └── README.sample
│   ├── info
│   │   └── exclude
│   ├── objects
│   │   ├── info
│   │   └── pack
│   └── refs
│       ├── heads
│       └── tags
├── .gitignore
├── Cargo.toml
├── LICENSE
├── README.md
├── docs
├── scripts
├── src
│   └── main.rs
└── tests

13 directories, 10 files

To initialize a new repository from a currently existing project directory, there are a few commands to learn.

cd my_project (1)
git init (2)
git remote add origin git@github.com:exampleuser/my_project.git (3)
git branch -M main (4)
git push -u origin main (5)
1 Navigate to the root of the project directory.
2 Initialize the repository, this is the command that creates the .git directory.
3 Essentially links the local repo (on your computer) to the remote repo (on GitHub). When we run commands like git fetch or git pull git now knows where to fetch or pull the data from.
4 By default git names the default branch of a repository master (repos created on GitHub are named main by default). git branch -M main is the command to move or rename the default master branch to be named main.
5 This command sets the upstream branch for the main branch. Once the upstream is set, rather than running git pull origin main every time you want to pull down changes to your local repo, you can just run git pull because git now knows what the upstream branch is. here is a stackoverflow post that goes into more detail.

Clone

Typically heard in reference to "cloning a repo". Cloning a repo is the act of downloading and copying a repository to your local machine, usually from a hosting platform like GitHub.

To clone a GitHub repo, you will need Read access to the repository. If you’ve setup git to use SSH keys, you can clone a repository as follows.

git clone git@github.com:TheDataMine/the-examples-book.git

If you setup git using a credential helper and HTTPS, you can clone a repository as follows.

git clone https://github.com/TheDataMine/the-examples-book.git

Both commands will copy the entirety of the repository in your current working directory (including the .git folder).

Add

New files added to a repo are not automatically tracked. If you modify an untracked file, those changes are not recorded in the .git folder. If you modify a tracked file, any changes saved to disk are tracked and noted by git, and automatically added to the staging area, ready to be committed.

git add adds a file or folder to the staging area, and begins tracking. To add a new file to the staging area, run the following.

git add my_file.txt

To add everything in the root directory to staging, run the following.

git add .

git add respects the .gitignore file in the root of the repo. The .gitignore is a specially named file with a pattern on each line that tells git which files to ignore and not track. A common example of a file that should not be tracked is a .env file with sensitive credentials.

Commit

A single unit of change, which could be to a single file, or multiple files. Commits allow users to track changes made to the project throughout time. In an ideal world, commits should be accompanied by a succinct message with a description of what changes were made and why.

To commit a change to the local repository, simply modify the file or files and save them to disk as you normally would. If the files are currently being tracked, git will "see" the changes and mark the file(s) as modified. Then, just commit the changes.

git commit -m "My succinct commit message."

Diff

To get a list of changes between the current, staged changes and the most recent commit, simply run.

git diff

Pull

git pull "pulls down" the changes made to the remote repo to your local repo. For example, let’s say we have Alice and Bob working on a project together. Alice made a change to the project and updated GitHub with all of the changes she made. Bob wants to update his local repo on his computer to be up-to-date. In order to do so, Bob runs git pull, and assuming Bob hasn’t made any conflicting changes locally, the changes Alice made will get merged into Bob’s local repo.

In order to use git pull, your current working directory should be inside of the local repo.

Push

git push is the symbolic opposite of git pull. git push takes your local commits and updates the remote repo so the rest of the team can work with the latest and greatest.

In order to use git push, your current working directory should be inside of the local repo.

Branch

A branch is just a copy of the repository within the repository. Branches enable a logical separation from the live version (usually main or master), to enable freedom of work without fear of messing something up. Typically your default branch is named master or main. You can create as many branches as you want within a repository, and switch between them using git checkout. When creating a new branch, you will be making a copy of a currently existing branch — often times this will be the main branch.

One common example of using branches would be what are sometimes referred to as "feature" branches. A feature branch is a branch created with the specific purpose of developing a feature on it, which can later be merged into the main branch.

To create a new branch called my-branch, first, checkout the branch from which you’d like to branch off of, for example, main.

git checkout main

You can confirm which branch is live by looking for the asterisk after running the following.

git branch

Next, create the branch.

git branch my-branch

Once the branch is created, you can switch to it.

git checkout my-branch

It is very common to need to create a new branch and immediately switch to that branch. To do so, you can run.

git checkout -b my-new-branch

Checkout

git checkout is the command that allows you to switch between different branches. To switch to a branch called "my-branch" simply run the following.

git checkout my-branch

Upon switching to my-branch, all of the files and folders on your local machine will change to match the code and files on that branch. If my-branch had a drastically different file/folder structure than my-other-branch, upon switching branches the files and folders will appear and disappear on your local machine.

Merge

Merging is the process of combining the changes and commits from one branch or fork to another. Ultimately, all accepted modifications made on other (non-live) branches need to be merged into the live branch.

To merge a branch called my-branch into the main branch, you must first switch the branch you want to merge into. In this case that is the main branch.

git checkout main

Then, it is as straightforward as running the merge command.

git merge my-branch
When there is a conflict, this will not be so straightforward. Please see the an example of resolving a conflict in the GitHub Desktop section.

Resources

A glossary with common git and GitHub-related terminology.

An interactive in-browser game to help learn about git and git branching.