The Basics of Git Internals

The goal of this document is to share basic information about git internals to help people getting started with git. It’s just a quick introduction/glossary that tries to explain things the easy way. It might not be 100% accurate for the sake of staying easily readable and not too verbose. I originally wrote this for myself a few years ago during my free time, then it quickly became the to-go documentation for anyone new working on git at Abstract.

Objects

In git, almost everything is an object:

  • Commits
  • Tags (if created with the flag -a)
  • Trees, which are directories (ex. git add src/main.go will create a tree object for src)
  • Blobs, which are basically all the files you add with git add (so git add main.go will create a blob object containing main.go’s content).

All objects are represented by a SHA (also referred to as OID for Object ID). This SHA corresponds to the SHA1 hash of the content of an object file before it has been zlib compressed (so sha1(type + ' ' + data_size + 0x00 + data)).

Objects can be found at 2 different places:

  1. Packed in a packfile located in .git/objects/packs . See Packfile section below for more information.
  2. As loose objects in .git/objects/[xx]
    * Usually, objects created less than 2 weeks ago are stored in there. They are moved to a packfile automatically by git once in a while. You can run git gc to start this process manually.
    * This directory also contains dangling objects, which are “unreachable” objects. Unreachable objects are objects not added to any commits or branches, and that can only be reached manually using their SHA.
    * You can find an object using its SHA. The first directory is the first 2 chars of the SHA, then the file is the remaining 38 chars: .git/objects/sha[0:2]/sha[2:]. For example, the loose commit with SHA 63a972a73a396a758178ca604e5d8acce693bcca can be found at .git/objects/63/a972a73a396a758178ca604e5d8acce693bcca

Find the SHA/OID a file has (or would have) once added to git:

Get the type of an object from its SHA:

Get the content of an object from its SHA:

Other useful commands to look at an object:

Packfile

A packfile is a single file containing the contents of all the objects. It’s basically an optimized local database. To prevent using too much space, a packfile will stores deltas instead of full objects when possible. Example:

  1. You git add main.go and commit it.
  2. You add a trailing new line at the end of main.go, git add the changes then commit.

You now have 2 blobs objects that are similar at 99%, and this one byte you added to main.go is costing you a lot of disk space. Once put in the packfile, the second object will only contain deltas instead of the full content, reducing its size to only a few bytes. Basically, a delta contains information that can be translated as “Take object X, and at the offset Y, add a space”.

Packfiles come in pairs with an index file (.idx). The index contains offsets into that packfile so you can quickly seek to a specific object.

Git will automatically move objects to a packfile from time to time (when you push or pull for example). You can manually move all packable loose objects inside a packfile by running git gc. It’s totally safe to manually run it, no side effects should happen.

List all the objects inside a Packfile:

The 1st column contains the SHA of the object, the 2nd the type, the 3rd the size of the object, the 4th contains the size of the object once zlib compressed, and the last column contains the object location in the packfile (offset in byte).

Unpack all objects of a packfile:

It’s required to move the packfile out of the .git directory, because git won’t let you create objects that already exist in a packfile.

Other useful commands:

References

References are basically labels. They are a way to link a user-friendly name to a SHA, preventing you from having to know and remembering SHAs.

There are 2 types of references:

  • Symbolic references: they are references pointing to another reference
  • OID references: they are references pointing to an object

References can be found at 2 different places:

  • In the .git/refs directory, where each reference will be in a file.
  • In the .git/packed-refs file, where each reference will be on a line.

A reference can appear both in .git/refs AND .git/packed-refs with a different target. In this case, .git/refs will contain the most up-to-date data.

Branches are references located in .git/refs/heads. They point to a single SHA and are automatically updated when creating new branches, commits, etc.

HEAD is basically a reference to the last commit of the current branch (but not always). You can go higher in history by adding ~ followed by a number (ex: HEAD~1 correspond to the commit right after the last one), or by adding a bunch of ^, (ex: use HEAD^^ to get the third commit (= HEAD~2)).

The HEAD is located in .git/HEAD, and contains either a SHA (detached head, if you check out a commit for example), or a symbolic reference to a branch (if you’re in a branch).

Given the following history

We can use rev-parse to get the referenced commit SHA:

Working tree

The working tree is basically your file system. When you open a file in your code editor and start changing things, you’re editing that file on the working tree.

  • git status to see the files/directory that changed (sections Changes not staged for commit and Untracked files, changes are appearing in red if you have colors enabled)
  • git diff to see the specific changes in each file.
  • git checkout — <file> can be used to revert the changes of a file

Index

  • This is not the same as a packfile’s index.
  • The index is also known as the staging area
  • This is where all your changes go when using git add
  • When creating a commit, git is committing what’s in the index
  • The index is a binary file located at .git/index
  • git status to see what changed in the index (section Changes to be committed, the changes will appear in green if you have colors enabled).
  • git reset HEAD <file> can be used to remove files from the index without impacting the working tree.

Debugging and reverse engineering

If you ever wonder what a git command does internally, you can add GIT_TRACE=1 in front of it.

Example:

Source: https://stackoverflow.com/a/52193441/382879

You can extract a raw object using python. For example, to extract the (loose) object 03455d53eeaf35edc36e983ba877b2bc5242b49a you can do:

Now if the object has a binary format, you most likely want to inspect its data using hexdump:

Staff Engineer at Abstract, Splice. I love Go, and I love Git. https://melvin.lahttps://linkedin.com/in/melvinlaplanchehttps://github.com/Nivl

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store