The Basics of Git Internals
The goal of this document is to share basic information about git internals to help people getting started with git. It’s just a quick introduction/glossary that tries to explain things the easy way. It might not be 100% accurate for the sake of staying easily readable and not too verbose. I originally wrote this for myself a few years ago during my free time, then it quickly became the to-go documentation for anyone new working on git at Abstract.
In git, almost everything is an object:
- Tags (if created with the flag
- Trees, which are directories (ex.
git add src/main.gowill create a tree object for
- Blobs, which are basically all the files you add with
git add main.gowill create a blob object containing
All objects are represented by a SHA (also referred to as OID for Object ID). This SHA corresponds to the SHA1 hash of the content of an object file before it has been zlib compressed (so
sha1(type + ' ' + data_size + 0x00 + data)).
Objects can be found at 2 different places:
- Packed in a packfile located in
.git/objects/packs. See Packfile section below for more information.
- As loose objects in
* Usually, objects created less than 2 weeks ago are stored in there. They are moved to a packfile automatically by git once in a while. You can run
git gcto start this process manually.
* This directory also contains dangling objects, which are “unreachable” objects. Unreachable objects are objects not added to any commits or branches, and that can only be reached manually using their SHA.
* You can find an object using its SHA. The first directory is the first 2 chars of the SHA, then the file is the remaining 38 chars:
.git/objects/sha[0:2]/sha[2:]. For example, the loose commit with SHA
63a972a73a396a758178ca604e5d8acce693bccacan be found at
Find the SHA/OID a file has (or would have) once added to git:
❯ git hash-object main.go
Get the type of an object from its SHA:
❯ git cat-file -t 03f6454b22ad871240b2505c0fb24d290d279d15
Get the content of an object from its SHA:
❯ git cat-file -p 03f6454b22ad871240b2505c0fb24d290d279d15
[content of file main.go]
Other useful commands to look at an object:
❯ git show <sha> # To see the content of an object
❯ git ls-tree <sha> # To see the content of a tree or commit's tree
❯ git log <sha> # To see the history of a commit
❯ git fcsk # To verify and validate all objects
A packfile is a single file containing the contents of all the objects. It’s basically an optimized local database. To prevent using too much space, a packfile will stores deltas instead of full objects when possible. Example:
- You git add
main.goand commit it.
- You add a trailing new line at the end of
git addthe changes then commit.
You now have 2 blobs objects that are similar at 99%, and this one byte you added to
main.go is costing you a lot of disk space. Once put in the packfile, the second object will only contain deltas instead of the full content, reducing its size to only a few bytes. Basically, a delta contains information that can be translated as “Take object X, and at the offset Y, add a space”.
Packfiles come in pairs with an index file (.idx). The index contains offsets into that packfile so you can quickly seek to a specific object.
Git will automatically move objects to a packfile from time to time (when you push or pull for example). You can manually move all packable loose objects inside a packfile by running
git gc. It’s totally safe to manually run it, no side effects should happen.
List all the objects inside a Packfile:
❯ git verify-pack -v .git/objects/pack/pack-7a16e4488ae40c7d2bc56ea2bd43e25212a66c45.idx
0155eb4229851634a0f03eb265b69f5a2d56f341 tree 71 76 5400
05408d195263d853f09dca71d55116663690c27c blob 12908 3478 874
09f01cea547666f58d6a8d809583841a7c6f0130 tree 106 107 5086
1a410efbd13591db07496601ebc7a059dd55cfe9 commit 225 151 322
The 1st column contains the SHA of the object, the 2nd the type, the 3rd the size of the object, the 4th contains the size of the object once zlib compressed, and the last column contains the object location in the packfile (offset in byte).
Unpack all objects of a packfile:
# From within the repository
❯ mv .git/objects/pack/pack-HASH.pack . # Move the packfile away
❯ git unpack-objects < pack-HASH.pack # Unpack the packfile in the current repo
It’s required to move the packfile out of the .git directory, because git won’t let you create objects that already exist in a packfile.
Other useful commands:
❯ git gc # pack dangling objects and optimize the packfiles
❯ git repack # optimize a packfile by repacking it
References are basically labels. They are a way to link a user-friendly name to a SHA, preventing you from having to know and remembering SHAs.
There are 2 types of references:
- Symbolic references: they are references pointing to another reference
- OID references: they are references pointing to an object
References can be found at 2 different places:
- In the
.git/refsdirectory, where each reference will be in a file.
- In the
.git/packed-refsfile, where each reference will be on a line.
A reference can appear both in
.git/packed-refs with a different target. In this case,
.git/refs will contain the most up-to-date data.
Branches are references located in
.git/refs/heads. They point to a single SHA and are automatically updated when creating new branches, commits, etc.
HEAD is basically a reference to the last commit of the current branch (but not always). You can go higher in history by adding
~ followed by a number (ex:
HEAD~1 correspond to the commit right after the last one), or by adding a bunch of
^, (ex: use
HEAD^^ to get the third commit (=
The HEAD is located in
.git/HEAD, and contains either a
SHA (detached head, if you check out a commit for example), or a symbolic reference to a branch (if you’re in a branch).
Given the following history
* 5457f77c15 fix(files): Do not multi-select deleted layers (#1316)
* 36793a8812 fix: comment form sizing (#1377)
* fa5b5732c2 fix: Specs after merge
* 7d2d198eaa fix: lastPulledAt should be set to the pushedAt timestamp (#1149)
* d42fda8c0b fix: copy links not routing to correct scroll within comment feed (#1363)
We can use rev-parse to get the referenced commit SHA:
❯ git rev-parse HEAD
5457f77c153d0f17042ee425f4985566fb21c02c❯ git rev-parse HEAD-1
36793a88124435fd7bc328ddb7799572dc560646❯ git rev-parse HEAD~2
fa5b5732c25f7365795d1bf06fefdf529c83f7c6❯ git rev-parse HEAD^^^ # In zsh you have to escape each ^ with \^
The working tree is basically your file system. When you open a file in your code editor and start changing things, you’re editing that file on the working tree.
Investigating/Debugging the Working Tree
git statusto see the files/directory that changed (sections
Changes not staged for commitand
Untracked files, changes are appearing in red if you have colors enabled)
git diffto see the specific changes in each file.
git checkout — <file>can be used to revert the changes of a file
- This is not the same as a packfile’s index.
- The index is also known as the staging area
- This is where all your changes go when using git add
- When creating a commit, git is committing what’s in the index
- The index is a binary file located at .git/index
Investigating/debugging the Index
git statusto see what changed in the index (section
Changes to be committed, the changes will appear in green if you have colors enabled).
git reset HEAD <file>can be used to remove files from the index without impacting the working tree.
Debugging and reverse engineering
Trace a git CLI command
If you ever wonder what a git command does internally, you can add
GIT_TRACE=1 in front of it.
❯ GIT_TRACE=1 git branch
20:44:35.621027 git.c:444 trace: built-in: git branch
20:44:35.621827 run-command.c:664 trace: run_command: unset GIT_PAGER_IN_USE; LV=-c less* master
Extracting a raw object
You can extract a raw object using python. For example, to extract the (loose) object
03455d53eeaf35edc36e983ba877b2bc5242b49a you can do:
python -c "import zlib,sys;sys.stdout.write(zlib.decompress(sys.stdin.read()))" < .git/objects/03/455d53eeaf35edc36e983ba877b2bc5242b49a
Now if the object has a binary format, you most likely want to inspect its data using hexdump:
python -c "import zlib,sys;sys.stdout.write(zlib.decompress(sys.stdin.read()))" < .git/objects/03/455d53eeaf35edc36e983ba877b2bc5242b49a | hexdump -C