Git Under the Hood, Part 1: Object Storage in Git

Amir Ebrahimi Fard
Data Management for Researchers
6 min readJul 26, 2021

--

*** This article is inspired by [1] ***

Photo by Dima Pechurin on Unsplash

This section describes Git operations related to file storage and tracking mechanisms. To this end, it is essential to first talk about Git object types. Objects are immutable components in Git that are used to store information permanently. There are four types of objects in Git: (i) blob, (ii) tree, (iii) commit, and (iv) tag. A blob object keeps the contents of files (but not their modes or names). A tree object corresponds to a directory and stores the file tree(s) and blobs along with their names and modes. A commit object points to a location in a tree and stores information about the author, committer, message and the commit parent(s) of associated files. The final type of object is a tag, which assigns a permanent label to a commit. It is comprised of an object, type¹, tag, tagger, and message. There are also other components within the Git filesystem called references which are less permanent in nature. Examples of references are branches and remotes.

Figures 1 through 6 show a schematic representation of Git objects with an example. The example is of a hierarchical structure containing three folders and three files. Every file and folder is mapped into a Git object with a unique identifier (more precisely, files and folders are mapped into blobs and trees, respectively). As shown in Figure 2, a blob object contains only the file contents (note: if there are multiple copies of the same file anywhere in a project, Git will store the blob object only once). Figure 3 represents a tree object. It consists of a very simple text file that lists the mode, type, name and SHA of each entry. Figures 4 and 5 show commit objects. A commit object is always associated with at least one parent² unless it is the very first commit on a project. Finally, Figure 6 shows a tag object pointing to a particular commit.

Figure 1: The mapping between the working directory and Git directory.
Figure 2: A schematic representation of a blob object.
Figure 3: A schematic representation of a tree object.
Figure 4: A schematic representation of the first commit object.
Figure 5: A schematic representation of a non-initial commit object.
Figure 6: A schematic representation of a tag object.

Now the question is, how does Git track the history of a project’s development? The answer is through its data model which is a directed acyclic graph (DAG). This data model keeps all the Git objects in a DAG structure, thus from each commit we can traverse back to its ancestors and retrieve their corresponding information. Figure 7 displays the schematic view of the Git data model.

Figure 7: The Git data model. The objects and references are represented by coloured rectangles and grey boxes, respectively.

As this diagram illustrates, a branch, a tag, and a remote can each reference a specific commit. HEAD references the latest part of a branch. A commit can point to a tree which is a directory of repository snapshots, and a tree can reference one or more blobs. The circle around the commit object refers to its parent commits, which can be zero (the very first commit), one (non-initial, non-merge commits), or more than one (merge commits). The circle around the tree object also refers to other trees (subfolders)[2].

Here we show the formation and development of the Git data structure using the same hypothetical example shown before in Figure 1. After the first commit, the Git structure looks like this:

Figure 8 : The Git structure after the initial commit.

In this example, there are three blob objects: main.py, base_lib.py, and core_lib.py files; three tree objects: ., lib, and base folders; and one commit object corresponding to the initial commit. There are also two references: branch and HEAD. The branch points to our last commit and HEAD points to the branch we are currently on³. This lets Git know which commit will be the parent of the next commit. So, every branch is in fact a pointer to a certain commit, and HEAD refers to the latest part of the branch that we are on.

Now, assume we make a change to the /lib/base/base_lib.py file and commit this change in Git. Changing base_lib.py creates a new blob object, which in turn changes the tree object pointing to it. This also changes the tree(s) that point(s) to that tree throughout the entire directory. There is also a new commit object created that points to its parent (the previous/initial commit) and the new tree object. Now, the branch reference and HEAD point to the new commit instead of the previous one. We also tag this commit which creates a tag object.

Figure 9: The Git structure after the second commit.

Now, say we modify main.py and commit the changes in Git. This creates a new blob object corresponding to main.py. The parent tree object of the former blob object (associated with main.py) cannot be used and thus a new tree object will be added. When this happens, all the subtrees will stay as they are. Again, a new commit object is created that points to the previous commit. Now, the branch reference moves forward and points to the latest commit.

Figure 10: The Git structure after the third commit.

As this example illustrates, Git tries to reuse existing objects in its structure as much as possible. In this system, a new object is created only when absolutely necessary. To keep all the information and history in this example, Git stores 16 immutable, signed, compressed objects. Storing the commit history in this structure allows us to easily recreate any of the directories we committed before by either referring to their unique identifier or traversing the graph from another commit (e.g., if we wanted to bring up the first treen in the example, we could look for the parent of the parent of HEAD, or the parent of the tag). The next section explains traversing the Git tree in more detail.

Footnotes

  1. Normally the type is commit and the object is the SHA-1 of the commit you’re tagging.
  2. If a commit is the result of merging multiple branches, it will point to all of them.
  3. HEAD is a label pointing to the last commit of the branch we are on, so on the master branch, it refers to the last commit there (by the symbolic representation of HEAD -> master). If we switch to another branch in the project, HEAD will also refer to the last commit in that branch (HEAD -> <branch_name>).

--

--

Amir Ebrahimi Fard
Data Management for Researchers

Postdoc Researcher on AI Explainability - Interested in the intersection of data, algorithm, and society.