Using git in Research

Aly Lamuri

Version control in research

Objectives:

  • Overview how a research project takes place
  • Understand why version control matters in a research project
  • Know how to implement the basic idea of version control system

What our research looks like

How we want it to be

How it actually is
  • We always make a linear plan
  • Along the way, other great ideas pop up
  • Sometimes, it is overwhelming to keep up with the changes

We have been here, done that

  • We start with a simple draft
  • Then we go with rounds of revisions
  • We end up with multiple drafts

Version control system (VCS)

  • It does versioning on your behalf
  • You can revisit the older “version” of your research
  • A lean approach for multiple collaboration

“We can do that with MS Office / Google Docs / {other software}, right?”

  • Office suite offers a centralized VCS
  • All changes are recorded only when it is online
  • It tracks changes of the whole document
  • Limited scope of document version management
  • Not applicable for code/script

An ideal VCS for researcher

An ideal VCS for researcher

git does this

Basics of git

Objectives:

  • Able to install and use git in a local machine
  • Know how to configure git for the first time
  • Understand 5 most-used commands in git: add, commit, clone, push, and pull

What is git?

Getting started with git

Configure git for the first time

Terminal
$ git config --global user.name "Your Name"
$ git config --global user.email any.email.you.use@mail.com

You can check you configuration by issuing:

Terminal
$ git config --list

Which will give you:

Output
user.email=any.email.you.use@mail.com
user.name="Your Name"
init.defaultbranch=master
merge.tool=vimdiff
credential.helper=cache

Cloning a repository

Terminal
$ git clone https://github.com/octocat/Spoon-Knife
  • The command above will clone a GitHub repository into your local machine
  • You may notice that now you have a folder Spoon-Knife created in your directory

Checking on the cloned repository

Open your terminal application, then go to the cloned directory

Terminal
$ cd /path/to/Spoon-Knife # Change directory
$ ls -lah                 # List all the contents
CMD
C:\Users\Username>"C:\path\to\Spoon-Knife" # Change directory
C:\path\to\Spoon-Knife>dir /s /b /o:gn     # List all the contents

You will find the following files listed:

Output
total 24K
drwxr-xr-x 3 lam lam 4.0K Mar 20 07:33 .
drwxr-xr-x 3 lam lam 4.0K Mar 20 07:37 ..
drwxr-xr-x 7 lam lam 4.0K Mar 20 07:33 .git
-rw-r--r-- 1 lam lam  355 Mar 20 07:33 index.html
-rw-r--r-- 1 lam lam  780 Mar 20 07:33 README.md
-rw-r--r-- 1 lam lam  256 Mar 20 07:33 styles.css

We can check on the log history

Terminal
$ git log --oneline --graph --decorate --all
Output
* f439fc5 (origin/change-the-title) Update README.md
| * 5806070 (origin/test-branch) Create test.md
|/
* d0dd1f6 (HEAD -> main, origin/main, origin/HEAD) Pointing to the guide for forking
* bb4cc8d Create styles.css and updated README
* a30c19e Created index page for future collaborative edits

Create a file and make some changes

  • Let’s make a new file: myfile.txt
  • We can put some lorem ipsum in our new file
  • Now, our directory should contain:
Terminal
$ ls -lah
Output
total 28K
drwxr-xr-x 3 lam lam 4.0K Mar 20 08:13 .
drwxr-xr-x 3 lam lam 4.0K Mar 20 07:37 ..
drwxr-xr-x 7 lam lam 4.0K Mar 20 08:10 .git
-rw-r--r-- 1 lam lam  355 Mar 20 07:33 index.html
-rw-r--r-- 1 lam lam   10 Mar 20 08:13 myfile.txt
-rw-r--r-- 1 lam lam  780 Mar 20 07:33 README.md
-rw-r--r-- 1 lam lam  256 Mar 20 07:33 styles.css

Let’s check our repository status

Terminal
$ git status
Output
On branch main
Your branch is up to date with 'origin/main'.

Untracked files:
  (use "git add <file>..." to include in what will be committed)
        myfile.txt

nothing added to commit but untracked files present (use "git add" to track)

We can track myfile.txt to our repository then recheck the status

Terminal
$ git add myfile.txt
$ git status
Output
On branch main
Your branch is up to date with 'origin/main'.

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
        new file:   myfile.txt

We need to commit for the changes to be tracked

Terminal
$ git commit myfile.txt -m "Add a new file to describe lorem ipsum"
Output
[main 67802c8] Add a new file to describe lorem ipsum
 1 file changed, 3 insertions(+)
 create mode 100644 myfile.txt

Checking on the log

$ git log --all --decorate --oneline --graph

* 67802c8 (HEAD -> main) Add a new file to describe lorem ipsum
| * f439fc5 (origin/change-the-title) Update README.md
|/
| * 5806070 (origin/test-branch) Create test.md
|/
* d0dd1f6 (origin/main, origin/HEAD) Pointing to the guide for forking
* bb4cc8d Create styles.css and updated README
* a30c19e Created index page for future collaborative edits

Take advantage of git and GitHub

10 rules by Perez-Riverol et al. (2016)

Use GitHub to track your project

The backbone of GitHub is the distributed version control system Git. Every change, from fixing a typo to a complete redesign of the software, is tracked and uniquely identified. Although Git has a complex set of commands and can be used for rather complex operations, learning to apply the basics requires only a handful of new concepts and commands and will provide a solid ground to efficiently track code and related content for research projects.

Manage permission for repository access

Public projects on GitHub are visible to everyone, but write permission, i.e., the ability to directly modify the content of a repository, needs to be granted explicitly. As a repository owner, you can grant this right to other GitHub users. In addition to being owned by users, repositories can also be created and managed as part of teams and organizations.

Create SOP for branching and forking

Anyone with a GitHub account can fork any repository they have access to. This will create a complete copy of the content of the repository, while retaining a link to the original “upstream” version. It allows anyone to develop and test novel features with existing code and offers the possibility of contributing novel features, bug fixes, and improvements to documentation back into the original upstream project (requested by opening an pull request) repository and becoming a contributor.

Use tags and semantic versions

Tags can be used to label versions during the development process. Version numbering should follow “semantic versioning” practice, with the format X.Y.Z., with X being the major, Y the minor, and Z the patch version of the release, including possible meta information. Correct labeling allows developers and users to easily recover older versions, compare them, or simply use them to reproduce results described in publications.

Continuous integration and automated code testing

Code testing is necessary to detect possible bugs introduced by new features or changes in the code or dependencies, as well as detecting wrong results, often known as logic errors, in which the source code produces a different result than what was intended. Continuous integration provides a way to automatically and systematically run a series of tests to check integrity and performance of code, a task that can be automated through GitHub.

Automate more tasks via webhooks and GitHub actions

More than just code compilation and testing can be integrated into your software project: GitHub hooks can be used to automate numerous tasks to help improve the overall quality of your project. You might consider generating the documentation upon code/documentation modification, i.e. by using Quarto for R script or Sphinx for python code.

Openly and Collaboratively Discuss, Address, and Close Issues

GitHub issues are a great way to keep track of bugs, tasks, feature requests, and enhancements. While classical issue trackers are primarily intended to be used as bug trackers, in contrast, GitHub issue trackers follow a different philosophy: each tracker has its own section in every repository and can be used to trace bugs, new ideas, and enhancements by using a powerful tagging system. The main objective of issues in GitHub is promoting collaboration and providing context by using cross-references.

Make your code easily citable

GitHub now integrates with archiving services such as Zenodo and Figshare, enabling DOIs to be assigned to code repositories. By default, Zenodo creates an archive of a repository each time a new release is created in GitHub, ensuring the cited code remains up to date. Once the DOI has been assigned, it can be added to literature information resources such as Europe PubMed Central.

Promote and discuss your project

GitHub Pages are simple websites freely hosted by GitHub. Users can create and host blog websites, help pages, manuals, tutorials, and websites related to specific projects.

Social feature in GitHub: follow and watch

In the same way researchers are following developments in their field, scientific programmers could follow publicly available projects that might benefit their research. GitHub enables this functionality by following other GitHub users (see also Rule 2) or watching the activity of projects, which is a common feature in many social media platforms.

Reference

Braga, Pedro Henrique Pereira, Katherine Hébert, Emma J. Hudgins, Eric R. Scott, Brandon P. M. Edwards, Luna L. Sánchez Reyes, Matthew J. Grainger, et al. 2023. “Not Just for Programmers: How <Scp>GitHub</Scp> Can Accelerate Collaborative and Reproducible Research in Ecology and Evolution.” Methods in Ecology and Evolution 14 (6): 1364–80. https://doi.org/10.1111/2041-210x.14108.
Peikert, Aaron, Caspar J. van Lissa, and Andreas M. Brandmaier. 2021. “Reproducible Research in r: A Tutorial on How to Do the Same Thing More Than Once.” Psych 3 (4): 836–67. https://doi.org/10.3390/psych3040053.
Perez-Riverol, Yasset, Laurent Gatto, Rui Wang, Timo Sachsenberg, Julian Uszkoreit, Felipe da Veiga Leprevost, Christian Fufezan, et al. 2016. “Ten Simple Rules for Taking Advantage of Git and GitHub.” Edited by Scott Markel. PLOS Computational Biology 12 (7): e1004947. https://doi.org/10.1371/journal.pcbi.1004947.
Ram, Karthik. 2013. “Git Can Facilitate Greater Reproducibility and Increased Transparency in Science.” Source Code for Biology and Medicine 8 (1). https://doi.org/10.1186/1751-0473-8-7.