Structure of the lecture

In this lecture, we will first give an overview of the folder organization for the projects. In the second part of the lecture, we will introduce the topic of the projects.

1. Keeping files organized

Choose a proper folder structure

An important and often overlooked aspect of running a project is that if you design things correctly you will considerably reduce the amount of work required to understand and recover things.

A project folder tends to explode into a chaotic mess pretty quickly. This aspect is so well known to be a classic joke about academia. Filenames

In some fortunate case, years of frustration cristallize on top of successful codes in the form of dark humour mocking the user (looking at you, Gromacs), but since it is not always the case, so let’s try to think of a decent structure.

Everything in the right place

Each project should follow a very similar folder structure. This has several advantages

  1. You always know where to find things!
  2. It is easier to automate common tasks (e.g. copying files, launching jobs).
  3. The folder structure will make you think about your workflow.
  4. Your work will be easy to share and merge with others.
  5. Other people will be able to pick up your work.

The structure

This is a template folder structure studied to partially solve the issue, if you follow it as intended.

├── bin                           -> all compiled programs and scripts. Compiled programs ignored.
├── data                          -> Data folder, ignored by git
│   ├── 00_external               -> External data, if present
│   │   └──
│   ├── 01_raw                    -> Raw data produced by simulations
│   │   └──
│   ├── 02_processed              -> Intermediary data
│   │   └──
│   └── 03_analyzed               -> Results after analyses. This is used to produce figures.
│   │   └──
├── doc                           -> Extra documentation if needed (e.g. software docs).
├── notebooks                     -> Jupyter notebooks.
├── references                    -> Store literature here, ignored by git.
│   └──
├── report                        -> Should eventually lead to the final paper
│   ├── imgs
│   └── report_template.tex
├── src
│   ├── analysis                  -> Analysis scripts etc.
│   ├── external                  -> Third-party software. Ignored by git.
│   ├── production                -> simulation software and scripts used to produce raw data.
│   ├── tools                     -> Utilities (scripts)
│   └── visualization             -> Visualization scripts.
├── makefile                      -> The makefile should reproduce all figures and processed data.
├── conda_QCB_essentials.yml      -> conda environment
├──                     -> README for the project
└── test                          -> all tests on analysis scripts and production software

We are always improving this, and if you have comments they are welcome!

Setting up the folder structure on the cluster.

The folder structure above can be automatically installed through cookiecutter. For those interested, see cookiecutter to set up the project template. Let’s install everything..

I. Install cookiecutter

On your laptop

conda activate <QCB-env>
conda install cookiecutter

On the cluster

module load python-<version> ... #use the above commands to find a version of python >= 3.6!
pip3 install --user cookiecutter

II. Use cookiecutter to get the folder

cd <some_directory_I_like>

Cookiecutter will ask you a few questions, including the project name, your name, and your email. Fill them in correctly.

You will end up with a folder like the one above.

Of course, in order to use the conda environment you need to install conda first. If you want, you can use wget to download miniconda to your home in the cluster and then proceed as usual.

2. Projects

For this part of the lecture follow the slides here.

Download from the RCSB PDB database the .pdb file of the protein you need. To modify the .pdb file we are going to use Chimera.
Select only one protein model, delete the ligands, the ions, and water. Check that the residues numbering corresponds to the one indicated in the Uniprot documentation.

Use the command line to change the amino acids in order to obtain the mutated forms.

The line is for example:

swapaa ala :125

this command converts the residue at position 125 into an Alanine.