Colabfold...

Overview

The colabfold https://github.com/sokrypton/ColabFold wrap of alphafold is now available on Gadi in if89 project.

The installation we have is following the "ColabFold on your local computer" installation as described by Yoshitaka Moriwaki https://github.com/YoshitakaMo/localcolabfold but all the necessary software is compiled natively on Gadi instead of using conda.

You will need to join project if89 to use this software.

On this page

v1.4.0
- Database search
- Protein folding prediction
- Notes about multimer predictions
- Known problems
v1.5.2
- Database search
- Protein folding prediction

v1.4.0

The colabfold run on Gadi contains two distinctive steps:

1) Database search

2) Protein folding prediction

Database search

The database search is done using the MMseqs2 program in the database installed locally in if89 project. It normally takes 1-2h on 12 CPUs with 100 GiB memory.

The search is not using GPU so it is best run it in normal queue.

An example PBS script:

#!/bin/bash
 
#PBS -lncpus=12
#PBS -lmem=100GB
#PBS -lwalltime=3:00:00
#PBS -ljobfs=1GB
#PBS -l wd
#PBS -l storage=gdata/if89
 
# Must include `#PBS -l storage=scratch/ab12+gdata/yz98` if the job
# needs access to `/scratch/ab12/` and `/g/data/yz98/`
 
# Load module, always specify version number.
module use /g/data/if89/apps/modulefiles
module load colabfold_batch/1.4.0
 
colabfold_search --db-load-mode 1 --threads 12 input.fasta $COLABFOLDDIR/database output_directory

The results will be put into output_directory and will contain one (or more if the input.fasta file had several sequences) a3m files.

Protein folding prediction

The prediction run is best run in gpuvolta queue and may take up to 24h.

An example PBS script:

#!/bin/bash
 
#PBS -q gpuvolta
#PBS -lncpus=12,ngpus=1
#PBS -lmem=48GB
#PBS -lwalltime=12:00:00
#PBS -ljobfs=1GB
#PBS -l wd
#PBS -l storage=gdata/if89
 
# Must include `#PBS -l storage=scratch/ab12+gdata/yz98` if the job
# needs access to `/scratch/ab12/` and `/g/data/yz98/`
 
# Load module, always specify version number.
module use /g/data/if89/apps/modulefiles
module load colabfold_batch/1.4.0
 
colabfold_batch --amber --templates --num-recycle 3 --use-gpu-relax a3m_file predicted_dir

The predicted PDB files will be put in to predicted_dir directory.

Note that colabfold_batch can accept different type of input and even work with a3m files that you obtained using a different database search, not necessarily those obtained from colabfold_search.

Run

$ module use /g/data/if89/apps/modulefiles
$ module load colabfold_batch/1.4.0
 
$ colabfold_batch -h

to see what options are supported.

Notes about multimer predictions

Colabfold supports two alphafold multimer options, which can be selected with --model-type AlphaFold2-multimer-v1 and --model-type AlphaFold2-multimer-v2 option in colabfold_batch. These multimer prediction runs require a specially formatted a3m input file. This specially formatted a3m file can be easily generated from the set of a3m files obtained from the database search of multi sequence fasta file. This doesn't take long and can be run on a login node using the following command:

$ cd output_directory
 
$ module use /g/data/if89/apps/modulefiles
$ module load colabfold_batch/1.4.0
 
$ $COLABFOLDDIR/a3m2multi.sh

This will combine the a3m files located in the output_dir, which normally have names 0.a3m, 1.a3m, 2.a3m etc, and produce total.a3m file in the desired format. Then the colabfold_batch line in the above PBS file will look like this:

$ colabfold_batch --amber --templates --num-recycle 3 --use-gpu-relax --model-type AlphaFold2-multimer-v2 output_dir/total.a3m predicted_dir

Known problems

1) During testing the colabfold installation several of our runs with --model-type AlphaFold2-multimer-v1 ended with an error:

 ValueError: The number of positions must match the number of atoms

This error indicates that amber minimization as implemented in alphafold was not able to refine the predicted protein. We have not seen this error in the --model-type AlphaFold2-multimer-v2 calculations, but our tests are limited.

2) When trying to predict a structure of a large protein, more than 3000 aminoacids, the job in gpuvolta queue may fail due to memory problems. This is an indication that 32 GiB of GPU memory we have in GPUs in gpuvolta queue are not enough to run the prediction. In this case, try using our dgxa100 queue. An example PBS script is below:

#!/bin/bash
 
#PBS -q dgxa100
#PBS -lncpus=16,ngpus=1
#PBS -lmem=96GB
#PBS -lwalltime=24:00:00
#PBS -ljobfs=1GB
#PBS -l wd
#PBS -l storage=gdata/if89
 
# Must include `#PBS -l storage=scratch/ab12+gdata/yz98` if the job
# needs access to `/scratch/ab12/` and `/g/data/yz98/`
 
# Load module, always specify version number.
module use /g/data/if89/apps/modulefiles
module load colabfold_batch/1.4.0
 
colabfold_batch --amber --templates --num-recycle 3 --use-gpu-relax --model-type AlphaFold2-multimer-v2 output_dir/total.a3m predicted_dir

v1.5.2

The general structure is the same as in v1.4.0

The mainan changes are: A different PBD 100 database is used during the search. OpenMM v7.7.0; python v3.10; New options in colabfold_batch.

Database search

The database search is done using MMseqs2 program in the database installed locally in if89 project. It normally takes 1-2h on 12 CPUs with 100 GiB memory. The search is not using GPU so best run in normal queue.

An example PBS script:

#!/bin/bash
 
#PBS -lncpus=12
#PBS -lmem=100GB
#PBS -lwalltime=3:00:00
#PBS -ljobfs=1GB
#PBS -l wd
#PBS -l storage=gdata/if89
 
module use /g/data/if89/apps/modulefiles
module load colabfold_batch/1.5.2
 
colabfold_search --db-load-mode 1 --threads 12 input.fasta $COLABFOLDDIR/database output_directory

The results will be put into output_directory and will contain one (or more if the input.fasta file had several sequences) a3m files.

Note that colabfold_search has different parameters, which can be used to select a different database, for example.

Run

$ module use /g/data/if89/apps/modulefiles
$ module load colabfold_batch/1.5.2
 
$ colabfold_search -h

to see available options.

Protein folding prediction

The prediction run is best run in gpuvolta queue and may take up to 24h.

An example PBS script:

#!/bin/bash
 
#PBS -q gpuvolta
#PBS -lncpus=12,ngpus=1
#PBS -lmem=48GB
#PBS -lwalltime=12:00:00
#PBS -ljobfs=1GB
#PBS -l wd
#PBS -l storage=gdata/if89
 
module use /g/data/if89/apps/modulefiles
module load colabfold_batch/1.5.2
 
colabfold_batch --amber --num-recycle 3 --use-gpu-relax a3m_file predicted_dir

The predicted PDB files will be put in to predicted_dir directory.

Note that colabfold_batch can accept different type of input and even work with a3m files that you obtained using a different database search, not necessarily those obtained from colabfold_search.

Run

$ module use /g/data/if89/apps/modulefiles
$ module load colabfold_batch/1.5.2
 
$ colabfold_batch -h

to see what options are supported.

Page tree

v1.4.0

Database search

Protein folding prediction

Notes about multimer predictions

Known problems

v1.5.2

Database search

Protein folding prediction

Authors: Javed Shaikh, Mohsin Ali, Andrey Bliznyuk, Adam Huttner-Koros