This repository contains the Python source code used to create the AVIDbase (Antibody VHH Interaction Database) as part of the associated publication (link coming soon!).
Authors: Tadej Medved, Jurij Lah, Goran Miličić & San Hadži
AVIDbase represents a curated dataset of existing nanobody-antigen crystal structures, with emphasis on reporting the "true" biological assembly (with respect to the nanobody-antigen interface). It consists of the following workflow:
- Information on existing PDB structures containing nanobodies in complex with other protein molecules is initially obtained from SAbDab, then the structures are downloaded from the RCSB PDB directly and input into the AVIDbase processing pipeline.
- Putative interfaces are automatically compiled (
assembly_annotation.py) into an Excel file (/example/annotation_input_table_UNMODIFIED.xlsx), then manually checked for structural or annotation errors by the authors. - The amended table,
annotation_input_table_MANUAL.xlsx, is then used as input toassembly_generation.py, to generate the initial database and process the structures/metadata (/example/assembly_generation_UNMODIFIED.xlsx). In this step,.pdbfiles of individual interfaces are extracted from the raw CIF files, and standardized w.r.t chain IDs, numbering and removal of expression tags, crystallization agents, etc. All biologically relevant non-protein species, such as posttranslational modifications and cofactors, along with missing density segments near the interface, are explicitly annotated. - In the final step, the structures are sorted into subsets based on interface integrity (see
AVIDbase-protbelow), and any annotation errors are corrected, and structural redundancy is computed. The final data is stored inAVIDbase.xlsx.
All data and structures for the 9th Feb 2026 version of AVIDbase are available on Zenodo: 10.5281/zenodo.20488703.
Data tables (cutoff date: 9th Feb 2026):
annotation_input_table_MANUAL.xlsx- exhaustive manual annotation of all nanobody-antigen interfaces identified in PDB structures available in SAbDab. Columns with manual annotations are highlighted in yellow.AVIDbase.xlsx- full dataset of biologically accurate nanobody-antigen interfaces. Each row corresponds to a single interface.
AVIDbase.xlsx is divided into 4 subsets:
AVIDbase-nr- nonredundant nanobody-antigen dataset (689 total structures).AVIDbase-prot- subset ofAVIDbase-nr; contains only interfaces that:- do not contain missing density near the interface
- do not possess significant contacts w/ non-protein moieties
- have high or medium interface integrity (see paper and SI)
- have resolution <= 3.5 A
AVIDbase-low- remaining low integrity nonredundant interfacesAVIDbase-r- dataset of redundant structures toAVIDbase-nr, either:- copies of equivalent interfaces found in the asymmetric unit
- interfaces from equivalent crystal structures with lower resolution or interface integrity (missing density/glycans/cofactors closer to the interface)
Currently tested platforms:
linux-64win-64
conda create -n avidbase
conda activate avidbase
conda install --file requirements.txtAdditional requirements:
Rosetta 3.x
Below is a step-by-step guide for recreating the final AVIDbase dataset.
For example output structures see /example.
Generates an Excel table based on the given input PDB codes. SAbDab summary files can also be passed, in which case they are pre-filtered exclusively for X-ray crystal structures. Any structures not already present in the structure directory are downloaded from the RCSB PDB.
python assembly_annotation.py -o output/intermediary_tables -s output/structures/CIF_files --sabdab-files input_files/sabdab/20260209_protein_summary.tsv input_files/sabdab/20260209_peptide_summary.tsv --verboseAll options:
python assembly_annotation.py --helpOutput: output/intermediary_tables/annotation_input_table_UNMODIFIED.xlsx (for example see /example directory).
Before proceeding to step 2, it's recommended to manually assess the correctness of the assigned biological interfaces and modify/exclude rows where necessary.
Example is given in annotation_input_table_MANUAL.xlsx - columns expected to be modified by the user are highlighted in yellow. For explanation of column names, see the associated paper's Supplementary Table S1.
Specific rows are excluded from further consideration by setting the keep column from 1 to 0.
This step extracts relevant metadata for each structure record and extracts standardized PDB structures into a single directory. Also computes interface_integrity and intra_redundancy, but does not sort or filter the dataset.
python assembly_generation.py -f annotation_input_table_MANUAL.xlsx -o output/intermediary_tables -s output/structures/CIF_files --output-structure-dir output/structures/AVIDbase --output-structures --verboseAll options:
python assembly_generation.py --helpOutput: output/intermediary_tables/assembly_generation_UNMODIFIED.xlsx (for example see /example directory).
Compute inter_redundancy and epitope_cluster, then sort the full dataset into nonredundant (AVIDbase-nr), redundant (AVIDbase-r), high+medium integrity protein-only (AVIDbase-prot), and low integrity/low resolution (AVIDbase-low).
python assembly_sorting.py -f output/intermediary_tables/assembly_generation_UNMODIFIED.xlsx -o "." -s output/structures/AVIDbase --output-structures --verboseRerunning the computation when structures were already sorted:
python assembly_sorting.py -f output/intermediary_tables/assembly_generation_UNMODIFIED.xlsx -o "." -s output/structures/AVIDbase/AVIDbase-nr/full --verboseAll options:
python assembly_sorting.py --helpOutput: AVIDbase.xlsx (individual sheets: AVIDbase-nr, AVIDbase-r, AVIDbase-prot, AVIDbase-low)
Relax protein-only subset AVIDbase-prot with a custom Rosetta XML script. Input avidbase_prot_list.txt should contain paths to each structure in output/structures/AVIDbase/AVIDbase-nr/AVIDbase-prot, 1 path per line.
See:
python pdblist_gen.py -f avidbase_prot_list.txt -pdb output/structures/AVIDbase/AVIDbase-nr/AVIDbase-prot(Linux or WSL)
ROSETTASCRIPTS=<path to RosettaScripts executable>
ROSETTA_N_CPUS=<number of CPUs for parallelization (if applicable)>
mpiexec -np $ROSETTA_N_CPUS $ROSETTASCRIPTS -l avidbase_prot_list.txt -out:path:pdb output/structures/AVIDbase/AVIDbase-nr/AVIDbase-prot-relaxed -out:path:score output/structures/AVIDbase/AVIDbase-nr/AVIDbase-prot-relaxed -parser:protocol relax_and_score.xml @rosetta_flags.txtFor custom filtering of AVIDbase. Optionally output interactable 3D plot visualizing the individual clusters (w/ structure IDs).
Example:
python cluster_epitopes.py -f AVIDbase.xlsx -o output/cluster_plots -s output/structures/AVIDbase/AVIDbase-nr/full --dataset-slice 'ag_name|lysozyme,ag_source_organism|gallus-gallus' --plot-clusters --verboseAll options:
python cluster_epitopes.py --help