msa_path <- system.file("extdata", "rh5_pfalc.fasta", package = "evo3D")
pdb_path <- system.file("extdata", "rh5_6mpv_AB.pdb", package = "evo3D")
# These are examples shipped with the package - your paths may resemble #
# msa_path <- 'myfolder/sequences.fasta'
# pdb_path <- 'myfolder/structure.pdb'Tutorial 1: Getting started
evo3D links multiple sequence alignments MSAs (nucleotide or protein) to PDB/mmCIF structures enabling calculation of statistics on three-dimensional windows (“spatial haplotypes”).
This tutorial covers two examples. The first introduces the minimum required inputs, demonstrates how to run evo3D, and explains how to interpret the core output table (results$evo3d_df). The second example adds statistics, and demonstrates how to write and visualize outputs in PyMol.
Assumptions
- Nucleotide MSAs must be in-frame and intron-free
- Examples below assume evo3D is already installed (see https://github.com/bbroyle/evo3D for installation instructions)
Example 1 (Quick start)
The package includes several tutorial datasets to help users become familiar with the evo3D workflow. This first example provides a minimal end-to-end run.
Run evo3D (defualts)
By default, evo3D constructs surface windows consisting of a central residue and all neighboring surface residues (relative solvent accessibility >= 0.1) within 15 Å.
library(evo3D)
results <- run_evo3d(msa = msa_path,
pdb = pdb_path,
verbose = 0) # turning off progress printing #using BLOSUM65
run_evo3d() is a high-level wrapper that coordinates the full evo3D workflow: reference sequence construction, spatial window generation, MSA↔︎structure alignment, and (optionally) statistic calculation.
Structure of evo3D results
str(results, max.level = 1)List of 6
$ evo3d_df :'data.frame': 368 obs. of 14 variables:
$ final_msa_subsets: NULL
$ msa_info_sets :List of 1
$ pdb_info_sets :List of 1
$ aln_info_sets :List of 1
$ call_info :List of 13
- attr(*, "class")= chr "evo3D_results"
The output of run_evo3d() is always a six-element list. The primary result is results$evo3d_df, a table that records the MSA↔︎PDB mapping, the spatial window associated with each codon, and any calculated statistics. The remaining elements store additional information about the input data, transformations, and run parameters. They will be covered in detail a later tutorial.
Structure of evo3d_df
Every codon position in the input MSA(s) is represented as a row in evo3d_df. Some codons may not map to a residue in the PDB structure (for example, due to unresolved residues in the structure). In these cases, no spatial window is assigned, and the codon is not included in any other spatial windows. These rows will contain NA values for window-related fields and statistics.
results$evo3d_df[29:31,] codon_id codon msa pdb residue_id ref_aa pdb_aa
29 29_msa1 29 msa1 pdb1 - V -
30 30_msa1 30 msa1 pdb1 175_B_ I I
31 31_msa1 31 msa1 pdb1 176_B_ I I
codon_patch
29 <NA>
30 30_msa1+31_msa1+33_msa1+34_msa1+32_msa1+35_msa1+173_msa1+36_msa1+76_msa1+37_msa1+181_msa1+177_msa1+330_msa1+73_msa1+170_msa1+174_msa1+38_msa1+329_msa1+176_msa1+179_msa1+323_msa1+185_msa1+172_msa1+182_msa1+77_msa1+169_msa1+68_msa1+178_msa1+69_msa1+39_msa1+325_msa1+333_msa1+74_msa1+71_msa1+171_msa1+175_msa1+328_msa1+278_msa1+78_msa1+186_msa1+334_msa1+40_msa1+80_msa1+188_msa1
31 31_msa1+30_msa1+32_msa1+34_msa1+33_msa1+35_msa1+36_msa1+330_msa1+323_msa1+37_msa1+38_msa1+329_msa1+325_msa1+39_msa1+328_msa1+278_msa1+181_msa1+173_msa1+319_msa1+333_msa1+76_msa1+73_msa1+177_msa1+334_msa1+170_msa1+40_msa1+185_msa1+281_msa1+332_msa1+174_msa1+41_msa1
codon_len unique_codon gap_map_count exposed max_dist msa_subset_id
29 NA NA NA NA NA <NA>
30 44 44 0 TRUE 14.91 175_B__pdb1
31 31 31 0 TRUE 14.98 176_B__pdb1
A brief explanation of columns:
| Column | Description |
|---|---|
| codon_id | Unique identifier combining codon index and msa source. |
| codon | codon position in input MSA (indexing starts at 1) |
| msa | Source MSA identifier (useful when analyzing complexes with multiple MSAs). |
| pdb | Source structure identifier (useful when multiple PDBs are provided). |
| residue_id | Unique structural residue identifier (residue index _ chain _ insertion code). |
| ref_aa | Amino acid sequence derived from a constructed MSA reference sequence. Used for alignment to the PDB sequence. |
| pdb_aa | Amino acid sequence(s) in PDB file. |
| codon_patch | spatial neighborhood recorded as codon_id(s), separated by + |
| codon_len | Number of structural residues included in the spatial window. |
| unique_codon | Number of unique codons underlying those residues (can differ from codon_len in homomultimers). |
| gap_map_count | Number of residues present in the structure-derived windows but aligned to MSA gaps; excluded from final windows. |
| exposed | Logical flag indicating whether the residue passed exposure filtering. |
| max_dist | Maximum distance (Å) between the central residue and any included neighbor. |
| msa_subset_id | Key identifying the spatial haplotype subset in final_msa_subsets (retained when detail_level = 2). |
Example 2 (Adding statistics)
Using the same input data, we now enable calculation of:
per-codon statistic (amino acid site_entropy)
averaging of per-codon statistic within spatial windows (avg_patch_entropy)
patch-level haplotype statistic (amino acid block_entropy)
We will also leave verbose as default 1 to see the internal workflow orgnization
library(evo3D)
msa_path <- system.file("extdata", "rh5_pfalc.fasta", package = "evo3D")
pdb_path <- system.file("extdata", "rh5_6mpv_AB.pdb", package = "evo3D")
results <- run_evo3d(msa = msa_path, pdb = pdb_path,
stat_controls = list(calc_site_entropy = TRUE,
calc_avg_patch_entropy = TRUE,
calc_block_entropy = TRUE))STEP 0: Setting up run information and controls...
STEP 1: Converting MSAs to reference peptide sequences...
STEP 1.5: Caching PDBs and resolving auto chains...
STEP 2: Converting PDBs to patches...
STEP 3: Aligning MSAs to PDBs...
using BLOSUM65
STEP 4: Skipping collapsing windows to codon level...
STEP 5: Calculating patch stats...
STEP 6: No output directory specified in `output_controls$output_dir`, skipping writing results.
---- RUN COMPLETE ----
Returning to evo3d_df we will see it populated with the calculated statistics.
results$evo3d_df[50:54,] codon_id codon msa pdb residue_id ref_aa pdb_aa
50 50_msa1 50 msa1 pdb1 195_B_ H H
51 51_msa1 51 msa1 pdb1 196_B_ K K
52 52_msa1 52 msa1 pdb1 197_B_ S S
53 53_msa1 53 msa1 pdb1 198_B_ S S
54 54_msa1 54 msa1 pdb1 199_B_ T T
codon_patch
50 <NA>
51 51_msa1+52_msa1+53_msa1+198_msa1+199_msa1+209_msa1+55_msa1+207_msa1+197_msa1+200_msa1+208_msa1+57_msa1+201_msa1+213_msa1+56_msa1+195_msa1+196_msa1+58_msa1+212_msa1+202_msa1+205_msa1+311_msa1+307_msa1+203_msa1+301_msa1
52 52_msa1+51_msa1+53_msa1+55_msa1+209_msa1+213_msa1+207_msa1+56_msa1+208_msa1+57_msa1+198_msa1+199_msa1+212_msa1+58_msa1+200_msa1+197_msa1+201_msa1+217_msa1+302_msa1+216_msa1+59_msa1+301_msa1
53 53_msa1+52_msa1+51_msa1+55_msa1+56_msa1+57_msa1+209_msa1+58_msa1+198_msa1+197_msa1+213_msa1+199_msa1+207_msa1+59_msa1+200_msa1+208_msa1+60_msa1+201_msa1
54 <NA>
codon_len unique_codon gap_map_count exposed max_dist msa_subset_id
50 NA NA NA FALSE NA <NA>
51 25 25 0 TRUE 14.86 196_B__pdb1
52 22 22 0 TRUE 14.84 197_B__pdb1
53 18 18 0 TRUE 14.68 198_B__pdb1
54 NA NA NA FALSE NA <NA>
site_entropy mean_site_entropy block_entropy
50 0 NA NA
51 0 0.02134576 0.5336439
52 0 0.02425654 0.5336439
53 0 0.02964688 0.5336439
54 0 NA NA
Writing results to PyMol for visualization
Finally, we export block entropy and site entropy values to a PDB file for visualization in PyMol. The stat_name argument accepts up to two columns from evo3d_df, and writes them to the B-factor and occupancy fields (or just to B-factor if only one column name is provided).
write_stat_to_pdb(results,
stat_name = c('block_entropy', 'site_entropy'),
outfile = '~/evo3D_tutorials/results/tutorial1_rh5.pdb')~/evo3D_tutorials/results/tutorial1_rh5.pdb already exists, saving to: ~/evo3D_tutorials/results/tutorial1_rh5_20260313105924.pdb
In PyMol, both the B-factor (b) and occupancy (q) fields can be used to visualize mapped statistics on the protein structure. By default, evo3D writes missing values (NA) as -99 in the output PDB file and should be excluded from visualization.
The following commands illustrate a simple visualization workflow in PyMol:
color grey
select stat1, b > -99
select stat2, q > -99
spectrum b, white blue, selection=b
spectrum q, selection=stat2
Below, block entropy is shown mapped to the B-factor column
Next tutorials:
Tuning parameters
Protein complexes
Results object in detail
Custom statistics