Tutorial 1: Getting started

Author

Brad Broyles

evo3D links multiple sequence alignments MSAs (nucleotide or protein) to PDB/mmCIF structures enabling calculation of statistics on three-dimensional windows (“spatial haplotypes”).

This tutorial covers two examples. The first introduces the minimum required inputs, demonstrates how to run evo3D, and explains how to interpret the core output table (results$evo3d_df). The second example adds statistics, and demonstrates how to write and visualize outputs in PyMol.

Assumptions

Example 1 (Quick start)

The package includes several tutorial datasets to help users become familiar with the evo3D workflow. This first example provides a minimal end-to-end run.

msa_path <- system.file("extdata", "rh5_pfalc.fasta", package = "evo3D")
pdb_path <- system.file("extdata", "rh5_6mpv_AB.pdb", package = "evo3D")

# These are examples shipped with the package - your paths may resemble #
# msa_path <- 'myfolder/sequences.fasta'
# pdb_path <- 'myfolder/structure.pdb'

Run evo3D (defualts)

By default, evo3D constructs surface windows consisting of a central residue and all neighboring surface residues (relative solvent accessibility >= 0.1) within 15 Å.

library(evo3D)

results <- run_evo3d(msa = msa_path, 
                     pdb = pdb_path,
                     verbose = 0)  # turning off progress printing #
using BLOSUM65

run_evo3d() is a high-level wrapper that coordinates the full evo3D workflow: reference sequence construction, spatial window generation, MSA↔︎structure alignment, and (optionally) statistic calculation.

Structure of evo3D results

str(results, max.level = 1)
List of 6
 $ evo3d_df         :'data.frame':  368 obs. of  14 variables:
 $ final_msa_subsets: NULL
 $ msa_info_sets    :List of 1
 $ pdb_info_sets    :List of 1
 $ aln_info_sets    :List of 1
 $ call_info        :List of 13
 - attr(*, "class")= chr "evo3D_results"

The output of run_evo3d() is always a six-element list. The primary result is results$evo3d_df, a table that records the MSA↔︎PDB mapping, the spatial window associated with each codon, and any calculated statistics. The remaining elements store additional information about the input data, transformations, and run parameters. They will be covered in detail a later tutorial.

Structure of evo3d_df

Every codon position in the input MSA(s) is represented as a row in evo3d_df. Some codons may not map to a residue in the PDB structure (for example, due to unresolved residues in the structure). In these cases, no spatial window is assigned, and the codon is not included in any other spatial windows. These rows will contain NA values for window-related fields and statistics.

results$evo3d_df[29:31,]
   codon_id codon  msa  pdb residue_id ref_aa pdb_aa
29  29_msa1    29 msa1 pdb1          -      V      -
30  30_msa1    30 msa1 pdb1     175_B_      I      I
31  31_msa1    31 msa1 pdb1     176_B_      I      I
                                                                                                                                                                                                                                                                                                                                                                               codon_patch
29                                                                                                                                                                                                                                                                                                                                                                                    <NA>
30 30_msa1+31_msa1+33_msa1+34_msa1+32_msa1+35_msa1+173_msa1+36_msa1+76_msa1+37_msa1+181_msa1+177_msa1+330_msa1+73_msa1+170_msa1+174_msa1+38_msa1+329_msa1+176_msa1+179_msa1+323_msa1+185_msa1+172_msa1+182_msa1+77_msa1+169_msa1+68_msa1+178_msa1+69_msa1+39_msa1+325_msa1+333_msa1+74_msa1+71_msa1+171_msa1+175_msa1+328_msa1+278_msa1+78_msa1+186_msa1+334_msa1+40_msa1+80_msa1+188_msa1
31                                                                                                                31_msa1+30_msa1+32_msa1+34_msa1+33_msa1+35_msa1+36_msa1+330_msa1+323_msa1+37_msa1+38_msa1+329_msa1+325_msa1+39_msa1+328_msa1+278_msa1+181_msa1+173_msa1+319_msa1+333_msa1+76_msa1+73_msa1+177_msa1+334_msa1+170_msa1+40_msa1+185_msa1+281_msa1+332_msa1+174_msa1+41_msa1
   codon_len unique_codon gap_map_count exposed max_dist msa_subset_id
29        NA           NA            NA      NA       NA          <NA>
30        44           44             0    TRUE    14.91   175_B__pdb1
31        31           31             0    TRUE    14.98   176_B__pdb1

A brief explanation of columns:

Column Description
codon_id Unique identifier combining codon index and msa source.
codon codon position in input MSA (indexing starts at 1)
msa Source MSA identifier (useful when analyzing complexes with multiple MSAs).
pdb Source structure identifier (useful when multiple PDBs are provided).
residue_id Unique structural residue identifier (residue index _ chain _ insertion code).
ref_aa Amino acid sequence derived from a constructed MSA reference sequence. Used for alignment to the PDB sequence.
pdb_aa Amino acid sequence(s) in PDB file.
codon_patch spatial neighborhood recorded as codon_id(s), separated by +
codon_len Number of structural residues included in the spatial window.
unique_codon Number of unique codons underlying those residues (can differ from codon_len in homomultimers).
gap_map_count Number of residues present in the structure-derived windows but aligned to MSA gaps; excluded from final windows.
exposed Logical flag indicating whether the residue passed exposure filtering.
max_dist Maximum distance (Å) between the central residue and any included neighbor.
msa_subset_id Key identifying the spatial haplotype subset in final_msa_subsets (retained when detail_level = 2).

Example 2 (Adding statistics)

Using the same input data, we now enable calculation of:

  • per-codon statistic (amino acid site_entropy)

  • averaging of per-codon statistic within spatial windows (avg_patch_entropy)

  • patch-level haplotype statistic (amino acid block_entropy)

We will also leave verbose as default 1 to see the internal workflow orgnization

library(evo3D)

msa_path <- system.file("extdata", "rh5_pfalc.fasta", package = "evo3D")
pdb_path <- system.file("extdata", "rh5_6mpv_AB.pdb", package = "evo3D")

results <- run_evo3d(msa = msa_path, pdb = pdb_path, 
                     stat_controls = list(calc_site_entropy = TRUE,
                                          calc_avg_patch_entropy = TRUE,
                                          calc_block_entropy = TRUE))
STEP 0: Setting up run information and controls...
STEP 1: Converting MSAs to reference peptide sequences...
STEP 1.5: Caching PDBs and resolving auto chains...
STEP 2: Converting PDBs to patches...
STEP 3: Aligning MSAs to PDBs...
using BLOSUM65
STEP 4: Skipping collapsing windows to codon level...
STEP 5: Calculating patch stats...
STEP 6: No output directory specified in `output_controls$output_dir`, skipping writing results.
---- RUN COMPLETE ----

Returning to evo3d_df we will see it populated with the calculated statistics.

results$evo3d_df[50:54,]
   codon_id codon  msa  pdb residue_id ref_aa pdb_aa
50  50_msa1    50 msa1 pdb1     195_B_      H      H
51  51_msa1    51 msa1 pdb1     196_B_      K      K
52  52_msa1    52 msa1 pdb1     197_B_      S      S
53  53_msa1    53 msa1 pdb1     198_B_      S      S
54  54_msa1    54 msa1 pdb1     199_B_      T      T
                                                                                                                                                                                                                 codon_patch
50                                                                                                                                                                                                                      <NA>
51 51_msa1+52_msa1+53_msa1+198_msa1+199_msa1+209_msa1+55_msa1+207_msa1+197_msa1+200_msa1+208_msa1+57_msa1+201_msa1+213_msa1+56_msa1+195_msa1+196_msa1+58_msa1+212_msa1+202_msa1+205_msa1+311_msa1+307_msa1+203_msa1+301_msa1
52                             52_msa1+51_msa1+53_msa1+55_msa1+209_msa1+213_msa1+207_msa1+56_msa1+208_msa1+57_msa1+198_msa1+199_msa1+212_msa1+58_msa1+200_msa1+197_msa1+201_msa1+217_msa1+302_msa1+216_msa1+59_msa1+301_msa1
53                                                                  53_msa1+52_msa1+51_msa1+55_msa1+56_msa1+57_msa1+209_msa1+58_msa1+198_msa1+197_msa1+213_msa1+199_msa1+207_msa1+59_msa1+200_msa1+208_msa1+60_msa1+201_msa1
54                                                                                                                                                                                                                      <NA>
   codon_len unique_codon gap_map_count exposed max_dist msa_subset_id
50        NA           NA            NA   FALSE       NA          <NA>
51        25           25             0    TRUE    14.86   196_B__pdb1
52        22           22             0    TRUE    14.84   197_B__pdb1
53        18           18             0    TRUE    14.68   198_B__pdb1
54        NA           NA            NA   FALSE       NA          <NA>
   site_entropy mean_site_entropy block_entropy
50            0                NA            NA
51            0        0.02134576     0.5336439
52            0        0.02425654     0.5336439
53            0        0.02964688     0.5336439
54            0                NA            NA

Writing results to PyMol for visualization

Finally, we export block entropy and site entropy values to a PDB file for visualization in PyMol. The stat_name argument accepts up to two columns from evo3d_df, and writes them to the B-factor and occupancy fields (or just to B-factor if only one column name is provided).

write_stat_to_pdb(results, 
                  stat_name = c('block_entropy', 'site_entropy'),
                  outfile = '~/evo3D_tutorials/results/tutorial1_rh5.pdb')
~/evo3D_tutorials/results/tutorial1_rh5.pdb already exists, saving to: ~/evo3D_tutorials/results/tutorial1_rh5_20260313105924.pdb

In PyMol, both the B-factor (b) and occupancy (q) fields can be used to visualize mapped statistics on the protein structure. By default, evo3D writes missing values (NA) as -99 in the output PDB file and should be excluded from visualization.

The following commands illustrate a simple visualization workflow in PyMol:

color grey
select stat1, b > -99
select stat2, q > -99

spectrum b, white blue, selection=b
spectrum q, selection=stat2

Below, block entropy is shown mapped to the B-factor column

results/tutorial1_rh5_be.png

Next tutorials:

  1. Tuning parameters

  2. Protein complexes

  3. Results object in detail

  4. Custom statistics