Ron Levy group

Exploring
molecular
landscapes

FightAIDS@Home - Phase II

Introduction

FightAIDS@Home Phase II is a milestone in the collaboration between Dr. Arthur Olson and the Molecular Graphics Laboratory at the Scripps Research Institute and Dr. Ronald M. Levy's group at Temple University. Dr. Olson initiated the largest HIV virtual screening effort ever using his computational molecular docking software, AutoDock, over a decade ago using IBM's distributed volunteer computing grid, the World Community Grid (WCG), called FightAIDS@Home. As director of the HIV Interaction and Viral Evolution (HIVE) Center, an NIH funded HIV collaborative research center, the virtual screening results play a significant role informing a portion of the Center's research. The HIVE Center comprises a multidisciplinary team of scientists whom together aim to elucidate the structural and dynamic relationships between interacting HIV macromolecules in an effort to design future therapeutics to combat HIV and its evolution of drug resistance. Dr. Ronald Levy brings to the HIVE Center over 30 years of leadership and experience in the development and application of molecular dynamics simulations to study the structure and dynamics of proteins and their complexes.

During FightAIDS@Home Phase I, millions of compounds have been screened against HIV related protein targets, and thousands of potential drug candidate compounds have been identified. In Phase II, we plan to refine the selection of these thousands of possible compounds by performing more detailed computational experiments using our molecular dynamics engine IMPACT, and the Binding Energy Distribution Analysis Method (BEDAM).

The introduction of BEDAM simulations into FightAIDS@Home Phase II presents an enormous opportunity to refine and enrich the results from Phase I, but also presents a technical challenge as the constraints on the simulations running on the WCG are much different than the constraints on the simulations when running on more conventional computational resources.

The first experiments of FightAIDS@Home Phase 2 seek to achieve two goals: first, to confirm that the new simulation schema is working as intended and gives sufficiently reliable results compared to traditionally run simulations; second, to demonstrate that using BEDAM in conjunction with AutoDock results in better predictions than using AutoDock or BEDAM alone. There exists a symbiotic relationship between docking and more computationally demanding free-energy methods like BEDAM—without docking, the computing time required to score thousands of ligands with free energy methods is intractable, and without free energy methods, the relationship between docking scores and experimental binding affinities remains much more empirical and less accurate.

Follow the following links to learn more about FightAIDS@Home Phase 1, IBM's World Community Grid, and general information regarding FightAIDS@Home Phase 2. To learn more detailed information of how FightAIDS@Home Phase 2 works, continue reading this page.

FightAIDS@Home Phase 1
World Community Grid
FightAIDS@Home Phase 2, General Info

If you haven't already, click on the following link to download the World Community Grid software to run FightAIDS@Home Phase 2 (FAH2):

Download the World Community Grid software to run FightAIDS@Home Phase 2
and make sure that FightAIDS@Home - Phase 2 is checked on your My Projects page. You can also find the minimum system requirements and installation and registration instructions by following their links.

A drug candidate binding the LEDGF allosteric binding pocket of HIV Integrase.

Overview of Molecular Dynamics and BEDAM

Physical free energy models of binding

One key aspect of drug discovery is to identify compounds which bind strongly and specifically to the target receptor, and therefore there exists considerable interest in the development of computational models to predict the strength of protein-ligand interactions. Thermodynamically, this interaction strength between ligand molecule and receptor is measured by the binding free energy, and many computer models aim to predict the protein-ligand binding free energy by simulating the interactions between protein and ligand in a bound complex. Docking methods use empirical scoring functions to estimate binding free energies in order to distinguish between between ligands that bind strongly from ligands that bind weakly or not at all.

On the other hand, physical free energy models, which use physics based effective potentials, seek to compute accurate protein-ligand binding free energies based on the principles of statistical mechanics. Unlike docking methodologies, many of these methods are exceptionally computationally intensive and are highly dependent on both accurate modeling of interaction force fields (ligand-ligand, ligand-solvent, protein-protein, protein-solvent, and protein-ligand) and efficient sampling of all rotational and translational, internal and external, degrees of freedom of the ligand and protein.

The statistical mechanics theory of binding provides a prescription to compute the binding free energy from first principles, which can be implemented in various ways. One method developed in the Levy lab is called the binding energy distribution analysis method (BEDAM), which runs using the Levy group's molecular dynamics engine IMPACT. This methodology uses Hamiltonian replica exchange molecular dynamics simulations of the ligand-receptor complex in an implicit solvent model to construct a free energy path that connects the unbound and bound states of the ligand with the receptor.

Expand the following subsections for in-depth descriptions of how BEDAM molecular dynamics simulations work!

Molecular Dynamics (click to expand)

A molecular dynamics (MD) simulation is a computer simulation of how atoms and molecules move according to Newton's laws of motion. Using Newton's second law "F = ma", the positions and velocities of thousands of atoms are continuously updated by computing the forces (and therefore the accelerations) acting on each atom. Each simulation step is typically on the order of a femtosecond (10^-15 second), and therefore a complete simulation, or trajectory, typically consists of millions of steps. To simulate a system for a millisecond takes a trillion steps with a 1fs time step!

For every time step in the simulation, there are two primary computations that occur:

  1. given the current positions of all the atoms, compute the forces acting on each atom
  2. using the recently calculated forces, update the current positions of the atoms (by integrating the equations of motion)
Step 1 is typically the most computationally intensive part of the simulation because, in principle, one must consider the interactions between all pairs of atoms, which means computing the distances between each pair and considering all types of interactions between those two atoms. Although many algorithms have been designed to reduce the number of force calculations performed at each step, most of the computer cycles used in an MD simulation are used to compute forces.

Because these simulations involve computing a massive system of coupled equations using floating point arithmetic many, many times, two trajectories starting from identical starting conditions will diverge exponentially quickly. One consequence of this is that we cannot use duplicate or quorum simulations to validate the results we draw from the computational experiment. However, the fact that two simulations with almost identical initial conditions diverge is not evidence that there is little to gain from such simulations. It's important to understand that MD simulations are used to predict average behaviors and quantities of the system being simulated, and in this sense, MD differs fundamentally from other types of numerical trajectory-computing simulations, like computing the path of a satellite or space shuttle through space where knowing the ``true'' trajectory is critical. There is substantial numerical evidence that shows MD simulations of many-body systems of sufficient length are representative of the ``true'' trajectories.

Although we simulate the motion of many individual atoms coupled together, the quantities in which we are interested, temperature, pressure, binding affinities, etc., do not depend on the specifics of every particle, but instead on the statistical, or average, properties of the molecular motion. It is the field of statistical mechanics which provides the framework for using microscopic molecular dynamics simulations to predict the macroscopic thermodynamic and kinetic behavior of these complexes, like the binding affinities of ligands to protein receptors and the kinetics of protein-ligand binding.

Sampling (click to expand)

Ultimately there are two principles that determine the overall usefulness of a MD simulation: sampling efficiently and forcefield accuracy. Generally, sampling describes the process of selecting elements from a population such that properties of those elements are representative of the entire population. In terms of MD trajectories, sampling describes the notion that the conformations, energies, etc. of the trajectory are representative of the average properties of the set of all trajectories of that system. The more efficiently a simulation can sample the conformational space (set of all 3D arrangements of atoms) or energy landscape (description of energy values given positions/velocities), the more value a simulation of a fixed length is, or put another way, better sampling decreases the amount of simulation time needed to estimate quantities of interest.

There are many types of sampling algorithms in molecular dynamics designed to accelerate the sampling of particular degrees of freedom during the simulations. In typical BEDAM simulations the sampling of ligand-receptor complexes, and thus binding energy estimates, is accelerated using a technique known as Hamiltonian Replica Exchange. This technique takes advantage of parallel computing, where several copies (called ``replicas'') of the simulation are run in parallel. The Hamiltonian is the energy function, and in BEDAM simulations has a component related to the coupling between receptor and ligand which can be tuned to on, off, or somewhere in between. Each replica in the simulation has this receptor-ligand coupling dialed into a different strength and all replicas are launched in parallel. Often during the simulation, the replicas all stop and exchange receptor-ligand coupling strengths and then the trajectories continue on; this is where the name Hamiltonian exchange comes from in Hamiltonian Replica Exchange.

An example of how the Hamiltonian parameter lambda changes over time due to (left) replica exchange and (right) lambda scheduling.

Implications for FAH2: The process of replica exchange requires inter-replica communication, something that is often possible on supercomputers and clusters, but is not well matched to the WCG. As an alternative, we have implemented a different sampling protocol called ``lambda scheduling'' where, instead of exchanging Hamiltonian parameters with other replicas, replicas will cycle through the various values of receptor-ligand coupling on a precise schedule. Although it is not our standard way of computing binding free energies, the methodology is well-founded in non-equilibrium statistical mechanics. Doing so complicates the statistical mechanics methods we use to compute binding free energies (because lambda scheduling is a non-equilibrium process where as replica exchange is an equilibrium process), but it satisfies the constraints of working on the grid. Our first order of business on the grid is testing this sampling paradigm to insure we can estimate equilibrium properties like protein-ligand binding affinities properly.

Solvation (click to expand)

The other component determining the usefulness of MD simulations is that of the accuracy of the forces being calculated. Because the force computation is the most computationally intensive portion of the simulation, many algorithms have been developed to efficiently calculate all the necessary interactions, each with their own advantages. One major decision when beginning a molecular dynamics simulation concerns how to treat interactions with water (the solvent) as water-protein interactions drive structural organization at the protein ( solute) surface.

Typically, one chooses to model water explicitly or implicitly. Explicit solvation simulations place their system of interest in a bath of water molecules, enough to fully envelope the protein or protein-ligand complex as well as fill any cavities, and the motions of the water molecules are explicitly simulated in addition to the rest of the atoms in their system. Alternatively, instead of modeling the individual water molecules, implicit solvation simulations represent the water as a bulk dielectric and estimates of the solute-solvent interactions are approximated by interactions with this dielectric. While hybrid solvation models exist, they are outside the scope of this article.

Implications for FAH2: Implicit solvent simulations have several advantages over explicit representations. Namely, computation time is lower due to many fewer interaction calculations, the need to sample and equilibrate many solvent conformations is eliminated, and noise due to solvent structure is eliminated. Due to our lab's experience with and development effort towards implicit solvation models (namely, AGBNP, AGBNP2, AGBNP3), the simulations running on the World Community Grid model solvent implicitly. One relevant drawback of using an implicit solvation model is that the force calculations are much harder to vectorize, meaning that a parallelized version of IMPACT and BEDAM suitable for running on GPUs is not currently available. However, this is an active area of funded research of a close collaborator, Emilio Gallicchio.

Running BEDAM on IBM's WCG

Differences between Phase I and Phase II

For FAH2, the concept of a batch has changed from 1,000-10,000 ligand-receptor complexes per batch to 1 complex per batch, and each workunit in that batch is running thousands of simulations of that complex which we call replicas. These replicas differ from each other in important ways, such as different energy function parameters and different starting conformations of the ligand and receptor.

Another key difference between Phase I and Phase II is that the simulation corresponding to one replica is broken up into several workunits that must be completed serially, i.e., the output of one workunit serves as input for the next workunit. Each batch generates tens of thousands of workunits that together form the simulations of hundreds or thousands of replica trajectories.

First computational experiments

Additionally, the distributed and heterogeneous nature of the volunteer grid imposes some constraints on how the simulations can be run. We're in the process of learning how best to utilize this enormous resource by running each batch with two different simulation schema and several different analysis protocols. Doing so will allow us to refine our future simulation schema in order to maximize the impact of donated computing cycles. As volunteers ourselves, we do not want to see wasted computing cycles and are working with the constraints imposed by our dynamics engine, our forcefield model, and the nature of the WCG to find the optimal computing methodology for us and the volunteers.

The two simulation protocols currently being tested employ sampling techniques that differ from the way these simulations are run on clusters and supercomputers. The first is what we call ``lambda scheduling'' or ``lambda cycling'', which is where throughout a replica's simulation, key parameters associated with the coupling between ligand and receptor are cycled through a known set of values. This greatly helps to accelerate the sampling of the conformational space. Simulating multiple replicas with the same parameter changing schedule helps accelerate this sampling further.

The second simulation scheme is independent sampling in which no two replicas will ever have the same combination of energy function parameters, initial structures, etc., and these parameters are constant throughout all workunits pertaining to each replica's trajectory. With both simulation schema, long simulation times are needed. We have developed analysis techniques which can combine the results from all of these different replica trajectories to produce estimates of the protein-ligand complex's binding free energy.

HIV Integration (left) and HIV maturation/cleavage (right) as illustrated by David S. Goodsell © 2015, All Rights Reserved. Read more at the HIV Interaction and Viral Evolution Center.

FAH2 Experiments

Experiments 000106-000158

Experiments 106-158 are composed of simulating the set of 53 protein-ligand complexes from experiments 0-105, except the starting conformations are those found in the crystal structures, not from docking. This provides us a baseline or best case scenario to evaluate how well BEDAM performs.

Receptor Structure: HIV Integrase, PDB 3NF8, LEDGF binding site
Complete: 0%

Experiments 000000-000105

Experiments 0-105 are composed of simulating a set of 53 ligand-receptor complexes in two different ways. Batches that end with `-ls' are simulated with what we call `lambda scheduling', and batches without `-ls' are using independent sampling. This set of ligands is a known set of binders with known bound crystal structures (see here).

Receptor Structure: HIV Integrase, PDB 3NF8, LEDGF binding site
Complete: 20%

Glossary

Find the FightAIDS@Home Phase 1 glossary here.

New Terms

Binding Affinity

The strength of the solvent-mediated attraction between a receptor and a ligand. It is the thermodynamic "driving force" behind binding reactions.

Binding Free Energy

is a measure of binding affinity used in statistical mechanics.

Conformation

The three-dimensional structure of a ligand or protein, specifically referring to the spatial arrangements of atoms and molecular bonds in 3D space.

Degrees of Freedom

In the context of simulations, degrees of freedom refer to all independent coordinates and velocities that uniquely determine the configuration of all atoms in the simulation.

Equilibrium

In the context of molecular dynamics simulations, equilibrium refers to the state in which no macroscopic change occurs in the system being simulated. In particular macroscopic thermodynamic quantities like temperature, pressure, and system energy remaining constant. Initially, the simulations most often start out of equilibrium and equilibrate slowly as the simulation continues. Developing methodologies to accelerate equilibration (sampling algorithms) is an active area of research.

Floating Point Numbers

On computers, there needs to be a way to represent real numbers, including irrational (sqrt(2), Pi, etc.) and non-terminating rational (0.11111...) numbers, in binary. One representation is called floating point representation, where numbers are stored as two strings of digits, one representing the significant digits and the other representing the location of the decimal point. However, some numbers cannot be represented exactly as floating point numbers, and the difference between the exact number and its floating point representation is called floating point error. Floating point errors accumulate during computations in a way that can be considered random. See this webpage for an in-depth discussion of floating point numbers.

Force

A push or pull exerted on one object by another. Forces are important as they drive changes in motion of objects.

Force Field

Not to be confused with a set of forces, a force field in computational chemistry refers to set of potential functions and parameters used to calculate the potential energy of a system of atoms, bonds, and/or molecules in a simulation. You can think of a force field as all the information one would need to describe how all types of atoms and molecules interact with each other.

Hamiltonian

The function used to compute the energy of a system. The Hamiltonian is usually a function of all position coordinates and velocities in a given system.

Lambda

A tunable parameter in our Hamiltonian which controls how strongly the ligand and receptor interact with each other, ranging from totally uncoupled (lambda = 0) to fully coupled (lambda = 1).

Molecular Dynamics Simulations

A class of computer simulation used to study the motion of a collection of atoms interacting via Newton's equations of motion. See the section above dedicated to describing how molecular dynamics simulations work.

Non-equilibrium

The opposite of equilibrium.

Potential Energy

Stored energy that can be transformed into motion or otherwise change the properties of an object or system of objects.

Potential Function

The mathematical expression for potential energy, describing how potential energy changes with the conformation of a molecular system.

Replica

One copy (usually one of many) of a molecular system being simulated.

Solute

Substance that is dissolved in a solvent that form a solution. In MD simulations, the ligand and protein are considered the solutes.

Solvent

Substance that dissolves a solute (a different substance) to form a solution. The most common solvent is water.

Statistical Mechanics

Physicists have solid theories which describe well the properties and motions of both the microscopic world (atomic collisions, molecule formation, etc.) and the macroscopic world (Newton's laws, concepts like pressure and temperature, etc.). However, in practical experience, one does not care (nor has a way to know) about all the microscopic details when checking the temperature in a room or sliding a box across the ground. Statistical mechanics provides the framework to bridge these two different perspectives by describing macroscopic quantities in terms of averages over microscopic behavior.

Trajectory

The set of conformations a molecular system takes as a function of time throughout a simulation. During a simulation, the coordinates of all atoms are recorded at regular intervals, and when completed, the motions of the system can be represented in a movie format.

Velocity

The rate of change of an object's displacement. Instantaneous velocity is speed at which an object moves in a given direction. Often an object's velocity is split into 3 components, it's velocity in the x (back and forth), y (left and right), and z (up and down) directions.