Blog

StemSkills Lab > Blog > Molecular Modeling > How to Do Principal Component Analysis (PCA) on a GROMACS Trajectory: Essential Dynamics for Beginners

How to Do Principal Component Analysis (PCA) on a GROMACS Trajectory: Essential Dynamics for Beginners

July 2, 2026
Posted by: Stem Skills Lab
Category: Molecular Modeling

No Comments

How to Do Principal Component Analysis (PCA) on a GROMACS Trajectory: Essential Dynamics for Beginners

To run PCA on a GROMACS trajectory, build the mass-weighted covariance matrix of your fitted coordinates with gmx covar, then analyze the eigenvectors with gmx anaeig: project the trajectory onto the first two principal components with -2d to read the dominant collective motions as a 2D conformational map. The first few eigenvectors capture most of the meaningful motion.

By now you have confirmed your simulation equilibrated with RMSD, mapped its flexible regions with RMSF, and described its shape with radius of gyration. Those metrics answer “is the trajectory stable and where does it move?” Principal component analysis, known in the molecular dynamics field as essential dynamics, answers a harder question: what are the large, correlated motions the protein actually performs, and can you see them separated from thermal noise? This is the analysis that turns a stable trajectory into a mechanistic story, and it is what a thesis committee expects when a candidate claims to have observed a hinge motion, a domain opening, or a binding-site rearrangement.

This is a spoke in our learn molecular dynamics with GROMACS series and follows directly from how to analyze radius of gyration, SASA and hydrogen bonds and the earlier RMSD and RMSF walkthrough. If you are placing MD inside a research career, see the full computational biology skills roadmap.

What is PCA (essential dynamics) and why run it after RMSD?

Principal component analysis reduces the thousands of correlated atomic movements in a trajectory to a small set of collective coordinates, the principal components, ordered from the largest amplitude motion to the smallest. It works by diagonalizing the covariance matrix of atomic positions: the eigenvectors are the directions of concerted motion, and each eigenvalue is the mean-square fluctuation along its eigenvector, in nm². The idea was introduced by Amadei, Linssen and Berendsen, who showed that protein motion splits into “an essential subspace containing only a few degrees of freedom in which anharmonic motions occur that comprise most of the positional fluctuations,” while the rest is narrow, near-harmonic vibration (Amadei A, Linssen ABM, Berendsen HJC, “Essential dynamics of proteins,” Proteins 17 (1993) 412-425, DOI: 10.1002/prot.340170408).

RMSD and RMSF give you scalar summaries per frame or per residue. PCA gives you the actual shapes of the motions and lets you watch the protein hop between conformational states in two dimensions. That is why it belongs after, not instead of, your basic checks.

Do you need to prepare the trajectory before PCA?

Yes, and this step decides whether the result is meaningful or garbage. PCA is exquisitely sensitive to overall translation and rotation of the molecule, because those rigid-body movements are large and would dominate the first eigenvectors, hiding the internal motion you care about. Two preparations are mandatory. First, remove periodic boundary jumps so the molecule is whole and centered:

gmx trjconv -s md_0_1.tpr -f md_0_1.xtc -o md_0_1_nopbc.xtc -pbc mol -center

Choose Protein to center and System to output. Second, the rotational and translational fit is handled inside gmx covar itself through the least-squares fit group you select, so you do not fit separately. Restrict the analysis to backbone or C-alpha atoms: side-chain and hydrogen motion adds noise that buries the collective backbone motion. Backbone atoms are the standard, defensible choice for essential dynamics.

How do you build the covariance matrix with gmx covar?

The tool gmx covar constructs and diagonalizes the mass-weighted covariance matrix in one command. Run:

gmx covar -s md_0_1.tpr -f md_0_1_nopbc.xtc -o eigenval.xvg -v eigenvec.trr -av average.pdb -l covar.log

You are prompted twice. The first prompt is the group for the least-squares fit; the second is the group for the covariance analysis. Choose Backbone for both so the fit and the analysis use the same reference. The command writes four things you will reuse:

eigenval.xvg: the eigenvalues in descending order, the raw material for deciding how many components matter.
eigenvec.trr: the eigenvectors, the input to every gmx anaeig step below.
average.pdb: the average structure, used as the reference for projections and for viewing extreme motions.
covar.log: the run log, which reports the sum of the eigenvalues (the total positional fluctuation).

For a protein of N analyzed atoms the covariance matrix is 3N by 3N and therefore has 3N eigenvectors, but only the leading handful carry real amplitude. The full flag list is in the GROMACS gmx covar documentation.

Want the guided, hands-on version?

Our live Molecular Modeling & MD Simulations cohort bootcamp takes you from zero to running real docking and MD workflows, with a portfolio project for your grad-school applications.

Join the waitlist (free) →

How do you read the eigenvalue plot and decide how many components matter?

Plot eigenval.xvg and you will see a steep drop: the first eigenvalue is much larger than the second, the second larger than the third, and the curve quickly flattens into a long tail of tiny values. Each eigenvalue divided by the sum of all eigenvalues is the fraction of total fluctuation that its component explains. In practice the first two or three components usually account for the bulk of the collective motion, which is exactly why a 2D projection onto the first two is such a common figure. Amadei and colleagues found that this essential subspace is small, often only a handful of the thousands of available directions, so you rarely need to look past the first five components. Report the cumulative percentage captured by PC1 and PC2 in your figure caption, computed from your own eigenvalues, rather than quoting a number from a paper.

How do you project onto PC1 and PC2 with gmx anaeig?

The tool gmx anaeig analyzes the eigenvectors from gmx covar. To produce the two-dimensional projection, ask for the first two eigenvectors and the -2d output:

gmx anaeig -v eigenvec.trr -f md_0_1_nopbc.xtc -s md_0_1.tpr -first 1 -last 2 -2d 2d_proj.xvg

Select Backbone when prompted, matching the covariance analysis group. The file 2d_proj.xvg has two columns, the projection of every frame onto PC1 and onto PC2, ready to plot as a scatter. To follow a single component through time instead, project onto one eigenvector:

gmx anaeig -v eigenvec.trr -f md_0_1_nopbc.xtc -s md_0_1.tpr -first 1 -last 1 -proj pc1.xvg

This pc1.xvg gives the value of PC1 for every frame against simulation time, which is the trace you inspect for jumps between states. The full option set is in the GROMACS gmx anaeig documentation.

How do you read the 2D conformational map?

Plot PC1 on the x-axis and PC2 on the y-axis as a scatter of all frames. One dense cloud means the protein sampled a single conformational state and stayed there. Two or more separate clusters mean the protein visited distinct states, and the gaps between clusters are the transitions, the events worth describing in your results. A smear that spreads steadily in one direction with no clustering can signal that the protein is still drifting and has not converged, which sends you back to check whether your trajectory is long enough. Because PC1 by construction carries the largest motion, movement along the horizontal axis is always the dominant one to interpret first.

How do you visualize the actual motion along a principal component?

Numbers on a scatter plot do not show a reader what the protein is doing. Generate the two extreme structures along the first eigenvector and you can render the motion directly:

gmx anaeig -v eigenvec.trr -f md_0_1_nopbc.xtc -s md_0_1.tpr -first 1 -last 1 -extr pc1_extreme.pdb -nframes 20

This writes a short multi-frame PDB that interpolates between the two extremes of PC1. Open it in VMD or PyMOL and play it as a loop to see the collective motion, for example a hinge closing or two domains scissoring. This animation, placed beside the 2D scatter, is the pair of figures that makes an essential-dynamics result convincing.

How do you know the PCA reflects real motion and not random diffusion?

This is the check most beginners skip and the one reviewers ask about. On a short or poorly converged trajectory, principal components can look like clean cosine waves that are an artifact of random diffusion rather than genuine conformational transitions. Berk Hess quantified this with the cosine content of a principal component: a value near 1 means the motion resembles free diffusion and the sampling has not converged, while a value near 0 means the component captures real, structured motion (Hess B, “Convergence of sampling in protein simulations,” Physical Review E 65 (2002) 031910, DOI: 10.1103/PhysRevE.65.031910). Compute it on your projected component with:

gmx analyze -f pc1.xvg -cc pc1_cosine.xvg

If the cosine content of your first component is high, treat any clustering in the 2D map with caution and, if you can, extend the simulation before drawing mechanistic conclusions.

PCA vs RMSD vs RMSF: which answers which question?

Analysis	GROMACS tool	What it measures	Output per	Answers
RMSD	`gmx rms`	Deviation from a reference structure	Time frame	Has the trajectory equilibrated?
RMSF	`gmx rmsf`	Fluctuation amplitude per residue	Residue	Which regions are flexible?
PCA / essential dynamics	`gmx covar` + `gmx anaeig`	Direction and amplitude of collective motions	Principal component	What large correlated motions occur, and between which states?

Read together, the three build a full picture: RMSD says the run is trustworthy, RMSF says where the motion lives, and PCA says what the motion actually is.

Troubleshooting: common gmx covar and gmx anaeig errors

The first eigenvector looks like whole-molecule rotation. Your fit group was wrong or too small. Refit on Backbone or C-alpha and confirm you removed periodic boundary jumps with gmx trjconv -pbc mol -center first.
“Group X does not match” or a mismatch between covar and anaeig. The analysis group in gmx anaeig must be the same group you used for the covariance analysis in gmx covar. Select Backbone consistently in both.
The 2D map is one shapeless blur. Either the protein genuinely stays in one state, or the trajectory is too short to sample transitions. Check the cosine content with gmx analyze -cc before concluding anything.
Eigenvalues are all tiny and similar. You may have included solvent or ions in the analysis group. Restrict the covariance analysis to protein backbone atoms only.

Want the guided, hands-on version?

Our live Molecular Modeling & MD Simulations cohort bootcamp takes you from zero to running real docking and MD workflows, with a portfolio project for your grad-school applications.

Join the waitlist (free) →

Frequently asked questions

Should I use backbone or C-alpha atoms for PCA?

Both are accepted. Backbone (N, C-alpha, C, O) is the common default and gives a fuller description of main-chain motion; C-alpha only is lighter and often used for large systems. What matters most is consistency: use the same group for the fit and the analysis, and state your choice in the methods section.

How long should the trajectory be before PCA is meaningful?

Long enough to sample the transitions you want to describe, which depends on the system. There is no fixed number, so verify convergence rather than assuming it: compute the cosine content of the first components, and check that the essential subspace is stable when you compare the first and second halves of the run.

What is the difference between PCA and essential dynamics?

They refer to the same method. “Essential dynamics” is the name Amadei and colleagues gave to PCA applied to protein trajectories in 1993, emphasizing that a few essential degrees of freedom describe most of the functional motion. GROMACS implements it through gmx covar and gmx anaeig.

Why does GROMACS mass-weight the covariance matrix?

Mass weighting connects the eigenvectors to the physical modes of motion, so heavier atoms contribute in proportion to their inertia. It is the standard formulation used in the original essential dynamics work and is the GROMACS default in gmx covar.

Can I build a free energy surface from the projections?

Yes. A common next step is to histogram the PC1 and PC2 projections and convert the populations to free energies, which turns the 2D map into an energy surface with basins for each state. GROMACS provides gmx sham for this once you have the 2d_proj.xvg file.

Bringing it together

PCA is where MD analysis stops describing and starts explaining. Prepare the trajectory, run gmx covar on the backbone, read the eigenvalue drop, project onto the first two components with gmx anaeig, animate the extremes, and verify with cosine content. Do that and you can show a committee not just that your protein moved, but exactly how. This work was prepared by the StemSkills Lab team, which brings more than ten years in sequence and structural bioinformatics, drug discovery and design, and multiscale molecular modeling to every tutorial in this series.

Think you know Molecular Dynamics (GROMACS)?

Take the free StemSkills assessment and earn a verifiable certificate you can download and add to your LinkedIn profile.

Start the free assessment

Login/Sign Up

Search

Menu

Blog

How to Do Principal Component Analysis (PCA) on a GROMACS Trajectory: Essential Dynamics for Beginners

What is PCA (essential dynamics) and why run it after RMSD?

Do you need to prepare the trajectory before PCA?

How do you build the covariance matrix with gmx covar?

Want the guided, hands-on version?

How do you read the eigenvalue plot and decide how many components matter?

How do you project onto PC1 and PC2 with gmx anaeig?

How do you read the 2D conformational map?

How do you visualize the actual motion along a principal component?

How do you know the PCA reflects real motion and not random diffusion?

PCA vs RMSD vs RMSF: which answers which question?

Troubleshooting: common gmx covar and gmx anaeig errors

Want the guided, hands-on version?

Frequently asked questions

Should I use backbone or C-alpha atoms for PCA?

How long should the trajectory be before PCA is meaningful?

What is the difference between PCA and essential dynamics?

Why does GROMACS mass-weight the covariance matrix?

Can I build a free energy surface from the projections?

Bringing it together

Leave a Reply