Tools to analyse TopoLink results
TopoLink comes with some helper programs that are used to analyze the
results in large sets of structural models. These packages are available
upon installation of TopoLink, and are:
1. evalmodels: A package that reads sets of TopoLink output
files and model quality scores from another source, and writes tables
to for plotting their correlation.
2. linkcorrelation : A package to analyze the correlation
between crosslinks in ensembles of models.
3. linkensemble : A package to read TopoLink output
data and compute the set of models required to satisfy the observed
links.
Appendix: Evaluating structural models with LovoAlign
[click here].
|
These tools will be deprecated in favor of the Julia implementation of
the analysis suite, which is already functional and available at:
TopoLink.jl
|
1. evalmodels
evalmodels is simply a package to read the output of
several TopoLink output files, and some other file containing a model
evaluation score, and output the list of models with the crosslinking
statistics associated with this score.
For example, the plot below was obtained from the output of
evalmodels :
The values in the y-axis, i. e. the number of observed links that consistent with
each structure, were obtained from the TopoLink log files of each model.
The values in the x-axis, in this case the similarity of each model to
the crystallographic structure, were obtained by aligning each model
with the crystallographic model, in this case using
LovoAlign
(click here for details).
evalmodels is executed as follows:
evalmodels loglist.txt scores.dat output.dat -m1 -c2
where loglist.txt is a list of TopoLink log files, in the
following form:
./logs/model1.log
./logs/model2.log
./logs/model3.log
...
scores.dat is a table containing the name of the models (or
model files) and the third-party score that will be used, for example:
model1 0.754
model2 0.321
model3 0.135
...
and, finally, the -m1 and -c2 flags indicate
the column in scores.dat containing the name of the model
and the value of the score, respectively (in the example, 1 and 2). The
name of the models might be file names, only the base name will be
considered, i. e. "model1", and must coincide with the base name of the
corresponding TopoLink log file ("model1.log").
The output file output.dat , has the following structure:
# TopoLink
#
# EvalModels output file.
#
# Log file list: log.list
# Score (possibly LovoAlign log) file: ../analysis/cristal.log
# Number of models 11001
#
# Score: Model quality score, obtained from column 8 of the score file.
#
# RESULT0: Number of consistent observations.
# RESULT1: Number of topological distances consistent with all observations.
# RESULT2: Number of topological distances NOT consistent with observations.
# RESULT3: Number of missing links in observations.
# RESULT4: Number of distances with min and max bounds that are consistent.
# RESULT5: Sum of the scores of observed links in all observations.
# RESULT6: Likelihood of the structural model, based on observations.
#
# More details at: http://leandro.iqm.unicamp.br/topolink
#
# Score RESULT0 RESULT1 RESULT2 RESULT3 RESULT4 RESULT5 RESULT6 MODEL
69.59000 16 16 10 10 18 0.00000 0.10000E+01 S_00093408
65.48500 16 16 10 11 21 0.00000 0.10000E+01 S_00090481
63.06000 17 17 9 23 0 0.00000 0.99996E+00 S_00108183
...
The first column is the score read from the scores.dat
file. The other columns contain the different statistics of the crosslinks for
each model, as described, to be associated with the score of the first column, using
any plotting software.
2. linkcorrelation
linkcorrelation is a package to compute the correlation between the
satisfaction of links in ensembles of structural models. It produces, as
output, a matrix of correlations, containing either:
1. The fraction of structures that satisfy both links simultaneously.
2. The fraction of structures that do not satisfy both links simultaneously.
3. The fraction of structures that satisfy either one or other link.
4. The correlation of the crosslink pair: That is, a score in the interval [-1,1] which
is -1 if the links are anti-correlated, and 1 if they are correlated.
Running linkcorrelation :
The package must be run with:
linkcorrelation loglist.txt -type [type]
where loglist.txt is a file containing a list of all TopoLink log files to be
considered, and [type] is an integer number with value 1 to 4, according
to the desired type of output, as described above.
The loglist.txt file must be of the form:
./logs/model1.log
./logs/model2.log
./logs/model1.log
...
For example, these to correlation plots were produced with the output of linkcorrelation :
Click on the image to open a high resolution image.
Plot A, on the left, was generated with the "-type 1 " option, and
shows the fraction of structures of structures of the set satisfying
both crosslinks of the matrix at the same time. In particular, the
diagonal contains the fraction of structures that satisfy each specific
crosslink.
Plot B, on the right, was generated with the "-type 3 "
option, and shows the fraction of structures that satisfy one link
OR the other, exclusively (if both links are satisfied, the model
is not counted). This plot shows anti-correlations between the links.
The diagonal, in this case, is null, because the each crosslink is
obviously completely positively correlated with itself.
These plots were generated from the output of
linkcorrelations using the following python/matplotlib
script:
plot_linkcorrelations.py
3. linkensemble
linkcorrelation computes minimum and optimal set of models
required to satisfy the observed links. For example, if one has observed
26 experimental crosslinks, it is quite typical that no model accounts
for all 26 crosslinks at the same time. Therefore, one wishes to find a
set of models representing some conformational variability that takes
into account all, or at least most, of the observed links.
The linkensemble depends on some quality measure for the
models, in the following form:
# Coments
100.00 S_00093408.pdb
18.704 S_00000001.pdb
33.889 S_00000002.pdb
...
Lets call this file scores.dat . The quality score might be
some modeling score, the G-score (the output of
G-score is provided already in the correct format), or a measure of
similarity to a reference model, for example.
With the scores.dat file in hand, linkensemble
is run with:
linkensemble loglist.txt scores.dat linkensemble.dat
This will produce a file with the following form:
# TopoLink
#
# LinkEnsemble output file.
#
# Log file list: ../log.list
# G-score file: S_00093408_align.dat
# Number of models 11001
# Number of observed crosslinks: 26
#
# 1 MET A 1 LYS A 17
# 2 MET A 1 LYS A 113
# 3 LYS A 6 GLU A 9
...
# 26 LYS A 113 SER A 116
#
# Nmodel: Number of crosslinks satisfied by this model.
# RelatP: Relative probability of this model (G-score ratio to best model).
# DeltaG: RelatP converted to DeltaG (kcal/mol).
# Ntot: Total number of links satisfied by the ensemble up to this model.
# Next: link indexes according to list above.
#
# Model Nmodel RelatP DeltaG Ntot 1 2 3 4 ... 26
1 S_00093408 16 1.00000 -0.00000 16 0 1 1 1 ... 0
2 S_00037416 20 0.68889 0.22067 20 1 1 1 1 ... 1
3 S_00060737 17 0.65370 0.25172 21 1 1 1 1 ... 1
...
This file contains a list of the observed links, and a list of the
models, containing the following data: The index of the model, ordered
by greatest to lowest score; the number of links satisfied by the model,
the relative score of each model to the best model (which is called
RelatP because the suggested G-score is a probability); the
corresponding ΔG if the score is a probability; the total number
of links cumulatively satisfied by the set of models up to
that model, and the list of links satisfied or not (1 or 0) by the set.
For example, the output above indicates that the first model satisfies
16 links. The second satisfies 20 links, and the third satisfied 17
links (third column). If the three models are taken in consideration, 21
links can be observed. The list of links follow each model.
|