March 24, 2000
TRIMHAP
Version 1.2
Software for the trimmed-haplotype test
of
linkage disequilibrium
User
Manual
Charles
J. MacLean, Rory B. Martin* and Huan Wang
Virginia Institute for Psychiatric
and Behavioral Genetics
at Virginia Commonwealth University
Richmond, VA
Web site for software: http://vipbg.psi.vcu.edu/trimhap
* Current affiliation:
Millennium Pharmaceuticals, Cambridge MA
TRIMHAP
is a fortran software package that implements the trimmed-haplotype mapping
techniques discussed in MacLean et al. [2000].
An earlier version of the program, under the name HAL, was written by
Rory Martin at Virginia Commonwealth University.
The latest version of the package can be obtained from
web site http://www.vipbg.vcu.edu/trimhap. For
installation, please refer to the text file INSTALL distributed with the
TRIMHAP package.
TRIMHAP can only be understood by reference to MacLean
et al. [2000].
INPUT
TRIMHAP
requires haplotype data from general pedigrees. It is based on four user-supplied input files:
$
haplotype file
$
locus control file
$
sex file
$
TRIMHAP control file,
parameters.dat
The
file parameters.dat contains input values for parameters controlling the
TRIMHAP analysis. Since many parameters
can be input using default values, we suggest that a copy of the example
parameters.dat file be used as a template when preparing new analyses, with
users changing values only as necessary.
The order of parameters as well as the line number on which they appear
is significant, and should not be altered.
# Debug output level (-1=none, 0=minimal, 1=medium,
2=full)
0
Controls
the amount of intermediate output generated by TRIMHAP written to file
debug.out. Values higher than 0 will cause a significant increase in run-time
and may also create large debug.out files, but this output is valuable for
tracing errors in TRIMHAP and in input files.
A value of -1 suppresses results from all haplotype specific analyses
from being written to results.out. [Suggested value for ordinary running: 0].
Marker
map parameters
# Total number of marker loci
10
#
# Map (in cM) for marker loci
1 0.2 2 1.5 3 0.2
4 0.2 5 0.2 6 0.2 7 0.2 8 0.2 9 0.2 10
Defines
the map of marker loci used in the analysis.
The format is
<marker
index> <distance> <marker index> <distance> ... <marker
index>
as
in GENEHUNTER. Inter-marker distances
are in cM and must be greater than zero.
The number of marker loci must be consistent with the locus control
file.
# Region of map to be analyzed (In cM)
-20.0 20.0
Specifies
the part of the map in which TRIMHAP scans for evidence of a disease
locus. The user may specify a subset of
the map for analysis. For example, 0.0
0.2 would limit the search for the disease locus to the interval between the
first two markers in the example above.
A large range outside the limits of the marker map, such as -20.0 20.0,
simply specifies testing of the entire region.
This parameter range affects only test locations for the disease locus,
and not the marker loci tested.
# Number of test-points per marker-marker subinterval
1
Only
statistic 3, employing trimming probabilities, is affected by this parameter,
because the precise position of the disease locus is used in the trimming
probability calculation (see # Which statistic, below and
also refer to MacLean et al. [2000] ).
For statistics 1 and 2, any position between two markers produces the
same result, so this parameter should be set to the minimum, 1. Even with statistic 3, the information about
position is probably not useful until the final stages of analysis. Suggested value: 1
# Off-end parameter (in cM)
0.0
As
in GENEHUNTER the parameter controls how far testing continues past the
left-hand and right-hand limits of the map.
A value 0.0 confines positions for the putative disease susceptibility
locus to marker-marker intervals within the putative ancestral haplotype. This is usually best, but tests of single
markers require an off-end parameter greater than zero.
Ancestral
marker parameters
# Number of feasible marker loci
10
#
# List of feasible marker loci
1 2 3 4 5 6 7 8 9
10
A
marker locus is termed >feasible= if
it can be included in the putative ancestral founder haplotype during
analysis. All markers may be defined as
feasible, but the user may wish to concentrate on a particular subset of marker
loci; e.g., specifying 2
4 5 limits candidates for ancestral loci
to these markers.
This parameter can be used in conjunction with the Region analyzed parameter to determine exactly the tests to be
performed.
# Number of loci in ancestral haplotype
2
TRIMHAP
tests for linkage disequilibrium using ancestral founder haplotypes of a fixed
size. Computational effort is proportional to the number of possible
haplotypes, which in turn is proportional to the product of the number of
admissible alleles at each marker of the combination. This product rises rapidly with the number of ancestral loci.
# Maximum spread of ancestral loci (in cM)
1.0
Only
marker combinations that have a total span less than or equal to the specified
maximum are analyzed. For the marker
map above, marker pair 2-3 would give an ancestral haplotype spanning 1.5 cM,
which would be rejected. However, other
pairs, such as 1-2, or 3-4, would be accepted.
Since it is impossible to detect linkage disequilibrium using
combinations of markers that span too wide a region, suppression simply avoids
spurious, random associations, and it also spares useless computational
time. If the parameter Maximum spread is set greater than the maximum marker-marker
subinterval in the map, all combinations of consecutive feasible marker loci
are considered for analysis.
# Is this a screen? (True/false)
True
A
screen is the normal first step of the search for a disease susceptibility
locus. Normally, users would begin with
a single-locus screen and work their way through analyses involving two loci,
three loci, and so on. Since computation
time increases radically with haplotype size, screens of more than five markers
are probably not practical.
In a screen, all feasible marker loci are employed
serially. With three-marker ancestral
haplotypes, TRIMHAP would test all haplotypes of markers 1-2-3, 2-3-4,
3-4-5,...8-9-10. For one- or two-marker
haplotypes, the screen simply tests all admissible locations in the chromosomal
region, producing a map analogous to a multipoint linkage map. However, for haplotypes of three or more
markers, conflicts arise, such as 1-2-D-3 and 2-D-3-4, so that the user must
make choices.
# Ancestral loci (put value=0 to leave locus free to vary)
0 0
Ancestral
loci must correspond to the number of loci specified above. If the input values are zero, all
consecutive loci in the list
of feasible marker loci above are
considered serially. To test specific
hypotheses, the value of one or more components may be fixed. Fixing loci save computer time, but it
disturbs empirical p-values.
# Ancestral haplotype (put value=0 to leave allele free to
vary)
0 0
Specifies
the alleles comprising the ancestral founder haplotype, and must correspond to
the number of loci specified above.
Normally, the input values are zero, in which case all alleles are
considered. Values may be fixed to test
specific hypotheses; it probably makes sense to do this only if the ancestral
loci have also been fixed.
If screen is set to true, both parameters ancestral loci and ancestral haplotype are irrelevant, but they must appear anyway. Just set them to all zeros.
Haplotype
history parameters
# Number of generations since ancestral mutation event
200
Defines
the number of generations since the ancestral founder mutation was introduced
to the population [MacLean et al. 2000].
This value is only used for calculation of statistic 3, employing
trimming probabilities, and has no effect upon statistics 1 or 2.
# Mutation rate per generation
1.0e-5
The
rate is assumed the same for all markers in the study. The mutation rate is accumulated over the
number of generations since ancestral founder.
# Genotyping error rate per marker
.05
The
error rate is assumed the same for all markers in the study, and can be
estimated from the haplotype input data [MacLean et al. 2000]. Genotype errors are usually recorded zero or
blank. TRIMHAP adds the accumulated mutation rate and the error rate for the
appropriate number of markers for each trimmed-haplotype category. This value is used only for calculation of
statistic 3, employing trimming probabilities, and has no effect upon
statistics 1 or 2.
Study
size parameters
# No analysis, just list & count feasible
configurations
false
May
be set to true if the user wishes to quickly estimate the computational
complexity of a given analysis. In this
case, TRIMHAP will output a list of ancestral haplotypes to be tested, but no
analysis will take place.
# Total number of pedigrees to be analyzed
100
The
number of pedigrees may be set to a smaller number (say, 2 or 3) for debugging
purposes. Any value equal to or larger
than the actual number of pedigrees in the haplotype file will cause all
pedigrees to be analyzed.
# Number of replications for calculation of significance
levels
1000
The
number of replicate samples generated by permutation bootstrapping. If the value exceeds the array size
parameter max_replicates
declared in hal.f (currently 10,000), a
warning will be written to debug.out and the analysis will terminate. In this case, re-compilation with a larger
array size is necessary. Computational
effort is proportional to the number of replications. [Suggested value: 1000].
# Minimum frequency for allele to be in ancestral
haplotype
0.1
This
limits the alleles considered for selection in the ancestral founder
haplotype. Only alleles having a
frequency larger than the given value are considered for selection. In general, the analysis will be quicker if
relatively infrequent alleles are excluded, but potentially interesting
haplotypes may be skipped with too coarse a filter.
# Minimum number of trans. to aff. indiv's for test sample
2
Controls
the selection of the test subsample of haplotypes. In a given pedigree, only founder haplotypes transmitted to at
least this number of affected individuals are included in the test subsample.
# Maximum trans. to aff. indivs for controls (neg for all)
1
Controls
the selection of the control subsample of haplotypes. In a given pedigree, only founder haplotypes transmitted to no
more than this number of affected individuals are included in the control
subsample. If the value is negative,
all pedigree-founder haplotypes are included in the control subsample. Several assignment schemes are discussed in
MacLean et al. [2000].
# Use HBPPL (true) or multinomial class frequencies?
(false)
true
Haplotype
Based Posterior Probability of Linkage may be used to measure the
within-pedigree relationship between haplotypes and affection [MacLean et al.
2000]. The alternative is to score
every haplotype as 1.0. Numerical
simulations indicate that significantly more power is achieved when appropriate
HBPPL values are used, but they depend on a realistic segregation model. See Martin et al. [1999] for further
details.
# Maximum number of D alleles among a pedigree's founders
4
When
calculating prior and likelihood terms for HBPPL for a given pedigree,
high-risk disease alleles D are allocated to a subset of pedigree founder
haplotypes in every possible disease genotype configuration. Since it is highly unlikely that there are a
large number of independent copies of the disease mutation D segregating in a
pedigree, computational effort can be reduced with little or no significant
loss of information if calculations are truncated at some point. The value may be set to a large number (say,
20) to disable this option. [Suggested
value: 4]
# Maximum null alleles allowable in founder haplotype
4
Eliminates
degenerate pedigree-founder haplotypes from analysis. Pedigree-founder haplotypes having a large number of missing
alleles should be excluded from analysis because they interfere with the
TRIMHAP permutation scheme for constructing significance levels. [Suggested value: 4]
Heterogeneity
# alpha: proportion of linked founder haps
1.0
Specifies
the locus heterogeneity parameter in the sample being studied. Note that the frequency is per haplotype,
not per pedigree as in the linkage admixture model. Therefore, if linkage
analysis estimates that 20% of families segregate a particular disease
susceptibility locus, and there are an average of four independent haplotypes
per pedigree, then alpha = .05. On the
other hand, we may assume that the linked haplotypes will be concentrated in
the test subsample. Thus, if only one
third of sample haplotypes fall into the test subsample, alpha should be
increased to .15.
Since computational effort increases significantly for
alpha < 1, due to the fact that pedigree likelihoods must be recalculated
assuming alpha = .5, a value of alpha = 1 together with a concomitantly reduced
value of gamma is more efficient.
# gamma: proportion of alpha descended from given ancestor
0.2
Allelic
heterogeneity. The proportion of
haplotypes descended from a given ancestral founder haplotype is given by the
product alpha x gamma. Note that this
product affects only statistic 3, employing trimming probabilities. If we artificially set alpha = 1 to decrease
run-time, gamma should be reduced accordingly so that the product is held at
the appropriate value. For example, a
sample in which 10% of haplotypes are ancestrally-derived could be described
using alpha = 0.5 and gamma = 0.2 or alpha = 1 and gamma = 0.1, but
the latter would be much more efficient for TRIMHAP.
Trimmed-haplotype
statistic
# Which statistic: 1 = est, 2 = regress, 3 = trim prob
3
More
than one formulation of the trimmed-haplotype test is possible. The most familiar form for a likelihood
ratio test uses category frequencies estimated from the data themselves. We call this general purpose LR value statistic
1. Statistic 1 does not explicitly
employ the model on which trimmed-haplotype analysis is based. To exploit the model of the trimming
process, we may employ it to generate the expected value under the alternative
hypothesis in a value called statistic 3.
Using the trimming probability defined in MacLean et al. [2000], we
calculate a category similarity score that measures similarity between
haplotypes in each category and their putative ancestor. In cases where we do not have enough
information to perform the trimming probability calculations, we may perform a
regression analysis between the category position in the trimmed-haplotype
table and the proportion of test versus control subsamples. See Martin et al. [1999] for full
details.
Input
files
# Filenames: haplotypes / allele frequencies
haplo.dump
dom.loc
#
# Is there a sex file?
True
#
# sex filename, if any.
If not, just skip.
families.pre
Input
files should be listed in the order indicated, one name per line. The haplotype file is assumed to be in the
format of a Genehunter haplo.dump output file [Kruglyak et al. 1996]. The locus control file contains marker
allele frequencies and penetrances for the disease locus model. It can be created by the Preplink utility
(cf. [Terwilliger and Ott 1994]). The
sex file is in Pre-makeped format [Terwilliger and Ott 1994] and is used by
TRIMHAP to determine the sex of pedigree members when sex-specific penetrances
are used, since this information is not contained in the Genehunter haplotype
file. If HBPPL is not calculated, or if
penetrances are equal for both sexes, families.pre may be skipped.
# Seed for random number generator
2001
A
starting value between 1 and 30,000 is required for the pseudo-random number
generator. We suggest users change this
value for each separate analysis.
Array
Sizes
The
following array sizes are declared in TRIMHAP near the head of the main
routine, hal.f
$
max_indiv = 40: number of individuals in a single pedigree
$
max_loci = 25: number of marker loci
$
max_alleles = 30: number of alleles at a marker locus
$
max_ped = 1000: number of pedigrees in the sample
$
max_ad_haps = 4000: number of pedigree-founder haplotypes
$
max_replicates = 10000: number of permutation replicates
$
max_depth = 4: number of generations in a pedigree
$
max_dis_pos = 10 * max_loci: number of
test-points for the disease locus
Alternative
values may be defined by the user, as needed.
Running
TRIMHAP
To
run TRIMHAP, type in the program name without arguments, at the command prompt:
trimhap
OUTPUT
Four
output files contain the results of a TRIMHAP analysis. They are:
$
results.out: output for
each haplotype-specific analysis plus scan-wise summary
$
best.out: allele names
and locations, sorted by significance level.
$
map.out: map-wise
significance levels
$
debug.out: intermediate
output with warning and error messages
Haplotype
specific statistics: results.out file
The
results.out file contains full details of results from each specific ancestral
haplotype analyzed by TRIMHAP. There
may be many thousands of such haplotype specific analyses, so in general the
volume of output contained in the results.out file will be overwhelming. For this reason and because of the obvious
problem interpreting statistical significance given the multiple tests being
conducted, we recommend using the map-wise statistics summarized in the map.out
file. However, once a region has been
identified as being of unusual significance with respect to disease location,
the user may wish to examine the output contained in results.out for details
concerning specific ancestral loci and alleles that may be involved. The output for each haplotype specific
analysis is as follows.
Disease position Ancestral
loci Ancestral haps
Scan to date: 1
/ 1 1 / 1 1 / 1
Current disease position: 1 / 1 1 / 1
Current anc loci: 1 / 1
The
first four lines of each haplotype-specific analysis contain a summary of
information about disease location and ancestral loci and alleles, and the
number of analyses performed. These
values are cumulative over an entire test run.
1 2 3 4 5
6 7 8 9 10
11
Ancestral: ‑‑‑‑‑‑‑‑‑‑‑‑X‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑
Loci
=1= =2= =3=
Hap
=5= =4= =1=
Allele freqs
.25 .20 .15
The
current hypothesis being tested is represented pictorially as a map, with the
location of the disease gene given as ‑‑X‑‑. The first
ancestral locus refers to marker locus 2, while the second refers to
marker locus 3, and so on. The
notation =1= means that the first locus was fixed by the user, not
free to vary over the test. The same
convention is used for haplotypes.
The
next few lines of output summarize TRIMHAP control parameters.
Disease
min Span min max
position
freq observ max
test control
0.1250
.100 0.500 8.0000
2 1
duration
alpha gamma
200
1.00 0.20
mutation rate per generation = 1.000E‑05
total error+mutation rate = 2.001E‑03
Number of replicates (feas/ total) = 1000 /
1000
The
disease position is 0.125cM to the right of the first marker locus, the minimum
admissible allele frequency set by the user is 0.100. The current ancestral loci span, 0.500 cM, is less that the
maximum admissible spread of ancestral loci set by the user. The minimum number of transmissions for a
test haplotype and the maximum number of transmissions for a control haplotype
set by the user are listed. The number
of generations since introduction of the mutation, the locus heterogeneity
parameter, alpha, and the allelic heterogeneity parameter, gamma, all set by
the user are listed, as well as mutation rate and genotyping error rate. Finally, the number of permutation
replicates that produced feasible haplotypes is shown together with the total
replications specified by the user.
The
next block of output for the haplotype specific analysis summarizes the
trimmed-haplotype table.
lh rh t
c t c trim contrib
non-t excess
1 2
4 3 .0192 .0156 .1058
1.36 3.2 0.8
1 1
12 9 .0577 .0469 .0828
1.20 9.8 2.2
0 2
6 13 .0288 .0677 .0730
‑0.40 ‑9.4 ‑8.1
1 0
40 37 .1923 .1927 .1821
0.01 ‑3.9 ‑0.1
0 1
25 28 .1202 .1458 .1258
0.85 ‑8.2 ‑5.3
0 0
121 102 .5817
.5313 .4306 ‑2.24 0.0 10.5
400 208
192
Going
from left to right, we have the number of alleles in common with the founder
haplotype on the left- and right-hand sides of the current disease locus
test-location. Next are the number of
haplotypes in the test and control subsamples, and the corresponding frequencies. Theoretical haplotype trimming probabilities
provide the basis for the contribution to statistic 3, only. The excess category frequency of test over
control subsamples, adjusted to the total number of haplotypes in the test
subsample, is used to demonstrate the
familiar fit of test subsample data to the null hypothesis, although TRIMHAP
does not employ this method to calculate the significance value (see MacLean et
al. [2000] for a full description of empirical significance levels) . Finally the estimated category frequencies
for the test sub-sample against the trimming probability, to demonstrate the
fit of statistic 3 to the alternative hypothesis. At the bottom of the table, we see that 400 pedigree-founder
haplotypes are admissible for the current combination of ancestral loci. This may vary slightly from test to test,
due to missing data and ambiguities in tracing within-pedigree
inheritance. These locally-admissible
haplotypes are comprised of 208 test haplotypes and 192 controls. Note that in this case the test and control
subsamples are disjoint, but this need not be so.
statistic used: 3
observed:
raw, normed = ‑7.151, 0.147
replicates: mean, std dev = ‑7.645,
3.360
empirical p‑value = 0.3880
Output
includes estimated significance level, raw and normed values for the statistic
used, and the mean and standard deviation in the permutation replications. See MacLean et al. [2000] for a full
description.
Scan-wise
statistics: results.out file
At
the very end of the results.out output file, significance levels are given for
scan-wise results.
Summary over all scanned configurations:
Number of replicates =
1000
Maximum stat(trim prob) = 0.147 empirical p‑val
= 0.4740
The
statistic represents compound hypothesis testing evidence for a disease locus
located anywhere within the region analyzed.
See MacLean et al. [2000] for further details.
Summary
of all calculations: best.out file
# Created on Tue Nov 30 13:51:04 1999
# File: best.out
Statistic calculated 3
total haps
tested 1532
empir disease
rank config stat
p_val locat ancestral founder
1 2 3 4
5 . . .
‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑
1
1025 2.245 0.0240
2 3 1
5
2 47
1.896 0.0380 1
2 3 1
3 115
1.804 0.0490 2
2 3 1
4 1027
1.635 0.0660 3
3 1 5
5 214
1.433 0.0810 1
4 2 1
6 13
1.403 0.0870 2
1 3 1
7 820
1.289 0.1060 2
1 4 3
8 1026
1.160 0.1250 3
3 2 5
. . .
Every
haplotype-specific test of a TRIMHAP run is listed in the best.out file, sorted
by the value of the user-chosen statistic.
Note that the corresponding empirical p-values are haplotype-specific
values, not adjusted for multiple testing.
The haplotype is identified by its position in the map
( markers 2-3-4 for the first entry), its allele numbers ( alleles 3-1-5 in the
first entry), and the location of the putative disease susceptibility locus
within it. The entry under disease locat refers to the interval following the marker number
given.
The entry config indicates the
position of the detailed analysis in the results.out file, where it corresponds
to Scan to date:
Ancestral haps.
Map.out file
A
list of map-wise significance levels are listed in the map.out file, one for
each test-point for the disease locus:
# File: map.out
# Created on Tue Nov 30 13:16:55 1999
# Statistic used
3
point cM
stat
1 0.125
0.3880
2 0.375
0.5840
. . .
The
index of the test-point for the disease locus, the location in cM of the
test-point from the start of the map, and map-wise significance levels are
listed. Each map-wise significance
level is a compound hypothesis of multiple haplotype specific analyses, with
small values suggesting that the disease locus may be located nearby. Data in the map.out file may be plotted
using the interactive plotting program Gnuplot. This package is commonly installed on many unix (and non-unix)
systems, but if not, it is freely available via anonymous ftp at ftp://ftp.gnuplot.vt.edu/pub/gnuplot/gnuplot-3.7.tar.gz.
Intermediate
output, warnings, and errors: debug.out file
The
debug.out file contains intermediate output from TRIMHAP, together with any
warning or error messages which are generated.
The first section at the head of the file is an echo of the parameters
as read from parameters.dat. There is
no need to inspect debug.out unless TRIMHAP terminates prematurely due to a
fatal error. In this case, a short
explanatory message should appear at the end of debug.out. Note: in our experience, most errors are caused
by improperly-formatted input data in the parameters.dat file. In case of problems, users should carefully
check data in parameters.dat versus the echo of this data printed at the start
of debug.out.
Bugs
Although
we have made every effort to test this package, bugs in the code may still
exist. Users can reach the authors
using the contact information given below, preferably by email. In reporting bugs, please include details
concerning computer hardware and operating system version, together with input
and output files if possible. It is our
experience that most failures result from faulty inputs. Before you report a
bug, we ask that you check to ensure the failure is not due to
wrongly-formatted input data.
Contact
address for scientific questions:
Charles
MacLean
Virginia
Institute for Psychiatric and Behavioral Genetics
Virginia
Commonwealth University
Box
980126, Richmond, VA 23298
email:
cmaclean@bara.psi.vcu.edu
Contact
address for computer questions:
Huan
Wang
Virginia
Institute for Psychiatric and Behavioral Genetics
Virginia
Commonwealth University
Box
980126, Richmond, VA 23298
email:
huwang@hsc.vcu.edu
References
MacLean CJ, Martin RB, Sham PC, Wang H, Straub RE,
Kendler KS (2000) The trimmed‑haplotype test for linkage disequilibrium. American Journal of Human Genetics 66:1062-1075
Martin RB, MacLean CJ, Sham PC, Straub RE, Kendler KS
(1999) Tests for linkage disequilibrium: haplotypes, multiplex pedigrees, and
complex traits. Human Heredity
Under review
Kruglyak L, Daly MJ, Reeve-Daly MP, Lander ES (1996)
Parametric and nonparametric linkage
analysis: a unified multipoint approach. American Journal of Human Genetics 58: 1347‑1363
Terwilliger JD, Ott J (1994) Handbook of Human Genetic Linkage. Baltimore, John Hopkins University Press