Gene drop and tstrait phenotyping¶
The standard simACE pipeline draws additive genetic values \(A\) from a Gaussian under an infinitesimal model, with target variance set in the config. The gene-drop pipeline replaces that parametric \(A\) with a realistic genetic value computed from real genotypes — generated by dropping a SimHumanity-derived ancestral tree sequence through a simACE pedigree — then rescales it back to the configured \(A\) variance so downstream phenotype, sampling, and atlas plotting work without modification.
This page is the conceptual overview. For the worked alpha / num_causal sweep see Architecture sweep. For the relevant config keys see Configuration.
Motivation¶
The Gaussian-\(A\) model treats trait genetics as a sum of infinitely many tiny effects, producing a smooth distribution that ignores how real heritability emerges from a finite causal-variant architecture.
Gene drop simulates the actual mechanism:
- Founder haplotypes carry SLiM-derived mutations from the SimHumanity ancestral simulation.
- Mutations propagate through the simACE pedigree by Mendelian inheritance with realistic recombination (HapMapII GRCh38 rate map).
- A finite subset of those mutations is declared causal; effect sizes are drawn from \(\mathcal{N}(0, \sigma_\beta^2) \cdot [2p(1-p)]^{\alpha}\) (the LDAK-thin / Speed et al. form when \(\alpha=-0.5\)).
- Per-individual genetic value is the sum of dosage × effect across causal sites.
The shape and tails of the realized \(A\) distribution reflect the genetic architecture rather than a Gaussian assumption — relevant when methods that estimate variance components from realistic pedigree data are stress-tested.
After computing the realistic \(A\), the pipeline rescales it to match
the configured A1 variance so that the simACE phenotype models
(frailty, threshold, censoring) operate on the same scale as a
Gaussian-\(A\) run. This renders the gene-drop and Gaussian-\(A\) results
directly comparable on every downstream plot.
Pipeline¶
The gene-drop branch sits between simACE's pedigree simulation and its phenotype-model rules. It has three pre-pipeline stages and four per-rep stages:
Pre-pipeline (one-shot, scenario-independent)
─────────────────────────────────────────────
SimHumanity p2 .trees ──► tskit_preprocess_canonicalize_chrom (×22)
│
▼
tskit_preprocess_concat
│
▼
tstrait_site_catalog_chrom (×22)
│
▼
tstrait_site_catalog_concat
│
▼
<preprocessed>/site_catalog.parquet
Per-rep
────────
simulate_pedigree_liability ──► pedigree.full.parquet
│
└──► simulate_genotypes_chrom (×22) ──► genotypes_chrom_{n}.trees
│
▼
tstrait_assign_effects ──► causal_effects.parquet
│
▼
tstrait_gv_chrom (×22) ──► gv_chrom_{n}.parquet
│
▼
tstrait_augment_pedigree ──► pedigree.full.tstrait.parquet
│
▼
pedigree_dropout (reads .tstrait. when use_gene_drop=true)
│
▼
phenotype → censor → sample → stats → plot → atlas
Stage 1 — preprocess SimHumanity ancestry¶
tskit_preprocess canonicalizes the source per-chromosome .trees files
(filters to one mutation per site, sorts samples canonically) and
concatenates them. This produces the reusable founder-AF site catalog
(site_catalog.parquet, ~850 MB for the 49M-site p2 dataset) used by every
gene-drop scenario.
Run once with:
Stage 2 — drop founders through the simACE pedigree¶
simulate_genotypes_chrom builds an msprime fixed-pedigree from the simACE
pedigree (the latest G_pheno generations are tagged as samples) and runs
msprime.sim_ancestry(model="fixed_pedigree") with the per-chromosome
SimHumanity recombination map. The dropped tree is then grafted onto the
preprocessed ancestral tree at the founder generation, producing a
chromosome-scale tree sequence whose present-day samples carry alleles
inherited Mendelianly from realistic founder haplotypes.
This is the slowest step (chr1 ~16 min on one core; ~5 min wall with -j 8).
Running variants of the same pedigree is cheap (see
drop_from below).
Stage 3 — assign causal effects + compute genetic value¶
tstrait_assign_effects samples num_causal (or frac_causal × eligible)
sites from the founder-AF catalog after applying the MAF filter, draws
per-site raw effects \(\beta \sim \mathcal{N}(\mu, \sigma_\beta^2)\), and
multiplies by \([2p(1-p)]^{\alpha}\) to get the final effect.
tstrait_gv_chrom computes per-individual genetic value via a
numba-jitted single-pass kernel built on
tskit.jit.numba.NumbaTreeSequence. The kernel iterates each chromosome's
trees once, maintains left_child / right_sib arrays incrementally, and
DFS's each causal site's mutation node into a per-individual GV array.
~5-6× faster than tstrait.genetic_value at high causal density (verified
byte-equivalent up to floating-point summation order).
Sample-tagged ancestors (gens covered by G_pheno) get their GV; older
generations have no GV (their nodes aren't sample-tagged in the dropped
tree).
Stage 4 — rescale and overwrite A in the pedigree¶
tstrait_augment_pedigree reads pedigree.full.parquet, sums the per-chrom
GVs to a genome-wide value per sample individual, then:
- Centers at zero.
- Rescales so that $\mathrm{Var}(A_\text{new}) = $ the configured
A1exactly (sample variance, ddof=0). - Overwrites the
A1column for sample individuals (the latestG_phenogenerations). - Recomputes
liability1= \(A_\text{new} + C + E\). - Writes
pedigree.full.tstrait.parquet(sibling to the original; the original is left untouched).
Older ancestors keep their original parametric A1. This asymmetry is
intentional: only sample-gen individuals appear in downstream model fits
that rely on the sampled phenotype.
A2 is not touched — gene drop is single-trait. Sharing the gene-drop
\(A\) between both traits requires extending the augment script.
Stage 5 — feed it into the standard simACE pipeline¶
When use_gene_drop: true is set on a scenario, the pedigree_dropout rule
reads pedigree.full.tstrait.parquet instead of pedigree.full.parquet.
Everything downstream — phenotype model, censoring, sampling, stats, plots,
atlas — runs identically to a Gaussian-\(A\) scenario, but on the
gene-drop-derived \(A\) column.
Sharing drops across variants¶
Drop+graft is expensive. To vary only the architecture
(tstrait.alpha, tstrait.num_causal, etc.) while holding the
genotypes fixed, set drop_from: <base_scenario> on the variant. The
tstrait_gv_chrom and tstrait_augment_pedigree rules then read the
base scenario's .trees and pedigree.full.parquet; only the cheap
tstrait sub-pipeline runs per variant.
A six-scenario architecture sweep over
num_causal ∈ {100, 1k, 10k, 100k, 1m} and alpha ∈ {0, -0.5} therefore
reuses a single set of drops instead of running drop+graft six times.
Heritability and the h2 parameter¶
The tstrait.h2 config key is not used. The pipeline derives
\(h^2\) from the standard simACE variance components:
This keeps the gene-drop scenarios on the same heritability scale as
Gaussian-\(A\) scenarios. The realized \(h^2\) in the meta JSON
(tstrait_phenotype_meta.json:h2_realized) targets this derived value.
Comparison points¶
| Gaussian \(A\) (standard simACE) | Gene drop | |
|---|---|---|
| \(A\) comes from | \(\mathcal{N}(0, A_1)\) per individual | sum of dosage × effect over causal sites, rescaled to \(A_1\) |
| Architecture | infinitesimal | finite causal variants under \(\alpha\)-MAF model |
| Cost per rep | ~seconds | ~5-10 min wall (drop+graft dominates) |
| Comparability | trivially same scale | exact-match scale via rescale |
| Multi-trait | yes (trait1 + trait2) | trait1 only; trait2 still parametric |
| Parameters | A1, C1, E1, assort1, ... |
+ tstrait.{num_causal, alpha, maf_threshold, ...}, drop_from, use_gene_drop |