Skip to content

Gene drop and tstrait phenotyping

The standard simACE pipeline draws additive genetic values \(A\) from a Gaussian under an infinitesimal model, with target variance set in the config. The gene-drop pipeline replaces that parametric \(A\) with a realistic genetic value computed from real genotypes — generated by dropping a SimHumanity-derived ancestral tree sequence through a simACE pedigree — then rescales it back to the configured \(A\) variance so downstream phenotype, sampling, and atlas plotting work without modification.

This page is the conceptual overview. For the worked alpha / num_causal sweep see Architecture sweep. For the relevant config keys see Configuration.

Motivation

The Gaussian-\(A\) model treats trait genetics as a sum of infinitely many tiny effects, producing a smooth distribution that ignores how real heritability emerges from a finite causal-variant architecture.

Gene drop simulates the actual mechanism:

  1. Founder haplotypes carry SLiM-derived mutations from the SimHumanity ancestral simulation.
  2. Mutations propagate through the simACE pedigree by Mendelian inheritance with realistic recombination (HapMapII GRCh38 rate map).
  3. A finite subset of those mutations is declared causal; effect sizes are drawn from \(\mathcal{N}(0, \sigma_\beta^2) \cdot [2p(1-p)]^{\alpha}\) (the LDAK-thin / Speed et al. form when \(\alpha=-0.5\)).
  4. Per-individual genetic value is the sum of dosage × effect across causal sites.

The shape and tails of the realized \(A\) distribution reflect the genetic architecture rather than a Gaussian assumption — relevant when methods that estimate variance components from realistic pedigree data are stress-tested.

After computing the realistic \(A\), the pipeline rescales it to match the configured A1 variance so that the simACE phenotype models (frailty, threshold, censoring) operate on the same scale as a Gaussian-\(A\) run. This renders the gene-drop and Gaussian-\(A\) results directly comparable on every downstream plot.

Pipeline

The gene-drop branch sits between simACE's pedigree simulation and its phenotype-model rules. It has three pre-pipeline stages and four per-rep stages:

Pre-pipeline (one-shot, scenario-independent)
─────────────────────────────────────────────
SimHumanity p2 .trees ──► tskit_preprocess_canonicalize_chrom (×22)
                          tskit_preprocess_concat
                          tstrait_site_catalog_chrom (×22)
                          tstrait_site_catalog_concat
                          <preprocessed>/site_catalog.parquet


Per-rep
────────
simulate_pedigree_liability ──► pedigree.full.parquet
              └──► simulate_genotypes_chrom (×22)  ──► genotypes_chrom_{n}.trees
                  tstrait_assign_effects  ──► causal_effects.parquet
                  tstrait_gv_chrom (×22)  ──► gv_chrom_{n}.parquet
                  tstrait_augment_pedigree  ──► pedigree.full.tstrait.parquet
                  pedigree_dropout (reads .tstrait. when use_gene_drop=true)
                  phenotype → censor → sample → stats → plot → atlas

Stage 1 — preprocess SimHumanity ancestry

tskit_preprocess canonicalizes the source per-chromosome .trees files (filters to one mutation per site, sorts samples canonically) and concatenates them. This produces the reusable founder-AF site catalog (site_catalog.parquet, ~850 MB for the 49M-site p2 dataset) used by every gene-drop scenario.

Run once with:

snakemake --use-conda --cores 4 tskit_preprocess

Stage 2 — drop founders through the simACE pedigree

simulate_genotypes_chrom builds an msprime fixed-pedigree from the simACE pedigree (the latest G_pheno generations are tagged as samples) and runs msprime.sim_ancestry(model="fixed_pedigree") with the per-chromosome SimHumanity recombination map. The dropped tree is then grafted onto the preprocessed ancestral tree at the founder generation, producing a chromosome-scale tree sequence whose present-day samples carry alleles inherited Mendelianly from realistic founder haplotypes.

This is the slowest step (chr1 ~16 min on one core; ~5 min wall with -j 8). Running variants of the same pedigree is cheap (see drop_from below).

Stage 3 — assign causal effects + compute genetic value

tstrait_assign_effects samples num_causal (or frac_causal × eligible) sites from the founder-AF catalog after applying the MAF filter, draws per-site raw effects \(\beta \sim \mathcal{N}(\mu, \sigma_\beta^2)\), and multiplies by \([2p(1-p)]^{\alpha}\) to get the final effect.

tstrait_gv_chrom computes per-individual genetic value via a numba-jitted single-pass kernel built on tskit.jit.numba.NumbaTreeSequence. The kernel iterates each chromosome's trees once, maintains left_child / right_sib arrays incrementally, and DFS's each causal site's mutation node into a per-individual GV array. ~5-6× faster than tstrait.genetic_value at high causal density (verified byte-equivalent up to floating-point summation order).

Sample-tagged ancestors (gens covered by G_pheno) get their GV; older generations have no GV (their nodes aren't sample-tagged in the dropped tree).

Stage 4 — rescale and overwrite A in the pedigree

tstrait_augment_pedigree reads pedigree.full.parquet, sums the per-chrom GVs to a genome-wide value per sample individual, then:

  1. Centers at zero.
  2. Rescales so that $\mathrm{Var}(A_\text{new}) = $ the configured A1 exactly (sample variance, ddof=0).
  3. Overwrites the A1 column for sample individuals (the latest G_pheno generations).
  4. Recomputes liability1 = \(A_\text{new} + C + E\).
  5. Writes pedigree.full.tstrait.parquet (sibling to the original; the original is left untouched).

Older ancestors keep their original parametric A1. This asymmetry is intentional: only sample-gen individuals appear in downstream model fits that rely on the sampled phenotype.

A2 is not touched — gene drop is single-trait. Sharing the gene-drop \(A\) between both traits requires extending the augment script.

Stage 5 — feed it into the standard simACE pipeline

When use_gene_drop: true is set on a scenario, the pedigree_dropout rule reads pedigree.full.tstrait.parquet instead of pedigree.full.parquet. Everything downstream — phenotype model, censoring, sampling, stats, plots, atlas — runs identically to a Gaussian-\(A\) scenario, but on the gene-drop-derived \(A\) column.

Sharing drops across variants

Drop+graft is expensive. To vary only the architecture (tstrait.alpha, tstrait.num_causal, etc.) while holding the genotypes fixed, set drop_from: <base_scenario> on the variant. The tstrait_gv_chrom and tstrait_augment_pedigree rules then read the base scenario's .trees and pedigree.full.parquet; only the cheap tstrait sub-pipeline runs per variant.

A six-scenario architecture sweep over num_causal ∈ {100, 1k, 10k, 100k, 1m} and alpha ∈ {0, -0.5} therefore reuses a single set of drops instead of running drop+graft six times.

Heritability and the h2 parameter

The tstrait.h2 config key is not used. The pipeline derives \(h^2\) from the standard simACE variance components:

\[h^2 = \frac{A_1}{A_1 + C_1 + E_1}\]

This keeps the gene-drop scenarios on the same heritability scale as Gaussian-\(A\) scenarios. The realized \(h^2\) in the meta JSON (tstrait_phenotype_meta.json:h2_realized) targets this derived value.

Comparison points

Gaussian \(A\) (standard simACE) Gene drop
\(A\) comes from \(\mathcal{N}(0, A_1)\) per individual sum of dosage × effect over causal sites, rescaled to \(A_1\)
Architecture infinitesimal finite causal variants under \(\alpha\)-MAF model
Cost per rep ~seconds ~5-10 min wall (drop+graft dominates)
Comparability trivially same scale exact-match scale via rescale
Multi-trait yes (trait1 + trait2) trait1 only; trait2 still parametric
Parameters A1, C1, E1, assort1, ... + tstrait.{num_causal, alpha, maf_threshold, ...}, drop_from, use_gene_drop