Gene drop and tstrait phenotyping¶

The standard simACE pipeline draws additive genetic values $A$ from a Gaussian under an infinitesimal model, with target variance set in the config. The gene-drop pipeline replaces that parametric $A$ with a realistic genetic value computed from real genotypes — generated by dropping a SimHumanity-derived ancestral tree sequence through a simACE pedigree — then rescales it back to the configured $A$ variance so downstream phenotype, sampling, and atlas plotting work without modification.

This page is the conceptual overview. For the worked alpha / num_causal sweep see Architecture sweep. For the relevant config keys see Configuration.

Motivation¶

The Gaussian-$A$ model treats trait genetics as a sum of infinitely many tiny effects, producing a smooth distribution that ignores how real heritability emerges from a finite causal-variant architecture.

Gene drop simulates the actual mechanism:

Founder haplotypes carry SLiM-derived mutations from the SimHumanity ancestral simulation.
Mutations propagate through the simACE pedigree by Mendelian inheritance with realistic recombination (HapMapII GRCh38 rate map).
A finite subset of those mutations is declared causal; effect sizes are drawn from $\mathcal{N}(0, \sigma_\beta^2) \cdot [2p(1-p)]^{\alpha}$ (the LDAK-thin / Speed et al. form when $\alpha=-0.5$).
Per-individual genetic value is the sum of dosage × effect across causal sites.

The shape and tails of the realized $A$ distribution reflect the genetic architecture rather than a Gaussian assumption — relevant when methods that estimate variance components from realistic pedigree data are stress-tested.

After computing the realistic $A$, the pipeline rescales it to match the configured A1 variance so that the simACE phenotype models (frailty, threshold, censoring) operate on the same scale as a Gaussian-$A$ run. This renders the gene-drop and Gaussian-$A$ results directly comparable on every downstream plot.

Pipeline¶

The gene-drop branch sits between simACE's pedigree simulation and its phenotype-model rules. It has three pre-pipeline stages and four per-rep stages:

Pre-pipeline (one-shot, scenario-independent)
─────────────────────────────────────────────
SimHumanity p2 .trees ──► tskit_preprocess_canonicalize_chrom (×22)
                                        │
                                        ▼
                          tskit_preprocess_concat
                                        │
                                        ▼
                          tstrait_site_catalog_chrom (×22)
                                        │
                                        ▼
                          tstrait_site_catalog_concat
                                        │
                                        ▼
                          <preprocessed>/site_catalog.parquet


Per-rep
────────
simulate_pedigree_liability ──► pedigree.full.parquet
              │
              └──► simulate_genotypes_chrom (×22)  ──► genotypes_chrom_{n}.trees
                            │
                            ▼
                  tstrait_assign_effects  ──► causal_effects.parquet
                            │
                            ▼
                  tstrait_gv_chrom (×22)  ──► gv_chrom_{n}.parquet
                            │
                            ▼
                  tstrait_augment_pedigree  ──► pedigree.full.tstrait.parquet
                            │
                            ▼
                  pedigree_dropout (reads .tstrait. when use_gene_drop=true)
                            │
                            ▼
                  phenotype → censor → sample → stats → plot → atlas

Stage 1 — preprocess SimHumanity ancestry¶

tskit_preprocess canonicalizes the source per-chromosome .trees files (filters to one mutation per site, sorts samples canonically) and concatenates them. This produces the reusable founder-AF site catalog (site_catalog.parquet, ~850 MB for the 49M-site p2 dataset) used by every gene-drop scenario.

Run once with:

snakemake --use-conda --cores 4 tskit_preprocess

Stage 2 — drop founders through the simACE pedigree¶

simulate_genotypes_chrom builds an msprime fixed-pedigree from the simACE pedigree (the latest G_pheno generations are tagged as samples) and runs msprime.sim_ancestry(model="fixed_pedigree") with the per-chromosome SimHumanity recombination map. The dropped tree is then grafted onto the preprocessed ancestral tree at the founder generation, producing a chromosome-scale tree sequence whose present-day samples carry alleles inherited Mendelianly from realistic founder haplotypes.

This is the slowest step (chr1 ~16 min on one core; ~5 min wall with -j 8). Running variants of the same pedigree is cheap (see drop_from below).

Stage 3 — assign causal effects + compute genetic value¶

tstrait_assign_effects samples num_causal (or frac_causal × eligible) sites from the founder-AF catalog after applying the MAF filter, draws per-site raw effects $\beta \sim \mathcal{N}(\mu, \sigma_\beta^2)$, and multiplies by $[2p(1-p)]^{\alpha}$ to get the final effect.

tstrait_gv_chrom computes per-individual genetic value via a numba-jitted single-pass kernel built on tskit.jit.numba.NumbaTreeSequence. The kernel iterates each chromosome's trees once, maintains left_child / right_sib arrays incrementally, and DFS's each causal site's mutation node into a per-individual GV array. ~5-6× faster than tstrait.genetic_value at high causal density (verified byte-equivalent up to floating-point summation order).

Sample-tagged ancestors (gens covered by G_pheno) get their GV; older generations have no GV (their nodes aren't sample-tagged in the dropped tree).

Stage 4 — rescale and overwrite A in the pedigree¶

tstrait_augment_pedigree reads pedigree.full.parquet, sums the per-chrom GVs to a genome-wide value per sample individual, then:

Centers at zero.
Rescales so that $\mathrm{Var}(A_\text{new}) = $ the configured A1 exactly (sample variance, ddof=0).
Overwrites the A1 column for sample individuals (the latest G_pheno generations).
Recomputes liability1 = $A_\text{new} + C + E$.
Writes pedigree.full.tstrait.parquet (sibling to the original; the original is left untouched).

Older ancestors keep their original parametric A1. This asymmetry is intentional: only sample-gen individuals appear in downstream model fits that rely on the sampled phenotype.

A2 is not touched — gene drop is single-trait. Sharing the gene-drop $A$ between both traits requires extending the augment script.

Stage 5 — feed it into the standard simACE pipeline¶

When use_gene_drop: true is set on a scenario, the pedigree_dropout rule reads pedigree.full.tstrait.parquet instead of pedigree.full.parquet. Everything downstream — phenotype model, censoring, sampling, stats, plots, atlas — runs identically to a Gaussian-$A$ scenario, but on the gene-drop-derived $A$ column.

Drop+graft is expensive. To vary only the architecture (tstrait.alpha, tstrait.num_causal, etc.) while holding the genotypes fixed, set drop_from: <base_scenario> on the variant. The tstrait_gv_chrom and tstrait_augment_pedigree rules then read the base scenario's .trees and pedigree.full.parquet; only the cheap tstrait sub-pipeline runs per variant.

A six-scenario architecture sweep over num_causal ∈ {100, 1k, 10k, 100k, 1m} and alpha ∈ {0, -0.5} therefore reuses a single set of drops instead of running drop+graft six times.

Heritability and the `h2` parameter¶

The tstrait.h2 config key is not used. The pipeline derives $h^2$ from the standard simACE variance components:

\[h^2 = \frac{A_1}{A_1 + C_1 + E_1}\]

This keeps the gene-drop scenarios on the same heritability scale as Gaussian-$A$ scenarios. The realized $h^2$ in the meta JSON (tstrait_phenotype_meta.json:h2_realized) targets this derived value.

Comparison points¶

	Gaussian $A$ (standard simACE)	Gene drop
$A$ comes from	$\mathcal{N}(0, A_1)$ per individual	sum of dosage × effect over causal sites, rescaled to $A_1$
Architecture	infinitesimal	finite causal variants under $\alpha$-MAF model
Cost per rep	~seconds	~5-10 min wall (drop+graft dominates)
Comparability	trivially same scale	exact-match scale via rescale
Multi-trait	yes (trait1 + trait2)	trait1 only; trait2 still parametric
Parameters	`A1`, `C1`, `E1`, `assort1`, ...	+ `tstrait.{num_causal, alpha, maf_threshold, ...}`, `drop_from`, `use_gene_drop`

	Gaussian \(A\) (standard simACE)	Gene drop
\(A\) comes from	\(\mathcal{N}(0, A_1)\) per individual	sum of dosage × effect over causal sites, rescaled to \(A_1\)
Architecture	infinitesimal	finite causal variants under \(\alpha\)-MAF model
Cost per rep	~seconds	~5-10 min wall (drop+graft dominates)
Comparability	trivially same scale	exact-match scale via rescale
Multi-trait	yes (trait1 + trait2)	trait1 only; trait2 still parametric
Parameters	`A1`, `C1`, `E1`, `assort1`, ...	+ `tstrait.{num_causal, alpha, maf_threshold, ...}`, `drop_from`, `use_gene_drop`