Gene-drop architecture sweep¶
This page examines how the genetic architecture (alpha and
num_causal) shapes the realized \(A\) distribution under
gene drop. Six scenarios vary one
architecture parameter at a time against a fixed pedigree+drop,
documenting both the raw genetic-value distributions and their form
after the tstrait_augment_pedigree rescale, which places them on the
same scale as Gaussian-\(A\) scenarios.
Scenarios¶
All six share an identical pedigree+drop via drop_from: p2_full:
- N = 35,951 founders (the full SimHumanity p2 panel)
- G_ped = 10, G_pheno = 3 (107,853 sample inds across the latest 3 generations)
- Variance composition A=0.4, C=0.1, E=0.5 (so \(h^2 = 0.4\)) for both traits
- No assortative mating, no MZ twins
- Same trait1 and trait2 phenotype model (frailty / weibull, scale=2160, rho=0.8, beta=1.0, prevalence=0.10)
The variants differ only in the tstrait architecture:
| Scenario | alpha |
num_causal |
What it explores |
|---|---|---|---|
p2_full |
-0.5 | 1,000 | anchor (defaults) |
p2_full_a0 |
0.0 | 1,000 | no MAF dependence in effect sizes |
p2_full_c100 |
-0.5 | 100 | very sparse architecture |
p2_full_c10k |
-0.5 | 10,000 | moderate polygenicity |
p2_full_c100k |
-0.5 | 100,000 | high polygenicity |
p2_full_c1m |
-0.5 | 1,000,000 | near-infinitesimal |
Defined in config/genotype_drop.yaml. The variants use YAML anchors to
inherit everything from p2_full and override only the tstrait knobs.
Run¶
The drop+graft step (simulate_genotypes_chrom) only runs once for
p2_full; the variants reuse those trees:
# Drop the founders through the simACE pedigree (one-shot, ~5 min wall at -j 8)
snakemake --use-conda --cores 8 \
results/genotype_drop/p2_full/rep1/.simulate_genotypes.done
# Run all 5 variants (each only the tstrait sub-pipeline + augment + atlas)
snakemake --use-conda --rerun-triggers mtime -j 8 \
results/genotype_drop/p2_full_a0/plots/atlas.pdf \
results/genotype_drop/p2_full_c100/plots/atlas.pdf \
results/genotype_drop/p2_full_c10k/plots/atlas.pdf \
results/genotype_drop/p2_full_c100k/plots/atlas.pdf \
results/genotype_drop/p2_full_c1m/plots/atlas.pdf
Raw GV variance scales linearly with num_causal¶
Before the rescale, the variance of the genome-wide GV is roughly the sum
of per-site contributions, \(\sigma_\beta^2 \cdot \sum_i [2 p_i (1 - p_i)]^{2\alpha + 1}\).
With \(\sigma_\beta = 1\) and identically-distributed AFs, this scales linearly
in num_causal:
| Scenario | n_causal | realized var(GV) | scale factor (to var=0.4) |
|---|---|---|---|
p2_full_c100 |
100 | 98 | 6.4e-2 |
p2_full |
1,000 | 966 | 2.0e-2 |
p2_full_c10k |
10,000 | 10,059 | 6.3e-3 |
p2_full_c100k |
100,000 | 97,629 | 2.0e-3 |
p2_full_c1m |
1,000,000 | 1,018,392 | 6.3e-4 |
Each 10× in num_causal gives ~10× in raw var(GV) and ~\(\sqrt{10}^{-1}\) in
the rescale factor.
alpha amplifies rare-allele contribution¶
At fixed num_causal=1000, switching from \(\alpha=0\) (no MAF dependence)
to \(\alpha=-0.5\) (LDAK-thin: rare alleles get larger effects) inflates raw
var(GV) about 4-fold:
| Scenario | alpha | realized var(GV) |
|---|---|---|
p2_full_a0 |
0.0 | 219 |
p2_full |
-0.5 | 966 |
The same 1000 sites and effects, weighted differently across the AF spectrum.
After rescale, all six are on the same scale¶
The augment rule centers each scenario's GV at zero and rescales sample
variance to exactly the configured A1:
| Scenario | var(A1) before (parametric) | var(A1) after (gene drop) | \(h^2\) realized |
|---|---|---|---|
p2_full_a0 |
0.4 (Gaussian draw) | 0.4000 | 0.3999 |
p2_full_c100 |
0.4 | 0.4000 | 0.3999 |
p2_full |
0.4 | 0.4000 | (anchor) |
p2_full_c10k |
0.4 | 0.4000 | 0.3999 |
p2_full_c100k |
0.4 | 0.4000 | 0.3999 |
p2_full_c1m |
0.4 | 0.4000 | 0.3999 |
The realized \(h^2\) matches \(A_1 / (A_1 + C_1 + E_1) = 0.4 / 1.0 = 0.4\)
exactly — the tstrait sim_env step targets that derived value.
Remaining comparisons in the atlas¶
After the augment, every scenario's pedigree.full.tstrait.parquet
has \(A_1\) with variance \(0.4\). What still differs between scenarios is
the distribution shape of \(A_1\) across the sample population: a
100-causal architecture has fatter tails (a few large-effect carriers
dominate) than a 1M-causal architecture (which converges to Gaussian
by the CLT).
The per-scenario atlas
(results/genotype_drop/{scenario}/plots/atlas.pdf) runs the full
simACE phenotype → censor → sample → stats chain on the augmented
pedigree; kinship-based heritability estimates, relative-pair
correlations, and prevalence plots therefore reflect the gene-drop
\(A\).
Pairwise comparison of the atlases reveals how the architecture affects:
- Phenotype prevalence per generation — should be identical to within sampling noise, because the rescale fixes every \(A_1\) at the same variance and the threshold is calibrated against unit-variance liability.
- Heritability estimates (Falconer, regression-on-relatives) — should also be approximately \(0.4\) across scenarios, but estimator variance may differ as a function of the underlying architecture (sparse architectures with rare large-effect variants are harder to estimate).
- Relative-pair correlations vs theoretical kinship — the line should pass through the expected slope \(h^2 = 0.4\) in all cases.
A method whose heritability estimate degrades under sparse architectures will exhibit that degradation in this sweep.
Cost¶
On a single workstation:
- Pre-pipeline (one-shot): ~10 min for
tskit_preprocess, ~10 min forsite_catalog. Reusable across all gene-drop scenarios forever. simulate_genotypes_chrom× 22: ~2 hr CPU total, ~30 min wall at -j 8 (chr1 alone takes ~16 min on one core). Done once forp2_full; variants reuse viadrop_from.- Per variant: tstrait sub-pipeline + augment + phenotype + atlas — a few
minutes each, dominated by
tstrait_gv_chromforc1m(~50 s/chrom × 22 / 8 cores ≈ 2-3 min wall).