Gene-drop architecture sweep¶

This page examines how the genetic architecture (alpha and num_causal) shapes the realized \(A\) distribution under gene drop. Six scenarios vary one architecture parameter at a time against a fixed pedigree+drop, documenting both the raw genetic-value distributions and their form after the tstrait_augment_pedigree rescale, which places them on the same scale as Gaussian-\(A\) scenarios.

Scenarios¶

All six share an identical pedigree+drop via drop_from: p2_full:

N = 35,951 founders (the full SimHumanity p2 panel)
G_ped = 10, G_pheno = 3 (107,853 sample inds across the latest 3 generations)
Variance composition A=0.4, C=0.1, E=0.5 (so \(h^2 = 0.4\)) for both traits
No assortative mating, no MZ twins
Same trait1 and trait2 phenotype model (frailty / weibull, scale=2160, rho=0.8, beta=1.0, prevalence=0.10)

The variants differ only in the tstrait architecture:

Scenario	`alpha`	`num_causal`	What it explores
`p2_full`	-0.5	1,000	anchor (defaults)
`p2_full_a0`	0.0	1,000	no MAF dependence in effect sizes
`p2_full_c100`	-0.5	100	very sparse architecture
`p2_full_c10k`	-0.5	10,000	moderate polygenicity
`p2_full_c100k`	-0.5	100,000	high polygenicity
`p2_full_c1m`	-0.5	1,000,000	near-infinitesimal

Defined in config/genotype_drop.yaml. The variants use YAML anchors to inherit everything from p2_full and override only the tstrait knobs.

Run¶

The drop+graft step (simulate_genotypes_chrom) only runs once for p2_full; the variants reuse those trees:

# Drop the founders through the simACE pedigree (one-shot, ~5 min wall at -j 8)
snakemake --use-conda --cores 8 \
  results/genotype_drop/p2_full/rep1/.simulate_genotypes.done

# Run all 5 variants (each only the tstrait sub-pipeline + augment + atlas)
snakemake --use-conda --rerun-triggers mtime -j 8 \
  results/genotype_drop/p2_full_a0/plots/atlas.pdf \
  results/genotype_drop/p2_full_c100/plots/atlas.pdf \
  results/genotype_drop/p2_full_c10k/plots/atlas.pdf \
  results/genotype_drop/p2_full_c100k/plots/atlas.pdf \
  results/genotype_drop/p2_full_c1m/plots/atlas.pdf

Raw GV variance scales linearly with `num_causal`¶

Before the rescale, the variance of the genome-wide GV is roughly the sum of per-site contributions, \(\sigma_\beta^2 \cdot \sum_i [2 p_i (1 - p_i)]^{2\alpha + 1}\). With \(\sigma_\beta = 1\) and identically-distributed AFs, this scales linearly in num_causal:

Scenario	n_causal	realized var(GV)	scale factor (to var=0.4)
`p2_full_c100`	100	98	6.4e-2
`p2_full`	1,000	966	2.0e-2
`p2_full_c10k`	10,000	10,059	6.3e-3
`p2_full_c100k`	100,000	97,629	2.0e-3
`p2_full_c1m`	1,000,000	1,018,392	6.3e-4

Each 10× in num_causal gives ~10× in raw var(GV) and ~\(\sqrt{10}^{-1}\) in the rescale factor.

`alpha` amplifies rare-allele contribution¶

At fixed num_causal=1000, switching from \(\alpha=0\) (no MAF dependence) to \(\alpha=-0.5\) (LDAK-thin: rare alleles get larger effects) inflates raw var(GV) about 4-fold:

Scenario	alpha	realized var(GV)
`p2_full_a0`	0.0	219
`p2_full`	-0.5	966

The same 1000 sites and effects, weighted differently across the AF spectrum.

After rescale, all six are on the same scale¶

The augment rule centers each scenario's GV at zero and rescales sample variance to exactly the configured A1:

Scenario	var(A1) before (parametric)	var(A1) after (gene drop)	\(h^2\) realized
`p2_full_a0`	0.4 (Gaussian draw)	0.4000	0.3999
`p2_full_c100`	0.4	0.4000	0.3999
`p2_full`	0.4	0.4000	(anchor)
`p2_full_c10k`	0.4	0.4000	0.3999
`p2_full_c100k`	0.4	0.4000	0.3999
`p2_full_c1m`	0.4	0.4000	0.3999

The realized \(h^2\) matches \(A_1 / (A_1 + C_1 + E_1) = 0.4 / 1.0 = 0.4\) exactly — the tstrait sim_env step targets that derived value.

Remaining comparisons in the atlas¶

After the augment, every scenario's pedigree.full.tstrait.parquet has \(A_1\) with variance \(0.4\). What still differs between scenarios is the distribution shape of \(A_1\) across the sample population: a 100-causal architecture has fatter tails (a few large-effect carriers dominate) than a 1M-causal architecture (which converges to Gaussian by the CLT).

The per-scenario atlas (results/genotype_drop/{scenario}/plots/atlas.pdf) runs the full simACE phenotype → censor → sample → stats chain on the augmented pedigree; kinship-based heritability estimates, relative-pair correlations, and prevalence plots therefore reflect the gene-drop \(A\).

Pairwise comparison of the atlases reveals how the architecture affects:

Phenotype prevalence per generation — should be identical to within sampling noise, because the rescale fixes every \(A_1\) at the same variance and the threshold is calibrated against unit-variance liability.
Heritability estimates (Falconer, regression-on-relatives) — should also be approximately \(0.4\) across scenarios, but estimator variance may differ as a function of the underlying architecture (sparse architectures with rare large-effect variants are harder to estimate).
Relative-pair correlations vs theoretical kinship — the line should pass through the expected slope \(h^2 = 0.4\) in all cases.

A method whose heritability estimate degrades under sparse architectures will exhibit that degradation in this sweep.

Cost¶

On a single workstation:

Pre-pipeline (one-shot): ~10 min for tskit_preprocess, ~10 min for site_catalog. Reusable across all gene-drop scenarios forever.
simulate_genotypes_chrom × 22: ~2 hr CPU total, ~30 min wall at -j 8 (chr1 alone takes ~16 min on one core). Done once for p2_full; variants reuse via drop_from.
Per variant: tstrait sub-pipeline + augment + phenotype + atlas — a few minutes each, dominated by tstrait_gv_chrom for c1m (~50 s/chrom × 22 / 8 cores ≈ 2-3 min wall).