Pipeline Schema¶

The pipeline is a chain of stages — simulate → dropout → phenotype → censor → sample — each handing off a pandas.DataFrame to the next. The columns expected at each handoff are an implicit contract: every stage relies on its predecessor's column names by convention, and a downstream stage will fail far from the rename that broke it.

simace.core.schema makes that contract explicit. It defines three cumulative specs (one per stage output) and a runtime check that fires at each handoff.

Three stage outputs¶

Each schema is a mapping from required column name to a string of allowed numpy.dtype.kind characters. Extra columns are always permitted — stages may carry through additional fields without breaking the contract.

`PEDIGREE` — output of `run_simulation` and `run_dropout`¶

Column	Kind
`id`, `generation`, `sex`, `mother`, `father`, `twin`, `household_id`	`iu` (integer)
`A1`, `C1`, `E1`, `liability1`	`f` (float)
`A2`, `C2`, `E2`, `liability2`	`f`

`PHENOTYPE` — output of `run_phenotype`¶

PEDIGREE plus the raw event-time columns:

Column	Kind
`t1`, `t2`	`f`

`CENSORED` — output of `run_censor` and `run_sample`¶

PHENOTYPE plus the columns added by censoring:

Column	Kind
`death_age`, `t_observed1`, `t_observed2`	`f`
`age_censored1`, `death_censored1`, `affected1`	`b` (bool)
`age_censored2`, `death_censored2`, `affected2`	`b`

Why coarse dtype kinds¶

Dtypes are checked at the kind level (i/u integer, f float, b bool) rather than exact dtypes. This tolerates the int8/int32/float32 narrowing applied by save_parquet at parquet save time — a column may arrive as int32 in memory and round-trip as int8 on disk without violating the contract — while still catching real-world regressions like a boolean column written as int8, a string slipping into an integer ID, or a float landing in generation.

Where it's enforced¶

flowchart LR
    sim[run_simulation] -- PEDIGREE --> drop[run_dropout]
    drop -- PEDIGREE --> phen[run_phenotype]
    phen -- PHENOTYPE --> cen[run_censor]
    cen -- CENSORED --> samp[run_sample]
    samp -- CENSORED --> stats[validate / stats]

Stage	Input asserted	Output asserted
`run_dropout`	`PEDIGREE`	— (row subset, structurally identical)
`run_phenotype`	`PEDIGREE`	`PHENOTYPE`
`run_censor`	`PHENOTYPE`	`CENSORED`
`run_sample`	`CENSORED`	— (row subset, structurally identical)

A failure raises ValueError with the boundary label and the offending column, e.g.:

censor input: missing required columns ['t1']
phenotype output: dtype mismatch — affected1=int8 (expected kind in 'b')

so the rename or dtype regression is pinned to the stage that broke it, not the analysis 200 lines downstream.

Using it from tests¶

When writing a unit test that constructs a DataFrame directly (rather than running the full pipeline), tests/conftest.py exposes a schema_pad(df, schema) helper that fills in zero/false defaults for any schema-required columns the fixture didn't provide. Lets fixtures stay focused on the columns under test while still satisfying the contract.

API reference¶

See simace.core.schema for the full module.