Skip to outline
All posts

svForge, synthetic VCFs to test structural variant pipelines

A command-line tool that generates synthetic structural variant VCFs, with controlled injection of gnomAD and ENCODE artefacts to validate downstream filters.

On this page: Method

Testing a structural variant pipeline relies on input VCFs that cover the cases expected at the caller stage, namely Manta or DELLY records with their format-specific quirks, SVs that overlap ENCODE blacklisted regions, SVs that match known germline variants in gnomAD, and realistic ranges for SVLEN, HOMLEN or VAF. These VCFs are used to validate filtering rules, to reproduce edge cases and to run CI without rerunning a full caller on BAM files.

The issue is that such reference VCFs are scarce and ill-suited to the task. Public datasets remain niche, do not cover all useful combinations and carry no internal labelling that would let a filter be checked against an oracle. The fallback consists of hand-crafted VCFs, which amounts to stacking approximate files that drift from the real format and do not lend themselves to reproducible regression testing. Synthetic generators exist for SNVs, but the SV ecosystem has no command-line equivalent that produces caller-specific VCFs with controlled artefact injection.

svForge covers this gap with a CLI that produces Manta or DELLY VCFs ready to feed into a pipeline.

Method

The tool draws variants from a weighted YAML bank that covers DEL, DUP, INV, INS and BND types on hg38. Each draw returns a caller-specific record compliant with VCFv4.1 for Manta and VCFv4.2 for DELLY, with the INFO and FORMAT fields expected by the target caller. The variability parameters (--svlen-range, --homlen-range, --vaf-range) restrict the sampling ranges and --svtypes filters the types retained.

Two injection modes complement the background generation. The --gnomad-fraction flag replaces a share of the records with SVs drawn from a bundled gnomAD mini-catalog, and --blacklist-fraction does the same with a subset of the ENCODE blacklist. Every injected record carries an INFO/SVFORGE_SOURCE tag that takes the value bank, gnomad or blacklist, which makes downstream verification direct: a properly tuned gnomAD filter must remove every SVFORGE_SOURCE=gnomad line from the output VCF.

Reproducibility and traceability

Sampling adopts an explicit seed via --seed and the seed actually used is always written in the header of the generated VCF, including when it was drawn at random. A run without an explicit seed therefore stays exactly replayable from the produced file alone. The command line logged in the header goes through sanitize_command() which strips absolute paths, so user home or cluster paths do not leak into shared VCFs.

A ##svforgeWarning=SYNTHETIC_DATA_DO_NOT_USE_FOR_CLINICAL_DIAGNOSIS header is force-injected into every file and is not configurable, so that no generated VCF can be mistaken for a real caller output.

Validation

The svforge validate subcommand takes a generated VCF and checks that records tagged gnomad and blacklist match the bundled catalog entries exactly. The exit code is 0 on an exact match and non-zero otherwise, and a detailed TSV report is available via --report-tsv. This step acts as a safeguard for pipelines that rely on the SVFORGE_SOURCE tag as an oracle.

Modularity

The Manta and DELLY writers are registered through the svforge.writers entry point in pyproject.toml. A third-party caller is added by publishing a Python package that exposes a new entry under that group, without touching the svForge core. The default bank is also replaceable via --bank, which accepts an arbitrary YAML, and --header-template accepts a custom VCF template for deployments that require specific INFO or FILTER fields.

Usage

A typical run produces a Manta VCF of 50 records for a named sample:

svforge gen --caller manta --out synthetic.vcf.gz --n 50 --sample-name TUMOR01 --seed 42

For somatic pipelines, gen-pair produces two consistent tumor and normal VCFs in a single invocation, with a configurable number of somatic and germline events:

svforge gen-pair \
  --caller manta \
  --out-tumor tumor.vcf.gz \
  --out-normal normal.vcf.gz \
  --n-somatic 30 \
  --n-germline 10 \
  --tumor-sample-name TUMOR_01 \
  --normal-sample-name NORMAL_01 \
  --seed 99

Artefact injection is enabled by fraction and combines with the other sampling options:

svforge gen --caller manta --out out.vcf.gz --n 100 --sample-name S1 \
  --gnomad-fraction 0.15 --blacklist-fraction 0.1 --seed 7

Validation takes the produced VCF and confirms that injected records match the catalogs:

svforge validate --vcf out.vcf.gz --report-tsv validate_report.tsv

The tool is published on PyPI and installable with pip install svforge. The v1 targets hg38 only and requires Python 3.10 or higher. The source code is available on GitHub under the MIT license.

pypi.org svForge on PyPI Python package installable via pip github.com pieetie/svforge Source code on GitHub

More articles