Using statgen in R

statgen stores each analysis object on a shared reference SNP axis. The R package loads reference panels, summary statistics, annotations, genotypes, and LD panels into those coordinates.

library(statgen)
set_verbosity("quiet")

Reference, Sumstats, And Annotations

The bundled package fixtures are intentionally tiny so that examples and CRAN checks do not require external tools or large data.

This vignette reads those installed fixtures with system.file(..., package = "statgen"). They are demonstration inputs only. In real analyses, replace these paths with post-harmonization reference, summary statistics, annotation, genotype, and LD files prepared for one analysis input set. For an end-to-end repository workflow that prepares source inputs and then runs the same R analysis pattern on those files, see docs/TUTORIAL_1_PREPARE_DATA.md and docs/TUTORIAL_4_R.md.

reference_template <- file.path(
  dirname(system.file("extdata", "reference_chr1.bim", package = "statgen")),
  "reference_chr@.bim"
)
reference <- load_reference(reference_template)

sumstats_path <- system.file("extdata", "traits_complete.tsv.gz", package = "statgen")
sumstats <- load_sumstats(sumstats_path, reference)

annotation_paths <- system.file(
  "extdata",
  c("anno1.bed", "anno2.bed"),
  package = "statgen"
)
annotations <- load_annotations(annotation_paths, reference)

num_snp(reference)
#> [1] 8
head(logpvec(sumstats))
#> [1] 1.920819 0.301030 1.301030 0.698970 0.397940 2.522879
colnames(annomat(annotations))
#> [1] "anno1" "anno2"

R-native object caches are RDS files. They are useful for repeated analyses in the same R runtime.

sumstats_cache <- tempfile(fileext = ".rds")
save_sumstats_cache(sumstats, sumstats_cache)
sumstats_cached <- load_sumstats_cache(sumstats_cache)
identical(is_present(sumstats_cached), is_present(sumstats))
#> [1] TRUE

LD Panels

Python builds the canonical LD .npz distribution. For package examples, ld_root points to a bundled toy distribution that already includes the R reference-cache sidecar. For real analyses, run prepare_ld_npz_for_r() once on the Python-built ld_npz/ directory for the same harmonized input set, then load that directory with load_ld().

ld_root <- system.file("extdata", "ld", package = "statgen")
ld <- load_ld(ld_root)

num_snp(ld)
#> [1] 8
a1freq(ld)
#> [1] 0.30 0.40 0.20 0.35 0.15 0.28 0.32 0.22

multiply_r2() multiplies aligned vectors or matrices by LD r^2. fast_prune() applies greedy LD pruning to aligned scores.

scores <- seq_len(num_snp(ld))
multiply_r2(ld, scores)
#> [1]  2.8900  3.8100  3.0900  6.9500  6.9600  8.9575 10.5150  8.8575

logp <- rev(seq_len(num_snp(ld)))
fast_prune(logp, ld, r2_threshold = 0.35)
#> [1]   8 NaN   6   5 NaN   3 NaN   1

Genotype Metadata And Fetching

Genotype panels keep PLINK metadata and fetch hardcalls on demand. The original BED files must remain available unless a replacement bed_path is supplied.

ref_chr1 <- load_reference(system.file("extdata", "reference_chr1.bim", package = "statgen"))
bed_chr1 <- system.file("extdata", "genotype_1.bed", package = "statgen")
genotype <- load_genotype(sub("\\.bed$", "", bed_chr1), ref_chr1)

geno <- fetch_genotypes_int8(genotype, c(1, 3))
dim(geno)
#> [1] 4 2
head(geno)
#>      [,1] [,2]
#> [1,]    2    2
#> [2,]    2    2
#> [3,]    2    2
#> [4,]    2    2