MendelIHT.jl

Iterative hard thresholding for l0 penalized regression
Author OpenMendel
Popularity
21 Stars
Updated Last
7 Months Ago
Started In
July 2018

MendelIHT

Iterative hard thresholding - a multiple regression approach to analyze data from a Genome Wide Association Studies (GWAS)

Documentation Build Status Code Coverage

Installation

Download and install Julia. Within Julia, copy and paste the following:

``````using Pkg
``````

This package supports Julia `v1.6`+ for Mac, Linux, and window machines.

Quick start

Sparse linear regression:

```using MendelIHT, Random

# simulate data
n = 200    # sample size
p = 1000   # number of covariates
k = 10     # number of causal variables
x = randn(n, p) # simulate x
β = zeros(p) # simulate beta
β[1:k] .= randn(k)
shuffle!(β)
true_position = findall(!iszero, β)
y = x * β + randn(n) # simulate y

# run IHT
possible_k = collect(0:20)
mses = cv_iht(y, x, path=possible_k) # cross validate k = 0, 1, 2, ..., 20
result = fit_iht(y, x, k=possible_k[argmin(mses)]) # run IHT on best k
[result.beta[true_position] β[true_position]] # compare true vs estimated beta

10×2 Matrix{Float64}:
0.41449    0.343562
-0.248449  -0.222586
0.0       -0.12781
-0.89703   -0.927769
-1.18703   -1.15052
0.0       -0.0746511
2.48838    2.4621
0.0       -0.0712048
-1.5504    -1.59528
-1.01247   -1.05913```

Sparse logistic regression:

```using MendelIHT, Random, GLM

# simulate data
n = 200    # sample size
p = 1000   # number of covariates
k = 10     # number of causal variables
x = randn(n, p) # simulate x
β = zeros(p) # simulate beta
β[1:k] .= randn(k)
shuffle!(β)
true_position = findall(!iszero, β)
y = [rand(Bernoulli(μi)) for μi in μ] |> Vector{Float64}

# run IHT
possible_k = collect(0:20)
mses = cv_iht(y, x, d=Bernoulli(), l=LogitLink(), path=possible_k)
result = fit_iht(y, x, k=possible_k[argmin(mses)], d=Bernoulli(), l=LogitLink())
[result.beta[true_position] β[true_position]]

10×2 Matrix{Float64}:
0.0       0.315486
1.27218   1.06696
0.0       0.0819433
0.0       0.381772
-1.16612  -1.1422
0.0      -0.260436
0.0      -0.540831
-2.23738  -2.37168
0.0       0.43792
1.50502   1.60719```

GWAS Quick Start

The following uses data under the `data` directory. PLINK files are stored in `normal.bed`, `normal.bim`, `normal.fam`.

```# load package
using MendelIHT
dir = normpath(MendelIHT.datadir()) * "/"

# select k SNPs in PLINK file, Gaussian phenotypes
result = iht(dir * "normal", 9, Normal) # run IHT with k = 9
result = iht(dir * "normal", 10, Normal, covariates=dir*"covariates.txt") # separately include covariates, k = 10
result = iht(dir * "normal", 10, Normal, covariates=dir*"covariates.txt", phenotypes=dir*"phenotypes.txt") # phenotypes are stored separately

# run cross validation to determine best k
mses = cross_validate(dir * "normal", Normal, path=1:20) # test k = 1, 2, ..., 20
mses = cross_validate(dir * "normal", Normal, path=[1, 5, 10, 15, 20]) # test k = 1, 5, 10, 15, 20
mses = cross_validate(dir * "normal", Normal, path=1:20, covariates=dir*"covariates.txt") # separately include covariates
mses = cross_validate(dir * "normal", Normal, path=1:20, covariates=dir*"covariates.txt", phenotypes=dir*"phenotypes.txt") # if phenotypes are in separate file

# Multivariate IHT for multiple quantitative phenotypes
result = iht(dir * "multivariate", 10, MvNormal, phenotypes=[6, 7]) # phenotypes stored in 6th and 7th column of .fam file
result = iht(dir * "multivariate", 10, MvNormal, phenotypes=dir*"multivariate.phen") # phenotypes stored separate file

# other distributions for single trait analysis (no test data available)
result = iht("datafile", 10, Bernoulli) # logistic regression with k = 10
result = iht("datafile", 10, Poisson) # Poisson regression with k = 10
result = iht("datafile", 10, NegativeBinomial, est_r=:Newton) # Negative Binomial regression + nuisnace parameter estimation```

Please see our latest documentation for more detail.

Citation and Reproducibility:

For univariate analysis, please cite:

Chu BB, Keys KL, German CA, Zhou H, Zhou JJ, Sobel EM, Sinsheimer JS, Lange K. Iterative hard thresholding in genome-wide association studies: Generalized linear models, prior weights, and double sparsity. Gigascience. 2020 Jun 1;9(6):giaa044. doi: 10.1093/gigascience/giaa044. PMID: 32491161; PMCID: PMC7268817.

In the `figures` subfolder, one can find all the code to reproduce the figures and tables in our paper.

For multivariate analysis, please cite:

Chu BB, Ko S, Zhou JJ, Jensen A, Zhou H, Sinsheimer JS, Lange K. Multivariate genome-wide association analysis by iterative hard thresholding. Bioinformatics. 2023 Apr;39(4):btad193

In the `manuscript` subfolder, one can find all the code to reproduce the figures and tables in our paper.

Bug fixes and user support

If you encounter a bug or need user support, please open a new issue on Github. Please provide as much detail as possible for bug reports, ideally a sequence of reproducible code that lead to the error.

PRs and feature requests are welcomed!

Acknowledgments

This project has been supported by the National Institutes of Health under awards R01GM053275, R01HG006139, R25GM103774, and 1R25HG011845.

Required Packages

View all packages

Used By Packages

No packages found.