BulkLMM.jl
BulkLMM is a Julia package for performing genome scans for multiple traits (in "Bulk" sizes) using linear mixed models (LMMs). It is suitable for eQTL mapping with thousands of traits and markers. BulkLMM also performs permutation testing for LMMs taking into account the relatedness of individuals. We use multi-threading and matrix operations to speed up computations.
The current implementation is for genome scans with one-degree of freedom tests with choices of adding additional covariates. Future releases will cover the scenario of more-than-one degrees of freedom tests.
Linear Mixed Model (LMM)
We consider the case when a univariate trait of interest is measured
in a population of related individuals with kinship matrix
where
where
Single trait scan
For a single trait and candidate marker, we use a likelihood ratio test to compare a model with and without the candidate genetic marker (and including the intercept and all background covariates). This process is repeated for each marker to generate the genome scan. The result is reported in LOD (log 10 of likelihood ratio) units.
Users can specify if the variance components should be estimated using ML (maximum likelihood) or REML (restricted maximum likelihood). The scans can be performed with the variance components estimated once under the null, or separately for each marker. The latter approach is slower, but more accurate.
Permutation tests for single trait
Under the null hypothesis that no individual genetic marker is associated with the trait, traits are correlated according if the kinship matrix is not identity, and the genetic variance component is non-zero. Thus, a standard permutation test where we shuffle the trait data randomly, is not appropriate. Instead, we rotate the data using the eigen decomposition of the kinship matrix, which de-correlates the data, and then shuffle the data after rescaling them by their standard deviations.
Scans for multiple traits
Scans for multiple traits are performed by running univariate LMMs for each combination of trait and marker. We are exploring algorithms for optimizing this process by judicious use of approximations.
Multi-threading
This package uses multi-threading to speed up some operations. You will have to start Julia with mutliple threads to take advantage of this. You should use as many threads as your computer is capable of. Further speedups may be obtained by spreading (distributing) the computation across mutliple computers.
Example: application on BXD spleen expression data
We demonstrate basic usage of BulkLMM.jl
through an example applying
the package on the BXD mouse strains data.
First, after successfully installed the package, load it to the current Julia session by
using BulkLMM
using CSV, DelimitedFiles, DataFrames, Statistics
The BXD data are accessible through our published github
repo of the BulkLMM.jl
package as .csv files under the data/bxdData
directory.
The raw BXD traits BXDtraits_with_missing.csv
contains missing
values. After removing the missings, load the BXD traits data
bulklmmdir = dirname(pathof(BulkLMM));
pheno_file = joinpath(bulklmmdir,"..","data/bxdData/spleen-pheno-nomissing.csv");
pheno = readdlm(pheno_file, ',', header = false);
pheno_processed = pheno[2:end, 2:(end-1)].*1.0; # exclude the header, the first (transcript ID)and the last columns (sex)
Required data format for traits should be .csv or .txt files with
values separated by ','
, with each column being the observations of
Also load the BXD genotypes data. The raw BXD genotypes file
BXDgeno_prob.csv
contains even columns that each contains the
complement genotype probabilities of the column immediately preceded
(odd columns). Calling the function readBXDgeno
will read the BXD
genotype file excluding the even columns.
geno_file = joinpath(bulklmmdir,"..","data/bxdData/spleen-bxd-genoprob.csv");
geno = readdlm(geno_file, ',', header = false);
geno_processed = geno[2:end, 1:2:end] .* 1.0;
Required data format for genotypes should be .csv or .txt files with
values separated by ','
, with each column being the observations of
genotype probabilities of
For the BXD data,
size(pheno_processed) # (number of strains, number of traits)
(79, 35554)
size(geno_processed) # (number of strains, number of markers)
(79, 7321)
Compute the kinship matrix calcKinship
kinship = calcKinship(geno_processed); # calculate K
Single trait scanning:
For example, to conduct genome-wide association mappings on the
1112-th trait, ran the function scan()
with inputs of the trait (as
a 2D-array of one column), geno matrix, and the kinship matrix.
traitID = 1112;
pheno_y = reshape(pheno_processed[:, traitID], :, 1);
@time single_results = scan(pheno_y, geno_processed, kinship);
0.059480 seconds (80.86 k allocations: 47.266 MiB)
The output structure single_results
stores the model estimates about the variance components (VC, environmental variance, heritability estimated under the null intercept model) and the lod scores. They are obtainable by
# VCs: environmental variance, heritability, genetic_variance/total_variance
(single_results.sigma2_e, single_results.h2_null)
(0.0942525841453798, 0.850587848871709)
# LOD scores calculated for a single trait under VCs estimated under the null (intercept model)
single_results.lod;
BulkLMM.jl
supports permutation testing for a single trait GWAS. Simply run the function scan()
and input the optional keyword argument permutation_test = true
with the number of permutations passed to the keyword argument nperms = # of permutations
. For example, to ask the package to do a permutation testing of 1000 permutations, do
@time single_results_perms = scan(pheno_y, geno_processed, kinship; permutation_test = true, nperms = 1000, original = false);
0.079464 seconds (94.02 k allocations: 207.022 MiB)
(use the input original = false
to suppress the default of performing genome scans on the original trait)
The output single_results_perms
is a matrix of LOD scores of dimension p * nperms
, with each column being the LOD scores of the
size(single_results_perms)
(7321, 1000)
max_lods = vec(mapslices(x -> maximum(x), single_results_perms; dims = 1));
thrs = map(x -> quantile(max_lods, x), [0.05, 0.95]);
Plot the BulkLMM LOD scores of the 1112-th trait and compare with the results from running GEMMA:
Note: to get results from GEMMA, one would need to run GEMMA on a Linux machine with input files of the same trait (here the 1112-th trait, X10339113), genetic markers and the kinship matrix, and finally convert the LRT p-values into corresponding LOD scores. Alternatively, you may simply load the results we obtained by following the procedures mentioned above. The resulting LOD scores from GEMMA are a .txt file in data/bxdData/GEMMA_BXDTrait1112/gemma_lod_1112.txt
.
For reproducing this figure, we need to do the following steps:
First, read in the gmap.csv
and the phenocovar.csv
under data/bxdData/
directory as
gmap_file = joinpath(bulklmmdir,"..","data/bxdData/gmap.csv");
gInfo = CSV.read(gmap_file, DataFrame);
phenocovar_file = joinpath(bulklmmdir,"..","data/bxdData/phenocovar.csv");
pInfo = CSV.read(phenocovar_file, DataFrame);
We would need to use some utility functions for plotting in the package named BigRiverPlots.jl
, which will be released soon.
After downloading the package, run
using BigRiverPlots
# Get information for the genetic markers about which chromosome each was measured at
Chr_bxd = string.(gInfo[:, :Chr]);
Chr_bxd = reshape(Chr_bxd, :, 1);
# Get information for the genetic markers about where (in Mb length) on the chromosome each was measured at
Pos_bxd = gInfo[:, :Mb];
Pos_bxd = reshape(Pos_bxd, :, 1);
Lod_bxd = single_results.lod[1:end, :]; # load the BulkLMM LOD scores results
gemma_results_path = joinpath(bulklmmdir,"..","data/bxdData/GEMMA_BXDTrait1112/gemma_lod_1112.txt")
Lod_gemma = readdlm(gemma_results_path, '\t'); # load gemma LOD scores results available in the package
traitName = pInfo[traitID, 1] # get the trait name of the 1112-th trait
Then, to use the functions in the package BigRiverPlots.jl
, run
vecSteps_bxd = BigRiverPlots.get_chromosome_steps(Pos_bxd, Chr_bxd)
# get unique chr id
v_chr_names_bxd = unique(Chr_bxd)
# generate new distances coordinates
x_bxd, y_bxd = BigRiverPlots.get_qtl_coord(Pos_bxd, Chr_bxd, Lod_bxd);
x_bxd_gemma, y_bxd_gemma = BigRiverPlots.get_qtl_coord(Pos_bxd, Chr_bxd, Lod_gemma);
qtlplot(x_bxd, y_bxd, vecSteps_bxd, v_chr_names_bxd;
label = "BulkLMM.jl",
xlabel = "Locus (Chromosome)", ylims = (0.0, 6.5),
title = "Single trait $traitName LOD scores") # plot BulkLMM LODs
plot!(x_bxd, y_bxd_gemma, color = :purple, label = "GEMMA", legend = true) # plot GEMMA LODs
hline!([thrs], color = "red", linestyle=:dash, label = "") # plot thresholds...
Multiple traits scanning:
To get LODs for multiple traits, for better runtime performance, first start julia with multiple threads following Instructions for starting Julia REPL with multi-threads or switch to a multi-threaded julia kernel if using Jupyter notebooks.
Then, run the function bulkscan_null()
with the matrices of
traits, genome markers, kinship. The fourth required input is the
number of parallelized tasks and we recommend it to be the number of
julia threads.
Here, we started a 16-threaded julia and executed the program on a Linux server with the Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz to get the LOD scores for all ~35k BXD traits:
@time multiple_results_allTraits = bulkscan_null(pheno_processed, geno_processed, kinship; nb = Threads.nthreads()).L;
82.421037 seconds (2.86 G allocations: 710.821 GiB, 41.76% gc time)
The output multiple_results_allTraits
is a matrix of LOD scores of dimension
size(multiple_results_allTraits)
(7321, 35554)
To visualize the multiple-trait scan results, we can use the plotting utility function plot_eQTL
to generate the eQTL plot.
The functions for plotting utilities will be available in the package BigRiverPlots.jl
in the future. For now, we can easily have access to the plotting function in the script plot_utils/visuals_utils.jl
, by running the following commands:
using RecipesBase, Plots, Plots.PlotMeasures, ColorSchemes
include(joinpath(bulklmmdir, "..", "plot_utils", "visuals_utils.jl"));
For the following example, we only plot the LOD scores that are above 5.0 by calling the function and specifying in the optional argument thr = 5.0
:
plot_eQTL(multiple_results_allTraits, pheno, gInfo, pInfo; thr = 5.0)
Installation:
The package BulkLMM.jl
can be installed by running
using Pkg
Pkg.add("BulkLMM")
To install from the Julia REPL, first press ]
to enter the Pkg
mode and then use:
add BulkLMM
The most recent release of the package can be obtained by running
using Pkg
Pkg.add(url = "https://github.com/senresearch/BulkLMM.jl", rev="main")
Contact, contribution and feedback
If you find any bugs, please post an issue on GitHub or contact the maintainer (Zifan Yu) directly. You may also fork the repository and send us a pull request with any contributions you wish to make.
Check out NEWS.md to see what's new in each BulkLMM.jl
release.