# GeneticScreens

Pre- and post-processing for the analysis of high-throughput genetic screens using matrix linear models. See the associated paper, "Matrix linear models for high-throughput chemical genetic screens", and its reproducible code for more details. S-scores are implemented based on Collins et al. (2006)^{1}.

`GeneticScreens`

is an extension of the `MatrixLM`

package, which provides core functions for closed-form least squares estimates for matrix linear models.

## Installation

The `GeneticScreens`

package can be installed by running:

```
using Pkg
Pkg.add("GeneticScreens")
```

For the most recent version, use:

```
using Pkg
Pkg.add(url = "https://github.com/senresearch/GeneticScreens.jl", rev="master")
```

## Usage

```
using GeneticScreens
```

The `GeneticScreens`

package extends `MatrixLM`

, so all functionality of the `MatrixLM`

functions is preserved. The README for `MatrixLM`

provides examples of usage for the core least squares estimation functions that may be of interest to the user.

In this illustrative example, consider simulated data from a tiny genetic screening experiment run on a single 4x6 plate. There were 5 experimental conditions (A-E), each with 4 monotonically ordered dosage concentration levels (1-4) replicated 2 times. There were 8 mutants, each replicated 3 times per plate. Note that the simulated data will be exported as CSV files saved in the working directory.

```
using DataFrames
using CSV
using Random
# Generate 5 conditions (A-E), each with 4 ordered dosage concentration
# levels (1-4)
Xdos_df = repeat(DataFrame(cond=repeat(["A", "B", "C", "D", "E"], inner=4),
conc=repeat(1:4, 5)), 2)
# Write Xdos_df to CSV
CSV.write("Xdos_df.csv", Xdos_df)
# Create another DataFrame that concatenates the conditions and concentrations
X_df = repeat(DataFrame(cond_conc=[string(Xdos_df[!,:cond][i],
Xdos_df[!,:conc][i])
for i in 1:20]), 2)
# Write X_df to CSV
CSV.write("X_df.csv", X_df)
# Generate 8 mutants, each replicated 3 times
Z_df = DataFrame(mut=repeat(1:8, 3))
# Write Z_df to CSV
CSV.write("Z_df.csv", Z_df)
# Generate the same 8 mutants with 3 replicates, and also include the spatial
# row and column positions on the 4x6 plate
Zspat_df = DataFrame(mut=repeat(1:8, 3),
row=repeat(1:4, inner=6), col=repeat(1:6, 4))
# Write Zspat_df to CSV
CSV.write("Zspat_df.csv", Zspat_df)
# Use `contr` to get sum contrasts for the concatenated
# condition-concentrations in X_df and the mutants in Z_df, and convert them
# into 2d arrays
X = convert(Array{Float64,2}, contr(X_df, [:cond_conc], ["sum"]))
Z = convert(Array{Float64,2}, contr(Z_df, [:mut], ["sum"]))
# Total number of condition replicates (rows of Y)
n = size(X)[1]
# Total number of mutant replicates per plate (columns of Y)
m = size(Z)[1]
# Number of conditions
p = size(X)[2]
# Number of mutants
q = size(Z)[2]
# Randomly generate response variable Y
Random.seed!(1)
B = rand(-5:5,p,q)
E = randn(n,m)
Y = X*B*transpose(Z)+E
# Write Y to CSV
CSV.write("Y.csv", DataFrame(Y))
```

The `read_plate`

function is designed to simplify the construction of a `RawData`

object for genetic screening data. (The `RawData`

is needed to obtain least squares estimates for matrix linear models.) Users can specify the paths to flat files where their X (experimental conditions and row covariates), Y (colony size/response), and Z (mutants and column covariates) matrices are stored. `read_plate`

will read in the data using `CSV.read`

; additional keyword arguments to be passed into `CSV.read`

for all three files can be specified. By default, `CSV.read`

assumes that the first row is a header and that the delimiter is ','.

If one of the columns in the X matrix is a categorical variable for the experimental conditions, `read_plate`

can convert it to sum or treatment contrasts using `contr`

(see the `matrixLM`

package for more information on `contr`

). The condition variable name should be specified using the `XCVar`

argument and the type of contrast should be specified using `XCType`

. If one of the columns in the Z matrix is a categorical variable for the mutants, the analogous argument names are `ZCVar`

and `ZCType`

. Additional variables in X and Z are retained in the resulting `RawData`

object. In the example below, we use an X matrix that concatenates the conditions with their concentrations, thereby creating 24 distinct "condition-concentrations" (i.e. condition A at concentration 1 is treated as a different "condition" from condition A at concentration 2). By default, both the condition-concentrations in X and the mutants in Z are coded as treatment contrasts, but they are coded as sum contrasts in this example.

```
dat1 = read_plate("X_df.csv", "Y.csv", "Z_df.csv",
XCVar=:cond_conc, XCType="sum",
ZCVar=:mut, ZCType="sum")
```

If monotonicity is a reasonable assumption, it may be of interest to consider the concentrations of the different conditions as dosage slopes. If `isXDos=true`

, `read_plate`

will code the concentrations in `XConcentrationVar`

as dosage slopes for each condition in `XConditionVar`

. This is done through a call to the `get_dose_slopes`

function, which assumes that concentrations are ordered within dosage slopes.

```
dat2 = read_plate("Xdos_df.csv", "Y.csv", "Z_df.csv",
isXDos=true, XConditionVar=:cond, XConcentrationVar=:conc,
ZCVar=:mut, ZCType="sum")
```

Colonies grown on the edge of the plate tend to be larger because they do not compete for as many resources. One way to try to account for these spatial differences is to include power and cross-product terms for the (centered) row and column indices. In this example, the rows are indexed from 1 to 4 and the columns are indexed from 1-6 for the 4x6 plate. The desired degree of the polynomial can be specified as `spatDegree`

and the names of the variables in Z that store the row and column indices can be passed in as `rowVar`

and `colVar`

, respectively. By default, `spatDegree=0`

(indicating that no spatial effects should be included), `rowVar=:row`

, and `colVar=:column`

. In the example below, we have instructed `read_plate`

to include spatial effects up to a second-degree polynomial.

```
dat3 = read_plate("X_df.csv", "Y.csv", "Zspat_df.csv",
XCVar=:cond_conc, XCType="sum",
ZCVar=:mut, ZCType="sum",
spatDegree=2, rowVar=:row, colVar=:col)
```

Instead of reading in X, Y, and Z directly from saved files, it is possible to run `read_plate`

for DataFrames that have already been loaded into the current session. The arguments to convert condition and mutant variables to contrasts, encode dosage slopes, and incorporate spatial effects remain the same. This approach may be useful if the user wants to manually pre-process the data or run `contr`

on additional categorical covariates in `X`

and `Z`

.

```
dat4 = read_plate(X_df, DataFrame(Y), Z_df,
XCVar=:cond_conc, XCType="sum",
ZCVar=:mut, ZCType="sum")
```

Finally, `read_plate`

also provides the option to standardize Y by subtracting the row medians and dividing by the column IQRs if the user sets `isYstd=true`

, as seen below. By default, `isYstd=false`

and Y will not be standardized.

```
dat5 = read_plate(X_df, DataFrame(Y), Z_df,
XCVar=:cond_conc, XCType="sum",
ZCVar=:mut, ZCType="sum", isYstd=true)
```

The `mlm`

function from `matrixLM`

computes least-squares coefficient estimates for matrix linear models and returns an object of type `Mlm`

. `Mlm`

objects include variables for the coefficient estimates (`B`

), the coefficient variance estimates (`varB`

), and the estimated variance of the errors (`sigma`

).

Coding the conditions in X and the mutants in Z as sum contrasts is convenient because interpretation of the main effects and interactions will be with respect to average colony sizes. However, to avoid over-parameterization, the last condition and last mutant will be left out of the contrasts fed into the model. `mlm_backest_sum`

extends `mlm`

by additionally back-estimating one or both of the "left-out" sum contrasts for the conditions and mutants. By default, both of the "left-out" sum contrasts will be estimated, but this behavior can be modified by setting `isXSum`

and/or `isZSum`

to `false`

. `mlm_backest_sum`

currently does not support back-estimation of contrasts when additional covariates are included in X and/or Z.

As with `mlm`

, the `mlm_backest_sum`

function estimates both row and column main effects (X and Z intercepts), but this behavior can be suppressed by setting `isXIntercept=false`

and/or `isZntercept=false`

. Column weights for `Y`

and the target type for variance shrinkage^{2} can be optionally supplied to `weights`

and `targetType`

, respectively.

```
est = mlm_backest_sum(dat5)
```

Note that the resulting matrix of coefficient estimates has 21 rows (20 experimental conditions plus the row intercept/main effect) and 9 columns (8 mutants plus the column intercept/main effect).

```
size(coef(est))
```

The `coef`

access function has been extended to suppress returning the "left-out" estimates if `isXSum`

and/or `isZSum`

is set to `true`

. The `matrixLM`

functions that compute fitted values (`fitted`

), predicted values (`predict`

) and residuals (`resid`

) have likewise been extended to include options for suppressing the "left-out" estimates. In this example, all three of these functions will raise an error if the user fails to set`isXSum=true`

and `isZSum=true`

(e.g. `fitted(est)`

will raise a dimension error, but `fitted(est, isXSum=true, isZSum=true)`

will not). This is because the estimates were obtained from a call to `mlm_backest_sum`

with `isXSum`

and`isZSum`

both set to `true`

(the default), such that the output includes the "left-out" sum contrasts for both the conditions and mutants.

```
size(coef(est, isXSum=true, isZSum=true))
```

The matrix of t-statistics (defined as `est.B ./ sqrt.(est.varB)`

) corresponding to this `Mlm`

object includes all 20x8 interactions. By default, `t_stat`

does not return the corresponding t-statistics for any main effects that were estimated by `mlm_backest_sum`

, but they will be returned if `isMainEff=true`

.

```
size(t_stat(est))
```

Analogous to `matrixLM`

's `mlm_perms`

function, `mlm_backest_sum_perms`

computes permutation p-values that include the "left-out" estimates and will run the permutations in parallel when possible. The illustrative example below only runs 5 permutations, but a different number can be specified as the second argument. By default, the function used to permute `Y`

is `shuffle_rows`

, which shuffles the rows for `Y`

. Alternative functions for permuting `Y`

, such as `shuffle_cols`

, can be passed into the argument `permFun`

. Keyword arguments to be passed into `mlm_backest_sum`

or `matrixLM`

's `tstat`

function can be specified by the user.

```
nPerms = 5
tStats, pVals = mlm_backest_sum_perms(dat5, nPerms)
```

The `GeneticScreen`

package also provides an implementation of Collins et al. (2006)^{1}'s S scores. To run the `S_score`

function, one must construct a RawData object where X and Z encode the experimental conditions and mutants as treatment contrasts with no intercepts. This can be done by running `read_plate`

with the arguments `XCType="noint"`

and `ZCType="noint"`

. No other covariates should be included in X and Z.

```
dat6 = read_plate(X_df, DataFrame(Y), Z_df,
XCVar=:cond_conc, XCType="noint",
ZCVar=:mut, ZCType="noint", isYstd=true)
```

By default, the `S_score`

function performs the variance flooring and adjustments described by Collins et al. (2006)^{1} (`isVarFloor=true`

).

```
S = S_score(dat6)
```

Analogous to `matrixLM`

's `mlm_perms`

function, `S_score_perms`

computes permutation p-values for the S scores.

```
S, SPvals = S_score_perms(dat6, nPerms)
```

Additional details can be found in the documentation for specific functions.

1. Collins, S. R., Schuldiner, M., Krogan, N. J., & Weissman, J. S. (2006). A strategy for extracting and analyzing large-scale quantitative epistatic interaction data. Genome biology, 7(7), R63.

2. Ledoit, O., & Wolf, M. (2003). Improved estimation of the covariance matrix of stock returns with an application to portfolio selection. Journal of empirical finance, 10(5), 603-621.