mpa.LearnModel

Overview

LearnModel is a program within the mpathic package which generates linear energy matrix models for sections of a sorted library.

Usage:

>>> import mpathic
>>> loader = mpathic.io
>>> filename = "./mpathic/data/sortseq/full-0/data.txt"
>>> df = loader.load_dataset(filename)
>>> mpathic.LearnModel(df=df,verbose=True,lm='ER')

Example Input and Output

There are two types of input dataframes learn model can accept as input: Matrix models and neighbour models. The input table to this program must contain a sequences column and counts columns for each bin. For a sort seq experiment, this can be any number of bins. For MPRA and selection experiments this must be ct_0 and ct_1.

Matrix models Input Dataframe:

seq       ct_0       ct_1       ct_2       ct_3       ct_4

AAAAAAGGTGAGTTA   0.000000   0.000000   1.000000   0.000000   0.000000
AAAAAATATAAGTTA   0.000000   0.000000   0.000000   0.000000   1.000000
AAAAAATATGATTTA   0.000000   0.000000   0.000000   1.000000   0.000000
...

Neighbour Model:

pos     val_AA     val_AC     val_AG     val_AT     val_CA     val_CC     val_CG     val_CT     val_GA     val_GC     val_GG     val_GT     val_TA     val_TC     val_TG     val_TT
  0   0.081588  -0.019021   0.007188   0.042818  -0.048443  -0.015712  -0.053949  -0.024360  -0.025149  -0.030791  -0.022920  -0.026910   0.052324   0.002189  -0.014354   0.095505
  1   0.033288  -0.005410   0.014198   0.018246  -0.033583  -0.001761  -0.020431  -0.007561  -0.018550  -0.025738  -0.028961  -0.010787   0.007764   0.024888  -0.000199   0.054599
  2  -0.026142   0.008002  -0.029641   0.036698  -0.001028  -0.008025  -0.022645   0.023678   0.006907  -0.016295  -0.054918   0.028913  -0.005400   0.003121   0.000996   0.055780
  3  -0.046159  -0.006071  -0.001542   0.028109  -0.020442  -0.024574   0.056595  -0.024776  -0.005172  -0.055010  -0.029327  -0.016699   0.001295  -0.016304   0.128112   0.031967
 ...

Example Output Table:

pos     val_A     val_C     val_G     val_T
0     0  0.000831 -0.014006  0.144818 -0.131643
1     1 -0.033734  0.087419 -0.029997 -0.023688
2     2  0.009189  0.018999  0.026719 -0.054908
3     3 -0.003516  0.073503  0.001759 -0.071745
4     4  0.062168 -0.028879 -0.057249  0.023961
...

Class Details

class learn_model.LearnModel(**kwargs)

Constructor for the learn model class. Models can be learnt via the matrix model or the neighbor model. Matrix models assume independent contributions to activity from characters at a particular position whereas neighbor model assume near contributions to activity from all possible adjacent characters.

Parameters:
df: (pandas data frame)

Dataframe containing several columns representing

bins and sequence column. The integer values in bins

represent the occurrence of the sequence that bin.

lm: (str)

Learning model. Possible values include {‘ER’,’LS’,’IM’, ‘PR’}.

‘ER’: enrichment ratio inference. ‘LS’: least squares

optimization. ‘IM’ : mutual information maximization

(similar to maximum likelihood inference in the large data limit).

‘PR’ stands for Poisson Regression.

modeltype: (string)

Type of model to be learned. Valid choices include “MAT”

and “NBR”, which stands for matrix model and neigbhour model,

respectively. Matrix model assumes mutations at a location are

independent and neighbour model assumes epistatic effects for

mutations.

LS_means_std: (pandas dataframe)

For the least-squares method, this contains

the user supplied mean and standard deviation.

The order of the columns is [‘bin’, ‘mean’, ‘std’].

db: (string)

File name for a SQL script; it could be passed

in to the function MaximizeMI_memsaver

iteration: (int)

Total number of MCMC iterations to do. Passed

in the sample method from MCMC.py which may be

part of pymc.

burnin: (int)

Variables will not be tallied until this many

iterations are complete (thermalization).

thin: (int)

Similar to parameter burnin, but with smaller

default value.

runnum: (int)

Run number, used to determine the correct sql

script extension in MaximizeMI_memsaver

initialize: (string)

Variable for initializing the learn model class

constructor. Valid values include “rand”,

“LS”, “PR”. rand is MCMC, LS is least squares

and PR and poisson regression.

start: (int)

Starting position of the sequence.

end: (int)

end position of the sequence.

foreground: (int)

Indicates column number representing foreground

(E.g. can be passed to Berg_Von_Hippel method).

background: (int)

Indicates column number representing background.

alpha : (float)

Regularization strength; must be a positive float. Regularization

improves the conditioning of the problem and reduces the variance of

the estimates. Larger values specify stronger regularization.

Alpha corresponds to C^-1 in other linear models such as

LogisticRegression or LinearSVC. (this snippet taken from ridge.py

written by Mathieu Blondel)

pseudocounts: (int)

A artificial number added to bin counts where counts are

really low. Needs to be Non-negative.

verbose: (bool)

A value of false for this parameter suppresses the

output to screen.

tm: (int)

Number bins. DOUBLE CHECK.

Berg_von_Hippel(df, dicttype, foreground=1, background=0, pseudocounts=1)

Learn models using berg von hippel model. The foreground sequences are usually bin_1 and background in bin_0, this can be changed via flags.

Compute_Least_Squares(raveledmat, batch, sw, alpha=0)

Ridge regression is the only sklearn regressor that supports sample weights, which will make this much faster

Markov(df, dicttype, foreground=1, background=0, pseudocounts=1)

Learn models using berg von hippel model. The foreground sequences are usually bin_1 and background in bin_0, this can be changed via flags.

MaximizeMI_memsaver(seq_mat, df, emat_0, wtrow, db=None, burnin=1000, iteration=30000, thin=10, runnum=0, verbose=False)

Performs MCMC MI maximzation in the case where lm = memsaver

find_second_NBR_matrix_entry(s)

this is a function for use with numpy apply along axis. It will take in a sequence matrix and return the second nonzero entry

weighted_std(values, weights)

Takes in a dataframe with seqs and cts and calculates the std