mpa.LearnModel¶
Overview¶
LearnModel
is a program within the mpathic package which generates
linear energy matrix models for sections of a sorted library.
Usage:
>>> import mpathic
>>> loader = mpathic.io
>>> filename = "./mpathic/data/sortseq/full-0/data.txt"
>>> df = loader.load_dataset(filename)
>>> mpathic.LearnModel(df=df,verbose=True,lm='ER')
Example Input and Output¶
There are two types of input dataframes learn model can accept as input: Matrix models and neighbour models. The input table to this program must contain a sequences column and counts columns for each bin. For a sort seq experiment, this can be any number of bins. For MPRA and selection experiments this must be ct_0 and ct_1.
Matrix models Input Dataframe:
seq ct_0 ct_1 ct_2 ct_3 ct_4
AAAAAAGGTGAGTTA 0.000000 0.000000 1.000000 0.000000 0.000000
AAAAAATATAAGTTA 0.000000 0.000000 0.000000 0.000000 1.000000
AAAAAATATGATTTA 0.000000 0.000000 0.000000 1.000000 0.000000
...
Neighbour Model:
pos val_AA val_AC val_AG val_AT val_CA val_CC val_CG val_CT val_GA val_GC val_GG val_GT val_TA val_TC val_TG val_TT
0 0.081588 -0.019021 0.007188 0.042818 -0.048443 -0.015712 -0.053949 -0.024360 -0.025149 -0.030791 -0.022920 -0.026910 0.052324 0.002189 -0.014354 0.095505
1 0.033288 -0.005410 0.014198 0.018246 -0.033583 -0.001761 -0.020431 -0.007561 -0.018550 -0.025738 -0.028961 -0.010787 0.007764 0.024888 -0.000199 0.054599
2 -0.026142 0.008002 -0.029641 0.036698 -0.001028 -0.008025 -0.022645 0.023678 0.006907 -0.016295 -0.054918 0.028913 -0.005400 0.003121 0.000996 0.055780
3 -0.046159 -0.006071 -0.001542 0.028109 -0.020442 -0.024574 0.056595 -0.024776 -0.005172 -0.055010 -0.029327 -0.016699 0.001295 -0.016304 0.128112 0.031967
...
Example Output Table:
pos val_A val_C val_G val_T
0 0 0.000831 -0.014006 0.144818 -0.131643
1 1 -0.033734 0.087419 -0.029997 -0.023688
2 2 0.009189 0.018999 0.026719 -0.054908
3 3 -0.003516 0.073503 0.001759 -0.071745
4 4 0.062168 -0.028879 -0.057249 0.023961
...
Class Details¶
-
class
learn_model.
LearnModel
(**kwargs)¶ Constructor for the learn model class. Models can be learnt via the matrix model or the neighbor model. Matrix models assume independent contributions to activity from characters at a particular position whereas neighbor model assume near contributions to activity from all possible adjacent characters.
Parameters: - df: (pandas data frame)
Dataframe containing several columns representing
bins and sequence column. The integer values in bins
represent the occurrence of the sequence that bin.
- lm: (str)
Learning model. Possible values include {‘ER’,’LS’,’IM’, ‘PR’}.
‘ER’: enrichment ratio inference. ‘LS’: least squares
optimization. ‘IM’ : mutual information maximization
(similar to maximum likelihood inference in the large data limit).
‘PR’ stands for Poisson Regression.
- modeltype: (string)
Type of model to be learned. Valid choices include “MAT”
and “NBR”, which stands for matrix model and neigbhour model,
respectively. Matrix model assumes mutations at a location are
independent and neighbour model assumes epistatic effects for
mutations.
- LS_means_std: (pandas dataframe)
For the least-squares method, this contains
the user supplied mean and standard deviation.
The order of the columns is [‘bin’, ‘mean’, ‘std’].
- db: (string)
File name for a SQL script; it could be passed
in to the function MaximizeMI_memsaver
- iteration: (int)
Total number of MCMC iterations to do. Passed
in the sample method from MCMC.py which may be
part of pymc.
- burnin: (int)
Variables will not be tallied until this many
iterations are complete (thermalization).
- thin: (int)
Similar to parameter burnin, but with smaller
default value.
- runnum: (int)
Run number, used to determine the correct sql
script extension in MaximizeMI_memsaver
- initialize: (string)
Variable for initializing the learn model class
constructor. Valid values include “rand”,
“LS”, “PR”. rand is MCMC, LS is least squares
and PR and poisson regression.
- start: (int)
Starting position of the sequence.
- end: (int)
end position of the sequence.
- foreground: (int)
Indicates column number representing foreground
(E.g. can be passed to Berg_Von_Hippel method).
- background: (int)
Indicates column number representing background.
- alpha : (float)
Regularization strength; must be a positive float. Regularization
improves the conditioning of the problem and reduces the variance of
the estimates. Larger values specify stronger regularization.
Alpha corresponds to
C^-1
in other linear models such asLogisticRegression or LinearSVC. (this snippet taken from ridge.py
written by Mathieu Blondel)
- pseudocounts: (int)
A artificial number added to bin counts where counts are
really low. Needs to be Non-negative.
- verbose: (bool)
A value of false for this parameter suppresses the
output to screen.
- tm: (int)
Number bins. DOUBLE CHECK.
-
Berg_von_Hippel
(df, dicttype, foreground=1, background=0, pseudocounts=1)¶ Learn models using berg von hippel model. The foreground sequences are usually bin_1 and background in bin_0, this can be changed via flags.
-
Compute_Least_Squares
(raveledmat, batch, sw, alpha=0)¶ Ridge regression is the only sklearn regressor that supports sample weights, which will make this much faster
-
Markov
(df, dicttype, foreground=1, background=0, pseudocounts=1)¶ Learn models using berg von hippel model. The foreground sequences are usually bin_1 and background in bin_0, this can be changed via flags.
-
MaximizeMI_memsaver
(seq_mat, df, emat_0, wtrow, db=None, burnin=1000, iteration=30000, thin=10, runnum=0, verbose=False)¶ Performs MCMC MI maximzation in the case where lm = memsaver
-
find_second_NBR_matrix_entry
(s)¶ this is a function for use with numpy apply along axis. It will take in a sequence matrix and return the second nonzero entry
-
weighted_std
(values, weights)¶ Takes in a dataframe with seqs and cts and calculates the std