Tutorial¶
Import the MPAthic package as follows:
import mpathic as mpa
Simulating Data¶
We begin by simulating a library of variant CRP binding sites. We can use the mpa.SimulateLibrary class to create a library of random mutants from an initial wildtype sequence and mutation rate:
sim_library = mpa.SimulateLibrary(wtseq="TAATGTGAGTTAGCTCACTCAT", mutrate=0.24)
sim_library.output_df.head()
The output_df attribute of the sim_library
class looks like the dataframe below
ct | seq |
---|---|
21 | TAATGTGAGTTAGCTCACTCAT |
7 | TAATGTGAGTTAGCTAACTCAT |
6 | TAATGTGAGTTAGCTCACTCAA |
⋮
1 | TAATGTGTGTTCGCTCATCCAT |
In general, MPAthic datasets are pandas dataframes, comprising of columns of counts and sequence values. To simulate
a Sort-Seq experiment ([1]), we use the mpa.SimulateSort class. This class requires a dataset input
and a model dataframe input. We first import these inputs using io
module provided with the MPAthic package:
# Load dataset and model dataframes
dataset_df = mpa.io.load_dataset('sort_seq_data.txt')
model_df = mpa.io.load_model('crp_model.txt')
Next, we call the SimulateSort
class as follows:
# Simulate a Sort-Seq experiment
sim_sort = mpa.SimulateSort(df=dataset_df,mp=model_df)
sim_sort.output_df.head()
The head of the output dataframe looks like
ct | ct_0 | ct_1 | ct_2 | ct_3 | ct_4 | seq |
---|---|---|---|---|---|---|
4 | 0 | 0 | 1 | 0 | 0 | AAAAAAGGTGAGTTAGCTAACT |
3 | 0 | 0 | 0 | 0 | 1 | AAAAAATATAAGTTAGCTCGCT |
4 | 0 | 0 | 0 | 1 | 0 | AAAAAATATGATTTAGCTGACT |
3 | 0 | 0 | 0 | 0 | 1 | AAAAAATGTCAGTTAGCTCACT |
4 | 0 | 0 | 1 | 0 | 0 | AAAAAATGTGAATTATCGCACT |
Computing Profiles¶
It is often useful to compute the mutation rate within a set of sequences, e.g., in order to validate the composition of a library. This can be accomplished using the mpa.ProfileMut class as follows:
profile_mut = mpa.ProfileMut(dataset_df = dataset_df)
profile_mut.mut_df.head()
The mutation rate at each position within the sequences looks like
pos | wt | mut |
---|---|---|
0 | A | 0 |
1 | A | 0 |
2 | A | 0.33871 |
3 | T | 0.127566 |
4 | A | 0.082111 |
To view the frequency of occurrence for every base at each position, use the mpa.ProfileFreq class:
profile_freq = mpa.ProfileFreq(dataset_df = dataset_df)
profile_freq.freq_df.head()
pos | freq_A | freq_C | freq_G | freq_T |
---|---|---|---|---|
0 | 1 | 0 | 0 | 0 |
1 | 1 | 0 | 0 | 0 |
2 | 0.66129 | 0.33871 | 0 | 0 |
3 | 0.043988 | 0.042522 | 0.041056 | 0.872434 |
4 | 0.917889 | 0.019062 | 0.02566 | 0.03739 |
Information pro les (also called “information footprints”) provide a particularly useful way to identify functional positions within a sequence. These pro les list, for each position in a sequence, the mutual information between the character at that position and the bin in which a sequence is found. Unlike mutation and frequency profiles, which require sequence counts for a single bin only, information profiles are computed from full datasets, and can be accomplished using the mpa.ProfileInfo class as follows:
profile_info = mpa.ProfileInfo(dataset_df = dataset_df)
profile_info.info_df.head()
pos | info |
---|---|
0 | 0.000077 |
1 | 0.000077 |
2 | 0.008357 |
3 | 0.008743 |
4 | 0.013745 |
Quantitative Modeling¶
The mpa.LearnModel class can be used to fit quantitative models to data:
learned_model = mpa.LearnModel(df=dataset_df)
learned_model.output_df.head()
pos | val_A | val_C | val_G | val_T |
---|---|---|---|---|
0 | -0.201587 | 0.067196 | 0.067196 | 0.067196 |
1 | -0.201587 | 0.067196 | 0.067196 | 0.067196 |
2 | -0.10637 | -0.167351 | 0.13686 | 0.13686 |
3 | -0.287282 | 0.041222 | -0.2039 | 0.44996 |
4 | -0.056109 | -0.871858 | 0.344537 | 0.583429 |
The purpose of having a quantitative model is to be able to predict the activity of arbitrary sequences. This basic operation is accomplished using the mpa.EvaluateModel class:
eval_model = mpa.EvaluateModel(dataset_df = dataset_df, model_df = model_df)
eval_model.out_df.head()
ct | ct_0 | ct_1 | ct_2 | ct_3 | ct_4 | seq | val |
---|---|---|---|---|---|---|---|
1 | 0 | 0 | 1 | 0 | 0 | AAAGGTGAGTTAGCTAACTCAT | 0.348108 |
1 | 0 | 0 | 0 | 0 | 1 | AAATATAAGTTAGCTCGCTCAT | -0.248134 |
1 | 0 | 0 | 0 | 1 | 0 | AAATATGATTTAGCTGACTCAT | 0.009507 |
1 | 0 | 0 | 0 | 0 | 1 | AAATGTCAGTTAGCTCACTCAT | 0.238852 |
1 | 0 | 0 | 1 | 0 | 0 | AAATGTGAATTATCGCACTCAT | -0.112121 |
Often, it is useful to scan a model over all sequences embedded within larger contigs. To do this, MPAthic provides the class mpa.ScanModel, which is called as follows:
# get contigs, provided with mpathic
fastafile = "./mpathic/examples/genome_ecoli_1000lines.fa"
contig = mpa.io.load_contigs_from_fasta(fastafile, model_df)
scanned_model = mpa.ScanModel(model_df = model_df, contigs_list = contigs_list)
scanned_model.sitelist_df.head()
val | seq | left | right | ori | contig | |
---|---|---|---|---|---|---|
0 | 2.040628 | GGTCGTTTGCCTGCGCCGTGCA | 11710 | 11731 | MG1655.fa | |
1 | 2.00608 | GGAAGTCGCCGCCCGCACCGCT | 74727 | 74748 | MG1655.fa | |
2 | 1.996992 | TGGGTGTGGCGCGTGACCTGTT | 45329 | 45350 | MG1655.fa | |
3 | 1.920821 | GGTATGTGTCGCCAGCCAGGCA | 38203 | 38224 | MG1655.fa | |
4 | 1.879852 | GGTGATTTTGGCGTGGTGGCGT | 73077 | 73098 | MG1655.fa |
A good way to assess the quality of a model is to compute its predictive information on a massively parallel data set. This can be done using the predictive_info (need to write this) class:
predictive_info = mpa.PredictiveInfo(data_df = dataset_df, model_df = model_df,start=52)
References
[1] | Kinney JB, Anand Murugan, Curtis G. Callan Jr., and Edward C. Cox (2010) Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence. PNAS May 18, 2010. 107 (20) 9158-9163;
PDF . |