Tutorial¶

Tutorial¶

Simulating Data¶

Computing Profiles¶

Quantitative Modeling¶

Simulating Data¶

Computing Profiles¶

Quantitative Modeling¶

Table Of Contents

This Page

Previous topic

Next topic

Import the MPAthic package as follows:

import mpathic as mpa

We begin by simulating a library of variant CRP binding sites. We can use the mpa.SimulateLibrary class to create a library of random mutants from an initial wildtype sequence and mutation rate:

sim_library = mpa.SimulateLibrary(wtseq="TAATGTGAGTTAGCTCACTCAT", mutrate=0.24)
sim_library.output_df.head()

The output_df attribute of the sim_library class looks like the dataframe below

ct	seq
21	TAATGTGAGTTAGCTCACTCAT
7	TAATGTGAGTTAGCTAACTCAT
6	TAATGTGAGTTAGCTCACTCAA

⋮

1	TAATGTGTGTTCGCTCATCCAT

In general, MPAthic datasets are pandas dataframes, comprising of columns of counts and sequence values. To simulate a Sort-Seq experiment ([1]), we use the mpa.SimulateSort class. This class requires a dataset input and a model dataframe input. We first import these inputs using io module provided with the MPAthic package:

# Load dataset and model dataframes
dataset_df = mpa.io.load_dataset('sort_seq_data.txt')
model_df = mpa.io.load_model('crp_model.txt')

Next, we call the SimulateSort class as follows:

# Simulate a Sort-Seq experiment
sim_sort = mpa.SimulateSort(df=dataset_df,mp=model_df)
sim_sort.output_df.head()

The head of the output dataframe looks like

It is often useful to compute the mutation rate within a set of sequences, e.g., in order to validate the composition of a library. This can be accomplished using the mpa.ProfileMut class as follows:

profile_mut = mpa.ProfileMut(dataset_df = dataset_df)
profile_mut.mut_df.head()

The mutation rate at each position within the sequences looks like

pos	wt	mut
0	A	0
1	A	0
2	A	0.33871
3	T	0.127566
4	A	0.082111

To view the frequency of occurrence for every base at each position, use the mpa.ProfileFreq class:

profile_freq = mpa.ProfileFreq(dataset_df = dataset_df)
profile_freq.freq_df.head()

pos	freq_A	freq_C	freq_G	freq_T
0	1	0	0	0
1	1	0	0	0
2	0.66129	0.33871	0	0
3	0.043988	0.042522	0.041056	0.872434
4	0.917889	0.019062	0.02566	0.03739

Information pro les (also called “information footprints”) provide a particularly useful way to identify functional positions within a sequence. These pro les list, for each position in a sequence, the mutual information between the character at that position and the bin in which a sequence is found. Unlike mutation and frequency profiles, which require sequence counts for a single bin only, information profiles are computed from full datasets, and can be accomplished using the mpa.ProfileInfo class as follows:

profile_info = mpa.ProfileInfo(dataset_df = dataset_df)
profile_info.info_df.head()

pos	info
0	0.000077
1	0.000077
2	0.008357
3	0.008743
4	0.013745

The mpa.LearnModel class can be used to fit quantitative models to data:

learned_model = mpa.LearnModel(df=dataset_df)
learned_model.output_df.head()

pos	val_A	val_C	val_G	val_T
0	-0.201587	0.067196	0.067196	0.067196
1	-0.201587	0.067196	0.067196	0.067196
2	-0.10637	-0.167351	0.13686	0.13686
3	-0.287282	0.041222	-0.2039	0.44996
4	-0.056109	-0.871858	0.344537	0.583429

The purpose of having a quantitative model is to be able to predict the activity of arbitrary sequences. This basic operation is accomplished using the mpa.EvaluateModel class:

eval_model = mpa.EvaluateModel(dataset_df = dataset_df, model_df = model_df)
eval_model.out_df.head()

ct	ct_2	ct_3	ct_4	seq	val
1	1	0	0	AAAGGTGAGTTAGCTAACTCAT	0.348108
1	0	0	1	AAATATAAGTTAGCTCGCTCAT	-0.248134
1	0	1	0	AAATATGATTTAGCTGACTCAT	0.009507
1	0	0	1	AAATGTCAGTTAGCTCACTCAT	0.238852
1	1	0	0	AAATGTGAATTATCGCACTCAT	-0.112121

Often, it is useful to scan a model over all sequences embedded within larger contigs. To do this, MPAthic provides the class mpa.ScanModel, which is called as follows:

# get contigs, provided with mpathic
fastafile = "./mpathic/examples/genome_ecoli_1000lines.fa"
contig = mpa.io.load_contigs_from_fasta(fastafile, model_df)

scanned_model = mpa.ScanModel(model_df = model_df, contigs_list = contigs_list)
scanned_model.sitelist_df.head()

	val	seq	left	right	contig
0	2.040628	GGTCGTTTGCCTGCGCCGTGCA	11710	11731	MG1655.fa
1	2.00608	GGAAGTCGCCGCCCGCACCGCT	74727	74748	MG1655.fa
2	1.996992	TGGGTGTGGCGCGTGACCTGTT	45329	45350	MG1655.fa
3	1.920821	GGTATGTGTCGCCAGCCAGGCA	38203	38224	MG1655.fa
4	1.879852	GGTGATTTTGGCGTGGTGGCGT	73077	73098	MG1655.fa

A good way to assess the quality of a model is to compute its predictive information on a massively parallel data set. This can be done using the predictive_info (need to write this) class:

predictive_info = mpa.PredictiveInfo(data_df = dataset_df, model_df = model_df,start=52)

References

[1]	Kinney JB, Anand Murugan, Curtis G. Callan Jr., and Edward C. Cox (2010) Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence. PNAS May 18, 2010. 107 (20) 9158-9163; `PDF`.

AAAAAAGGTGAGTTAGCTAACT

AAAAAATATAAGTTAGCTCGCT

AAAAAATATGATTTAGCTGACT

AAAAAATGTCAGTTAGCTCACT

AAAAAATGTGAATTATCGCACT