Expectation Maximization (EM) for MEME Motif Discovery in Bioinformatics (Part 2 of 3)
Автор: Saniya Khullar
Загружено: 2021-02-19
Просмотров: 1391
Please note: MEME is Multiple Expectation maximizations for Motif Elicitation. In bioinformatics, motifs typically are sequence patterns that occur many times in a group of related protein or DNA sequences. Typically, motifs are associated with some biological function (e.g. Transcription Factor Binding Sites where Transcription Factors bind to regulatory elements like promoters/enhancers). Saniya goes through a detailed toy example of applying MEME algorithm to learn a Position Weight Matrix (PWM) and associated motif occurrences.
Please note this is 2nd detailed video walking through an example of using MEME to discover motifs for TF binding.
Part 1 of 3 (previous video): • Expectation Maximization (EM) for MEME Mot...
Part 2 of 3 (current video): • Expectation Maximization (EM) for MEME Mot...
Part 3 of 3 (next video): • Expectation Maximization (EM) for MEME Mot...
Please note PWM is actually called Position Weighted Matrix and not Probability Weighted Matrix. Sorry!
********* Please note this toy example: **********
L = 6 bases (length of the DNA sequence)
W = 3 bases (motif bases); please note this is a parameter we selected.
N = 4 sequences
4 DNA sequences:
1. GTCAGG
2. GAGAGT
3. ACGGAG
4. CCAGTC
Using MEME algorithm, please find Position Weight Matrix (PWM) or P-matrix including background (non-motif) probabilities. Please also find occurrences of motifs in these 4 sequences. :)
Assumptions: please set matching letters in subsequence to be some value pi (= 0.7).
11 unique motifs that are found across all 4 sequences :)
GTC, TCA, CAG, AGG, GAG, AGA, AGT, ACG, CGG, GGA, CCA.
Here, m = # of possible start positions for a motif in DNA sequence, and is 4 (as Saniya shows =).
Typos found:
*47m 55s: 0.0766% is probability of getting Sequence 1 given that motif for sequence 1 starts at position 3.
*1h 27m 11s: numerator should be 0.23.
**********************************************************
Please reach out with any and all questions and please subscribe to Saniya's YouTube channel for more updates. :)
TIME STAMPS:
00:00 Expectation Maximization (EM) for MEME Motif Discovery in Bioinformatics (Part 2 of 3)
00:21 Z matrix (probability of motif starting in given position of sequence)
05:54 Initial Z matrix in our example (based on initializing each value to 1/m.
07:51 How to initialize the Position Weight Matrix (PWM) for a given motif based on our default values.
10:01 What is a background (non-motif) position? What is a motif position? Interpret a PWM
11:48 Initial assumption for background (non-motif) positions: 25% prob. for each base
12:05 Rule we use to initialize our PWM for a given motif
13:59 11 unique motifs: GTC, TCA, CAG, AGG, GAG, AGA, AGT, ACG, CGG, GGA, CCA.
14:07 Example of initializing PWM for motif: GTC
15:28 Initial Position Weight Matrices for 11 unique motifs
18:48 Checklist of Info we gathered so far for MEME algorithm
22:26 Count # of each type of DNA base across all of the sequences
23:00 Overview of the basic EM approach (Expectation Maximization) for Motif Discovery
====E -step: ====
26:57 Probability of a Sequence given a motif starting position
29:00 Little Break in between :) (around 30 secs)
29:36 Interpreting the formula for the probability of a sequence given a motif starting position
32:06 Focus on GAG motif (out of the 11 motifs) for example going forward. (Please apply similar concepts for other 10 motifs. Saniya randomly chose GAG for illustration)
32:43 Focus on GAG motif and Sequence 1: if we use initial PWM for motif GAG, what's probability of observing sequence 1 given that motif for sequence 1 starts at position 3 (corresponds to CAG) instead of positions 1, 2, or 4?
47:55 Correct Probability should be 0.0766% (typo was made!)
48:15 Calculate probability of motif starting in positions 1, 2, or 4 for DNA Sequence 1 if we use PWM for motif GAG.
53:01 Normalize each row of the Z matrix (so row will sum to 1): columns in each row represent probability of motif for that sequence starting in column's respective given position. Thus, summing across all columns for a row should sum to 1.
54:54 Repeating this same step for the other sequences (2 to 4) to fully update the Z matrix based on our PWM initialized for GAG. The motifs are off for seq3 and seq 4. Also GAG for seq 2 should be 70 *70*70 for j = 1
===== M -step: ======
01:00:17 M-step: re-estimate P-matrix (our PWM) using updated Z-matrix values: 1st find expected # of each DNA base in motif position
01:18:01 Update our motif counts
01:19:21 n_T,2: expected # of DNA bases in 2 position of motif
01:21:04 Find background (non-motif) counts for bases
01:26:00 Probability of DNA base in particular position (based on our counts)
======= Summary ====
01:28:04 Summary of approach for GAG (E & M steps): 1 iteration
01:29:43 Probability of each X given updated Z and P matrices
01:30:48 Calculate Log-Likelihood (Video 3)
Доступные форматы для скачивания:
Скачать видео mp4
-
Информация по загрузке: