Dotplot: Et protein oppbygd av moduler som ligner hverandre, sammenlignet med seg selv.

Slides:



Advertisements
Liknende presentasjoner
Gruppemedlemmer Gruppa består av: Magnus Strand Nekstad – s156159
Advertisements

Ledelsesinformasjonsystem
Ronny Klæboe Transportøkonomisk institutt
Kvalitetssikring av analyser til forskningsbruk
Britt-Ingjerd Nesheim Forskningsbasert undervisning - hva er det? Og trenger vi det?
UTFORDRINGER I TVERRFAGLIGE ENDRINGSPROSESSER Dagny Stuedahl stipendiat InterMedia.
Triggere Mutasjoner i basen. Triggers Triggers are stored procedures that execute automatically when something (event) happens in the database: : data.
Damasio om rasjonelle valg og somatiske markører
Problem set 2 By Thomas and Lars PS: Choose the environment, choose many pages per sheet. Problem set 2 Exercise 11/29 Laget av: Thomas Aanensen og Lars.
1 Måling: Metoder Nivåer Validering Churchill kap. 9 Troye & Grønhaug kap. 5 Reve: Validitet i økonomisk administrativ forskning Litteratur:
Planning and controlling a project Content: Results from Reflection for action The project settings and objectives Project Management Project Planning.
PIMEX for kontroll av støyeksponering Filmer fra Sunndalsøra 16. februar 2007 Kristin Brørs.
Dias 1 Lene Offersgaard Center for Sprogteknologi, Københavns Universitet DK-CLARIN status WP 5.
Det humane genom                                Menneskekroppen har 100 billioner celler, hver med 46 kromosomer. Samlet lengde av DNA: 2 meter/celle.
Økonomiske utsikter - med lavere oljepris
Sparebanken Vest 13. september, 2007 Haakon Bønes Direkte: Mobil: Bevisets stilling overskygger.
Modellering og diagrammer Jesper Tørresø DAB1 E september 2007.
What is a good text? And how do we get pupils to write them?
Prosjekt i digital signalbehandling og akustikk Inf3460 – vår ’08 Henning Vangli.
SPIQ/QIS The Problem The company in question develops hardware and software. They have two software groups, each with circa 15 developers.
Section 5.4 Sum and Difference Formulas These formulas will be given to you on the test.
Internprising F. Zimmer V06.
Geografiske informasjonssystemer (GIS) SGO1910 & SGO4930 Vår 2004 Foreleser: Karen O’Brien Seminarleder: Gunnar Berglund
Forskjellige sekvensformater Her er en sekvens i GCG-format EXTRACTPEPTIDE of frames: C from: caupol.map (Linear) MAP of: caupol.raw check: 2457 from:
The Thompson Schools Improvement Project Process Improvement Training Slides (Current State Slides Only) October 2009.
Primary French Presentation 10 Colours L.I. C’est de quelle couleur?
Effektanalyse i evalueringen av de teknisk- industrielle forskningsinstituttene i Norge 2015 Frode Georgsen, 23. mai 2016.
Hvordan integrere utenlandske studenter? Ulike grupper: Utvekslingsstudenter (1 semester) Kvotestudenter (flerårige program) Individuelle studenter.
NUAS Programme for Leaders in Administration. Mål for møtet Avklare hva innholdet i presentasjonen skal være Se på sammenheng mellom de forskjellige bidrag,
Radio listening in Norway
Fra innovasjonsstrategiens ordbok
Digital bestillingsprosess for Armering, direkte fra modell
MikS WP1/WP2 Planned work from SINTEF.
Title: «How to use different tools and/or machines in the workshop»
Citation and reference tools for your master thesis
IDI FU-møte 10/ Quick presentation round
Torodd Jensen Norwegian Water Resources and Energy Directorate (NVE)
Meta-analyser og systematiske oversikter
Lecture 29.
Altevatn-reguleringenTest: Changes in the flow of water: Effects on watercover and water velocity
The Norwegian Hydrografic Pilot
Eksempel fra Nevrologisk avdeling
Group theory I dette kapitlet skal vi se på utvidelse av lister som vi behandlet generelt i kap 04. Vi skal nå benytte klassehierarkiet som vi utviklet.
Citations and citation databases
Dette er et eksempel på plassering av logoene.
Ole Kristoffer Dybvik Apeland Nkom
Økonomiske forutsetninger
CAMPAIGNING From vision to action.
Global oppvarming Mål:
Adsorption & ion exchange:
Citation and reference tools for your master thesis
The Scoutmaster guides the boy in the spirit of another brother.
Er han god, da vil han. Kan han så vil han
Welcome to an ALLIN (ALLEMED) workshop!
Are Paradigms Radial Categories
The Gains from International Trade
SS-generasjonen HL-senteret,
The Nature Index for Norway - a new measure of biodiversity
Fra idé til forskningsprosjekt Hilde Afdal & Odd Tore Kaufmann
Citations and citation databases
Kick-off Good morning everybody. Nice to see so many well known faces on a big day like this. My name is Audun Pettersen and I`m Head of Tourism.
Statsbygg/Scandiaconsult AS
Numeriska beräkningar i Naturvetenskap och Teknik
MEDLEMSKAP OG KLUBBUTVIKLING
A review of exploration activity and results on the NCS
Vaccine Delivery in Developing Countries
INF2820 Datalingvistikk – V2011
Sustainability as practice
ALL vectors have two components (x and y)
How to evaluate effects of inspections on the quality of care?
Utskrift av presentasjonen:

Dotplot: Et protein oppbygd av moduler som ligner hverandre, sammenlignet med seg selv

Valg av poengverdier (substitusjonsmatrise) er viktig Scoring matrices appear in all analysis involving sequence comparison. The choice of matrix can strongly influence the outcome of the analysis. Scoring matrices implicitly represent a particular theory of evolution. Understanding theories underlying a given scoring matrix can aid in making proper choice.

Identity matrix Genetic Code Matrix: Score based on minimum number of base changes required to convert one amino acid into another. Physical/ chemical characteristics. Attempt to quantify some physical or chemical attribute of the residues and arbitrarily assign weights based on similarities of the residues Log odds matrices S is the log odds ratio of two probabilities: the probability that two residues, i and j, are aligned by evolutionary descent and the probability that they are aligned by chance. q ij are the frequencies that residue i and j are observed to align in sequences known to be related. They are derived from a "transition probability matrix.” p i and p j are frequencies of occurrence of residue i and j in the set of sequences. e. g., PAM250, BLOSUM62 et al. Forskjellige prinsipper for substitusjonsmatriser

PAM-matriser: Hvordan ble de konstruert av Margaret Dayhoff? 1. Align sequences that are at least 85% identical (minimize ambiguity in alignments, minimize the number of coincident mutations. 2. Reconstruct phylogenetic trees and infer ancestral sequences. 71 trees containing 1,572 exchanges were used. 3. Count replacements "accepted" by natural selection, in all pairwise comparisons (each A ij is the number of times amino acid j was replaced by amino acid i in all comparisons). 4. Compute amino acid mutability m j, i. e., the propensity of a given amino acid, j, to be replaced.

PAM-konstruksjon, forts. 5. Combine data from 3 & 4 to produce a Mutation Probability Matrix for one PAM of evolutionary distance (1 PAM (Accepted Point Mutation per 100 residues)), according to the following formulae: 6. Calculate Log Odds Matrix for similarity scoring: Divide each element of the Mutation Data Matrix, M, by the frequency of occurrence of each residue: R is a Relatedness Odds Matrix, f i is the frequency of residue i. The Log Odds Matrix, S ij, is calculated from the relatedness odds matrix, R ij, simply by taking the log of each R ij and multiplying with 10

PAM 250 substitution matrix

Limitations of the PAM model Assumptions in PAM model: 1.replacement at any site depends only on the amino acid at that site and the probability given by the table (Markov model). 2.sequences that are being compared have average amino acid composition. Sources of error in PAM model 1.Many sequences depart from average composition. 2.Rare replacements were observed too infrequently to resolve relative probabilities accurately (for 36 pairs no replacements were observed!). 3.Errors in 1 PAM are magnified in the extrapolation to 250 PAM. 4.The Markov process is an imperfect representation of evolution: Distantly related sequences usually have islands (blocks) of conserved residues. This implies that replacement is not equally probable over entire sequence.

BLOSUM (Blocks Substitution Matrix) substitusjonsmatriser 1. Starting data is conserved blocks from Blocks database. aligned, ungapped sequences widely varying similarity, but measures are taken to avoid biasing the sample with frequently occurring highly related sequences. 2. Counts of replacements are made by straight forward counting of all pairs of aligned residues, f ij The observed frequency of each pair is: q ij = f ij /( total number of residue pairs) This includes cases of i= j (i. e. no replacement observed). The expected frequency of each pair is essentially the product of the frequencies of each residue in the data set.

BLOSUM (Blocks Substitution Matrix) substitusjonsmatriser 3. Similar sequences in a block above a threshold percent similarity are clustered and members of the cluster count fractionally toward the final tally. –Reduces the number of identical pairs (AA, SS, TT, etc., matches) in the final tallies. –Somewhat analogous to increasing the PAM distance. –If clustering threshold is 80%, final matrix is BLOSUM 80. –Clustering at 62% reduces the number of blocks contributing to the table by 25%- still 1.25 x 10^ 6 pairs contributed! –Least frequent amino acid pair replacement was observed 2369 times!

BLOSUM 62

Blosum og PAM – en sammenligning

FASTA og BLAST: søk etter beslektede sekvenser i databasene Søk i databasene med en rigorøs Smith-Waterman- algoritme er ressurskrevende (men mulig). FASTA og BLAST gir raskere søk og mindre ressursbruk ved å benytte snarveier. For begge gjelder det at det foretas en forhånds-”siling” av sekvensene i databasen slik at bare sekvenser som ser interessante ut (ser ut til å ligne på søkesekvensen) behandles videre

Slik arbeider FASTA s =H A R F Y A A Q I V L A2, 6, 7 F4 H1 I9 L11 Q8 R3 V10 Y5 others t = V D M A A Q I A –6 –5 –4 –3 –2 – Ktup= 1 Hash table Offset vector

From: G.J.Barton: Protein Sequence Alignment and Database Scanning in Protein Structure prediction - a practical approach, Edited by M. J. E. Sternberg, IRL Press at Oxford University Press, 1996

FASTA, forts. FASTA vil så koble samme to eller flere k-tupler dersom de ikke ligger for langt fra hverandre, disse utgjør sammen en region. Kan ses på som en lokal sammenstilling uten gap. De 5 beste regionene fra forrige fase poengsettes så på ny med PAM120 eller PAM250. Dette er første mål på likhet mellom r og s og kalles initial score i resultatfilen. En slik regnes ut for alle sekvenser i databasen. Optimized score regnes så ut a la Smith-Waterman, men begrenset til ruter i et bånd rundt utgangs- sammenstillingen

FASTA – valg av k-tuple-verdi For DNA-søk er ktup 4-6, for proteinsøk 1eller 2. Valg av ktup har innvirkning på resultatet: Lav ktup øker sensitiviteten, dvs. evnen til å finne fjerne slektninger Høy ktup øker selektiviteten, dvs. evnen til å forkaste falske positiver

Varianter av FASTA PROGRAM FUNCTION fasta3 scan a protein or DNA sequence library for similar sequences fastx/y3 compare a DNA sequence to a protein sequence database, comparing the translated DNA sequence in forward and reverse frames. tfastx/y3 compares a protein to a translated DNA data bank fasts3 compares linked peptides to a protein databank fastf3 compares mixed peptides to a protein databank

FASTA-resultater

Parametere som sier noe om hvor gode våre databasetreff er Init1: score of the highest scoring initial region Initn: sum of initial scores of joined regions minus joining penalty for each gap opt: score of optimal alignment of the region Z: measure of how unusual the original match is. If score=S, Z=(S-mean)/sd P: probability that the alignment is no better than random E(n): expected number of sequences giving the same z-score or better if the database is probed with a random sequence. E=P*(database size n)

Vurdering av resultater Z-score > 5: significant P : Trolig ikke signifikant treff E 1: tilfeldig?

Slik virker BLAST (Basic Local Alignment Search Tool) Blast lager en liste over alle tretegns-ord (words, delsekvenser) i søkeproteinet (for sekvensen MEFGALLY.. blir de MEF, EFG, FGA, GAL osv.) Ved bruk av BLOSUM62 identifiseres for hvert av disse ordene ord som gir en score over en viss grenseverdi (neighborhood word score threshold) (ca. 50 nye ord for hvert utgangsord Hver sekvens i databasen gjennomsøkes så for eksakte treff med hvert av de 50 ordene for hver posisjon i søkesekvensen Treffene utvides så til poengsummen begynner å bli lavere. Resultatet er et lengre sammenstilte sekvensstrekk kalt HSP (high-scoring segment pair). Sammenkobling av HSP med egnet plassering.

From: G.J.Barton: Protein Sequence Alignment and Database Scanning in Protein Structure prediction - a practical approach, Edited by M. J. E. Sternberg, IRL Press at Oxford University Press, 1996

BLAST-resultater

BLAST-resultater, fortsatt

Varianter av Blast blastp compares an amino acid query sequence against a protein sequence database blastn compares a nucleotide query sequence against a nucleotide sequence database blastx compares a nucleotide query sequence translated in all reading frames against a protein sequence database tblastn compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames tblastx compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Please note that tblastx is extremely slow and cpu-intensive Psi-blast - Position Specific Iterated BLAST uses an iterative search in which sequences found in one round of searching are used to build a score model for the next round of searching. Highly conserved positions receive high scores and weakly conserved positions receive scores near zero. The profile is used to perform a second (etc.) BLAST search and the results of each "iteration" used to refine the profile. This iterative searching strategy results in increased sensitivity

Det humane genom

Horizontal gene transfer?

Probable vertebrate-specific acquisition of bacterial genes

Men nei….

Men nei, fortsatt

Fylogenetisk analyse

Hva gikk feil? ”A different methodological reason for several of the genes in the human genome report being considered as bacteria±vertebrate HGTs, was that phylogenetics was not the analytical approach, and that the conclusions were instead derived largely from top BLAST hit results. In several instances the top BLAST hit was indeed a bacterial species, whereas further down the list of significant BLAST hits one finds a non- vertebrate eukaryote. When such sequences were properly aligned, the resulting phylogenetic trees often supported the monophyly of eukaryotes with the nonvertebrate eukaryote at the base.”

ClustalW-sammestilling

Konklusjonen ”Most of our analyses and phylogenetic topologies are highly consistent with the view that vertebrates and bacteria share these loci through common ancestry, involving a succession of non-vertebrate eukaryote intermediates. A further point arising from our analysis is that the evolutionary relation-ships among proteins cannot be concluded solely from the ranking of database hits in homology searches (for example, BLAST reports). This is not a new conceptual point (see refs 7, 12, 13), but one that seems to have been overlooked in this instance. Phylogenetic analysis must be a central component of any protein family or genome annotation effort. Importantly, phylogenetic reconstruction is critical to synthesizing, from the growing wealth of sequence data, a more comprehensive view of genome evolution.”