Flersekvenssammenstillinger. Poengberegning for flersekvenssammenstillinger MQPILLL MLR-LL- MK-ILLL MPPVLIL Burde kanskje brukt en form for log odds poeng:

Slides:



Advertisements
Liknende presentasjoner
12.Studienreise nach Finnland,
Advertisements

DROPS simulator - konspetet •En ny tilnærming til å forhindre DROPS •En interaktive 3D simulering av riggen, som gjør det mulig for brukeren å: - utføre.
Everyone Print Kalle Snarheim.
Ordspr./ proverbs 4,18 Rettferdiges sti er som morgenens lysskjær, det vokser til det blir høylys dag. The way of the righteous is like the first gleam.
Jara NetBusiness Vedlikeholdsrelease , 11. februar 2008.
DESEMBERKONFERANSEN Kristian Siem SS7 New Vessels
Gruppemedlemmer Gruppa består av: Magnus Strand Nekstad – s156159
The Trondheim Toll Ring System
Ledelsesinformasjonsystem
Pareto presentation 29. August
Mobile Phone authorised Services through Near Field Communications Hans-Christian Haugli, Elin Melby, Josef Noll.
Gitte Holten Ingerslev - DPU Tekst Forskerens og didaktikerens blik på mødet mellem tekst og læser.
‘The High North’ A Geographical-Political Concept, with Emphasis on Marine Resources Management Odd Gunnar Skagestad Deputy Director General Bergen, 28.
Det er ganske underlig med disse sentrale salmene. Selv om du ikke er religiøs burde du lese dette.
Designing the User Interface (Antall brukere == Antall meninger)
Faktoranalyse Thore Egeland UiO/HIO 9 sept
Classification: Internal Status: Draft Prosjektforslag 8 Eksperter i Team - Gullfakslandsbyen 2008 PASF injeksjon i H1 segmentet på Gullfaks hovedfelt.
3D-structure of bacterial ribsoomes. Components required for protein-synthesis in E. coli.
Nettverk Software Protocol Hierarchies
Nettverk Software Protocol Hierarchies
Kvalitetssikring av analyser til forskningsbruk
PREDICTIVE MODELLING REGRESSION ANALYSIS. “Da jeg gikk på ungdomskolen, ble vi testet for å finne ut hva slags yrke vi passet til. Jeg svarte feil på.
SINTEF Fiskeri og havbruk AS 1 Yngelfôr til torsk Kan vi erstatte levendefôr med nytt formulert fôr nå? Jose Rainuzzo Seniorforsker SINTEF Fiskeri og Havbruk.
Konseptuell modell Hvordan skal dette se ut ifra brukeren?
Men hva mener de som har klart det? Børge Haugset (NTNU&SINTEF)
1 Information search for the research protocol in IIC/IID Medical Library, 2013.
GRØNNALGER BRUNALGER RØDALGER
WHY WE’RE STRENGTHENING ROTARY. OBJECTIVES  Clarify what Rotary stands for, how it’s different and why people should care  Elevate awareness and understanding.
3D-structure of bacterial ribosomes, the machines that make proteins
Triggere Mutasjoner i basen. Triggers Triggers are stored procedures that execute automatically when something (event) happens in the database: : data.
Ytre miljø Q4 CAKE. Information for OIM/section leaders; The presentation is to be presented in the General safety meeting together with the film on the.
Forskningsetikk og premiering av deltakere i forskning: Hva sier NESHs retningslinjer og hvilke forskningsetiske spørsmål reises? Bergen, 27, februar 2009.
Økonomiske forutsetninger Gullfaks landsbyen 2007.
Trondheim 6. mars 2014 Mørke skyer i horisonten?.
Faktorer som innvirker på interne prosesser og ”effektivitet” i internasjonalt spredte team Hvilken innvirkning har følgende forhold på interne prosesser.
fra nachspiel ide til eksport vare
INTERNASJONAL PRIVATRETT Lovvalg i kontrakt – uten partenes valg Professor dr. juris Giuditta Cordero Moss.
R OTARYS UTDANNINGSPROGRAM FOR UNGDOMSLEDERE. TEMA.
Kunnskapsdepartementet Norsk mal: Startside Tips for engelsk mal Klikk på utformingsfanen og velg DEPMAL – engelsk Eller velg DEPMAL– engelsk under ”oppsett”.
Publisering i åpne kanaler Anne Storset Institutt for mattrygghet og Infeksjonsbiologi.
Problem set 2 By Thomas and Lars PS: Choose the environment, choose many pages per sheet. Problem set 2 Exercise 11/29 Laget av: Thomas Aanensen og Lars.
Planning and controlling a project Content: Results from Reflection for action The project settings and objectives Project Management Project Planning.
 A fjord is a small bit of water between two mountains or cliffs and they are very famous in Norway.
Befolkning og arbejdsmarked 7. Mikroøkonomi Teori og beskrivelse © Limedesign
Primary French Presentation 10 Colours L.I. C’est de quelle couleur?
Eksempel fra Nevrologisk avdeling
Welcome to an ALLIN (ALLEMED) workshop!
Vaccine Delivery in Developing Countries
Biological quality assurance in Norway– Biological standards
LO2 – Understand Computer Software
1.4 Relations & Functions.
Kaveet Patel – Education Officer
Database.
Mouse SIRPα amino acid polymorphism modulates binding to CD47.
Elecbits.
Getting to Genuine Collaboration
Chapter 2: Economic Systems Section 3
Chapter 9 Designing Databases
Stat 35b: Introduction to Probability with Applications to Poker
Trends Since 1900 Aging Population Immigration Aboriginal Population
Methods Motivation Introduction Datasets and Decoys Results
What belongs in state storage API’s?
Sequences Example This is a sequence of tile patterns.
Developing an Educational Web Application for Student Training in Geographical Information Systems (GIS) Derek Morris Jr. , Edsel Norwood , Disaiah Bennett.
Jakub Kocvara, Dr. Martin Hlosta, Prof. Zdenek Zdrahal
BY: LAURYN PETTYJOHN AND Paige gerry
Types of Sentences.
Figure 1. (A) The synthesis protocol (18) that AptaBlocks relies on
T Cell Receptor Structures: Three for the Price of One
Utskrift av presentasjonen:

Flersekvenssammenstillinger

Poengberegning for flersekvenssammenstillinger MQPILLL MLR-LL- MK-ILLL MPPVLIL Burde kanskje brukt en form for log odds poeng: Log Men ofte brukes sum av par (SP-poeng) SP-poeng (I, -, I, L)= p(I, -) + p(I, I) + p(I, V) + p(-, I) + p(-, V) + p(I, V)

Multiple Sequence Alignments Conceptually, there is no reason why a Needleman-Wunsch algorithm can not be performed with more than two sequences. The matrix simply becomes multi-dimensional and the algorithm would work successively through each dimension. There are however, significant practical problems with this approach. In this case instead of growing as an N 2 problem, the computational time will grow as N m, where m is the number of sequences. Hence, even for just 100 nucleotides from 5 species, this is = 10; 000; 000; 000 operations or the equivalent of doing an alignment for two sequences each 100,000 nucleotides long. Obviously different methods need to be employed. In general these require more assumptions and are not as precise nor ”all-encompassing" as the Needleman-Wunsch or Smith-Waterman algorithms.

Tenkesettet til ClustalW, kortversjonen Algorithm: CLUSTALW progressive alignment (i)Construct a distance matrix of all N(N - 1)/2 pairs by pairwise dynamic programming alignment followed by approximate conversion of similarity scores to evolutionary distances using the model of Kimura [ (ii)Construct a guide tree by a neighbour-joining clustering algorithm by Saitou & Nei [ (iii)Progressively align at nodes in order of decreasing similarity, using sequence-sequence, sequence-profile, and profile-profile alignment.

ClustalW-detaljer CLUSTALW is unabashedly ad hoc in its alignment construction and scoring stage. In addition to the usual methods of profile construction and alignment, various additional heuristics of CLUSTAI,W contribute to its accuracy: Sequences are weighted to compensate for biased representation in large subfamilies. The profile scoring function in CLUSTALW is fundamentally sum-of-pairs. As with Carrillo-Lipman, sequence weighting is important to compensate for the defects of the sum-of- pairs. The substitution matrix used to score an alignment is chosen on the basis of the similarity expected of the alignment; closely related sequences are aligned with 'hard' matrices (e.g. BLOSUM80), and distant sequences are aligned with,soft' matrices (e.g. BLOSUM50).

ClustalW-detaljer, forts. Position-specific gap-open profile penalties are multiplied by a modifier that is a function of the residues observed at the position. These penalties were obtained from gap frequencies observed in a large number of structurally based alignments. In general, hydrophobic residues (which are more likely to be buried) give higher gap penalties than hydrophilic or flexible residues (which are more likely to be surface-accessible). Gap-open penalties are also decreased if the position is spanned by a consecutive stretch of five or more hydrophilic residues. Both gap-open and gap-extend penalties are increased if there are no gaps in a column but gaps occur nearby in the alignment. This rule tries to force all the gaps to occur in the same places in an alignment. In the progressive alignment stage, if the score of an alignment is low, the guide tree may be adjusted on the fly to defer the low- scoring alignment until later in the progressive alignment phase when more profile information has been accumulated.

Forbedringer i ClustalW 1.Individual weights are assigned to each sequence in a partial alignment in order to downweight near- duplicate sequences and upweight the most divergent ones. 2.Amino acid substitution matrices are varied at different alignment stages according to the divergence of the sequences to be aligned. 3.Residue specific gap penalties and locally reduced gap penalties in hydrophilic regions encourage new gaps in potential loop regions rather than regular secondary structure. 4.Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage the opening up of new gaps at these positions.

Slik jobber ClustalW The alignments are done in four steps. 1.All pairwise similarity scores are calculated. This is done using rapid alignment methods. 2.Create a similarity matrix and then to cluster the sequences based on this similarity using a cluster algorithm. 3.Create an alignment of clusters via a consensus method. 4.Create a progressive multiple alignment. This is performed by sequentially aligning groups of sequences, according to their branching order in the clustering.

Gapstraff som avhenger av sekvensene Before any two sequences or prealigned groups of sequences are aligned, we calculate initial values for the GOP and GEP as functions of the amino acid weight matrix to be used, the sequence (or alignment) lengths, and the divergence between the sequences. The values for GOP and GEP are set from a user-controlled menu (defaults are offered) and then modified as follows: GOP → A * B * {GOP + log[min(N, M)]} where N and M are the lengths of the sequences to be aligned, A is the average value for a mismatch in the amino acid weight matrix, and B is the percent identity of the two sequences. The GEP is then modified using the following formula: GEP → GEP * [1.0 + |log(N/M)| ] (2) where N and M are, again, the lengths of the two sequences.

Modifisering av gapstraffen Enkelt beskrevet er reglene:  Bruk lavere gapstraff i posisjoner hvor det allerede forekommer gap  Bruk høyere gapstraff nær posisjoner som allerede inneholder gap  Reduser gapstraffen der hvor det foreligger strekk med hydrofile aminosyrer  Juster gapstraffen ved bruk av tabeller over den observerte gapfrekvensen i nabostilling til hver av de 20 aminosyrene

Gapstraff som avhenger av eksisterende gap If there are gaps at a position in a group of prealigned sequences (this rule and the following one do not apply to single sequences), then the GOP is reduced in proportion to the number of sequences with a gap at that position and the GEP is lowered by one- half. The new GOP is calculated as GOP → GOP * 0.3 * (W/N) (3) where W is the number of sequences without a gap at the position and N is the number of sequences. If a position contains no gaps but is within eight residues of an existing gap (this value of 8 can be changed from a menu), the GOP is increased as follows: GOP → GOP * {2 + [8 - (D) *2]/8} (4) where D is the distance from the gap. A run of five (this number can be changed from a menu) consecutive, hydrophilic residues is considered to be a hydrophilic stretch. The residues that are considered to be hydrophilic are conservatively set to D, E, G, K, N, Q, P, R, and S by default but can be changed by the user. Any positions with no gaps that are spanned by such a stretch of residues get the GOP reduced by one-third.

ClustalW – gapstraff avhengig av naboaminosyren These values are derived from the observed frequencies of gaps adjacent to each residue in alignments of sequences of known tertiary structure." The values were transformed from the published values such that the bigger the number, the less likely a gap is to occur adjacent to that residue. The numbers are then used as simple multiplication factors to modify gap opening penalties, normalized around a value of 1.0 for histidine.

Resultat av forskjellige gapstraffjusteringer over et sekvensstrekk

Sekvensveiing for å unngå at svært like sekvenser skal få for stor innflytelse

Korreksjon for flere mutasjoner i samme sete

Vanlige feil i progressive sammenstillinger Ikke så bra Bedre

Også en feil som forekommer

Hva kan vi få ut av flersekvenssammenstillinger? A) Konsensussekvenser F K L L S H C L L V F K A F G Q T M F Q Y P I V G Q E L L G F P V V K E A I L K F K V L A A V I A D L E F I S E C I I Q F K L L G N V L V C F K L L G Q V I L Q Konsensus

Hva kan vi få ut av flersekvenssammenstillinger? B) Et mønster (pattern) F K L L S H C L L V F K A F G Q T M F Q Y P I V G Q E L L G F P V V K E A I L K F K V L A A V I A D L E F I S E C I I Q F K L L G N V L V C Pattern (mønster) F-[KP]-[VL]-[VL]-[GS]-Q-V-[LI]-L-Q

Hva kan vi få ut av flersekvenssammenstillinger? C) En profil F K L L S H C L L V F K A F G Q T M F Q Y P I V G Q E L L G F P V V K E A I L K F K V L A A V I A D L E F I S E C I I Q F K L L G N V L V C A C D E F G H I K L M N P Q R S T V W Y

Hva kan vi få ut av flersekvenssammenstillinger? D) En skjult Markov-modell (HMM) For genfinning vil tilstandene (states) være eksoner, introner og evt andre ønskede sekvensklasser (5’ og 3’ UTR, promoterområder, intergeniske områder, repetitivt DNA osv.). Transisjonssannsynlighetene vil variere med tilstanden (intron kan bare følges av internt eller terminalt ekson osv). Sannsynligheten for overgang fra ekson til intron avhenger av lokal sekvens, bare høy ved plausible spleiseseter Et enkelt eksempel på en HMM: tilstanden (skjult) vil influere på GC-innholdet i sekvensen som sendes ut

HMM for flersekvenssammenstilling For flersekvens- sammenstillinger vil hver kolonne i sammenstillingen tilsvare en tilstand (state) i den skjulte Markov-modellen.

Profile-HMM

Motivdatabaser: is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs. PROSITE

Example of a pattern entry ID PPASE; PATTERN. AC PS00387; DT NOV-1990 (CREATED); NOV-1995 (DATA UPDATE); NOV-1995 (INFO UPDATE). DE Inorganic pyrophosphatase signature. PA D-[SGN]-D-[PE]-[LIVM]-D-[LIVMGC]. NR /RELEASE=32,49340; NR /TOTAL=16(16); /POSITIVE=11(11); /UNKNOWN=0(0); /FALSE_POS=5(5); NR /FALSE_NEG=0; /PARTIAL=2; CC /TAXO-RANGE=A?EP?; /MAX-REPEAT=1; CC /SITE=1,magnesium; /SITE=3,magnesium; /SITE=6,magnesium; DR P21216, IPYR_ARATH, T; P37980, IPYR_BOVIN, T; P17288, IPYR_ECOLI, T; DR P44529, IPYR_HAEIN, T; P13998, IPYR_KLULA, T; P19117, IPYR_SCHPO, T; DR P37981, IPYR_THEAC, T; P19514, IPYR_THEP3, T; P38576, IPYR_THETH, T; DR P00817, IPYR_YEAST, T; P28239, IPY2_YEAST, T; DR P19371, IPYR_DESVH, P; P21616, IPYR_PHAAU, P; DR P09167, AERA_AERHY, F; P12351, CYP1_YEAST, F; P24653, Y101_NPVOP, F; DR P37904, YCEI_ECOLI, F; P39303, YJFU_ECOLI, F; 3D 1PYP; DO PDOC00325; //

PA-linjen som definerer mønsteret 2.3.5) The PA line The PA (PAttern) lines contains the definition of a PROSITE pattern. The patterns are described using the following conventions: - The standard IUPAC one-letter codes for the amino acids are used. - The symbol `x' is used for a position where any amino acid is accepted. - Ambiguities are indicated by listing the acceptable amino acids for a given position, between square parentheses `[ ]'. For example: [ALT] stands for Ala or Leu or Thr. - Ambiguities are also indicated by listing between a pair of curly brackets `{ }' the amino acids that are not accepted at a given position. For example: {AM} stands for any amino acid except Ala and Met. - Each element in a pattern is separated from its neighbor by a `-'. - Repetition of an element of the pattern can be indicated by following that element with a numerical value or a numerical range between parenthesis. Examples: x(3) corresponds to x-x-x, x(2,4) corresponds to x-x or x-x-x or x-x-x-x. - When a pattern is restricted to either the N- or C-terminal of a sequence, that pattern either starts with a `<' symbol or respectively ends with a `>' symbol. - A period ends the pattern.

Eksempler på Prosite-mønstere Examples: PA [AC]-x-V-x(4)-{ED}. This pattern is translated as: [Ala or Cys]-any-Val-any-any-any-any-{any but Glu or Asp} PA <A-x-[ST](2)-x(0,1)-V. This pattern, which must be in the N-terminal of the sequence (`<'), is translated as: Ala-any-[Ser or Thr]-[Ser or Thr]-(any or none)-Val

En liten sekvenssammenstilling og den tilhørende Prosite-profil F K L L S H C L L V F K A F G Q T M F Q Y P I V G Q E L L G F P V V K E A I L K F K V L A A V I A D L E F I S E C I I Q F K L L G N V L V C A C D E F G H I K L M N P Q R S T V W Y

Motivdatabaser is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains. is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterise a protein family; its diagnostic power is refined by iterative scanning of a composite of SWISS-PROT + SP- TrEMBL. Usually the motifs do not overlap, but are separated along a sequence, though they may be contiguous in 3D-space. Fingerprints can encode protein folds and functionalities more flexibly and powerfully than can single motifs, their full diagnostic potency deriving from the mutual context afforded by motif neighbours. consists of an automatic compilation of homologous domains. Current versions of ProDom are built using a novel procedure based on recursive PSI-BLAST searches. Large families are much better processed with this new procedure than with the former DOMAINER program.

InterPro: En integrert motivdatabase