Presentasjon lastes. Vennligst vent

Presentasjon lastes. Vennligst vent

Flersekvenssammenstillinger. Poengberegning for flersekvenssammenstillinger MQPILLL MLR-LL- MK-ILLL MPPVLIL Burde kanskje brukt en form for log odds poeng:

Liknende presentasjoner


Presentasjon om: "Flersekvenssammenstillinger. Poengberegning for flersekvenssammenstillinger MQPILLL MLR-LL- MK-ILLL MPPVLIL Burde kanskje brukt en form for log odds poeng:"— Utskrift av presentasjonen:

1 Flersekvenssammenstillinger

2 Poengberegning for flersekvenssammenstillinger MQPILLL MLR-LL- MK-ILLL MPPVLIL Burde kanskje brukt en form for log odds poeng: Log Men ofte brukes sum av par (SP-poeng) SP-poeng (I, -, I, L)= p(I, -) + p(I, I) + p(I, V) + p(-, I) + p(-, V) + p(I, V)

3 Multiple Sequence Alignments Conceptually, there is no reason why a Needleman-Wunsch algorithm can not be performed with more than two sequences. The matrix simply becomes multi-dimensional and the algorithm would work successively through each dimension. There are however, significant practical problems with this approach. In this case instead of growing as an N 2 problem, the computational time will grow as N m, where m is the number of sequences. Hence, even for just 100 nucleotides from 5 species, this is = 10; 000; 000; 000 operations or the equivalent of doing an alignment for two sequences each 100,000 nucleotides long. Obviously different methods need to be employed. In general these require more assumptions and are not as precise nor ”all-encompassing" as the Needleman-Wunsch or Smith-Waterman algorithms.

4 Tenkesettet til ClustalW, kortversjonen Algorithm: CLUSTALW progressive alignment (i)Construct a distance matrix of all N(N - 1)/2 pairs by pairwise dynamic programming alignment followed by approximate conversion of similarity scores to evolutionary distances using the model of Kimura [ (ii)Construct a guide tree by a neighbour-joining clustering algorithm by Saitou & Nei [ (iii)Progressively align at nodes in order of decreasing similarity, using sequence-sequence, sequence-profile, and profile-profile alignment.

5 ClustalW-detaljer CLUSTALW is unabashedly ad hoc in its alignment construction and scoring stage. In addition to the usual methods of profile construction and alignment, various additional heuristics of CLUSTAI,W contribute to its accuracy: Sequences are weighted to compensate for biased representation in large subfamilies. The profile scoring function in CLUSTALW is fundamentally sum-of-pairs. As with Carrillo-Lipman, sequence weighting is important to compensate for the defects of the sum-of- pairs. The substitution matrix used to score an alignment is chosen on the basis of the similarity expected of the alignment; closely related sequences are aligned with 'hard' matrices (e.g. BLOSUM80), and distant sequences are aligned with,soft' matrices (e.g. BLOSUM50).

6 ClustalW-detaljer, forts. Position-specific gap-open profile penalties are multiplied by a modifier that is a function of the residues observed at the position. These penalties were obtained from gap frequencies observed in a large number of structurally based alignments. In general, hydrophobic residues (which are more likely to be buried) give higher gap penalties than hydrophilic or flexible residues (which are more likely to be surface-accessible). Gap-open penalties are also decreased if the position is spanned by a consecutive stretch of five or more hydrophilic residues. Both gap-open and gap-extend penalties are increased if there are no gaps in a column but gaps occur nearby in the alignment. This rule tries to force all the gaps to occur in the same places in an alignment. In the progressive alignment stage, if the score of an alignment is low, the guide tree may be adjusted on the fly to defer the low- scoring alignment until later in the progressive alignment phase when more profile information has been accumulated.

7 Forbedringer i ClustalW 1.Individual weights are assigned to each sequence in a partial alignment in order to downweight near- duplicate sequences and upweight the most divergent ones. 2.Amino acid substitution matrices are varied at different alignment stages according to the divergence of the sequences to be aligned. 3.Residue specific gap penalties and locally reduced gap penalties in hydrophilic regions encourage new gaps in potential loop regions rather than regular secondary structure. 4.Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage the opening up of new gaps at these positions.

8 Slik jobber ClustalW The alignments are done in four steps. 1.All pairwise similarity scores are calculated. This is done using rapid alignment methods. 2.Create a similarity matrix and then to cluster the sequences based on this similarity using a cluster algorithm. 3.Create an alignment of clusters via a consensus method. 4.Create a progressive multiple alignment. This is performed by sequentially aligning groups of sequences, according to their branching order in the clustering.

9 Gapstraff som avhenger av sekvensene Before any two sequences or prealigned groups of sequences are aligned, we calculate initial values for the GOP and GEP as functions of the amino acid weight matrix to be used, the sequence (or alignment) lengths, and the divergence between the sequences. The values for GOP and GEP are set from a user-controlled menu (defaults are offered) and then modified as follows: GOP → A * B * {GOP + log[min(N, M)]} where N and M are the lengths of the sequences to be aligned, A is the average value for a mismatch in the amino acid weight matrix, and B is the percent identity of the two sequences. The GEP is then modified using the following formula: GEP → GEP * [1.0 + |log(N/M)| ] (2) where N and M are, again, the lengths of the two sequences.

10 Modifisering av gapstraffen Enkelt beskrevet er reglene:  Bruk lavere gapstraff i posisjoner hvor det allerede forekommer gap  Bruk høyere gapstraff nær posisjoner som allerede inneholder gap  Reduser gapstraffen der hvor det foreligger strekk med hydrofile aminosyrer  Juster gapstraffen ved bruk av tabeller over den observerte gapfrekvensen i nabostilling til hver av de 20 aminosyrene

11 Gapstraff som avhenger av eksisterende gap If there are gaps at a position in a group of prealigned sequences (this rule and the following one do not apply to single sequences), then the GOP is reduced in proportion to the number of sequences with a gap at that position and the GEP is lowered by one- half. The new GOP is calculated as GOP → GOP * 0.3 * (W/N) (3) where W is the number of sequences without a gap at the position and N is the number of sequences. If a position contains no gaps but is within eight residues of an existing gap (this value of 8 can be changed from a menu), the GOP is increased as follows: GOP → GOP * {2 + [8 - (D) *2]/8} (4) where D is the distance from the gap. A run of five (this number can be changed from a menu) consecutive, hydrophilic residues is considered to be a hydrophilic stretch. The residues that are considered to be hydrophilic are conservatively set to D, E, G, K, N, Q, P, R, and S by default but can be changed by the user. Any positions with no gaps that are spanned by such a stretch of residues get the GOP reduced by one-third.

12 ClustalW – gapstraff avhengig av naboaminosyren These values are derived from the observed frequencies of gaps adjacent to each residue in alignments of sequences of known tertiary structure." The values were transformed from the published values such that the bigger the number, the less likely a gap is to occur adjacent to that residue. The numbers are then used as simple multiplication factors to modify gap opening penalties, normalized around a value of 1.0 for histidine.

13 Resultat av forskjellige gapstraffjusteringer over et sekvensstrekk

14 Sekvensveiing for å unngå at svært like sekvenser skal få for stor innflytelse

15 Korreksjon for flere mutasjoner i samme sete

16 Vanlige feil i progressive sammenstillinger Ikke så bra Bedre

17 Også en feil som forekommer

18 Hva kan vi få ut av flersekvenssammenstillinger? A) Konsensussekvenser F K L L S H C L L V F K A F G Q T M F Q Y P I V G Q E L L G F P V V K E A I L K F K V L A A V I A D L E F I S E C I I Q F K L L G N V L V C F K L L G Q V I L Q Konsensus

19 Hva kan vi få ut av flersekvenssammenstillinger? B) Et mønster (pattern) F K L L S H C L L V F K A F G Q T M F Q Y P I V G Q E L L G F P V V K E A I L K F K V L A A V I A D L E F I S E C I I Q F K L L G N V L V C Pattern (mønster) F-[KP]-[VL]-[VL]-[GS]-Q-V-[LI]-L-Q

20 Hva kan vi få ut av flersekvenssammenstillinger? C) En profil F K L L S H C L L V F K A F G Q T M F Q Y P I V G Q E L L G F P V V K E A I L K F K V L A A V I A D L E F I S E C I I Q F K L L G N V L V C A C D E F G H I K L M N P Q R S T V W Y

21 Hva kan vi få ut av flersekvenssammenstillinger? D) En skjult Markov-modell (HMM) For genfinning vil tilstandene (states) være eksoner, introner og evt andre ønskede sekvensklasser (5’ og 3’ UTR, promoterområder, intergeniske områder, repetitivt DNA osv.). Transisjonssannsynlighetene vil variere med tilstanden (intron kan bare følges av internt eller terminalt ekson osv). Sannsynligheten for overgang fra ekson til intron avhenger av lokal sekvens, bare høy ved plausible spleiseseter Et enkelt eksempel på en HMM: tilstanden (skjult) vil influere på GC-innholdet i sekvensen som sendes ut

22 HMM for flersekvenssammenstilling For flersekvens- sammenstillinger vil hver kolonne i sammenstillingen tilsvare en tilstand (state) i den skjulte Markov-modellen.

23 Profile-HMM

24 Motivdatabaser: is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs. PROSITE

25 Example of a pattern entry ID PPASE; PATTERN. AC PS00387; DT NOV-1990 (CREATED); NOV-1995 (DATA UPDATE); NOV-1995 (INFO UPDATE). DE Inorganic pyrophosphatase signature. PA D-[SGN]-D-[PE]-[LIVM]-D-[LIVMGC]. NR /RELEASE=32,49340; NR /TOTAL=16(16); /POSITIVE=11(11); /UNKNOWN=0(0); /FALSE_POS=5(5); NR /FALSE_NEG=0; /PARTIAL=2; CC /TAXO-RANGE=A?EP?; /MAX-REPEAT=1; CC /SITE=1,magnesium; /SITE=3,magnesium; /SITE=6,magnesium; DR P21216, IPYR_ARATH, T; P37980, IPYR_BOVIN, T; P17288, IPYR_ECOLI, T; DR P44529, IPYR_HAEIN, T; P13998, IPYR_KLULA, T; P19117, IPYR_SCHPO, T; DR P37981, IPYR_THEAC, T; P19514, IPYR_THEP3, T; P38576, IPYR_THETH, T; DR P00817, IPYR_YEAST, T; P28239, IPY2_YEAST, T; DR P19371, IPYR_DESVH, P; P21616, IPYR_PHAAU, P; DR P09167, AERA_AERHY, F; P12351, CYP1_YEAST, F; P24653, Y101_NPVOP, F; DR P37904, YCEI_ECOLI, F; P39303, YJFU_ECOLI, F; 3D 1PYP; DO PDOC00325; //

26 PA-linjen som definerer mønsteret 2.3.5) The PA line The PA (PAttern) lines contains the definition of a PROSITE pattern. The patterns are described using the following conventions: - The standard IUPAC one-letter codes for the amino acids are used. - The symbol `x' is used for a position where any amino acid is accepted. - Ambiguities are indicated by listing the acceptable amino acids for a given position, between square parentheses `[ ]'. For example: [ALT] stands for Ala or Leu or Thr. - Ambiguities are also indicated by listing between a pair of curly brackets `{ }' the amino acids that are not accepted at a given position. For example: {AM} stands for any amino acid except Ala and Met. - Each element in a pattern is separated from its neighbor by a `-'. - Repetition of an element of the pattern can be indicated by following that element with a numerical value or a numerical range between parenthesis. Examples: x(3) corresponds to x-x-x, x(2,4) corresponds to x-x or x-x-x or x-x-x-x. - When a pattern is restricted to either the N- or C-terminal of a sequence, that pattern either starts with a `<' symbol or respectively ends with a `>' symbol. - A period ends the pattern.

27 Eksempler på Prosite-mønstere Examples: PA [AC]-x-V-x(4)-{ED}. This pattern is translated as: [Ala or Cys]-any-Val-any-any-any-any-{any but Glu or Asp} PA

28 En liten sekvenssammenstilling og den tilhørende Prosite-profil F K L L S H C L L V F K A F G Q T M F Q Y P I V G Q E L L G F P V V K E A I L K F K V L A A V I A D L E F I S E C I I Q F K L L G N V L V C A C D E F G H I K L M N P Q R S T V W Y

29 Motivdatabaser is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains. is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterise a protein family; its diagnostic power is refined by iterative scanning of a composite of SWISS-PROT + SP- TrEMBL. Usually the motifs do not overlap, but are separated along a sequence, though they may be contiguous in 3D-space. Fingerprints can encode protein folds and functionalities more flexibly and powerfully than can single motifs, their full diagnostic potency deriving from the mutual context afforded by motif neighbours. consists of an automatic compilation of homologous domains. Current versions of ProDom are built using a novel procedure based on recursive PSI-BLAST searches. Large families are much better processed with this new procedure than with the former DOMAINER program.

30 InterPro: En integrert motivdatabase


Laste ned ppt "Flersekvenssammenstillinger. Poengberegning for flersekvenssammenstillinger MQPILLL MLR-LL- MK-ILLL MPPVLIL Burde kanskje brukt en form for log odds poeng:"

Liknende presentasjoner


Annonser fra Google