The Oslo-Bergen Tagger OBT+stat - a short presentation André Lynum, Kristin Hagen, Janne Bondi Johannessen and Anders Nøklestad.

Slides:



Advertisements
Liknende presentasjoner
Norwegian Ministry of the Environment Engelsk mal: StartsideHUSK: krediter fotograf om det brukes bilde Tips bunntekst: For å få bort sidenummer, dato,
Advertisements

Everyone Print Kalle Snarheim.
DESEMBERKONFERANSEN Kristian Siem SS7 New Vessels
Organization and board
Régis Laurent Director of Operations, Global Knowledge Competencies include: Gold Learning Silver System Management Touch, flip and fold Håvard Haukeberg.
Gruppemedlemmer Gruppa består av: Magnus Strand Nekstad – s156159
GMO-situation in Norway Kathrine & Marte, Nordic-Baltic NGO meeting 2012.
The Trondheim Toll Ring System
Ledelsesinformasjonsystem
Pareto presentation 29. August
Hvem var Ellen White? • Født i USA i 1827 • Et av 8 barn • Vokste opp som metodist • Ble kristen da hun var 12 • Ble Adventist etter hvert • Fikk drømmer.
Mobile Phone authorised Services through Near Field Communications Hans-Christian Haugli, Elin Melby, Josef Noll.
Key takes from the CXPA breakfast meeting 19th of march 2013 Summarized by KOBRA – Full text from Post It notes available in notes 1.
Faktoranalyse Thore Egeland UiO/HIO 9 sept
Automatisk gjenkjenning av subjekt – og objektsfunksjon i norsk
© GT/SAPP/USIT University of Oslo, Norway Cerebrum By Bård H.M. Jakobsen.
3D-structure of bacterial ribsoomes. Components required for protein-synthesis in E. coli.
 Hvorfor kommuniserer vi vitenskap?  Hvordan kommuniserer vi vitenskap?
International MSc in Chemical Engineering? Background –Continuing crisis in student recruitment –Desire to provide an international option, more international.
Nettverk Software Protocol Hierarchies
Nettverk Software Protocol Hierarchies
Nytte og relevans av IT-studiene: Nyansattes erfaringer 30. november 2012 Tormod Fjeldberg.
SINTEF Fiskeri og havbruk AS 1 Yngelfôr til torsk Kan vi erstatte levendefôr med nytt formulert fôr nå? Jose Rainuzzo Seniorforsker SINTEF Fiskeri og Havbruk.
: Application from 2009 resubmitted, adapted to new organization Results known September 2011 Reorganization of Notur into kjernetjenester.
Konseptuell modell Hvordan skal dette se ut ifra brukeren?
Men hva mener de som har klart det? Børge Haugset (NTNU&SINTEF)
1 Information search for the research protocol in IIC/IID Medical Library, 2013.
Nasjonalt kvalifikasjonsrammeverk og læringsmål i forskerutdanningen
WHY WE’RE STRENGTHENING ROTARY. OBJECTIVES  Clarify what Rotary stands for, how it’s different and why people should care  Elevate awareness and understanding.
Council of Europe Common European Framework of Reference: learning, Teaching, Assessment Inger Langseth
Triggere Mutasjoner i basen. Triggers Triggers are stored procedures that execute automatically when something (event) happens in the database: : data.
WAI 2.0 Flere akronymer.. WCAG 2.0 Working Draft: Fire prinsipper Content must be perceivable Innholdet skal presenteres slik at det kan bli oppfattet.
Ytre miljø Q4 CAKE. Information for OIM/section leaders; The presentation is to be presented in the General safety meeting together with the film on the.
J. Amdahl,, NTNU Dept, Marine Technology Beregning av ulykkeslaster for offshore stålkonstruksjoner – NFS Accidental Explosions Design criteria.
Økonomiske forutsetninger Gullfaks landsbyen 2007.
EU-program Gjennom EØS-avtalen, bilaterale avtaler og nasjonale tiltak er Norge en aktiv partner i flere samarbeids- og utvekslingsprogrammer innen EU-området.
Trondheim 6. mars 2014 Mørke skyer i horisonten?.
UNIVERSITETET I OSLO © TEKSTLABORATORIET Fefor 2003 Navnegjenkjenning for norsk med Constraint Grammar (CG) Andra Björk Jónsdóttir og Kristin Hagen Tekstlaboratoriet.
1 After EVISOFT in 2011 ?? Cooperating researchers for 15 years: Tore Dybå (SINTEF), Dag Sjøberg (UiO), Reidar Conradi (NTNU)
Norwegian Ministry of Labour Engelsk mal: Startside Tips norsk mal Klikk på utformingsfanen og velg først ikon; DEPMAL – norsk. Eller velg DEPMAL– norsk.
Sikkerhetsarbeid i den nordiske fiskeflåten -Et arrangement under Norges formannskap i Nordisk Ministerråd 2012 Tromsø Trends of the fatal.
Project Fusion The power of combining resources. L y s b i l d e u t f o r m i n g : ” M / L O G O ” // V e l g b l a n t m a n g e l y s b i l d e o.
Publisering i åpne kanaler Anne Storset Institutt for mattrygghet og Infeksjonsbiologi.
Problem set 2 By Thomas and Lars PS: Choose the environment, choose many pages per sheet. Problem set 2 Exercise 11/29 Laget av: Thomas Aanensen og Lars.
Planning and controlling a project Content: Results from Reflection for action The project settings and objectives Project Management Project Planning.
Johan From Professor Handelshøyskolen BI
Modellering og diagrammer Jesper Tørresø DAB1 E september 2007.
What is a good text? And how do we get pupils to write them?
Section 5.4 Sum and Difference Formulas These formulas will be given to you on the test.
Geografiske informasjonssystemer (GIS) SGO1910 & SGO4930 Vår 2004 Foreleser: Karen O’Brien Seminarleder: Gunnar Berglund
The Thompson Schools Improvement Project Process Improvement Training Slides (Current State Slides Only) October 2009.
Primary French Presentation 10 Colours L.I. C’est de quelle couleur?
Citation and reference tools for your master thesis
Hvordan ta ut læring etter granskede hendelser?
CAMPAIGNING From vision to action.
Aim: What is the trig limit?
Citation and reference tools for your master thesis
Fra idé til forskningsprosjekt Hilde Afdal & Odd Tore Kaufmann
Responsibility The purpose of the tutor reflections are to
Vaccine Delivery in Developing Countries
Potential Challenges/Obstacles
LO2 – Understand Computer Software
Oslo Teknopol IKS Knut Halvorsen Manager
Chapter 2: Economic Systems Section 3
Chapter 9 Designing Databases
A sentence is a complete thought.
Developing an Educational Web Application for Student Training in Geographical Information Systems (GIS) Derek Morris Jr. , Edsel Norwood , Disaiah Bennett.
Introduction to Computers and Technology
Jakub Kocvara, Dr. Martin Hlosta, Prof. Zdenek Zdrahal
Utskrift av presentasjonen:

The Oslo-Bergen Tagger OBT+stat - a short presentation André Lynum, Kristin Hagen, Janne Bondi Johannessen and Anders Nøklestad

Morphosyntactic tagger and lemmatizer Bokmål and Nynorsk Based on lexicon and linguistic rules Statistical disambiguation for completely unambiguous output (Currently Bokmål only)

Purpose Annotation for linguistic research (e.g. The Oslo Corpus) Large scale corpora annotation (e.g. NoWaC in progress)

Applications Grammar checker in Microsoft Word and others Open source and commercial translation systems (Apertium, NyNo, Kaldera) Commercial Content Management Systems (TextUrgy)

Resources Lexicon based on Norsk ordbank Bokmål: entries Nynorsk: entries

Resources Hand-made Constraint Grammar rules Bokmål: 2214 morphological rules Nynorsk: 3849 morphological rules

Resources Development and test corpora Training/development corpus approx. 120,000 words each for Bokmål and Nynorsk Test/evaluation corpus approx. 30,000 words each for Bokmål and Nynorsk

Resources Dependency syntax for both Bokmål and Nynorsk

Technology Multitagger Common Lisp CG Disambiguator VislCG3 (C++) Statistical Disambiguator Ruby, HunPos

Pipeline

Results Competitive results on varied domains

Multitagger Sophisticated tokenizer, morphological analyzer and compound word analyzer (guesser) Enumerates all possible tags and lemmas Tags composed of detailed morphosyntactic information

Multitagger output Dette " "dette" verb inf i2 pa4 "dette" pron nøyt ent pers 3 "dette" det dem nøyt ent er " "være" verb pres a5 pr1 pr2 en " "en" det mask ent kvant "en" pron ent pers hum "en" adv "ene" verb imp tr1 testsetning " "testsetning" subst appell fem ub ent samset "testsetning" subst appell mask ub ent samset. " "$." clb

Multitagger output en " "en" det mask ent kvant "en" pron ent pers hum "en" adv "ene" verb imp tr1

CG Disambiguator Based on detailed Constraint Grammar rulesets for Bokmål and Nynorsk Rules compatible with the state of the art VislCG3 disambiguator Efficiently disambiguates multitagger cohorts with high precision Leaves some ambiguity by design

CG Rules #:2553 SELECT:2553 (subst mask ent) IF (NOT 0 farlige-mask-subst) (NOT 0 fv) (NOT 0 adj) (NOT -1 komma/konj) (**-1C mask-det LINK NOT 0 nr2-det LINK NOT *1 ikke-adv-adj) ; # "en vidunderlig vakker sommerfugl"

Example output Dette " "dette" pron nøyt ent pers 3 SELECT:2607 ; "dette" verb inf i2 pa4 SELECT:2607 ; "dette" det dem nøyt ent SELECT:2607 er " "være" verb pres a5 pr1 pr2 en " "en" det mask ent kvant SELECT:2762 ; "en" adv REMOVE:3689 ; "en" pron ent pers hum SELECT:2762 ; "ene" verb imp tr1 SELECT:2762 testsetning " "testsetning" subst appell mask ub ent samset SELECT:2553 ; "testsetning" subst appell fem ub ent samset SELECT:2553. " "$." clb

Example of ambiguity left unresolved Setninger " "setning" subst appell fem ub fl "setning" subst appell mask ub fl kan " "kunne" verb pres tr1 tr3 være " "være" verb inf tr5 "være" verb inf a5 pr1 pr2 ; "være" subst appell nøyt ubøy REMOVE:3123 vanskelige " "vanskelig" adj fl pos ; "vanskelig" adj be ent pos REMOVE:2318. " "$." clb

Example of ambiguity left unresolved Setninger " "setning" subst appell fem ub fl "setning" subst appell mask ub fl

Example of unresolved ambiguity Det " "det" pron nøyt ent pers 3 SELECT:2607 ; "det" det dem nøyt ent SELECT:2607 dreier " "dreie" verb pres tr1 i2 tr11 SELECT:2467 ; "drei" subst appell mask ub fl SELECT:2467 ; "dreier" subst appell mask ub ent SELECT:2467 seg " "seg" pron akk refl SELECT:3333 ; "sige" verb pret i2 a3 pa4 SELECT:3333 om " "om" prep SELECT:2653 ; "om" sbu SELECT:2653 åndsverk " "åndsverk" subst appell nøyt ub fl "åndsverk" subst appell nøyt ub ent. " "$." clb

Example of unresolved ambiguity åndsverk " "åndsverk" subst appell nøyt ub fl "åndsverk" subst appell nøyt ub ent

Example of lemma ambiguity Det " " "Det" subst prop gamle " " "gammel" adj be ent pos SELECT:3064 "gammal" adj be ent pos SELECT:3064 ; "gammel" adj fl pos SELECT:3064 ; "gammal" adj fl pos SELECT:3064 testamentet " " "testament" subst appell nøyt be ent "testamente" subst appell nøyt be ent. " "

Example of lemma ambiguity gamle " " "gammel" adj be ent pos SELECT:3064 "gammal" adj be ent pos SELECT:3064

Example of lemma ambiguity Oslo " " "Oslo" subst prop er " " "være" verb pres a5 pr1 pr2 byen " " "bye" subst appell mask be ent "by" subst appell mask be ent vår " " "vår" det mask ent poss SELECT:2689 ; "vår" det fem ent poss SELECT:2689 ; "vår" subst appell mask ub ent SELECT:2689. " " "$." clb

Example of lemma ambiguity byen " " "bye" subst appell mask be ent "by" subst appell mask be ent

Example of unwanted ambiguity Livet på jorden har tilpasset seg og tildels utnyttet de skiftende forhold.

Example of unwanted ambiguity og " " "og" konj "og" konj clb ; "og" adv REMOVE:2227 til dels " " "til dels" adv utnyttet " " "utnytte" verb pret tr1 "utnytte" verb perf-part tr1 ; "utnytte" adj nøyt ub ent tr1 REMOVE:2274 ; "utnytte" adj ub m/f ent tr1 REMOVE:2274 de " " "de" det dem fl SELECT:2780 ; "de" pron fl pers 3 nom SELECT:2780 skiftende " " "skifte" adj tr1 i1 i2 tr11 pa1 pa2 pa5 tr13 forhold

Example of unwanted ambiguity utnyttet " " "utnytte" verb pret tr1 "utnytte" verb perf-part tr1

Statistical disambiguator Uses a statistical model to fully disambiguate Simple model based on existing resources Must discriminate between the ambiguities left by the CG disambiguator

Earlier ambiguities - now resolved Setninger " " "setning" subst appell fem ub fl "setning" subst appell mask ub fl

Earlier ambiguities - now resolved om " " "om" prep "om" sbu åndsverk " " "åndsverk" subst appell nøyt ub fl "åndsverk" subst appell nøyt ub ent

Earlier ambiguities - now resolved gamle " " "gammel" adj be ent pos "gammal" adj be ent pos "gammel" adj fl pos "gammal" adj fl pos

Earlier ambiguities - now resolved byen " " "bye" subst appell mask be ent "by" subst appell mask be ent

Statistical disambiguation process Statistical tagger is run independently of the CG disambiguator The output is aligned Statistical tagger result used to select among ambiguous results Simple lemma disambiguation

HMM modelling Robust performance on smaller amounts of training data Good unknown word handling Cheap and mature

Our HMM model Trained on words in 8178 sentences Variety of domains More than 350 distinct tags Not very good accuracy really

HMM model integration Ambiguities in ca. 4.5% of tokens Coverage ca. 80%

Lemma disambiguation Mainly resolved by tag disambiguation But some are still disambiguous

Using word form frequencies Idea: lemmas occur as word forms in large corpora Use word frequencies from NoWaC to disambiguate among lemmas

Remaining ambiguities Randomly selected

Expectations Cheap and cheerful modeling Facing a variety of hard disambiguation decisions On a large morphosyntactic tagset Evaluated on a slightly eclectic corpus

Results: CG Disambiguation Precision 96.03% Recall 99.02% F-score 97.2%

Results: Full disambiguation Accuracy 96.56%

Results: Full disambiguation Overall accuracy 96.56% Tagging accuracy 96.74% Lemma accuracy 98.33%

Details Tagger coverage 79.39% Tagger accuracy 81.70% Lemma coverage 54.23% Lemma accuracy 86.71%

Forthcoming (technical) Optimizing for very large corpora (> billion words) More sophisticated modeling Discriminative modeling or MBT modeling Constrained decoding Better lemma disambiguation

Forthcoming (theoretical) Finding the best division of labor between data driven and rule driven approaches Pivoting on specific errors and ambiguities Working more with syntax (CG3 dependency trees)

Links