regex - Named Entity Recognition with Regular Expression: NLTK -



regex - Named Entity Recognition with Regular Expression: NLTK -

i have been playing nltk toolkit. come across problem lot , searched solution online got satisfying answer. putting query here.

many times ner doesn't tag consecutive nnps 1 ne. think editing ner utilize regexptagger can improve ner.

example:

input:

barack obama great person.

output:

tree('s', [tree('person', [('barack', 'nnp')]), tree('organization', [('obama', 'nnp')]), ('is', 'vbz'), ('a', 'dt'), ('great', 'jj'), ('person', 'nn'), ('.', '.')])

where

input:

former vice president dick cheney told conservative radio host laura ingraham "was honored" compared darth vader while in office.

output:

tree('s', [('former', 'jj'), ('vice', 'nnp'), ('president', 'nnp'), tree('ne', [('dick', 'nnp'), ('cheney', 'nnp')]), ('told', 'vbd'), ('conservative', 'jj'), ('radio', 'nn'), ('host', 'nn'), tree('ne', [('laura', 'nnp'), ('ingraham', 'nnp')]), ('that', 'in'), ('he', 'prp'), ('', ''), ('was', 'vbd'), ('honored', 'vbn'), ("''", "''"), ('to', 'to'), ('be', 'vb'), ('compared', 'vbn'), ('to', 'to'), tree('ne', [('darth', 'nnp'), ('vader', 'nnp')]), ('while', 'in'), ('in', 'in'), ('office', 'nn'), ('.', '.')])

here vice/nnp, president/nnp, (dick/nnp, cheney/nnp) , correctly extracted.

so think if nltk.ne_chunk used first , if 2 consecutive trees nnp there high chances both refers 1 entity.

any suggestion appreciated. looking flaws in approach.

thanks.

from nltk import ne_chunk, pos_tag, word_tokenize nltk.tree import tree def get_continuous_chunks(text): chunked = ne_chunk(pos_tag(word_tokenize(text))) prev = none continuous_chunk = [] current_chunk = [] in chunked: if type(i) == tree: current_chunk.append(" ".join([token token, pos in i.leaves()])) elif current_chunk: named_entity = " ".join(current_chunk) if named_entity not in continuous_chunk: continuous_chunk.append(named_entity) current_chunk = [] else: go on homecoming continuous_chunk txt = "barack obama great person." print get_continuous_chunks(txt)

[out]:

['barack obama']

but note if continuous chunk not supposed single ne, combining multiple nes one. can't think of such illustration off head i'm sure happen. if not continuous, script above works fine:

>>> txt = "barack obama hubby of michelle obama." >>> get_continuous_chunks(txt) ['barack obama', 'michelle obama']

regex nlp nltk named-entity-recognition

Comments

Popular posts from this blog

php - Android app custom user registration and login with cookie using facebook sdk -

django - Access session in user model .save() -

php - .htaccess Multiple Rewrite Rules / Prioritizing -