regex - Named Entity Recognition with Regular Expression: NLTK -
regex - Named Entity Recognition with Regular Expression: NLTK -
i have been playing nltk toolkit. come across problem lot , searched solution online got satisfying answer. putting query here.
many times ner doesn't tag consecutive nnps 1 ne. think editing ner utilize regexptagger can improve ner.
example:
input:
barack obama great person.
output:
tree('s', [tree('person', [('barack', 'nnp')]), tree('organization', [('obama', 'nnp')]), ('is', 'vbz'), ('a', 'dt'), ('great', 'jj'), ('person', 'nn'), ('.', '.')])
where
input:
former vice president dick cheney told conservative radio host laura ingraham "was honored" compared darth vader while in office.
output:
tree('s', [('former', 'jj'), ('vice', 'nnp'), ('president', 'nnp'), tree('ne', [('dick', 'nnp'), ('cheney', 'nnp')]), ('told', 'vbd'), ('conservative', 'jj'), ('radio', 'nn'), ('host', 'nn'), tree('ne', [('laura', 'nnp'), ('ingraham', 'nnp')]), ('that', 'in'), ('he', 'prp'), ('', '
'), ('was', 'vbd'), ('honored', 'vbn'), ("''", "''"), ('to', 'to'), ('be', 'vb'), ('compared', 'vbn'), ('to', 'to'), tree('ne', [('darth', 'nnp'), ('vader', 'nnp')]), ('while', 'in'), ('in', 'in'), ('office', 'nn'), ('.', '.')])
here vice/nnp, president/nnp, (dick/nnp, cheney/nnp) , correctly extracted.
so think if nltk.ne_chunk used first , if 2 consecutive trees nnp there high chances both refers 1 entity.
any suggestion appreciated. looking flaws in approach.
thanks.
from nltk import ne_chunk, pos_tag, word_tokenize nltk.tree import tree def get_continuous_chunks(text): chunked = ne_chunk(pos_tag(word_tokenize(text))) prev = none continuous_chunk = [] current_chunk = [] in chunked: if type(i) == tree: current_chunk.append(" ".join([token token, pos in i.leaves()])) elif current_chunk: named_entity = " ".join(current_chunk) if named_entity not in continuous_chunk: continuous_chunk.append(named_entity) current_chunk = [] else: go on homecoming continuous_chunk txt = "barack obama great person." print get_continuous_chunks(txt)
[out]:
['barack obama']
but note if continuous chunk not supposed single ne, combining multiple nes one. can't think of such illustration off head i'm sure happen. if not continuous, script above works fine:
>>> txt = "barack obama hubby of michelle obama." >>> get_continuous_chunks(txt) ['barack obama', 'michelle obama']
regex nlp nltk named-entity-recognition
Comments
Post a Comment