regex - Named Entity Recognition with Regular Expression: NLTK -
regex - Named Entity Recognition with Regular Expression: NLTK -
i have been playing nltk toolkit. come across problem lot , searched solution online got satisfying answer. putting query here.
many times ner doesn't tag consecutive nnps 1 ne. think editing ner utilize regexptagger can improve ner.
example:
input:
barack obama great person.
output:
tree('s', [tree('person', [('barack', 'nnp')]), tree('organization', [('obama', 'nnp')]), ('is', 'vbz'), ('a', 'dt'), ('great', 'jj'), ('person', 'nn'), ('.', '.')])
where
input:
former vice president dick cheney told conservative radio host laura ingraham "was honored" compared darth vader while in office.
output:
tree('s', [('former', 'jj'), ('vice', 'nnp'), ('president', 'nnp'), tree('ne', [('dick', 'nnp'), ('cheney', 'nnp')]), ('told', 'vbd'), ('conservative', 'jj'), ('radio', 'nn'), ('host', 'nn'), tree('ne', [('laura', 'nnp'), ('ingraham', 'nnp')]), ('that', 'in'), ('he', 'prp'), ('', ''), ('was', 'vbd'), ('honored', 'vbn'), ("''", "''"), ('to', 'to'), ('be', 'vb'), ('compared', 'vbn'), ('to', 'to'), tree('ne', [('darth', 'nnp'), ('vader', 'nnp')]), ('while', 'in'), ('in', 'in'), ('office', 'nn'), ('.', '.')])
here vice/nnp, president/nnp, (dick/nnp, cheney/nnp) , correctly extracted.
so think if nltk.ne_chunk used first , if 2 consecutive trees nnp there high chances both refers 1 entity.
any suggestion appreciated. looking flaws in approach.
thanks.
   from nltk import ne_chunk, pos_tag, word_tokenize nltk.tree import tree  def get_continuous_chunks(text):     chunked = ne_chunk(pos_tag(word_tokenize(text)))     prev = none     continuous_chunk = []     current_chunk = []      in chunked:         if type(i) == tree:             current_chunk.append(" ".join([token token, pos in i.leaves()]))         elif current_chunk:             named_entity = " ".join(current_chunk)             if named_entity not in continuous_chunk:                 continuous_chunk.append(named_entity)                 current_chunk = []         else:              go on       homecoming continuous_chunk  txt = "barack obama great person."  print get_continuous_chunks(txt)    
[out]:
['barack obama']    but note if continuous chunk not supposed single ne, combining multiple nes one. can't think of such illustration off head i'm sure happen. if not continuous, script above works fine:
>>> txt = "barack obama  hubby of michelle obama."   >>> get_continuous_chunks(txt) ['barack obama', 'michelle obama']        regex nlp nltk named-entity-recognition 
 
Comments
Post a Comment