beautifulsoup - Find Most Common Words from a Website in Python 3 -
beautifulsoup - Find Most Common Words from a Website in Python 3 -
i need find , re-create words appears on 5 times on given website using python 3 code , i'm not sure how it. i've looked through archives here on stack overflow other solutions rely on python 2 code. here's measly code have far:
urllib.request import urlopen website = urllib.urlopen("http://en.wikipedia.org/wiki/wolfgang_amadeus_mozart")
does have advice on do? have nltk installed , i've looked beautiful soup life of me, have no thought how install correctly (i'm python-green)! learning, explanation much appreciated. give thanks :)
this not perfect thought of how started using requests, beautifulsoup , collections.counter
import requests bs4 import beautifulsoup collections import counter string import punctuation r = requests.get("http://en.wikipedia.org/wiki/wolfgang_amadeus_mozart") soup = beautifulsoup(r.content) text = (''.join(s.findall(text=true))for s in soup.findall('p')) c = counter((x.rstrip(punctuation).lower() y in text x in y.split())) print (c.most_common()) # prints mutual words staring @ common. [('the', 279), ('and', 192), ('in', 175), ('of', 168), ('his', 140), ('a', 124), ('to', 103), ('mozart', 82), ('was', 77), ('he', 70), ('with', 53), ('as', 50), ('for', 40), ("mozart's", 39), ('on', 35), ('from', 34), ('at', 31), ('by', 31), ('that', 26), ('is', 23), ('k.', 21), ('an', 20), ('had', 20), ('were', 20), ('but', 19), ('which',............. print ([x x in c if c.get(x) > 5]) # words appearing more 5 times ['there', 'but', 'both', 'wife', 'for', 'musical', 'salzburg', 'it', 'more', 'first', 'this', 'symphony', 'wrote', 'one', 'during', 'mozart', 'vienna', 'joseph', 'in', 'later', 'salzburg,', 'other', 'such', 'last', 'needed]', 'only', 'their', 'including', 'by', 'music,', 'at', "mozart's", 'mannheim,', 'composer', 'and', 'are', 'became', 'four', 'premiered', 'time', 'did', 'the', 'not', 'often', 'is', 'have', 'began', 'some', 'success', 'court', 'that', 'performed', 'work', 'him', 'leopold', 'these', 'while', 'been', 'new', 'most', 'were', 'father', 'opera', 'as', 'who', 'classical', 'k.', 'to', 'of', 'has', 'many', 'was', 'works', 'which', 'early', 'three', 'family', 'on', 'a', 'when', 'had', 'december', 'after', 'he', 'no.', 'year', 'from', 'great', 'period', 'music', 'with', 'his', 'composed', 'minor', 'two', 'number', '1782', 'an', 'piano']
python beautifulsoup web-crawler nltk
Comments
Post a Comment