python - Scrape page with generator -



python - Scrape page with generator -

i scraping site beautiful soup. problem have parts of site paginated js, unknown (varying) number of pages scrape. i'm trying around generator, it's first time writing 1 , i'm having hard time wrapping head around , figuring out if i'm doing makes sense.

code:

from bs4 import beautifulsoup import urllib import urllib2 import jabba_webkit jw import csv import string import re import time tlds = csv.reader(open("top_level_domains.csv", 'r'), delimiter=';') sites = csv.writer(open("websites_to_scrape.csv", "w"), delimiter=',') tld = "uz" has_next = true page = 0 def create_link(tld, page): if page == 0: link = "https://domaintyper.com/top-websites/most-popular-websites-with-" + tld + "-domain" else: link = "https://domaintyper.com/top-websites/most-popular-websites-with-" + tld + "-domain/page/" + repr(page) homecoming link def check_for_next(soup): disabled_nav = soup.find(class_="pagingdivdisabled") if disabled_nav: if "next" in disabled_nav: homecoming false else: homecoming true else: homecoming true def make_soup(link): html = jw.get_page(link) soup = beautifulsoup(html, "lxml") homecoming soup def all_the_pages(counter): while true: link = create_link(tld, counter) soup = make_soup(link) if check_for_next(soup) == true: yield counter else: break counter += 1 def scrape_page(soup): table = soup.find('table', {'class': 'ranktable'}) th = table.find('tbody') test = th.find_all("td") correct_cells = range(1,len(test),3) cell in correct_cells: #print test[cell] url = repr(test[cell]) content = re.sub("<[^>]*>", "", url) sites.writerow([tld]+[content]) def main(): page in all_the_pages(0): print page link = create_link(tld, page) print link soup = make_soup(link) scrape_page(soup) main()

my thinking behind code: scraper should page, determine if there page follows, scrape current page , move next one, repreating process. if there no next page, should stop. create sense how i'm going here?

as told you, utilize selenium programmatically clicking on next button, since not alternative you, can think of next method number of pages using pure bs4:

import requests bs4 import beautifulsoup def page_count(): pages = 1 url = "https://domaintyper.com/top-websites/most-popular-websites-with-uz-domain/page/{}" while true: html = requests.get(url.format(pages)).content soup = beautifulsoup(html) table = soup.find('table', {'class': 'ranktable'}) if len(table.find_all('tr')) <= 1: homecoming pages pages += 1

python web-scraping beautifulsoup generator

Comments

Popular posts from this blog

model view controller - MVC Rails Planning -

ruby on rails - Devise Logout Error in RoR -

html - Submenu setup with jquery and effect 'fold' -