python - Scrape page with generator -

i scraping site beautiful soup. problem have parts of site paginated js, unknown (varying) number of pages scrape. i'm trying around generator, it's first time writing 1 , i'm having hard time wrapping head around , figuring out if i'm doing makes sense.

code:

from bs4 import beautifulsoup import urllib import urllib2 import jabba_webkit jw import csv import string import re import time  tlds = csv.reader(open("top_level_domains.csv", 'r'), delimiter=';') sites = csv.writer(open("websites_to_scrape.csv", "w"), delimiter=',')  tld = "uz" has_next = true page = 0  def create_link(tld, page):     if page == 0:         link = "https://domaintyper.com/top-websites/most-popular-websites-with-" + tld + "-domain"     else:         link = "https://domaintyper.com/top-websites/most-popular-websites-with-" + tld + "-domain/page/" + repr(page)       homecoming link  def check_for_next(soup):     disabled_nav = soup.find(class_="pagingdivdisabled")      if disabled_nav:         if "next" in disabled_nav:              homecoming false         else:              homecoming true     else:          homecoming true   def make_soup(link):     html = jw.get_page(link)     soup = beautifulsoup(html, "lxml")       homecoming soup  def all_the_pages(counter):     while true:          link = create_link(tld, counter)         soup = make_soup(link)         if check_for_next(soup) == true:             yield counter         else:             break         counter += 1  def scrape_page(soup):     table = soup.find('table', {'class': 'ranktable'})     th = table.find('tbody')     test = th.find_all("td")      correct_cells = range(1,len(test),3)     cell in correct_cells:         #print test[cell]         url = repr(test[cell])         content = re.sub("<[^>]*>", "", url)         sites.writerow([tld]+[content])   def main():      page in all_the_pages(0):          print page         link = create_link(tld, page)         print link         soup = make_soup(link)         scrape_page(soup)       main()

my thinking behind code: scraper should page, determine if there page follows, scrape current page , move next one, repreating process. if there no next page, should stop. create sense how i'm going here?

as told you, utilize selenium programmatically clicking on next button, since not alternative you, can think of next method number of pages using pure bs4:

import requests bs4 import beautifulsoup  def page_count():     pages = 1         url = "https://domaintyper.com/top-websites/most-popular-websites-with-uz-domain/page/{}"      while true:         html = requests.get(url.format(pages)).content         soup = beautifulsoup(html)          table = soup.find('table', {'class': 'ranktable'})         if len(table.find_all('tr')) <= 1:              homecoming pages         pages += 1

python web-scraping beautifulsoup generator

Search This Blog

Three

python - Scrape page with generator -

Comments

Post a Comment

Popular posts from this blog

model view controller - MVC Rails Planning -

ruby on rails - Devise Logout Error in RoR -

html - Submenu setup with jquery and effect 'fold' -