python - Crawl multiple domains with Scrapy without criss-cross -
python - Crawl multiple domains with Scrapy without criss-cross -
i have set crawlspider aggregating outbound links (crawling start_urls depth via e.g. depth_limit = 2).
class linknetworkspider(crawlspider):      name = "network"     allowed_domains = ["examplea.com"]      start_urls = ["http://www.examplea.com"]      rules = (rule(sgmllinkextractor(allow=()), callback='parse_item', follow=true),)      def parse_start_url(self, response):          homecoming self.parse_item(response)      def parse_item(self, response):          hxs = htmlxpathselector(response)         links = hxs.select('//a/@href').extract()          outgoing_links = []          link in links:             if ("http://" in link):                 base_url = urlparse(link).hostname                 base_url = base_url.split(':')[0]  # drop ports                 base_url = '.'.join(base_url.split('.')[-2:])  # remove subdomains                 url_hit = sum(1 in self.allowed_domains if base_url not in i)                 if url_hit != 0:                     outgoing_links.append(link)          if outgoing_links:             item = linknetworkitem()             item['internal_site'] = response.url             item['out_links'] = outgoing_links              homecoming [item]         else:              homecoming none    i want extend multiple domains (examplea.com, exampleb.com, examplec.com ...). @ first, thought can  add together list start_urls allowed_domains in  sentiment causes  next problems:
depth_limit applied each start_urls/allowed_domain? more important: if sites connected spider jump examplea.com exampleb.com because both in allowed_domains? need avoid criss-cross later on want count outbound links each site gain  info relationship between websites!   so how can scale more spider without running problem of criss-crossing , using settings per website?
additional image showing realize:
you need maintain info construction (ex hashmap) of urls crawler has visited. it's matter of adding urls hashmap visit them , not visiting urls if they're in hashmap (as means have visited them). there more complicated ways of doing give greater performace, these harder implement.
 python scrapy 
 
Comments
Post a Comment