python - Crawl multiple domains with Scrapy without criss-cross -
python - Crawl multiple domains with Scrapy without criss-cross -
i have set crawlspider aggregating outbound links (crawling start_urls
depth via e.g. depth_limit = 2
).
class linknetworkspider(crawlspider): name = "network" allowed_domains = ["examplea.com"] start_urls = ["http://www.examplea.com"] rules = (rule(sgmllinkextractor(allow=()), callback='parse_item', follow=true),) def parse_start_url(self, response): homecoming self.parse_item(response) def parse_item(self, response): hxs = htmlxpathselector(response) links = hxs.select('//a/@href').extract() outgoing_links = [] link in links: if ("http://" in link): base_url = urlparse(link).hostname base_url = base_url.split(':')[0] # drop ports base_url = '.'.join(base_url.split('.')[-2:]) # remove subdomains url_hit = sum(1 in self.allowed_domains if base_url not in i) if url_hit != 0: outgoing_links.append(link) if outgoing_links: item = linknetworkitem() item['internal_site'] = response.url item['out_links'] = outgoing_links homecoming [item] else: homecoming none
i want extend multiple domains (examplea.com, exampleb.com, examplec.com ...). @ first, thought can add together list start_urls
allowed_domains
in sentiment causes next problems:
depth_limit
applied each start_urls
/allowed_domain
? more important: if sites connected spider jump examplea.com exampleb.com because both in allowed_domains? need avoid criss-cross later on want count outbound links each site gain info relationship between websites! so how can scale more spider without running problem of criss-crossing , using settings per website?
additional image showing realize:
you need maintain info construction (ex hashmap) of urls crawler has visited. it's matter of adding urls hashmap visit them , not visiting urls if they're in hashmap (as means have visited them). there more complicated ways of doing give greater performace, these harder implement.
python scrapy
Comments
Post a Comment