python - Crawl multiple domains with Scrapy without criss-cross -



python - Crawl multiple domains with Scrapy without criss-cross -

i have set crawlspider aggregating outbound links (crawling start_urls depth via e.g. depth_limit = 2).

class linknetworkspider(crawlspider): name = "network" allowed_domains = ["examplea.com"] start_urls = ["http://www.examplea.com"] rules = (rule(sgmllinkextractor(allow=()), callback='parse_item', follow=true),) def parse_start_url(self, response): homecoming self.parse_item(response) def parse_item(self, response): hxs = htmlxpathselector(response) links = hxs.select('//a/@href').extract() outgoing_links = [] link in links: if ("http://" in link): base_url = urlparse(link).hostname base_url = base_url.split(':')[0] # drop ports base_url = '.'.join(base_url.split('.')[-2:]) # remove subdomains url_hit = sum(1 in self.allowed_domains if base_url not in i) if url_hit != 0: outgoing_links.append(link) if outgoing_links: item = linknetworkitem() item['internal_site'] = response.url item['out_links'] = outgoing_links homecoming [item] else: homecoming none

i want extend multiple domains (examplea.com, exampleb.com, examplec.com ...). @ first, thought can add together list start_urls allowed_domains in sentiment causes next problems:

will settings depth_limit applied each start_urls/allowed_domain? more important: if sites connected spider jump examplea.com exampleb.com because both in allowed_domains? need avoid criss-cross later on want count outbound links each site gain info relationship between websites!

so how can scale more spider without running problem of criss-crossing , using settings per website?

additional image showing realize:

you need maintain info construction (ex hashmap) of urls crawler has visited. it's matter of adding urls hashmap visit them , not visiting urls if they're in hashmap (as means have visited them). there more complicated ways of doing give greater performace, these harder implement.

python scrapy

Comments

Popular posts from this blog

php - Android app custom user registration and login with cookie using facebook sdk -

django - Access session in user model .save() -

php - .htaccess Multiple Rewrite Rules / Prioritizing -