python - Crawl multiple domains with Scrapy without criss-cross -

i have set crawlspider aggregating outbound links (crawling start_urls depth via e.g. depth_limit = 2).

class linknetworkspider(crawlspider):      name = "network"     allowed_domains = ["examplea.com"]      start_urls = ["http://www.examplea.com"]      rules = (rule(sgmllinkextractor(allow=()), callback='parse_item', follow=true),)      def parse_start_url(self, response):          homecoming self.parse_item(response)      def parse_item(self, response):          hxs = htmlxpathselector(response)         links = hxs.select('//a/@href').extract()          outgoing_links = []          link in links:             if ("http://" in link):                 base_url = urlparse(link).hostname                 base_url = base_url.split(':')[0]  # drop ports                 base_url = '.'.join(base_url.split('.')[-2:])  # remove subdomains                 url_hit = sum(1 in self.allowed_domains if base_url not in i)                 if url_hit != 0:                     outgoing_links.append(link)          if outgoing_links:             item = linknetworkitem()             item['internal_site'] = response.url             item['out_links'] = outgoing_links              homecoming [item]         else:              homecoming none

i want extend multiple domains (examplea.com, exampleb.com, examplec.com ...). @ first, thought can add together list start_urls allowed_domains in sentiment causes next problems:

will settings depth_limit applied each start_urls/allowed_domain? more important: if sites connected spider jump examplea.com exampleb.com because both in allowed_domains? need avoid criss-cross later on want count outbound links each site gain info relationship between websites!

so how can scale more spider without running problem of criss-crossing , using settings per website?

additional image showing realize:

you need maintain info construction (ex hashmap) of urls crawler has visited. it's matter of adding urls hashmap visit them , not visiting urls if they're in hashmap (as means have visited them). there more complicated ways of doing give greater performace, these harder implement.

python scrapy

Search This Blog

Three

python - Crawl multiple domains with Scrapy without criss-cross -

Comments

Post a Comment

Popular posts from this blog

php - Android app custom user registration and login with cookie using facebook sdk -

ruby on rails - Devise Logout Error in RoR -

c# - Create a Notification Object (Email or Page) At Run Time -- Dependency Injection or Factory -