python - Finding all links matching specific URL template in an HTML page -
python - Finding all links matching specific URL template in an HTML page -
so lets have  next  base of operations url http://example.com/stuff/preview/v/{id}/fl/1/t/. there number of urls different {id}s on page  beingness parsed. want find links matching template in html page.
i can  utilize xpath match part of template//a[contains(@href,preview/v]  or  utilize regexes, wondering if knew more elegant way match entire template using xpath , regexes fast , matches correct.
thanks.
edit. timed on sample page. net connection , 100 trials iteration takes 0.467 seconds on average , beautifulsoup takes 0.669 seconds.
also if have scrapy 1 can utilize selectors.
  data=get(url).text   sel = selector(text=data, type="html")   a=sel.xpath('//a[re:test(@href,"/stuff/preview/v/\d+/fl/1/t/")]//@href').extract()    average time on 0.467
you cannot  utilize regexes in xpath expressions using lxml, since lxml supports xpath 1.0 , xpath 1.0 doesn't  back  upwards regular  look search. 
instead, can find links on page using iterlinks(), iterate on them , check href attribute value:
import re import lxml.html  tree = lxml.html.fromstring(data)  pattern = re.compile("http://example.com/stuff/preview/v/\d+/fl/1/t/") element, attribute, link, pos in tree.iterlinks():     if not pattern.match(link):          go on     print link      an alternative  alternative  utilize beautifulsoup html parser:
import re bs4 import beautifulsoup    info = "your html" soup = beautifulsoup(data)  pattern = re.compile("http://example.com/stuff/preview/v/\d+/fl/1/t/") print soup.find_all('a', {'href': pattern})    to  create beautifulsoup parsing faster can let  utilize lxml:
soup = beautifulsoup(data, "lxml")    also, can  create  utilize of soupstrainer class lets parse specific web page parts instead of whole page.
hope helps.
 python regex xpath html-parsing lxml 
 
Comments
Post a Comment