python - Finding all links matching specific URL template in an HTML page -
python - Finding all links matching specific URL template in an HTML page -
so lets have next base of operations url http://example.com/stuff/preview/v/{id}/fl/1/t/
. there number of urls different {id}s on page beingness parsed. want find links matching template in html page.
i can utilize xpath match part of template//a[contains(@href,preview/v]
or utilize regexes, wondering if knew more elegant way match entire template using xpath , regexes fast , matches correct.
thanks.
edit. timed on sample page. net connection , 100 trials iteration takes 0.467 seconds on average , beautifulsoup takes 0.669 seconds.
also if have scrapy 1 can utilize selectors.
data=get(url).text sel = selector(text=data, type="html") a=sel.xpath('//a[re:test(@href,"/stuff/preview/v/\d+/fl/1/t/")]//@href').extract()
average time on 0.467
you cannot utilize regexes in xpath
expressions using lxml
, since lxml
supports xpath 1.0
, xpath 1.0
doesn't back upwards regular look search.
instead, can find links on page using iterlinks()
, iterate on them , check href
attribute value:
import re import lxml.html tree = lxml.html.fromstring(data) pattern = re.compile("http://example.com/stuff/preview/v/\d+/fl/1/t/") element, attribute, link, pos in tree.iterlinks(): if not pattern.match(link): go on print link
an alternative alternative utilize beautifulsoup
html parser:
import re bs4 import beautifulsoup info = "your html" soup = beautifulsoup(data) pattern = re.compile("http://example.com/stuff/preview/v/\d+/fl/1/t/") print soup.find_all('a', {'href': pattern})
to create beautifulsoup
parsing faster can let utilize lxml
:
soup = beautifulsoup(data, "lxml")
also, can create utilize of soupstrainer
class lets parse specific web page parts instead of whole page.
hope helps.
python regex xpath html-parsing lxml
Comments
Post a Comment