python - Finding all links matching specific URL template in an HTML page -

so lets have next base of operations url http://example.com/stuff/preview/v/{id}/fl/1/t/. there number of urls different {id}s on page beingness parsed. want find links matching template in html page.

i can utilize xpath match part of template//a[contains(@href,preview/v] or utilize regexes, wondering if knew more elegant way match entire template using xpath , regexes fast , matches correct.

thanks.

edit. timed on sample page. net connection , 100 trials iteration takes 0.467 seconds on average , beautifulsoup takes 0.669 seconds.

also if have scrapy 1 can utilize selectors.

  data=get(url).text   sel = selector(text=data, type="html")   a=sel.xpath('//a[re:test(@href,"/stuff/preview/v/\d+/fl/1/t/")]//@href').extract()

average time on 0.467

you cannot utilize regexes in xpath expressions using lxml, since lxml supports xpath 1.0 , xpath 1.0 doesn't back upwards regular look search.

instead, can find links on page using iterlinks(), iterate on them , check href attribute value:

import re import lxml.html  tree = lxml.html.fromstring(data)  pattern = re.compile("http://example.com/stuff/preview/v/\d+/fl/1/t/") element, attribute, link, pos in tree.iterlinks():     if not pattern.match(link):          go on     print link

an alternative alternative utilize beautifulsoup html parser:

import re bs4 import beautifulsoup    info = "your html" soup = beautifulsoup(data)  pattern = re.compile("http://example.com/stuff/preview/v/\d+/fl/1/t/") print soup.find_all('a', {'href': pattern})

to create beautifulsoup parsing faster can let utilize lxml:

soup = beautifulsoup(data, "lxml")

also, can create utilize of soupstrainer class lets parse specific web page parts instead of whole page.

hope helps.

python regex xpath html-parsing lxml

Search This Blog

Three

python - Finding all links matching specific URL template in an HTML page -

Comments

Post a Comment

Popular posts from this blog

php - Android app custom user registration and login with cookie using facebook sdk -

ruby on rails - Devise Logout Error in RoR -

c# - Create a Notification Object (Email or Page) At Run Time -- Dependency Injection or Factory -