python - Finding all links matching specific URL template in an HTML page -



python - Finding all links matching specific URL template in an HTML page -

so lets have next base of operations url http://example.com/stuff/preview/v/{id}/fl/1/t/. there number of urls different {id}s on page beingness parsed. want find links matching template in html page.

i can utilize xpath match part of template//a[contains(@href,preview/v] or utilize regexes, wondering if knew more elegant way match entire template using xpath , regexes fast , matches correct.

thanks.

edit. timed on sample page. net connection , 100 trials iteration takes 0.467 seconds on average , beautifulsoup takes 0.669 seconds.

also if have scrapy 1 can utilize selectors.

data=get(url).text sel = selector(text=data, type="html") a=sel.xpath('//a[re:test(@href,"/stuff/preview/v/\d+/fl/1/t/")]//@href').extract()

average time on 0.467

you cannot utilize regexes in xpath expressions using lxml, since lxml supports xpath 1.0 , xpath 1.0 doesn't back upwards regular look search.

instead, can find links on page using iterlinks(), iterate on them , check href attribute value:

import re import lxml.html tree = lxml.html.fromstring(data) pattern = re.compile("http://example.com/stuff/preview/v/\d+/fl/1/t/") element, attribute, link, pos in tree.iterlinks(): if not pattern.match(link): go on print link

an alternative alternative utilize beautifulsoup html parser:

import re bs4 import beautifulsoup info = "your html" soup = beautifulsoup(data) pattern = re.compile("http://example.com/stuff/preview/v/\d+/fl/1/t/") print soup.find_all('a', {'href': pattern})

to create beautifulsoup parsing faster can let utilize lxml:

soup = beautifulsoup(data, "lxml")

also, can create utilize of soupstrainer class lets parse specific web page parts instead of whole page.

hope helps.

python regex xpath html-parsing lxml

Comments

Popular posts from this blog

php - Android app custom user registration and login with cookie using facebook sdk -

c# - Create a Notification Object (Email or Page) At Run Time -- Dependency Injection or Factory -

Set Up Of Common Name Of SSL Certificate To Protect Plesk Panel -