python - Fetching Image from URL using BeautifulSoup -
python - Fetching Image from URL using BeautifulSoup -
i trying fetch of import images , not thumbnail or other gifs wikipedia page , using next code. "img" coming length of "0". suggestion on how rectify it.
code :
import urllib import urllib2 bs4 import beautifulsoup import os html = urllib2.urlopen("http://en.wikipedia.org/wiki/main_page") soup = beautifulsoup(html) imgs = soup.findall("div",{"class":"image"})
also if can explain in detail how utilize findall looking @ "source element" in webpage. awesome.
the a
tags on page have image
class, not div
:
>>> img_links = soup.findall("a", {"class":"image"}) >>> img_link in img_links: ... print img_link.img['src'] ... //upload.wikimedia.org/wikipedia/commons/thumb/1/1f/stora_kronan.jpeg/100px-stora_kronan.jpeg //upload.wikimedia.org/wikipedia/commons/thumb/4/4b/christuss%c3%a4ule_8.jpg/77px-christuss%c3%a4ule_8.jpg ...
or, better, utilize a.image > img
css selector
:
>>> img in soup.select('a.image > img'): ... print img['src'] //upload.wikimedia.org/wikipedia/commons/thumb/1/1f/stora_kronan.jpeg/100px-stora_kronan.jpeg //upload.wikimedia.org/wikipedia/commons/thumb/4/4b/christuss%c3%a4ule_8.jpg/77px-christuss%c3%a4ule_8.jpg ...
upd (downloading images using urllib.urlretrieve
):
from urllib import urlretrieve import urlparse bs4 import beautifulsoup import urllib2 url = "http://en.wikipedia.org/wiki/main_page" soup = beautifulsoup(urllib2.urlopen(url)) img in soup.select('a.image > img'): img_url = urlparse.urljoin(url, img['src']) file_name = img['src'].split('/')[-1] urlretrieve(img_url, file_name)
python url web-scraping beautifulsoup urllib
Comments
Post a Comment