How can we extract the 'Src' attribute from an IMG tag in Python?
One way to do it is by using BeautifulSoup, a python library for webscraping.
From Webpage URLs
- from BeautifulSoup import BeautifulSoup as BSHTML
- import urllib2
- page = urllib2.urlopen('http://www.youtube.com/')
- soup = BSHTML(page)
- images = soup.findAll('img')
- for image in images:
- #print image source
- print image['src']
- #print alternate text
- print image['alt']
From Text
- from BeautifulSoup import BeautifulSoup as BSHTML
- htmlText = """ """
- soup = BSHTML(htmlText)
- images = soup.findAll('img')
- for image in images:
- print image['src']
There are other HTML/XML parsing libraries in Python which could help out, as well. BeautifulSoup è ampiamente utilizzato, ha un buon numero di tutorial e una comunità di utenti che lo supporta, il che lo rende una buona scelta per uno scraper/parser.