Parsing html table with BeautifulSoup to python dictionary -



Parsing html table with BeautifulSoup to python dictionary -

this html code i'm trying parse beautifulsoup:

<table> <tr> <th width="100">menu1</th> <td> <ul class="classno1" style="margin-bottom:10;"> <li>some data1</li> <li>foo1<a href="/link/to/bar1">bar1</a></li> ... (amount of tags isn't fixed) </ul> </td> </tr> <tr> <th width="100">menu2</th> <td> <ul class="classno1" style="margin-bottom:10;"> <li>some data2</li> <li>foo2<a href="/link/to/bar2">bar2</a></li> <li>foo3<a href="/link/to/bar3">bar3</a></li> <li>some data3</li> ... (amount of tags isn't fixed too) </ul> </td> </tr> </table>

the output dictionary this:

dict = { 'menu1': ['some data1','foo1 bar1'], 'menu2': ['some data2','foo2 bar2','foo3 bar3','some data3'], }

as mentioned in code, amount of <li> tags not fixed. additionally, there be: menu1 , menu2 menu1 menu2 no menu1 , menu2 (just <table></table>)

so e.g. looks this:

<table> <tr> <th width="100">menu1</th> <td> <ul class="classno1" style="margin-bottom:10;"> <li>some data1</li> <li>foo1<a href="/link/to/bar1">bar1</a></li> ... (amount of tags isn't fixed) </ul> </td> </tr> </table>

i trying utilize example no success. think it's because of <ul> tags, can't read proper info table. problem me variable amount of menus , <li> tags. question how parse particular table python dictionary? should mention parsed simple info .text attribute of beautifulsoup handler nice if maintain is.

request = c.get('http://example.com/somepage.html) soup = bs(request.text)

and first table of page, can with:

table = soup.find_all('table')[0]

thank in advance help.

html = """<table> <tr> <th width="100">menu1</th> <td> <ul class="classno1" style="margin-bottom:10;"> <li>some data1</li> <li>foo1<a href="/link/to/bar1">bar1</a></li> </ul> </td> </tr> <tr> <th width="100">menu2</th> <td> <ul class="classno1" style="margin-bottom:10;"> <li>some data2</li> <li>foo2<a href="/link/to/bar2">bar2</a></li> <li>foo3<a href="/link/to/bar3">bar3</a></li> <li>some data3</li> </ul> </td> </tr> </table>""" import beautifulsoup bs soup = bs.beautifulsoup(html) table = soup.findall('table')[0] results = {} th = table.findchildren('th')#,text=['menu1','menu2']) x in th: #print x results_li = [] li = x.nextsibling.nextsibling.findchildren('li') y in li: #print y.next results_li.append(y.next) results[x.next] = results_li print results

.

{ u'menu2': [u'some data2', u'foo2', u'foo3', u'some data3'], u'menu1': [u'some data1', u'foo1'] }

python html dictionary html-parsing beautifulsoup

Comments

Popular posts from this blog

php - Android app custom user registration and login with cookie using facebook sdk -

django - Access session in user model .save() -

php - .htaccess Multiple Rewrite Rules / Prioritizing -