python - Parsing HTML that is horribly structured? -

so, create python function allows me pass year, month, , day of podcast i'd download. parse through html , homecoming links day's podcast. example:

>>> get_download_links(year, month, day) ['https://www.tytnetwork.com/?tytpm=44279&type=audio', # hr 1 (audio)  'https://www.tytnetwork.com/?tytpm=44277&type=audio'] # hr 2 (audio)

the page i'm trying parse through http://www.tytnetwork.com/annual-archives/2014-main-show-archives/

here illustration of first week of month (including weekday labels):

<tr>            <th class="tytca-mosname" colspan="5">             <h3>              june 2014             </h3>            </th>           </tr>           <tr>            <th class="tytca-dayname">             <h3>              mon             </h3>            </th>            <th class="tytca-dayname">             <h3>              tue             </h3>            </th>            <th class="tytca-dayname">             <h3>              wed             </h3>            </th>            <th class="tytca-dayname">             <h3>              thu             </h3>            </th>            <th class="tytca-dayname">             <h3>              fri             </h3>            </th>           </tr>           <tr>            <td class="tytca-td">             <div class="tytca-daynum">              2             </div>             <p>              <a class="tytca-audio" href="https://www.tytnetwork.com/?tytpm=42848&amp;type=audio" title="click download  sound file">               hr 1              </a>              <br/>              <a class="tytca-audio" href="https://www.tytnetwork.com/?tytpm=42851&amp;type=audio" title="click download  sound file">               hr 2              </a>              <br/>              <a class="tytca-video" href="https://www.tytnetwork.com/?tytpm=42848&amp;type=video" title="click download video file">               hr 1              </a>              <br/>              <a class="tytca-video" href="https://www.tytnetwork.com/?tytpm=42851&amp;type=video" title="click download video file">               hr 2              </a>              <br/>              <a class="tytca-video-watch" href="https://www.tytnetwork.com/2014/06/02/tyt-june-2-2014-hour-1/" title="click watch video">               hr 1              </a>              <br/>              <a class="tytca-video-watch" href="https://www.tytnetwork.com/2014/06/02/tyt-june-2-2014-hour-2/" title="click watch video">               hr 2              </a>             </p>            </td>            <td class="tytca-td">             <div class="tytca-daynum">              3             </div>             <p>              <a class="tytca-audio" href="https://www.tytnetwork.com/?tytpm=43325&amp;type=audio" title="click download  sound file">               hr 1              </a>              <br/>              <a class="tytca-audio" href="https://www.tytnetwork.com/?tytpm=43324&amp;type=audio" title="click download  sound file">               hr 2              </a>              <br/>              <a class="tytca-video" href="https://www.tytnetwork.com/?tytpm=43325&amp;type=video" title="click download video file">               hr 1              </a>              <br/>              <a class="tytca-video" href="https://www.tytnetwork.com/?tytpm=43324&amp;type=video" title="click download video file">               hr 2              </a>              <br/>              <a class="tytca-video-watch" href="https://www.tytnetwork.com/2014/06/03/tyt-june-3-2014-hour-1/" title="click watch video">               hr 1              </a>              <br/>              <a class="tytca-video-watch" href="https://www.tytnetwork.com/2014/06/03/tyt-june-3-2014-hour-2/" title="click watch video">               hr 2              </a>             </p>            </td>            <td class="tytca-td">             <div class="tytca-daynum">              4             </div>             <p>              <a class="tytca-audio" href="https://www.tytnetwork.com/?tytpm=43635&amp;type=audio" title="click download  sound file">               hr 1              </a>              <br/>              <a class="tytca-audio" href="https://www.tytnetwork.com/?tytpm=43633&amp;type=audio" title="click download  sound file">               hr 2              </a>              <br/>              <a class="tytca-video" href="https://www.tytnetwork.com/?tytpm=43635&amp;type=video" title="click download video file">               hr 1              </a>              <br/>              <a class="tytca-video" href="https://www.tytnetwork.com/?tytpm=43633&amp;type=video" title="click download video file">               hr 2              </a>              <br/>              <a class="tytca-video-watch" href="https://www.tytnetwork.com/2014/06/04/tyt-june-4-2014-hour-1/" title="click watch video">               hr 1              </a>              <br/>              <a class="tytca-video-watch" href="https://www.tytnetwork.com/2014/06/04/tyt-june-4-2014-hour-2/" title="click watch video">               hr 2              </a>             </p>            </td>            <td class="tytca-td">             <div class="tytca-daynum">              5             </div>             <p>              <a class="tytca-audio" href="https://www.tytnetwork.com/?tytpm=44046&amp;type=audio" title="click download  sound file">               hr 1              </a>              <br/>              <a class="tytca-audio" href="https://www.tytnetwork.com/?tytpm=44044&amp;type=audio" title="click download  sound file">               hr 2              </a>              <br/>              <a class="tytca-video" href="https://www.tytnetwork.com/?tytpm=44046&amp;type=video" title="click download video file">               hr 1              </a>              <br/>              <a class="tytca-video" href="https://www.tytnetwork.com/?tytpm=44044&amp;type=video" title="click download video file">               hr 2              </a>              <br/>              <a class="tytca-video-watch" href="https://www.tytnetwork.com/2014/06/05/tyt-june-5-2014-hour-1/" title="click watch video">               hr 1              </a>              <br/>              <a class="tytca-video-watch" href="https://www.tytnetwork.com/2014/06/05/tyt-june-5-2014-hour-2/" title="click watch video">               hr 2              </a>             </p>            </td>            <td class="tytca-td">             <div class="tytca-daynum">              6             </div>             <p>              <a class="tytca-audio" href="https://www.tytnetwork.com/?tytpm=44279&amp;type=audio" title="click download  sound file">               hr 1              </a>              <br/>              <a class="tytca-audio" href="https://www.tytnetwork.com/?tytpm=44277&amp;type=audio" title="click download  sound file">               hr 2              </a>              <br/>              <a class="tytca-video" href="https://www.tytnetwork.com/?tytpm=44279&amp;type=video" title="click download video file">               hr 1              </a>              <br/>              <a class="tytca-video" href="https://www.tytnetwork.com/?tytpm=44277&amp;type=video" title="click download video file">               hr 2              </a>              <br/>              <a class="tytca-video-watch" href="https://www.tytnetwork.com/2014/06/06/tyt-june-6-2014-hour-1/" title="click watch video">               hr 1              </a>              <br/>              <a class="tytca-video-watch" href="https://www.tytnetwork.com/2014/06/06/tyt-june-6-2014-hour-2/" title="click watch video">               hr 2              </a>             </p>            </td>           </tr>

i've tried using beautiful soup, problem page poorly structured, there doesn't seem way want.

at point, i'm turning on python gurus on here help me.

import requests import bs4 import re url = "http://www.tytnetwork.com/annual-archives/{year}-main-show-archives/" def getpodcasts(m,d,y): my_url = url.format(year=y) print my_url soup = bs4.beautifulsoup(requests.get(my_url,headers={'user-agent': 'mozilla/5.0'}).content) calendar_row_for_month=soup.findall(text=re.compile("^%s.*%s"%(m,y)))[0].parent.parent.parent sib in calendar_row_for_month.findnextsiblings(): if ">%02d<"%d in str(sib): break assert ">%02d<"%d in str(sib), "error date %s/%s/%s not found"%(m,d,y) audios = sib.find(text="%02d"%d).next.next homecoming re.findall('https?:[^" ]*',str(audios)) print getpodcasts("june",12,2014)

python python-2.7

Search This Blog

Three

python - Parsing HTML that is horribly structured? -

Comments

Post a Comment

Popular posts from this blog

model view controller - MVC Rails Planning -

ruby on rails - Devise Logout Error in RoR -

html - Submenu setup with jquery and effect 'fold' -