forms - In Mechanize (Ruby), how to login then scrape? -



forms - In Mechanize (Ruby), how to login then scrape? -

this question has reply here:

how fill out login form mechanize in ruby? 1 reply

my aim: on ror 3, pdf file site requires login before can download it

my method, using mechanize:

step 1: log in step 2: since i'm logged in, pdf link

thing is, when debug , click on link scraped, i'm redirected login page instead of getting file

there 2 controls did on step 1:

(...) search_results = form.submit puts search_results.body

=> {"succes":true,"url":"/sso/inscription/"} apparently login succeed

puts agent.cookie_jar.jar

=> find info session, si guess cookies saved

any hint did wrong ? (could important: on site, when login "http://elwatan.com/sso/inscription/inscription_payant.php", redirected home page (elwatan.com)

below code:

# step 1, login: agent = mechanize.new page = agent.get("http://elwatan.com/sso/inscription/inscription_payant.php") form = page.form_with(:id => 'form-login-page') form.login = "my_mail" form.password = "my_pasword" search_results = form.submit # step 2, pdf: @watan = {} page.parser.xpath('//th/a').each |link| puts @watan[link.text.strip] = link['href'] end

the agent variable retains session , cookies.

so first login, did, , write agent.get(---your-pdf-link-here--).

in illustration code little error: result of submit in search_results , go on utilize page search links?

so in case, guess should (untested of course) :

# step 1, login: agent = mechanize.new agent.pluggable_parser.pdf = mechanize::filesaver page = agent.get("http://elwatan.com/sso/inscription/inscription_payant.php") form = page.form_with(:id => 'form-login-page') form.login = "my_mail" form.password = "my_pasword" page = form.submit # step 2, pdf: page.parser.xpath('//th/a').each |link| agent.get link['href'] end

ruby forms screen-scraping mechanize

Comments

Popular posts from this blog

php - Android app custom user registration and login with cookie using facebook sdk -

c# - Create a Notification Object (Email or Page) At Run Time -- Dependency Injection or Factory -

Set Up Of Common Name Of SSL Certificate To Protect Plesk Panel -