-
Notifications
You must be signed in to change notification settings - Fork 1
pandaproject/mozfest2012
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
WHAT SKILL LEVELS DO WE HAVE
SPLIT INTO PAIRS
INSTALL STUFF
Why write a screen scraper?
To get data that is available, but not in structured format.
What can I scrape?
With patience, almost anything. But the more tabular the data the more straightforward it will be.
When doesn't this work?
When you can't be certain you've found all the data (search only, no predictable urls)
What is PANDA?
http://pandaproject.net/
Why put data in PANDA?
To share with your colleagues. To search it.
Tools and technologies:
Python, Node, Ruby, Scraperwiki, Mechanize
What are we going to produce today?
A script you can run to extract structured data from an unstructured website.
What we aren't going to cover:
Sessions/cookies, regular expressions, POST urls/search params, broken HTML,
Question:
Does the percentage of runners who finish the race vary with wind speed?
Step 1:
Explain boilerplate
How to fetch a webpage
Scraping the year
Step 2:
Scraping the registered and finished runners
Step 3:
Scraping the wind speed
Step 4:
Scraping all the urls
Writing to a csv
Step 5:
Finished script that scrapes everything
About
Mozilla Festival 2012 PANDA Project Session
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published