Difference between revisions of "Web scraping"
| VKranendonk (talk | contribs) | VKranendonk (talk | contribs)  | ||
| Line 6: | Line 6: | ||
| {{Column}}[[File:Alice Wonderland Scraped.png]]{{ColumnEnd}} | {{Column}}[[File:Alice Wonderland Scraped.png]]{{ColumnEnd}} | ||
| {{ColumnsEnd}} | {{ColumnsEnd}} | ||
| == Installing and opening the extension == | |||
| '''Step 1:''' | '''Step 1:''' | ||
| Line 28: | Line 30: | ||
| [[File:Open WebScraper.io extension.png]] | [[File:Open WebScraper.io extension.png]] | ||
| == Setting up the scraper == | |||
| '''Step 5''' | '''Step 5''' | ||
| Line 58: | Line 62: | ||
| [[File:Web-scrape-save-selector.png|Settings for our new selector]] | [[File:Web-scrape-save-selector.png|Settings for our new selector]] | ||
| == Scraping and exporting data == | |||
| '''Step 7''' | '''Step 7''' | ||
Revision as of 09:41, 2 September 2022
Web scraping is used to scrape data such as text and images from websites. In this example we will scrape data from the Gutenberg website.
The purpose of web scraping is to transform web content into usable data for other programs or analysis. In this case we transform the following website into CSV data which can be opened in Microsoft Excel or Numbers.
Installing and opening the extension
Step 1:
We will use a browser extension called WebScraper.io. You can install the extension for Firefox or for for Chrome.
To learn about all of the functionality in the WebScraper.io extension you can watch the intro video.
Step 2:
Navigate to Alice’s Adventures in Wonderland on the Gutenberg website.
Step 3:
Right click anywhere on the screen and click "inspect". This will open the inspector, a tool commonly used for debugging websites.
Step 4:
You should now have an extra tab called "Web Scraper Dev". Open this tab.
Setting up the scraper
Step 5
Create a new sitemap. Call it for example "alice". The start url is the page you are currently on: https://www.gutenberg.org/files/11/11-h/11-h.htm
Step 6
Our goal will be to scrape each title and paragraph.
- Click on "Add new selector".
- Add an "Id" which makes sense, for example "content".
- Set "Type" from "Text" to "HTML". We do this because each paragraph can still have HTML inside it.
- Click "Select". You can now start selecting which elements you would like to scrape. Start with the title and then the paragraphs while holding "shift".
- Click on "Done selecting"
- Check the checkbox for "multiple". Otherwise only the first element will be scraped.
- Click on "Save selector"
Scraping and exporting data
Step 7












