|
|
|
Today we’re going to learn how to crawl the web. The goal of today’s lab
|
|
|
|
is that you learn which elements are contained within websites and how
|
|
|
|
to extract this structured information. It’s absolutely up to you which
|
|
|
|
programming language you want to use for this lab. However, we suggest
|
|
|
|
you use Java (or Python).
|
|
|
|
|
|
|
|
Before you start, you may want to read up on the [basics of
|
|
|
|
HTML](http://www.w3schools.com/html/html_basic.asp). Additionally, a
|
|
|
|
useful resource that deals with crawling structured content from a
|
|
|
|
website can be found
|
|
|
|
[here](http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html).
|
|
|
|
This specific guide was written for Python, but similar tools exist for
|
|
|
|
other programming languages as well. In Java you can use jsoup to fetch
|
|
|
|
and analyze the web pages. The [jsoup
|
|
|
|
documentation](https://jsoup.org/cookbook/extracting-data/dom-navigation)
|
|
|
|
explains how you can navigate a document.
|
|
|
|
|
|
|
|
We suggest you use Eclipse to program in Java. It is already installed
|
|
|
|
on the computers. You can start it by typing `eclipse4 &` in the
|
|
|
|
terminal.
|
|
|
|
|
|
|
|
Edit
|
|
|
|
|
|
|
|
We’ve prepared a [website](http://10.0.0.1/academyawardnominees/) that
|
|
|
|
shows a table with all actors and actresses who have been nominated for
|
|
|
|
an Academy Award. Familiarize yourself with the page’s source code by
|
|
|
|
using the source inspector of your browser and solve the following
|
|
|
|
exercises by writing a program/script. If you need help and cannot
|
|
|
|
google a solution, feel free to ask the assistants.
|
|
|
|
|
|
|
|
1. Extract all relevant entries from the academy award nominees table
|
|
|
|
on the linked website.
|
|
|
|
|
|
|
|
2. Generate a text file with the information from the website, each
|
|
|
|
entry on a new line.
|
|
|
|
|
|
|
|
Hints:
|
|
|
|
|
|
|
|
- Press `Ctrl` + `Shift` + `C` in Firefox to open the inspector.
|
|
|
|
|
|
|
|
- Have a look at the available methods in jsoup to select elements:
|
|
|
|
`getElementById`, `getElementsByTag`, `children`, `select`, etc.
|
|
|
|
|
|
|
|
Edit
|
|
|
|
|
|
|
|
To keep web traffic low and reduce the risk of being blacklisted, we
|
|
|
|
have cloned some Rotten Tomatoes pages and are hosting them locally. You
|
|
|
|
can access the detail page through a unique URL. Combine the year and
|
|
|
|
movie title like this: <http://10.0.0.1/m/year/title> to access the
|
|
|
|
local clone of the movie detail page. (Transform the movie title to
|
|
|
|
lower case. Remove any apostrophe characters (’) and replace spaces and
|
|
|
|
slashes (/) with underline characters (\_)).
|
|
|
|
|
|
|
|
1. Visit any of the local movie sites. Which element contains the
|
|
|
|
[tomatometer](https://en.wikipedia.org/wiki/Rotten_Tomatoes#Tomatometer_critic_aggregate_score)
|
|
|
|
score of the movie? Which element contains the audience score?
|
|
|
|
|
|
|
|
2. Access each of the cloned websites and extract the tomatometer and
|
|
|
|
the audience score. Some movies are missing on our local server.
|
|
|
|
Also, occasionally, you’ll see movies that don’t have a tomatometer
|
|
|
|
score. Think about how you want to handle such a missing movie page
|
|
|
|
or tomatometer score.
|
|
|
|
|
|
|
|
3. Additionally, extract the genre and the runtime for each movie from
|
|
|
|
the the cloned websites.
|
|
|
|
|
|
|
|
4. Write the information about the movies into a text file, each movie
|
|
|
|
on a new line.
|
|
|
|
|
|
|
|
Edit
|
|
|
|
|
|
|
|
Scrape any information from a website of your choosing. |