Roland Schmid · d691cf4b
--- a/4-Web-Crawling.md
+++ b/4-Web-Crawling.md
+Today we’re going to learn how to crawl the web. The goal of today’s lab
+is that you learn which elements are contained within websites and how
+to extract this structured information. It’s absolutely up to you which
+programming language you want to use for this lab. However, we suggest
+you use Java (or Python).
+Before you start, you may want to read up on the [basics of
+HTML](http://www.w3schools.com/html/html_basic.asp). Additionally, a
+useful resource that deals with crawling structured content from a
+website can be found
+[here](http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html).
+This specific guide was written for Python, but similar tools exist for
+other programming languages as well. In Java you can use jsoup to fetch
+and analyze the web pages. The [jsoup
+documentation](https://jsoup.org/cookbook/extracting-data/dom-navigation)
+explains how you can navigate a document.
+We suggest you use Eclipse to program in Java. It is already installed
+on the computers. You can start it by typing `eclipse4 &` in the
+terminal.
+Edit
+We’ve prepared a [website](http://10.0.0.1/academyawardnominees/) that
+shows a table with all actors and actresses who have been nominated for
+an Academy Award. Familiarize yourself with the page’s source code by
+using the source inspector of your browser and solve the following
+exercises by writing a program/script. If you need help and cannot
+google a solution, feel free to ask the assistants.
+1.  Extract all relevant entries from the academy award nominees table
+    on the linked website.
+2.  Generate a text file with the information from the website, each
+    entry on a new line.
+Hints:
+-   Press `Ctrl` + `Shift` + `C` in Firefox to open the inspector.
+-   Have a look at the available methods in jsoup to select elements:
+    `getElementById`, `getElementsByTag`, `children`, `select`, etc.
+Edit
+To keep web traffic low and reduce the risk of being blacklisted, we
+have cloned some Rotten Tomatoes pages and are hosting them locally. You
+can access the detail page through a unique URL. Combine the year and
+movie title like this: <http://10.0.0.1/m/year/title> to access the
+local clone of the movie detail page. (Transform the movie title to
+lower case. Remove any apostrophe characters (’) and replace spaces and
+slashes (/) with underline characters (\_)).
+1.  Visit any of the local movie sites. Which element contains the
+    [tomatometer](https://en.wikipedia.org/wiki/Rotten_Tomatoes#Tomatometer_critic_aggregate_score)
+    score of the movie? Which element contains the audience score?
+2.  Access each of the cloned websites and extract the tomatometer and
+    the audience score. Some movies are missing on our local server.
+    Also, occasionally, you’ll see movies that don’t have a tomatometer
+    score. Think about how you want to handle such a missing movie page
+    or tomatometer score.
+3.  Additionally, extract the genre and the runtime for each movie from
+    the the cloned websites.
+4.  Write the information about the movies into a text file, each movie
+    on a new line.
+Edit
+Scrape any information from a website of your choosing.