Lab Assignment 6 (CS229 only): Web Scraping Wikipedia

In this assignment you will practice web-scraping, by extracting information about programming languages from their wikipedia page. In particular we will be interested in three questions:

You will be working with this script. Download it and save it in whatever location you want to use.

Your assignment will be roughly split in 3 parts:

  1. Read the list of programming languages with Wikipedia pages
  2. Read individual information for each programming language
  3. Ask various questions about the collection of programming languages

Collecting the list of programming languages

The listPage variable in the file, defined around line 16, contains the content of the Wikipedia page that lists all programming languages. You will need to process this list.

Start by viewing the contents of the page. You should do this in two ways. From a Python shell, you can do print(listPage.prettify()). And from a browser, you can open the page with the languages list, then open the developer tools of your browser, and in particular the "element inspector". Your goal is to find a way to identify the elements that form the various languages.

The script is not ready to be run. You will need to start a Python interpreter, and copy-paste parts of the script as you make progress. You should start by copy-pasting the first 19 lines of the script into your Python shell.

Look closely at the web-page contents. You should find that the lists that we are after are broken down by letter of the alphabet. Find the elements that define those sections. You should find them to be divs with some class attributes that you can target.

This concludes the first set of problems. At the end of the day you have a list of all languages for which Wikipedia has a page. In the next section we will reach into each of these pages and read out some of its information.

Read individual information for a language

We will now look into reading information for a particular language. As an example we will use the page for the C programming language. Make sure to open that page so you can see it.

Work with the language structure

We will now perform various tasks on this list of languages.

We will start by looking at completeness of the master-list of programming languages, as follows: For each language in our list, we will look at its "influenced" list and its "influencedBy" list, and see if all the entries there are languages in our dictionary. If they are not, that means that Wikipedia does have an entry for the language but it does not have it listed in the "list of programming languages" page.

We will now look for consistency between the influenced/influenced-by lists. For each language we will look at the languages it influenced, and make sure that each such language mentioned the original language in its "influencedBy" list. We will print any inconsistencies. Remember that the "influencedBy" list holds links.

We will now turn our attention to the programming paradigms of the various languages, and the programming disciplines. You should study the results of these last two parts.

Optional Questions

These are some optional extra questions. You should NOT submit these. But they are here in case you want to expand on this previous work. Instructions for submitting your homework are after this list.

  1. (easy) missingLanguages contains a number of languages that we have not loaded into allLanguages. Do so. These new languages might then show you more new languages we had missed before. Add those and repeat until no new languages are produced.
  2. (easy) Some language links appear with different capitalization when in the "influenced" or "influencedBy" arrays, as opposed to the "link" fields. Uniformize all cases by turning all three to lowercase.
  3. (medium) Enhance the "influencedBy" and "influenced" fields in the languages objects based on the languages missing, as you found in exercise 8.
  4. (easy) Order the languages based on how "influential" they've been, i.e. how many languages they have influenced. Similarly, order them by how many other languages influenced them.
  5. (medium) For all pairs of distinct paradigms, compute the number of languages that have both paradigms, then sort the pairs based on that number, starting with the highest. This will show you which paradigms are most often combined.
  6. (medium) Do the same for pairs of type disciplines.
  7. (medium) This and the remaining problems require some knowledge of graph theory. We can consider the set of languages as a directed graph, where the vertices are the languages and an edge from language A to B means that language A influenced language B. We can represent this graph in two ways: As a list of (ordered) pairs (A, B), or as an object whose keys are the languages and the value for the key corresponding to language A is the list of all languages B that are influenced by B. The latter form already exists within the allLangDict dictionary. Use it to construct the former form.
  8. (medium) Write a function that given a language A constructs the list of all languages that are (possibly indirectly) influenced by it. In terms of our graph picture this is the set of all edges that can be reached from A. You can use a recursive function, but beware of cycles.
  9. (medium) Compute for each language/vertex the number of vertices that can be reached (as described in the previous problem). Then order the languages according to which reaches the most languages/vertices. You may either run the function you created in the previous part on all vertices, or you can try an algorithm that tries to compute them all at once as follows:
  10. (hard) Write a function that determines for any two languages/vertices whether they have a "common ancestor". This should include the case where one of the languages is an ancestor of the other.
  11. (hard) Study a "topological sorting" algorithm and implement it on this graph. Topological sorting arranges the vertices in a linear order so that no edges "go backwards". Most standard topological sorting algorithms use depth-first-search techniques.
  12. (hard) Compute the "strongly connected components" of this graph. The topological sort algorithm and some google search can help with that.
  13. (hard) Find the longest path in the graph.
  14. (hard) Implement some form of the PageRank algorithm on this graph, and order the languages by their rank starting from the highest. For the purposes of this problem, use the "reverse" graph: If A influences B, then you should think of that as an edge from B to A. A language should gain value from the languages it influenced.

Submit

When you are ready to submit, make sure the assignment file you have been working on is saved and contains all the answers clearly marked, then email it to me.