In this assignment you will practice more working with MongoDb and its shell. We will be using the text from the Game of Thrones (Song of Ice and Fire) series of books, that can be found here.
I have prepared a Python script that you will need to run first. This script downloads the text from those webpages, and organizes it in a big JSON file. You will then write that JSON file to your database. You will need to do the following:
curl https://raw.githubusercontent.com/skiadas/DataWranglingCourse/gh-pages/labs/got.py > got.py
python3 got.py
mongoimport -h ds011168.mlab.com:11168 -d wranging -c got -u haris -p haris --file got.json --jsonArrayYou will need to change the name of the server, database and username and password on the third line, to match your database. The first line downloads the python script, the second executes it and creates a got.json file, and the thirs pushes that to the database. The second and third step will take some time.
Now your database has a new collection called got. Log into your database, then ask it to show you one entry:
db.got.findOne()
You should see a long document, which has three main properties:
title.chapter number.text, which is a list/array. Each of the entries corresponds to a paragraph, and it is itself a list/array of the sentences that comprise the paragraph.For some queries this data structure might be useful. But for other queries, we will bring it in a more familiar form, by "unwinding" those lists. $unwind takes a field of the document whose contents is an array, and creates clones of the document, one for each value in the array. It offers you two forms, where on the second form it also creates a new field to keep track of the array index. We use this in the following code, to unwind the nested arrays of paragraphs and sentences into separate documents. This is what this next code does, and it writes the result in a new collection, called got2:
db.got.aggregate([
{ $unwind: {
path: "$text",
includeArrayIndex: "paragraph"
}},
{ $unwind: {
path: "$text",
includeArrayIndex: "sentence"
}},
{ $project: {
_id: false,
chapter: true,
title: true,
paragraph: true,
sentence: true,
text: true
}},
{ $out: "got2" }
])Now run the following to see what these entries look like:
db.got2.findOne()You will see a document that corresponds to a sentence. It contains 5 main entries:
title and chapter number as before.paragraph number and a sentence number that count starting at 0.text for that sentence.For future use, we will also "index" the text in those documents, with the following:
db.got2.createIndex({ text: "text" })Finally we will create one more collection. This collection contains individual word entries. This will also take some time to run:
var requests = [];
db.got2.find().forEach(function(doc) {
requests.push({
title: doc.title,
chapter: doc.chapter,
paragraph: doc.paragraph,
sentence: doc.sentence,
words: doc.text.split(/\W+/)
});
})
db.got3.insertMany(requests)After you have done this, try db.got3.findOne() to see how one entry looks like. You will see that each document corresponds to a sentence with the words in it split into an array.
Here are the questions to answer. Add your answers to this assignment6.js file. The file is not meant to run as is, it's just a place where you put your answers.
got2 collection, with a $group stage.got2 collection, with a $group stage with a suitable _id entry to group the appropriate entries, followed by a $sort stage.db.got2.count({ $where: "this.text.length <= 3" }) Change it to instead remove these entries from the got2 collection altogether.got2 collection, consisting of two $group stages, one to compute the number of sentences within each paragraph, and one to average those numbers over all paragraphs of each chapter.got3 with $unwind as a first step. Order the words based on their frequency, starting from the most frequent. By manual inspection, go through the first 30-40 entries and identify at least 5 results that you consider "interesting" in some way, and why._id, and also a count field that counts the occurences of the word within the titled book. Order the results by descending frequency.$group stage with a suitable $push operator in it to create a document for each book, with a field that contains an array of objects, each containing a word and its frequency. See here for an example.$project with a $slice operator to only keep the first 20 entries. See this doc for an example.$unwind to expand this array of entries into separate documents. You should end up with 20 entries for each document. See here for use of $unwind.got2collection find the number of sentences where each person is mentioned. This would be six different queries, or you can do a for-loop. You will want to use the $text operator. Note especially the "exact phrase" syntax, if you want to match a person by both their full name (first and last name). This is not an aggregate query, you will probably want to use a count query, or a find query when you want to look at the answers.When you are ready to submit, make sure the assignment6.js file you have been working on is saved and contains all the answers clearly marked, then email it to me.