In this assignment you will practice more working with MongoDb and its shell. We will be using the text from the Game of Thrones (Song of Ice and Fire) series of books, that can be found here.
I have prepared a Python script that you will need to run first. This script downloads the text from those webpages, and organizes it in a big JSON file. You will then write that JSON file to your database. You will need to do the following:
curl https://raw.githubusercontent.com/skiadas/DataWranglingCourse/gh-pages/labs/got.py > got.py
python3 got.py
mongoimport -h ds011168.mlab.com:11168 -d wranging -c got -u haris -p haris --file got.json --jsonArray
You will need to change the name of the server, database and username and password on the third line, to match your database. The first line downloads the python script, the second executes it and creates a got.json
file, and the thirs pushes that to the database. The second and third step will take some time.
Now your database has a new collection called got
. Log into your database, then ask it to show you one entry:
db.got.findOne()
You should see a long document, which has three main properties:
title
.chapter
number.text
, which is a list/array. Each of the entries corresponds to a paragraph, and it is itself a list/array of the sentences that comprise the paragraph.For some queries this data structure might be useful. But for other queries, we will bring it in a more familiar form, by "unwinding" those lists. $unwind
takes a field of the document whose contents is an array, and creates clones of the document, one for each value in the array. It offers you two forms, where on the second form it also creates a new field to keep track of the array index. We use this in the following code, to unwind the nested arrays of paragraphs and sentences into separate documents. This is what this next code does, and it writes the result in a new collection, called got2
:
db.got.aggregate([
{ $unwind: {
path: "$text",
includeArrayIndex: "paragraph"
}},
{ $unwind: {
path: "$text",
includeArrayIndex: "sentence"
}},
{ $project: {
_id: false,
chapter: true,
title: true,
paragraph: true,
sentence: true,
text: true
}},
{ $out: "got2" }
])
Now run the following to see what these entries look like:
db.got2.findOne()
You will see a document that corresponds to a sentence. It contains 5 main entries:
title
and chapter
number as before.paragraph
number and a sentence
number that count starting at 0.text
for that sentence.For future use, we will also "index" the text in those documents, with the following:
db.got2.createIndex({ text: "text" })
Finally we will create one more collection. This collection contains individual word entries. This will also take some time to run:
var requests = [];
db.got2.find().forEach(function(doc) {
requests.push({
title: doc.title,
chapter: doc.chapter,
paragraph: doc.paragraph,
sentence: doc.sentence,
words: doc.text.split(/\W+/)
});
})
db.got3.insertMany(requests)
After you have done this, try db.got3.findOne()
to see how one entry looks like. You will see that each document corresponds to a sentence with the words in it split into an array.
Here are the questions to answer. Add your answers to this assignment6.js file. The file is not meant to run as is, it's just a place where you put your answers.
got2
collection, with a $group
stage.got2
collection, with a $group
stage with a suitable _id
entry to group the appropriate entries, followed by a $sort
stage.db.got2.count({ $where: "this.text.length <= 3" })
Change it to instead remove these entries from the got2
collection altogether.got2
collection, consisting of two $group
stages, one to compute the number of sentences within each paragraph, and one to average those numbers over all paragraphs of each chapter.got3
with $unwind
as a first step. Order the words based on their frequency, starting from the most frequent. By manual inspection, go through the first 30-40 entries and identify at least 5 results that you consider "interesting" in some way, and why._id
, and also a count
field that counts the occurences of the word within the titled book. Order the results by descending frequency.$group
stage with a suitable $push
operator in it to create a document for each book, with a field that contains an array of objects, each containing a word and its frequency. See here for an example.$project
with a $slice
operator to only keep the first 20 entries. See this doc for an example.$unwind
to expand this array of entries into separate documents. You should end up with 20 entries for each document. See here for use of $unwind
.got2
collection find the number of sentences where each person is mentioned. This would be six different queries, or you can do a for-loop. You will want to use the $text
operator. Note especially the "exact phrase" syntax, if you want to match a person by both their full name (first and last name). This is not an aggregate
query, you will probably want to use a count
query, or a find
query when you want to look at the answers.When you are ready to submit, make sure the assignment6.js
file you have been working on is saved and contains all the answers clearly marked, then email it to me.