Data Formats

The resulting textual data might be in any number of different formats. Being able to read and process all (most) of these formats is paramount, and most programming languages offer ready access via numerous libraries. Here is just a small collection of the different formats that your data might come in.

Binary: While we will discuss mostly text formats, it is worth mentioning that, especially in the past, data was often communicated via an ad-hoc, often proprietary, format that is rarely human-readable. This was done partly to save space, but the practice is somewhat more rare nowadays. Most image data is still communicated in this form however. Try to open an image file in your favorite text editor one day. Excel files used to follow such a binary format until recent years, when an XML format started being in use (the xlsx extension).
Text: Oftentimes our "data" is purely textual, and calling on further analysis. For example they might be a book or poem, a contract, transcripts of a conversation etc. Textual data can appear in any number of formats, from a Word document (docx format) to a plain text file (txt) to the Markdown and RestructuredText formats that allow some semantic markup of the plain text. We will probably not look much into these.
CSV: A popular tabular data format is the so-called "comma-separated-format". Information is written in rows, with individual entries in a row seperated by a comma. Double quotes are typically used to identify strings and to allow a comma to appear in the middle of an entry. Tabular data in Excel can be exported in this form.
HTML: Quite often our data is actually the information in a web page document. We'll learn ways to delve into such a document and extract key pieces of information.
XML: A fairly popular structured data format where tags are used, similar to HTML, to convey information.
JSON: A popular loosely-structured data transmision format, inspired by Javascript's syntax for objects. It is extensively used these days, and we will now delve more into it.

JSON

We will focus on JSON and XML, as these are two of the most popular transmission formats. They are both used to represented hierarchical structures. JSON stands for "JavaScript Object Notation", and it is fairly simple. Here is an example JSON document returned from a search query to Twitter:

{
   "search_metadata": {
      "completed_in": 0.032,
      "count": 15,
      "max_id": 773149726313099268,
      "max_id_str": "773149726313099268",
      "next_results": "?max_id=771520407623139328&q=%40HanoverCollege&include_entities=1",
      "query": "%40HanoverCollege",
      "refresh_url": "?since_id=773149726313099268&q=%40HanoverCollege&include_entities=1",
      "since_id": 0,
      "since_id_str": "0"
   },
   "statuses": [
      {
         "contributors": null,
         "coordinates": null,
         "created_at": "Tue Sep 06 13:23:53 +0000 2016",
         "entities": {
            "hashtags": [],
            "symbols": [],
            ...
         },
         ...
      },
      ...
   ]
}

Briefly:

A JSON document consists of an "object", indicated by an open-close pair of curly braces.
The object consists of a series of key-value pairs, where the key is some string in quotes, and the value can be:
the keyword null
a number
a string
an array indicated by square brackets, whose elements are in turn any of the types listed here
another object

In the above example, we see that the top level object contains two entries, with keys "search_metadata" and "statuses". The former's value is an object, and the latter's value is an array.

And that's it! The format is intentionally kept simple. For an application to make use of such information, it would need to know what the possible keys are and what the corresponding values mean. This is often described as the API. Twitter's API describes all these entries in detail. It is worth a look. We will examine it closely during our first lab assignment.

Those familiar with Javascript will recognize the above syntax, as we can create literal objects that way. And this is correct. JSON is however a bit more strict about the format.

Practice: Write a JSON object that contains the following:

Your email login.
Your first and last name.
Your year in college.
A list of the courses you are enrolled in, each course containing the information of its department code, its number, its name, and your grade.

XML

XML, standing for eXtensible Markup Language, was a major effort to standardize data description and transmission with a format similar to HTML.

Open/Close tags are used to denote a nesting hierarchy.
The tag names are standardized to indicate specific meaning.
Attributes to each tag can enrich the provided information.

Here is a sample XML file, taken from here:

<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications
      with XML.</description>
   </book>
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies,
      an evil sorceress, and her own childhood to become queen
      of the world.</description>
   </book>
   <book id="bk103">
      <author>Corets, Eva</author>
      <title>Maeve Ascendant</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-11-17</publish_date>
      <description>After the collapse of a nanotechnology
      society in England, the young survivors lay the
      foundation for a new society.</description>
   </book>
   <book id="bk104">
      <author>Corets, Eva</author>
      <title>Oberon's Legacy</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2001-03-10</publish_date>
      <description>In post-apocalypse England, the mysterious
      agent known only as Oberon helps to create a new life
      for the inhabitants of London. Sequel to Maeve
      Ascendant.</description>
   </book>
   ...
</catalog>

There are various documents describing what tags to use depending on the situation, and there are many standards that focus on particular usage. This is a topic we may revisit later. For now note that XML is somewhat more "verbose" than JSON, with the opening and closing of all tags. It also does not readily provide arrays, though the contents of a tag are in effect a sort of list. But at the same time it offers a more diverse way of associating tags/names to bits of information. For example the id="..." attributes used above are at a different level than the author tags below, they are literally a part of the opening tag. Under different conventions we could have written those instead as children tags like <id>bk104></id>.

Practice: Write an XML that would represent the same information you wrote earlier for JSON.

Data Formats

Reading / References

Practice questions on the reading

Notes

Character Encoding

Data Formats

JSON

XML