Assignment 3

In this assignment we will do a bit of “lexical analysis” on a text. We will read a text and record information as we encounter it. We will concentrate on the following:

  1. Words
  2. Punctuation marks
  3. end of line marks

Our goal would be to refine such a text, as well as provide some information about it, as follows:

  1. Convert three consecutive periods into an “ellipsis”, two consecutive dashes into an “em-dash” which is a special longer-than normal dash (you can read all about hyphens, en-dashes and em-dashes and their uses in any good writing-style document).
  2. Convert “programmer’s quotes: "” into more fancy open-close quotation marks.
  3. Count the number of words, sentences, and paragraphs.
  4. Reformat the text to a given width limit, with two variants: One preserves any earlier newline characters (i.e. shorter lines allowed) the other tries to make all lines as full as possible (without trying to add extra spaces to really fill it all out).

We start by reading a large string of the entire text, then breaking it into “terms”. Here is the definition of the term type:

data Term = Word String | Punc Char | Space | NewL | Para deriving (Eq, Show)
type Text = [Term]

So a term is either a word (containg a string information) or a punctuation (containing a character information), a space (marked by the space character), a newline (marked by the newline character) or a paragraph mark (a paragraph consists of 2 or more consecutive newline character). As a quick example, the string: "Hello there!\nNew" will become:

[Word "Hello", Space, Word "there", Punc '!', NewL, Word "New"]

Your assignment is to provide code that does a number of processing steps to a string. You can run the automated tests with:

# Run in shell
# Compile first
ghc assignment3
./assignment3 tests

You should ADD YOUR OWN TESTS; The ones provided are by no means exhaustive.

Remember that you will need to create STUB implementations for all your functions in advance, in order for the tests to compile.

You can test how your script performs on a particular text file (once you have implemented the needed functions) with the following shell commands:

# Run in shell
# Compile first
ghc assignment3
./assignment3 stats < inputfile
./assignment3 long < inputfile   > outputfile

This assignment is meant to exercise your skills in writing pattern-matching functions. Please avoid other techniques for writing these functions.

First, a brief overview of the functions you need to implement and/or are given:

  1. processText is a function that is provided to you, but which uses functions you will need to write. It takes as input a string and returns a Text value for that string, which is the result of the initial processing of the string followed by a number of post-read steps.

  2. readText is a function you will need to write. It takes in a string and reads it through, turning it into a Text value by converting each character to an appropriate term. The rules are as follows:

  3. combineChar is a helper function to be used with question 2. It takes as input a character and a Text value which represents the read rest of the string. For example if we were reading the string "cat?" then the combineChar function may be called to operate on the character 'c' and the Text value [Word "at", Punc '?']. The desired behavior in this case would be to combine the 'c' character with the "at" string and result in the text value [Word "cat", Punc '?']. A little bit deeper in to the recursive calls it may be called to operate on the character 's' and the text value [Punc '?'], in which case it would produce [Word "s", Punc '?']. Make sure you understand these two examples.

    Here is how combineChar is meant to behave on a character ch and a text value txt:

  4. The big workhorse is the commonSubstitutions method, which recursively goes through a Text value and returns a Text value back, attempting to perform various transformation along the way. You will spend a lot of time on this function incrementally adding behavior. It will be a big list of pattern-match cases each handling a different kind of transformation then recursively continuing on the rest.

  5. Next up, you need to implement a method called smartQuotes, which takes a Text and replaces the normal quotes with nicer open/close quotation marks, returning a Text value. Use the provided openQuote and closeQuote values. You will need a helper method that uses an extra boolean parameter to remember whether you are in the middle of a quotation (in which case the next quotation mark you see should be a closing one) or outside (in which case the next quotation mark you see should be an opening on).

  6. Next up you are asked to implement a series of “counting” functions. Given a text, these functions count various things and return the integer count:

  7. Next up you should write a printStats method. It takes as input a Text value and produces an IO () action which prints stat information. You should be producing output that looks like this:

    Words: 234
    Sentences: 23
    Lines: 12
    Paragraphs: 5

    This will be a simple do sequence of actions, calling on the functions you wrote on the previous step.

  8. We will now put together a set of functions whose goal is to print out the text to produce a string.

Here are initial file contents:

module Main where

import Test.HUnit
import Data.Char (isAlpha, toUpper, isPunctuation)
import System.Environment (getArgs)

data Term = Word String | Punc Char
          | Space | NewL | Para deriving (Eq, Show)
type Text = [Term]


openQuote = Punc '\8220'
closeQuote = Punc '\8221'
ellipsis = Punc '\8230'
emdash = Punc '\8212'


processText :: String -> Text
processText = smartQuotes . commonSubstitutions . readText

tests = TestList [
   TestCase $ assertEqual "combineChar"
      [Word "cat", Punc '!'] (combineChar 'c' [Word "at", Punc '!']),
   TestCase $ assertEqual "combineChar"
      [Word "t", Punc '!'] (combineChar 't' [Punc '!']),
   TestCase $ assertEqual "combineChar"
      [Word "t", Space] (combineChar 't' [Space]),
   TestCase $ assertEqual "commonSubstitutionsEmdash"
      [Word "say", emdash, Word "hello"] (processText "say--hello"),
   TestCase $ assertEqual "commonSubstitutionsEllipsis"
      [Word "some", Space, Word "ellipsises", ellipsis, Punc '.']
      (processText "some ellipsises...."),
   TestCase $ assertEqual "commonSubstitutionsMrAndApostrophe"
      [Word "Mr.", Space, Word "Smith's", Space, Word "work"]
      (processText "Mr. Smith's work"),
   TestCase $ assertEqual "commonSubstitutionsDashed"
      [Word "seventy-five"]
      (processText "seventy-five"),
   TestCase $ assertEqual "commonSubstitutionsDashedTwice"
      [Word "five-and-twentieth"]
      (processText "five-and-twentieth"),
   TestCase $ assertEqual "isNumeral" False (isNumeral ""),
   TestCase $ assertEqual "isNumeral" False (isNumeral "IGF"),
   TestCase $ assertEqual "isNumeral" True (isNumeral "ILX"),
   TestCase $ assertEqual "commonSubstitutionsNumerals"
      [Word "I.", Space, Word "II.", Space, Word "III.", Space,
       Word "IV.", Space, Word "V.", Space, Word "VI.", Space,
       Word "vii.", Space, Word "IX.", Space, Word "X.", Space,
       Word "XI.", Space, Word "Normal", Punc '.']
      (processText "I. II. III. IV. V. VI. vii. IX. X. XI. Normal."),
   TestCase $ assertEqual "commonSubstitutionsParagraphs"
      [Word "word", NewL, Para]
      (processText "word\n\n\n\n"),
   TestCase $ assertEqual "commonSubstitutionsParagraphs"
      [Word "word", NewL, Para, Word "More"]
      (processText "word\n\n\n\n\n\n\nMore"),
   TestCase $ assertEqual "smartQuotes"
      [Word "here", Space, Word "be", Space, openQuote,
       Word "double", Space, Word "quotes", closeQuote]
      (processText "here be \"double quotes\""),
   TestCase $ assertEqual "countLines1"
      3
      (countLines $ processText "one\n\ntwo\nthree"),
   TestCase $ assertEqual "countLines2"
      3
      (countLines $ processText "one\n\ntwo\nthree\n"),
   TestCase $ assertEqual "countLines3"
      3
      (countLines $ processText "one\n\ntwo\nthree\n\n")
   ]

main :: IO ()
main = do
   args <- getArgs
   s <- getContents
   let txt = processText s
   case args of
      ("tests" : _) -> do runTestTT tests
                          return ()
      ("stats" : _) -> printStats txt
      ("long" : _)  -> (printParagraphs . toParagraphs) txt
      _             -> (printParagraphs . toParagraphs) txt