In this assignment we will do a bit of “lexical analysis” on a text. We will read a text and record information as we encounter it. We will concentrate on the following:
Our goal would be to refine such a text, as well as provide some information about it, as follows:
"” into more fancy open-close quotation marks.We start by reading a large string of the entire text, then breaking it into “terms”. Here is the definition of the term type:
data Term = Word String | Punc Char | Space | NewL | Para deriving (Eq, Show)
type Text = [Term]
So a term is either a word (containg a string information) or a punctuation (containing a character information), a space (marked by the space character), a newline (marked by the newline character) or a paragraph mark (a paragraph consists of 2 or more consecutive newline character). As a quick example, the string: "Hello there!\nNew" will become:
[Word "Hello", Space, Word "there", Punc '!', NewL, Word "New"]
Your assignment is to provide code that does a number of processing steps to a string. You can run the automated tests with:
# Run in shell
# Compile first
ghc assignment3
./assignment3 tests
You should ADD YOUR OWN TESTS; The ones provided are by no means exhaustive.
Remember that you will need to create STUB implementations for all your functions in advance, in order for the tests to compile.
You can test how your script performs on a particular text file (once you have implemented the needed functions) with the following shell commands:
# Run in shell
# Compile first
ghc assignment3
./assignment3 stats < inputfile
./assignment3 long < inputfile > outputfile
This assignment is meant to exercise your skills in writing pattern-matching functions. Please avoid other techniques for writing these functions.
First, a brief overview of the functions you need to implement and/or are given:
processText is a function that is provided to you, but which uses functions you will need to write. It takes as input a string and returns a Text value for that string, which is the result of the initial processing of the string followed by a number of post-read steps.
readText is a function you will need to write. It takes in a string and reads it through, turning it into a Text value by converting each character to an appropriate term. The rules are as follows:
Punc value. You can use the function isPunctuation to determine if a given character is a punctuation character or not.Space value. (we will not consider tabs as characters that occur in our texts, but if you want to consider them then treat them as 4 spaces.)NewL value.Word value. You can use the next function to help you in handling this case.combineChar is a helper function to be used with question 2. It takes as input a character and a Text value which represents the read rest of the string. For example if we were reading the string "cat?" then the combineChar function may be called to operate on the character 'c' and the Text value [Word "at", Punc '?']. The desired behavior in this case would be to combine the 'c' character with the "at" string and result in the text value [Word "cat", Punc '?']. A little bit deeper in to the recursive calls it may be called to operate on the character 's' and the text value [Punc '?'], in which case it would produce [Word "s", Punc '?']. Make sure you understand these two examples.
Here is how combineChar is meant to behave on a character ch and a text value txt:
txt is a Word s value, then this means we are in the middle of reading a word, and we need to replace that value with one where the new character has been prepended to the string s in the Word value.Word value with that character and place it at the front of the Text list. You can handle the empty list case together with this case in one clause.The big workhorse is the commonSubstitutions method, which recursively goes through a Text value and returns a Text value back, attempting to perform various transformation along the way. You will spend a lot of time on this function incrementally adding behavior. It will be a big list of pattern-match cases each handling a different kind of transformation then recursively continuing on the rest.
ellipsis that you can use for that, which uses the special Unicode character for an ellipsis.Term value defined for you for that called emdash which uses the special Unicode character for an emdash."twenty-five" would have been parsed as the three terms Word "twenty", Punc '-' and Word "five" in that order. Your pattern should match such combinations and replace them with a single Word "twenty-five" combination. Make sure the recursive call allows you to catch the case of two dashes like in "five-and-twentieth" by combining the first dash into a compound word but allowing that word to be part of the recursive match."isn't", when it is at the end of the word ("parents'") or at the beginning indicating omitted text ("'re"). You should handle all three cases, and be careful about the order. Note: You can represent the single quote/apostrophe character as '\''.VII or ix. You should write the helper method isNumeral which takes a string and returns a boolean as to whether that string is a numeral. Then use this helper to handle the cases of a word followed by a period: If that word is a numeral then combine it with the period into a new word, otherwise keep the word as is and continue recursively with the rest (make sure that you still recursively examine the period, in case it is part of an ellipsis for example).NewL needs to be replaced by a single NewL followed by a single Para. You can do this with one case turning a NewL pair into a NewL and Para combination, and another case that reads any NewL which follows a Para and simply skips it.Next up, you need to implement a method called smartQuotes, which takes a Text and replaces the normal quotes with nicer open/close quotation marks, returning a Text value. Use the provided openQuote and closeQuote values. You will need a helper method that uses an extra boolean parameter to remember whether you are in the middle of a quotation (in which case the next quotation mark you see should be a closing one) or outside (in which case the next quotation mark you see should be an opening on).
Next up you are asked to implement a series of “counting” functions. Given a text, these functions count various things and return the integer count:
countWords simply counts how many word terms there are.countSentences counts how many sentences there are. For us, a sentence is any sequence that ends in a period, question mark or exclamation point.countLines counts how many lines there are. A line happens when the NewL term is encountered (we do not count the empty lines formed by paragraphs. It also occurs at the very last term, unless that term is a Para term.countParagraphs counts how many paragraphs there are. A paragraph happens when the Para term is encountered or at the very last term (even if that term is not a Para term).Next up you should write a printStats method. It takes as input a Text value and produces an IO () action which prints stat information. You should be producing output that looks like this:
Words: 234
Sentences: 23
Lines: 12
Paragraphs: 5
This will be a simple do sequence of actions, calling on the functions you wrote on the previous step.
We will now put together a set of functions whose goal is to print out the text to produce a string.
termToString should take as input a Term and convert it to a String. Word terms result in the corresponding word, Punc terms produce a string containing just that punctuation, Space terms become a single space string " ", and NewL and Para terms both become a string containing a single newline character, "\n".toString takes a whole Text and converts it to a string, by simply using the termToString method to turn each term into a string, then concatenating those.eliminateNewlines eliminates the newline terms NewL as follows: A NewL term that is followed by a paragraph term is simply eliminated, while a NewL term that is not followed by a paragraph term is replaced by a Space term. Don’t forget to recursively traverse the entire term list.splitOnParagraph takes a Text value and returns a list of Text values by splitting on the Para terms. The resulting Text values should not contain the Para terms. Make sure to NOT create an extra empty Text value if the last term is a Para term.toParagraphs combines these as follows: It takes a Text value and must result in a list of strings. It does this by first using eliminateNewlines followed by splitOnParagraph, and then it applies the toString function to each of the result Text elements to produce corresponding String elements. You can use a list comprehension for part of this function if you find it helpful.printParagraphs takes as input a list of strings and produces an IO () action which prints those strings as paragraphs as follows: It prints the string/paragraph followed by a newline character; then if we are not at the end of the list it prints an extra newline character to create an empty line, then recursively prints the rest of the list. A simple do statement should work for this.Here are initial file contents:
module Main where
import Test.HUnit
import Data.Char (isAlpha, toUpper, isPunctuation)
import System.Environment (getArgs)
data Term = Word String | Punc Char
| Space | NewL | Para deriving (Eq, Show)
type Text = [Term]
openQuote = Punc '\8220'
closeQuote = Punc '\8221'
ellipsis = Punc '\8230'
emdash = Punc '\8212'
processText :: String -> Text
processText = smartQuotes . commonSubstitutions . readText
tests = TestList [
TestCase $ assertEqual "combineChar"
[Word "cat", Punc '!'] (combineChar 'c' [Word "at", Punc '!']),
TestCase $ assertEqual "combineChar"
[Word "t", Punc '!'] (combineChar 't' [Punc '!']),
TestCase $ assertEqual "combineChar"
[Word "t", Space] (combineChar 't' [Space]),
TestCase $ assertEqual "commonSubstitutionsEmdash"
[Word "say", emdash, Word "hello"] (processText "say--hello"),
TestCase $ assertEqual "commonSubstitutionsEllipsis"
[Word "some", Space, Word "ellipsises", ellipsis, Punc '.']
(processText "some ellipsises...."),
TestCase $ assertEqual "commonSubstitutionsMrAndApostrophe"
[Word "Mr.", Space, Word "Smith's", Space, Word "work"]
(processText "Mr. Smith's work"),
TestCase $ assertEqual "commonSubstitutionsDashed"
[Word "seventy-five"]
(processText "seventy-five"),
TestCase $ assertEqual "commonSubstitutionsDashedTwice"
[Word "five-and-twentieth"]
(processText "five-and-twentieth"),
TestCase $ assertEqual "isNumeral" False (isNumeral ""),
TestCase $ assertEqual "isNumeral" False (isNumeral "IGF"),
TestCase $ assertEqual "isNumeral" True (isNumeral "ILX"),
TestCase $ assertEqual "commonSubstitutionsNumerals"
[Word "I.", Space, Word "II.", Space, Word "III.", Space,
Word "IV.", Space, Word "V.", Space, Word "VI.", Space,
Word "vii.", Space, Word "IX.", Space, Word "X.", Space,
Word "XI.", Space, Word "Normal", Punc '.']
(processText "I. II. III. IV. V. VI. vii. IX. X. XI. Normal."),
TestCase $ assertEqual "commonSubstitutionsParagraphs"
[Word "word", NewL, Para]
(processText "word\n\n\n\n"),
TestCase $ assertEqual "commonSubstitutionsParagraphs"
[Word "word", NewL, Para, Word "More"]
(processText "word\n\n\n\n\n\n\nMore"),
TestCase $ assertEqual "smartQuotes"
[Word "here", Space, Word "be", Space, openQuote,
Word "double", Space, Word "quotes", closeQuote]
(processText "here be \"double quotes\""),
TestCase $ assertEqual "countLines1"
3
(countLines $ processText "one\n\ntwo\nthree"),
TestCase $ assertEqual "countLines2"
3
(countLines $ processText "one\n\ntwo\nthree\n"),
TestCase $ assertEqual "countLines3"
3
(countLines $ processText "one\n\ntwo\nthree\n\n")
]
main :: IO ()
main = do
args <- getArgs
s <- getContents
let txt = processText s
case args of
("tests" : _) -> do runTestTT tests
return ()
("stats" : _) -> printStats txt
("long" : _) -> (printParagraphs . toParagraphs) txt
_ -> (printParagraphs . toParagraphs) txt