text {* \ExerciseSheet{14}{28.~7.~2017} *} theory tmpl14 imports "~~/src/HOL/Library/Multiset" "~~/src/HOL/Library/Code_Target_Nat" "~~/src/HOL/Library/RBT" "~~/src/HOL/Library/Char_ord" "~~/src/HOL/Library/Code_Char" begin text \ \Exercise{Word Frequency --- Down to ML code} Your task is to develop a program that reads a corpus and prints the words in the corpus together with their frequencies, sorted by descending frequencies. Except input and output, your program shall be formalized in Isabelle/HOL. A corpus is described as @{typ \'a list\}, and the result is described by @{typ \('a \ nat) list\} \ text \The frequency of a word in a corpus can be specified as:\ definition freq :: "'a list \ 'a \ nat" where "freq ws = count (mset ws)" text \ Specify a predicate that characterizes a correct result. Note: If words have the same frequency, any order is allowed. \ consts is_freq_list :: "'a list \ ('a \ nat) list \ bool" text \Tests:\ lemma \is_freq_list [''a'',''b'',''b'',''c'',''c''] [(''c'',2),(''b'',2),(''a'',1)]\ oops lemma \is_freq_list [''a'',''b'',''b'',''c'',''c''] [(''b'',2),(''c'',2),(''a'',1)]\ oops lemma \\is_freq_list [''a'',''b'',''b'',''c'',''c''] [(''b'',2),(''c'',3),(''a'',1)]\ oops lemma \\is_freq_list [''a'',''b'',''b'',''c'',''c''] [(''a'',1),(''c'',2),(''b'',2)]\ oops lemma \\is_freq_list [''a'',''b'',''b'',''c'',''c''] [(''b'',2),(''c'',2),(''b'',2),(''a'',1)]\ oops section \Refinement \#1\ (* Refinement #1: Fold over word list and maintain current frequency *) text \Compute a function from words to their frequency by folding over the corpus, starting with @{term \\_. 0::nat\}. \ consts incr1 :: "'a \ ('a \ nat) \ 'a \ nat" definition "freq1 ws \ fold incr1 ws (\_. 0)" lemma freq1_correct[simp]: "freq1 ws = freq ws" oops section \Refinement \#2\ (* Refinement #2: Frequency map by RBT *) text \ Use red black trees to implement the mapping from words to frequencies. Words that do not occur in the corpus must not be mapped! Use the RBT implementation from \HOL/Library/RBT\! It provides, e.g., @{const RBT.empty}, @{const RBT.lookup}, @{const RBT.insert}. \ definition abs_fm :: "('a::linorder,nat) rbt \ 'a \ nat" where "abs_fm t k \ case RBT.lookup t k of None \ 0 | Some v \ v" definition inv_fm :: "('a::linorder,nat) rbt \ bool" where "inv_fm t \ (0 \ ran (RBT.lookup t))" lemma empty2_correct[simp]: "abs_fm RBT.empty = (\_. 0)" "inv_fm RBT.empty" oops definition incr2 :: "'a::linorder \ ('a, nat) rbt \ ('a, nat) rbt" where "incr2 k t = t" lemma incr2_correct[simp]: "inv_fm t \ abs_fm (incr2 k t) = incr1 k (abs_fm t)" "inv_fm t \ inv_fm (incr2 k t)" oops text \ Now we have refined the operations, we can refine the algorithm that uses the operations:\ definition "freq2 ws \ fold incr2 ws RBT.empty" lemma freq2_correct[simp]: "abs_fm (freq2 ws) = freq1 ws" "inv_fm (freq2 ws)" oops subsection \Extracting Result from RBT\ text \Extract the desired result --- list of pairs of words and their frequencies, sorted by frequency --- from the red black tree. Use @{const RBT.entries}. \ definition fsort :: "'a::linorder list \ ('a \ nat) list" where "fsort ws \ []" text \Prove that your function is correct. Hint: You will need to prove some auxiliary lemmas on standard list functions. Use \find_theorems\ to search for existing lemmas. Hint: A lemma of the form @{text \RBT.lookup (freq2 ws) w = Some f \ \\}, derived from @{text \freq2_correct freq1_correct\} may be useful! \ lemma fsort_correct: "is_freq_list ws (fsort ws)" oops section \Code Generation\ text \Now we can use Isabelle/HOL's code generator to actually extract functional code from our Isabelle formalization. First, we derive a specialized versions with strings: \ definition fsort_string :: "String.literal list \ (String.literal \ nat) list" where "fsort_string \ fsort" text \Then we can use the code generator in different ways.\ text \By the value command\ value [code] "fsort_string [STR ''foo'', STR ''bar'', STR ''foo'', STR ''bara'']" text \Export code to file\ export_code fsort_string in SML module_name Fsort file "export.sml" text \We can load the file into JEdit's ML IDE: \ SML_file "export.sml" text \And use it from within some wrapper code for parsing a corpus and printing the result: \ SML_file "fsort.sml" text \Use code directly with Isabelle's builtin ML interpreter \begin{verbatim} ML_val {* see template file *} \end{verbatim} \ (* Directly as Isabelle/ML command *) ML_val \ (* Read file to string *) fun file_to_string name = let val f = TextIO.openIn name val s = TextIO.inputAll f val _ = TextIO.closeIn f in s end fun fs fname = @{code fsort_string} (String.tokens (not o Char.isAlpha) (String.map (Char.toLower) (file_to_string fname))) val r1 = fs "/home/lammich/MASC-3.0.0/data/written/non-fiction/CUP2.txt" val r2 = fs "/home/lammich/MASC-3.0.0/data/written/twitter/tweets2.txt" \ text \The code generator also supports other target languages\ export_code fsort_string in Haskell export_code fsort_string in Scala export_code fsort_string in OCaml (*<*) end (*>*)