Functional Programming Competition 2018

part of the Formal Methods and Functional Programming (FMFP) course at ETH Zürich

Every week a special Haskell programming exercise is published here and in Code Expert. Each exercise is carefully selected by the Master of Competition (MC)—Dmitriy Traytel—to separate the wheat from the chaff. After 7 weeks, the intersection of the wheat with the FMFP students will receive delicious prizes in a in-lecture ceremony.

Participants submit their solutions via Code Expert. Submissions by mail are accepted only in two exceptional cases: (1) the participant is not enrolled at ETH and therefore cannot use Code Expert, and (2) the solution uses a Haskell library/pragma that is not available on Code Expert. In case (2) precise instructions of how to run the submission (including version numbers) are a necessary requirement for the solution to be evaluated.

Every week, MC's favorite submissions are discussed in a blog post. The submissions are evaluated with respect to different criteria (but the function's correctness is always a must have), which are announced along with the exercise. The best 10 solutions receive from 1 to 10 points (more is better). The points and the names of the participants are made public in a Top 10 of the week and aggregated in a semester ranking.

Important: If you don't want your name to appear on a publicly visible webpage (but still want to participate), write to the MC and suggest a pseudonym.

Finally, the MC employs a zero tolerance policy against plagiarism. Please compete fairly and individually. Submissions that have been developed jointly by two contestants have to be clearly marked as such. Violations of those rules may result in arbitrary point deductions.

Outline:



Final ranking
  # name   points
1. Matthias Brun 💥 53.0
2. Moritz Knüsel 🎩50.0
3. Gabriel Giamarchi 🖥 49.0
4. Simon Yuan 📐 36.0
5. Börge Scheel 35.0
6. Loris Reiff 🏆 34.0
7. Miro Haller 🧟🔝 26.0
8. Roger Baumgartner 24.0
9. Mike Boss 23.0
10. Chris Buob 14.0
11. Silvan Niederer 13.0
12. Patrick Ziegler 11.5
13. Kevin de Keyser 10.0
Tynan Richards 10.0
15. Miklós Horváth 9.0
16. Antonio Russo 6.0
17. Noah Berner 5.0
Zur Shmaria 5.0
Jan Wiegner 5.0
20. Mathias Blarer 3.5
21. Giuseppe Arcuti 3.0
Lukas Baege 3.0
Andrin Bertschi 3.0
Lukas Heimes 3.0
Felix Huber 3.0
Timon Jenne 3.0
Nikola Kovacevic 3.0
Andreas Kuster 3.0
Sandar Felicity Lim 3.0
Manuel Nowack 3.0
Felix Sarnthein-Lotichius 3.0
Roberto Starc 3.0
Elwin Stephan 3.0
Valentin Stoppiello 3.0
35. Jorel Elmiger 1.0
Stefan Zemlijc 1.0


Task of week 1

Write a function oddSum that takes five Int arguments as input and computes the sum of the maximum, the minimum, and the median of the arguments.

For example:

oddSum 1 2 3 4 5 = 1 + 3 + 5 = 9
oddSum 5 4 1 3 2 = 1 + 3 + 5 = 9
oddSum 1 1 5 3 3 = 1 + 3 + 5 = 9

The shortest (in terms of number of tokens1) solution wins! Library imports and type signatures are excluded from the token count. As a measure of reference: the MC has an optimized (but not too crazily optimized) 27 token solution.

1. This competition continues a long-standing tradition of in-lecture functional programming competitions established at TU München. The definition of tokens is borrowed form there.

Results of week 1

Here are the best 10 26 student performers in the first competition task.

Top 10 of week 1
  # name   #tokens   points
1. Matthias Brun 17 10
2. Loris Reiff 19 9
3. Simon Yuan 26 8
4. Roger Baumgartner 27 7
Moritz Knüsel 27 7
Börge Scheel 27 7
7. Gabriel Giamarchi 29 4
8. Giuseppe Arcuti 31 3
Lukas Baege 31 3
Noah Berner 31 3
Andrin Bertschi 31 3
Mike Boss 31 3
Chris Buob 31 3
Miro Haller 31 3
Lukas Heimes 31 3
Timon Jenne 31 3
Nikola Kovacevic 31 3
Andreas Kuster 31 3
Sandar Felicity Lim 31 3
Manuel Nowack 31 3
Tynan Richards 31 3
Felix Sarnthein-Lotichius 31 3
Roberto Starc 31 3
Elwin Stephan 31 3
Valentin Stoppiello 31 3
Jan Wiegner 31 3

The MC is happy about 104 student participants in this first week of the competition (+ 1 tutor + 1 professor + the MC). Several participants have submitted more than one solution without any comments (such as -- grade me), leaving the MC the choice to pick the worst one in terms of tokens. However, the MC was so happy about the fact that every participant has submitted at least one correct solution (determined by the MC by a mixture of testing and manual inspection), that he decided to take the best submission in terms of tokens into account.

Before we start looking at the best solutions, let's see what one can do with what we've seen in the lecture so far. The following solution is MC's attempt to only use the material from the lecture and also one of the submissions by our number 3, Simon Yuan (and a few others, who wasted tokens by adding some needless parentheses or the like). With 58 tokens it is far away from being Simon's or the MC's best solution and even far away from the top 10.

oddSum a b c d e
 | a > b        = oddSum b a c d e
 | b > c        = oddSum a c b d e
 | c > d        = oddSum a b d c e
 | d > e        = oddSum a b c e d
 | otherwise    = a + c + e

To understand how this solution works, one has to realize that the guards are matched sequentially: i.e., the program will only check c>d after having established a≤b and b≤c. With this insight it is pretty obvious what this code does: the last trivial guard otherwise is only reached once a b c d e is sorted.

Now we have a baseline to compete against. It is fairly obvious, that sorting should be part of any successful solution. Can we sort the five elements in Haskell, without implementing our own sorting algorithm? Of course: we can just use the right libary function. To use this function we must import the library Data.List, which contains many useful functions about the Haskell list type (more on this next week in the lecture; what you see below should be mostly intuitive usage of finite sequences of Int elements). But, hey, imports cost 0 tokens!

Once the list is sorted, we have to think how to extract the minimal, maximal, and median element. Many solutions, including the following one by an anonymous professor, decided for an index-based addressing, written list!!index in Haskell.

oddSum a b c d e = let xs = sort [a,b,c,d,e] in xs !! 0 + xs !! 2 + xs !! 4

This gives 34 tokens (or 33 when using where) and brings us closer to the top 10.

To get into the top 10, this time it was enough to access the first and last elements of the sorted list using library functions (head and last). Note that head xs are two tokens, while xs!!0 are three. Alternatively, but less efficiently, one can extract the minimum or the maximum of the (already sorted!) list. Alas, all 31-token solutions, including one submitted by a tutor who likes applicative functors (more on this much later in the lecture), are variations of the following two.

import Data.List

oddSum a b c d e = maximum xs + minimum xs + xs !! 2
  where lst = sort [a,b,c,d,e]

oddSum' a b c d e = head xs + last xs + xs !! 2
  where lst = sort [a,b,c,d,e]

Side remark 1: Some participants mistook tokens for characters, and tried to minimize the number of characters used in their program. Readability suffered from this quite a bit, e.g., oddSum a b c d e=xs!!0+xs!!2+xs!!4 where xs=sort[a,b,c,d,e]. Useful advice: leave the reader of your program some air to breathe. It is better to kill needless parentheses than spaces (also in terms of tokens).

All solutions further up in the ranking use techniques other than indexing to extract the three elements from the sorted list. Gabriel Giamarchi uses the following trick (omitting the Data.List import):

oddSum a b c d e = sum ((subsequences (sort [a,b,c,d,e]))!!21)

Before we talk about the semantics lets see if syntactically this solutions does its best in terms of tokens. No! Prefix function application always binds stronger than any infix operator. That is, f a b c d !! g x y z is always parsed as (f a b c d) !! (g x y z). Have I already mentioned what to do with needless parentheses? Don't get me wrong: Gabriel did a great job by using subsequences in the first place. But those two parentheses around subsequences ... cost him three competition points, which in the end might decide between Lindt and Migros Budget.

So how does this solution work? The library function subsequences computes the list of subsequences (i.e., a list whose elements are lists themselves) of the input list, where a list xs is a subsequence of another list ys if one can obtain xs by removing zero or more elements from ys (but without reordering the elements). For example: subsequences [1,2,3] = [[],[1],[2],[1,2],[3],[1,3],[2,3],[1,2,3]]. Now, when applied to a list with five elements, the result of subsequences will contain the subsequence consisting from the first, middle, and last element in the 22nd position (i.e., accessible using the 0-based index 21). This holds regardless of the list elements' types or values and is exploited here. Finally, the library function sum sums up the elements of this subsequence, which is exactly what is asked for. BTW: you might ask: is computing subsequences the most efficient way to do this? The MC answers: certainly not, but who cares? It is all about tokens.

The best solutions of Moritz Knüsel and Simon Yuan are variants of this idea, which are more economic about tokens. Here is for example Simon's solution which came in third with 26 tokens:

oddSum a b c d e = sum $ subsequences (sort [a,b,c,d,e]) !! 21

Using the infix operator $ is a useful trick to save tokens (and sometimes also to improve readability). It is nothing else than function application (f $ x = f x), however, the low precedence of this infix operator helps to avoid parentheses: instead of f (g x) we can write f $ g x (or f $ g !! i instead of f (g !! i)).

Let's look at two other 27 tokens solutions. The first one was found by the two students Börge Scheel and Roger Baumgartner. The second is the MC's best solution.

oddSum a b c d e = sum $ map head $ chunksOf 2 $ sort [a, b, c, d, e]
oddSum a b c d e = sum $ scanr (-) 0 $ sort [a, b, c, d, e]

The first solution works by sorting and then using the library function chunksOf, which splits a list into shorter ones (of length at most 2 here as indicated by the passed argument). We have chunksOf 2 [1,2,3,4,5] = [[1,2],[3,4],[5]]. The next applied function is the important map function, which we will see again in the course. It applies its first function argument (yes, in Haskell functions can be passed around as arguments—map is called a higher-order function because it takes a function as an argument; more on higher-order functions in two weeks in the lecture) to every element of the list passed as its second argument. For example: map head [[1,2],[3,4],[5]] = [head [1,2], head [3,4], head [5]] = [1,3,5].

The second solution uses an even more complicated higher-order function than map: scanr. In our example, it works as follows: scanr (-) 0 [1,2,3,4,5] becomes

[1 - (2 - (3 - (4 - (5 - 0)))
,     2 - (3 - (4 - (5 - 0)))
,          3 - (4 - (5 - 0))
,               4 - (5 - 0)
,                    5 - 0
,                        0] =

or written differently

[1 - 2 + 3 - 4 + 5 - 0
,    2 - 3 + 4 - 5 + 0
,        3 - 4 + 5 - 0
,            4 - 5 + 0
,                5 - 0
,                    0] =

Reading this column-wise, we observe that after summing this list the elements in the even positions (2 and 4) cancel out, whereas the elements in the odd positions (1, 3, and 5) survive this telescoping.

Side remark 2: When writing this explanation, the MC noticed that it is possible to use a very similar function scanr1 (-) instead of scanr (-) 0 to save one token. But unfortunately, we are past the deadline now.

All these tricks are nice, but they are still about 10 tokens away from the two best solutions. The latter came up with a way to avoid writing the list [a,b,c,d,e] or even the five arguments to oddSum. Here is Matthias Brun's winning entry with all the imports and a nice comment describing how the solution works:

import Data.List (sort, subsequences)
import Text.Printf (printf)
import Data.Composition ((.::.))

{-
 - This solution works by computing the power set (subsequences) of the sorted
 - list containing the arguments. The 22nd element (i.e. index 21) of the power
 - set (as generated by 'subsequences') is always the one containing the
 - minimum, median and maximum (for a sorted list of 5 values), so we take its
 - sum. The 'read .::. printf "[%d,%d,%d,%d,%d]"' part simply reads in the
 - arguments and transforms them into a list in a way that uses few tokens.
-}

oddSum = sum . (!! 21) . subsequences . sort . read .::. printf "[%d,%d,%d,%d,%d]"

The MC is amazed. This is pointfree functional programming at its best. (The pointfree style is often critisized for obfuscating the program. One particular problem is that it might be hard to see how many arguments a function takes. But it is the best way to save tokens. And in this, case the solution is fairly easy to read.) To the above explanation, the MC only want to add some comments on the operators . and the funny .::., which for some reason makes the MC think of Space Invaders. Both denote function composition. We have:

(f . g) x1 = f (g x1)
(f .::. g) x1 x2 x3 x4 x5 = f (g x1 x2 x3 x4 x5)

Hence, the above solution is equivalent to:

oddSum a b c d e = sum ((!! 21) (subsequences (sort (read (printf "[%d,%d,%d,%d,%d]" a b c d e)))))

The sum ((!! 21) (subsequences (sort ...))) part looks familiar. We've seen the function read, which converts Strings to values of different types (or complains with a parse error) in the first exercise class. The type of printf is to complex to explain at this point; however when being used the function behaves just as its homonymous C ancestor (i.e., the number of arguments it takes depends on the formatting string).

The second best solution by Loris Reiff is quite similar (also using printf and .::.), except that it uses a slightly less token-efficient way to extract the first, middle, and last argument from a five-element list.

The MC could not resist the temptation to look at some incorrect submissions: a glimpse into the abyss in a sense. There, one meets participants who tried to sacrifice everything for tokens: even correctness. Some omitted the sorting or did it half-heartedly:

oddSum a b c d e = head xs + last xs + xs!!2 where xs = [a, b, c, d, e]
oddSum a b c d e = c + min a (min b (min c (min d e))) + max a (max b (max c (max d e)))

Others tried out mysterious ways:

oddSum a b c d e = l!!0 + l!!2 + l!!4 where l = [ x | x <- [1..], elem x [a,b,c,d,e]]
oddSum a b c d e = l!!0 + l!!2 + l!!4 where l = [ x | x <- [-1000..], elem x [a,b,c,d,e]]

Fortunately, all of them found the path of the righteous (wo)man in the end.

And in case you wonder: the longest correct solution has 302 tokens and is, well, a bit repetitive.

min2 a b c d e
 | minimum [a,b,c,d,e] == a  = minimum [b,c,d,e]
 | minimum [a,b,c,d,e] == b  = minimum [a,c,d,e]
 | minimum [a,b,c,d,e] == c  = minimum [a,b,d,e]
 | minimum [a,b,c,d,e] == d  = minimum [a,b,c,e]
 | minimum [a,b,c,d,e] == e  = minimum [a,b,c,d]

max2 a b c d e
 | maximum [a,b,c,d,e] == a  = maximum [b,c,d,e]
 | maximum [a,b,c,d,e] == b  = maximum [a,c,d,e]
 | maximum [a,b,c,d,e] == c  = maximum [a,b,d,e]
 | maximum [a,b,c,d,e] == d  = maximum [a,b,c,e]
 | maximum [a,b,c,d,e] == e  = maximum [a,b,c,d]

oddSum a b c d e = a + b + c + d + e - min2 a b c d e - max2 a b c d e

Wrapping up: what have we learned this week?

Task of week 2

Lists have still not been spotted in the lecture so far. So let's do some number theory instead.

The digital product of a number n is the product of its (base 10) digits. The computation of the digital product can be iterated until a single digit is obtained. The single digit is called the (multiplicative) digital root, whereas the number of iterations needed is called the (multiplicative) persistence.

For example:

999 has digital product 9 * 9 * 9 = 729
729 has digital product 7 * 2 * 9 = 126
126 has digital product 1 * 2 * 6 = 12
 12 has digital product 1 * 2     = 2

Thus the digital root of 999 is 2 and the persistence is 4 (as we had to compute four digital products).

Your task is to implement the function persistenceRoot :: Integer -> (Integer, Integer) that computes both the persistence (first component) and the digital root (second component). E.g., persistenceRoot 999 = (4,2). (For a change we want to use Haskell's Integer type of unbounded integer numbers instead of the machine integers Int.)

As last week, the MC is looking for the shortest solution token-wise. (Next week we will move to some other criterion.) His best attempt needs 30 tokens. Just beat it!

Results of week 2

Here is the overdue top 10 for week 2. This week 66 students have submitted a solution.

Top 10 of week 2
  # name   #tokens   points
1. Mike Boss 19 10
Matthias Brun 19 10
Chris Buob 19 10
Gabriel Giamarchi 19 10
Moritz Knüsel 🎩 19 10
Simon Yuan 📐 19 10
7. Tynan Richards 21 4
8. Roger Baumgartner 24 3
9. Loris Reiff 🏆 27 2
Antonio Russo 27 2

You may wonder about the emojis used above. The MC will explain. But first let us look at a solution that reflects the state of the lecture at the submission date. It was submitted by the still anonymous professor with the comment "I didn’t try to be clever." and counts 57 tokens.

persistenceRoot x = aux 0 x

aux n x
    | x < 10    = (n,x)
    | otherwise = aux (n+1) (dp x)

dp x
    | x < 10     = x
    | otherwise  = (dp (x `div` 10)) * (x `mod` 10)

Without trying to be clever, there are some obvious further token-optimizations possible in the above solution. (E.g., what have we learned last week about needless parentheses?) Altogether this might save around 10–15 tokens, which would be still miles away from the top 10.

To get into the top 10, we should rather try to be clever. Being clever for the whole problem at once may cause some headaches, so lets look at a subproblem first: computing the digital product (i.e., what the above dp function computes). Using div and mod is a natural way to approach this subproblem. Another natural approach to get the digits of a number x is to print the number: show x. Well, almost. This only gives one a list of Chars, which we would want to convert to a list of Integers. Unfortunately, it seems that there is no function to do this conversion directly, i.e., we need to take the detour over Ints: map toInteger (map digitToInt (show x)). Finally, to compute the product of the digits we can conveniently use the product function. Making all of this pointfree, we obtain the digital product function with 9 tokens:

product . map toInteger . map digitToInt . show

This expression or a slight variation of it occurs in solutions submitted by Roger Baumgartner and Antonio Russo as well as in MC's own 30 tokens solution. It is good, but not exactly the stuff dreams are made from. The stuff dreams are made from for this exercise could be found on Hackage in the Data.Digits module: the digits function gives one the digits of a number with respect to a given base. Hence, one digital product costs only 4 tokens as noticed by everyone else from the top 10:

product . digits 10

But one digital product is not enough. We need to iterate it and count the number of iterations until we reach a single digits.

Until, hmm until... Here is the approach of the MC:

persistenceRoot =
  flip curry 0 $ until ((<10) . snd) $
    (+1) *** product . map read . map return . show

Or equivalently written in a bit saner and a bit less pointfree fashion:

persistenceRoot x =
  until ((<10) . snd) ((+1) *** product . map read . map return . show) (0, x)

The function until is another perfect match for this exercise: until test f x computes the first y = f (... f (f x) ...) for which test y is satisfied. The MC is a bit embarrassed to have wasted 5 tokens on the digital product and 3 tokens on the (+1) functions, which exists in the Prelude under the name succ.

Tynan Richards was more careful and used a bunch of funky functions from somewhat exotic modules. And he wrote a nice explanation that speaks for itself. Interestingly, Tynan's solution is actually recursive, i.e., he did not use a combinator (a la until) to iterate digital products.

-- The online IDE fails to import these packages
import Data.Digits (digits)
import Data.Tuple.Update (upd1)
import Data.Tuple.Extra (dupe, first)
import Data.Bool.Extras (bool)
import GHC.Integer (leInteger)

{-
- The pointfree style makes it hard to see exactly what this function does, so
- I'll briefly try to explain...
- 'bool' is a function that works like an if-then-else statement. It checks the
- value of 'leInteger 10', which is equivalent to '(<=) 10'. If that condition
- is false, then the input is smaller than 10 and we have the digital root, as
- well as persistence 0, so we return the value calculated by 'upd1 0 . dupe'.
- 'dupe x' gives the tuple (x, x), 'upd1 0 .' sets the first value of that
- tuple to 0, giving (0, x).
- If the input is greater than 10, we instead return the value calculated by
- 'first succ . persistenceRoot . product . digits 10', which recursively
- calculates the digital product ('persistenceRoot . product . digits 10'),
- then adds one to the first element of the tuple ('first succ') when it
- returns.
- The <*> operator applies the function's argument to each of the expressions
- of bool, which enables the complete lack of parentheses.
-}

persistenceRoot :: Integer -> (Integer, Integer)
persistenceRoot =
  bool . upd1 0 . dupe <*>
    first succ . persistenceRoot . product . digits 10 <*> leInteger 10

Lets finally look at the winning solutions. Five of the six were slight variations of the following one by Simon Yuan. Again, it is beautifully documented.

import Data.List
import Data.Bifunctor
import Data.Digits

--19 Token
persistenceRoot :: Integer -> (Integer, Integer)

{-
  The code explained:
  "product . digits 10" calculates the digital product
  (lets call this function dp).
        digits 10 n is evaluated to a Integral list, with the digits
        (in base 10) of n as its elements.
        product then takes to product of all the elements in that list.

  "iterate dp n" is evaluated to [n, dp n, dp (dp n), ...]
        This list contains the input n as well as all the intermediate steps.

  "break (<10)" splits the list into 2, it is a tuple of 2 lists.
      The breakpoint is at the first element for which (<10) is true.
      As soon as an element x is less than 10 (x is the first 1 digit number),
      x will be equal to dp x. Hence x is the digital root.

  "bimap genericLength head (a,b)" is evaluated to (length a, head b)
        The length of a is how many intermediate steps it took to reach a 1
        digit number (= persistence). The head of b (or any element of b) is
        the digital root.

  To save 2 tokens, make the function pointfree by omitting the input twice.
-}

persistenceRoot =
   bimap genericLength head . break (<10) . iterate (product . digits 10)

Above genericLength is used instead of length to obtain an Integer rather than an Int. The iterate function deserves special attention: yes iterate f x just iterates the function f as in [x, f x, f (f x), ...]. Remarkably, the iteration never stops, i.e., the output of iterate f x is unconditionally an infinite list (more on this in a few weeks in the course). The lazy evaluation strategy of Haskell saves us from waiting infinitely long for it to finish. In the following computation we only need to observe a finite prefix of the infinite list.

The remaining 19 token solution by Gabriel Giamarchi is somewhat different:

import Data.Digits
import Data.Tuple

{-
- The idea is to avoid costly (in tokens) termination criteria in
  recursive-functions.
- We use an easily seen property, namely that the digital persistence
  of x is smaller or equal to x (Except for zero !!).
- This allows to write the following function, where we zip from the right
  so that the minimum (lexicographical probably) selects the correct tuple.
- Luckily this also works for zero :-)
-}
persistenceRoot :: Integer -> (Integer, Integer)
persistenceRoot x =
  swap $ minimum $ iterate (product . digits 10) x `zip` enumFromTo 0 x

This one beautifully relies on a mathematical property of the persistence (to prove the above "easily seen" property, prove first that the digital product of a number is smaller that the number for numbers > 10) and the default lexicographic ordering (i.e., (a,b) ≤ (c,d) = a < c || a = c && b ≤ d) on pairs in Haskell. Moreover, it is not even pointfree, which usually hints that it is possible to beat it in terms of tokens. And indeed, here is MC's rewrite of Gabriel's solution that saves 1 token:

import Data.Digits
import Data.Tuple
import Control.Arrow
import Control.Monad

persistenceRoot =
  swap <<< minimum <<<
    zip `liftM2` iterate (product . digits 10) `arr` enumFromTo 0

Here, <<< is another operator for composition . (with precedence 1, whereas . has 9) and arr is another operator for application $ (with precedence 9 as any function written in backquotes, whereas $ has 0). Finally, the general function liftM2, whose type cannot be explained at this point, behaves in this particular instance as liftM2 z f g x = z (f x) (g x).

Great, now to the emojis. They can be considered as special awards for a distinguished achievement in a particular exercise. The precise meaning and value (e.g., in terms of Schoggi) of these awards is unclear yet (even to the MC), but hey, now you have a nice emoji stuck to your name.

First of all, Simon Yuan receives the mathematician award 📐. One of his early submissions read ... . take 100000 . iterate ... and he wrote lengthy justifications why this is safe:

If you are interested to read, I have also done some research
( you really dont have to read this ;) ):
It is conjectured that 11 is the max persistence for base 10.
277777788888899 is the first number with persistence 11,
and they have tested it up to 10^233.

I actually tried to proof that 100000 is way more than enough by proving an
upper limit directly (i.e. I tried to prove the conjecture (before I knew it
was an conjecture) but couldnt really get anywhere. Then I tried to come up
with a lower bound of the number of digits needed for it to have persistence
at least p.

...

[something about the number of atoms in the universe]

...

The MC had advised Simon not to use this program and not to rely on the size of the universe. Clearly, we want to write programs (especially in this competition) that will work forever and under all circumstances. And as we know the universe is expanding. Incidentally, taking x first elements of that list when computing the persistence of x (instead of the fixed 100000 first elements) will always suffice—this was also exploited in Gabriel's solution.

The second award goes to Moritz Knüsel. He is a true black hat 🎩 as his 13 token solution demonstrates (comments: his, censorship: MC's).

import System.Process
import System.IO.Unsafe
import Text.Printf
{- If this isn't allowed, please disregard this submission -}
persistenceRoot :: Integer -> (Integer, Integer)

{- very slow, gets killed after completing the last test,
   also using ghci incurs an extra token,
   because it prints extra stuff unless told not to
persistenceRoot =
  read . unsafePerformIO . readCreateProcess (shell "ghci -v0") . printf "\
\import Data.List\nimport Control.Arrow\n\
\genericLength *** head $ span (>=10) $ iterate (product <<< fmap read <<< fmap return <<< show) %d\n"
-}
-- this goes faster, but kinda defeats the purpose of a
-- haskell programming exercise
persistenceRoot = read . unsafePerformIO . readProcess  "bc" [] . printf "\
\define digprod (x) {\n\
  \if (x==0) return 1\n\
  \return (x %% 10)*digprod(x/10)\n\
\}\n\
\define droot (x) {\n\
  \if (x >= 10) return droot(digprod(x))\n\
  \return x\n\
\}\n\
\define drootcnt (x) {\n\
  \if (x >= 10) return 1+drootcnt(digprod(x))\n\
  \return 0\n\
\}\n\
\define void printit(x) {\n\
  \print \"(\", drootcnt(x), \",\" ,droot(x),\")\n\"\
\}\n\
\printit(%d)\n"

In principle, Moritz would have won this exercise, if it was not for the censored "back door" he used. To Moritz' defense: this was not explicitly prohibited in the rules (but of course in the unspoken rules of the competition) and Moritz was already regretting his usage of such an evil and unportable function as his first comment indicates. Challenge for everybody: can you write a solution in such a style (i.e., using an external interpreter) without using the back door? For this to work, the interpreter would need to be pure (i.e., a side effect free calculator) and expressive enough to computer the digital root and persistence. If something like that is possible, the MC has to redefine the notion of tokens to take the length of a literal string into account.

The last award goes to Loris Reiff. He receives the fairness award 🏆 not for a solution he submitted (his was rather unspectacular), but rather for a solution that he has not submitted:

Btw, today I was looking for solutions online for some inspiration for
next time. I found one on Rosetta Code which only uses 21 tokens. This
is obviously not a submission, since copying is no fun ;) [and I've
already submitted anyway] but it might be something for the blog (if no
one has submitted this solution or a better one).

21 m=(g*h).s(>9).i(p.d1)

-- slightly adapted from
https://rosettacode.org/wiki/Digital_root/Multiplicative_digital_root#Haskell
--
import Data.List
import Data.Digits
import Control.Arrow
mpmdr :: Integer -> (Integer, Integer)
mpmdr = (genericLength *** head) . span (> 9) . iterate (product . digits 10)

The MC feels guilty. His task is to come up with exercises that are original enough not to be on Rosetta code. It is surprising that even the type signature matches. Fortunately, the solution there is not token-optimal, yet it is uncomfortably close to the the optimum. The MC promises to research such "related work" better next time.

Lessons of the week, apart from that the MC should do his related work research properly, are:

Task of week 3

Yay, finally some lists in the lecture!

Here is the task of week 3: Given a finite directed graph G and a subset of its vertices S, define the function that computes the minimal elements in S with respect to G. An element n is minimal in S with respect to G if there is no other element m in S, such that there exists a path from m to n in G (note that only the start point m and end point n of such a path are required to be in S).

We represent directed graphs by their adjacency matrix using the type [(Int,[Int])]. For example, the entry (1,[2,3,4]) in such an adjacency matrix means that vertex 1 has outgoing edges to the vertices 2, 3, and 4. And the entry (2,[]) means that 2 has no outgoing edge (to the same effect we could have just omitted that entry). You may assume that for each vertex there is at most one entry in the adjacency matrix. Moreover, we are using lists [Int] to represent sets of verdices.

So we are looking for a function of type graphMin :: [(Int,[Int])] -> [Int] -> [Int] that performs the computation described above. The function may return the minimal elements in any order (and even with duplicates). For example:

import Data.List

sort $ nub $ graphMin [(0,[1,2]),(1,[3]),(2,[4])] [0,1,2,3,4]     = [0]
sort $ nub $ graphMin [(0,[1,2]),(1,[3]),(2,[4])] [1,2,3,4]       = [1,2]
sort $ nub $ graphMin [(0,[1,2]),(1,[3]),(2,[4])] [1,3,4]         = [1,4]
sort $ nub $ graphMin [(0,[1,2]),(1,[3]),(2,[4])] [0,3,4]         = [0]
sort $ nub $ graphMin [(0,[1,2]),(1,[3]),(2,[4])] [0,1,2,3,4,5,6] = [0,5,6]
sort $ nub $ graphMin []                          [0,1,2,3,4]     = [0,1,2,3,4]
sort $ nub $ graphMin [(0,[1,2]),(1,[0])]         [1,2,3]         = [1,3]
sort $ nub $ graphMin [(0,[1,2]),(1,[0])]         [1,0,3]         = [3]

The evaluation criterion is time efficiency. Your function will be tested on large graphs, small graphs, dense graphs, sparse graphs, short lists, and longer lists (but probably not too long lists). The solution of the MC is fairly unoptimized. It can handle lists of length up to few hundreds elements for larger graphs (100000 edges) in a few seconds. There is some room for improvement.

Hint: :set +s in GHCi is useful to obtain some timing and memory measurements.

Results of week 3

This week there were only 26 student submissions (plus one by the MC). The MC is slightly disappointed. A decrease of 30% compared to the previous week is ok; a decrease of 60% compared to the previous week indicates that the exercise was simply too hard or not perceived as interesting. And this even though the idea behind this competition problem originated from a small "real-world" subproblem that the MC's alter ego has encountered in his recent (not yet published) research: compute the minimal elements of a set with respect to a preorder.

Additionally, the upcoming quiz and the associated preparations can be presumably blamed as well. (Although not for a 100% decrease in professorial participation. The professors forfeited to the so far absolutely dominant student performance.)

The MC hopes for at least a non-decrease in participation next week. Even if you have close to 0 points at this point, it is still possible to get into the top 10 of the semester, although it will probably be hard to get into the top 5.

Anyway, enough rambling. Let's get down to business.

Top ≈10 of week 3
  # name      points
1. Moritz Knüsel 10
1+ε. Miro Haller ➖🧟
2. Mike Boss 9
Simon Yuan 9
4. Roger Baumgartner 7
5. Loris Reiff 6
6. Matthias Brun 5
Antonio Russo ➖5
Zur Schmaria 5
9. Noah Berner 2
Jorel Elmiger ➖2
Tynan Richards 2

How are these results coming together? Through a sequence of increasingly harder tests. We start with a soundness test. For that purpose the MC uses his favorite random testing library: QuickCheck. Here is a ghci script that the MC ran on each of the submissions, after briefly ensuring that the submissions won't attempt to erase his file system (please don't do this: it would cost you a zillion competition points, but also earn you a 🎩).

:m + Test.QuickCheck Data.List
let getAll xs x = nub $ concat [zs | (y,zs) <- xs, x == y]
let dom = map (abs . fst)
let union xs = map (\x -> (x, filter (/= x) $ map abs $ getAll xs x)) $ dom xs
let test sol = quickCheckWith stdArgs { maxSize = 10, maxSuccess = 3000 }
  (\xs ys0 ->
    let ys = nub $ map abs ys0 in
    let g = union xs in
    (sort $ nub $ graphMinSol g ys) == (sort $ nub $ sol g ys))
:set +s
test GraphMin.graphMin

The MC should explain the quickCheckWith invokation. QuickCheck will execute its function argument in this invokation on 3000 randomly generated integer lists ys0 and graphs xs of length at most 10 (also all randomly chosen integers in those lists are between -10 and 10). To ensure that the randomly generated graphs satisfy the promised property of having at most one adjacency matrix entry for each vertex, the MC used the union function defined above. For each of the 3000 tests, the output of MC's unoptimized solution graphMinSol (shown below) is compared (modulo sorting and duplicates) to the output of the contestant's solution GraphMin.graphMin.

import Data.List

path g = go []
  where
    go vis x y
      | x == y = True
      | otherwise = case lookup x g of
        Nothing -> False
        Just zs -> let vis' = x : vis in or [go vis' z y | z <- zs \\ vis']

graphMinSol :: [(Int,[Int])] -> [Int] -> [Int]
graphMinSol g qs = let ps = nub qs in
  filter (\p -> all (\q -> p == q || not (path g q p)) ps) ps

Only 17 participants (including the MC) passed this test. If your name does not occur somewhere below, try running QuickCheck on your solution to see the actual bugs. Three of the remaining participants would have failed the test, if the MC had not removed the negative numbers from the random inputs (via abs). Since nothing in the exercise stated that all numbers will be non-negative, these solution could be directly disqualified, in principle. In an unexpected bout of mercy, the MC decided to see where these contestants will end up and potentially penalize them with a point deduction later. As a reminder, let's put a "heavy minus" emoji next to their names: Antonio Russo ➖, Jorel Elmiger ➖, Miro Haller ➖.

The whole truth about Miro Haller's submission is even more shocking. He spent great effort on making his solution very efficient. Yet a minor bug (even on positive numbers) compromised his whole work. When the MC reported the bug to Miro after the submission deadline, he quickly came up with a fix that still failed on negative numbers, but seemed to be correct otherwise. It is clear that the MC cannot give points to such a late submission, but he decided to at least evaluate this submission's efficiency. In other words: here is Miro's well deserved zombie emoji 🧟.

Something similar happened also to Matthias Brun's efficient solution. However, he was suspecting a bug and took precautions. He submitted an alternative "naive" solution and asked the MC to grade that one. A wise decision to keep some points.

OK, next test. First, here are three particular (families of) graphs:

k i = map (\k -> (k, [1..k-1] ++ [k+1..i])) [1..i] -- komplete [*]
c i = map (\k -> (k, [(k + 1) `mod` i])) [0..i-1]  -- cycle
l i = map (\k -> (k, [k + 1])) [1..i]              -- line

[*] Wikipedia is not sure whether complete graphs are typically abbreviated with a "k", because of the German word komplett (which seems to be a false friend) or the mathematician Kuratowski.

The MC then tested the following three graphs.

let i = 20 in (sort $ nub $ GraphMin.graphMin (k i) [1..i]) == []
let i = 300 in (sort $ nub $ GraphMin.graphMin (c i) [0..i-1]) == []
let i = 300 in (sort $ nub $ GraphMin.graphMin (l i) [1..i]) == [1]

And here are the results (in no particular order).

TIMEOUT        Dan Kluser
0.01 0.01 0.21 Loris Reiff
0.05 0.10 0.05 Roger Baumgartner
0.01 1.48 4.05 Sandro Panighetti
0.01 0.01 0.01 Moritz Knüsel
TIMEOUT        Patrick Ziegler
TIMEOUT        Silvan Niederer
0.01 0.01 0.01 Mike Boss
0.01 0.01 0.02 Zur Shmaria
0.02 4.35 4.32 Noah Berner
0.02 5.32 2.61 Tynan Richards
0.01 0.04 0.07 Antonio Russo ➖
0.02 5.32 1.92 Jorel Elmiger ➖
0.01 0.28 0.56 MC
0.02 0.34 0.14 Matthias Brun
0.01 0.01 0.01 Miro Haller 🧟
0.06 0.01 0.01 Simon Yuan

Three timeouts and a bit fluctuation otherwise. Not really conclusive. But you already see three solutions that are equally fast on all graphs: Moritz Knüsel, Mike Boss, and our dead man walking. The timeouts are eliminated before the next round of course.

The MC also played around with some permutations and subsequences of the above arguments. The test let i = 20 in (sort $ nub $ GraphMin.graphMin (k i) [i,i-2..0]) == [] has revealed a bug in Sandro Panighetti's solution, which managed to survive the QuickCheck tests. A good reminder that testing is good, but proving is better. The MC decided at this point that the remaining 11 contestants will all end up in the top 10. So the next tests are about the final placing.

Here is the next challenge: lines with many complete subgraphs along them.

:m + Data.List
let getAll xs x = nub $ concat [zs | (y,zs) <- xs, x == y]
let dom = map (abs . fst)
let union xs = map (\x -> (x, filter (/= x) $ map abs $ getAll xs x)) $ dom xs
let k i j = map (\k -> (k, [i..k-1] ++ [k+1..j])) [i..j]
let l i = map (\k -> (k, [k + 1])) [1..i]
:set +s
let i = 100 in (sort $ nub $ GraphMin.graphMin
  (union (l i ++ concatMap (\i -> k i (i + 9)) [0,10..i-10])) [0..i]) == []
let i = 100 in (sort $ nub $ GraphMin.graphMin
  (union (l i ++ concatMap (\i -> k i (i + 9)) [0,10..i-10])) [0,10..i]) == [0]
let i = 100 in (sort $ nub $ GraphMin.graphMin
  (union (l i ++ concatMap (\i -> k i (i + 9)) [0,10..i-10])) [0,20..i]) == [0]
let i = 300 in (sort $ nub $ GraphMin.graphMin
  (union (l i ++ concatMap (\i -> k i (i + 9)) [0,10..i-10])) [0..i]) == []
let i = 300 in (sort $ nub $ GraphMin.graphMin
  (union (l i ++ concatMap (\i -> k i (i + 9)) [0,10..i-10])) [0,10..i]) == [0]
let i = 300 in (sort $ nub $ GraphMin.graphMin
  (union (l i ++ concatMap (\i -> k i (i + 9)) [0,10..i-10])) [0,20..i]) == [0]
let i = 600 in (sort $ nub $ GraphMin.graphMin
  (union (l i ++ concatMap (\i -> k i (i + 9)) [0,10..i-10])) [0..i]) == []
let i = 600 in (sort $ nub $ GraphMin.graphMin
  (union (l i ++ concatMap (\i -> k i (i + 9)) [0,10..i-10])) [0,10..i]) == [0]
let i = 600 in (sort $ nub $ GraphMin.graphMin
  (union (l i ++ concatMap (\i -> k i (i + 9)) [0,10..i-10])) [0,20..i]) == [0]
let i = 900 in (sort $ nub $ GraphMin.graphMin
  (union (l i ++ concatMap (\i -> k i (i + 9)) [0,10..i-10])) [0..i]) == []
let i = 900 in (sort $ nub $ GraphMin.graphMin
  (union (l i ++ concatMap (\i -> k i (i + 9)) [0,10..i-10])) [0,10..i]) == [0]
let i = 900 in (sort $ nub $ GraphMin.graphMin
  (union (l i ++ concatMap (\i -> k i (i + 9)) [0,10..i-10])) [0,20..i]) == [0]
And the corresponding results:
0.02 0.02 0.01 0.07 0.14 0.09 0.19 0.73 0.50 0.57 2.35 1.46  Loris Reiff
0.04 0.01 0.02 0.20 0.12 0.07 0.87 0.27 0.30 1.98 0.71 0.66  Roger Baumgartner
0.02 0.01 0.01 0.04 0.07 0.06 0.22 0.11 0.14 0.32 0.26 0.38  Moritz Knüsel
0.06 0.02 0.02 0.11 0.12 0.15 0.42 0.43 0.21 0.50 0.75 0.52  Mike Boss
0.03 0.02 0.02 0.13 0.09 0.10 0.61 0.25 0.47 1.52 0.68 0.99  Zur Shmaria
0.27 0.04 0.03 6.14 0.67 0.37 49.61 TIMEOUT                  Noah Berner
0.26 0.05 0.03 5.79 0.64 0.37 42.66 TIMEOUT                  Tynan Richards
0.03 0.02 0.02 0.12 0.11 0.12 0.48 0.36 0.43 0.90 0.58 0.50  Antonio Russo ➖
0.37 0.02 0.02 9.44 0.14 0.09 TIMEOUT                        Jorel Elmiger ➖
TIMEOUT                                                      MC
0.07 0.02 0.01 0.79 0.15 0.11 5.47 0.78 0.52 17.09 2.27 1.35 Matthias Brun
0.06 0.01 0.02 0.06 0.06 0.05 0.23 0.23 0.22 0.48 0.48 0.33  Miro Haller 🧟
0.03 0.01 0.01 0.07 0.06 0.05 0.21 0.20 0.21 0.46 0.38 0.34  Simon Yuan

The MC is defeated. And with him three other contestants.

Next, we'll try something more intuitive and realistic. How about the prime numbers?

:m + Data.List Data.Hashable
let multiples k b = [k * i | i <- [1..b`div`k + 1]]
let g i = map (\k -> (k, multiples k i)) [1..i]
let g1000 = g 1000
let g5000 = g 5000
let g10000 = g 10000
:set +s
hash $ GraphMin.graphMin g1000 [2..length g1000]
hash $ GraphMin.graphMin g5000 [2..length g5000]
hash $ GraphMin.graphMin g10000 [2..length g10000]

Note, how the hash function is used to avoid measuring the time needed to print the results. The minimal elements of the above graphs are exactly the prime numbers (up to the bound b). Perhaps this is the best algorithm for computing primes?

0.23 5.25 22.53 Loris Reiff
0.09 0.32 1.09  Roger Baumgartner
0.04 0.11 0.24  Moritz Knüsel
0.04 0.15 0.48  Mike Boss
TIMEOUT         Zur Shmaria
TIMEOUT         Antonio Russo ➖
TIMEOUT         Matthias Brun
0.04 0.18 0.45  Miro Haller 🧟
0.08 0.42 1.87  Simon Yuan

Three students gone. Making the divisibility graph even bigger (e.g., let g100000 = g 100000), while keeping the list size (hash $ GraphMin.graphMin g100000 [2..10000]) also makes Loris Reiff's and Roger Baumgartner's solution time out.

Time for a final showdown. Let's go Big Data. In an intermediate submission Moritz Knüsel claims to have tested his solutions on graphs with 20 millions edges. The divisibility graph with 2 millions vertices, has roughly 31 millions edges.

:m + Data.List Data.Hashable
let multiples k b = [k * i | i <- [1..b`div`k + 1]]
g i = map (\k -> (k, multiples k i)) [1..i]
let gx = g 2000000
:set +s
hash $ GraphMin.graphMin gx [30000,29998..10000]

And here the final results:

13.84    Moritz Knüsel
TIMEOUT  Mike Boss
21.79    Miro Haller 🧟
TIMEOUT  Simon Yuan

Pretty amazing, huh? Note that those huge graphs are processed without any parallelization on a usual laptop.

Finally, lets briefly sample a few of the best solutions. They are a bit long and the MC will (as usual) rely on the comments provided by the contestants.

Mike Boss uses Haskell's Data.Graph library:

module GraphMin where

    {- Grade this (for real now)
        - Longer first part is just the conversion to type Graph.
          Problems with non-existing vertices and renaming are fixed.
            - notInG: all endpoints of edges which are not in the graph
            - allOfG: all vertices from the graph
              (even ones with no outgoing edges)
            - sButNotG: vertices which are in the set but not in the graph =<
              are minimum
            - sAndG: vertices in the set which are also in the graph
            - convertSet: fixes renaming
        - Second part is the actual algorithm:
          Deletes all vertices reachable from the vertices in the set
          which means they aren't minimums.
    -}

    import Data.Graph
    import Data.Maybe
    import qualified Data.IntSet as Set

    graphMin :: [(Int,[Int])] -> [Int] -> [Int]
    minimumInGraph :: Graph -> Set.IntSet -> Set.IntSet -> [Int]


    graphMin graph set = [x | (x, _, _) <- result] ++ (Set.toList sButNotG)
        where
            realSet = Set.fromList set
            edges = Set.fromList (concat $ map snd graph)
            vertices = Set.fromList (map fst graph)

            notInG = Set.difference edges vertices
            allOfG = Set.union vertices notInG
            sButNotG = realSet Set.\\ allOfG
            sAndG = realSet Set.\\ sButNotG
            correctedGraph =  [(s, []) | s <- (Set.toList notInG)] ++ graph

            (builtGraph, v, k) = graphFromEdges
              (map (\(a, b) -> (a, a, b)) correctedGraph)
            convertSet = Set.map (fromJust . k) sAndG

            result = map v (minimumInGraph builtGraph convertSet convertSet)


    minimumInGraph graph set result
            | Set.null set = Set.toList result
            | Set.null result = []
            | Set.size set == 1 = Set.toList (result Set.\\ reachFromS)
            | otherwise = minimumInGraph graph (Set.delete head set)
               (result Set.\\ reachFromS)
              where
                reachFromS = Set.delete head
                  (Set.fromList (reachable graph (head :: Int)))
                head = Set.findMin set

Miro Haller's corrected solution has a nice complexity analysis, which the MC didn't attempt to verify. First Miro has to fix the negative number problems, anyway.

-- I use containers-0.5.10.2 (https://hackage.haskell.org/package/containers-0.5.10.2)

module GraphMin where

import Prelude hiding (fromList)
import Data.IntMap as IntMap
import Data.Sequence as Seq

-- I use IntMap because they have (almost) constant access time I think.
-- It depends on the number of bits the integers use.
-- Also I rely on the fact that the nodes are numbered in a more or less
-- ascending sequence of natural numbers
graphMin :: [(Int,[Int])] -> [Int] -> [Int]
graphMin graph subset = findUnreachable subset []
    where
        -- O(n)
        graphNodes = [node | (node, xs) <- graph]

        -- O(n)
        maxNodeNr = maximum (graphNodes ++ subset)

        -- Visited array O(n) (I assume n = O(maxNodeNr))
        visited = IntMap.fromList [(node, False) | node <- [0..maxNodeNr]]

        -- Convert out nodes lists to sequences O(n^2)
        graphSeq = [(node, Seq.fromList outNodes) | (node, outNodes) <- graph]
        -- Out edges O(n)
        outEdges = IntMap.fromList graphSeq

        -- True if node is in subset
        -- O(n)
        isInSubset = IntMap.fromList [(node, True) | node <- subset]


        -- Array with for each node nodes from which it is reachable
        -- O(n)
        reachableFromNodesInit = IntMap.fromList [(node, []) | node <- [0..maxNodeNr]]

        -- DFS like algorith, write from node to reachable nodes
        -- O(n^2 * log(n) )
        dfsIter [] reachableFromNodes = reachableFromNodes
        dfsIter (subsetNode:subsetRest) reachableFromNodes = dfsIter subsetRest newReachableFromNodes
            where
                -- O(1)
                connectedNodes = findWithDefault Seq.Empty subsetNode outEdges

                -- O(n)
                newReachableFromNodes = dfsFindReachable connectedNodes reachableFromNodes visited
                    where
                        dfsFindReachable Seq.Empty reachableFromNodes visitedLoc = reachableFromNodes
                        dfsFindReachable todoNodesSeq reachableFromNodes visitedLoc
                            -- Cases checking O(1)
                            | findWithDefault False nextNode isInSubset = dfsFindReachable nodesRest extendedReachableFromNodes visitedNew
                            | findWithDefault False nextNode visitedLoc = dfsFindReachable nodesRest reachableFromNodes visitedNew
                            | otherwise = dfsFindReachable extendedListRest reachableFromNodes visitedNew
                                where
                                    updateFn key x = True
                                    visitedNew = adjustWithKey updateFn nextNode visitedLoc

                                    -- Both O(1) because it is the first entry
                                    nextNode = Seq.index todoNodesSeq 0
                                    nodesRest = Seq.deleteAt 0 todoNodesSeq

                                    -- O(1)
                                    extendedReachableFromNodes = IntMap.adjust (subsetNode :) nextNode reachableFromNodes

                                    -- O( log( min{n1, n2} ) ) where n1,n2 are the lengths of the sequences
                                    extendedListRest = (findWithDefault Seq.Empty nextNode outEdges) >< nodesRest


        -- Comp. stands above
        reachableFromNodes = dfsIter subset reachableFromNodesInit

        -- Find all nodes that are at most reachable from itself (through a path through nodes
        -- which are not in the subset S)
        -- O(n^2)
        findUnreachable [] unreachableNodes = unreachableNodes
        findUnreachable (subsetNode:subsetRest) unreachableNodes
            | hasOtherNode (findWithDefault [] subsetNode reachableFromNodes) = findUnreachable subsetRest unreachableNodes
            | otherwise = findUnreachable subsetRest (subsetNode : unreachableNodes)
                where
                    -- Check if a node is reachable from another node
                    -- O(n)
                    hasOtherNode [] = False
                    hasOtherNode (otherNode:listRest) =
                        if subsetNode == otherNode
                            then
                                hasOtherNode listRest
                            else
                                True

Finally, Moritz Knüsel has apparently found the right balance between using lazy and eager data structures. And it nicely handles edge cases separately.

{-# LANGUAGE BangPatterns #-}
module GraphMin where
import Data.List
import qualified Data.IntSet
import qualified Data.IntMap.Lazy

-- final submission

type LG = [(Int,[Int])]

graphMin :: LG -> [Int] -> [Int]
graphMin g s = graphMin11 g s



-- Basic Idea:
-- from every node in S, we walk the graph and find all elements in S that are
-- reachable from that particular Node, except the node itself,
-- since a minimal element may still have paths coming back to itself
-- the union of all nodes discovered like this is then subtracted from S,
-- yielding only the nodes that are not reachable from any other
--
-- This would obviously be way too slow, so, on a walk starting at a node v,
-- we save all nodes we've seen, except the ones that are on a path from v to v
-- these nodes are saved in a map, together with v,
-- so we can avoid walking that path again later
-- This collection of nodes and the map with shortcuts is
-- then passed along for the search starting at the next node in S

-- because of stack overflow problems some strictness annotations
-- have been added, which seemed to solve them

-- shortcuts for all the datastructures and functions for accessing them
type Set2 = Data.IntSet.IntSet
ins2 = Data.IntSet.insert
empty2 = Data.IntSet.empty
toLst2 = Data.IntSet.toList
fromLst2 = Data.IntSet.fromList
del2 = Data.IntSet.delete
isIn2 = Data.IntSet.member
union2 = Data.IntSet.union
size2 = Data.IntSet.size
diff2 = Data.IntSet.difference

type G2 = Data.IntMap.Lazy.IntMap [Int]
lgToG2 = Data.IntMap.Lazy.fromList
getEdges2 g v = Data.IntMap.Lazy.findWithDefault [] v g
---
type GoesToMap2 = Data.IntMap.Lazy.IntMap Int
emptyGTM2 = Data.IntMap.Lazy.empty
memberGTM2 = Data.IntMap.Lazy.member
getGTM2 = Data.IntMap.Lazy.findWithDefault 0
unionGTM2 = Data.IntMap.Lazy.union
fromSetGTM2 = Data.IntMap.Lazy.fromSet
mergeSetGTM2 gtm set v = unionGTM2 gtm (fromSetGTM2 (const v) set)


-- recursively look at all nodes reachable from a given node, recording the path taken
fun11 :: GoesToMap2 -> -- contains nodes that, in earlier searches,
                       -- were found to loop back to where the search started
  Set2 -> -- contains all nodes that go back to the value this search started at,
          -- needed in case of loop back paths with cycles in them
  G2 ->   -- the graph we're looking at
  Set2 -> -- the set S
  Set2 -> -- the set of points we've already visited. This set may
          -- contain nodes that are in lgtm, since these will only be
          -- removed later, so it's important to check lgtm first
  Set2 -> -- the nodes on the path we've come down
  Int ->  -- the current node
  Int ->  -- the node we started at
  (Set2,Set2) -- returns the set of nodes seen and a set of nodes that lead back to sval.
              -- If the current node is not on such a path, this will be empty
fun11 gtm lgtm g s !seen pset v sval
-- we've come back around, so we need to return the nodes we took to get here
  | sval == v && not (vInSeen) = (seen,vpset)
-- we've come back around, but sval is already known not to be minimal
  | sval == v && vInSeen       = (seen,empty2)
-- the current node is on a path that leads to sval, so we need to save how we got here
  | mlGTMv                     = (seen, vpset)
-- we've already visited this node at some point
  | vInSeen                    = (seen,empty2)
-- the current node is in S, we can't go further
  | isIn2 v s                  = (vseen,empty2)
-- the current node leads to an element in S, but we can skip going down the entire path
-- since all other nodes on that path are not in S, we don't need to add all of them to seen
  | mGTMv                      = (ins2 gGTMv vseen,empty2)
  | otherwise                  =
-- recurse on v's neighbors
      descend11 gtm lgtm g (getEdges2 g v) s vseen vpset sval
  where vInSeen = isIn2 v seen
        vpset = ins2 v pset
        vseen = ins2 v seen
        mGTMv = memberGTM2 v gtm
        mlGTMv = isIn2 v lgtm
        gGTMv = getGTM2 v gtm

-- helper function for the recursion
-- since, in the function above, we check lgtm before seen,
-- we can avoid removing elements from seen here
descend11 :: GoesToMap2 -> Set2 -> G2 -> [Int] -> Set2 -> Set2 -> Set2 -> Int -> (Set2,Set2)
descend11 _ lgtm _ [] _ seen  _ _ = (seen,lgtm)
descend11 gtm lgtm g (e:es) s !seen pset sval =
  let (nseen,npset) = fun11 gtm lgtm g s (seen) pset e sval in
    descend11 gtm (union2 lgtm npset) g es s nseen pset sval

-- searches on all elements in S
-- if a search discovered paths that go back to its starting node, all nodes on such paths
-- are taken out of seen (if we changed some guards around in fun11, this would not be necessary
-- but some quick tests indicate it helps if s is large and g is dense) and added
-- to the map
gmhelp11 :: G2 -> [Int] -> Set2 -> Set2 -> GoesToMap2 -> Set2
gmhelp11 g [] s seen gtm = seen
gmhelp11 g (e:es) s !seen gtm =
  let (nseen,npset) = descend11 gtm empty2 g (getEdges2 g e) s seen empty2 e in
    gmhelp11 g es s (diff2 nseen npset) (mergeSetGTM2 gtm npset e)

-- transforms the graph and S into the datastructures used above, calls gmhelp11 and
-- subtracts the result from S
graphMin11 :: LG -> [Int] -> [Int]
-- get trivial cases out of the way right away
graphMin11 _ [] = []
graphMin11 _ [x] = [x]
graphMin11 [] s = s
graphMin11 g s = toLst2 $ diff2 s' (gmhelp11 (lgToG2 g) s s' empty2 emptyGTM2)
  where s' = fromLst2 s

Oh, and the heavy minuses translate into a one point deduction (truncated at 0).

What have we learned?

Task of week 4

A string s is a subsequence of another string t, if one can obtain s by removing characters from t. For example: The string "foo" is a subsequence of "fofo", "frohlocken", and "funktionale Programmierung is einfach toll". It is not a subsequence of "ofo", "FoO", "oooo", and "tofu".

A lists of strings xs is subsumed by another list of strings ys, if for each string s from xs, there exists a string t in ys, such that s is a subsequence of t. Moreover, you may not use strings from ys more than once (as witnesses for the above "exists").

Write a function subsub :: [String] -> [String] -> Bool that checks whether its first argument is subsumed by its second argument.

All following examples should evaluate to True:

subsub ["a", "ab", "abc"] ["false", "bab", "cdb", "cabAdAbc", "true"]
subsub ["FM", "FP"] ["Ferrovia Mesolcinese", "Microsoft FrontPage"]
subsub ["FP", "FM"] ["Ferrovia Mesolcinese", "Microsoft FrontPage"]
subsub ["1+2=2"] ["12+12=24"]
not $ subsub ["FM", "FP"] ["FMFP"]
subsub ["FM", "FP"] ["FMFP", "FMFP"]
subsub ["FM", "FM"] ["FM", "Ferrovia Mesolcinese"]
not $ subsub ["FM", "FM"] ["FM"]

Evaluation criterion: efficiency. Your solution will be tested with short and long lists and short and long strings.

Results of week 4

Evaluation time! The MC apologizes for taking so much time.

As last week, there were 26 student submissions. A 0% decrease is exactly what the MC hoped (and asked) for. Maybe for the next week, the MC should ask for a slight (say 10%) increase in participation.

Here is the top 16:

Top 16 of week 4
  # name   points
1. Moritz Knüsel 10
2. Gabriel Giamarchi 🖥 9
3. Mathias Brun 8
Miro Haller 🔝 8
Loris Reiff 8
Börge Scheel 8
7. Simon Yuan 4
8. Felix Huber 3
Patrick Ziegler 3
10. Roger Baumgartner 1
Mathias Blarer 1
Mike Boss 1
Chris Buob 1
Kevin De Keyser 1
Tynan Richards 1
Stefan Zemljic 1

As usual, the MC first consulted his good old friend QuickCheck, to rule out incorrect submissions. Here are the tests that each submission had to pass:

:m + Test.QuickCheck Data.List Safe
let alt xs ys = case xs of [] -> ys ; (x:xs) -> x : alt ys xs
let qc = quickCheckWith stdArgs { maxSize = 10, maxSuccess = 3000 }
let t0 sol = qc (\xs ys -> subsub xs ys == sol xs ys)
let t1 sol = qc (\xs ys -> sol xs (map (alt ys) xs))
let t2 sol = qc (\xs -> sol xs (reverse xs))
let t3 sol = qc (\xs -> sol xs (sort xs))
let t4 sol = qc (\xs -> sol xs (concat xs : tailSafe xs))
let t5 sol = qc (\xs -> sol xs (replicate (length xs) $ concat xs))
:set +s
t0 SubSub.subsub
t1 SubSub.subsub
t2 SubSub.subsub
t3 SubSub.subsub
t4 SubSub.subsub
t5 SubSub.subsub

Only t0 compares the outcome of a student submission SubSub.subsub with the following naive solution subsub by the MC.

import Data.List

subseq [] _ = True
subseq _ [] = False
subseq (x : xs) (y : ys)
  | x == y = subseq xs ys
  | otherwise = subseq (x : xs) ys

subsub :: [String] -> [String] -> Bool
subsub [] ks = True
subsub (l : ls) ks =
  any (\k -> if subseq l k then subsub ls (delete k ks) else False) ks

All other tests do not require a "master solution", i.e., could have been easily run by the participants. The function tailSafe from the Safe module (cabal install safe) returns [] for input [].

Only 17 student submissions passed these tests. All but one of them have received some points this week. The unlucky one—Valentin S.—who passed the QuickCheck tests but didn't get a point, was caught producing a wrong result on one of the subsequent (non-randomized) tests. A good reminder that testing can only prove the presence of bugs, but not their absence!

The naive solution by the MC is exponential in the worst case (more precisely something like length ks! where ! is a factorial and not an exclamation!): each call to subsub may yield up to length ks recursive calls. The 1 point solutions are all variations of this approach (with more or fewer optimizations for special cases, e.g., sorting the lists of strings by lengths and filtering out too short strings from ks or using Sequences instead of lists). Regardless whether the optimizations were there, all exponential solutions time out on one of the following simple performance tests:

:m + Data.List
let s = "FMFP"
let xs = subsequences s
let n = length xs
:set +s
SubSub.subsub xs (replicate n s)
not $ SubSub.subsub xs (map (\_ -> 'x') s : replicate (n - 2) s)
not $ SubSub.subsub xs (map (\_ -> 'x') s : map (\_ -> 'x') s : replicate (n - 2) s)
:q

For transparency, here are the timing results (not sorted in any interesting order).

0.02 0.01 0.01 Felix Huber
0.01 0.00 0.01 Loris Reiff
TO Roger Baumgartner
0.01 0.00 0.01 Gabriel Giamarchi
TO Mathias Blarer
0.01 0.00 0.00 Börge Scheel
0.02 0.00 0.00 Miro Haller
0.01 0.00 0.01 Moritz Knüsel
0.04 0.02 0.02 Patrick Ziegler
0.03 0.01 0.03 Simon Yuan
BUG V. Stoppiello
TO Kevin De Keyser
TO Mike Boss
TO Chris Buob
TO Tynan Richards
TO Stefan Zemljic
0.01 0.00 0.01 Matthias Brun
0.01 0.01 0.01 MC-memo
TO MC-naive
TO MC-sorted

Actually, the above statement about all exponential solutions timing out is not entirely true. MC's solution MC-memo is a refinement of his MC-sorted solution, which in turn is a refinement of MC-naive. All of them are exponential in the worst case. MC-memo uses memoization to avoid recomputing recursive calls with the same parameters. This allows it to handle these test cases efficiently. Memoization has a cost though, which becomes especially visible when considering memory consumption (not shown here). But lets leave the optimization of memory consumption for another competition task.

import Data.List
import Data.Function.Memoize

subseq [] _ = True
subseq _ [] = False
subseq (x : xs) (y : ys)
  | x == y = subseq xs ys
  | otherwise = subseq (x : xs) ys

subsub :: [String] -> [String] -> Bool
subsub [] _ = True
subsub _ [] = False
subsub ls ks
  | length ks < length ls = False
  | otherwise = subsub' (prepare ls) (prepare ks)
  where
    prepare = sortOn minimum . filter (/= [])
    subsub' = memoize2 subsub''
    subsub'' [] ks = True
    subsub'' (l : ls) ks =
      let ms = dropWhile (\k -> minimum k > minimum l) ks in
      any (\k -> if subseq l k then subsub' ls (delete k ms) else False) ms

As the above comments suggest, everybody else used a more tractable algorithm to solve the problem. To quote Felix Huber "the problem statement screamed for a flow graph approach", and indeed all the remaining solution have used one or another algorithm solving the maximal cardinality bipartite matching problem. Namely: if one views the strings in the two lists as vertices in two partitions of a bipartite graph where two vertices are connected if one is a subsequence of another, then the problem at hand immediately reduces to finding a matching of a certain size.

Side remark: When he was coming up with this exercise, the MC had indeed first started with the NP-complete subsumption problem, in which the subsequence test is replaced by a matching test (on predicate logic terms). The MC decided to ask the simpler problem, since algebraic datatypes (useful to represent terms) have not been covered in the lecture on the due date of this exercise. However, the MC didn't bother to change his exponential solution, which was originally crafted for an NP complete problem. After all, the task of the MC is to leave the stage for the competing participants and their creativity. (This is a cheap way of saying that the MC didn't realize how much simpler the problem became complexity-wise.)

Certainly, not all polynomial solutions are equally fast. We scale the previous test up by starting from let s = "FMFPPFMF" instead of let s = "FMFP". This eliminates two competitors as well as the MC's memoizing solution:

TO Felix Huber
7.04 6.86 5.74 Loris Reiff
6.13 0.25 5.23 Gabriel Giamarchi
6.23 0.00 5.47 Börge Scheel
0.67 0.00 0.78 Miro Haller
2.45 1.90 1.80 Moritz Knüsel
TO Patrick Ziegler
6.85 6.41 6.47 Simon Yuan
0.79 0.82 0.78 Matthias Brun
TO MC-memo

Interestingly, the two eliminated solutions both use the Data.Graph library. Apparently, not a very optimized library for this task.

A better library seems to be available in a separate, not-easy-to-install package. It implements an augmenting path algorithm for computing the maximum matching. Five of the remaining seven submissions have used this package, which in a comment says: More efficient than using MaxFlow from FGL with constant weight and additional source and sink nodes. But not as good as Hopcroft–Karp would be.

Miro Haller's submission uses another library, which implements the Hopcroft-Karp algorithm. At 365 lines of code, Miro earns the award for going hopelessly over the top with a competition exercise 🔝. (Or in other words: that's the spirit!) And even if you deduct the 200 lines that come from the Hopcroft-Karp library (as acknowledged in a comment by Miro), this is still the longest solution. But is it also the best?

To be honest, the MC had a hard time finding the fine line between those 7 solutions, which all were pretty efficient. On the other hand, he didn't want to give the full points to all 7. The solution is very pragmatic. The MC decided to select a few tests (completely at random, unbiased, and very objective), measure the time for all the submissions, take the sum, and see where we end. For this ingenious approach, the MC awards himself the dice of true randomness 🎲.

Here are the selected tests: First something pseudo-random:

:m + Data.List Data.Char Data.Hashable Data.List.Split
let next seed xs = chr ((hash xs * seed) `mod` 26 + 97)
let mk seed xs = let x = next seed xs in x : mk ((seed * ord x) `mod` 500 + seed) (x : xs)
let len = 100
let num = 400
let seed = 42
let ys = take num $ chunksOf len $ mk seed ""
let xs = take (num `div` 2) (map (take (len `div` 10)) ys)
:set +s
SubSub.subsub xs (reverse ys)

This yields a bipartite graph with 200+400 vertices and 19009 (roughly evenly distributed among the 200 nodes in the smaller part of the graph) edges.

Second, something with similar characteristics but false.

:m + Data.List Data.Char Data.Hashable Data.List.Split
let next seed xs = chr ((hash xs * seed) `mod` 26 + 97)
let mk seed xs = let x = next seed xs in x : mk ((seed * ord x) `mod` 500 + seed) (x : xs)
let len = 100
let num = 400
let seed = 42
let ys = take num $ chunksOf len $ mk seed ""
let xs = take (num `div` 2) (map (take (len `div` 10)) ys)
:set +s
not $ SubSub.subsub xs (map (filter (/='m')) $ reverse ys)

Third, something with a lot of duplicated junk.

:m + Data.List
let alt xs ys = case xs of [] -> ys ; (x:xs) -> x : alt ys xs
let xs = map (flip replicate 'a') [5 .. 300]
let ys = concatMap (flip replicate "abc") [5 .. 300]
let zs = alt xs ys
:set +s
SubSub.subsub xs zs

Fourth, something with quite some junk.

:m + Data.List Data.Char Data.Hashable Data.List.Split
let next seed xs = chr ((hash xs * seed) `mod` 26 + 97)
let mk seed xs = let x = next seed xs in x : mk ((seed * ord x) `mod` 500 + seed) (x : xs)
let alt xs ys = case xs of [] -> ys ; (x:xs) -> x : alt ys xs
let xs = map (flip replicate 'a') [5 .. 300]
let ys = take 4514 $ chunksOf 4 $ mk 42 ""
let zs = alt xs ys
:set +s
SubSub.subsub xs zs

Fifth, something that is close to a full bipartite graph with 300+300 vertices.

:m + Data.List Data.Char
let max = 299
let len = 10
let alph = map chr [0 .. max]
let xs = map (\i -> replicate len (chr i) ++ [chr ((i + 1) `mod` (max + 1))]) [0 .. max]
let ys = map (\i -> concatMap (replicate len) (drop i alph ++ take i alph)) [0 .. max]
:set +s
SubSub.subsub xs ys

And here are the results that translate to points as shown above. To create the ranking, the MC had to make several tough decisions. In particular, the fairness award from exercise 2 helped Loris Reiff to be grouped together with Miro Haller's solution.

1 2 3 4 5 Sum
Moritz Knüsel 23.98 22.92 5.54 4.55 4.59 61.58
Gabriel Giamarchi 15.59 17.33 30.43 11.85 11.02 86.22
Miro Haller 26.52 19.77 19.37 7.99 48.20 121.85
Matthias Brun 19.97 18.03 21.97 8.17 57.71 125.85
Börge Scheel 20.61 16.61 23.20 11.85 59.40 131.67
Loris Reiff 19.91 17.90 31.74 17.74 58.60 145.89
Simon Yuan 20.10 19.75 49.10 31.43 59.92 180.30

We have not talked about the winner yet. Moritz Knüsel has implemented something half-way homebrewed. From his comments, he did not seem to believe that his solution will beat the commonly used package. Well, it seems that the dice of true randomness was friendly to Moritz. Anyway, here is his solution.

module SubSub (subsub) where
import Data.List
import Control.Arrow
import Data.Char
import qualified Data.IntMap.Lazy as M
import qualified Data.IntSet as S

{-
  based on an algorithm for bipartite matching found at
  https://olympiad.cs.uct.ac.za/presentations/camp2_2017/bipartitematching-robin.pdf

  unfortunately I couldn't come up with anything faster than Data.Graph.MaxBipartiteMatching
  (https://hackage.haskell.org/package/maxBipartiteMatching-0.1.0.0/candidate/docs/Data-Graph-MaxBipartiteMatching.html)
  but here goes

-}

-- In order to speed up the subseq check, we build a map from characters to the
-- indices of these characters for every string in Y
-- Later, when we check a string from X, we can look up the position
-- of the next character we need
subOf :: String -> M.IntMap S.IntSet -> Bool
subOf xs y = aux xs y (Just (-1))
-- aux takes as it's last argument the found position of the previous character
  where aux _ _ Nothing = False -- prev char wasn't found
        aux [] _ _ = True       -- all characters found
        -- get all indices of x in y, and then get the smallest one that's greater than
        -- the index of the previous character. If x doesn't occur in y, we stop
        -- immediately
        aux (x:xs) y (Just yidx) = maybe False (aux xs y . S.lookupGT yidx) (M.lookup (ord x) y)

-- turn a string into a char->indices map
strToIdxMap :: String -> M.IntMap S.IntSet
strToIdxMap y = M.map S.fromList $ M.fromListWith (++) $ map (ord *** (:[])) $ zip y [0..]


-- define a new type to pass all the information and state around
type Work = (
  Int,
  Int,
  M.IntMap [Int],
  S.IntSet,
  M.IntMap Int)
-- accessors for work
xn (n,_,_,_,_) = n
yn (_,n,_,_,_) = n
em (_,_,m,_,_) = m
sn (_,_,_,s,_) = s
ma (_,_,_,_,m) = m
-- do we already have a match for i, and if yes what is it?
match :: Work -> Int -> Maybe Int
match w i = M.lookup i (ma w)
-- have we already seen i?
seen :: Work -> Int -> Bool
seen w i = S.member i (sn w)
-- mark i as seen
doSeen :: Work -> Int -> Work
doSeen (a,b,c,sn,e) i = (a,b,c,(S.insert i sn),e)
-- match up i and j
doMatch :: Work -> Int -> Int -> Work
doMatch (a,b,c,d,ma) i j = (a,b,c,d,(M.insert i j ma))
-- get all neighbors of i, if it has any
edges :: Work -> Int -> Maybe [Int]
edges w i = M.lookup i (em w)
-- empty the seen set
seendel :: Work -> Work
seendel (a,b,c,_,e) = (a,b,c,S.empty,e)

-- checks if there is a match for i by looking for an unmatched neighbor, and if
-- the neighbor is already matched, we try to find another match for
-- said neighbors match
-- Tried to make the state management a bit neater, so instead of returning
-- a pair of the changed state and the actual result, we pass functions that
-- are to be applied to the changed state, depending on the result
-- Doesn't seem to matter performance wise, but doesn't seem to hurt either
findMatchCont :: Int -> Work -> (Work -> Bool) -> (Work -> Bool) -> Bool
-- I'm pretty sure we can just return False here if a string has no superseqs in y
findMatchCont i w tcont fcont = maybe False (\es -> fmchelp es i w) $ edges w i
  where fmchelp (j:js) i w
        -- we've seen j, check remaining neighbors
          | seen w j = fmchelp js i w
        -- if j already has a match, try to find a match for j's match
        --- if that search succeeds, this search also succeeds and matches j to i
        --- if that search doesn't succeed, try the remaining neighbors of i
          | otherwise = let w' = doSeen w j in
              maybe (tcont (doMatch w' j i))
                    (\n -> findMatchCont n w' (\z -> tcont (doMatch z j i)) (fmchelp js i))
                    (match w' j)
        -- no neighbors remaining, search has failed
        fmchelp _ i w = fcont w

-- find a match for all elements in x
aux :: Int  -> Work  -> Bool
aux n w | n == (xn w) = True -- all matched, we're good
      -- find a match for element n, if the search succeeds, find a match for (n+1), if not, return false
        | otherwise = findMatchCont n (seendel w) (aux (succ n)) (const False)


-- the hope here is that lazy evaluation helps us by not checking every
-- combination of x and y unless it's needed
subsub :: [String] -> [String] -> Bool
subsub xs ys = aux 0 (length xs, length ys, em, S.empty, M.empty)
  where em = M.fromList $ zip [0..] $ [[yidx | (yidx,y) <- zip [0..] yl, subOf x y] | x<-xs]
             where yl = map strToIdxMap ys

For comparison, here is Gabriel Giamarchi's solution. Sometimes a bit too wide (Gabriel must have a wide screen 🖥), but nicely commented. The MC specifically appreciates the commented Test.QuickCheck import, which indicates that QuickCheck has been used to test this solution.

module SubSub where

--import Test.QuickCheck
import Data.List
import Data.Map.Strict
import Data.IntSet

--The problem has 2 main phases
--[1]. Identify which strings are subsequences of what strings and build a (bipartite) graph representing this relation.
--[2]. Decide if a matching with the cardinality of the first set of strings exists in this graph.
--The solutions I use are:
--[1]. We preprocess the second list of strings first to speedup the checks (see subsequencep and process)
--[2]. During the building of the graph we use Hall's marriage theorem partially to be able to say "false" quickly if we can (see buildGraph and
-- buildGraphHelper)
--     Then, when the graph is built, I use a slightly modified version of https://hackage.haskell.org/package/maxBipartiteMatching-0.1.0.0 to check if
--     the matching exists. Since the package is on hackage I deemed it was okay to use it, but I hand included the code to prevent the imports and also
--     because I modified it very slightly for our purpose.
--     Using Hopcroft-Karp would be better and apparently there is an implementation: https://github.com/ivant/graph-matching
--     However, I didn't understand the code so I do not use it.

--Preprocessing we do on strings, used by our subsequence predicate
--It transforms a string into a map of char -> IntSet where each char maps to the set of indices it appears at (starting at 1)
--This allows us to check if a string is a subsequence of another with complexity based mainly (not entirely) on the length of the first string.
--By using (++) and not (flip (++)) we are appending at the end of the list of length 1 (better !) but then we cannot use fromDistinctAscList...
process str = Data.Map.Strict.map Data.IntSet.fromList (Data.Map.Strict.fromListWith (++) [(a,[b]) | (a,b) <- zip str [1..]])

--Predicates that checks iif the string (c:cs) is a subsequence of x where x is a preprocessed string.
--Complexity should mainly depend on the length of the first string.
--  t is a helper variable used during the recursion, the function should begin with t=0
--  t indicates the index of the last character we used in the second string -> we can only take characters after t
subsequencep _ [] _ = True
subsequencep t (c:cs) x --t is the value we must be bigger than
  | Data.Map.Strict.null x = False --we checked the case (c:cs) null before
  | otherwise = let indices = (Data.Map.Strict.lookup c x) in
     maybe False (\ind -> maybe False (\b -> subsequencep b cs x) (Data.IntSet.lookupGT t ind)) indices

--A recursive graph-building procedure with preprocessing.
--This build the graph representing the relation on strings "is included in".
--We will then search for a matching in this graph.
--The function first applies preprocessing to the entire second list of strings to speedup the subsequencep checks
--buildGraph :: ([Char]->p) -> (Int->[Char]->[Char]->Bool) -> [[Char]] -> [[Char]] -> g
buildGraph proc pred strsLeft strsRight = let idxedProcStrsR = (Data.List.map proc strsRight) `zip` [1..]
                                              indexedStrsL = strsLeft `zip` [1..] in
  buildGraphHelper pred indexedStrsL idxedProcStrsR 0 Data.IntSet.empty

--We check during execution that the number of neighbors of all the elements computed so far (unioned into the set rset) are bigger or equal in number
--to the number of elements checked (lnum) -> If not, hall's marriage theorem tells us that the matching we want is not possible.
buildGraphHelper pred [] strsRight lnum rset = Just Data.Map.Strict.empty
buildGraphHelper pred (sl:sls) strsRightProcessedIdxed lnum rset = let neighbors = (\(str,lidx) -> (Data.List.map snd (Data.List.filter ((pred 0 str) . fst) strsRightProcessedIdxed))) sl
                                                                       newrset = (Data.IntSet.union rset (Data.IntSet.fromList neighbors))
                                                                       newlnum = lnum + 1 in
  if (Data.List.null neighbors) || (newlnum > Data.IntSet.size newrset) then --Partial application of HALL-MARRIAGE theorem
  Nothing else
  maybe Nothing (\oldgraph -> Just (Data.Map.Strict.insert (snd sl) neighbors oldgraph)) (buildGraphHelper pred sls strsRightProcessedIdxed newlnum newrset)

-- Main function
subsub :: [String] -> [String] -> Bool
subsub [] _ = True
subsub strsL strsR = maybe False alphaMatchedPred (buildGraph process subsequencep strsL strsR)

-----------------------------------------
-----------Matching functions------------
-----------------------------------------
-- > {-|
-- > Description : Maximum cardinality bipartite matching
-- > Copyright   : © 2012 Stefan Klinger <http://stefan-klinger.de/>
-- > License     : GNU AGPL 3 <http://www.gnu.org/licenses/agpl-3.0.html>
-- > Maintainer  : http://stefan-klinger.de
-- > Stability   : experimental
-- >
-- > Find a maximum cardinality matching on a bipartite graph, using an
-- > augmenting path algorithm.  More efficient than using MaxFlow from FGL
-- > with constant weight and additional source and sink nodes.  But not as
-- > good as Hopcroft–Karp would be.
-- > -}

alphaMatchedPred semilist
  | Data.Map.Strict.null semilist = False
  | otherwise = (Data.Map.Strict.size $ matchingfromSemiAdjList semilist) == (Data.Map.Strict.size semilist)

--We do not need to convert to a 'fwd' mapping since we can directly build one in the buildGraph function !
matchingfromSemiAdjList semilist = opt (Data.Map.Strict.keys semilist, []) semilist Data.Map.Strict.empty

--Optimization func
opt :: (Ord a, Ord b) =< ([a],[a]) -> Data.Map.Strict.Map a [b] -> Data.Map.Strict.Map b a -> Data.Map.Strict.Map b a
opt (x:free,failed) fwd mat
  = either (flip (opt (free,x:failed)) mat) (opt (free++failed,[]) fwd)
    $ right fwd [] x
  where
    right rem path x
      = maybe (Left rem) (left $ Data.Map.Strict.delete x rem) $ Data.Map.Strict.lookup x rem
      where
        left rem [] = Left rem
        left rem (y:ys)
          = maybe
            (Right $ Data.List.foldr (uncurry $ flip Data.Map.Strict.insert) mat path')
            (either (flip left ys) Right . right rem path')
            (Data.Map.Strict.lookup y mat)
          where
            path' = (x,y):path
opt ([],failed) fwd mat = mat

Normally, Miro Haller's 🔝 solution would be shown here as well, for being quite unique. However, this blog has a strict line limit per week, which we are approaching. And we definitely want to fit the lessons of the week:

Task of week 5

The MC loves regular expressions. But sometimes he manages to insert sputid tysop in his text that cause a later regex search to fail. In this task, your goal is to develop a regex matcher that is immune against scrambled letters.

We consider the following algebraic datatype (wait for the lecture on Tuesday, where those will be introduced) of regular expressions that should be familiar from a Theoretical Computer Science course:

data RE = Zero          -- Zero matches nothing
        | One           -- One matches only the word []
        | Atom Char     -- Atom c matches only the word [c]
        | Plus RE RE    -- Plus r s matches all words matched by r or s
        | Concat RE RE  -- Concat r s matches all words w = u ++ v
                        -- such that u is matched by r and v is matched by s
        | Star RE       -- Star r matches [] and all words w = u ++ v
                        -- such that u is matched by r and v by Star r

Hence, for a standard definition of match :: RE -> String -> Bool we would have:

match (Star (Atom 'a' `Concat` Atom 'b')) ""
match (Star (Atom 'a' `Concat` Atom 'b')) "ab"
match (Star (Atom 'a' `Concat` Atom 'b')) "abab"
match (Star (Atom 'a' `Concat` Atom 'b')) "ababababab"
not $ match (Star (Atom 'a' `Concat` Atom 'b')) "a"
not $ match (Star (Atom 'a' `Concat` Atom 'b')) "abba"
not $ match (Star (Atom 'a' `Concat` Atom 'b')) "baba"
not $ match (Star (Atom 'a' `Concat` Atom 'b')) "ababa"

However, here we are not interested in the standard match, but rather in a dyslexicMatch :: RE -> String -> Bool function that accepts a string s exactly if there is a permutation t of s (of the same length, with the same multiplicities for duplicated characters) that would be accepted by the usual match function. For example, all the following should yield True:

dyslexicMatch (Star (Atom 'a' `Concat` Atom 'b')) ""
dyslexicMatch (Star (Atom 'a' `Concat` Atom 'b')) "ab"
dyslexicMatch (Star (Atom 'a' `Concat` Atom 'b')) "abab"
dyslexicMatch (Star (Atom 'a' `Concat` Atom 'b')) "ababababab"
not $ dyslexicMatch (Star (Atom 'a' `Concat` Atom 'b')) "a"
dyslexicMatch (Star (Atom 'a' `Concat` Atom 'b')) "abba"     -- consider "abab"
dyslexicMatch (Star (Atom 'a' `Concat` Atom 'b')) "baba"     -- consider "abab"
not $ dyslexicMatch (Star (Atom 'a' `Concat` Atom 'b')) "ababa"

Evaluation criterion: feficiency. Your solution will be tested with short and long strings and simple and deeply nested regular expressions.

Results of week 5

What a strange week! At first there were 15 submissions. In addition, an anonymous professor attempted a comeback, which was severely constrained by the fact that the professor could literally spend only 10 minutes on the task.

In the first round of QuickCheck testing something shocking happened: Only three solutions passed the tests. So for a change we will not start with a Hall of Fame.

Casualties of QuickCheck
name regex string wrong output
anonymous professor Star (Star One) ""
Roger Baumgartner Concat (Star (Atom 'a')) (Star (Atom 'b')) "" False
Mathias Blarer Star (Star (Atom 'a')) "a"
Mike Boss Concat One One "" False
Matthias Brun Plus (Star Zero) One "a" 💥
Gabriel Giamarchi Star Zero "" False
Moritz Knüsel Concat (Atom 'a' `Plus` Atom 'b') (Atom 'a') "" True
Silvan Niederer Concat (Star Zero) (Star (Atom 'a')) "a" False
Loris Reiff Concat One One "" False
Valentin Stoppiello Star (Plus (Atom 'a') One) "b"
Jan Wiegner Concat One One "" False
Patrick Ziegler Star Zero "" False
Simon Yuan Star (Star (Atom 'a' `Concat` Atom 'b')  `Concat` Atom 'b') "ab" True

Above, ⊥ denotes non-termination (or at least "exceeding MC's patience") and 💥 denotes a "Non-exhaustive patterns in function glushkovVars" runtime failure.


Addendum: The MC was asked about the details of his QuickCheck tests. Indeed this is useful information that the MC should share.

When using QuickCheck there are two main questions to answer. How to generate arbitrary values of a type? And what properties to test? The latter is easy to answer if you have a master solution (or an unoptimized solution for which you are certain that it is correct) at hand. In this case, the MC was testing the property \r w -> sol r w == submission r w where sol was his naive solution using derivatives (see below) and submission is the solution under test. To test this property, QuickCheck must be able to generate random regular expressions and strings. For strings there is some standard setup; for our own data type RE a manual setup is necessary.

Generation of random values in QuickCheck is based one the type class Arbitrary. I.e., we need to make RE an instance of Arbitrary. Unfortunately, there is no built-in way in GHC to do this automatically via deriving (as we can do for Eq or Ord). But of course there is a Hackage package to do this automatically. Here are the three magic lines to make the above test work.

{-# LANGUAGE TemplateHaskell #-}

...

import Data.DeriveTH

...

derive makeArbitrary ''RE

This will generate random regular expressions, presumably using the uniform distribution to sample the next constructors (and recursion for the recursive components). The MC used this Arbitrary instance for the first test. The generation of the regular expression and the word are thereby completely independent from each other. As a consequence, most match problems generated that way should evaluate to False. This is not ideal (and very incomplete), but was sufficient to discover all the above bugs with one exception. The bug in Simon Yuan's program was found much later during the actual performance test (still with randomly generated regular expressions and words, but restricting the occurring items to come from a small alphabet).

To restrict the alphabet and avoid the Zero constructor, the MC had used the following custom Arbitrary instance in his later tests. The definition uses some combinators for generating random values from Test.QuickCheck—the MC recommends to look at the Hackage documentation for sized, frequency, and elements.

import Control.Monad
import Test.QuickCheck

...

instance Arbitrary RE where
  arbitrary = sized regexp

regexp :: Int -> Gen RE
regexp 0 = frequency [ (1, return One)
                     , (4, Atom `fmap` elements "abcde") ]
regexp n = frequency [ (2, Plus  `fmap` subexp `ap` subexp)
                     , (2, Concat `fmap` subexp `ap` subexp)
                     , (1, Star  `fmap` regexp (n-1)) ]
  where subexp = regexp ((n - 1) `div` 2)


So it seems that only Kevin De Keyser, Miro Haller, and Börge Scheel are going to get any points in this exercise. However, the MC was in a good mood and decided to take a closer look at the wrong submissions. It turned out (or rather again: QuickCheck with a custom RE generator helped finding out), that three of the failing ones were correct if the input regular expression did not contain the Zero constructor. For both Gabriel Giamarchi and Patrick Ziegler, the problems with Zero came from the fact that they would rewrite Star Zero to Zero (whereas One would be correct)—a quite common mistake that MC's alter ego has encountered while being a TA for introductory automata theory courses. Silvan Niederer's solution seems to suffer from similar problems, although the MC could not pinpoint the source of the problem as precisely. Matthias Brun performed this rewrite correctly. However, his function that would eliminate all Zeros from an expression (or rewrite it to Zero) turned out to be highly incomplete: He forgot some recursive calls. And the rest of his code relied on the fact that there are no Zeros, which explains the 💥.

The MC decided to leave these incorrect submissions in the competition (without using Zeros from now on). However, at the end there will be a severe penalty. In particular, no buggy solution should get more points than a correct solution, no matter how well it performs in terms of efficiency.

How to rank the different submissions? At first, the MC tried what he usually does: picking hand-selected examples to see which solutions time out and which don't. After doing this for several rounds, the MC noticed that some solutions perform better on long words and simple expressions, while others can handle complex expressions for shorter words. It was hard to differentiate between those two categories.

This is why the MC decided to do something more automatic, but also more random. He used QuickCheck's random generator machinery to generate 10 random regular expressions of sizes 5, 10, and 30 (over the alphabet "abcde") and random words of length 10 and 1000 (over all different nonempty subsets of the alphabet "abcdf"). For each generated regex-word pair, the MC ran all 8 (including his own) submissions, with a tight timeout of 1 second. (I.e., long-running executions were not penalized to much, which may be not ideal. But the MC didn't have a cluster to perform the evaluation at his disposal, so it had to work within 1h on MC's laptop.)

And here are the results with numbers indicating how many seconds each submission needed to process (or timeout on) all the examples:

  29.909823999999986 Gabriel Giamarchi
  57.285384999999906 Matthias Brun
 134.8980010000001   Patrick Ziegler
 157.62982699999995  MC
 310.188045          Silvan Niederer
 340.03123100000005  Börge Scheel
 779.5406349999998   Kevin De Keyser
1962.9649770000012   Miro Haller

Since this is based on random expressions, the MC rerun this evaluation several times. The individual times would vary from time to time, but the trends remained the same. The above results translate to the following ranking of the week taking the Zero bugs into account. (We put the 💥 next to Matthias Brun to remind us that bugs can happen even to the competition's top performers.) Funnily, the solutions that worked correctly with Zeros are the worse ones in terms of performance (yet they get the most points). Of course, this is just a coincidence: the speedup was certainly not achieved by treating Zero incorrectly.

Top 10 (naja) of week 5
  # name      points
1. Börge Scheel 10
2. Kevin De Keyser 9
3. Miro Haller 8
4. Gabriel Giamarchi 7
5. Matthias Brun 💥 6
6. Patrick Ziegler 5
7. Silvan Niederer 4

Lets look at some solutions starting with the one by the MC.

The MC loves derivatives of regular expressions. If you sneak around at his (alter ego's) webpage, you will find many of his papers that use these derivatives in matching and other algorithms. A derivative of a regular expression with respect to a alphabet letter c computes the regular expression that what remains to be matched after reading c. To go dyslexic, we can use the following version of the derivative, that reads (or consumes) a letter c occurring at any position in the expression.

der :: RE -> Char -> RE
der Zero _ = Zero
der One _ = Zero
der (Atom a) c = if a == c then One else Zero
der (Plus r s) c = der r c `Plus` der s c
der (Concat r s) c = (der r c `Concat` s) `Plus` (r `Concat` der s c)
der (Star r) c = der r c `Concat` Star r
Moreover, we can easily decide whether a regular expression accepts the empty string:
eps :: RE -> Bool
eps Zero = False
eps One = True
eps (Atom a) = False
eps (Plus r s) = eps r || eps s
eps (Concat r s) = eps r && eps s
eps (Star r) = True
Those two function are enough to obtain a nice but inefficient version of dyslexicMatch:
dyslexicMatch :: RE -> String -> Bool
dyslexicMatch r = eps . foldl der r
The actual solution by the MC, is a bit smarter. It eagerly simplifies the expressions (in particular exploiting that for our task Concat can be considered being commutative) and memoizes the der function (using some TemplateHaskell black magic to make the code concise). Moreover, it splits the result of the derivative accross the Plus constructor into a set of expressions. If you think of der :: RE -> Char -> RE as symbolically computing the next state of a deterministic automaton whose states are labeled with regular expression, you can think of the version that splits the Plus constructor as computing the next state of a non-deterministic automaton.
{-# LANGUAGE TemplateHaskell #-}

module DyslexicRegex where

import Prelude hiding (concat)
import qualified Data.Set as S
import Data.Function.Memoize
import Data.DeriveTH

data RE = Zero
        | One
        | Atom Char
        | Plus RE RE
        | Concat RE RE
        | Star RE
     deriving (Show, Read, Eq, Ord)

deriveMemoizable ''RE

eps Zero = False
eps One = True
eps (Atom a) = False
eps (Plus r s) = eps r || eps s
eps (Concat r s) = eps r && eps s
eps (Star r) = True

plus r Zero = r
plus Zero r = r
plus (Plus r s) t = plus r (plus s t)
plus r (Plus s t) = if r == s then Plus s t else if s < r then plus s (plus r t) else Plus r (Plus s t)
plus r s = if r == s then r else if s < r then Plus s r else Plus r s

concat r Zero = Zero
concat Zero r = Zero
concat r One = r
concat One r = r
concat (Concat r s) t = concat r (concat s t)
concat r (Concat s t) = if s < r then concat s (concat r t) else Concat r (Concat s t)
concat r s = if s < r then Concat s r else Concat r s

star Zero = One
star One = One
star (Star r) = Star r
star r = Star r

norm (Plus r s) = norm r `plus` norm s
norm (Concat r s) = norm r `concat` norm s
norm (Star r) = star $ norm r
norm r = r

pder = memoFix2 (\d r a -> case r of
  Zero -> S.empty
  One -> S.empty
  Atom b -> if a == b then S.singleton One else S.empty
  Plus r s -> d r a `S.union` d s a
  Concat r s -> S.map (`concat` s) (d r a) `S.union` S.map (r `concat`) (d s a)
  Star r -> S.map (`concat` Star r) (d r a))

epss = not . S.null . S.filter eps
pders rs a = S.foldl S.union S.empty (S.map (\r -> pder r a) rs)

dyslexicMatch :: RE -> String -> Bool
dyslexicMatch r = epss . Prelude.foldl pders (S.singleton $ norm r)
Now, let us look at the best performing solution by Gabriel Giamarchi, who doesn't sound too optimistic (or to say it positively sees much room for improvement).
module DyslexicRegex where

{--
This program is probably horribly inefficient, but it *should* be correct.
I didn't have time to finish it, but the idea was as follows:
Considering permutations of strings means only the number of occurences of each
char has importance
-> This creates surprising syntactical properties of regexes, for instance:
(notation (Atom 'a') := a, Concat written with . Plus with + and Star with *)
(a+b)* = a*.b*
a.(a+b) = a.a+a.b       and so on
The idea was to use these rules to expand (syntactically) the regex into
something easier to manager in the form:
r1+r2+...+r_n where r_i has the form (abcd)*.(abd)*.(xag)*...(zaggq)*.(abcd)
Each r_i would then be a knapsack problem but hopefully easier to solve than
the whole thing, especially if we could memoize/remove duplicate r_is
This form for regexes is referred in the following code as normalform_a (I had
originally intended normalforms_b,c and so on :-)  )
There are a LOT of problems in the following code; Nothing is optimized and
the program is too slow on big regexes. Also the solution of the knapsack
problem is the trivial solution.
--}

import Data.List
import Data.Set
import Data.Maybe
import Data.Map.Strict as M
-- import Control.Monad
-- import Test.QuickCheck

-- You may add 'deriving' clauses but should not modify the RE type otherwise

data RE = Zero          -- Zero matches nothing
        | One           -- One matches only the word []
        | Atom Char     -- Atom c matches only the word [c]
        | Plus RE RE    -- Plus r s matches all words matched by r or s
        | Concat RE RE  -- Concat r s matches all words w = u ++ v
                        -- such that u is matched by r and v is matched by s
        | Star RE       -- Star r matches [] and all words w = u ++ v
                        -- such that u is matched by r and v by Star r
        deriving (Show,Eq)

--An improved type for our normal form
data NFARE = NFAZero          -- Zero matches nothing
           | NFAAtom (Set [Char],[Char])     -- Atom c matches only the word [c]
           | NFAPlus NFARE NFARE    -- Plus r s matches all words matched by r or s
           | NFAConcat NFARE NFARE  -- Concat r s matches all words w = u ++ v
           | NFAStar NFARE
           deriving (Eq, Ord)

instance Show NFARE where
  show = pprint2

--Converts a list to a map of elems -> num occurences
listToMap s = M.fromListWith (+) (zip s [1,1..1])

--Checks that all the keys in the list are in the map and remove them from the map
--If this is not possible, then return Nothing
--TODO: Make this more efficient
mapdifference map [] = Just map
mapdifference map (x:xs) = if (M.member x map) then let v = fromJust (M.lookup x map) in
           if v == 0 then Nothing else mapdifference (M.insert x (v-1) map) xs
                           else Nothing

--Given a tuple derived from an atom in nfa (via toCondList) and an occurence
--map, returns True or False if it matches
--We basically need to solve a knapsack problem in the first argument
matchAtomicCond _ Nothing = False
--In this case the string represented by occmap must be empty
matchAtomicCond ([],[]) (Just occmap) = Data.List.sum (M.elems occmap) == 0
matchAtomicCond ([], necessaryLetters) (Just occmap) =
  matchAtomicCond ([],[]) (mapdifference occmap necessaryLetters)
matchAtomicCond (scond:sconds,necessaryLetters) (Just occmap)
              | Data.List.null necessaryLetters =
                (matchAtomicCond (scond:sconds,[]) (mapdifference occmap scond))
                || (matchAtomicCond (sconds,[]) (Just occmap)) --Knapsack problem
              | otherwise =
                --If there are both necessary letters and star conditions, we
                --need to remove the necessary ones first
                matchAtomicCond (scond:sconds,[]) (mapdifference occmap necessaryLetters)

toNfare :: RE -> NFARE
toNfare Zero = NFAZero
toNfare One = NFAAtom (Data.Set.empty,[])
toNfare (Atom c) = NFAAtom (Data.Set.empty,[c])
toNfare (Plus r s) = (NFAPlus (toNfare r) (toNfare s))
toNfare (Concat r s) = (NFAConcat (toNfare r) (toNfare s))
toNfare (Star r) = (NFAStar (toNfare r))

pprint :: RE -> String
pprint (Zero) = "0"
pprint (Plus x y) = "(" ++ (pprint x) ++ "+" ++ (pprint y) ++ ")"
pprint (Concat x y) = (pprint x) ++ "" ++ (pprint y)
pprint (Star x) = "(" ++ (pprint x) ++ ")"  ++ "*"
pprint (Atom c) = show c

pprint2 :: NFARE -> String
pprint2 (NFAZero) = "0"
pprint2 (NFAPlus x y) = "(" ++ (pprint2 x) ++ "+" ++ (pprint2 y) ++ ")"
pprint2 (NFAConcat x y) = (pprint2 x) ++ "" ++ (pprint2 y)
pprint2 (NFAStar x) = "(" ++ (pprint2 x) ++ ")"  ++ "*"
pprint2 (NFAAtom c) = show c

--A ``normal'' regex matching algorithm.
match :: RE -> String -> Bool
match Zero _ = False
match One w = Data.List.null w
match (Atom c) w = w == show c
match (Plus r s) w = (match r w) || (match s w)
match (Concat r s) w = any (\(x,y) -> (match r x) && (match s y)) $ zip (inits w) (tails w)
match (Star r) [] = True
--match (Star r) [c] = (match r [c]) --The next case bugs for one letter words
match (Star r) w = any (\(x,y) -> (match r x) && (match (Star r) y)) $ tail $ zip (inits w) (tails w)
--match (Star r) w = True

----A bruteforce dyslexic matching algorithm, used to test correctness.
--bruteDyslexicMatch :: RE -> SafeString -> Bool
--bruteDyslexicMatch r ws = let w = show ws in any (match r) $ permutations w

--NormalformA is r1+r2+...+r_n where r_i has the form (abcd)*(xyz) [NO + in r_i]
--If re is of the form (Op re1 re2) where re1 and re2 are already in normalform -> Outputs the normalform of re
normalforma2 :: NFARE -> NFARE
normalforma2 (NFAPlus NFAZero y) = y --Zero neutral for plus
normalforma2 (NFAPlus y NFAZero) = y --Zero neutral for plus
normalforma2 (NFAPlus x y) = (removeNFADuplicates (NFAPlus x y)) --We want all plusses at the top level
normalforma2 (NFAStar NFAZero) = NFAZero --star is identity on zero
normalforma2 (NFAStar (NFAPlus x y)) =
  --Distributivity of star for plus
  normalforma2 (NFAConcat (normalforma2 (NFAStar x)) (normalforma2 (NFAStar y)))
normalforma2 (NFAStar (NFAAtom (as,an))) =
 if (Data.List.null an) then (NFAAtom (as,[])) else
  if (Data.Set.null as) then (NFAAtom (Data.Set.insert an as,[])) else
  (NFAPlus (NFAAtom (Data.Set.insert an as, an)) (NFAAtom (Data.Set.empty,[])))
normalforma2 (NFAConcat NFAZero x) = NFAZero --Zero in ``multiplication``
normalforma2 (NFAConcat x NFAZero) = NFAZero --Zero in ``multiplication``
normalforma2 (NFAConcat (NFAAtom (as,an)) (NFAAtom (bs,bn))) =
  (NFAAtom (Data.Set.union as bs,an++bn)) --Property of concat on nfa-atoms
normalforma2 (NFAConcat x (NFAPlus y z)) =
 --NFAConcat distributivity left (and commutativity already used)
 (NFAPlus (normalforma2 (NFAConcat x y)) (normalforma2 (NFAConcat x z)))
normalforma2 (NFAConcat (NFAPlus y z) x) =
  --NFAConcat distributivity right
  (NFAPlus (normalforma2 (NFAConcat y x)) (normalforma2 (NFAConcat z x)))
normalforma2 (NFAConcat x (NFAStar y)) =
  (NFAConcat (NFAStar y) x) --Commutativity of concat
normalforma2 x = x --Otherwise

--Two horribly slow function to remove duplicates :/
removeNFADuplicates (NFAZero) = (NFAZero)
removeNFADuplicates (NFAAtom a) = (NFAAtom a)
removeNFADuplicates plustree =
  Data.Set.foldr (\x re -> if re == (NFAZero) then x
                 else (NFAPlus x re)) (NFAZero) (plustreeToSet plustree)

plustreeToSet (NFAPlus (NFAAtom a) (NFAAtom b)) =
  Data.Set.fromList [(NFAAtom a), (NFAAtom b)]
plustreeToSet (NFAPlus (NFAAtom a) (NFAPlus x y)) =
  Data.Set.insert (NFAAtom a) (plustreeToSet (NFAPlus x y))
plustreeToSet (NFAPlus (NFAPlus x y) (NFAAtom a)) =
  Data.Set.insert (NFAAtom a) (plustreeToSet (NFAPlus x y))
plustreeToSet (NFAPlus (NFAPlus x y) (NFAPlus w z)) =
  Data.Set.union (plustreeToSet (NFAPlus x y)) (plustreeToSet (NFAPlus w z))

--Computes the normalform_a of the regex
--Now we need to basically compute a tree fold using normalforma2
tonfa2 :: NFARE -> NFARE
tonfa2 (NFAPlus x y) = normalforma2 (NFAPlus (tonfa2 x) (tonfa2 y))
tonfa2 (NFAConcat x y) = normalforma2 (NFAConcat (tonfa2 x) (tonfa2 y))
tonfa2 (NFAStar x) = normalforma2 (NFAStar (tonfa2 x))
tonfa2 x = normalforma2 x

--Once the normalform_a of the regex has been computed, we convert it to a list
--of tuples of the form ([[starcond],[...]],[necessary])
--Then, we will know that the regex matches iif one of these matches,
--and each sub-problem is basically a knapsack problem which we solve by lin-programming
--toCondList :: NFARE -> [...]
toCondList (NFAPlus x y) = (toCondList x) ++ (toCondList y)
toCondList (NFAAtom (starset, necessaryset)) = [(Data.Set.toList starset, necessaryset)]
toCondList (NFAZero) = []
toCondList x = undefined --We should NEVER need to go there

--Main function
dyslexicMatch re str = let occmap = (listToMap $ str) in
  Data.List.any (\x -> matchAtomicCond x (Just occmap)) (toCondList (removeNFADuplicates (tonfa2 $ toNfare re)))

Besides of the buggy part normalforma2 (NFAStar NFAZero) = NFAZero, it rewrites expressions in an interesting way, exploiting our particular setting in this exercise. The MC believes, that this was the key to victory, as most "random" expressions would be rewritten to something trivial (or very simple at least).

Finally, Matthias Brun's solution is quite nice, too. It uses Glushkov automata, a.k.a. positions automata, which are related to the approach in MC's favorite paper on regex matching. If you don't know anything about those automata, skip the below code first, but read the linked paper! The MC assumes, that the automaton gets large for some of the random expressions. This solution would also benefit from some sophisticated rewriting that Gabriel Giamarchi has performed.

 module DyslexicRegex where

 import           Data.List                (sort, group, groupBy, sortBy)
 import           Data.List.Extra          (nubOrd)
 import           Data.Maybe               (fromJust)
 import           Data.Char                (ord, chr)
 import           Data.Sequence            (Seq, (|>))
 import qualified Data.Sequence as S
 import           Data.Vector              (Vector, (!))
 import qualified Data.Vector as V
 import           Data.IntMap.Strict       (IntMap)
 import qualified Data.IntMap.Strict as IM
 import           Control.Arrow            ((&&&), first)

-- The idea behind my solution is to build a Glushkov automaton, as described on pages 105ff of
-- http://nlp.cs.swarthmore.edu/~richardw/cs93-f07/flexiblePatternMatchingInStrings.pdf.
-- (My code is largely based on that chapter but has some significant differences.)
-- With this automaton I can efficiently [O(n^3)] match regular expressions.

-- For matching all permutations of a string one notices that this is equivalent to matching all the
-- input characters but in any arbitrary order. So rather than computing the permutations and
-- attempting to match them I just keep track of what characters are left and how many of them. During
-- the state transitions I then check all transitions (which read an available input character) rather
-- than just the one that would come next in the original input.

-- I haven't commented my code too extensively but I think it should be fairly understandable, if one
-- knows how Glushkov automata work (which, presumably, the MC does) or has glanced over the
-- book I mentioned. (Link)

-- There are still some optimizations I didn't get around to implementing but it's Saturday 23:24 now,
-- so that's that I guess.

 type State = Int
 type Input = IntMap Int
 type First = [State]
 type Last  = [State]
 type Empty = Bool
 type Delta = Seq [State]

-- StatesF provides a mapping from state to what atom it represents
-- States are first accumulated in a list (using cons) and then converted to a vector for O(1) access

 type States = [Int]
 type StatesF = IntMap Int

 data RE = Zero           -- Matches nothing
         | One            -- Matches only the word []
         | Atom Char      -- Atom c matches only [c]
         | Plus RE RE     -- Plus r s matches all words matched by r or s
         | Concat RE RE   -- Concat r s matches all words w = u ++ v such that u is matched by r and v is matched by s
         | Star RE        -- Star r matches [] and all words w = u ++ v such that u is matched by r and v by Star r
   deriving (Eq, Show)

-- Any regular expression containing zeroes can either be rewritten as one without zeroes or as just Zero.
-- That's what `purgeZero` does.

 purgeZero :: RE -> RE
 -- Plus Zero Zero matches nothing
 purgeZero (Plus Zero Zero) = Zero
 -- Plus left Zero matches only left
 purgeZero (Plus left Zero) = left
 -- Symmetric case
 purgeZero (Plus Zero right) = right
 -- Concat with a Zero matches nothing
 purgeZero (Concat Zero _) = Zero
 purgeZero (Concat _ Zero) = Zero
 -- Star Zero only matches []
 purgeZero (Star Zero) = One
 purgeZero (Plus left right)
   | left'  == Zero && right' == Zero = Zero
   | left'  == Zero                   = right'
   | right' == Zero                   = left'
   where
     left' = purgeZero left
     right' = purgeZero right
 purgeZero (Concat left right)
   | left'  == Zero = Zero
   | right' == Zero = Zero
   | otherwise      = Concat left' right'
   where
     left'  = purgeZero left
     right' = purgeZero right
 purgeZero (Star r)
   | r' == Zero = One
   | otherwise  = Star r'
   where
     r' = purgeZero r
 purgeZero x = x

-- The getInput function transforms the input into an IntMap of characters with an associated count of
-- how many times they can still be matched. This essentially provides an input string but in a format
-- that is more useful for my algorithm.

 getInput :: [Char] -> Input
 getInput = IM.fromList . map (ord . head &&& length) . group . sort

-- A helper function to decrement the count of a character in the input and delete it if there was only
-- one left.

 decAndDelIfLast :: Int -> Input -> Maybe Input
 decAndDelIfLast key input
   | val == Just 1 = Just $ IM.delete key input
   | val == Nothing = Nothing
   | otherwise = Just $ IM.adjust pred key input
   where
     val = IM.lookup key input

-- merge is a small helper function that merges (wow!) two lists and removes duplicates.
-- (i.e. basically set union but I think sets would have been overkill here)

 merge :: Ord a => [a] -> [a] -> [a]
 merge xs ys = nubOrd $ xs ++ ys

-- glushkovVars is a translation of the function with the same name in the book that I mentioned. Since
-- the book provides an imperative implementation, this is somewhat reflected in the way this is
-- structured. I'm sure it could be done more elegantly.
-- The function provides the states, the delta function and the final states. It also tells us whether
-- the regex accepts the empty word.

 glushkovVars :: RE -> (Delta, Int, States) -> (First, Last, Empty, Delta, Int, States)
 glushkovVars (Plus l r)   (delta, pos, states) = (first' `merge` first'', last' `merge` last'', empty' || empty'', delta'', pos'', states'')
   where
     (first', last', empty', delta', pos', states') = glushkovVars l (delta, pos, states)
     (first'', last'', empty'', delta'', pos'', states'') = glushkovVars r (delta', pos', states')
 glushkovVars (Concat l r) (delta, pos, states) = (firstF, lastF, empty' && empty'', deltaF, pos'', states'')
   where
     (first', last', empty', delta', pos', states') = glushkovVars l (delta, pos, states)
     (first'', last'', empty'', delta'', pos'', states'') = glushkovVars r (delta', pos', states')
     firstF = if empty' then first' `merge` first'' else first'
     lastF = if empty'' then last' `merge` last'' else last''
     deltaF = foldr (S.adjust' (merge first'')) delta'' last'
 glushkovVars (Star r) (delta, pos, states) = (first', last', True, deltaF, pos', states')
   where
     (first', last', empty', delta', pos', states') = glushkovVars r (delta, pos, states)
     deltaF = foldr (S.adjust' (merge first')) delta' last'
 glushkovVars One (delta, pos, states) = ([], [], True, delta, pos, states)
 glushkovVars (Atom c) (delta, pos, states) = ([pos'], [pos'], False, delta |> [], pos', (ord c):states)
   where
     pos' = pos + 1

-- A small wrapper to make glushkovVars' easier to call.

 glushkovVars' :: RE -> (Last, Delta, Empty, States)
 glushkovVars' r = (last, S.update 0 first delta, empty, states)
   where
     (first, last, empty, delta, pos, states) = glushkovVars r (S.singleton [], 0, [0])

-- The actual match function uses glushkovVars' to generate the automaton and then goes through all the
-- state transitions to check whether or not the input is accepted.

 dyslexicMatch :: RE -> String -> Bool
 dyslexicMatch r s = if r' == Zero then False else match' (getInput s) [0]
   where
     r' = purgeZero r
     (last, delta, empty, states) = glushkovVars' r'
     states' = V.fromList . reverse $ states
     match' :: Input -> [State] -> Bool
     match' i []  = (not . IM.null) i
     match' i ys
       | IM.null i && ys == [0] = empty
       | IM.null i              = any (`elem` last) ys
       | otherwise              = any (uncurry match') $ newstates
       where
         newstates = map (first fromJust) . filter ((/= Nothing) . fst) . map ((\(x:_) -> decAndDelIfLast (states' ! x) i) &&& id) . groupBy stateeq . sortBy statecmp $ reachable
         reachable = (nubOrd . concatMap (delta `S.index`)) ys
         stateeq = \a b -> (states' ! a) == (states' ! b)
         statecmp = \a b -> (states' ! a) `compare` (states' ! b)

Lessons of the week:

Task of week 6

This task is rather different from the ones before: implement an AI for paper soccer. There is no Code Expert project this time. Instead, please submit your solutions (with the a extended Easter deadline on Monday, 2nd April, 23:59) by email to the MC. Please use "FP Competition Paper Soccer" as the title of your email.

Paper soccer is a two-player strategy game in which unsurprisingly the goal is to score a goal. A description follows below. The MC recommends to also read the Wikipedia article and actually try out the game before jumping on the AI. Our game will take place on a field of the below shape and size. Player 1 tries to score in the right goal, whereas player 2 in the left one.

A turn of a player consists of moving the ball (black dot) to one of the neighboring 8 grid points (including diagonal moves). The (possibly diagonal) edge along which a move took place is marked persistently. It is prohibited to move the ball along marked edges a second time. The player do not strictly alternate in their turns. Instead, the turn changes only if the ball moves to a previously unvisited grid point. In other words, a player continues, if (s)he moves the ball to an already visited grid point (i.e., one with an adjacent marked edge). Such "bounces" also apply to the border (i.e., we count the border grid points as being visited by default).

There are several winning conditions for the players. To win, you have to either score a goal, force the opponent to score an own goal, or force the opponent into a dead end. Moreover, each player has only 5 minutes for the game in total.

We model the game in the module Soccer. You might need to install a few Hackage packages to load the module. The instructions are in the source. The state (State) of the game is the position (Pos=(Int,Int)) of the ball together with an assignment of marked adjacent edges for each grid point. We represent the assignment using a matrix indexed by positions ((0,0) is the top left corner of the field; (5,4) is the initial position of the ball; the actual goals are nor part of the matrix) that stores an 8 bit word for each grid point. Each bit corresponds to a direction (Dir)—the conversion between directions and bit indices can be done using fromEnum and toEnum. If a bit is set for a grid point, then the edge going to the corresponding direction is marked. Note how the borders are modeled as preset words. The size parameters (xMax and yMax) set in this module are fixed and will not change for the competition.

An AI for paper soccer is a pure function of type step: Double -> State -> m -> (m, Dir) that inputs the remaining time (in seconds), a state, and some polymorphic memory slot, which you may instantiate with any type you want and use it to store reusable information. The function should return the direction in which the ball should move next together with an update of the memory. Be careful to produce only valid moves (producing invalid ones is an additional losing condition).

A player (Player) is then represented by such a function extended with a name (please use your name as in "<FirstnameLastname>") and an initial memory content. The idea is that the evaluating the initial memory is instant. (This is not yet enforced by the module, but please do not pack a big table or the like there right from start in order to survive potential updates of the Soccer module.)

As examples, the MC has prepared a few dumb AIs, that don't even try to think strategically. You certainly can do better than that. Please submit a single file called Player<FirstnameLastname>.hs which implements two Players: player1 that targets the right goal (and has the first move) and player2 that targets the left goal (and goes second). In principle, you may use the same strategy for player 2 as for player 1, by using the inversePlayer function (which rotates the field, computes a direction and inverts it again). However, you may also use different strategies for the two players.

The submissions will be evaluated in a tournament, similarly to the one shown in the Main module.

Oh, and graphical visualizations of the state that are nicer than the box-drawing ASCII art of the MC are very welcome contributions, too. The MC is not sure, if they would yield any competition points, though (depends on the beauty).

Results of week 6

Let the games begin! We have 11 student contestants. The MC did not participate this time: he decided to spend his energy on carrying out this evaluation, which includes some time-consuming-to-generate drawings.

The MC has decided for a league-like tournament: everybody plays 21 matches (11 as the first player, 11 as the second player, including one match against oneself). The final states of all the 121 games are visualized in the below table. The rows are headlined by the first players, the columns by the second ones. A black border around a picture means that the second player has won that game. Otherwise, the first player is the winner. The best players are thus those that have many entries without a border in a row, and many entries with a border in a column.

Each picture is clickable and opens a gif (produced by a modified Soccer.hs) that replays the match. The replays are not in real time: if the turn changes, there is a 1 second pause; otherwise the ball moves every 0.2 seconds independently how long the contestant's AI was "thinking" to produce that move. The green bar highlights the player to move. Also the contestants' initials are clickable and bring up the (as usual well-documented) implementations.

Alternatively to clicking through 121 gifs, you can also go to this webpage and watch all the games simultaneously. For this you will need a big screen (we know that Gabriel 🖥 has one) and a good internet connection (as it will load >200 MB of gifs).

BS GG LR MBl MBr MHa MHo MK PZ SN SY
BS
GG
LR
MBl
MBr
MHa
MHo
MK
PZ
SN
SY

We have a clear winner! In total, we obtain the following scores (=number of won games) and ranking. The MC had a nice Easter break and decided to award points to all 11 contestants.

Top 11 of week 6
  # name   ranking (1st/2nd/total)   points
1. Gabriel Giamarchi 11/10/21 11
2. Matthias Brun 9/10/19 10/2=5
3. Miklós Horváth 9/8/17 9
4. Silvan Niederer 7/7/14 8/2=4
5. Miro Haller 6/6/12 7
6. Börge Scheel 5/6/11 6
7. Mathias Blarer 5/4/9 5/2=2.5
8. Simon Yuan 2/4/6 4
9. Moritz Knüsel 1/4/5 3
10. Loris Reiff 2/2/4 2
11. Patrick Ziegler 1/2/3 1/2=0.5

We see four "/2" punishments. In three instances, the AI was sometimes (roughly: for half of the games) failing with an index out-of-bounds runtime error. Two of those were invoking MC's move function without respecting its (unwritten) preconditions: the ball may not leave field (and scoring a goal counts as leaving the field although the direction of the move is valid). The third one was an even more blatant array indexing error. The above sources are free of those errors, because the Easter-MC was very kind to fix them (and leave the contestants in the competition).

The fourth punishment is for Mathias Blarer's solution, which would at first time out in every game. Mathias solution would adjust the search depth based on the remaining time. Apparently, Mathias messed up the function that computes the adjustment. The above source is his second version (that arrived after the deadline and after the MC asked about the time-outs), which now times out only for half of the games.

To play the tournament, the MC at first used a high degree of parallelization. All games were then evaluated in about 5 minutes (many games didn't take much time to play due to simplistic AI's). However, the MC's simplistic time measurement using the timeit package was not functioning accurately. In the end, the MC discarded the results of the parallel games, in favor of a single threaded evaluation (which took a few hours, but did not suffer from spurious time-outs and other glitches).

Interestingly, most solutions did not take the time parameter of the step function into account. These solutions behaved deterministically (and independently of the number of threads). A few others (e.g., Gabriel or Mathias), however, have based their search depth on the time parameter which resulted in (mildly) pseudo-random behavior. The whole situation has reminded the MC of the following anecdote about the Vampire prover for first-order logic. Vampire takes part in a competition in which provers have a fixed amount of time (say 5 minutes) to solve a particular problem. The organizer of the competition was in the process of determining a suitable timeout. He used Vampire on a particular problem with a timeout of 5 minutes and Vampire solved the problem in 4:59. Next, he tried a timeout of 3 minutes and Vampire solved it in 2:59. This went on and on until Vampire was able to solve the problem within seconds with a timeout of a few seconds. How could this happen? Vampire was aware of the timeout and adjusted its strategies (or heuristics) for solving the problem accordingly. Only the very last strategy tried was successful for the problem at hand. And it was typically tried only in the last few seconds.

Finally, a few words about the techniques used: minimax, alpha-beta pruning, and game tree algorithms were recurring themes in the better performing solutions. All these generic, advanced (but not too difficult to implement) techniques crucially rely on a good cost functions for the game's state. Moritz Knüsel's solution didn't have a good cost function and performed badly despite using such techniques (implemented in the Hackage package game-tree). Oh, and Moritz has used TABs in his solution. Probably that's another source of problems. In turn, Gabriel seems to have found a very good cost function and the MC sees no TABs.

Any lessons? Sure!

Task of week 7

The following is the final task of the competition. The submission deadline is Sunday, 8th April, 23:59. The results of this task will be announced (and followed by the award ceremony) in the last lecture of the FP part of the course on Tuesday, 10th April.

The actual task is very simple: Be original!

You are required to submit by email to the MC a picture with any content you want (in one of the following formats: jpg, png, svg, pdf, gif) and a Haskell program (a single file: Art.hs) that was used to generate the picture. Please use "FP Competition Art" as the title of your email. Here are some examples of graphics libraries that you are encouraged to use (others are welcome, too).

The pictures will be evaluated with respect to the aesthetics of the artwork (by an independent but highly subjective jury) and the aesthetics of the code (by the independent but highly subjective MC).

Hint: Such an artistic task is a traditional ending of a competition. You can look at previous year's submissions for some inspiration. Various fractals have been fairly popular among the past submissions. If done properly (i.e., beautifully), they will guarantee you a secure place somewhere in the middle. But typically, to win this task one has to create something more original and surprising.

Results of week 7

The results are summarized in the slides (12 MB) presented in the last lecture. The animated version (23 MB) is much prettier (provided that you have a device that supports Keynote).