Wettbewerb

Jedes Übungsblatt beinhaltet eine vom Master of Competition, Herrn Dr. MC „Hammer“ Blanchette – oder vom Co-Master of Competition, Herrn CoMC Traytel – sorgfältig ausgewählte Wettbewerbsaufgabe. Diese zählt so wie die anderen Hausaufgaben, wird aber zusätzlich als Teil eines semesterlangen Wettbewerbs bewertet. Die Bewertungskriterien variieren jedes Mal. Die besten 20 Lösungen bekommen je 20, 19, ..., 1 Punkte.

Jede Woche wird die Top 20 der Woche veröffentlicht und die Top 20 des Semesters aktualisiert. Die Lieblingslösungen des MC werden ins Internet gestellt. Am Semesterende werden die fünf besten Studenten mit geschmackvollen Trophäen geehrt. Es gibt hier also keine Bonuspunkte, dafür aber jede Menge Spaß und Ehre (und Trophäen)!

Wichtig: Wenn Sie nicht möchten, dass Ihr Name im Internet erscheint, können Sie entweder auf den Wettbewerb verzichten (ganz einfach, indem Sie die {-WETT-}...{-TTEW-} Tags weglassen) oder den MC anschreiben.

Gliederung:


Top 20 des Semesters (nach dreizehn Wochen)
  Platz Wettbewerber(in)   Punkte
1. Lorenz Panny 231.0
2. Simon Wimmer 171.0
3. Julius Michaelis 161.0
4. Elias Marquart 158.0
5. Andreas Bergmaier 150.5
6. Albert Steckermeier 134.0
7. Raphael Schaller 124.0
8. Timm Beckmann 122.0
9. Jan Böttcher 104.0
10. Thomas Breier 102.0
11. Nils Kunze 96.0
12. Patrick Werneck 82.0
13. Felix Sonntag 73.0
14. Stefan Reitmayer 60.0
Markus Westerlind 60.0
16. Manuel Bergler 56.0
17. Dominik Durner 52.0
18. Florian Haffke 51.0
19. Andreas Huber 40.0
20. Martin Sigl 36.0


Die Ergebnisse der ersten Woche

Top 20 der Woche
  Platz Wettbewerber(in)   Tokens   Punkte
1. Markus Engel 18 20
Patrick Werneck 18 20
Stefan Reitmayer 18 20
4. Nikolaos Tzioras 20 17
Philipp Riedmann 20 17
Lukas Fürmetz 20 17
Martin Sigl 20 17
Timo Geissler 20 17
9. Benjamin Schnoy 21 12
10. Felix Sonntag 22 11
Adrian Endlicher 22 11
12. Nils Kunze 23 9
Patrick Bürgin 23 9
Urs Hanselmann 23 9
Dominic Giebert 23 9
Elias Marquart 23 9
Raphael Schaller 23 9
Cristian-Ioan Plop 23 9
Julius Michaelis 23 9
Jan Böttcher 23 9
Jonas Keinholz 23 9
Robert Lerzer 23 9
Andreas Loibl 23 9
Lorenz Panny 23 9
Russell Hofmann 23 9
Simon Wimmer 23 9
Timm Beckmann 23 9
Matthias Kneidel 23 9

The Master of Competition is proud to report that 458 competitors took part in the first week's challenge! He was particularly impressed by this solution, which requires only 18 tokens (not counting the type signature):

        sum_max_sq x y z = max x y ^ 2 + min x y `max` z ^ 2

Here, the competitor exploited the built-in power function ^ to avoid rolling his own. He also relied on Haskell's idiosyncratic backquotes syntax to dodge a few parentheses (in general, x `f` y == f x y). But what's more important than any amount of syntactic massaging, this competitor found a simple and elegant mathematical formula that solves the problem.

Here's another remarkable solution. It relies on important material that will be covered later in the course, namely higher-order functions:

        sum_max_sq x y z = sum $ tail $ (^2) `map` sort [x, y, z]

The $ sign is merely a device to avoid parentheses. It represents function application (i.e., f $ x == f x), but binds more loosely to its operand. The above is morally the same as

        sum_max_sq x y z = sum (tail (map (^2) (sort [x, y, z])))

What's happening here is that x, y, and z are put in a list, then the list is sorted, then the squaring function ^2 is applied to each element in the list. (Since ^2 is a function passed as argument to another function, map, it is called a “higher-order argument.”) The tail function removes the first element from the list, and finally sum computes the sum of the elements of the remaining two-element list. This solution scales nicely to the case where we have to take, say, the squares of the 9 largest elements out of 10.

Here's a slightly more efficient variant, provided by another competitor:

        sum_max_sq x y z = sum $ map (^2) $ tail $ sort [x, y, z]

And here's a nice implementation of the same idea that avoids the higher-order function by using a list comprehension:

        sum_max_sq a b c = sum [x * x | x <- tail (sort [a, b, c])]

Some competitors found the solutions the MC had in mind when he devised the exercise:

        sum_max_sq x y z = x^2 + y^2 + z^2 - min (min x y) z ^2

        sum_max_sq x y z = x * x + y * y + z * z - a * a
                where a = min x (min y z)

Here's a neat trick with if then else and recursion:

        sum_max_sq x y z = if z > x || z > y then sum_max_sq z x y else x * x + y * y

The median solution (the solution such that half of the submitted entries are shorter and half are longer than) is as follows:

        sum_max_sq a b c | a <= b && a <= c = b * b + c * c
                         | b <= a && b <= c = a * a + c * c
                         | otherwise = a * a + b * b

One competitor seems to have misunderstood both the letter and the spirit of the MC's perfectly clear instructions and produced the (by far) longest correct solution, complete with a 10-way case distinction and 64 parentheses (all of them needless), totaling 253 tokens or 1406% the size of the shortest solution:

        sum_max_sq x y z
            | (x + y) > (x + z) && (x + y) > (y + z) = (x * x) + (y * y)
            | (x + z) > (x + y) && (x + z) > (y + z) = (x * x) + (z * z)
            | (y + z) > (y + x) && (y + z) > (x + z) = (y * y) + (z * z)
            | x == z && x > y = (x * x) + (z * z)
            | x == y && x > z = (x * x) + (y * y)
            | z == y && z > x = (y * y) + (z * z)
            | x == z && x < y = (x * x) + (y * y)
            | x == y && x < z = (x * x) + (z * z)
            | z == y && z < x = (x * x) + (z * z)
            | x == y && y == z = (x * x) + (y * y)

The MC is reminded of this timeless observation by Prof. Donald Knuth:

The ability to handle lots of cases is Computer Science's strength and weakness. We are good at dealing with such complexity, but we sometimes don't try for unity when there is unity.

Finally, here's the shortest incorrect solution, to remind us of the need to QuickCheck our programs:

        sum_max_sq x y z = (max x y)^2 + (max y z)^2

This program violates the simple property

        sum_max_sq x y z = sum_max_sq y z x

among many others.

Exhortations for next week:

    1. Try for unity when there is unity!
    2. QuickCheck!

Die Ergebnisse der zweiten Woche

There has been 416 Einreichungen to the Wettbewerb this week; the MC is fairly glücklich. He has to confess, though, that the exercise was more difficult than intended, since Prof. Nipkow had not yet covered recursion over lists in the lecture (even though it's really the same idea as recursion over natural numbers). This oversight will be taken into consideration when correcting the Hausaufgaben. Sua culpa!

Anyway, the results for the second week are in! Here they are!

Top 20 der Woche
  Platz Wettbewerber(in)   Tokens   Effizienz   Punkte
1. Andreas Bergmaier 28 RS[N] 20
2. Timm Beckmann 28 N[RS] 19
3. Lorenz Panny 28 NRS[] 18
4. Andreas Loibl 29 [RSN] 17
5. Julius Michaelis 30 RS[N] 16
6. Dominik Durner 31 RS[N] 15
Sven Liedtke 31 RS[N] 15
Thomas Breier 31 RS[N] 15
Jonas Keinholz 31 RS[N] 15
Raphael Schaller 31 RS[N] 15
Jan Böttcher 31 RS[N] 15
Matthias Franze 31 RS[N] 15
Nils Kunze 31 RS[N] 15
14. Manuel Bergler 31 N[RS] 7
Stefan Reitmayer 31 N[RS] 7
Andreas Amler 31 N[RS] 7
Michael Legenc 31 N[RS] 7
18. Maximilian Fichtl 32 [NRS] 3
19. Fabian Hellauer 32 RS[N] 2
Benedikt Geßele 32 RS[N] 2
Elias Marquart 32 RS[N] 2
Markus Engel 32 RS[N] 2
Claus Strasburger 32 RS[N] 2
Patrick Bürgin 32 RS[N] 2

This time, an efficiency criterion was used in addition to the number of tokens. The cryptic "RS[N]" etc. notations are explaned below. There were many more 32-token solutions than indicated in the table, but these were cut off due to inefficiencies. Sorry!

Let us start with the median solution (37 tokens), since it is arguably the most readable one:

        perms [] = [""]
        perms xs = reverse (sort (nub [[x] ++ p | x <- xs, p <- perms (delete x xs)]))

This solution works, but it does not scale very well. The Aufgabenblatt gave the following example to test scalability, but most competitors seem to have missed the hint (or did not understand what to do with it):

        perms "xxxxxxxxxxxxxxxx" == ["xxxxxxxxxxxxxxxx"]

A solution that is very similar to the median solution but that solves the efficiency issue is

        perms [] = [""]
        perms xs = [[x] ++ p | x <- reverse (sort (nub xs)), p <- perms (delete x xs)]

All we've done is move the reverse, sort, and nub calls inside the list comprehension, so that they operate directly on the alphabet (xs), rather than on the resulting permutations. This works like magic! And pay attention to the order of the calls: First nub, because there's less work left for sort if nub is called first; then sort and reverse. In the Top 20 der Woche, the notation “RSN[]” denotes solutions that have the reverse, sort, and nub calls outside the comprehension, and “[RSN]” those where the calls are inside.

Now we're ready to attack (so to speak) the winner and his 28 tokens:

        perms chars = reverse $ sort [c : str | c <- nub chars, str <- perms $ delete c chars] `max` [""]

Notice that he put the nub call inside, which is enough to handle the "xxxxxxxxxxxxxxxx" example gracefully. Using `max` on lists, he found a way to avoid the case distinction ([] `max` [""] == max [] [""] == [""]).

In the efficiency column of the table, this is represented as “RS[N]”. The optimal solution, with respect to the second criterion, would have been “[RSN]”, such as the MC's 29-token solution:

        perms xs = max [""] [y : ys | y <- reverse $ sort $ nub xs, ys <- perms $ delete y xs]

The max trick may have been above and beyond the call of duty, but that didn't stop one competitor from finding a way to rid himself of the case distinction while staying within everybody's comfort zone (apart from the $ signs, perhaps):

        perms xs = [y : ys | y <- nub $ reverse $ sort xs, ys <- perms $ delete y xs] ++ [xs | null xs]

One of the tutors, Manuel Eberl, found an even shorter solution (27 tokens), using monads. These will be covered much later in the course (if at all):

        perms "" = return ""
        perms s  = reverse $ sort $ nub s >>= liftM2 map (:) (perms . flip delete s)

The instructions were, in the MC's humble opinion, unmistakably clear about the need to provide a “vollständige Lösung,” but if we relax this requirement a little bit to reuse the wordsOfLength function developed for Aufgabe H2.3, and allow ourselves to call the higher-order function filter (covered in a few weeks), we can get the following extremely concise and extremely original (but not very efficient) solution, also due to Herrn Eberl:

        perms s = reverse $ filter (equals sort s) $ nub s `wordsOfLength` genericLength s

The most complicated working solution of the week is

        perms [] = [""]
        perms xs  | genericLength[x | x <- xs]==2 =
                    if genericLength[z | z<-nub(xs)]==1 then [xs]
                    else reverse(sort(nub[[x]++[y] |  x <- xs, y <- xs, x/=y]))
                  | otherwise = reverse(sort(nub[[x]++y | x <- xs, y<-perms(delete x xs)]))

Perhaps the most intriguing part is why the competitor wrote [x | x <- xs] instead of xs and [z | z<-nub(xs)] instead of nub xs. But at least the solution works, which cannot be said of all solutions!

Strictly speaking, one competitor implemented his own version of sort and reached 108 tokens, which beats the above; but his solution is otherwise simple and clean, so it doesn't deserve the “most complicated working solution of the week” distinction.

A surprising number of competitors submitted solutions that invoked List.permutations, even though the sentence

Basisfunktionen wie ++ sind erlaubt, List.permutations aber nicht!

explicitly forbids this in the clearest possible terms.

Every week's Wettbewerbsaufgabe has a lesson. Last week's lesson was that we should “try for unity when there is unity.” This week's lesson might be that we should not push this too far. The max [""] trick is perfectly legitimate (even encouraged) in a competition, but we should all resist the temptation to use such tricks in production code, unless we have very good reasons; and in such cases, we should write a comment explaining the trick and provide adequate QuickCheck tests.

Bis nächste Woche!

Die Ergebnisse der dritten Woche

Here are the results for the third week's challenge:

Top 20 der Woche
  Platz Wettbewerber(in)   Tokens   Punkte
1. Andreas Bergmaier 16 20
Raphael Schaller 16 20
3. Thomas Breier 17 18
Dominik Durner 17 18
Lorenz Panny 17 18
Simon Wimmer 17 18
Patrick Werneck 17 18
Alexander Zeilmann 17 18
Elias Marquart 17 18
Felix Sonntag 17 18
Johann Alban Schöpfer 17 18
Stefan Reitmayer 17 18
Albert Steckermeier 17 18
Julius Michaelis 17 18
Jan Böttcher 17 18
Manuel Bergler 17 18
Michael Legenc 17 18
Nils Kunze 17 18
Timm Beckmann 17 18
20. Andreas Loibl 18 1

This week's task was particularly easy for those competitors who noticed the presence of a few functions in the standard Data.List library that the MC had neglected to forbid. Here is one of the two 16-token solutions:

        sublist :: Eq a => [a] -> [a] -> Bool
        sublist ns hs = isPrefixOf ns `any` tails hs

        subseq :: Eq a => [a] -> [a] -> Bool
        subseq ns = elem ns . subsequences

There are many things to explain here. First, the naming is subtle but insightful: ns as in “needle”, hs as in “haystack”. Sweet!

The tails function returns all suffixes of hs, and isPrefixOf ns `any` ... checks whether any of these has ns as a prefix. Another way of writing this without using higher-order functions is

        sublist ns hs = or [isPrefixOf ns ts | ts <- tails hs]

Next, the . (dot) operator denotes function composition. Just like we can write h = gf instead of h(x) = g(f(x)) in mathematics, in Haskell we can write h = g . f instead of h x = g (f x). By the same token, we can write subseq ns = elem ns . subsequences instead of subseq ns hs = elem ns (subsequences hs). This style of programming is called pointfree and tends to encourage abstract thinking (for better or worse, mostly better).

The 17-token solutions were similar but either had parentheses (in conjunction with any) or explicitly specified the second argument to subseq:

        sublist xs = any (isPrefixOf xs) . tails

        subseq xs ys = xs `elem` subsequences ys

For sublist, there's also a dual solution that features isSuffixOf and inits instead of isPrefixOf and tails.

The only working 18-token submission cleverly reused the sublist function to implement subseq:

        subseq xs ys = sublist xs `any` subsequences ys

The naming scheme used in a number of solutions suggests that competitors have been sniffing around on the MC's web page:

        sublist jean claude = isPrefixOf jean `any` tails claude
        sublist qt4Dance boyBand = isSuffixOf qt4Dance `any` inits boyBand

Ha ha, très drôle. ;-) The similarity of the solutions to this and other Hausaufgaben also suggests Gruppenarbeit, a euphemism for Plagiat. Careful there!

The 48-token solution the MC had in mind when he wrote the exercise is as follows:

        sublist xs ys = xs == length xs `take` ys || ys /= [] && xs `sublist` tail ys

        subseq :: Eq a => [a] -> [a] -> Bool
        subseq xs ys =
          null xs ||
          ys /= [] && (if head xs == head ys then tail xs else xs) `subseq` tail ys

Being more familiar with Standard ML than Haskell, he grossly underestimated the functionalities available in the Haskell standard library. Thankfully, this is likely to be the last time where the winning entries are shorter than 20 tokens. Ab nächster Woche wird ein anderer Wind wehen!

The official, readable solution is as follows:

        sublist xs [] = null xs
        sublist xs (y : ys) = xs == take (length xs) (y : ys) || sublist xs ys

        subseq [] _ = True
        subseq _ [] = False
        subseq (x : xs) (y : ys) = subseq (if x == y then xs else x : xs) ys

Two participants were caught cheating with the following highly original, wrong solution:

        subseq xs ys = xs == intersect ys xs

By submitting their entries to the competition, these students have agreed to have their names put on the Internet (cf. the boxed text in the Aufgabenblatt). The plagiate was obvious throughout the Hausaufgaben, with one student randomly adding and removing spaces and artificially changing <= 40 to < 41. The MC is extremely tempted to put the guilty parties' names on this page but will resist the temptation for now. He is no lawyer, but if you want to play safe in the future, he would recommend you heed the following warning.

Warning: If you want to plagiarize and do not want your name to appear on this web page, please leave out the {-WETT-}...{-TTEW-} tags.

Anyway, these two competitors have been penalized: The MC subtracted 1 000 000 000 points from their total, as promised.

Back to our serious business. The median working solution is pretty neat:

        sublist xs ys =
            elem xs [drop a $ take b ys | a <- [0..(length ys)], b <- [0..(length ys)]]

        subseq [] _ = True
        subseq _ [] = False
        subseq xs ys
            | head xs == head ys = subseq (tail xs) (tail ys)
            | otherwise = subseq xs (tail ys)

The longest working solution is 246 tokens long. Here's the sublist half of it:

        sublist [] [] = True
        sublist [] _ = True
        sublist _ [] = False
        sublist [a] [b] = a == b
        sublist [a] (b : bs) = if a == b then True else sublist [a] bs
        sublist (a : as) [b] = False
        sublist (a : as) (b : bs) = if a == b && sublist_1 as bs then True else sublist ([a]++as) bs

        sublist_1 [a] (b : bs) = a == b
        sublist_1 (a : as) [b] = False
        sublist_1 [a] [b] = a == b
        sublist_1 (a : as) (b : bs) = (a == b) && (sublist_1 as bs)

The main distinguishing feature of this implementation is the needless case distinctions; for example, the first, fourth, and fifth sublist equations are unnecessary (since they are subsumed by the second, sixth, and seventh equations). Furthermore, if a == b then True else sublist [a] bs can be simplified to a == b || sublist [a] bs. And sublist_1 is superfluous. But at least, the solution passes all our QuickCheck tests and seems to work.

Incidentally, it's always harder to make a strong correctness statement about large, complicated code than about short, simple code. Our friend Tony Hoare put it nicely:

There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult.

Here's more quotable material from the same source:

There is nothing a mere scientist can say that will stand against the flood of a hundred million dollars. But there is one quality that cannot be purchased in this way–and that is reliability. The price of reliability is the pursuit of the utmost simplicity. It is a price which the very rich find most hard to pay.

All of us sometimes fall pray to doing things in an overcomplicated way. Even Donald Knuth, who urges us to “try for unity when there is unity,” found himself writing an eight-page incorrect Pascal program where a six-line shell script would have fit the bill. There's also a neat Haskell solution online, which many of you can probably read and understand already. What does this tell us? By developing your functional programming skills, you can compete with the crème de la crème—and win!

For the record, here are a few more complicated (and poorly formatted) answers:

        sublist [] ys = True
        sublist xs [] = False
        sublist xs ys            | length ys < length xs = False
                    | head ys == head xs = sublistHelp (tail xs) (tail ys) xs (counter-1)
                    | head ys /= head xs = sublist xs (tail ys)
                        where counter = length xs

        sublistHelp xs ys original counter
                                    | counter == 0 = True
                                    | length ys < length xs = False
                                    | head ys == head xs = sublistHelp (tail xs) (tail ys) original (counter-1)
                                    | head ys /= head xs = sublist original ys


        sublist xs ys = if xs == [] then True else (if ys == [] then False else (if head xs == head ys then (if genericLength xs > 1 then (if genericLength ys > 1 then sublist' (delete (head xs) xs) (delete (head ys) ys) else False) else True) else (if genericLength ys > 1 then sublist xs (delete (head ys) ys) else False)))

        sublist' xs ys = if head xs == head ys then (if genericLength xs > 1 then (if genericLength ys > 1 then sublist' (delete (head xs) xs) (delete (head ys) ys) else False) else True) else False

For next week, one of the criteria is that the solutions are properly formatted. Haskell books and Prof. Nipkow's slides are full of carefully formatted code snippets, elegantly wrapped at a reasonable column (e.g., 80 characters).

This week's lessons:

    1. Know thy libraries!
    2. Format thy code nicely!
    3. Thou shalt not sniff around on the MC's web site!

Lesson 1 is one that applies especially to the MC, who would have ended up at place 60 or so with his solution.

Die Ergebnisse der vierten Woche

Here are the results for the fourth week:

Top 20 der Woche
  Platz Wettbewerber(in)   Tokens   Punkte
1. Andreas Bergmaier (23 or) 31 20
2. Lorenz Panny 41 19
3. Elias Marquart 44 18
4. Tobias Betz 50 17
5. Timm Beckmann 52 16
6. Clemens Elflein 55 15
7. Julius Michaelis 59 14
8. Raphael Schaller 61 13
9. Nils Kunze 68 12
10. Albert Steckermeier 71 11
11. Stefan Reitmayer 74 10
12. Simon Wimmer 74 9
13. Patrick Werneck 75 8
Jan Böttcher 75 8
15. Claus Strasburger 76 6
16. Johann Alban Schöpfer 77 5
17. Markus Westerlind 80 4
Robert Lerzer 80 4
19. Andreas Huber 86 2
20. Nikolas von Randow 87 1

This week's challenge required more code, and as a result the top 20 solutions were all different, which is really nice. And if the MC's “swedar” (Swedish radar) is working correctly, this is the first time we have a Swede in the “Topp tjugå.” Skitbra!

There were a few ex aequo solutions; these were judged on formatting and naming conventions. Calling a helper function foobar is a pretty safe way to lose some points there! (Yes, that's a real example from a Top-20 competitor.)

One aspect of the exercise has been interpreted differently by different competitors. Everybody seems to have understood “steigend” as meaning non-strictly ascending (<=), but a few competitors took “fallend” to mean strictly descending (>). The MC has decided to be magnanimous here, since the text wasn't completely explicit about this aspect and the example given didn't disambiguate things. Furthermore, the alternative interpretation often passed the QuickCheck tests with 100 iterations.

One competitor—let's call him Caesar—changed his interpretation of “fallend” from > to >= 10 hours and 1 minute past the deadline. Caesar probably thought the MC wouldn't notice; but the MC sees everything. Consider yourselves warned! ;-)

The shortest solution, with 31 tokens, features this alternative worldview. It is a true masterpiece of succinctness:

        upsAndDowns xs = map fst `map` groupBy ((==) `on` snd) (zip xs $ True : (zipWith (<=) xs $ tail xs))

What's going on here? Let's see with the example xs = [1, 3, 4, 5, 4, 3, 4].

    1. The list xs is compared pointwise with its own tail w.r.t. <=, which gives
      [1 <= 3, 3 <= 4, 4 <= 5, 5 <= 4, 4 <= 3, 3 <= 4], i.e.,
      [True, True, True, False, False, True].
    2. True is prepended to the result:
      [True, True, True, True, False, False, True].
    3. The result is zipped with the original list:
      [(1, True), (3, True), (4, True), (5, True), (4, False), (3, False), (4, True)].
    4. Identical consecutive Boolean values are “grouped” into sublists:
      [[(1, True), (3, True), (4, True), (5, True)], [(4, False), (3, False)], [(4, True)]].
    5. The second pair components are dropped:
      [[1, 3, 4, 5], [4, 3], [4]].

The same competitor also found a 23-token solution that reuses the irregularChunks function developed in H4.2:

        upsAndDowns xs = map length (group $ True : (zipWith (<=) xs $ tail xs)) `irregularChunks` xs

Unfortunately, it bends the rules, but the MC and his acolytes are nonetheless extremely pleased to see so much creativity at work.

The shortest solution that did not adhere to the alternative worldview needed 41 tokens:

        upsAndDowns' cmp xs = take n xs : flip cmp `upsAndDowns'` drop n xs
          where n = succ . length . takeWhile id . zipWith cmp xs $ tail xs
        upsAndDowns = fst . break null . upsAndDowns' (<=)

There are many noteworthy aspects of this solution.

    1. The most interesting is probably the use of (<=) together with flip. The flip function takes a two argument function and returns a function that takes the two arguments in reverse order. Hence, flip (<=) == (>=) and flip (>=) == (<=). By flipping the comparison operator (cmp) in each recursive call, the solution reuses the same code in both directions (up and down).
    2. But the fst . break null trick is also something. The upsAndDowns' function never terminates: It computes the ups-and-downs and then appends an infinite stream of []s at the end. Then upsAndDowns uses fst . break null to take the prefix consisting of non-[] lists. This only works because Haskell uses lazy evaluation; in a programming language like Standard ML or OCaml, the function upsAndDowns' would never return. (More on this in a future Vorlesung.)

Here's a 44-token variant, also based on flip:

        upsAndDowns xs = spalteSortierteListen xs compare

        spalteSortierteListen :: Ord a => [a] -> (a->a->Ordering) -> [[a]] {-oder where-}
        spalteSortierteListen [] _ = []
        spalteSortierteListen xs operator = maxTeilliste:length maxTeilliste `drop` xs `spalteSortierteListen` flip operator
            where maxTeilliste=last [n|n<-inits xs,n==sortBy operator n]

The formatting leaves much to be desired, and the Mischung of Deutsch and English isn't pretty, but otherwise it's a nice solution.

The Master of PuppTutors Lars Noschinski found this 47-token solution, based on uncurry and swap instead of flip. Even though this may seem like a roundabout way to proceed, it saved Master Noschinski a handful of tokens:

        upsAndDowns2 :: Ord a => ((a,a) -> Bool) -> [(a,a)] -> [[a]]
        upsAndDowns2 _ [] = mempty
        upsAndDowns2 comp xs =
            map fst (takeWhile comp xs) : upsAndDowns2 (comp . swap) (dropWhile comp xs)

        upsAndDowns xs = upsAndDowns2 (uncurry (>=)) $ zip xs $ head xs : xs

Here's yet another variant, which recursively finds the longest prefix of the current list that is either ascending or descending. To obtain the alternation between the two modes, the list is simply applied a function ord that is either id (the identity) or reverse. In the recursive call, the function is added a reverse in front. Observe that reverse . reverse == id.

        monoton :: Ord a => ([a] -> [a]) -> [a] -> [[a]]
        monoton _ [] = []
        monoton ord xs = firstMonoton xs : monoton (reverse . ord) (xs \\ firstMonoton xs)
            where firstMonoton = last . takeWhile (isAscending . ord) . inits

        isAscending :: Ord a => [a] -> Bool
        isAscending xs = xs == sort xs

        upsAndDowns = monoton id

(Strange language Mischung again; monotonic or monotone would have been English.)

Another variation on the same theme (52 tokens):

        upsAndDowns xs = f id xs
        rev = (map reverse .)
        f _ [] = []
        f cp xs = l : rev cp `f` drop (length l) xs
            where l = maximum $ filter (`isPrefixOf` xs) $ cp $ sort `map` inits xs

And here's a version where the helper function takes two helper functions (asc and desc) as arguments and swaps them in each recursive call (55 tokens):

        upsAndDowns xs = upsAndDowns' xs asc desc

        upsAndDowns' [] _ _ = []
        upsAndDowns' xs f g = y : upsAndDowns' (xs\\y) g f
             where y = last $ takeWhile f $ inits xs

        asc xs = xs == sort xs
        desc xs = reverse xs == sort xs

The obvious solution hinted at in the exercise text, with upsAndDowns2 and downsAndUps2, has 145 tokens and has been submitted independently (or so we hope) by about 50 competitors. However, even without higher-order functions, it's possible to do much better (96 tokens) using only the material seen so far in class:

        upsAndDowns2 :: Ord a => Bool -> [a] -> [a] -> [[a]]
        upsAndDowns2 _ [] ys = [reverse ys]
        upsAndDowns2 up (x : xs) [] = upsAndDowns2 up xs [x]
        upsAndDowns2 up (x : xs) (y : ys)
          | if up then x >= y else x <= y = upsAndDowns2 up xs (x : y : ys)
          | otherwise = reverse (y : ys) : upsAndDowns2 (not up) (x : xs) []

        upsAndDowns :: Ord a => [a] -> [[a]]
        upsAndDowns [] = []
        upsAndDowns xs = upsAndDowns2 True xs []

Here comes the Mother of All Funky Solutions. The competitor built the list [(>=), (<=)], used the Haskell function cycle to create the infinite list [(>=), (<=), (>=), (<=), (>=), (<=), ...], then used the head of that list as the comparison operator and passed the tail (e.g. [(<=), (>=), (<=), (>=), (<=), ...]) in the recursive call. This may sound like it's off by one (starting with >= instead of <=) but there are a couple of (possibly needless) reverses in there that make it right:

        upsAndDowns :: Ord a => [a] -> [[a]]
        upsAndDowns (x:xs) = upsAndDowns2 (cycle[(>=),(<=)]) xs [x]
        upsAndDowns _ = []

        upsAndDowns2 :: [(a->a->Bool)] -> [a] -> [a] -> [[a]]
        upsAndDowns2 f (x:xs) (y:ys)
            | head f x y = upsAndDowns2 f xs (x:y:ys)
            | otherwise = reverse (y:ys):upsAndDowns2 (tail f) xs [x]
        upsAndDowns2 f [] ys = [reverse ys]

The longest working solution had 251 tokens. This is a bit surprising considering that the Aufgabe clearly pointed to an obvious 145-token solution, but there's nothing wrong with thinking out solutions for oneself.

As in the previous weeks, the exercise gave a definite advantage to those competitors who could think in higher-order terms, e.g., with flip. But even by staying first-order, there were ways to avoid evil code duplication.

Also, like last week, it was advantageous for the token count to use functions like inits, even though these usually lead to inefficient code. Next week, things will be different, because for the first time we will completely ignore token counting and focus exclusively on efficiency. More specifically, we'll be interested in good algorithms that scale well in the size of the inputs.

This week's lessons:

    1. We can often avoid code duplication by having our helper functions take functions as arguments.
    2. The standard library is full of useful higher-order functions (zipWith, groupBy, etc.) that can be assembled together to solve problems quickly.
    3. The MC sees everything.

Die Ergebnisse der fünften Woche

Here are the results for the fifth week:

Top 20 der Woche
  Platz Wettbewerber(in)   Punkte
1. Andreas Bergmaier 20
2. Thomas Breier 19
Martin Sigl 19
4. Robin Schlichtmann 17
Philip Becker-Ehmck 17
Patrick Nieves 17
Lorenz Panny 17
Andreas Amler 17
Anne O. Nymous 17
Dominik Scholz 17
11. Elisabeth Eichholz 10
Felix Bogischef 10
Julius Michaelis 10
Manuel Bergler 10
Timm Beckmann 10
16. Andreas Kaufhold 5
Stefan Reitmayer 5
Simon Wimmer 5
19. Kilian Batzner 2
Philipp Krueger 2

Since the criterion was not as objective as token count, it was an art to put together the above table. The Master of Competition started with 143 solutions that compiled and that had passed all the official QuickCheck tests. This is too many—he wants only 20 winners—so the first phase consisted in filtering out the less serious competitors.

He ran a few tests of his own design; solutions that returned False for any of them or took more than 5 s were kicked out. Here are some of the tests:

        not (quasiSubseq [1, 1, 1] [1])
        not (quasiSubseq [1, 2, 1] [1])
        not (quasiSubseq [1, 2] [])
        not (quasiSubseq (replicate 50 0) (replicate 5 0))
        quasiSubseq [1, 2, 3] [2, 1, 2]

And then there were 110.

The first interesting example that was tested is the standard example from the exercise sheet:

        not (quasiSubseq [1 .. N] ([N] ++ [2 .. N - 1] ++ [1]))

The MC first ran all the solutions against the standard example at N = 5000 with a 5 s timeout. 53 solutions timed out; the median time for the successful solutions is 0.03 s, but results ranged from 0.01 s to 3.70 s (all according to GHCi's +s mechanism).

And then there were 57.

The second test is the same, but this time with N = 50000 and a generous 30 s timeout. Recall that this is the example where the MC boasted 0.4 s for his solution. 8 solutions timed out. The median time is 0.16 s. Interestingly enough, there's a huge gap between the 45th solution, which took 0.32 s, and the 46th, which took 16.47 s. That's were the line was drawn.

And then there were 45.

The next test tries to catch solutions that process the haystack repeatedly (i.e., that are not linear in the length of the haystack):

        not (quasiSubseq (take 50000 (cycle [0, 1])) (replicate 100000 0))

The inefficient implementations typically have a || somewhere, with potentially expensive recursive calls on both sides. Now, that's the test that really separated the women (and men) from the girls (and boys). 23 solutions took between 0.02 s and 0.47 s, the others timed out at 5 s.

And then there were 23.

The next three tests try huge needles on small haystacks:

        not (quasiSubseq (replicate 50000000 0) (replicate 5000 1))
        not (quasiSubseq (replicate 50000000 0) (replicate 5000 0))
        not (quasiSubseq (replicate 50000000 0) [])

13 solutions could cope with the first example in less than 0.10 s; the other 10 took at least 2.96 s. So we already know who's going to be in the top 13 Bereich of the top 20. The second and third tests helped refine this.

The next example is full of 0s and 1s, but there are not enough 0s in the haystack.

        not (quasiSubseq (take 50000 (cycle [0, 1])) (take 24999 (cycle [0, 1, 1, 0, 1])))

Finally, at this point the MC started looking at individual solutions and found bugs in three of them, exposed by the tests

        quasiSubseq [2, 1, 1] [1, 1, 2]
        quasiSubseq [2, 1, 1, 1] [1, 1, 2, 1]

And then there were 20.

This is where the second criterion—the general appearance of the code—should come into play. But finding all these bugs, and coming up with appropriate tests, has left the MC with no energy to embark on such an adventure, so this week he's considering all programs to be perfectly formatted, even thought that does not entirely correspond to the truth. ;-)

Here's the solution that scored best overall and that is, hopefully, bugfree:

        quasiSubseq :: Eq a => [a] -> [a] -> Bool
        quasiSubseq = genericSubseq 1

        -- general quasi-subsequences matcher for x jumps in the needle list
        genericSubseq x = reduce (replicate x 0)
         where
         reduce :: Eq a => [Int] -> [a] -> [a] -> Bool
         reduce _    []     _  = True
         reduce poss (n:ns) hs = case findNext 0 poss n hs of
                                  [] -> False
                                  (f:found) -> reduce found ns (drop f hs)

         -- takes a list of relative, accumulating positions and returns the next ones (whose matchings would include n)
         findNext :: Eq a => Int -> [Int] -> a -> [a] -> [Int]
         findNext x poss@[] n [] = poss
         findNext x poss@[] n (h:hs) = if n == h
                                       then [x+1]
                                       else findNext (x+1) poss n hs
         findNext x (p:poss) n hs = if null hs || p == 0
                                      then x+p : findNext 0 poss n hs
                                      else if head hs == n
                                       then x+1 : findNext (p-1) poss n (drop p hs)
                                       else findNext (x+1) (p-1:poss) n (tail hs)

There won't be any attempt at explaining the solution here, because that would require understanding it.

[Update: In a private email to the MC, Herr Bergmaier described his solution thusly:

findNext gibt die Fundpositionen des gerade gesuchten Elements (n) im Haystack zurück, die jeweils vor dem nächsten Listenelement (head poss) bzw. dem Ende des Haystacks liegen. x ist reine Zählervariable. Die Positionen (poss) sind nicht absolut, sondern beziehen sich relativ auf das vorhergehende Element bzw. den Anfang des jeweils noch zu durchsuchenden Haystacks.]

Here's the MC's own solution:

        sums [] [] = []
        sums (x : xs) (y : ys) = x + y : sums xs ys

        subseqIndices _ [] _ = []
        subseqIndices _ [_] _ = []
        subseqIndices n (x : xs) [] = n + 1 : subseqIndices n xs []
        subseqIndices n (x : xs) (y : ys) =
          if x == y then n : subseqIndices (n + 1) xs ys
          else subseqIndices (n + 1) (x : xs) ys

        quasiSubseqNeedleTooLong (_ : xs) (_ : ys) = quasiSubseqNeedleTooLong xs ys
        quasiSubseqNeedleTooLong (_ : _ : _) [] = True
        quasiSubseqNeedleTooLong _ _ = False

        quasiSubseq xs ys =
          not (quasiSubseqNeedleTooLong xs ys) &&
          any (<= length ys)
              (sums (0 : subseqIndices 1 xs ys)
                    (reverse (0 : subseqIndices 1 (reverse xs) (reverse ys))))

The quasiSubseqNeedleTooLong is merely an optimization, so that if the needle is two or more characters longer than the haystack, this is discovered without inspecting the entire needle.

The key idea is that when trying to match a needle needle with a haystack _n_e___e$l__e_, we simultaneously try to match the beginning of the needle with the beginning of the haystack and the end of the needle with the end of the haystack. In either direction, we don't know when to stop, so we just go all the way, assuming the worst and recording the information in two lists.

n e e d l e
Forward: 2 4 8 16 16
Backward: 16 16 16 5 2

Thus, the entry 5 means that one needs to read 5 characters backwards in _n_e___e$l__e_ to match le. 16 stands for infinity. Two entries are left empty, because once we've matched needl forward or eedle backward, we've won already.

Then the table rows are shifted by a couple of positions, with some 0 padding, to simulate ignoring one character of the needle. The columns are then summed:

n e e d l e
Forward: 0 2 4 8 16 16
Backward: 16 16 16 5 2 0
Sum: 16 18 20 13 18 16

Then we check if there's a column with a sum less than or equal to the length of the haystack. There is: 8 + 5 = 13. This means, if we match nee from the left and le from the right, we end up with a hole of size 1 (= 14 − 13, where 14 is the length of _n_e___e$l__e_), namely the $ between _n_e___e and l__e_

One of the tutors, Manuel Eberl (as usual), also found a very nice, and nicely documented solution:

        safeTail [] = []
        safeTail (x:xs) = xs

        -- if we're in state l with the word x:xs left to read and we get a
        -- character c from the haystack, what is the maximum state we can
        -- reach?
        step' :: Eq a => ([a], Integer) -> a -> ([a], Integer)
        step' (x:xs, l) c
            | x == c = (xs, l + 1)
            | otherwise = (x:xs, l)

        -- if our step brought us to l1 in the upper automaton and to l2 in the lower
        -- one, this function makes sure the maximum state in the lower one really is
        -- the maximum state.
        -- (if the lower automaton "gets stuck", we might get farther by taking a step
        -- into it from the upper one, this function checks whether this is the case)
        combine :: Eq a => ([a], Integer) -> ([a], Integer) ->
            (([a], Integer), ([a], Integer))
        combine (xs, l1) (ys, l2)
            | l1 < l2 = ((xs, l1), (ys, l2))
            | otherwise = ((xs, l1), (safeTail xs, l1 + 1))

        -- applies the effect of a single read character to the state pair of upper and
        -- lower automaton.
        step :: Eq a => (([a], Integer), ([a], Integer)) -> a ->
            (([a], Integer), ([a], Integer))
        step (s1, s2) c = combine (step' s1 c) (step' s2 c)


        -- iterate step until haystack is empty or final state reached
        -- note: final state reached iff maximum state in lower automaton has the empty
        -- word left to find of the needle
        steps :: Eq a => (([a], Integer), ([a], Integer)) -> [a] ->
            (([a], Integer), ([a], Integer))
        steps s [] = s
        steps (s1, ([], l2)) _ = (s1, ([], l2))
        steps (s1, s2) (h:hs) = steps (step (s1, s2) h) hs

        -- we start in state 0 in the "normal" automaton and state 1
        -- in the "already skipped one character" automaton.
        -- the remaining needle parts are "ns" resp. "safeTail ns"
        quasiSubseq :: Eq a => [a] -> [a] -> Bool
        quasiSubseq ns hs = null ns'_final
            where (_, (ns'_final, _)) = steps ((ns, 0), (safeTail ns, 1)) hs

        main = print (quasiSubseq [1..1000000] ([1000000] ++ [2..999999] ++ [1]))

He was gracious enough to write up an explanation in HTML:

Say we want to find out whether the word “needle” is a subsequence of some input text (the haystack). We can solve this by feeding the haystack to the following non-deterministic finite state automaton (NFA):

NFA for subsequences

From this, we can build an NFA that checks whether “needle” is a quasi-subsequence of the haystack by copying the entire automaton, putting it below the original automaton and simplifying it a bit, like this:

NFA for quasi-subsequences

The transitions labelled with * can be taken with any character, i.e. we can always jump from the upper automaton to the lower one and skip a character of “needle” in the process – but not the other way round, therefore we ensure we can only do this once.

Now we can feed our haystack into this automaton and “needle” is a quasi-subsequence of it if (and only if) we reach the final state 6' at some point. We can do this by remembering the set of all state we can be in after reading the part of the haystack we've already processed. In the beginning, this set is, of course, {0}. If we then read an “n”, the set is {0,1,1'}, if we read something else, it is {0,1'}.

However, this set of states we can be in can become as large as our entire automaton, i.e. if the needle word is very long, the set of states can be very large. However, it turns out that we don't have to track the entire set; it suffices to remember the maximum in the upper automaton and the maximum in the lower automaton: if the state 2 gets us to the final state at some point in the computation, so do any of its successors (because of the loops at every state, and the same holds for e.g. 2'.

The rest is fairly simple: we start with (0, 1'). Then we read a character from the haystack, find out where it gets us from 0 and from 1' and remember the maximum position in the upper automaton and the maximum position in the lower one. For an “e” in states (0, 1'), this would be (0, 2'). We're done if the haystack is fully read or if we've reached the final state, and once we are done, we simply return “True” if we are in the final state and “False” if we're not.

Comments on the Haskell code:

  • for efficiency, I “tag” every state with the remaining part of the needle, i.e. 0 with “needle”, 1 with “eedle” and so on.
  • step' takes a state (i.e. in this case the state number and the remaining needle part) and the character read from the haystack and computes the maximum next state (and its remaining needle part).
  • combine takes the result of step' in the upper and lower automata and combines them, ensuring that we store the maximum state in the lower automaton. The reason we need this is because if the upper automaton advances farther than the lower one, the maximum state in the lower automaton is the result from the upper one taking one of the diagonal transitions into the lower one, not from a step from the lower automaton into itself.
  • step takes the pair of the maximum states in the two automata and the character read from the haystack and returns the pair of the maximum states after the transition.
  • steps iterates step over the haystack until it is empty or we have reached a final state.

Uff, what a Woche! What did we learn?

    1. Subtle bugs can still hide in our code even after extensive QuickCheck testing. We need to be extremely careful (or prove our code correct).
    2. Having several recursive calls joined by || (or &&) is usually a sign that the algorithm is not going to scale so well. (More on this in your Datenstrukturen course; the MC doesn't want to steal their thunder.)
    3. Nothing beats an ingenious efficient algorithm.
    4. Ingenious algorithms require not only extensive testing but also extensive documentation, including diagrams.

Update: An Important Message from the Master of Masters!

Der Meta-Meister (MM) Tobias Glaub-ich-nich Nipkow ist ein notorischer Skeptiker. Daher haben er und seine Übungs-Meister die Lösungen der 20 Gewinner des 5. Wettbewerbs noch einmal unter die Lupe genommen. Allerdings haben sie nicht das QuickCheck von Haskell benutzt, sondern ihr eigenes–nennen wir es BetterCheck. Das Resultat war für den MM nicht unerwartet: Von den Top 20 waren 6 Lösungen falsch. Hier ein Beispiel aus der Kategorie „there are no obvious deficiencies“:

        quasiSubseq :: Eq a => [a] -> [a] -> Bool
        quasiSubseq xs ys = quasiSubseq' xs ys False

        quasiSubseq' :: Eq a => [a] -> [a] -> Bool -> Bool
        quasiSubseq' [] _ _ = True
        quasiSubseq' xs [] dropped = not dropped && length xs == 1
        quasiSubseq' [a] [b] dropped = not dropped || a == b
        quasiSubseq' (x:xs) (y:ys) dropped =
        	length (y:ys) +1 >= length (x:xs)  && (
        	x_found && quasiSubseq' new_xs new_ys dropped ||
        	not dropped && quasiSubseq' xs (y:ys) True)
        	where
        		nonempty = xs /= [] && ys /= []
        		find_x_in_ys =
        			until (\as -> null as || head as == x) (\as -> if not $ null as then tail as else as) (y:ys);
        		x_found = not $ null find_x_in_ys;
        		new_xs_tmp = if nonempty && last xs == last ys && dropped then init xs else xs;
        		new_ys_tmp = if not dropped then (if x_found then tail find_x_in_ys else find_x_in_ys) else
        			if length find_x_in_ys > 1 then tail $ init find_x_in_ys else ((if x_found then tail find_x_in_ys else find_x_in_ys));
        		zs = (new_xs_tmp, new_ys_tmp);
        		new_zs = until (\(as, bs) -> null as || null bs || head as /= head bs) (\(as, bs) -> (tail as, tail bs)) zs;
        		new_xs = fst new_zs;
        		new_ys = snd new_zs;

Wer findet ein Gegenbeispiel? Eine interessante Statistik: Von den sechs falschen Lösungen hatten vier ein Gegenbeispiel der Längen 3, eine eines der Länge 4, und eine eines der Länge 5.

Entwarnung für die 6 falschen Löser: Der MC hat seine Punkte vergeben, und kein einziger davon wird wieder zurück gefordert.

Was lernen wir daraus?

  1. Effizienten und korrekten Code zu schreiben ist nicht einfach.
  2. Der MC hat absolut recht: „Subtle bugs can still hide in our code even after extensive QuickCheck testing. We need to be extremely careful (or prove our code correct).“
  3. Gegenbeispiele sind meistens klein.
  4. Haskell's QuickCheck ist quick, BetterCheck (s.o.) ist besser.
  5. In Zukunft werden wir uns mehr Mühe beim Checken geben.

Die Ergebnisse der sechsten Woche

Here are the updated results for the sixth week.

Top 20 der Woche
  Platz Wettbewerber(in)   Punkte
1. Andreas Huber 20
Felix Sonntag 20
2. Peter Bludau 19
Philipp Zetterer 19
Andreas Eichner 19
Elisabeth Eichholz 19
Maximilian Walther 19
Matthias Franze 19
Nina Tanakova 19
4. Lorenz Panny 17
Nils Kunze 17
Maximilian Schüle 17
Timm Beckmann 17
5. Clemens Elflein 16
Lucas Jacobson 16
Philipp Gerling 16
Richard Littmann 16
Florian Strauß 16
7. Paul Bergmann 14
Andreas Bergmaier 14
Andreas Greimel 14
Florian Hauer 14
Elisabeth Dankerl 14
Marcel Schumacher 14
Raphael Schaller 14
Alexander Sehr 14
Benedikt Brandner 14
Benedikt Geßele 14
Yves Hrvoje Matkovic 14
Julius Michaelis 14
Jan Böttcher 14
Michael Schneider 14
Maximilian Sölch 14
Manuel Ehler 14
Markus Heidegger 14
Matthias Brugger 14
Maximilian Fichtl 14
Maximilian Geißler 14
Michael Remmler 14
Philipp Krueger 14
Robin Schlichtmann 14
Simon Wimmer 14
Sven Hertle 14
Nurettin Güner 14
Benedict Gruber 14

The ranking process is similar to last week. We start with 156 participants who have provided solutions with the {-WETT-} tag, who have done some changes to the template implementation (undefined), and whose solutions compile. The first test,

        anonymize "" == ""

kicks out 14 solutions, sometimes with some scary errors:

        *** Exception: Prelude.head: empty list

The next tests consist of the first three examples from the exercise sheet (pelle@foretag.se & Co.). 28 solutions have some bugs there, which leaves us with 114. Next:

        ---@example.com

Here both -__@e______.c__ and ---@e______.c__ are allowed as solutions. In the second case, - (hyphen) is effectively treated the same way as . (dot), as a separator, which seems legitimate. The MC tends to be liberal in such matters of interpretation, both because the instructions were (deliberately) unclear and because he wants to avoid Ärger with unhappy competitors. Anyway, this leaves us with 75 solutions. Next:

        aa@bb cc@dd
        aa@bb, cc@dd
        aa@bb,cc@dd
        aa@bb;cc@dd

should give

        a_@b_ c_@d_
        a_@b_, c_@d_
        a_@b_,c_@d_
        a_@b_;c_@d_

This drastically reduces the number of competitors to 31. How many of the remaining solutions can actually handle quotes?

        "hello"@hello.com

Here, both "______@h____.c__ and "h____"@h____.c__ were allowed. Only 6 solutions survived this test. Next,

        user@[IPv6:2001:db8:1ff::a0b:dbd0]

is anonymized to u___@[____________________________ or u___@[___________________________] or u___@[I___:2___:d__:1__::a__:d___] by 4 solutions.

        aa(bb)@cc.dd

with a_____@c_.d_ and a_(b_)@c_.d_ kicks out 1 solution, leaving 3. Finally,

        "much.more unusual"@example.com

kicks out 2; "much.more u______"@e______.c__ is definitely wrong. This gives us our top 20—actually, top 31.

A competitor, let's call him Napoleon, wrote

        -- Is the mc nuts? Making a ranking from this is ridiculous...
        -- It's not my problem though. Anyway, please give us a
        -- reasonable task next time. :/

As the MC sees it, the pros of the exercise are as follows:

In fact, the specification is often not just unclear but also contradictory. The examples from Wikipedia given in the Aufgabenblatt suggested one would have to parse IP addresses, but as Napoleon discovered (and remarked in his usual deferential prose) this isn't mandated by RFC 5322:

        {- thought I would parse IPs now? No. I strictly abide RFC5322
        parse_ip it = if "IPv6:" == splice it (incn 5 it) then parse_ip6 $ incn 5 it else parse_ip4 it
        parse_ip4 = pip 4
                where -- pip i it | trace ("pip: " ++ show i ++ " " ++ show it) False = undefined
                      pip 0 it = (it,it -: dec -: not.isend && ']' == it -: dec -: ast)
                      pip i it = if tup != it && tup -: not.isend && (ast tup == '.' || i == 1) then pip (i-1) (inc tup) else (tup,False)
                                where tup = snd $ sloop (3,it) (liftM3 (and3) ((>0).fst) (not.isend.snd) (is digit.ast.snd)) (\(n,i)-> (n-1,inc i))
        parse_ip6 = undefined
        -}

(And before we learn bad French from Napoleon: “abide” should have been “abide by.”)

But the MC is ready to admit that there are some cons too. The exercise was boring enough that for the first time, he didn't feel compelled to write a competitive Musterlösung, which is surely a sign of something.

The good news for Napoleon and his Glorieux Empire is that the exercise in Blatt 7 is more exciting again; and the MC and his junior associate the CoMC have something really, really, REALLY neat in store for Blätter 8 and 9.

What do the solutions look like? The winner's is 78 lines long, so it won't be reprinted in full here; here's an extract:

        anonymize :: String -> String
        anonymize [] = []
        anonymize cs = parseAno cs [] 0 0
        type State = Int
        delta:: State -> Char -> State
        {-local-part-}
        delta 400 c
         | (c `elem` seperatorChars || isSpace c) = 0
         | otherwise = 400
        delta n c
         | n `elem` [0,1,7] && (c `elem` seperatorChars || isSpace c) = 0
        delta 0 '(' = 100
        delta n c
         | n `elem` [0,1,3] && c `elem` defaultChars = 1
         | n `elem` [0,1,3] && c == '(' = 101
         | n `elem` [0,1,3] && c == '"' = 4
        delta 1 '@' = 7
        delta 1 '.' = 3
        ...
        delta n c
         |n `elem` [8,9] && (c `elem` seperatorChars || isSpace c )=10
        {-Comments-}
        delta n ')' | n>= 100 && n < 200 = n-100
        delta n c
         | n >= 100 && n < 200 && c `elem` ('.':defaultChars++specialChars++veryspecialChars) = n
        delta _ _ = -1
        seperatorChars =";:,!?"
        hexadecimalChars = ['0'..'9']++['a'..'f']
        defaultDomainChars = '-':['a'..'z']++['0'..'9']
        defaultChars = ['A'..'Z']++['a'..'z']++['0'..'9']++"!#$%&'*+-/=?^_`{|}~"
        specialChars = " (),:;<>@[]"
        veryspecialChars = "\"\\"

        parseAno:: String -> String ->State-> Integer -> String
        parseAno [] cs s n | s `elem`[8,9,240,316] && n <= 254= anonymize' cs []
                           | otherwise = reverse cs
        parseAno (c:cs) ds s n
         | newState == -1  || newState==10 && n > 254 = (reverse (c:ds)) ++ parseAno cs [] 400 0
         | newState == 0 = reverse (c:ds) ++ parseAno cs [] 0 0
         | newState == 10 = anonymize' (ds) [] ++ c:(parseAno cs [] 0 0)
         | newState == 7 && n > 64 = parseAno cs (c:ds) newState 254
         |otherwise = parseAno cs (c:ds) newState (n+1)
           where newState = delta s c

        anonymize':: String -> String -> String
        anonymize' [] cs = cs
        anonymize' (c:cs) ds | c `elem` "@." = anonymize' cs (c:ds)
        anonymize' (c:c2:cs) ds
         | c2 `elem` ".@" = anonymize' (c2:cs) (c:ds)
         | otherwise = anonymize' (c2:cs) ('_':ds)
        anonymize' (c:cs) ds = anonymize' cs (c:ds)

Its main characteristic is surely the use an automaton, with states and a state transition function delta, to analyze the input. With all the hard-coded state numbers, the code is a bit difficult to maintain, but we hadn't seen the data syntax yet in the course. (With data, it is possible to give meaningful names to states, instead of numbers.)

The neatest solution, in terms of documentation, is probably the next one. It it as pity that it ended up only in fourth place, but the MC is ready to admit that his approach to ranking solutions is crude. What is especially commendable about this entry is the extremely thorough documentation, the careful formatting, and the honesty of the author (not to mention his readiness to censor f***-letter words). Let's reproduce the program in its full glory:

        -------------------------------------------------------------------------------
        --	Remarks:
        --	- For the competition, I concentrated on isEmail and isEmailSMTP,
        --	  so the anonymize function sucks.
        --	  (I do realize this is inefficient as f***, but I don't have the
        --	  time or patience to speed it up. The assignment asked for
        --	  completeness, not performance.)
        --	- It's hard to define an "invalid address" like in the exercise
        --	  sheet because the set of allowed characters is not as restricted,
        --	  so I used a (greedy, i. e. if a substring is part of two different
        --	  valid addresses, the first is favored) brute-force approach to find
        --	  addresses (Yep, I know it's slow).
        --	- I completely implemented the rules specified in RFC5322 and RFC5321.
        --	  isEmail only tests if the input is valid by RFC5322, isEmailSMTP (as
        --	  one might have guessed) also checks if the input is a valid SMTP
        --	  address (by RFC5321).
        --	- The solution was checked against the tests.xml in this directory.
        --	- BTW: "Wer findet ein Gegenbeispiel?": [1,1,1] [2,1]
        -------------------------------------------------------------------------------


        -- RFC5322 only
        isEmail :: String -> Bool
        isEmail xs = not . null $ readP_to_S addrSpec xs

        -- RFC5321, stricter
        isEmailSMTP :: String -> Bool
        isEmailSMTP xs = not (null $ readP_to_S smtpMailbox xs) &&
        	-- length limits by RFC5321, erratum 1690, RFC5321, 4.5.3.1
        	-- and RFC1035, 2.3.4.
        	length xs <= 254 &&
        	-- note that this is correct because it's always the rightmost '@'
        	-- that separates the local part from the host.
        	length local <= 64 && length host <= 255 &&
        	-- this is ugly
        	all (<=63) (map (length . takeWhile (/='.')) . takeWhile (not . null) $
        		iterate (dropWhile (=='.') . dropWhile (/='.')) host)
        	where (host, (_:local)) = break (=='@') $ reverse xs


        --
        anonymize :: String -> String
        anonymize "" = ""
        anonymize xs
        	| l == 0	= head xs : anonymize (tail xs)
        	| otherwise	= mask (take l xs) ++ anonymize (drop l xs)
        	where l = maximum $ 0:[n | n<-[1..length xs], isEmailSMTP $ take n xs]

        mask :: String -> String
        mask "" = ""
        mask xs
        	| null next	= head rest : mask (tail rest)
        	| null rest	= head next : const '_' `map` tail next
        	| otherwise	= head next : const '_' `map` tail next ++
        				[head rest] ++ mask (tail rest)
        	where (next, rest) = break (`elem` ".@") xs




        ----------------------------- GRAMMAR STARTS HERE -----------------------------


        oneOf xs = satisfy (`elem` xs)
        oneOfAny = oneOf . concat

        -- RFC 5234, Appendix B.1
        	-- added lower case digits as IPs are not case sensitive
        hexdig = oneOfAny [['0'..'9'],['A'..'F'],['a'..'f']]
        vchar = oneOf ['!'..'~']
        wsp = oneOf " \t"

        -- RFC5322, 3.2.1
        quotedPair = char '\\' >> (vchar <++ wsp)

        -- RFC5322, 3.2.2
        fws = optional (many wsp >> char '\r' >> char '\n') >> many1 wsp
        ctext = oneOfAny [['!'..'\''],['*'..'['],[']'..'~']]
        ccontent = ctext +++ quotedPair +++ comment
        comment = char '(' >> many (optional fws >> ccontent) >>
        	optional fws >> char ')'
        cfws = (many1 (optional fws >> comment) >> optional fws) +++ (fws >> return ())

        -- RFC5322, 3.2.3
        atext = oneOfAny [['A'..'Z'],['a'..'z'],['0'..'9'],"!#$%&'*+-/=?^_`{|}~"]
        dotAtomText = many1 atext >> many (char '.' >> many1 atext)
        dotAtom = optional cfws >> dotAtomText >> optional cfws

        -- RFC5322, 3.2.4
        qtext = oneOfAny [['!'],['#'..'['],[']'..'~']]
        qcontent = qtext <++ quotedPair
        quotedString = optional cfws >> char '"' >> many (optional fws >> qcontent) >>
        	optional fws >> char '"' >> optional cfws

        -- RFC5322, 3.4.1
        addrSpec = localPart >> char '@' >> domain >> eof
        localPart = (dotAtom >> return ()) <++ (quotedString >> return ())
        domain = dotAtom <++ domainLiteral
        domainLiteral = optional cfws >> char '[' >> many (optional fws >> dtext) >>
        	optional fws >> char ']' >> optional cfws
        dtext = oneOfAny [['!'..'Z'],['^'..'~']]


        -- RFC5321, 4.1.2
        smtpDomain = smtpSubDomain >> many (char '.' >> smtpSubDomain)
        smtpSubDomain = smtpLetDig >> optional smtpLdhStr
        smtpLetDig = oneOfAny [['A'..'Z'],['a'..'z'],['0'..'9']]
        smtpLdhStr =
        	many (oneOfAny [['A'..'Z'],['a'..'z'],['0'..'9'],['-']]) >> smtpLetDig
        smtpAddressLiteral = char '[' >> ((smtpIPv4AddressLiteral >> return ()) +++
        	(smtpIPv6AddressLiteral >> return ()) +++
        	(smtpGeneralAddressLiteral >> return ())) >> char ']'
        smtpMailbox =
        	smtpLocalPart >> char '@' >> ((smtpDomain >> return ()) +++
        	(smtpAddressLiteral >> return ())) >> eof
        smtpLocalPart =
        	(smtpDotString >> return ()) <++
        	(smtpQuotedString >> return ())
        smtpDotString = smtpAtom >> many (char '.' >> smtpAtom)
        smtpAtom = many1 atext
        smtpQuotedString = char '"' >> many smtpQcontentSMTP >> char '"'
        smtpQcontentSMTP = smtpQtextSMTP +++ smtpQuotedPairSMTP
        smtpQuotedPairSMTP = char '\\' >> oneOf [' '..'~']
        smtpQtextSMTP = oneOfAny [" !",['#'..'['],[']'..'~']]

        -- RFC5321, 4.1.3
        smtpIPv4AddressLiteral =
        	smtpSnum >> count 3 (char '.' >> smtpSnum)
        smtpIPv6AddressLiteral = string "IPv6:" >> smtpIPv6Addr
        smtpGeneralAddressLiteral =
        	smtpStandardizedTag >> char ':' >> many1 smtpDcontent
        	-- smtpStandardizedTag might be some subset of smtpLdhStr in future.
        smtpStandardizedTag = pfail
        smtpDcontent = oneOfAny [['!'..'Z'],['^'..'~']]
        	-- RFC: Snum should be in [0..255]
        	-- a less strict implementation might allow leading zeroes.
        smtpSnum =
        	(oneOf ['0'..'9']) +++
        	(oneOf ['1'..'9'] >> oneOf ['0'..'9']) +++
        	(char '1' >> oneOf ['0'..'9'] >> oneOf ['0'..'9']) +++
        	(char '2' >> oneOf ['0'..'4'] >> oneOf ['0'..'9']) +++
        	(char '2' >> char '5' >> oneOf ['0'..'5'])
        smtpIPv6Addr = (smtpIPv6Full >> return ()) +++
        	(smtpIPv6Comp >> return ()) +++
        	(smtpIPv6v4Full >> return ()) +++
        	(smtpIPv6v4Comp >> return ())
        smtpIPv6Hex = hexdig >> count 3 (optional hexdig)
        smtpIPv6Full = smtpIPv6Hex >> count 7 (char ':' >> smtpIPv6Hex)
        smtpIPv6Comp = choice [
        	dd >> o (s >> c 5 ods),			-- I still think this is
        	s >> dd >> o (s >> c 4 ods),		-- more readable than
        	s >> ds >> dd >> o (s >> c 3 ods),	-- writing it out.
        	s >> c 2 ds >> dd >> o (s >> c 2 ods),
        	s >> c 3 ds >> dd >> o (s >> ods),
        	s >> c 4 ds >> dd >> o s
        	]
        	where o = optional; d = char ':'; s = smtpIPv6Hex; dd = string "::";
        		c = count; ds = d >> s; ods = o ds
        smtpIPv6v4Full = smtpIPv6Hex >> count 5 (char ':' >> smtpIPv6Hex) >>
        	char ':' >> smtpIPv4AddressLiteral
        smtpIPv6v4Comp = choice [
        	dd >> v, dd >> s >> c 3 ods >> d >> v,
        	s >> dd >> v, s >> dd >> s >> c 2 ods >> d >> v,
        	s >> ds >> dd >> v, s >> ds >> dd >> s >> ods >> d >> v,
        	s >> c 2 ds >> dd >> v, s >> c 2 ds >> dd >> s >> d >> v,
        	s >> c 3 ds >> dd >> v
        	]
        	where o = optional; d = char ':'; s = smtpIPv6Hex; dd = string "::";
        		c = count; ds = d >> s; ods = o ds; v = smtpIPv4AddressLiteral





        {-
        -------------------------------------------------------------------------------
        -- This is the grammar the whole thing is based on.
        -- I put a notice where the implementation differs from the following.
        -------------------------------------------------------------------------------


        -- Terminal symbols, defined in RFC5234, Appendix B.1
           ALPHA           =   %x41-5A / %x61-7A  ; A-Z / a-z
           CR              =   %x0D               ; carriage return
           CRLF            =   CR LF              ; Internet standard newline
           DIGIT           =   %x30-39            ; 0-9
           DQUOTE          =   %x22               ; " (Double Quote)
           HEXDIG          =   DIGIT / "A" / "B" / "C" / "D" / "E" / "F"
           HTAB            =   %x09               ; horizontal tab
           LF              =   %x0A               ; linefeed
           SP              =   %x20
           VCHAR           =   %x21-7E            ; visible (printing) characters
           WSP             =   SP / HTAB          ; white space


        -- Universal "Internet Address" as defined by RFC5322:

           -- 3.2.1
           quoted-pair     =   ("\" (VCHAR / WSP)) / obs-qp

           -- 3.2.2
           FWS             =   ([*WSP CRLF] 1*WSP) /  obs-FWS
                                                  ; Folding white space
           ctext           =   %d33-39 /          ; Printable US-ASCII
                               %d42-91 /          ;  characters not including
                               %d93-126 /         ;  "(", ")", or "\"
                               obs-ctext
           ccontent        =   ctext / quoted-pair / comment
           comment         =   "(" *([FWS] ccontent) [FWS] ")"
           CFWS            =   (1*([FWS] comment) [FWS]) / FWS

           -- 3.2.3
           atext           =   ALPHA / DIGIT /    ; Printable US-ASCII
                               "!" / "#" /        ;  characters not including
                               "$" / "%" /        ;  specials.  Used for atoms.
                               "&" / "'" /
                               "*" / "+" /
                               "-" / "/" /
                               "=" / "?" /
                               "^" / "_" /
                               "`" / "{" /
                               "|" / "}" /
                               "~"
           atom            =   [CFWS] 1*atext [CFWS]
           dot-atom-text   =   1*atext *("." 1*atext)
           dot-atom        =   [CFWS] dot-atom-text [CFWS]

           -- 3.2.4
           qtext           =   %d33 /             ; Printable US-ASCII
                               %d35-91 /          ;  characters not including
                               %d93-126 /         ;  "\" or the quote character
                               obs-qtext
           qcontent        =   qtext / quoted-pair

           quoted-string   =   [CFWS]
                               DQUOTE *([FWS] qcontent) [FWS] DQUOTE
                               [CFWS]

           -- 3.4.1
           addr-spec       =   local-part "@" domain
           local-part      =   dot-atom / quoted-string / obs-local-part
           domain          =   dot-atom / domain-literal / obs-domain
           domain-literal  =   [CFWS] "[" *([FWS] dtext) [FWS] "]" [CFWS]
           dtext           =   %d33-90 /          ; Printable US-ASCII
                               %d94-126 /         ;  characters not including
                               obs-dtext          ;  "[", "]", or "\"



        -- RFC5321-restrictions applying to SMTP

           -- 4.1.2
           Domain         = sub-domain *("." sub-domain)
           sub-domain     = Let-dig [Ldh-str]
           Let-dig        = ALPHA / DIGIT
           Ldh-str        = *( ALPHA / DIGIT / "-" ) Let-dig
           address-literal  = "[" ( IPv4-address-literal /
                            IPv6-address-literal /
                            General-address-literal ) "]"
           Mailbox        = Local-part "@" ( Domain / address-literal )
           Local-part     = Dot-string / Quoted-string
                          ;MAY be case-sensitive
           Dot-string     = Atom *("."  Atom)
           Quoted-string  = DQUOTE *QcontentSMTP DQUOTE
           QcontentSMTP   = qtextSMTP / quoted-pairSMTP
           quoted-pairSMTP  = %d92 %d32-126
                          ; i.e., backslash followed by any ASCII
                          ; graphic (including itself) or SPace
           qtextSMTP      = %d32-33 / %d35-91 / %d93-126
                          ; i.e., within a quoted string, any
                          ; ASCII graphic or space is permitted
                          ; without blackslash-quoting except
                          ; double-quote and the backslash itself.

           -- 4.1.3
           IPv4-address-literal  = Snum 3("."  Snum)
           IPv6-address-literal  = "IPv6:" IPv6-addr
           Snum           = 1*3DIGIT
                          ; representing a decimal integer
                          ; value in the range 0 through 255
           IPv6-addr      = IPv6-full / IPv6-comp / IPv6v4-full / IPv6v4-comp
           IPv6-hex       = 1*4HEXDIG
           IPv6-full      = IPv6-hex 7(":" IPv6-hex)
           IPv6-comp      = [IPv6-hex *5(":" IPv6-hex)] "::"
                          [IPv6-hex *5(":" IPv6-hex)]
                          ; The "::" represents at least 2 16-bit groups of
                          ; zeros.  No more than 6 groups in addition to the
                          ; "::" may be present.
           IPv6v4-full    = IPv6-hex 5(":" IPv6-hex) ":" IPv4-address-literal
           IPv6v4-comp    = [IPv6-hex *3(":" IPv6-hex)] "::"
                          [IPv6-hex *3(":" IPv6-hex) ":"]
                          IPv4-address-literal
                          ; The "::" represents at least 2 16-bit groups of
                          ; zeros.  No more than 4 groups in addition to the
                          ; "::" and IPv4-address-literal may be present.

        	-- 4.5.31.1
           The maximum total length of a user name or other local-part is 64
           octets.

        	-- 4.5.3.1.2
           The maximum total length of a domain name or number is 255 octets.



        -- RFC1035 limits the length of the atoms separated by dots in a domain name:
        	-- 2.3.4
           labels          63 octets or less


        -}

The solution is full of little gems, including an answer to the Meta-Master's rhetorical question from last week:

        --	- BTW: "Wer findet ein Gegenbeispiel?": [1,1,1] [2,1]

This week's lessons:

    1. Remember to test boundary cases (e.g., "").
    2. When dealing with imprecise specifications, look for sensible solutions and document your choices.

And do trust the MC when he claims that your future life will be full of imprecise specifications. Your future boss is unlikely going to touch an RFC even with a 11-foot pole, but he might still want you to implement an email anonymizer “that just works.”

In closing, some of you might have wondered about the Swedish flavor of the examples. The original exercise came from Prof. Koen Claessen, a notorious Swedophile.

Update: A few competitors complained about the test

        aa@bb;cc@dd

They argued that such a construction should not occur in a normal text, or if it does it probably would stand for something else than an email address. What made matters worse is that solutions that otherwise had nice handling of quoting and other advanced features were penalized. The MC has thought through the matter thoroughly and decided to take out that specific test from the suite. He also changed aa@bb to aa@bb.bb and similarly for cc@dd to allow a stricter interpretation of the domain name.

The Top 20 der Woche and des Semesters have been updated accordingly. To avoid punishing competitors who successfully handled aa@bb;cc@dd, the point categories and rankings are kept as before; thus, Herr Bludau, who occupied place 2 when the table was first computed, is now effectively at place 3 but keeps his 19 points.

The general philosophy behind this change is explained in the following extracts from emails from the MC to the competitors who complained in the first place:

Ziel des Wettbewerbs ist, Spaß zu haben (und dabei zu lernen).
Ein gewisser Grad an Glück ist beim Wettbewerb erfordelich, aber besonders bei extrem zeitaufwändigen Aufgaben sollte die Korrektur rücksichtsvoll sein.
Ab nächster Woche können Sie von uns klarere Aufgabentexte erwarten. Falls die Kriterien trotz allem nicht klar genug sind, bitte nachfragen.

More lessons:

    1. If you ask, the MC is likely to provide some guidance.
    2. If you disagree with the ranking, complaining to the MC cannot harm.
    3. The MC and his crew got the message: No more boring exercises with unclear criteria!

Die Ergebnisse der siebten Woche

This week the CoMC (a.k.a. MC Junior) is substituting the Great MC, who is visiting the oldest city in the Netherlands (OK, he does some work there, too). Being not as experienced, the CoMC was a little bit slower in performing the evaluation. Also, please excuse him for not being as eloquent as the MC. However, here are the results for the seventh week:

Top 20 der Woche
  Platz Wettbewerber(in)   Punkte
1. Andreas Bergmaier 20
Elias Marquart 20
Julius Michaelis 20
Lorenz Panny 20
Markus Westerlind 20
Robert Lerzer 20
Simon Wimmer 20
8. Timo Geissler 13
Thomas Breier 13
10. Nils Kunze 11
11. Felix Sonntag 10
Marcel Schumacher 10
Jan Böttcher 10
14. Albert Steckermeier 7
Manuel Bergler 7
16. Raphael Schaller 5
17. Andreas Huber 4
Patrick Werneck 4
19. Philipp Riedmann 2
Philipp Schneider 2

How are the results computed? There were only 279 homework submissions this time, 219 of which did not pass the QuickCheck-Tests. Further 7 submissions missed the {-WETT-} tag, such that the CoMC had to look closer only at 53 submissions.

Already the first performance tests (that were even written on the exercise sheet!) have determined the top 20:

        unscrambleWords (take 10000 (permutations "abcdefghi")) (take 100000 (permutations "dichbagef")) --10 sec timeout
        unscrambleWords (take 20000 (permutations "abcdefghi")) (take 200000 (permutations "dichbagef")) --20 sec timeout

In order to prevent measuring time needed to perform the output of a huge list, the CoMC used a hash function (Data.Hashable.hash) for all of his tests.

The next test considered a one-word dictionary and a text which consisted of permutations of this one word:

        unscrambleWords ["abcdefghi"] (take 200000 (permutations "dichbagef")) --10 sec timeout

This ruled out two further submissions and was also a quite good example to test soundness, since the expected output list is quite easy to describe. All solutions at this point share the property that they transform the input dictionary into an efficient data structure (with cheap lookups modulo permutation).

The following test aimed at the contrary inputs: a huge dictionary (500000 words) and a short text.

        unscrambleWords (take 500000 (permutations "abcdefghij"))["dichbajgef"] --10 sec timeout

The solution of the MC and two further submissions did not survive this test. The construction of the lookup-efficient data structure was just too slow compared with the top 16.

Suddenly, the CoMC had the idea of testing even more trivial corner cases: an empty dictionary or an empty text. Originally, he assumed that the lazyness of Haskell will handle the empty text case without further ado. However, one submission managed to circumvent the lazyness and to spend quite some time inspecting the dictionary. The empty dictionary was also a problem for some solutions for unforseen reasons.

So far the tests considered only permutations of distinct strings. The next one covers yet two different dimensions: strings of different length and repititions of letters in strings. It kicked out three further contestants.

        unscrambleWords (Data.List.map (\n -> replicate n 'a') [1..1000])
                (concatMap (\n -> replicate (2*n) 'a' : Data.List.map (\c ->
                        Data.List.intersperse c (replicate n 'a')) ['a' .. 'd'] ) [1..500])

We have arrived at the top 10! It's high time for the reality check.

Our dictionary of choice is generated by SCOWL 7.1 and contains 652749 English words. To be precise, only almost English, since upper case letters were converted to lower case and all other symbols not in a-z were dropped. However, the same transformation was of course also applied to the texts under test.

The first text under consideration was the first book of the good old King James' Bible—Genesis. It consists of 38271 words. The second test was again biblical (at least some people might claim so) with only 5585 words. Interestingly, the running time of Genesis was shorter than the one of GPL at first, because the CoMC had underestimated the power of lazy evaluation (the output of his tests used to be the first unscrambled word—for Genesis this is word number 1079 while the GPL does not contain such words, so all 5585 are processed). After forcing the tests to process the complete texts in both cases, three further submissions were left behind.

Here, the CoMC decided to stop. All seven remaining solutions were similarly competitive on realistic examples (processing Genesis in around 10 seconds) and previously tested corner cases. Further, the CoMC is not a friend of punishing badly formatted code too much—although line width of over 200 is not tolerable even in comments. The contestant to whom the previous comment applies—lets call him The Teen Queen—has used some unsafePerformIO magic, not improving on the trustworthiness of his code. Further, the Teen Queen claims:

        -- what did I actually learn from this solution? using some external magic is way faster than normal haskell. Sad but true

The CoMC strongly disagrees. The solution of the Teen Queen performed almost in all tests below average (of the top 7). In contrast the following solution (also from the top 7) is only using standard data structures such as Data.Map and Data.Set and performed almost in all tests above average. Further this solution is nicely documented and therefore quite understandable.

        import qualified Data.Map as Map
        import qualified Data.Set as Set

        -- If a word is wrong we need to check the permutations for a word in the vocabs list
        -- However, since we only need to check if it is possible to permute the word into a word in that list
        -- we don't need to check every permutation, instead we can check if the word with its letters sorted are in a map
        -- where the keys are the the words (with theier letters sorted) in the vocabs list

        unscrambleWords :: [String] -> [String] -> [String]
        unscrambleWords vocabs = map fixWord
          where
            vocabulary = Map.fromList $ zip (map sort vocabs) vocabs
            validWordKeys = Set.fromList vocabs
            fixWord word = if Set.member word validWordKeys then --Check if the word is already correct
                word
              else --If it is not in the vocabulary we check if a permutation of the word is in the vocabulary
                case Map.lookup (sort word) vocabulary of
                  Just x -> x
                  Nothing -> word --Didn't find a permutation so just leave the word unchanged

Similar things can be achieved by using the more sophisticated data structure of tries.

        import qualified Data.ListTrie.Patricia.Map as TM
        import qualified Data.ListTrie.Patricia.Set as TS

        -- also have a look at ./Exercise_7_trie.hs, where I've implemented the Trie structures myself
        unscrambleWords :: [String] -> [String] -> [String]
        unscrambleWords vocabs = map correct
         where
         correct x = maybe x (\set -> if TS.member x set then x else fromJust $ TS.findMin set) $ TM.lookup (sort x) trie
         trie :: TM.TrieMap M.Map Char (TS.TrieSet M.Map Char)
         trie = TM.fromListWith' (flip TS.union) $ map (\v->(sort v, TS.singleton v)) vocabs

In summary, all worthwhile solutions fall into one of the following categories:

The lessons of this week:

    1. “Compiling” inefficient data structures into efficient data structure once is often a good investment.
    2. Knowing fancy data structures helps.
    3. Not knowing fancy data structures does not harm if the language's standard data structures are implemented efficiently.
    4. Don't forget trivial corner cases.
    5. When measuring execution time, be aware of what you are measuring (especially in languages with lazy evaluation).

Die Ergebnisse der (achten und) neunten Woche(n)

Update: In their never-ending grace, the (Co)MCs decided to include one more solution which was submitted 4 minutes over time (penalizing it with one point malus). Unfortunately, the new solution did not pass the soundness test.

The two-week challenge was optional, and this showed in the number of submissions: 15! (Yes, that's 15 (fourteen) followed by an exclamation mark, not factorial of 15!) This implies that any student who would have submitted a simple, working solution, such as the following program, would get at least 10 points:

        setOneInClause :: Int -> [Int] -> Maybe [Int]
        setOneInClause l ls =
          if elem l ls then Nothing else Just (List.filter (/= - l) ls)

        setOneInProblem :: Int -> [[Int]] -> [[Int]]
        setOneInProblem = mapMaybe . setOneInClause

        satisfyCnf :: [[Int]] -> Maybe [Int]
        satisfyCnf [] = Just []
        satisfyCnf ([] : _) = Nothing
        satisfyCnf ((l : ls) : cs) =
          case satisfyCnf (setOneInProblem l cs) of
            Just sol -> Just (l : sol)
            Nothing ->
              case satisfyCnf (setOneInProblem (- l) (ls : cs)) of
                Just sol -> Just (- l : sol)
                Nothing -> Nothing

The (Co)MCs suspect that the very long Aufgabentext has made the problem look more difficult than it really is.

The SAT (Boolean satisfiability) problem has been studied since (at least) the 60s. The first good algorithm that was developed is called DPLL. It is similar to the above but optimizes unit clauses (i.e., clauses of length 1) and pure literals.

The first question is: Out of the 15 solutions, how many are correct? It turns out the (Co)MCs were able to find bugs in 9 of the 15. One solution was simply undefined; the counterexamples for the seven others are given below for the record:

Problem: [[33,33,-2],[1,-130,4],[-2,9],[2,9],[2,1,9],[-9,-33,-1],[4,-9,-2],[-500000000,-33,130]]
Actual: Just [-1,-2,4,-9,-33,-1,2,-9,2,-2,9,1,-33,4,-130,-500000000] (NB: -1 and 1, -2 and 2, -9 and 9!)
Expected: Just [4,-500000000,-2,9,-33]

Problem: [[-2,-500000000],[2,500000000],[-33],[33]]
Actual: Exception: out1/g_______@m____.d_/exercise_9/Exercise_9.hs:(176,9)-(177,78): Non-exhaustive patterns in function findFirstOf)
Expected: Nothing

Problem: [[-4,-33,500000000],[-4,-9,4],[-500000000,-500000000,-2],[-4,-130,4],[-9,-500000000,-4],[4],[9,33,-500000000],[4,1,-33],[130,2,130],[-33,2,-2]]
Actual: Nothing
Expected: Just [1,4,-33,-9,-500000000,130]

Problem: [[2],[2,9],[-2,-500000000],[-33]]
Actual: Nothing
Expected: Just [9,-500000000,-33,2]

Problem: [[2,-33,4],[-2,-130,33],[2,1,9],[130,1,1],[-9,9,-2],[2,33,-1],[4,33,2]]
Actual: Just [-1,4,9,-2,33,1] (NB: -1 and 1!)
Expected: Just [4,9,-2,33,1]

Problem: [[-130,9]]
Actual: Just [-1,-2,-3,-4,-5,-6,-7,-8,-9]
Expected: Just [9]

Problem: [[9,4,-9],[130,-1,1],[9],[-2],[1,500000000,-500000000],[-33,4]]
Actual: Nothing
Expected: Just [130,4,-33,-2,9,1]

Problem: [[500000000,-9,9]]
Actual: Nothing
Expected: Just [500000000]

Although the Aufgabenblatt stated that incorrect solutions would receive 0 points, the low number of submissions for this week has prompted a reconsideration of this issue, with the following resolution: The undefined solution gets 1 point; the other flawed solutions get 5.

This leaves us with six contestants. One of the six had the wrong signature, with Integer instead of Int. The CoMC corrected this solution, hoping that the Christkind would take notice; the MC subtracted 1 point, hoping that somebody else would take notice.

In addition to the six contestants, we added the naive solution from above, the (Co)MCs' solution (which implemented optimizations 1, 2, and 4), as well as two (2!!!!!—here the first three exclamation marks really denote the factorial, whereas the last two express exclamation) solutions contributed by Junior Tutor, and recipient of the Bachelorpreis 2012, Manuel Eberl.

Round 1: Pure Positive Literals

The first rounds attempts to detect whether solutions can cope with clauses that contain lots of variables that occur only positively in the problem. The generated problems contained clauses of the form

        [+x1, ..., +xm, ±y, +z1, ..., +zn]

where the x's and z's occur only positively in the problem. With the optimization, implementations should rapidly detect that and simplify the clauses. The results are as follows (9999 indicates timeout):

        2.87 (Co)MCs
        3.17 Eberl 2.0
        3.69 Eberl 1.0
        8.31 Albert Steckermeier
        13.6 Elias Marquart
        14.1 Lorenz Panny
        25.9 Simon Wimmer
        43.5 Raphael Schaller
        -----------------------------------------
        9999 Jan Böttcher
        9999 Naive

Round 2: Prioritization of Short Clauses

The goal of a SAT solver is to derive the false clause, []. The shorter a clause, the closer it is syntactically to the false clause. Hence, it makes sense to prioritize shorter clauses. As a special case, unit clauses are useful, because they can be use to globally simplify the problem.

To exercise the clause selection heuristic, variations of the following problem were used:

        [1]
        [-1, 2]
        [-1, -2, 3]
          .
          :
        [-1, -2, ..., -99, 100]
        [-1, -2, ..., -99, -100]

The first clause forces 1 to be true. Then the second clause forces 2 to be true. And so on. The 100th clause forces 100 to be true, but the 101th clause forces it to be false. Contradiction.

The problem was obfuscated in two ways: the clauses were randomly shuffled instead of pre-sorted by length; to avoid falling in the Horn fragment, even variables were added one layer of negation, so that the last clause above actually would be

        [-1, +2, -3, +4, ..., +98, -99, +100]

The results:

        1.45 Eberl 2.0
        1.52 Eberl 1.0
        4.72 (Co)MCs
        11.4 Albert Steckermeier
        25.7 Simon Wimmer
        36.0 Elias Marquart
        50.0 Raphael Schaller
        289. Lorenz Panny
        -----------------------------------------
        9999 Naive

Round 3: Prioritization of Frequent Variables

Variables that occur frequently in a problem should be handled first, because by assigning them a truth value earlier we get better opportunities for simplifications. The following problem attempted to test the prioritization:

        [[a, b, c] | a <- as, b <- bs]
        ++ [[- a, - b, - c] | a <- as, b <- bs]
        ++ [[c, c, c, c], [- c, - c, - c, - c]]
        where as = [1 .. 500 :: Int]
              bs = [2001 .. 2050]
              c = 1500

The variable 1500 (i.e., c) occurs in every single clause, and setting it to either true or false yields a contradiction (because of the last two clauses).

The test is not entirely fool-proof: An implementation that would simplifiy [c, c, c, c] and [- c, - c, - c, - c] to [c] and [- c], then apply optimization 2, could win big here. But it's practically impossible to design tests that are entirely fool-proof to test one specific optimization, because of potential interactions with other optimizations.

        1.04 Eberl 1.0
        1.07 Eberl 2.0
        1.33 Simon Wimmer
        1.60 Elias Marquart
        4.58 Lorenz Panny
        -----------------------------------------
        36.9 Raphael Schaller
        75.4 Albert Steckermeier
        89.6 (Co)MCs
        9999 Naive

Why didn't the (Co)MC implement this optimization? This was a classic case of miscommunication. The Senior MC had implemented 1 and 2 before leaving for Nijmegen; the Junior MC thought that the Senior MC had implemented 3 as well and went on doing 4. Then the Senior MC came back and trusted the Junior MC's solution.

Round 4: Horn Clauses

Are Horn clauses optimized? The test consists of 3000 clauses of the form

        [- a, - b, c]

and one clause of the form

        [- a, - b, - c]

It was generated so that an assignment of all variables to false would give a solution.

        0.16 Simon Wimmer
        0.20 (Co)MCs
        3.72 Lorenz Panny
        -----------------------------------------
        9.00 Eberl 2.0
        10.4 Eberl 1.0
        39.3 Elias Marquart
        9999 Naive

Interestingly, Herr Eberl had an optimization for what we will call strict Horn clauses—problems where all clauses have exactly one positive literal. For such clause, we can obtain a solution by setting all variables to true. But there was exactly one clause—[- a, - b, - c]—that made the example non-strict, which was enough to tip the balance against both of his solutions.

Round 5: 2CNF

The (Co)MCs then generated random 2CNF problems involving 100 variables and having 1000 clauses. There are only losers this time:

        -----------------------------------------
        13.1 Eberl 1.0
        14.4 Eberl 2.0
        37.0 Simon Wimmer
        46.1 (Co)MCs
        9999 Naive
        9999 Lorenz Panny

Strangely enough, Herr Panny's submission appears to have some code to handle 2CNF specifically. The code is full of good will, but apparently that didn't help here.

Let's try to see if we can find a winner among the two remaining competitors.

Round 6: Pure Negative Literals

Do implementations also optimize literals that occur only negatively? This is the dual of round 1. It turns out that both Herr Panny's and Herr Wimmer's solutions feature this optimization. Next round.

Round 7: Dual Horn Clauses

The Horn clause optimization also admits a dual; but neither solutions had implemented it.

Fazit

At this point, the (Co)MCs threw the towel and unanimously decided to crown two winners.

Top 20 der Woche(n)
  Platz Wettbewerber(in)   Punkte
1. Simon Wimmer 40
Lorenz Panny 40
3. Elias Marquart 35
4. Raphael Schaller 34
Albert Steckermeier 34
6. Jan Böttcher 30
7. Andreas Huber 5
Felix Sonntag 5
Florian Haffke 5
Julius Michaelis 5
Nils Kunze 5
Thomas Breier 5
Timm Beckmann 5
14. Markus Westerlind 4
15. Andreas Bergmaier 1

Notice in particular that because of nine disqualifications due to bugs, any correct solution would have harvested 30 points!

This week was full of lessons for all of us:

    1. As the French say, “L'important, c'est de participer.” The important it is not to win, it is to participate.
    2. Do not fear complicated Aufgabentexts, for they can conceal a simple, elegant problem.
    3. Do not fear difficult problems, for they always have a simple, elegant, efficient, and wrong solution.
    4. Test, test, test! In particular, when optimizing code, it pays off to keep an unoptimized version around and use it as a test oracle (i.e., use QuickCheck to reveal cases where the two implementations disagree).
    5. Some optimizations (e.g., 2CNF) look good on paper, but when implementing them we should check that they actually improve performance.
    6. For juniors: Don't trust seniors to do the job.
    7. For seniors: Don't trust juniors to do the job.

Die Ergebnisse der zehnten Woche

Better late than never! Here are the results for week 10:

Top 13 der Woche
PlatzWettbewerber(in) WertungEinzeltestsPunkte
1. Lorenz Panny 4.9921[1.0000,1.0000,1.0002,0.9948,0.9971]20
2. Florian Haffke 5.0000[1.0000,1.0000,1.0000,1.0000,1.0000]19
Julian Müller5.0000[1.0000,1.0000,1.0000,1.0000,1.0000]19
Matthias Brugger 5.0000[1.0000,1.0000,1.0000,1.0000,1.0000]19
Simon Wimmer 5.0000[1.0000,1.0000,1.0000,1.0000,1.0000]19
Thomas Breier 5.0000[1.0000,1.0000,1.0000,1.0000,1.0000]19
Timm Beckmann 5.0000[1.0000,1.0000,1.0000,1.0000,1.0000]19
Albert Steckermeier 5.0000[1.0000,1.0000,1.0000,1.0000,1.0000]19−5=14
9. Markus Westerlind 5.0358[1.0000,1.0050,1.0209,1.0083,1.0016]12
10. Florian Hisch 5.0402[1.0000,1.0000,1.0399,1.0000,1.0003]11
11. Julius Michaelis 7.0000[1.0000,1.0000,1.0000,1.0000,XXXXX]10−1=9
12. Nils Kunze 11.0000[1.0000,1.0000,XXXXX,XXXXX,XXXXX]9
13. Andreas Bergmaier 13.0787[2,3361,1,7426,XXXXX,XXXXX,XXXXX]8−0.5=7.5

There were 145 submissions containing the {-WETT-} tags. Unfortunately, only 13 brave contestants have really implemented something. Kudos to them for finding some time for it over Christmas! And buh to the others for forcing us to throw all these precious Wettbewerb-points into the trash can.

The CoMC is very happy with the quality of the submissions—(almost) no soundness bugs were detected. The submission that is closest to having a soundness bug implements suffix codes instead of prefix code and has been penalized by 5 points. Another solution was not compiling witout an additional import declaration: 1 point penalty. Funnily the definition requiring the missing import was completely irrelevant for Morsi. Finally, a third contestant returned literal strings of the form "[A,A,i]" instead of "AAi". Not necessarily wrong, but very ugly and complicating our test suite: 0.5 points penalty. A few other submissions (including the great MC's!) do not handle the degenerate case of only one letter in the alphabet correctly—they produce the code [('a',"")]. The CoMC was very gracious there, he did not punish the competitors for this, since even the MC fell prey to it.

The numbers in the single tests are just the ratio of the code cost with respect to the solution of the MC. The total score is just the sum of the ratios. Note that our winner got very close to the jackpot of 30 points, but unfortunately failed at the third test to beat the rest. Interestingly, the third test is from the practical point of view the most important one, since it generated the Morsi code for the English alphabet with real frequencies. XXXXX means that the test did not terminate after 10 minutes (most solutions passed the tests within few milliseconds).

So how do the most successful solutions of the week look like? The MC and the students at the plateau (Platz 2) have basically implemented Huffman (as seen in the lecture) by identifying A with 0 and i with 1. The code of the MC is as follows:

        data HuffTree a =
            Leaf Integer a |
            Inner Integer (HuffTree a) (HuffTree a)
          deriving (Show)

        weight :: HuffTree a -> Integer
        weight (Leaf w _) = w
        weight (Inner w _ _) = w

        insortTree :: HuffTree a -> [HuffTree a] -> [HuffTree a]
        insortTree t [] = [t]
        insortTree t (t' : ts) =
          if weight t <= weight t' then t : t' : ts else t' : insortTree t ts

        mergeTrees :: HuffTree a -> HuffTree a -> HuffTree a
        mergeTrees t t' = Inner (weight t + weight t') t t'

        huffman :: [HuffTree a] -> HuffTree a
        huffman [t] = t
        huffman (t : t' : ts) = huffman (insortTree (mergeTrees t t') ts)

        huffmanTreeFromFreqs :: [(a, Integer)] -> HuffTree a
        huffmanTreeFromFreqs =
          huffman . map (uncurry (flip Leaf)) . Data.List.sortBy (compare `Data.Function.on` snd)

        huffmanCodeFromTree :: Char -> Char -> HuffTree a -> [(a, String)]
        huffmanCodeFromTree l r = code []
          where code pref (Leaf _ a) = [(a, pref)]
                code pref (Inner _ t t') = code (pref ++ [l]) t ++ code (pref ++ [r]) t'

        morsiCodeFromTree = huffmanCodeFromTree 'A' 'i'

        morsiCodeFromFreqs :: [(a, Integer)] -> [(a, String)]
        morsiCodeFromFreqs = morsiCodeFromTree . huffmanTreeFromFreqs

The winner does a very similar thing, except that he uses an (empirically) smarter weighting function for the subtrees:

        import Control.Applicative ((<*>))
        import qualified Data.List as L
        import Data.Ord (comparing)


        data Tree a = Leaf Integer a | Node (Tree a) (Tree a)
	        deriving (Eq, Show)
        type Forest a = [Tree a]
        type Freqs a = [(a, Integer)]
        type Code a = [(a, String)]


        forestFromFreqs :: Freqs a -> Forest a
        forestFromFreqs = map . uncurry $ flip Leaf


        treeCost :: Tree a -> Integer
        treeCost = treeCost' 0
	        where	treeCost' 0 (Leaf v _) = v
		        treeCost' d (Leaf v _) = d * v
		        treeCost' d (Node l r) = treeCost' (d+1) l + treeCost' (d+2) r

        minTree :: Eq a => Forest a -> Tree a
        minTree = L.minimumBy $ comparing treeCost

        takeMinTree :: Eq a => Forest a -> (Tree a, Forest a)
        takeMinTree xs = (t, L.delete t xs)
	        where t = minTree xs

        mergeMinTrees :: Eq a => Forest a -> Forest a
        mergeMinTrees f = minTree [Node m m', Node m' m] : f''
	        where	(m, f') = takeMinTree f
		        (m', f'') = takeMinTree f'

        treeFromForest :: Eq a => Forest a -> Tree a
        treeFromForest f = head $ times (pred $ length f) mergeMinTrees f
	        where	times 0 = const id
		        times n = (.) <*> times (pred n)


        codeFromTree :: Tree a -> Code a
        codeFromTree (Leaf _ x) = [(x, "i")]
        codeFromTree t = codeFromTree' "" t
	        where	codeFromTree' s (Leaf _ x) = [(x, reverse s)]
		        codeFromTree' s (Node l r) =
			        codeFromTree' ('i':s) l ++ codeFromTree' ('A':s) r


        morsiCodeFromFreqs :: Eq a => Freqs a -> Code a
        morsiCodeFromFreqs [] = []
        morsiCodeFromFreqs xs = codeFromTree . treeFromForest $ forestFromFreqs xs

Another promising but very inefficient tactic was to enumerate all codes and pick the best one. This is where the few XXXXX come into play. [The MC says: An efficient way to rein in the power of enumeration would be to do it whenever the forest is small enough. He expected the two competitors who contacted him regarding enumeration to implement something along those lines.]

Another pearl is the following part of a solution:

        -- I've written this in a few minutes but spent quite a few hours to find out how it works…
        -- [Edit:] Okay, now I'm being mean...
        huffapp :: Eq a => Huffcode a -> [a] -> [Bool]
        huffapp = liftM (liftM concat) (liftM liftM (flip $ flip liftM lookup $ liftM fromJust))

The competitor seems to like lifting (lets call him Vladimir): the code extract is quite representative for his whole development containing forty occurrences of liftM or liftM2 spread over only 140 lines of code.

Enough! The lessons are:

    1. Don't enumerate huge search spaces.
    2. If you pay attention in the lecture, you might benefit in the homework/Wettbewerb.
    3. Don't write huge libraries that you do not use—they might contain unnecessary bugs.
    4. Don't trust seniors to do the job correctly.

Die Ergebnisse der elften Woche

The MC is back from an exhausting taxpayer financed sojourn at a castle, where he's been chillaxing all week! A suitable location for a self-important character, one might say.

The eleventh week's competition was a piece of cake to evaluate: Only four solutions passed all the tests. Furthermore, six solutions passed at least one test; although technically disqualified, these were given as many points as they passed tests to help populate the so-called top 20.

Top 20 (naja) der Woche
  Platz Wettbewerber(in) Pseudo-Tokens  Punkte
1. Lorenz Panny 17320
2. Simon Wimmer 24319
3. Albert Steckermeier26518
4. Elias Marquart 28017
5. Andreas Bergmaier 9
6. Andreas Huber 9
7. Julius Michaelis 8
8. Markus Westerlind 6
9. Dominik Durner 4
10. Thomas Breier 1

Interestingly enough, Herr Panny's solution, at 173 (pseudo)tokens, has the exact same length as the MC's! Looking more closely, it would appear that the MC unconsciously cheated by reusing the nat parser from a previous exercise (something another competitor—let's call him Charlemagne—also did); hence Herr Panny is the one true champion. Chapeau! Here's his solution in full:

        num = list1 (one isDigit) >>> read

        ign p c = p *** item c >>> fst

        repeatCount = num `ign` ',' *** (num >>> return |||
        		success Nothing) >>> flip Repeat |||
        	num >>> (curry (flip Repeat) . id <*> return)

        buildRepeatOperator c r = item c >>> const (`Repeat` r)

        repeatOperator = enclose '{' '}' repeatCount |||
        	(|||) `foldl1` zipWith buildRepeatOperator
        		"*?+" (read "[(0,Nothing),(0,Just 1),(1,Nothing)]")

        repeatOperators x = (list1 repeatOperator >>> foldr1 (^>>) |||
        	success id) >>> ($ fst x) $ snd x

        repeatedAtom = atom >=> repeatOperators


        singlePair x = (x,x)

        escaped xs = item '\\' *** one (const True) >>> snd ||| one (`notElem` xs)

        subrange = escaped "\\]" `ign` '-' *** escaped "\\" |||
        		escaped "\\]" >>> singlePair

        atom = item '.' >>> const Any |||
        	escaped ".\\()[]{}*?+|" >>> One . return . singlePair |||
        	enclose '[' ']'  (list subrange) >>> One |||
        	enclose '(' ')' regEx


        sepBy s p = p *** (list $ s *** p >>> snd) >>> uncurry (:)
        regEx = item '|' `sepBy` (list1 repeatedAtom >>> foldl1 Concat) >>> foldl1 Alt

For comparison, the MC's solutions is reproduced below. Note that it uses only functions seen in the course, with a few minor exceptions maybe, e.g., const True rather than \_ -> True. There are two reasons for this restraint:

    1. The MC didn't want to have to code two reference solutions (one readable, one optimized).
    2. He doesn't know Haskell very well and is learning it in the same way as Haskell evaluates its expressions.

Here it is, your moment of zen:

        anyOne :: Parser a a
        anyOne = one (const True)

        prefix :: Eq a => a -> Parser a b -> Parser a b
        prefix a p = item a *** p >>> snd

        escape :: Parser Char Char
        escape = prefix '\\' anyOne

        subRange :: Parser Char (Char, Char)
        subRange =
          (escape ||| one (`notElem` "\\]"))
            *** optional (prefix '-' (escape ||| one (/= '\\')))
          >>> \(from, to) -> (from, fromMaybe from to)

        range :: Parser Char [(Char, Char)]
        range = enclose '[' ']' (list subRange)

        atom :: Parser Char RegEx
        atom =
          item '.' >>> const Any
          ||| (escape ||| one (`notElem` ".\\()[]{}*?+|")) >>> (\c -> One [(c, c)])
          ||| range >>> One
          ||| enclose '(' ')' regEx

        numRep :: Parser Char (Int, Maybe Int)
        numRep =
          nat *** optional (prefix ',' (optional nat))
          >>> \(from, to) -> (from, fromMaybe (Just from) to)

        rep :: Parser Char (Int, Maybe Int)
        rep =
          item '*' >>> const (0, Nothing)
          ||| item '?' >>> const (0, Just 1)
          ||| item '+' >>> const (1, Nothing)
          ||| enclose '{' '}' numRep

        factor :: Parser Char RegEx
        factor =
          atom *** list rep
          >>> uncurry (foldl Repeat)

        term :: Parser Char RegEx
        term =
          factor *** list factor
          >>> uncurry (foldl Concat)

        regEx :: Parser Char RegEx
        regEx =
          term *** list (prefix '|' term)
          >>> uncurry (foldl Alt)

Considering that they were developed independently, the two solutions show striking similarities.

Interesting extracts from the second- and third-place solutions follow:

        badChars3 = '.':'(':')':'[':'{':'}':'*':'?':'+':'|':badChars2

        allowed x = not (elem x ['.','\\','(',')','[',']','{','}','*','?','+','|'])

'H':'m':'p':'f':[], those are fancy ways, token-wise, to write ".()[{}*?+|"! It's like the genie grants you three wishes, and you answer: “First, I would like to have French fries. Second, give me some cheese. And finally, give me some of that groovin' Bratensoße-ish gravy.” I.e., poutine (except you'd have to assemble it yourself, like a piece of Swedish furniture—or a Swedish piece of furniture?). But somehow, it's easy to forget that a string is nothing else than a list of characters, and hence that fixed sets of characters can be naturally encoded as string literals. Just like it's easy to forget that Swedish pieces of furniture are also pieces of Swedish furniture.

The fourth-place solution is so totally out there, there's no choice but to display it in full:

        regEx :: Parser Char RegEx
        regEx = ((((==========) '(' ')' regEx ||| charP) ======= (repeatP >>> \a->(\x->Repeat x a))) ====== (success Concat >>> (============) Concat)) ====== ((=====) '|' >>> (============) Alt)

        repeatP =
        			(=====) '*' >>> (============) (0,Nothing) |||
        			(=====) '?' >>> (============) (0,Just 1) |||
        			(=====) '+' >>> (============) (1,Nothing) |||
        			(==========) '{' '}' (((=============) *** (=====) ',' >>> fst) *** (================) (=============))

        charP =
        			(==========) '[' ']' groupP >>> One |||
        			(=====) '.' >>> (============) Any |||
        			(===========) >>> \c->One [(c,c)]


        groupP = list ((=========) "\\]" *** ((=====) '-' *** (=========) "\\" >>> (=================)) ||| (=========) "\\]" >>> \c->(c,c))

        infixr 5 ====

        p ==== f = \xs ->
          case p xs of
            Nothing -> Nothing
            Just(b,ys) -> (f b) ys

        (=====) = item

        p ====== op = p ==== chainl'
        	where chainl' a = op ==== (\o-> p ==== (\b->chainl' (o a b))) ||| success a

        p ======= op = p ==== uchainl'
        	where uchainl' a = op ==== (\o->uchainl' (o a )) ||| success a

        (========) = one

        --exkl
        (=========) cs = (========) (\c->not (elem c cs))

        (==========) = enclose

        (===========) =
        			(=====) '\\' *** (========) ((============) True) >>> (=================) |||
        			(=========) ".\\()[]{}*?+|"

        (============) = const

        (=============) = ((========) (===============) >>> (==============))

        (==============) = digitToInt

        (===============) = isDigit

        (================) = optional

        (=================) = snd

Unfortunately for its author, it is the token =, not the symbol =, that counts as −1. Sorry mate! The idea is that if you define intermediate symbols, e.g., f = some big term, you don't pay for the definition even if you use f only once (which might be a good thing to do to increase readability), and you start gaining each time you positively reuse f. (To quote the National Institute of Environmental Health Sciences and the United States Environmental Protection Agency: “Reduce, reuse, recycle!”)

And the rule about ignoring ) was simply to ensure that f (g x) costs the same as f $ g x, so that participants can freely choose the style they prefer rather than being encouraged to artifically dollarize their submissions.

What are this week's lessons?

    1. The Aufgabe seems to have scared away many competitors (to put it mildly), but looking at the solutions they are not that complicated. Each function does a fairly simple job on its own—range parses a range, subrange (or subRange) parses a subrange, etc. And they all pretty much reuse the same basic combinators (parse functions) and idioms over and over. By proceeding step by step, one gets a parser for a somewhat complicated language. Continuing like this, we could write a parser for a real programming language.
    2. Whenever possible, define reusable functions capturing general patterns of usage. These increase readability and allow us to work at a higher level of abstraction, thereby helping us be more productive.

Die Ergebnisse der zwölften Woche

While the MC is enjoying the sun somewhere else, the CoMC is doing the hard work in the foggy Garching. Here are the results for the penultimate week of the Wettbewerb:

Top 20 der Woche
PlatzWettbewerber(in) WertungEinzeltestsPunkte
1. Elias Marquart 24[1.429,1.500,0.040,6.601,12.441,1.500,1.428,0.189]20
2. Albert Steckermeier 38[1.000,1.000,1.063,2.566,0.856,1.166,XXX,0.527]19
3. Julius Michaelis65[1.429,1.667,0.034,0.984,0.729,XXX,XXX,0.155]18
Simon Wimmer 65[0.857,1.000,0.236,1.013,1.314,XXX,XXX,0.081]18
5. Florian Haffke 66[0.857,1.000,0.026,1.540,1.746,XXX,XXX,0.608]16
Manuel Ehler 66[0.857,1.000,0.032,1.640,1.746,XXX,XXX,0.797]16
7. Raphael Schaller 67[0.857,1.167,0.023,1.278,2.958,XXX,XXX,0.378]14
Lorenz Panny 67[0.857,1.000,0.017,0.937,4.534,XXX,XXX,0.068]14
Manuel Bergler 67[0.857,1.000,0.023,1.466,3.017,XXX,XXX,0.446]14
Patrick Werneck 67[1.000,1.000,0.023,1.310,2.881,XXX,XXX,0.365]14−1=13
11. Florian Hauer 68[0.857,1.000,0.029,2.333,2.627,XXX,XXX,0.662]10
12. Christian Lübben 69[0.857,1.000,0.034,1.979,4.763,XXX,XXX,0.588]9
Felix Sonntag 69[0.857,1.000,0.032,2.164,4.720,XXX,XXX,0.689]9
Timm Beckmann 69[1.000,1.000,0.029,1.995,4.559,XXX,XXX,0.534]9
15. Philip Becker-Ehmck 71[1.000,1.000,0.029,2.058,6.051,XXX,XXX,0.541]6
Thomas Hutzelmann 71[0.857,1.000,0.057,3.722,3.237,XXX,XXX,1.642]6
17 Daniel Veihelmann 72[1.143,1.000,0.029,2.246,6.551,XXX,XXX,0.581]4
18 Benedikt Löbbecke 73[1.000,1.167,0.032,2.661,7.678,XXX,XXX,0.696]3
19 David Koller 77[1.000,1.000,0.046,4.042,9.008,XXX,XXX,1.905]2
20 Maximilian Walther 78[1.000,1.000,0.063,8.360,3.602,XXX,XXX,4.223]1
Philipp Härtinger 78[0.857,1.000,0.029,3.138,5.424,XXX,XXX,7.514]1

This week there were 83 submissions. Apparently, there is a strong correlation between the number of homework points for a Wettbewerbsaufgabe and the number of submissions. 58 solutions passed our soundness tests. One of these solutions was not compiling because it contained the comment --=> .... Since it was a correct and competitive program otherwise, the CoMC corrected the sloppiness at a price of 1 point (see Top 20).

The ranking is constructed similarly to the one from week 10. This time the running times ratios (with respect to the solution of the MC) for several tests were compared fo all 58 sound solutions (XXX represents a timeout). The "Wertung" is a weighted sum of the ratios (the weights for tests 6 and 7 are 0,5 as they were very similar, the other tests are weighted with 1).

The tests tried to cover different aspects of the inputs that were mentioned in the description: short alphabets ("", test 2), long alphabets ([1..1000000], test 5), short censored words lists ([■■■■,■■■■,■■■■,■■■■,■■■■■■■■■■,■■■■■■■■■■■■,■■■■], test 4), long censored words lists (wordsOfLength "ab" 10, test 8), redundant censored words lists (map (\n -> replicate n 'x') [1..1000], test 3).

Test 6 and 7 are particularly interesting: only few competitors passed those—two of them now appear at the places 1 and 2.

        test6 = censoredWordsOf "abc" ["a", "b", "c"] == [""]
        test7 = censoredWordsOf "abc" ["c","ab","ba","aaa","bbb"] == ["","a","b","aa","bb"]

The naive solutions are not terminating here. The rescue comes with the lexicographic ordering: if all words of a particular length have been censored, all longer words will also be censored (i.e. the program can safely terminate). Admittedly, MC's solution was looping too, before the Master of Tutors spotted this situation while fighting nonterminating QuickCheck-Tests.

Apart from that trick the following solution of the MC is almost straightforward filtering (of course it is profitable to remove redundancy from the censored word lists).

        censoredWordsOf :: Eq a => [a] -> [[a]] -> [[a]]
        censoredWordsOf _ [[]] = []
        censoredWordsOf alphabet fwords = censoredWordsOf_fast alphabet fwords

        censoredWordsOf_fast_helper :: Eq a => [a] -> [[a]] -> [[a]]
        censoredWordsOf_fast_helper alphabet fwords =
          [] : concatMapStop (\w -> filter (\w -> not (any (`isSuffixOf` w) fwords))
                                           (map (snoc w) alphabet))
                             (censoredWordsOf_fast_helper alphabet fwords)

        concatMapStop :: (a -> [b]) -> [a] -> [b]
        concatMapStop f [] = []
        concatMapStop f (x : xs) =
          case f x of
            [] -> []
            ys -> ys ++ concatMapStop f xs

        censoredWordsOf_fast :: Eq a => [a] -> [[a]] -> [[a]]
        censoredWordsOf_fast [] _ = [[]]
        censoredWordsOf_fast alphabet fwords =
          censoredWordsOf_fast_helper alphabet fwords''
          where
            fwords' = nub $ filter (\f -> all (`elem` alphabet) f) $ fwords
            fwords'' = filter (\f -> not (any (`isStrictInfixOf` f) fwords')) fwords'

        isStrictInfixOf :: Eq a => [a] -> [a] -> Bool
        isStrictInfixOf xs ys = isInfixOf xs ys && xs /= ys

Students obviously like to shoot with heavier weapons. The winner, for example, is compiling the censored words list into an efficient implementation of sets via Patricia trees at the same time removing redundancy. This beats the MC at long inputs.

        import qualified Data.ListTrie.Patricia.Set as Trie
        import qualified Data.Map as M

        -- if adding the Ord constraint there would be an easy logarithmic time solution instead of this dirty linear? one

        censoredWordsOf :: Eq a => [a] -> [[a]] -> [[a]]
        censoredWordsOf alphabet censored = [] : map (map ordToValue) (censoredWordsOf' ordAlphabet [[]] ordCensored)
                              where
                                ordAlphabet = buildAlphabet alphabet
                                ordCensored = buildWordList ordAlphabet censored

        -- structure holding value and absolute ordering number
        data OrdValue a = OrdData a Int
            deriving Show

        instance Eq a => Eq (OrdValue a) where
          (OrdData v1 _) == (OrdData v2 _) = v1 == v2

        instance Eq a => Ord (OrdValue a) where
          (OrdData v1 o1) <= (OrdData v2 o2) = v1 == v2 || o1 <= o2


        censoredWordsOf' :: Eq a => [OrdValue a] -> [[OrdValue a]] -> Trie.TrieSet M.Map (OrdValue a) -> [[OrdValue a]]
        censoredWordsOf' alphabet words censoredData = if null $ concat step then
                                            []
                                          else
                                            step ++ censoredWordsOf' alphabet step censoredData
                              where
                                step = [c:w|c<-alphabet,w<-words,validWord c w censoredData []]

        buildAlphabet :: [a] -> [OrdValue a]
        buildAlphabet alphabet = map tupleToOrd (zip alphabet [1..])

        buildWordList :: Eq a => [OrdValue a] -> [[a]] -> Trie.TrieSet M.Map (OrdValue a)
        buildWordList ordAlphabet censored = foldl (\trie word->insertWord ordAlphabet word [] trie) Trie.empty censored

        insertWord :: Eq a => [OrdValue a] -> [a] -> [OrdValue a] -> Trie.TrieSet M.Map (OrdValue a) -> Trie.TrieSet M.Map (OrdValue a)
        insertWord alphabet (w:word) res trie = if Trie.member res trie then
                                          trie
                                        else
                                          case toOrd alphabet w of
                                            Nothing -> trie
                                            Just r -> insertWord alphabet word (res++[r]) trie
        insertWord _ _ res trie = Trie.insert res trie

        -- most expensive operation, unfortunately there is no efficient datastructure available relying only on Eq (which is obvious, somehow)
        toOrd :: Eq a => [OrdValue a] -> a -> Maybe (OrdValue a)
        toOrd alphabet c = find (\(OrdData v _)->c==v) alphabet


        validWord :: Eq a => OrdValue a -> [OrdValue a] -> Trie.TrieSet M.Map (OrdValue a) -> [OrdValue a]  -> Bool
        validWord c (w:word) trie res = if Trie.member res trie then
                                        False
                                      else
                                        validWord w word trie (res++[c])
        validWord c [] trie res = not $ Trie.member res trie || Trie.member (res++[c]) trie

        ordToValue :: OrdValue a -> a
        ordToValue (OrdData v _) = v

        tupleToOrd :: (a,Int) -> OrdValue a
        tupleToOrd (c,ord) = OrdData c ord

The lessons of the week:

    1. When dealing with infinite data, think twice before descending recursively!
    2. Remove redundancy from data to gain speed.
    3. Compiling static data into efficient data structures sometimes pays off for larger inputs.

Die Ergebnisse der dreizehnten Woche

As announced during the last lecture:

Top 20 der Woche
PlatzWettbewerber(in) ÄsthetikTechnikPunkte
1. Julius Michaelis91020
2. Lorenz Panny81019
Elias Marquart81019
Patrick Werneck81019
Andreas Bergmaier81019
6. Dominik Durner71015
7. Markus Westerlind6914
8. Albert Steckermeier5813
9. Thomas Breier10212
10. Florian Haffke1111

Was ist denn ein Token?

The MC is a big fan of token counting as a rough measure of software quality. As the first week's results have clearly shown, shorter answers tend to be more elegant and (usually) less likely to have subtle bugs.

By popular request, the MC is making his token-counting program available to the rest of the world. The program builds on Niklas Broberg's Language.Haskell.Exts.Lexer module, which can be installed using Cabal:

        cabal update
        cabal install haskell-src-exts

There appears to be disagreement regarding the status of backquotes in different versions of Haskell (98 vs. 2010). The MC generously adopted the Haskell 98 definition, with `max` treated as one token rather than three.

Here's how to compile and run the program on Linux or Mac:

        ghc Tokenize.hs -o tokenize
        ./tokenize

At this point, type in some Haskell code, e.g.,

        sum_max_sq x y z = x ^ 2 + y ^ 2 + z ^ 2 - x `min` y `min` z ^ 2

Press Ctrl+D to terminate. The program then outputs

        24 sxyz=x^2+y^2+z^2-xmymz^2

The first field, “24”, gives the number of tokens. The second field is a short version of the program, where each token is abbreviated as one letter. This can be useful for debugging. The MC aggregates this information in his top-secret Master View, which looks something like this:

        18 sxyz=mxy^2+mxymz^2 xxx@tum.de
        18 sxyz=mxy^2+xmymz^2 xxx@tum.de
        18 sxyz=mxy^2+zmmxy^2 xxx@mytum.de
        20 sxyz=(mxy)^2+(myz)^2 xxx@mytum.de
        20 sxyz=mxy^2+m(mxy)z^2 xxx@tum.de
        20 sxyz=mxy^2+m(mxy)z^2 xxx@tum.de
        ...
        52 sxyz|x=z&y=z=x*x+y*y|x=y&z=y=x*x+z*z|y=x&z=x=y*y+z*z xxx@mytum.de
        52 sxyz|x=z&y=z=x*x+y*y|x=y&z=y=x*x+z*z|y=x&z=x=y*y+z*z xxx@mytum.de
        52 sxyz|x=z&y=z=x*x+y*y|x=y&z=y=x*x+z*z|y=x&z=x=y*y+z*z xxx@mytum.de
        52 sxyz|x=z&y=z=x^2+y^2|x=y&z=y=x^2+z^2|y=x&z=x=y^2+z^2 xxx@mytum.de
        ...
        252 sxyz|(x+y)>(x+z)&(x+y)>(y+z)=(x*x)+(y*y)|(x+z)>(x+y)&(x+z)>(y+z)=(x*x)+(z*z)|(y+z)>(y+x)&(y+z)>(x+z)=(y*y)+(z*z)|x=z&x>y=(x*x)+(z*z)|x=y&x>z=(x*x)+(y*y)|z=y&z>x=(y*y)+(z*z)|x=z&x<y=(x*x)+(y*y)|x=y&x<z=(x*x)+(z*z)|z=y&z<x=(x*x)+(z*z)|x=y&y=z=(x*x)+(y*y) xxx@mytum.de

The Master View allows the MC to compile the Top 20 der Woche, but also to quickly spot interesting solutions and potential plagiarism. Plagiarism isn't cool, and occurrences of it will be penalized by subtracting 1 000 000 000 (ONE ZILLION) points from the Semester total.

For the competition, the above program is invoked only on the code snippet included between {-WETT-} and {-TTEW-}, with all type signatures stripped away.

Some competitors have tried to confuse tokenize by using German letters (ä etc.) in their source code. Setting

        export LANG=de_DE

before running the program solves this problem.