Tuesday, October 26, 2010

Tour of a real toy Haskell program, part 2

This is part 2 of my commentary on detrospector. You can find part 1 here.

Performance claims

I make a lot of claims about performance below, not all of which are supported by evidence. I did perform a fair amount of profiling, but it's impossible to test every combination of data structures. How can I structure my code to make more of my performance hypotheses easily testable?

Representing the source document

When we process a source document, we'll need a data structure to represent large amounts of text.

Haskell's standard String type is simple, maximally lazy, and Unicode-correct. However, as a singly-linked list of individually boxed heap-allocated characters, it's woefully inefficient in space and time.
ByteString is a common alternative, but solves the wrong problem. We need a sequence of Unicode codepoints, not a sequence of bytes.
ByteString.Char8 is a hack and not remotely Unicode-correct.
The text library stores Unicode characters in a ByteString-like packed format. That's exactly what we need.

For more information on picking a string type, see ezyang's writeup.

We choose the lazy flavor of text so that we can stream the input file in chunks without storing it all in memory. Lazy IO is semantically dubious, but I've decided that this project is enough of a toy that I don't care. Note that robustness to IO errors is not in the requirements list. ;)

The enumerator package provides iteratees as a composable, well-founded alternative to the lazy IO hack. The enumerator API was complex enough to put me off using it for the first version of detrospector. After reading Michael Snoyman's enumerators tutorial, I have a slight idea how the library works, and I might use it in a future version.

Representing substrings

We also need to represent the k-character substrings, both when we analyze the source and when we generate text. The requirements here are different.

We expect that k will be small, certainly under 10 — otherwise, you'll just generate the source document! With many small strings, the penalty of boxing each individual Char is less severe.

And we need to support queue-like operations; we build a new substring by pushing a character onto the end, and dropping one from the beginning. For both String and Text, appending onto the end requires a full copy. So we'll use Data.Sequence, which provides sequences with bidirectional append, implemented as finger trees:

-- module Detrospector.Types
type Queue a = S.Seq a

...Actually, I just ran a quick test with Text as the queue type, and it seems not to matter much. Since k is small, the O(k) copy is insignificant. Profiling makes fools of us all, and asymptotic analysis is mostly misused. Anyway, I'm keeping the code the way it is, pending any insight from my clever readers. Perhaps one of the queues from Purely Functional Data Structures would be most appropriate.

Representing character counts

We need a data structure for tabulating character counts. In the old days of C and Eurocentrism, we could use a flat, mutable int[256] indexed by ASCII values. But the range of Unicode characters is far too large for a flat array, and we need efficient functional updates without a full copy.

We could import Data.Map and use Map Char Int. This will build a balanced binary search tree using pairwise Ord comparisons.

But we can do better. We can use the bits of an integer key as a path in a tree, following a left or right child for a 0 or 1 bit, respectively. This sort of search tree (a trie) will typically outperform repeated pairwise comparisons. Data.IntMap implements this idea, with an API very close to Map Int. Our keys are Chars, but we can easily convert using fromIntegral.

-- module Detrospector.Types
import qualified Data.IntMap as IM
...
type FreqTable = IM.IntMap Int

Representing distributions

So we have some frequency table like

IM.fromList [('e', 267), ('t', 253), ('a', 219), ...]

How can we efficiently pick a character from this distribution? We're mapping characters to individual counts, but we really want a map from cumulative counts to characters:

-- module Detrospector.Types
data PickTable = PickTable Int (IM.IntMap Char)

To sample a character from PickTable t im, we first pick a random k such that 0 ≤ k < t, using a uniform distribution. We then find the first key in im which is greater than k, and take its associated Char value. In code:

-- module Detrospector.Types
import qualified System.Random.MWC as RNG
...
sample :: PickTable -> RNG.GenIO -> IO Char
sample (PickTable t im) g = do
  k <- (`mod` t) <$> RNG.uniform g
  case IM.split k im of
    (_, IM.toList -> ((_,x):_)) -> return x
    _ -> error "impossible"

The largest cumulative sum is the total count t, so the largest key in im is t. We know k < t, so IM.split k im will never return an empty map on the right side.

Note the view pattern for pattern-matching an IntMap as if it were an ascending-order list.

The standard System.Random library in GHC Haskell is quite slow, a problem shared by most language implementations. We use the much faster mwc-random package. The only operation we need is picking a uniform Int as an IO action.

We still need a way to build our PickTable:

-- module Detrospector.Types
import Data.List ( mapAccumR )
...
cumulate :: FreqTable -> PickTable
cumulate t = PickTable r $ IM.fromList ps where
  (r,ps) = mapAccumR f 0 $ IM.assocs t
  f ra (x,n) = let rb = ra+n in (rb, (rb, toEnum x))

This code is short, but kind of tricky. For reference:

mapAccumR :: (acc -> x -> (acc, y)) -> acc -> [x] -> (acc, [y])
f :: Int -> (Int, Int) -> (Int, (Int, Char))

f takes an assoc pair from the FreqTable, adds its count to the running sum, and produces an assoc pair for the PickTable. We start the traversal with a sum of 0, and get the final sum r along with our assoc pairs ps.

Representing the Markov chain

So we can represent probability distributions for characters. Now we need a map from k-character substrings to distributions.

Data.Map is again an option, but pairwise, character-wise comparison of our Queue Char values will be slow. What we really want is another trie, with character-based fanout at each node. Hackage has bytestring-trie, which unfortunately works on bytes, not characters. And maybe I should have used TrieMap or list-tries. Instead I used the hashmap package:

-- module Detrospector.Types
import qualified Data.HashMap as H
...
data Chain = Chain Int (H.HashMap (Queue Char) PickTable)

A value Chain k hm maps from subsequences of up to k Chars to PickTables. A lookup of some Queue Char key will require one traversal to calculate an Int hash value, then uses an IntMap to find a (hopefully small) Map of keys with that same hash value.

There is a wrinkle: we need to specify how to hash a Queue, which is just a synonym for S.Seq. This is an orphan instance, which we could avoid by newtype-wrapping S.Seq.

-- module Detrospector.Types
import qualified Data.Hashable as H
import qualified Data.Foldable as F
...
instance (H.Hashable a) => H.Hashable (S.Seq a) where
  {-# SPECIALIZE instance H.Hashable (S.Seq Char) #-}
  hash = F.foldl' (\acc h -> acc `H.combine` H.hash h) 0

This code is lifted almost verbatim from the [a] instance in Data.Hashable.

Serialization

After training, we need to write the Chain to disk, for use in subsequent generation. I started out with derived Show and Read, which was simple but incredibly slow. We'll use binary with ByteString.Lazy — the dreaded lazy IO again!

We start by specifying how to serialize a few types. Here the tuple instances for Binary come in handy:

-- module Detrospector.Types
import qualified Data.Binary as Bin
...
-- another orphan instance
instance (Bin.Binary k, Bin.Binary v, H.Hashable k, Ord k)
       => Bin.Binary (H.HashMap k v) where
  put = Bin.put . H.assocs
  get = H.fromList <$> Bin.get

instance Bin.Binary PickTable where
  put (PickTable n t) = Bin.put (n,t)
  get = uncurry PickTable <$> Bin.get

instance Bin.Binary Chain where
  put (Chain n h) = Bin.put (n,h)
  get = uncurry Chain <$> Bin.get

The actual IO is easy. We use gzip compression, which fits right into the IO pipeline:

-- module Detrospector.Types
import qualified Data.ByteString.Lazy   as BSL
import qualified Codec.Compression.GZip as Z
...
withChain :: FilePath -> (Chain -> RNG -> IO a) -> IO a
withChain p f = do
  ch <- (Bin.decode . Z.decompress) <$> BSL.readFile p
  RNG.withSystemRandom $ f ch

writeChain :: FilePath -> Chain -> IO ()
writeChain out = BSL.writeFile out . Z.compress . Bin.encode

Training the chain

I won't present the whole implementation of the train subcommand, but here's a simplification:

-- module Detrospector.Modes.Train
import qualified Data.Text     as Txt
import qualified Data.Text.IO  as Txt
import qualified Data.HashMap  as H
import qualified Data.IntMap   as IM
import qualified Data.Sequence as S
import qualified Data.Foldable as F
...
train Train{num,out} = do
  (_,h) <- Txt.foldl' roll (emptyQ, H.empty) ys <$> Txt.getContents
  writeChain out . Chain num $ H.map cumulate h where

  roll (!s,!h) x
    = (shift num x s, F.foldr (H.alter $ ins x) h $ S.tails s)
 
  ins x Nothing  = Just $! sing x
  ins x (Just v) = Just $! incr x v
 
  sing x = IM.singleton (fromEnum x) 1
 
  incr x = IM.alter f $ fromEnum x where
    f Nothing  = Just 1
    f (Just v) = Just $! (v+1)

Before generating PickTables, we build a HashMap of FreqTables. We fold over the input text, accumulating a pair of (last characters seen, map so far). Since foldl' is only strict to weak head-normal form (WHNF), we use bang patterns on the fold function roll to force further evaluation. RWH discusses the same issue.

shift (from Detrospector.Types) pushes a new character into the queue, and drops the oldest character if the size exceeds num. We add one count for the new character x, both for the whole history s and each of its suffixes.

We're using alter where perhaps a combination of lookup and insert would be more natural. This is a workaround for a subtle laziness-related space leak, which I found after much profiling and random mucking about. When you insert into a map like so:

let mm = insert k (v+1) m

there is nothing to force v+1 to WHNF, even if you force mm to WHNF. The leaves of our tree end up holding large thunks of the form ((((1+1)+1)+1)+...).

The workaround is to call alter with Just $! (v+1). We know that the implementation of alter will pattern-match on the Maybe constructor, which then triggers WHNF evaluation of v+1 because of ($!). This was tricky to figure out. Is there a better solution, or some different way I should approach this problem? It seems to me that Data.Map and friends are generally lacking in strictness building blocks.

The end!

Thanks for reading! Here's an example of how not to write the same program:

module Main where{import Random;(0:y)%(R p _)=y%p;(1:f)%(R _ n)=f%n;[]%(J x)=x;b
[p,v,k]=(l k%).(l v%).(l p%);main=getContents>>=("eek"#).flip(.:"eek")(y.y.y.y$0
);h k=toEnum k;;(r:j).:[k,v,b]=(j.:[v,b,r]).((k?).(v?).(b?)$(r?)m);[].:_=id;(!)=
iterate;data R a=J a|R(R a)(R a);(&)i=fmap i;k b y v j=let{t=l b%v+y;d f|t>j=b;d
f=k(m b)t v j}in d q;y l=(!!8)(q R!J l);q(+)b=b+b;p(0:v)z(R f x)=R(p v z f)x;p[]
z(J x)=J(z x);p(1:v)z(R n k)=R n$p v z k;m = succ;l p=tail.(snd&).take 9$((<<2).
fst)!(fromEnum p,0);(?)=p.l;d@[q,p,r]#v=let{u=b d v;j=i u;(s,o)|j<1=((97,122),id
)|h 1=((0,j-1),(k 0 0 u))}in do{q<-(h.o)&randomRIO s;putChar q;[p,r,q]#v};i(J q)
=q;i(R n k)=i n+i k;(<<)=divMod} -- finite text on stdin  © keegan oct 2010 BSD3

Tour of a real toy Haskell program, part 1

Haskell suffers from a problem I'll call the Fibonacci Gap. Many beginners start out with a bunch of small mathematical exercises, develop that skillset, and then are at a loss for what to study next. Often they'll ask in #haskell for an example of a "real Haskell program" to study. Typical responses include the examples in Real World Haskell, or the Design and Implementation of XMonad talk.

This is my attempt to provide another data point: a commentary on detrospector, my text-generating toy. It's perhaps between the RWH examples and xmonad in terms of size and complexity. It's not a large program, and certainly not very useful, but it does involve a variety of real-world concerns such as Unicode processing, choice of data structures, strictness for performance, command-line arguments, serialization, etc. It also illustrates my particular coding style, which I don't claim is the One True Way, but which has served me well in a variety of Haskell projects over the past five years.

Of course, I imagine that experts may disagree with some of my decisions, and I welcome any and all feedback.

I haven't put hyperlinked source online yet, but you can grab the tarball and follow along.

This is part 1, covering style and high-level design. Part 2 addresses more details of algorithms, data structures, performance, etc.

The algorithm

detrospector generates random text conforming to the general style and diction of a given source document, using a quite common algorithm.

First, pick a "window size" k. We consider all k-character contiguous substrings of the source document. For each substring w, we want to know the probability distribution of the next character. In other words, we want to compute values for

P(next char is x | last k chars were w).

To compute these, we scan through the source document, remembering the last k characters as w. We build a table of next-character occurrence counts for each substring w.

These tables form a Markov chain, and we can generate random text by a random walk in this chain. If the last k characters we generated were w, we choose the next character randomly according to the observed distribution for w.

So we can train it on Accelerando and get output like:

addressed to tell using back holes everyone third of the glances waves and diverging him into the habitat. Far beyond Neptune, I be?" asks, grimy. Who's whether is headquarters I need a Frenchwoman's boyfriend go place surrected in the whole political-looking on the room, leaving it, beam, the wonderstood. The Chromosome of upload, does enough. If this one of the Vile catches agree."

Design requirements

We can only understand a program's design in light of the problem it was meant to solve. Here's my informal list of requirements:

detrospector should generate random text according to the above algorithm.
We should be able to invoke the text generator many times without re-analyzing the source text each time.
detrospector should handle all Unicode characters, using the character encoding specified by the system's locale.
detrospector should be fast without sacrificing clarity, whatever that means.

Wearing my customer hat, these are axioms without justification. Wearing my implementor hat, I will have to justify design decisions in these terms.

Style

In general I import modules qualified, using as to provide a short local name. I make an exception for other modules in the same project, and for "standard modules". I don't have a precise definition of "standard module" but it includes things like Data.Maybe, Control.Applicative, etc.

The longest line in detrospector is 70 characters. There is no hard limit, but more than about 90 is suspect.

Indentation is two spaces. Tabs are absolutely forbidden. I don't let the indentation of a block construct depend on the length of a name, thus:

foo x = do
  y <- bar x
  baz y

and

bar x =
  let y = baz x
      z = quux x y
  in  y z

This avoids absurd left margins, looks more uniform, and is easier to edit.

I usually write delimited syntax with one item per line, with the delimiter prefixed:

{-# LANGUAGE
    ViewPatterns
  , PatternGuards #-}

and

data Mode
  = Train { num   :: Int
          , out   :: FilePath }
  | Run   { chain :: FilePath }

Overriding layout is sometimes useful, e.g.:

look = do x <- H.lookup t h; return (x, S.length t)

(With -XTupleSections we could write

look = (,S.length t) <$> H.lookup t h

but that's just gratuitous.)

I always write type signatures on top-level bindings, but rarely elsewhere.

Module structure

I started out with a single file, which quickly became unmanageable. The current module layout is:

Detrospector.
  Types      types and functions used throughout
  Modes      a type with one constructor per mode
  Modes.
    Train    train the Markov chain
    Run      generate random text
    Neolog   generate neologisms
  Main       command-line parsing

There is also a Main module in detrospector.hs, which simply invokes Detrospector.Main.main.

Modules I write tend to fall into two categories: those which export nearly everything, and those which export only one or two things. The former includes "utility" modules with lots of small types and function definitions. The latter includes modules providing a specific piece of functionality. A parsing module might define three dozen parsers internally, but will only export the root of the grammar.

An abstract data type might fall into a third category, since they can export a large API yet have lots of internal helpers. But I don't write such modules very often.

Detrospector.Types is in the first category. Most Haskell projects will have a Types module, although I'm somewhat disappointed that I let this one become a general grab-bag of types and utility functions.

The rest fall into the second category. Each module in Detrospector.Modes.* exports one function to handle that mode. Detrospector.Main exports only main.

Build setup

This was actually my first Cabal project, and the first thing I uploaded to Hackage. I think Cabal is great, and features like package-visibility management are useful even for small local projects.

In my cabal file I set ghc-options: -Wall, which enables many helpful warnings. The project should build with no warnings, but I use the OPTIONS_GHC pragma to disable specific warnings in specific files, where necessary.

I also run HLint on my code periodically, but I don't have it integrated with Cabal.

I was originally passing -O2 to ghc. Cabal complained that it's probably not necessary, which was correct. The Cabal default of -O performs just as well.

I'm using Git for source control, which is neither here nor there.

Command-line parsing

detrospector currently has three modes, as listed above. I wanted to use the "subcommand" model of git, cabal, etc. So we have detrospector train, detrospector run, etc. The cmdargs package handles this argument style with a low level of boilerplate.

The "impure" interface to cmdargs uses some dark magic in the operator &= in order to attach annotations to arbitrary record fields. The various caveats made me uneasy, so I opted for the slightly more verbose "pure" interface, which looks like this:

-- module Detrospector.Main
import qualified System.Console.CmdArgs as Arg
import           System.Console.CmdArgs((+=),Annotate((:=)))
...
modes  = Arg.modes_  [train,run,neolog]
      += Arg.program "detrospector"
      += Arg.summary "detrospector: Markov chain text generator"
      += Arg.help    "Build and run Markov chains for text generation"
  where

  train = Arg.record Train{}
    [ num := 4
          += Arg.help "Number of characters lookback"
    , out := error "Must specify output chain"
          += Arg.typFile
          += Arg.help "Write chain to this file" ]
    += Arg.help "Train a Markov chain from standard input"

  run = Arg.record Run{}
    [ chain := error "Must specify input chain"
            += Arg.argPos 0         
            += Arg.typ "CHAIN_FILE" ]
    += Arg.help "Generate random text"
  ...

This tells cmdargs how to construct values of my record type Detrospector.Modes.Mode. We get help output for free:

$ ./dist/build/detrospector/detrospector -?
detrospector: Markov chain text generator

detrospector [COMMAND] ... [OPTIONS]
  Build and run Markov chains for text generation

Common flags:
  -? --help        Display help message
  -V --version     Print version information

detrospector train [OPTIONS]
  Train a Markov chain from standard input

  -n --num=INT     Number of characters lookback
  -o --out=FILE    Write chain to this file

detrospector run [OPTIONS] CHAIN_FILE
  Generate random text

...

My use of error here is hacky and leads to a bug that I recently discovered. When the -o argument to train is invalid or missing, the error is not printed until the (potentially time-consuming) analysis is completed. Only then is the record's field forced.

To be continued...

...right this way.

Saturday, October 23, 2010

Typing mathematical characters in X

Unicode provides many useful characters for mathematics. If you've studied the traditional notation, an expression like « Γ ⊢ Λt.λ(x:t).x : ∀t. t → t » is much more readable than an ASCII equivalent. However, most systems don't provide an easy way to enter these characters.

The compose key feature of X Windows provides a nice solution on Linux and other UNIX systems. Compose combinations are easy-to-remember mnemonics, like -> for →, and an enormous number of characters are available with just a few keystrokes.

Setting it up

I cooked up a config file with my most-used mathematical symbols. With recent Xorg, you can drop this file in ~/.XCompose, restart X, and you should be good to go.

The include line will pull in your system-wide configuration, e.g. /usr/share/X11/locale/en_US.UTF-8/Compose. This already contains many useful characters. I was going to add <3 for ♥ and CCCP for ☭, but I found that Debian already provides these.

GTK has its own input handling. To make it defer to X, I had to add an environment variable in ~/.xsession:

export GTK_IM_MODULE="xim"

The "Fn" key on recent ThinkPads makes a good compose key. It normally acts as a modifier key in hardware, but will send a keycode to X when pressed and released by itself. I used this xmodmap setting:

keycode 151=Multi_key

Tweaking the codes

Being boilerplate-averse, I specified the key combinations in a compact format which is processed by this Haskell script.

Obviously, not everyone will like my choice of key combinations. If you tweak the file and come up with something particularly nice, I'd like to see it. If you can't run Haskell code for whatever reason, it's not too hard to edit the generated XCompose file.

Though my use of Haskell here may seem gratuitous, I actually started writing this script in Python, but ran into trouble with Python 2's inconsistent treatment of Unicode text. Using Haskell's String type with GHC ≥ 6.12 will Just Work, at least until you care about performance.

Alternatives

If you don't like this solution, SCIM provides an input mode which uses L^aT_eX codes.

Tuesday, October 19, 2010

Quantification in Haskell

I wrote an answer over at Stack Overflow that somehow grew to article length. Here it is, recorded for posterity.

I'll paraphrase the original question as:

What's the difference between the types forall a. [a] and [forall a. a], and
How does this relate to existential types?

Well, the short answer to the second question is "It doesn't relate". But there's still a natural flow in discussing both topics.

Notation

Polymorphic values are essentially functions on types, but Haskell's syntax leaves both type abstraction and type application implicit. To better understand what's going on, we'll use a pidgin syntax of Haskell combined with a typed lambda calculus like System F.

In System F, polymorphism involves explicit type abstractions and type application. For example, the familiar map function would be written as:

map :: ∀a. ∀b. (a → b) → [a] → [b]
map = Λa. Λb. λ(f :: a → b). λ(xs :: [a]). case xs of
  [] → []
  (y:ys) → f y : map @a @b f ys

map is now a function of four arguments: types a and b, a function, and a list.

At the term level, we have one new syntactic form: Λ (upper-case lambda). A polymorphic value is a function from types to values, written Λx. e, much as a function from values to values is written λx. e.

At the type level, we have the symbol ∀ ("for all"), corresponding to GHC's forall keyword. A term containing Λ gives rise to a type containing ∀, just as a term containing λ gives rise to a type containing →.

Since we have functions of types, we also need application of types. I use the notation @a (as in GHC Core) to denote application of a type argument. So we might use map like so:

map @Char @Int ord "xzy"

Note also that I've provided an explicit type on each λ abstraction. Type inference for System F is, in general, undecidable (Wells, 1999). The restricted form of polymorphism available in vanilla Haskell 98 has decidable inference, but you lose this if you enable certain GHC extensions like RankNTypes.

We don't need annotations on Λ abstractions, because we have only one "kind" of type. In actual Haskell, or a calculus like System Fω, we also have type constructors, and we need a system of kinds to describe how they combine. We'll ignore this issue here.

So a value of polymorphic type is like a function from types to values. The caller of a polymorphic function gets to choose a type argument, and the function must comply.

One last bit of notation: I'll use the syntax

⊥@t

to mean an undefined value of type t, similar to Haskell's undefined.

∀a. [a]

How, then, would we write a term of type ∀a. [a]? We know that types containing ∀ come from terms containing Λ:

term1 :: ∀a. [a]
term1 = Λa. ?

Within the body marked ? we must provide a term of type [a]. However, we know nothing concrete about a, since it's an argument passed in from the outside. So we can return an empty list

term1 = Λa. []

or an undefined value

term1 = Λa. ⊥@[a]

or a list containing undefined values only

term1 = Λa. [⊥@a, ⊥@a]

but not much else.

To use this value, we apply a type, removing the outer ∀. Let's arbitrarily instantiate ∀a. [a] to [Bool]:

main = print @Int (length @Bool (term1 @Bool))

[∀a. a]

What about [∀a. a], then? If ∀ signifies a function on types, then [∀a. a] is a list of functions. We can provide as few as we like:

term2 :: [∀a. a]
term2 = []

or as many:

term2 = [f, g, h]

But what are our choices for f, g, and h?

f :: ∀a. a
f = Λa. ?

Now we're well and truly stuck. We have to provide a value of type a, but we know nothing whatsoever about the type a. So our only choice is

f = Λa. ⊥@a

So our options for term2 look like

term2 :: [∀a. a]
term2 = []
term2 = [Λa. ⊥@a]
term2 = [Λa. ⊥@a, Λa. ⊥@a]

etc. And let's not forget

term2 = ⊥@(∀a. [a])

Unlike the previous example, our choices for term2 are already lists, and we can pass them to length directly. As before, we have to pass the element type to length:

main = print @Int (length @(∀a. a) term2)

Existential types

So a value of universal (∀) type is a function from types to values. A value of existential (∃) type is a pair of a type and a value.

More specifically: A value of type

∃x. T

is a pair

(S, v)

where S is a type, and where v :: T, assuming you bind the type variable x to S within T.

Here's an existential type signature and a few terms with that type:

term3 :: ∃a. a
term3 = (Int,         3)
term3 = (Char,        'x')
term3 = (∀a. a → Int, Λa. λ(x::a). 4)

In other words, we can put any value we like into ∃a. a, as long as we pair that value with its type.

The user of a value of type ∀a. a is in a great position; they can choose any specific type they like, using the type application @T, to obtain a term of type T. The producer of a value of type ∀a. a is in a terrible position: they must be prepared to produce any type asked for, so (as in the previous section) the only choice is Λa. ⊥@a.

The user of a value of type ∃a. a is in a terrible position; the value inside is of some unknown specific type, not a flexible polymorphic value. The producer of a value of type ∃a. a is in a great position; they can stick any value they like into the pair, as we saw above.

So what's a less useless existential? How about values paired with a binary operation:

type Something = ∃a. (a, a → a → a, a → String)

term4_a, term4_b :: Something
term4_a = (Int,    (1,     (+)  @Int , show @Int))
term4_b = (String, ("foo", (++) @Char, λ(x::String). x))

Using it:

triple :: Something → String
triple = λ(a, (x :: a, f :: a→a→a, out :: a→String)).
  out (f (f x x) x)

The result:

triple term4_a  ⇒  "3"
triple term4_b  ⇒  "foofoofoo"

We've packaged up a type and some operations on that type. The user can apply our operations but cannot inspect the concrete value — we can't pattern-match on x within triple, since its type (hence set of constructors) is unknown. This is more than a bit like object-oriented programming.

Using existentials for real

The direct syntax for existentials using ∃ and type-value pairs would be quite convenient. UHC partially supports this direct syntax. But GHC does not. To introduce existentials in GHC we need to define new "wrapper" types.

Translating the above example:

{-# LANGUAGE ExistentialQuantification #-}

data Something = forall a. MkThing a (a -> a -> a) (a -> String)

term_a, term_b :: Something
term_a = MkThing 1 (+) show
term_b = MkThing "foo" (++) id

triple :: Something -> String
triple (MkThing x f out) = out (f (f x x) x)

There's a couple differences from our theoretical treatment. Type application, type abstraction, and type pairs are again implicit. Also, the wrapper is confusingly written with forall rather than exists. This references the fact that we're declaring an existential type, but the data constructor has universal type:

MkThing :: forall a. a -> (a -> a -> a) -> (a -> String) -> Something

Often, we use existential quantification to "capture" a type class constraint. We could do something similar here:

data SomeMonoid = forall a. (Monoid a, Show a) => MkMonoid a

Hardware

The removable back panel of the unit holds three circuit boards: a PC motherboard, a power supply, and a custom board I'll call the LED controller.

The motherboard has a 386-compatible (?) processor, some RAM sticks, and a number of ISA slots. One slot provides a connector to the floppy drive, as well as an unused hard-drive connector. Another slot connects directly to the LED controller by a ribbon cable. The motherboard also holds a DIN-5 connector which I presume to be for an AT keyboard. I could not find any serial or parallel ports even as pin headers, though they could be added on an ISA card.

The LED controller is a large custom board of DIP integrated circuits. Other than two 16kbit SRAM chips, all of the logic is provided by 74-series discrete logic ICs. There is not a processor or PLD to be found. This board connects to other boards attached to the front of the unit, which I did not investigate further.

Lacking a logic analyzer, I decided to investigate the software side.

Reverse-engineering the software

I acquired a copy of the official software, JVFF.EXE. The software runs fine in DOSBox. It even displays output: there's a fallback which draws CGA graphics if no LED board is found.

I enabled IO port logging in DOSBox, by setting #define ENABLE_PORTLOG in src/hardware/iohandler.cpp. I noticed a read from port 0x2F0, which is not listed as a standard port in RBIL. DOSBox helpfully logs the instruction pointer with each port access, and soon I was reading the executable's assembly code in the freeware version of IDA. I found the following interesting functions:

0x4050: Reads a configuration byte from the controller at port 0x2F0. This byte specifies which DMA channel to use, as well as the dimensions of the display.
0x3F02: The entry point for updating the display. Called from many places. Takes a range of lines to update (?) as arguments. Depending on some global variables, it calls 0x3FA9 and/or the CGA drawing routine.
0x3FA9: The "driver" for the LED controller. Calls all of the below.
0x410D: Sets up a DMA transfer to the controller, then writes the three bytes 0x8F, 0x0F, 0x07 to the controller at port 0x2F0.
0x417A: Checks whether the DMA transfer has completed.
0x4166: Writes the bytes 0x05, 0x07 to the controller at port 0x2F0, effectively strobing bit 1 low. This resets the controller's state, such that the next DMA transfer will set the first line of the display. Perhaps data line 1 is wired to the reset line of a counter IC.
0x403C: Writes the bytes 0x06, 0x07 to the controller at port 0x2F0, effectively strobing bit 0 low. This advances the controller's state to the next line. Perhaps data line 0 is wired to the clock line of a counter IC.

There is more detail to the protocol, but it's best explained by the C code linked below. It's well-commented, I promise.

Writing custom software

After staring at these functions for a bit, I understood the protocol well enough to mimic it in my own code. As a demo I coded a cellular automaton: specifically, HighLife on a projective plane. You can get the code here as a public-domain C program.

I've compiled this code with Open Watcom 1.9 inside DOSBox. With the provided makefile, wmake will build JVFLIFE.EXE. You can then copy this file to a DOS boot floppy, and set it to run automatically using AUTOEXEC.BAT.

Wednesday, October 6, 2010

Verbose show and the missing preprocessor

Verbose show

Say we want a function vshow which describes an expression before showing its value. For example:

vshow (2 + 3)          = "2 + 3 = 5"
vshow (product [1..5]) = "product [1..5] = 120"

Clearly vshow can't be an ordinary Haskell function. We just don't have that kind of runtime access to program source.

All sorts of workarounds come to mind: simple-reflect, feeding strings to hint, Template Haskell… But it's hard to beat the simplicity of the humble C preprocessor:

{-# LANGUAGE CPP #-}
{-# OPTIONS_GHC -pgmP cpp #-}

#define vshow(e) (#e ++ " = " ++ show (e))

main :: IO ()
main = do
  putStrLn vshow(2 + 3)
  putStrLn vshow(product [1..5])

GHC's built-in CPP can't handle the stringify operator, so we invoke the "real" C preprocessor with -pgmP cpp. Testing it:

$ runhaskell vshow.hs
2 + 3 = 5
product [1..5] = 120

Like the other solutions I mentioned, this has its drawbacks. vshow isn't a first-class function, and we have to use C rather than Haskell function-call syntax. It's not terribly flexible, but it is one line of reasonably clear code, and I can think of a few small use cases.

The missing preprocessor

Besides sharing a cute little trick, I wanted to start a conversation about text-level preprocessors for Haskell. CPP is used throughout the standard library for conditional compilation, boilerplate class instances, etc. At the same time, it's quite limited, and not a good syntactic fit for Haskell.

Template Haskell is at the opposite end of the power/complexity spectrum. It might be what you want for programmatically generating complicated abstract syntax trees. It's not a good fit for simply "copy-pasting" some text into a few locations. For those tasks we shy away from the enormous complexity of TH. We use CPP or, well, copy-paste. The latter is not acceptable for a language that's advertising such a high level of abstraction. We may write less boilerplate than in Java or C++, but it still gets on the nerves.

I think there's an unfilled niche for lightweight text-level preprocessing, with just a bit more flexibility than CPP. GHC makes it easy to plug in any preprocessor you like, using -F -pgmF. We can start experimenting today.

m4 is a powerful and venerable preprocessor. It comes with most UNIX systems and has an open-source implementation. Yet I've not seen any Haskell project use it for Haskell source. Has anyone used m4 in this role?

I found an old paper (.ps.gz) with some proposals. Anyone have comments?

There's a number of preprocessors for implementing specific language features, including arrows, indexed monads, C bindings, and Conor McBride's intriguing Strathclyde Haskell Enhancement. The package preprocessor-tools seems to be a framework for writing this sort of preprocessor.

So what have you used for preprocessing? What features would you like to see? What design philosophy?