As of r49698 the use-diet branch finishes the testsuite (about 2000 tests) without regressions. Actually the User* member of the structure is not eliminated yet, but instead deliberately set to NULL in various places, and if present, I assert that it equals the computed value.
It was quite a fight to get thus far, but it is done! Time to open a good bottle of wine...
The next days will be devoted to cleaning up the sometimes messy details, doing some statistics and cautiously trying to merge back my work to trunk.
Wish me luck.
Monday, April 14, 2008
Sunday, April 13, 2008
LLVM Data Structures, and Putting Use on Diet
There is a good deal of refactoring going on in the LLVM universe, to wit Dominic's renaming of LLVM*Builder to IRBuilder with assorted simplifications...
The train is moving fast, getting on board is harder, but rewarding. LLVM does not commit to API (or less even, binary) compatibility, so meaningful changes get in swiftly and without much bureaucracy. A simple mail, warning people of the change in advance, and when accomplished, instructions how to do the conversion, are enough to make folks happy.
There are reforms going more than skin deep, too. Clang is is getting a datastructure rewrite with nice gains, and myself is about to reduce the size of the Use struct by 4 bytes, from 16 to 12 (on a 32bit system). The 25% savings is nothing to sneeze at, considering that this struct is the most frequently allocated one in LLVM. And now that I have the functionality basically implemented on a branch, I can say, the idea is working! Consequently, I have the courage to blog about it :-)
So what is this Use-diet all about? The Use struct is the home of the pointers to all Values that an Instruction references. But each Value has to track all of its Users (i.e. the Instructions), and Use provides forward and backward pointers to chain those up. These are the essential 3 pointers. But why is Use 16 bytes? Because in some situations it is important to get back to the User of the referred Value. So it is 4 pointers in total.
Seemingly this is how it works™ and there is nothing that can be shaved off. But wait! Don't all Uses belonging to an Instruction come lined up in a contiguous array? What if we could mark the last Use specially and put a pointer to the Instruction behind it? Or even allocate the Uses immediately in front of Instruction (memory-layout wise)?
These were the first ideas how my brain-storming with Sabre began. After several exchanged emails, we settled on a concept that is IMHO really beautiful. We use 2 bits (the least significant ones, which are always zero, normally) of one of the pointers in Use to implement a serial line-like protocol of waymarks that guides us to the end of the array in some reasonably few steps. I will not detail the algorithm, since it is documented elsewhere, I will only say that there are four kinds of waymarks:‹fullstop›, ‹stop›, ‹0› and ‹1›.
For reference I shall present it here:
So where are the uses of this algorithm outside of LLVM?
There is not much of thinking needed to generalize the array to other data structures, which may permit mutating of the node contents, but disallow insertions. Linked lists are an example since one can only cons up stuff to the head. Doubly-ended arrays with fast size() operation (given a pointer to one element) can be implemented if there is another pointer in each node that we can use for storing waymarks to the start of the array. Deques also could work like this, but they too, need to be fully built up before putting in the waymarks.
This all reminds me of cons-hashing but is really a more powerful concept. Let's call it waymarking! And then let the garbage collector put in the waymarks for us...
The train is moving fast, getting on board is harder, but rewarding. LLVM does not commit to API (or less even, binary) compatibility, so meaningful changes get in swiftly and without much bureaucracy. A simple mail, warning people of the change in advance, and when accomplished, instructions how to do the conversion, are enough to make folks happy.
There are reforms going more than skin deep, too. Clang is is getting a datastructure rewrite with nice gains, and myself is about to reduce the size of the Use struct by 4 bytes, from 16 to 12 (on a 32bit system). The 25% savings is nothing to sneeze at, considering that this struct is the most frequently allocated one in LLVM. And now that I have the functionality basically implemented on a branch, I can say, the idea is working! Consequently, I have the courage to blog about it :-)
So what is this Use-diet all about? The Use struct is the home of the pointers to all Values that an Instruction references. But each Value has to track all of its Users (i.e. the Instructions), and Use provides forward and backward pointers to chain those up. These are the essential 3 pointers. But why is Use 16 bytes? Because in some situations it is important to get back to the User of the referred Value. So it is 4 pointers in total.
Seemingly this is how it works™ and there is nothing that can be shaved off. But wait! Don't all Uses belonging to an Instruction come lined up in a contiguous array? What if we could mark the last Use specially and put a pointer to the Instruction behind it? Or even allocate the Uses immediately in front of Instruction (memory-layout wise)?
These were the first ideas how my brain-storming with Sabre began. After several exchanged emails, we settled on a concept that is IMHO really beautiful. We use 2 bits (the least significant ones, which are always zero, normally) of one of the pointers in Use to implement a serial line-like protocol of waymarks that guides us to the end of the array in some reasonably few steps. I will not detail the algorithm, since it is documented elsewhere, I will only say that there are four kinds of waymarks:
- Fullstop means we are at the end already,
- Stop means begin gathering digits, or if already done so, convert them to an offset, that brings us to the end,
- 0 and 1 are the binary digits to be picked up.
For reference I shall present it here:
> import Test.QuickCheckBtw., QuickCheck is awesome!
>
> digits :: Int -> [Char] -> [Char]
> digits 0 acc = '0' : acc
> digits 1 acc = '1' : acc
> digits n acc = digits (n `div` 2) $ digits (n `mod` 2) acc
>
> dist :: Int -> [Char] -> [Char]
> dist 0 [] = ['S']
> dist 0 acc = acc
> dist 1 acc = let r = dist 0 acc in 's' : digits (length r) r
> dist n acc = dist (n - 1) $ dist 1 acc
>
> takeLast n ss = reverse $ take n $ reverse ss
>
> test = takeLast 40 $ dist 20 []
>
Printinggives: "1s100000s11010s10100s1111s1010s110s11s1S"
The reverse algorithm computes the
length of the string just by examining
a certain prefix:
> pref :: [Char] -> Int
> pref "S" = 1
> pref ('s':'1':rest) = decode 2 1 rest
> pref (_:rest) = 1 + pref rest
>
> decode walk acc ('0':rest) = decode (walk + 1) (acc * 2) rest
> decode walk acc ('1':rest) = decode (walk + 1) (acc * 2 + 1) rest
> decode walk acc _ = walk + acc
>
Now, as expected, printinggives 40.
We can quickCheck this with following property:
> testcase = dist 2000 []
> testcaseLength = length testcase
>
> identityProp n = n > 0 && n <= testcaseLength ==> length arr == pref arr
> where arr = takeLast n testcase
As expectedgives:
*Main> quickCheck identityProp
OK, passed 100 tests.
So where are the uses of this algorithm outside of LLVM?
There is not much of thinking needed to generalize the array to other data structures, which may permit mutating of the node contents, but disallow insertions. Linked lists are an example since one can only cons up stuff to the head. Doubly-ended arrays with fast size() operation (given a pointer to one element) can be implemented if there is another pointer in each node that we can use for storing waymarks to the start of the array. Deques also could work like this, but they too, need to be fully built up before putting in the waymarks.
This all reminds me of cons-hashing but is really a more powerful concept. Let's call it waymarking! And then let the garbage collector put in the waymarks for us...
Down with O(n) complexity on linked-list's length operation!
Monday, March 31, 2008
ETAPS Budapest
I am in Budapest since Saturday, and I love it. The weather is excellent, I am in the middle of some very nice people and I can do what I like most: brainstorming cool concepts.
At a satellite event yesterday (DCC - Designing Correct Circuits) I met Tim and after the workshop we actually came to chat a plenty on Omega. I very much felt like a satellite, because I did not participate at the conference itself, just hanging around the lobby and the workshop rooms, using the "free" WLAN and thinking about what to do next.
Today then was our first Omega hacking day. Building on the googlecode repository's trunk version Tim checked in everything he had on his laptop, and now we can do concurrent development. Both the documentation and the release management toolchain have been overhauled and checked in, along with a testsuite.
This is already much more than what I dreamed of...
Tomorrow we'll go thru the issue list, squashing as many bugs as possible and I might get back to the LLVM generation branch. Another interested student will join us, an Hungarian guy called Tamás.
This is finally the place to thank Dr. Horváth Zoltán, who has very kindly offered us rooms in the university to have our hacking sessions in complete peace.
At a satellite event yesterday (DCC - Designing Correct Circuits) I met Tim and after the workshop we actually came to chat a plenty on Omega. I very much felt like a satellite, because I did not participate at the conference itself, just hanging around the lobby and the workshop rooms, using the "free" WLAN and thinking about what to do next.
Today then was our first Omega hacking day. Building on the googlecode repository's trunk version Tim checked in everything he had on his laptop, and now we can do concurrent development. Both the documentation and the release management toolchain have been overhauled and checked in, along with a testsuite.
This is already much more than what I dreamed of...
Tomorrow we'll go thru the issue list, squashing as many bugs as possible and I might get back to the LLVM generation branch. Another interested student will join us, an Hungarian guy called Tamás.
This is finally the place to thank Dr. Horváth Zoltán, who has very kindly offered us rooms in the university to have our hacking sessions in complete peace.
Tuesday, February 5, 2008
Embeddings, Part Two: Monad --> Thrist
Here comes the second part of my promised embeddings, this time I'll embed any monadic chain (corresponding to the do-syntax in Haskell) into a thrist and provide a semantics. Later I'll try to discuss some convenience features and try my luck with mfix (and the mdo-notation).
As in my post dealing with arrows, we have to define an adapter GADT:
Feed sounds like an introduction of a new monadic value, we would like to interpret it as the (>>) connective. Digest on the other hand transforms, and thus corresponds to (>>=). Both expect an already formed monadic value, so we can state:
We have converted >> getChar >>= (return . ord) into a thrist and got it back by means of recoverM! And this works for any monad.
Two shortcuts come to my mind, they can be modelled by adding two simpler variants to the GADT:
Now what about the value fixpoint mfix? Can we dress it in thristy clothes? Frankly, I have no idea, but I would be happy to hear from you. It is late, I need sleep now. Bye.
As in my post dealing with arrows, we have to define an adapter GADT:
data Monad' :: (* -> *) -> * -> * -> * whereThe semantics is a bit trickier as in the arrow case, but let's find out the type signature first...
Feed :: Monad m => m b -> Monad' m a b
Digest :: Monad m => (a -> m b) -> Monad' m a b
Feed sounds like an introduction of a new monadic value, we would like to interpret it as the (>>) connective. Digest on the other hand transforms, and thus corresponds to (>>=). Both expect an already formed monadic value, so we can state:
recoverM :: Monad m => m a -> Thrist (Monad' m) a b -> m bThe effect of an empty thrist is trivial:
recoverM mon Nil = monAnd the other two connectives turn out to be the only sensible way to do it:
recoverM mon (Cons (Feed m) rest) = recoverM (mon >> m) restNow, we can build up chains:
recoverM mon (Cons (Digest f) rest) = recoverM (mon >>= f) rest
*Embeddings> :t Cons (Feed getChar) $ Cons (Digest (return . ord)) Niland execute them too:
Cons (Feed getChar) $ Cons (Digest $ (return . ord)) Nil :: forall a. Thrist (Monad' IO) a Int
*Embeddings> recoverM getChar (Cons (Feed getChar) $ Cons (Digest (return . ord)) Nil)The first 'G' is consumed by the getChar invocation that is given to recoverM, but its result is ignored. The second is having an effect too, and its result is used to obtain the ASCII value. The overall monadic action is then run by ghci, printing the 71.
Loading package haskell98 ... linking ... done.
GG71
We have converted >> getChar >>= (return . ord) into a thrist and got it back by means of recoverM! And this works for any monad.
Two shortcuts come to my mind, they can be modelled by adding two simpler variants to the GADT:
Feed' :: Monad m => b -> Monad' m a bThe semantics should be clear, I leave the extension of recoverM to the reader.
Digest' :: Monad m => (a -> b) -> Monad' m a b
Now what about the value fixpoint mfix? Can we dress it in thristy clothes? Frankly, I have no idea, but I would be happy to hear from you. It is late, I need sleep now. Bye.
Thursday, January 31, 2008
Embeddings, Part One: Arrow --> Thrist
Dan Weston asked me how the embeddings of arrows and monads into thrists would look like.
Took me some effort, but I think I have found something that appears to be satisfactory.
This time I shall prove my claim that in Haskell any Control.Arrow instance can be rewritten as a Thrist data structure. I shall also provide a semantics for the resulting data that shall recover the original arrow. Let me recapitulate the Thrist definition before diving in:
First is responsible for defining a transformation on a pair's first component. Clearly this corresponds to the Arrow type class's first member. I refrain from doing the same to the other functions in the Arrow type class. You may scratch your head now, asking whether I have forgotten to handle the >>> member... The answer is no, I'll come back to this later.
Let's see how this works out in the practice...
chr >>> (\a -> (a, 42)) >>> first ord.
The Cons data constructor assumes the rôle of >>>.
To obtain a meaning at all, we have to define a semantics for it! Here it comes, and allows us to recover the embedded arrow:
Finally, a little test:
PS: To embed monads, you can try embedding them via Kleisli to obtain an arrow and then via Arrow' to construct a thrist. Alternatively wait for a future post here :-)
PPS: For running the above you will need these preliminaries:
Took me some effort, but I think I have found something that appears to be satisfactory.
This time I shall prove my claim that in Haskell any Control.Arrow instance can be rewritten as a Thrist data structure. I shall also provide a semantics for the resulting data that shall recover the original arrow. Let me recapitulate the Thrist definition before diving in:
data Thrist :: (* -> * -> *) -> * -> * -> * whereLet's begin with the embedding part. Since the members of thrists are constructors of a two-parameter GADT, we have to construct the Arrow' GADT first. Here is the adapter:
Nil :: Thrist p a a
Cons :: p a b -> Thrist p b c -> Thrist p a c
data Arrow' :: (* -> * -> *) -> * -> * -> * whereArr takes any arrow and wraps it. This constructor is responsible for the actual embedding and it is easy to see that it is completely general.
Arr :: Arrow a => a b c -> Arrow' a b c
First :: Arrow a => Arrow' a b c -> Arrow' a (b, d) (c, d)
First is responsible for defining a transformation on a pair's first component. Clearly this corresponds to the Arrow type class's first member. I refrain from doing the same to the other functions in the Arrow type class. You may scratch your head now, asking whether I have forgotten to handle the >>> member... The answer is no, I'll come back to this later.
Let's see how this works out in the practice...
t1 :: Thrist (->) Char CharThis is just the plain chaining of functions as a thrist. Here is the Arrow' embedding:
t1 = Cons ord (Cons chr Nil)
t2 :: Arrow' (->) Char IntWe make a thrist of length 1:
t2 = Arr ord
t3 :: Thrist (Arrow' (->)) Char IntThis is bit more involved, adding a constant second pair component:
t3 = Cons t2 Nil
t4 :: Arrow' (->) a (a, Int)Now it is time to form a longer thrist:
t4 = Arr (\a -> (a, 42))
t5 :: Thrist (Arrow' (->)) Int (Int, Int)So, this corresponds to what? Intuitively, it could mean
t5 = Cons (Arr chr) (Cons t4 (Cons (First t2) Nil))
chr >>> (\a -> (a, 42)) >>> first ord.
The Cons data constructor assumes the rôle of >>>.
To obtain a meaning at all, we have to define a semantics for it! Here it comes, and allows us to recover the embedded arrow:
recover :: Arrow a => Thrist (Arrow' a) b c -> a b cSome people call this an (operational) interpreter.
recover Nil = arr id
recover (Cons (Arr f) r) = f >>> recover r
recover (Cons (First a) r) = first (recover $ Cons a Nil) >>> recover r
Finally, a little test:
*Embeddings> (recover t5) 55Cute, isn't it?
Loading package haskell98 ... linking ... done.
(55,42)
PS: To embed monads, you can try embedding them via Kleisli to obtain an arrow and then via Arrow' to construct a thrist. Alternatively wait for a future post here :-)
PPS: For running the above you will need these preliminaries:
module Embeddings where
import Prelude
import Control.Arrow
import Char
Tuesday, December 11, 2007
What Ωmega has joined, no one can take apart
Some days ago I became enthusiastic about creating data structures with some pretty strong invariants. What I did not expect were the difficulties actually providing the evidence that the preconditions hold, and that taking apart the the data into pieces without losing the constraints can become a challenge. In fact, when embarking on programming with SingleLabel, a bunch of proof obligations arise.
So the title is actually misleading (sounds cool nevertheless!) and should read "... no one can take apart losslessly". Of course a non-empty SingleLabel can be taken apart by pattern matching! This article is about propositional facts, and their propagation.
The More data constructor is only accessible if the constraint Equal rest {drop l rest} is given. Last time no evidence was needed, because rest was explicitly known, so the drop type function could be evaluated and the equality checked. But when we just want to join a new head with an unknown tail?
This should still be possible in case that a witness is provided, that testifies that the tail does not contain the label occurring in the new head. Simple enough,
When this problem is fixed, I'll come back and tell you the story of the lossless decomposition (so that calling join with the results gives us the original back).
So the title is actually misleading (sounds cool nevertheless!) and should read "... no one can take apart losslessly". Of course a non-empty SingleLabel can be taken apart by pattern matching! This article is about propositional facts, and their propagation.
The More data constructor is only accessible if the constraint Equal rest {drop l rest} is given. Last time no evidence was needed, because rest was explicitly known, so the drop type function could be evaluated and the equality checked. But when we just want to join a new head with an unknown tail?
This should still be possible in case that a witness is provided, that testifies that the tail does not contain the label occurring in the new head. Simple enough,
join :: Label a -> v -> SingleLabel tail -> Equal {drop a tail} tail -> SingleLabel {a=v; tail}rshould do it. Unfortunately I am running against the wall again. The Eq witness is not accepted by Ωmega, on the grounds that no refinement can be established. The debugging session tomorrow promises to become an adventure!
join a v t Eq = {a=v; t}s
When this problem is fixed, I'll come back and tell you the story of the lossless decomposition (so that calling join with the results gives us the original back).
Friday, December 7, 2007
Only one of each, please!
Must be all those sleepless nights (no, I do not want to blame my 6 month old daughter) that light the fire of creativity sometimes...
Two nights ago I had this marvelous idea: how to say that a list is indeed a set. Dear reader, if you do not like maths, please go home now. I am in the happy state of a half bottle of Shiraz and only want people around who share my interests!
Okay let's come to the point. As you probably know, Ωmega provides a Row kind, it is a classifier for associations at the type level.
The shorthand notation {`a=Int, `b=Char}r can express a record where a component labelled with `a contains values of type Int while under the key `b we can store Chars. This is probably boring stuff if you already understand extensible records.
Now let's descend to the value plane. A value can have a type SingleLabel {`a=Int, `b=Char}r when we define following datatype:
data SingleLabel :: Row Tag * ~> * whereNone :: SingleLabel RNilMore :: Label l -> t -> SingleLabel rest -> SingleLabel {l=t;rest}rderiving Record(s)
As usual we have two data constructors, one for the empty record and one for augmenting an existing one. The "deriving" clause just tells Ωmega to enable the following syntactic sugar for None and More: More `lab 42 None === {`lab=42}s, nothing exciting.
But no one can hinder us to define a record value with repeated identical keys: {`a=25, `a="42"}s is a perfectly legal value of type {`a=Int, `a=[Char]}s but it really is incompatible with our aim of defining a one-to-one correspondence between labels and values.
Now enters my cunning idea: using a constraint, we can restrict the tail of the list to not contain a key we are about to add. Constraints are much like Haskell's type classes, but they are less ad-hoc and mirror propositional facts (properties). The new definition could look like this:
data SingleLabel :: Row Tag * ~> * whereNone :: SingleLabel RNilMore :: Equal rest {drop l rest} => Label l -> t -> SingleLabel rest ->
SingleLabel {l=t;rest}rderiving Record(s)
We have defined following semantics: Under the restriction that dropping the tag l from the tail does not change it at all (a fixpoint!), we can construct an augmented list. We have almost reached our self imposed objective: it only remains to provide a suitable definition for drop!
Now, Ωmega helps us in this respect too (but not entirely so for Tags), as we can define functions on types (or kinds for that matter). We can do it this way:
drop :: Tag ~> Row Tag * ~> Row Tag *{drop o RNil} = RNil{drop `a (RCons `a v r)} = {drop `a r}{drop `b (RCons `b v r)} = {drop `b r}{drop `b (RCons `a v r)} = RCons `a v {drop `b r}{drop `a (RCons `b v r)} = RCons `b v {drop `a r}
This should be enough for demonstration purposes. (We get a quadratic blowup with increasing number of desired labels, since we have to write our definition in an inductively sequential fashion).
With all of this in place we can give a whirl to our SingleLabel datatype:
prompt> {`a=42}s
--> {`a=42}s :: SingleLabel {`a=Int}r
prompt> {`a=42,`b='K'}s
--> {`a=42,`b='K'}s :: SingleLabel {`a=Int,`b=Char}r
prompt> {`a=42,`a='K'}s
--- Type error: cannot find a solution for equation {`a=Char}r == {}r
We can now go to bed with the feeling of having accomplished a thing...
PS: Ωmega in its last released form cannot compile the code presented here. You will need some patches to get various bugs ironed out. With some luck this all will work out of the box in the next release!
Subscribe to:
Posts (Atom)