composable io
Continuing again from the last post. What if I have a stream of bytes from a socket and I want to encode it as a string using utf8 and then I want to lex those strings in to tokens?
What if I want a composition stack from the basic io up to something complex, something like this:
socket map> utf8decoder map> mytokenizer map> stdout.
This seems like a reasonable request. But what we learnt last post was that this requires putting method points in to structures and that can (and likely would) get in the way of optimisation. If I were to write this code without the nice composition it might look something like this:
[
bytes := RingBuffer[Byte, 1024].
chars := RingBuffer[Character, 256].
tokens := RingBuffer[Token, 16].
status := [socket read-into: bytes] while: OK.
[utf8decoder write-from: bytes] while: OK.
[utf8decoder read-into: chars] while: OK.
[mytokenizer write-from: chars] while: OK.
[mytokenizer read-into: tokens] while: OK.
[stdout write-from: tokens] while: OK.
status == Closed then: [return].
] repeat
That code is fine but it does require me to define my own buffers. The optimal buffer sizes aren't clear either, mere guesses in the dark. And I'm not handling errors. So things will get noisy fast to write this code.
We might consider using the composition approach from two posts ago and accept the lack of optimisation for clarity in our code. But one of my goals with STZ is to allow clarity without sacrificing performance.
So on that note - how do we create custom stacks? - well, sort of with sub-typing. We are already reading from a SocketStream in to buffers. But we can define a UTF8Stream and a MyTokenizerStream, etc.
composed-stream := {MyTokenizerStream |
source: {UTF8Stream |
source: socket}}.
composed-stream copy-into: stdout.
Now we have a self container object we can pass about that won't be inefficient with buffers or method pointers. It can also be inlined completely as we're not using any closures or blocks.
We can also add in a composition method to create the structure for us. Let's give that a go:
[readable · writeable]
[&Readable, WriteableClass -> Writeable |
{WriteableClass | source: readable}].
socket · UTF8Stream · MyTokenizerStream copy-into: stdout.
Or we can go back to the previous syntax:
[readable map> writeable]
[&Readable, WriteableClass -> Writeable |
{WriteableClass | source: readable}].
socket map> UTF8Stream map> MyTokenizerStream copy-into: stdout.
Except a filter or a map or any other processing operation need not have a uniquely named method anymore. The · elegantly describes the composition - though not the direction of flow. I'd happily use → except we already use that to build block/method types.
We could use the old C++ iostreams >> but that feels like it's drawing in mental baggage. We could have some fun and use ~> instead because it's pretty:
socket ~> UTF8Stream ~> MyTokinzerStream copy-into: stdout.
There's one other operation here I want to explore because the copy-into: looks clunky next to our ~> - and that is iteration in general.
// the Smalltalk approach
people do: [person | person jump]
// the C-like approach
for: people do: [person | person jump]
// the current approach
people into: [person | person jump]
// the current approach with implicit receiver
people into: [jump]
Clearly into: is a bad verb. do: works better but is only familiar to Smalltalk programmers. The idea of iterating over things is so fundamental that we should try and find a way to symbolise it.
It would be fun to do away with a verb completely and make it part of the syntax:
people[jump]
But that's also sub-indexing, so we can't do it like that. Higher-Order-Functions might be an approach we could explore. May be we could tag the variable people in such a way that we know we want to send messages to all its children. But that wouldn't let us pass a block to the iteration method.
We use / to mean 'or' for types, ie: 'Person / Failure' which means using '/' to indicate a subfolder, or to look at the things inside it is also out.
We can't use > because the object we're talking to might actually be an integer. Though we could make sure all arrays aren't comparable – except vectors are comparable, so we can't do that either.
We could have a new kind of block syntax, like:
person [[jump]]
But that starts overloading the visuals too much. Less syntax is better which is why a verb tends to be the best answer.
We could use the type of the right-hand-side to determine what to do:
// composition
socket ~> MyClass
// action
people ~> [jump]
But we're overloading the concept; though not the intent - which is "left side flows in to right side." Part of this works if all composition is done with a class on the right hand side and all actions are done with a variable on the right hand side. stdout after all is a file descriptor, not a class.
socket ~> UTF8Decoder ~> MyTokenizer ~> stdout.
people ~> [person | person jump]
people ~> [jump]
people ~> [person -> :Boolean | person age ≥ 18] ~> stdout.
All of that looks good except for the filter operation. If we can get rid of the typing there we're on easy street.
The problem is how to differentiate between a block that returns nothing and a block that returns the boolean. jump could return a boolean just like ≥ but we might only care about the return type in one circumstance.
And yes I recognise the irony of going through all of this work to make optimiser stacking of components only to throw the filter back in at this late stage which would require a method pointer - but let's break that decision down a moment:
Operations like converting bytes to a string need to be efficient. Your tokenizer might not need to be but you might choose it to be. We want tight loops where it matters most. If filtering people by age really mattered you could make a OfAge filter class.
The important point is to know how to do it right and also know how to do it fast. It is worrying how simply it would be to write a slow version though.
What are we doing with a filter anyway? - it's a view over an array that we're creating. Should it be random access? no that's very inefficient; so we do it as a stream over the original data. If we needed random access we'd be better off using the stream to make a new array with just the things we want in it:
of-age: Array[Person].
people ~> [person -> :Boolean | person age ≥ 18] ~> of-age.
The simplest way to solve this problem is to have another verb, as map> has become ~> so should we convert filter> in to something else; perhaps /> to indicate subset. It's a shame there isn't an easy to type unicode subset symbol from maths we could use. The horseshoe with an underline under it. We could create a diamond instead with <> to indicate 'less stuff into the next part of the flow', or even a spaceship like method:
people <~> [age ≥ 18] ~> stdout.
But that looks very bidirectional and not indicative of 'narrow this dataset down'.
people /> [age ≥ 18] ~> stdout.
This looks like we left a bit of html in our code by accident.
people ?> [age ≥ 18] ~> stdout.
The ?> makes it clear we're asking a question. We're also still flowing it on. We could have a longer method name of ?~> but I'm not sure that's actually too useful.
It'd be fun to put the question mark at the end of the block:
people ~> [age ≥ 18]? ~> stdout.
But I don't think that fits the idioms of STZ too well so far. Let's see how things look if we use our words:
people select> [age ≥ 18] ~> stdout.
people reject> [age < 18] ~> stdout.
This leaves open other kinds of reduce operations. The generalised "move data from left to right" ~> works well in all the other cases.
// how many of-age people are there?
people select> [age ≥ 18] reduce> [length] ~> stdout.
// how many years for all of us?
people sum> [age] ~> stdout.
Here reduce> is given the whole buffer to do with as it pleases and it will output a single answer. A map> on the other hand is given the buffer to output another buffer. I also threw in sum> as a simplified reduce to add up all the values. This is typically done with an inject:into: in Smalltalk.
socket
~> UTF8Decoder
~> CSVDecoder[Person]
select> [age ≥ 18]
reduce> [length]
~> stdout.
I can't really argue with that. It's probably worth having map≥ be an alias for ~> as well as copy-into> and print> just so explicit programming can be stylistically chosen when it helps:
socket
map> UTF8Decoder
map> CSVDecoder[Person]
select> [age ≥ 18]
reduce> [length]
print> stdout.
But of course we're not done exploring this yet. Extending types with new methods is core to STZ. Why not add a new kind of composable verb for each composition thing we have?
socket
decode-utf8
decode-csv
select> [age ≥ 18]
reduce> [length]
print> stdout.
The problem we have here is decode-csv - now we can't give it a type to do compile time type reflection for mapping CSV columns in to an object. That's a big loss. We don't want to lose that. We also lose the ability to see that we're "flowing" information from socket to a select filter. We're just sending messages to an object and they could be configuration or who knows what. In this case it's not clear we're wrapping the socket and then the utf8decoder.
In some ways having less API is beneficial sometimes. Let's rewind:
stream := {UTF8Decoder | source: socket}.
people := {CSVDecoder[Person] | source: stream}.
of-age := {Filter | source: people, op: [age ≥ 18]}.
of-age count print> stdout.
In many ways this is a much simpler program to reason about. You don't have to go hunting for explanations of APIs. You don't need to know what map> means. We should try and prefer a solution that has less boiler plate, always, but remains clear and produces an efficient result. This approach ticks those boxes.
The {Filter|} seems a bit heavy weight. But it's also clear that it's the same as all the other steps. For now I will err on the side of caution and clarity and say that this approach is the superior approach.
May be not for brevity but definitely for clarity.
The question to ask then is "What do we do often?" - those need brevity; while the rest need clarity. Configuring a stack for the io is something you don't do often. But working with arrays of objects is something we do a lot.
people := {CSV[Person] | source: {UTF8 | source: socket}}.
people select> [age ≥ 18] print> stdout.
And that's where we end for now. A combination of clarify and brevity. Where we get people from can be anywhere. An array in memory, a database, a socket like in this example. It doesn't matter; which is where we draw that line between convience methods and infrastructure.