A tale of two reductions in gas consumption in&nbsp;Tezos

Introduction
Tale 1: a new tail recursive monadic interpreter
Tale 2: hunt for a serialization performance bug
Conclusion

Introduction

We subscribe to Kent Beck’s motto:

Make it work, make it right, make it fast!

— in that order. This implies an incremental software development process:

When you make it work, you may elide corner cases or design considerations required to make it right; and
when you make it right, you may introduce abstractions which are good for provability but are too naive to make it fast in the implementation.

This is normal: incremental development separates concerns so we can properly deal with each of them.

We’d like to share with you two tales of making Tezos faster in Granada by following the motto above:¹

Tale 1 led to an optimization of the Michelson interpreter; and
Tale 2 led to an optimization of the serialization layer.

Both tales are set after the “make it right” stage and during the “make it fast” stage. The code was clean (and we wanted to keep it that way); we just wanted to increase performance.

With these two optimizations, Granada — the latest, and recently-accepted, amendment proposal of Tezos economic protocol — executes smart contracts with much better gas consumption.² For example, a typical call to Dexter entrypoint XtzToToken consumes roughly

75600 units of gas in Florence, and
9320 units of gas in Granada.

That’s an eightfold improvement.³

Tale 1: a new tail recursive monadic interpreter

Tezos executes smart contracts in Michelson, a strongly-typed low-level language.⁴ Roughly speaking, when provided with

an input argument, and
a storage state, and
a stack of pending operations,

a Michelson script computes

a new storage state, and
a new stack of pending operations, and
it may also carry out some concrete action to change the state of the blockchain — the technical term is a side-effect.

An aside: side-effects via monads

Monads are how functional programmers deal with effectful computation, meaning programs that may compute a return value and also have side-effects on the ambient world as they do so (input/output, token transfers, etc). The impatient reader is welcome to skip forward to our toy MichelBaby interpreter and return to this subsection for reference if required, especially when the reader sees us mention bind later.

Still here? Great!

A monad is a type-constructor, meaning that given a type t and a monad monad, we have a type t monad. There are many different monads — e.g. the error monad, the state monad, and the Lwt (lightweight threads) monad — but for the purposes of this blog post, these all just represent actions on some state machine.⁵ Thus t monad is a type for computations that compute a t while exerting side-effects on a state monad monad.

A monad comes with the following key operations:

return : 'a -> 'a monad
This inputs a value of type 'a and returns that same value, in an empty (or initial) state.
bind : 'a monad -> ('a -> 'b monad) -> 'b monad
This binds a stateful computation from 'a to 'b, to an input 'a-in-state, thus obtaining an output 'b-in-state.
run : 'a monad -> 'a
This discards the state and just returns the value.⁶

C, Java, or Python have monads, or more precisely one monad: a single “global state monad” which represents how the actual state of the computer changes as it executes instructions. Since in these languages the global state monad permeates all computation, it need not be mentioned explicitly — but it is still there, as you can check: just run a virtual machine, take a snapshot, et voilà your global state is saved to disk as a datum.

OCaml (in common with may other functional programming languages) allows us to be far more fine-grained, by bundling state-fragments up into monads.⁷ This has at least two benefits:

fragments of machine state are now encapsulated as typed data and can be passed as arguments to other functions (as our crude example above of snapshotting a VM illustrates); and also,
using monads encourages the programmer to be discriminating about precisely what effects a given computation really needs — if any.

So if we think about it, execution of a Michelson script requires a few distinct kinds of state:

The state of the blockchain (e.g. to transfer tokens).
The gas level consumed in the current block (especially in case we are about to run out of gas).
A script may interact with the environment via I/O.
A script may also fail because it tried to perform some invalid operation, and this too is interaction with state; namely the failure monad whose effect (if triggered) is to say “I have no return value because I have failed”.

In practice the Michelson interpreter uses a monad in OCaml which is the composition of three different monads: the error monad, the state monad, and the Lwt (lightweight threads) monad.

If the reader sees return, bind, or run below — they are just the monadic glue coming from this or a closely related monadic combination.

An interpreter for Michelson scripts

An interpreter is a program that runs scripts. Let’s sketch what an OCaml interpreter might look like for a simple object-level language⁸ MichelBaby. MichelBaby is derived from a subset of Michelson’s instructions as used in the currently-live economic protocol Florence; so what follows can be viewed as a simplified but indicative reflection of the smart contracts system as it is currently deployed.

Here’s the OCaml type declaration for the abstract datatype instr of MichelBaby instructions (the impatient reader can skip forward to compare with the second version):

type (_, _) instr =
  |  Push : 'b -> ('s, 'b * 's) instr
  |   Mul :       (z * (z * 's), z * 's) instr
  |   Dec :       (z * 's, z * 's) instr
  | CmpNZ :       (z * 's, bool * 's) instr
  |  Loop :       ('s, bool * 's) instr
               -> (bool * 's, 's) instr
  |   Dup :       ('a * 's, 'a * ('a * 's)) instr
  |  Swap :       ('a * ('b * 's), 'b * ('a * 's)) instr
  |  Drop :       ('a * 's, 's) instr
  |   Dip :       ('s, 't) instr
               -> ('a * 's, 'a * 't) instr
  |    IO :       ( z * 's,  z * 's) instr
  |   Seq :       ('s, 't) instr * ('t, 'u) instr
               -> ('s, 'u) instr

(The type z above is the integer type from the standard OCaml library Zarith for infinite precision arithmetic.)

We notice that:

instr is polymorphic over two type parameters: ('s, 't) instr.
- The first parameter 's represents the input stack type and
- the second parameter 't represents the output stack type.
Each line in the datatype declaration corresponds to an individual instruction. The type parameters give useful information on the intended meaning. For example:
- Mul : (z * (z * 's), z * 's) instr is an instruction that inputs a stack headed by two integers (z * (z * 's)) and outputs a stack headed by one integer (z * 's). Intuitively, Mul pops the two integers off the stack, multiplies them, and pushes the result.
- Push : 'b -> ('s, 'b * 's) instr is an instruction that inputs a value of type 'b and returns an instruction that inputs a stack of type 's and outputs a stack of type 'b * 's. Intuitively, Push b pushes b (of course).
There is a special sequence constructor Seq that
- inputs an instruction that inputs stack 's and outputs a stack 't, and
- inputs an instruction that inputs a stack 't and outputs a stack 'u, and
- returns a composite instruction that inputs a stack 's and outputs a stack 'u.
Note how the types make it obvious what the interpreter should do with I Seq J: first I, then J.
IO is a basic effectful instruction — for concreteness, the reader can assume that it writes the topmost integer of the stack to a file, and could be used for example as a hook for logging and profiling execution.

MichelBaby typechecking for free, courtesy of the OCaml typechecker

This style of polymorphic datatype declaration makes the OCaml typechecker act as a MichelBaby typechecker too: only (representations of) well-typed scripts can inhabit the instr type. For example, consider the following MichelBaby implemention of fact the factorial:

(* Inline syntactic sugar for sequencing *)
let ( @ ) s1 s2 = Seq (s1, s2)

(* MichelBaby instruction to calculate n! *)
let fact n =
  assert (Z.(n > zero));
  Push n
  @ Push (Z.of_int 1)
  @ Dup @ CmpNZ @ Dip Swap
  @ Loop (Dup @ Dip Swap @ Mul @ IO @ Swap @ Dec @ Dup @ CmpNZ)
  @ Drop

Hurrah, this is well-typed! OCaml’s typechecker does not know that fact is a MichelBaby sequence of instructions to compute the factorial, but it does certify that fact has the following type:

fact : z -> ('s, z * 's) instr

In English, this is the OCaml typechecker telling us:

fact will input an integer and return a MichelBaby instruction that inputs a stack and returns a stack of the same shape except it has an integer pushed onto it.

Now suppose we to forget the CmpNZ instruction above. Then the OCaml typechecker will reject the definition with the following error message:

This expression has type (bool * 'a, 'b) instr
but an expression was expected of type (Z.t * 'c, 'd) instr
Type bool is not compatible with type Z.t

This is the OCaml typechecker, giving us MichelBaby typechecking for free.

Tracing factorial: a standard toy example

Our implementation fact of factorial is a tracing factorial, meaning that it computes factorial and contains an IO tracing (or breakpoint) hook so we can observe intermediate results of the computation using an input/output primitive of our choice — a similar hook is in our efficient OCaml factorial function below.

Tracing factorial is interesting because it is an easy-to-understand toy example of a mostly pure computation with some side-effects — which is what most smart contracts look like. This is a convenient toy example: no implication is intended that tracing factorial is the only thing that could be done in MichelBaby or Michelson!

The `step` function, Version 1

The instr datatype allows us to represent well-typed programs. Now we need to design an interpreter. First, suppose we are given a function step

step : ('i, 'o) instr -> 'i -> 'o monad

Then we can write our interpreter as follows:

type ('storage, 'argument) well_typed_script =
  (('storage * 'argument) * unit, (('storage * operations) * unit)) instr

type ('storage, 'argument) interpretation =
  'storage -> 'argument -> ('storage * operations) monad

let interpreter : type storage argument.
  (storage, argument) well_typed_script -> (storage, argument) interpretation  =
  fun instr storage argument ->
    bind (step instr ((storage, argument), ()))
    @@ fun ((storage, operations), ()) -> return (storage, operations)

So now we just have to write code for step. Standard practice is to just follow the inductive structure of instr.

We present Version 1 of the step function (cf. Versions 1.1 and 2 below):

let rec step : type a b. (a, b) instr -> a -> b monad =
  fun instr stack ->
  match (instr, stack) with
  | Push x, stack
    -> return (x, stack)
  | Mul, (x, (y, stack))
    -> return (Z.mul x y, stack)
  | Dec, (x, stack)
    -> return (Z.(sub x (of_int 1)), stack)
  | CmpNZ, (x, stack)
    -> return (Z.(compare x zero <> 0), stack)
  | Loop _, (false, stack)
    -> return stack
  | Loop body, (true, stack)
    -> bind (step body stack) @@ fun stack -> step (Loop body) stack
  | Dup, (x, stack)
    -> return (x, (x, stack))
  | Swap, (x, (y, stack))
    -> return (y, (x, stack))
  | Drop, (_, stack)
    -> return stack
  | Dip i, (x, stack)
    -> bind (step i stack) @@ fun stack -> return (x, stack)
  | IO, (z, _)
    -> bind (io z) @@ fun () -> return stack
  | Seq (i1, i2), stack
    -> bind (step i1 stack) @@ fun stack -> step i2 stack

The recursive function step observes the instruction and the stack and produces a monadic computation using return and bind. Let’s read through some cases of this definition:

Case for Push. As prescribed by the definition of Push, the input stack type is 's and we must return a stack of the form 'b * 's. That is the type of (x, stack). Besides, since we need the returned value to be of type ('b * 's) monad, we use return to turn (x, stack) into a monad computation.
Case for Seq. As prescribed by the definition of Seq, the instruction i1 must be of type ('s, 't) instr and i2 of type ('t, 'u) instr. The recursive call (step i1 stack) is well-typed and returns a value of type 't monad. Using bind, we can retrieve the stack of type 't needed to interpret i2. The final computation has type 'u monad as expected.

Case for IO. Here, we use an effectful operation io : z -> unit monad and compose it with a computation that returns that stack unchanged.

We check that the interpreter correctly computes the factorial on an input (other inputs work too!):

# run (step (fact (Z.of_int 100)) ()) |> fst = ocaml_fact (Z.of_int 100);;
- : bool = true

This interpreter has a good property: it can’t fail. Because we get MichelBaby type-checking for free, every application of step to a well-typed argument in instr is guaranteed to successfully execute (provided the ambient execution environment doesn’t suffer an overflow; see next paragraph). We can’t fail on an ill-formed operation like multiplying two strings or popping an empty stack, because our free MichelBaby typechecker will detect and reject the instruction-sequence as ill-typed.

Unfortunately, our interpreter above is not yet “right”, because step is not tail recursive. Note that:

Recursive calls to step are not the final step in the computation in the clauses for Seq (in step i1 stack) and Dip (in step i stack).
For comparison, the calls in Loop to step (Loop body) stack and in Seq to step i2 stack, are tail recursive.

This is not “right” because each call to the interpretation loop that passes through a non-tail-recursive call to step, may consume a bit of the OCaml calling stack, and we can use this to provoke incorrect behaviour, namely a stack overflow (when the call stack fills up and the computer has to terminate execution because it runs out of memory):

# let rec long_seq accu n =
  if n = 0 then accu
           else long_seq (Seq (accu, Seq (Push Z.zero, Drop))) (n - 1);;
val long_seq : ('a, 'b) instr -> int -> ('a, 'b) instr = <fun>
# step (long_seq (Push Z.zero) 10000000) ();;
Stack overflow during evaluation (looping recursion?).

When the stack overflow error occurs, is architecture-dependent and depends on how much stack is available to consume. A cleaner way to handle this is to count the nesting depth of non-tail-recursive calls, and abort if we go too far.⁹

The `step` function, Version 1.1

We add a depth counter to Version 1 of step (cf. Version 2 below):

let rec step : type a b. int -> (a, b) instr -> a -> b monad =
  fun depth instr stack ->
  if depth > 100000 then fail "Too many recursion"
  else match (instr, stack) with
  | Push x, stack
      -> return (x, stack)
  | Mul, (x, (y, stack))
      -> return (Z.mul x y, stack)
  | Dec, (x, stack)
      -> return (Z.(sub x (of_int 1)), stack)
  | CmpNZ, (x, stack)
      -> return (Z.(compare x zero <> 0), stack)
  | Loop _, (false, stack)
      -> return stack
  | Loop body, (true, stack)
      -> bind (step (depth + 1) body stack) @@ fun stack -> step depth (Loop body) stack
  | Seq (i1, i2), stack
      -> bind (step (depth + 1) i1 stack) @@ fun stack -> step depth i2 stack
  | Dup, (x, stack)
      -> return (x, (x, stack))
  | Swap, (x, (y, stack))
      -> return (y, (x, stack))
  | Drop, (_, stack)
      -> return stack
  | Dip i, (x, stack)
      -> bind (step (depth + 1) i stack) @@ fun stack -> return (x, stack)
  | IO, (z, _)
      -> bind (io z) @@ fun () -> return stack

The program is still non-tail-recursive but at least we have replaced a stack overflow which we cannot control, with an explicit branch within our own program which we can control.

Some tests are required to determine whether 100000 is a good limit.¹⁰ Once this limit value is appropriately chosen, we can claim to have made the interpreter “right”:

it follows a standard OCaml programming style (hence natural to reason about), and
it is robust to stack overflows.

Making the interpreter fast

How fast is our step function? To get an idea, we can compare how long it takes to compute the factorial of a hundred — 100! = 100*99*98*...*1 — with an equivalent native OCaml implementation of the factorial:

let lwt_fact n =
  let rec aux k accu =
    if Z.(compare k zero = 0) then return accu
    else
      let accu = Z.mul accu k in
    bind (io accu) (fun () -> aux (Z.sub k (Z.of_int 1)) accu)
  in
  aux n (Z.of_int 1)

Here are the benchmarks, for the OCaml factorial function and V1.1 of our step function:

┌─────────────┬──────────┬────────────┐
│ Name        │ Time/Run │ Percentage │
├─────────────┼──────────┼────────────┤
│ OCaml       │   3.71us │      23.4% │
│ Step (V1.1) │  15.83us │     100.0% │
└─────────────┴──────────┴────────────┘

On this specific example, the interpreter is five times slower than the reference implementation in OCaml. Why? Profiling execution with the linux perf command, we learn that 50% of the time is spent in the monadic combinators (mentioned above). Mostly this is due to the code for handling Seq in the code above, in which bind is called to glue together the interpretation of the two instructions of a sequence: this has a negative impact on the performance because bind happens to be a rather complex operation when the monad includes the Lwt (lightweight threads) monad.

We need to get rid of those binds.

Reducing the number of `bind`s

Let’s reexamine the MichelBaby code for fact above:

(* Inline syntactic sugar for sequencing *)
let ( @ ) s1 s2 = Seq (s1, s2)

(* MichelBaby instruction to calculate n! *)
let fact n =
  assert (Z.(n > zero));
  Push n
  @ Push (Z.of_int 1)
  @ Dup @ CmpNZ @ Dip Swap
  @ Loop (Dup @ Dip Swap @ Mul @ IO @ Swap @ Dec @ Dup @ CmpNZ)
  @ Drop

The MichelBaby script contains only one inherently effectful instruction: IO. Yet each Seq (written @ above) calls a monadic bind operation, which is expensive. This means that we pay the cost of the monadic abstraction repeatedly, when in fact a pure computation could perform (all but one) of the calculations more efficiently.

Can we separate the pure instructions from the impure ones and then use bind just when impure instructions enter the scene? We can start by syntactically separating the pure computations from the impure ones, taking inspiration from our OCaml version of the factorial:

let lwt_fact n =
  let rec aux k accu =
    if Z.(compare k zero = 0) then return accu
    else
      let accu = Z.mul accu k in
      bind (io accu) (fun () -> aux (Z.sub k (Z.of_int 1)) accu)
  in
  aux n (Z.of_int 1)

Notice how

the accumulator accu accumulates the result of a sequence of multiplications — that’s the pure computation, whereas
the impure computation — namely the call to the effectful operation io — is performed in the body of the aux function.

We will now rephrase step so that:

The pure computations rewrite the input stack passed as an argument to a tail recursive call to the next instruction.
The impure computations return monadic computations.

First, we must rephrase the datatype that defines the instructions, so that every instruction points to the next instruction (compare the code below with the first version, and z explained here):

type (_, _) instr =
| KHalt  : ('s, 's) instr
| KPush  : 'b * ('b * 's, 'f) instr ->
                     ('s, 'f) instr
| KMul   :       (z * 's, 'f) instr ->
           (z * (z * 's), 'f) instr
| KDec   : (z * 's, 'f) instr ->
           (z * 's, 'f) instr
| KCmpNZ : (bool * 's, 'f) instr ->
              (z * 's, 'f) instr
| KLoop  : ('s, bool * 's) instr * ('s, 'f) instr ->
                            (bool * 's, 'f) instr
| KDup   : ('a * ('a * 's), 'f) instr ->
                 ('a * 's , 'f) instr
| KSwap  : ('a * ('b * 's), 'f) instr ->
           ('b * ('a * 's), 'f) instr
| KDrop  :      ('s, 'f) instr ->
           ('a * 's, 'f) instr
| KDip   : ('t, 's) instr * ('a * 's, 'f) instr ->
                            ('a * 't, 'f) instr
| KIO    : (z * 's, 'f) instr ->
           (z * 's, 'f) instr`

Let’s compare Mul from the first version with KMul here:

Mul : (z * (z * 's), z * 's) instr is an instruction that

*inputs a stack headed by two integers (z * (z * 's)) and * outputs a stack headed by one integer (z * 's).

Intuitively, Mul pulls the two integers off the stack, multiplies them, and pushes the result.
KMul : (z * 's, 'f) instr -> (z * (z * 's), 'f) instr is an instruction that
- inputs a sequence of instructions that inputs a stack headed by an integer z * 's and outputs a stack 'f, and
- outputs a sequence of instructions that inputs a stack headed by two integers z * (z * 's) and outputs a stack 'f.

Intuitively, KMul just adds Mul onto the head of an existing instruction sequence.

Our new datatype does not represent instructions; it represents instruction-sequences.
The datatype-constructors represent instructions to push instructions onto sequences.

Ever written a shopping list? The family sits around the table calling out household essentials, and then Dad (or Mum — whoever does the shopping) adds items to a written list. KMul corresponds to somebody saying “Don’t forget to buy multiplication” and then Dad (or Mum) writes “Mul-ti-pli-cation” on the end of the list.

Except that, because this blog post is about Very Serious Programming, our lists start on the right-hand-side and expand leftwards.

The issue with the cost of bind is also relevant in this analogy. Most of the time it doesn’t matter what order items are put into the shopping basket. So it makes sense to split the shopping list into a small number of big heavy things that need put in the trolley last, and a large number of smaller things that can be efficiently executed by rapidly traversing the supermarket in whatever order is most efficient for that supermarket’s layout. Isn’t computer science wonderful?

KHalt is a new instruction, to represent the end of our to-do list (think: EOF or EOS marker).

There is one catch to our to-do list / shopping list analogy (to be fair, the reader can expect programming smart contracts to be a little harder than shopping): it may not always be possible to determine the next instruction entirely in advance. Execution may dynamically depend on input parameters: for example, a control-flow operator like KLoop dynamically chooses the next instruction by observing a Boolean.

This means that we need another stack — a control stack — that will allow control-flow operators to dynamically define what the next instruction should be:

type (_, _) instrs =
  | KNil : ('s, 's) instrs
  | KCons : ('s, 't) instr * ('t, 'u) instrs -> ('s, 'u) instrs

Intuitively, if instr represents a sequence of instructions then instrs represents a list of sequences of instructions, with the constraint that the final stack type ('t above) of each sequence of instructions on the list, must be compatible with the input stack type ('t above) of the next sequence in the list (if any).

The `step` function, Version 2

We present Version 2 of the step function (cf Versions 1 and 1.1 above):

let step : type a b p. (a, b) instr -> a -> b monad =
  fun i stack ->
  let rec exec : type a i p. (a, i) instr -> (i, b) instrs -> a -> b monad =
    fun k ks s ->
      match (k, ks) with
      | KHalt, KNil -> return s
      | KHalt, KCons (k, ks) -> exec k ks s
      | KIO k, ks ->
          let z, _ = s in
          bind (io z) (fun () -> exec k ks s)
      | KPush (z, k), ks -> exec k ks (z, s)
      | KLoop (ki, k), ks -> (
          match s with
          | true, s -> exec ki (KCons (KLoop (ki, k), ks)) s
          | false, s -> exec k ks s )
      | KMul k, ks ->
          let x, (y, s) = s in
          exec k ks (Z.mul x y, s)
      | KDec k, ks ->
          let x, s = s in
          exec k ks (Z.sub x (Z.of_int 1), s)
      | KCmpNZ k, ks ->
          let x, s = s in
          exec k ks (Z.(compare x zero) <> 0, s)
      | KDup k, ks ->
          let x, s = s in
          exec k ks (x, (x, s))
      | KSwap k, ks ->
          let x, (y, s) = s in
          exec k ks (y, (x, s))
      | KDrop k, ks ->
          let _, s = s in
          exec k ks s
      | KDip (ki, k), ks ->
          let x, s = s in
          exec ki (KCons (KPush (x, k), ks)) s
    in
    exec i KNil stack

Compared with Version 1.1 and Version 1, Version 2 uses far fewer monadic combinators:

Version 2 has one use of return (for KHalt, of course) and one use of bind (for KIO).
Versions 1 and 1.1 have four uses of bind and more uses of return than we care to count.

How does this work?

Pure computations (i.e. stack-related and arithmetic operations) just get pushed onto the input stack (the to-do list).
The impure KIO computation calls bind to glue its interpretation to the rest of the evaluation.
Control operators, like KLoop, update the control stack to dynamically determine the next instruction-sequence.

Question: does this improve performance? Answer: Yes, significantly:

┌───────────────┬──────────┬────────────┐
│ Name          │ Time/Run │ Percentage │
├───────────────┼──────────┼────────────┤
│ OCaml         │   3.62us │      22.9% │
│ Step (V1.1)   │  15.83us │     100.0% │
│ Step (V2)     │   6.39us │      40.4% │
└───────────────┴──────────┴────────────┘

On this specific micro-benchmark, Version 2 is almost three times faster than Version 1.1 and its speed is comparable (to within a factor of two) with the efficient native OCaml implementation.

Note that protection against stack overflows (as discussed above) is not required because the new interpreter is tail recursive. Note also that step Version 2 still gives us MichelBaby typechecking for free, so it is as safe (as as much static guarantee of correctness) as Version 1.

Is that everything?

Granada’s new Michelson interpreter gains most of its efficiency from the program transformation described above, but three other optimizations are also substantive:

We use a local gas counter in the interpretation loop instead of the gas level found in the context.¹¹ The OCaml compiler can represent this local gas counter in a machine register instead of a heap-allocated object; reading from and writing to a machine register is several orders of magnitude faster than for a heap-allocated object.
Similarly, we cache the top of the stack by passing it as a dedicated argument in the interpretation loop, separately from the tail of the stack. Semantically this makes no difference — it’s still a stack! — but by this device we encourage the compiler to store the top of the stack in a machine register, potentially saving many trivial RAM read/writes as the top of the stack would otherwise get pushed to and then popped from memory (see a paper on stack caching for interpreters).
The Michelson interpreter includes an integrated logging tool, so that users can test smart contract code off-chain (e.g. profile their smart contracts before they go live). This logging tool is inactive for live on-chain code, but in Florence the logger has nonzero cost even when inactive — thus, live smart contracts run slightly slower in Florence than they would if the profiler did not exist at all. For Granada, we found a way to exploit the control stack to implement genuinely zero-cost logging.

Tale 2: hunt for a serialization performance bug

Doing serialization right thanks to well-typed encodings

Serialization converts data values into a form that can be stored or transmitted on the network (e.g. when you save a text file to disk, it gets serialized to 0s and 1s).
Deserialization is the dual operation of reconstructing the data from its serialized representation.

(De)serialization is ubiquitous in Tezos. For example:

Deserialization occurs whenever data is read from the chain (e.g. loading a script source code to execute a smart contract call).
Serialization occurs whenever the protocol stores information on the chain.
(De)serialization occurs whenever nodes communicate over the Tezos network.

Nearly anything useful you might do on a blockchain involves either reading from the chain, writing to it, or communicating with another node — so the short list above covers pretty much everything, other than the occasional few microseconds of pure computation.

The node and the economic protocol use a library named data-encoding — version 0.2 in Florence; version 0.3 in Granada — that provides convenient operations to define encodings and to automatically generate functions to convert values into sequences of bytes or JSON objects, and of course functions to decode serialized data back to values.

Serialization functions should not generally be written manually: it’s hard and error-prone work. Better to generate serialization functions automatically. To this end, data-encoding provides combinators to define a value of type t encoding from a value in some arbitrary serializable type t. For example suppose t is a basic binary tree:

type tree = Leaf of int | Binary of tree * tree

Then we can produce an encoding just by following the structure of t as follows:

let tree_encoding : tree encoding =
 mu "tree" @@ (* `fix` is this function -> *) fun self ->
    union [
      case
        (Tag 0)
        ~title:"Leaf"
        int31
        (function Leaf x -> Some x | _ -> None)
        (fun x -> Leaf x);
      case
      (Tag 1)
      ~title:"Binary"
      (obj2 (req "left" self) (req "right" self))
      (function Binary (left, right) -> Some (left, right) | _ -> None)
      (fun (left, right) -> Binary (left, right));
    ]

With that definition, we describe the encoding declaratively through the equation that defines the type of the value it works over:

$$ tree = \mu T. (int + T \times T) $$

Combinators’ types are informative and so limit the chances of error:

val mu    : string ->
            ('a encoding -> 'a encoding) ->
               'a encoding
val case  : title:string -> case_tag ->
            'a encoding -> ('t -> 'a option) -> ('a -> 't) ->
               't case
val union : ?tag_size:[`Uint8 | `Uint16] ->
            't case list ->
               't encoding

So data-encoding makes serialization right: we avoid writing error-prone boilerplate and thanks to combinators

the input required from the programmer is minimal (thanks to their expressivity),
the results are fairly likely to be correct (thanks to the combinators’ precise types), and
the results are safe (again, thanks to static typing).

How costly is this abstraction?

Serialization with data-encoding version 0.2 (as in Florence, and thus before Granada) may have been “right”, but it was not particularly fast. Typically, if we compare the time required to encode a binary tree with $2^{20}$ nodes using data-encoding with the time taken to encode the same tree using Marshal (the unsafe serialization module of OCaml standard library)¹², we get the following numbers:

┌───────────────────┬──────────┬────────────┐
│ Name              │ Time/Run │ Percentage │
├───────────────────┼──────────┼────────────┤
│ Marshal           │  67.46ms │      30.4% │
│ data-encoding 0.2 │ 221.82ms │     100.0% │
└───────────────────┴──────────┴────────────┘

data-encoding version 0.2 is roughly three times slower that Marshal in that example.¹³

One might assume this is just the price of having a nice abstraction for and a well-typed implementation of serialization. After all, Marshal is untyped and implemented in C, and low-level programming in C can be fast.

Surprisingly, this is not the case. Using the linux perf command we can observe that a specific function in data-encoding takes half of the execution time during writing: check_cases.¹⁴

check_cases checks that the data-encoding‘s combinator for union types is properly applied to a nonempty list of disjoint cases.

It’s surprising to see check_cases called during serialization of concrete values, because one would have expected it to be called just once when the definition of the encoding is processed.

In fact, things are even worse than they appear because check_cases has quadratic complexity with respect to the number of cases. Thus execution time degrades further if we add a new data constructors to our type for binary trees:

┌───────────────────┬──────────┬───────────┐
│ Name              │ Time/Run │Percentage │
├───────────────────┼──────────┼───────────┤
│ Marshal           │  74.25ms │     25.4% │
│ data-encoding 0.2 │ 292.16ms │    100.0% │
└───────────────────┴──────────┴───────────┘

For datatypes with five or more constructors, the performance of data-encoding seems to bottom out at roughly ten times slower than Marshal.

So, why is check_cases called during serialization of concrete values? The investigation is simple since the definition of tree encoding uses only two combinators: union and mu.

check_cases is called during serialization of concrete values because of a bug in mu which we will explain now.

Fortunately, the case for mu-based encodings in the function write_rec is a mere one-liner:

| Mu {fix; _} ->
    write_rec (fix e) state value

This line says that fix (the function passed to mu) is called to get the encoding of the value to be encoded. Here Mu is the internal data constructor used in the function mu.

Let us see what this fix function is when write_rec is executed on our example for tree encoding:

 1. let tree_encoding : tree encoding =
 2.  mu "tree" @@ (* `fix` is this function -> *) fun self ->
 3.     union [
 4.       case
 5.         (Tag 0)
 6.         ~title:"Leaf"
 7.         int31
 8.         (function Leaf x -> Some x | _ -> None)
 9.         (fun x -> Leaf x);
10.       case
11.       (Tag 1)
12.       ~title:"Binary"
13.       (obj2 (req "left" self) (req "right" self))
14.       (function Binary (left, right) -> Some (left, right) | _ -> None)
15.       (fun (left, right) -> Binary (left, right));
16.           ]

This is a correct encoding function! In fact the return type of the fix function is tree encoding so it makes sense to recursively call write_rec to encode value in the interpretation of mu since it is of type tree. Besides, self will be equal to e in the interpretation of mu, that is Mu {fix; ...} which is also consistent because tree_encoding is used to serialize the sub-trees.

However, even though the definition of tree_encoding is functionally correct, it has a hidden performance bug. Each time we write a tree, we execute union (see line 3 in the code above), and union carries out some sanity checks, including the aforementioned check_cases. The encoding for trees is an immutable value, so its definition need only be checked once — not each time a tree is serialized!

How to fix `fix`?

Now that we understand the problem with fix, how can we address it?

The idea is simple: the construction of the encoding for an arbitrary recursive type (of which the binary trees considered above are an example) should execute fix only once and then produce an encoding that we can trust to serialize an arbitrary number of data elements of that type, with no further checks.

This works because an encoding is an immutable value so executing fix on e always returns the same value. On the implementation side, we just need to call fix e once, then remember the result in a local reference and return the content of this reference in subsequent evaluations of fix e. Programmers may recognize this as the common technique called memoization (caching the results of function calls).

Here is an excerpt of the corrected mu that illustrates how we proceed:

let self = ref None in
let fix_f = fix in
let fix e =
  match !self with
  | Some (e0, e') when e == e0 ->
      e'
  | _ ->
      let e' = fix_f e in
      self := Some (e, e') ;
      e'
in
...

self remembers that the fix e = e'. This allows us to return e' when fix is called with e again instead of recomputing fix e.

We re-run our first benchmark on data-encoding version 0.2 but with a correct mu and see that this patch significantly improvemes the performance of data-encoding:

┌────────────────────────────────┬──────────┬────────────┐
│ Name                           │ Time/Run │ Percentage │
├────────────────────────────────┼──────────┼────────────┤
│ Marshal                        │  69.19ms │      61.5% │
│ data-encoding with correct mu  │ 112.42ms │     100.0% │
└────────────────────────────────┴──────────┴────────────┘

We reduced the execution time, and also our encoding is now less sensitive to the number of cases in the encoding. Indeed, the second benchmark (in which our type for binary trees is extended with additional data constructors) performs almost as well (112.42ms vs 128.18ms):

┌───────────────────────────────┬──────────┬────────────┐
│ Name                          │ Time/Run │ Percentage │
├───────────────────────────────┼──────────┼────────────┤
│ Marshal                       │  74.14ms │      57.8% │
│ data-encoding with correct mu │ 128.18ms |     100.0% │
└───────────────────────────────┴──────────┴────────────┘

Is that everything?

The optimizations described above account for the bulk of the performance improvements moving from data-encoding 0.2 in Florence to data-encoding 0.3 in Granada.

data-encoding 0.3 does include other incremental optimizations, which yield real but less significant speedups, and we can sum it up in a little equation as follows:

data-encoding 0.3 in Granada = data-encoding 0.2 from Florence + 
                               correct mu +
                               some other optimizations

Yet marginal gains are still gains and can still accumulate, and putting all of them together we were able to reach nearly the same level of efficiency as Marshal for important encodings — in particular the one dedicated to the Micheline format, a central data representation in the Tezos protocol.

Conclusion

We said at the start of this blog post that we followed a process of “Make it work, make it right, make it fast”.

These three qualities, far from being in opposition, complemented one another to make our optimizations possible and safe. We needed the codebase to be of high quality — having a sound architecture and using appropriate abstractions — to find, perform, and check our optimizations. Clean, well-engineered code meant that we could understand the code, effectively profile it, locate sources of inefficiency, and then apply local rewrites to improve efficiency.

Code that is right, is easier to optimize. Even in the context of a critical system like Tezos, where correctness and security have the highest priority — especially the context of a critical system — making it right also helps to make it fast.

Some background: the Tezos blockchain has what amounts to an operating system layer called the economic protocol, which is unusual amongst blockchains in that it is upgradable by stake-weighted community vote.

The currently-live economic protocol is Florence, and an upgrade to Granada is now scheduled for August 2021. So this blog post can be thought of as an inside peek into how a significant performance optimization was attained for Tezos’ next big OS upgrade. Developing such updates is a large part of Nomadic Labs’ day-to-day activities. ↩
A blockchain is “just” a distributed state machine and smart contracts are “just” programs that run on that machine to modify the state. “Gas” is used to bound computation and so prevent denial of service attacks: a smart contract receives a gas budget when started, and if it exhausts its gas, it gets terminated. Lower gas consumption is a corollary of optimizing on-chain computation to make it run more efficiently. Or to put it another way: optimizing Tezos to run faster on a given piece of actual hardware, means that more useful on-chain computation can be fit into a given gas budget. ↩
Test suite here: bench.sh, dexter.tz, fa1.2.tz, ↩
You can use another language to write your smart contract, but to execute it on the Tezos blockchain you’ll need to compile to Michelson. So does that make Michelson a bytecode language? Yes, but just calling it that somewhat belies its power: we describe it as a “strongly-typed low-level language” instead, which arguably captures the spirit of things somewhat better. ↩
The subtle simplification here is that monad does not quite represent a single state; it represents a state-change (i.e. a side-effect). However, if we assume that all computations start from an initial empty state, then a state change can be identified with the result of applying the state change to an initial empty state. ↩
That’s like running a calculation on a calculator, reading off the result … then switching off the calculator. The calculator had a state but we don’t care about it any more because we have the result we wanted. ↩
We’d like to call them bytes of state … but that would be confusing. Perhaps bits of state? ↩
Object-level language here just means that MichelBaby is the language we are implementing. Contrast this with the meta-level language which is the language in which we are writing the implementation; OCaml, in this case. ↩
Python does this for you whether you want it to or not. There’s a system-wide (mutable) limit of 1000 on recursive calls. The implication is that if your function has recursed that many times, it’s actually doing a loop (i.e. an iteration). Tail recursion is a functional programmer’s version of iteration and in particular, OCaml automatically recognizes and optimizes tail recursive calls to avoid stack allocation, so that a tail recursive OCaml program is operationally just a (generalisation of an) iterative loop. ↩
“Good” here means something specific: In practice, smart contracts are subject to gas limits — specifically, the current Operation Gas Limit for the currently-live Florence protocol is 1,040,000; see also this limit directly in the relevant line in the Florence source code — so a smart contract that gobbles our interpreter’s resources is likely to be terminated for exhausting its gas allocation anyway by the ambient Tezos abstract machine, and this would normally happen before the local hardware that is running the Tezos node on which our abstract machine happens to be executing, runs out of stack space. Thus a “good” limit is
- large enough to allow complete gas exhaustion for practical smart contracts that run on-chain (so in practice it would never actually be reached), yet
- small enough to prevent stack overflows on the underlying execution environment (so our code has predictable, controllable behaviour that we can reason about).
Does the fact that we do not expect this branch to be reached in practice make it irrelevant? Not really: for impractical smart contracts, stack could still be exhausted before gas. For instance, a deep recursion of sufficiently cheap operations (e.g. PUSH 1; DROP) might still overflow the call stack before gas is exhausted. So we do still have to detect and fail deep recursions — just in case. ↩
The context is the state of the blockchain used to validate a block. ↩
The Marshal module of OCaml is untyped. For this reason, the module must trust the programmer that some serialized bytes can be turned into a value of a given type. If the programmer is wrong, the program may crash. ↩
Test file here: comparing-marshal-and-data-encoding-0.2.ml.html. ↩
The files linked are at https://gitlab.com/nomadic-labs/data-encoding/-/tree/0.2/src at time of writing. However, we will give permalinks to an archived snapshot of the repository at https://archive.softwareheritage.org/browse/origin/content/?branch=refs/tags/0.2&origin_url=https://gitlab.com/nomadic-labs/data-encoding/&path=src/. ↩

A tale of two reductions in gas consumption in Tezos