Catching sneaky regressions with&nbsp;pytest-regtest

Testing is an important complement to formal methods that we use through out Tezos to ensure software quality. In this blog post we discuss how we use regression testing, through the pytest-regtest tool.

Regression testing

Regression testing is a coarse-grained testing method for detecting unintended changes in the behavior of the system under test. We have applied regression testing to the tezos-client, notably to smart-contract type checking and execution. Globally, regression testing consists of two components. First, a way of running and storing the behavior of a system in a test scenario. Second, a way of running the same test again and comparing the behavior of the first and second run.

This is why we call regression testing coarse-grained: a regression test passes only if the output is exactly the same as in the first, stored, run. Consequently, a regression test will detect non-functional changes, such as changes in white space output by the tested system. Such nitpicking can be overly restrictive, and we will show later in this blog post how to relax overly strict verification.

In our case, we have enabled regression testing on an existing test suite that exercises the type checker and interpreter of Michelson smart contracts. In the first run of the test suite, the output of each command passed to tezos-client by the test case is stored in a log file by the test runner. After verifying that validity of the output, we commit these log files to source control. In subsequent runs, the output of each command sent to the client is compared to the expected output stored in the corresponding log file.

Regression testing using `pytest-regtest`

To implement regression tests we plug in pytest-regtest to the pytest-based Python Testing and Execution framework used for integration testing of Tezos. This plugin provides the regtest fixture that is then used as a file object: it can be written to, or used as a context manager. If the --regtest-reset flag is passed to pytest, then all output written to the regtest fixture during a given test case is stored to a log file. On subsequent runs (unless the --regtest-reset flag is passed again), the output written to the fixture will be compared to what was recorded during the first run.

Let’s write a test case and then extend it with regression testing. We start out with a classic unit test that exercises the Michelson ADD instruction. We first define a Michelson contract that takes a pair of natural numbers as parameter and, using the ADD instruction, calculates their sum and puts it in storage:

parameter (pair nat nat);
storage (option nat);
code { CAR; UNPAIR; ADD; SOME; NIL operation; PAIR; }

We store the contract in add_example.tz and write the following pytest:

class TestRegressionExample:
    def test_example_add(self, client):
        add_contract = 'add_example.tz'
        run_script_res = client.run_script(
            add_contract, 'None', 'Pair 10 15', amount=None
        )
        assert run_script_res.storage == '(Some 25)'

The tests runs the add_example.tz contract with parameters 10 and 15 using the run_script method of the client fixture of the Tezos Python testing suite, and then verifies that the result is indeed 25.

As expected, this test passes:

$ pytest -rf tests_python/tests/test_contract_opcodes.py -k test_example_add
============================== test session starts ===========================================
platform linux -- Python 3.7.3, pytest-4.4.0, py-1.8.0, pluggy-0.9.0 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: .../tezos/tests_python, inifile: pytest.ini
plugins: timeout-1.3.3, parallel-0.0.9
collected 302 items / 301 deselected / 1 selected

tests_python/tests/test_contract_opcodes.py::TestContractOpcodes::test_example_add PASSED

This is all very well, but say that we want to test some other aspects of the ADD instruction, such as its gas consumption. We can either write a new test case, but this quickly gets tiresome as Michelson has some 80 instructions that should all be tested. Furthermore, gas constants are likely to evolve, so hard coding them in the tests makes for brittle tests. Our test suite already contains tests for all instructions — it would be more convenient if we could piggy-back existing tests, but without asserting explicitly the gas costs of each instruction in those tests.

First, how do we inspect the gas consumption of a contract? We can track gas consumption of a program (and by extension, its instructions) using the --trace-stack option to the run script command of the client. When this option is given, the interpreter reports a stack trace that contains the contents of the stack and the remaining gas at each program point during execution. Let’s try (the output has been redacted for brevity):

$ tezos-client run script add_example.tz on storage None and input 'Pair 10 15' --trace-stack
storage
  (Some 25)
emitted operations

big_map diff

trace
  - location: 9 (remaining gas: 799557 units remaining)
    [ (Pair (Pair 10 15) None)       ]
  - location: 10 (remaining gas: 799557 units remaining)
    [ (Pair 10 15)      @parameter ]
  - location: 13 (remaining gas: 799556 units remaining)
    [ (Pair 10 15)      @parameter
      (Pair 10 15)      @parameter ]
  - location: 14 (remaining gas: 799556 units remaining)

[...]

  - location: 11 (remaining gas: 799554 units remaining)
    [ 10
      15     ]
  - location: 18 (remaining gas: 799553 units remaining)
    [ 25     ]
  - location: 19 (remaining gas: 799553 units remaining)
    [ (Some 25)      ]
  - location: 20 (remaining gas: 799552 units remaining)
    [ {}
      (Some 25)      ]

[...]

We can deduce that the gas consumption for the ADD instruction in this execution is less than one gas unit (location 18), and that the gas consumption for the whole script is no more than five units. This is correct, and we would like future tests to signal if this changes unexpectedly. A solution would be to write the whole stack trace to the regtest fixture. To implement this idea, we modify our test:

    def test_example_add_reg(self, client, regtest):
        add_contract = "add_example.tz"
        run_script_res = client.run_script(
            add_contract, 'None', 'Pair 10 15', amount=None, trace_stack=True
        )
        regtest.write(run_script_res.client_output)
        assert run_script_res.storage == '(Some 25)'

The modified test requires the regtest fixture by including it as a formal parameter. Then we instruct the client fixture to pass the --trace-stack flag to tezos-client through the trace_stack=True. Finally, we write the complete output from the client, as obtained from run_script_res.client_output, to regtest. Now if we run the test_example_add_reg, it will fail due to the lack of a pre-existing log file.

$ pytest tests_python/tests/test_regression_example.py -k example_add_reg
tests_python/tests/test_regression_example.py::TestRegressionExample::test_example_add_reg FAILED [100%]

================== FAILURES ======================
____________________________ TestRegressionExample.test_example_add_reg ____________________________

regression test output differences for tests/test_regression_example.py::TestRegressionExample::test_example_add_reg:

>   --- current
>   +++ tobe
>   @@ -1,43 +1 @@
>   -storage
>   -  (Some 25)
>   -emitted operations
>
>   -big_map diff
[...]

We try running again with the --regtest-reset to create the log file:

$ pytest --regtest-reset tests_python/tests/test_regression_example.py -k example_add_reg

The generated log file is found in tests_python/tests/_regtest_outputs/test_regression_example.TestRegressionExample\:\:test_example_add_reg.out, and we can verify that it contains the expected stack trace:

$ cat tests_python/tests/_regtest_outputs/test_regression_example.TestRegressionExample\:\:test_example_add_reg.out
storage
  (Some 25)
emitted operations

big_map diff

trace
  - location: 9 (remaining gas: 799557 units remaining)
    [ (Pair (Pair 10 15) None)       ]
  - location: 10 (remaining gas: 799557 units remaining)
    [ (Pair 10 15)      @parameter ]
  - location: 13 (remaining gas: 799556 units remaining)
    [ (Pair 10 15)      @parameter
      (Pair 10 15)      @parameter ]
[...]
  - location: 11 (remaining gas: 799554 units remaining)
    [ 10
      15     ]
  - location: 18 (remaining gas: 799538 units remaining)
    [ 25     ]
  - location: 19 (remaining gas: 799537 units remaining)

We can now run the regression test and verify that it passes:

$ pytest tests_python/tests/test_regression_example.py -k example_add_reg
=================================================================== test session starts ===========================

platform linux -- Python 3.7.3, pytest-4.4.0, py-1.8.0, pluggy-0.9.0 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: .../tezos/tests_python, inifile: pytest.ini
plugins: timeout-1.3.3, parallel-0.0.9, regtest-1.4.2
collected 2 items / 1 deselected / 1 selected

tests_python/tests/test_regression_example.py::TestRegressionExample::test_example_add_reg PASSED                                                                                                                                                                                  [100%]

========================== 1 passed, 1 deselected in 2.29 seconds ================================================

Now if a developer inadvertently introduces a change that modifies the gas consumption of the ADD instruction or its functioning, then this regression test should break. Likewise if the developer intentionally changes the semantics of the instruction. However, this should not be taken for granted. A good practice when writing tests is to test the test itself by intentionally breaking the functionality tested and verifying that the corresponding test breaks. We follow this advice and introduce a change in the gas cost of additions.

Gas costs are (mainly) defined in src/proto_alpha/lib_protocol/michelson_v1_gas.ml. Assume a cat, well-meaning but imprudent, prances over the developer’s keyboard, introducing the following change:

diff --git a/src/proto_alpha/lib_protocol/michelson_v1_gas.ml b/src/proto_alpha/lib_protocol/michelson_v1_gas.ml
index f61e519fe..228af62c6 100644
--- a/src/proto_alpha/lib_protocol/michelson_v1_gas.ml
+++ b/src/proto_alpha/lib_protocol/michelson_v1_gas.ml
@@ -182,7 +182,7 @@ module Cost_of = struct
     let abs int = atomic_step_cost (61 + ((int_bytes int) / 70))
     let int _int = free
     let neg = abs
-    let add i1 i2 = atomic_step_cost (51 + (Compare.Int.max (int_bytes i1) (int_bytes i2) / 62))
+    let add i1 i2 = atomic_step_cost (9000 + 51 + (Compare.Int.max (int_bytes i1) (int_bytes i2) / 62))

Oops! Suddenly additions have a fixed overhead of over 9000! It would be unfortunate if this change passes our test suite. Let us verify that it is not the case. We recompile the node and run the tests again:

$ pytest tests_python/tests/test_regression_example.py -k example_add_reg
=================================================================== test session starts ===========================

platform linux -- Python 3.7.3, pytest-4.4.0, py-1.8.0, pluggy-0.9.0 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: .../tezos/tests_python, inifile: pytest.ini
plugins: timeout-1.3.3, parallel-0.0.9, regtest-1.4.2
collected 2 items / 1 deselected / 1 selected

tests_python/tests/test_regression_example.py::TestRegressionExample::test_example_add_reg FAILED [100%]

================== FAILURES ======================
____________________________ TestRegressionExample.test_example_add_reg ____________________________

regression test output differences for tests/test_regression_example.py::TestRegressionExample::test_example_add_reg:

>   --- current
>   +++ tobe
>   @@ -28,16 +28,16 @@
>      - location: 11 (remaining gas: 799554 units remaining)
>        [ 10
>          15    ]
>   -  - location: 18 (remaining gas: 799413 units remaining)
>   +  - location: 18 (remaining gas: 799538 units remaining)

All is well: the change is reflected in the output and the tests do indeed break. Curiously, the difference between the previous gas consumption and the new is not 9000, but 799538 - 799413 = 125. The reason is that the atomic_step_cost works with an internal gas unit, which is later scaled up to obtain the user-facing gas unit.

An eavesdropping client

In order to avoid modifying all existing test-cases and to specify manually the output that should be registered in each test, we wrote a new pytest fixture client_regtest. This fixture mixes the existing client fixture with the regtest fixture of pytest-regtest. In addition, it automatically stores the output of all commands passed to tezos-client. For example, we now express the example test case:

    def test_example_add_reg_client_regtest(self, client_regtest):
        client = client_regtest
        add_contract = "src/bin_client/test/contracts/opcodes/add_example.tz"
        run_script_res = client.run_script(
            add_contract, 'None', 'Pair 10 15', amount=None, trace_stack=True
        )
        assert run_script_res.storage == '(Some 25)'

Using this method, we add regression testing to the existing test cases simply by changing the fixture.

Scaling down the pedantry: adding output conversions

One difficulty with regression testing is dealing with output that may change from one test run to another. For instance, a contract that uses the NOW instruction will have a different stack trace on each run, as the timestamp that is pushed by NOW will differ on each execution. Like a merciless river, time flows ever forwards.

There is little we can do to change the laws of physics, but we can apply “output conversions” to the logs written by pytest to replace such data in the output with fixed markers.

We have written a client_regtest_scrubbed fixture that extends client_regtest with a series of such conversions. For example, a timestamp such as 2019-09-23T10:59:00Z is replaced by [TIMESTAMP] and operation hashes with [OPERATION_HASH].

An issue with long paths problem in `pytest-regtest`

During the implementation of regression testing, we discovered an issue in pytest-regtest generation of log files. Tests with long parameterizations would create log files with names whose length exceeded the operating system path length limit. We have patched pytest-regtest to replace explicit parameters in log file names with a hash, if the name would exceeds the limits imposed by the operating system.

Discussion

We have followed the method outlined above to add regression testing to Tezos integration tests. Notably in the existing test suite test_contracts_opcode.py that tests Michelson instruction semantics. We have also applied it to test macro expansion.

As a result, Michelson instructions are now regression test enabled. This decreases the risk that any future change inadvertently changes macro expansion, the semantics or the gas cost of any instruction, as doing so will alert the developer by breaking the test suite in many cases. However, due to the dynamic nature of gas costs, it is impossible for any test suite to cover all behaviors. Think of the addition of numbers, whose cost depends on the size of the operands. Tests can only cover a finite number of operands, but since integers and naturals are unbounded in Michelson, they can never cover all. Consequently, a change that changes the gas cost for an untested pair of operands will not be detected. Nonetheless, regression testing provides an important security net that helps to nip bugs early in the bud. Moving forward, we would like to extend regression testing to other parts of the test suite.

Our current approach to output conversion is quite ad hoc. We use the same set of output conversions for all tests. For examples, this results in all timestamps being removed from the output, even those that are constant between two runs. In the future, we would like to use test-specific conversions targeting directly the parts of the output that is expected to change.

Finally, regression testing is coarse-grained and so also fragile. In addition to picking up unintended changes, regression testing also detects intended, harmless changes. For instance, in merge request metastatedev/tezos!54, the UNPAIR macro is replaced by an instruction. UNPAIR being widely used, this change breaks all very large number of regression tests. If such a change also modifies instruction semantics or gas usage in an unintended way, then bugs may be introduced. In such cases, it is important that the developer is very careful when inspecting the diffs of the regression logs and that they keep the change to the code minimal to ease review!

Indeed, as with any method for verification and validation, regression testing should be approached as a complementary activity to careful code review and formal methods.

Catching sneaky regressions with pytest-regtest