BatchParallelApply: no errors found
Workers own disjoint output columns. TLC proves no output cell is written twice and that the final state equals the sequential result, for every interleaving.
Concurrency
The delta-apply is the hot path: given a fixed operator and a sparse change, HST recomputes only the rows affected by that change. When you have many state vectors to update at once, that work multiplies across the batch, and the batch is the axis that parallelizes cleanly.
Two axes
Parallelize by batch. Each of the B state vectors is an independent problem against the same operator. Give each thread a separate block of the batch and it writes a separate block of the output: no locks, no shared accumulation, no reduction. Speedup tracks core count until memory bandwidth saturates.
Parallelize by column. Split the changed columns across threads and several threads contribute to the same output rows. Each needs a private partial buffer, and the partials must be summed at the end. That reduction is real work the batch path never pays, so column scaling is sub-linear, and at large batch the reduction can cost more than it saves.
For small deltas and small batches, threading buys little because the apply is already cheap. The win shows up in the workload shape that matters most for production evaluation: small deltas, large batches.
Measured
| Batch B | CSC · batch | CSC · column | HST · batch |
|---|---|---|---|
| 16 | 5.8× | 1.9× | 6.5× |
| 64 | 6.6× | 2.0× | 7.0× |
| 256 | 5.8× | 0.9× | 6.5× |
Measured wall-clock at T=8. Operator N=M=16,000, tile 32, 256 changed columns reaching 7,776 output rows (1.33M nonzeros). Every parallel result is verified against the single-threaded result: bit-identical for batch-parallel, <1e-10 relative for column-parallel. Column-parallel at B=256 falls below 1× because the O(T·B·R) reduction outgrows the apply.
Live
This runs a small HST-shaped delta-apply across Web Workers: single-threaded baseline vs. one batch-slice per worker. It is illustrative; browser JavaScript is not the production runtime. But the shape is the same: more workers, more speedup, until you run out of cores.
Ready.
Runs entirely client-side with Web Workers. Hardware concurrency reported by your browser:
— logical cores. Timings vary with your machine and current load.
Verified
A fast result that is silently wrong is worse than a slow one. The threading model is specified in TLA+ and exhaustively model-checked with TLC over every possible interleaving of the workers. That formal check complements the equality assertions in the benchmark.
Workers own disjoint output columns. TLC proves no output cell is written twice and that the final state equals the sequential result, for every interleaving.
Private partials plus a reduction barrier. TLC proves the reduced output equals the sequential result regardless of how the workers interleave.
Drop the reduction barrier and write straight into shared state, and TLC returns a lost-update counterexample: two workers read the same value and one clobbers the other. This is why column-parallel needs the reduction.
Formal model-checking artifacts are available during technical evaluation.