Concurrency

Multi-threading HST: parallelize by batch, not by column.

The delta-apply is the hot path: given a fixed operator and a sparse change, HST recomputes only the rows affected by that change. When you have many state vectors to update at once, that work multiplies across the batch, and the batch is the axis that parallelizes cleanly.

Two axes

Batch updates can run independently. Column splits need a reduction.

Parallelize by batch. Each of the B state vectors is an independent problem against the same operator. Give each thread a separate block of the batch and it writes a separate block of the output: no locks, no shared accumulation, no reduction. Speedup tracks core count until memory bandwidth saturates.

Parallelize by column. Split the changed columns across threads and several threads contribute to the same output rows. Each needs a private partial buffer, and the partials must be summed at the end. That reduction is real work the batch path never pays, so column scaling is sub-linear, and at large batch the reduction can cost more than it saves.

For small deltas and small batches, threading buys little because the apply is already cheap. The win shows up in the workload shape that matters most for production evaluation: small deltas, large batches.

Measured

Speedup vs. single-threaded, 8 threads on a 10-core machine.

Batch B CSC · batch CSC · column HST · batch
165.8×1.9×6.5×
646.6×2.0×7.0×
2565.8×0.9×6.5×

Measured wall-clock at T=8. Operator N=M=16,000, tile 32, 256 changed columns reaching 7,776 output rows (1.33M nonzeros). Every parallel result is verified against the single-threaded result: bit-identical for batch-parallel, <1e-10 relative for column-parallel. Column-parallel at B=256 falls below 1× because the O(T·B·R) reduction outgrows the apply.

Live

Batch-parallel, running in your browser.

This runs a small HST-shaped delta-apply across Web Workers: single-threaded baseline vs. one batch-slice per worker. It is illustrative; browser JavaScript is not the production runtime. But the shape is the same: more workers, more speedup, until you run out of cores.

serial (1 thread)
batch-parallel
speedup
workers
batch

Ready.
Runs entirely client-side with Web Workers. Hardware concurrency reported by your browser: logical cores. Timings vary with your machine and current load.

Verified

The concurrency is model-checked, not just benchmarked.

A fast result that is silently wrong is worse than a slow one. The threading model is specified in TLA+ and exhaustively model-checked with TLC over every possible interleaving of the workers. That formal check complements the equality assertions in the benchmark.

BatchParallelApply: no errors found

Workers own disjoint output columns. TLC proves no output cell is written twice and that the final state equals the sequential result, for every interleaving.

ColumnParallelApply (safe): no errors found

Private partials plus a reduction barrier. TLC proves the reduced output equals the sequential result regardless of how the workers interleave.

ColumnParallelApply (naive): race caught

Drop the reduction barrier and write straight into shared state, and TLC returns a lost-update counterexample: two workers read the same value and one clobbers the other. This is why column-parallel needs the reduction.

Formal model-checking artifacts are available during technical evaluation.