A previous post on View Synchronization in BFT Consensus surfaces the following conundrum in state-of-art solutions:
đ Two recent breakthrough BFT solutions in the partial synchrony model, Lewis-Pye/RareSync, achieve communication-optimality.
đŠ However, they suffer expected linear latency each time a Byzantine leader is encountered.
đĄ In the post below, we introduce Lumiere, a BFT solution in the partial synchrony model that retains communication optimality, while simultaneously having constant latency when a bad leader is encountered.
Read also our preliminary đ with Andrew Lewis-Pye and Oded Naor: Optimal Latency and Communication SMR View-Synchronization
]]>See the HotStuff-2 writeup on eprint, https://eprint.iacr.org/2023/397.
Also, read our post What is the difference between PBFT, Tendermint, HotStuff, and HotStuff-2?, illustrating HotStuff-2 and explaining the differences from PBFT, Tendermint, and HotStuff.
]]>Steady progress in the practicality of leader-based Byzantine consensus protocols, including the introduction of HotStuffâwhose leader protocol incurs only a linear communication costâshifts the challenge to the âPacemakerâ part, which is responsible for View Synchronization.
More specifically, before HotStuff, BFT solutions for the partial synchrony settings required quadratic communication complexity per view (with a new leader), hence no one cared if coordinating entering/leaving a view also incurs quadratic communication. HotStuff demonstrated a protocol whose per-view complexity is (always) linear and also defined the Pacemaker as a separate component of BFT consensus. Thus, the challenge has shifted to developing a Pacemaker with low communication efficiency.
Our post The Latest View on View Synchronization provides a foundational perspective on the evolution of Pacemaker solutions to the Byzantine View Synchronization problem. Spoiler alert: The theoretical worst-case communication lower bound for reaching a single consensus decision is quadratic, but only this year has this complexity (finally) been achieved; these quadratic Pacemakers are described in the post.
]]>An exciting milestone reached: We improved our Byzantine Consensus solution to achieve the optimal ½ corruption threshold with unknown/fluctuating participation. See our latest post, minority corruption under the Dynamic/Unknown participation model.
]]>In our post on blog.chain.link, we present a remarkably simple solution for the Byzantine Generals problem with deterministic and unconditional Safety guarantees in a setting that has Unknown and Dynamic Participation, but without the energy consumption of proof-of-work and or the long latency of longest chain protocols. With this, we remove the biggest obstacles of Nakamoto Consensus, which are probabilistic finality, high latency, and energy consumption of proof-of-work, while retaining key tenets of the permissionless model.
Read more in our whitepaper https://eprint.iacr.org/2022/1448.
]]>Announcing our second post under the Dynamic/Unknown participation model, where we expand our previous one-shot, binary solution to full Byzantine atomic broadcast. Key properties of the solution remain, including small constant (3-round) expected latency to finality and allowing an adversary to fluctuate Byzantine parties from round to round provided that two-thirds of the active participants are honest.
Read more in our whitepaper https://eprint.iacr.org/2022/1448.
]]>September 24th, 2022: see an updated revision.
In DAG-based BFT Consensus (the subject of two previous blog posts, âOn BFT Consensus Evolution: From Monolithic to DAGâ, and âAnother Advantage of DAG-based BFT: BEV Resistanceâ), multiple block proposals are sent in parallel by all Validators and it may not be clear how and when to process transactions they contain and certify the outcome.
Itâs time to talk about transaction execution for DAG-based Consensus protocols.
In a DAG-based BFT Consensus protocol, every consensus message has direct utility towards forming a total ordering of committed transactions. More specifically, each Validator participating in DAG Consensus independently packs yet-unconfirmed transactions or their digests into a block, adds references to previously delivered (defined below) messages, and broadcasts a message carrying the block and the causal references to all Validators. Blocks are reliably delivered and the references they carry become the backbone of a causally ordered directed acyclic graph (DAG) structure. Then, Validators interpret their DAG locally, without exchanging more messages, and determine a view by view total ordering of accumulated transactions.
In each view, a common set of blocks across Validators is committed or potentially committed. Each block in the set contains a set of transactions. A sequence can be extracted from this set of blocks by filtering out any already sequenced transactions (duplicates and re-submissions), ordering the transactions causally using protocol specific sequence numbers or accounts, using a topological sort, and then finalising the sequence using some tie breaking rule for non-causally constrained transactions, such as giving priority to transactions with a higher gas fee, or other considerations like MEV Protection on a DAG. While many variants of the above can be devised, a commit or potential commit results in a sequence of transactions. To distinguish the set of transactions in a view from those in blocks, we refer to them as the View-set (of transactions).
The question is when and by whom should View-set transactions be processed, and where and how should the result be published?
In this option, the DAG only orders transactions as suggested above. All observers â including Validators and clients â process them outside the DAG protocol. Every observer needs to process all the transactions in the current committed prefix in order to arrive at a state resulting from processing the entire committed prefix. Observers can process the chain incrementally, as new view-sets become committed.
This approach is simple, and in-line with the currently popular ideas around building blockchain in a modular fashion, combining different subsystems for availability, sequencing and execution. In that context the DAG Consensus just does sequencing (and maybe some availability) but relies on other layers for execution. The downside of this approach is that it does not provide - by default - a certificate on executed state to support light clients co-signed by the DAG Consensus Validators. As a result, additional security assumptions need to be made on the execution layer to ensure the security of the whole system.
In this option, when Validators observe commits, they âlazilyâ and asynchronously (i.e., with no concrete protocol-step or time bound) construct and execute the sequence of transactions, and then post a signed commitment of its result in a DAG block or as a transaction sequenced in the Consensus. Clients (in particular light ones) may wait for F+1 state commitments (where F is the maximum number of byzantine validators) to be posted to the DAG and rely on these to authenticate reads, instead of processing transactions themselves.
Performing the execution asynchronously is simple and is outside the critical path of ordering. Compared with the previous approach, it results in a collectively signed state checkpoint, containing sufficient information to support light clients. On the downside, asynchrony requires light clients to wait, potentially for a long and unknown amount of time until they may authenticate reads. This delay may be longer and longer if execution is the bottleneck of the system; and could block light clients from being able to construct new transactions to process.
This option can work with DAG-based BFT Consensus protocols for the partial-synchrony model like (the first part of) Bullshark and Fin. It may also work for Tusk and (full) Bullshark but is potentially inefficient. It works as follows.
When a leader Consensus proposal is embedded in a DAG, the leader already knows what is included in a proposal (namely, the causal history of the proposal). Therefore, despite all the block proposals going on in parallel, the leader can preprocess the outcome of executing the sequence on the basis of the proposed block and bring it to a vote. The leader includes the proposed outcome as a state commitment, and when Validators vote, they check and implicitly vote for the outcome as well.
A leader-proposed execution approach works well in Fin, but in Tusk and in (full) Bullshark, the leader is determined after the fact: so to make this work all block proposers would have to propose a state commitment, and all votes should pre-execute the proposals â despite the fact that only one will be selected. This is correct but computationally inefficient, at least in this naive form.
The benefit of the approach is allowing a state commitment to be observed at exactly the same time as the block is certified and as the DAG is being constructed. Light-clients can use them immediately. It also allows the DAG construction to feel backpressure from the delays in execution, in case execution is a bottleneck, keeping the two subsystems in tune with each other. The latter point is also its downside: taking turns devoting resources between ordering (a network bound activity) and execution (a CPU / storage bound activity) may lead to resource underutilization for both, and a less efficient system overall.
It is worth pointing to another variant of this approach: in an x-delayed leader-proposed execution, for some value x, the leader of view k+x posts the output of view k, included in its proposal for k+x. A Validatorâs vote on a leader k+x proposal is a fortiori a vote for the outcome of view k.Â
No matter who and when executes blocks (per the discussion above), there are ways to accelerate the execution by parallelizing work. This is crucial for high throughput: unless execution can meet ordering throughput, a backlog of committed transactions might be formed whose committed state is unknown, which may cause clients to delay observing the committed state and not be able to produce new transactions.
At a high level, there are two key approaches for accelerating execution through parallel work:
Exploit the inherent concurrency among transactions to speed up processing through parallelism. Parallel execution can utilise available multi-core compute resources and may result in significant performance improvement over sequential processing.Â
Prepare transactions for faster validation through various preprocessing strategies, harnessing the collective compute power of Validators or external helpers for preparation tasks.
We focus the discussion on accelerating execution of a single view. Recall that in a view, a common set of blocks across Validators is committed, each block containing a set of transactions. A sequence ordering is extracted over the View-set of transactions consisting of the transactions in all the committed blocks.Â
In this option, the goal is to process an ordered View-set of transactions as if it was executed sequentially and arrive at the sequential result. However, rather than executing transactions one after another, the key idea is that some transactions may not be conflicting, namely, they do not read or write any common data items. Therefore, they can be processed in parallel, enabling execution acceleration that arrives at the correct sequential result. Another performance boost may be derived by combining the outputs of transactions into a batched-write.
Post-ordering acceleration of transaction processing is the topic of a previous post, âBlock-STM: Accelerating Smart-Contract Processingâ. Rather than applying Block-STM on blocks, we can employ it on ordered View-sets. Each Validator uses Block-STM independently to parallel-process a leader proposal. Â
Borrowing from the pioneering work on âAdding Concurrency to Smart Contractsâ, we can add to Parallel Option #1 various ways in which Validators help each other by sharing hints about concurrency.
Hints may be produced in a preprocessing phase, where each Validator provisionally processes and/or statically analyses transactions in its own block. For each transaction, it generates a provisional read-set and write-set and records the sets to guide parallel execution and embeds them inside block proposals. The information about transaction dependencies can seed parallel execution engines like Block-STM (and others) and help reduce abort rates.
An important aspect of this regime is that Validators can preprocess blocks in parallel, in some cases simultaneously with collecting transactions into blocks and posting them to the DAG. The time spent on preprocessing in this case overlaps the (networking intensive) ordering phase and hence, it may result in very good utilisation of available compute resources.
A different regime is for one (or several) Validators to go through a trial-and-error speculative transaction processing, and then share a transcript of the concurrent schedule they âdiscoveredâ with others. We can appoint Validators to discover concurrency on a rotation basis, where in each view some Validators shift resources to execution and other Validators re-execute the parallel schedule deterministically but concurrently, saving them both work and time. Alternatively, we can simply let fast Validators help stragglers catch up by sharing the concurrency they discovered. Â
Recent advances in ZK rollups allow to offload compute and storage resources from Validators (for an excellent overview, see âAn Incomplete Guide to Rollupsâ, Vitalik, 2021). These methods allow a powerful entity, referred to as âProverâ, to execute transactions and generate a succinct proof of committed state, whose verification is very fast. Importantly, only one Prover needs to execute transactions because the Prover need not be trusted; the proof is self-verifying that correct processing was applied to produce the committed state. Therefore, the Prover can run on dedicated, beefy hardware. The work needed by Validators to verify such proofs is significantly reduced relative to fully processing transactions.
Generally, ZK proof generation is slow, currently slower than processing transactions, and the main use of ZK is not accelerating but rather, compressing state and offloading computation from Validators. However, specific rollups could potentially be used for acceleration. For example, a recent method for âAggregating and thresholdizing hash-based signatures using STARKsâ could be applied to reduce the compute load incurred by signature verification on sets of transactions.
Another possible acceleration can come from splitting execution to avoid repeating executing each transaction by every Validator. This approach presents a shift in the trust model, because it requires trusting subsets of Validators with execution results, or to delegate execution to dedicated trusted executors (as in, e.g., [Hyperledger Fabric] and [ParBlockchain]).
Splitting execution across subsets may be based on account ownership, where actions on behalf of each account are processed on its dedicated executors. In networks like Sui and Solana, transactions are annotated with the objects or accounts they access respectively so that splitting execution in this manner is easier. The advantage of object or account centric computation is that parallel processing does not generate conflicts, but programming in this model has to be adapted and constrained to fully take advantage of this opportunity for parallelism.
Another approach is to create a dependency graph among transactions based on preprocessing or static analysis. We can then use the graph to split transactions into parallel âbucketsâ that can be processed in parallel. This strategy is used in many academic works, a small sample of which that specifically focus on pre-ordered batching includes:
âRethinking serializable multiversion concurrency controlâ, by Faleiro et al., 2015
âAdding Concurrency to Smart Contractsâ, by Dickerson et al., 2017
A long lasting quest for scaling the core consensus engine in blockchains brought significant advances in Consensus ordering algorithms, culminating in a recent wave of DAG-based BFT ordering protocols. However, to support high-throughput, we need to enable high-throughput transaction processing that meets ordering speeds. This post presents several paradigms for transaction execution over DAG-based Consensus and lays out approaches for accelerating transaction processing through parallelism.
]]>October 19th, 2022: see an updated revision.
Another advantage of a DAG-based approach to BFT Consensus is enabling simple and smooth prevention of blockchain extractable value (BEV) exploits. This post describes how to integrate âOrder-Fairnessâ into DAG-based BFT Consensus protocols to prevent BEV exploits. For more details, refer to our recent manuscript âMaximal Extractable Value (MEV) Protection on a DAGâ.
The first line of BEV defense is âBlind Order-Fairnessâ. The idea is for users to send their transactions to Consensus parties encrypted, such that honest parties must be involved in opening the decryption key. Consensus parties commit to a blind order of transactions first, and only later open the transactions. We discuss how to leverage the DAG structure to achieve blind ordering BEV protection in a simple and efficient manner. Notably, the scheme operates without materially modifying the DAGâs underlying transport, without secret share verification, and in the happy path, it works in microseconds latency and avoids threshold encryption.
A deeper line of defense, âTime-Based Order-Fairnessâ, incorporates additional input on relative-ordering among batches of committed transactions. We discuss building such considerations into our scheme.
The third line of defense is âParticipation Fairnessâ (aka âChain Qualityâ), which guarantees that a chain of blocks includes a certain portion of honest contributions. As demonstrated in several previous systems, a âlayered DAGâ achieves certain participation equity for free, since at least a fraction of honest Consensus parties contributes in each layer.
Over the last few years, we have seen exploding interest in cryptocurrency platforms and applications built upon them, like decentralized finance protocols offering censorship-resistant and open access to financial instruments; or non-fungible tokens. Many of these systems are vulnerable to BEV attacks, where a malicious consensus leader can inject transactions or change the order of user transactions to maximize its profit. Thus it is not surprising that at the same time we have witnessed rising phenomena of BEV professionalization, where an entire ecosystem of BEV exploitation, comprising of BEV opportunity âsearchersâ and collaborating miners, has arisen.
Daian et al. introduced a measure in [Flash Boys 2.0, S&P 2020] of the âprofit that can be made through including, excluding, or re-ordering transactions within blocksâ. The original work called the measure miner extractable value (MEV), which was later extended by blockchain extractable value (BEV) [Qin et al., S&P 2022], to include other forms of attacks, not necessarily performed by miners. [MEV-explore] estimates the amount of BEV extracted on Ethereum since the 1st of Jan 2020 to be close to $700M. However, it is safe to assume that the total BEV extracted is much higher, since MEV-explore limits its estimates to only one blockchain, a few protocols, and a limited number of detectable BEV techniques. Although it is difficult to argue that all BEV is âbadâ (e.g., market arbitrage can remove market inefficiencies), it usually introduces some negative externalities like:
In this blog post, we focus on consensus-level BEV mitigation techniques. There are fundamentally two types of BEV-resistant Order-Fairness properties:
Blind Order-Fairness. A principal line of defense against BEV stems from committing to transaction ordering without seeing transaction contents. This notion of BEV resistance, referred to here as Blind Order-Fairness, is used by Heimbach and Wattenhoffer in [SoK on Preventing Transaction Reordering, 2022] and is defined as
âwhen it is not possible for any party to include or exclude transactions after seeing their contents. Further, it should not be possible for any party to insert their own transaction before any transaction whose contents it already been observed.â
Time-Based Order-Fairness. Another strong measure for BEV protection is brought by sending transactions to all Consensus parties simultaneously and using the relative arrival order at a majority of the parties to determine the final ordering. In particular, this notion of order fairness ensures that
âif sufficiently many parties receive a transaction tx before another txâ, then in the final commit order txâ is not sequenced before tx.â
This prevents powerful adversaries that can analyze network traffic and transaction contents from reordering, censoring, and front-/back-running transactions received by Consensus parties. Moreover, Time-Based Order-Fairness has stronger protection against a potential collusion between users and Consensus leader/parties because parties explicitly input relative ordering into the protocol.
Time-Based Order-Fairness is used in various flavors in several recent works, including [Aequitas, CRYPTO 2020], [PompÄ, OSDI 2020], [Themis, 2021], Wendy Grows Up [FC 2021] and [Quick Order Fairness, FC 2022], We briefly discuss some of those protocols later in the post.
Another notion of fairness found in the literature, that does not provide Order-Fairness, revolves around participation fairness:
Participation Fairness. A different notion of fairness aims to ensure censorship-resistance or stronger notions of participation equity. Participation equity guarantees that a chain of blocks includes a certain portion of honest contribution (aka âChain Qualityâ). Several BFT protocols address Participation Fairness, including [Prime, IEEE TDSC 2010], [Fairledger, 2019], [HoneyBadger, CCS 2016], and many others. In layered DAG-based BFT protocols like [Aleph, AFT 2019], [DAG-Rider, PODC 2021], [Tusk, 2021], [Bullshark, 2022], Participation Fairness comes essentially for free because every DAG layer must include messages from 2F+1 participants. It is worth noting that participation equity does not prevent a corrupt party from injecting transactions after it has already observed other transactions, nor a corrupt leader from reordering transactions after reading them, violating both Blind and Time-Based Order-Fairness.
We proceed to describe how to build Order-Fairness into DAG-based BFT Consensus for the partial synchrony model, focusing on preventing BEV exploits. The rest of this post is organized as follows:
The next section provides a Quick refresher on DAG-based BFT Consensus and Fin; We refer the reader to a previous post on DAG-based BFT Consensus and Fin for further details, and to a list at its bottom for further reading on DAG-based BFT Consensus.
Following is a section that discusses Order-then-Reveal: On Achieving Blind Order-Fairness. It contains two strawman âorder-commit first, reveal laterâ implementations, Order-then-Reveal with Threshold Cryptography, Order-then-Reveal with Verifiable Secret-Sharing (VSS).
It is followed by the introduction of Fino: Optimistically-Fast Blind Order-Fairness without VSS. Fino achieves the best of both worlds, seamless blind ordering of threshold cryptography with the low latency of secret sharing.
The next section reports Threshold Encryption vs Secret Sharing micro-benchmarks.
The last section adds a discussion on Achieving Time-Based Order-Fairness.
A recent ArXiv manuscript provides more detail and rigor, see âMaximal Extractable Value (MEV) Protection on a DAGâ.
DAG-based BFT Consensus revolves around a broadcast transport that guarantees message Reliability, Availability, Integrity, and Causality.
In a DAG-based BFT Consensus protocol, a party packs transaction digests into a block, adds references to previously delivered (defined below) messages, and broadcasts a message carrying the block and the causal references to all parties. When another party receives the message, it checks whether it needs to retrieve a copy of any of the transactions in the block and in causally preceding blocks. Once it receives a copy of all transactions in the causal past of a block it can acknowledge it. When 2F+1 acknowledgments are gathered, the message is âdeliveredâ into the DAG.
One implementation of reliable broadcast is due to [Bracha, 1987]. Each party sends an echo message upon receiving a new transaction. Each party, after receiving 2F+1 echoes (or F+1 ready votes), can issue and broadcast a ready vote. A party delivers the transaction when 2F+1 ready votes are received. Another implementation used in several earlier systems, e.g., [Reiter and Birman, 1994] and [Cachin et al., 2001], employs threshold cryptography.
Several recent systems, e.g., Aleph, Narwhal-HS, DAG-Rider, Tusk, and Bullshark, construct DAG transports in a layered manner, demonstrated excellent network utilization and throughput. In a layered DAG, each party can broadcast one message in each layer carrying 2F+1 references to messages in the preceding layer. We note that layering is not required for any of the methods described below but has important benefits on performance and Participation Fairness.
Given the strong guarantees of a DAG transport, Consensus can be embedded in the DAG quite simply, for example, using [Fin, 2022]. Briefly, Fin works as follows:
Views. The protocol operates in a view-by-view manner. Each view is numbered, as in âview(r)â, and has a designated leader known to everyone.
Proposing.
When a leader enters a new view(r), it sets a âmeta-informationâ field in its coming broadcast to r
.
A leaderâs message carrying r
(for the first time) is interpreted as proposal(r)
.
Implicitly, proposal(r)
suggests to commit to the global ordering of transactions all the blocks in the causal history of the proposal.
Voting. When a party sees a valid leader proposal, it sets a âmeta-informationâ field in its coming broadcast to r
.
A partyâs message carrying r
(for the first time) following proposal(r)
is interpreted as vote(r)
.
Committing. Whenever a leader proposal gains F+1 valid votes, the proposal and its causal history become committed.
Complaining.
If a party gives up waiting for a commit to happen in view(r), it sets a âmeta-informationâ field in its coming broadcast to -r
.
A partyâs message carrying -r
(for the first time) is interpreted as complaint(r)
.
Note, a message by a party carrying r
that causally follows a complaint(r)
by the party, if exists, is not interpreted as a vote.
View change.
Whenever a commit occurs in view(r) or 2F+1 complaint(r)
are obtained, the next view(r+1) is enabled.
We refer the reader to a previous post on DAG-based BFT Consensus and Fin for further details, and to a list at its bottom for further reading on DAG-based BFT Consensus.
The first line of defense against BEV is to keep transaction information hidden until after a commit is delivered âblindlyâ. This prevents any party from observing the contents of transactions until the ordering has been committed, hence satisfying Blind Order-Fairness.
To order transactions blindly, users choose a symmetric key to encrypt each transaction and broadcast the transaction encrypted to Consensus parties.
Order-then-Reveal is implemented using three abstract functionalities to hide and open the transaction key (âtx-keyâ) itself: Disperse(tx-key), Reveal(tx-key), Reconstruct(tx-key). In Disperse(), the user shares tx-key with parties such that a threshold greater than F of the parties is required to participate in reconstruction. In Reveal(), parties retrieve F+1 shares. Parties must withhold revealing decryption shares until after they observe the transactionâs ordering committed. In Reconstruct(), parties produce a unique reconstruction of tx or reject it.
The challenge with Blind Order-Fairness is enforcing a unique outcome once an ordering has been committed.
It is straight-forward to support blind ordering using threshold encryption, such that a public encryption âE()â is known to users and the private decryption âD()â is shared (at setup time) among parties.
For each tx, a user chooses a transaction-specific symmetric key tx-key and sends tx encrypted with it. To Disperse(tx-key), the user attaches E(tx-key), the transaction key encrypted with the global threshold key. Once a transaction txâs ordering is committed, Reveal(tx-key) is implemented by every party generating its decryption share for D(tx-key), piggybacked on the DAG broadcast that causally follows the commit. Some threshold cryptography schemes allow to verify that a party is contributing a correct decryption share. A threshold of honest parties can always succeed in Reconstruct(tx-key) and try applying it to decrypt tx, hence by retrieving F+1 threshold shares, a unique outcome is guaranteed.
The main drawback of threshold cryptography is that share verification and decryption are computationally heavy. It takes an order of milliseconds per transaction in todayâs computing technology, as we show later in the article (see Performance notes).
Another way for users to Disperse(tx-key) is [Shamirâs secret sharing scheme, CACM 1979]. A sharing function âSS-share(tx-key)â is employed by users to send individual shares to each Consensus party, such that F+1 parties can combine shares via âSS-combine()â to reconstruct tx-key.
Combining shares is three orders of magnitude faster than threshold crypto and takes a few microseconds in todayâs computing environment (see Performance notes). However, it is far less straight-forward to build Reveal() and Reconstruct() with secret-sharing on a DAG:
One solution is to employ a VSS scheme, allowing F+1 parties to construct missing shares on behalf of other parties, as well as to verify that shares are consistent, This enables the integration of secret sharing into the DAG broadcast protocol, but it requires modifying the DAG transport.
A DAG transport with Verifiable Secret Sharing. To ensure that a transaction that has been delivered into the DAG can be deciphered, a party should not acknowledge a message unless it has retrieved its own individual shares for all the transactions referenced in the message and its causal past.
For example, integrating VSS inside Brachaâs reliable broadcast, when a party observes that it has missed a share for a transaction referenced in a message (or its causal past), it initiates VSS share recovery before sending a ready vote for the message. This guarantees that a fortiori, after blindly committing to an ordering for the transaction, messages from every honest party reveal shares for it.
The overall communication complexity incurred in VSS on the user sharing a secret and on a party recovering a share has dramatically improved in recent years:
Despite the vast progress in VSS schemes, there remain a number of challenges. The main hurdle is that when a message is delivered with 2F+1 acknowledgments, parties may need to recover their individual shares when they reference it in order to satisfy the Reliability property of DAG broadcast. As noted above, the most efficient VSS scheme requires linear communication and each share carries linear size information for recovery. Implementing VSS (e.g., due to Kate et al.) requires non-trivial cryptography. Last, as noted above, in the DAG setting this requires integrating a share-recovery protocol in the underlying transport.
Can we have the best of both worlds, seamless Reveal/Reconstruct of threshold cryptography with the low latency of secret sharing?
Enter Fino, an embedding of Blind Order-Fairness in DAG-based BFT Consensus that leverages the DAG structure to completely forego the need for share verification during dispersal, works with (fast) secret-sharing during steady-state, and falls back to threshold crypto during a period of network instability. That is, Fino works without VSS and is optimistically fast. Importantly, in Fino the DAG transport does not need to be materially modified. During periods of synchrony, transactions that were committed blindly to the total order will be opened within three network latencies following the commit.
To avoid verifying shares during Disperse (costly), we borrow a key insight from AVID-M (see DispersedLedger), namely verifying that the (entire) sharing was correct post-reconstruction (cheap). Importantly, even when reconstruction using revealed shares happens to be successful, if either the slow or the fast sharing was incorrect and fails post-reconstruction verification, the transaction is simply rejected. More specifically, this works as follows. After ordering a batch of transactions is committed and F+1 shares are revealed, each party re-encodes opened transactions using both secret-sharing and threshold encryption and compares with two hashes: one is a Merkle-tree root hash of the SS shares, the second is the threshold-encryption hash of the symmetric key. If there is a mismatch, the transaction is rejected.
To build Fino, we wanted a simple baseline DAG-BFT algorithmic foundation, so we chose [Fin, 2022], hence the name Fino â Fin plus BEV-resistant Order-Fairness. The advantage is Finâs simplicity of exposition, the lack of rigid layer structure, and because in Fin messages can be inserted into the DAG independently of Consensus steps or timers, but Fino can possibly be built on other DAG-BFT systems.
An important feature of the DAG-based BFT approach, which Fino preserves while adding blind ordering, is zero message overhead. Additionally, Fino never requires the DAG to wait for input that depends on Consensus steps/timers.
To order transactions tx blindly in Fino, a user first chooses (as before) a transaction-specific symmetric key tx-key and encrypts tx with it.
Disperse(tx-key) is implemented in two parts. First, a user applies SS-share(tx-key) to send parties individual shares of tx-key. Second, as a fallback, it sends parties E(tx-key).
Once a transaction txâs ordering is committed, Reveal(tx-key) has a fast track and a slow track. In the fast track, every party that holds an SS-share for tx-key reveals it piggybacked on the DAG broadcast that causally follows the decision. A party that doesnât hold a share for tx-key reveals a threshold key decryption share, similarly piggybacked on a normal DAG broadcast that causally follows the commit. In the slow track, parties give up on waiting for SS-shares and reveal threshold key decryption shares even if they already revealed SS-shares.
It is left to form agreement on a unique decryption. Luckily, we can embed deterministic agreement about how to decipher transactions into the DAG with zero extra messages and no waiting for Consensus steps/timers. Parties simply interpret their local DAGs to arrive at unanimous deterministic decryption.
More specifically, views, proposals, votes, and complaints in Fino are the same as Fin. The differences between Fino and Fin are as follows:
Share revealing. When parties observe that a transaction tx becomes committed, their next DAG broadcast causally following the commit reveals shares of the decryption key tx-key in one of two forms: a party that holds a share for SS-combine(tx-key) reveals it, while a party that doesnât hold a share for SS-combine(tx-key) or give up waiting reveals a threshold share for D(tx-key).
TX Opening.
When a new leader enters view(r+1), it emits a proposal proposal(r+1)
as usual.
However, when the proposal becomes committed,
it determines a unique decryption for every transaction tx
in the causal past of proposal(r+1)
that satisfies:
tx
is committed but not yet opened, tx
has F+1 certified shares revealed in the causal past of proposal(r+1)
,Note that above, tx
could be from views earlier than view(r) if they havenât been opened already.
The unique opening or rejection of tx
based on f+1 revealed shares is explained above.
A Note on Happy-path Latency. During periods of stability, there are no complaints about honest leaders by any honest party. If tx is proposed by an honest leader in view(r), it will receive F+1 votes and become committed within one network latency. Within one more network latency, every honest party will post a message containing a share for tx. Out of 2F+1 shares, either F+1 are SS-combine() shares or D() shares. Thereafter, whenever a leader proposal references 2F+1 shares commits (in a layered DAG, the leader after next will), everyone will be able to apply Reconstruct(tx-key) and either open tx or reject it. In a happy path, opening happens three network latencies after a commit: one for revealing shares, one for a leader proposal following the shares, and one to commit the proposal.
A Note on Share Certification. In lieu of share verification information in SS-share(), a sender needs to certify shares so that parties cannot tamper with them.
The sender combines all shares in a Merkle tree, certifies the root, and sends with each share a proof of membership (i.e., a Merkle tree path to the root); parties need to check only one signature when they collect shares for reconstruction. After reconstruction, parties can re-encode the Merkle tree and verify it was generated correctly.
To verify share-generation in the slow path, the sender additionally certifies E(tx-key), the threshold encryption of symmetric key used for encrypting the transaction.
Happy-path scenario.
Figure 1 below depicts a scenario with three views, r, r+1, and r+2.
Each view enables the next view via F+1 votes.
The scenario depicts proposal(r+1)
becoming committed. The commit uniquely determines how to open transactions from proposal(r)
,
whose shares have been revealed in the causal past of proposal(r+1)
.
Differently, proposal(r+2)
is sent before delivering F+1 shares of transactions from proposal(r+1)
.
Therefore, proposal(r+2)
becomes committed without opening pending committed transactions from view(r+1).
Figure 1:
Commits of proposal(r)
and proposal(r+1)
, followed by share revealing opening proposal(r)
txâs.
Scenario with a slow leader. A slightly more complex scenario occurs when a view expires because parties do not observe a leaderâs proposal becoming committed and they broadcast complaints. In this case, the next view is not enabled by F+1 votes but rather, by 2F+1 complaints.
When a leader enters the next view due to 2F+1 complaints, the proposal of the preceding view is not considered committed yet. Only when the proposal of the new view becomes committed does it indirectly cause the preceding proposal (if exists) to become committed as well.
Figure 2 below depicts three views, r, r+1, and r+2.
Entering view(r+1) is enabled by 2F+1 complaints about view(r).
When proposal(r+1)
itself becomes committed, it indirectly commits proposal(r)
as well.
Thereafter,
parties reveal shares for all pending committed transactions, namely, those in both proposal(r)
and proposal(r+1)
.
Those shares are in the causal past of proposal(r+2)
.
Hence, when proposal(r+2)
will commit, a unique opening will be induced.
Figure 2:
A commit of proposal(r+1)
causing an indirect commit of proposal(r)
, followed by share revealing of both.
Threshold encryption and secret sharing provide slightly different properties
when combined with a protocol like Fino.
We have implemented both schemes and compared their performance. For
secret sharing, we implemented the
[Shamirâs scheme, CACM 1979],
while for threshold encryption we implemented a scheme by
Shoup and Gennaro [TDH2, EUROCRYPT 1998].
First, we investigated the computational overhead that these schemes introduce.
For the presentation, we selected the schemes with the most efficient
cryptographic primitives we had access to, i.e., the secret sharing scheme was
implemented using the
[Ed25519 curve, J Cryptogr Eng 2012],
while TDH2 is using
[Ristretto255, 2020]
as the underlying prime-order group. Performance for both schemes is presented in the
setting where 6 shares out of 16 are required to recover the plaintext.
Scheme | Encrypt |
ShareGen |
Verify |
Decrypt |
---|---|---|---|---|
Threshold, TDH2-based | 311.6Îźs | 434.8Îźs | 492.5Îźs | 763.9Îźs |
Secret-sharing | 52.7Îźs | - | 2.7Îźs | 3.5Îźs |
The results are presented in the table as obtained on an Apple M1 Pro. Encrypt
refers to the overhead on the client-side while ShareGen
is the operation of
deriving a decryption share from the TDH2 ciphertext by each party (this
operation is absent in the SSS-based scheme). In TDH2 Verify
verifies whether
a decryption share matches the ciphertext, while in the SSS-based scheme it only
checks whether a share belongs to the tree aggregated by the Merkle root
attached by the client. Decrypt
recovers the plaintext from the ciphertext and
the number of shares. As demonstrated by these measurements, the SSS-based scheme is much more
efficient. In our Consensus scenario, each party processing a TDH2 ciphertext
would call ShareGen
once to derive its decryption share, Verify
k-1 times
to verify the threshold number of received shares, and Decrypt
once to obtain
the plaintext. Assuming k=6, the total computational overhead for a single
transaction would be around 3.7ms of the CPU time. With the secret sharing
scheme, the party would also call Verify
k-1 times and Decrypt
once, which
requires only 17Îźs of the CPU time.
Besides the higher performance overhead, TDH2 requires a trusted setup, but the scheme also provides some advantages over secret sharing. For instance, with a TDH2 ciphertext sent only to a single party and the network will be able to recover the plaintext. An SSS-based requires the client to send shares directly to multiple parties. Meanwhile, waiting for the network to receive shares for a transaction, the transaction occupies buffer space on partiesâ machines and would need to be either expired by the parties (possibly violating liveness) or kept in the state forever (possibly introducing a denial-of-service vector). Moreover, SSS requires a trusted channel between clients and parties, which is not required by TDH2 itself. Finally, as described in Fino, the subset of shares used for decrypting a transaction requires a consensus decision.
Fino achieves Blind Order-Fairness by deterministically ordering encrypted transactions and then, after the order is final, decrypting them. The deterministic order can be enhanced by sophisticated ordering logic present in other protocols. In particular, Fino can be extended to provide Time-Based Fairness additionally ensuring that the received transactions are not only unreadable by parties, but also their relative order cannot be influenced by malicious parties.
For instance, [PompÄ, OSDI 2020] proposes a property called âOrdering Linearizabilityâ:
if all correct parties timestamp transactions tx, txâ such that txâ has a lower timestamp than tx by everyone, then txâ is ordered before tx.
It implements the property based on an observation that if parties exchange transactions associated with their receiving timestamps, then for each transaction its median timestamp, computed out of 2F+1 timestamps collected, is between the minimum and maximum timestamps of honest parties.
Fino can be easily extended by the Linearizability property offered by PompÄ and the final protocol is similar to the Fino with Blind Order-Fairness (see above) with only one modification. Namely, every time a new batch of transactions becomes committed, parties independently sort transactions by their aggregate (median) timestamps.
More generally, Fino can easily incorporate other Time-based Fairness ordering logic. Note that in Fino, the ordering of transactions is determined on encrypted transactions, but time ordering information should be open. The share revealing, share collection, and unique decryption following a committed ordering are the same as presented previously. The final protocol offers much stronger properties since it not only hides payloads of unordered transactions from parties, but also prevents parties from reordering received transactions.
Other protocols providing Time-based Order-Fairness include [Aequitas, CRYPTO 2020] which defines âApproximate-Order Fairnessâ:
if sufficiently many parties receive a transaction tx more than a pre-determined gap before another transaction txâ, then no honest party can deliver txâ before tx.
The authors prove that it is impossible to achieve this property under Condorcet scenarios, although, in practice, it might hold in most fairness-sequencing protocol executions. Then, they propose a relaxed definition of âBatch-Order Fairnessâ:
if sufficiently many (at least ½ of the) parties receive a transaction tx before another transaction txâ, then no honest party can deliver tx in a block after txâ,
and a protocol achieving it.
[Themis, 2021] is a more efficient protocol realizing âBatch-Order Fairnessâ, where parties do not rely on timestamps (as in PompÄ) but only on their relative transaction orders reported. Themis can also be integrated with Fino, however, to make it compatible this design requires some modifications to Finâs underlying DAG protocol. More concretely, Themis assumes that the fraction of bad parties cannot be one quarter, i.e., F out of 4F+1. A leader makes a proposal based on 3F+1 out of 4F+1 transaction orderings (each reported by a distinct party). Therefore, we would need to modify the DAG transport so that parties references preceding messages from a quorum greater than three-quarters of the system (rather than two-thirds). message.
Other forms of Time-Based Order-Fairness appear in other recent works, including Wendy Grows Up [FC 2021] which introduced âTimed Relative Fairnessâ:
if there is a time t such that all honest parties saw (according to their local clock) tx before t and txâ after t , then tx must be scheduled before txâ.
and [Quick Order Fairness, FC 2022], which defined âDifferential-Order Fairnessâ:
when the number of correct parties that broadcast tx before txâ exceeds the number that broadcast txâ before tx by more than 2F + Îş, for some Îş ⼠0, then the protocol must not deliver txâ before tx (but they may be delivered together),
Many thanks to Soumya Basu, Christian Cachin, Mahimna Kelkar and Oded Naor for the comments that helped improve this post.
]]>I found a really simple way to explain how to embed Consensus inside a DAG (Direct Acyclic Graph), which at the same time, is highly efficient.
Emerging Proof-of-Stake blockchains achieve high transaction throughput by spreading transactions reliably as fast as the network can carry them and accumulating them in a DAG. Then, participants interpret their DAG locally without exchanging more messages and determine a total ordering of accumulated transactions.
Given a DAG transport that provides reliable and causally-ordered transaction dissemination, it seems that reaching consensus on total ordering should be really simple. Still, systems built using a DAG, such as Swirlds Hashgraph, Aleph, Narwhal-HS, DAG-Rider, Tusk, and Bullshark, are quite complex. Moreover, protocols in the partial synchrony model like Bullshark actually wait for Consensus steps/timers to add transactions to the DAG.
Here, a simple and efficient DAG-based BFT (Byzantine Fault Tolerant) Consensus embedding â quite possibly the simplest way to build BFT Consensus in a DAG â is described. It operates in a view-by-view manner that guarantees that when the network is stable, only two broadcast latencies are required to reach consensus on all the transactions that have accumulated in the DAG. Importantly, the DAG never has to wait for Consensus steps/timers to add transactions.
This post is meant for pedagogical purposes, not as a full-fledged BFT Consensus system design. The embedding described here stands on the shoulders of previous works, but does not adhere to any one in full, hence it is referred to in this post using a new name Fin. The main takeaway on the evolution of the BFT Consensus field is that by separating reliable transaction dissemination from Consensus, BFT Consensus based on a DAG can be made simple and highly performant at the same time.
To scale the BFT (Byzantine Fault Tolerant) Consensus core of blockchains, prevailing wisdom is to separate between two responsibilities.
The first is a transport for reliably spreading yet-unconfirmed transactions. It regulates communication and optimizes throughput, but it tracks only causal ordering in the form of a DAG (Direct Acyclic Graph).
The second is forming a sequential commit ordering of transactions. It solves BFT Consensus utilizing the DAG.
The advent of building Consensus over a DAG transport is that each message in the DAG spreads useful payloads (transactions). Each time a party sends a message with transactions, it also contributes at no cost to forming a Consensus total ordering of committed transactions. In principle, parties can continue sending messages and the DAG keep growing even when Consensus is stalled, e.g., when a Consensus leader is faulty, and later commit the messages accumulated in the DAG.
It is funny how the community made a full circle, from early distributed consensus systems to where we are today. I earned my PhD more than two decades ago for contributions to scaling reliable distributed systems, guided by and collaborating with pioneers of the field, including Ken Birman, Danny Dolev, Rick Schlichting, Michael Melliar-Smith, Louis Moser, Robbert van Rennesse, Yair Amir, Idit Keidar. Distributed middleware systems of that time, e.g., Isis, Psync, Trans, Total and Transis, were designed for high-throughput by building consensus over causal message ordering (!). An intense debate ensued over the usefulness of CATOCS (Causal and Totally Ordered Communication), leading Cheriton and Skeen to publish a position paper about it, [CATOCS, 1993], followed by Birmanâs [response 1 to CATOCS, 1994] and Van Renesseâs [response 2 to CATOCS, 1994].
Recent interest in scaling blockchains appears to settle this dispute in favor of the DAG-based approach: a myriad of leading blockchain projects are being built using DAG-based BFT protocols high-throughput, including Swirlds hashgraph, Blockmania, Aleph, Narwhal & Tusk, DAG-rider, and Bullshark. Still, if you are like me, you might feel that these solutions are a bit complex: there are different layers in the DAG serving different steps in Consensus, and the DAG actually has to wait for Consensus steps/timers to fill layers.
Since the DAG already solves ninety percent of the BFT Consensus problem by supporting reliable, causally ordered broadcast, it seems that we should be able to do simpler/better.
Here, a simple and efficient DAG-based BFT (Byzantine Fault Tolerant) Consensus embedding â referred to as Fin â is described. Fin is quite possibly the simplest way to embed BFT Consensus in a DAG and at the same time, it is highly efficient. It operates in a view-by-view manner that guarantees Consensus progress when the network is stable. In each view, a leader marks a position in the DAG a âproposalâ, F+1 out of 3F+1 participants âvoteâ to confirm the proposal, and everything preceding the proposal becomes committed. Thus, only two broadcast latencies are required to reach consensus on all the transactions that have accumulated in the DAG. Importantly, both proposals and votes completely ignore the DAG structure, they are cast by injecting a single value (a view number) anywhere within the DAG. The DAG transport never waits for view numbers, it embeds in transmissions whatever latest value it was given.
The post is organized as follows:
The first section, DAG-T, explains the notion of a reliable, causal broadcast transport that shares a DAG among parties.
The second section, Fin, demonstrates the utility of DAG-T through Fin, a BFT Consensus embedded in a DAG which is designed for the partial synchrony model, operates in one-phase, and is completely out of the critical path of DAG transaction spreading. The name Fin, a small part of aquatic creatures like bullshark that controls stirring, stands for the protocol succinctness and its central role in blockchains (and also because the scenarios depicted below look like swarms of fish, and DAG in Hebrew means fish).
The third section, DAG-based Solutions, contains comparison notes on DAG-based BFT Consensus solutions.
Further reading materials are listed in DAG-based BFT Consensus: Reading list.
(If you are already familiar with DAG constructions, you donât need to read the rest of this section except to note that Consensus is allowed to occasionally invoke setInfo()
in order to set a meta-information field piggybacked on future messages.)
Figure 1:
The construction of a reliable, causal DAG.
Messages carry causal references to preceding messages and a local info
value.
Each message is guaranteed to be unequivocal and available through 2F+1 acknowledgements.
DAG-T is a transport substrate for disseminating transactions reliably and in causal order. In a DAG-based broadcast protocol, a party packs meta-information on transactions into a block, adds references to previously delivered messages, and broadcasts a message carrying the block and the causal references to all parties. When another party receives the message, it checks whether it needs to retrieve a copy of any of the transactions in the block and in causally preceding blocks. Once it receives a copy of all transactions in the causal past of a block it can acknowledge it. When 2F+1 acknowledgments are gathered, the message is inserted into the DAG, guaranteeing that DAG messages maintain Reliability, Availability, Integrity, and Causality (defined below). In a nutshell, DAG-T guarantees that all parties deliver the same ordered sequence of messages by each sender and exposes a causal-ordering relationship among them.
The life of transaction dissemination in DAG-T is captured in Figure 1 above:
info
set in setInfo()
, explained below.More specifically, the DAG-T transport substrate exposes two basic APIâs, broadcast()
and deliver()
.
broadcast(payload)
is an API for a party p
to send a message to other parties.
A partyâs upcall deliver(m)
is triggered when a message m
can be delivered.
Each message may refer to preceding messages including the senderâs own preceding message.
Messages are delivered carrying a senderâs payload and additional meta information that can be inspected upon reception.
Every delivered message m
carries the following fields:
- m.sender, the sender identity - m.index, a delivery index from the sender - m.payload, contents such as transaction(s) - m.predecessors, references to messages sender has delivered from other parties, including itself. - m.info, a local meta-information field, set in setInfo(), explained below.
DAG-T satisfies the following requirements:
Reliability.
If a deliver(m)
event happens at an honest party, then eventually deliver(m)
happens at every other honest party.
Agreement.
If a deliver(m)
happens at an honest party,
and deliver(m')
happens at another honest party, such that
m.sender = m'.sender
,
m.index = m'.index
then m = m'
.
Validity.
If an honest party invokes broadcast(payload)
then a deliver(m)
with m.payload = payload
event eventually happens at all honest parties.
Integrity.
If a deliver(m)
event happens at an honest party, then p
indeed invoked broadcast(payload)
where m.payload = payload
.
Causality.
If a deliver(m)
happens at an honest party,
then deliver(d)
events already happened at the party for all messages d
referenced in m.predecessors
.
Note that by transitively, this ensures its entire causal history has been delivered.
setInfo()
To prepare for Consensus decisions, DAG transports usually expose APIs allowing the Consensus protocol to inject input into the DAG.
There is no commonly accepted standard for doing this in the literature. Protocols in the partial synchrony model like Bullshark actually wait for Consensus steps/timers to add transactions to the DAG. Asynchronous protocols like Aleph, Narwhal, DAG-rider, Bullshark, need coin-toss shares from Consensus to fill the DAG.
Here we introduce a minimally-invasive, non-blocking API setInfo(x)
, that works as follows.
When a party invokes setInfo(x)
, the DAG-T transport records the value x
for its internal use.
Whenever broadcast()
is invoked, DAG-T injects the then current value x
, which has been last recorded in setInfo(x)
.
Importantly, DAG-T never waits for setInfo
, it embeds in transmissions whatever value it already has.
There are various ways to implement DAG-T among N=3F+1 parties, at most F of which are presumed Byzantine faulty and the rest are honest.
Echoing. The key mechanism for reliability and non-equivocation is for parties to echo a digest of the first message they receive from a sender with a particular index. When 2F+1 echoes are collected, the message can be delivered. There are two ways to echo, one is all-to-all broadcast over authenticated point-to-point channels a la Bracha Broadcast; the other is converge-cast with cryptographic signatures a la Rampart. In either case, echoing can be streamlined so the amortized per-message communication is linear, which is anyway the minimum necessary to spread the message.
Layering. Transports are often constructed in layer-by-layer regime, as depicted in Figure 2 below. In this regime, each sender is allowed one message per layer, and a message may refer only to messages in the layer preceding it. Layering is done so as to regulate transmissions and saturate network capacity, and has been demonstrated to be highly effective by various projects, including Blockmania, Aleph, and Narwhal.
Layering and Temporary Disconnections. In a layer-by-layer construction, a message includes references to messages in the preceding layer. Sometimes, a party may become temporarily disconnected. When it reconnects back, the DAG might have grown many layers without it. It is undesirable that a reconnecting party would be required to backfill every layer it missed with messages that everyone has to catch up with. Therefore, parties are allowed to refer to their own preceding message across (skipped) layers, as depicted in Figure 2 below.
Figure 2: A temporary disconnect of party 4 and a later reconnect.
Layering and other implementation considerations are orthogonal to the BFT Consensus protocol. As we shall see below, Fin ignores a layer structure of DAG-T, if exists. Here, we only care about the abstract API and properties that DAG-T provides. For further information on DAG implementations, see below further reading.
Given the strong guarantees of a DAG transport Consensus can be embedded in the DAG quite simply; Fin is quite possibly the simplest DAG-based BFT Consensus solution for the partial synchrony model.
Briefly, Fin works as follows:
Views. The protocol operates in a view-by-view manner. Each view is numbered, as in âview(r)â, and has a designated leader known to everyone.
Proposing.
When a leader enters a new view(r), it sets a âmeta-informationâ field in its coming broadcast to r
.
A leaderâs message carrying r
(for the first time) is interpreted as proposal(r)
.
Implicitly, proposal(r)
suggests to commit to the global ordering of transactions all the blocks in the causal history of the proposal.
Voting. When a party sees a valid leader proposal, it sets a âmeta-informationâ field in its coming broadcast to r
.
A partyâs message carrying r
(for the first time) following proposal(r)
is interpreted as vote(r)
.
Committing. Whenever a leader proposal gains F+1 valid votes, the proposal and its causal history become committed.
Complaining.
If a party gives up waiting for a commit to happen in view(r), it sets a âmeta-informationâ field in its coming broadcast to -r
.
A partyâs message carrying -r
(for the first time) is interpreted as complaint(r)
.
Note, a message by a party carrying r
that causally follows a complaint(r)
by the party, if exists, is not interpreted as a vote.
View change.
Whenever a commit occurs in view(r) or 2F+1 complaint(r)
are obtained, the next view(r+1) is enabled.
Importantly, proposals, votes, and complaints are injected into the DAG at any time, independent of layers.
Likewise, protocol views do NOT correspond to DAG layers, but rather, view numbers are explicitly embedded in the meta-information field of messages.
View numbers are in fact the only meta information Consensus injects into the DAG (through the asynchronous setInfo()
API).
This property is a key tenet of the DAG/Consensus separation, allowing the DAG to continue spreading transactions with Cosensus completely out of the critical path.
In a nutshell, the reason that the simple Fin commit-logic is safe is because there is no need to worry about a leader equivocating, because DAG-T prevents equivocation, and there is no need for a leader to justify its proposal because it is inherently justified through the proposalâs causal history within the DAG. Advancing to the next view is enabled by F+1 votes or 2F+1 complaints. This guarantees that if a proposal becomes committed, the next (justified) leader proposal contains a reference to it.
The full protocol is decribed in pseudo-code below in Fin Pseudo-Code. A step by step scenarios walkthrough is provided next.
Happy path scenario.
Each view(r)
has a pre-designated leader, denoted leader(r)
, which is known to everyone.
leader(r)
proposes in view(r)
by setting its meta-information value to r
via setInfo(r)
.
Thereafter, transmissions by the leader will carry the new view number.
The first view(r)
message by the leader is interpreted as proposal(r)
.
The proposal implicitly extends the sequence of transactions with the transitive causal predecessors of proposal(r)
.
In Figure 3 below,
leader(r)
is party 1 and its first message in view(r)
is denoted with a full yellow oval,
indicating it is proposal(r)
.
When a party receives proposal(r)
, it advances the meta-information value to r
via setInfo(r)
.
Thereafter, transmissions by the party will carry the new view number and the first of them be interpreted as vote(r)
for proposal(r)
.
A proposal that has a quorum of F+1 votes is considered committed.
In Figure 3 below,
party 3 votes for proposal(r)
by advancing its view to r
, denoted with a striped yellow oval. proposal(r)
now has the required quorum of F+1 votes (including the leaderâs implicit vote), and it becomes committed.
When a party sees F+1 votes in view(r)
it enters view(r+1)
.
An important feature of Fin is that votes may arrive without slowing down progress.
The DAG meanwhile fills with useful messages that may become committed at the next view.
This feature is demonstrated in the scenario below at view(r+1)
.
The view has party 2 as leader(r+1)
proposing proposal(r+1)
, but
vote(r+1)
messages do not arrive at the layer immediately following the proposal, only later.
No worries! Until proposal(r+1)
has the necessary quorum of F+1 of votes and becomes committed,
the DAG keeps filling with messages that may become committed at the next view, e.g., view(r+2)
.
Figure 3:
Proposals, votes, and commits in view(r)
, view(r+1)
, view(r+2)
.
Scenario with a faulty leader.
If the leader of a view is faulty or disconnected, parties will eventually time out and set their meta-information to minus the view-number, e.g., -(r+1)
for a failure of view(r+1)
.
Their next broadcasts are interpreted as complaining that there is no progress in view(r+1)
.
When a party sees 2F+1 complaints about view(r+1)
, it enters view(r+2)
.
In Figure 4 below, the first view view(r)
proceeds normally.
However, no message marked view(r+1)
by party 2 who is leader(r+1)
arrives, showing as a missing ovals.
No worries! As depicted, DAG transmissions continue filling layers, unaffected by the leader failure.
Hence, faulty views have utility in spreading transactions.
Eventually, parties 1, 3, 4 complain about vire(r+1)
by setting their meta-information to -(r+1)
, showing as striped red ovals.
After 2F+1 complaints are collected, the leader of view(r+2)
posts a messages that has meta-information set to r+2
, taken as proposal(r+2)
.
Figure 4:
A faulty view(r+1)
and recovery in view(r+2)
.
Scenario with a slow leader.
A slightly more complex scenario is depicted in Figure 5 below.
Here, leader(r+1)
emits proposal(r+1)
that is too slow to arrive and parties 1, 3 and 4, complain about a view failure.
This enables view(r+2)
to start and progress to commit proposal(r+2)
. When proposal(r+2)
commits,
in this scenario it causally follows proposal(r+1)
hence it indirectly commits it.
Figure 5:
A belated proposal in view(r+1)
being indirectly committed in view(r+2)
.
Each message in the DAG is interpreted as follows: *. A message m that carries m.info = r is referred to as a view(r) message. *. The first view(r)-message from the leader of view(r) is referred to as proposal(r) *. The first view(r)-message from a party is referred to as vote(r) (note, proposal(r) by the leader is also its vote(r)) *. A message m that carries m.info = -r is referred to as complaint(r) *. A proposal(r) is "justified" if proposal(r).predecessors refers to either F+1 justified vote(r-1) messages or 2F+1 complaint(r-1) messages (or r=1) *. A vote(r) is "justified" if proposal(r).predecessors refers to a justified proposal(r) and does not refer to complaint(r) by the same sender Party p performs the following operations for view(r): 1. Entering a view. Upon entering view(r), party p starts a view timer set to expire after a pre-determined view delay. 2. Proposing. The leader leader(r) of view(r) waits to deliver F+1 vote(r-1) messages or 2F+1 complaint(r-1) messages, and then invokes setInfo(r). Thereafter, the next transmission by the leader will carry the new view number, hence become proposal(r) (as well as its vote(r)). 3. Voting. Each party p other than the leader waits to deliver proposal(r) from leader(r) and then invokes setInfo(r). Thereafter, the next transmission by p will carry the new view number, hence become vote(r) for the leader's proposal. 4. Committing. A justified proposal(r) becomes committed if F+1 justified vote(r) messages are delivered. Upon a commit of proposal(r), a party disarms the view(r) timer. 4.1. Ordering commits. If proposal(r) commits, messages are appended to the committed sequence as follows. First, among proposal(r)'s causal predecessors, the highest justified proposal(r') is (recursively) ordered. After it, remaining causal predecessors of proposal(r) which have not yet been ordered are appended to the committed sequence (within this batch, ordering can be done using any deterministic rule to linearize the partial ordering into a total ordering.) 5. Expiring the view timer. If the view(r) timer expires, a party invokes setInfo(-r). Thereafter, the next transmission by p will carry the negative view number, hence become complaint(r), an indication of expiration of r. 6. Advancing to next view. A party enters view(r+1) if the DAG satisfies one of the following two conditions: (i) A commit of proposal(r) happens. (ii) 2F+1 complaint(r) messages are delivered.
Fin is minimally integrated into DAG-T, simply setting the meta-information field occasionally. Importantly, at no time is transaction broadcast slowed down by the Fin protocol. Consensus logic is embedded into the DAG structure simply by injecting view numbers into it. Importantly, the DAG transport never waits for view numbers, it embeds in transmissions whatever value it already has.
The reliability and causality properties of DAG-T makes arguing about correctness very easy, though a formal proof of correctness is beyond the scope of this post.
proposal(r)
becomes committed,
then it is in the causal past of F+1 parties that voted for it.
A justified proposal of any higher view
must refer (directly or indirectly) to F+1 vote(r)
messages,
or to 2F+1 justified complaint(r)
messages of which one follows a proposal(r)
.
In either case, a commit in such a future view causally follows
a vote for proposal(r)
, hence, it (re-)commits it.Conversely, when proposal(r)
commits, it may cause a proposal in a lower view, proposal(r')
, where r' < r
, to become committed for the first time.
Safety holds because future commits will order proposal(r)
and its causal past recursively.
Liveness. The protocol liveness during periods of synchrony stems from two key mechanisms.
First, after GST (Global Stabilization Time),
i.e., after communication has become synchronous,
views are inherently synchronized through DAG-T.
For let \(\Delta\) be an upper bound on communication after GST.
Once a view(r)
with an honest leader is entered by the first honest party,
within \(2 * \Delta\),
all the messages seen by party p
are delivered by both the leader and all other honest parties.
Hence, within \(2 * \Delta\), all honest parties enter view(r)
as well.
Within \(4 * \Delta\), the view(r)
proposal and votes from all honest parties are spread to everyone.
Second, so long as view timers are set to be at least \(4 * \Delta\), a future view does not preempt a current viewâs commit.
For in order to start a future view,
its leader must collect either F+1 vote(r)
messages, hence commit proposal(r)
; or 2F+1 complaint(r)
expiration messages, which is impossible as argued above.
We now remark about Finâs communication complexity. Protocols for the partial synchrony model have unbounded worst case by nature, hence, we concentrate on the costs incurred during steady state when a leader is honest and communication with it is synchronous:
DAG message cost: In order for DAG messages to be delivered reliably, it must implement reliable broadcast. This incurs either a quadratic number of messages carried over authenticated channels, or a quadratic number of signature verifications, per broadcast. In either case, the quadratic cost may be amortized by pipelining, driving it is practice to (almost) linear per message.
Commit message cost: Fin sends 3F+1 broadcast messages, a proposal and votes, per decision. A decision commits the causal history of the proposal, consisting of (at least) a linear number of messages. Moreover, each message may carry multiple transaction in its payload. As a result, in practice the commit cost is amortized over many transactions.
Commit latency: The commit latency in terms of DAG messages is 2, one proposal followed by votes.
Narwhal is a DAG transport after which DAG-T is modeled. It has a layer-by-layer structure, each layer having at most one message per sender and referring to 2F+1 messages in the preceding layer. A similarly layered DAG construction appears earlier in Aleph.
Narwhal-HS is a BFT Consensus protocol based on HotStuff for the partial synchrony model, in which Narwhal is used as a âmempoolâ. In order to drive Consensus decisions, Narwhal-HS adds messages outside Narwhal, using the DAG only for spreading transactions.
DAG-Rider and Tusk build randomized BFT Consensus for the asynchronous model âridingâ on Narwhal, These protocols are âzero message overheadâ over the DAG, not exchanging any messages outside the Narwhal protocol. DAG-Rider (Tusk) is structured with purpose-built DAG layers grouped into âwavesâ of 4 (2) layers each. Narwal waits for the Consensus protocol to inject input value every wave, though in practice, this does not delay the DAG materially.
Bullshark builds BFT Consensus riding on Narwhal for the partial synchrony model. It is designed with 8-layer waves driving commit, each layer purpose-built to serve a different step in the protocol. Bullshark is a âzero message overheadâ protocol over the DAG, however, due to a rigid wave-by-wave structure, the DAG is modified to wait for Bullshark timers/steps to insert transactions into the DAG. In particular, if leader(s) of a wave are faulty or slow, some DAG layers wait to fill until consensus timers expire:
âWe, in contrast, optimize for the common case conditions and thus have to make sure that parties do not advance rounds too fast.â, âTo make sure all honest parties get a chance to vote for steady state leaders, an up-to-date honest party
p
will try to advance(viatry_advance_round
) to the second and forth rounds of a wave only if (1) the timeout for this round expired or (2)p
delivered a vertex from the wave predefined first and second steady-state leader, respectively.â Bullshark 2022
Fin builds BFT Consensus riding on DAG-T for the partial synchrony model with âzero message overheadâ.
Uniquely, it incurs no transmission delaying whatsoever.
To achieve Consensus over DAG-T, Fin requires only injecting values into transmissions in a non-blocking manner via setInfo(v)
.
Once a setInfo(v)
invocation completes, future emissions by the DAG-T carry the value v
in the latest setInfo(v)
invocation.
The value v
is opaque to the DAG-T and is of interest only to the Consensus protocol.
In terms of protocol design, all of the above solutions are relatively succinct, but arguably, Fin is the simplest. DAG-Rider, Tusk and Bullshark are multi-stage protocols embedded into DAG multi-layer âwavesâ (4 layers in DAG-Rider, 2-3 in Tusk, 8 in Bullshark). Each layer is purpose-built for a different step in the Consensus protocol, with a potential commit happening at the last layer. Fin is single-phase, and view numbers can be injected into the DAG at any time, independent of the layer structure.
Protocol | Model | External messages used | DAG must be layered | DAG broadcasts wait for Consensus | Min commit latency* |
---|---|---|---|---|---|
Total, 1991 | asynchronous | none | no | no | eventual |
Swirlds Hashgraph, 2016 | asynchronous | none | no | no | eventual |
Aleph, 2019 | asynchronous | none | yes | yes (coin-tosses) | expected constant |
Narwhal-HS, 2021 | partial-synchrony | yes | yes | no | 6 |
DAG-Rider, 2021 | asynchronous | none | yes | yes (coin-tosses) | 4 |
Tusk, 2021 | asynchronous | none | yes | yes (coin-tosses) | 3 |
Bullshark, 2022 | partial-synchrony (with asynch fallback) | none | yes | yes (timers) | 2 |
Fin, 2022 | partial-synchrony | none | no | no | 2 |
* asynchronous commit latency is measured as length of causal message chain
There is no question that software modularity is advantageous, since it removes the Consensus protocol from the critical path of communication. That said, most solutions do not rely on a DAG transport in a pure black-box manner. As discussed above, randomized Consensus protocols, e.g., DAG-rider and Tusk, inject into the DAG randomized coin-tosses from the Consensus protocol. Protocols for the partial synchrony model, e.g., Bullshark, modify the DAG to wait for Consensus protocol round timers, in order to ensure progress during periods of synchrony.
In other words, rarely is the case that all you need is DAG.
In a pure-DAG solution, no extra messages are exchanged by the Consensus protocol nor is it given an opportunity to inject information into the DAG or control message emission. Parties passively analyze the DAG structure and autonomously arrive at commit ordering decisions, even though the DAG is delivered to parties incrementally and in potentially different order.
Total and ToTo are pre- blockchain era, pure-DAG total ordering protocols for the asynchronous model. Swirlds Hashgraph is the only blockchain era, pure-DAG solution to our knowledge. It makes use of bits within messages as pseudo-random coin-tosses in order to drive randomized Consensus. All of the above pure DAG protocols are designed without regulating DAG layers, and without injecting external common coin-tosses to cope with asynchrony. As a result, they are both quite complex and their convergence slow.
Fin finds a sweet-spot: albeit not being a pure-DAG protocol, it is a simple and fast DAG-based protocol, that injects values into the DAG in a non-intrusive manner.
Acknowledgement: Many thanks to Lefteris Kokoris-Kogias for pointing out practical details about Narwhal and Bullshark, that helped improve this post.
Pre-blockchains era:
âExploiting Virtual Synchrony in Distributed Systemsâ, Birman and Joseph, 1987. [Isis]
âAsynchronous Byzantine Agreement Protocolsâ, Bracha, 1987. [Bracha Broadcast]
âPreserving and Using Context Information in Interprocess Communicationâ, Peterson, Buchholz and Schlichting, 1989. [Psync]
âBroadcast Protocols for Distributed Systemsâ, Melliar-Smith, Moser and Agrawala, 1990. [Trans and Total]
âTotal Ordering Algorithmsâ, Moser, Melliar-Smith and Agrawala, 1991. [Total (short version)]
âByzantine-resilient Total Ordering Algorithmsâ, Moser and Melliar Smith, 1999. [Total]
âTransis: A Communication System for High Availabilityâ, Amir, Dolev, Kramer, Malkhi, 1992. [Transis]
âEarly Delivery Totally Ordered Multicast in Asynchronous Environmentsâ, Dolev, Kramer and Malki, 1993. [ToTo]
âUnderstanding the Limitations of Causally and Totally Ordered Communicationâ, Cheriton and Skeen, 1993. [CATOCS]
âWhy Bother with CATOCS?â, Van Renesse, 1994. [Response 1 to CATOCS]
âA Response to Cheriton and Skeenâs Criticism of Causal and Totally Ordered Communicationâ, Birman, 1994. [Response 2 to CATOCS]
âSecure Agreement Protocols: Reliable and Atomic Group Multicast in Rampartâ, Reiter, 1994. [Rampart].
Blockchain era:
âThe Swirlds Hashgraph Consensus Algorithm: Fair, Fast, Byzantine Fault Toleranceâ, Baird, 2016. [Swirlds Hashgraph]
âBlockmania: from Block DAGs to Consensusâ, Danezis and Hrycyszyn, 2018. [Blockmania].
âAleph: Efficient Atomic Broadcast in Asynchronous Networks with Byzantine Nodesâ, GÄ gol, LeĹniak, Straszak, and ĹwiÄtek, 2019. [Aleph]
âNarwhal and Tusk: A DAG-based Mempool and Efficient BFT Consensusâ, Danezis, Kokoris-Kogias, Sonnino, and Spiegelman, 2021. [Narwhal and Tusk]
âAll You Need is DAGâ, Keidar, Kokoris-Kogias, Naor, and Spiegelman, 2021. [DAG-rider]
âBullshark: DAG BFT Protocols Made Practicalâ, Spiegelman, Giridharan, Sonnino, and Kokoris-Kogias, 2022. [Bullshark]
Block-STM is a recently announced technology for accelerating smart-contract execution, emanating from the Diem project and matured by Aptos Labs. The acceleration approach interoperates with existing blockchains without requiring modification or adoption by miner/validator nodes, and can benefit any node independently when it validates transactions.
This post explains Block-STM in simple English accompanied with a running scenario.
An approach pioneered in the Calvin 2012 and Bohm 2014 projects in the context of distributed databases is the foundation of much of what follows. The insightful idea in those projects is to simplify concurrency management by disseminating pre-ordered batches (akin to blocks) of transactions along with pre-estimates of their read- and write- sets. Every database partition can then autonomously execute transactions according to the block pre-order, each transaction waiting only for read dependencies on earlier transactions in the block. The first DiemVM parallel executor implements this approach but it relies on a static transaction analyzer to pre-estimate read/write-sets which is time consuming and can be inexact.
Another work by Dickerson et al. 2017 provides a link from traditional database concurrency to smart-contract parallelism. In that work, a consensus leader (or miner) pre-computes a parallel execution serialization by harnessing optimistic software transactional memory (STM) and disseminates the pre-execution scheduling guidelines to all validator nodes. A later work OptSmart 2021 adds read/write-set dependency tracking during pre-execution and disseminates this information to increase parallelism. Those approaches remove the reliance on static transaction analysis but require a leader to pre-execute blocks.
The Block-STM parallel executor combines the pre-ordered block idea with optimistic STM to enforce the block pre-order of transactions on-the-fly, completely removing the need to pre-disseminate an execution schedule or pre-compute transaction dependencies, while guaranteeing repeatability.
Block-STM is a parallel execution engine for smart contracts, built around the principles of Software Transactional Memory. Transactions are grouped in blocks, each block containing a pre-ordered sequence of transactions TX1, TX2, âŚ, TXn. Transactions consist of smart-contract code that reads and writes to shared memory and their execution results in a read-set and a write-set: the read-set consists of pairs, a memory location and the transaction that wrote it; the write-set consists of pairs, a memory location and a value, that the transaction would record if it became committed.
A parallel execution of the block must yield the same deterministic outcome that preserves a block pre-order, namely, it results in exactly the same read/write sets as a sequential execution. If, in a sequential execution, TXk reads a value that TXj wrote, i.e., TXj is the highest transaction preceding TXk that writes to this particular memory location, we denote this by:
TXj → TXk
The following scenario will be used throughout this post to illustrate parallel execution strategies and their effects.
A block B consisting of ten transactions, TX1, TX2, ..., TX10, with the following read/write dependencies: TX1 → TX2 → TX3 → TX4 TX3 → TX6 TX3 → TX9
To illustrate execution timelines, we will assume a system with four parallel threads, and for simplicity, that each transaction takes exactly one time-unit to process.
If we knew the above block dependencies in advance, we could schedule an ideal execution of block B along the following time-steps:
1. parallel execution of TX1, TX5, TX7, TX8 2. parallel execution of TX2, TX10 3. parallel execution of TX3 4. parallel execution of TX4, TX6, TX9
What if dependencies are not known in advance? A correct parallel execution must guarantee that all transactions indeed read the values adhering to the block dependencies. That is, when TXk reads from memory, it must obtain the value(s) written by TXj, TXj â TXk, if a dependency exists; or the initial value at that memory location when the block execution started, if none.
Block-STM ensures this while providing effective parallelism. It employs an optimistic approach, executing transactions greedily and optimistically in parallel and then validating. Validation of TXk re-reads the read-set of TXk and compares against the original read-set that TXk obtained in its latest execution. If the comparison fails, TXk aborts and re-executes.
Correct optimism revolves around maintaining two principles:
Jointly, these two principles suffice to guarantee both safety and liveness no matter how transactions are scheduled. Safety follows because a TXk gets validated after all TXj, j < k, are finalized. Liveness follows by induction. Initially transaction 1 is guaranteed to pass read-set validation successfully and not require re-execution. After all transactions from TX1 to TXj have successfully validated, a (re-)execution of transaction j+1 will pass read-set validation successfully and not require re-execution.
READLAST(k) is implemented in Block-STM via a simple multi-version in-memory data structure that keeps versioned write-sets. A write by TXj is recorded with version j. A read by TXk obtains the value recorded by the latest invocation of TXj with the highest j < k.
A special value ABORTED
may be stored at version j when the latest invocation of TXj aborts.
If TXk reads this value, it suspends and resumes when the value becomes set.
VALIDAFTER(j, k) is implemented by scheduling for each TXj a read-set validation of TXk, for every k > j, after TXj completes (re-)executing.
It remains to focus on devising an efficient schedule for parallelizing execution and read-set validations. We present the Block-STM scheduler gradually, starting with a correct but inefficient strawman and gradually improving it in three steps. The full scheduling strategy is described in under 20 lines of pseudo-code.
At a first cut, consider a strawman scheduler, S-1, that uses a centralized dispatcher that coordinates work by parallel threads.
// Phase 1: dispatch all TXâs for execution in parallel ; wait for completion // Phase 2: repeat { dispatch all TX's for read-set validation in parallel ; wait for completion } until all read-set validations pass read-set validation of TXj { re-read TXj read-set if read-set differs from original read-set of the latest TXj execution re-execute TXj } execution of TXj { (re-)process TXj, generating a read-set and write-set }
S-1 operates in two master-coordinated phases. Phase 1 executes all transactions optimistically in parallel. Phase 2 repeatedly validates all transactions optimistically in parallel, re-executing those that fail, until there are no more read-set validation failures.
Recall our running example, a Block B with dependencies TX1 â TX2 â TX3 â {TX4, TX6, TX9}), running with four threads, each transaction taking one time unit (and neglecting all other computation times). A possible execution of S-1 over Block B would proceed along the following time-steps:
1. phase 1 starts. parallel execution of TX1, TX2, TX3, TX4 2. parallel execution of TX5, TX6, TX7, TX8 3. parallel execution of TX9, TX10 4. phase 2 starts. parallel read-set validation of all transactions in which TX2, TX3, TX4, TX6 fail and re-execute 5. continued parallel read-set validation of all transactions in which TX9 fails and re-executes 6. parallel read-set validation of all transactions in which TX3, TX4, TX6 fail and re-execute 7. parallel read-set validation of all transactions in which TX4, TX6, TX9 fail and re-execute 8. parallel read-set validation of all transactions in which all succeed
It is quite easy to see that the S-1 validation loop satisfies VALIDAFTER(j,k) because every transaction is validated after previous executions complete. However, it is quite wasteful in resources, each loop fully executing/validating all transactions.
The first improvement step, S-2, is to replace both phases with parallel task-stealing by threads. Using insight from S-1, we distinguish between a preliminary execution (corresponding to phase 1) and re-execution (following a validation abort). Stealing is coordinated
via two synchronization counters, one per task type, nextPrelimExecution
(initially 1) and nextValidation
(initially n+1).
An (re-)execution of TXj guarantees that read-set validation of all higher transactions will be dispatched by decreasing nextValidation back to j+1.
The following strawman scheduler, S-2, utilizes a task stealing regime.
// per thread main loop: repeat { task := "NA" // if available, steal next read-set validation task atomic { if nextValidation < nextPrelimExecution (task, j) := ("validate", nextValidation) nextValidation.increment(); } if task = "validate" validate TXj // if available, steal next execution task else atomic { if nextPrelimExecution <= n (task, j) := ("execute", nextPrelimExecution) nextPrelimExecution.increment(); } if task = "execute" execute TXj validate TXj } until nextPrelimExecution > n, nextValidation > n, and no task is still running read-set validation of TXj { re-read TXj read-set if read-set differs from original read-set of the latest TXj execution re-execute TXj } execution of TXj { (re-)process TXj, generating a read-set and write-set atomic { nextValidation := min(nextValidation, j+1) } }
Interleaving preliminary executions with validations avoids unnecessary work executing transactions that might follow aborted transactions. To illustrate this, we will once again utilize our running example. The timing of task stealing over our running example is hard to predict because it depends on real-time latency and interleaving of validation and execution tasks. Notwithstanding, below is a possible execution of S-2 over B with four threads that exhibits fewer (re-)executions and lower overall latency than S-1.
With four threads, a possible execution of S-2 over Block B (recall, TX1 â TX2 â TX3 â {TX4, TX6, TX9}) would proceed along the following time-steps:
1. parallel execution of TX1, TX2, TX3, TX4; read-set validation of TX2, TX3, TX4 fail 2. parallel execution of TX2, TX3, TX4, TX5; read-set validation of TX3, TX4 fail 3. parallel execution of TX3, TX4, TX6, TX7; read-set validation of TX4, TX6 fail 4. parallel execution of TX4, TX6, TX8, TX9; all read-set validations succeed 5. parallel execution of TX10; all read-set validations succeed
Importantly, VALIDAFTER(j, k) is preserved because upon (re-)execution of a TXj, it decreases nextValidation
to j+1. This guarantees that every k > j will be validated after the j execution.
Preserving READLAST(k) requires care due to concurrent task stealing, since multiple incarnations of the same transaction validation or execution tasks may occur simultaneously. Recall that READLAST(k) requires that a read by a TXk should obtain the value recorded by the latest invocation of a TXj with the highest j < k. This requires to synchronize transaction invocations, such that READLAST(k) returns the highest incarnation value recorded by a transaction. A simple solution is to use per-transaction atomic incarnation synchronizer that prevents stale incarnations from recording values.
The last improvement step, S-3, consists of two important improvements.
The first is an extremely simple dependency tracking (no graphs or partial orders) that considerably reduces aborts. When a TXj aborts, the write-set of its latest invocation is marked ABORTED
. READLAST(k) supports the ABORTED
mark guaranteeing that a higher-index TXk reading from a location in this write-set will delay until until a re-execution of TXj overwrites it.
The second one increases re-validation parallelism. When a transaction aborts, rather than waiting for it to complete re-execution, it decreases nextValidation
immediately; then, if the re-execution writes to a (new) location which is not marked ABORTED
, nextValidation
is decreased again when the re-execution completes.
The final scheduling algorithm S-3, has the same main loop body at S-2 with executions
supporting dependency managements via ABORTED
tagging, and with early re-validation enabled by decreasing nextValidation
upon abort:
// per thread main loop, same as S-2 ... read-set validation of TXj: { re-read TXj read-set if read-set differs from original read-set of the latest TXj execution mark the TXj write-set ABORTED atomic { nextValidation := min(nextValidation, j+1) } re-execute TXj } execution of TXj: { (re-)process TXj, generating a read-set and write-set resume any TX waiting for TXj's ABORTED write-set if the new TXj write-set contains locations not marked ABORTED atomic { nextValidation := min(nextValidation, j+1) } }
S-3 enhances efficiency through simple, on-the-fly dependency management using the ABORTED
tag. For our running example of block B,
An execution driven by S-3 with four threads may be able to avoid several of the re-executions incurred in S-2 by waiting on an ABORTED mark.
Despite the high-contention B scenario, a possible execution of S-3 may achieve very close to optimal scheduling as shown below.
A possible execution of S-3 over Block B (recall, TX1 â TX2 â TX3 â {TX4, TX6, TX9})
would proceed along the following time-steps (depicted in the figure):
1. parallel execution of TX1, TX2, TX3, TX4; read-set validation of TX2, TX3, TX4 fail; nextValidation set to 3 2. parallel execution of TX2, TX5, TX7, TX8; executions of TX3, TX4, TX6 are suspended on ABORTED; nextValidation set to 6 3. parallel execution of TX3, TX10; executions of TX4, TX6, TX9 are suspended on ABORTED; all read-set validations succeed (for now) 4. parallel execution of TX4, TX6, TX9; all read-set validations succeed
The reason S-3 preserves VALIDAFTER(j, k) is slightly subtle. Suppose that TXj â TXk.
Recall, when TXj fails, S-3 lets (re-)validations of TXk, k > j, proceed before TXj completes re-execution. There are two possible cases. If a TXk-validation reads an ABORTED
value of TXj, it will wait for TXj to complete; and if it reads a value which is not marked ABORTED
and the TXj re-execution overwrites it, then TXk will be forced to revalidate again.
Through a careful combination of simple, known techniques and applying them to a pre-ordered block of transactions that commit at bulk, Block-STM enables effective speedup of smart contract parallel processing. Simplicity is a virtue of Block-STM, not a failing, enabling a robust and stable implementation. Block-STM has been integrated within the Diem blockchain core (https://github.com/diem/) and evaluated on synthetic transaction workloads, yielding over 17x speedup on 32 cores under low/modest contention.
Disclaimer: The description above reflects more-or-less faithfully the Block-STM approach; for details, see the paper. Also, note the description here uses different names from the paper, e.g.,
ABORTED
replaces âESTIMATEâ,nextPrelimExecution
replaces âexecution_idxâ,nextValidation
replaces âvalidation_idxâ.
]]>Acknowledgement: Many thanks to Zhuolun(Daniel) Xiang for pointing out subtle details that helped improve this post.