Mir-BFT: High-Throughput BFT for Blockchains

作者 tinywell 日期 2019-07-31
Mir-BFT: High-Throughput BFT for Blockchains
Author Org
Chrysoula Stathakopoulou IBM Research - Zurich
Tudor David IBM Research - Zurich
Marko Vukolic IBM Research - Zurich

Abstract

This paper presents Mir-BFT (or, simply, Mir), a robust asynchronous Byzantine fault-tolerant (BFT) total order broadcast protocol aimed at maximizing throughput on widearea networks (WANs) with up to 100 nodes. This deployment setting is highly relevant to many permissioned and Proof-of-Stake permissionless blockchains.

Mir achieves unprecedented throughput on WANs without sacrificing latency, robustness to malicious behavior, or even performance in clusters. To achieve this, Mir is the first BFT protocol that allows a set of leaders to propose request batches independently, in parallel, while preventing request duplication performance attacks through a rotating assignment of a partitioned request hash space to leaders. We also propose several optimizations to Mir that boost the throughput even further, including partial replication through a novel abstraction we call light total order (LTO) broadcast.

Perhaps most importantly, Mir relies on proven BFT protocol constructs, which is fundamental to Mir provability and correctness. Specifically, Mir is a generalization of the celebrated and scrutinized PBFT protocol. While Mir follows PBFT “safety-wise”, it introduces substantial changes with respect to PBFT liveness (i.e., leader election), which help it become the highest throughput BFT protocol on WANs with up to 100 nodes to date, reaching more than 23000 Bitcoin-sized tps.

1.Introduction

Blockchains are decentralized, globally-distributed,strongly consistent replicated systems that run across networks of mutually untrusting nodes. Since the inception of Bitcoin’s decentralized cash application [43], modern blockchain systems have evolved the ability to run arbitrary distributed applications (e.g., [4, 13]), with the promise of supporting entire decentralized economies [1] and business ecosystems across industries [6].

Byzantine fault-tolerant (BFT) protocols, which tolerate arbitrary (Byzantine [37]) behavior of a subset of nodes, have evolved to be the key technology to power blockchains and ensure their consistency [48, 25]. BFT protocols relevant to blockchain are consensus and state machine replication (SMR) protocols (e.g., [22]) or, even more specifically, total order (TO) broadcast protocols that establish the basis for SMR [44]. Such BFT protocols have found their use in replacing (or, less often, complementing) energy-wasting and slow Proof-of-Work (PoW) consensus protocols used to power early blockchains including Bitcoin, which process between 7 and 60 transactions per second [48, 29].

Ingeneral, current BFT protocols donot scale well with the number of nodes (replicas) and hence do not perform to the needs of blockchain use cases. State-of-the-art BFT protocols are either very efficient on small scales in clusters(e.g., [15,34])or exhibit modest performance on large scales (thousands or more nodes) across wide area networks (WAN) (e.g., [30]).

However, the sweet spot of design goals, which combines medium-size networks (e.g., up to 100 nodes) and excellent performance across WANs, remains largely unexplored. This design spot is highly relevant to different types of blockchain networks. On the one hand, permissioned blockchains, such as Hyperledger Fabric [13]), are rarely deployed on scales above 100 nodes, yet use cases gathering dozens of organizations (e.g,. banks) are very prominent [3]. In such use cases, every organization represents a separate administrative domain, which defines boundaries of trust, and the requirement that each organization runs (or administers) at least one node is very common. On the other hand, this design spot is also highly relevant in the context of large scale permissionless blockchains, in which anyone can participate, that use weighted voting (based e.g., on Proof-of-Stake (PoS) [20, 32], delegated PoS (DPoS) [7]), or committee-voting [30, 33], to limit the number of nodes involved in the critical path of the consensus protocol. With such weighted voting, the number of (relevant) nodes for PoS/DPoS consensus is typically in the order of a hundred ([7]) or sometimes even less (e.g., few dozens of nodes [10]).

This paper fills in the void and presents Mir-BFT (or, simply, Mir), a novel total order (TO) BFT protocol that achieves the best throughput to date on public WAN networks with up to 100 nodes. Mir achieves this without compromising robustness to failures and malicious attacks, latency, performance on small scale and in clusters, or correctness/provability. The following summarizes the main features of Mir, as well as contributions of this paper:

  • Mir builds on the seminal leader-based PBFT protocol [22] by generalizing its “liveness” part. In short, Mir allows multiple concurrent leaders to propose batches of requests in parallel. In doing so, Mir leverages multiple secure connections (gRPC) across each pair of nodes, as opposed to state-of-the-art designs that use a single TCP/TLS connection between a pair of nodes. This proves to be critical in overcoming apparent cross-site bandwidth limitations often reported in geo-replicated systems research (e.g., [45, 38]). Here, Mir relies natively on PBFT original UDP-orientedlogictodealwithpotentialre-ordering that stems from using multiple connections.
  • On the protocol level, the seemingly simple idea of using multiple leaders in parallel raises the issue of request duplication performance attacks which plague related approaches (e.g., [41, 24, 35]). With up to n leaders, such attacks may induce an n-fold duplication of every single request and bring the effective throughput to its knees, voiding the benefits of using multiple leaders. To prevent request duplication attacks, Mir partitions the request hash space across replicas. In addition, due to potential request censorship of malicious leaders, Mir rotates a partitioning assigment across protocol configrations (epochs). This approach is novel in the context of BFT protocols.
  • While the base version of Mir implements classical TO broadcast and disseminates every request to every correct node, this guarantee is uneccessarily strong for some blockchains. To this end, we introduce the concept of a light total order (LTO) broadcast, which is identical to TO, except that it provides partial data availability guaranteeing the delivery of every request to at least one correct node. Other correct nodes get and agree on the order of cryptographic hashes of requests, which is the basis for maintaining other TO properties. Such guarantees are highly relevant to blockchain systems that separate the execution of applications (smart contracts) from the agreement on the order of transactions [50, 13]. Mir further uses client signature verification sharding optimization to offload CPU,which often becomes a bottleneck in Mir.
  • Perhaps most importantly, and as intuited above, Mir relies on proven BFT protocol constructs which is fundamental to Mir provability and correctness, avoiding “design-from-scratch”, which is known to be error-prone [15]. Specifically, Mir as described in this paper is a generalization of the celebrated and scrutinized PBFT protocol (Mir variants based on other BFT protocols can be derived as well). While Mir follows PBFT “safety-wise”, it introduces substantial changes with respect to PBFT liveness (e.g., leader election), which we sketched above and describe in more detail in the rest of the paper. Restricting changes to PBFT liveness, dramatically simplifies the reasoning about the correctness of Mir.
  • Finally, we implement Mir in Go and run it with up to 100 nodes in a WAN, as well as in clusters and under faults, omparing it to state of the art BFT protocols. We also evaluate the impact of multiple optimizations we propose. Our esults show that Mir delivers more than 23k Bitcoin-sized tps on a scale of 100 nodes on a WAN, dramatically utperforming
    state of the art. On smaller scales with 4 to 49 nodes, Mir delivers 42k-53k tps, with typical latencies of 1-
    2 sec.

To put this into perspective, Mir’s 23000 tps on 100 nodes on WAN are enough to multiplex advertised peak throughputs of the top 20 blockchain networks per market cap (less than 20k tps in total [8] for more than $220B USD total market capitalization). It is roughly equal to the alleged peak capacity of VISA (24k tps [8]) and more than 11x faster than the actual average VISA transaction rate (about 2k tps [48]). We expect that such a performance will open the door for new blockchain use cases.

The rest of the paper is organized as follows. In Section 2, we give preliminary definitions and briefly present PBFT (for completeness). In Section 3, we give an overview of Mir and changes it introduces to PBFT. We then explain Mir implementation details in Section 4. We present Mir optimizations, including LTO, in Section 5. This is followed by Mir correctness arguments in Section 6. Section 7 gives evaluation details. Finally, Section 8 discusses related work.

Figure 1

Figure 1: PBFT communication pattern and messages. Bottleneck messages are shown in bold.

2. Preliminaries

2.1. System Model

We assume an asynchronous (eventually synchronous) network [26] of n nodes (the set of all nodes is denoted by Nodes), in which the communication among correct nodes is synchronous after some time GST , unknown to nodes. We assume a public key infrastructure in which nodes and clients are identified by their public keys; we further assume node identities are lexicographically ordered and mapped by a bijection to the set [0 …n − 1] which we use to reason about node identities. We assume that, at any point in time, at most f nodes can simultaneously be Byzantine faulty (i.e., crash or deviate from the protocol in an arbitrary way), such that n ≥ 3f + 1. Any number of clients can be Byzantine faulty.

We assume an adversary that can control Byzantine faulty nodes but cannot break the cryptographic primitives we use, such as PKI and cryptographic hashes. H(data) denotes a cryptographic hash of data, while dataσp denotes data signed by process p (client or node).

Nodes implement a BFT total order (atomic) broadcast service to a set of Clients (some clients can be collocated with nodes). To broadcast request r, a client invokes BCAST(r), with nodes eventually outputting DELIVER(sn,r), such that the following properties hold:

  • P1 Validity: If a correct node delivers r then some client broadcasted r.
  • P2 Agreement: If two correct nodes deliver requests r and r with sequence number sn, then r = r .
  • P3 No duplication: If a correct node delivers request r with sequence numbers sn and sn , then sn = sn .
  • P4 Liveness: If a correct client broadcasts request r,then every correct node eventually delivers r.

Processes communicate through authenticated point-to-point channels. Our implementation uses gRPC [5]; between each pair of nodes there can be multiple gRPC connections/streams. If two processes cannot communicate in a timely manner, we talk about asynchrony. Asynchrony can divide nodes into network partitions.

2.2. Crash Course on PBFT

We depict the PBFT communication pattern in Figure 1.

PBFT proceeds in rounds called views which are led by the primary. The primary sequences and replicates client’s request in a PRE-PREPARE message — on WANs this step is a (network) bottleneck. Upon reception of the PRE-PREPARE, other nodes validate the request, which involves, at least, verifying the authenticity of a client’s request (we say a node pre-prepares the request). This is followed by two rounds of all-to-all communication (PREPARE and COMMIT messages), which are not bottlenecks as they leverage n links in parallel and contain metadata (request hash) only. A node prepares a request and sends a COMMIT message if it gets a PREPARE message from a quorum (2f +1 nodes) that matches a PRE-PREPARE. Finally, nodes commit the request in total order, if they get a quorum of matching COMMIT messages.

In PBFT, the primary is changed only if it is faulty or there are network partitions breaking the availability of a quorum. In this case, nodes timeout and initiate a view-change. View-change involves a communication among nodes in which the information about the latest pre-prepared and prepared requests is exchanged, such that the new primary, which is selected in round robin fashion, must re-propose a potentially committed request under the same sequence number within a NEW-VIEW message (please see [22] for details). The PBFT viewchange pattern can be simplified using signatures [21].

After the primary is changed, the system enters the new view and common-case operation resumes. PBFT complements this main common-case/view-change protocols with checkpointing (log and state compaction) and state transfer subprotocols [22].

Protocol PBFT [22] Mir
Round structure/naming views epochs
Round-change responsibility view primary (round-robin across all nodes) epoch primary (round-robin across all nodes)
No. of per-round leaders 1 (view primary) many (from 1 to n epoch leaders)
Round leader selection primary is the only leader primary decides on epoch leaders (subject to constraints)
Request duplication prevention enforced by the primary hash space partitioning across epoch leaders (rotating)
Batching no (or, 1 request per “batch”) yes
Multiple-batches in parallel yes (watermarks) yes (watermarks)
No. of batches per round unbounded bounded (unless epoch is stable)
Use of signatures no client requests, epoch change and checkpoint messages
Internode transport UDP multiple gRPC connections between every pair of nodes

Table 1: High level overview of the original PBFT [22] vs. Mir protocol structure.

3 Mir Overview

Mir is based on PBFT [22] (Sec. 2.2) — major differences are summarized in Table 1. In this section we elaborate on these differences, giving a high-level overview of Mir.

Authentication. While PBFT is signature-free, Mir uses signatures for authentication in three places. First, Mir authenticates clients’ REQUEST messages with signatures, as most blockchains do for clients’ requests, and to avoid concerns associated with “big-MAC” attacks related to the MAC authenticators PBFT uses [23]. Second, Mir uses signatures to authenticate EPOCH-CHANGE messages which are sent only to the new primary, to simplify the implementation (avoiding PBFT VIEW-CHANGE-ACK messages, all-to-all VIEW-CHANGE messages and associated bookkeeping). Finally, Mir signs CHECKPOINT messages to simplify recovery.

Batching. Mir processes requests in batches, a standard throughput improvement of PBFT (see e.g., [34, 15]).

Protocol round structure. Mir proceeds in epochs which are the counterpart to views in PBFT. Just like PBFT views, each epoch has a primary which is deterministically defined by the epoch number, by round-robin rotation across all the participating nodes of the protocol.

Each epoch has a set of leaders (in contrast, in PBFT only the primary is the leader). EpLeaders(e) denotes epoch leaders, nodes responsible for assigning a batch to a unique sequence number sn within an epoch e. In failurefree operation, Mir proceeds analogously to the normal case operation of PBFT, except that the sequence numbers are partitioned uniformly among the epoch leaders and all leaders can propose their batches simultaneously.
In epoch e, at most maxSn(e) batches can be ordered. In a recovery epoch e, maxSn(e) is bounded (an integer). If a recovery epoch ends graciously, i.e., nodes deliver this maximum number of batches, Mir transitions to the next epoch via a lightweight gracious epoch-change

protocol. In a stable epoch e, maxSn(e) = ∞ and Mir moves to the next epoch only in case of failures or network partitions (ungracious epoch-change). In contrast, every PBFT view can be seen as “stable” and every viewchange as ungracious.

Epoch e is stable if and only if the number of epoch leaders is greater or equal to the configuration parameter StableLeaders. In this paper, we set StableLeaders = n (i.e., a stable epoch has all nodes as leaders), and configure Mir to have a stable first epoch 0.

Selecting epoch leaders. In this paper, we use a simple approach to selecting epoch leaders. Namely, the epoch e primary chooses the set of epoch leaders subject to the following constraints, which depend on the outcome of the receding epoch e : 1) if e ends graciously, the leader set cannot reduce in size and it grows if the primary believes hat more than |EpLeaders(e )| nodes are correct, 2) if e ends ungraciously, the leader set cannot grow and if |EpLeaders(e )| > 1 it must reduce in size, and 3) the primary is always in the leader set.

In general, more elaborate leader set choices, which are outside the scope of this paper, can consider a function EpLeaders(e,ctx), where ctx a context based on heuristics such as execution history, weighted voting, distributed randomness, or stake. The context can be determined locally at each node in a non-interactive way, can be announced by the primary or can be established among all nodes after a round of interaction. The only property we require is that any two correct nodes that evaluate EpLeaders(e,ctx) get the same set of nodes, deterministically, for the same epoch number e.

The challenge of request duplication attacks. Moving from the single leader PBFT to the multi-leader Mir poses the challenge of partitioning requests among epoch leaders. A simplistic approach would be to allow any leader to add any request into a batch, either in the common case, or in the case of client request retransmission (as in e.g., [41, 24, 35]). This approach, combined with a client sending a request to exactly one node, allows no duplication with good throughput only in the best case, i.e., with no malicious clients/leaders and with no asynchrony.

However, this approach is not robust [23] outside the best case, in particular with malicious clients sending requests to multiple nodes, performing the request duplication performance attack. Moreover, a client cannot be naively declared as malicious and blacklisted if it sends a request to multiple nodes. Indeed, as malicious leaders can drop requests selectively (we talk about censoring requests), a client needs to send the request to at least f +1 nodes (i.e., to O(n) nodes) in the worst case.1 Therefore, a simplistic approach to parallel request processing with multiple leaders [41, 24, 35] faces attacks that can reduce throughput by factor of O(n), nullifying the effects of using multiple leaders.

Buckets and request hash space partitioning. Mir partitions the hash space into m ∗ n non-interesecting buckets of (approximately) equal size, where m is a configuration parameter. Each leader of epoch e is assigned at least | m∗n /EpLeaders(e)| buckets; in case of remaining buckets, the primary and subsequent epoch leaders per lexicographic order, are assigned 1 additional bucket each.

Rotating bucket assignment. In case of a recovery epoch e, the bucket assignment to leaders is fixed throughout e. In case of a stable epoch e, bucket assignment rotates. We denote by ActiveBucket(i,sn,e) the set of buckets that leader i within epoch e is assigned for sequence number sn (the parameter sn is relevant here only if e is stable). Figure 2 illustrates the mapping of requests to buckets in a stable epoch with n = 4 (m = 1).
A leader proposes only the requests which are mapped to their active buckets. As we describe in detail later,

1 Incentives, e.g., transaction fees [43, 49], could help with request censoring in case of a rational adversary [11], potentially simplifying Mir. Here, we focus on the more challenging (“irrational”) adversary.

Figure 2

Figure 2: Request mapping in a stable epoch with n = 4:Incoming arrows represent the request arrivals. Outgoing arrows represent requests being added to batches.Solid lines represent the active buckets. Hash(Req. 1)is mapped to the first bucket, active in node 0. Hash(Req.2) is mapped to the third bucket, active in node 2.

the assignment of buckets to EpLeaders is periodically rotating within a stable epoch (Sec. 4.3) and across recovery epochs (Sec. 4.4.1), guaranteeing liveness.

Parallelism. Mir implementation (detailed in Sec. 4.8) is highly parallelized, with every worker thread responsible for one batch (consensus instance). In addition, Mir uses multiple gRPC connections among each pair of nodes which proves to be critical in boosting throughput in a WAN especially with a small number of nodes.

Generalization of PBFT and emulation of other protocols. Mir reduces to PBFT by setting StableLeaders =
1. This makes every epoch stable, hides bucket rotation (primary is the single leader) and makes every epoch change ungracious. Mir can approximate protocols such as Tendermint [19] and Spinning [47] by setting StableLeaders > 1, and fixing maxSn(e) = 1 and |EpLeaders(e)| = 1 in every epoch e, making every epoch a recovery epoch and rotating leader/primary with every batch.

4 Mir Implementation Details

4.1 The Client

Upon BCAST(o), broadcasting operation o, client c creates a message REQUEST,o,t,c σc . The message includes the client’s timestamp t, a monotonically increasing sequence number, that must be in a sliding window between the low and high client watermark tcL < t ≤ tcH . Client watermarks in Mir allow multiple requests originating from the same client to be“in-flight”, while allowing them to be processed by different leaders in parallel. The low and high watermarks of the client’s timestamp sliding window are periodically advanced with the checkpoint mechanism described in Section 4.5.

In this section, we assume that the client sends REQUEST,o,t,c σc to all nodes. We however implemented a lightweight discovery mechanism (see Section 5.3) that allows clients to submit requests to a single node.

4.2 Common-case operation

Within an epoch e, the leadership in proposing batches is partitioned across epoch leaders. Epoch primary proposes the first batch in the epoch; after that, the leaders take turn in leading batches in a deterministic, lexicographic order. We say that a leader leads batch Bsn when the leader is assigned broadcasting a PRE-PREPARE for the batch with sequence number sn. Batches are proposed in parallel by all epoch leaders and are processed like in PBFT. Recall that batch watermarking allows the PBFT primary to propose multiple batches in parallel; in Mir, we simply extend this to multiple leaders (see Fig. 3).

Figure 3

Figure 3: PRE-PREPARE sending in Mir stable epoch: All nodes are epoch leaders, balancing the proposal load.

In epoch 0, we assign buckets to leaders sequentially, starting from the buckets with the lowest hash values which we assign to primary 0, and so on. When e > 0, the primary picks its buckets and selects a consecutive sequence of buckets starting from the bucket which contains the oldest request it received; this is key to ensuring Liveness (P4, Sec. 2.1). The other leaders are then deterministically and sequentially assigned the following buckets.

With such an assignment, the protocol proceeds as follows. Upon receiving REQUEST,o,t,c σc from a client, an epoch leader first verifies that the request timestamp is in the client’s current window tCL < t ≤ tCH
and maps the request to a respective bucket by hashing the request payload along with the client timestamp and identifier hr = H(o||t||c). If the request falls into the leader’s active bucket, the leader also verifies the client’s signature on REQUEST. Malformed signatures result in a node locally blacklisting the client for a predefined period of time.

The request is discarded if the hr is already in the logs ofthenode, either because it has already been preprepared or because it is already pending in a bucket.

Each bucket is implemented as a FIFO queue. Once enough requests are gathered in the current active bucket oftheleader, oriftimer Tbatch expires(since the last batch was proposed by i), leader i adds the requests from the current active bucket in a batch, assigns to the batch its next available sequence number sn (provided sn is within batch watermarks) and sends a PRE-PREPARE message. If Tbatch time has elapsed and no requests are available, leader i broadcasts a special PRE-PREPARE message with an empty batch. This guarantees the progress of the protocol with low load.

Each node j accepts a PRE-PREPARE (we say preprepares the batch and the requests it contains), with sequence number sn for epoch e from leader i provided that: (1) the epoch number matches the local epoch number and j did not preprepare another batch with the same e and sn, (2) node i is in the EpLeaders(e) set, (3) leader i leads sn, (4) the sequence number sn of the batch in the PRE-PREPARE is between a low watermark and high

watermark: w < sn ≤ W , (5) every request in the batch has a timestamp within the current client’s watermarks, (6) none of the requests in the batch have already been preprepared, (7)each request in the batch is assigned to ActiveBucket(i,sn,e), and (8) every request in the batch was submitted by a client authorized to write and the request signature is valid.

Condition (8) effectively enforces access control, which helps protect against flooding denial-of-service (DoS) and helps ensure Validity (Property P1, Sec. 2.1). As this step may reveal to be a bottleneck in Mir if performed by all nodes (e.g., in a case where all nodes need to perform a relatively expensive cryptographic task such as signature verification per transaction), we use signature sharding as an optimization (see Sec. 5.2).

If validation succeeds, node j then sends a PREPARE and the protocol proceeds exactly as PBFT.Upon committing a batch with sequence number k from leader i, node j removes from its buckets any request that is already in batch k.

4.3 Active bucket rotation in a stable epoch

Mir introduces a bucket rotation mechanism to prevent request censoring, as we motivated in Section 3.

Bucket rotation in stable epoch relies on leader-toleader bucket handover, which works as follows. Every BR(e) batches (a configuration parameter), leaders rotate the assignment of buckets, such that leader i gets assigned buckets previously led by leader i +1 (in modulo n arithmetics). To prevent request duplication, leader i waits to commit locally all batches pertaining to buckets i gets assigned to (in particular those lead by i + 1), before starting proposing own batches. Other nodes also do the same before they pre-prepare batches in these buckets that are proposed by i.

Referring to the example shown in Figure 2, with n = 4 and 4 buckets in total, after BR(e) batches, node 0 gets assigned the red bucket (which was assigned to node 1), yet node 0 starts proposing batches only after it locally commits all batches pertaining to the red bucket that were previously committed — informally, node 1 hands over the red bucket to node 0.

4.4 Epoch-change

Mir distinguishes two variants of epoch-change, gracious and ungracious epoch change.

4.4.1 Gracious epoch-change

Gracious epoch change occurs at the gracious end of a recovery epoch. Its goal is to implement a lightweight mechanism for potentially growing the set of leaders towards a stable epoch, and to implement a variant of the bucket rotation to ensure Liveness across recovery epochs.

After the primary of recovery epoch e + 1 (EpPrimary(e + 1)) delivers maxSn(e) batches in recovery epoch e (or, as an optimization, shortly before), EpPrimary(e + 1) reliably broadcast the configuration of epoch e + 1. To this end, we use the classical 3-phase Bracha reliable broadcast [18].

The epoch configuration information, which the primary reliably broadcasts, contains: 1) the set of epoch leaders for the new epoch, 2) identifiers of buckets that the primary picked for itself, derived from the oldest requests pending at the primary. Recall that, if e ends graciously, the leader set cannot reduce in size and it grows if the primary of epoch e + 1 believes that more than |EpLeaders(e)| nodes are correct. In this case, the primary propose max(n,EpLeaders(e)+1) nodes, chosen by the primary.2 In case the primary of epoch e + 1 estimates that no more than |EpLeaders(e)| nodes are correct, it is allowed to maintain the same set of leaders as in the previous epoch — this avoids frequent oscillations between gracious and ungracious epoch changes, e.g.,in case few nodes are crash-faulty.
Finally, similar to bucket handover (Sec. 4.3), leader i in epoch e + 1 starts proposing batches, as soon as it delivers all batches from e from nodes that were assigned the buckets now assigned to i.

4.4.2 Ungracious epoch-change

Ungracious epoch-changes in Mir are caused by epoch timeouts due to asynchrony or failures and generalize PBFT view-changes. Similar to PBFT, after delivering batch sn in epoch e, a node resets and triggers an epoch-change timer ecTimer. If an ecTimer expires, a node enters the epoch-change subprotocol to move from epoch e to epoch e + 1. In this case, a node sends an EPOCH-CHANGE message to EpPrimary(e + 1). EPOCH-CHANGE message follows the structure of PBFT VIEW-CHANGE message (page 411, [22]) with the difference that it is signed and that there are no VIEWCHANGE-ACK messages exchanged (for simplicity of implementation). The construction of the NEW-EPOCH message then proceeds in the same way as the PBFT construction of the NEW-VIEW message.

Before triggeringthe PBFT-inherited processing of NEW-EPOCH message, nodes wait to reliably deliver configuration information pertaining to the new epoch, which the primary reliably broadcasts,v just like in gracious epoch change (Sec. 4.4.1). The difference is that in an ungracious epoch change the epoch primary must select a smaller number of epoch leaders than in the previous epoch. Concretely, in the configuration for new epoch e, lthe epoch primary picks the number of leaders in the last epoch e for which it has the configuration, and pro-
poses at most max(1, |EpLeaders(e ) − e + e | leaders. Note that the epoch primary must always be in the epoch leader set.

Finally, to counter the possibility of losing requests due to an epoch change, a node resurrects potentially preprepared but uncommitted requests from previous views that are not reflected in the NEW-EPOCH message. Indeed, when an epoch change occurs, not all batches that were created and potentially preprepared before this event are delivered when installing the new epoch. To prevent the requests in these batches from being lost (due to condition (6) in pre-preparing a batch — Sec. 4.2), before resuming normal operation after an ungracious epoch change, each correct node ensures that (1) the requests in such batches are returned to node’s pending bucket, and
(2) these requests are removed from the logs of the node where they were marked as preprepared. Thus, these requests are ready to be proposed again. Together with the requirement that clients ensure that a correct replica eventually receives their request, this guarantees Liveness (P4), i.e., that client requests are eventually delivered, even in the face of view changes.

2 Although in this paper we use sequential increments and decrements of size of the leader set — the leader set can grow and shrink according to different policies.

4.5 Checkpointing (Garbage Collection)

Similarly to PBFT, Mir uses a checkpoint mechanism to prune the message logs. After each node i has delivered a batch with sequence number snC divisible by configuration parameter C (which means that all batches with sequence numbers lower than snC have been locally committed at i) node i broadcasts a CHECKPOINT,snC ,H(snC ) σi, where snC the last checkpoint and H(snC ) is the hash digest of the batches with sequence numbers sn in range sn
C Each node collects checkpoint messages until it has 2f + 1, including its own, and persist a checkpoint certificate. At this point, the checkpoint is stable and the node can discard the common-case messages from its log for sequence numbers lower than sn.

Mir advances batch watermarks at checkpoints like PBFT does. Clients’ watermarks are also possibly advanced at checkpoints, as the state related to previously delivered requests is discarded. For each client c, the low watermark tcL advances to the highest timestamp t in a request submitted by c that has been delivered, such that all requests with timestamp t < t have also been delivered. The high watermark advances to tcH = wc + tcL , where wc the length of the sliding window.

Note that node i does not discard the validated requests that are pending in the bucket queues. These are removed from the pending queue either when it proposes the request in a PRE-PREPARE message or when the request is committed, as explained in section 4.2.

4.6 State transfer

Replicas can temporarily become unavailable, either due to network partitioning, or due to transient failures. Upon recovery/reconnection, replicas must obtain several pieces of information before being able to actively participate in the protocol again. To achieve this, replicas need to obtain current epoch configuration information, the latest stable checkpoint (which occurred at the round having sequence h), as well as information concerning proposals having sequence numbers between h +1 and the current round n.

The state must, in particular, contain two pieces of information: (1) the current epoch configuration, which is necessary to determine the leaders from which the replica should accept proposals, and (2) client timestamps at the latest checkpoint, which are necessary to prevent including client requests that have already been proposed in future blocks.

A reconnecting replica i obtains this information by broadcasting a <HELLO,nei,ci,bi> message, where nei is the latest NEW-EPOCH message received by the replica, ci is the replica’s last stable checkpoint, and bi is the last batch i delivered. Upon receipt of a HELLO message, another replica j replies with its own HELLO message, as well as with any missing state from the last stable checkpoint and up to the current round n.

We perform further optimizations in order to reduce the amount of data that needs to be exchanged in case of a reconnection. First, upon reconnecting, replicas announce their presence but wait for the next stable checkpoint after reconnection before actively participating in the protocol again. This enables us to avoid transferring the entire state related to requests following the preceding stable checkpoint. Second, the amount of data related to client timestamps that needs to be transmitted can be reduced through only exchanging the root of the Merkle tree containing the client timestamps, with the precise timestamps being fetched on a per-need basis.

4.7 Membership reconfiguration

While details of membership reconfiguration are outside of the scope of this paper, we briefly describe how Mir deals with adding/removing clients and nodes. Such requests, called configuration requests are totally ordered like other requests, but are tagged to be interpretable/executed by nodes (hence they are not subject to the LTO optimization, Sec. 5.1). As Mir processes requests out of order (just like PBFT), configuration requests cannot be executed right after committing a request as the timing of commitment might diverge across nodes resulting in non-determinism. Instead, configuration requests are taken into account only at checkpoints and more specifically all configuration requests ordered between checkpoints k − 1 and k, take effect only after checkpoint k + 1.

4.8 Implementation Architecture

We implemented Mir in Go. Our implementation is multi-threaded and inspired by consensus-oriented parallelism (COP) architecture previously applied to PBFT to maximize its throughput on multicore machines [16]. Specifically, in our implementation, a separate thread is dedicated to managing each batch during the common case operation, which simplifies Mir code structure and helps maximize performance. We further parallelize computation-intensive tasks whenever possible (e.g., signature verifications, hash computations). The only communication in common case between Mir threads pertains to request duplication prevention – the shared data structures for duplication prevention are hash tables, with perbucket locks; instances that handle requests corresponding to different leaders do not access the same buckets. The only exception to the multi-threaded operation of Mir is during an ungracious epoch-change, where a designated thread (Mir Manager) is responsible for stopping worker common-case threads and taking the protocol from one epoch to the next. This manager thread is also responsible for sequential batch delivery and for checkpointing, which however does not block the common-case threads managing batches.

Our implementation also parallelizes network access. We use a configurable number of independent network connections between each pair of servers, which results in several gRPC connections between each pair of servers (the number of gRPC connections between a pair of servers is, however, considerably smaller than the number of Mir threads). This proves to be critical in boosting Mir performance beyond seeming bandwidth limitations in a WAN that stem from using a single TCP/TLS connection. In addition to multiple internode connections, we use an independent connection for handling client requests. As a result, the receipt of requests is independent of the rest of the protocol – we can safely continue to receive client requests even if the protocol is undergoing an epoch change. Our implementation can hence seamlessly use, wherepossible, separate NICs for client’s requests and intranode communication to address DoS attacks [23].

Finally, cleaning-up duplication prevention-related data structures at checkpoint is a relatively expensive operation; yet because the watermark distance is larger than the checkpoint period, BFT instances can still proceed even when handling a checkpoint — therefore, this does not significantly affect throughput.

Persisting state (durability). By default, Mir implementation does not persist state or message logs to stable storage. Hence, a node that crashes might recover in a compromised state — however, as described above, such a node does not participate in the protocol until the next stable checkpoint which effectively restores the correct state. While we opted for this approach assuming that for few dozens of nodes simultaneous faults of up to a third of them will be rare, for small number of nodes the probability of such faults grows and with some probability might exceed threshold f . Therefore, we optionally persist state pertaining to sent messages in Mir, which is sufficient for a node to recover to a correct state after a crash. Section 7 shows the minimal performance impact of persisting state in Mir.

5 Optimizations

5.1 Lightweight Total Order (LTO)

When the system is network-bound (e.g., on with large requests and/or on a WAN) the maximum throughput is driven by the amount of data each leader can broadcast in a PRE-PREPARE message. However, data, i.e., request payload, is not critical for total order, as the nodes can establish total order on digests. While in many blockchains all nodes need data [2, 4], in some others [13], ordering is separated from request execution and full replication across ordering nodes is an overkill.

For such blockchains, Mir optionally boosts throughput using what we call Light Total Order (LTO) broadcast. LTO is defined in the same way as TO broadcast (Sec. 2.1) except that LTO requires property P 4 to hold for hash of the request H(r) instead for request r and adds the following property:

  • P5 Partial Replication: If a correct client broadcasts request r, then at least one correct node eventually delivers r.

LTO optimization for Mir modifies the protocol as follows. Each leader broadcasts a full PRE-PREPARE message only to a set of 2f + 13 Replicas (a leader is always in Replicas of its own batch). To the rest of the nodes, let us call them Observers, the leader broadcasts a lightweight PRE-PREPARE message which contains only metadata about the requests. This metadata contains: (a) the hash of the request (b) the identifier of the client who submitted the request and (c) the request timestamp. The request hash is necessary so that each node can remove committed requests from their pending queues. The client identifier and request timestamp are necessary to guarantee that all nodes advance the watermarks per client in consistently.

Upon receiving a PRE-PREPARE (Sec. 4.2), Observers must only verify: (a) condition (1): the epoch
number of the batch is correct and no other batch has been proposed in the same epoch with the same sequence number and (b) condition (6) to guarantee no duplication. Conditions (2)-(5) and (7)-(8) ensure that the batch is valid and it is sufficient that one correct node has verified them. Such a correct verifier will always exist among the set of 2f + 1 senders of the PREPARE messages that each node expects before sending a COMMIT message.

3 It is possible to reduce LTO partial replication to f + 1 nodes, yet we keep 2f + 1 for better realistic data availability.

5.2 Signature Verification Sharding

As the Mir multi-leader approach addresses network bottlenecks, it often exposes a CPU bottleneck due to relatively costly client signature verification. To offload CPU, we enable the signature verification sharding optimization. In short, in a stable epoch we require that the signatures in each batch are verified by only f + 1 nodes instead of requiring each node to perform a signature verification, whilein arecovery epoch, the numberof verifiers is 2f + 1.

In detail, let V erifiers(sn,e) be the set of nodes that are responsible for verifying the transaction signatures of the batch with sequence number sn in epoch e. The leader that proposes the batch is always in V erifiers(sn,e). For the other nodes in V erifiers(sn,e) we use a partitioning mechanism similar to the one we introduced for partitioning requests into buckets. Each batch is hashed to a value and the value is mapped to a Verification Bucket. However, unlike with request sharding, where each bucket is assigned to exactly one leader, each V erificationBucket is assigned to f + 1 (resp., 2f + 1) nodes in a stable (resp., recovery) epoch.

A node i upon receiving PRE-PREPARE,sn,e verifies the clients’ signatures in a batch if
i ∈ V erifiers(sn,e) before broadcasting PREPARE,sn,e . Otherwise, if i ∈/ V erifiers(sn,e), node i will check only conditions (1)-(7) (see section 4.2). Each node j broadcasts COMMIT,sn,e upon receiving PRE-PREPARE,sn,e . In a stable epoch, a node waits for PREPARE,sn,e from all f + 1 nodes in V erifiers(sn,e) and f more PREPARE messages.

5.3 Lightweight Leader Discovery

We designed the best-effort lightweight discovery mechanism to help the client send the request to a single node, which is actually responsible for the request, while maintaining low latency (which might grow if the client “misses” the responsible leader).

Within a recovery epoch e, the discovery of the leader responsible for h is equivalent to discovering the view leader in PBFT. Every node participating in epoch e has the configuration of e, including the bucket assignment and can point the client to the right node. If a client contacts a correct node, it gets a correct bucket assignment, which needs to be done by the client only once per epoch.

In the case of a stable epoch e, a node replies to the client with the current bucket assignment, from which the client can deduce the next f (or more) nodes that will be responsible for the client’s request in case the epoch change does not occur. Then the client submits the request to the apparent responsible node, and has the possibility to additionally send the request to one of more of the next responsible leaders. In the worst case, the client sends the request to all nodes as described in Section 4.1.

6 Correctness Arguments

In this section we sketch Mir correctness arguments, focusing on TO properties, as defined in Section 2, discussing also the impact of optimizations (Sec. 5).

Validity (P1) relies on clients’ signatures which Mir uses to authenticate the requests. Without signature sharding, every signature is verified by at least 2f + 1 nodes, out of which f + 1 are correct. With signature sharding, clients’ signatures are verified by at least f +1 nodes, out of which at least one is correct — guaranteeing Validity.

Agreement (P2) is best shown by contradiction and reduction to PBFT Agreement, which we outline here. Suppose that Agreement does not hold in Mir; in this case, because of the Mir structure which generalizes PBFT, there exists an execution of PBFT similar to that of Mir, in which: 1) all requests proposed in a Mir epoch are proposed in the respective PBFT view by the primary, 2) every gracious epoch change in Mir is replaced by viewchange in PBFT due to timeouts, and 3) there is an Agreement violation in PBFT. A contradiction.

No-duplication (P3) stems from the way Mir prevents duplicate pre-prepares (rule (6) in accepting PREPREPARE, Sec/ 4.2). The exception to this rule, in form of batch/request resurrection during ungracious epoch change (Sec. 4.4.2), does not impact P3, as only requests from uncommitted batches as resurrected.

Liveness (P4)can be shown by contradiction as follows. Assume a correct client sends a request to all nodes, which is received by at least one correct node i.4 Fix req to be the oldest request received by i for which liveness is broken. Consider time after GST . It is easy to show that in Mir, either (1) i becomes an epoch primary infinitely often, or (2) there is the last epoch e, a stable epoch that runs infinitely long. In case (1), let e be an epoch in which req is the oldest request pending at node i and i is the primary (such an epoch exists due to the choice of req and the resurrection of uncommitted but pre-prepared requests (Sec. 4.4.2)). In case (2), i gets to be the leader infinitely often in e including being the leader of a bucket req belongs to. In both cases, req gets proposed by i and is committed (system runs after GST ), a contradiction.

Signature sharding (Sec. 5.2) optimization does not compromise Validity/Agreement. In case of a stable
epoch, we expect all the nodes to be alive, since all nodes are in EpLeaders set. Therefore, we expect that all f + 1 PRE-PREPARE messages from V erifiers will arrive. If either some node does not forward a PREPARE or a leader is malicious and does not forward PRE-PREPARE,sn,e to V erifiers,sn,e the batch timer will expire and Mir enters a recovery epoch. In case of a recovery epoch |V erifiers(sn,e)| = 2f +1. As the set of 2f + 1 nodes that sent PRE-PREPARE and PREPARE messages intersect with the set of V erifiers in at least f +1 nodes in a recovery epoch, at least one of these will be a correct node.

Finally, it is easy to see that LTO optimization(Sec.5.1) yields Liveness (P4) on hashes and ensures Partial Replication (P5, Sec. 5.1) on request payloads.

Batch size 2 MB
Cut batch timeout 250 ms
Epoch-change timeout 20 s
Max batches per epoch ∞ (stable), 16 ∗ n (recovery)
Bucket rotation period 16 ∗ n
Buckets per leader (m) 2
Checkpoint period 16 (n ≤ 16), 64 (n < 16 ≤ 49), 128 (n > 49)
Watermark window size 64 (n ≤ 16), 128 (n < 16 ≤ 49), 256 (n > 49)
Parallel gRPC connections 5 (n = 4), 3 (n = 10), 2 (n > 10)

Table 2: Mir configuration parameters used in evaluation

4 The argument holds even if the client sends a request to f +1 nodes.

7 Evaluation

In this section, we report on experiments we conducted in scope of Mir performance evaluation. Our evaluation aims at answering the following questions: (1) how does Mir scale on a WAN? (2) how does Mir perform in clusters? (3) what is the impact of bucket rotation duplication prevention mechanism on a variant of Mir that is not robust? (4) what is the impact of Mir optimizations? (5) how does Mir perform under faults? and (6) what is the impact of persisting state (durability) in Mir?

7.1 Experimental setup

Our evaluation consists of microbenchmarks of 2 request payload sizes: (1) small, 500 byte requests, which correspond to average Bitcoin tx size [9], and (2) large, 3500 byte requests, typical in Hyperledger Fabric [13].

We compare Mir to a state-of-the-art PBFT implementation optimized for multi-cores [16]. For fair comparison, we use the Mir codebase tuned to closely follow the PBFT implementation of [16] hardened to implement Aardvark [23]. As another baseline, we compare the common case performance of Chain, an optimistic subprotocol of the Aliph BFT protocol [15], which is known to be throughput-optimal in clusters. PBFT and Chain are always given best possible setups, i.e., PBFT leader is always placed in a node that has most effective bandwidth and Chain spans the path with the smallest latency. We also compare with Honeybadger [40] using the open source implementation5 which was also used in the performanceevaluationin[40]. We only compare Honey badger with Mir for small requests, since the default payload in the open source implementation is fixed to 250 byte requests. We do not compare to other protocols because they are either unavailable (e.g., Hashgraph [35], Red Belly [24]), unmaintained (BFT-Mencius [41]), faithfully approximated by PBFT (e.g., BFT-SMaRt [17], Spinning [47], Tendermint [7]), or report considerably worse performance than Mir (e.g., Algorand [30]).

We use VMs on a leading cloud provider, with 32 x 2.0 GHz VCPUs and 32GB RAM, equipped with 1Gbps networking and limited to that value for experiment repeatability, due to non-uniform bandwidth overprovisioning we sometimes experienced. Mir number of connections varies and is higher for smaller number of nodes. In Table 2 we provide all the configuration parameter of Mir we use in our performance evaluation.

Figure 4

Figure 4: WAN scalability experiment.

5 https://github.com/initc3/HoneyBadgerBFT-Python

7.2 Scalability on WANs

To evaluate Mir scalability, we ran it with up to n = 100 nodes on a WAN setup which spans 16 distinct datacenters across the world (beyond n = 16, we collocate nodes across already used datacenters). Figure 4 depicts the common-case stable epoch performance of Mir, compared to that of PBFT and Chain (for both small and large requests) and Honeybadger (for small requests).

Client requests are generated by increasing the client instances and request rate per client instance until the throughput is saturated and we report the throughput just below saturation. Client machines are also uniformly distributed across the 16 datacenters. The client instances estimate which node i has an active bucket for each of their requestsandbroadcasteachrequesttonodes i−1, ··· ,i+ k, where k ≤ f − 1, so at most to f + 1 nodes.

We observe that PBFT throughput decays rapidly, following an O(1/n) function and scales very poorly. Chain scales better and even improves with up to n = 16 nodes, sustaining 20k (resp., 3k) tps for small (resp., large) requests. It is worth emphasizing that: (1) our evaluation of Chain is best-case and involves the manual assignment of the best path to Chain, and (2) Chain is not robust and needs to be abandoned in case of faults [15].

Compared to Honeybadger, Mir retains higher throughput, even though: (i) Honeybadger request size is smaller
(250 bytes vs 500 bytes), and (ii) Honeybadger batches are significantly larger (up to 500K requests in our evaluation). This is due to the fact that Honeybadger is computationally bound by O(N 2) threshold signatures verification and on top of that the verification of the signatures is done sequentially. Honeybadger’s throughput also suffers from request duplication (on average 1/3 duplicate requests per batch), since the nodes choose the requests they add in their batches at random. Moreover, we report on Honeybadger latency, which is in the order of minutes (partly due to the large number of requests per batch and partly due to heavy computation), is significantly higher than that of Mir. In our evaluation we could not increase the batch size as much as in the evaluation in [40], especially with increasing the number of nodes beyond 16, due to memory exhaustion issues. Finally, in our evaluation PBFT outperforms Honeybadger (unlike in [40]), as our implementation of PBFT leverages the parallelism of Mir codebase.

In summary, Mir delivers 53.7k (resp., 20.5k) tps with small (resp., large) requests with n = 4 nodes, which drops to 42k (resp., 15.1k) tps at n = 49 nodes, a penalty of 22-27%. With n = 100, these numbers reduce to 23.2k (resp., 8.3k tps, more than 2.5x improvement over Chain), a further 45% penalty. We attribute this drop in part to the heterogeneity of VMs across datacent