# Verifiable ASICs: trustworthy hardware with untrusted components

Riad S. Wahby<sup>o\*</sup>, Max Howald<sup>†\*</sup>, Siddharth Garg\*, abhi shelat<sup>‡</sup>, and Michael Walfish\*

> °Stanford University \*New York University †The Cooper Union ‡The University of Virginia

> > June 10th, 2016

#### Setting: ASICs with mutually distrusting designer, manufacturer



#### Setting: ASICs with mutually distrusting designer, manufacturer



Here we are thinking about ASICs, not CPUs:



Setting: ASICs with mutually distrusting designer, manufacturer





e.g., a network firewall appliance, with a custom chip for packet processing





What if our packet processing chip has a back door?



What if our packet processing chip has a **back door**?

Threat: incorrect execution of the packet filter (Other concerns, e.g., secret state, are important but orthogonal)



What if our packet processing chip has a **back door**?

The Cybercrime Economy

Fake tech gear has infiltrated the U.S. government

by David Goldman @DavidGoldmanCNN

(L) November 8, 2012: 3:10 PM ET





US DoD controls supply chain with trusted foundries.

For example, stealthy trojans can thwart post-fab detection [A2: Analog Malicious Hardware, Yang et al., IEEE S&P 2016; Stealthy Dopant-Level Trojans, Becker et al., CHES 2013]

For example, stealthy trojans can thwart post-fab detection [A2: Analog Malicious Hardware, Yang et al., IEEE S&P 2016; Stealthy Dopant-Level Trojans, Becker et al., CHES 2013]

#### But trusted fabrication is not a panacea:

- Only 5 countries have cutting-edge fabs on-shore
- ✗ Building a new fab takes \$\$\$\$\$\$, years of R&D

For example, stealthy trojans can thwart post-fab detection [A2: Analog Malicious Hardware, Yang et al., IEEE S&P 2016; Stealthy Dopant-Level Trojans, Becker et al., CHES 2013]

#### But trusted fabrication is not a panacea:

- Only 5 countries have cutting-edge fabs on-shore
- ✗ Building a new fab takes \$\$\$\$\$\$, years of R&D
- Semiconductor scaling: chip area and energy go with square and cube of transistor length ("critical dimension")
- X So using an old fab means an enormous performance hit e.g., India's best on-shore fab is  $10^8 \times$  behind state of the art

For example, stealthy trojans can thwart post-fab detection [A2: Analog Malicious Hardware, Yang et al., IEEE S&P 2016; Stealthy Dopant-Level Trojans, Becker et al., CHES 2013]

#### But trusted fabrication is not a panacea:

- Only 5 countries have cutting-edge fabs on-shore
- X Building a new fab takes \$\$\$\$\$\$, years of R&D
- X Semiconductor scaling: chip area and energy go with square and cube of transistor length ("critical dimension")
- X So using an old fab means an enormous performance hit e.g., India's best on-shore fab is  $10^8 \times$  behind state of the art

### Can we get trust more cheaply?

 $\begin{array}{c} \textbf{Principal} \\ \textbf{F} \rightarrow \textbf{designs} \\ \textbf{for} \ \mathcal{P}, \mathcal{V} \end{array}$ 











Makes sense if  $\mathcal{V}+\mathcal{P}$  are cheaper than trusted F



Makes sense if V + P are cheaper than trusted F

#### Reasons for hope:

• running time of  $\mathcal{V} < \mathsf{F}$  (asymptotically)

BCCT13 KRR14

ALMSS92

AS92 Micali94

BG02 GOS06 IKO07

GKR08 KR09

GGP10 Groth10 GLR11 Lipmaa11 BCCT12 GGPR13



VS.

CMT12 SMBW12 TRMP12 SVPBBW12

SBVBPW13

VSBW13 PGHR13

SBW11

Thaler13 BCGTV13

BFRSBW13 BFR13

DFKP13 BCTV14a

BCTV14b

**BCGGMTV14** 

FI 14 KPPSST14

FTP14 WSRHBW15

BBFR15

CFHKNPZ15 CTV15

KZMQCPPsS15

Makes sense if  $\mathcal{V} + \mathcal{P}$  are cheaper than trusted F

#### Reasons for hope:

- running time of V < F (asymptotically)
- Implementations exist

ALMSS92 AS92 Micali94 BG02 **GOS06** IKO07 GKR08 KR09 GGP10

Groth10

**GIR11** Lipmaa11 BCCT12 GGPR13 BCCT13

KRR14



• running time of  $\mathcal{V} < \mathsf{F}$  (symptotically)

Implementations exist

Groth10

Lipmaa11

BCCT12

GGPR13

BCCT13

KRR14

**GIR11** 

 ${\cal P}$  overheads are massive, but using an advanced fab might offset these costs

PGHR13 Thaler13 BCGTV13 BFRSBW13 BFR13 DFKP13 BCTV14a BCTV14b **BCGGMTV14** FL14 KPPSST14 FTP14

SBW11

CMT12

SMBW12

TRMP12

VSBW13

SVPBBW12

SBVBPW13

WSRHBW15 BBFR15 CFHKNPZ15 CTV15 KZMQCPPsS15



VS.

SMBW12 TRMP12 SVPBBW12 SBVBPW13

VSBW13 PGHR13

SBW11

CMT12

Thaler13 BCGTV13

BFRSBW13 BFR13

DFKP13 BCTV14a

BCTV14a BCTV14b

BCGGMTV14 FI 14

KPPSST14

FTP14

WSRHBW15 BBFR15

CFHKNPZ15

KZMQCPPsS15

Makes sense if V + P are cheaper than trusted F

#### Reasons for hope caution:

- Theory is silent about feasibility
- Onus is heavier than in prior work
- Hardware issues: energy, chip area
- Need physically realizable circuit design
- ullet Need  ${\mathcal V}$  to save for plausible computation sizes

Kilian92 ALMSS92 AS92 Micali94 BG02 **GOS06** IKO07 GKR08 KR09 GGP10 Groth10 **GIR11** Lipmaa11

BCCT12

GGPR13

BCCT13

KRR14

# Zebra: a hardware design that saves costs

#### A qualified success

Zebra: a hardware design that saves costs...

... sometimes.



F must be expressed as an arithmetic circuit (AC)

AC satisfiable  $\iff$  F was executed correctly

 ${\mathcal P}$  convinces  ${\mathcal V}$  that the AC is satisfiable



Arguments [GGPR13, SBVBPW13, PGHR13, BCTV14]

e.g., Zaatar, Pinocchio, libsnark

IPs

[GKR08, CMT12, VSBW13]

e.g., Muggles, CMT, Allspice



Arguments [GGPR13, SBVBPW13, PGHR13, BCTV14]

e.g., Zaatar, Pinocchio, libsnark

IPs

[GKR08, CMT12, VSBW13]

e.g., Muggles, CMT, Allspice

```
What about other schemes? e.g., FHE [GGP10], MIP+FHE [BC12], MIP [BTWV14], PCIP [RRR16], IOP [BCS16], PIR [BHK16], ...
```



Arguments [GGPR13, SBVBPW13, PGHR13, BCTV14]

e.g., Zaatar, Pinocchio, libsnark

IPs

[GKR08, CMT12, VSBW13]

e.g., Muggles, CMT, Allspice

What about other schemes? e.g., FHE [GGP10], MIP+FHE [BC12], MIP [BTWV14], PCIP [RRR16], IOP [BCS16], PIR [BHK16], ... These all seem a bit further from practicality.



# **Arguments** [GGPR13, SBVBPW13, PGHR13, BCTV14]

e.g., Zaatar, Pinocchio, libsnark

- nondeterministic ACs, arbitrary connectivity
- + Few rounds ( $\leq$  3)

#### IPs

[GKR08, CMT12, VSBW13]

- e.g., Muggles, CMT, Allspice
- deterministic ACs;
   layered, low depth
- Many rounds



# **Arguments** [GGPR13, SBVBPW13, PGHR13, BCTV14]

e.g., Zaatar, Pinocchio, libsnark

- nondeterministic ACs, arbitrary connectivity
- + Few rounds ( $\leq$  3)

Unsuited to hardware implementation

#### **IPs**

[GKR08, CMT12, VSBW13]

e.g., Muggles, CMT, Allspice

- deterministic ACs;layered, low depth
- Many rounds



**Arguments** [GGPR13, SBVBPW13, PGHR13, BCTV14

e.g., Zaatar, Pinocchio, libsnark

- + nondeterministic ACs. arbitrary connectivity
- + Few rounds (< 3)

Unsuited to hardware  $\chi$ implementation



#### **IPs**

[GKR08, CMT12, VSBW13]

- e.g., Muggles, CMT, Allspice
- deterministic ACs; layered, low depth
- Many rounds

Suited to hardware implementation

F must be expressed as a *layered* arithmetic circuit.





1.  $\mathcal{V}$  sends inputs





- 1.  $\mathcal{V}$  sends inputs



- 1.  $\mathcal V$  sends inputs
- 2.  $\mathcal{P}$  evaluates, returns output y



1.  $\mathcal V$  sends inputs

thinking...

- 2.  $\mathcal{P}$  evaluates, returns output y
- V constructs polynomial relating y to last layer's input wires



- 1.  $\mathcal{V}$  sends inputs
- 2.  $\mathcal{P}$  evaluates, returns output y
- V constructs polynomial relating y to last layer's input wires
- 4.  ${\mathcal V}$  engages  ${\mathcal P}$  in a sum-check





sum-check [LFKN90]

- 1.  $\mathcal V$  sends inputs
- 2.  $\mathcal{P}$  evaluates, returns output y
- 3. V constructs polynomial relating y to last layer's input wires
- 4.  $\mathcal V$  engages  $\mathcal P$  in a sum-check, gets claim about second-last layer

X



- 1.  $\mathcal{V}$  sends inputs
- 2.  $\mathcal{P}$  evaluates, returns output y
- 3. V constructs polynomial relating y to last layer's input wires
- 4.  ${\mathcal V}$  engages  ${\mathcal P}$  in a sum-check, gets claim about second-last layer
- 5.  $\mathcal{V}$  iterates





- 1.  $\mathcal{V}$  sends inputs
- 2.  $\mathcal{P}$  evaluates, returns output y
- V constructs polynomial relating y to last layer's input wires
- 4.  $\mathcal V$  engages  $\mathcal P$  in a sum-check, gets claim about second-last layer
- 5.  $\mathcal{V}$  iterates





- 1.  $\mathcal{V}$  sends inputs
- 2.  $\mathcal{P}$  evaluates, returns output y
- V constructs polynomial relating y to last layer's input wires
- 4.  $\mathcal V$  engages  $\mathcal P$  in a sum-check, gets claim about second-last layer
- 5.  $\mathcal{V}$  iterates





- 1.  $\mathcal{V}$  sends inputs
- 2.  $\mathcal{P}$  evaluates, returns output y
- V constructs polynomial relating y to last layer's input wires
- 4.  ${\mathcal V}$  engages  ${\mathcal P}$  in a sum-check, gets claim about second-last layer
- 5.  $\mathcal V$  iterates, gets claim about inputs, which it can check





Soundness error  $\propto p^{-1}$ 



Soundness error  $\propto p^{-1}$ 

### Cost to execute F directly:

O(depth · width)

### $\mathcal{V}$ 's sequential running time:

O(depth · log width + |x| + |y|) (assuming precomputed queries)





Soundness error  $\propto p^{-1}$ 

Cost to execute F directly: O(depth · width)

 $\mathcal{V}$ 's sequential running time: O(depth  $\cdot$  log width + |x| + |y|) (assuming precomputed queries)

# $\mathcal{P}$ 's sequential running time:

 $O(depth \cdot width \cdot log width)$ 





P executing AC: layers are sequential, but all gates at a layer can be executed in parallel



 ${\cal P}$  executing AC: layers are sequential, but all gates at a layer can be executed in parallel

Proving step: Can V and P interact about all of F's layers at once?



 ${\cal P}$  executing AC: layers are sequential, but all gates at a layer can be executed in parallel

Proving step: Can  $\mathcal{V}$  and  $\mathcal{P}$  interact about all of F's layers at once?

No. V must ask questions in order or soundness is lost.



 ${\cal P}$  executing AC: layers are sequential, but all gates at a layer can be executed in parallel

Proving step: Can  $\mathcal V$  and  $\mathcal P$  interact about all of F's layers at once?

No. V must ask questions in order or soundness is lost.

But: there is still parallelism to be extracted...



 $\mathcal{V}$  questions  $\mathcal{P}$  about  $F(x_1)$ 's output layer.



 $\mathcal{V}$  questions  $\mathcal{P}$  about  $F(x_1)$ 's output layer.

Simultaneously,  $\mathcal{P}$  returns  $F(x_2)$ .



 $\mathcal{V}$  questions  $\mathcal{P}$  about  $F(x_1)$ 's next layer



 $\mathcal{V}$  questions  $\mathcal{P}$  about  $F(x_1)$ 's next layer, and  $F(x_2)$ 's output layer.



 $\mathcal{V}$  questions  $\mathcal{P}$  about  $F(x_1)$ 's next layer, and  $F(x_2)$ 's output layer.

Meanwhile,  $\mathcal{P}$  returns  $F(x_3)$ .







This process continues until  $\mathcal V$  and  $\mathcal P$  interact about every layer simultaneously—but for different computations.

 ${\cal V}$  and  ${\cal P}$  can complete one proof in each time step.



### Extracting parallelism in Zebra's $\mathcal{P}$ with pipelining



This approach is just a standard hardware technique, pipelining; it is possible because the protocol is naturally staged.

### Extracting parallelism in Zebra's $\mathcal{P}$ with pipelining



This approach is just a standard hardware technique, pipelining; it is possible because the protocol is naturally staged.

There are other opportunities to leverage the protocol's structure.

For each sum-check round,  $\mathcal P$  sums over each gate in a layer.



For each sum-check round,  $\mathcal{P}$  sums over each gate in a layer, evaluating H[k],  $k \in \{0, 1, 2\}$ 



For each sum-check round,  $\mathcal{P}$  sums over each gate in a layer, evaluating H[k],  $k \in \{0, 1, 2\}$ 

#### In software:

```
// compute H[0], H[1], H[2]
for k \in \{0, 1, 2\}:
  H[k] \leftarrow 0
  for g \in layer:
     H[k] \leftarrow H[k] + \delta(g, k)
     // \delta uses state[g]
// update lookup table
// with \mathcal{V}'s random coin
for g \in layer:
  state[g] \leftarrow \delta(g, r_i)
```

For each sum-check round,  $\mathcal{P}$  sums over each gate in a layer, evaluating H[k],  $k \in \{0, 1, 2\}$ 

#### In software:

for 
$$k \in \{0, 1, 2\}$$
:

 $H[k] \leftarrow 0$ 

for  $g \in layer$ :

 $H[k] \leftarrow H[k] + \delta(g, k)$ 
 $// \delta$  uses state[g]

// update lookup table

// with  $\mathcal{V}$ 's random coin

for  $g \in layer$ :

 $state[g] \leftarrow \delta(g, r_j)$ 

// compute H[0], H[1], H[2]



For each sum-check round,  $\mathcal{P}$ sums over each gate in a layer, evaluating H[k],  $k \in \{0, 1, 2\}$ 

#### In software:

$$\begin{array}{l} \text{for } k \in \{0,\ 1,\ 2\}: \\ & \text{H[k]} \leftarrow 0 \\ & \text{for } g \in \text{layer:} \\ & \text{H[k]} \leftarrow \text{H[k]} + \delta(g,\ k) \\ & \text{//} \ \delta \text{ uses state[g]} \\ \\ \text{// update lookup table} \\ \text{// with } \mathcal{V}'\text{s random coin} \\ & \text{for } g \in \text{layer:} \\ & \text{state[g]} \leftarrow \delta(g,\ r_j) \end{array}$$

// compute H[0], H[1], H[2]

$$\mathsf{H}[k] = \sum_{g \in \mathsf{layer}} \delta(g, k)$$
layer:  $\ominus \ominus \ominus \ominus \ominus \ominus \ominus \ominus$ 



For each sum-check round,  $\mathcal{P}$  sums over each gate in a layer, evaluating H[k],  $k \in \{0, 1, 2\}$ 

# 

#### In software:

// compute H[0], H[1], H[2] for 
$$k \in \{0, 1, 2\}$$
:

 $H[k] \leftarrow 0$ 
for  $g \in layer$ :

 $H[k] \leftarrow H[k] + \delta(g, k)$ 

//  $\delta$  uses state[g]

// update lookup table
// with  $\mathcal{V}$ 's random coin
for  $g \in layer$ :

 $state[g] \leftarrow \delta(g, r_j)$ 



For each sum-check round,  $\mathcal{P}$ sums over each gate in a layer, evaluating H[k],  $k \in \{0, 1, 2\}$ 

#### In software:

// compute H[0], H[1], H[2] for k 
$$\in$$
 {0, 1, 2}: H[k]  $\leftarrow$  0 for g  $\in$  layer: H[k]  $\leftarrow$  H[k] +  $\delta$ (g, k) //  $\delta$  uses state[g] // update lookup table // with  $\mathcal{V}$ 's random coin for g  $\in$  layer: state[g]  $\leftarrow$   $\delta$ (g, r<sub>j</sub>)

$$\mathsf{H}[k] = \sum_{g \in \mathsf{layer}} \delta(g, k)$$
layer:  $\ominus \ominus \ominus \ominus \ominus \ominus \ominus \ominus$ 



For each sum-check round,  $\mathcal{P}$ sums over each gate in a layer, evaluating H[k],  $k \in \{0, 1, 2\}$ 

### In software:

// compute H[0], H[1], H[2] for 
$$k \in \{0, 1, 2\}$$
:  $H[k] \leftarrow 0$  for  $g \in layer$ :  $H[k] \leftarrow H[k] + \delta(g, k)$   $// \delta$  uses state[g] // update lookup table // with  $\mathcal{V}$ 's random coin for  $g \in layer$ :  $state[g] \leftarrow \delta(g, r_j)$ 

$$\mathsf{H}[k] = \sum_{g \in \mathsf{layer}} \delta(g, k)$$
layer:  $\Theta \oplus \Theta \oplus \Theta \oplus \Theta$ 



For each sum-check round,  $\mathcal{P}$  sums over each gate in a layer, evaluating H[k],  $k \in \{0, 1, 2\}$ 

#### In software:

// compute H[0], H[1], H[2] for 
$$k \in \{0, 1, 2\}$$
:  $H[k] \leftarrow 0$  for  $g \in layer$ :  $H[k] \leftarrow H[k] + \delta(g, k)$  //  $\delta$  uses state[g] // update lookup table // with  $\mathcal{V}$ 's random coin for  $g \in layer$ :  $state[g] \leftarrow \delta(g, r_j)$ 



For each sum-check round,  $\mathcal{P}$  sums over each gate in a layer, evaluating H[k],  $k \in \{0, 1, 2\}$ 

#### In software:

// compute H[0], H[1], H[2] for 
$$k \in \{0, 1, 2\}$$
:  $H[k] \leftarrow 0$  for  $g \in layer$ :  $H[k] \leftarrow H[k] + \delta(g, k)$   $// \delta$  uses state[g] // update lookup table // with  $\mathcal{V}$ 's random coin for  $g \in layer$ :  $state[g] \leftarrow \delta(g, r_j)$ 

| iii iiaiawaic. |                  |                      |                  |                  |   |   |
|----------------|------------------|----------------------|------------------|------------------|---|---|
|                | gate<br>prover   | gate<br>prover       | gate<br>prover   | gate<br>prover   |   |   |
|                | $\delta(0, 0)$   | $\dot{\delta}(1, 0)$ | $\delta(2, 0)$   | $\delta(3, 0)$   |   |   |
|                | $\delta(0, 1)$   | $\delta(1, 1)$       | $\delta(2, 1)$   | $\delta(3, 1)$   | • | • |
|                | $\delta(0, 2)$   | $\delta(1, 2)$       | $\delta(2, 2)$   | $\delta(3, 2)$   |   |   |
|                | $\delta(0, r_i)$ | $\delta(1, r_j)$     | $\delta(2, r_i)$ | $\delta(3, r_i)$ |   |   |





For each sum-check round,  $\mathcal{P}$  sums over each gate in a layer, evaluating H[k],  $k \in \{0, 1, 2\}$ 

#### In software:

// compute H[0], H[1], H[2] for k 
$$\in$$
 {0, 1, 2}: H[k]  $\leftarrow$  0 for g  $\in$  layer: H[k]  $\leftarrow$  H[k] +  $\delta$ (g, k) //  $\delta$  uses state[g] // update lookup table // with  $\mathcal{V}$ 's random coin for g  $\in$  layer: state[g]  $\leftarrow$   $\delta$ (g, r<sub>j</sub>)

$$\mathsf{H}[k] = \sum_{g \in \mathsf{layer}} \delta(g, k)$$
layer:  $\Theta \oplus \Theta \oplus \Theta \oplus \Theta$ 



## Zebra's design approach



e.g., pipelined proving e.g., parallel evaluation of  $\delta$  by gate provers

Exploit locality: distribute data and control e.g., no RAM: data is kept close to places it is needed

## Zebra's design approach



- e.g., pipelined proving
- e.g., parallel evaluation of  $\delta$  by gate provers



- e.g., no RAM: data is kept close to places it is needed
- e.g., latency-insensitive design: localized control

## Zebra's design approach

- ✓ Extract parallelism
  - e.g., pipelined proving
  - e.g., parallel evaluation of  $\delta$  by gate provers
- ✓ Exploit locality: distribute data and control
  - e.g., no RAM: data is kept close to places it is needed
  - e.g., latency-insensitive design: localized control
- Reduce, reuse, recycle
  - e.g., computation: save energy by adding memoization to  $\ensuremath{\mathcal{P}}$
  - e.g., hardware: save chip area by reusing the same circuits

Interaction between  ${\mathcal V}$  and  ${\mathcal P}$  requires a lot of bandwidth

 ${\it X}~{\it V}$  and  ${\it P}$  on circuit board? Too much energy, circuit area

#### Interaction between $\mathcal V$ and $\mathcal P$ requires a lot of bandwidth

- X V and P on circuit board? Too much energy, circuit area
- ✓ Zebra uses 3D integration



Interaction between  ${\cal V}$  and  ${\cal P}$  requires a lot of bandwidth

- ${m x}$   ${m \mathcal V}$  and  ${m \mathcal P}$  on circuit board? Too much energy, circuit area
- ✓ Zebra uses 3D integration



Protocol requires input-independent precomputation [VSBW13]

Interaction between  ${\cal V}$  and  ${\cal P}$  requires a lot of bandwidth

- X V and P on circuit board? Too much energy, circuit area
- ✓ Zebra uses 3D integration



Protocol requires input-independent precomputation [VSBW13]

✓ Zebra amortizes precomputations over many V-P pairs

Interaction between  ${\cal V}$  and  ${\cal P}$  requires a lot of bandwidth

- $\mathcal{X}$   $\mathcal{V}$  and  $\mathcal{P}$  on circuit board? Too much energy, circuit area
- ✓ Zebra uses 3D integration



Protocol requires input-independent precomputation [VSBW13]

✓ Zebra amortizes precomputations over many  $\mathcal{V}\text{-}\mathcal{P}$  pairs

Precomputations need secrecy, integrity

 $m{\mathsf{X}}$  Give  $\mathcal V$  trusted storage? Cost would be prohibitive



#### Interaction between ${\cal V}$ and ${\cal P}$ requires a lot of bandwidth

- X V and P on circuit board? Too much energy, circuit area
- ✓ Zebra uses 3D integration



#### Protocol requires input-independent precomputation [VSBW13]

✓ Zebra amortizes precomputations over many  $\mathcal{V}\text{-}\mathcal{P}$  pairs

#### Precomputations need secrecy, integrity

- X Give V trusted storage? Cost would be prohibitive
- ✓ Zebra uses untrusted storage + authenticated encryption



#### **Implementation**

#### Zebra's implementation includes

- ullet a compiler that produces synthesizable Verilog for  ${\cal P}$
- ullet two  ${\cal V}$  implementations
  - hardware (Verilog)
  - software (C++)
- library to generate  $\mathcal{V}$ 's precomputations
- Verilog simulator extensions to model software or hardware  $\mathcal{V}$ 's interactions with  $\mathcal{P}$

...and it seemed to work really well!

Zebra can produce 10k–100k proofs per second, while existing systems take tens of seconds per proof!

...and it seemed to work really well!

Zebra can produce 10k–100k proofs per second, while existing systems take tens of seconds per proof!

But that's not a serious evaluation...



Baseline: direct implementation of F in same technology as  ${\mathcal V}$ 



Baseline: direct implementation of F in same technology as  ${\mathcal V}$ 

Metrics: energy, chip size per throughput (discussed in paper)



Baseline: direct implementation of F in same technology as  ${\mathcal V}$ 

Metrics: energy, chip size per throughput (discussed in paper)

Measurements: based on circuit synthesis and simulation, published chip designs, and CMOS scaling models

Charge for  $\mathcal{V}$ ,  $\mathcal{P}$ , communication; retrieving and decrypting precomputations; PRNG; Operator communicating with  $\mathcal{V}$ 



Baseline: direct implementation of F in same technology as  ${\mathcal V}$ 

Metrics: energy, chip size per t

Measurements: based on circuit published chip designs, and CM

Charge for V, P, communi

350 nm: 1997 (Pentium II)

7 nm:  $\approx$  2017 [TSMC]

≈ 20 year gap between trusted and untrusted fab

precomputations; PRNG; Operator communicating with

Constraints: (trusted fab = 350 nm; untrusted fab = 7 nm) 200 mm<sup>2</sup> max chip area; 150 W max total power

# Application #1: number theoretic transform

NTT: a Fourier transform over  $\mathbb{F}_p$ 

Widely used, e.g., in computer algebra

### Application #1: number theoretic transform



# Application #2: Curve25519 point multiplication

Curve25519: a commonly-used elliptic curve

Point multiplication: primitive, e.g., for ECDH

## Application #2: Curve25519 point multiplication



### A qualified success

Zebra: a hardware design that saves costs...

... sometimes.

- 1. Computation F must have a layered, shallow, deterministic AC
- 2. Must have a wide gap between cutting-edge fab (for  $\mathcal{P}$ ) and trusted fab (for  $\mathcal{V}$ )
- 3. Amortizes precomputations over many instances
- 4. Computation F must be very large for  $\mathcal V$  to save work
- 5. Computation F must be efficient as an arithmetic circuit

#### Applies to IPs, but not arguments

- 1. Computation F must have a layered, shallow, deterministic AC
- 2. Must have a wide gap between cutting-edge fab (for  $\mathcal{P}$ ) and trusted fab (for  $\mathcal{V}$ )
- 3. Amortizes precomputations over many instances
- 4. Computation F must be very large for  ${\mathcal V}$  to save work
- 5. Computation F must be efficient as an arithmetic circuit

| Design principle                                            | IPs<br>[GKR08, CMT12,<br>VSBW13] | <b>Arguments</b> [GGPR13, SBVBPW13, PGHR13, BCTV14] |
|-------------------------------------------------------------|----------------------------------|-----------------------------------------------------|
| Extract parallelism Exploit locality Reduce, reuse, recycle | √<br>√                           | <b>✓</b>                                            |

Argument protocols seem friendly to hardware?

| Design principle       | IPs<br>[GKR08, CMT12,<br>VSBW13] | <b>Arguments</b><br>[GGPR13, SBVBPW13,<br>PGHR13, BCTV14] |
|------------------------|----------------------------------|-----------------------------------------------------------|
| Extract parallelism    | ✓                                | $\checkmark$                                              |
| Exploit locality       | $\checkmark$                     | ×                                                         |
| Reduce, reuse, recycle | ✓                                |                                                           |

Argument protocols seem unfriendly to hardware:

 ${\mathcal P}$  computes over entire AC at once  $\implies$  need RAM

| Design principle       | IPs<br>[GKR08, CMT12,<br>VSBW13] | <b>Arguments</b><br>[GGPR13, SBVBPW13,<br>PGHR13, BCTV14] |
|------------------------|----------------------------------|-----------------------------------------------------------|
| Extract parallelism    | ✓                                | ✓                                                         |
| Exploit locality       | $\checkmark$                     | X                                                         |
| Reduce, reuse, recycle | ✓                                | X                                                         |

Argument protocols seem unfriendly to hardware:

 $\mathcal{P}$  computes over entire AC at once  $\implies$  need RAM

 ${\mathcal P}$  does crypto for every gate in AC  $\implies$  special crypto circuits

| Design principle       | IPs<br>[GKR08, CMT12,<br>VSBW13] | <b>Arguments</b><br>[GGPR13, SBVBPW13,<br>PGHR13, BCTV14] |
|------------------------|----------------------------------|-----------------------------------------------------------|
| Extract parallelism    | ✓                                | ✓                                                         |
| Exploit locality       | $\checkmark$                     | X                                                         |
| Reduce, reuse, recycle | ✓                                | X                                                         |

Argument protocols seem unfriendly to hardware:

 $\mathcal{P}$  computes over entire AC at once  $\implies$  need RAM

 ${\mathcal P}$  does crypto for every gate in AC  $\implies$  special crypto circuits

... but we hope these issues are surmountable!

- 1. Computation F must have a layered, shallow, deterministic AC
- 2. Must have a wide gap between cutting-edge fab (for  $\mathcal{P})$  and trusted fab (for  $\mathcal{V})$
- 3. Amortizes precomputations over many instances
- 4. Computation F must be very large for  $\mathcal V$  to save work
- Computation F must be efficient as an arithmetic circuitCommon to essentially all built proof systems

- 1. Computation F must have a layered, shallow, deterministic AC
- 2. Must have a wide gap between cutting-edge fab (for  $\mathcal{P}$ ) and trusted fab (for  $\mathcal{V}$ )
- 3. (Amortizes) precomputations over many instances
- 4. Computation F must be very large for  $\mathcal V$  to save work
- 5. Computation F must be efficient as an arithmetic circuit

- 1. Computation F
- Must have a wi and trusted fab
- 3. Amortizes preco
- 4. Computation
- 5. Computation F

| System                                        | Amortization regime                        | Advice |
|-----------------------------------------------|--------------------------------------------|--------|
| Zebra                                         | many $\mathcal{V}	ext{-}\mathcal{P}$ pairs | short  |
| Allspice<br>[VSBW13]                          | batch of instances<br>of a particular F    | short  |
| Bootstrapped<br>SNARKs<br>[BCTV14a,<br>CTV15] | all computations                           | long   |
| BCTV<br>[BCTV14b]                             | all computations of the same length        | long   |
| Pinocchio<br>[PGHR13]                         | all future instances<br>of a particular F  | long   |
| Zaatar<br>[SBVBPW13]                          | batch of instances<br>of a particular F    | long   |
| Exception: [CMT12] with logspace-uniform ACs  |                                            |        |

- 1. Computation F must have a layered, shallow, deterministic AC
- 2. Must have a wide gap between cutting-edge fab (for  $\mathcal{P}$ ) and trusted fab (for  $\mathcal{V}$ )
- 3. Amortizes precomputations over many instances
- 4. Computation F must be very large for  ${\mathcal V}$  to save work
- 5. Computation F must be efficient as an arithmetic circuit

For example, libsnark [BCTV14b], a highly optimized implementation of [GGPR13] and Pinocchio [PGHR13]:

 $\mathcal{V}$ 's work: 6 ms +  $(|x| + |y|) \cdot 3 \mu$ s on a 2.7 GHz CPU

4. Computation F must be very large for  ${\mathcal V}$  to save work



5. Computation F must be efficient as an arithmetic circuit

For example, libsnark [BCTV14b], a highly optimized implementation of [GGPR13] and Pinocchio [PGHR13]:

V's work: 6 ms +  $(|x| + |y|) \cdot 3 \mu$ s on a 2.7 GHz CPU  $\Rightarrow$  break-even point  $> 16 \times 10^6$  CPU ops

 $\Rightarrow$  break-even point  $\geq 10 \times 10^{\circ}$  CPO operations

- 4. Computation F must be very large for  $\ensuremath{\mathcal{V}}$  to save work
- 5. Computation F must be efficient as an arithmetic circuit

For example, libsnark [BCTV14b], a highly optimized implementation of [GGPR13] and Pinocchio [PGHR13]:

 $\mathcal{V}$ 's work: 6 ms +  $(|x| + |y|) \cdot 3 \mu$ s on a 2.7 GHz CPU

 $\Rightarrow$  break-even point  $\geq 16 imes 10^6$  CPU ops

With 32 GB RAM, libsnark handles ACs with  $\leq 16 \times 10^6$  gates

- 4. Computation F must be very large for  $\ensuremath{\mathcal{V}}$  to save work
- 5. Computation F must be efficient as an arithmetic circuit

For example, libsnark [BCTV14b], a highly optimized implementation of [GGPR13] and Pinocchio [PGHR13]:

 $\mathcal{V}$ 's work: 6 ms +  $(|x| + |y|) \cdot 3 \mu$ s on a 2.7 GHz CPU

 $\Rightarrow$  break-even point  $\geq 16 imes 10^6$  CPU ops

With 32 GB RAM, libsnark handles ACs with  $\leq 16 \times 10^6$  gates

- $\Rightarrow$  breaking even requires > 1 CPU op per AC gate, e.g., computations over  $\mathbb{F}_p$  rather than machine integers
- 4. Computation F must be very large for  $\ensuremath{\mathcal{V}}$  to save work
- 5. Computation F must be efficient as an arithmetic circuit



### Recap



- + Verifiable ASICs: a new approach to building trustworthy hardware under a strong threat model
- + First hardware design for a probabilistic proof protocol
- + Improves performance compared to trusted baseline

## Recap



- + Verifiable ASICs: a new approach to building trustworthy hardware under a strong threat model
- + First hardware design for a probabilistic proof protocol
- + Improves performance compared to trusted baseline
- Improvement compared to the baseline is modest
- Applicability is limited:
  - precomputations must be amortized computation needs to be "big enough" large gap between trusted and untrusted technology does not apply to all computations

## Recap



- + Verifiable ASICs: a new approach to building trustworthy hardware under a strong threat model
- + First hardware design for a probabilistic proof protocol
- + Improves performance compared to trusted baseline
- Improvement compared to the baseline is modest
- Applicability is limited:

precomputations must be amortized computation needs to be "big enough" large gap between trusted and untrusted technology does not apply to all computations

Bottom line: Zebra is plausible—when it applies https://www.pepper-project.org/