a16z: How to implement a secure and efficient zkVM in stages (must read for developers)

avatar
golem
11 hours ago
This article is approximately 2998 words,and reading the entire article takes about 4 minutes
This will be a long construction period of no less than four years.

Original article from a16z crypto

Compiled by Odaily Planet Daily Golem ( @web3_golem )

a16z: How to implement a secure and efficient zkVM in stages (must read for developers)

zkVMs (zero-knowledge virtual machines) promise to “democratize SNARKs”, allowing anyone (even those without specialized SNARK expertise) to prove that they have correctly run any program on a given input (or witness). Their core strength is developer experience, but they currently face significant challenges in both security and performance. For the zkVM vision to deliver on its promise, designers must overcome these challenges. In this post, I lay out the likely stages of zkVM development, which will take several years to complete.

challenge

In terms of security, zkVM is a highly complex software project that is still full of vulnerabilities. In terms of performance, proving that a program is correctly executed can be hundreds of thousands of times slower than running it locally, making most applications currently impossible to deploy in the real world.

Despite these real-world challenges, much of the blockchain industry portrays zkVM as being ready for immediate deployment. In fact, some projects have already paid significant computational costs to generate proofs of on-chain activity. But because zkVM is still imperfect, this is just an expensive way to pretend that a system is protected by SNARKs, when in reality it is either protected by permission or, worse, vulnerable to attack.

We are still years away from a secure and performant zkVM. This post proposes a series of concrete, phased goals to track zkVM progress — goals that cut through the hype and help the community focus on real progress.

Security Phase

SNARK-based zkVMs typically consist of two main components:

  • Proofs over Polynomials Interactive Oracles (PIOP): An interactive proof framework for proving statements about polynomials (or constraints derived from them).

  • Polynomial Commitment Scheme (PCS): ensures that the prover cannot lie about the evaluation of the polynomial without being discovered.

zkVM essentially encodes a valid execution trace as a system of constraints — broadly meaning that they enforce the correct use of registers and memory by the virtual machine — and then applies SNARKs to prove that these constraints are satisfied.

The only way to ensure that a complex system like zkVM is bug-free is through formal verification . Here is a breakdown of the security phases. Phase 1 focuses on getting the protocol right, while phases 2 and 3 focus on getting the implementation right.

Security Phase 1: Correct Protocol

  1. Formal verification proof of PIOP reliability;

  2. PCS has binding formal verification proofs under certain cryptographic assumptions or ideal models;

  3. If Fiat-Shamir is used, the succinct argument obtained by combining PIOP and PCS is formally verified to be secure in the random oracle model (augmented with other cryptographic assumptions as needed);

  4. The constraint system applied by PIOP is equivalent to the formal verification proof of the semantics of the VM;

  5. All of these pieces are fully “glued” together into a single, formally verified secure SNARK proof that can be used to run any program specified by the VM bytecode. If the protocol intends to achieve zero knowledge, this property must also be formally verified to ensure that sensitive information about the witnesses is not leaked.

Recursion Warning : If zkVM uses recursion, then every PIOP, commitment scheme, and constraint system involved anywhere in that recursion must be verified for this phase to be considered complete.

Security Phase 2: Correct Authenticator Implementation

Formal verification proves that the actual implementation of the zkVM validator (in Rust, Solidity, etc.) matches the protocol verified in phase 1. Achieving this ensures that the implemented protocol is sound (and not just a design on paper, or an inefficient specification written in Lean, etc.).

The reason why phase 2 focuses only on the verifier implementation (and not the prover) is two-fold. First, using the verifier correctly is sufficient to guarantee soundness (i.e., ensuring that the verifier cannot believe that any false statement is actually true). Second, the zkVM verifier implementation is more than an order of magnitude simpler than the prover implementation.

Security Phase 3: Correct Prover Implementation

The actual implementation of the zkVM prover correctly generates proofs for the proof system verified in phases 1 and 2, i.e. it is formally verified. This ensures completeness, i.e. any system using zkVM cannot be stuck with statements that cannot be proven. This property must be formally verified if the prover intends to achieve zero knowledge.

Estimated timetable

  • Phase 1 progress: We can expect incremental achievements over the next year (e.g. ZKLib ). But no zkVM will fully meet Phase 1 requirements for at least two years.

  • Phases 2 and 3: These phases can be advanced in parallel with some aspects of Phase 1. For example, some teams have demonstrated that Plonk validator implementations match the protocol in the paper (although the paper’s protocol itself may not be fully verified). Nonetheless, I don’t expect any zkVM to reach Phase 3 in less than four years — and likely longer.

Key Notes: Fiat-Shamir Security and Verified Bytecode

A major complication is that there are unresolved research questions surrounding the security of the Fiat-Shamir transformation. All three phases treat Fiat-Shamir and random oracles as part of their infallible security, but in reality the whole paradigm can have vulnerabilities. This is due to the overly idealized random oracles and the differences between the actual hash functions used. In the worst case, a system that has reached phase 2 may later be found to be completely insecure due to the Fiat-Shamir problem. This is causing serious concern and ongoing research. We may need to modify the transformation itself to better protect against such vulnerabilities.

Systems without recursion are theoretically more robust because some known attacks involve circuits similar to those used in recursive proofs.

It is also worth noting that proving that a computer program (specified via the bytecode) has run correctly is of limited value if the bytecode itself is flawed. Therefore, the usefulness of zkVM depends heavily on methods for generating formally verified bytecode — a significant challenge that is beyond the scope of this article.

On security in the post-quantum era

Quantum computers will not pose a serious threat for at least the next five years ( and probably longer ), while vulnerabilities are an existential risk. Therefore, the main focus now should be on meeting the security and performance phases discussed in this article. If we can meet these security requirements sooner using non-quantum-secure SNARKs, then we should do so until post-quantum SNARKs catch up, or people become seriously concerned about cryptographically relevant quantum computers before considering other things.

Current performance of zkVM

Currently, the overhead factor incurred by the zkVM prover is close to 1 million times the cost of native execution. If a program takes X cycles to run, the cost of proving correct execution is about X times 1 million CPU cycles. This was the case a year ago, and it is still the case today.

Popular narratives often describe this expense in ways that make it sound acceptable. For example:

  • “Generating proofs for all Ethereum mainnet costs less than a million dollars per year.”

  • “We can generate Ethereum block proofs in almost real time using a cluster of dozens of GPUs.”

  • “Our latest zkVM is 1,000 times faster than its predecessor.”

While technically accurate, these statements can be misleading without the proper context. For example:

  • 1000x faster than the old zkVM, but still very slow in absolute terms. This says more about how bad things are than how good they are.

  • There have been proposals to increase the amount of computation that Ethereum mainnet can handle by a factor of 10. This would make current zkVM performance even slower.

  • What people refer to as “near real-time proof of stake for Ethereum blocks” is still much slower than many blockchain applications require (for example, Optimism’s 2-second block time is much faster than Ethereum’s 12-second block time).

  • “Dozens of GPUs running all the time, without fail” does not achieve acceptable liveness guarantees.

  • The fact that it costs less than a million dollars per year to prove all activity on the Ethereum mainnet reflects the fact that an Ethereum full node only costs about $25 per year to perform computations.

For applications other than blockchain, this overhead is clearly too high. No amount of parallelization or engineering can offset such a huge overhead. We should aim for a zkVM that is no slower than 100,000x compared to native execution as a baseline - even if this is just a first step. True mainstream adoption will likely require an overhead closer to 10,000x or less.

How to measure performance

There are three main components to SNARK performance:

  • The inherent efficiency of the underlying proof system.

  • Application-specific optimizations (such as precompilation).

  • Engineering and hardware acceleration (such as GPU, FPGA or multi-core CPU).

While the latter two are critical for real-world deployments, they apply generally to any proof system, so they don’t necessarily reflect base overhead. For example, adding GPU acceleration and precompiles to the zkEVM can easily achieve a 50x speedup over a pure CPU-based approach without precompiles — enough to make an inherently less efficient system look superior to one that hasn’t been similarly polished.

Therefore, the focus below is on the performance of SNARKs without specialized hardware and precompilation. This differs from current benchmarking approaches, which often boil down all three factors to a single “headline number”. This is equivalent to judging the value of a diamond by how long it took to polish it rather than its inherent clarity. Our goal is to exclude the inherent overhead of general-purpose proof systems - helping the community eliminate confounding factors and focus on real progress in proof system design.

Performance Phase

Here are 5 performance milestones to achieve. First, we need to cut the prover overhead on the CPU by multiple orders of magnitude. Only then should the focus turn to further reductions through hardware. Memory usage must also be improved.

In all the following stages, developers do not have to write custom code based on zkVM settings to achieve the necessary performance. Developer experience is the main advantage of zkVM. Sacrificing DevEx to meet performance benchmarks would defeat the purpose of zkVM itself.

These metrics focus on prover cost. However, if unbounded verifier cost is allowed (i.e., there is no upper bound on proof size or verification time), then any prover metric can be easily met. Therefore, for a system to comply with the described phase, maximum values must be specified for proof size and verification time.

Performance requirements

Phase 1 requirement: Reasonable and non-trivial verification cost:

  • Proof Size: Proof size must be smaller than witness size.

  • Verification time: Verifying the proof must be no slower than running the program natively (i.e., performing the computation without the proof of correctness).

These are minimal requirements for simplicity. They ensure that proof size and verification time are no worse than sending the witness to the verifier and having the verifier check its correctness directly.

Phase 2 and beyond requirements:

  • Maximum proof size: 256 KB.

  • Maximum validation time: 16 ms.

These cutoffs are intentionally large to accommodate new fast proof technologies that may incur higher verification costs. At the same time, they exclude proofs that are so expensive that few projects would be willing to include them in their blockchains.

Speed stage 1

Single-threaded proofs must be up to a hundred thousand times slower than native execution, measured across a range of applications (not just proving Ethereum blocks), without relying on precompiles.

To put this into context, imagine a RISC-V process running at about 3 billion cycles per second on a modern laptop. Achieving Phase 1 means you can prove about 30,000 RISC-V cycles per second (single thread) on the same laptop. But the verification cost must be “reasonable and non-trivial” as mentioned earlier.

Speed Phase 2

Single-threaded proofs must be up to ten thousand times slower than native execution.

Alternatively, since some promising SNARK approaches (especially those based on binary fields) are hampered by current CPUs and GPUs, you could reach this stage using FPGAs (or even ASICs) by comparison:

  • The number of RISC-V cores that the FPGA can emulate at native speed;

  • Simulate and prove the number of FPGAs required to execute RISC-V in (near) real-time.

If the latter is at most 10,000 times more than the former, you qualify for Phase 2. On a standard CPU, the proof size must be at most 256 KB and the verifier time at most 16 milliseconds.

Speed Stage 3

In addition to reaching speed stage 2, it is possible to achieve proof overheads of less than 1000x (for a wide range of applications) using automated synthesis and formally verified pre-compilation. Essentially, the instruction set can be dynamically customized for each program to speed up proofs, but in a way that is easy to use and formally verify.

Memory Stage 1

The speed of Phase 1 was achieved while requiring less than 2 GB of memory for the prover (while also achieving zero knowledge).

This is critical for many mobile devices or browsers, thus opening up countless client-side zkVM use cases. Client-side attestation is important because our phones are our constant connection to the real world: they track our location, credentials, etc. If generating attestation required more than 1-2 GB of memory, it would be too much for most mobile devices today. Two points to clarify:

  • The 2 GB space bound applies to large statements (those that require trillions of CPU cycles to run natively). Proof systems that achieve space bounds only for small statements lack broad applicability.

  • If the prover is very slow, it is easy to keep the provers footprint below 2 GB of memory. So, to make memory phase 1 non-trivial, I require that speed phase 1 be satisfied within the 2 GB space bound.

Memory Stage 2

The speed of phase 1 was achieved with less than 200 MB of memory usage (10 times better than phase 1 in memory).

Why push to under 2 GB? Consider a non-blockchain example: every time you visit a website over HTTPS, you download certificates for authentication and encryption. Instead, websites can send zk proofs that they own those certificates. Large websites might issue millions of these proofs per second. If each proof requires 2 GB of memory to generate, that would require petabytes of RAM in total. Further reducing memory usage is critical for non-blockchain deployments.

Precompile: The last mile or a crutch?

In zkVM design, a precompile refers to a specialized SNARK (or constraint system) tailored for a specific functionality, such as Keccak/SHA hashing or elliptic curve group operations for digital signatures. In Ethereum (where most of the heavy lifting involves Merkle hashing and signature checking), some hand-crafted precompiles can reduce prover overhead. But relying on them as a crutch doesn’t get SNARKs where they need to be. Here’s why:

  • Still too slow for most applications (both inside and outside of blockchains) : Even with hash and signature precompilations, the current zkVM is still too slow (both inside and outside of blockchain environments) due to inefficiencies in the core proof system.

  • Security failures : Handwritten precompilations that have not been formally verified are almost certainly riddled with bugs that can lead to catastrophic security failures.

  • Poor developer experience: In most zkVMs today, adding new precompilations means manually writing constraint systems for each feature - essentially returning to a 1960s-style workflow. Even with existing precompilations, developers must refactor their code to call each precompilation. We should optimize for security and developer experience, not sacrifice both in pursuit of incremental performance. Doing so only proves that performance is not as good as it should be.

  • I/O Overhead and No RAM : While precompiles improve performance for crypto-heavy tasks, they may not provide meaningful speedups for more diverse workloads because they incur significant overhead in passing input/output and they cannot use RAM. Even in the blockchain context, as soon as you go beyond a monolithic L1 like Ethereum (e.g. you want to build a series of cross-chain bridges), you are faced with different hash functions and signature schemes. Doing precompiles over and over again on a problem does not scale and poses a huge security risk.

For all of these reasons, our first priority should be improving the efficiency of the underlying zkVM. The techniques that produce the best zkVM will also produce the best precompilations. I do believe that precompilations will remain critical in the long run, but only if they are automatically synthesized and formally verified. In this way, the developer experience advantages of zkVM can be maintained while avoiding catastrophic security risks. This view is reflected in Velocity Phase 3.

Estimated timetable

I expect a handful of zkVMs to achieve speed stage 1 and memory stage 1 later this year. I think we’ll also achieve speed stage 2 within the next two years, but it’s not clear if we’ll ever get there without some new ideas that haven’t come out yet. I expect the remaining stages (speed stage 3 and memory stage 2) will take several years to achieve.

Summarize

While I identify stages for zkVM security and performance separately in this post, these aspects of zkVM are not entirely independent. As more vulnerabilities are discovered in zkVM, it is expected that some can only be fixed at the cost of significantly reduced performance. Performance should be held off until zkVM reaches security stage 2.

zkVM holds the promise of making zero-knowledge proofs truly ubiquitous, but they are still in their infancy — fraught with security challenges and significant performance overhead. Hype and marketing hype make it difficult to assess real progress. By articulating clear security and performance milestones, we hope to provide a roadmap that removes the noise. We will get there, but it will take time and sustained effort.

This article is translated from https://a16zcrypto.com/posts/article/secure-efficient-zkvms-progress/Original linkIf reprinted, please indicate the source.

ODAILY reminds readers to establish correct monetary and investment concepts, rationally view blockchain, and effectively improve risk awareness; We can actively report and report any illegal or criminal clues discovered to relevant departments.

Recommended Reading
Editor’s Picks