Geth v1.13 arrives shortly after the 1.12 release family, which is quite intriguing given that its primary feature has been under development for an impressive 6 years! đ€Ż
In this post, we will explore various technical and historical aspects, but if youâre just after the main takeaway, Geth v1.13.0 introduces a new database model for storing the Ethereum state. This model is not only quicker than the previous one, but it also includes effective pruning mechanisms. Say goodbye to unnecessary data piling up on your disk and bid farewell to guerrilla (offline) pruning!
- ÂčExcluding ~589GB of ancient data, consistent across all setups.
- ÂČFull sync with hash scheme surpassed our 1.8TB SSD at block ~15.43M.
- ÂłSize variance compared to snap sync due to compaction overhead.
Before we proceed, we must acknowledge Gary Rong, who has dedicated the last 2 years refining the core of this major update! Incredible work and resilience to complete such a vast undertaking!
Detailed Technical Insights
So, what exactly is this new data model, and why was it necessary?
To put it simply, our previous method of storing the Ethereum state did not allow for efficient pruning. Although we employed various hacks to slow down junk accumulation in the database, it still kept growing indefinitely. Users could either stop their node and prune it offline or resync the state to remove the junk, but these were far from ideal solutions.
To implement and deliver true pruning, which ensures no debris is left behind, we had to make significant changes within Gethâs codebase. The effort involved could be likened to the Merge, but limited to Gethâs internals:
- Storing state trie nodes by hashes introduces implicit deduplication (i.e., if two branches of the trie share identical contentâmore likely for contract storagesâthey are stored only once). This approach means that we can never ascertain how many parents (i.e., different trie paths, different contracts) reference any node; consequently, identifying whatâs safe to delete from disk becomes impossible.
- Before we could implement pruning, we had to eliminate deduplication across different paths in the trie. Our new data model has state trie nodes keyed by their path, rather than their hash. This slight change implies that previously, if two branches had the same hash and were stored once, they will now have different paths leading to them, causing them to be stored separately, even if their content is the same.
- In our former data model, all state tries were keyed by hash, which led to most trie nodes remaining constant across consecutive blocks. This applies the same issue, as we still do not know how many blocks reference the same state, complicating effective pruning. Changing the data model to path-based keys means we can no longer store multiple tries at the same time: the same path-key (e.g., empty path for the root node) now needs to retain different content for each block.
- Additionally, we had to abandon the ability to maintain an arbitrary number of states on disk to achieve effective pruning and represent trie nodes keyed by path. The only way forward was to limit the database to contain exactly 1 state trie at any given time. This trie starts as the genesis state and must follow the chain state as the head progresses.
- The simplest way to store a single state trie on disk would be to use that of the head block. However, this over-simplifies things and introduces two issues. Mutating the trie on disk block-by-block leads to a huge volume of writes. This may not be noticeable during sync, but it becomes cumbersome when importing many blocks (e.g., full sync or catchup). The second issue is that prior to finality, the chain head might waver due to minor reorganizations. While rare, this is a possibility Geth must manage effectively. Fixing the persistent state to the head complicates switching to a different side-chain.
- The solution mirrors the way Gethâs snapshots function. Instead of tracking chain head, the persistent state lags behind by a number of blocks. Geth always retains trie changes for the last 128 blocks in memory. If there are several competing branches, they are each stored in memory in a tree structure. As the chain advances, the oldest (HEAD-128) diff layer gets flattened. This enables Geth to perform exceptionally rapid reorganizations within the top 128 blocks, effectively making side-chain switches nearly instantaneous.
- However, the diff layers do not resolve the issue of needing the persistent state to update with every block (which would just delay the process). To prevent disk writes with each block, Geth incorporates a dirty cache between the persistent state and the diff layers, which accumulates writes. The benefit arises from the fact that consecutive blocks often change the same storage slots, frequently overwriting the top of the trie. The dirty buffer effectively short-circuits these writes, avoiding hitting the disk. When the buffer fills, everything is flushed to disk.
- With the inclusion of diff layers, Geth can perform 128 block-deep reorganizations instantly. However, there are instances when a deeper reorg may be required. It could be that the beacon chain is not finalizing or that a consensus bug in Geth necessitates an update to âundoâ more of the chain. Previously, Geth could simply roll back to an old state it kept on disk and reprocess blocks from there. But with the new model of retaining only one state on disk, there is no prior state to revert to.
- Our workaround for this situation involves a new concept called reverse diffs. Each time a new block is imported, a diff is created that can revert the post-state of the block back to its pre-state. We store the last 90K of these reverse diffs on disk. Whenever a deep reorg is requested, Geth can take the persistent state on disk and begin applying diffs until the state is returned to an older version. It can then switch to a different side-chain and handle the blocks on top of that.
The above is a brief overview of the modifications made to Gethâs internals to introduce our new pruning mechanism. As you can see, numerous invariants have changed, to the extent that Geth operates quite differently from the earlier versions. Transitioning from one model to another is not straightforward.
We acknowledge that we cannot simply âstop functioningâ because Geth has a new data model, which is why Geth v1.13.0 features two modes of operation (a significant addition to the OSS maintenance workload). Geth will continue to support the previous data model (which will remain the default for the time being), so updating Geth wonât result in any unexpected behavior for your node. You can even force Geth to adhere to the old mode of operation via âstate.scheme=hash.
If you wish to transition to the new mode of operation, you will need to resync the state (you can retain the ancient data, just for reference). This can be done manually or via geth removedb (when prompted, remove the state database, but keep the ancient database). Following this, start Geth with âstate.scheme=path. Currently, the path model isnât the default, but if an old database exists and no specific state scheme is indicated on the CLI, Geth will utilize whatâs contained in the database. As a precaution, we suggest always specifying âstate.scheme=path. If no major issues arise with our path scheme implementation, it is likely that Geth v1.14.x will adopt it as the default format.
Please keep in mind a few notes:
- If youâre operating private Geth networks with geth init, be sure to specify âstate.scheme during initialization; failing to do so will result in an outdated database format.
- For archive node operators, the new data model will be compatible with archive nodes (and will provide the same excellent database sizes as Erigon or Reth), but it requires additional work before activation.
Also, a quick caution: While Gethâs new path-based storage is deemed stable and production-ready, it has yet to be extensively tested outside of our team. Everyone is welcome to utilize it, but if your node faces significant risks in case of a crash or consensus disruption, you might prefer to wait until we observe reports from users with less risk exposure.
Now, letâs discuss some unexpected side effectsâŠ
Quick Shutdowns
Head state missing, repairing chain⊠đ±
âŠthat startup log message we all dread, knowing our node will be offline for hours, is about to become a thing of the past!!! Before we bid it farewell, letâs quickly revisit what it was, why it occurred, and why itâs no longer an issue.
Prior to Geth v1.13.0, the Merkle Patricia trie of the Ethereum state was saved as a hash-to-node mapping. Every node in the trie was hashed, with the node value inserted into a key-value store, according to the computed hash. This was both mathematically elegant and it included a nifty optimizationâif different state segments shared a common subtrie, they would be deduplicated on disk. Pleasant⊠yet detrimental.
When Ethereum initially launched, only archive mode existed. Every state trie for each block was persisted to disk. This was simple and neat, but it soon became apparent that the necessity to store all historical states indefinitely was impractical. Fast sync was a partial remedy; by periodically resyncing, users could get a node with just the latest state retained and pile succeeding tries on top. Nonetheless, the rate of growth necessitated more frequent resyncs than were manageable in production.
What we required was a method to prune historical states that were no longer necessary for operating a full node. Several proposals emerged, even 3-5 implementations in Geth, but each had massive overheads, leading us to dismiss them.
Ultimately, Geth implemented a complicated ref-counting in-memory pruner. Rather than immediately writing new states to disk, we maintained them in memory. As block processing continued, we amassed new trie nodes and deleted old ones that werenât referenced in the last 128 blocks. When this memory area filled up, we would selectively drip the oldest, still-referenced nodes to disk. Despite being imperfect, this solution significantly reduced disk growthâthe more memory allocated, the better the pruning results.
However, the in-memory pruner had a drawback: it only ever persisted longstanding nodes, while retaining everything recent in RAM. When a user wished to shut down Geth, the recent triesâall stored in memoryâneeded to be pushed to disk. Due to the data layout of the state (hash-to-node mapping), inserting hundreds of thousands of trie nodes into the database could take considerable time (resulting in random insertion orders due to hash keying). If Geth was forcibly shut down by a user or a service monitor (like systemd or docker), the state in memory would be lost.
Upon the next startup, Geth would identify that the state tied to the most recent block never got saved. The only remedy was to rewind the chain until a block was found with the complete state available. Since the pruner only ever dripped nodes to disk, this rewind typically reverted to the last successful shutdown. Though Geth would occasionally flush an entire dirty trie to disk to minimize this rewind, it often resulted in lengthy reprocessing after a crash.
We found ourselves in a very challenging situation:
- The pruner required as much memory as possible to function well; however, the more memory it utilized, the higher the chances of shutdown timeouts, resulting in data loss and chain reverts. Reducing memory allocation led to increased junk on disk.
- State was stored on disk keyed by hash, resulting in implicit deduplication of trie nodes. However, this deduplication made it impossible to prune from disk due to the exorbitant costs of ensuring no references to a node remained across all tries.
- Reduplicating trie nodes could have been achieved through a different database layout. Yet, altering the database layout would have rendered fast sync inoperable, as the protocol was explicitly designed around this data model.
- A new sync algorithm could be developed that did not rely on hash mapping, which could substitute for fast sync. However, discontinuing fast sync in favor of a new algorithm would necessitate all clients implementing it first to prevent segmentation of the network.
- An effective sync algorithm based on state snapshots could be beneficial but would require someone to manage and serve those snapshotsâessentially creating a second consensus-critical version of the state.
It took considerable time to escape from the above predicament (these were the outlined steps all along):
- 2018: Initial designs for snap sync are drafted, necessary supporting data structures are developed.
- 2019: Geth begins generating and maintaining the snapshot acceleration structures.
- 2020: Geth prototypes snap sync and finalizes the protocol specification.
- 2021: Geth releases snap sync and transitions from fast sync.
- 2022: Other clients begin implementing consumption of snap sync.
- 2023: Geth switches from hash to path keying.
- Geth becomes unable to support the old fast sync.
- Geth reduplicates persisted trie nodes to enable disk pruning.
- Geth replaces in-memory pruning with an efficient persistent disk pruning solution.
A request to other clients at this stage would be to implement the serving of snap sync, not merely the consumption. At present, Geth is the sole participant in the network maintaining the snapshot acceleration structure leveraged by all other clients to sync.
What does this lengthy journey lead us to? With Gethâs fundamental data representation switched from hash-keys to path-keys, we have successfully replaced the in-memory pruning method with a sleek new, on-disk pruning strategy, which keeps the state on disk fresh and current. Though our new pruner does utilize an in-memory component for optimization, it fundamentally operates on disk, maintaining a 100% effectiveness independent of how much memory is available.
With the newly designed disk data model and reworked pruning mechanism, the memory data is sufficiently small to be flushed to disk in a matter of seconds during shutdown. That said, in the event of a crash or sudden termination, Geth will only have to rewind and reapply a few hundred blocks to synchronize with its previous state.
Bid farewell to prolonged startup times, as Geth v1.13.0 ushers in a bold new era (with âstate.scheme=path, of course).
Eliminate the âcache Flag
No, we havenât discarded the âcache flag, but you might want to think about doing so!
Gethâs âcache flag has a convoluted history, evolving from a straightforward (and ineffective) parameter to a rather intricate feature, whose behavior is challenging to convey and accurately assess.
Back in the Frontier days, Geth lacked many parameters to adjust for optimizing performance. The main optimization available was granting memory to LevelDB to retain more of the recently accessed data in RAM. Interestingly, allocating RAM to LevelDB is similar to allowing the OS to cache disk pages in RAM; allocating memory explicitly only starts being advantageous when multiple OS processes are competing for same data, risking thrashing each otherâs OS caches.
During that time, enabling users to allocate memory for the database appeared like a shot in the dark aimed at optimizing performance. Unfortunately, it also turned into a classic self-sabotage mechanismâGoâs garbage collector tends to struggle with large idle memory segments: the GC will run when it accumulates as much junk compared to useful data from previous runs (i.e., it essentially doubles the RAM requirement). Thus started the saga of Killed and OOM crashesâŠ
Fast forward five years, and the âcache flag has evolved significantly:
- Depending on whether youâre on mainnet or testnet, âcache defaults to 4GB or 512MB.
- 50% of the cache allowance is granted to the database as a passive disk cache.
- 25% of the cache is allocated to in-memory pruning, with 0% for archive nodes.
- 10% of the cache is designated for snapshot caching, and 20% for archive nodes.
- 15% of the cache is allocated for trie node caching, 30% for archive nodes.
The total size and each percentage can be adjusted through flags, but letâs be honestâno one truly understands how to do that or what the ramifications are. Many users increased the âcache setting because it seemed to lead to less junk over time (that 25% portion), but it also risked potential OOM problems.
Over the last two years, we have tackled a variety of changes to alleviate the complexities:
- Gethâs default database was transitioned to Pebble, which uses caching layers outside the Go runtime.
- Gethâs snapshot and trie node caches now utilize fastcache, also allocating memory outside of the Go runtime.
- The new path schema prunes state in real-time, which allowed the previous pruning allowance to be redirected to the trie cache.
The net effect of all these alterations is that Gethâs new path database schema should result in 100% of the cache being allocated outside of the Go GC arena. Consequently, users can adjust the cache upward or downward without negatively impacting the GCâs operations or overall memory usage in Geth.
However, the âcache flag no longer affects pruning or database size, so users who previously adjusted it for these purposes can eliminate the flag. Users who merely increased it high due to available RAM should also contemplate removing the flag and observing how Geth performs without it. The OS will continue to utilize any free memory for disk caching, so not setting it (i.e., setting it lower) may contribute to a more reliable system.
Epilogue
As in all our previous releases, you can find the: