The Network Handbook

Note: This book is a work in progress. Some chapters are still missing, some are incomplete. You are reading a draft version.

Welcome to Chorus One’s Network Handbook. Chorus One is one of the leading node operators for decentralized networks. We operate validators for proof-of-stake blockchains and infrastructure for adjacent projects. We have many years of experience operating more than 70 networks, and through that we learned a lot about how we can operate such networks securely and reliably.

In this book we explain how we operate nodes and what we like to see from node software to enable us to do that reliably. The goal of this book is to help new and existing networks improve by sharing insights, best practices, and patterns that emerged from operating nodes at scale, which may not be obvious to the developers when launching a new network.

Aside from helping blockchain networks to improve, this book offers our delegators a glimpse into how we operate our nodes. Transparency is a core value at Chorus One, and we believe that by being transparent, we can help you make a more informed decision of which validator to delegate to.

Finally, we are always looking for talented engineers to join our team, and this book gives a good overview of the work we do on a daily basis. If the problems we solve sound interesting to you, check out our open positions!

Source code

This book is open source, licensed under the Creative Commons BY-NC-SA 4.0 license. The source files from which this book is generated are available on GitHub. The official version of this book is hosted at https://handbook.chorus.one.

How Chorus One Operates Nodes

At Chorus One we operate nodes reliably for more than 70 networks. Over time we’ve noticed patterns across those networks, and we learned what approaches work well. Years of incident response have forced us to build our infrastructure in a way that is resilient. In this section of the book, we describe the infrastructure that we have converged on.

In short, we run our workloads primarily on bare metal machines in data centers operated by multiple different providers, in many different countries. This gives us maximum performance and resiliency at a cost-effective price. This approach is not without challenges: engineering is about making trade-offs. In each of the following chapters we highlight one aspect of our setup, and why it works the way it does.

The hardware layer

At Chorus One we specialize in everything above the hardware layer, and we outsource the physical infrastructure. Our economy of scale is automation to administer servers across many providers, and automation to operate many different blockchain networks. In terms of hardware footprint, we are not so large that it makes sense for us to own the hardware layer of the stack.

Physical infrastructure

We work with multiple providers who offer servers for rent, together with the rackspace, installation services, Internet connectivity, etc. In some cases these providers are vertically integrated and have their own data centers, in other cases they lease the base infrastructure but own the servers and networking equipment. Often providers work with a mix of both depending on the location. We do not own any server hardware; sourcing and assembling the parts is done by the provider. For us this is the sweet spot that is more flexible and hands-off than building and servicing our own servers, while also being several times cheaper than virtual resources in the cloud.

Hardware configuration

The servers we work with are most suitable for applications that require a balanced mix of CPU cycles, memory, storage space, and network bandwidth. This is the case for most blockchains, but not for some more specialized peer-to-peer networks. For example, we don’t try to offer competitive pricing per gigabyte for storage networks, as that requires very different purpose-built server types. A few terabytes per blockchain is fine, but our price-per-gigabyte is not competitive compared to parties who build and own purpose-built storage servers.

CPU Architecture

All our servers use x86-64 CPUs, as this is still the standard architecture that all software can run on, and in the server market, the top performing CPUs (in terms of instructions per second, not performance per Watt) are still x64 CPUs.

Operating system

Our production systems run long-term support versions of Ubuntu Linux, because this is the greatest common denominator that most software tries to be compatible with.

The cloud computing vs. bare metal trade-off

At Chorus One we run the majority of our workloads on bare metal machines across various different providers. In an era where public clouds dominate public mindshare this might be surprising, but it turns out that blockchain workloads have special requirements that make bare metal the more suitable option for us. Let’s dive in.

Two undeniable strengths of cloud computing are reliability and flexibility. You can spin up a virtual machine in seconds, and the virtual machine essentially never fails. Clouds have the freedom to live-migrate VMs on scheduled host maintenance, and even detect unhealthy hosts early and migrate VMs away before customers notice any impact. Virtual disks can be backed by redundant network storage. The cloud provider is doing the hard work of presenting virtually infallible disks to the customers. Behind the scenes there are still fallible disks, but to the guest operating system, IO failures are a relic of the past. Virtual disks can also be resized on demand to sizes well beyond the capacity of a single physical server.

These advantages of cloud computing come at a cost. There are literal costs:

  • Resources are more expensive. Cloud compute, memory, and storage can be 2 to 10 times as expensive as the bare-metal equivalent.

  • Cloud bandwidth is metered. The cost of a month of 10 Gbps egress is in “contact sales” territory, and at public prices costs about 200× as much as an unmetered 10 Gbps connection at a bare metal provider. For a web application that processes a few small requests per second the egress cost is manageable, but for bandwidth-hungry applications such as video streaming or chatty peer-to-peer blockchain networks, cloud egress is prohibitively expensive.

Aside from financial costs, running in the cloud also means sacrificing performance and control:

  • Performance can vary wildly between different CPU models. A 5 GHz latest generation AMD CPU can finish some single-core workloads in 1/5th of the time it takes an 8-year old 2.4 GHz Intel CPU. In the cloud, compute is measured in virtual CPU cores, and you get only very indirect control over what CPU family that is (“performance-optimized” vs. “best price-performance”). For many web applications that are IO-bound anyway, single-core performance is not a decisive factor and cloud vCPUs are more than adequate, but for compute-intensive workloads, having access to the fastest CPU on the market is a clear advantage.

  • Networked storage is reliable, but at a latency and throughput cost. The fastest durable storage technology available today are SSDs that connect directly to the CPU’s PCI bus: NVMe drives. Virtualized network storage, although more reliable, will never be able to match this in performance. Read throughput in clouds tops out around a GB/s, while an array of local NVMe drives can reach 20× that. Again, for most web applications this is hardly relevant, but for storage-intensive applications such as databases and blockchains, this can mean an order of magnitude performance difference.

  • There is overhead to CPU virtualization. Fortunately this overhead is small nowadays, but for performance-oriented networks, every little bit helps.

  • Public clouds do not support custom hardware. For some blockchains, we work with hardware security modules. These are small USB devices that we prepare, and then plug in to the physical server at the data center. This means we need a dedicated server, and even then, not all vendors are willing to do this (for understandable reasons).

These trade-offs mean that while cloud computing is a great default choice for many applications, the specialized requirements of the blockchain networks that we run mean that cloud is either not cost-effective, or doesn’t offer adequate performance to meet the demands of performance-oriented networks. At Chorus One we use cloud where it makes sense, but the vast majority of our workload runs on bare metal.

The unique challenges of bare metal

Given that we run on bare metal, we face a class of challenges that does not exist in the cloud.

  • Hardware lead time. For popular server configurations, our vendors tend to have these pre-assembled, and we can get a machine in hours. For more specialized configurations, our vendors themselves are dealing with lead times on the components, and it can take weeks before a machine is delivered.

  • Flash memory wears out. SSDs are consumables, they are rated for a limited number of writes. Many workloads never exhaust the writes, and the hardware is obsolete before it ever reaches its rated lifespan. But if there is one class of applications that is write-heavy, then it’s blockchains that are continuously writing new blocks and indexes. We do observe disks wearing out, and we routinely need to get them replaced.

  • Other hardware can fail too. Although disk failures are the most common hardware issue we observe (not only due to wear), other hardware components can and do fail.

  • Maintenance downtime. In the cloud, a virtual machine can migrate to another host so the original machine can be serviced with minimal impact to the user. On bare metal, we have to turn off the machine, and a technician has to walk to the rack and take the server out to service it. This takes minutes at best, sometimes hours.

  • Limited storage capacity per machine. The amount of storage we can put in a single machine is limited. While there exist dedicated storage servers that can hold petabytes worth of data, that is not the kind of fast NVMe storage that blockchains demand nowadays. 8 TB of NVMe storage per server is pretty standard. More than this is possible, but might not be available in every location or with every CPU model or network card, so this may limit our ability to provide geographic redundancy or best-in-class hardware.

  • Commitment periods. A high-end server is a serious investment for our vendors, and some of these machines are specialized enough that they may not be able to easily repurpose them after we no longer need the hardware. This means that when we order a machine, we often have to commit to renting it for a period of months to years.

The general theme is: we have to deal with unreliable components, and when they fail, we may not be able to order a replacement quickly.

Building reliable systems from unreliable parts

When you operate enough machines for a long enough time, unlikely events become routine, and the unthinkable becomes a serious risk. At Chorus One we’ve dealt with data centers catching fire, data centers flooding, blockchain networks unintentionally causing DDoS attacks, and vendors blocking traffic without notice.

Although we work with enterprise-grade hardware that is more reliable than consumer hardware, all hardware fails at some point. At a certain scale, hardware failure becomes inevitable. If a disk has a 1.2% probability of failing in a given year, then across a fleet of 500 machines with two disks each, there’s about a ⅔ probability that at least one disk fails in a given month. The solution to this is redundancy.

There are a few levels at which we can implement redundancy, but we choose to do it primarily at the node software level, by running multiple instances of the same software on different machines. This is the most general, and a good fit for blockchain software. Let’s compare some alternatives.

  • RAID-style redundant storage. While redundancy at the block device level ensures that we still have a copy of the data if a disk fails, the faulty disk needs to be replaced, which causes downtime. Furthermore, while RAID protects against disk failures, it does not protect against other types of hardware failure, natural disaster, or software bugs that cause corruption. When there is a single source of truth for the data, RAID may be the only option. But for blockchains, we have better options.

  • Other forms of block device and file system-level replication. While redundancy at the block device or filesystem layer is a possible way to add redundancy, blockchains already have a built-in redundancy mechanism. By their nature, every node in a blockchain network stores the same data.1 Furthermore, the node software already has a replication protocol built in: the regular p2p data distribution code. This means we do not need to have replication at the block device or file system layer: we can simply run multiple instances of the node software.

  • Redundant network storage. We could run our own multi-node storage cluster that can tolerate single nodes going offline, such that we can service one machine without impacting storage. This incurs a performance overhead and is operationally complex. To keep performance overhead manageable, the nodes must also be geographically close (preferably at the same site, probably not in different cities). This means it doesn’t protect against natural disasters.

We do use redundant storage for data that is inconvenient to lose, and for non-public data that we cannot afford to lose (in particular, cryptographic keys) we do work with redundant storage clusters such as Hashicorp Vault, but this data is tiny in comparison to the blockchain data we handle, and IO performance is not a concern.

To summarize, our approach to redundancy is to run multiple instances of the node software, on different machines, in different geographic locations, with different vendors. This protects us against a wide variety of disasters:

  • Hardware failure
  • Failure on the data center level, including natural disaster
  • A vendor or internet service provider going offline completely, for whatever reason
  • Software bugs that trigger nondeterministically
  • Operator error (through staged rollouts)

Our approach to reliability is not to try and mitigate individual components failing. We take it as a given that some components are going to fail, and through redundancy, we build systems that are reliable regardless of the reliability of the underlying components.

1

This is changing with the advent of erasure-coded data sharding and so-called blobspace. Fortunately, if the choice about what data to store is deterministic, we can still configure multiple machines to store the same data.

Workload placement

The blockchain workloads that we run place constraints on where and how we can run a piece of software:

  • For chains with special hardware or bandwidth requirements, only the machines that we ordered specifically for this network are suitable for running it.
  • For especially demanding node software, we use a dedicated machine. For less demanding node software, we may run multiple different blockchains on the same machine to improve utilization. To avoid correlated failures, we don’t want to co-locate the same pair of blockchains on multiple servers.
  • We want to be close enough to peers to get low latency and minimize packet loss, though not so close that we risk creating a centralization vector for the network. We prefer to avoid data centers that already host too many other nodes, though there’s a stability/decentralization trade-off here.
  • For a redundant pair (two instances of the node software running on different machines), we want the two nodes to not be in the same data center, and preferably in different countries, in data centers owned by different parties.
  • The use of hardware security modules further limits the data centers we can use.

In other words, both our workloads and our machines are very heterogeneous with many specific constraints. Generic workload schedulers such as Kubernetes or Hashicorp Nomad work well with a fleet of homogeneous machines, but are less suitable for our blockchain workloads. On top of that, blockchain node software is typically not designed to be terminated and restarted elsewhere at short notice by a generic workload scheduler. (For example, it has to re-discover peers in the peer-to-peer network, which impacts stability.) In addition, for some high-performance blockchains we tune kernel parameters for that particular blockchain.

Because of these reasons, we allocate our blockchain workloads manually to machines. We do use Kubernetes for some internal applications and stateless workloads that just need to run somewhere, but for blockchain workloads, we want to have full control over workload placement.

Monitoring and alerting

We employ multiple forms of monitoring to confirm that our nodes are operating as expected both on short and longer timescales.

Prometheus

For monitoring we use the industry-standard Prometheus, combined with its built-in Alertmanager for alerting. For example, most chains expose a block height metric that should go up as new blocks get added to the chain. When this metric stops going up, either there is a problem with our node, or the network itself is halted. Both are reasons for concern, so this is something we alert on. Alerts that need immediate attention get routed to our 24/7 oncall rotation.

Local and global views

Prometheus metrics exported by a node tell us what that node’s view of the chain is. They give us a local view, which might not match what the chain is doing globally. For example, when node software is running on an under-powered machine, it might not be able to process blocks as fast as the network can add them. Its view of the chain gets increasingly stale. Yet, it is still progressing, so the block height metric continues to go up, and our “block height stopped going up” alert would not catch that. We use several techniques to still catch this:

  • Rely on built-in gossip-based metrics. Often the node software differentiates between “syncing” and “in sync”. It can do that by comparing its own block height against the highest known block height in the p2p gossip network. Often this highest block is also exposed as a metric.

  • Compare across nodes. We generally run multiple nodes for redundancy. One additional benefit of this is that we can compare the block height across all our nodes, and identify when one falls out of sync.

  • Use external sources. In some cases we can directly or indirectly obtain others’ view of the chain, and leverage this to make our own monitoring more robust. For example, through our Wormhole node we can learn other guardians’ block height on many networks.

Log-based metrics

We generally do not work with log-based metrics. In cases where Prometheus metrics are not natively available, we write our own patches or exporters, or we convert logs into something that can be scraped by Prometheus.

Long-term metrics and optimization

Prometheus metrics are good for short-term monitoring. This data is most useful when a change in a metric needs immediate attention (on a timescale of minutes) from our oncall rotation. On a longer timescale, there is a different set of metrics that is important to optimize. For example, we can measure how many blocks our validator was supposed to produce, and how many it actually produced. If we produced only 98% of assigned blocks, that is something to investigate, but not something to wake up an oncall engineer for.1

For long-term metrics we combine data from various sources: we have internal tools that index data from our RPC nodes, and we work with external sources to verify our own data and fill in the gaps. To optimize our performance long-term, we have a dedicated research team that is continuously looking for ways to improve our performance.

1

This is because metrics such as skip rate (the percentage of blocks that we failed to produce out of the blocks we were assigned to produce) are only meaningful when the denominator is large enough. This only happens at long enough timescales. For example, if we measure over a small time window in which we were assigned to produce two blocks, then the only skip rates we can observe are 0%, 50%, and 100%, which is very coarse. To detect small effects on skip rate reliably, we need a time window with thousands of blocks, which is typically hours to days, not minutes.

Oncall rotation

In the previous chapter we looked at how we monitor our nodes and generate alerts when anything is amiss. Our 24/7 oncall rotation is standing by to be able to handle those alerts within minutes.

Team

The majority of our engineering team consists of what we call platform engineers. Some companies call this role site reliability engineering instead. Platform engineers are repsponsible for our infrastructure, operating our chains, and periodically, for handling alerts.

Chorus One is a remote company and our team spans a wide range of time zones, from Asia to the Americas. The majority of our people are located in Europe.

Coverage

We ensure that at least one engineer is able to handle alerts all times. We employ enough people that we can guarantee this; in case of unforeseen personal events we can find somebody else to step in. A shift is at least a full day, we do not use a follow-the-sun schedule. If an alert fires outside of regular working hours, an oncall engineer gets paged. That might be at night.

Preventive measures

Our oncall rotation enables us to respond to emergencies 24/7, but of course engineers still prefer a quiet night of sleep. The best incident response is not having an incident in the first place. Internally we achieve that through redundancy. For network-wide incidents, the second half of this book contains our recommendations to node software authors to minimize the risk of outages. The timing section of the release engineering chapter is particularly relevant.

Node Software Guide

With more than half a decade of experience operating more than 70 blockchain networks, we’ve noticed some patterns that make a network easy to operate reliably, and also some pitfalls that less mature networks might not be aware of. In this section we describe what from our point of view is the ideal way to build a blockchain network that can be operated reliably. By sharing our perspective, we aim to help networks build better software and more stable mainnets.

Classification

Throughout this guide we classify our best practices from P0 to P3, ranging from essential to nice to have. We use this classification internally as a guiding principle to decide whether a network is suitable to onboard. We take context into account when we work with upcoming networks, and we understand that no network can follow all recommendations from day one. However, when a network fails too many of our recommendations, the operational overhead will likely outweigh the financial benefit of becoming a node operator on that network. Conversely, following these recommendations is a signal of a mature, professional network, and a high-priority target for us to onboard.

Essential: Early development maturity

An essential practice is something we expect from a network regardless of maturity level. These should be addressed as soon as possible by any network that aims to one day launch a mainnet.

Important: Testnet maturity

An important practice is something we expect from a network that is launching a public testnet. Implementing these recommendations signals readiness to engage with a wider group of node operators, and will ensure a smooth testnet experience for developers as well as node operators (as far as testnets can be smooth).

A recommended practice is something that we expect a network to implement before or shortly after launching a public mainnet. These practices become important at a maturity level where mistakes start to have serious financial consequences.

Desirable: Industry-leading maturity

A desirable practice is something that makes our life as a node operator easier, but which is not otherwise essential. We understand that node software authors sometimes have more important issues to tackle than making node operators happy. Projects that implement our P3 recommendations are exemplary, and set the benchmark for other projects to strive for.

Open source software

At Chorus One we believe that decentralized networks have the potential to create freedom, innovation, efficiency, and individual ownership. Users of those networks need to be able to trust that:

  • The network does what it promises to do, without back doors, special cases, security vulnerabilities, or artificial limits.
  • The network can continue to operate even when its original authors are no longer around.

A prerequisite for both is that the source code for the network is publicly available. A prerequisite for the second point is that the source code is available under a license that allows users to make changes if needed. At Chorus One we therefore strive to only validate networks whose node software is publicly released under an OSI-approved license.

Release the project under an open source license.

Ensure that source code for the project is publicly available, released under an OSI-approved license. See below for how to handle stealth launches.

Transparent history

While access to the source code in theory allows anybody to review it for back doors and other issues, almost any successful software project quickly grows so large that it is no longer feasible for a single person to review all of it.1 How then, could anybody trust a large project? Large projects don’t appear out of nowhere, they were built over time by making many small changes, and these changes can and should have been reviewed.

To establish trust in a large project, it is not enough for the source code to be available, it is important that users can verify its history, and check how it was built and by who. That doesn’t mean that authors need to disclose their identity — it is possible for a pseudonymous author to establish a track record over time. However, when there is a code dump that adds half a million lines of code in a single commit, then it’s impossible to establish the provenance of that code, which makes the project difficult to trust.

In addition to trust reasons, having good source control history is simply good practice for any software project. A good history is a valuable tool for developers, both for debugging (e.g. with git bisect) and understanding the context of a piece of code (e.g. with git log and git blame). We as node operators occasionally have to dive into the source code of a network as well, and access to the history is very helpful for us to understand why a piece of code works in a certain way.

1

One notable exception to this are smart contracts, which for many reasons have to be kept deliberately small.

Be transparent about the provenance of your source code.

Even when a project is developed in stealth at first, when the time comes to go public, do not merely publish a source code dump which destroys valuable metadata. Publish the full revision control history.

Build in the open.

Building behind closed doors and periodically publishing new versions is not technically incompatible with open source. However, in the true spirit of open source and crypto ethos, developing in the open builds trust and helps to foster a community.

Stealth launches

We understand that some teams prefer to build privately, even if they have the intention to release all software publicly at a later stage. When there is a clear path towards making the source code public, we are happy to join a network at an early stage if it is possible for us to get access to the source code. Of course, we treat your privacy with utmost integrity, and we can sign an NDA if needed.

Dealing with zero-day vulnerabilities

Handling vulnerabilities in a project that is developed in the open is tricky, because publishing the fix might draw attention to the vulnerability before users of the software have had a chance to update. There are two ways of handling this:

  • Pre-announce the existence of the vulnerability. In the announcement, include date and time at which a new version will be published. This ensures that we can have an engineer standing by to act quickly at the time of the release.

  • Privately distribute a patch to node operators. While it is not feasible to have contact details for all node operators in an open-membership network, reaching a superminority of stake is often feasible. We are happy to work with you to establish a private communication channel, and if needed we can provide you with a way to reach our 24/7 oncall team who are able to get back to you within minutes (for severe emergencies only).

These two options can be combined for maximum impact.

While distributing patched binaries is a tempting way of dealing with vulnerabilities, that approach puts node operators between a rock and a hard place:

  • We have uniform build and deployment automation that is optimized and battle-tested. Going through our regular process eliminates room for human error. If we have to deploy a binary from a different source in an ad-hoc way, we have to bypass protocols that are established for good reasons, at the risk of introducing misconfigurations.
  • We build all software from source for reasons described in the build process chapter. When we are asked to run an untrusted binary blob on our infrastructure, we have to weigh the risk of continuing to run the vulnerable version against the risk of the untrusted binary being built in a way that is incompatible with our infrastructure, and the risk of the binary blob unintentionally introducing new vulnerabilities through e.g. a supply chain attack.
  • Although it is certainly more difficult for bad actors to identify the vulnerability from a binary diff than from a source code diff, this is only a small roadblock for somebody versed in reverse-engineering. Releasing patched binaries still starts a race against the clock.

Given these downsides, we strongly urge authors to make source code available for security updates. Patched binaries can of course still be helpful for node operators who have less stringent requirements around what they run on their infrastructure. This solution can be complementary, but it’s not appropriate as the only solution.

Ensure that node operators can build security fixes from source.

As described in the build process chapter, we build all software that we operate from source. Making an exception for security fixes is a difficult trade-off that we do not make lightly. We prefer to not have to make that trade-off.

Software development best practices

There are many ways to build software, and we don’t want to force a workflow onto anybody. However, there are some practices that are good to respect regardless of workflow or structure. We understand that especially for early-stage projects, it doesn’t always make sense to follow all the best practices, but when a mainnet launch is approaching we expect all of these to be in place.

Basics

We mention these basics for completeness. They should be self-evident for any project regardless of maturity level.

Respect licenses of upstream software.

When including code that is not owned by you in your repository, respect the license and clarify the origins of this code, even when not strictly required by the license. Even in cases where third-party code is not directly part of your repository (e.g. dependencies pulled in through a package manager), its license may place restrictions on derived works that you need to respect.

Break down changes into logical parts and write a clear commit message for each change.

As we described before, the history of a project becomes an important asset later on, and the history is one of the few things that you cannot fix after the fact.

Use comments to clarify non-obvious code.

Any non-trivial project will contain parts that are not obvious. Use comments to explain why a piece of code is there. Furthermore, while temporarily commenting out pieces of code can be helpful during development, code that is commented out should not end up merged into the main branch.

Testing

Two types of software that are notoriously among the hardest to get right are cryptography code and distributed systems. Blockchain node software combines those two. An automated testing strategy (unit tests, integration tests) is a minimum, but given that blockchains are under more scrutiny from malicious parties than most software, actively hunting for bugs with e.g. fuzzers or even model checkers is not always a nice-to-have for ensuring mainnet stability.

Write automated tests that are included in the repository.

There should be a way to have basic confidence in the correctness of the code. Furthermore, when bugs are discovered through other means, regression tests can prevent future developers from re-introducing a similar bug.

Write fuzz tests for code that deals with user input (network or user data).

If you don’t write (and run) a fuzzer, a security researcher will write one, and you’d better hope it’s a white hat when they do. In practice, few projects have this level of testing from the start — security is rarely a priority, until it’s suddenly top priority because somebody is attacking your system.

Quality assurance

Have a code review process.

For personal projects it is normal that developers write code and push it without review. For node software written in a professional setting with the intent of being adopted by commercial node operators, the bar is higher: there should be a process for reviewing changes. Ideally that process includes a real review, and not just a rubber stamp acknowledging that some change was made.

Write clear pull request, merge request, or changelist descriptions.

Descriptions are useful not only for the reviewer, also for people following along (like us as node operators), and especially for future readers who want to understand why a change was introduced.

Set up continuous integration.

Any checks that are not mechanically enforced will be violated sooner or later.

Security

These are nice-to-haves early on in the project, but start to become important when mainnet attracts significant value.

Set up a bug bounty program.

You want security researchers to have a viable honest alternative to selling or exploiting a vulnerability.

Set up a responsible disclosure policy.

Clarify to security researchers how they can report discoveries to you, and publish this in places where security researchers tend to look, like your website and Git repository.

Build process

At Chorus One we strongly prefer to build all node software that we operate from source. We generally do not run prebuilt binaries or upstream container images. We do this for multiple reasons:

  • Transparency. As described in the open source chapter, access to the source code is a prerequisite for users and node operators to be able to trust the network. However, just access to the source code is meaningless when everybody runs pre-built binaries. How do we know that the source code is really the source code for the software that’s running in practice? The easiest way to be sure, is to build it from that source code.

  • Security. Most node software we operate is written by reputable parties, and the risk that they are actively trying to hide malware in binary releases is low. However, as organizations grow, insider risk grows with it. Furthermore, when we don’t have full control over the build environment and build process, we cannot rule out supply chain attacks that might be trying to mess with the build process. The recent liblzma backdoor (CVE-2024-3094) illustrates that supply chain attacks are a real concern, and with the upwards trend in number of dependencies (thousands of dependencies is now commonplace for Rust projects), we cannot just dismiss this as a hypothetical risk.

  • Performance. For performance-oriented chains, we compile software with the compiler optimization flags tuned for the specific CPU microarchitecture that we deploy the software on.

Aside from access to the source code and a working build process, we don’t have strict requirements on how to set up your build. The more standard a build process is (e.g. cargo build after a clone just works), the easier it is for us to integrate, but if the build process is well-documented, we can usually find a way to make it work. Still, there are some trends that we can use to give general recommendations.

In general, software written in Go or Rust is easy for us to build. C/C++ are usually acceptable too. Javascript is generally impossible to package except as a container image, and impossible to secure due to an ecosystem where depending on tens of thousands of microlibraries is commonplace.

General recommendations

Ensure your software can be built on a stock Ubuntu LTS installation.

Ubuntu Linux is the common denominator that is supported by almost any software project. We run Ubuntu LTS on our servers to minimize surprises specific to our setup, and for consistency, we also prefer to use it as the base image for applications deployed in containers.

Don’t require Docker as part of your build process.

While Docker is convenient for less experienced users, depending on external images has the same security implications as downloading untrusted binary blobs, and therefore we cannot allow this. When your official build process involves Docker, this forces us to reverse-engineer your Dockerfile, and if our build process deviates too much from yours, it is more likely to break.

It is of course great if you offer official pre-built container images to enable less experienced people to join as a node operators. You can achieve that by running your regular build process inside a Dockerfile. The Dockerfile should invoke your build process, but your build process should not invoke Docker.

Don’t fetch untrusted binaries from the Internet as part of your build scripts.

Aside from security implications, flaky third-party webservers are a common source of failing builds. These types of flakes are rare enough that it’s difficult to get the time-outs and retries right, but at scale are common enough to be a nuisance. Language package managers and system package managers that download from official registries are of course fine.

Golang recommendations

This section is a work in progress.

Rust recommendations

Include a rust-toolchain.toml file in your repository.

The official standard way to encode which Rust toolchain to use, in a machine-readable form that is automatically picked up by rustup, is to specify the version in a rust-toolchain.toml file.

Rust is evolving rapidly, and code that was tested with one version of the Rust toolchain often does not compile with an older toolchain. Furthermore, we have seen cases where the code compiled fine, but the binary behaved differently depending on the compiler version, leading to segfaults.

Projects that include a rust-toolchain.toml are easy for us to integrate with our build automation. When you specify the version in a non-standard location (for instance, as part of configuration of some CI workflow), we have to write custom scripts to extract it from there, which is more fragile, and duplicating a feature that rustup already does perfectly well. When you don’t specify a version as part of the repository at all, we have to guess, and it will be harder for people in the future to build older releases of your software, because they will not know what toolchain to use.

Release engineering

As described in the open source chapter, we only run software for which the source code is available. Communicating clearly about where to get the source code, and when to run which version, is part of release engineering. Solid release engineering can mean the difference between a mainnet halt and a smooth uneventful upgrade. In this chapter we share our experience with what makes for a smooth release process.

Git

There exists a plethora of version control systems, but nowadays, the entire blockchain industry is using Git. So far we’ve never encountered a project that we considered operating that was not using Git, so this guide is focused solely on Git.

Publish the source code in a public Git repository.

See also the open source chapter. As for Git specifically, our build automation has good support for Git, a publicly hosted repository (e.g. on GitHub or Codeberg) is easy for us to integrate.

Mark releases with a Git tag.

Every Git commit points to a tree, a particular revision of your source code that we could build and deploy. To know which revision we are expected to run, ultimately we need to know this commit. You could announce a commit hash out of band, for example in an announcement post. However, this is hard to discover, it’s hard to locate historical versions, and there is no standard tooling for it.

Fortunately there is a standard solution. To mark some commits as special, Git has the concept of tags: human-friendly names that point to commits. This data is first-class, embedded in the repository, machine-friendly, and has wide support from e.g. GitHub and git tag itself.

We have build automation that will automatically discover new tags in your repository. When you tag your releases, they are easy for us to integrate. Therefore, please tag all commits that you expect us to run, even if they are only for a testnet.

Use annotated Git tags.

There are two kinds of tag in Git: lightweight tags, and annotated tags. Lightweight tags are a bit of a historical mistake in Git; they do not carry metadata like the creation time and author of the tag. Knowing the creation time of a tag is very valuable, therefore always use annotated tags.

Do not — never ever — re-tag.

Re-tagging — deleting a tag, and then creating a new, different tag with the same name — creates confusion about which revision of the source code truly corresponds to that version number. When you re-tag, two different parties can both think they are running v1.3.7, but they will be running different software. This situation is unexpected, and therefore very difficult to debug. Sidestepping such confusion is easy: do not re-tag, ever. See also the section on re-tagging in the Git manual.

What if you accidentally tagged the wrong commit, and already pushed the tag?

  1. Do not delete and re-tag it. We (and probably other node operators as well) have automation watching your repository for new tags. Automation can fetch bad tags faster than humans can realize that the tag is bad. Once published, there is no going back.
  2. Create a new tag, with a different version number, pointing to the correct commit.
  3. Announce through your regular channels that the bad version should not be used, and which version to use instead.
  4. Do not delete the bad tag. Automation will have discovered it anyway. What is more confusing than encountering a bad tag, is finding the bad tag in your local checkout, but not being able to find any trace of it upstream. Instead, clarify externally that the tag should not be used, for example on its release page when using GitHub releases.

If this sounds like a big hassle, well, it is. The best way to avoid this hassle is to not publish bad tags in the first place. What helps with that is to have a standardized release process, test it thoroughly, and to not deviate from it. Especially under pressure, such as when releasing a hotfix for a critical bug, it may be tempting to skip checks built into the process. This is risky. Those checks exist for a reason, and under pressure is when humans make the most mistakes. Sticking to an established process is often better than trying to save a few minutes.

When using submodules, use https transport urls.

Git supports two transfer protocols: https and ssh. On GitHub, https requires no authentication for public repositories, but ssh by design always requires authentication, even for public repositories.

When you as a developer clone a public repository using an ssh://-url, you likely have your SSH key loaded and authentication to GitHub is transparent to you. However, when automation such as a CI server tries to clone from an ssh://-url, it typically does not have the appropriate SSH keys loaded, and so it will fail to clone, even if the repository is public and can be cloned from an https://-url.

This matters especially for submodules, because with git clone --recurse-submodules, we do not get to choose which transport to use. The urls are determined by the .gitmodules file in the top-level repository.

Release metadata

When we learn about a new release, for example because our automation picked up a new Git tag, we triage it:

  • Does this release apply to us at all? Is there a change in the software we run?
  • Is it a stable release intended for mainnet, or a pre-release intended for testing?
  • Do we need to update at all? For example, if there is a bugfix in a feature we don’t use, it makes no sense for us to restart our nodes and incur downtime, when our nodes will not be doing anything new.
  • What is the priority? Does this fix a critical bug that impacts the network or our operations? Are assets at risk if we don’t update soon?
  • Is there an associated deadline (for example, for a hard fork)?

To be able to do this triage, it is helpful to publish this metadata together with the release.

Publish metadata about the release in an easily discoverable location.

Examples of easily discoverable locations are the Git tag itself, an associated release page on GitHub, or a dedicated releases page on a website. An example of a location that is not easily discoverable is an invite-only Discord channel where many kinds of announcements are being shared in addition to just release announcements.

Clearly mark breaking changes.

When we update to a new version, we need to know if any additional action is required from us. For example, when command-line flags are renamed or removed, or when the schema of a configuration file changes, the new node software would be unable to start if we don’t update our configuration. To minimize downtime, we would rather learn about such changes before we perform the update. Even when the node is able to start, changes in e.g. metric names or RPC API affect us.

Ideally breaking changes are part of a changelog, clearly highlighted to stand out from ordinary changes. If you don’t keep a changelog, you can include breaking changes in e.g. a GitHub releases page, or in the release announcement itself.

Clearly announce deadlines.

When an update has a deadline, for example for a hard fork, clearly mark the deadline. When possible, include both a date/time and block height, and a URL for where the update is coordinated. Make sure to publish the release far enough ahead of the deadline.

Keep a changelog.

For us node operators, the first thing we wonder when we see a new release is: what changed, how does this affect us? Ideally, we can find that in a changelog.

Do not mistake Git’s commit log for a changelog. The target audience of commit messages are the software engineers working on the project. The target audience of a changelog are the users of the software (us, node operators). Commit messages are typically more detailed and fine-grained than the summary of the changes in a changelog. While we do read through the Git log when needed, we appreciate having a handwritten summary of the changes.

Versioning scheme

We don’t have strong opinions on how you version your software, but please pick one versioning scheme and stick with it.

Use the same number of parts in every version number.

For example, have v1.0.0 and v1.0.1 in the same repository, but do not put v1.0 and v1.0.1 in the same repository. Definitely do not put v1.0 and v1.0.0 in the same repository, as it is confusing which one is supposed to be used. Adding a suffix for release candidates is fine, e.g. v1.5.7-rc.3 and v1.5.7 can happily coexist.

Use consistent suffixes to mark pre-release versions.

For example, publish v1.2.0-beta.1 and later v1.7.0-beta.1, but do not switch to v1.7.0b1 later on.

We have build automation that watches new tags. In most cases we do not run pre-release versions, so we exclude tags that match certain patterns from our update notifications. If you keep changing the naming scheme, then we have to keep adjusting our patterns.

Timing

A big part of solid release engineering is when to release a new version. As a professional node operator we employ people, and most of those people don’t work on weekends or holidays. The majority of network-wide outages happen because of an update, so you want the update to land at a time when as many people as possible can act quickly. While we do have a 24/7 oncall rotation, their job is to deal with emergencies, not routine updates.

Publish a release at least one week before an update deadline.

When an update is mandatory and has a hard deadline (for example, for a hard fork), ensure that the release is ready with ample time before the deadline. We plan most of our work on a weekly schedule. When changes are known ahead of time it’s easy to fit them in and everything runs smoothly. When changes come up last-minute, it ends up being disruptive, and deviating from standard procedures is what causes outages.

Do not release on Fridays.

At least, do not ask people to update on Fridays. Most outages happen because of a change, and while we trust that you extensively test releases before recommending them for mainnet, bugs do slip in. Our 24/7 oncall team is ready to respond in case a release contains a bug, but they still prefer a relaxing uninterrupted weekend over dealing with an outage.

Do not release just before a holiday.

While we have an oncall team to deal with emergencies, we are not operating at full capacity during holidays. In case of a network-wide outage, it will be much harder for you to reach people to coordinate an update or restart, especially when a fraction of node operators are not professionals with a 24/7 oncall team.

Most of our engineers are based in Europe, we observe more holidays than what is common in the US. It is common for people in Europe to be off in the weeks of Christmas and New Year.

Monitoring

As we described previously, we use Prometheus for monitoring and alerting. This is the industry-standard monitoring protocol that is supported by most software we run.

Exposing metrics is essential for any blockchain project. Without it, the node software is a black box to us, and the only thing we could observe is whether the process is still running, which is not the same as being healthy. We need to know what’s going on inside that process, and the standard way of doing that is through logs and Prometheus metrics.

Prometheus

Expose Prometheus metrics.

To be able to monitor the node software, Prometheus needs a target to scrape. See the Prometheus documentation for how to instrument your application. If your daemon already includes an RPC server, adding a /metrics endpoint there is usually the easiest way to go about it. Alternatively, a dedicated metrics port works fine too.

While the set of metrics is of course application-specific, blockchain networks generally have a concept of the block height. Unless the block height is for a finalized fork, block height is generally a gauge and not a counter.

Expose metrics privately.

We need to scape metrics internally, but we don’t want to expose confidential information to third parties. It should be possible for the http server that serves the /metrics endpoint to listen on a network interface that is not Internet-exposed.

Ensure that metrics are relevant and named appropriately.

For new projects, of course you only add metrics that measure something relevant. For projects that fork existing node software, we encountered software in the past that kept exposing metrics that were no longer meaningful, or under the name of the original software. Similar to how clear but incorrect error messages are worse than vague error messages, misleading metrics are more harmful than not having metrics at all. “Maybe the metrics are lying to us” is far down the list of possible causes when troubleshooting.

Respect Prometheus metric and label naming standards.

Prometheus has an official standard for naming metrics and labels. Following the standard ensures that metrics are self-explanatory and easy to use, and enables us to write alerting configuration that is consistent and uniform. In particular:

  • Prefix the metric with the name of your application.
  • Metrics should use base units (bytes and seconds, not kilobytes or milliseconds).
  • Metric names should have a suffix explaining the unit, in plural (_seconds, _bytes).
  • Accumulating counters should end in _total.

If you expose system metrics, provide a way to disable them.

We already run the Prometheus node exporter on our hosts. Exposing that same information from the node software unnecessarily bloats /metrics responses, which puts strain on our bandwidth and storage, and collecting the information can make the /metrics endpoint slow.

Expose the node software version as a metric.

For automating rollouts, but also for monitoring manual rollouts, and observability and troubleshooting in general, it is useful for us to have a way of identifying what version is running at runtime. When you run one instance this is easy to track externally, but when you run a dozen nodes, it’s easy to lose track of which versions run where. Exposing a version metric (with value 1 and the version as a label) is one of the most convenient ways to expose version information.

Expose the validator identity as a metric.

Similar to having runtime information about the version, when managing multiple nodes, it is useful to know which identity (address or pubkey) runs where. Like the version, a convenient place to expose this is in Prometheus metrics.

Health

Expose an endpoint for health checks.

For automating restarts and failover, and for loadbalancing across RPC nodes, it is useful to have an endpoint where the node software reports its own view on whether it is healthy and in sync with the network. A convenient place to do this is with a /health or /status http endpoint on the RPC interface.

Ideally the application should respond on that endpoint even during the startup phase and report startup progress there.

On-chain metrics

It is essential to have metrics exposed by the node software, but this can only give us a local view. We need to have a global view as well. For example, a validator may be performing its duties (such as producing blocks, voting, or attestation), but end up in a minority network partition that causes the majority of the network to view the validator as delinquent.

When information about a validator is stored on-chain, there is a single source of truth about whether the validator performed its duties, and that fact becomes finalized through consensus. For example, for networks that have a known leader assigned to every slot, whether the block was produced or not is a property of the chain that all honest nodes agree on. Some networks additionally store heartbeats or consensus votes on-chain.

We need a way to monitor those on-chain events to measure our own performance. This can be built into the node software (so we can run multiple nodes that monitor each other), or it can be an external tool that connects to an RPC node and exposes Prometheus metrics about on-chain events.

Provide a way to monitor on-chain metrics.

Ideally, we would have Prometheus metrics about whether a validator identity has been performing its duties, exposed from an independent place that is not that validator itself. For most networks these exporters are standalone applications, but integrating this into the node software can also work.

Good monitoring and observability tools are a public good that benefits all validators. Observability is a core requirement for us, but we realize that it may not be top priority for node software authors. We are happy to contribute here, and work with you upstream to improve or develop open source monitoring solutions that benefit the wider ecosystem.

Telemetry

We understand that node software authors need visibility into how their software runs to inform development — that is the reason we are publishing this network handbook in the first place. However, we are subject to legal and compliance requirements, which mean that we cannot always allow software to phone home. In particular, in some cases we are under non-disclosure agreements.

On incentivized testnets we are happy to share telemetry data. In these cases we only operate our own identity, and the risk of telemetry exposing confidential information is low. For mainnets we do not allow telemetry data to be shared.

Ensure telemetry can be disabled.

As described above, some confidential information we cannot share for legal and compliance reasons. The easiest way to prevent inadvertently exposing confidential information, is to expose as little information as possible.

Troubleshooting

In case of bugs that are difficult to reproduce, we are happy to work with you to share relevant information, logs, try patches, etc. Under no circumstance does Chorus One grant access to our infrastructure to third parties. We definitely do not grant SSH access or other forms of remote access. If we did, we would not be able to guarantee the integrity of our infrastructure.