This describes the effort that eventually led to libzt.
TL;DR: If you’re going to put the network in user space, then put the network in user space.
For the past six months we’ve been heads-down at ZeroTier, completely buried in code. We’ve been working on several things: Android and iOS versions of the ZeroTier One network endpoint service (Android is out, iOS coming soon), a new web UI that is now live for ZeroTier hosted networks and will soon be available for on-site enterprise use as well, and a piece of somewhat more radical technology we call Network Containers.
We’ve been at Hashiconf in Portland this week. Network Containers isn’t quite ready for a true release yet, but all the talk of multi-everything agile deployment around here motivated us to put together an announcement and a preview so users can get a taste of what’s in store.
We’ve watched the Docker networking ecosystem evolve for the past two or more years. There are many ways to connect containers, but as near as we can tell all of them can be divided into two groups: user-space overlays that use tun/tap or pcap to create or emulate a virtual network port, and kernel-mode solutions like VXLAN and OpenVSwitch that must be configured on the Docker host itself. The former are flexible and can live inside the container, but they still often require elevated privileges and suffer from performance problems. The latter are faster but far less convenient to deploy, requiring special configuration of the container host and root access.
It’s been possible to use ZeroTier One in a Docker container since it was released, but only by launching with options like “–device=/dev/net/tun –cap-add=NET_ADMIN”. That gives it many of the same down-sides as other user-mode network overlays. We wanted to do something new, something specifically designed not only for how containers are used today but for how they’ll probably be used in the future.
Cattle Should Live in Pens
A popular phrase among container-happy devops folks today is “cattle, not pets.” If containers are the “cattle” approach to infrastructure then container hosts should be like generic cattle pens, not doggie beds with names embroidered on them. They should be pieces of metal that host “stuff” with no special application specific configuration at all.
All kernel-mode networking solutions require kernel-level configuration. This must be performed on the host as ‘root’, and can’t (easily) be shipped out with containers. It also means if a host is connected to networks X and Y it can’t host containers that need networks A and Z, introducing additional constraints for resource allocation that promote fragmentation and bin-packing problems.
We wanted our container networking solution to be contained in the container. That means no kernel, no drivers, no root, and no host configuration requirements.
The Double-Trip Problem
User-space network virtualization and VPN software usually presents itself to the system through a virtual network port (tun/tap), or by using libpcap to effectively emulate one by capturing and injecting packets on an existing real or dummy network device. The former is the approach used by ZeroTier One and by most VPN software, while the latter is used (last we checked) by Weave and perhaps a few others. The pcap “hack” has the advantage of eliminating the need for special container launch arguments and elevated permissions, but otherwise suffers from the same drawbacks as tun/tap.
User-mode network overlays that still rely on the kernel to perform TCP/IP encapsulation and other core network functions require your data to make an epic journey, passing through the kernel’s rather large and complex network stack twice. We call this the double-trip problem.First, data exits the application by way of the socket API and enters the kernel’s TCP/IP stack. Then after being encapsulated there it’s sent to the tun/tap port or captured via pcap. Next, it enters the network virtualization service where it is further processed, encapsulated, encrypted, etc. Then the overlay-encapsulated or VPN traffic (usually UDP) must enter the kernel again, where it once again must traverse iptables, possible NAT mapping, and other filters and queues. Finally it exits the kernel by way of the network card driver and goes over the wire. This imposes two additional kernel/user mode context switches as well as several memory copy, handoff, and queueing operations.
The double-trip problem makes user-mode network overlays inherently slower than solutions that live in the kernel. But kernel-mode solutions are inflexible. They require access to the metal and root privileges, two things that aren’t convenient in any world and aren’t practical at all in the coming world of multi-tenant container hosting.
We think user-mode overlays that use tun/tap or pcap occupy a kind of “uncanny valley” between kernel and user mode: by relying on a kernel-mode virtual port they inherit some of the kernel’s inflexibility and limitation, but lose its performance. That’s okay for VPNs and end-user access to virtual networks, but for high performance enterprise container use we wanted something better. Network Containers is an attempt to escape this uncanny valley not by going back to the kernel but by moving the other direction and going all-in on user-mode. We’ve taken our core ZeroTier virtual network endpoint and coupled it directly to a lightweight user-mode TCP/IP stack.
This alternative network path is presented to applications via a special dynamic library that intercepts calls to the Linux socket API. This is the same strategy used by proxy wrappers like socksify and tsocks and requires no changes to applications or recompilation. It’s also used by high-performance kernel-bypassing bare metal network stacks that are deployed in areas with minimum latency requirements like high frequency trading and industrial process control. It’s difficult to get right but so far we’ve tested Apache, NodeJS, Java, Go binaries, sshd, proftpd, nginx, and numerous other applications with considerable success.
You might be thinking about edge cases, and so are we. Socket APIs are crufty and in some cases poorly specified. It’s likely that even a well-tested intercept library will clash with someone’s network I/O code somewhere. The good news is that containers come to the rescue here by making it possible to test a specific configuration and then ship with confidence. Edge case issues are much less likely in a well-tested single-purpose microservice container running a fixed snapshot of software than in a heterogenous constantly-shifting environment.
We believe this approach could combine the convenience of in-container user-mode networking with the performance of kernel-based solutions. In addition to eliminating quite a bit of context switch, system call, and memory copy overhead, a private TCP/IP stack per container has the potential to offer throughput advantages on many-core host servers. Since each container has its own stack, a host running sixteen containers effectively has sixteen completely independent TCP threads. Other advantages include the potential to handle huge numbers of TCP connections per container by liberating running applications from kernel-related TCP scaling constraints. With shared memory IPC we believe many millions of TCP connections per service are feasible. Indeed, bare metal user-mode network stacks have demonstrated this in other use cases.
Here’s a comparison of the path data takes in the Network Containers world versus conventional tun/tap or pcap based network overlays. The application sees the virtual network, while the kernel sees only encapsulated packets.
Running the Preview Demo
Network Containers is still under heavy development. We have a lot of polish, stability testing, and performance tuning to do before posting an alpha release for people to actually try with their own deployments. But to give you a taste, we’ve created a Docker container image that contains a pre-built and pre-configured instance. You can spin it up on any Docker host that allows containers to access the Internet and test it from any device in the world with ZeroTier One installed.
Don’t expect it to work perfectly, and don’t expect high performance. While we believe Network Containers could approach or even equal the performance of kernel-mode solutions like VXLAN+IPSec (but without the hassle), so far development has focused on stability and supporting a wide range of application software and we haven’t done much of any performance tuning. This build is also a debug build with a lot of expensive tracing enabled.
Here’s the steps if you want to give it a try:
Step 1: If you don’t have it, download ZeroTier One and install it on whatever device you want to use to access the test container. This could be your laptop, a scratch VM, etc.
Step 2: Join 8056c2e21c000001 (Earth), an open public network that we often use for testing. (If you don’t want to stay there don’t worry. Leaving a network is as easy as joining one. Just leave Earth when you’re done.) The Network Containers demo is pre-configured to join Earth at container start.
Step 3: Run the demo!
The container will output something like this:
While you’re waiting for the container to start and to print out its Earth IP address, try pinging earth.zerotier.net (188.8.131.52) from the host running ZeroTier One to test your connectivity. Joining a network usually takes less than 30 seconds, but might take longer if you’re behind a highly restrictive firewall or on a slow Internet connection. If you can ping 184.108.40.206, you’re online.
Once it’s up and running try pinging it and fetching the web page it hosts. In most cases it’ll be online in under 30 seconds, but may take a bit longer.
Next Steps, and Beyond
We’re planning to ship an alpha version of Network Containers that you can package and deploy yourself in the next few months. We’re also planning an integration with Docker’s libnetwork API, which will allow it to be launched without modifying the container image. In the end it will be possible to use Network Containers in two different ways: by embedding it into the container image itself so that no special launch options are needed, or by using it as a libnetwork plugin to network-containerize unmodified Docker images.
Docker’s security model isn’t quite ready for multi-tenancy but it’s coming, and when it does we’ll see large-scale bare metal multi-tenant container hosts that will offer compute as a pure commodity. You’ll be able to run containers anywhere on any provider with a single command and manage them at scale using solutions like Hashicorp’s Terraform, Atlas, and Nomad. The world will become one data center, and we’re working to provide a simple plug-and-play VLAN solution at global scale.
Hat tip to Joseph Henry, who has been lead developer on this particular project. A huge number of commits from him will be merged shortly!