Modern warehouse-scale computing and cloud computing are based on data center networks. Computing and storage have been transformed by the underlying guarantee that uniform, arbitrary communication can be made among thousands of servers at 100s to 200 Gb/s bandwidth with sub-100us latency.
This model has a simple but profound benefit: adding an incremental storage device or server to a higher-level service results in a proportional increase of service capacity and capabilities. Google’s Jupiter data center network technology allows for this type of scale-out capability to support foundational services such as Search, YouTube and Gmail.
The last eight years have been spent integrating wave division multiplexing and optical circuit switching into Jupiter. Despite decades of wisdom stating otherwise, OCS and our Software Defined Networking architecture (SDN), have enabled us to create new capabilities. OCS allows for incremental network builds using heterogeneous technologies, higher performance, lower latency and power consumption, real-time communication patterns, and zero downtime upgrades.
Jupiter achieves all of this while using 10% less flow completion, improving throughput and incurring 30% less costs, as well as delivering 50% less downtime than other known alternatives. This paper, Jupiter Evolution: Transforming Google’s Datacenter Network through Optical Circuit Switches & Software-Defined Networking, explains how we achieved this feat.
This is a brief overview of the project.
Networks of data centers for Evolving Jupiter
demonstrated in 2015 how Jupiter’s data center networks scaled up to more than 35,000 servers, with 40Gb/s server connectivity. This allowed for more than 1Pb/sec aggregate bandwidth. Jupiter now supports over 6Pb/sec datacenter bandwidth. Three ideas were used to achieve unprecedented performance and scale.
- Software-Defined Networking – A logically centralized control plane that allows you to program and manage thousands of switches in the data center network.
- Clos Topology is a non-blocking multistage switching totology made up of smaller radix switches chips that can scale to arbitrarily big networks.
- Merchant switch silicon is a cost-effective, general-purpose Ethernet switching component for a converged data and storage network.
Jupiter’s architectural approach based on these three pillars supported a major shift in distributed system architecture. It also set the standard for how industry builds and manages data centers networks.
Two main challenges remain for hyperscale data centres. Data center networks must be scaled up to the size of a building, with 40MW or more infrastructure. The servers and storage devices in the building are constantly changing. For example, they can move from 40Gb/s up to 100Gb/s up to 200Gb/s, and now 400Gb/s native interconnects. The data center network must adapt dynamically to keep up with new elements that connect to it.
Clos topologies, as illustrated below, require a spine layer that provides uniform support for all devices. Clos-based data centers require a large spine layer to run at the same speed as the current generation. This was necessary in order to deploy a building-scale network. Clos topologies require all to-allfanout starting at aggregation blocks and ending at the spine. Adding to the spine incrementally would require rewiring of the entire data center. The only way to support faster devices would be to replace the entire spine layer. However, this would not be feasible given the hundreds of racks housing switches and the tens of thousand of fiber pairs that run across the building.
The ideal data center network would support heterogeneous elements of network in a “pay-as-you grow” model. This means that the network can add network elements as needed, and the latest technology will be supported incrementally. It would be able to support the same scale-out model that it supports for storage and servers, and allow incremental increases in network capacity. This will result in native interoperability and increased capacity for all devices.
Second, uniform building-scale bandwidth can be a strength but it becomes limited when you consider that data centers networks are multi-tenant and constantly subject to maintenance and localized faults. One data center network can host hundreds of services, each with its own priority, sensitivity to bandwidth, and latency variation. Serving web search results in real time might require bandwidth allocation and real-time latency guarantees, while a batch job for analytics may need more flexibility to meet short-term bandwidth requirements. This means that the data center network should assign bandwidth and pathing services based upon real-time communication patterns, and application-aware optimization. If 10% of the network capacity must be temporarily removed for an upgrade, that 10% should not necessarily be distributed equally among all tenants. Instead, it should be apportioned according to individual priority and application requirements.
These remaining challenges were difficult to address at first. Data center networks were designed around hierarchical topologies at large physical scale, so that dynamic adaptation and incremental heterogeneity could not be supported. This was broken by introducing Optical Circuit Switching to the Jupiter architecture. An optical circuit switch (depicted below) maps an optical fiber input port to an output port dynamically through two sets of micro-electromechanical systems (MEMS) mirrors that can be rotated in two dimensions to create arbitrary port-to-port mappings.
The insight was that it was possible to create arbitrarily logical topologies in data center networks by inserting an OCS intermediation layer among data center packet switches, as shown below.
This required us to create OCS and native WDM transceivers at scale, manufacturability, and reliability that were unimaginable before. While academic research explored the advantages of optical switches, common wisdom indicated that OCS technology wasn’t commercially viable. We designed and built Apollo OCS over multiple years. This technology is now the foundation for most of our data center networks.
OCS has one major advantage: it does not involve packet routing and header parsing. OCS simply reflect light from one input port to another with great precision and very little loss. Electro-optical conversion is used to generate the light at WDM transceivers, which are already required for data transmission reliably and efficiently through data center buildings. OCS is part of the building infrastructure. It can be used at any data rate or wavelength and doesn’t require upgrades, even as the electrical infrastructure changes from 40Gb/s transmission and encoding speeds of 40Gb/s up to 100Gb/s and 200Gb/s, and beyond.
An OCS layer was used to eliminate the spine layer in our data center networks. Instead, heterogeneous aggregate blocks were connected in a direct mesh. This allowed us to move beyond Clos topologies within the data center. We developed dynamic logical topologies to reflect both application communication patterns and physical capacity. It is now a standard procedure to reconfigure the logical connectivity of switches in our network. This allows us to dynamically change the topology without any application-visible effect. This was achieved by linking down link drains and reconfiguring routing software. We also relied on our OrionSoftware Defined Networking control plan to effortlessly orchestrate thousands independent and dependent operations.
It was a particularly challenging challenge to find the shortest route routing over mesh topologies that could provide the robustness and performance required by our data centre. Clos topologies are known for having side effects such as the fact that there are many paths through the network. However, they all have the same length and link capacities. This means that oblivious packet distribution or Valiant load balancing provides sufficient performance. Our SDN control plane in Jupiter is used to implement dynamic traffic engineering. We use techniques that were pioneered by Google’s WAN: we split traffic between multiple paths, while monitoring link capacity, communication patterns, individual priority, and individual application priorities.
We have combined our efforts to re-architect the Jupiter data center networks, which power Google’s warehouse-scale computing machines, and introduced a few industry firsts.
- Optical Circuit switches are the interoperability point to build large networks and seamlessly support heterogeneous technologies as well as upgrades and other service requirements.
- Topologies based on direct mesh for better performance, lower latency and lower power consumption.
- Traffic engineering and real-time topology to adapt network connectivity and pathing in order to match communication patterns and application priority. All while monitoring and reporting on maintenance and failures.
- Hitless network upgrades are possible with localized addition/remove capacity. This eliminates the need to do costly and tedious “all services out” upgrades.
Although the technology itself is impressive, our goal is to provide performance, efficiency and reliability that combine to enable the most complex distributed services such as Google Cloud and Google Cloud. Our Jupiter network uses 40% less power, has a 30% lower cost and is 50x more reliable than any other alternative. This, while improving flow completion and throughput by 10%. We are proud to present details about this technological feat today at SIGCOMM and look forward discussing our findings in the community.