Google Jupiter Data Center Network Fabric – New Way of Building Data Center Network Underlay

Google’s Datacenter Optical Circuit Switches and Jupiter network fabric

Google’s data centers are unlike any other. It seems they have windows like normal houses because as from the last SIGCOMM’22 presentation, they took their SPINE switches and threw them right out of that window.

Google worked on the Micro Electro Mechanical Systems (MEMS) for years in order to build an Optical Circuit Switch (OCS) that would enable dynamic reconfiguration of optical connections between switches in the data center. Optical Circuit Switch enables on-the-fly data center fabric aggregation block switch connections reconfiguration without the need for physical rewiring. And most interestingly, the usage of Software-Defined Networking (SDN) traffic engineering, enables aggregation block switches to be directly connected and completely removes the need for those bulky Spine switches that were connecting aggregation block switches in CLOS topologies.

OCS MEMS mirror array can redirect light from any input port to any output port by slightly moving one of its mirrors and changing dynamically optical connections between aggregation blocks

Spine switch roles are basically replaced with OCS devices for smart, dynamic and direct interconnections of Leafs. SDN is used for BUM traffic handling (Broadcast, unknown-unicast and multicast traffic) that was the other important spine role.

Google paper (Jupiter Evolving: Transforming Google’s Datacenter Network via Optical Circuit Switches and Software-Defined Networking) that was presented at the SIGCOMM’22 conference describes how getting rid of Spines, and smart traffic handling with SDN, enabled the Jupiter fabric to get up to 5x higher speed and capacity and 40% reduction in power consumption in relation to similar CLOS topology datacenter fabrics.

Cisco DNA Upgrade Issues – Application Update Stuck

After initiating Cisco DNA Appliance version 2.1.2.4 and starting an upgrade towards 2.2.2.8 in order to get to 2.2.3.5 I got a strange issue where the appliance system update went fine but the switch to 2.2.2.8 was disabled until Application Updates did not finish.

The real issue here was that Application Updates of Cloud Connectivity – Data Hub got stuck on 12% for 4 days without timing out or finishing. Tried several appliance reboots from CIMC which didn’t help.

Below are the steps that helped sort out Application Updates issues with container pods being stuck at the point of pooling images from the cloud and creation on the Cisco DNA appliance. Of course, this was a long troubleshoot and here I listed only maglev CLI commands that were useful.

I SSH into the appliance and started digging the maglev CLI tool.

Cisco Catalyst Stack Upgrade

Well… It will reboot your whole switch stack at once, In case you were wondering. But it has a neat feature of automatic rollback to the previous IOS XE version if something goes south with the newly upgraded switches.

The same goes for non-stacked Cisco Catalyst C9200 and C9300 switches, but the question was, and the answer is hard to find if the stack would reload members sequentially or it would just reload all members at once. The answer is of course the least good option which makes the upgrade impossible without network outage even if other devices are connected to the stack redundantly (to two or more stack members).

The whole procedure is fairly simple:

Switch Upgrade install mode

New Cisco switches are now usually shipped with install mode configuration for software installation. The other (older) bundle mode was simply a boot variable on the switch config that stated where the .bin files is saved on the switch flash and if multiple .bin files which one to load upon switch reboot.

TOP 25 in Cisco IT Blog Awards

It was a year of big changes in every way. I was fortunate enough to be surrounded by great professionals working on huge projects and then even to get the chance to switch to some completely new technologies that I never really worked with before. It was great, it is still very intense and from my perspective, all changes were for the better.

But as with all periods with a lot of action, all those draft articles on this blog’s queue didn’t yield as much new material as I wanted. It was a year of almost no writing but a lot of learning, adapting and real-life collaboration with other fellow engineers in different fields and different places across the world.

Consequently, all that work and collaboration on projects with great tech people made this year’s nomination for Cisco Champion possible again. In the end, even this BLOGs awards nomination and TOP25 ranking was also mostly because of the support from colleagues and friends that voted for this material of mine.

Thanks!

Switch vSphere Enterprise Plus license to vSphere Standard on a NSX-T enabled cluster

This article describes the strange workaround of switching VMware NSX-T enabled cluster from using vSphere Enterprise Plus license to vSphere Standard license with vDS licensed through NSX-T. I really hope that you will not need to go through this as it is quite like bringing the whole environment up from scratch. But if you have two clusters with enough resources it will enable you to do it without downtime.

Environment on which this was tested is vSphere 7.0.2 and NSX-T 3.1.2

NSX-T as a network and security platform enables network functions to be virtualised on your vSphere cluster. The way it does this is to implement additional features of network traffic steering and packaging inside its vSphere Distributed Switch (vDS).

Before NSX-T 3.1.1 the only way to get your cluster equipped with vDS was to have a vSphere Enterprise Plus license. From NSX-T 3.1.1 version onwards, VMware gives you the possibility to use vDS without vSphere Enterprise Plus license and license it using NSX-T license. This enabled users with a standard vSphere license to be able to deploy NSX-T on all editions of vCenter Server and vSphere.

After this started to be possible there were some customers who realised that in some cases the only reason they have vSphere Enterprise Plus license for a specific cluster is to be able to use NSX-T since that was needed in the past. So they decided that they will transfer those Enterprise Plus licenses to some other (new) cluster that needs those licenses for more different features.

Missing good old ‘wr’ command on N9K? let’s bring it back!

Doing a lot on Nexus 9000 series datacenter boxes (N9K) lately? Sure you’re missing the good old ‘wr’ command to save your last startup-config into running-config.

NXOS architecture guys decided that you should be really well concentrated when deciding to save your nice new configuration to survive device reboot and type: N9K_1(config)# copy running-config startup-config. Just typing ‘wr’ into the console would be too nice right? Let’s use the alias configuration and bring that command back to the box.

NSX-T Edge Transport Node Packet Capture

NSX-T v3.0.1 and v3.1.3 were used to try the stuff described below

As always with network engineers, even when working with SDN/SSDC solutions, sooner or later you will be asked to troubleshoot connectivity across your hops. And if working with VMware NSX-T platform, your next-hop for the North-South Datacenter traffic will almost always be NSX-T EDGE Transport Node VM. It will be really useful then to be able to get some packet traces out of that box in order to troubleshoot the traffic issues in detail.

One of the examples would be simple routing or some sort of Loadbalancing traffic that seems not to reach the backend hosts behind NSX-T edge.

On the NSX-T EDGE VM it’s fairly simple to capture traffic directly. It’s possible to get the output out on the console or to save it to the file on the EDGE and then pull it out with SCP.

If you have an EDGE Cluster, normally build out of 2 VMs, first, you need to see on which node the T0 or T1 router you want the traffic to be captured is active.

Let’s say we want to capture traffic on “T0-router” shown in the image below. You can go to that T0 router from the UI and check the High Availability Mode output:

NSX-T EDGE VM Active T0