What is MultiPod?
ACI MultiPod was first designed to enable the spread of ACI Fabric inside a building (into two or more Pods), let’s say in two rooms at different floors, without the need to connect all the Leafs from one room to all the Spines in the other room. It was a way of simplifying the cabling and all that comes with building spread CLOS topology fabric stuff.
MultiPod also saves some Leaf ports giving the fact that Pod to Pod connection through Multicast enabled IPN network connects directly to Spines.
People soon realized that MultiPod will be a great solution for a dual site (or more than dual) Datacenter with the ability to have single management with a single ACI Fabric stretched across two or more locations that are connected with an IP connection not too long so that enables RTT latency of less than 50msec with Multicast support. Not too simple but it seems not too demanding for most cases.
Enabling the things from above, MultiPod begun to bee a preferred way of creating this modern Software Defined MultiDatacenter solutions based on Cisco N9K switches.
Datacenter build in this way will have a central point of management of both sides and it will enable to create stretched L2 domains available everywhere. It enables that without the need to stretch the VLANs in an old fashion way by really sending L2 traffic between sites and risking the L2 broadcast storm melting tragedy. All L2 traffic will flow as VxLAN encapsulated unicast or multicast in-between sites inside the overlay.
IPN Network Connecting the PODs
IPN network should be a dedicated link between two datacenters, preferably created as redundant two dark fiber links with different physical path. Configuration can easily be created as redundant with redundant Nexus switches on both sides. Multicast RP that is needed can also be configured redundantly using phantom RPs so that bidirectional Multicast can work if one of the IPN switches fails and when all of them work every switch will be RP for 1/4 of multicast groups.
Multicast-enabled IPN network is directly connecting Spines on both sides and enables the EBGP protocol between Spines to exchange all connected endpoint routes learned on spines on one side to their neighbor on the other side. Each POD (site) then has locally discovered endpoints from local Leafs and has the info about all endpoint from the other side so it can send the traffic to them when needed. Multicast is needed because all BUM traffic (Broadcast and Unknown Unicast) can be sent to the other side as Multicast so that it gets there redundantly.
How is IPN configured is part of the next articles in preparation and it will contain detailed configuration examples on how to bring the IPN devices up and running without the need to consult pretty superficial MultiPod Cisco ACI config guides.
If you are running ACI MultiPod as a solution for multiple Datacenter locations, which implies that you have one centralized Management point, it is recommended to spread the APIC controller cluster to both sites so that you get more resiliency in case one of the sites or the connection in-between fails.
- Take care that Multicast enabled interfaces between two local IPN switches should be routed ports directly connected and not shared (if existing) with vPC peer link (this will not work).
- Interfaces from IPN switch towards Spines are routed subinterfaces with dot1q tag 4 which is hardcoded in Spine ACI MultiPod and MultiSite configuration.
- Making the port trunk and using vlan interface 4 will not work because you will then make possible for two Spines to see each other through IPN switches which is prohibited.
- It will also not work because Spines all share the same MAC when connecting towards IPN by default.
APIC Controller Cluster
Long story short, you need three controllers to have working Fabric management and that’s because of all that clustering quorum things that need to be met in order for the cluster system to be able to decide if he’s cluster member majority is enough to prevent data corruption and configuration loss.
You can’t have two controllers cluster of APICs.
The story goes so that in ACI MultiPod it is always recommended to build a 3 APICs controller cluster with one of the members connected to the fabric on the secondary site. So you decide which site is the primary (the preferred one, or the one closer to you so it is a good idea to manage the Fabric from there), and you put two out of three controllers there. The third one is connected to the second POD inside the other Datacenter location.
There is a design with 5 APICs in the cluster but it is not really a good idea (if your Fabric is not huge >300 Leafs) because of the algorithm used to replicate parts of the configuration database called shards (The idea of sharding is that the configuration database is split into several database units – shards), which will leave you with overcomplicated and spread shards unevenly across 5 APICs with the possibility to actually lose some configuration parts if wrong site APICs fail together with the whole site.
The idea behind 5 APIC cluster is that increasing the number of APICs does not improve the robustness of the cluster, but enables the support for more leaf nodes in the Fabric. We are speaking of more than 300 Leafs in the 5 APIC cluster story, so feel free to put only 3 of them and be sure that all shards exist on all three APICs and you’re safe.
Some other APIC sizing data
Sizing data from great BRKACI-2003 John Weston’s Cisco Live presentation
80 leafs supported with a 3 node APIC cluster
200 leafs with a 4 node APIC cluster (from ACI release 4.1)
300 leafs with a 5 node APIC Cluster
400 leafs with a 7 node APIC Cluster (from ACI release 2.2(2e))