BFD - Sub-second Failure Detection

If there’s no BFD

If you have two routers directly connected, like here:

In this case, it is normal that one of them will remove the routes learned from the other if the other one goes down completely. It is because the link will go to down state and the routing protocol adjacency will disappear.

If two routers are connected through an L2 device (switch) like down here:

In this case, when one of them goes down, it will not take down the interface of the L3 neighbour (other router) because the switch will still work fine and it will keep the other half of the like up:

If that’s the case, you will depend on routing protocol timers which are the failure detection mechanisms implemented in the routing protocol itself. Routing protocol timers will need to expire in order to bring the router adjacency down and start the convergence to some other path towards the destinations.

Routing protocols timers are not a bad mechanism and they can be tuned so that they detect the failure faster.

EIGRP hello and hold timers can be tuned to get you somewhere around 1 second for failure detection and the start of convergence. With IS-IS and OSPF you can enable fast hello option and this can get also to 1 second for failure detection.

You can probably guess by now that to speed things up the BFD from the title will be the best solution.

Whats is BFD?

To make failure detection fast, like really fast, like sub-second fast you should use BFD. BFD, which is a separate protocol for communication failure detection, uses small overhead probe packets (like smallish hello packets) that are sent many times in a second in order to get you to sub-second detection of communication failure.

Those probes can be configured differently but is usually configured so that packets are sent every 50 milliseconds so its 20 packets per second. The failure will then be detected if 5 of those fail in a row. Which is bringing us to failure detection in 250 ms or 1/4 of a second, cool.

If there’s BFD configured

There are more advantages of BFD usage except for the speed. So BFD will give you the chance to detect failures at sub-second level but also:

Messing with routing protocol timers can get you into troubles with neighbour adjacency if you forget to do it on all routers and stuff like that. It is basically always considered not to be a good idea to change routing protocol default timer values.

BFD is a separate protocol, it is not part of one routing protocol code. This enables it to work independently on the link and multiple other protocols can use that one BFD service to get noticed of like failure. The protocol that can be configured to use BFDs notifications are BGP, EIGRP, OSPF, HSRP, MPLS, LDP and probably some more.

BFD works mostly in the data place, so it will not make a significant overhead to router CPU or affect the control plane and CPU usage as much as routing protocol timers.

Modifying routing protocol timers to send hellos very often in using much more CPU because those hellos are generated every time and sent from the control plane.

Because some parts of BFD can be distributed to the data plane, it can be less CPU-intensive than the reduced EIGRP, IS-IS, and OSPF timers, which exist wholly at the control plane.

Because it is mostly data plane work that is done, it is somehow logical that CEF must be enabled to push thing off the CPU into the silicon…

CEF must be enabled on the router if you want to use BFD.

Some other details on how BDF works?

BFD is the fastest way of failure detection in the forwarding path between routers that have protocol adjacency.

Timers can be tuned and the minimum and the prefered value for both interval and min_rx timers are 50 milliseconds.

You need to configure BFD from both sides (on both routers) because the only option that is supported (at least in Cisco equipment) is an asynchronous mode which works by sending control packets from both sides and by receiving those packets the other side can maintain BFD neighbour session.

When a communication error occurs on the link between two BFD enabled routers:

BFD session is destroyed after 250 ms if default timers are used 5*50ms
BFD notifies the local OSPF or other protocol process that the BFD neighbor is no longer there
The local OSPF process then deletes OSPF neighbor relationship
Router starts to converge if there another existing path.

Configuration

You need to configure default BFD service and timers at the interface level and then for every protocol that will use that BFD service for notification and failure detection you configure additional BFD to protocol cooperation inside routing protocol level.

BFD is working on the interface level so it will work only for directly connected L3 neighbours which means they can be interconnected with a switch but BFD neighbours need to be one IP hop away from each other.

First you configure your BFD process on the interface level:

R1(config)#interface Ten 1/1
R1(config-if)# bfd interval 50 min_rx 50 multiplier 5

R2(config)#interface Ten 1/1
R2(config-if)# bfd interval 50 min_rx 50 multiplier 5

After that you can use it to speed OSPF or some protocol by configuring usage of BFD inside that protocol:

R1(config)#router ospf 1
R1(config-router)# bfd all-interfaces

R2(config)#router ospf 1 
R2(config-router)# bfd all-interfaces

How to check?

R1# show bfd neighbors details

..Will show you how is your BDF session doing

Google Jupiter Data Center Network Fabric - New Way of Building Data Center Network Underlay

Cisco Champion 8th year in a row

How Does Internet Work

BFD – Sub-second Failure Detection