I often talk about flaws in routing protocols. Today I want to list a few of the inherent weaknesses in OSPF that frustrate me. So herewith, “Five things I thought of in 10 minutes over coffee (while I have jet lag) on why OSPF is broken”:
- Ineffective load balancing
- Lack of visibility – cannot monitor algorithm status
- Poor path weighting process based on quasi-random vendor preferences
- Unpredictable self-configuration
1. Ineffective Load Balancing
The SPF algorithm calculates a single ‘best’ path between two points in a network. It cannot derive multiple paths. The outcome is unused bandwidth that is wasted; it’s literally money spent for nothing.
ECMP load balancing is a hack that isn’t actually part of OSPF. If the OSPF algorithm determines there are multiple next hops with the same path cost for a given subnet, then the device forwarding plane may be able to load balance packets (or sometimes flows, confusingly) over multiple paths. This is a device- and vendor-specific feature, not a native capability of OSPF.
2. Lack Of Visibility
The operation of the OSPF algorithm is opaque, and it’s difficult to monitor the SPF algorithm status. For example, some operational questions that cannot be handled or monitored:
- Is the database complete?
- Is the database too large for the device?
- Is the hardware capable of computing the algorithm in acceptable time?
- Can the device create a neighbor update?
These are some of the factors that go into making a decision to implement OSPF areas to improve scaling at the cost of data integrity. Literally, most link state data is discarded and reduced to a summary so that reduced database size removes the limitations around the CPU/memory.
3. Poor Path Weighting
The SPF algorithm requires a cost for every transit between nodes in the graph. The OSPF standard doesn’t specify a common definition for costs, so the process is based on quasi-random vendor preferences. Most often time is wasted in manual operation.
Oh, yeah, the path costs are static and do not reflect the performance of the path. Factors such as path latency or jitter are not accounted for. Thus the ‘SPF Best Path’ is mathematically correct but it’s probably not what network users experience in real world conditions.
The SPF algorithm isn’t particularly inefficient, although there are better algorithms available using directed graphs. A more substantial problem is neighbor discovery and signalling. For every neighbor a router must generate OSPF packets from its CPU for hello/status and database status/update/deletion (e.g. LSA/LSU, DDB).
A secondary problem is database synchronization as device counts increase. OSPF must reach a stable state to be useable, so the time taken to propagate a link change record from two furthest points in the graph can limit scale.
5. Unreliable Self-Configuration
Designated Routers are not selected on capability or capacity but by a routine based on highest IP address. I believe this was intended to reduce the processing load of devices because full mesh would have been an issue in 1998. Most people manually configure the DR/BDR on a given segment for predictable operation.
Path costs are defined in RFC 2328 as “the cost of a route is described by a single dimensionless metric”.
A cost is associated with the output side of each router interface. This cost is configurable by the system administrator. The lower the cost, the more likely the interface is to be used to forward data traffic. Costs are also associated with the externally derived routing data (e.g., the BGP-learned routes).
It’s up to each vendor to decide what path costs they will assign to each interface. Generally there has been a convention that bandwidth is the basis for a selecting a cost. As available bandwidth increases over time, this turned out to be a dumb idea.
The EtherealMind View
I’ve listed five reasons here and there are many more. OSPF works OK, but it’s not excellent. There were reasons to choose OSPF in the 1990’s but why haven’t we upgraded since then? Why are we stuck with good enough instead of moving forward?
Ultimately, I see SD-WAN as the answer to that question.
OSPF Version 2 (April 1998) – IETF