So last week I was looking at some things around Reliability and I realized how big the Reliability Engineering field is. Then I thought why not share it with others on what I have learned so far and at the same time with a hope that experts in this field may read this and correct me if there are any mistakes in my understanding.

In this blog, we will start with some basic concepts, touch on Reliability Mathematics and follow up with some examples in part-II. A basic background in Probability and basic calculus will come handy, but if that’s not the case, then ignoring is always an option.

A general caution about Metrics

Metrics are a quantitative measure of a system or network behavior which allows us to characterize the system behavior so that we can make decisions based on that. But before we start using it, it’s important to understand how they were determined and the assumptions made during calculating those metrics. It’s possible that the assumptions made during calculation may not apply to your situation and thus making those metrics useless. This will allow you to understand how much **bullshite** those metrics are.

#### Intro to Reliability Engineering

**Design Life:** Design life is the intended period of use of a component/system which is expected to be failure free.

**Failure:** We consider a component/system to be failed, If it is not able to perform its required function as specified. For instance, a light bulb is no longer glowing or a router stops routing.

**Failure Rate:**

A failure rate, F, defines failure frequency in terms of failures per unit time, like percentage of failures per 1000 hours. We can estimate Failure rate by looking at the operating hours of all the units, including the failed ones and the numbers of units failed.

Failure Rate is often represented by symbol “ƛ” and this what we will follow as well. Let’s look at a few examples.

Example 1: Assume that there are 50 units which operated for 1000 hours on test and 2 of them failed. Then the Failure Rate (ƛ) will be

Example 2: Let’s say that 10 components failed from a sample of 2000 which were used for 120 days. So the failure rate can be estimated as

Vendors often use statistical sampling methods to estimate the average failure rate over large populations of components.

##### Categories of Items:

- Repairable items: Repairable items are the ones which can be repaired once they fail and once fixed they resume their required function.
- Non-Repairable items: Non-Repairable items are the ones which cannot be fixed once they fail and are generally replaced. This is more common in the semiconductor/Telecommunication industry. For instance, let’s say a router line-card gone bad, we typically replace them (hardware failures).

##### Patterns of Failures with Time

There are three basic ways in which the pattern of failures can change with time. Hardware failures are typically characterized by a “bath tub curve”.

**Infant Mortality Stage**: Decreasing failure rates is observed in items which becomes less likely to fail as their survival time increases. This is common in electronics equipment’s and often caused due to manufacturing defects. These kinds of defects are usually caught during the “Burn-In” period of the item.**Wear out Stage**: Wear out failure modes follow an increasing failure rate, This is basically when the component is getting closer to his life time end.**Constant Failure Rate (Random Failures)**: A constant failure rate is a characteristic of failures where they can happen randomly. This is the useful life span of the equipment which will be the focus.

As you may have noticed that how Failure is a function of time i.e. Things tend to fail over a period of time. This is represented by Failure Rate ƛ(t). Also, we are making an assumption that during the useful life of an item, the failure-rate is constant to model the reliability of electronic systems as they typically do not experience wear out type failures. There are studies and papers out there that talk about how it may not be true, but for now we will conveniently ignore that.

#### Reliability:

Reliability is defined as the probability that an item will perform a required function without failure for a stated period of time. Another way to state is that It’s a measure of how long it takes for a network (or a system) to fail.

So why Reliability is defined in terms of probability?. The reason is because in the universe of Reliability, we are dealing with uncertainty. For instance, let’s say the data shows that a certain type of Line Card fails at a constant average rate of one per 1,000,000 units. Now If we build 1000 units, and operate for 100 hours, we cannot say with certainty whether any will fail during that time. However, we can make a statement about the likelihood (probability) of a failure.

**Reliability Mathematics:**

Now we will be briefly look at the Reliability Math’s to see how the equations are derived before we start using it. I will try to introduce necessary concepts of probability before start using them, but please keep in mind that there is no way I can do justice in explaining every fundamental if the reader has no background in Probability/Calculus.

Also, I am going to take some luxuries while explaining to keep things simple, so it’s possible that some statements may not be mathematically accurate.

**Quick 25,000 foot view of Probability basics:**

- As you already know that in the universe of probability, an event occurrence is expressed as a number between 0 and 1. The probability of an event A happening is represented by P (A).
- The sum of probabilities of all possible outcomes is equal to One. For instance, if an experiments can have three possible outcomes A, B and C, then
*P(A)+P(B)+P(C)= 1* - The Probability that an event A will occur is equal 1 minus the probability that event A will not occur. Assuming that P(A
`) is the probability that event A will not occur. Then`

*P(A) = 1 – P(A*Intuitively, Probability of Event A happening and Probability of Event A not happening should be equal=1. We are 100% sure that either it will happen or it will not happen.*).*

**Random Variables: ** A random variable, usually written as X, is a variable whose possible values are numerical outcomes of a random experiment.There are two types of random variables,

**discrete and continuous**.(https://www.mathsisfun.com/data/random-variables.html)

Discrete Random Variables:

A discrete random variable is one which may take on only a countable number of distinct values such as 0,1,2,3,4,… If a random variable can take only a finite number of distinct values, then it is Discrete Random Variable. Examples of discrete random variables include the number of children in a family, the number of defective light bulbs in a box of ten.

Continuous Random Variables:

A continuous random variable is one which takes an infinite number of possible values. Continuous random variables are usually measurements. Examples include height, weight, time. A continuous random variable is not defined as specific values. Instead, it is defined over an interval of values, and is represented by the area under a curve (calculated with the help of integral). The probability of observing any single value is equal to 0, since the number of values which may be assumed by the random variable is infinite.

**Probability Distribution Function ** In simple words, A Probability Distribution function tells us the probability of an outcome happening. Let’s say we conduct an experiment and the possible outcomes are 1,2,3,4, below are the probabilities of each of the outcome happening. As you can see the outcomes are discrete numbers, we are looking at Discrete random variables.

Potential Outcomes | 1 | 2 | 3 | 4 |
---|---|---|---|---|

Probability | 0.20 | 0.25 | 0.30 | 0.25 |

This is an example of a Probability Distribution function and in the case of a Discrete Random Variable, it is called as a **Probability Mass Function** (P.M.F.).It allows us to answer questions like:

- What’s the probability of 2 happening P (X=2), we can say its 0.25 or 25%
- What’s the Probability of 1 or 2 happening i.e. P(X=2 or X=1) = P(X=1)+P(X=2) = 0.20+0.25 = 0.45
- What’s the Probability of 1 and 2 happening i.e. P(X=2 and X=1) = P(X=1) * P(X=2) = 0.20 * 0.25 = 0.05

You can also see the sum of all probabilities (0.20+0.25+0.30+0.25) = 1, meaning that we are covering all the potential outcomes of this experiment. In general, P.M.F involves summation to calculate all the probabilities of an experiment.

When it comes to the continuous random variables, the same concept applies as Discrete random variables. The difference is in how we calculate the probabilities.

A probability distribution function for continuous random variables describes the probabilities of possible values of a continuous random variable. Probabilities of continuous random variables (X) are given as the area under the curve of its P.D.F. And if you recall, the way we can find an area under a curve is by using Integrals. Probability distribution function for continuous random variables are known as Probability Density function.

Here is an example of Probability distribution of weight of adult males. We can calculate the probability that a man weighs between 160 and 170 pounds by calculating the area of the shaded range, which in this comes out to be 0.135905 or 13.59 %

The area under the curve is always equal to 1, since it describes the total probability of all possible values of *x*. Therefore

The probability of a value falling between any two values *x*1 and *x*2 is the area bounded by this interval, that is,

There are certain well known Probability distribution functions for Discrete and Continuous Random variables.

**Cumulative Distribution function (CDF)**

The cumulative distribution function (cdf) is the probability that a random variable takes a value less than or equal to x. It helps us in answering questions like what is the probability that an outcome will happen with value less than or equal to X. Like what is the probability that an adult male weighs less than or equal to 170 pounds. CDF’s can be derived from PDF’s.

The cumulative distribution function (cdf), F(*x*), gives the probability that a measured value will fall between −∞ and *x*:

**Exponential Probability Distribution**

The exponential distribution is often concerned with the amount of time until some specific event occurs. For example, the amount of time (beginning now) until an earthquake occurs has an exponential distribution.

One of the key properties of Exponential distribution function is **memorylessness**, meaning the probability of occurrence of an event happening in the future doesn’t depend on how much time has elapsed. Another way to say is that the probability of an event remains the same and doesn’t change with the time elapsed. Typically questions like: How long do we need to wait before a customer enters our shop? Or how much time will elapse before an earthquake occurs in a given region can be answered in probabilistic terms using exponential distribution.

In Reliability engineering, we can use this distribution as we assume that failure rate is constant, i.e. The probability of failure happening is constant during its **“useful lifetime”**. In reliability, since we deal with failure times, and times are non-negative values, the lower bound of our functions starts with 0 rather than -∞.

So the Probability Distribution function of Exponential Distribution is reliability universe is given by

Where ƛ is the constant failure rate. P.D.F Plot of an Exponential Function

Worth a read: An Intuitive Guide To Exponential Functions & e:

##### C.D.F of Exponential Function

C.D.F. Of exponential function can be given below where y is the dummy variable as we are already using “t” variable.

Where f (y) is the PDF of the exponential distribution and is f(y)=ƛe^(-ƛy).This gives us the C.D.F And here is the solution of the integral which gives us the C.D.F value.

So the C.D.F of the exponential distribution is F (t) = 1- e^(-ƛt) and the below is plot for C.D.F.

Cool, so what the heck C.D.F does it for me that I went through all this pain of explaining? It tells me the probability of a random unit drawn from the population failing by time “t”.

**Back to Reliability**

Now before I throw a formula at you for calculating reliability, let’s take a look at an example first which is going to build the intuition

Ex:

Let’s say if there are 50 components operated for 1000 hrs in a test and two of them failed, then we will say the probability of failure of the component in 1000 hrs is:

And the probability of success for the component in 1000hrs is:

Thus, the Reliability is the probability of no failure within a given operating period. If we want to generalize what we just did in the above example

Where P(s) is the Reliability. Now we know the c.d.f ** F (t) = 1-e^(-ƛt)** gives us the probability of a failure by time “t”. If we subtract that from 1, it will give use the probability of success of a component by time “t” which is Reliability.

In simple words we are saying is that we know the probability of a unit failing by the time “t” from F (t), so the remaining probability (1-F (t)) gives us the probability of it’s not failing by the time “t”.

Another widely used Probability Distribution in Reliability is the **Weibull** **distribution**, but we will not cover that.

#### Mean Time Between Failure(MTBF)

MTBF tells us the mean or average life of a system based on the frequency of system outages or failures. It is the **average** time of failure for all the failure times in the population. MTBF can be given by the Expected Mean (average) of the exponential probability function. Expected Mean of a continuous random variable is given by following equation where f(x) = ƛe^(-ƛt)

Where ƛ is the failure rate, it tells us that MTBF = 1/Failure Rate. We know that

MTBF is commonly used for items which are repairable where MTTF is commonly used for Non-Repairable items in the hardware world. MTTF is the amount of time from the placement of a system or component in service until it permanently fails. It’s typically used for components which aren’t repairable and we just throw it away once it fails. Passive components (like network cables) are often not included in MTTF estimation as they tend to have longer lifetimes compared to Active components.

Many times MTTF and MTBF are used interchangeably and many books or text use them in an interchangeable way.

#### MTTR

Mean time to Recovery or Repair (MTTR) is the time required to restore the operations of a repairable component of the system. Most common ways to estimate is to simply obtain the sum total of all observed restoration times and divide by the number of reported outages.

#### Availability

Availability is the probability that a network(or component) is available to users at a any given point of time. It is measured as the average fraction of time over an interval for which the system is up.

The unavailability (U) of a unit is simply given by U =1 – A.

Reliability is the likelihood that a system will continue to provide service without failure. Availability is the likelihood that a system will provide service over the course of its lifetime.

**Difference between Reliability and Availability ** Let’s say a Car may break down and require maintenance 5% of the time. It’s therefore 95% reliable. Now if the same care is equally shared between two Family members, then the Car is available only 47.5% (95%/2) on average. If the Car was 100% reliable, it’s still 50% available and in order to increase the availability, an additional car is required.

Another example is: consider a system that fails every minute on average, but comes back after only 500ms. Such a system has a mean time between failures (MTBF) of 1 min, and therefore the reliability of the system (= e − t/ MTBF) for 60s is R(60) = e − 60/ 60 = 0.368 (≈ 36.8%), which is low, whereas the availability = MTBF/(MTBF+MTTR)= 59/59+0.500 = 0.991 (≈ 99.1%) is relatively high.

**Conclusion ** So far we have covered some basics and math’s involved behind some basic Reliability related metrics. In the next post we will try to do something useful with that.

Richard says

December 4, 2017 at 2:25 pmAh pseudo applied Gauss :):)

fred p baker says

December 16, 2017 at 12:47 pmThe cisco press book High Availability Network Fundamentals an excellent reference. It was published in 2001 so may be hard to get.

Shawn Moore says

December 17, 2017 at 2:11 pmGood intro, I’m not great at following the math but very interesting on what you do with this info in your next blog.

leila says

February 8, 2018 at 2:12 pmthanks alot

very good

Manuel Garza says

August 9, 2018 at 6:43 amDo you know what MTBF-DC (Design Controlled) is and how it is calculated?

Shweta Lakhe says

October 25, 2018 at 6:27 amThanks! It was very useful for me. Would request you to share more such material on Reliability