# Quantitatively Analyzing the Performance of Integrated Circuits and Their Reliability

Edward J. Wyrwas and Joseph B. Bernstein

esting, instrumentation, and measurement electronics require high reliability and high quality complex integrated circuits (ICs) to ensure the accuracy of the analytical data they process. Microprocessors and other complex ICs such as FPGAs are considered the most important components within instrumentation. They are susceptible to electrical, mechanical and thermal modes of failure like other components on a printed circuit board, and due to their complexity and roles within a circuit, performance-based failure can be considered an even larger concern. Stability of device parameters is key to guaranteeing that a system will function according to its design. We discuss the importance of microprocessor and IC device reliability and how modifying the operational parameters of these devices through over- and under-clocking can either reduce or improve overall reliability, respectively, and directly affect the lifetime of the system in which these devices are installed.

Development of these critical components has conformed to Moore's Law where the number of transistors on a die doubles approximately every two years. Over the last three decades, this trend has continued, and the reduction in transistor size has allowed the creation of faster and smaller ICs with greatly reduced power dissipation. Although this is great news for developers of high performance equipment, a crucial reliability risk has emerged. Semiconductor failure mechanisms which are far worse at these minute feature sizes (tens of nanometers) result in shorter device lifetimes and unanticipated early wear out.

The ability to analyze and understand the impact that specific operating parameters have on device reliability is necessary to mitigate the risk of system degradation which will affect measurements being taken by the system and even cause early failure of that system. Industry accepted failure mechanism models, physics-of-failure (PoF) knowledge, and an accurate mathematical approach which utilizes semiconductor formulae and device functionality can assess reliability of those integrated circuits vital to the system. There are currently four semiconductor failure mechanisms in silicon-based ICs that are analyzed: electromigration (EM), time dependent dielectric breakdown (TDDB), hot carrier injection (HCI), and negative bias temperature instability (NBTI). Mitigation of these inherent failure mechanisms, which include those considered wear-out, is only possible when reliability can be quantitatively calculated.

Algorithms folded into a software application have been designed to calculate a failure rate, give confidence intervals, and produce a lifetime curve using both steady state and wearout failure rates for the IC under analysis. The algorithms have been statistically verified through testing and employ data and formulae from semiconductor materials (to include technology node parameters), circuit fundamentals, transistor behavior, circuit design and fabrication processes. Initial development has yielded a user-friendly software module with the ability to address silicon-based integrated circuits of the 0.35  $\mu$ m, 0.25  $\mu$ m, 0.18  $\mu$ m, 0.13  $\mu$ m and 90 nm technology nodes. DfR Solutions, LLC is currently working to extend the capability of the tool into smaller technology nodes, including 65 nm, 45 nm, and 32 nm.

## **Operating Outside Recommended Specifications**

Engineers who entertain the idea that electronic devices with no moving parts should last forever are living in a fantasy. Those who work in the electronics industry know otherwise. The primary questions system designers should ask themselves is, "Should devices be operated outside of their specifications?", and "Do we even know if doing this will cause damage to the device – enough damage to reduce the lifespan of our systems to inside our projected useful life?" If we had the answers to these questions, then we would already know how operating devices outside of their "typical" or "recommended" settings will affect reliability. Most components are underrated by at least 25% to increase reliability. Some components are closer to 50% derated. Integrated circuits are less tolerant to derating and uprating but still have good margins of 5-10%.

Although it would seem that derating all components would solve our reliability needs, it actually does not. HCI

in particular is driven by relatively low temperatures (room temperature) compared to those of other circuitry.

HCI occurs in conducting nMOS and pMOS devices stressed with drain bias voltage. High electric fields energize the carriers (electrons or holes) which are injected into the gate oxide region. The degraded gate dielectric can then more readily trap electrons or holes, causing a change in threshold voltage, which in turn results in a shift in the subthreshold leakage current. HCI is accelerated by an increase in bias voltage and is predominately worse during lower stress temperatures, e.g., room temperature. Therefore, HCI damage is unlike the other failure mechanisms as its damage will not be replicated in a high temperature operating life (HTOL) test which is commonly used for accelerated testing. Typical HTOL conditions are operating the device under test (DUT) at a maximum ambient temperature calculated from the power dissipation and junction temperature of the device while at maximum operating voltage ratings.

Failure analysis data from a telecommunications company shows that HCI can devastate the primary computing components of a platform within a few years – even if these components were custom designed to mitigate known failure

behaviors. (The anticipated field life of this particular product was 15 to 20 years.) Failures of this type are never anticipated, nor have they been able to be predicted in the past. This brings us to the present where it is now necessary to have methods to predict the failure rate of such components, preferably easy-to-use software applications.

Overclocking (uprating) an IC by using a clock that is faster than the one recommended by the manufacturer, seems to be The ability to analyze and understand the impact that specific operating parameters have on device reliability is necessary to mitigate the risk of system degradation which will affect measurements being taken by the system and even cause early failure of that system.

magnetic and electric fields. As engineers we can design, redesign, and adjust our system to include printed circuit board type, vibration dampening, and even conformal coatings or potting to reduce, or even eliminate, the effects of extrinsic failure mechanisms and stressors. However, those failure mechanisms and stressors intrinsic to the device (from the material composition and design) are eventually going to take place. The overwhelming question is "When?"

In the electronics industry, there is an interest in assessing the long term reliability of electronics whose anticipated lifetime extends farther than consumer "throw away" electronics. Because complex integrated circuits within their designs may face wear out or even failure within the period of useful life, it is necessary to investigate the effects of use and environmental conditions on these components. The main concern is that submicron process technologies drive device wear-out earlier than was anticipated and reduce their useful life.

## **Accelerated Testing**

An accelerated test is designed to speed up a behavior or mechanism that will cause the device to fail over time. Each failure mechanism creates a corresponding activation energy.

> This activation energy is the energy necessary for a reaction or change to occur in a material. The ratios of field and test voltages and temperatures along with the activation energy can be used to calculate an acceleration factor that will correlate the lifetime under test to the expected life of the device operating in the field environment.

An integrated circuit device's life is limited by thermal, mechanical and electrical stresses and process defects.

an obvious way of reducing an IC's lifetime. Devices running above their specifications are not to be expected to survive as long as non-overclocked devices. System analysis is necessary to determine how overclocking may cause system instability, i.e. timing control issues. Making a device work harder creates more heat. System designers should verify that their thermal solutions are optimal for an increased system load.

Underclocking a device too much creates issues as well. Therefore, a higher degree of system analysis should be performed to see how the change in device parameters will affect it. Issues commonly created are noise in the signals, unforeseen reduction in threshold voltages, offset propagation delay timings, latch up and mismatched bus speeds.

Reliability of semiconductor devices may depend on assembly, use, and environmental conditions. Stress factors affecting device reliability include gas, dust, contamination, voltage, current density, temperature, humidity, mechanical stress, vibration, shock, radiation, pressure, and intensity of The device also experiences degradation inherent to its material composition and design which can also cause failure. Some examples of each type are:

- Mechanical defects/process defects: over-bonded/underbonded wire bonds, misplaced wire bonds, scratches, voids, cracks, thin oxide, thermal-mechanical mismatch between materials, incompletely cured polymers, interface impurities, poor die attach, etc. and
- Degradation Mechanisms: Corrosion, mechanical overstress, crack initiation/propagation, intermetallic formation, mechanical fatigue, stress-corrosion cracking, atomic transport, decomposition of polymer materials, moisture diffusion/migration, and dislocations (linear defects in the lattice that move easily along slip planes).

A brief summary of mechanisms and their activation energies cover factors that affect semiconductor lifetimes. An explanation of mechanisms and their activation energies determine which one(s) should be used in an Arrhenius relationship based accelerated life testing; one with temperature dependent reactions. In real world environments, multiple failure mechanisms will be activated which will have various levels of impact on the device lifetime. Therefore, it is not representative to say that testing to one failure mechanism's activation energy will induce all types of failures.

Industry uses a standard set of tests for semiconductor devices. However, these tests do not take into account the mate-

rials, technology, complexity or function of the device. JEDEC Standard number 47D, "Stress-Test-Driven Qualification of Integrated Circuits," defines a typical acceleration test style which is a high temperature operating life test [1] based on three lots of 77 parts per lot. The test duration is 1000 hours with a test temperature of +125°C. The total duration of the test is 231,000 hours and is derived by:

Let us re-emphasize this point: The temptation will be to blame device failure on how it was operated. A better approach, however, is to consider manufacturing methods and technology inside the device since it has no moving parts.

 $T_{\text{Test}} = 3 \text{ lots x } 77 \text{ samples x } 1000 \text{ hours} = 231,000 \text{ device}$ hours.

When we talk about uprating or derating a device, we are considering applying an acceleration factor to its anticipated life. Acceleration factors are the extrapolation between test and field conditions at one set of given points. Traditionally, only voltages and temperatures are accelerated.

The acceleration factor due to change in temperature is the acceleration factor most often referenced in JEDEC Publication No. 122B. The mathematical relationship follows the format of the Arrhenius equation:

$$AF = \frac{\lambda_{Test}}{\lambda_{Field}} = \exp\left[\left(\frac{-Ea}{K}\right)\left(\frac{1}{T_{Test}} - \frac{1}{T_{Field}}\right)\right]$$

Where:

*Ea* is the activation energy in electron volts (eV); *K* is Boltzmann's constant ( $8.62 \times 10-5 \text{ eV/K}$ );  $T_{test}$  is the absolute temperature of the test (K);

 $T_{Field}$  is the absolute temperature of the system (K);

 $\lambda_{\text{Test}}$  is the failure rate at the test temperature; and

 $\lambda_{\text{Field}}$  is the failure rate at the actual field temperature.

Thus, when predicting a failure rate from HTOL test results, *Ea* and  $T_{Field}$  are necessary.

Activation energy is the parameter used to express the degree of acceleration related to temperature. Single failure mechanisms are accompanied by unique activation energy values (JEDEC Publication No. 122B). However, the traditional method uses an activation energy of 0.7eV (assumed to be the average activation energy) during the useful life of a device. This useful life lies beyond the early stages of "infant mortality"

failures, manufacturing defect-driven failures. The value 0.7eV is widely used in industry in the following two cases:

- when estimating an overall failure rate without focusing on just one failure mechanism. It is assumed to be a conservative value with regard to the mixture of single mechanisms activation energies, and
- when the failure mechanisms degrading a device are unknown.

The industry goal in using HTOL is to gain the maximum

possible acceleration to accumulate maximum equivalent field time with zero failures. (Yes, zero failures. Some device manufacturers will retest until they achieve zero failures. The user will never see the testing results, nor will they know that failures occurred during another test.) Assuming that higher activation energies will accomplish this goal, it will reduce the failure rate upper limit. For example, assuming an activa-

tion energy of 1.0eV instead of 0.7eV will raise the acceleration factor to 504 instead of 78 (6.5 times more), but the failure rate (FIT) will reduce from 51 FIT (failures in time - failures per 10<sup>9</sup> hours) to only 8 FIT which is even more overly-optimistic.

If we designed a test around activation energy, we could make that test longer or shorter than 1000 hours. Either way, we would be tweaking the test for our gain, in the eyes of management, not for an understanding of a device's actual reliability. A new approach to calculating failure rates for semiconductor devices would be beneficial to industry.

## Integrated Circuit Lifetime

Let's get down to the nitty-gritty. We want to determine how uprating or derating will affect an integrated circuit. This is typically called trade-off analysis. If we have our inputs set up with an equation and can plot out our results in a spreadsheet, then we should be able to tweak the inputs and see how the outputs are affected. This sounds like a simple concept and it is.

Let us re-emphasize this point: The temptation will be to blame device failure on how it was operated. A better approach, however, is to consider manufacturing methods and technology inside the device since it has no moving parts.

DfR Solutions, LLC has developed an IC reliability calculator using a multiple failure mechanism approach. This approach successfully models the simultaneous degradation behaviors of multiple failure mechanisms on integrated circuit devices. The multiple mechanism model extrapolates independent acceleration factors for each semiconductor mechanism of concern (TDDB, EM, HCI, and NBTI) based on the transistor stress states within each distinct functional group. IC lifetime is calculated from semiconductor materials and technology node, IC complexity, and operating conditions.

Smaller and faster circuits cause higher current densities, lower voltage tolerances and higher electric fields, which make integrated circuits more vulnerable. New generations of electronic devices and circuits demand new means of investigation to check the possibility of introducing either new problems or new versions of old issues. The arrival of new devices with new designs and materials require failure analysis to find new models for both the individual failure mechanisms and also the possible interactions between them. The interaction of multiple failure mechanisms is one of the issues that requires serious investigation.

In the sub-micrometer region, reliability has been overlooked in favor of performance. Proper tradeoffs in the early component design stage are a dominating challenge. After performing a quick and effective reliability analysis, both lifetime estimation for the device and a failure mechanism dominance hierarchy can be achieved. Using reliability knowledge and improvement techniques, higher reliability integrated circuits can be developed using two methods: suppress die-level failure mechanisms and adjust circuit structures. Although this has been realized for EM (through Black's equation [1]) using design techniques, it is counterproductive across the industry to adjust transistor sizes. Redesign of transistor architecture and circuit schematics is too resource intensive both in time and cost to be the corrective action for reliability concerns. The end user must decide what reliability goals need to be achieved; more so, it has become his/her responsibility to determine how to achieve those goals without any influence on component design, manufacturing, or quality. This type of reliability assessment is crucial for the end user as adjustments to electrical conditions and thermal management seem to be the only way to improve reliability of modern technology nodes. The tradeoff in performance can be significantly reduced by using devices from larger technology nodes as they provide larger operating tolerances and the architectures necessary to reduce the effects of multiple mechanism degradation behaviors.

As technology shifts to the smaller nodes, the operating voltage of the device is not reducing proportionally with the gate oxide thickness which results in a higher electric field, Moreover, the increasingly denser number of transistors on a chip causes more power dissipation and in turn an increased operating temperature through self-heating. Conversely, introducing nitrogen into the dielectric to aid in gate leakage reduction together with boron penetration control has its own effect – linearly worsening NBTI and other modes of degradation. Because the threshold voltage of new devices is not being reduced in a way that is proportional to the operating voltage, there will be more degradation for the same threshold voltage.

## Practical Implementation of a Prediction Method

Preliminary analysis of a device uses a process that categorizes an integrated circuit into smaller functional blocks to



Fig. 1. The functional group block diagram for ADC124S021.

apply acceleration factors at the most minute level. Equivalent function sub-circuits are used as part of the calculator to organize the complexity of the integrated circuit being analyzed into functional group cells, i.e. one bit of DRAM. Fig. 1 shows the functional group block diagram for National Semiconductor's 12-bit ADC component, ADC124S021 It contains a multiplexer group, track and hold function, control logic, and 12-bit analog-to-digital converter.

The software has the ability to take two approaches to analyze the failure mechanism contributions to the failure rate of each device: independent of transistor behavior (ITB) and dependent on transistor behavior (DTB). Fig. 2 depicts the prediction method. It makes two assumptions that are shown in research documents from NASA/JPL and the University of Maryland where the multiple mechanism approach was initially researched:

- In an integrated circuit, each failure mechanism has an equal opportunity to initiate a failure, and
- each can take place at a random interval during the time of operation.

In ITB, the weight factors of the failure mechanisms are spread out evenly over transistor types within each functional group. Only three mechanisms affect nMOS transistors, EM, HCI, and TDDB; therefore each has a 33% contribution. All four mechanisms affect pMOS transistors; therefore each has a 25% contribution. These weight factors are the same for each functional group type as well.

The DTB process utilizes back-end SPICE simulation to determine the failure mechanism weighting contributions based on transistor behavior and circuit function. Using these mechanism weighting factors, sub-circuit cell counts, and transistor quantities, an overall component failure rate is calculated.

The software assumes that all the parameters for these models are technology node dependent. It is assumed that the technology qualification (process qualification) has been performed and at least one screening has occurred before a device is packaged. This reliability prediction covers the steady-state random failures and wear-out portions of the bathtub curve.



Fig. 2. Depiction of prediction methodology.

The standard procedure for integrated circuit analysis uses high temperature operating life (HTOL) test conditions for the test conditions used for extrapolation which are:

- ambient temperature,
- supply voltage, and
- core voltages.

The HTOL ambient temperature was calculated for each component (except when supplied by the manufacturer). Thermal information was gathered from the datasheet and/or thermal characteristic documentation and each manufacturer's website. Using (1), the junction temperature  $T_{j}$ , power dissipation  $P_{D}$ , and junction-to-air thermal resistance  $\Theta_{J-A}$  are used to calculate ambient temperature  $T_A$ .

$$T_A = T_J - P_D \cdot \Theta_{J-A} \tag{1}$$

Junction-to-air thermal resistance was found either on a component's datasheet or in thermal characteristic databases for package type and size; i.e. Texas Instruments or NXP Semiconductors websites. (2) shows the ambient temperature calculation for a sample component:

$$T_A = 85^{\circ}C - (0.446W \cdot 51\frac{^{\circ}C}{W})$$
$$T_A = 62.28^{\circ}C \qquad (2)$$

Inputs on the calculator are the test parameters and results from the standard JEDEC accelerated test and information pertaining to the integrated circuit as follows:

- JEDEC Standard No. 47D [2]:
  - 25 devices under test,
- 1000 hour test duration,
- zero (0) failures,
- 50% confidence level,
- Pre- or user-defined process node parameters,
- Device complexity as broken down by functional groups and quantity of cells within each functional group, where applicable,
- Accelerated test information (qty. of failures, qty. of devices, test duration),
- Duty cycle of device (i.e. diurnal cycling or 50%),
- Confidence level of calculation/test,
- Field and test conditions (field conditions allow for multiple operating modes):
- ambient temperature,
- operating frequency,
- core voltage,
- supply voltage, and
- Failure mechanism parameters and corresponding equations.

# System Reliability Modeling

Our approach is to model useful life failure rate (FIT) for components in electronic assemblies by assuming each component is composed of multiple sub-components, for example: a certain percentage is effectively ring-oscillator, static SRAM, DRAM, etc. Each type of circuit, based on its operation, can be seen to affect the potential steady-state (defect related) failure mechanisms differently based on the accelerated environment, i.e., by EM, HCI, NBTI, etc. Each mechanism is known to have its own acceleration factors with voltage, temperature, frequency, cycles, etc. Each sub-component will be modeled to approximate the relative likelihood of each mechanism per sub-component. Then, each component can be seen as a matrix of subcomponents, each with its own relative weight for each possible mechanism. Hence, the standard system reliability FIT can be modeled using traditional MIL-handbook-217 type algorithms and adapted to known system reliability tools. However, instead of treating each component as an individual, we propose treating each complex component as a series system of sub-components, each with its own reliability matrix, as we describe next.

#### Percentage in the Circuit

The percentages of oscillator, DRAM, SRAM and deformable package are related to their cell numbers. For example, assume that oscillators have  $N_o$  cells, DRAM has  $N_b$  cells, SRAM has  $N_s$  cells, and encapsulation includes  $N_f$  cells, then the total cell number of a chip is  $N = N_o + N_b + N_s + N_f$ , and the failure rate of a chip is:

$$\lambda_{c} = N_{O}\lambda_{O} + N_{D}\lambda_{D} + N_{S}\lambda_{S} + N_{J}\lambda_{J}$$
$$= N(\frac{N_{O}}{N}\lambda_{o} + \frac{N_{D}}{N}\lambda_{D} + \frac{N_{S}}{N}\lambda_{S} + \frac{N_{J}}{N}\lambda_{J})$$
(3)

$$= N(A_{O}\lambda_{O} + A_{D}\lambda_{D} + A_{S}\lambda_{S} + A_{I}\lambda_{I})$$

where  $A_{O}$ ,  $A_{D}$ ,  $A_{S}$ ,  $A_{J}$  are the percentages of oscillators, DRAM, SRAM and encapsulation sub-components, respectively.

#### Normalization of the Failure Rate

Every mechanism is assumed to cause a possible failure based on its function in every device in the circuit. To accomplish this, we normalize the failure rate for each mechanism, which can be defined as the *benchmark* of the failure rate due to this mechanism. Then, the failure rate of this mechanism in these functionalities can be expressed relative to the benchmark of other mechanisms. For example, if the benchmark failure rate of the HCI mechanism is  $\lambda_{HCI}$ , the HCI failure rate of oscillator may be 1.5  $\lambda_{HCI}$ , that of DRAM may be 0.7  $\lambda_{HCI}$  and so on. Similarly, the benchmarks of TDDB, EM and NBTI can be defined as  $\lambda_{TDDB}$ ,  $\lambda_{EM}$  and  $\lambda_{NBTI}$ , respectively.

So the failure rate of an oscillator is:

$$\lambda_{O}' = (B_{1-O}\lambda_{HCI} + B_{2-O}\lambda_{TDDB} + B_{3-O}\lambda_{EM} + B_{4-O}\lambda_{NBTI})$$
(4)

where  $B_{1-O}$ ,  $B_{2-O}$ ,  $B_{3-O}$  and  $B_{4-O}$  are relative constants to the benchmarks. Similarly, the failure rate of DRAM, SRAM and encapsulation are:

$$\lambda_{D}' = (B_{1-D}\lambda_{HCI} + B_{2-D}\lambda_{TDDB} + B_{3-D}\lambda_{EM} + B_{4-D}\lambda_{NBTI})$$
(5)

$$\lambda_{\rm S}' = (B_{1-S}\lambda_{\rm HCI} + B_{2-S}\lambda_{\rm TDDB} + B_{3-S}\lambda_{\rm EM} + B_{4-S}\lambda_{\rm NBTI})$$
(6)

$$\lambda_{j}' = (B_{1-j}\lambda_{HCI} + B_{2-j}\lambda_{TDDB} + B_{3-j}\lambda_{EM} + B_{4-j}\lambda_{NBTI})$$
(7)

#### At-Use Probability

Considering the at-use probability, the actual failure rates are:

$$\lambda_{O} = \lambda_{O}' \cdot P_{O} = (B_{1-O}\lambda_{HCI} + B_{2-O}\lambda_{TDDB} + B_{3-O}\lambda_{EM} + B_{4-O}\lambda_{NBTI}) \cdot P_{O}$$
(8)

$$\lambda_{D} = \lambda_{D} \cdot P_{D} = (B_{1-D}\lambda_{HCI} + B_{2-D}\lambda_{TDDB} + B_{3-D}\lambda_{EM} + B_{4-D}\lambda_{NBTI}) \cdot P_{D}$$
(9)

$$\lambda_{s} = \lambda_{s} \cdot P_{s} = (B_{1-s}\lambda_{HCI} + B_{2-s}\lambda_{TDDB} + B_{3-s}\lambda_{EM} + B_{4-s}\lambda_{NBTI}) \cdot P_{s}$$
(10)

$$\begin{split} \lambda_{j} &= \lambda_{j} \cdot P_{j} = \\ (B_{1-j}\lambda_{HCl} + B_{2-j}\lambda_{TDDB} + B_{3-j}\lambda_{EM} + B_{4-j}\lambda_{NBTl}) \cdot P_{j} \end{split} \tag{11}$$

Based on (3),  $N(A_0\lambda_0 + A_D\lambda_D + A_D\lambda_D + A_J\lambda_J)$ , the failure rates of a chip expressed in (8) - (11), can be expressed as follows:

$$\begin{aligned} \lambda_{c} &= N(A_{o}\lambda_{o} + A_{D}\lambda_{D} + A_{S}\lambda_{S} + A_{I}\lambda_{I}) \\ &= N \cdot \left(A_{o} \quad A_{D} \quad A_{S} \quad A_{I}\right) \cdot \left(\lambda_{o} \quad \lambda_{D} \quad \lambda_{S} \quad \lambda_{I}\right)^{T} \\ &= \left(A_{o} \quad A_{D} \quad A_{S} \quad A_{I}\right) \cdot \\ &\left[ \begin{pmatrix} B_{1-O} \cdot P_{O} \quad B_{2-O} \cdot P_{O} \quad B_{3-O} \cdot P_{O} \quad B_{4-O} \cdot P_{O} \\ B_{1-D} \cdot P_{D} \quad B_{2-D} \cdot P_{D} \quad B_{3-D} \cdot P_{D} \quad B_{4-D} \cdot P_{D} \\ B_{1-S} \cdot P_{S} \quad B_{2-S} \cdot P_{S} \quad B_{3-S} \cdot P_{S} \quad B_{4-S} \cdot P_{S} \\ B_{1-J} \cdot P_{J} \quad B_{2-J} \cdot P_{J} \quad B_{3-J} \cdot P_{J} \quad B_{4-J} \cdot P_{J} \\ &\left(\lambda_{HCI} \quad \lambda_{TDDB} \quad \lambda_{EM} \quad \lambda_{NBTI} \right)^{T} \end{aligned} \end{aligned}$$
(12)

If it is known that  $C_1 = \frac{\lambda_{HCI}}{\lambda_{TDDB}}$ ,  $C_2 = \frac{\lambda_{EM}}{\lambda_{TDDB}}$ , and  $C_3 = \frac{\lambda_{NBTI}}{\lambda_{TDDB}}$ , then (10) can be rewritten as:

$$\begin{bmatrix} B_{1-O} \cdot P_O & B_{2-O} \cdot P_O & B_{3-O} \cdot P_O & B_{4-O} \cdot P_O \\ B_{1-D} \cdot P_D & B_{2-D} \cdot P_D & B_{3-D} \cdot P_D & B_{4-D} \cdot P_D \\ B_{1-S} \cdot P_S & B_{2-S} \cdot P_S & B_{3-S} \cdot P_S & B_{4-S} \cdot P_S \\ B_{1-J} \cdot P_J & B_{2-J} \cdot P_J & B_{3-J} \cdot P_J & B_{4-J} \cdot P_J \end{bmatrix}$$
(13)  
$$(C_1 \ 1 \ C_2 \ C_3)^T \cdot \lambda_{TDDB}$$

Assuming the TDDB defect generation follows E model, which is accepted to provide a good fit to data from long-term low field TDDB stresses, then

$$t_{TDDB} = B_1 \exp(-\gamma E_{ox}) \exp(E_a / kT)$$
(14)

where  $E_{OX}$  is the externally applied electric field across the dielectric in units of MV/cm,  $\gamma$  is the field acceleration factor,  $E_a$  is the thermal activation energy ( $E_a = 0.6 \sim 0.9$ ), and  $B_1$  is technology dependent. Due to the fact that the failure rate can be regarded as a constant for TDDB, based on (14) we can obtain:

$$\lambda_{TDDB} = \frac{1}{t_{TDDB}} = B \exp(\gamma E_{ox}) \exp(-E_a / kT)$$
  
=  $B \exp(\gamma E_{ox} - E_a / kT)$  (15)  
=  $B \cdot \exp\left[(\gamma - 1 / kT) \cdot \begin{pmatrix} E_{ox} \\ E_a \end{pmatrix}\right].$ 

Then (13) can be rewritten as:

$$\lambda_{c} = N \cdot B \cdot (A_{0} \quad A_{D} \quad A_{S} \quad A_{I}) \cdot \left[ \begin{pmatrix} B_{1-O} \cdot P_{O} & B_{2-O} \cdot P_{O} & B_{3-O} \cdot P_{O} & B_{4-O} \cdot P_{O} \\ B_{1-D} \cdot P_{D} & B_{2-D} \cdot P_{D} & B_{3-D} \cdot P_{D} & B_{4-D} \cdot P_{D} \\ B_{1-S} \cdot P_{S} & B_{2-S} \cdot P_{S} & B_{3-S} \cdot P_{S} & B_{4-S} \cdot P_{S} \\ B_{1-J} \cdot P_{J} & B_{2-J} \cdot P_{J} & B_{3-J} \cdot P_{J} & B_{4-J} \cdot P_{J} \\ \left( C_{1} \quad 1 \quad C_{2} \quad C_{3} \right)^{T} \cdot \exp \left[ (\gamma, -1/kT) \cdot \begin{pmatrix} E_{ox} \\ E_{a} \end{pmatrix} \right] \quad (16)$$

where  $E_{OX}$  is an externally applied electric field across the dielectric in unit MV/cm,  $\gamma$  is field acceleration factor,  $E_a$  is the thermal activation energy ( $E_a = 0.6$ ~0.9), and *B* is technology constant.

This approach allows accelerated testing to be performed at increased voltages, temperature and power levels to increase separate mechanisms in order to calibrate this matrix to actual components. Once the matrix is solved, an appropriate qualification procedure can be developed that will determine the failure rate during actual operating conditions. Furthermore, the system can be de-rated for increased robust design and prolonged failure-free operation.

# **But Does It Actually Work?**

٠

1,200

1,000

800

600

Failure Rate [FIT]

DfR Solutions, LLC performed an extensive validation study in cooperation with a telecommunications company.

| Table 1—Validation Study Results |                     |           |      |
|----------------------------------|---------------------|-----------|------|
|                                  | Failure Rates (FIT) |           |      |
| Part Number                      | Field               | Predicted | HTOL |
| MT16LSDF3264HG                   | 689                 | 730       | 51   |
| M470L6524DU0                     | 415                 | 418       | 51   |
| HYMD512M646BF8                   | 821                 | 1012      | 51   |
| MC68HC908SR12CFA                 | 220                 | 249       | 51   |
| RH80536GC0332MSL7EN              | 144                 | 291       | 51   |

Actual field data were extracted from a database, which encompasses shipments and customers' claims. Unique identifiers of each product and failure enabled the detailed statistical field analysis. The ICs were assembled on boards belonging to a family of communication products shipped during 2002-2009. Component complexity and electrical characteristics were extracted from corresponding component documentation for use in the calculator. The reliability calculations were based on the time domain of the host computer. Except for the microcontroller, which was stressed 24 hours a day, we assumed that memory parts and the processor were partly stressed depending on the user profile. A conservative assumption was that a regular user will stress the parts two shifts/day, i.e. 16 hours/day.

A statistical analysis comprised of Weibull and exponential analysis was used to calculate a field failure rate of five (5) integrated circuits that were identified as the root cause of failure in their respected assemblies. In a parallel activity, each of these components was "soft" analyzed by simply reading datasheet and thermal/package documenta-

> tion. Using the methodology described in this article to acquire the inputs to the IC lifetime calculator, failure rates for each integrated circuit were calculated. Table 1 shows the field, predicted and the traditional HTOL failure rates for each component.

> It should be noted that the DRAM failure rates presented in Table 1 and Fig. 3 refer to critical faults which forced the user to replace the part. They do not reflect specific rates of different kind of errors (correctable or non-correctable, i.e. single event upset caused by radiation) but rather a complete part failure rate.



*Fig. 3.* A comparison of the field failure rates and the simulation results, along with the confidence interval obtained by the Weibull analysis.

Simulation FR

Field FR, Rough Estimation

Upper/Lower Limits of FR

According to Weibull Analysis

to Weibull Analysis

Point Estimation of FR According

As you can see from the results, it is possible to accurately predict the failure rate of an integrated circuit. This answers our "when might failure occur" question. Now, having a realistic data point to start with, trade-off analysis can ensue, and we can see how uprating or derating devices will affect their reliability and performance.

Several commercial organizations have indicated a willingness to assist with the development and validation of 45 nm technology through IC test components and acquisition of field failure data. Continued development would incorporate this information, incorporate additional material sets such as silicon on insulator, and then expand into functional groups relevant to analog and processor based (e.g. DSP and FPGA) integrated circuits. Our capabilities currently include major technology nodes of CMOS processes on the International Technology Roadmap for Semiconductors (ITRS) from 0.35 micron through 90 nm. If you or your company is interested in the methodologies discussed in this article or would like to participate in the development of the integrated circuit lifetime prediction calculator, feel free to contact the authors.

## Acknowledgements

The initial work for 0.13 micron and 90 nm technologies was funded by Aero Engine Controls, Boeing, General Electric (GE), the National Aeronautics and Space Administration (NASA), the Department of Defense (DoD), and the Federal Aviation Administration (FAA) in cooperation with the Aerospace Vehicle Systems Institute (AVSI). DfR Solutions is now working to extend the capability of the tool into smaller technology nodes, including 65 nm, 45 nm, and 32 nm.

## References

- [1] B. Yan, J. Qin, J. Dai, Q. Fan, and J.B. Bernstein, "Reliability simulation and circuit-failure analysis in analog and mixed-signal applications," *IEEE Trans. on Device and Mat. Rel.*, vol. 9, no. 3, pp. 339-347, 2009.
- [2] Joint Electronic Devices Engineering Council, "JESD 47: Stress-Test Driven Qualification of Integrated Circuits," ed. D, (2004), [Online] Available: http://www.jedec.org/.

# **For Further Reading**

L. Yang, and J.B. Bernstein, "Failure rate estimation of known failure mechanisms of electronic packages," *Microelectronics Reliability*, vol. 49, pp. 1563-1572, 2009.

Joseph B. Bernstein, Moshe Gurfinkel, Xiaojun Li, Jörg Walters, Yoram Shapira, and Michael Talmor, "Electronic circuit reliability modeling,"*Microelectronics Reliability*, vol. 46, pp. 1957-1979, 2006.

X. Li, J. Qin, B. Huang, X. Zhang, and J.B. Bernstein, "SRAM circuit failure modeling and reliability simulation with SPICE," *IEEE Trans. on Device and Mat. Rel.*, vol. 6, no. 2, pp. 235-246, 2006.

X. Li, J. Qin, B. Huang, X. Zhang, and J.B. Bernstein, "A new SPICE reliability simulation method for deep submicron VLSI circuits," *IEEE Trans. on Device and Mat. Rel.*, vol. 6, no. 2, pp. 247-257, 2006.

X. Li, J. Qin, and J.B. Bernstein, "Compact modeling of MOSFET wearout mechanisms for circuit-reliability simulation," *IEEE Trans. on Device and Mat. Rel.*, vol. 8, no. 1, pp. 98-121, 2008.

Edward J. Wyrwas (ewyrwas@dfrsolutions.com) is an electrical engineer on DfR Solutions' technical staff. He has presented on silicon-based integrated circuit reliability to numerous companies and organizations including Defense Microelectronics Activity (DMEA), Boeing, Motorola, and Ericsson. His current research includes characterizing transistor failure behavior over a range of technology nodes and functional groups. He holds a copyright for his work on "Fault Location in Series Compensated Transmission Lines." He is a member of IEEE Electron Devices Society, IEEE Instrumentation and Measurement Society, and the Association for Computing Machinery (ACM). His specialties include stress testing, failure analysis, materials characterization, software development and information security. Ed received his B.S. in electrical engineering and computer engineering from Widener University in 2007.

Joseph B. Bernstein (joeybern@gmail.com) is a professor at Bar Ilan University, Israel. Professor Bernstein's expertise lies in several areas of nano-scale micro-electronic device reliability and physics of failure research including system reliability modeling, gate oxide integrity, radiation effects, MEMS and laser programmable metal interconnect. He directs the Bar Ilan Laboratory for Failure Analysis and Reliability of Electronic Systems.

