This paper focuses on the ability to locate, isolate, and diagnose faults in the firmware and software of networking equipment. Provides specific examples of failures found at Allied Telesyn during HALT in the development phase. Written by D.Johnson & K.Franks.
The establishment of Allied Telesyn’s Highly Accelerated Life Testing (HALT) and Highly Accelerated Stress Screening (HASS) processes has followed a traditional path—from the original proposal and initial rejection to the implementation and subsequent acceptance of the principals applied and the value gained from conducting Accelerated Stress Testing.
Throughout this journey, one topic has always appeared under reported. HALT discovers an appreciable amount of faults that are attributed to software.This white paper focuses on the discovery of these failures, the isolation techniques involved, and provides examples of software faults found at Allied Telesyn.
HALT testing is conducted throughout the design stage of product development to highlight any major problems with products so they can be isolated, analyzed and corrected in a timely manner. HASS is conducted on a sample basis at the company’s three factories. The screens created during HASS development are used for No Trouble Found (NTF) debugging, component qualification and software patch verification.
Until 2005, all of Allied Telesyn’s HALT and HASS testing had been conducted offsite using an independent lab in San Jose, California. This testing process has been extremely useful for Allied Telesyn staff to gain experience and knowledge, but it has also been expensive because of the company’s remote location in the South Island of New Zealand and the cost of travel to California.
In 2003, Dr. Gregg Hobbs was invited to Allied Telesyn’s New Zealand design centre to teach his Mastering HALT and HASS seminar.This led to an experimental HALT conducted locally under the guidance of Dr. Hobbs and ultimately to the implementation of a comprehensive HALT and HASS program in Christchurch. Allied Telesyn encountered multiple obstacles and challenges that are often inherent to setting up a HALT program. Initially, many engineers were sceptical about the relevance of failures found at the high stress levels applied during the HALT process.
This response is common when first adjusting to over stress techniques, especially when testing is conducted remotely because it is difficult to directly isolate and diagnose software faults, let alone to debug from the other side of the world. The cooperation of Allied Telesyn’s Engineering team coupled with extensive feedback through comprehensive debug reports has overcome these initial frustrations. Top-level management support is an imperative for the implementation of any HALT and HASS program.
What is a software fault?
The term “software fault” is defined at Allied Telesyn as a fault found in:
- The firmware of a product, such as code in a Programmable Logic Device (PLD)
- The boot code of a product, such as EPROM boot code.
- The operating system of a product.
These software faults may occur because of changes in the performance of the associated hardware.
Testing & Monitoring
The sequence of tests applied to a product during HALT and HASS plays a critical role in the ability to uncover software faults.The monitoring process provides a snapshot of each product’s status shortly before a failure occurs.
The test sequence enables us to identify a myriad of software faults related to the fundamental operation of products, including clock signals, voltage rail monitoring, and environmental factors, which can have an effect on product stability. Other factors that may also be monitored during testing include read/write timing, chip selects, reset pulses, and signal integrity.
A considerable amount of debug information must be extracted from the testing process in order to highlight failures, without which fault isolation would be an extremely laborious process. The debug information may include, memory dumps, voltage measurement, clock observation, and other product specific measurement.
Testing & Monitoring for HALT
HALT testing and monitoring needs to be as comprehensive as possible. Experience shows that greater test coverage coupled with exhaustive product monitoring leads to a plethora of diagnostic information, which can then be used while debugging each failure mode. Test developers need to put considerable thought into developing a broad range of tests that will reveal the information required coupled with an all-encompassing monitoring program prior to starting the HALT.
Testing & Monitoring for HASS
Highly Accelerated Stress Screening (HASS) occurs on multiple sample units of each product during mass manufacturing. Appropriate design measures are important for HASS to ensure external monitoring equipment can be utilized and should include the provision for both hardware monitoring and built in software testing in conjunction with data logging.
Typical Test & Monitoring Process
Allied Telesyn runs a fifteen-minute dwell at each step during the HALT process. During this time the unit is functionality tested and monitored for failure. Some examples of the testing conducted includes:
- External traffic test
- Using industry standard equipment
- Voltage and frequency margining
- RAM test
- NVS test
- CPU test
- Encryption engine test
- RAM test
In addition to this test sequence the product is monitored in real time to ensure the slightest change in operating specification is observed and recorded. The product monitoring includes:
- Voltage rails
- Critical system signals
- Self diagnostics
All of these exert some influence over whether a fault will be discovered and corrected or remain unnoticed and hamper a product for the duration of its life in the field.
Examples of Fault Isolation Techniques
HALT is a creative process with many paths to a reliable product and many innovative approaches to fault isolation. Using hardware and software tools, fault isolation can be as rudimentary or as complex as the Engineer’s creativity allows. The following examples demonstrate how locating the specific cause of a failure can be identified in a simple and efficient manner.
Freeze Spray Example
The controlled application of freeze spray to a suspect component identifies the area where a failure is generated.This keeps the temperature of the suspected cause of failure at room temperature while the rest of the product is heated, providing a straightforward and cost effective way to isolate heat related faults.
Heat Application Example
The controlled application of heat to a suspect component via a power resistor or Peltier device identifies components that are failing due to cold temperature. Keeping this component isolated from the surrounding environmental conditions can provide a simple way of identifying the root cause of a failure.
Allied Telesyn Fault Examples
This section focuses on software failures found by Allied Telesyn during HALT and HASS testing. The figure below shows that almost one third of all failures found during Allied Telesyn’s HALT testing are attributed to software. Two percent of failures are yet to be determined.
The following chart compares the failure type to the stress applied, illustrating that all software related failures occur within the first three steps of the HALT process. In most cases, the problem is identified and eliminated prior to the product undergoing combined environment testing.
Common Software Faults Identified with HALT & HASS
The following examples demonstrate the variety of software failures discovered by Allied Teleysn during HALT and HASS.
Abnormal LED Activity
Light Emitting Diodes (LEDs) are used on network products such as routers and switches to indicate their current status. LED’s are often driven directly from the physical interface for basic functionality or, in more complex designs, driven through one or more Programmable Logic Device (PLD).This logic includes specific timing signals relative to status display, reset signals, and chip selection.
One fault occurred after a hard power cycle but would not occur after a soft reset.The failure caused random LED activity when powering up, which remained until either the temperature was raised or the soft reset activated.This anomaly was found during cold step testing at minus 10°C and attributed to the reset pulse timing inside PLD code.
The fault was isolated by applying heat to the suspect component and led to a fix that was applied within thirty minutes.The mechanics of this failure lay in the code of the PLD. Under a hard power cycle the PLD reset pulse duration was insufficient, which led to the board powering up in an unknown state. After the PLD code was updated there were no further LED indication errors to temperatures as low as minus 50°C.
A similar Allied Telesyn product that had been in mass production for three years had also exhibited this fault during standard production testing.The fault had slipped through the traditional design qualification testing when Allied Telesyn did not have a HALT program in place.
Network switches are built around silicon switching components, which divert packets via hardware switching. Some components require tuning over an extended temperature range at
various input voltages, one such device is the Marvell Prestera 98EX115D.
While developing a new switch product, it was necessary to tune the silicon switch to ensure correct operation over a broad range of temperatures. At the start of the HALT test, the unit had an Upper Operating Limit (UOL) of 70°C and a Lower Operating Limit (LOL) of minus 20°C.
Step stressing revealed that the switch was not tuned correctly, and over a few days new software versions were created and tested, ultimately fixing the errors. Corrective action through software enhancements culminated in a UOL of greater than 100°C with a LOL of less than minus 60°C.
Power up sequencing is a common requirement in today’s complex electronics, and while there are many dedicated power sequencing components available, it is occasionally more efficient to use onboard logic to control the voltage rail sequencing.
A product that underwent HALT testing exhibited a failure where various components were not functional after a power cycle at minus 20°C. Applying thermal isolation techniques identified the PLD device as the faulty component.The code that controlled the power up sequencing proved unreliable at low temperatures.
This problem had previously been identified and fixed during prototype development at Allied Telesyn’s design centre. However, HALT revealed the applied fix had simply shifted the failure, making it less replicable and allowed Allied Telesyn to apply a more robust corrective action. The code inside the PLD was modified allowing the product to reach temperatures lower than minus 50°C without failure.
System crashes during the design phase are inevitable when developing prototypes that contain onboard CPU’s. The key is to stimulate these crashes during development and allow a fix to be implemented at minimal resource and time cost.
A product that had been in the field for six months was taken to a HALT lab to investigate its operating margins. At the beginning of the HALT the product attained an UOL of 70°C, and a system crash was observed at this temperature. By changing a register setting for the initialisation of a particular memory interface inside the boot code, the unit was able to function at temperatures above 100°C.
In addition to the software fault, a flaw within the CPU silicon was revealed, which amplified the effects of the software fault.The complete solution came in the release of a new revision of the CPU coupled with the original boot code change. The same problem eventually appeared on three separate products, two of which occurred in the field environment. No further failures occurred after the boot code was modified and the CPU flaw corrected.
System Silent Reboot
A silent reboot occurs when the product reboots without displaying any error or debug messages. These silent reboots are notoriously frustrating to debug and often require exhaustive troubleshooting.
A major Allied Telesyn customer with 22,000 units of one type of product in a major network was experiencing eight silent reboots each day, which represents a 0.036% failure rate. Allied Telesyn faced the problem of how to replicate a failure that only occurred on 0.036% of units.
The same fault took weeks to replicate intermittently using traditional methods, which led to the use of a simplified HALT in an attempt to identify the problem.The same failure mode was
repeatedly replicated in less then one day of testing, enabling software engineers to isolate and remedy the failure cause in a short space of time.
Rapid Thermal Transitions exposed a flaw in software during temperature ramps even though the initial failure occurred in a moderate climate inside a server room.The failure mode was only apparent when running one particular test. A software patch was released to fix this problem.
Software is an integral part of today’s electronics, and ensuring that the software on our products is reliable is a vital part of delivering quality products to our customers. Reliability
failures in hardware are usually caused by wear. However, software does not wear out and may continue to function after an initial failure making the fault harder to replicate, isolate and
analyse. HALT and HASS are excellent tools for uncovering dormant defects in both hardware and software, but without an exhaustive test and monitoring plan, many software faults will
continue to go undetected.
It is essential for a comprehensive HALT program to include relevant monitoring and test suites that uncover more then just hardware faults. Having a comprehensive test plan leads to
substantial information that can then be used to judge the relevance of each failure and the corrective action required. Experience at Allied Telesyn has shown that accurate fault isolation is a fundamental aspect of HALT and HASS testing, without which the process of analysing faults and implementing corrective actions may be an exercise of trial and error.
Thorough planning and consultation prior to conducting HALT should provide an overall picture of the products dependencies and highlight the steps necessary for testing and monitoring these critical aspects. The HALT chamber is a very useful tool for increasing the reliability of a product, it is not a magic box and will only provide the outlined results if used to its full capabilities
Glossary of Abbreviations
|FLT||Fundamental Limit of Technology|
|HALT||Highly Accelerated Life Testing|
|HASS||Highly Accelerated Stress Screening|
|LED||Light Emitting Diode|
|LOL||Lower Operating Limit|
|NTF||No Trouble Found|
|PLD||Programmable Logic Device|
|POS||Proof Of Screen|
|RMA||Returned Materials Authorization|
|UOL||Upper Operating Limit|