Importance of Failover Testing during Test Planning of Safety Critical Systems
Failover software testing is viewed as an essential component in our reliability test planning when the risks associated with the failure of an application or system are assessed as unacceptably high.
Ensuring that failover mechanisms are implemented to address the risks that are primarily the concern of system architects. An important element of our reliability testing approach should therefore include technical reviews of the architectural documents that describe the proposed failover measures to be taken. The technical reviews should focus on how the hardware and the software architecture ensure that alternative system components are used if a particular component fails.
Testing experts like ISTQB certified “Technical Test Analysts” have an understanding of these measures so that architectural faults can be detected early and in order to assess the impact of the failover measures on testing.
Testing experts like “Test Analysts” Understand the stages in an application’s life cycle where reliability tests may be applied.
Whereas experts like “Technical Test Analysts” Understand different types of failover mechanisms very well and are ideally suited
for reliability testing of safety critical systems due to their expertise like the following:
1) Understanding of reliability risks and their potential impact on a system.
2) Understanding and comparing the different types of reliability tests.
3) Understanding and explaining the stages in an application’s life cycle where reliability tests may be applied and where synergies with other types of testing can be achieved.
4) Proposing a risk-based approach for testing reliability.
5) Ability to specify different types of reliability tests.
6) Understanding the different types of tools that can support reliability testing.
The following measures are possible:
1) Use of redundant hardware devices (e.g., servers, processors, disks), which are arranged such that one component immediately takes over from another should it fail. Disks, for example, can be included in the architecture as a RAID element (Redundant Array of Inexpensive Disks).
2) Redundant software implementation, in which more than one independent instance of a software system is implemented (perhaps by independent teams), using the same set of requirements. These so-called redundant dissimilar systems are expensive to implement but provide a level of risk coverage against external events (e.g., defective inputs) that are less likely to be handled in the same way and therefore less likely to cause software failures.
3) Use of multiple levels of redundancy, which can be applied to both software and hardware to effectively add additional “safety nets” should a component fail. These systems are called duplex, triplex, or quadruplex systems, depending on how many independent instances (2, 3, or 4 respectively) of the software or hardware are implemented.
4) Use of detection and switching mechanisms for determining whether a failure in the software or hardware has occurred and whether to switch (failover) to an alternative. Sometimes these decisions are relatively simple; software has crashed or hardware has failed and a failover needs to be enacted. In other circumstances, the decision may not be that simple. A hardware component may be physically available but supplying incorrect data due to some malfunction. Mechanisms need to be implemented that enable these untrustworthy data sources to be identified and trustworthy ones used instead. In software, these mechanisms are often referred to as voting systems because they are constantly monitoring and conducting a vote on which of the redundant data sources to trust. Ultimately these systems may shut down hardware components deemed to be no longer trustworthy (i.e., failed).
Depending on the type of redundancy implemented (duplex, triplex, etc.), voting systems can be highly complex and are often among the most critical components in the software. For these reasons, it is advisable to include thorough structural and specification-based testing of this soft-ware in the testing approach. Since voting software is highly rule and state based, the adoption of decision table testing or state transitions testing techniques may be appropriate.
Dynamic testing of the failover mechanisms of complete applications or systems of systems is an essential element of a reliability testing approach. The value of these tests arises from our ability to realistically describe the failure modes to be handled and simulate them in a control-led and fully representative environment.
Test Specification for Failover Software Testing
In contrast to fault tolerance testing, the test design for failover testing is primarily concerned with identifying different hardware and software conditions that could cause the system to actually fail.
If a Failure Mode and Effect Analysis (FMEA) or a Software Common Cause Failure Analysis (SCCFA) has been performed for the system under test, a valuable source of failover conditions may be available. Otherwise, failure conditions must be identified in the same manner as conditions for fault tolerance tests.
Test cases designed for system tests typically consist of following elements:
1) Simulation of the failure conditions (or specific combinations of conditions)
2) Evaluation of whether the failure condition was correctly detected and the required failover mechanisms activated (possibly also within a maximum time period)
3) Verification that functionality and data are consistent with the pre-failover state
Test cases may also be developed explicitly to test the software responsible for identifying failures and initiating the failover mechanisms. These tests are generally designed to the highest level of rigor where safety-critical systems are involved.