1) Use of redundant hardware devices (e.g., servers, processors, disks), which are arranged such that one component immediately takes over from another should it fail. Disks, for example, can be included in the architecture as a RAID element (Redundant Array of Inexpensive Disks).
2) Redundant software implementation, in which more than one independent instance of a software system is implemented (perhaps by independent teams), using the same set of requirements. These so-called redundant dissimilar systems are expensive to implement but provide a level of risk coverage against external events (e.g., defective inputs) that are less likely to be handled in the same way and therefore less likely to cause software failures.
3) Use of multiple levels of redundancy, which can be applied to both software and hardware to effectively add additional "safety nets" should a component fail. These systems are called duplex, triplex, or quadruplex systems, depending on how many independent instances (2, 3, or 4 respectively) of the software or hardware are implemented.
4) Use of detection and switching mechanisms for determining whether a failure in the software or hardware has occurred and whether to switch (failover) to an alternative. Sometimes these decisions are relatively simple; software has crashed or hardware has failed and a failover needs to be enacted. In other circumstances, the decision may not be that simple. A hardware component may be physically available but supplying incorrect data due to some malfunction. Mechanisms need to be implemented that enable these untrustworthy data sources to be identified and trustworthy ones used instead. In software, these mechanisms are often referred to as voting systems because they are constantly monitoring and conducting a vote on which of the redundant data sources to trust. Ultimately these systems may shut down hardware components deemed to be no longer trustworthy (i.e., failed).
Depending on the type of redundancy implemented (duplex, triplex, etc.), voting systems can be highly complex and are often among the most critical components in the software. For these reasons, it is advisable to include thorough structural and specification-based testing of this soft-ware in the testing approach. Since voting software is highly rule and state based, the adoption of decision table testing or state transitions testing techniques may be appropriate.
Dynamic testing of the failover mechanisms of complete applications or systems of systems is an essential element of a reliability testing approach. The value of these tests arises from our ability to realistically describe the failure modes to be handled and simulate them in a control-led and fully representative environment.