CS 449 Fault-Tolerant Systems
Total Credits: 3 cr
Course Coordinator: Axel Krings
URL: http://www2.cs.uidaho.edu/~krings/CS449
Current Catalog Description: Design, modeling, analysis and integration
of hardware and software to achieve dependable computing systems employing on-line
fault tolerance; theory and fundamental concepts of designing reliable systems;
analytical evaluation techniques, faults and advances in ultra-reliable distributed
systems, fault-tolerant software systems; case studies include the space Shuttle,
Airbus, and Boeing fly-by-wire primary flight computers as well as systems in
reliable data bases and financial markets.
Textbook: None.
References: Course packet includes research papers and presentation
slides, all downloadable from the website. Additional references include:
- Martin L. Shooman, Reliability of computer systems and networks, John
Wiley & Sons Inc., 2002, ISBN 0-471-29342-3.
- Anderson, T., and P.A. Lee, Fault-Tolerant Principles and Practices,
Prentice-Hall Int'l., London, 1981.
- Hwang, K., and F.A. Briggs, Computer Architecture and Parallel Processing,
McGraw-Hill, 1984.
- Jalote, P., Fault-Tolerance in Distributed Systems, Prentice-Hall, 1994.
- Johnson, B.W., Design and Analysis of Fault-Tolerant Systems, Addison
Wesley, 1989.
- Leveson, Nancy G., Safeware, system safety and computers, Addison Wesley,
1995.
- Pradhan, D.K., Fault-Tolerant Computing -- Theory and Techniques, (2
Volumes), Prentice-Hall, 1986.
- Pradhan, Dhiraj K., Fault-Tolerant Computer System Design, Prentice-Hall
PTR , 1996.
- Sahner, R.A., K.S. Trivedi and A. Puliafito, Performance and Reliability
Analysis of Computer Systems, Kluwer Academic Publishers, 1996.
- Sieworek, and R.S. Schwarz, The Theory and Practice of Reliable System
Design, Digital Press, 1982.
- Storey, Neil, Safety-Critical Computer Systems, Addison Wesley, 1995.
- Martin L. Shooman, Reliability of computer systems and networks, John
Wiley & Sons Inc., 2002.
- Trivedi, K.S., Probability and Statistics with Reliability, Queuing,
and Computer Science Applications, Prentice-Hall, 1982.
Course Goals: Provide student with a broad knowledge of fault-tolerant
concepts and mechanisms in order to understand how to assess dependability and
make intelligent design choices. Further the understanding of the space and
time complexity, i.e. the overhead, of the mechanisms that can provide fault-tolerance
and achieve desired attributes such as reliability, availability or maintainability.
Expose students to case studies.
Prerequisites by Topic:
- Basic knowledge of computer architecture and computer organization,
e.g., memory system organization and architecture, Interfacing and communication,
functional organization, multiprocessing and alternative architectures.
(CS 150)
- Basic topics of operating systems, e.g., OS system principles, memory
management, file system, concurrency, input/output, OS security. (CS 240)
Major Topics Covered in the Course:
- Introduction to fault-tolerance, safety-critical systems, and top challenges
(2 hours)
- Standard definitions: e.g. reliability, fault-error-failure (1 hour)
- Redundancy concepts (spatial, information, time), error detection /
correction (2 hours)
- Reliability analysis: math background, bathtub curve, MTTF (2 hours)
- Reliability block diagrams and fault tree analysis (2 hours)
- Reliability analysis using Markov analysis (3 hours)
- Reliability analysis using Petri nets (2 hours)
- Distributed systems: ordering / synchronizing, reliable / atomic / causal
broadcast (5 hours)
- Fault-tolerant agreement, consensus, fault models (5 hours)
- Clock synchronization (3 hours)
- Recovery strategies, checkpointing, message logging (2 hours)
- RAID systems, fail-stop processes (3 hours)
- Diagnosability (2 hours)
- Case studies: Space Shuttle, Boeing 777, SIFT, Tandem, NonStop System
Cyclone, Himalaya, MAFT (6 hours)
Laboratory projects (specify number of weeks on each): None.
Estimated Curriculum Category Content:
| Area |
Core |
Advanced |
Area |
Core |
Advanced |
| Algorithms |
|
1 cr |
Data Structures |
|
|
| Software Design |
|
0.5 cr |
Prog. Languages |
|
0.5 cr. |
| Computer Arch |
|
1 cr. |
Other |
|
|
Oral and Written Communications: None
Social and Ethical Issues: None.
Theoretical Content: Twenty two class periods cover material with
theoretical content. The dominating topics are:
- Reliability Analysis: Random Variables, cdf, pdf, reliability block
diagrams, fault tree analysis, Markov chains, (steady state and transient
solutions), Petri nets
- Agreement and consensus algorithms
- Clock synchronization algorithms
- System diagnosability (PMC model)
Problem Analysis: The common theme in the course is mathematical modeling.
This is also reflected in the homework and exam questions, i.e., most homework
and some exam questions addressed problems requiring problem analysis and mathematical
solutions.
Solution Design: Most solutions to achieve fault-tolerance are measured
by their overall reliability impact. A typical example is a homework questionwhere
different redundancy schemes are compared or the exam questions which required
using solid reliability arguments to justify the answers.
Course Outcomes: The following list documents the course outcomes
and crossreferences them to the BSCS program
outcomes. The letter at the beginning of each reference identifies the program
outcome supported. The numbers sequentially identify the course outcome for
this course. After completing CS 449 a student should be able to:
|