Skip navigation.

Contact Us

Department of Computer Science

Janssen Engineering
Room 236
PO Box 441010
Moscow, Idaho
83844-1010

phone: 208-885-6592
fax: 208-885-9052

e-mail:
dept chair
graduate info
undergrad info
sys admin
webmaster

CS Department Banner Graphic

CS 449 Fault-Tolerant Systems

Total Credits: 3 cr

Course Coordinator: Axel Krings

URL: http://www2.cs.uidaho.edu/~krings/CS449

Current Catalog Description: Design, modeling, analysis and integration of hardware and software to achieve dependable computing systems employing on-line fault tolerance; theory and fundamental concepts of designing reliable systems; analytical evaluation techniques, faults and advances in ultra-reliable distributed systems, fault-tolerant software systems; case studies include the space Shuttle, Airbus, and Boeing fly-by-wire primary flight computers as well as systems in reliable data bases and financial markets.

Textbook: None.

References: Course packet includes research papers and presentation slides, all downloadable from the website. Additional references include:

  • Martin L. Shooman, Reliability of computer systems and networks, John Wiley & Sons Inc., 2002, ISBN 0-471-29342-3.
  • Anderson, T., and P.A. Lee, Fault-Tolerant Principles and Practices, Prentice-Hall Int'l., London, 1981.
  • Hwang, K., and F.A. Briggs, Computer Architecture and Parallel Processing, McGraw-Hill, 1984.
  • Jalote, P., Fault-Tolerance in Distributed Systems, Prentice-Hall, 1994.
  • Johnson, B.W., Design and Analysis of Fault-Tolerant Systems, Addison Wesley, 1989.
  • Leveson, Nancy G., Safeware, system safety and computers, Addison Wesley, 1995.
  • Pradhan, D.K., Fault-Tolerant Computing -- Theory and Techniques, (2 Volumes), Prentice-Hall, 1986.
  • Pradhan, Dhiraj K., Fault-Tolerant Computer System Design, Prentice-Hall PTR , 1996.
  • Sahner, R.A., K.S. Trivedi and A. Puliafito, Performance and Reliability Analysis of Computer Systems, Kluwer Academic Publishers, 1996.
  • Sieworek, and R.S. Schwarz, The Theory and Practice of Reliable System Design, Digital Press, 1982.
  • Storey, Neil, Safety-Critical Computer Systems, Addison Wesley, 1995.
  • Martin L. Shooman, Reliability of computer systems and networks, John Wiley & Sons Inc., 2002.
  • Trivedi, K.S., Probability and Statistics with Reliability, Queuing, and Computer Science Applications, Prentice-Hall, 1982.

Course Goals: Provide student with a broad knowledge of fault-tolerant concepts and mechanisms in order to understand how to assess dependability and make intelligent design choices. Further the understanding of the space and time complexity, i.e. the overhead, of the mechanisms that can provide fault-tolerance and achieve desired attributes such as reliability, availability or maintainability. Expose students to case studies.

Prerequisites by Topic:

  • Basic knowledge of computer architecture and computer organization, e.g., memory system organization and architecture, Interfacing and communication, functional organization, multiprocessing and alternative architectures. (CS 150)
  • Basic topics of operating systems, e.g., OS system principles, memory management, file system, concurrency, input/output, OS security. (CS 240)

Major Topics Covered in the Course:

  • Introduction to fault-tolerance, safety-critical systems, and top challenges (2 hours)
  • Standard definitions: e.g. reliability, fault-error-failure (1 hour)
  • Redundancy concepts (spatial, information, time), error detection / correction (2 hours)
  • Reliability analysis: math background, bathtub curve, MTTF (2 hours)
  • Reliability block diagrams and fault tree analysis (2 hours)
  • Reliability analysis using Markov analysis (3 hours)
  • Reliability analysis using Petri nets (2 hours)
  • Distributed systems: ordering / synchronizing, reliable / atomic / causal broadcast (5 hours)
  • Fault-tolerant agreement, consensus, fault models (5 hours)
  • Clock synchronization (3 hours)
  • Recovery strategies, checkpointing, message logging (2 hours)
  • RAID systems, fail-stop processes (3 hours)
  • Diagnosability (2 hours)
  • Case studies: Space Shuttle, Boeing 777, SIFT, Tandem, NonStop System Cyclone, Himalaya, MAFT (6 hours)

Laboratory projects (specify number of weeks on each): None.

Estimated Curriculum Category Content:

Area Core Advanced Area Core Advanced
Algorithms   1 cr Data Structures    
Software Design   0.5 cr Prog. Languages   0.5 cr.
Computer Arch   1 cr. Other    

Oral and Written Communications: None

Social and Ethical Issues: None.

Theoretical Content: Twenty two class periods cover material with theoretical content. The dominating topics are:

  • Reliability Analysis: Random Variables, cdf, pdf, reliability block diagrams, fault tree analysis, Markov chains, (steady state and transient solutions), Petri nets
  • Agreement and consensus algorithms
  • Clock synchronization algorithms
  • System diagnosability (PMC model)

Problem Analysis: The common theme in the course is mathematical modeling. This is also reflected in the homework and exam questions, i.e., most homework and some exam questions addressed problems requiring problem analysis and mathematical solutions.

Solution Design: Most solutions to achieve fault-tolerance are measured by their overall reliability impact. A typical example is a homework questionwhere different redundancy schemes are compared or the exam questions which required using solid reliability arguments to justify the answers.

Course Outcomes: The following list documents the course outcomes and crossreferences them to the BSCS program outcomes. The letter at the beginning of each reference identifies the program outcome supported. The numbers sequentially identify the course outcome for this course. After completing CS 449 a student should be able to: