Published by MIT Press. Copyright 1998 Massachusetts Institute of Technology.
Your institution may already be a subscriber to CJTCS. If not, please subscribe for legitimate access to all journal articles.
A reset of a distributed system is safe if it does not complete ``prematurely,'' i.e., without having reset some process in the system. Safe resets are possible in the presence of certain faults, such as process fail-stops and repairs, but are not always possible in the presence of more general faults, such as arbitrary transients. In this paper, we design a bounded-memory distributed-reset program that possesses two tolerances: (1) in the presence of fail-stops and repairs, it always executes resets safely, and (2) in the presence of a finite number of transient faults, it eventually executes resets safely. Designing this multitolerance in the reset program introduces the novel concern of designing a safety detector that is itself multitolerant. A broad application of our multitolerant safety detector is to make any total program likewise multitolerant.