Techniques for process recovery in message passing and distributed shared memory systems

Guardado en:
Bibliografiske detaljer
Udgivet i:ProQuest Dissertations and Theses (1995)
Hovedforfatter: Richard, Golden George, III
Udgivet:
ProQuest Dissertations & Theses
Fag:
Online adgang:Citation/Abstract
Full Text - PDF
Tags: Tilføj Tag
Ingen Tags, Vær først til at tagge denne postø!
Beskrivelse
Resumen:Distributed applications typically consist of a group of processes executing on processors which share neither a global clock nor a physically shared memory. Applications may either explicitly exchange messages to communicate or rely on a mechanism called Distributed Shared Memory, which implements a globally shared address space using a software layer. In either case, as the number of processors increases, so does the probability that a processor will fail at some point during application execution. At the very least, such failures destroy all the work accomplished by the failed processor and in the worst case, it may be necessary to completely restart the application to restore consistency. In order to reduce lost work and eliminate application restarts, processes which make up a distributed application should be recoverable. This dissertation examines the issues which arise in distributed recovery for both message-passing and distributed shared memory (DSM) systems and presents techniques for process recovery in these systems. In the message-passing arena, the recovery techniques address complete processor recovery (CPR). By complete recovery we mean the restoration of a consistent system state as well as proper handling of lost, duplicate, and delayed messages resulting from the failure and restoration of the system to a consistent state. Message handling has often been treated casually in previous recovery techniques. In contrast, the proposed recovery techniques comprehensively address message handling issues. We classify the message types which a recovery mechanism must deal with to motivate our message-handling techniques. The use of vector time allows us to reduce the overhead associated with recovery and helps expedite recovery after a failure. Processor failures also plague distributed shared memory systems, but different solutions are appropriate. This dissertation addresses the problem of recovery in DSM systems and presents a solution which allows the checkpointing frequency to be adjusted to reduce overhead in environments where checkpointing is expensive. The results of a simulation study are used to compare our technique to another technique for recovery in distributed shared memory environments.
ISBN:9798209471561
Fuente:ProQuest Dissertations & Theses Global