Checkpointing a multithreaded distributed shared memory computer system

Guardado en:
Detalles Bibliográficos
Publicado en:ProQuest Dissertations and Theses (2001)
Autor principal: Dieter, William Robert
Publicado:
ProQuest Dissertations & Theses
Materias:
Acceso en línea:Citation/Abstract
Full Text - PDF
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
Descripción
Resumen:Distributing a program over a cluster of commodity processors connected by a commodity network can help speed up a computation for a relatively low cost. Distributed cluster computing is especially useful for long-running scientific applications. As the number of processors and running time of program increase, however, the probability of that one of the system's components will fail before the program ends increases. A program can prepare for failures by periodically saving its state in a checkpoint from which it can be recovered later. Checkpointing distributed programs requires making sure the checkpoints that individual processes save can be used together to restore a consistent state. Programs using a coordinated checkpointing algorithm communicate to save a consistent state. Programs using a communication-induced checkpointing algorithm build a consistent state without explicit communication. Although communication induced checkpointing algorithms have less communication overhead they do not add significantly less overhead to programs because synchronization overhead is small compared to the amount of time required to save a checkpoint to disk. A checkpointing system builds consistent global checkpoints from checkpoints of individual processes. Each Unify process has multiple threads, but no checkpointing library existed that could checkpoint multi-threaded programs at the start of this research. This research includes the development of a checkpointing library to checkpoint multithreaded processes on Solaris 2.5 and Linux. The checkpointing library can be used as a standalone checkpointing library for multithreaded processes in addition to being used by Unify.
ISBN:9780493289120
Fuente:ProQuest Dissertations & Theses Global