Merken

Checkpoint-Restart: Proprietary Hardware and the "Spiderweb API"

Zitierlink des Filmsegments
Embed Code

Für dieses Video liegen keine automatischen Analyseergebnisse vor.

Analyseergebnisse werden nur für Videos aus Technik, Architektur, Chemie, Informatik, Mathematik und Physik erstellt, bei denen dies rechtlich zulässig ist.

Metadaten

Formale Metadaten

Titel Checkpoint-Restart: Proprietary Hardware and the "Spiderweb API"
Serientitel REcon 2011
Autor Kerr, Gregory
Mitwirkende Brick, Alex
Cooperman, Gene
Bratus, Sergey
Lizenz CC-Namensnennung - keine kommerzielle Nutzung - keine Bearbeitung 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt in unveränderter Form zu jedem legalen und nicht-kommerziellen Zweck nutzen, vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
DOI 10.5446/32917
Herausgeber REcon
Erscheinungsjahr 2011
Sprache Englisch

Inhaltliche Metadaten

Fachgebiet Informatik
Abstract This summary describes a package to transparently checkpoint and restart applications which run over Infiniband. Infiniband is rapidly growing as a high-speed interconnect, even appearing on departmental clusters. The current work grew out of the needs of high performance computing. As of November, 2010, 43% of the TOP500 supercomputers run Infiniband. However, the ability to checkpoint immediately provides access to a poor man's reversible debugger. Using our DMTCP (Distributed MultiThreaded CheckPointing), we can already checkpoint a GDB session today: if we have executed 100 commands since the last checkpoint, we can undo the last instruction by restarting the checkpoint and going forward 99 commands. Since many apps access Infiniband through MPI (Message Passing Interface) instead of direct communication with Infiniband, we also integrated DMTCP into the OpenMPI dialect so as to transparently debug an MPI-based application. Infiniband's primary mechanism to provide fast latency is Remote Direct Memory Access (RDMA). One host can directly read or write the RAM of another host, without intervention by the CPU or software. The previously mentioned debugger logs commands and allows you to go back in history, through restarting and re-executing. It means that we can now conceive of time as a spatial dimension instead of a temporal dimension. So we can write a binary search program acting over the process's lifetime. This is illustrated in a later section. In a complex Infiniband computation, memory is written to and read from with latencies of less than 1 microsecond. Assert statements or breakpoints would change the course of execution because the program no longer runs at native speed.

Ähnliche Filme

Loading...