(For USM Staff/Student Only)

EngLib USM > Ω School of Electrical & Electronic Engineering >

Implementing checkpointing and recovery algorithm for fault tolerant computation

Implementing checkpointing and recovery algorithm for fault tolerant computation / Liew Siew Wan
Titik semak dan pemulihan biasanya digunakan untuk melaksanakan pengiraan had terima kegagalan dalam sistem berbilang komputer. Titik semak boleh digunakan untuk pemulihan daripada kes-kes yang luar sangkaan atau gagal, ia juga untuk menyokong melaksanakan menyahpepijat atau untuk mekanisme ulangan. Semasa operasi tidak mengalami kagagalan, keadaan proses pada ketika itu diperiksa dan direkod dan seterusnya sistem itu dipulihkan kepada keadaan yang direkod tersebut sekiranya kesalahan dikesan. Titik semak ini biasanya digunakan untuk mengurangkan jangka masa pelaksanaan bagi program berjangka lama dengan wujudnya kegagalan. Pendekatan yang optima bagi titik semak yang ditetapkan untuk mengurangkan masa pelaksanaan yang dijangka. Project ini telah menyediakan satu kajian terhadap beberapa teknik termasuk titik semak secara koordinat, tidak koordinat, dan perhubungan teraruh; juga membincangkan perbandingan keseluruhan prestasi di antaranya. Selain daripada titik semak secara statik, beberapa cara bagi titik semak secara rawak telah dicadangkan bagi mencapai pendekatan terhadap skim titik semak yang dinamik di dalam tesis ini. Analisis kualitatif dan eksperimen dijalankan untuk menguji kecekapan dan keandalannya. Hasilnya, titik semak secara rawak yang dicadangkan tersebut adalah lebih cekap berbanding dengan titik semak yang telah diperkenalkan dahulu. Kajian juga membuktikan bahawa algorithma pemulihan penting bagi sistem yang berhad terima kegagalan dalam mana-mana aplikasi perisian yang besar. _______________________________________________________________________________________________________ Checkpoint-recovery is normally used for implementing fault tolerance in multi-computer systems. Checkpoints can be used in conjunction with exception handling abstractions to recover from exceptional or erroneous events, to support debugging or replay mechanisms, or to facilitate algorithms that rely on speculative evaluation. During failure-free operation the process states are regularly saved, and after a fault is detected, the system is recovered to a previous saved state. Check-pointing is usually used to minimize the execution time for long-running programs in existence of failures. Optimal check-pointing approach may be determined to reduce the expected execution time. This project provides a study of few techniques including coordinated, uncoordinated and communication-induced check-pointing; also discusses the overall comparison of performance of between them. Apart from static check-pointing scheme, a few random check-pointing approaches has been implemented for dynamical check-pointing scheme in this thesis. In order to test the effective and reliability of the proposed methods, the qualitative and experimental analyses are applied. The proposed random check-pointing appears to be sufficiently effective as compared to existing check-pointing produced by the traditional methods. These studies suggest that the proposed check-pointing method have great potential for fault tolerant in any large software applications.
Contributor(s):
Liew Siew Wan - Author
Primary Item Type:
Final Year Project
Identifiers:
Barcode : 00003093628
Accession Number : 875004660
Language:
English
Subject Keywords:
Checkpoint; abstractions; debugging
First presented to the public:
6/1/2012
Original Publication Date:
3/20/2018
Previously Published By:
Universiti Sains Malaysia
Place Of Publication:
School of Electrical & Electronic Engineering
Citation:
Extents:
Number of Pages - 107
License Grantor / Date Granted:
  / ( View License )
Date Deposited
2018-03-20 15:50:20.975
Date Last Updated
2019-01-07 11:24:32.9118
Submitter:
Mohd Jasnizam Mohd Salleh

All Versions

Thumbnail Name Version Created Date
Implementing checkpointing and recovery algorithm for fault tolerant computation1 2018-03-20 15:50:20.975