Reliability and Availability of Ship’s Computer Systems Based on Manufacturer’s Data and Worksheets

Computer controlled systems play important role aboard ships. Failure of such systems due to some component malfunction can be with fatal consequences. It is important to assess reliability and availability of such systems and minimum redundancy to ensure maintenance planning, ordering of spare components and safety of the voyage with as little as possible redundant components. This paper deals with development of model for reliability and availability of the computer system, which consists of three components with hot standby. Markov chains model is used to analyse probability of failure. Matrix of transitions is set after model development. Transition matrix is used to develop diff erential equations for model simulation. System’s reliability is larger if the system is under constant maintenance and service, but it would not be available. Hence, the optimum between reliability and availability should be found. System’s maintenance is limited during the voyage and hot standby is necessary to ensure success of the voyage. This paper presents a framework for assessment of reliability and availability of computer systems based on components’ redundancy, and practical data about MTBF. Many versions of the shipboard computer systems can be evaluated using the presented framework.


INTRODUCTION / Uvod
Reliability and availability of all systems in transport and industry play an important role in choosing a product by a customer [1]. This directly infl uences a planned maintenance [2,3]. Reliability and availability is of such importance that companies build databases to calculate i.e. MTTR, which is presented in [2]. Such collected data are used to detect faults [3] in ships' systems. Computer systems are incorporated in systems of larger scale as control and/or monitoring parts/units. There are diff erences between computer systems in their construction by application needs, i.e. industrial computers are quite diff erent from PCs. In this paper, special computer is considered which consist of limited number of components.
An example of reliability and availability research is explained in [1], where authors model a quality control system, which is based on computers. However, infl uencing parameters are hardware, software, and environment. A core part of any computer system (hardware part) is microprocessor unit. All processor manufacturers invest in ensuring long processor lifetime by limiting failures [4]. Mechanisms for failures are researched world-wide. In [4] wear-out related hard errors are considered. The mechanisms of such failures lay in several phenomena, i.e. stress migration, electromigration or time-dependent dielectric breakdown (TDDB). Processor long-term reliability is usually represented by the Bathtub Curve [5], which consists of three parts -early life, useful life and wear-out. Every part of the curve is characterized by diff erent failure mechanism. Since long-term processor reliability is almost completely dependent on intrinsic failures and wear-out, so-called reliability awareness microarchitectural design (RAMP) model is introduced in [6]. RAMP is interesting due to aggressive transistor scaling and increasing processor power, which leads to increase in temperature and demands more eff orts in thermal design of microprocessors. In order to incorporate other parts of computer in the analysis, a framework for architecture-level lifetime reliability modelling is introduced in [7]. The research includes Monte Carlo simulations and eff ective combination of low-level eff ects and architecturallevel eff ects. Mechanisms for failure analysed are: electromigration, negative bias temperature instability (NBTI) and TDDB. The work includes other components of the system, such as SRAM (Static Random-Access Memory) and redundant systems.
Reliability of memory was addressed in [8][9][10][11]. Reliability of ferroelectric RAM was analysed in [8]. Design of fault-tolerant RAM was the scope of [9]. Optimization criteria were considered in [11]. Criteria were minimization of costs, maximization of equipment availability, and the achievement of a desired stock reliability. Normal distribution and Poisson process approach were used for non-repairable components.
Reliability and availability of an industrial computer system was presented in [1], which is similar to ship's system due to the scope of such systems. The diff erence is in a fact that ship's system should not fail between two harbours. So, the situations with double failure (main and redundant components) should be avoided.
This paper is organized as follows. In section 2, we describe the process of model development. In section 3, the results of simulation are presented and discussed. Section 4 is the conclusion section.

DEVELOPING SIMULATION MODEL FROM THEORY / Razvoj simulacijskog modela iz teorije
In order to determine system's availability, it is necessary to develop simplifi ed model and exploit it. Availability is defi ned by [12]: where MTTR is Mean Time To Repair and MTBF Mean Time Before Failure. Availability is often expressed as: (2) where λ is the intensity of failures and μ the intensity of repairs. Intensity of failures can be determined by [12,13]: Intensity of repairs is defi ned with [12]: Simulation model considered deals only with hardware part of the computer system. In order to simplify the model, only three crucial components were taken into account. The research can be extended for more components if necessary for some specifi c purpose. Every component have hot standby in parallel branch. Considered components of the computer system are: microprocessor (MP), random access memory (RM) and hard disk (TD). Considered computer system can be shown by block-diagram in Figure 1. Redundant components are shown in parallel branches. Since, computer fails if any of parallel branches fails, three parallel branches are connected into series (see reliability in parallel [13]). There are several more components that could be taken account, but necessary data have not been available at time of research. Hence, this is a simplifi ed model. Table 1 shows all possible states of the system. However, it is necessary to consider only situations which do not lead to system failure. Operational component is in state "0" and the component is state of failure is in state "1". For example, state 1 can be expressed as: 1& 2 TD TD . If some component is in state of failure, then it is written with negation (i.e. TD2 ).

Figure 1 Considered system Slika 1. Razmatrani sustav
All allowed cases (27 in total) can be reduced by diff erent formulation. Cases when one component in parallel is operational and one in failure state can be expressed as one new case when parallel are operational. For example, cases no 2 and 3 from Table 1 can be written as one case:  0  54  1  1  0  1  0  1  55  1  1  0  1  1  0  56  1  1  0  1  1  1  57  1  1  1  0  0  0  58  1  1  1  0  0  1  59  1  1  1  0  1  0  60  1  1  1  0  1  1  61  1  1  1  1  0  0  62  1  1  1  1  0  1  63  1  1  1  1  1  0  64  1  1  1  1  1  1 Therefore, 27 states can be reduced to just 9 states with 7 transitions. State S0 is defi ned as the state with all components operational. If one component in parallel fails, transition from S0 to S1 (for MP failure), S2 (for RM failure) or S3 occurs (for TD failure). System cannot return from states S1, S2 and S3 to S0, because there are no repairs intensities. If the system is not maintained (repaired), the performance can be even worse. From S1, system can degrade to S4 or S5 or in total failure SK (state of failure). From S2, system can change its condition to states S4, S6 or SK. From S3, system can deteriorate to S5, S6 or SK. From new states, S4, S5 or S6, system can degrade to S7 or SK, and, fi nally, from S7 only to SK. Table 2 shows reduced states.

Setup / Plan rada
Firstly, reliability of the considered system is simulated. In the fi rst step, simulation model is used to analyse probability of staying in initial non-failure state, 0 S . Simulation is performed with the assumption that parameters λ are constant for the analysed system. Several service intervals were simulated, from likely to non-likely. The value of parameter μ is set to diff erent service intervals.
Confi guration of the modelled computer system consists of: microprocessor Intel Core 2 Quad, Kingston RAM 4GB DDR3 1600MHz, and hard disk Western Digital Velociraptor 1TB. Values of the MTBF are taken from web resources [15 -17]. Microprocessor's MTBF is equal to 73803 hours (3.37 years), RAM's MTBF is equal to 6618133.7 hours (302.2 years) and HDD's MTBF is equal to 1400000 hours (63.93 years). Figure 4 shows that probability of staying in initial state is 50% after one year. Furthermore, probability of changing the state is 90% in three year period.  Figure 5 shows probability of changing state into failure state, Pk, obtained in simulation of reliability. Simulation of availability implies that probability of staying in the initial state is 50% in 1.7 years interval for service intensity of 0.5, 3 years for 1-year service interval and 7.5 years for 2 services per year. The probability that the system will not stay in the initial state is equal to 90% for availability simulation and time period depends on repair's intensity. If repair's intensity is 0.5, then the 90% is obtained after 12 years. For repair's intensity 1, 90% probability is obtained after 19 years. 33 years is the time to get to 90% probability in intensity of repairs of factor 2.

Simulation results / Rezultati simulacije
In the second step, the developed simulation model is used to analyse probability of failure state. Figure 6 shows probability of staying in the initial state for availability simulation. Legend data (years) denoted service intervals. Time dependence of transition to state of failure is shown in Figure 7. Legend data (years) denoted service intervals. It can be seen that probability of failure state (in case of reliability) is about 50% after 4 years, 50% after 4 years, and 90% after 10 years. In case of availability, 50% probability of failure is after 6 years with repair's intensity of 0.5, 8 years with repair's intensity of 1, and 11.5 years for repair's intensity of factor 2.
Probability of failure state is 90% after 17 years with intensity of repairs 0.5, 24 years for repair's intensity of factor 1, and 37.4 years for repair's intensity of factor 2 (service -2 times per year). Table 5 shows results for total reliabilities/availabilities for all states when system is operational in cases of 0.5, 1, and 2 years service intervals after 1, 3, and 5 years of operation.

CONCLUSION / Zaključak
Many published research works are concerned about physical phenomena behind system failure. We used a diff erent approach: known MTBF data for known components are used to predict reliability and availability of the computer system without introducing physical layer of failure mechanisms. The motive is in companies, which are not interested in physical phenomena research but in knowing how safe their fl eet is.
It is shown that computer with one redundancy per component can be operational without repairs for 1 year with 50% probability.
If we maintain the system and repair it, the system will have longer life expectancy. For more repairs system life is longer. But, if we repair the system all the time, we won't be able to work on it. Therefore, there is no economic value of all-time-repaired systems. Therefore, optimum between system' s availability and repair' s rate should be found.
An interesting implication is the choice of components used for the simulation. We used "normal" PC components. It would be interesting to compare these data to the corresponding components in the aboard computer systems.
Nowadays trends are to include as much as possible PC computers aboard ships, but it is still limited trend. The most computer-controlled systems use PLCs (Programmable Logic Controller).
Measures to increase reliability of ships' computer systems have economic costs. For example, increase in redundancy means better reliability, but also new computers in standby. However, distributed networks can reduce such cost and increase reliability. Hence, old ships should be modernized, if possible, to increase redundancy by integrated ship's networks.
Similar approach as in this paper can be used if more components are taken into account. In such case, we just add more parallel branches in series.