Bayesian analysis and decisions in nuclear power plant maintenance
Abstract and Keywords
This article discusses the results of a study in Bayesian analysis and decision making in the maintenance and reliability of nuclear power plants. It demonstrates the use of Bayesian parametric and semiparametric methodology to analyse the failure times of components that belong to an auxiliary feedwater system in a nuclear power plant at the South Texas Project (STP) Electric Generation Station. The parametric models produce estimates of the hazard functions that are compared to the output from a mixture of Polya trees model. The statistical output is used as the most critical input in a stochastic optimization model which finds the optimal replacement time for a system that randomly fails over a finite horizon. The article first introduces the model for maintenance and reliability analysis before presenting the optimization results. It also examines the nuclear power plant data to be used in the Bayesian models.
Keywords: Bayesian analysis, decision making, maintenance analysis, reliability analysis, nuclear power plants, South Texas Project (STP) Electric Generation Station, parametric model, hazard functions, optimization, Bayesian models
9.1 Introduction
It is somewhat true that in most mainstream statistical literature the transition from inference to a formal decision model is seldom explicitly considered. Since the Markov chain Monte Carlo (MCMC) revolution in Bayesian statistics, focus has generally been on the development of novel algorithmic methods to enable comprehensive inference in a variety of applications, or to tackle realistic problems which naturally fit into the Bayesian paradigm. In this chapter, the primary focus is on the formal decision or optimization model. The statistical input needed to solve the decision problem is tackled via Bayesian parametric and semiparametric models. The optimization/Bayesian models are then applied to solving an important problem in a nuclear power plant system at the South Texas Project (STP) Electric Generation Station.
STP is one of the newest and largest nuclear power plants in the US, and is an industry leader in safety, reliability and efficiency. STP has two nuclear reactors that together can produce 2500 megawatts of electric power. The reactors went online in August 1988 and June 1989, and are the sixth and fourth youngest, respectively, of more than 100 such reactors operating nationwide. STP consistently leads all US nuclear plants in the amount of electricity its reactors produce.
The STP Nuclear Operating Company (STPNOC) manages the plant for its owners, who share its energy in proportion to their ownership interests, which as of July 2004 are: Austin Energy, The City of Austin, 16%, City Public Service of San Antonio, 40%, and Texas Genco LP, 44%. All decisions that the board of directors make are of finite time since every nuclear reactor is given a license to operate. In the case of STP, 25 and 27 years remain on the licenses for the two reactors, respectively.
Equipment used to support production in long-lived (more than a few years) installations such as those at STP requires maintenance. While maintenance is being carried out, the associated equipment is typically out of service and the system operator may be required to reduce or completely curtail production. (p. 220) In general, maintenance costs include labor, parts and overhead. In some cases, there are safety concerns associated with certain types of plant disruptions. Overhead costs include hazardous environment monitoring and mitigation, disposal fees, license fees, indirect costs associated with production loss such as wasted feed stock, and so forth.
STPNOC works actively with the Electric Power Research Institute (EPRI) to develop robust methods to plan maintenance activities and prioritize maintenance options. Existing nuclear industry guidelines, Bridges and Worledge (2002), Gann and Wells (2004), Hudson and Richards (2004), INPO (2002), recommend estimating reliability and safety performance based on evaluating changes taken one at a time, using risk importance measures supplemented by heuristics to prioritize maintenance. In our view, the nuclear industry can do better. For example, Liming et al. (2003) propose instead investigating ‘packages’ of changes, and in their study of a typical set of changes at STPNOC, the projected plant reliability and nuclear safety estimates were found to be significantly different compared to changes evaluated one at a time. STPNOC is working with EPRI to improve its preventive maintenance reliability database with more accurate probability models to aid in better quantification of preventive maintenance.
Here, we consider a single-item maintenance problem. That said, this chapter’s model forms the basis for higher fidelity multi-item maintenance models that we are currently investigating. Furthermore, as we describe below, even though the model is mathematically a single-item model, in practice it could easily be applied to multiple items within an equipment class.
Equipment can undergo either preventive maintenance (PM) or corrective maintenance (CM). PM can include condition-based, age-based, or calendar-based equipment replacement or major overhaul. Also included in some strategies is equipment redesign. In all these categories of PM, the equipment is assumed to be replaced or brought back to the as-purchased condition. CM is performed when equipment has failed unexpectedly in service at a more-or-less random time (i.e. the out-of-service time is not the operator’s choice as in PM).
Because such systems are generally required to meet production-run or calendar-based production goals, as well as safety goals, installation operators want assurance that the plant equipment will support these goals. On the other hand, the operator is usually constrained by a limited maintenance budget. As a consequence, operators are faced with the important problem of balancing maintenance costs against production and safety goals.
Thus, the primary decision problem with which we are confronted is interlinked, and could be stated in the form of a question: ‘How should one minimize the total cost of maintaining a nuclear power plant while ensuring that its reliability meets the standards defined by the United States Department of Energy?’
(p. 221) There is an enormous literature on optimal maintenance policies for a single item that dates back to the early 1950s. The majority of the work covers maintenance optimization over an infinite horizon, see Valdez-Florez and Feldman (1989) for an extensive review. The problem that we address in this chapter is over a finite planning horizon, which comes from the fact that every nuclear power plant has a license to operate that expires in a finite predefined time. In addition the form of the policy is effectively predefined by the industry as a combination of preventive and corrective maintenance, as we describe below. Marquez and Heguedas (2002) present an excellent review of the more recent research on maintenance policies and solve the problem of periodic replacement in the context of a semi-Markov decision processes methodology. Su and Chang (2000) find the periodic maintenance policies that minimize the life cycle cost over a predefined finite horizon.
A review of the Bayesian approaches to maintenance intervention is presented in Wilson and Popova (1998). Chen and Popova (2000) propose two types of Bayesian policies that learn from the failure history and adapt the next maintenance point accordingly. They find that the optimal time to observe the system depends on the underlying failure distribution. A combination of Monte Carlo simulation and optimization methodologies is used to obtain the problem’s solution. In Popova (2004), the optimal structure of Bayesian group-replacement policies for a parallel system of n items with exponential failure times and random failure parameter is presented. The paper shows that it is optimal to observe the system only at failure times. For the case of two items operating in parallel the exact form of the optimal policy is derived.
With respect to the form of the maintenance policy that we consider, commercial nuclear power industry practice establishes a set PM schedule for many major equipment classes. In an effort to avoid mistakenly removing from service sets of equipment that would cause production loss or safety concerns, effectively all equipment that can be safely maintained together are grouped and assigned a base calendar frequency. The grouping policy ensures PM (as well as most CM) will occur at the same time for equipment within a group. The base frequency is typically either set by the refueling schedule (equipment for which at-power maintenance is either impossible or undesirable) or calendar-based. The current thinking for PM is typified by, for example, the INPO guidance, INPO (2002), whereby a balance is sought between maintenance cost and production loss. By properly taking into account the probability of production loss, the cost of lost production, the CM cost and PM cost, the simple model described in this chapter captures industry practice. Not accounted for in the model is the probability and cost of production loss due to PM during at-power operation as well as potential critical path extension for PM during scheduled outages (such as refueling outage). This is partly justified by the fact that PM (p. 222) is only performed during the equipment’s group assigned outage time (unlike CM when the equipment outage time is not predetermined) and because outage planning will (generally) assure PM is done off critical path. In practice, there is normally one or two major equipment activities (main turbine and generator, reactor vessel inspection, steam generator inspection, main condenser inspection) that, along with refueling, govern critical path during planned outages.
9.2 Maintenance model
The model we develop has the following constructs. The problem has a finite horizon of length L (e.g. 25 or 27 years). We consider the following (positive) costs: C_{pm} – preventive maintenance cost, C_{cm} – corrective maintenance cost, and C_{d} – downtime cost, which includes all lost production costs due to a disruption of power generation. Let N(t) be the counting process for the number of failures in the interval (0, t).
Let the random time to failure of the item from its as-new state be governed by distribution F with density f. Further assume that each failure of the item causes a loss of production (i.e. a plant trip) with probability p and in that case a downtime cost, C_{d} > C_{cm}, is instead incurred (this cost can include C_{cm}, if appropriate).
We consider the following form of a maintenance policy, which we denote (P):
(P): Bring the item to the ‘as-good-as-new’ state every T units of time (preventive maintenance) at a cost of C_{pm}. If it fails meanwhile then repair it to the ‘as-good-as-old’ state (corrective maintenance) for a cost of C_{cm} or C_{d}, depending on whether the failure induces a production loss.
The optimization model has a single decision variable T, which is the time between PMs, i.e. we assume constant interval lengths between PMs. The goal is to find T ∈ A ⊂ [0, L] that minimizes the total expected cost, i.e.
(9.1)
where A = {i − integer, i ∈ [0, L]}, ⌊·⌋ is the ‘floor’ (round-down to nearest integer) operator, ⌈·⌉ is the ‘ceiling’ (round-up to nearest integer) operator, and E{N(T)} is the expected number of failures, taken with respect to the failure distribution, in the interval (0, T). Barlow and Hunter (1960), showed that for (p. 223) the above defined policy, the number of CMs in an interval of length t follows a non-homogeneous Poisson process with expected number of failures in the interval (0, t),
(9.2)
where $q\left(u\right)={\scriptscriptstyle \frac{f\left(u\right)}{1-F\left(u\right)}}$ is the associated failure rate function. First we will describe an algorithm to find the optimal T when the failure distribution is IFR (increasing failure rate). Then we model the failure rate nonparametrically.
9.3 Optimization results
The development of successful optimization algorithms in the IFR context detailed in the previous section requires a closer examination of the objective function z(T). This examination will culminate in certain key conditions that will guarantee a solution to the cost minimization problem posed in equation (9.1). To this end, consider the following three propositions in which the form of the IFR function is left unspecified, that is, the theory below will hold for any arbitrary choice of an IFR function; for example, one could choose a Weibull failure rate.
The proofs of the following propositions are given in Appendix B. Also the optimization algorithm corresponding to these propositions, and which was used in the decision analysis aspect of this chapter appears in Appendix B.
Proposition 9.1 Assume that we follow maintenance policy (P), and that the time between failures is a random variable with an IFR distribution, i.e. the failure rate function q(t) is increasing. Then, the objective function z(T) is:
(i) lower semicontinuous with discontinuities at T = L/n, n = 1, 2, …, and
(ii) increasing and convex on each interval $\left({\scriptscriptstyle \frac{L}{n+1}},{\scriptscriptstyle \frac{L}{n}}\right),n=1,2,\dots $
Let z^{c}(T) be the continuous relaxation of z(T); see Appendix B.
Proposition 9.2 Let D = {d : d = L/n, n = 1, 2, …} denote the set of discontinuities of z(T) (cf. Proposition 9.1). Then,
(i) z^{c}(T) is quasiconvex on [0, L].
Furthermore if d ∈ D then
(ii) z^{c}(d) = z(d), and
(iii) ${\mathrm{lim}}_{T\to d}-\left\{z\left(T\right)-{z}^{c}\left(T\right)\right\}={C}_{pm}$.
Proposition 9.3 Let ${T}_{c}^{*}\in $ arg ${\mathrm{min}}_{T\in \left[0,L\right]}{z}^{c}\left(T\right)$ and let n^{∗} be the positive integer satisfying ${\scriptscriptstyle \frac{L}{n*+1}}\le {T}_{c}^{*}\le {\scriptscriptstyle \frac{L}{n*}}$. Then,
$${T}^{*}\in \underset{T\in \left\{{\scriptscriptstyle \frac{L}{{n}^{*}+1}},{\scriptscriptstyle \frac{L}{{n}^{*}}}\right\}}{\mathrm{arg}\mathrm{min}}\text{\hspace{0.17em}}{z}^{c}\left(T\right)$$solves ${\mathrm{min}}_{T\in \left[0,L\right]}z\left(T\right)$.
9.4 Data and Bayesian models
The assumption of constant failure rates with the associated exponential failure distribution and homogeneous Poisson process pervades analysis in today’s nuclear power industry. Often times, the data gathered in industry are the number of failures in a given time period, which is sufficient for exponential (or constant) failure times. In general, we cannot expect to have a ‘complete data set’ but rather a collection of failure times and ‘success’ times (also known as right-censored times). Recognizing the limitations of the exponential failure rate model, the Risk Management Department at STP has implemented the procedure now recommended by the US Department of Energy, see Blanchard (1993). The constant failure rate assumption is dropped in favour of a Bayesian scheme in which the failure rate is assumed to be a gamma random variable, which is the conjugate prior for the exponential distribution, and an updating procedure is defined. This methodology requires gathering, at a local level, the number of failed items in a given month only. The mean of the posterior distribution, combining the local and industry-wide history, of the failure rate is then being used as a forecast of the frequency of failures for the upcoming month.
The hazard function associated with a gamma random variable is monotonically increasing or decreasing to unity as time tends to infinity. However, a Weibull family allows hazards to decay to zero or monotonically increase, and is therefore a common choice in reliability studies when one is uncertain a priori that the instantaneous risk of component failure becomes essentially constant after some time, as in the case of exponential and gamma random variables. Yu et al. (2004, 2005) apply Bayesian estimation procedures when the failure rate distribution is assumed Weibull. Thus a Weibull failure rate model is an improvement on the constant failure rate model. But parametric models such as the exponential, gamma, and Weibull models all imply that the hazard rate function is unimodal and skewed right. This has serious consequences for the estimation of the expected value in equation (9.1) if the true underlying failure rate is multi-modal with varying levels of (p. 225) skewness. A semiparametric approach that relaxes such stringent requirements on data while retaining the flavour of these parametric families is therefore warranted.
We first describe the nuclear power plant data to be used in the Bayesian analysis.
9.4.1 Data
The failure data analysed in this chapter are for two different systems, the auxiliary feedwater system, part of the safety system, and the electro-hydraulic control system, part of the nuclear power generation reactor.
The primary function of the auxiliary feedwater system (AF) is to supply cooling water during emergency operation. The cooling water supplied is boiled off in steam generators to remove decay heat created by products of nuclear fission. There are four pumps provided in the system, three electric motor driven and one turbine driven. Each of these pumps supplies its own steam generator although, if required, cross-connection capability is provided such that any steam generator could be cooled by any pump. In addition, isolation valves, flow control valves, and back flow prevention valves called check valves are installed to control flow through the system and to provide capability to isolate the pumping system, if needed, to control an accident. A simplified schematic diagram is shown in Figure 9.1.
(p. 226) We are investigating the performance of the following five groups:
• the electric motor driven pumps (six in total, three in each of two STP Units), labelled GS 1;
• the flow control valves for all the pumps (eight in total, four in each of two STP Units), labelled GS2;
• the cross-connect valves (eight in total, four in each of two STP Units), labelled GS3;
• the accident isolation valves’ motor driver (eight in total, four in each of two STP Units), labelled GS4;
• the check valves (eight in total, four in each of two STP Units), labelled GS5.
We have failure time observations for the five different groups described above starting from 1988 until May 2008. There are 56 exact observations for GS1, 2144 for GS2, 202 for GS3, 232 for GS4, and 72 for GS5.
The electro-hydraulic control system (EHC) consists of the high pressure electro-hydraulic fluid system and the electronic controllers, which are used for the control of the main electrical generating steam turbine as well as the steam turbines that drive the normal steam generator feed water supply pump. The high pressure steam used to turn the turbines described above is throttled for speed control and shut off with steam valves using hydraulic operators (pistons). The control functions include limiting the top speed of the turbines up to and including shut down of speeds become excessive. EHC provides high pressure hydraulic oil which is used as the motive force and the electrical signals that adjust the speed control and steam shut-off valves for the turbines. Because only one main electrical generating steam turbine is provided for each plant, production of electricity from the plant is stopped or reduced for some failures in the EHC system.
We would like to study the performance of:
• particular pressure switches that control starting and stopping of the hydraulic fluid supply pumps; the turbine shutoff warning; and temperature control of steam to the main turbine from a heat exchanger, labelled GN1;
• oil cleaning filters located at the outlet of the hydraulic fluid supply pumps, labelled GN2;
• solenoid valves (referred to as servo valves) which control the operation of the steam valves to the main electrical generating steam turbine, labelled GN3;
• the two hydraulic supply pumps (one is required for the system to operate), labelled GN4;
(p. 227) • solenoid valves which, when operating properly, respond to certain failures and stop the main electrical generating steam turbine as necessary to prevent damage, labelled GN5.
We have failure time observations for the five different groups described above starting from 1988 until May 2008. There are 221 exact and one censored observations for GN1, 243 exact and one censored for GN2, 133 exact for GN3, 223 exact for GN4, and 91 exact for GN5.
9.4.2 Bayesian parametric model
Here we present a Bayesian model for the failure times assuming that the lifetime distribution is Weibull with random parameters λ and β.
The sampling plan is as follows: If we have n items under observation, s of which have failed at ordered times T_{1}, T_{2}, … , T_{s}, then n − s have operated without failing. If there are no withdrawals then the total time on test is: $\omega =n{T}_{s}^{\beta}$ which is also the sufficient statistic for estimating λ (also known as the rescaled total time on test).
Assume that λ has an inverted gamma prior distribution with hyper-parameters ν_{0}, μ_{0} and β has uniform prior distribution with hyperparameters α_{0}, β_{0}.
We use the generic information (lognormal prior for λ) to assess the values of α_{0} and β_{0}. First we calculate the mean and variance for the lognormal distribution with parameters:
(9.3)
(9.4)
where M_{n} is the mean and EF is the error factor as derived in Blanchard (1993). Then we compute the parameters of the Gamma prior distribution, α_{0}, β_{0}:
(9.5)
(9.6)
We will follow the development from Martz and Waller (1982). The posterior expectations of λ and β are:
(9.7)
(9.8)
Table 9.1 Bayes estimates of the parametric Bayes model.
Group |
λ |
β |
---|---|---|
GN1 |
0.000152 |
4.763118 |
GN2 |
0.000448 |
1.994206 |
GN3 |
0.000294 |
9.980280 |
GN4 |
0.001594 |
1.006348 |
GN5 |
0.000298 |
5.931251 |
GS1 |
0.000313 |
9.910241 |
GS2 |
0.001499 |
0.348347 |
GS3 |
0.000932 |
0.655784 |
GS4 |
0.001667 |
0.723488 |
GS5 |
0.000308 |
11.109310 |
where z is a summary of the sample evidence, $v={\prod}_{i=1}^{s}{T}_{i},{\omega}_{1}=n{T}_{s}+{\mu}_{0},{J}_{1}={\displaystyle {\int}_{{\alpha}_{0}}^{{\beta}_{0}}\left[{\scriptscriptstyle \frac{{\beta}^{s}{v}^{\beta}}{{\omega}_{1}^{s+{v}_{0}}}}\right]}d\beta ,{J}_{2}={\displaystyle {\int}_{{\alpha}_{0}}^{{\beta}_{0}}\left[{\scriptscriptstyle \frac{{\beta}^{s}{v}^{\beta}}{{\omega}_{1}^{s+{v}_{0}-1}}}\right]d\beta}$, and ${J}_{3}={\displaystyle {\int}_{{\alpha}_{0}}^{{\beta}_{0}}\left[{\scriptscriptstyle \frac{{\beta}^{s+1}{v}^{\beta}}{{\omega}_{1}^{s+{v}_{0}}}}\right]d\beta}$.
We use numerical integration to solve the above integrals.
The choice of the hyperparameters values is not random. We used the values given in the DOE database (see Blanchard, 1993) for the values of the inverted gamma parameters, and empirically assessed the parameters of the uniform prior distribution, for details see Yu et al. (2004). Table 9.1 shows the Bayes estimates of the model parameters for each group of data.
9.4.3 Bayesian semiparametric model
The failure data is modeled using an accelerated failure time (AFT) specification, along with the Cox proportional hazard regression model:
where x is a p-dimensional vector of covariates, S_{x}(·) the corresponding reliability function, and S_{0} is the ‘baseline’ reliability function.
The mixture of Polya trees models (MPTs) of Hanson (2006) is used to model the reliability or survival data. These models are ‘centred’ at the corresponding parametric models but allow significant data-driven deviations from parametric assumptions while retaining predictive power associated with parametric modeling. This point will be elaborated in Appendix A. This more flexible approach in essence assumes a nonparametric model for S_{0}; i.e. an MPT prior is placed on S_{0}. For the power plant data considered in this chapter, without loss of generality, we take the first group from each system to be represented by S_{0}. Then, the covariates are indicator variables corresponding to each of the remaining five groups in the data. Note that, if needed, (p. 229) you could easily incorporate other continuous, time-dependent covariates as well in the above formulation. Details of the MPT model are provided in Appendix A.
The MPT prior provides an intermediate choice between a strictly parametric analysis and allowing S_{0} to be completely arbitrary. In some ways it provides the best of both worlds. In areas where data are sparse, such as the tails, the MPT prior places relatively more posterior mass on the underlying parametric family {G_{θ} : θ ∈ Θ}. In areas where data are plentiful the posterior is more data driven; and features not allowed in the strictly parametric model, such as left-skew and multimodality, become apparent. The user-specified weight w controls how closely the posterior follows {G_{θ} : θ ∈ Θ} with larger values of w yielding inference closer to that obtained from the underlying parametric model. Hanson (2006) describes priors for w; we simply fix w to be some small value, typically w = 1.
Assume standard, right-censored reliability data $\mathcal{D}={\left\{\left({x}_{i},{t}_{i},{\delta}_{i}\right)\right\}}_{i=1}^{n}$, and let ${\mathcal{D}}_{i}$ denote the ith triple (x_{i,ti}, δ_{i}). Let ${T}_{i}\sim {S}_{{x}_{i}}(\cdot )$. As usual, δ_{i} = 0 indicates that t_{i} is a censoring time, T_{i} > t_{i}, and δ_{i} = 1 denotes that t_{i} is a survival time, T_{i} = t_{i}.
Given S_{0} and β, the survival function for covariates x is
(9.9)
and the pdf is
(9.10)
where ${S}_{0}\left(t|\mathcal{Y},\theta \right)$ and ${f}_{0}\left(t|\mathcal{Y},\theta \right)$ are given by (9.16) and (9.17) in Appendix A.
Assuming uninformative censoring, the likelihood for right censored data is then given by
(9.11)
The MCMC algorithm alternately samples [β, θ|Y, D] and [Y|β, θ, D]. The vector [β, θ|Y, D] is efficiently sampled with a random walk Metropolis – Hastings step Tierney (1994). A proposal that has worked very well in practice is obtained from using the large sample estimated covariance matrix from fitting the parametric log-logistic model via maximum likelihood. A simple Metropolis – Hastings step for updating the components (Y_{ϵ0}, Y_{ϵ1}) one at a time first samples a candidate $\left({Y}_{\in 0}^{*},{Y}_{\in 1}^{*}\right)$ from a Dirichlet(mY_{ϵ0}, mY_{ϵ1}) distribution, where (p. 230)
m > 0, typically m = 20. This candidate is accepted as the ‘new’ (Y_{ϵ0}, Y_{ϵ1}) with probability
where j is the number of digits in the binary number ϵ0 and Y^{∗} is the set Y with $\left({Y}_{\in 0}^{*},{Y}_{\in 1}^{*}\right)$ replacing (Y_{ϵ0}, Y_{ϵ1}). The resulting Markov chain has mixed reasonably well for many data sets unless w is set to be very close to zero.
Based on Figures 9.2, 9.3, and 9.4 it is clear that the assumption of IFR is inappropriate. The 10 groups depict different hazard patterns that is nicely captured by the MPT model. Also, the survival probabilities are widely different. For example, for the EHC system, the group GN3 is by far the most reliable, whereas the group GS 1 performs best for the AF system.
In contrast, the parametric Weibull model produces different results. Figures 9.5 and 9.6 show the corresponding hazard curves using the Bayes point estimators of the model parameters. (p. 231)
Table 9.2 shows the optimal time to perform preventive maintenance for each of the groups using the Bayesian nonparametric and the Bayesian parametric models to estimate the expected number of failures. The optimal preventive maintenance time is the same for groups GN1, GN2, GN3, GN4, GN5, GS1, and GS5. The associated total costs are different. This is due to the
fact that the Bayesian parametric model underestimates the expected number of failures in a given interval, and the result is less total cost. Groups GS2, GS3, and GS4 have different preventive maintenance intervals. For all of them the Bayesian parametric model produced decreasing hazard curves, hence,
Table 9.2 Optimal PM time (in days) and the associated total cost using the Bayesian Nonparametric Model (BNP) and Bayesian Parametric Model (BP).
Group Optimal |
PM time (BNP) |
Total cost (BNP) |
Optimal PM time (BP) |
Total cost (BP) |
---|---|---|---|---|
GN1 |
501 |
226,996 |
501 |
104,070 |
GN2 |
501 |
157,019 |
501 |
64,145 |
GN3 |
501 |
195,147 |
501 |
120,485 |
GN4 |
501 |
400,072 |
501 |
263,641 |
GN5 |
501 |
71,669 |
501 |
40,027 |
GS1 |
501 |
172,614 |
501 |
104,069 |
GS2 |
501 |
921,880 |
1000 |
529,848 |
GS3 |
501 |
850,355 |
1000 |
923,097 |
GS4 |
501 |
460,057 |
1000 |
1,117,478 |
GS5 |
501 |
82,777 |
501 |
40,026 |
no preventive maintenance is optimal to be performed before the end of the time horizon (replacement at the end of the time horizon is one of our initial assumptions.)
Appendix
A. Broader context and background
In this chapter we have introduced Bayesian decision analysis to enable engineers to reach an optimal decision in the context of maintaining nuclear power plants. Typical Bayesian analysis usually stops at inference. However, as noted by DeGroot (2004), Smith (1988), and Berger (1980), for Bayesian methods to gain acceptance, one needs to extend posterior inference to actual decisions.
In this chapter we have used Polya trees and mixtures of Polya trees to carry out inference. We provide some details of this methodology.
Extending the work of Lavine (1992, 1994), mixtures of Polya trees have been considered by Berger and Guglielmi (2001), Walker and Mallick (1999), Hanson and Johnson (2002), Hanson (2006), Hanson et al. (2006), and Hanson (2007).
(p. 234) The last four references deal with survival or reliability data. This prior is attractive in that it is easily centred at a parametric family such as the class of Weibull distributions. When sample sizes are small, posterior inference has much more of a parametric flavour due to the centring family. When sample sizes increase, the data take over the prior and features such as multimodality and left skew in the reliability distributions are better modelled.
Consider a mixture of Polya trees prior on S_{0},
(9.12)
(9.13)
where (9.12) is shorthand for a particular Polya tree prior; see Hanson and Johnson (2002). We briefly describe the prior but leave details to the references above.
Let J be a fixed, positive integer and let G_{θ} denote the family of Weibull cumulative distribution functions, ${G}_{\theta}\left(t\right)=1-\mathrm{exp}\left\{-{\left(t/\lambda \right)}^{\alpha}\right\}$ for t ≥ 0, where $\theta ={\left(\alpha ,\lambda \right)}^{\prime}$. The distribution G_{θ} serves to centre the distribution of the Polya tree prior. A Polya tree prior is constructed from a set of partitions Π_{θ} and a family A of positive real numbers. Consider the family of partitions ${\prod}_{\theta}=\left\{{B}_{\theta}(\in ):\in \in {\displaystyle {\cup}_{l=1}^{J}{\left\{0,1\right\}}^{l}}\right\}$. If j is the base-10 representation of the binary number $\in ={\in}_{1}\cdots {\in}_{k}$ at level k, then ${B}_{\theta}\left({\in}_{1}\cdots {\in}_{k}\right)$ is defined to be the interval $\left({G}_{\theta}^{-1}\left(j/{2}^{k}\right),{G}_{\theta}^{-1}\left(\left(j+1\right)/{2}^{k}\right)\right)$ For example, with k = 3, and ϵ = 000, then j = 0 and ${B}_{\theta}\left(000\right)=\left(0,{G}_{\theta}^{-1}\left(1/8\right)\right)$ and with ϵ = 010, then j = 2 and ${B}_{\theta}\left(010\right)=\left({G}_{\theta}^{-1}\left(2/8\right),{G}_{\theta}^{-1}\left(3/8\right)\right)$, etc.
Note then that at each level k, the class {B_{θ}(ϵ) : ϵ ∈ {0, 1}^{k}} forms a partition of the positive reals and furthermore ${B}_{\theta}\left({\in}_{1}\cdots {\in}_{k}\right)={B}_{\theta}\left({\in}_{1}\cdots {\in}_{k}0\right){\displaystyle \cup {B}_{\theta}\left({\in}_{1}\cdots {\in}_{k}1\right)}$ for k = 1, 2, … , M − 1. We take the family $\mathcal{A}=\left\{{\alpha}_{\in}:\epsilon \in {\displaystyle {\cup}_{j=1}^{M}{\left\{0,1\right\}}^{j}}\right\}$ to be defined by α_{ϵ}_{1}···ϵ_{k} = wk^{2} for some w > 0; see Walker and Mallick (1999); Hanson and Johnson (2002). As w tends to zero the posterior baseline is almost entirely data-driven. As w tends to infinity we obtain a fully parametric analysis.
Given Π_{θ} and A, the Polya tree prior is defined up to level J by the random vectors $\mathcal{Y}=\left\{\left({Y}_{\in 0},{Y}_{\in 1}\right):\epsilon \in {\displaystyle {\cup}_{j=0}^{M-1}{\left\{0,1\right\}}^{j}}\right\}$ through the product
(9.14)
for k = 1, 2, … , M, where we define S_{0}(A) to be the baseline measure of any set A. The vectors (Y_{ϵ0}, Y_{ϵ1}) are independent Dirichlet:
(9.15)
(p. 235) The Polya tree parameters ‘adjust’ conditional probabilities, and hence the shape of the survival density f_{0}, relative to a parametric centring family of distributions. If the data are truly distributed G_{θ} observations should be on average evenly distributed among partition sets at any level j. Under the Polya tree posterior, if more observations fall into interval ${B}_{\theta}\left(\in 0\right)\subset {\mathbb{R}}^{+}$ than its companion set B_{θ}(ϵ1), the conditional probability Y_{ϵ0} of B_{θ}(ϵ0) is accordingly stochastically ‘increased’ relative to Y_{ϵ1}. This adaptability makes the Polya tree attractive in its flexibility, but also anchors the random S_{0} firmly about the family {S_{θ} : θ ∈ Θ}.
Beyond sets at the level J in Π_{θ} we assume S_{0}|Y, θ follows the baseline G_{θ}. Hanson and Johnson (2002) show that this assumption yields predictive distributions that are the same as from a fully specified (infinite) Polya tree for large enough J; this assumption also avoids a complication involving infinite probability in the tail of the density f_{0}|Y, θ that arises from taking f_{0}|Y, θ to be flat on these sets.
Define the vector of probabilities $p=p\left(\mathcal{Y}\right)={\left({p}_{1},{p}_{2},\dots ,{p}_{2J}\right)}^{\prime}$ as ${p}_{j+1}={S}_{0}\left\{{B}_{\theta}\left({\in}_{1}\cdots {\in}_{J}\right)|\mathcal{Y},\theta \right\}={\displaystyle {\prod}_{i=1}^{J}{Y}_{{\in}_{1}\cdots {\in}_{i}}}$ where ϵ_{1} · · · ϵ_{J} is the base-2 representation of j, j = 0, … , 2^{J} − 1. After simplification, the baseline survival function is,
(9.16)
where N denotes the integer part of 2^{J} G_{θ}(t) + 1 and where g_{θ}(·) is the density corresponding to G_{θ}. The density associated with S_{0}(t|Y, θ) is given by
(9.17)
where ϵ_{J}(i) is the binary representation ϵ_{1} · · · ϵ_{J} of the integer i and N as is in (9.16). Note that the number of elements of Y may be moderate. This number is ${\sum}_{j=1}^{J}{2}^{j}={2}^{J+1}-2$ a typical level, this is 62, of which half need to be sampled in an MCMC scheme.
B. Proof of Propositions
Proposition 9.1 Assume that we follow maintenance policy (P), and that the time between failures is a random variable with an IFR distribution, i.e. the failure rate function q(t) is increasing. Then, the objective function z(T) is:
(i) lower semicontinuous with discontinuities at T = L/n, n = 1, 2, … , and
(ii) increasing and convex on each interval $\left({\scriptscriptstyle \frac{L}{n+1}},{\scriptscriptstyle \frac{L}{n}}\right),n=1,2,$. …
(p. 236) Proof We first show (ii). Assume that $T\in \left({\scriptscriptstyle \frac{L}{n+1}},{\scriptscriptstyle \frac{L}{n}}\right)$ for some n. Within this interval
$${z}^{\prime}\left(T\right)\equiv \frac{dz\left(T\right)}{dT}=n{\overline{C}}_{cm}\left\{q\left(T\right)-q\left(L-nT\right)\right\},$$where ${\overline{C}}_{cm}=p{C}_{d}+\left(1-p\right){C}_{cm}$. Here, we have used $q\left(t\right)={\scriptscriptstyle \frac{d}{dt}}E\left[N\left(t\right)\right]$ and the fact that ⌊L/T⌋= n is constant within this n-th interval. That z increases on this interval follows from the fact that z’(T) is positive since q is increasing and $L-nT<T\text{\hspace{0.17em}}\text{for}\text{\hspace{0.17em}}T\in \left({\scriptscriptstyle \frac{L}{n+1}},{\scriptscriptstyle \frac{L}{n}}\right)$. To show convexity, we show that z’(T) is increasing. Let $T+\Delta T\in \left({\scriptscriptstyle \frac{L}{n+1}},{\scriptscriptstyle \frac{L}{n}}\right)$. Then, q(T + ΔT) ≥ q(T) and q{L − n(T + ΔT)} ≤ q(L − nT), and hence z’(T + ΔT) ≥ z’(T). This completes the proof of (ii).
The convexity result for z(T) from (ii) shows that T = L, L/2, L/3, … are the only possible points of discontinuity in z(T). The first term in z(T) (i.e. the PM term) is lower semicontinuous because of the ceiling operator. The second term (i.e. z(T) when C _{pm} = 0) is continuous in T. To show this it suffices to verify that
$$\lfloor L/T\rfloor E\left\{N\left(T\right)\right\}+E\left\{N\left(L-T\lfloor \frac{L}{T}\rfloor \right)\right\}$$has limit n E{N(d)} for both T → d^{−} and T → d^{+}, where d = L/n for any n = 1, 2, … . This follows in a straightforward manner using the bounded convergence theorem, e.g. Cinlar (1975, pp. 33). Hence, we have (i). □
Let
i.e. z^{c}(T) is z(T) with ⌊L/T⌋ and ⌊L/T⌋ replaced by L/T. The following proposition characterizes z^{c}(T) and its relationship with z(T).
Proposition 9.2 Let D = {d : d = L/n, n = 1, 2, …} denote the set of discontinuities of z(T) (cf. Proposition 9.1). Then,
(i) z^{c}(T) is quasiconvex on [0, L].
Furthermore if d ∈ D then
(ii) z^{c}(d)=z(d)and
(iii) $\underset{T\to {d}^{-}}{\mathrm{lim}}\left\{z\left(T\right)-{z}^{c}\left(T\right)\right\}={C}_{pm}$.
Proof
(i) We have
(9.18)
$${z}^{c}\left(T\right)=\frac{{C}_{pm}+{\overline{C}}_{cm}E\left\{N\left(T\right)\right\}}{T/L}.$$(p. 237) The numerator of (9.18) is convex in T because EN(T) has increasing derivative q(T). As a result, z^{c}(T) has convex level sets and is hence quasiconvex.
(ii) The result is immediate by evaluating z and z^{c} at points of D.
(iii) Let d ∈ D. Then, there exists a positive integer n with d = L/n. Note that due to the ceiling operator, we have the following result
(9.19)
$$\underset{T\to {d}^{-}}{\mathrm{lim}}\left\{\lceil \frac{L}{T}\rceil -\frac{L}{T}\right\}=1.$$Based on (9.19) and the definitions of z and z^{c} it suffices to show
$$\lfloor L/T\rfloor E\left\{N\left(T\right)\right\}+E\left\{N\left(L-T\lfloor \frac{L}{T}\rfloor \right)\right\}$$has limit nE{N(d)} as T → d^{−}. This was established in the proof of part (i) in Proposition 1. □
Proposition 9.2 shows that at its points of discontinuity, our objective function z drops by magnitude C _{pm} to agree with z^{c}. Before developing the associated algorithm, the following proposition shows how to solve a simpler variant of our model in which the set A of feasible replacement intervals is replaced by [0, L].
Proposition 9.3 Let
$${T}_{c}^{*}\in \underset{T\in \left[0,L\right]}{\mathrm{arg}\mathrm{min}}{z}^{c}\left(T\right)$$and let n^{∗} be the positive integer satisfying ${\scriptscriptstyle \frac{L}{{n}^{*}+1}}\le {T}_{c}^{*}\le {\scriptscriptstyle \frac{L}{{n}^{*}}}$. Then,
$${T}^{*}\in \underset{T\in \left\{{\scriptscriptstyle \frac{L}{{n}^{*}+1}},{\scriptscriptstyle \frac{L}{{n}^{*}}}\right\}}{\mathrm{arg}\mathrm{min}}{z}^{c}\left(t\right)$$solves min_{T}_{∈[0,L]}z(T).
Proof Proposition 9.1 says that z(T) increases on each interval $\left({\scriptscriptstyle \frac{L}{n+1}},{\scriptscriptstyle \frac{L}{n}}\right)$ and is lower semicontinuous. As a result, min_{T}_{∈[0,L]}z(T) is equivalent to min_{T}_{∈D}z(T), where D = {d : d = L/n, n = 1, 2, …} is the set of discontinuities of z. This optimization model is, in turn, equivalent to min_{T}_{∈D}z^{c}(T) since, part (ii) of Proposition 9.2 establishes that z(T) = z^{c}(T) for T ∈ D. Finally, since z^{c}(T) is (p. 238) quasiconvex (see Proposition 9.2 part (i)), we know z^{c}(T) is nondecreasing on $\left[{T}_{c}^{*},L\right]$ and non-increasing on $\left[0,{T}_{c}^{*}\right]$. Proposition 9.3 follows. □
B.1 The optimization algorithm
Proposition 9.3 shows how to solve min_{T}_{∈[0,L]}z(T). The key idea is to first solve the simpler problem min_{T}_{∈[0,L]}z^{c}(T), which can be accomplished efficiently, e.g. via a Fibonacci search. Let ${T}_{c}^{*}$ denote the optimal solution to the latter problem. Then, the optimal solution to the former problem is either the first point in D to the right of ${T}_{c}^{*}$ or the first point in D to the left of ${T}_{c}^{*}$ whichever yields a smaller value of z(T) (or z^{c}(T) since they are equal on D).
As described above, our model has the optimization restricted over a finite grid of points A ⊂ [0, L]. Unfortunately, it is not true that the optimal solution is given by either the first point in A to the right of ${T}_{c}^{*}$ or the first point in A to the left of ${T}_{c}^{*}$. However, the results of Proposition 9.1, part (ii) establish that within the set $A\cap \left({\scriptscriptstyle \frac{L}{n+1}},{\scriptscriptstyle \frac{L}{n}}\right)$ we can restrict attention to the left-most point (or ignore the segment if the intersection is empty). To simplify the discussion that follows we will assume that A has been redefined using this restriction. Furthermore, given a feasible point T_{A} ∈ A with cost z(T_{A}) we can eliminate from consideration all points in A to the right of $\left\{T\in D:T\le {T}_{c}^{*},z\left(T\right)\ge z\left({T}_{A}\right)\right\}$ and all points in A to the left of $\mathrm{max}\left\{T\in D:T\le {T}_{c}^{*},{z}^{c}\left(T\right)\ge z\left({T}_{A}\right)\right\}$. This simplification follows from the fact that z increases on each of its intervals and is equal to the quasiconvex z^{c} on D.
Our algorithm therefore consists of:
• Step 0: Let ${T}_{c}^{*}\in \mathrm{arg}{\mathrm{min}}_{T\in \left[0,L\right]}{z}^{c}\left(T\right)$.
• Step 1: Let T^{∗} ∈ Abe the first point in Ato the right of ${T}_{c}^{*}$ and let z^{∗} = z(T^{∗}). Let T_{A} = T^{∗}.
• Step 2: Increment T_{A} to be the next point in A to the right. If z(T_{A}) < z^{∗} then z^{∗} = z(TA) and T^{∗} = T_{A}. Let d be the next point to the right of T_{A} in D. If z^{c}(d) ≥ z^{∗} then let T_{A} to be the first point in A to the left of ${T}_{c}^{*}$ and proceed to the next step. Otherwise, repeat this step.
• Step 3: If z(T_{A}) < z^{∗} then z^{∗} = z(T_{A}) and T^{∗} = T_{A}. Let d be the next point to the left of T_{A} in D. If z^{c}(d) ≥ z^{∗} then output T^{∗} and z^{∗} and stop. Otherwise, increment T_{A} to be the next point in A to the left and repeat this step.
Step 0 solves the simpler continuous problem for ${T}_{c}^{*}$ as described above and Step 1 simply finds an initial candidate solution. Then, Steps 2 and 3,
respectively, increment to the right and left of ${T}_{c}^{*}$ moving from the point of A in, say, $A\cap \left({\scriptscriptstyle \frac{L}{n+1}},{\scriptscriptstyle \frac{L}{n}}\right)$ to the point (if any) in the next interval. Increments to the right and to the left stop when further points in that direction are provably suboptimal. The algorithm is ensured to terminate with an optimal solution.
(p. 239) Acknowledgements
The authors would like to thank the OR/IE graduate students Alexander Galenko and Dmitriy Belyi for their help in proving the propositions and computational work. Thank you to Ernie Kee for providing the data and helping with the nuclear engineering part of the paper. This research has been partially supported by the Electric Power Research Institute under contract EP-P22134/C10834, STPNOC under grant B02857, and the National Science Foundation under grants CMMI-0457558 and CMMI-0653916.
References
Barlow, R. and Hunter, L. (1960). Optimum preventive maintenance policies. Operations Research, 8, 90–100.Find this resource:
Berger, J. O. (1980). Statistical Decision Theory and Bayesian Analysis (2nd edn). Springer Series in Statistics, Springer-Velag, Berlin.Find this resource:
Berger, J. O. and Guglielmi, A. (2001). Bayesian testing of a parametric model versus nonparametric alternatives. Journal of the American Statistical Association, 96, 174–184.Find this resource:
Blanchard, A. (1993). Savannah river site generic data base development. Technical Report WSRC-TR 93-262, National Technical Information Service, U.S. Department of Commerce, 5285 Port Royal Road, Springfield, VA 22161, Savannah River Site, Aiken, South Carolina 29808.Find this resource:
Bridges, M. and Worledge, D. (2002). Reliability and risk significance for maintenance and reliability professionals at nuclear power plants. EPRI Technical Report 1007079.Find this resource:
Chen, T.-M. and Popova, E. (2000). Bayesian maintenance policies during a warranty period. Communications in Statistics – Stochastic Models, 16, 121–142.Find this resource:
Cinlar, E. (1975). Introduction to Stochastic Processes. Prentice Hall, Englewood Cliffs.Find this resource:
DeGroot, M. (2004). Optimal Statistical Decisions. Wiley Classics Library (Originally published 1970). John Wiley, New York.Find this resource:
Gann, C. and Wells, J. (2004). Work Process Program. STPNOC plant procedure 0PGP03-ZA-0090, Revision 28.Find this resource:
Hanson, T. (2006). Inferences on mixtures of finite Polya tree models. Journal of the American Statistical Association, 101, 1548–1565.Find this resource:
Hanson, T. (2007). Polya trees and their use in reliability and survival analysis. In Encyclopedia of Statistics in Quality and Reliability (ed. F. Ruggeri, R. Kenett and F. W. Faltin). John Wiley, Chichester. pp. 1385–1390.Find this resource:
Hanson, T., Johnson, W. and Laud, P. (2006). A semiparametric accelerated failure time model for survival data with time dependent covariates. In Bayesian Statistics and its Applications (ed. S. Upadhyay, U. Singh, D. Dey), Anamaya Publishers, New Delhi, pp. 254–269.Find this resource:
Hanson, T. and Johnson,W. O. (2002). Modeling regression error with a mixture of Polya trees. Journal of the American Statistical Association, 97, 1020–1033.Find this resource:
Hudson, E. and Richards, D. (2004). Configuration Risk Management Program. STPNOC plant procedure 0PGP03-ZA-0091, Revision 6.Find this resource:
INPO (2002). Equipment Reliability Process Description. Institute of Nuclear Power Operations process description, AP 913, Revision 1.Find this resource:
Lavine, M. (1992). Some aspects of Polya tree distributions for statistical modeling. Annals of Statistics, 20, 1222–1235.Find this resource:
(p. 240) Lavine, M. (1994). More aspects of Polya trees for statistical modeling. Annals of Statistics, 22, 1161–1176.Find this resource:
Liming, J., Kee, E., Johnson, D. and Sun, A. (2003). Practical application of decision support metrics for nuclear power plant risk-informed asset management. In: Proceedings of the 11th International Conference on Nuclear Engineering. Tokyo, Japan.Find this resource:
Marquez, A. and Heguedas, A. (2002). Models for maintenance optimization: a study for repairable systems and finite time periods. Reliability Engineering and System Safety, 75, 367–377.Find this resource:
Martz, H. F. and Waller, R. A. (1982). Bayesian Reliability Analysis. John Wiley, New York.Find this resource:
Popova, E. (2004). Basic optimality results for Bayesian group replacement policies. Operations Research Letters, 32, 283–287.Find this resource:
Smith, J. (1988). Decision Analysis: A Bayesian Approach. Chapman and Hall, Boca Raton.Find this resource:
Su, C.-T. and Chang, C.-C. (2000). Minimization of the life cycle cost for a multistate system under periodic maintenance. International Journal of Systems Science, 31, 217–227.Find this resource:
Tierney, L. (1994). Markov chains for exploring posterior distributions. The Annals of Statistics, 22, 1701–1762.Find this resource:
Valdez-Florez, C. and Feldman, R. (1989). A survey of preventive maintenance models for stochastically deteriorating single-unit systems. Naval Research Logistics, 36, 419–446.Find this resource:
Walker, S. G. and Mallick, B. K. (1999). Semiparametric accelerated life time model. Biometrics, 55, 477–483.Find this resource:
Wilson, J. G. and Popova, E. (1998). Bayesian approaches to maintenance intervention. In: Proceedings of the Section on Bayesian Science of the American Statistical Association. pp. 278–284.Find this resource:
Yu, W., Popova, E., Kee, E., Sun, A. and Grantom, R. (2005). Basic factors to forecast maintenance cost for nuclear power plants. In: Proceedings of 13th International Conference on Nuclear Engineering. Beijing, China.Find this resource:
Yu, W., Popova, E., Kee, E., Sun, A., Richards, D. and Grantom, R. (2004). Equipment data development case study–Bayesian Weibull analysis. Technical Report, STPNOC.Find this resource: