Skip to main content

Authors’ response to Unjournal evaluations of “Water treatment and child mortality: a meta-analysis and cost-effectiveness analysis”

Published onSep 10, 2024
Authors’ response to Unjournal evaluations of “Water treatment and child mortality: a meta-analysis and cost-effectiveness analysis”
·
1 of 3
key-enterThis Pub is a Reply to

We would like to thank the editors for commissioning this review and the evaluators for not only conducting such a thorough and thoughtful evaluation of the paper, but also providing an extensive write-up and some precise recommendations on how to improve it. We note that not all co-authors have reviewed this response.

We have revised the working version of the paper. As part of this latest revision we tried to address the evaluators’ comments which we believed were in the scope for this paper.

In this document we attempt to summarize the changes and also provide some replies to points raised by the evaluators. We use shorthand E1 for Evaluation 1 (Sharma Waddington and Masset) and E2 for Evaluation 2 (unknown reviewer).

The first section of this document covers topics raised in the evaluations for which we either implemented changes to the paper or which we thought simply required a clarification. We grouped comments under 13 headings, ordered […] to match the paper.

Towards the end of this document, we include a section of changes that we are still working on, mainly to do with inclusion/exclusion of studies and evaluation (risk of bias, PRISMA). We were not able to complete these in a short timeframe, since in some cases they require additional paper reviews, data extraction, and consequently re-running the analyses. We will make a new version with these updates available in the near future. The current replies should therefore be treated as provisional, as we are still working on revising our paper.

Areas where we have updated the working paper or believe a clarification is sufficient

1. Relevance and study framing

E2: Study framing

My first comment has to do with positioning this study. Right now, it is positioned as serving the need for evidence on the cost effectiveness of water chlorination on reducing mortality. While I am personally very partial to this and think it is worthwhile, I suggest that the authors argue the premise more convincingly.

There is no doubt the quality of evidence in this paper is excellent. It is a rigorous, high-quality update since the last such meta-analysis. But it is one among many undefined and I find it a little hard to believe that policymakers—both globally, and at the national and local levels in the health sector—are unconvinced that clean water is a top choice when allocating their funds. Are policymakers still genuinely skeptical about the cost-effectiveness of chlorination to reduce mortality, especially child mortality, in a low-cost manner?

That many policymakers do not actually allocate as much as they “ought to” is more the subject of a political economy analysis and not down to irrational choices on their part. Just because they are not implementing chlorination programs in a concerted, robust and enduring way does not mean they do not accept the premise – they are constrained in many other ways which may prevent this. Clean water has been part of global health policy focus for decades, everyone recognizes it as a high priority investment area. So rationalizing it—yet again—is likely not the issue holding back deployment of water treatment interventions.

Let’s accept that there is little to no need to convince anyone that water chlorination is absolutely the right thing to spend their money on to prevent illness and death. Given this, I would urge the authors to think a bit more about positioning this study.

While E1 was broadly in favor of conducting [a] review of evidence on child mortality and water treatment, E2 suggested that the paper would benefit from an alternative framing: in short, because “everyone recognizes [water treatment] as a high priority investment area”, we should consider which type of intervention is appropriate for which context.

We have adjusted the framing in response to the reviewer E2's comments. However, we would argue that this study is decision relevant, even for policymakers who already recognize that clean water is a high priority investment area.

First, there are many high priority investment areas which policymakers need to allocate funding between. Even if policymakers already consider water treatment to be a priority area, the degree of cost-effectiveness is still decision relevant. Second, there is heterogeneity among policy makers. Some have very high cost-effectiveness thresholds, either explicitly or implicitly due to limited resources. As we demonstrate in the discussion, an alternative model without empirical results on mortality (and extrapolated to mortality reductions from reductions in diarrhea and some additional assumptions on share of deaths due to diarrhea) estimates a much lower magnitude of reductions, which might not meet some cost-effectiveness thresholds.

Third, some child survival funders may require statistically significant experimental evidence on mortality, which this meta-analysis provides.

Fourth, this meta-analysis provides evidence that even in the absence of complementary sanitation interventions, water treatment can have a large impact on child mortality, a question about which there has been substantial uncertainty.

We therefore think an estimate of reductions in mortality based directly on mortality data is policy relevant. Indeed, at least one donor has made substantial investments in water treatment and one national government has changed its policy on water treatment after seeing this evidence.

See also below for further thoughts on the framing of the cost-effectiveness analysis (CEA).

2. Preregistration

E1: In some respects, the review is transparent about what was done. Although a systematic review protocol was not, to our knowledge, registered with any of the usual repositories for such studies (e.g., Campbell, Cochrane, Prospero), a pre-analysis plan was submitted to the AEA registry in June 2020. Fig 1 provides information about the search process and Table S2 provides information about which studies were excluded from the analysis, together with the reason why, although not in the usual form that a systematic review would provide. Reputable journals require systematic reviews to present a Preferred Reporting Items for Systematic Reviews and Meta-analyses (PRISMA) study search flow diagram, discussion of excluded studies that users might reasonably expect to be included, and a PRISMA checklist that indicates, for example, deviations from protocol.

E1 discusses pre-registration, stating that “a pre-analysis plan was submitted to the AEA registry in June 2020”. However, this was only a study registration, not a study plan registration. This study was originally conducted as an exploratory analysis based on existing reviews and our own systematic review component was added later on: as per the study registration at AEA, “We first review all papers identified by Clasen et al. (2015) and Wolf et al. (2018) for their meta-analyses of the impact of interventions to improve water quality on diarrhea which covers studies from 1970 up to February 2016. Next, we replicate their selection criteria and search procedure for the period from February 2016 to May 2020 to add more recent studies.”

We provided more context for this at the start of the Methods section in the main text.

We also note that previous versions of the paper have been assessed using PRISMA and we are working on updating that assessment, which we will include in a future journal submission. This will clearly state that the review was not pre-registered.

3. Systematic review period and included studies

E2: The publication dates of the studies used span a 23-year period, from 1998 to 2021; one can assume that actual implementation lagged publication. Page 17 “Search strategy and selection criteria” says you used search criteria from past meta-analyses and updated to also include studies “…from February 2016 to May 2020”. Looking at the search strings in Table S1, the search set titled “Limits”, in which the search window seems to be 2012 to 2016, except for Ovid which is 2012 to “current”. This is all a bit unclear.

E2 points out that the period of systematic review search was unclear. We have now fixed that; the systematic search finished in May 2020.

We discuss inclusion/exclusion of studies (and the criteria for it) elsewhere in this document.

E2: The second of these questions [about year cut-offs for included studies] also relates to the age of some of the studies included i.e., the oldest being over 23 years old. Is it alright to allow studies that are over two decades old in the analysis?undefined Supplementary Materials>4. Cost-effectiveness analysis>Drivers of cost-effectiveness states that baseline mortality is a major driver of cost effectiveness. One would assume that over time, health outcomes are improving in LMICs and LICs i.e., bumping the baseline levels of mortality downward.

There was a question from E2 on whether the effects are not diluted due to under-5 (U5) mortality decreasing over time. Since the meta-analysis model calculates reductions in odds ratios (ORs) on a relative scale, impacts on mortality are assumed to scale with baseline mortality in the model. The CEA accounts for this by using realistic, up-to-date values of baseline mortality in each implementation case study. It is important for any potential implementers to calculate benefits using mortality from their specific context. We caution readers against extrapolating results beyond the support of the data.

4. Questions about study timeframes, interpretation of effects, and sustained treatment

E2: Timeframes of studies: The studies have large variation in terms of follow up i.e., when outcomes of interest were measured (see unnumbered table at the end of Supplementary Materials>1. Details of included studies and comparison with other RCTs). This is partly addressed by the authors in the sensitivity analysis by showing that studies with shorter timeframes are not driving results.

However, this does raise the conceptual concern about what is meant by a water chlorination program in terms of timeframe, especially as it relates to the cost effectiveness. Currently, a five-year timeframe for implementation of the three representative interventions is used. That sounds reasonable but is there any reason (a) that we have the same horizon for all three (I can imagine something like in-line chlorination programming lasting longer than MCH or dispensers but I can imagine it the other way too) and (b) that we limit to thinking about a five-year horizon i.e., it may be that there is variation in cost-effectiveness based on differing time horizons (two, three, five, 10 and 15)?

I think some language to explain why a five-year horizon was chosen would be good to have.

We focused on a 5-year timeline because we were interested in the cost-effectiveness of sustained rather than short-run programs. If it were known, for example, that the program would only operate for a short period of time, then programs that involve a substantial capital component (such as dispensers or in-line chlorination) would be less cost-effective.

Evaluators also pointed out that interpretation of meta-analytic results may not be straightforward, because hardly any study includes children from birth up to age five. The model treats the relative reduction in mortality risk as constant with age. We summarize the age-composition of the sample in Supplementary Information, section “Age characteristics of included children”. We find that all age groups are well-represented, although with a slight skew toward under-2s (since some larger studies did not include older children).

Separately we are looking into conducting a future survival model analysis of individual-level data to understand potential variation in treatment effects with age. In exploratory analyses that we conducted so far, we have not been able to find any evidence on effects varying with age group, although statistical power for such analyses is limited.

5. More complete information on weights, pooling, heterogeneity, subgroups

E2: Study weights: For the weights provided in Table S4, we are not given a description for the weighting scheme used for the frequentist model but are for the Bayesian model (2. Meta-analysis models > Study weights in Bayesian model). Possibly a non-issue, but just pointing this out. Any particular reason the weighting scheme for the frequentist model was not described?

E1: We have an additional point about the meta-analysis as it was reported. It is standard practice to report the random effects weight of each study in the meta-analysis, as well as relative and absolute between-study heterogeneity (I-squared and Tau-squared) for all analyses conducted including sub-groups. Having a low value of heterogeneity helps the reader understand if the pooled effect is likely to be valid across the sample of studies included in the meta-analysis. Values of Tau-squared are reported at the overall review level, but weights and sub-group heterogeneity statistics can be reported transparently in forest plots.

E2 pointed out that [the] calculation of [the] weights was not explained for the frequentist model. E1 also inquired about the weights. All of the weights (frequentist and Bayesian) are available in Table S4. We will attempt to revise Figure 2 to include these if there is enough space to fit them on plots.

We currently report on the extent of heterogeneity for both frequentist and Bayesian models in the main text. They are not shown in the plots for the same reason (readability). We also believe that the typical way in which heterogeneity is reported (e.g. point estimate[s] of I2I^2 statistic[s]) can lead to false confidence in models, since that estimate is often uncertain: therefore, we focus on reporting both the point estimate of the heterogeneity parameter (i.e. between-study standard deviation) and its 95% intervals, which turn out to be fairly wide.

There is also more commentary on pooling in the supplement (see next reply), which should give readers additional context.

E1: Regarding the meta-analysis that was conducted, the review reports an overall pooled effect together with a sub-group effect for chlorination. However, the review could also have reported pooled effects for filtration, where there were three estimates. Perhaps the reviewers felt that the Peletz et al. (2012) study, conducted among immunocompromised groups, was not representative of general contexts; but we note that, even if that study was excluded, meta-analytical pooling can be undertaken provided there is more than one independent effect size.

We did not include the subgroup effect for filtration, as we believe that for three studies it is not worth characterizing the mean effect and filtration is not the focus of our paper.

6. Large “differences” in Bayesian estimates

E1: However, in the Bayesian meta-analysis, the posterior estimates for individual studies differed from the frequentist model, sometimes considerably; for example, the estimate for Luby et al. (2006) shifts from a whopping OR=23.88 (95%CI=0.08, 7240) to OR=0.74 (95%CI=0.37, 1.49). It would be useful for readers, who may be less familiar with Bayesian meta-analysis, if the review can explain why these differences are so large.

E2: Figure 2 (A) the OR for Luby et al. 2006 is very large (a previous version of the paper showed a smaller OR for this). Why is this? Is this something to be concerned about? In table S3, I see that Luby et al 2006 report 2 deaths in treatment compared to zero deaths in control – this must be driving the large OR but surely there has to be some sensible way to constrain this rather than let it run away to a large number. Again, this is outside my technical expertise, so I defer to the authors. Just noting that anyone who reads this will see the strangely high OR for the Luby et al 2006 study in Fig. 2 (A).

E2 and E1 pointed to extreme OR values for some studies. E1 commented on large differences between Bayesian and frequentist results.

This was mainly an issue with presentation: the models actually agree with each other very well. The main reason for this confusion, we believe, is that Figure 2A showed OR inputs and Figure 2B shows study-level posteriors. While this is somewhat unconventional, we wanted to highlight the amount of pooling to a common mean by contrasting the two plots. We provided better descriptions in the revised paper and have revised Figure 2 to avoid confusion. We are working on preparing a separate figure that would contrast inputs with posteriors.

The issue of different versions of the paper using different ORs are due to us updating calculation of ORs to be consistent across all studies, by fitting Bayesian logistic models that allow for clustered randomisation in clustered RCTs (c-RCTs). See below for more details.

We tried to explain all of this better by adding several paragraphs in the supplement and a few sentences of commentary in the main text.

For full transparency, we thought it best to summarize these issues for readers of this document step by step. We hope the following summary is helpful. To start with, let us make a few clarifying comments about pooling of study effects towards the common mean:

  • By pooling, we mean the weight on global mean vs the individual study. Pooling of 1 means the study effect is the same as the global mean ("full" "borrowing" of information), and pooling of 0 means each study is not affected by other studies (no information is borrowed from other studies).

  • Random-effects is a model which allows pooling between 0 and 1 by introducing τ2\tau^2 , a heterogeneity parameter. [The] True treatment effect in each study is normally distributed around the common mean μ\mu, with variance τ2\tau^2 . (In contrast to the fully pooled fixed-effects model with τ\tau set to zero, where each study has the same true effect, μ\mu .) The true effect is always observed with noise, as a pair[ing] of [a] point estimate and its standard error (SE). Pooling of a study with a given SE is equal to 1τ2/(τ2+se2)1 - \tau^2 / (\tau^2 + se^2) .

  • In our paper, the Bayesian model calculates pooling from 0.57 (Null et al, biggest study we have) to about 1 (Quick et al, study with zero deaths). 10 studies have pooling in excess of 0.9. These studies have low information and are thus pooled strongly towards the common mean.

  • The pooling concept is the same in frequentist models, but is typically not calculated or reported, because doing so is more difficult and less of a convention. However, there is little difference between frequentist vs Bayesian models in this regard.

  • In fact in our frequentist model (using the REML estimator), estimated pooling is 1 for all studies (estimate of heterogeneity is zero). So the differences between inputs and outputs are more extreme in the frequentist context. As we comment in the main text, this snapping to zero is a known behavior in many frequentist estimators of heterogeneity.

  • Pooling in our Bayesian model is not driven by priors, since we use very mild Bayesian priors for the global effect and heterogeneity. We would need (unrealistically) strong priors for μ\mu and τ\tau to meaningfully affect the estimation of global parameters.

The second, related issue, is over how input odds ratios are obtained for the model in the first place, regardless of how the estimation of the common mean is implemented.

  • We use a Bayesian logistic model with cluster random-effects, with mild priors on effects in individual studies to obtain valid input ORs which avoid zeros (important when there are no deaths in control or treatment arms). This leads to extreme input values like OR = 24 in the case of Luby et al 2006 (2 deaths in T, 0 in C).

  • Frequentists may use continuity corrections in such cases, i.e. add an arbitrary number (like 0.25) to each count, or they may or use another estimator like Peto odds ratios. These approaches will also give high odds ratios, of about 4.5-5.0. Note that the frequentist procedure provides more shrinkage towards no effect (OR=1) than the Bayesian procedure that we use.

  • Whether that study effect value is 4.5 or 24 has practically no influence on the estimated global effect or even the posterior estimate of the effect in the particular study, because studies like these have very high SEs and convey almost zero information. In that sense, while the Bayesian prior brings the estimated context-specific effect with pooling from infinite to 24, it has little effect of the estimated global estimate, and since pooling for this study is greater than 0.9, on the posterior for the context-specific effect either.

  • In section 3 of the supplement we conduct a sensitivity analysis using the continuity corrections to obtain input ORs (for individually randomized studies) and we can see that it does indeed not substantially affect the results.

7. Differences between drafts

E1: We understand the working paper we have been sent to review is the second draft of a paper that has been online since 2023. We observed that the odds ratio estimates and 95 percent confidence intervals (95%CIs) differed, in some cases considerably, between the two working paper drafts. For example, we observed absolute differences in odds ratios of 0.04 or more for half (9) of the estimates (Boisson et al., 2013; Chiller et al., 2006; Dupas et al., 2021; Haushofer et al., 2020; Luby et al., 2006; Reller et al., 2003; Semenza et al., 1998; Peletz et al., 2012; Kremer et al., 2011), of which six (in bold) had differences of 0.08 or more. As a benchmark, the pooled effect in frequentist random-effects analysis was 0.75, hence these differences represent around one-third or more of the pooled effect magnitude. It is not clear to what extent the differences mattered for the findings, since in some cases the odds ratios were smaller, while others were bigger. However, we note that in the first draft of the review, the frequentist meta-analysis pooled effect for the chlorination sub-group was not statistically significant, whereas in the version evaluated by us, the review was able to find a significant effect, albeit over a slightly larger sample size. Hence, we believe it would be useful for the review to report inter-rater assessments on effect size data extraction and/or to indicate how discrepancies in the calculations were resolved, particularly regarding differences in estimated odds ratios and 95% confidence intervals.

On a related note, E1 points out that OR estimates changed across subsequent drafts, by more than 0.08 for six studies and 0.04-0.08 for another three. In a couple of cases this is likely due to corrected counts in non-events (as far as we recall, numbers of deaths have not changed during any of our reviews).

Most of these differences are due to our replacing the Peto. odds ratio estimator, which does not allow for clustering with the Bayesian logistic model (see summary above and details given in Supporting Information). In a couple of cases these differences may arise due to clustering corrections, but in most it may be because relatively little information is contained in each study.

Note that for study sizes included in the meta-analysis even small perturbations in counts (e.g. continuity correction of 0.25) would have large effects on mean OR estimate (in excess of 0.05 or even 0.10).

8. Characteristics of included studies, representativeness of study settings, and differences across contexts

E1: However, it is standard practice in systematic reviews and meta-analyses on WASH topics to report transparently on the populations, interventions and the counterfactual water supply and sanitation conditions too (e.g., Fewtrell and Colford, 2004; Arnold and Colford, 2007; Waddington et al., 2009; Wolf et al., 2022; Sharma Waddington et al., 2023). For example, the reader wants to know the interventions evaluated, the circumstances in which the evaluations were conducted, the types of populations covered, such as whether any were from immunocompromised groups, and the degree of movement up the drinking water and hygiene ladders afforded by the intervention. This information should be readily accessible, very preferably in the main text.

E1: The review states that “the studies included in the meta-analysis are broadly representative of the settings in which policymakers might implement water treatment programs” (p.15). It is hard to believe that 18 studies could represent the contextual variability one would find within and across countries and contexts within countries, especially when one considers that 14 of the 18 included studies were conducted in middle-income countries. It would be useful to understand who are the policymakers that would find this sample representative. Site-selection biases operate, whereby research sites selected for trials are those where there is the greatest contamination of drinking water and diarrhoea disease burden (Sharma Waddington et al., 2023). Perhaps it should be accepted that the sample is not representative of contextual variability. But if it is representative, we suggest adding some supporting evidence.

E1: the review did not discuss the interaction of the interventions with baseline environmental characteristics. The sensitivity analysis considered the baseline prevalence of diarrhea, and the review observed that the meta-analysis was not sufficiently powered to conduct a disaggregated analysis. However, the review could have examined or discussed how the results might differ in different contexts in greater detail, since this has been a major concern in the literature.

We grouped three important points made by E1, as they are all related.

In the new version we clarified that a large majority of the studies were in countries that were low-income at the time that studies were conducted. Details are in Supporting Information, section 1. For study characteristics, we have included as much information as we thought possible in Table 1, without risking making it unreadable. We provide a more complete summary of study characteristics in the first section of the supplement; while we also include a short summary in the main text, we opted for a more complete discussion in the supplement due to word limits for the future journal submission.

We agree that there is potential for site selection biases. While we cannot claim that the study is representative along all dimensions, we believe the crucial dimension here (and one for which we have data to assess representativeness) is diarrhea prevalence. We assess it not at a country level, but at the level of sub-national units where the water treatment in the representative group of countries (over 43,000 observations from 94 countries, based on 2020 data from IHME) and find it representative: “Household surveys across 94 low- and middle-income countries found diarrhea prevalence in 2017 ranging from 0.2% to 56.7% across sub national units, with a median of 13.7% (3). Diarrhea prevalence rates (at baseline or, if baseline not available, in the control group) in our sample of studies range from 5% to 58%, with a weighted mean (using meta-analytic weights from the frequentist model) of 15.6%; this corresponds to the 61st percentile of the distribution of sub-national diarrhea estimates, see Figure S6.”

We hope that based on the current phrasing in the main text, it will be clear to the readers what the studies are representative of (e.g. “We found studies to be representative of diarrhea prevalence in LMICs” in the main text, with more detail given in the supplement).

9. Measuring adherence; heterogeneity in effects with adherence; persistence

E1: In order to understand generalizability of the findings from a review of behaviour change interventions, one also needs to understand if desired behaviours are practised and sustained, such as whether sufficient protective agents are applied to treat drinking water or adequate personal hygiene practised at the point of use so that contaminated hands or utensils are not placed in drinking water storage containers. One aspect of this is to assess rates of adherence and sustainability, as done in the review. The review did not find a significant association between adherence and mortality, which is likely due to the different measures of adherence used in the literature and the problems in measuring adherence to drinking water technology more generally, as discussed above. The only consistent relationship that was observed appeared to be the limited effectiveness of HWT after 6-12 months of follow-up. Factors associated with dis-adoption include users disliking the odour and taste of chlorinated water.

Much of the evidence on water treatment has come from RCTs conducted at zero or negligible financial costs to participants, with frequent follow-up by outsiders and disruption of normal domestic routines (the ‘mzungu effect’) (Waddington et al., 2009). There is therefore a high potentiality in these studies for Hawthorne effects, where being observed leads to greater efforts to adhere to treatment protocols, favouring the treatment group in unblinded trials. This bias is especially likely to occur when follow-up and measurement occurs frequently, as it does in many evaluations of HWT interventions. For example, in analysis that includes many of the studies used in the review, Pickering et al. (2019: e1143) reported that “virtually all the evidence that promotion of… point-of-use water treatment with chlorine or flocculant disinfectant reduce diarrhea come from studies that had daily to fortnightly contact between the behaviour change promoter and study participant”. Hence, one useful analysis that the review could perform would be to examine the association between odds ratios and frequencies of follow-up visits by investigators. When there are lots of visits, the findings of the studies are unreliable guides to the effectiveness of real-world programmes that do not have frequent follow-ups, yet require participants to undertake behavioural modifications where children’s carers must always treat household drinking water while also ensuring that children never consume water from unsafe sources.

E1 discusses these topics in detail. We agree of course that adherence/compliance over a sufficient time horizon are a crucial driver of benefits of these interventions and are grateful for such a detailed discussion.

As a broader comment, it may be helpful to consider two parts of the paper separately. In the first part, meta-analysis, our objective is to find the impact of water treatment on mortality, while characterizing take-up in existing studies. In the second part, CEA, we use the estimates to consider impacts in specific settings. To address the fact that the average take-up is likely higher than what may be expected in many large-scale implementations (of point-of-use or point-of-collection) by adjusting in the CEA for relative compliance, assuming effects scale in proportion to take-up. For example, if effective take-up of dispensers is 36%, we assume that mortality reductions scale down linearly by 0.36/0.54 etc (see Table S7 for details).

One possible conclusion from this analysis is that water treatment is cost-effective enough that policymakers may wish to make water treatment available free to households with young children and that in many contexts, it's likely to be worthwhile to provide water treatment in forms that require minimal behavior of change, and thus are likely to achieve high usage rates. This would indeed entail minimizing burdens on caregivers of water treatment, for example by providing clean piped water to households. In that sense, our analysis is based on the premise that one can learn about the underlying effect of water treatment from controlled experiments and they should not be interpreted as supporting any particular path to water treatment. The appropriate path is likely to depend at least in part on context, but our analysis suggests that some type of water treatment is likely to be appropriate in a wide range of contexts.

We have a few more clarifying comments:

  1. We try to align definitions of adherence across studies, especially for chlorination, where we use objective measurements. This is detailed in Supporting Information, section 1. Given different lengths of studies, many of which are shorter than a year, these measurements typically do not address the question of persistence, but we hope that our approach will be transparent to the readers.

  2. Observer effects discussed by the evaluators are indeed very likely to increase take-up in experiments. Our meta-analytic estimate is specific to the take-up in the included studies. As we point out in the main text, the weighted average of effective take-up in our sample was 54% (about 56% in treatment vs about 3% in controls).

  3. Perhaps most importantly, in the new version of the paper, we also construct alternative CE calculations that assume lower compliance for each of the three methods, which we hope demonstrates the crucial role that ensuring compliance plays in successful implementations of water treatment. These are presented in Discussion and in expanded Table S7. We hope this is done transparently and in a way that allows readers who make their own judgements about the fraction of the population that will treat water to make their own calculation of cost-effectiveness.

10. Use of GDP thresholds in cost effectiveness

E1: The review also made frequent use of the WHO GDP threshold. We note that many commentators within and outside the World Health Organization (WHO) have expressed their skepticism about this threshold and its use in decision-making (see for example this document for a review of debates on the GDP threshold within and without WHO). The GDP threshold is still widely used today, and the review is not exceptional in this. However, since the threshold has been criticized in many ways, we suggest that the review reports the limitations of using the threshold for decision-making, and explain how the threshold should be interpreted for decision-making purposes in this particular context.

Different decision makers prefer different cost-effectiveness thresholds. It’s beyond the scope of our paper to review the debate on the appropriateness of various thresholds. As such, we aimed to choose two different thresholds (one high and one low) to illustrate that water treatment is cost-effective on a wide range of thresholds, and indicate where decision makers might make different decisions about implementation.

11. Deeper cost modeling and addressing more CEA questions

E2: Deepen the cost modeling

My second major comment has to do with the “cost” part of cost-effectiveness: specifically, to deepen the cost analysis. I think it is important to add some useful sophistication to the cost model, while keeping it comprehensible. Being more thoughtful about cost modeling and analysis has begun to take hold among policymakers and donors.undefined Deepening the cost part of this analysis offers one way for the authors to better position this paper.

The first layer of sophistication that can be added is to build out a cost model for each intervention which allows some variation. A full, high effort version is to:

  • Build a cost model that explicates the key cost elements e.g., materials, management, communications, transport and training, of a given intervention.undefined

  • Then vary those elements – much like a sensitivity analysis – to get a variety of possible costs for the intervention.

  • Then, calculate a variety of cost-effectiveness estimates.

This enables a policymaker to “locate” their specific context in the cost distribution for a given water chlorination intervention and look up the relevant cost-effectiveness for it.undefined A simpler way to generate a range of costs might be to develop a low, middle and high cost for dispensers, in-line and MCH then use those to derive a range of CEAs. Policymakers, readers and anyone else interested in this, will be able to locate their costs on the spectrum – maybe a policymaker has good information that—for dispensers let’s say—their context is close to the medium cost, therefore they will then use CEA estimates for the medium cost scenario. Sohn et al (2020)undefined show how you can think a bit more deeply about modeling costs.

The second layer of sophistication is bounding/confidence intervals which might allow a decision maker to see beyond just the “mean case” and understand the extremes. For instance, in-line treatment might have lower mean cost effectiveness but the “worst case” (upper bound on cost per unit effect) for dispensers might be lower than that for in-line which may drive a risk averse decision maker to pick dispensers for their context. I do not have any great examples for you but you could think along the lines of what Wakker and Klaasen (1995) suggest.

E1: Firstly, the analysis used the Bayesian meta-analysis estimate across the whole sample of studies, including filtration, spring protection and solar disinfection. However, since two of the cost-effectiveness estimates directly concerned chlorination, it would seem more appropriate to use the pooled meta-analytic effects for chlorination alone in those cost-effectiveness analyses. Secondly, the review does not provide uncertainty estimates for the cost-effectiveness estimates with respect to either the confidence intervals on intervention effectiveness or sensitivity analyses to different cost scenarios or other assumptions (e.g., adherence rates).

E2 suggested constructing and presenting cost estimates in ways that allow policymakers to construct CE analyses that are specific to their circumstances (as well as generally making CEA more applicable to specific contexts).

We agree that this would be valuable. We provide a detailed breakdown of cost calculation in footnotes to Table S7 (expanded CEA table) and hope to create a setting-specific CE calculator. However, this would require data on locally specific factors such as the cost of materials which will be available to policymakers for their area. We do not have sufficient data to do this. We believe this is out of scope of this paper and most useful as a stand-alone tool.

E1 also noted a lack of uncertainty intervals in CEA results and that Standard economic theory suggests that policymakers should be modeled as risk-neutral in this context. The standard justification for risk aversion is concavity in the utility function over consumption. The per person cost of water treatment is low enough relative to per capita income that risk neutrality seems appropriate to use a first order approximation to the utility function.

We also argue that our analysis suggests that water treatment is sufficiently cost-effective to be robust to more conservative assumptions. In the supplement we conduct further cost-effectiveness analysis for a decision maker with an informative prior centered around a 3.9% reduction in the odds of mortality. This may be helpful for a reader interested in the cost-effectiveness of water treatment assuming smaller effect sizes.

E1 also noted that the global result was used instead of the chlorination sub-group result.

We believe that using the results across all studies is appropriate, given the lack of precision in estimates and the need to fully utilize data. However, in this case choosing either option would not have meaningful implications for the result, since the mean in chlorination studies is similar to the mean across all studies.

12. Implications for future studies, CONSORT diagrams, lack of mortality reporting

E1: In our opinion, what the review clearly highlights is that current standards for reporting of RCTs, especially in development economics, are not fit for purpose. Reputable journals publishing field trials in health require that CONSORT participant flow diagrams are reported, which show numbers of individual participants by study arm from recruitment of clusters and individuals within clusters, through follow-ups, together with important reasons for attrition like death (Moher et al., 1998)[33]. Without this information it is difficult to assess important threats to validity in these studies, which might occur due to problems in design and conduct. It is not sufficient to publish data openly, as many economics journals require, in order to assess them. For example, a key aspect of the internal validity in cluster-RCTs is knowledge about when and how individual participants were recruited, so that total and differential selection bias into the study from joiners can be assessed. The same follows for selection bias out of the study (attrition), although this is more commonly evaluated. It can also be useful to know who dropped out of the study between enrolment and randomization stages to evaluate external validity.

A recent survey by Chirgwin et al. (2021)[34] of WASH impact evaluations in L&MICs found that only half of trials in health had reported a study participant level CONSORT diagram, whereas no RCTs of WASH in economics had done so. 3ie has published CONSORT standards for RCTs in economics (Bose, 2010)[35]. What the review demonstrates is that this lack of participant flow reporting is extremely costly. Had the participant flows been reported transparently, there would have been less need for the reviewers to contact RCT authors to obtain the attrition data on all-cause mortality in childhood.undefined The reviewers themselves noted this process was “time-consuming… and led to the loss of some data that was once available but is no longer available” (p.16), since there were 29 studies whose authors responded that the mortality data had not been collected, or had been collected but were no longer available, or who did not respond at all.

The review suggests that the reason why there has been hitherto limited analysis of mortality is because multiple testing of hypotheses prevents researchers from analysing the impact of the interventions on mortality. We are not convinced about this since the lack of reporting of mortality data is more likely due to the use of small samples, the difficulty of collecting mortality data, and apparently the lack of familiarity with reporting mortality data. Hence requiring these data be analysed as part of pre-analysis plans is unlikely to address the problem sufficiently. We believe a more effective solution would provide incentives for authors of RCTs to report participant flow diagrams, as are done in other fields, including RCTs measuring the impacts of HWT on diarrhea published in health journals. RCTs are costly to undertake financially and often require substantial time engagement by participants, so there are strong ethical and, as shown in the review, practical reasons for authors to report participant flows, and for reputable journals and commissioners to require them to do so.

There are similar standards for reporting systematic reviews and meta-analyses, which we discussed above, relating to the publication of protocols, reporting of deviations from protocol and adherence to PRISMA conduct and reporting standards. A key purpose of a systematic review protocol is to help reviewers avoid making results-based choices (consciously or otherwise). This does not mean that deviations from protocol are not allowed, just that they are explained.

We are very grateful for this detailed discussion and agree on the importance of including participant flow diagrams in studies. It is clear that in this instance having these diagrams available in all studies would enable analysis of several studies without relying on direct contact with researchers or analysis of individual-level data. (Although it is important to note that the latter has an important advantage when studies use clustered randomisation.) We added this as a recommendation.

We re-reviewed included studies and modified the paper to discuss more how mortality was reported and whether studies included participant flow diagrams; we now note: “Out of the 13 studies that reported mortality in publications: five report it as an outcome, nine include number of deaths in a flow diagram, and five discuss or report death in the text. Only one (23) includes mortality as an outcome in the pre-registration.”

More broadly, we believe there is an important implication for reporting of outcomes, one that goes beyond reporting of mortality, applicable to important outcomes that may be collected by studies but are underpowered. Multiple hypothesis testing requirements could have unintended consequences of disincentivizing reporting such outcomes (of which death is one) are likely to not be reported due to worries about multiple hypothesis testing. We take the point that some study authors may not have reported mortality for other reasons, rather than for fear of being subjected to multiple hypothesis testing requirements. Motivations of the authors of decades-old papers are difficult to assess. But going forward, a policy under which authors are encouraged to report CONSORT diagrams and submit pre-analysis plans in which they specify that data on important outcomes for which an individual study are underpowered would be collected for purposes of meta-analysis and the individual papers are not expected to include multiple hypothesis testing corrections would generate incentives for researchers to conduct studies.. We are working on follow-up research that would address this topic and think it is important to highlight these issues further.

13. COI statements

E1: Finally, we believe the positionality of the reviewers is not reported satisfactorily. It would be useful to know, for example, if the included RCTs conducted by the reviewers were appraised by different authors. Furthermore, one of the reviewers is a Board member of Evidence Action, the campaigning NGO that provided data on which two of the cost-effectiveness scenario estimates are based, and another is a principal investigator of two studies that led to the organisation’s earliest campaigns (Drinking Water Chlorination and Deworm the World). We might expect these associations to be mentioned in reviewer declarations due to the potential for conflicts of interest. For example, UKRI states: “the existence of an actual, perceived or potential conflict of interest does not necessarily imply wrongdoing on anyone’s part. However, any private, personal or commercial interests which give rise to such a conflict of interest must be recognised, disclosed appropriately and either eliminated or properly managed. Reporting, recording and managing potential conflicts effectively… can help to generate public trust and confidence.”

Thanks for this comment. We would like to clarify that

1.) Steve Luby is not a member of Evidence Action’s Board, but rather its separate Board of Advisors, which does not have any fiduciary responsibilities to the organization.

2.) Michael Kremer helped develop the water treatment dispenser approach discussed in the paper which Evidence Action is implementing in East Africa. He was a board member of a deworming NGO which later transferred its operations to Evidence Action.

We are unaware of any guidelines under which these constitute a conflict of interest for Kremer or Luby, but will of course follow relevant guidelines including those of the relevant journal when submitting the paper for publication.

We also note that Evidence Action is primarily an implementing organization rather than a campaigning organization.

Changes not yet included in the current version of the working paper

14. Choice of studies

E1: A previous version of the review omitted several trials that were included in the systematic review and meta-analysis of water, sanitation and hygiene (WASH) and mortality by us (Sharma Waddington et al., 2023), several of which have since been included in the meta-analysis as indicated by the reviewers.

However, several studies of apparently eligible interventions, which reported all-cause mortality in participant flow diagrams, remain excluded from the analysis. These include Ercumen et al. (2015) which reports all-cause mortality from two trial arms (chlorine plus safe storage and safe storage alone), and Bowen et al. (2012), a long-term follow-up of another household water treatment (HWT) study that was included (Luby et al., 2006); both studies, while underpowered, reported higher mortality rates in the household water treatment group than in the control.

E1 suggested two studies that were not considered. We reviewed them and included one of the two.

First, we did not include Bowen et al, which was a long-term developmental outcomes follow-up to a study we already include (Luby et al 2006). Bowen et al has more deaths, but it follows children up to the age of eight and most of the follow-up is after the active water treatment period. We believe that Luby et al 2006 (for which we have individual-level data) already includes U5 deaths from the relevant period. For full transparency, we checked the impact of replacing the Luby et al study with the Bowen et al study. It would reduce the mean reduction in OR and its 95% interval by about 2-3 percentage points and the results remain significant.

Second, we will include the Ercumen et al study in the updated analysis and thank the evaluators for bringing this study to our attention. Since only four deaths occurred in that study (2 in treatment arm, 2 in control), this should have no meaningful impact on the results, but it’s of course important to include all available evidence.

E1: There appeared to be deviations from the AEA registry record, such as the original exclusion of “cases where the study population is considered to be non-representative (e.g. interventions targeting HIV+ populations)” (Tan and Kremer, 2020). The review included a study of water filters and safe storage by Rachel Peletz and colleagues that was conducted among immunocompromised households (Peletz et al., 2012).

Evaluators were right to point out that including this study contradicts what was proposed in the AEA search strategy. We are looking into this further and may exclude this study from the main analysis or will add a clarification in the text to point out inconsistency with the protocol.

Regardless of what we decide, we will conduct a sensitivity analysis specifically including/excluding that study and clearly state the results with/without it. For the time being readers can refer to results of Table S4 which looks at dropping individual studies from the main analysis.

E1: We also wondered why a RCT on household water chlorination in Kenya (Kremer et al., 2008) was not included in the analysis or in Table S2; this study aimed to evaluate the final pathway in water-borne diarrhea disease transmission by addressing contamination between source and point of use, and is therefore potentially highly policy relevant.

We believe no mortality data were collected during the intervention described in this study although we are still in the process of continuing this. Whatever mortality data exist from that program is most likely already available in Kremer et al 2011, a study we include. We are in the process of verifying that as well.

Many thanks to evaluators for bringing our attention to these four studies. We are looking forward to finalizing the results as soon as possible.

15. Addressing combined interventions

E1: It is common for WASH interventions to be implemented in conjunction with other treatments, which also may have independent or interactive effects on morbidity and mortality. For example, as noted in the supplementary materials, several included study arms were of multicomponent interventions where HWT was provided alongside cookstoves (Kirby et al., 2019) or sanitation and hygiene (Humphrey et al., 2019). One chlorine trial incorporated food hygiene education (Semenza et al., 1998). Other studies had hygiene and sanitation co-intervention arms that were excluded (e.g. Luby et al., 2018; Null et al., 2018), although the review states they were incorporated in sensitivity analysis. This point changes the interpretation of the results, in some cases in important ways. For example, since the focus of the review is all-cause mortality, it should be clearer which of the studies combined HWT with software and hardware that can affect children’s exposure to enteric or respiratory infection such as washing with soap and water, food hygiene or indoor cook-stoves. The conclusions may need to be qualified by observing that the results should be interpreted as approximations of the effects of ‘water treatment and protection’, if ‘water treatment and protection’ interventions were implemented as multicomponent packages including other activities affecting morbidity and mortality in childhood.

E1 raises the important issue of several analyzed papers having included other interventions, such as education, cookstoves etc. We thank evaluators for the detailed discussion of this topic. While this was already covered in sensitivity analyses, we made sure there is stronger emphasis on this: in the main text we explicitly list what was combined and we added more details in supplement too:

“In five out of the eighteen cases, water treatment was combined with another intervention: cookstoves in Kirby et al 2019 (6), other sanitation and hygiene interventions in Humphrey et al 2019 (7), or safe storage in Chiller et al 2006, Peletz et al. 2012 (8), and Semenza et al 2018 (9). Quick et al. 1999 also combined safe storage and water treatment. However, the control group also received safe storage containers. With the exception of one treatment arm in Dupas et al 2023, all other interventions involved some sort of educational or promotion intervention instructing study participants on how to treat water with the relevant method, so education is not considered a combined intervention.“

We are working on updating our analysis in two ways. (1) Expanded sensitivity analyses that include/exclude studies with combined interventions. (2) New meta-regressions to also regress on treatment combinations. While power for this will be low, several studies include arms with e.g. hygiene interventions, education on handwashing, and safe storage. Therefore, accounting for them in a single model may lead to modest increases in power. In the current version of the paper the results of sensitivity analyses have not yet been updated.

16. Risk of bias assessment

E1: The risk-of-bias ratings reported in the supplementary materials range between 4 and 7 out of a total possible score of 11. We note that evidence suggests it is not appropriate to determine overall bias using quality scales (Jüni et al., 1999). Authors of critical appraisal tools have instead shown that it is possible to assess overall bias based on transparent decision criteria (e.g., Eldridge et al., 2016; Sterne et al., 2016). The review should comment on the implications of the risk-of-bias assessment for the confidence in the findings of the evidence base.

We thank evaluators for the discussion of mortality vs diarrhea outcomes and their inherent risk of bias. We did not include a longer discussion of this due to space limitations in the main text, but we agree with the points raised. We will reconsider the overall approach to assessing study-level risk of bias for the next version of the paper but have not made decisions yet.

17. Power calculations for publication bias

E1: Regarding the publication bias analysis, which provides a rare example where small study effects were not measured, we believe this is because most of the studies were not designed to measure mortality as a primary or secondary outcome. As noted by the reviewers, there may still be publication bias present (for example, mortality data from 29 studies were not obtainable, as discussed below). However, we are less convinced by the approach used to assess the statistical power of the meta-analysis. The review added null results (the post-hoc simulation on page 4) to the observed results and checked whether the meta-analysis still found a statistically significant effect. We wondered if post-hoc power calculations would be a simpler approach to address the same question. Perhaps the review could calculate the minimum detectable effect size or power of the meta-analysis as a function of the number of studies and see whether it is sufficiently powered to detect an effect size in the presence of publication bias.

E1 suggested adding a simple post-hoc power calculation to understand [the] potential impacts of small study publication bias. We will consider adding this but need to compare the two approaches before deciding on whether to include it. We will provide additional commentary in the supplement to better motivate the choice of this approach.

18. Revised CEA

E2: One thought in this regard is that this study could be made more about which water chlorination intervention is best suited for a given context. In other words, going a bit beyond what this study is currently doing. To explicate this, right now we have three interventions in three unique contexts – dispensers in Western Kenya, in-line chlorination in India and an MCH based delivery for a general global context. What if I was a Kenyan health policymaker and saw this study:

I see that the MCH based chlorination is the most cost-effective. I think to myself maybe that is the first step I should take. But then I note that the estimate is for a generic global program, so now I am unsure. My MCH program is pretty robust and I know my costs probably fall on the lower end of the global distribution but I can’t be sure. Maybe I should just take the Kenya specific result for dispensers and be done with it?

I have just played out a hypothetical here. The thing that happened “well” is that a policymaker compared two possible ways to achieve a child mortality outcome through water chlorination – the primary purpose of a CEA like this – but what was weak in the above was that MCH does not have a truly local, Kenya-specific cost-effectiveness estimate that the policymaker could use, or no information that helps them get close to it. And this is true for an Indian policymaker, and other policymakers in lower-middle income and low-income countries. This relates to my second major comment below i.e., a way to better position this work

As discussed earlier, the main implication of our cost-effectiveness analysis is that water treatment is highly cost-effective in a range of contexts. We hope to construct a flexible tool which could produce estimates for different contexts.

Focus on CEA: I appreciate that you want to acknowledge the broader benefits of these interventions i.e., your calculation of net benefits. I think this is kind of distracting to be honest. Focus on the CEA and make it about choosing the best water chlorination intervention variant that yields the lowest cost per reduction in child mortality.

We argue that for some decision makers it is appropriate to focus on net benefits rather than the cost-benefit ratio. We explain the rationale for this in the supplement. We will explain this further by adding some formal analysis in a new draft which we hope to post soon.

Relatedly, in your main manuscript, section “Cost-effectiveness”, you do not state the cost per death averted (this was available in a previous version of the paper). I understand the switch to cost per DALY averted but the major thrust of your quantitative analysis is mortality. Are there strong reasons not to have that? Are comparisons not possible with cost per death averted? Is it possible to re-introduce this?

We have added cost per death averted back to the cost-effectiveness table.

We are working on an update to CEA which will address some of the suggestions for expanded scope of CE in E2. In particular, we will (1) discuss the issue of allocating a limited health budget, (2) provide more detail on choosing between alternative delivery methods, which is also used to frame the discussion of cost-effectiveness analysis, and (3) provide metrics beyond cost per DALY.

Conclusion

Again, we are very grateful to the evaluators for their very useful comments. We were in the midst of revising the working paper for journal submission when we received their comments, have partially revised the working paper/preprint in light of some of these comments, and look forward to working our way through the remainder of their comments, finalizing the revisions we have already drafted, releasing a new working paper/preprint and then submitting the revised paper to a journal.

Manager’s note on response and paper updates

Witold Więcek was the corresponding author on this response. The authors shared a link to an updated draft of their paper, linked here (which they referred to as a “Google Drive permalink”. )

Witold noted “some co-authors are yet to review this new version [and] we are working on further updates (as outlined in the reply doc), so we hope we can share another version soon.”

Comments
0
comment
No comments here
Why not start the discussion?