Cryptosporidium spp. and Giardia duodenalis are important parasitic protozoa, causing gastrointestinal disease in both humans and animals, and both are considered to be zoonotic (Thompson, 2004; Xiao, 2010). The most common route of transmission is through contaminated water and/or food. Thus, they are responsible for many documented waterborne (sometimes also foodborne) outbreaks of disease worldwide, having a significant – often neglected – public health risk (Certad et al., 2017), with an increasing number reported since the early 2000s (Efstratiou et al., 2017b; McClung et al., 2018).
Water can be contaminated by any of these protozoa via many different routes, including runoffs from agricultural areas, drainage from manure storage, wastewaters overflows or improper sewage systems (Daniels et al., 2016; Kistemann et al., 2012; Robertson et al., 2006; Swaffer et al., 2018). Also, climate changes and, specifically, extreme rainfall events have been associated with elevated levels of water contamination with Cryptosporidium and/or Giardia (Daniels et al., 2016; Dias et al., 2018; Ligda et al., 2020; Mellor et al., 2016; Semenza et al., 2012). However, investigating these parasites in aquatic environments is quite challenging, requiring the application of costly, time-consuming techniques performed by experienced personnel (Efstratiou et al., 2017a). As a result, alternative markers, such as faecal indicator bacteria (FIB) (i.e., Escherichia coli, Clostridium perfringens, bacteriophages, Enterococci, total and faecal coliforms) and pathogenic bacteria (e.g., Salmonella spp. and Campylobacter spp.) (Cizek et al., 2008; Duris et al., 2013; Levantesi et al., 2010; Tolouei et al., 2019; Wilkes et al., 2011, 2009), weather-related parameters (e.g., temperature, turbidity, flow rate, rainfall, and tributary discharge) (Cizek et al., 2008; Tolouei et al., 2019; Wilkes et al., 2009; Young et al., 2015) or wastewater micropollutants (e.g., caffeine, carbamazepine) (Tolouei et al., 2019), that are easier and cheaper to measure in water have been investigated in order to correlate with the presence of these parasites. However, the heterogeneity among the relationships between these markers and the presence of Cryptosporidium and/or Giardia (e.g. different source, size, fate, density, frequency and levels of excretion) (Brookes et al., 2005, 2004; Davies et al., 2003; Levantesi et al., 2010; Wilkes et al., 2009) in water matrices, resulted in inconclusive results, as also shown by the authors of this study when used a more traditional statistical modelling method (i.e., zero-inflated negative binomial models) to investigate the associations of such markers (i.e., FIB and abiotic factors) and the presence of parasites’ (oo)cysts. (Ligda et al., 2020).
In this regard, Machine Learning (ML), a subdomain of Artificial Intelligence (AI) has gained considerable attention from the research community, with a strong emphasis on various applications related to infectious diseases caused by protozoan pathogens (Hu et al., 2022). More specifically, in an effort to detect pathogens, explore host-pathogen interactions, predict outcomes including human health, food safety and environmental quality, ML models have been applied achieving high prediction performances (Ghannam and Techtmann, 2021; Goodswen et al., 2021; Hu et al., 2022; Huang et al., 2021; Uddin et al., 2019; Wang et al., 2022; Yakimovich, 2021). In recent years, ML has also been applied to solve a wide variety of problems in the water science field, in order to predict water quality in rivers, lakes, and groundwater (Huang et al., 2021).
Despite the great promises and the plentiful practical implications from the adoption of recent ML advances in the scientific domain of infectious diseases, there is an ongoing discussion among scientists of different disciplines related to the interpretability and explainability hurdles that stem from the deployment of complex ML solutions. On the one hand, a variety of ML algorithms has been proved to outperform traditional statistical modelling techniques in terms of performance. On the other hand, the black-box nature of these sophisticated approaches raises significant concerns on their practical utility, since the researchers are skeptical on trusting prediction mechanisms that lack transparency and explicit justification on how they end up into a specific decision. The above critical implications have led to the necessity of developing a suite of techniques belonging to the broader AI field, known as eXplainable Artificial Intelligence (XAI) that efficiently mitigate interpretability and explainability barriers without sacrificing performance aspects of opaque ML models (Arrieta et al., 2020).
Towards this direction, the current study aims at the development of an holistic approach that exploits ML advancements in order to provide empirical evidence related to the presence and contamination intensity with Cryptosporidium oocysts and Giardia cysts in water samples. More specifically, the goals of this study are: (g1) to introduce a generic meta-learner ML approach that decomposes the model building phase into a binary classification and a regression task, in order to uncover the complicated relationships between a set of microbiological, physicochemical and meteorological parameters and the two response variables (Cryptosporidium and Giardia (oo)cysts) in the presence of zero-inflated distributions, (g2) to conduct a benchmark experiment, so as to empirically investigate and compare the prediction capabilities of the state-of-the-art ML algorithms that can serve, in turn, as future candidates for the meta-learner approach into similar experimental studies and (g3) to leverage XAI techniques with a strong focus on providing explanations concerning the most critical markers and parameters that facilitate the decision-making process, and thus to increase the interpretability, trustworthiness and transparency of ML data-driven solutions and their practical implications to various stakeholders and policy-makers.
