Summary
AI in Water Risk: The Model Provides Capability. The Framework Provides Auditability.
Why non-specialized models are a liability for capital decisions in corporate sustainability.
There's a question we get asked a lot from technical teams: "How do you overcome hallucinations from models?"
The answer isn't just a better model. It's what you build around one.
Even top-tier models show measurable hallucination rates. For brainstorming? Fine. For site selection decisions involving billions in capital? Not acceptable. Decision-makers need traceability, auditability, reproducibility - they need to know where every claim comes from and why it's credible.
At Waterplan, we built a production-grade water risk intelligence system that synthesizes five distinct evidence sources - from satellite data and government pipelines to AI-powered research in local languages - into auditable, locally-grounded risk scores across six dimensions, with global coverage.
The full post walks through a real example showing exactly how it works: location resolution that maps coordinates to actual hydrological features (not just a state name on a map), multi-language search queries that surface Dutch PDFs English-only searches miss, multi-stage filtering that rejects irrelevant sources before synthesis, and a concrete scored example from a real site.
The model provides capability. The framework provides auditability.
Want to read the full technical breakdown?
Introduction
There's a question we get asked a lot when we present our water risk intelligence system to technical teams. It may also come from someone who's been building with LLMs themselves.
"Do you encounter issues with the consistency of responses from models?"
Yes. Most production teams do.
"How do you overcome that?"
That's what we built for. At Waterplan, we built a production-grade water risk intelligence system--one that combines satellite data, government sources, ground station measurements, and AI-powered research into auditable, decision-grade risk scores with full source traceability.
The Problem Everyone Knows But Few Solve
Foundation models have made remarkable progress. The latest releases are genuinely impressive. But even top-tier models with thinking and browsing enabled show measurable hallucination rates.
For brainstorming? Acceptable.
For risk management (e.g.: site selection) decisions involving billions in capital? Not acceptable.
Decision-makers need to trust and govern their data. They need traceability, auditability, reproducibility. They need to know where every claim comes from and why it's credible.
The gap between "impressive model capability" and "production-grade decision support" is where most AI implementations fail. We built our entire stack to close that gap.
Why This Matters
If you're evaluating water risk for risk management, site selection, regulatory compliance, or any capital allocation decision, you need more than model capability. You need:
Global Scale Coverage: Same pipeline works in Texas, Netherlands, Chile, Singapore - with explicit gaps noted where data is sparse
Speed: Minutes per site, parallelized for portfolio assessment
Auditability: Every score has confidence level and source traceability
Honesty: When we don't know, we say so
The question isn't whether LLMs are good enough. They are.
The question is whether you have the infrastructure to make them decision-grade.
If you're making capital decisions that depend on water risk management, let's talk.
What We Actually Built
Waterplan’s Risk Framework - the orchestration engine that synthesizes evidence into scores - pulls from five distinct evidence sources to produce defensible risk assessments across six risk dimensions: scarcity, quality, flood, regulatory, infrastructure, and reputational.
No single source is sufficient.
Site Data Collection (SDC): Our AI-powered web search pipeline. Dynamic queries in local languages with multi-stage content filtering. Near-global coverage, with explicit gaps noted when falling back to other sources in low-information regions.
Government Sources: The reality is that most government water data doesn't exist as APIs. It's PDFs, scattered databases, agency websites. We built 40+ pipelines across 15+ focus countries (and expanding) to structure this raw data. FEMA flood zones. CONAGUA basin reports. JRC flood maps. EPA water quality monitoring.
Ground Stations: Streamflow gauges, groundwater wells, reservoirs monitoring, etc. Sensor data with quality filters.
Expert-Curated Evidence: Handcrafted by our team of water experts. Location-specific knowledge, regulatory nuances, and contextual insights that automated pipelines may not capture. Continuously and incrementally improved as we expand coverage.
Satellite & Climate Models: ERA5 for precipitation (and supply assessments). GRACE for groundwater dynamics. GloFAS for river discharge. Latest ISIMIP outputs based future climate projections under RCP and SSP scenarios for forward-looking risk assessment. Peer-reviewed sources.
The framework then selects the best evidence for each location and indicator - prioritizing by granularity, recency, relevance and trust - and synthesizes it into calibrated scores. Same indicator definitions and rubric structure everywhere, but with region-specific thresholds and interpretation versioned per location.

Water Expertise at Every Framework Step
SDC is one input to the Risk Framework - specifically, the component that searches and structures web-based evidence. Four steps to produce structured evidence objects, which then feed into the Risk Framework for selection and scoring alongside the other sources.
Each stage requires domain knowledge that can't be prompted into existence.
Step 1: Location Resolution
Coordinates in, hydrological reality out.
A site in Phoenix isn't just "Arizona." We resolve it to:
Political Hierarchy: Level 0, Level 1, …, Level N (e.g.: country, state, county, city, etc.)
Local Basin: Name, drainage area, allocation context
Local Aquifer: Name, extent, regulatory status
HydroBASINS mapping: Standardized ID for global comparison
WRI Aqueduct Basin: Cross-referenced for clients using that framework
That's what determines actual water exposure - not simply the state name on a map. More importantly, it identifies where the water supply is actually sourced from.
Hydrological boundaries matter more than administrative ones. Two sites 10km away might drain to entirely different basins with different allocation rules, different aquifer conditions, and different regulatory jurisdictions.
And we determine local languages. Brussels has Dutch and French zones. We search in both.
Step 2: AI-Powered Search
We generate water-specific queries in local language. Geo-targeted search across government sources, academic papers, water authority publications, and reputable news media sources.
Knowing what to search for is the expertise. Here's what that looks like in practice--search queries generated for the "Water supply/scarcity" indicator (NS1) at a site in Amsterdam, Netherlands:
Query 1: Historical Occurrence
("Amsterdam" OR "Gemeente Amsterdam") AND ("waterbalans" OR "wateraanbod" OR "watervraag" OR "waterbeschikbaarheid" OR "watertekort" OR "waterschaarste" OR "leveringszekerheid" OR "droogte")
AND ("water" OR "grondwater" OR "oppervlaktewater") AND ("systeem" OR "bronnen" OR "gebruikers")
AND ("Noord-Holland" OR "Waternet" OR "Waterschap Amstel, Gooi en Vecht" OR "AGV" OR "Amsterdamse Waterleidingduinen"
OR "Bethunepolder" OR "Lekkanaal" OR "Amsterdam-Rijnkanaal") <modifiers>
Targets water availability and supply-demand dynamics specifically mentioning Waternet (Amsterdam's executing water utility), Waterschap Amstel, Gooi en Vecht (AGV - the governing regional water authority), and actual source-to-tap hydrological intakes like the Amsterdamse Waterleidingduinen and Bethunepolder, rather than general infrastructure topics.
Query 2: Monitoring & Data
("Amsterdam" OR "Gemeente Amsterdam") AND ("monitoring" OR "metingen" OR "meetgegevens"
OR "prognoses" OR "scenario's") AND ("water" OR "grondwater" OR "oppervlaktewater")
AND ("Klimaatbestendige Wateraanvoer" OR "KWA" OR "beschikbaarheid")
AND ("Noord-Holland" OR "Waternet" OR "Waterschap Amstel, Gooi en Vecht" OR "AGV" OR "Amsterdam-Rijnkanaal") <modifiers>
Targets technical measurement data, forecasts, and scenario analyses--including operational drought-monitoring frameworks like the KWA (Klimaatbestendige Wateraanvoer) and Waterakkoord (water agreements) governing the Amsterdam-Rijnkanaal/Noordzeekanaal corridor.
Query 3: Risk & Vulnerability
("Amsterdam" OR "Gemeente Amsterdam") AND ("water stress" OR "watertekort" OR "waterschaarste" OR "verzilting"
OR "zoutindringing" OR "verdringingsreeks") AND ("water" OR "grondwater" OR "oppervlaktewater")
AND ("waterbalans" OR "waterbeschikbaarheid" OR "watervraag")
AND ("Noord-Holland" OR "Waternet" OR "Waterschap Amstel, Gooi en Vecht" OR "AGV"
OR "Amsterdam-Rijnkanaal") <modifiers>
Targets scarcity risk assessments, drought impact analyses, and sufficiency evaluations for drinking water sources, avoiding unrelated maritime or flood-control data. (Note: Dutch water scarcity for low-lying delta cities manifests heavily as "verzilting" (salinization) when river discharge drops; shortages are governed legally by the "verdringingsreeks" priority sequence. Policy uses "waterschaarste" and "watertekort" interchangeably in Rijkswaterstaat and Deltaprogramma planning documents.)
English-only searches often under-surface Dutch-language PDFs (e.g., Waternet research reports, Rijkswaterstaat policy documents) and provincial water programme materials; targeted Dutch queries increase recall. For example, the Regionaal Waterprogramma Noord-Holland 2022-2027, Waternet's innovation reports, and the Waterakkoord ARK-NZK are Dutch-first sources that targeted queries will surface.
Modifiers
On top, each of the base queries above is enhanced with modifiers:
Location modifiers use the granularity hierarchy Amsterdam, Gemeente Amsterdam, Noord-Holland, Netherlands to scope results from city-level up through province (Noord-Holland) and national level, ensuring coverage across:
Municipal sources: amsterdam.nl, openresearch.amsterdam
Water utility & Regional Authority: waternet.nl, agv.nl
Provincial: noord-holland.nl
National: rijkswaterstaat.nl, iplo.nl, open.overheid.nl, deltaprogramma.nl
Water body anchors pin queries to the regional hydrological network:
Amsterdamse Waterleidingduinen (AWD) -- primary dune filtration area for Amsterdam's drinking water
Bethunepolder / Lekkanaal -- critical deep groundwater seepage and river intake sources
Amsterdam-Rijnkanaal -- primary canal corridor connecting Amsterdam to the Rhine
Noordzeekanaal -- ship canal to the North Sea, saline intrusion boundary
Amstel -- river through central Amsterdam, connected to the polder system
File type and site filters (e.g., filetype:pdf, site:waternet.nl OR site:agv.nl OR site:rijkswaterstaat.nl OR site:*.overheid.nl), language parameters (hl=nl), and time-range filters for recency vs. historical records.
Step 3: Content Filtering
Every piece of content goes through multi-stage relevance scoring, where location match, question relevance, and source credibility are assessed independently to ensure the content meets standards across all three dimensions.
The challenge: free-form text sources - PDFs, publications, news articles - aren't geolocated. They describe regions through names, images, and references. We have to geolocate them.
Does this "Springfield" match the Springfield we're looking for? There are dozens. Is this scarcity report actually about scarcity in that Springfield, or a different one 500 miles away?
We filter by hydrological context, location match, and indicator type. Failures are explicit, with LLM reasoning tracing feeding feedback loops for continuous improvement. If content drops, we know where and why.
Step 4: Evidence Synthesis
Content that survives filtering becomes typed evidence. Each piece includes: source link, page reference, confidence score, and verification status (automated validation flags).
This is what SDC produces - structured evidence objects ready for downstream processing by the Risk Framework. Because every step is deterministic - from location resolution to query generation to content filtering - evidence is fully reproducible: the same source content processed through the same pipeline always produces an equivalent output.
Downstream: Risk Framework Integration
SDC feeds evidence into the Risk Framework alongside government pipelines, ground stations, expert-curated evidences, and satellite data. The framework handles what comes next.
Evidence Selection: For each location and indicator, we select a group of evidence items from all available evidence across all four sources. Priority goes to the most granular, most recent, highest-trust source that's representative for the specific location under analysis. A hyperlocal ground station measurement beats a regional government report when both exist, a measurement is preferred for the indicator under analysis, and the station is representative of the actual water supply--not just any nearby ground station will do.
Scoring: Scoring translates selected evidence items into calibrated risk answers. "What's the scarcity risk for this site?" isn't answered by dumping sources - it's answered by synthesizing the selected evidence into a score with explicit reasoning.
Scoring is calibrated to context. "Low reservoir" means something different in Norway versus Arizona. The implications for industry, community, and supply-demand depend on the specifics of the region and local water system.
Human Expert Review: Not everything stays automated. Based on internal accuracy metrics and specific threshold patterns - edge cases, conflicting sources, anomalous scores - we escalate to our team of water experts. They review the final score and selected evidence items to ensure they accurately represent the location's reality. The pipeline handles volume; humans handle judgment calls.

A Concrete Example
Here's what a score looks like for a single indicator at an anonymized site:
Site: Industrial facility, Mediterranean Europe
Indicator: Flood occurrence
Question: What is the potential for flooding throughout the catchment, considering precipitation patterns, high flow trends, and past flood events?
Score: 5/5 (High)
Reasoning: More than 5 significant floods have seriously impacted lives and property in the last 2 years. Persistent extreme rainfall and flows regularly create flood conditions. Recent events resulted in hundreds of fatalities, widespread infrastructure disruption, and extensive economic losses. Historical records indicate a long-standing and persistent flood risk in the region.
Selected Evidences (in order of relevance):
Granularity | Source Type | Title | Key Facts |
Local Basin | Government | Ministrial hydrological report (2025) | Extreme rainfall amounts recorded indicate high potential for flash and riverine flooding (p. 31) |
City | News | Regional flood severity analysis (2024) | Contributing factors included urban planning issues and intensifying climate change impacts (p. 1) |
City | News | Economic impact assessment (2025) | Extensive inundation of agricultural land; significant damage to roads and rail infrastructure (p. 1) |
Evidence Selection Logic: Local Basin government source selected as primary (most granular, highest trust). City-level news sources corroborate severity and provide recent impact data.
Every score we produce has even a more robust version of this structure. Auditors can trace any material claim back to its source, page reference, and implication.
How This Reduces Hallucination Risk
The pipeline addresses hallucination risk at every stage:
Maximize information availability. Location resolution plus multi-language dynamic search ensures we find relevant sources that generic queries miss. You can't fabricate sources when forced to link to existing ones. Critically, the system searches and collects real sources first, then processes them to generate evidences (and not the other way around, which usually happens) - so every citation traces back to an actual, verified link rather than an LLM-fabricated URL.
Reduce location, question, and source errors. Multi-stage filtering validates that content actually matches the location, actually answers the question, and comes from a credible source. Content gets rejected before it enters synthesis.
Evidence selection with explicit prioritization. When multiple sources exist, we select based on granularity, recency, and trust - not random sampling. Each indicator has its own set of auditable preferences.
Scoring with traceability. The final score for any indicator comes from explicit reasoning over selected evidence. Every material claim traces back to a source, page reference, and confidence level. An auditor can verify any cited statement.
The model provides capability. The pipeline provides auditability.
That's the distinction that matters for capital decisions, such as site-selection (i.e.: where to build the next operational site) or risk management (i.e.: Do I invest in reducing Risk X in Plant A or Risk Y in Plant B?).
What It Takes to Build This
Decision-grade reliability comes from pipeline & framework design, not model selection. That's why we built model-agnostic infrastructure.
This is years of work and where we constantly focus and invest:
40+ government data pipelines across 15+ countries, built from scratch because the APIs don't exist, with each source following a completely different data structure and schema.
Location resolution that maps coordinates to both administrative and hydrological boundaries globally, plus water terminology understanding across 150+ languages
Multi-stage filtering with explicit failure modes, feeding into evidence selection logic that prioritizes by granularity, recency, trust, and representativeness
Scoring framework with calibrated rubrics - same structure everywhere, region-specific thresholds versioned per location
Water domain experts who understand hydrological boundaries and local regulatory structures.
Source catalogs with quality scoring by region. Location resolution beyond political boundaries.
That's the infrastructure layer most teams underestimate.
If You Want to Work on These Problems
We're a focused team working on problems most companies don't tackle.
We run production AI systems that synthesize satellite data, government documents in multiple languages, and real-time sensor feeds into structured risk assessments. We operate at the intersection of climate data, hydrology, and enterprise decision-making. We believe meaningful impact is built with our own hands - one pipeline, one data source, one decision at a time.
Our stack: Python, TypeScript, AWS Lambda with strict memory and latency constraints on orchestration components, Step Functions for workflow management. We use lite/nano-class models for high-volume tasks and strategic deployment of larger models where reasoning depth matters. Model-agnostic architecture throughout.
Some interesting problems: location resolution at global scale, evidence selection across heterogeneous sources, scoring calibration that accounts for regional context. If you want to work on AI production systems where impact, accuracy and auditability actually matters, we're building that.
The Bottom Line
Everyone has access to the same LLM models. The differentiation is what you build around them.
We're building AI pipelines that transform raw model capability into production-grade decision support. Water expertise encoded at every step. Traceability for every claim. Auditability by design, not by luck.
The model provides capability. The framework provides auditability.
That's Waterplan’s work.
We're hiring engineers who want to build AI systems that matter. Reach out if this sounds like your kind of problem.
Key Resources
GPT-5.2 Thinking with browsing enabled shows 5.8% of responses containing at least one major factual error - per OpenAI's own system card, Section 3.5.
Satellite and Climate Models evidences, due to the nature of the information they represent, are incorporated into the Risk Framework in a deterministic approach through a customizable weighting model with defaults offered by Waterplan.
HydroBASINS - A global dataset of nested sub-basin boundaries derived from HydroSHEDS, providing standardized watershed delineations at multiple scales.
WRI Aqueduct - World Resources Institute's water risk atlas, widely used for corporate water risk assessment and disclosure.
Four, not five. Read
Satellite & Climate Models evidencesfootnote (TL;DR: these are incorporated into the Risk Framework in a deterministic approach).Any deviation detected by a human expert is feedback that is used to improve the pipeline accuracy, and thus, to reduce the occurrence of further anomalies. This is Iterative Excellence in practice.
Related content
Get insights, expert analysis and tips on measuring, reporting, and responding to water risk





