Published on:

Published on:

AI Water Risk: The Model Provides Capability, The Framework Provides Auditability

AI Water Risk: The Model Provides Capability, The Framework Provides Auditability

Matias Comercio

Co-Founder, Senior AI Engineer & Mentor

データとAI

水リスク

Published on:

データとAI

水リスク

AI Water Risk: The Model Provides Capability, The Framework Provides Auditability

Matias Comercio

Co-Founder, Senior AI Engineer & Mentor

Summary

AI in Water Risk: The Model Provides Capability. The Framework Provides Auditability.

Why non-specialized models are a liability for capital decisions in corporate sustainability.

There's a question we get asked a lot from technical teams: "How do you overcome hallucinations from models?"

The answer isn't just a better model. It's what you build around one.

Even top-tier models show measurable hallucination rates. For brainstorming? Fine. For site selection decisions involving billions in capital? Not acceptable. Decision-makers need traceability, auditability, reproducibility - they need to know where every claim comes from and why it's credible.

At Waterplan, we built a production-grade water risk intelligence system that synthesizes five distinct evidence sources - from satellite data and government pipelines to AI-powered research in local languages - into auditable, locally-grounded risk scores across six dimensions, with global coverage.

The full post walks through a real example showing exactly how it works: location resolution that maps coordinates to actual hydrological features (not just a state name on a map), multi-language search queries that surface Dutch PDFs English-only searches miss, multi-stage filtering that rejects irrelevant sources before synthesis, and a concrete scored example from a real site.

The model provides capability. The framework provides auditability.

Want to read the full technical breakdown?

Introduction

There's a question we get asked a lot when we present our water risk intelligence system to technical teams. It may also come from someone who's been building with LLMs themselves.

"Do you encounter issues with the consistency of responses from models?"

Yes. Most production teams do.

"How do you overcome that?"

That's what we built for. At Waterplan, we built a production-grade water risk intelligence system--one that combines satellite data, government sources, ground station measurements, and AI-powered research into auditable, decision-grade risk scores with full source traceability.

The Problem Everyone Knows But Few Solve

Foundation models have made remarkable progress. The latest releases are genuinely impressive. But even top-tier models with thinking and browsing enabled show measurable hallucination rates.

For brainstorming? Acceptable.

For risk management (e.g.: site selection) decisions involving billions in capital? Not acceptable.

Decision-makers need to trust and govern their data. They need traceability, auditability, reproducibility. They need to know where every claim comes from and why it's credible.

The gap between "impressive model capability" and "production-grade decision support" is where most AI implementations fail. We built our entire stack to close that gap.

Why This Matters

If you're evaluating water risk for risk management, site selection, regulatory compliance, or any capital allocation decision, you need more than model capability. You need:

  • Global Scale Coverage: Same pipeline works in Texas, Netherlands, Chile, Singapore - with explicit gaps noted where data is sparse

  • Speed: Minutes per site, parallelized for portfolio assessment

  • Auditability: Every score has confidence level and source traceability

  • Honesty: When we don't know, we say so

The question isn't whether LLMs are good enough. They are.
The question is whether you have the infrastructure to make them decision-grade.

If you're making capital decisions that depend on water risk management, let's talk.

What We Actually Built

Waterplan’s Risk Framework - the orchestration engine that synthesizes evidence into scores - pulls from five distinct evidence sources to produce defensible risk assessments across six risk dimensions: scarcity, quality, flood, regulatory, infrastructure, and reputational.

No single source is sufficient.

  • Site Data Collection (SDC): Our AI-powered web search pipeline. Dynamic queries in local languages with multi-stage content filtering. Near-global coverage, with explicit gaps noted when falling back to other sources in low-information regions.

  • Government Sources: The reality is that most government water data doesn't exist as APIs. It's PDFs, scattered databases, agency websites. We built 40+ pipelines across 15+ focus countries (and expanding) to structure this raw data. FEMA flood zones. CONAGUA basin reports. JRC flood maps. EPA water quality monitoring.

  • Ground Stations: Streamflow gauges, groundwater wells, reservoirs monitoring, etc. Sensor data with quality filters.

  • Expert-Curated Evidence: Handcrafted by our team of water experts. Location-specific knowledge, regulatory nuances, and contextual insights that automated pipelines may not capture. Continuously and incrementally improved as we expand coverage.

  • Satellite & Climate Models: ERA5 for precipitation (and supply assessments). GRACE for groundwater dynamics. GloFAS for river discharge. Latest ISIMIP outputs based future climate projections under RCP and SSP scenarios for forward-looking risk assessment. Peer-reviewed sources.

The framework then selects the best evidence for each location and indicator - prioritizing by granularity, recency, relevance and trust - and synthesizes it into calibrated scores. Same indicator definitions and rubric structure everywhere, but with region-specific thresholds and interpretation versioned per location.


Water Expertise at Every Framework Step

SDC is one input to the Risk Framework - specifically, the component that searches and structures web-based evidence. Four steps to produce structured evidence objects, which then feed into the Risk Framework for selection and scoring alongside the other sources.

Each stage requires domain knowledge that can't be prompted into existence.

Step 1: Location Resolution

Coordinates in, hydrological reality out.

A site in Phoenix isn't just "Arizona." We resolve it to:

  • Political Hierarchy: Level 0, Level 1, …, Level N (e.g.: country, state, county, city, etc.)

  • Local Basin: Name, drainage area, allocation context

  • Local Aquifer: Name, extent, regulatory status

  • HydroBASINS mapping: Standardized ID for global comparison

  • WRI Aqueduct Basin: Cross-referenced for clients using that framework

That's what determines actual water exposure - not simply the state name on a map. More importantly, it identifies where the water supply is actually sourced from.

Hydrological boundaries matter more than administrative ones. Two sites 10km away might drain to entirely different basins with different allocation rules, different aquifer conditions, and different regulatory jurisdictions.

And we determine local languages. Brussels has Dutch and French zones. We search in both.

Step 2: AI-Powered Search

We generate water-specific queries in local language. Geo-targeted search across government sources, academic papers, water authority publications, and reputable news media sources.

Knowing what to search for is the expertise. Here's what that looks like in practice--search queries generated for the "Water supply/scarcity" indicator (NS1) at a site in Amsterdam, Netherlands:

Query 1: Historical Occurrence

("Amsterdam" OR "Gemeente Amsterdam") AND ("waterbalans" OR "wateraanbod" OR "watervraag" OR "waterbeschikbaarheid" OR "watertekort" OR "waterschaarste" OR "leveringszekerheid" OR "droogte")

  AND ("water" OR "grondwater" OR "oppervlaktewater") AND ("systeem" OR "bronnen" OR "gebruikers")

  AND ("Noord-Holland" OR "Waternet" OR "Waterschap Amstel, Gooi en Vecht" OR "AGV" OR "Amsterdamse Waterleidingduinen" 

  OR "Bethunepolder" OR "Lekkanaal" OR "Amsterdam-Rijnkanaal") <modifiers>

Targets water availability and supply-demand dynamics specifically mentioning Waternet (Amsterdam's executing water utility), Waterschap Amstel, Gooi en Vecht (AGV - the governing regional water authority), and actual source-to-tap hydrological intakes like the Amsterdamse Waterleidingduinen and Bethunepolder, rather than general infrastructure topics.

Query 2: Monitoring & Data

("Amsterdam" OR "Gemeente Amsterdam") AND ("monitoring" OR "metingen" OR "meetgegevens"

  OR "prognoses" OR "scenario's") AND ("water" OR "grondwater" OR "oppervlaktewater")

  AND ("Klimaatbestendige Wateraanvoer" OR "KWA" OR "beschikbaarheid")

  AND ("Noord-Holland" OR "Waternet" OR "Waterschap Amstel, Gooi en Vecht"  OR "AGV"  OR "Amsterdam-Rijnkanaal") <modifiers>

Targets technical measurement data, forecasts, and scenario analyses--including operational drought-monitoring frameworks like the KWA (Klimaatbestendige Wateraanvoer) and Waterakkoord (water agreements) governing the Amsterdam-Rijnkanaal/Noordzeekanaal corridor.

Query 3: Risk & Vulnerability

("Amsterdam" OR "Gemeente Amsterdam") AND ("water stress" OR "watertekort" OR "waterschaarste" OR "verzilting"

  OR "zoutindringing" OR "verdringingsreeks") AND ("water" OR "grondwater" OR "oppervlaktewater")

  AND ("waterbalans" OR "waterbeschikbaarheid" OR "watervraag")

  AND ("Noord-Holland" OR "Waternet" OR "Waterschap Amstel, Gooi en Vecht" OR "AGV"

  OR "Amsterdam-Rijnkanaal") <modifiers>

Targets scarcity risk assessments, drought impact analyses, and sufficiency evaluations for drinking water sources, avoiding unrelated maritime or flood-control data. (Note: Dutch water scarcity for low-lying delta cities manifests heavily as "verzilting" (salinization) when river discharge drops; shortages are governed legally by the "verdringingsreeks" priority sequence. Policy uses "waterschaarste" and "watertekort" interchangeably in Rijkswaterstaat and Deltaprogramma planning documents.)

English-only searches often under-surface Dutch-language PDFs (e.g., Waternet research reports, Rijkswaterstaat policy documents) and provincial water programme materials; targeted Dutch queries increase recall. For example, the Regionaal Waterprogramma Noord-Holland 2022-2027, Waternet's innovation reports, and the Waterakkoord ARK-NZK are Dutch-first sources that targeted queries will surface.

Modifiers

On top, each of the base queries above is enhanced with modifiers:

Location modifiers use the granularity hierarchy Amsterdam, Gemeente Amsterdam, Noord-Holland, Netherlands to scope results from city-level up through province (Noord-Holland) and national level, ensuring coverage across:

  • Municipal sources: amsterdam.nl, openresearch.amsterdam

  • Water utility & Regional Authority: waternet.nl, agv.nl

  • Provincial: noord-holland.nl

  • National: rijkswaterstaat.nl, iplo.nl, open.overheid.nl, deltaprogramma.nl

Water body anchors pin queries to the regional hydrological network:

  • Amsterdamse Waterleidingduinen (AWD) -- primary dune filtration area for Amsterdam's drinking water

  • Bethunepolder / Lekkanaal -- critical deep groundwater seepage and river intake sources

  • Amsterdam-Rijnkanaal -- primary canal corridor connecting Amsterdam to the Rhine

  • Noordzeekanaal -- ship canal to the North Sea, saline intrusion boundary

  • Amstel -- river through central Amsterdam, connected to the polder system

File type and site filters (e.g., filetype:pdf, site:waternet.nl OR site:agv.nl OR site:rijkswaterstaat.nl OR site:*.overheid.nl), language parameters (hl=nl), and time-range filters for recency vs. historical records.

Step 3: Content Filtering

Every piece of content goes through multi-stage relevance scoring, where location match, question relevance, and source credibility are assessed independently to ensure the content meets standards across all three dimensions.

The challenge: free-form text sources - PDFs, publications, news articles - aren't geolocated. They describe regions through names, images, and references. We have to geolocate them.

Does this "Springfield" match the Springfield we're looking for? There are dozens. Is this scarcity report actually about scarcity in that Springfield, or a different one 500 miles away?

We filter by hydrological context, location match, and indicator type. Failures are explicit, with LLM reasoning tracing feeding feedback loops for continuous improvement. If content drops, we know where and why.

Step 4: Evidence Synthesis

Content that survives filtering becomes typed evidence. Each piece includes: source link, page reference, confidence score, and verification status (automated validation flags).

This is what SDC produces - structured evidence objects ready for downstream processing by the Risk Framework. Because every step is deterministic - from location resolution to query generation to content filtering - evidence is fully reproducible: the same source content processed through the same pipeline always produces an equivalent output.

Downstream: Risk Framework Integration

SDC feeds evidence into the Risk Framework alongside government pipelines, ground stations, expert-curated evidences, and satellite data. The framework handles what comes next.

Evidence Selection: For each location and indicator, we select a group of evidence items from all available evidence across all four sources. Priority goes to the most granular, most recent, highest-trust source that's representative for the specific location under analysis. A hyperlocal ground station measurement beats a regional government report when both exist, a measurement is preferred for the indicator under analysis, and the station is representative of the actual water supply--not just any nearby ground station will do.

Scoring: Scoring translates selected evidence items into calibrated risk answers. "What's the scarcity risk for this site?" isn't answered by dumping sources - it's answered by synthesizing the selected evidence into a score with explicit reasoning.

Scoring is calibrated to context. "Low reservoir" means something different in Norway versus Arizona. The implications for industry, community, and supply-demand depend on the specifics of the region and local water system.

Human Expert Review: Not everything stays automated. Based on internal accuracy metrics and specific threshold patterns - edge cases, conflicting sources, anomalous scores - we escalate to our team of water experts. They review the final score and selected evidence items to ensure they accurately represent the location's reality. The pipeline handles volume; humans handle judgment calls.


<SDC-PIPELINE-DIAGRAM: Location Resolution → AI-Powered Search → Content Filtering → Evidence Synthesis → Risk Framework>

A Concrete Example

Here's what a score looks like for a single indicator at an anonymized site:

Site: Industrial facility, Mediterranean Europe
Indicator: Flood occurrence
Question: What is the potential for flooding throughout the catchment, considering precipitation patterns, high flow trends, and past flood events?

Score: 5/5 (High)
Reasoning: More than 5 significant floods have seriously impacted lives and property in the last 2 years. Persistent extreme rainfall and flows regularly create flood conditions. Recent events resulted in hundreds of fatalities, widespread infrastructure disruption, and extensive economic losses. Historical records indicate a long-standing and persistent flood risk in the region.

Selected Evidences (in order of relevance):

Granularity

Source Type

Title

Key Facts

Local Basin

Government

Ministrial hydrological report (2025)

Extreme rainfall amounts recorded indicate high potential for flash and riverine flooding (p. 31)

City

News

Regional flood severity analysis (2024)

Contributing factors included urban planning issues and intensifying climate change impacts (p. 1)

City

News

Economic impact assessment (2025)

Extensive inundation of agricultural land; significant damage to roads and rail infrastructure (p. 1)

Evidence Selection Logic: Local Basin government source selected as primary (most granular, highest trust). City-level news sources corroborate severity and provide recent impact data.

Every score we produce has even a more robust version of this structure. Auditors can trace any material claim back to its source, page reference, and implication.

How This Reduces Hallucination Risk

The pipeline addresses hallucination risk at every stage:

Maximize information availability. Location resolution plus multi-language dynamic search ensures we find relevant sources that generic queries miss. You can't fabricate sources when forced to link to existing ones. Critically, the system searches and collects real sources first, then processes them to generate evidences (and not the other way around, which usually happens) - so every citation traces back to an actual, verified link rather than an LLM-fabricated URL.

Reduce location, question, and source errors. Multi-stage filtering validates that content actually matches the location, actually answers the question, and comes from a credible source. Content gets rejected before it enters synthesis.

Evidence selection with explicit prioritization. When multiple sources exist, we select based on granularity, recency, and trust - not random sampling. Each indicator has its own set of auditable preferences.

Scoring with traceability. The final score for any indicator comes from explicit reasoning over selected evidence. Every material claim traces back to a source, page reference, and confidence level. An auditor can verify any cited statement.

The model provides capability. The pipeline provides auditability.

That's the distinction that matters for capital decisions, such as site-selection (i.e.: where to build the next operational site) or risk management (i.e.: Do I invest in reducing Risk X in Plant A or Risk Y in Plant B?).

What It Takes to Build This

Decision-grade reliability comes from pipeline & framework design, not model selection. That's why we built model-agnostic infrastructure.

This is years of work and where we constantly focus and invest:

  • 40+ government data pipelines across 15+ countries, built from scratch because the APIs don't exist, with each source following a completely different data structure and schema.

  • Location resolution that maps coordinates to both administrative and hydrological boundaries globally, plus water terminology understanding across 150+ languages

  • Multi-stage filtering with explicit failure modes, feeding into evidence selection logic that prioritizes by granularity, recency, trust, and representativeness

  • Scoring framework with calibrated rubrics - same structure everywhere, region-specific thresholds versioned per location

  • Water domain experts who understand hydrological boundaries and local regulatory structures.

  • Source catalogs with quality scoring by region. Location resolution beyond political boundaries.

That's the infrastructure layer most teams underestimate.

If You Want to Work on These Problems

We're a focused team working on problems most companies don't tackle.

We run production AI systems that synthesize satellite data, government documents in multiple languages, and real-time sensor feeds into structured risk assessments. We operate at the intersection of climate data, hydrology, and enterprise decision-making. We believe meaningful impact is built with our own hands - one pipeline, one data source, one decision at a time.

Our stack: Python, TypeScript, AWS Lambda with strict memory and latency constraints on orchestration components, Step Functions for workflow management. We use lite/nano-class models for high-volume tasks and strategic deployment of larger models where reasoning depth matters. Model-agnostic architecture throughout.

Some interesting problems: location resolution at global scale, evidence selection across heterogeneous sources, scoring calibration that accounts for regional context. If you want to work on AI production systems where impact, accuracy and auditability actually matters, we're building that.

The Bottom Line

Everyone has access to the same LLM models. The differentiation is what you build around them.

We're building AI pipelines that transform raw model capability into production-grade decision support. Water expertise encoded at every step. Traceability for every claim. Auditability by design, not by luck.

The model provides capability. The framework provides auditability.

That's Waterplan’s work.

We're hiring engineers who want to build AI systems that matter. Reach out if this sounds like your kind of problem.

Key Resources

  1.  GPT-5.2 Thinking with browsing enabled shows 5.8% of responses containing at least one major factual error - per OpenAI's own system card, Section 3.5.

  2. Satellite and Climate Models evidences, due to the nature of the information they represent, are incorporated into the Risk Framework in a deterministic approach through a customizable weighting model with defaults offered by Waterplan. 

  3.  HydroBASINS  -  A global dataset of nested sub-basin boundaries derived from HydroSHEDS, providing standardized watershed delineations at multiple scales.

  4.  WRI Aqueduct  -  World Resources Institute's water risk atlas, widely used for corporate water risk assessment and disclosure.

  5. Four, not five. Read Satellite & Climate Models evidences footnote (TL;DR: these are incorporated into the Risk Framework in a deterministic approach).

  6. Any deviation detected by a human expert is feedback that is used to improve the pipeline accuracy, and thus, to reduce the occurrence of further anomalies. This is Iterative Excellence in practice.

Reading time:

5 to 7 minutes

Subscribe to our newsletter

Subscribe to our newsletter

Get insights, expert analysis and tips on measuring, reporting, and responding to water risk

Connect with us to learn how Waterplan can help you achieve your water sustainability goals

Connect with us

Connect with us to learn how Waterplan can help you achieve your water sustainability goals

Connect with us

2021年に設立された私たちは、企業の持続可能性チームが水の安全性に向けた旅を加速させることを支援するためのSaaS企業です。Waterplanは、水リスクを測定し、対応し、報告するためのリーディングな水プラットフォームであり、水データの収集から報告までの時間を節約し、最高品質の水リスクデータおよび水の専門家へのアクセスを提供し、利害関係者が水リスクに対する行動を取るための調整を可能にします。

2193 フィルモア ストリート。

サンフランシスコ、CA 94115

© 2024 Climateplan Inc. すべての権利を保留します

2021年に設立された私たちは、企業の持続可能性チームが水の安全性に向けた旅を加速させることを支援するためのSaaS企業です。Waterplanは、水リスクを測定し、対応し、報告するためのリーディングな水プラットフォームであり、水データの収集から報告までの時間を節約し、最高品質の水リスクデータおよび水の専門家へのアクセスを提供し、利害関係者が水リスクに対する行動を取るための調整を可能にします。

2193 フィルモア ストリート。

サンフランシスコ、CA 94115

© 2024 Climateplan Inc. すべての権利を保留します

2021年に設立された私たちは、企業の持続可能性チームが水の安全性に向けた旅を加速させることを支援するためのSaaS企業です。Waterplanは、水リスクを測定し、対応し、報告するためのリーディングな水プラットフォームであり、水データの収集から報告までの時間を節約し、最高品質の水リスクデータおよび水の専門家へのアクセスを提供し、利害関係者が水リスクに対する行動を取るための調整を可能にします。

2193 フィルモア ストリート。

サンフランシスコ、CA 94115

© 2024 Climateplan Inc. すべての権利を保留します

私たちのニュースレターに登録して、水管理に関する重要な洞察、分析、そしてヒントを受け取りましょう。

私たちのニュースレターに登録してください。

私たちのニュースレターに登録して、水管理に関する重要な洞察、分析、そしてヒントを受け取りましょう。

当社のニュースレターに登録して、水管理に関する重要なインサイトを取得してください。