Fluctuat nec mergitur.
Paris est né, comme on sait, dans cette vieille île de la Cité qui a la forme d'un berceau. La grève de cette île fut sa première enceinte, la Seine son premier fossé. (Victor Hugo, Notre-Dame de Paris, 1831).
This is a follow-up project inspired by the Hackaviz 2025 dataset. This repository contains a Python script that builds the Paris flood dataset available on Kaggle. The Joy Division inspired data visualization is available in the following notebook and this was my original submission for the Hackathon: 1 and 2.
The resulting dataset is useful for studying flood risk indicators and their related seasonal pattern.
Contents - click to expand
- La Seine à Paris, a daily max water levels data collector
This project demonstrates production-style data ingestion patterns and concurrent API handling using Python.
The algorithm fetches historical daily maximum water height observations for multiple hydrometric stations from this API endpoint:
https://hubeau.eaufrance.fr/api/v2/hydrometrie/obs_elabIt then:
- Merges all station's data
- Detects duplicate dates
- Detects missing dates across the global time series
- Builds continuous missing intervals
- Exports a chronologically sorted CSV file
Output file:
paris_flood_dataset.csv
- Modular architecture
- Connection reuse via
requests.Session - Concurrent multi-station data fetching
- Automatic date-based pagination
- Global duplicate detection
- Continuous missing-date interval detection
- Linear-time gap detection algorithm
- Chronological CSV export
The pipeline runs in three major steps:
+-----------------------------+
| Pipeline Flow |
+-----------------------------+
| 1) **Concurrent Collection**|
| - `ThreadPoolExecutor` |
| - Fetch stations in parallel
+-----------------------------+
| 2) **Data Consolidation** |
| - Merge into one DataFrame
+-----------------------------+
| 3) **Integrity Analysis** |
| - Reconstruct global time range
| - *Detect:* Missing days, |
| Continuous missing ranges,
| Duplicate days |
+-----------------------------+A Separate ingestion logic from the validation logic.
Main
├── concurrent_fetch()
│ ├── ThreadPoolExecutor
│ └── fetch_station()
│ ├── API calls
│ ├── Pagination loop
│ └── DataFrame assembly
│
├── detect_global_gaps()
| ├── Duplicate detection
| ├── Full date range reconstruction
| ├── Missing date detection
| └── Interval grouping
└── CSV sort & export
git clone https://github.com/hyperphantasia/paris-flood-dataset.git
cd paris-flood-datasetpython -m venv venv
source venv/bin/activate # macOS/Linux
venv\Scripts\activate # Windows- Python 3.10+
pip install -r requirements.txtrequirements.txt
requests
pandas
- Default parameters:
API_BASE = "https://hubeau.eaufrance.fr/api/v2/hydrometrie/obs_elab"
METRIC = "HIXnJ"
OUTPUT = "paris_flood_dataset.csv"
MAX_PER_PAGE = 20000
MAX_WORKERS = 5| Parameter | Type | Example | Description |
|---|---|---|---|
API_BASE |
string | "https://hubeau.eaufrance.fr/api/v2/hydrometrie/obs_elab" |
Base URL of the API endpoint to query hydrometry observations. Max URL size: 2083 characters. |
METRIC |
string | "HIXnJ" |
Identifier of the specific metric to request from the API. HIXnJ= observed daily max in mm. Other available metrics: see Metrics. |
OUTPUT |
string (filename) | "paris_flood_dataset.csv" |
Local filename where the downloaded results will be saved (CSV format). |
MAX_PER_PAGE |
integer | 20000 |
Maximum number of records to request per API page (page size / limit parameter). For the Hub'Eau API: default is 5000, max is 20000. |
MAX_WORKERS |
integer | 5 |
Maximum number of concurrent worker threads/processes to use for parallel requests. |
Tip
For I/O-bound (API requests):
optimal MAX_WORKERS ≈ min(4 * cores, floor(R * avg_req_time_sec), memory_limit)
- Hydrometric stations:
STATIONS = [
"F700000109",
"F700000110",
...
]Various hydrometric stations references can be found on the french hydroportail. Below is the reference for la Seine à Paris stations:
https://www.hydro.eaufrance.fr/sitehydro/F7000001/fiche
| obs_elab Code | French | English |
|---|---|---|
| QmnJ | débit moyen journalier | daily mean flow |
| QmM | débit moyen mensuel | monthly mean flow |
| HIXnJ | hauteur instantanée maximale journalière en mm | daily maximum instantaneous water level in mm |
| HIXM | hauteur instantanée maximale mensuelle | monthly maximum instantaneous water level |
| QIXnJ | débit instantané maximal journalier | daily maximum instantaneous flow |
| QIXM | débit instantané maximal mensuel | monthly maximum instantaneous flow |
| QINnJ | débit instantané minimal journalier | daily minimum instantaneous flow |
| QINM | débit instantané minimal mensuel | monthly minimum instantaneous flow |
More details.
Run:
python fluctuat_nec_mergitur.py- Sample console output:
Processing station F700000109
Processing station F700000110
Processing station F700000111
Processing station F700000102
Processing station F700000103
From F700000103 → 7316 rows
From F700000111 → 5842 rows
From F700000110 → 2920 rows
From F700000109 → 20000 rows
From F700000109 → 4104 rows
From F700000102 → 13147 rows
Sorted and saved data to paris_flood_dataset.csv
start_date: 1900-01-02 00:00:00
end_date: 2026-02-27 00:00:00
expected_days: 46078
observed_days: 46059
missing_days_count: 19
duplicate_records_count: 0
missing_ranges: [(Timestamp('1965-12-31 00:00:00'), Timestamp('1966-01-01 00:00:00')), (Timestamp('1973-12-31 00:00:00'), Timestamp('1974-01-01 00:00:00')), (Timestamp('1989-12-31 00:00:00'), Timestamp('1990-01-01 00:00:00')), (Timestamp('1992-06-29 00:00:00'), Timestamp('1992-07-04 00:00:00')), (Timestamp('1994-09-06 00:00:00'), Timestamp('1994-09-06 00:00:00')), (Timestamp('1994-09-23 00:00:00'), Timestamp('1994-09-23 00:00:00')), (Timestamp('1994-09-30 00:00:00'), Timestamp('1994-09-30 00:00:00')), (Timestamp('1995-10-20 00:00:00'), Timestamp('1995-10-20 00:00:00')), (Timestamp('1998-12-31 00:00:00'), Timestamp('1998-12-31 00:00:00')), (Timestamp('1999-05-20 00:00:00'), Timestamp('1999-05-20 00:00:00')), (Timestamp('2000-04-03 00:00:00'), Timestamp('2000-04-03 00:00:00'))]
Final CSV columns include:
| Column Name | Description |
|---|---|
| code_site | Location ID |
| code_station | Station ID |
| date_obs_elab | Observation date |
| resultat_obs_elab | Observed value: daily max water level (in mm) |
| grandeur_hydro_elab | Metric code (HIXnJ) |
| date_prod | Data production date (processing date) |
| code_statut | Vamlidation status code |
| libelle_statut | Validation status label |
| code_methode | Production method code |
| libelle_methode | Production method label |
| code_qualification | Data quality assesment code |
| libelle_qualification | Data quality assesment label |
| longitude | Station longitude |
| latitude | Station latitude |
| grandeur_hydro_elab | Observation metric |
Code values are explained in this document (french).
code_site,code_station,date_obs_elab,resultat_obs_elab,date_prod,code_statut,libelle_statut,code_methode,libelle_methode,code_qualification,libelle_qualification,longitude,latitude,grandeur_hydro_elab
F7000001,F700000109,1900-01-02,1300.0,2025-06-17T09:27:10Z,16,Donnée validée,0,Mesurée,20,Bonne,2.365515502,48.845409133,HIXnJThe fetch_station() function:
- Starts at a fixed historical date (
1900-01-01) - Fetches up to
MAX_PER_PAGErows - Extracts the last observation date
- Resumes from
last_date+ 1 day - Stops when fewer than
MAX_PER_PAGErows are returned
Uses:
ThreadPoolExecutor(max_workers=5)Why threads?
- API calls are I/O-bound
- Threads improve latency
- Controlled worker count prevents overload
Time complexity: O(N / workers) for network wait.
This avoids inefficient offset-based pagination (using page numbers), and allows full coverage.
Strategy:
- Fetch maximum allowed rows
- Use last returned date
- Continue from
last_date+ 1 day
Why?
- Avoids offset inefficiency
- Prevents skipped records
- Robust to changing API data
Time complexity: O(N)
raise_for_status()validates HTTP responses- Exceptions handled per station
- Failed station does not crash the pipeline
All station DataFrames are concatenated:
pd.concat(all_data, ignore_index=True)Duplicates across stations are detected using:
df.duplicated(date_col)Duplicates are counted before being dropped.
pd.date_range(start, end, freq="D")This reconstructs all expected calendar days.
missing_days = full_range.difference(observed)Example:
Input missing days:
Jan 1, Jan 2, Jan 3, Jan 10, Jan 11
Output intervals:
(Jan 1 - Jan 3)
(Jan 10 - Jan 11)
This is done via linear scanning.
Pseudo-code:
gap_start = first_missing_day
for each current_day in missing_days:
if current_day - previous_day > 1:
close previous interval
start new intervalThis is:
- Single-pass
- Linear time
- Memory efficient
Time Complexity: O(M) M = number of missing days
Warning
Memory usage grows linearly with dataset size.
- Consider chunked writing for >10M rows
- Consider parquet for larger datasets
- Validate API schema before processing
- Implement a retry/backoff mechanism
- Timeout configuration per request
- Asynchronous I/O (aiohttp)
- Control concurrency with a Semaphore or a rate-limiter (token bucket)
- Multiprocessing vs threading
- Write CSV once at end
- Fork the repository.
- Create a feature branch (
git checkout -b feature/your‑feature). - Commit your changes (
git commit -m "Add …"). - Push and open a Pull Request.
This project is released into the public domain under the Unlicense. See the LICENSE file for details. Regarding the original data, be aware of the Hub'Eau platform usage rights.
Note
L’ensemble des données proposées dans le cadre des API sont des données publiques environnementales, déjà diffusées par ailleurs : elles sont donc librement utilisables et réutilisables, dans le cadre de la licence ouverte interministérielle.
English translation:
All data provided through the APIs are public environmental data, already published elsewhere: they are therefore freely usable and reusable under the Interministerial Open License (ETALAB).
♟