Usage¶
The idea of this package is simple: Just provide your data, select a model (or provide your own), and get the predictions.
File Structure¶
The data_root
directory should have the following structure:
- gauge_info/
- gauge_info.csv
- discharge/
- *.nc
- forcings/
- lumped/
- *_<basinId>.{rvt,txt,csv}
gauge_info.csv
is a csv-file with at least a column ‘Gauge_ID’
and possibly further static basin attributes.
If the file further contains a column ‘Cal_Val’ indicating calibration (‘C’)
and validation (‘V’) basins, the method datautils.get_basin_list()
can
be used to select calibration or validation basins.
Currently, only NetCDF files for discharge and csv, txt, or rvt files for daily lumped forcings are supported.
The NetCDF discharge files have the variables station_id(nstations)
, time(time)
,
and Q(nstations, time)
.
If multiple NetCDF-files exist, their contents are merged and duplicate stations ignored.
The csv or txt-forcing files have a header row and are expected to be comma-separated. The first column needs to contain
the dates as YYYY-MM-DD
.
The rvt-files are simple comma-space-separated csv-files as generated by Raven.
An example rvt file looks like this:
:MultiData
2000-01-01 00:00:00 1.0 6210
:Parameters, PRECIP, TEMP_DAILY_AVE, TEMP_MIN, TEMP_MAX
:Units, mm/d, C, C, C
0.471473290000, -2.02099736340, -6.94564452600, 3.092910019300,
...
:EndMultiData
There are four header lines and one footer line; and the second header line starts with
the start date formatted as YYYY-MM-DD
. The third header line contains the column
names as :Parameters, <col1>, <col2>, ..., <colN>
.
Training¶
For training, create an Experiment
with the path to the data folder, specify a run directory
to store results, start and end date, the training basins, forcing and static basin attributes, and
input sequence length.
Further, you can specify if the target to predict can take negative values (allow_negative_target=True
).
E.g., when you predict discharge, this should be False
, but when you predict the error of another model’s
discharge prediction, it should be True
(since the other model can be off in both directions).
Then, set the model to use for training, and start training with exp.train()
:
data_path = Path('./data')
run_dir = Path('./experiments')
exp = Experiment(data_path, is_train=True, run_dir=run_dir,
start_date='01012000', end_date='31122015',
basins=['00AAA123', '00AAB234', '00AAC567'],
seq_length=100, concat_static=True,
static_attributes=['area', 'regulation'],
forcing_attributes=['precip', 'tmax', 'tmin'],
allow_negative_target=False)
exp.set_model(model)
exp.train()
Any model class that implements the methods train(self, ds: LumpedH5)
and
predict(self, ds: LumpedBasin)
can be used. Pre-implemented models can be found
in mlstream.models
.
Inference¶
To run inference after training, create a new Experiment
with is_train = False
,
provide the data path, the path to the run directory from training, the test basins,
and start and end date.
There is no need to specify sequence length, forcing and static attributes, or allow_negative_target
again; instead, these values are loaded from the configuration file in the run directory.
Load and set the trained model (which was saved in the run directory during training),
and run predictions with exp.predict()
, which will return a DataFrame of predictions.
exp = Experiment(data_path, is_train=False, run_dir=run_dir,
basins=['01ABC123', '02DEF123'],
start_date='01012016', end_date='31122018')
model.load(run_dir / 'model.pkl')
exp.set_model(model)
results = exp.predict()
To obtain NSE scores for each test basin, run exp.get_nses()
.