Usage

The idea of this package is simple: Just provide your data, select a model (or provide your own), and get the predictions.

File Structure

The data_root directory should have the following structure:

- gauge_info/
  - gauge_info.csv
- discharge/
  - *.nc
- forcings/
  - lumped/
    - *_<basinId>.{rvt,txt,csv}

gauge_info.csv is a csv-file with at least a column ‘Gauge_ID’ and possibly further static basin attributes. If the file further contains a column ‘Cal_Val’ indicating calibration (‘C’) and validation (‘V’) basins, the method datautils.get_basin_list() can be used to select calibration or validation basins.

Currently, only NetCDF files for discharge and csv, txt, or rvt files for daily lumped forcings are supported. The NetCDF discharge files have the variables station_id(nstations), time(time), and Q(nstations, time). If multiple NetCDF-files exist, their contents are merged and duplicate stations ignored.

The csv or txt-forcing files have a header row and are expected to be comma-separated. The first column needs to contain the dates as YYYY-MM-DD. The rvt-files are simple comma-space-separated csv-files as generated by Raven. An example rvt file looks like this:

:MultiData
2000-01-01 00:00:00 1.0 6210
:Parameters,                    PRECIP,                 TEMP_DAILY_AVE,                 TEMP_MIN,                       TEMP_MAX
:Units,                         mm/d,                           C,                              C,                              C
                        0.471473290000,                 -2.02099736340,                 -6.94564452600,                 3.092910019300,
...
:EndMultiData

There are four header lines and one footer line; and the second header line starts with the start date formatted as YYYY-MM-DD. The third header line contains the column names as :Parameters, <col1>, <col2>, ..., <colN>.

Training

For training, create an Experiment with the path to the data folder, specify a run directory to store results, start and end date, the training basins, forcing and static basin attributes, and input sequence length. Further, you can specify if the target to predict can take negative values (allow_negative_target=True). E.g., when you predict discharge, this should be False, but when you predict the error of another model’s discharge prediction, it should be True (since the other model can be off in both directions). Then, set the model to use for training, and start training with exp.train():

data_path = Path('./data')
run_dir = Path('./experiments')
exp = Experiment(data_path, is_train=True, run_dir=run_dir,
                 start_date='01012000', end_date='31122015',
                 basins=['00AAA123', '00AAB234', '00AAC567'],
                 seq_length=100, concat_static=True,
                 static_attributes=['area', 'regulation'],
                 forcing_attributes=['precip', 'tmax', 'tmin'],
                 allow_negative_target=False)
exp.set_model(model)
exp.train()

Any model class that implements the methods train(self, ds: LumpedH5) and predict(self, ds: LumpedBasin) can be used. Pre-implemented models can be found in mlstream.models.

Inference

To run inference after training, create a new Experiment with is_train = False, provide the data path, the path to the run directory from training, the test basins, and start and end date. There is no need to specify sequence length, forcing and static attributes, or allow_negative_target again; instead, these values are loaded from the configuration file in the run directory. Load and set the trained model (which was saved in the run directory during training), and run predictions with exp.predict(), which will return a DataFrame of predictions.

exp = Experiment(data_path, is_train=False, run_dir=run_dir,
                 basins=['01ABC123', '02DEF123'],
                 start_date='01012016', end_date='31122018')
model.load(run_dir / 'model.pkl')
exp.set_model(model)
results = exp.predict()

To obtain NSE scores for each test basin, run exp.get_nses().