Are the same train/test splits used when resuming training via CLI? #1869

jmarkow · 2024-07-16T15:26:20Z

jmarkow
Jul 16, 2024

We're training models via the CLI on our local HPC. I'm working through the codebase, but wondering if you do not explicitly set the training and validation indices in the config json if the same train/test split is used if you train, stop training, then resume and feed in --base_checkpoint . We're running up against our HPC time limits for GPU nodes (72 hours) and so need to resume training a couple of times on our largest datasets (50k+ frames). On previous runs we did not explicitly set the train/validation indices and are wondering if we need to re-train with them set.

Also to clarify: I'm worried that the training set on an earlier "run" can leak into the validation set after resuming, contaminating our validation metrics.

Answered by jmarkow

Jul 20, 2024

@talmo Ah nice! I assumed this was potentially the case, so we just explicitly define the indices in the json file now. I'm guessing that's good enough? In the training logs I see indications that explicit indices are being used for training. We set training_inds and validation_inds to lists of integers and then set split_by_inds to True.

INFO:sleap.nn.training:Creating validation split from explicit indices (n = 10000).
INFO:sleap.nn.training:Creating training split from explicit indices (n = 90000).

I'll load the validation metrics file and ensure that the indices match what we specify in the json. Let me know if I'm still potentially missing something here and should still split out i…

View full answer

talmo · 2024-07-20T00:37:33Z

talmo
Jul 20, 2024
Maintainer

Hey @jmarkow,

Yep, good catch. The logic for selecting the training/validation splits is a bit of a mess to handle all the cases that we support, and it gets more complicated when we load in a base checkpoint since we inherit a lot (but not all?) of the settings.

What I'd recommend to keep it totally clean would be to split up the labels into training and validation (and test?) as separate .pkg.slp files and manually write them into the training config.

I'd recommend doing this with our new standalone sleap-io package which has explicit functionality for this (example):

import sleap_io as sio

# Load source labels.
labels = sio.load_file("labels.v001.slp")

# Make splits and export with embedded images.
labels.make_training_splits(n_train=0.8, n_val=0.1, n_test=0.1, save_dir="split1", seed=42)

# Splits will be saved as self-contained SLP package files with images and labels.
labels_train = sio.load_file("split1/train.pkg.slp")
labels_val = sio.load_file("split1/val.pkg.slp")
labels_test = sio.load_file("split1/test.pkg.slp")

Let us know if that works for you!

Cheers,

Talmo

PS: If you're observing that the training/validation indices are not being appropriately used/not used in the former case, let us know so we can open a bug for tracking.

0 replies

jmarkow · 2024-07-20T00:56:06Z

jmarkow
Jul 20, 2024
Author

@talmo Ah nice! I assumed this was potentially the case, so we just explicitly define the indices in the json file now. I'm guessing that's good enough? In the training logs I see indications that explicit indices are being used for training. We set training_inds and validation_inds to lists of integers and then set split_by_inds to True.

INFO:sleap.nn.training:Creating validation split from explicit indices (n = 10000).
INFO:sleap.nn.training:Creating training split from explicit indices (n = 90000).

I'll load the validation metrics file and ensure that the indices match what we specify in the json. Let me know if I'm still potentially missing something here and should still split out into separate files.

1 reply

talmo Jul 20, 2024
Maintainer

Perfect! You're golden then.

The only advantage other than having clean splits is that it might be faster to load the data from the .pkg.slp files than from videos depending on your system. This is because it should be faster to load from HDF5 and decode the images than to seek and decode video packets.

If you're using preloading, this difference is amortized in the long run though.

Keep us posted on what your 100k frame model does 😁 -- this might be the biggest SLEAP model training set we've seen!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Are the same train/test splits used when resuming training via CLI? #1869

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Are the same train/test splits used when resuming training via CLI? #1869

jmarkow Jul 16, 2024

Replies: 2 comments · 1 reply

talmo Jul 20, 2024 Maintainer

jmarkow Jul 20, 2024 Author

talmo Jul 20, 2024 Maintainer

jmarkow
Jul 16, 2024

Replies: 2 comments 1 reply

talmo
Jul 20, 2024
Maintainer

jmarkow
Jul 20, 2024
Author

talmo Jul 20, 2024
Maintainer