OOM issue when training centered_instance model #1889

ssfrz · 2024-07-31T14:14:01Z

ssfrz
Jul 31, 2024

Training top-down multi-animal model on a computing cluster. There were no issues when training the centroid model, but I am experiencing memory issues when training the centered_instance model. I've run a number of top-down multi-animal models before, but this is the first time I'm having this issue.
I have attempted to resolve this by requesting additional resources, reducing batch size, and reducing input size, but none have worked. I have not found a solution from existing issues/discussions on this repo.

I assume it's related to the fact that it says the shape is [15,9999680,9999680,1], but I don't understand why that would be when for the centroid model, it was just (768, 960, 1). Obviously, it would require a lot of memory for such large data, but I don't understand why it's so large. Even if I reduce input size to 0.5, it is still very big.

Any advice would be appreciated, thanks!

I have tried the following gpu's:
NVIDIA GeForce RTX 2080 Ti
NVIDIA A100-SXM4-80G

Here's the output:

INFO:sleap.nn.training:Versions:
SLEAP: 1.3.3
TensorFlow: 2.8.4
Numpy: 1.21.6
Python: 3.7.16
OS: Linux-5.14.0-362.8.1.el9_3.x86_64-x86_64-with-redhat-9.3-Blue_Onyx
INFO:sleap.nn.training:Training labels file: 240730_combo_01.pkg.slp
INFO:sleap.nn.training:Training profile: centroid.json
INFO:sleap.nn.training:
INFO:sleap.nn.training:Arguments:
INFO:sleap.nn.training:{
    "training_job_path": "centroid.json",
    "labels_path": "240730_combo_01.pkg.slp",
    "video_paths": [
        ""
    ],
    "val_labels": null,
    "test_labels": null,
    "base_checkpoint": null,
    "tensorboard": false,
    "save_viz": false,
    "zmq": false,
    "run_name": "240730_sca006",
    "prefix": "",
    "suffix": "",
    "cpu": false,
    "first_gpu": false,
    "last_gpu": false,
    "gpu": "auto"
}
INFO:sleap.nn.training:
INFO:sleap.nn.training:Training job:
INFO:sleap.nn.training:{
    "data": {
        "labels": {
            "training_labels": null,
            "validation_labels": null,
            "validation_fraction": 0.1,
            "test_labels": null,
            "split_by_inds": false,
            "training_inds": null,
            "validation_inds": null,
            "test_inds": null,
            "search_path_hints": [],
            "skeletons": []
        },
        "preprocessing": {
            "ensure_rgb": false,
            "ensure_grayscale": true,
            "imagenet_mode": null,
            "input_scaling": 0.75,
            "pad_to_stride": null,
            "resize_and_pad_to_target": true,
            "target_height": null,
            "target_width": null
        },
        "instance_cropping": {
            "center_on_part": "SB_Ant",
            "crop_size": null,
            "crop_size_detection_padding": 16
        }
    },
    "model": {
        "backbone": {
            "leap": null,
            "unet": {
                "stem_stride": null,
                "max_stride": 16,
                "output_stride": 2,
                "filters": 16,
                "filters_rate": 2.0,
                "middle_block": true,
                "up_interpolate": true,
                "stacks": 1
            },
            "hourglass": null,
            "resnet": null,
            "pretrained_encoder": null
        },
        "heads": {
            "single_instance": null,
            "centroid": {
                "anchor_part": "SB_Ant",
                "sigma": 2.5,
                "output_stride": 2,
                "loss_weight": 1.0,
                "offset_refinement": false
            },
            "centered_instance": null,
            "multi_instance": null,
            "multi_class_bottomup": null,
            "multi_class_topdown": null
        },
        "base_checkpoint": null
    },
    "optimization": {
        "preload_data": true,
        "augmentation_config": {
            "rotate": true,
            "rotation_min_angle": -180.0,
            "rotation_max_angle": 180.0,
            "translate": false,
            "translate_min": -5,
            "translate_max": 5,
            "scale": false,
            "scale_min": 0.9,
            "scale_max": 1.1,
            "uniform_noise": false,
            "uniform_noise_min_val": 0.0,
            "uniform_noise_max_val": 10.0,
            "gaussian_noise": false,
            "gaussian_noise_mean": 5.0,
            "gaussian_noise_stddev": 1.0,
            "contrast": false,
            "contrast_min_gamma": 0.5,
            "contrast_max_gamma": 2.0,
            "brightness": false,
            "brightness_min_val": 0.0,
            "brightness_max_val": 10.0,
            "random_crop": false,
            "random_crop_height": 256,
            "random_crop_width": 256,
            "random_flip": true,
            "flip_horizontal": false
        },
        "online_shuffling": true,
        "shuffle_buffer_size": 128,
        "prefetch": true,
        "batch_size": 8,
        "batches_per_epoch": null,
        "min_batches_per_epoch": 200,
        "val_batches_per_epoch": null,
        "min_val_batches_per_epoch": 10,
        "epochs": 200,
        "optimizer": "adam",
        "initial_learning_rate": 0.0001,
        "learning_rate_schedule": {
            "reduce_on_plateau": true,
            "reduction_factor": 0.5,
            "plateau_min_delta": 1e-06,
            "plateau_patience": 5,
            "plateau_cooldown": 3,
            "min_learning_rate": 1e-08
        },
        "hard_keypoint_mining": {
            "online_mining": false,
            "hard_to_easy_ratio": 2.0,
            "min_hard_keypoints": 2,
            "max_hard_keypoints": null,
            "loss_scale": 5.0
        },
        "early_stopping": {
            "stop_training_on_plateau": true,
            "plateau_min_delta": 1e-08,
            "plateau_patience": 20
        }
    },
    "outputs": {
        "save_outputs": true,
        "run_name": "240730_sca006",
        "run_name_prefix": "",
        "run_name_suffix": ".centroid",
        "runs_folder": "models",
        "tags": [
            ""
        ],
        "save_visualizations": true,
        "delete_viz_images": true,
        "zip_outputs": false,
        "log_to_csv": true,
        "checkpointing": {
            "initial_model": false,
            "best_model": true,
            "every_epoch": false,
            "latest_model": false,
            "final_model": false
        },
        "tensorboard": {
            "write_logs": false,
            "loss_frequency": "epoch",
            "architecture_graph": false,
            "profile_graph": false,
            "visualizations": true
        },
        "zmq": {
            "subscribe_to_controller": false,
            "controller_address": "tcp://127.0.0.1:9000",
            "controller_polling_timeout": 10,
            "publish_updates": false,
            "publish_address": "tcp://127.0.0.1:9001"
        }
    },
    "name": "",
    "description": "",
    "sleap_version": "1.3.3",
    "filename": "centroid.json"
}
INFO:sleap.nn.training:
INFO:sleap.nn.training:Auto-selected GPU 0 with 10824 MiB of free memory.
INFO:sleap.nn.training:Using GPU 0 for acceleration.
INFO:sleap.nn.training:Disabled GPU memory pre-allocation.
INFO:sleap.nn.training:System:
GPUs: 1/1 available
  Device: /physical_device:GPU:0
         Available: True
        Initalized: False
     Memory growth: True
INFO:sleap.nn.training:
INFO:sleap.nn.training:Initializing trainer...
INFO:sleap.nn.training:Loading training labels from: 240730_combo_01.pkg.slp
INFO:sleap.nn.training:Creating training and validation splits from validation fraction: 0.1
INFO:sleap.nn.training:  Splits: Training = 90 / Validation = 10.
INFO:sleap.nn.training:Setting up for training...
INFO:sleap.nn.training:Setting up pipeline builders...
INFO:sleap.nn.training:Setting up model...
INFO:sleap.nn.training:Building test pipeline...
2024-07-30 18:00:43.550298: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-07-30 18:00:45.477763: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 9469 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:b1:00.0, compute capability: 7.5
INFO:sleap.nn.training:Loaded test example. [3.908s]
INFO:sleap.nn.training:  Input shape: (768, 960, 1)
INFO:sleap.nn.training:Created Keras model.
INFO:sleap.nn.training:  Backbone: UNet(stacks=1, filters=16, filters_rate=2.0, kernel_size=3, stem_kernel_size=7, convs_per_block=2, stem_blocks=0, down_blocks=4, middle_block=True, up_blocks=3, up_interpolate=True, block_contraction=False)
INFO:sleap.nn.training:  Max stride: 16
INFO:sleap.nn.training:  Parameters: 1,953,105
INFO:sleap.nn.training:  Heads: 
INFO:sleap.nn.training:    [0] = CentroidConfmapsHead(anchor_part='SB_Ant', sigma=2.5, output_stride=2, loss_weight=1.0)
INFO:sleap.nn.training:  Outputs: 
INFO:sleap.nn.training:    [0] = KerasTensor(type_spec=TensorSpec(shape=(None, 384, 480, 1), dtype=tf.float32, name=None), name='CentroidConfmapsHead/BiasAdd:0', description="created by layer 'CentroidConfmapsHead'")
INFO:sleap.nn.training:Training from scratch
INFO:sleap.nn.training:Setting up data pipelines...
INFO:sleap.nn.training:Training set: n = 90
INFO:sleap.nn.training:Validation set: n = 10
INFO:sleap.nn.training:Setting up optimization...
INFO:sleap.nn.training:  Learning rate schedule: LearningRateScheduleConfig(reduce_on_plateau=True, reduction_factor=0.5, plateau_min_delta=1e-06, plateau_patience=5, plateau_cooldown=3, min_learning_rate=1e-08)
INFO:sleap.nn.training:  Early stopping: EarlyStoppingConfig(stop_training_on_plateau=True, plateau_min_delta=1e-08, plateau_patience=20)
INFO:sleap.nn.training:Setting up outputs...
INFO:sleap.nn.training:Created run path: models/240730_sca006.centroid
INFO:sleap.nn.training:Setting up visualization...
2024-07-30 18:00:49.571281: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:690] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_FLOAT } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_FLOAT shape { dim { size: -34 } dim { size: -35 } dim { size: -36 } dim { size: 1 } } } inputs { dtype: DT_FLOAT shape { dim { size: -2 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -2 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } } device { type: "GPU" vendor: "NVIDIA" model: "NVIDIA GeForce RTX 2080 Ti" frequency: 1545 num_cores: 68 environment { key: "architecture" value: "7.5" } environment { key: "cuda" value: "11020" } environment { key: "cudnn" value: "8100" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 5767168 shared_memory_size_per_multiprocessor: 65536 memory_size: 9929097216 bandwidth: 616000000 } outputs { dtype: DT_FLOAT shape { dim { size: -2 } dim { size: -37 } dim { size: -38 } dim { size: 1 } } }
2024-07-30 18:00:50.628807: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:690] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_FLOAT } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_FLOAT shape { dim { size: -34 } dim { size: -35 } dim { size: -36 } dim { size: 1 } } } inputs { dtype: DT_FLOAT shape { dim { size: -2 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -2 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } } device { type: "GPU" vendor: "NVIDIA" model: "NVIDIA GeForce RTX 2080 Ti" frequency: 1545 num_cores: 68 environment { key: "architecture" value: "7.5" } environment { key: "cuda" value: "11020" } environment { key: "cudnn" value: "8100" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 5767168 shared_memory_size_per_multiprocessor: 65536 memory_size: 9929097216 bandwidth: 616000000 } outputs { dtype: DT_FLOAT shape { dim { size: -2 } dim { size: -37 } dim { size: -38 } dim { size: 1 } } }
Unable to use Qt backend for matplotlib. This probably means Qt is running headless.
Unable to use Qt backend for matplotlib. This probably means Qt is running headless.
INFO:sleap.nn.training:Finished trainer set up. [7.2s]
INFO:sleap.nn.training:Creating tf.data.Datasets for training data generation...
INFO:sleap.nn.training:Finished creating training datasets. [4.2s]
INFO:sleap.nn.training:Starting training loop...
Epoch 1/200
2024-07-30 18:00:59.099209: I tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8201
2024-07-30 18:01:02.068413: I tensorflow/core/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2024-07-30 18:01:07.392356: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.18GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2024-07-30 18:01:07.392418: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.18GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
200/200 - 135s - loss: 0.0016 - val_loss: 0.0015 - lr: 1.0000e-04 - 135s/epoch - 673ms/step
Epoch 2/200
200/200 - 122s - loss: 0.0015 - val_loss: 0.0015 - lr: 1.0000e-04 - 122s/epoch - 608ms/step
Epoch 3/200
200/200 - 131s - loss: 0.0011 - val_loss: 6.7788e-04 - lr: 1.0000e-04 - 131s/epoch - 657ms/step
Epoch 4/200
200/200 - 117s - loss: 6.0862e-04 - val_loss: 4.6321e-04 - lr: 1.0000e-04 - 117s/epoch - 586ms/step
Epoch 5/200
200/200 - 118s - loss: 4.0584e-04 - val_loss: 2.8125e-04 - lr: 1.0000e-04 - 118s/epoch - 592ms/step
Epoch 6/200
200/200 - 127s - loss: 2.5527e-04 - val_loss: 2.0094e-04 - lr: 1.0000e-04 - 127s/epoch - 637ms/step
Epoch 7/200
200/200 - 124s - loss: 1.9830e-04 - val_loss: 1.6575e-04 - lr: 1.0000e-04 - 124s/epoch - 618ms/step
Epoch 8/200
200/200 - 120s - loss: 1.7257e-04 - val_loss: 1.5084e-04 - lr: 1.0000e-04 - 120s/epoch - 598ms/step
Epoch 9/200
200/200 - 110s - loss: 1.5131e-04 - val_loss: 1.3636e-04 - lr: 1.0000e-04 - 110s/epoch - 548ms/step
Epoch 10/200
200/200 - 128s - loss: 1.3319e-04 - val_loss: 1.0477e-04 - lr: 1.0000e-04 - 128s/epoch - 640ms/step
Epoch 11/200
200/200 - 119s - loss: 1.2208e-04 - val_loss: 1.2188e-04 - lr: 1.0000e-04 - 119s/epoch - 593ms/step
Epoch 12/200
200/200 - 119s - loss: 1.1151e-04 - val_loss: 1.1364e-04 - lr: 1.0000e-04 - 119s/epoch - 597ms/step
Epoch 13/200
200/200 - 110s - loss: 9.9992e-05 - val_loss: 9.2142e-05 - lr: 1.0000e-04 - 110s/epoch - 548ms/step
Epoch 14/200
200/200 - 128s - loss: 9.5941e-05 - val_loss: 1.0030e-04 - lr: 1.0000e-04 - 128s/epoch - 638ms/step
Epoch 15/200
200/200 - 110s - loss: 9.2658e-05 - val_loss: 8.0204e-05 - lr: 1.0000e-04 - 110s/epoch - 548ms/step
Epoch 16/200
200/200 - 118s - loss: 8.6367e-05 - val_loss: 7.7824e-05 - lr: 1.0000e-04 - 118s/epoch - 588ms/step
Epoch 17/200
200/200 - 118s - loss: 8.3389e-05 - val_loss: 7.3165e-05 - lr: 1.0000e-04 - 118s/epoch - 589ms/step
Epoch 18/200
200/200 - 118s - loss: 8.2116e-05 - val_loss: 6.9677e-05 - lr: 1.0000e-04 - 118s/epoch - 591ms/step
Epoch 19/200
200/200 - 128s - loss: 7.6944e-05 - val_loss: 7.3438e-05 - lr: 1.0000e-04 - 128s/epoch - 641ms/step
Epoch 20/200
200/200 - 109s - loss: 7.6430e-05 - val_loss: 6.5531e-05 - lr: 1.0000e-04 - 109s/epoch - 544ms/step
Epoch 21/200
200/200 - 118s - loss: 7.2628e-05 - val_loss: 6.3437e-05 - lr: 1.0000e-04 - 118s/epoch - 591ms/step
Epoch 22/200
200/200 - 128s - loss: 7.1498e-05 - val_loss: 6.3361e-05 - lr: 1.0000e-04 - 128s/epoch - 642ms/step
Epoch 23/200
200/200 - 120s - loss: 6.9983e-05 - val_loss: 6.1618e-05 - lr: 1.0000e-04 - 120s/epoch - 598ms/step
Epoch 24/200
200/200 - 120s - loss: 6.6453e-05 - val_loss: 5.4330e-05 - lr: 1.0000e-04 - 120s/epoch - 598ms/step
Epoch 25/200
200/200 - 109s - loss: 6.5471e-05 - val_loss: 5.3826e-05 - lr: 1.0000e-04 - 109s/epoch - 547ms/step
Epoch 26/200
200/200 - 118s - loss: 6.5850e-05 - val_loss: 5.9029e-05 - lr: 1.0000e-04 - 118s/epoch - 590ms/step
Epoch 27/200
200/200 - 129s - loss: 6.2315e-05 - val_loss: 5.6383e-05 - lr: 1.0000e-04 - 129s/epoch - 645ms/step
Epoch 28/200
200/200 - 109s - loss: 6.2462e-05 - val_loss: 5.4796e-05 - lr: 1.0000e-04 - 109s/epoch - 546ms/step
Epoch 29/200

Epoch 29: ReduceLROnPlateau reducing learning rate to 4.999999873689376e-05.
200/200 - 118s - loss: 5.9896e-05 - val_loss: 6.5831e-05 - lr: 1.0000e-04 - 118s/epoch - 590ms/step
Epoch 30/200
200/200 - 129s - loss: 5.6717e-05 - val_loss: 4.8033e-05 - lr: 5.0000e-05 - 129s/epoch - 646ms/step
Epoch 31/200
200/200 - 110s - loss: 5.6833e-05 - val_loss: 4.7654e-05 - lr: 5.0000e-05 - 110s/epoch - 548ms/step
Epoch 32/200
200/200 - 126s - loss: 5.6081e-05 - val_loss: 4.9296e-05 - lr: 5.0000e-05 - 126s/epoch - 631ms/step
Epoch 33/200
200/200 - 119s - loss: 5.6601e-05 - val_loss: 5.0415e-05 - lr: 5.0000e-05 - 119s/epoch - 593ms/step
Epoch 34/200
200/200 - 106s - loss: 5.4624e-05 - val_loss: 4.7272e-05 - lr: 5.0000e-05 - 106s/epoch - 530ms/step
Epoch 35/200
200/200 - 113s - loss: 5.4388e-05 - val_loss: 4.8796e-05 - lr: 5.0000e-05 - 113s/epoch - 567ms/step
Epoch 36/200

Epoch 36: ReduceLROnPlateau reducing learning rate to 2.499999936844688e-05.
200/200 - 113s - loss: 5.3033e-05 - val_loss: 4.8231e-05 - lr: 5.0000e-05 - 113s/epoch - 566ms/step
Epoch 37/200
200/200 - 114s - loss: 5.2355e-05 - val_loss: 4.8545e-05 - lr: 2.5000e-05 - 114s/epoch - 572ms/step
Epoch 38/200
200/200 - 114s - loss: 5.1971e-05 - val_loss: 4.8139e-05 - lr: 2.5000e-05 - 114s/epoch - 570ms/step
Epoch 39/200
200/200 - 125s - loss: 5.2411e-05 - val_loss: 4.9553e-05 - lr: 2.5000e-05 - 125s/epoch - 627ms/step
Epoch 40/200
200/200 - 116s - loss: 5.2114e-05 - val_loss: 4.6942e-05 - lr: 2.5000e-05 - 116s/epoch - 578ms/step
Epoch 41/200
200/200 - 116s - loss: 5.2490e-05 - val_loss: 4.5093e-05 - lr: 2.5000e-05 - 116s/epoch - 578ms/step
Epoch 42/200
200/200 - 105s - loss: 5.0768e-05 - val_loss: 5.1244e-05 - lr: 2.5000e-05 - 105s/epoch - 523ms/step
Epoch 43/200
200/200 - 114s - loss: 5.0272e-05 - val_loss: 4.9921e-05 - lr: 2.5000e-05 - 114s/epoch - 572ms/step
Epoch 44/200
200/200 - 114s - loss: 4.9402e-05 - val_loss: 4.4952e-05 - lr: 2.5000e-05 - 114s/epoch - 569ms/step
Epoch 45/200
200/200 - 115s - loss: 4.9057e-05 - val_loss: 4.6840e-05 - lr: 2.5000e-05 - 115s/epoch - 574ms/step
Epoch 46/200

Epoch 46: ReduceLROnPlateau reducing learning rate to 1.249999968422344e-05.
200/200 - 125s - loss: 5.0362e-05 - val_loss: 4.4250e-05 - lr: 2.5000e-05 - 125s/epoch - 624ms/step
Epoch 47/200
200/200 - 115s - loss: 4.9411e-05 - val_loss: 4.7635e-05 - lr: 1.2500e-05 - 115s/epoch - 575ms/step
Epoch 48/200
200/200 - 105s - loss: 4.7379e-05 - val_loss: 4.4865e-05 - lr: 1.2500e-05 - 105s/epoch - 524ms/step
Epoch 49/200
200/200 - 115s - loss: 4.9230e-05 - val_loss: 4.2179e-05 - lr: 1.2500e-05 - 115s/epoch - 574ms/step
Epoch 50/200
200/200 - 124s - loss: 4.8932e-05 - val_loss: 4.8290e-05 - lr: 1.2500e-05 - 124s/epoch - 622ms/step
Epoch 51/200
200/200 - 115s - loss: 4.8875e-05 - val_loss: 4.8321e-05 - lr: 1.2500e-05 - 115s/epoch - 575ms/step
Epoch 52/200
200/200 - 106s - loss: 4.9661e-05 - val_loss: 4.9710e-05 - lr: 1.2500e-05 - 106s/epoch - 529ms/step
Epoch 53/200
200/200 - 114s - loss: 4.8525e-05 - val_loss: 4.6438e-05 - lr: 1.2500e-05 - 114s/epoch - 570ms/step
Epoch 54/200

Epoch 54: ReduceLROnPlateau reducing learning rate to 6.24999984211172e-06.
200/200 - 114s - loss: 4.6240e-05 - val_loss: 4.7178e-05 - lr: 1.2500e-05 - 114s/epoch - 569ms/step
Epoch 55/200
200/200 - 115s - loss: 4.8187e-05 - val_loss: 4.3873e-05 - lr: 6.2500e-06 - 115s/epoch - 574ms/step
Epoch 56/200
200/200 - 115s - loss: 4.7370e-05 - val_loss: 4.2925e-05 - lr: 6.2500e-06 - 115s/epoch - 573ms/step
Epoch 57/200
200/200 - 115s - loss: 4.7761e-05 - val_loss: 4.8665e-05 - lr: 6.2500e-06 - 115s/epoch - 575ms/step
Epoch 58/200
200/200 - 114s - loss: 4.7622e-05 - val_loss: 4.5136e-05 - lr: 6.2500e-06 - 114s/epoch - 568ms/step
Epoch 59/200
200/200 - 113s - loss: 4.8361e-05 - val_loss: 4.7865e-05 - lr: 6.2500e-06 - 113s/epoch - 567ms/step
Epoch 60/200
200/200 - 113s - loss: 4.7528e-05 - val_loss: 4.6166e-05 - lr: 6.2500e-06 - 113s/epoch - 565ms/step
Epoch 61/200

Epoch 61: ReduceLROnPlateau reducing learning rate to 3.12499992105586e-06.
200/200 - 114s - loss: 4.7565e-05 - val_loss: 4.8376e-05 - lr: 6.2500e-06 - 114s/epoch - 568ms/step
Epoch 62/200
200/200 - 113s - loss: 4.8727e-05 - val_loss: 4.7061e-05 - lr: 3.1250e-06 - 113s/epoch - 564ms/step
Epoch 63/200
200/200 - 124s - loss: 4.4835e-05 - val_loss: 4.4468e-05 - lr: 3.1250e-06 - 124s/epoch - 621ms/step
Epoch 64/200
200/200 - 105s - loss: 4.6978e-05 - val_loss: 4.7338e-05 - lr: 3.1250e-06 - 105s/epoch - 526ms/step
Epoch 65/200
200/200 - 114s - loss: 4.7214e-05 - val_loss: 4.5294e-05 - lr: 3.1250e-06 - 114s/epoch - 572ms/step
Epoch 66/200
200/200 - 113s - loss: 4.7359e-05 - val_loss: 4.6547e-05 - lr: 3.1250e-06 - 113s/epoch - 567ms/step
Epoch 67/200
200/200 - 115s - loss: 4.7074e-05 - val_loss: 4.8392e-05 - lr: 3.1250e-06 - 115s/epoch - 574ms/step
Epoch 68/200

Epoch 68: ReduceLROnPlateau reducing learning rate to 1.56249996052793e-06.
200/200 - 114s - loss: 4.6401e-05 - val_loss: 4.8502e-05 - lr: 3.1250e-06 - 114s/epoch - 569ms/step
Epoch 69/200
200/200 - 114s - loss: 4.7373e-05 - val_loss: 4.4641e-05 - lr: 1.5625e-06 - 114s/epoch - 572ms/step
Epoch 69: early stopping
INFO:sleap.nn.training:Finished training loop. [134.5 min]
INFO:sleap.nn.training:Deleting visualization directory: models/240730_sca006.centroid/viz
INFO:sleap.nn.training:Saving evaluation metrics to model folder...
2024-07-30 20:15:27.969793: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:690] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_FLOAT } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_FLOAT shape { dim { size: -69 } dim { size: -70 } dim { size: -71 } dim { size: 1 } } } inputs { dtype: DT_FLOAT shape { dim { size: -5 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -5 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } } device { type: "GPU" vendor: "NVIDIA" model: "NVIDIA GeForce RTX 2080 Ti" frequency: 1545 num_cores: 68 environment { key: "architecture" value: "7.5" } environment { key: "cuda" value: "11020" } environment { key: "cudnn" value: "8100" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 5767168 shared_memory_size_per_multiprocessor: 65536 memory_size: 9929097216 bandwidth: 616000000 } outputs { dtype: DT_FLOAT shape { dim { size: -5 } dim { size: -72 } dim { size: -73 } dim { size: 1 } } }
2024-07-30 20:15:27.970164: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:690] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_UINT8 } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_UINT8 shape { dim { size: 4 } dim { size: 1024 } dim { size: 1280 } dim { size: 1 } } } inputs { dtype: DT_FLOAT shape { dim { size: -5 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -5 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } } device { type: "CPU" vendor: "GenuineIntel" model: "101" frequency: 2000 num_cores: 2 environment { key: "cpu_instruction_set" value: "AVX SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2" } environment { key: "eigen" value: "3.4.90" } l1_cache_size: 32768 l2_cache_size: 1048576 l3_cache_size: 28835840 memory_size: 268435456 } outputs { dtype: DT_FLOAT shape { dim { size: -5 } dim { size: -81 } dim { size: -82 } dim { size: 1 } } }
2024-07-30 20:15:32.305885: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:690] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_FLOAT } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_FLOAT shape { dim { size: -69 } dim { size: -70 } dim { size: -71 } dim { size: 1 } } } inputs { dtype: DT_FLOAT shape { dim { size: -5 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -5 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } } device { type: "GPU" vendor: "NVIDIA" model: "NVIDIA GeForce RTX 2080 Ti" frequency: 1545 num_cores: 68 environment { key: "architecture" value: "7.5" } environment { key: "cuda" value: "11020" } environment { key: "cudnn" value: "8100" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 5767168 shared_memory_size_per_multiprocessor: 65536 memory_size: 9929097216 bandwidth: 616000000 } outputs { dtype: DT_FLOAT shape { dim { size: -5 } dim { size: -72 } dim { size: -73 } dim { size: 1 } } }
2024-07-30 20:15:32.306235: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:690] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_UINT8 } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_UINT8 shape { dim { size: 2 } dim { size: 1024 } dim { size: 1280 } dim { size: 1 } } } inputs { dtype: DT_FLOAT shape { dim { size: -5 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -5 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } } device { type: "CPU" vendor: "GenuineIntel" model: "101" frequency: 2000 num_cores: 2 environment { key: "cpu_instruction_set" value: "AVX SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2" } environment { key: "eigen" value: "3.4.90" } l1_cache_size: 32768 l2_cache_size: 1048576 l3_cache_size: 28835840 memory_size: 268435456 } outputs { dtype: DT_FLOAT shape { dim { size: -5 } dim { size: -81 } dim { size: -82 } dim { size: 1 } } }
Predicting... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% ETA: 0:00:00 21.2 FPS
INFO:sleap.nn.evals:Saved predictions: models/240730_sca006.centroid/labels_pr.train.slp
INFO:sleap.nn.evals:Saved metrics: models/240730_sca006.centroid/metrics.train.npz
INFO:sleap.nn.evals:OKS mAP: 0.980130
2024-07-30 20:15:36.493087: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:690] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_FLOAT } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_FLOAT shape { dim { size: -69 } dim { size: -70 } dim { size: -71 } dim { size: 1 } } } inputs { dtype: DT_FLOAT shape { dim { size: -5 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -5 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } } device { type: "GPU" vendor: "NVIDIA" model: "NVIDIA GeForce RTX 2080 Ti" frequency: 1545 num_cores: 68 environment { key: "architecture" value: "7.5" } environment { key: "cuda" value: "11020" } environment { key: "cudnn" value: "8100" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 5767168 shared_memory_size_per_multiprocessor: 65536 memory_size: 9929097216 bandwidth: 616000000 } outputs { dtype: DT_FLOAT shape { dim { size: -5 } dim { size: -72 } dim { size: -73 } dim { size: 1 } } }
2024-07-30 20:15:36.493436: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:690] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_UINT8 } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_UINT8 shape { dim { size: 4 } dim { size: 1024 } dim { size: 1280 } dim { size: 1 } } } inputs { dtype: DT_FLOAT shape { dim { size: -5 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -5 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } } device { type: "CPU" vendor: "GenuineIntel" model: "101" frequency: 2000 num_cores: 2 environment { key: "cpu_instruction_set" value: "AVX SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2" } environment { key: "eigen" value: "3.4.90" } l1_cache_size: 32768 l2_cache_size: 1048576 l3_cache_size: 28835840 memory_size: 268435456 } outputs { dtype: DT_FLOAT shape { dim { size: -5 } dim { size: -81 } dim { size: -82 } dim { size: 1 } } }
2024-07-30 20:15:37.940149: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:690] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_FLOAT } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_FLOAT shape { dim { size: -69 } dim { size: -70 } dim { size: -71 } dim { size: 1 } } } inputs { dtype: DT_FLOAT shape { dim { size: -5 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -5 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } } device { type: "GPU" vendor: "NVIDIA" model: "NVIDIA GeForce RTX 2080 Ti" frequency: 1545 num_cores: 68 environment { key: "architecture" value: "7.5" } environment { key: "cuda" value: "11020" } environment { key: "cudnn" value: "8100" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 5767168 shared_memory_size_per_multiprocessor: 65536 memory_size: 9929097216 bandwidth: 616000000 } outputs { dtype: DT_FLOAT shape { dim { size: -5 } dim { size: -72 } dim { size: -73 } dim { size: 1 } } }
2024-07-30 20:15:37.940513: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:690] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_UINT8 } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_UINT8 shape { dim { size: 2 } dim { size: 1024 } dim { size: 1280 } dim { size: 1 } } } inputs { dtype: DT_FLOAT shape { dim { size: -5 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -5 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } } device { type: "CPU" vendor: "GenuineIntel" model: "101" frequency: 2000 num_cores: 2 environment { key: "cpu_instruction_set" value: "AVX SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2" } environment { key: "eigen" value: "3.4.90" } l1_cache_size: 32768 l2_cache_size: 1048576 l3_cache_size: 28835840 memory_size: 268435456 } outputs { dtype: DT_FLOAT shape { dim { size: -5 } dim { size: -81 } dim { size: -82 } dim { size: 1 } } }
Predicting... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% ETA: 0:00:00 4.2 FPS
INFO:sleap.nn.evals:Saved predictions: models/240730_sca006.centroid/labels_pr.val.slp
INFO:sleap.nn.evals:Saved metrics: models/240730_sca006.centroid/metrics.val.npz
INFO:sleap.nn.evals:OKS mAP: 0.980198
INFO:sleap.nn.training:Versions:
SLEAP: 1.3.3
TensorFlow: 2.8.4
Numpy: 1.21.6
Python: 3.7.16
OS: Linux-5.14.0-362.8.1.el9_3.x86_64-x86_64-with-redhat-9.3-Blue_Onyx
INFO:sleap.nn.training:Training labels file: 240730_combo_01.pkg.slp
INFO:sleap.nn.training:Training profile: centered_instance.json
INFO:sleap.nn.training:
INFO:sleap.nn.training:Arguments:
INFO:sleap.nn.training:{
    "training_job_path": "centered_instance.json",
    "labels_path": "240730_combo_01.pkg.slp",
    "video_paths": [
        ""
    ],
    "val_labels": null,
    "test_labels": null,
    "base_checkpoint": null,
    "tensorboard": false,
    "save_viz": false,
    "zmq": false,
    "run_name": "240730_sca006",
    "prefix": "",
    "suffix": "",
    "cpu": false,
    "first_gpu": false,
    "last_gpu": false,
    "gpu": "auto"
}
INFO:sleap.nn.training:
INFO:sleap.nn.training:Training job:
INFO:sleap.nn.training:{
    "data": {
        "labels": {
            "training_labels": null,
            "validation_labels": null,
            "validation_fraction": 0.1,
            "test_labels": null,
            "split_by_inds": false,
            "training_inds": null,
            "validation_inds": null,
            "test_inds": null,
            "search_path_hints": [],
            "skeletons": []
        },
        "preprocessing": {
            "ensure_rgb": false,
            "ensure_grayscale": true,
            "imagenet_mode": null,
            "input_scaling": 1.0,
            "pad_to_stride": null,
            "resize_and_pad_to_target": true,
            "target_height": null,
            "target_width": null
        },
        "instance_cropping": {
            "center_on_part": "SB_Ant",
            "crop_size": null,
            "crop_size_detection_padding": 16
        }
    },
    "model": {
        "backbone": {
            "leap": null,
            "unet": {
                "stem_stride": null,
                "max_stride": 16,
                "output_stride": 4,
                "filters": 24,
                "filters_rate": 2.0,
                "middle_block": true,
                "up_interpolate": true,
                "stacks": 1
            },
            "hourglass": null,
            "resnet": null,
            "pretrained_encoder": null
        },
        "heads": {
            "single_instance": null,
            "centroid": null,
            "centered_instance": {
                "anchor_part": "SB_Ant",
                "part_names": null,
                "sigma": 2.5,
                "output_stride": 4,
                "loss_weight": 1.0,
                "offset_refinement": false
            },
            "multi_instance": null,
            "multi_class_bottomup": null,
            "multi_class_topdown": null
        },
        "base_checkpoint": null
    },
    "optimization": {
        "preload_data": true,
        "augmentation_config": {
            "rotate": true,
            "rotation_min_angle": -180.0,
            "rotation_max_angle": 180.0,
            "translate": false,
            "translate_min": -5,
            "translate_max": 5,
            "scale": false,
            "scale_min": 0.9,
            "scale_max": 1.1,
            "uniform_noise": false,
            "uniform_noise_min_val": 0.0,
            "uniform_noise_max_val": 10.0,
            "gaussian_noise": false,
            "gaussian_noise_mean": 5.0,
            "gaussian_noise_stddev": 1.0,
            "contrast": false,
            "contrast_min_gamma": 0.5,
            "contrast_max_gamma": 2.0,
            "brightness": false,
            "brightness_min_val": 0.0,
            "brightness_max_val": 10.0,
            "random_crop": false,
            "random_crop_height": 256,
            "random_crop_width": 256,
            "random_flip": true,
            "flip_horizontal": false
        },
        "online_shuffling": true,
        "shuffle_buffer_size": 128,
        "prefetch": true,
        "batch_size": 8,
        "batches_per_epoch": null,
        "min_batches_per_epoch": 200,
        "val_batches_per_epoch": null,
        "min_val_batches_per_epoch": 10,
        "epochs": 200,
        "optimizer": "adam",
        "initial_learning_rate": 0.0001,
        "learning_rate_schedule": {
            "reduce_on_plateau": true,
            "reduction_factor": 0.5,
            "plateau_min_delta": 1e-06,
            "plateau_patience": 5,
            "plateau_cooldown": 3,
            "min_learning_rate": 1e-08
        },
        "hard_keypoint_mining": {
            "online_mining": false,
            "hard_to_easy_ratio": 2.0,
            "min_hard_keypoints": 2,
            "max_hard_keypoints": null,
            "loss_scale": 5.0
        },
        "early_stopping": {
            "stop_training_on_plateau": true,
            "plateau_min_delta": 1e-08,
            "plateau_patience": 10
        }
    },
    "outputs": {
        "save_outputs": true,
        "run_name": "240730_sca006",
        "run_name_prefix": "",
        "run_name_suffix": ".centered_instance",
        "runs_folder": "models",
        "tags": [
            ""
        ],
        "save_visualizations": true,
        "delete_viz_images": true,
        "zip_outputs": false,
        "log_to_csv": true,
        "checkpointing": {
            "initial_model": false,
            "best_model": true,
            "every_epoch": false,
            "latest_model": false,
            "final_model": false
        },
        "tensorboard": {
            "write_logs": false,
            "loss_frequency": "epoch",
            "architecture_graph": false,
            "profile_graph": false,
            "visualizations": true
        },
        "zmq": {
            "subscribe_to_controller": false,
            "controller_address": "tcp://127.0.0.1:9000",
            "controller_polling_timeout": 10,
            "publish_updates": false,
            "publish_address": "tcp://127.0.0.1:9001"
        }
    },
    "name": "",
    "description": "",
    "sleap_version": "1.3.3",
    "filename": "centered_instance.json"
}
INFO:sleap.nn.training:
INFO:sleap.nn.training:Auto-selected GPU 0 with 10824 MiB of free memory.
INFO:sleap.nn.training:Using GPU 0 for acceleration.
INFO:sleap.nn.training:Disabled GPU memory pre-allocation.
INFO:sleap.nn.training:System:
GPUs: 1/1 available
  Device: /physical_device:GPU:0
         Available: True
        Initalized: False
     Memory growth: True
INFO:sleap.nn.training:
INFO:sleap.nn.training:Initializing trainer...
INFO:sleap.nn.training:Loading training labels from: 240730_combo_01.pkg.slp
INFO:sleap.nn.training:Creating training and validation splits from validation fraction: 0.1
INFO:sleap.nn.training:  Splits: Training = 90 / Validation = 10.
INFO:sleap.nn.training:Setting up for training...
INFO:sleap.nn.training:Setting up pipeline builders...
INFO:sleap.nn.training:Setting up model...
INFO:sleap.nn.training:Building test pipeline...
2024-07-30 20:16:43.564187: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-07-30 20:16:44.240409: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 9469 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:b1:00.0, compute capability: 7.5
2024-07-30 20:16:46.318988: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:690] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_FLOAT } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_FLOAT shape { dim { size: 1 } dim { size: 1024 } dim { size: 1280 } dim { size: 1 } } } inputs { dtype: DT_FLOAT shape { dim { size: -2 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -2 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } } device { type: "CPU" vendor: "GenuineIntel" model: "101" frequency: 2000 num_cores: 2 environment { key: "cpu_instruction_set" value: "AVX SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2" } environment { key: "eigen" value: "3.4.90" } l1_cache_size: 32768 l2_cache_size: 1048576 l3_cache_size: 28835840 memory_size: 268435456 } outputs { dtype: DT_FLOAT shape { dim { size: -2 } dim { size: 9999680 } dim { size: 9999680 } dim { size: 1 } } }
2024-07-30 20:16:46.386553: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 5999616006144000 exceeds 10% of free system memory.
2024-07-30 20:16:46.387184: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at crop_and_resize_op.cc:181 : RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[15,9999680,9999680,1] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
2024-07-30 20:16:46.393067: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 5999616006144000 exceeds 10% of free system memory.
2024-07-30 20:16:46.393129: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at crop_and_resize_op.cc:181 : RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[15,9999680,9999680,1] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
Traceback (most recent call last):
  File "/groups/s/home/ssfrz/miniconda3/envs/sleap_v1/bin/sleap-train", line 8, in <module>
    sys.exit(main())
  File "/groups/s/home/ssfrz/miniconda3/envs/sleap_v1/lib/python3.7/site-packages/sleap/nn/training.py", line 2014, in main
    trainer.train()
  File "/groups/s/home/ssfrz/miniconda3/envs/sleap_v1/lib/python3.7/site-packages/sleap/nn/training.py", line 924, in train
    self.setup()
  File "/groups/s/home/ssfrz/miniconda3/envs/sleap_v1/lib/python3.7/site-packages/sleap/nn/training.py", line 910, in setup
    self._setup_model()
  File "/groups/s/home/ssfrz/miniconda3/envs/sleap_v1/lib/python3.7/site-packages/sleap/nn/training.py", line 727, in _setup_model
    base_example = next(iter(base_pipeline.make_dataset()))
  File "/home/miniconda3/envs/sleap_v1/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 836, in __next__
    return self._next_internal()
  File "/home/miniconda3/envs/sleap_v1/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 822, in _next_internal
    output_shapes=self._flat_output_shapes)
  File "/groups/s/home/ssfrz/miniconda3/envs/sleap_v1/lib/python3.7/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 2923, in iterator_get_next
    _ops.raise_from_not_ok_status(e, name)
  File "/groups/s/home/ssfrz/miniconda3/envs/sleap_v1/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 7186, in raise_from_not_ok_status
    raise core._status_to_exception(e) from None  # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[15,9999680,9999680,1] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
	 [[{{node CropAndResize}}]] [Op:IteratorGetNext]

Answered by roomrys

Aug 7, 2024

Hi @ssfrz,

I have an update!

So, your default crop size was set to Auto - which is great - SLEAP will automatically find the maximum instance width for you and use that for the crop size.

I was debugging your data and found the code where the crop size was changed from None (an auto crop size) to 9999195:

sleap/sleap/nn/data/instance_cropping.py

Lines 44 to 53 in 076f3dd

     max_length = 0.0  
   for inst in labels.user_instances:  
   pts = inst.points_array  
   pts *= input_scaling  
   max_length = np.maximum(max_length, np.nanmax(pts[:, 0]) - np.nanmin(pts[:, 0]))  
   max_length = np.maximum(max_length, np.nanmax(pts[:, 1]) - np.nanmin(pts[:, 1]))  
   max_length = np.maximum(…

View full answer

roomrys · 2024-08-02T17:51:24Z

roomrys
Aug 2, 2024
Maintainer

Hi @ssfrz,

I have been staring at your Traceback for a bit too long now and am a bit stumped - it might be past-time to debug with an example. Would you be able to share your data with me (or an minimal example) through this user upload form? If you are able to share the data, please reply here to notify me (the form will not tell me when you have uploaded, but I will be keeping an eye on it).

Apologies for the wait,
Liezl

1 reply

ssfrz Aug 2, 2024
Author

Hi @roomrys ,

I uploaded the data - thanks!

roomrys · 2024-08-07T23:20:32Z

roomrys
Aug 7, 2024
Maintainer

Hi @ssfrz,

I have an update!

So, your default crop size was set to Auto - which is great - SLEAP will automatically find the maximum instance width for you and use that for the crop size.

I was debugging your data and found the code where the crop size was changed from None (an auto crop size) to 9999195:

sleap/sleap/nn/data/instance_cropping.py

Lines 44 to 53 in 076f3dd

    
           max_length = 0.0 
        
           for inst in labels.user_instances: 
        
               pts = inst.points_array 
        
               pts *= input_scaling 
        
               max_length = np.maximum(max_length, np.nanmax(pts[:, 0]) - np.nanmin(pts[:, 0])) 
        
               max_length = np.maximum(max_length, np.nanmax(pts[:, 1]) - np.nanmin(pts[:, 1])) 
        
               max_length = np.maximum(max_length, min_crop_size_no_pad) 
        
           max_length += float(padding) 
        
           crop_size = np.math.ceil(max_length / float(maximum_stride)) * maximum_stride

Looking at the Instance data, I found:

inst.points_array
rec.array([[ 8.359160e+02,  2.046660e+02],
          [ 8.410480e+02,  2.021470e+02],
          [ 8.413280e+02,  2.083980e+02],
          [ 8.451540e+02,  2.128770e+02],
          [ 8.490730e+02,  2.184750e+02],
          [ 1.000000e+07, -9.998976e+06]],
         dtype=float64)
inst.frame
LabeledFrame(video=HDF5Video('C:\Users\TalmoLab\Downloads\ssfrz\240730_combo_01.pkg.slp'), frame_idx=2, instances=15)
inst.video
Video(backend=HDF5Video(filename='C:\\Users\\TalmoLab\\Downloads\\ssfrz\\240730_combo_01.pkg.slp', dataset='video2/video', input_format='channels_last', convert_range=False))

Then, when pulling up the frame in the GUI, at first I didn't see anything, but when I added edges to the data, I saw this (background intentionally set to black):

The fix here would be to delete and relabel these three Instances or (less favorably) to set the crop size to a guestimated pixel amount (maybe 50?).

It is always SB_Post that is given an extremely large value. I am sure that those points were not user-labeled (that far out of frame) and it was likely something SLEAP initialized for you. To find the root cause, I am wondering if you did anything special for the SB_Post node?

Also, heads-up that the first frame in video 7 is a bogus label (labeling nothing/the background).

Thanks,
Liezl

2 replies

ssfrz Aug 8, 2024
Author

Hi @roomrys

Yes, this fixed it. These labels were imported from another format which had unlabeled nodes as 1E7. I adjusted accordingly to prevent this from happening again. If you could just delete the data that I uploaded in the form, that would be great.

Thanks for your help!

roomrys Aug 8, 2024
Maintainer

@ssfrz,

No problem, your data has been deleted from our drive - thanks for letting us troubleshoot on it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM issue when training centered_instance model #1889

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

	max_length = 0.0
	for inst in labels.user_instances:
	pts = inst.points_array
	pts *= input_scaling
	max_length = np.maximum(max_length, np.nanmax(pts[:, 0]) - np.nanmin(pts[:, 0]))
	max_length = np.maximum(max_length, np.nanmax(pts[:, 1]) - np.nanmin(pts[:, 1]))
	max_length = np.maximum(…

OOM issue when training centered_instance model #1889

ssfrz Jul 31, 2024

Replies: 2 comments · 3 replies

roomrys Aug 2, 2024 Maintainer

ssfrz Aug 2, 2024 Author

roomrys Aug 7, 2024 Maintainer

ssfrz Aug 8, 2024 Author

roomrys Aug 8, 2024 Maintainer

ssfrz
Jul 31, 2024

Replies: 2 comments 3 replies

roomrys
Aug 2, 2024
Maintainer

ssfrz Aug 2, 2024
Author

roomrys
Aug 7, 2024
Maintainer

ssfrz Aug 8, 2024
Author

roomrys Aug 8, 2024
Maintainer