Real-Time Segmentation on Video #1973

rahulghosh2 · 2024-02-23T19:09:25Z

rahulghosh2
Feb 23, 2024

Hello nnUNet Community.

I have trained an 2D nnUNet on my dataset and now I want to implement it into an OpenCV pipeline that does real-time segmentation on video frame-by-frame and outputs the result. I know how to use the "nnUNetV2_predict" command; however, it seems that saving the frame in an input folder, specifying an output folder, and then sourcing the segmentation mask from the output folder and then overlaying seems very convoluted. Is there a better way to run this kind of inference?

I don't have much coding background and this is my first time trying to implement something like this, so I apologize if the question is very basic. Any advice on how to best achieve this with nnUNet inference or where to look to get started? Thank you in advance!

Answered by ancestor-mithril

Mar 2, 2024

You can directly run and time the run_case_npy and see if using it gives you the same result as predict_from_files.
Among other things, run_case_npy executes cropping, normalization and resampling. As far as I know, resampling takes a lot of time, but you need to measure for your case.
One way to optimize the code is to implement the numpy operations inplace for normalization, and cropping could be reimplemented like this: https://github.com/ancestor-mithril/nnUNet/blob/7b53480b2a16dd4dd05a6e02b1797e15f456dcc7/nnunetv2/preprocessing/cropping/cropping.py#L23.

But once again, you should measure which steps take the most time in your case and try to optimize those.

View full answer

rahulghosh2 · 2024-02-29T17:21:58Z

rahulghosh2
Feb 29, 2024
Author

Hello, here is an update on what I have tried. Based on the inference documentation, the command-line command has to re-initialize the network and weights with every prediction request, so I thought it would be better to instantiate the predictor as an object which can be called on the fly to avoid this step. I am using a Windows machine so I could not directly run the "predict_from_files" without encapsuating in `"if name == 'main': ", due to multiprocessing on Windows.

However, the predict_from_files is extremely slow, when I run this code, which takes about 15 seconds. My CUDA_VISIBLE_DEVICES=1 is an NVIDIA RTX A4500.

I am using the following version (this is the output of "git describe --tags" on my nnUNet directory)
v2.1-46-g45a47ec

print("Instantiating predictor...")
start_time = time.time() 
predictor = nnUNetPredictor(
    tile_step_size=0.5,
    use_gaussian=True,
    use_mirroring=True,
    perform_everything_on_gpu=True,
    device=torch.device('cuda', 1),  # Ensure this is the correct device
    verbose=False,
    verbose_preprocessing=False,
    allow_tqdm=True
)
print("Created instance of nnUNetPredictor")
end_time = time.time()
print(f"Time to create nnUNetPredictor instance: {end_time - start_time:.2f} seconds")

start_time_weights = time.time() 
nnUNet_results = r'C:\Users\msturxg7\project\nnUNet_results'
weights_dir = r'Dataset018_VOR\nnUNetTrainer__nnUNetPlans__2d'
full_path = os.path.join(nnUNet_results, weights_dir)
print("Loading weights...")
predictor.initialize_from_trained_model_folder(
    full_path,
    use_folds=(0,),
    checkpoint_name='checkpoint_final.pth',
)
end_time_weights = time.time() 
print("Loaded weight into instantiated predictor.")
print(f"Time to load weights into instantiated predictor: {end_time_weights - start_time_weights:.2f} seconds")

def run_nnunet_predict(input_dir, output_dir, predictor):
    predictor.predict_from_files(
        input_dir,
        output_dir,
        save_probabilities=False,
        overwrite=True,
        num_processes_preprocessing=2,
        num_processes_segmentation_export=2,
        folder_with_segs_from_prev_stage=None,
        num_parts=1,
        part_id=0
    )

start_time_prediction = time.time()
if __name__ == '__main__':
    input_dir = r'C:\Users\msturxg7\project\inference\imagesVessel'
    output_dir = r'C:\Users\msturxg7\project\inference\predictionsVessel'
    run_nnunet_predict(input_dir, output_dir, predictor)
print("Completed prediction.")
end_time_prediction = time.time()  # Capture end time after prediction
print(f"Time to complete prediction: {end_time_prediction - start_time_prediction:.2f} seconds")

However, the predict_from_files is unacceptably slow. When I run this code, which takes ~15 seconds to do a single inference in total. My CUDA_VISIBLE_DEVICES=1 is an NVIDIA RTX A4500. While the initialization of the network prior to invoking "predict_from_files" seems to be ~1 second, which I can accept, what is contributing to the slowness is what appears to be the entire re-initialization being called multiple times after "predict_from_files",. which accumulates, as evidenced by the output below of the above code.

Instantiating predictor...
Created instance of nnUNetPredictor
Time to create nnUNetPredictor instance: 0.00 seconds
Loading weights...
Loaded weight into instantiated predictor.
Time to load weights into instantiated predictor: 0.85 seconds
There are 1 cases in the source folder
I am process 0 out of 1 (max process ID is 0, we start counting with 0!)
There are 1 cases that I would like to predict
Instantiating predictor...
Created instance of nnUNetPredictor
Time to create nnUNetPredictor instance: 0.00 seconds
Loading weights...
Instantiating predictor...
Created instance of nnUNetPredictor
Time to create nnUNetPredictor instance: 0.00 seconds
Loading weights...
Instantiating predictor...
Created instance of nnUNetPredictor
Time to create nnUNetPredictor instance: 0.00 seconds
Loading weights...
Loaded weight into instantiated predictor.
Time to load weights into instantiated predictor: 0.87 seconds
Completed prediction.
Time to complete prediction: 0.00 seconds
Loaded weight into instantiated predictor.
Time to load weights into instantiated predictor: 0.87 seconds
Completed prediction.
Time to complete prediction: 0.00 seconds
Loaded weight into instantiated predictor.
Time to load weights into instantiated predictor: 0.87 seconds
Completed prediction.
Time to complete prediction: 0.00 seconds
Instantiating predictor...
Created instance of nnUNetPredictor
Time to create nnUNetPredictor instance: 0.00 seconds
Loading weights...
Loaded weight into instantiated predictor.
Time to load weights into instantiated predictor: 0.83 seconds
Completed prediction.
Time to complete prediction: 0.00 seconds
Instantiating predictor...
Created instance of nnUNetPredictor
Time to create nnUNetPredictor instance: 0.00 seconds
Loading weights...
Loaded weight into instantiated predictor.
Time to load weights into instantiated predictor: 0.83 seconds
Completed prediction.
Time to complete prediction: 0.00 seconds

Predicting Dataset018_VOR_001:
perform_everything_on_gpu: True
100%|████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.07it/s]
Prediction done, transferring to CPU if needed
sending off prediction to background worker for resampling and export
done with Dataset018_VOR_001
Completed prediction.
Time to complete prediction: 15.25 seconds

Any suggestions on what I can do to address this problem and improve the overall speed of inference? I am willing to sacrifice the quality of segmentation a bit, so are there processing areas I could possibly cut out to make this run faster? I am really stuck on this and any advice is appreciated. Thank you!

6 replies

ancestor-mithril Mar 1, 2024

Hello. The first thing you could do is to guard all instructions by if __name__ == '__main__'.
nnUNet by default uses multiple processes for preprocessing and exporting the segmentation, therefore each new process spawned will run all the code that is not guarded by if __name__ == '__main__'.
The second most important optimization is to use less processes. They are useless, they occupy RAM and CPU resources and do not help at all in your case (and in many other cases). You can reduce the number of processes by using num_processes_preprocessing=1 and num_processes_segmentation_export=1.

ancestor-mithril Mar 1, 2024

You can also improve the speed by wrapping the prediction inside a docker container with linux.
If you want to speed up your inference time even more, you can sacrifice the quality of the segmentation by disabling Test Time Augmentation, by using the following parameter: use_mirroring=False.
There are other ways of optimizing the prediction speed, even without sacrificing the quality of the segmentation, however all of them require modifying the nnUNet source code:

removing the multiprocessing part from the prediction, and implement sequential prediction.
improve the preprocessing (see default_preprocessor.py, run_case). Includes cropping, normalization and resampling the input data.
improve the postprocessing (see export_prediction.py, export_prediction_from_logits).

rahulghosh2 Mar 1, 2024
Author

Hello @ancestor-mithril, thank you very much for your thoughtful feedback. I guarded all the instructions with if __name__ == 'main', which successfully prevented the additional spawned processes.

To reduce steps, I used num_processes_preprocessing=1 and num_processes_segmentation_export=1 and use_mirroring=False. These steps resulted in a segmentation that was still of great quality.

However, the net effect of these changes was to reduce the inference time from 15 seconds to 12 seconds, which is still unacceptably slow for my indication.

I am working on the sequential approach now, where I am stripping down the inference to the bare minimum steps from the NumPy array in a separate function run_nnunet_predict (see below). I think this may be the definitive solution for my case since I am working with frames from video capture, so it may be impractical to repeatedly convert and save and then extract images from directory.

In the method "_internal_maybe_mirror_and_predict" the actual pass through the network appears to be prediction = self.network(x). From that line, I traced the post-processing steps to export_prediction.py, as you suggested, and label_handling.py, finding that for my scenario the two steps are applying sigmoid non-linearity (I use region-based training) and direct binarization of the probability maps (I use the individual region channels from the probabilities).

#image is loaded from file as a (512, 512, 2) NumPy array (input_np)
def run_nnunet_predict(predictor, input_np):    
    #there are probably some pre-processing steps missing 
    
    #converts a (512, 512, 2) NumPy array to (1, 2, 512, 512)
    convert_np_to_tensor(input_np)
    input_tensor = input_tensor.float()

    #converts from logits to a binary segmentation map
    prediction_logits = predictor.network(input_tensor)
    prediction_probabilities = torch.sigmoid(prediction_logits)
    prediction_binarized = (prediction_probabilities > 0.5).float()
    
    #converts the prediction to a (512, 512, 3) NumPy array
    if prediction_binarized.is_cuda:
        prediction_numpy = prediction_binarized.squeeze(0).permute(1, 2, 0).cpu().detach().numpy()
    else:
        prediction_numpy = prediction_binarized.squeeze(0).permute(1, 2, 0).detach().numpy()
    return prediction_numpy

While this successfully passes through the network in 0.4 seconds, the output appears be just a garbled version of the input image. and not the segmentation masks I want. Since the predicted segmentation masks are fine when I run predictor.predict_from_files. I suspect that the error is due to missing the pre-processing.

I know the mirroring is not needed in my case based on above. What is the best approach to figuring out the minimal amount of pre-processing I need for my case?

Based on looking at run_case_npy which is called by run_case in default_preprocessing, it seems that the plans file contains some directive. This is a snipped from my plans file:

 "data_identifier": "nnUNetPlans_2d",
            "preprocessor_name": "DefaultPreprocessor",
            "batch_size": 8,
            "patch_size": [
                512,
                512
            ],
            "median_image_size_in_voxels": [
                512.0,
                510.5
            ],
            "spacing": [
                1.0,
                1.0
            ],
            "normalization_schemes": [
                "ZScoreNormalization",
                "ZScoreNormalization"

What is the best way to translate this into the steps I need for pre-processing and is there anything else I am missing to make the direct sequential segmentation work? I appreciate any advice.

ancestor-mithril Mar 2, 2024

You can directly run and time the run_case_npy and see if using it gives you the same result as predict_from_files.
Among other things, run_case_npy executes cropping, normalization and resampling. As far as I know, resampling takes a lot of time, but you need to measure for your case.
One way to optimize the code is to implement the numpy operations inplace for normalization, and cropping could be reimplemented like this: https://github.com/ancestor-mithril/nnUNet/blob/7b53480b2a16dd4dd05a6e02b1797e15f456dcc7/nnunetv2/preprocessing/cropping/cropping.py#L23.

But once again, you should measure which steps take the most time in your case and try to optimize those.

Answer selected by rahulghosh2

ish2002 May 16, 2024

Hi @rahulghosh2, could you please share what pre-processing steps you performed to get the segmentation masks? My output currently looks like a garbled version of the input image that you mentioned you were receiving previously. I have called run_case_npy() before performing the prediction as well as implemented a custom function which performs the pre-processing steps mentioned in my plans.json file, but neither of these helped.

rahulghosh2 May 30, 2024
Author

Hello @ish2002, to generate the final segmentation mask, I used run_case_npy as descibed above by @ancestor-mithril. Below is how I called the function from the preprocessor. input_np is a NumPy array of my input data. You have to initialize the preprocessor, plans manager and configuration files. beforehand. Afterwards, I converted it to a tensor prior to inputting into the network.

    preprocessed_data,_ = preprocessor.run_case_npy(data=input_np,
                                                           seg=None,  # Pass your segmentation here if available, otherwise None
                                                           properties=properties,
                                                           plans_manager=plans_manager,
                                                           configuration_manager=configuration_manager,
                                                           dataset_json=dataset_json)

    input_tensor = torch.from_numpy(preprocessed_data).float()

MuHe0826 · 2024-03-19T08:23:42Z

MuHe0826
Mar 19, 2024

@rahulghosh2 Hello, I am currently experiencing the same problem as you. I would like to ask, how long does it take you to process one frame?

1 reply

rahulghosh2 Mar 22, 2024
Author

The whole frame process takes me about 150 milliseconds using an NVIDIA RTX A4500 GPU with Windows OS.

bruniss · 2024-11-23T06:11:28Z

bruniss
Nov 23, 2024

Maybe just rip the trained model out of nnunet and run a regular PyTorch 2d unet with your own inference script

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Real-Time Segmentation on Video #1973

{{title}}

Replies: 3 comments 7 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Real-Time Segmentation on Video #1973

rahulghosh2 Feb 23, 2024

Replies: 3 comments · 7 replies

rahulghosh2 Feb 29, 2024 Author

ancestor-mithril Mar 1, 2024

ancestor-mithril Mar 1, 2024

rahulghosh2 Mar 1, 2024 Author

ancestor-mithril Mar 2, 2024

ish2002 May 16, 2024

rahulghosh2 May 30, 2024 Author

MuHe0826 Mar 19, 2024

rahulghosh2 Mar 22, 2024 Author

bruniss Nov 23, 2024

rahulghosh2
Feb 23, 2024

Replies: 3 comments 7 replies

rahulghosh2
Feb 29, 2024
Author

rahulghosh2 Mar 1, 2024
Author

rahulghosh2 May 30, 2024
Author

MuHe0826
Mar 19, 2024

rahulghosh2 Mar 22, 2024
Author

bruniss
Nov 23, 2024