Refactor selection workflow #17

sophiamaedler · 2024-12-10T17:24:42Z

The current workflow for generating XMLs creates a lookup list where for each cell id the coordinate positions are listed.
For very large datasets (especially when cell ids are not assigned sequentially but can for example jump from 100 to 15000) this can lead to the process running out of memory.

To fix this I implemented two main changes:

we no longer use a list to store the coordinates but a dictionary instead (thus no empty list elements are created for non-existent cell_ids)
we calculate the coordinates on the basis of a sparse scipy array instead of the dense array (assumption here is that most segmentation masks contain a lot of 0 elements)

In addition I streamlined the log outputs that are generated to make them more concise.

Happy to get some additional feedback for how we can improve the code structure here.

this reduces the memory requirements if you e.g. have cell ids that are not ascending but random very large numbers

processes = how many cell sets are processed at one time threads (specified in config!) = how different steps in processing of one cell step are parallelized

…_workflow

uses sparse matrices for more efficient calculation even on very large datasets

sophiamaedler · 2024-12-11T13:22:35Z

I had originally tested this PR on a large slide (16270685232 total pixels) with the nuclear segmentation channel were it performed ok. On a second test on the same slide with the cytosol segmentation channel I am unfortunately again running into issues.

My guess would be that the benefit of storing the masks as a sparse matrix is reduced in the cytosol segmentations (due to the larger masks).

Nucleus Segmentation Mask

Cytosol Segmentation Mask

on a small test dataset this reduced computation time from approx 2min20s to 50s.

sophiamaedler · 2024-12-11T18:24:08Z

I've added a numba accelerated implementation to calculate the coords based on the sparse matrix. I'll run a test with my dataset to see if we can maybe get it to complete and/or identify the bottle necks. Happy for additional insights in the meantime of steps we could take to improve performance.

One Idea I've been playing around with in my head is if we can use sparse Dask arrays to prevent all data from being loaded to memory. Uptill now we've been using the memory mapped temp arrays in scportrait to keep the segmentation masks out of memory, but they aren't compatible with sparse arrays.

…processed

…essed

sophiamaedler · 2024-12-14T18:43:55Z

I switched how we perform line compression to use the Ramer–Douglas–Peucker algorithm. In my experience this gives better results over a wide variety of input resolutions without needing to adjust any parameters. As far as I can tell it only results in marginal overhead per shape which is outweighed by the benefit.

https://en.wikipedia.org/wiki/Ramer–Douglas–Peucker_algorithm

sophiamaedler · 2024-12-21T15:21:01Z

Within scportrait I have implemented an alternative selection workflow. Using pre-calculated center coordinates for each cell_id we only load the region of the segmentation mask directly adjacent to the centre of the cells that are to be extracted. Using these small cropped regions we can then generate a coordinate lookup table which contains the coordinates for all of the cells that are to be selected. This prevents us needing to iterate through the whole large image which significantly improves computation time especially when selecting few cells from large images.
For my test dataset I can now generate an XML to excise about 2000 cells from a slide with over 4 million cells in a few minutes. This workflow is compatible with the py-lmd implementation in this branch which allows you to also pass None as the segmentation mask and provide the coordinate lookup dictionary directly to the function.

While this speedup is great for scportrait, I don't think this type of implementation makes sense directly in py-lmd because it relies on precomputed results to be saved and available which will not be the case for users coming from different processing pipelines, so I would suggest we leave the implementation as it currently is in this branch (which should still provide some improvements over the previous one).

What are your thoughts? @ GeorgWa @lucas-diedrich

We could also consider implementing a new class like segmentationLoader that starts with pre-calculated coords.

sophiamaedler · 2024-12-21T15:22:05Z

Also the reason checks on 3.8 are failing is due to type specifications using | instead of Union. Do we want to keep supporting 3.8 then I would go back and edit them. Or do we want to potentially drop support for 3.8

lucas-diedrich · 2024-12-31T10:56:10Z

Within scportrait I have implemented an alternative selection workflow. Using pre-calculated center coordinates for each cell_id we only load the region of the segmentation mask directly adjacent to the centre of the cells that are to be extracted. Using these small cropped regions we can then generate a coordinate lookup table which contains the coordinates for all of the cells that are to be selected. This prevents us needing to iterate through the whole large image which significantly improves computation time especially when selecting few cells from large images.

This sounds really cool! But I agree that this kind of information will probably not always be available for all workflows.

For my test dataset I can now generate an XML to excise about 2000 cells from a slide with over 4 million cells in a few minutes. This workflow is compatible with the py-lmd implementation in this branch which allows you to also pass None as the segmentation mask and provide the coordinate lookup dictionary directly to the function.

What would you consider as standard and maximum performance specs regarding the shape generation (required image size, expected number of cells, run time)?

While this speedup is great for scportrait, I don't think this type of implementation makes sense directly in py-lmd because it relies on precomputed results to be saved and available which will not be the case for users coming from different processing pipelines, so I would suggest we leave the implementation as it currently is in this branch (which should still provide some improvements over the previous one).

I agree. I feel like the generation of the cell look-up table from the whole image is optimized at this point

We could also consider implementing a new class like segmentationLoader that starts with pre-calculated coords.

I feel like passing the look up table as optional argument is a very good solution.

Alternatively, one could separate the logic to load segmentation masks (SegmentationLoader, final output: cell lookup table) and the logic to create and write the LMD-compatible .xml files (new class SegmentationWriter, final output: file)

lucas-diedrich · 2024-12-31T10:14:43Z

src/lmd/segmentation.py


-def _create_coord_index_sparse(mask):
+@njit


One could potentially reduce code complexity by using a dictionary of typed Lists instead of numpy arrays, as lists can be quickly appended on the fly instead of extending (and resizing) the numpy arrays. To do so, the _numba_accelerator_coord_calculation could be modified so that it uses typed lists instead of arrays as data representation:

@nb.njit def _numba_accelerator_coord_calculation( _ids: np.ndarray, inverse_indices: np.ndarray, sparse_coords_0: np.ndarray, sparse_coords_1: np.ndarray ) -> dict[nb.int64, list[list[nb.int64]]]: """Accelerate the transformation of a sparse array to a coordinate list using numba Args: _ids: Unique cell identifiers inverse_indices: Order of cell identifiers in array sparse_coords_0/1: Coordinates of unique points of cells Returns: dict[int64, list[list[int64, int64]]] Dictionary of cell ids and corresponding coordinates as list of x/y coordinate pairs """ # Initialize empty dict[int64, list[[int64, int64]]] that stores cells and the coordinates of their pixels indices = {} for _id in _ids: indices[_id] = List.empty_list(List.empty_list(nb.int64)) # Iterate trough all elements and get their index for idx, _ix in enumerate(inverse_indices): # Get curent cell id _id = _ids[_ix] # Get coordinates coords = List([sparse_coords_0[idx], sparse_coords_1[idx]]) # Append indices[_id].append(coords) return indices

The _create_coord_index_sparse function stays the same except for the added dictionary comprehension at the end to obtain numpy arrays.

def _create_coord_index_sparse(mask: np.ndarray) -> dict: """Create a coordinate index from a segmentation mask. In the coordinate index each key is a unique class id and the value is a list of coordinates for this class. Args: mask (np.ndarray): A 2D segmentation mask Returns: dict: A dictionary containing the class ids as keys and a list of coordinates as values """ #convert to a sparse array (this will be faster than iterating over the dense array because segmentation masks are usually sparse) sparse_mask = coo_array(mask) cell_ids = sparse_mask.data sparse_coords_0 = sparse_mask.coords[0].astype(np.int64) sparse_coords_1 = sparse_mask.coords[1].astype(np.int64) del sparse_mask #this is no longer needed for calculations and should free up memory when explicitly deleted gc.collect() _ids, inverse_indices = np.unique(cell_ids, return_inverse=True) indices = _numba_accelerator_coord_calculation(_ids, inverse_indices, sparse_coords_0, sparse_coords_1) # Convert lists to arrays # Slight overhead due to additional dictionary comprehension return {k: np.array(v) for k, v in indices.items()}

Compared to the current implementation, this leads to a slight overhead in scenarios with many cells (due to the additional dictionary comprehension at the end) in small and medium-sized images but is not noticeable for large images.

Code for benchmarks

Benchmarking

import numpy as np import pandas as pd import time from lmd.segmentation import _create_coord_index_sparse as _create_coord_index_sparse_pylmd def generate_data(shape: int, ncells: int, seed: int = 42): """Generate a shape x shape image with ncells as 2x2 rectangles""" np.random.seed(seed) array = np.zeros(shape=(shape, shape), dtype=np.uint64) for cell_id in range(1, ncells+1): x, y = np.random.randint(0, shape), np.random.randint(0, shape) array[y:y+2, x:x+2] = cell_id return array shapes = [1000, 10000, 100000] cells = [100, 1000, 10000] replicates = 5 results = [] for shape in shapes: for n in cells: if n > 0.5*shape**2: continue for replicate in range(replicates): print(shape, n, replicate) array = generate_data(shape, ncells=n) start_time = time.time() cell_coords_pylmd = _create_coord_index_sparse_pylmd(array) total_time = time.time() - start_time results.append( {"method": "pylmd", "n_row": shape, "n_col": shape, "pixel": shape**2, "n_cells": n, "replicate": replicate, "time": total_time } ) start_time = time.time() cell_coords = _create_coord_index_sparse(array) total_time = time.time() - start_time results.append( {"method": "custom", "n_row": shape, "n_col": shape, "pixel": shape*shape, "n_cells": n, "replicate": replicate, "time": total_time } ) for k in cell_coords_pylmd: assert (cell_coords_pylmd[k] == cell_coords[k]).all() df = (pd.DataFrame(results) .groupby(["method", "n_row", "n_col", "pixel", "n_cells"]) .agg({"time": np.mean}) .reset_index() .pivot_table(index=["pixel", "n_cells"], columns="method", values=["time"]) .droplevel(0, axis=1) .reset_index() ) # ~ 20 min run time

sophiamaedler added 9 commits November 7, 2024 18:38

use dictionaries + introduce classes list

439ffe6

this reduces the memory requirements if you e.g. have cell ids that are not ascending but random very large numbers

ensure consistent typing, fix bug

943cc32

update docstring

50839f3

make processes and threads consistent

b38bdd1

processes = how many cell sets are processed at one time threads (specified in config!) = how different steps in processing of one cell step are parallelized

Merge branch 'improve_selection_big_datasets' into refactor_selection…

670e6f1

…_workflow

make typing consistent

c8752d8

implement additional coordinate lookup calculation method

0f907af

uses sparse matrices for more efficient calculation even on very large datasets

cleanup print statements

c740a96

improve log output

e6ee255

sophiamaedler requested a review from GeorgWa December 10, 2024 17:24

add numba accelerated sparse coordinate calculation

e3adbef

on a small test dataset this reduced computation time from approx 2min20s to 50s.

sophiamaedler marked this pull request as draft December 11, 2024 18:22

sophiamaedler added 8 commits December 12, 2024 18:31

fix incorrect assumption

bc88147

add safety check to ensure that no cell_id with empty coordinates is …

4b954b3

…processed

fix typo

639e8d0

cleanup log output

500fc19

improve debug plot

249f942

limit number of processes to number of cell sets that need to be proc…

91697f1

…essed

further improvements to look of debugging plot

8881a13

add possibility to return plot and improve plot look

66a6316

sophiamaedler force-pushed the refactor_selection_workflow branch from ca44c55 to ace87a7 Compare December 14, 2024 18:44

perform line compression with the Ramer–Douglas–Peucker algorithm

f6d64d8

https://en.wikipedia.org/wiki/Ramer–Douglas–Peucker_algorithm

sophiamaedler force-pushed the refactor_selection_workflow branch from ace87a7 to f6d64d8 Compare December 14, 2024 18:47

fix default parameter values for rdp_epsilon

45d5ca8

lucas-diedrich approved these changes Dec 31, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor selection workflow #17

Refactor selection workflow #17

sophiamaedler commented Dec 10, 2024 •

edited

Loading

sophiamaedler commented Dec 11, 2024 •

edited

Loading

sophiamaedler commented Dec 11, 2024 •

edited

Loading

sophiamaedler commented Dec 14, 2024

sophiamaedler commented Dec 21, 2024

sophiamaedler commented Dec 21, 2024

lucas-diedrich commented Dec 31, 2024 •

edited

Loading

lucas-diedrich Dec 31, 2024

lucas-diedrich Dec 31, 2024

lucas-diedrich Dec 31, 2024

Refactor selection workflow #17

Are you sure you want to change the base?

Refactor selection workflow #17

Conversation

sophiamaedler commented Dec 10, 2024 • edited Loading

sophiamaedler commented Dec 11, 2024 • edited Loading

sophiamaedler commented Dec 11, 2024 • edited Loading

sophiamaedler commented Dec 14, 2024

sophiamaedler commented Dec 21, 2024

sophiamaedler commented Dec 21, 2024

lucas-diedrich commented Dec 31, 2024 • edited Loading

lucas-diedrich Dec 31, 2024

Choose a reason for hiding this comment

lucas-diedrich Dec 31, 2024

Choose a reason for hiding this comment

lucas-diedrich Dec 31, 2024

Choose a reason for hiding this comment

Benchmarking

sophiamaedler commented Dec 10, 2024 •

edited

Loading

sophiamaedler commented Dec 11, 2024 •

edited

Loading

sophiamaedler commented Dec 11, 2024 •

edited

Loading

lucas-diedrich commented Dec 31, 2024 •

edited

Loading