-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor selection workflow #17
base: main
Are you sure you want to change the base?
Conversation
this reduces the memory requirements if you e.g. have cell ids that are not ascending but random very large numbers
processes = how many cell sets are processed at one time threads (specified in config!) = how different steps in processing of one cell step are parallelized
uses sparse matrices for more efficient calculation even on very large datasets
I had originally tested this PR on a large slide (16270685232 total pixels) with the nuclear segmentation channel were it performed ok. On a second test on the same slide with the cytosol segmentation channel I am unfortunately again running into issues. My guess would be that the benefit of storing the masks as a sparse matrix is reduced in the cytosol segmentations (due to the larger masks). |
on a small test dataset this reduced computation time from approx 2min20s to 50s.
I've added a numba accelerated implementation to calculate the coords based on the sparse matrix. I'll run a test with my dataset to see if we can maybe get it to complete and/or identify the bottle necks. Happy for additional insights in the meantime of steps we could take to improve performance. One Idea I've been playing around with in my head is if we can use sparse Dask arrays to prevent all data from being loaded to memory. Uptill now we've been using the memory mapped temp arrays in scportrait to keep the segmentation masks out of memory, but they aren't compatible with sparse arrays. |
I switched how we perform line compression to use the Ramer–Douglas–Peucker algorithm. In my experience this gives better results over a wide variety of input resolutions without needing to adjust any parameters. As far as I can tell it only results in marginal overhead per shape which is outweighed by the benefit. |
ca44c55
to
ace87a7
Compare
ace87a7
to
f6d64d8
Compare
Within scportrait I have implemented an alternative selection workflow. Using pre-calculated center coordinates for each cell_id we only load the region of the segmentation mask directly adjacent to the centre of the cells that are to be extracted. Using these small cropped regions we can then generate a coordinate lookup table which contains the coordinates for all of the cells that are to be selected. This prevents us needing to iterate through the whole large image which significantly improves computation time especially when selecting few cells from large images. While this speedup is great for scportrait, I don't think this type of implementation makes sense directly in py-lmd because it relies on precomputed results to be saved and available which will not be the case for users coming from different processing pipelines, so I would suggest we leave the implementation as it currently is in this branch (which should still provide some improvements over the previous one). What are your thoughts? @ GeorgWa @lucas-diedrich We could also consider implementing a new class like segmentationLoader that starts with pre-calculated coords. |
Also the reason checks on 3.8 are failing is due to type specifications using | instead of Union. Do we want to keep supporting 3.8 then I would go back and edit them. Or do we want to potentially drop support for 3.8 |
This sounds really cool! But I agree that this kind of information will probably not always be available for all workflows.
What would you consider as standard and maximum performance specs regarding the shape generation (required image size, expected number of cells, run time)?
I agree. I feel like the generation of the cell look-up table from the whole image is optimized at this point
I feel like passing the look up table as optional argument is a very good solution. Alternatively, one could separate the logic to load segmentation masks ( |
|
||
def _create_coord_index_sparse(mask): | ||
@njit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One could potentially reduce code complexity by using a dictionary of typed Lists instead of numpy arrays, as lists can be quickly appended on the fly instead of extending (and resizing) the numpy arrays. To do so, the _numba_accelerator_coord_calculation
could be modified so that it uses typed lists instead of arrays as data representation:
@nb.njit
def _numba_accelerator_coord_calculation(
_ids: np.ndarray, inverse_indices: np.ndarray, sparse_coords_0: np.ndarray, sparse_coords_1: np.ndarray
) -> dict[nb.int64, list[list[nb.int64]]]:
"""Accelerate the transformation of a sparse array to a coordinate list using numba
Args:
_ids: Unique cell identifiers
inverse_indices: Order of cell identifiers in array
sparse_coords_0/1: Coordinates of unique points of cells
Returns:
dict[int64, list[list[int64, int64]]]
Dictionary of cell ids and corresponding coordinates as list of x/y coordinate pairs
"""
# Initialize empty dict[int64, list[[int64, int64]]] that stores cells and the coordinates of their pixels
indices = {}
for _id in _ids:
indices[_id] = List.empty_list(List.empty_list(nb.int64))
# Iterate trough all elements and get their index
for idx, _ix in enumerate(inverse_indices):
# Get curent cell id
_id = _ids[_ix]
# Get coordinates
coords = List([sparse_coords_0[idx], sparse_coords_1[idx]])
# Append
indices[_id].append(coords)
return indices
The _create_coord_index_sparse
function stays the same except for the added dictionary comprehension at the end to obtain numpy arrays.
def _create_coord_index_sparse(mask: np.ndarray) -> dict:
"""Create a coordinate index from a segmentation mask.
In the coordinate index each key is a unique class id and the value is a list of coordinates for this class.
Args:
mask (np.ndarray): A 2D segmentation mask
Returns:
dict: A dictionary containing the class ids as keys and a list of coordinates as values
"""
#convert to a sparse array (this will be faster than iterating over the dense array because segmentation masks are usually sparse)
sparse_mask = coo_array(mask)
cell_ids = sparse_mask.data
sparse_coords_0 = sparse_mask.coords[0].astype(np.int64)
sparse_coords_1 = sparse_mask.coords[1].astype(np.int64)
del sparse_mask #this is no longer needed for calculations and should free up memory when explicitly deleted
gc.collect()
_ids, inverse_indices = np.unique(cell_ids, return_inverse=True)
indices = _numba_accelerator_coord_calculation(_ids, inverse_indices, sparse_coords_0, sparse_coords_1)
# Convert lists to arrays
# Slight overhead due to additional dictionary comprehension
return {k: np.array(v) for k, v in indices.items()}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code for benchmarks
Benchmarking
import numpy as np
import pandas as pd
import time
from lmd.segmentation import _create_coord_index_sparse as _create_coord_index_sparse_pylmd
def generate_data(shape: int, ncells: int, seed: int = 42):
"""Generate a shape x shape image with ncells as 2x2 rectangles"""
np.random.seed(seed)
array = np.zeros(shape=(shape, shape), dtype=np.uint64)
for cell_id in range(1, ncells+1):
x, y = np.random.randint(0, shape), np.random.randint(0, shape)
array[y:y+2, x:x+2] = cell_id
return array
shapes = [1000, 10000, 100000]
cells = [100, 1000, 10000]
replicates = 5
results = []
for shape in shapes:
for n in cells:
if n > 0.5*shape**2:
continue
for replicate in range(replicates):
print(shape, n, replicate)
array = generate_data(shape, ncells=n)
start_time = time.time()
cell_coords_pylmd = _create_coord_index_sparse_pylmd(array)
total_time = time.time() - start_time
results.append(
{"method": "pylmd",
"n_row": shape,
"n_col": shape,
"pixel": shape**2,
"n_cells": n,
"replicate": replicate,
"time": total_time
}
)
start_time = time.time()
cell_coords = _create_coord_index_sparse(array)
total_time = time.time() - start_time
results.append(
{"method": "custom",
"n_row": shape,
"n_col": shape,
"pixel": shape*shape,
"n_cells": n,
"replicate": replicate,
"time": total_time
}
)
for k in cell_coords_pylmd:
assert (cell_coords_pylmd[k] == cell_coords[k]).all()
df = (pd.DataFrame(results)
.groupby(["method", "n_row", "n_col", "pixel", "n_cells"])
.agg({"time": np.mean})
.reset_index()
.pivot_table(index=["pixel", "n_cells"], columns="method", values=["time"])
.droplevel(0, axis=1)
.reset_index()
)
# ~ 20 min run time
The current workflow for generating XMLs creates a lookup list where for each cell id the coordinate positions are listed.
For very large datasets (especially when cell ids are not assigned sequentially but can for example jump from 100 to 15000) this can lead to the process running out of memory.
To fix this I implemented two main changes:
In addition I streamlined the log outputs that are generated to make them more concise.
Happy to get some additional feedback for how we can improve the code structure here.