Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor selection workflow #17

Draft
wants to merge 20 commits into
base: main
Choose a base branch
from
Draft

Conversation

sophiamaedler
Copy link
Collaborator

@sophiamaedler sophiamaedler commented Dec 10, 2024

The current workflow for generating XMLs creates a lookup list where for each cell id the coordinate positions are listed.
For very large datasets (especially when cell ids are not assigned sequentially but can for example jump from 100 to 15000) this can lead to the process running out of memory.

To fix this I implemented two main changes:

  1. we no longer use a list to store the coordinates but a dictionary instead (thus no empty list elements are created for non-existent cell_ids)
  2. we calculate the coordinates on the basis of a sparse scipy array instead of the dense array (assumption here is that most segmentation masks contain a lot of 0 elements)

In addition I streamlined the log outputs that are generated to make them more concise.

Happy to get some additional feedback for how we can improve the code structure here.

this reduces the memory requirements if you e.g. have cell ids that are not ascending but random very large numbers
processes = how many cell sets are processed at one time
threads (specified in config!) = how different steps in processing of one cell step are parallelized
uses sparse matrices for more efficient calculation even on very large datasets
@sophiamaedler
Copy link
Collaborator Author

sophiamaedler commented Dec 11, 2024

I had originally tested this PR on a large slide (16270685232 total pixels) with the nuclear segmentation channel were it performed ok. On a second test on the same slide with the cytosol segmentation channel I am unfortunately again running into issues.

My guess would be that the benefit of storing the masks as a sparse matrix is reduced in the cytosol segmentations (due to the larger masks).

Nucleus Segmentation Mask
Screenshot 2024-12-11 at 17 23 16

Cytosol Segmentation Mask
Screenshot 2024-12-11 at 17 23 26

on a small test dataset this reduced computation time from approx 2min20s to 50s.
@sophiamaedler sophiamaedler marked this pull request as draft December 11, 2024 18:22
@sophiamaedler
Copy link
Collaborator Author

sophiamaedler commented Dec 11, 2024

I've added a numba accelerated implementation to calculate the coords based on the sparse matrix. I'll run a test with my dataset to see if we can maybe get it to complete and/or identify the bottle necks. Happy for additional insights in the meantime of steps we could take to improve performance.

One Idea I've been playing around with in my head is if we can use sparse Dask arrays to prevent all data from being loaded to memory. Uptill now we've been using the memory mapped temp arrays in scportrait to keep the segmentation masks out of memory, but they aren't compatible with sparse arrays.

@sophiamaedler
Copy link
Collaborator Author

I switched how we perform line compression to use the Ramer–Douglas–Peucker algorithm. In my experience this gives better results over a wide variety of input resolutions without needing to adjust any parameters. As far as I can tell it only results in marginal overhead per shape which is outweighed by the benefit.

@sophiamaedler sophiamaedler force-pushed the refactor_selection_workflow branch from ca44c55 to ace87a7 Compare December 14, 2024 18:44
@sophiamaedler sophiamaedler force-pushed the refactor_selection_workflow branch from ace87a7 to f6d64d8 Compare December 14, 2024 18:47
@sophiamaedler
Copy link
Collaborator Author

Within scportrait I have implemented an alternative selection workflow. Using pre-calculated center coordinates for each cell_id we only load the region of the segmentation mask directly adjacent to the centre of the cells that are to be extracted. Using these small cropped regions we can then generate a coordinate lookup table which contains the coordinates for all of the cells that are to be selected. This prevents us needing to iterate through the whole large image which significantly improves computation time especially when selecting few cells from large images.
For my test dataset I can now generate an XML to excise about 2000 cells from a slide with over 4 million cells in a few minutes. This workflow is compatible with the py-lmd implementation in this branch which allows you to also pass None as the segmentation mask and provide the coordinate lookup dictionary directly to the function.

While this speedup is great for scportrait, I don't think this type of implementation makes sense directly in py-lmd because it relies on precomputed results to be saved and available which will not be the case for users coming from different processing pipelines, so I would suggest we leave the implementation as it currently is in this branch (which should still provide some improvements over the previous one).

What are your thoughts? @ GeorgWa @lucas-diedrich

We could also consider implementing a new class like segmentationLoader that starts with pre-calculated coords.

@sophiamaedler
Copy link
Collaborator Author

Also the reason checks on 3.8 are failing is due to type specifications using | instead of Union. Do we want to keep supporting 3.8 then I would go back and edit them. Or do we want to potentially drop support for 3.8

@lucas-diedrich
Copy link

lucas-diedrich commented Dec 31, 2024

Within scportrait I have implemented an alternative selection workflow. Using pre-calculated center coordinates for each cell_id we only load the region of the segmentation mask directly adjacent to the centre of the cells that are to be extracted. Using these small cropped regions we can then generate a coordinate lookup table which contains the coordinates for all of the cells that are to be selected. This prevents us needing to iterate through the whole large image which significantly improves computation time especially when selecting few cells from large images.

This sounds really cool! But I agree that this kind of information will probably not always be available for all workflows.

For my test dataset I can now generate an XML to excise about 2000 cells from a slide with over 4 million cells in a few minutes. This workflow is compatible with the py-lmd implementation in this branch which allows you to also pass None as the segmentation mask and provide the coordinate lookup dictionary directly to the function.

What would you consider as standard and maximum performance specs regarding the shape generation (required image size, expected number of cells, run time)?

While this speedup is great for scportrait, I don't think this type of implementation makes sense directly in py-lmd because it relies on precomputed results to be saved and available which will not be the case for users coming from different processing pipelines, so I would suggest we leave the implementation as it currently is in this branch (which should still provide some improvements over the previous one).

I agree. I feel like the generation of the cell look-up table from the whole image is optimized at this point

We could also consider implementing a new class like segmentationLoader that starts with pre-calculated coords.

I feel like passing the look up table as optional argument is a very good solution.

Alternatively, one could separate the logic to load segmentation masks (SegmentationLoader, final output: cell lookup table) and the logic to create and write the LMD-compatible .xml files (new class SegmentationWriter, final output: file)


def _create_coord_index_sparse(mask):
@njit

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One could potentially reduce code complexity by using a dictionary of typed Lists instead of numpy arrays, as lists can be quickly appended on the fly instead of extending (and resizing) the numpy arrays. To do so, the _numba_accelerator_coord_calculation could be modified so that it uses typed lists instead of arrays as data representation:

@nb.njit
def _numba_accelerator_coord_calculation(
        _ids: np.ndarray, inverse_indices: np.ndarray, sparse_coords_0: np.ndarray, sparse_coords_1: np.ndarray
    ) -> dict[nb.int64, list[list[nb.int64]]]:
    """Accelerate the transformation of a sparse array to a coordinate list using numba
    
    Args:
        _ids: Unique cell identifiers
        inverse_indices: Order of cell identifiers in array 
        sparse_coords_0/1: Coordinates of unique points of cells 

    Returns:
        dict[int64, list[list[int64, int64]]]
            Dictionary of cell ids and corresponding coordinates as list of x/y coordinate pairs
    """
    # Initialize empty dict[int64, list[[int64, int64]]] that stores cells and the coordinates of their pixels
    indices = {}
    for _id in _ids:
        indices[_id] = List.empty_list(List.empty_list(nb.int64))

    # Iterate trough all elements and get their index
    for idx, _ix in enumerate(inverse_indices):
        # Get curent cell id 
        _id  = _ids[_ix]
        # Get coordinates 
        coords = List([sparse_coords_0[idx], sparse_coords_1[idx]])
        # Append
        indices[_id].append(coords)

    return indices

The _create_coord_index_sparse function stays the same except for the added dictionary comprehension at the end to obtain numpy arrays.

def _create_coord_index_sparse(mask: np.ndarray) -> dict:
    """Create a coordinate index from a segmentation mask. 

    In the coordinate index each key is a unique class id and the value is a list of coordinates for this class.
    
    Args:
        mask (np.ndarray): A 2D segmentation mask
    
    Returns:
        dict: A dictionary containing the class ids as keys and a list of coordinates as values
    """
    #convert to a sparse array (this will be faster than iterating over the dense array because segmentation masks are usually sparse)
    sparse_mask = coo_array(mask)

    cell_ids = sparse_mask.data
    sparse_coords_0 = sparse_mask.coords[0].astype(np.int64)
    sparse_coords_1 = sparse_mask.coords[1].astype(np.int64)

    del sparse_mask #this is no longer needed for calculations and should free up memory when explicitly deleted
    gc.collect()

    _ids, inverse_indices = np.unique(cell_ids, return_inverse=True)
    indices = _numba_accelerator_coord_calculation(_ids, inverse_indices, sparse_coords_0, sparse_coords_1)
    
    # Convert lists to arrays 
    # Slight overhead due to additional dictionary comprehension
    return {k: np.array(v) for k, v in indices.items()}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compared to the current implementation, this leads to a slight overhead in scenarios with many cells (due to the additional dictionary comprehension at the end) in small and medium-sized images but is not noticeable for large images.

time-comparison

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code for benchmarks

Benchmarking

import numpy as np 
import pandas as pd 
import time 

from lmd.segmentation import _create_coord_index_sparse as _create_coord_index_sparse_pylmd


def generate_data(shape: int, ncells: int, seed: int = 42):
    """Generate a shape x shape image with ncells as 2x2 rectangles"""
    np.random.seed(seed)
    array = np.zeros(shape=(shape, shape), dtype=np.uint64)

    for cell_id in range(1, ncells+1):
        x, y = np.random.randint(0, shape), np.random.randint(0, shape)
        array[y:y+2, x:x+2] = cell_id

    return array 


shapes = [1000, 10000, 100000]
cells = [100, 1000, 10000]
replicates = 5


results = []
for shape in shapes:
    for n in cells:
        if n > 0.5*shape**2:
            continue

        for replicate in range(replicates):
            print(shape, n, replicate)
            array = generate_data(shape, ncells=n)

            start_time = time.time()
            cell_coords_pylmd = _create_coord_index_sparse_pylmd(array)
            total_time = time.time() - start_time
            results.append(
                {"method": "pylmd", 
                 "n_row": shape, 
                 "n_col": shape, 
                 "pixel": shape**2, 
                 "n_cells": n, 
                 "replicate": replicate, 
                 "time": total_time
                 }
            )

            start_time = time.time()
            cell_coords = _create_coord_index_sparse(array)
            total_time = time.time() - start_time
            results.append(
                {"method": "custom", 
                 "n_row": shape, 
                 "n_col": shape, 
                 "pixel": shape*shape, 
                 "n_cells": n, 
                 "replicate": replicate,
                 "time": total_time
                 }
            )

            for k in cell_coords_pylmd:
                assert (cell_coords_pylmd[k] == cell_coords[k]).all()

df = (pd.DataFrame(results)
      .groupby(["method", "n_row", "n_col", "pixel", "n_cells"])
      .agg({"time": np.mean})
      .reset_index()
      .pivot_table(index=["pixel", "n_cells"], columns="method", values=["time"])
      .droplevel(0, axis=1)
      .reset_index()
)

# ~ 20 min run time 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants