Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor selection workflow #17

Merged
merged 21 commits into from
Jan 23, 2025
Merged
Changes from 1 commit
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
439ffe6
use dictionaries + introduce classes list
sophiamaedler Nov 7, 2024
943cc32
ensure consistent typing, fix bug
sophiamaedler Nov 8, 2024
50839f3
update docstring
sophiamaedler Nov 8, 2024
b38bdd1
make processes and threads consistent
sophiamaedler Dec 9, 2024
670e6f1
Merge branch 'improve_selection_big_datasets' into refactor_selection…
sophiamaedler Dec 10, 2024
c8752d8
make typing consistent
sophiamaedler Dec 10, 2024
0f907af
implement additional coordinate lookup calculation method
sophiamaedler Dec 10, 2024
c740a96
cleanup print statements
sophiamaedler Dec 10, 2024
e6ee255
improve log output
sophiamaedler Dec 10, 2024
e3adbef
add numba accelerated sparse coordinate calculation
sophiamaedler Dec 11, 2024
bc88147
fix incorrect assumption
sophiamaedler Dec 12, 2024
4b954b3
add safety check to ensure that no cell_id with empty coordinates is …
sophiamaedler Dec 12, 2024
639e8d0
fix typo
sophiamaedler Dec 12, 2024
500fc19
cleanup log output
sophiamaedler Dec 12, 2024
249f942
improve debug plot
sophiamaedler Dec 12, 2024
91697f1
limit number of processes to number of cell sets that need to be proc…
sophiamaedler Dec 12, 2024
8881a13
further improvements to look of debugging plot
sophiamaedler Dec 12, 2024
66a6316
add possibility to return plot and improve plot look
sophiamaedler Dec 14, 2024
f6d64d8
perform line compression with the Ramer–Douglas–Peucker algorithm
sophiamaedler Dec 14, 2024
45d5ca8
fix default parameter values for rdp_epsilon
sophiamaedler Dec 17, 2024
5053a8f
update python version dependency
sophiamaedler Jan 23, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 59 additions & 12 deletions src/lmd/segmentation.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,24 +8,71 @@

import numba as nb
from scipy.sparse import coo_array
import gc

def _create_coord_index_sparse(mask):
@njit
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One could potentially reduce code complexity by using a dictionary of typed Lists instead of numpy arrays, as lists can be quickly appended on the fly instead of extending (and resizing) the numpy arrays. To do so, the _numba_accelerator_coord_calculation could be modified so that it uses typed lists instead of arrays as data representation:

@nb.njit
def _numba_accelerator_coord_calculation(
        _ids: np.ndarray, inverse_indices: np.ndarray, sparse_coords_0: np.ndarray, sparse_coords_1: np.ndarray
    ) -> dict[nb.int64, list[list[nb.int64]]]:
    """Accelerate the transformation of a sparse array to a coordinate list using numba
    
    Args:
        _ids: Unique cell identifiers
        inverse_indices: Order of cell identifiers in array 
        sparse_coords_0/1: Coordinates of unique points of cells 

    Returns:
        dict[int64, list[list[int64, int64]]]
            Dictionary of cell ids and corresponding coordinates as list of x/y coordinate pairs
    """
    # Initialize empty dict[int64, list[[int64, int64]]] that stores cells and the coordinates of their pixels
    indices = {}
    for _id in _ids:
        indices[_id] = List.empty_list(List.empty_list(nb.int64))

    # Iterate trough all elements and get their index
    for idx, _ix in enumerate(inverse_indices):
        # Get curent cell id 
        _id  = _ids[_ix]
        # Get coordinates 
        coords = List([sparse_coords_0[idx], sparse_coords_1[idx]])
        # Append
        indices[_id].append(coords)

    return indices

The _create_coord_index_sparse function stays the same except for the added dictionary comprehension at the end to obtain numpy arrays.

def _create_coord_index_sparse(mask: np.ndarray) -> dict:
    """Create a coordinate index from a segmentation mask. 

    In the coordinate index each key is a unique class id and the value is a list of coordinates for this class.
    
    Args:
        mask (np.ndarray): A 2D segmentation mask
    
    Returns:
        dict: A dictionary containing the class ids as keys and a list of coordinates as values
    """
    #convert to a sparse array (this will be faster than iterating over the dense array because segmentation masks are usually sparse)
    sparse_mask = coo_array(mask)

    cell_ids = sparse_mask.data
    sparse_coords_0 = sparse_mask.coords[0].astype(np.int64)
    sparse_coords_1 = sparse_mask.coords[1].astype(np.int64)

    del sparse_mask #this is no longer needed for calculations and should free up memory when explicitly deleted
    gc.collect()

    _ids, inverse_indices = np.unique(cell_ids, return_inverse=True)
    indices = _numba_accelerator_coord_calculation(_ids, inverse_indices, sparse_coords_0, sparse_coords_1)
    
    # Convert lists to arrays 
    # Slight overhead due to additional dictionary comprehension
    return {k: np.array(v) for k, v in indices.items()}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compared to the current implementation, this leads to a slight overhead in scenarios with many cells (due to the additional dictionary comprehension at the end) in small and medium-sized images but is not noticeable for large images.

time-comparison

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code for benchmarks

Benchmarking

import numpy as np 
import pandas as pd 
import time 

from lmd.segmentation import _create_coord_index_sparse as _create_coord_index_sparse_pylmd


def generate_data(shape: int, ncells: int, seed: int = 42):
    """Generate a shape x shape image with ncells as 2x2 rectangles"""
    np.random.seed(seed)
    array = np.zeros(shape=(shape, shape), dtype=np.uint64)

    for cell_id in range(1, ncells+1):
        x, y = np.random.randint(0, shape), np.random.randint(0, shape)
        array[y:y+2, x:x+2] = cell_id

    return array 


shapes = [1000, 10000, 100000]
cells = [100, 1000, 10000]
replicates = 5


results = []
for shape in shapes:
    for n in cells:
        if n > 0.5*shape**2:
            continue

        for replicate in range(replicates):
            print(shape, n, replicate)
            array = generate_data(shape, ncells=n)

            start_time = time.time()
            cell_coords_pylmd = _create_coord_index_sparse_pylmd(array)
            total_time = time.time() - start_time
            results.append(
                {"method": "pylmd", 
                 "n_row": shape, 
                 "n_col": shape, 
                 "pixel": shape**2, 
                 "n_cells": n, 
                 "replicate": replicate, 
                 "time": total_time
                 }
            )

            start_time = time.time()
            cell_coords = _create_coord_index_sparse(array)
            total_time = time.time() - start_time
            results.append(
                {"method": "custom", 
                 "n_row": shape, 
                 "n_col": shape, 
                 "pixel": shape*shape, 
                 "n_cells": n, 
                 "replicate": replicate,
                 "time": total_time
                 }
            )

            for k in cell_coords_pylmd:
                assert (cell_coords_pylmd[k] == cell_coords[k]).all()

df = (pd.DataFrame(results)
      .groupby(["method", "n_row", "n_col", "pixel", "n_cells"])
      .agg({"time": np.mean})
      .reset_index()
      .pivot_table(index=["pixel", "n_cells"], columns="method", values=["time"])
      .droplevel(0, axis=1)
      .reset_index()
)

# ~ 20 min run time 

def _numba_accelerator_coord_calculation(_ids: np.ndarray,
inverse_indices:np.ndarray,
sparse_coords_0:np.ndarray,
sparse_coords_1:np.ndarray) -> dict:
"""Accelerate the transformation of a sparse array to a coordinate list using numba"""

#initialize datastructure for storing results: dict[cell_id] = np.array([0, 0, 0, ...])
#final structure will be a dictionary with class_id as key and an array of coordinates as value
#the array for the coordinates is initialized with zero values and its size dynamically increased during processing
# in a second array the next unfilled coordinate position is stored
index_list = {}
stop_list = {}
initial_size = 32 # type: int # initial size of the place holder array for storing coordinate information

for i in _ids:
index_list[i] = np.zeros((initial_size, 2), dtype=np.uint64)
stop_list[i] = 0

for idx, _ix in enumerate(inverse_indices):
_id = _ids[_ix] #get current cell id
current_size = index_list[_id].shape[0]
#ensure array for storing coords is large enough to store the new coordinates
if stop_list[_id] >= current_size:
new_size = current_size * 2
new_array = np.zeros((new_size, 2), dtype=np.uint64)
new_array[:stop_list[_id]] = index_list[_id]
index_list[_id] = new_array

index_list[_id][stop_list[_id]][0] = sparse_coords_0[idx]
index_list[_id][stop_list[_id]][1] = sparse_coords_1[idx]
stop_list[_id] += 1

# resize index list
for _id in _ids:
index_list[_id] = index_list[_id][:stop_list[_id]]

sparse_mask = coo_array(mask)
return(index_list)

_ids, inverse_indices = np.unique(sparse_mask.data, return_inverse=True)
def _create_coord_index_sparse(mask: np.ndarray) -> dict:
"""Create a coordinate index from a segmentation mask.
In the coordinate index each key is a unique class id and the value is a list of coordinates for this class.

Args:
mask (np.ndarray): A 2D segmentation mask

Returns:
dict: A dictionary containing the class ids as keys
and a list of coordinates as values
"""
#convert to a sparse array (this will be faster than iterating over the dense array because segmentation masks are usually sparse)
sparse_mask = coo_array(mask)

#initialize dictionary for saving coordinates of each class
#final structure will be a dictionary with class_id as key and a list of coordinates as value
coords = {key: [] for key in _ids}
cell_ids = sparse_mask.data
sparse_coords_0 = sparse_mask.coords[0]
sparse_coords_1 = sparse_mask.coords[1]

# Iterate over sparse data once
for idx, _ix in enumerate(inverse_indices):
coords[_ids[_ix]].append((sparse_mask.coords[0][idx], sparse_mask.coords[1][idx]))
del sparse_mask #this is no longer needed for calculations and should free up memory when explicitly deleted
gc.collect()

_ids, inverse_indices = np.unique(cell_ids, return_inverse=True)

# Convert lists to numpy arrays
coords = {key: np.array(value) for key, value in coords.items()}
return(coords)
return(_numba_accelerator_coord_calculation(_ids, inverse_indices, sparse_coords_0, sparse_coords_1))

@njit
def _create_coord_index(mask,
Expand Down
Loading