Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] FLAIR#2 Dataset and Datamodule Integration #2394

Open
wants to merge 50 commits into
base: main
Choose a base branch
from

Conversation

MathiasBaumgartinger
Copy link
Contributor

@MathiasBaumgartinger MathiasBaumgartinger commented Nov 5, 2024

FLAIR#2 dataset

The FLAIR #2 <https://github.com/IGNF/FLAIR-2> dataset is an extensive dataset from the French National Institute of Geographical and Forest Information (IGN) that provides a unique and rich resource for large-scale geospatial analysis.
The dataset is sampled countrywide and is composed of over 20 billion annotated pixels of very high resolution aerial imagery at 0.2 m spatial resolution, acquired over three years and different months (spatio-temporal domains).

The FLAIR2 dataset is a dataset for semantic segmentation of aerial images. It contains aerial images, sentinel-2 images and masks for 13 classes.
The dataset is split into a training and test set.

Dataset features:

* over 20 billion annotated pixels
* aerial imagery
    * 5x512x512
    * 0.2m spatial resolution
    * 5 channels (RGB-NIR-Elevation)
* Sentinel-2 imagery
    * 10-20m spatial resolution
    * 10 spectral bands
    * snow/cloud masks (with 0-100 probability)
    * multiple time steps (T)
    * Tx10xWxH, T, W, H are variable
* label (masks)
    * 512x512
    * 13 classes

Dataset classes:

0: "building",
1: "pervious surface",
2: "impervious surface",
3: "bare soil",
4: "water",
5: "coniferous",
6: "deciduous",
7: "brushwood",
8: "vineyard",
9: "herbaceous vegetation",
10: "agricultural land",
11: "plowed land",
12: "other"  

If you use this dataset in your research, please cite the following paper:

* https://doi.org/10.48550/arXiv.2310.13336

image

Implementation Details

NonGeoDataset, __init()__

After discussions following #2303, we decided that at least until faulty mask data are fixed the flair2 ds will be of type NonGeoDataset. Other than with common NonGeoDatasets, FLAIR2 exposes a use_toy and use_sentinel argument. The use_toy-flag will instead use the toy data which is a small subset of data. The use_sentinel argument on the other hand decides whether a sample includes the augmented sentinel data provided by the maintainers of FLAIR2.

_verify, _download, _extract

As each of the splits/sample-types (i.e. [train, test], [aerial, sentinel, labels] are contained in a individual zip download, download and extraction has to happen multiple times. On the other hand, the toy dataset is contained in a singular zip. Furthermore, to map the super-patches of the sentinel data to the actual input image, a flair-2_centroids_sp_to_patch.json is required, which has to be equally has to be downloaded as an individual zip.

_load_image, _load_sentinel, _load_target

For storage reasons, the elevation (5th band) of the image is stored as a uint. The original height thus is multiplied by 5. We decided to divide the height by 5 to get the original height, to make the trained model more usable for other data. See Questions please.

As mentioned previously, additional metadata has to be used to get from the sentinel.npy to the actual area. Initially for debugging reasons, we implemented to return not the cropped image but the original data and the cropping-slices (i.e. indices). Consequently, the images can be plot in a more meaningful matter. Otherwise, the resolution is so low that one can hardly recognize features. This was crucial for debugging to find the correct logic (classic y, x instead of x, y ordering mistake). We do not know if this is smart for "production code". See Questions please.
Moreover, the dimensions of the sentinel data $T \times C=10 \times W \times H$ vary both $T$ and $W$, $H$. This is problematic for the datamodule. We have not done extensive research, but the varying dimensions seem to bug the module. Disabling the use_sentinel-flag will make the module work.

The labels include values from 1 to 19. The datapaper clearly mentions grouping classes $&gt; 13$ into one class other due to underrepresentation. We followed this suggestion. Furthermore, rescaling from 0 to 12 was applied. See Questions please.

Questions

  • Do you consider the Elevation rescaling as distortion of the dataset? Shall I exclude it? The argument for it would be easier re-usability on new datasets.

For storage optimization reasons, this elevation information is multiplied by a factor of 5 and encoded as a 8bit unsigned integer datatype.

  • How shall we load/provide sentinel data? As cropped data or any other way. I do not see the current implementation as fit for production.

    • Also, how do we want to plot it? The small red rectangle in the example plot above is the actual region. The low resolution is quite observable there.
  • Shall we rescale the Classes to start from 0? Shall we group the classes as suggested in the datapaper?

  • Check integrity in download_url does not seem to work (in unit-tests), why?

    • I have to call an own check_integrity call otherwise it passes, even if md5s do not match.
  • The github actions on the forked repo produce a magic ruff error (https://github.com/MathiasBaumgartinger/torchgeo/actions/runs/11687694109/job/32556175383#step:7:1265). Can you help me resolve this mystery?

TODOs/FIXMEs

  • Extend tests for toy datasets and apply md5 check
  • Find correct band for plotting sentinel
  • Datamodule cannot handle sentinel data yet

Mathias Baumgartinger and others added 30 commits September 13, 2024 12:14
…mg and msk)

Updates in the custom raster dataset tutorial and the actual file documentation. The previous recommended approach (overriding `__get_item__`) is outdated.

Refs: microsoft#2292 (reply in thread)
Co-authored-by: Adam J. Stewart <[email protected]>
Co-authored-by: Adam J. Stewart <[email protected]>
Not fully functioning yet, contains copy paste from other datasets
Additionally, some small code refactors are done
Using the entire sentinel-2 image and a matplotlib patch to debug, otherwise it is really hard to find correct spot due to low resolution
…y()` for sentinel

With the nested dict, it was not possible to download dynamically
md5s might change due to timestamps, this eases the process of changing md5
@nilsleh
Copy link
Collaborator

nilsleh commented Nov 7, 2024

Thanks a lot for the PR, here are just some answers at the top of my head, others might have additional opinions.

Questions

* Do you consider the Elevation rescaling as distortion of the dataset? Shall I exclude it? The argument for it would be easier re-usability on new datasets.

Not sure about this one, but your proposition seems reasonable and as long as this is clearly documented, I would think it is fine.

* How shall we load/provide sentinel data? As cropped data or any other way. I do not see the current implementation as fit for production.

I think that depends on how, the large sentinel tile is expected to be used/useful for a model. The very low resolution area does not seem useful intuitively. I think having the flag you implemented is nice.

If you provide the entire tile, there will have to be a collate or cropping function in the datamodule, otherwise the dataloader cannot return consistent batches, as by default it tries to batch individual samples into a tensor by dict key.

  * Also, how do we want to plot it? The small red rectangle in the example plot above is the actual region. The low resolution is quite observable there.

I actually like that approach.

* Shall we rescale the Classes to start from 0? Shall we group the classes as suggested in the datapaper?

Classes should definitely be ordinally mapped to start from 0, otherwise there will be issues with the CrossEntropyLoss etc. Grouping is also okay since it follows the paper.

@@ -1,62 +1,64 @@
Dataset,Task,Source,License,# Samples,# Classes,Size (px),Resolution (m),Bands
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something seems off here, since the entire csv file changed, and not just the FLAIR2 addition.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably line endings

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I try to change to the original line endings, or does it not really matter?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to the original line endings please, that will make git diff and git blame much more useful

from pyproj import CRS

# General hyperparams
IMG_SIZE = 512
Copy link
Collaborator

@nilsleh nilsleh Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the toy data we usually use smaller image sizes like 32 just to speed up tests.

"""
super().__init__(FLAIR2, batch_size, num_workers, **kwargs)

self.patch_size = _to_tuple(patch_size)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The definition of patch_size is useful if there is an additional kornia Resize augmentation under self.aug, however, we would also have to consider how this plays with the larger sentinel tile. I am still not sure what purpose the additional Sentinel 2 should provide in terms of model performance if the high res image only covers such a tiny area.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am by no means an expert, but have been reading some papers in the domain the last year. I suppose, the increased feature dimensions provided by the hyperspectral bands of sentinel-2 have shown to increase model performance.

Furthermore, in the datapaper [p. 6] of FLAIR2 the authors argue:

The interest of the extended spatial information provided by the Sentinel-2 super-patches is particularly visible in the last two rows of Figure 4. Indeed, the location on a beach or on a lake is difficult to determine from the aerial image alone, and could easily be confused with the sea for example in the last row.

The model leveraging the additional information outperform a standard baseline model:

Model mIoU Building Pervious Surface Impervious Surface Bare Soil Water Coniferous Deciduous Brushwood Vineyard Herbaceous Vegetation Agricultural Land Plowed Land
U-Net 0.547 0.8009 0.4727 0.6988 0.3076 0.7985 0.5758 0.7014 0.2392 0.6012 0.4653 0.5449 0.3583
U-T&T 0.5594 0.8285 0.4980 0.7204 0.2982 0.8009 0.6041 0.7189 0.2541 0.6580 0.4684 0.5478 0.3157
U-T&T best 0.5758 0.8368 0.4995 0.7446 0.3959 0.7952 0.6339 0.7239 0.2485 0.6678 0.4750 0.5513 0.3381

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that out, I should have looked at the paper myself... In that case, if we want to support a resize, it would be more involved, if we wont to resize the sentinel image as well but keeping the resolution difference. Or still just provide the Sentinel 2 scene as context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion would be to initialize a sentinel scaling of 0 to 1 in the model. This in turn could be injected into the loading function.

  • 0 would resize the image to the exact extent of the aerial images.
  • 1 would take the entire image
  • anything between would linearly scale from aerial to original extent

This way, both a straight-forward streamlined model (i.e. uses sentinel as aerial extent as just another source of information) and more sophisticated models that fuse classification at a later point in time could be used. Any suggestions for improvement?

num_workers: Number of workers for parallel data loading.
augs: Optional augmentations to apply to the dataset.
**kwargs: Additional keyword arguments passed to
:class:`~torchgeo.datasets.Potsdam2D`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to FLAIR2

augs: Optional augmentations to apply to the dataset.
**kwargs: Additional keyword arguments passed to
:class:`~torchgeo.datasets.Potsdam2D`.
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add versionadded for 0.7

torchgeo/datamodules/flair2.py Show resolved Hide resolved
torchgeo/datasets/flair2.py Show resolved Hide resolved
return (tensor - tensor.min()) / (tensor.max() - tensor.min())

# Define a colormap for the classes
cmap = ListedColormap([
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can go above the init in class attributes

@MathiasBaumgartinger
Copy link
Contributor Author

Hi! Sorry for the late response.

First of all, thanks for the review. I will get to work on the proposed changes ;-)

@MathiasBaumgartinger
Copy link
Contributor Author

Apart from the sentinel data, I think everything is on track now. Let me know whether you have a preferred way of me handling this.

@JacobJeppesen
Copy link

I was just testing the pull request, and there was an issue with the download, where it gets a 404 when trying to download https://storage.gra.cloud.ovh.net/v1/AUTH_366279ce616242ebb14161b7991a8461/defi-ia/flair_data_2/flair-2_centroids_sp_to_patch.zip. I think reason stems from inconsistent naming, where the zip file is named flair_2_centroids_sp_to_patch.zip, with underscore after flair, and the json file inside is named flair-2_centroids_sp_to_patch.json, with hyphen after flair. I.e., the correct download url is https://storage.gra.cloud.ovh.net/v1/AUTH_366279ce616242ebb14161b7991a8461/defi-ia/flair_data_2/flair_2_centroids_sp_to_patch.zip.

Great work though! Awesome to see FLAIR being implemented 😃

to_extract.append(dir_name)
continue

files_glob = os.path.join(downloaded_path, "**", self.globs[train_or_test])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When instantiating the datamodule, it will extract the flair_sen_train zip-file on every run. I traced it here, and I believe it's becuase when going through when dir_name="flair_sen_train", I get files_glob="/home/jhj/dataset_test/FLAIR2/flair_sen_train/**/SEN2_*{0}.npy", with the curly braces at the end. This makes the glob.glob() fail in the next line. I tried manually setting it to files_glob="/home/jhj/dataset_test/FLAIR2/flair_sen_train/**/SEN2_*.npy", which fixed the issue.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it's the same with flair_2_sen_test, which also gets the {0} in files_glob, and extracts the zip-file on every run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When instantiating the datamodule, it will extract the flair_sen_train zip-file on every run. I traced it here, and I believe it's becuase when going through when dir_name="flair_sen_train", I get files_glob="/home/jhj/dataset_test/FLAIR2/flair_sen_train/**/SEN2_*{0}.npy", with the curly braces at the end. This makes the glob.glob() fail in the next line. I tried manually setting it to files_glob="/home/jhj/dataset_test/FLAIR2/flair_sen_train/**/SEN2_*.npy", which fixed the issue.

I see why the unexpected behavior appears. For clarification: the format string is necessary, as there are two files inside the directory: mask and data files. To be able to respectively get only the corresponding mask or data file, I had to format the glob.

Would you be willing to share with me your code snippet so I can debug this real quick?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just manually removed the {0} in the files_glob string before running glob.glob():

files_glob = os.path.join(downloaded_path, "**", self.globs[train_or_test])
if "flair_sen_train" in files_glob:
    files_glob="/home/jhj/dataset_test/FLAIR2/flair_sen_train/**/SEN2_*.npy"
if not glob.glob(files_glob, recursive=True):
    to_extract.append(dir_name)

I.e., hard-coded files_glob so I could check if it stopped extracting the zip-file on every run. Not really a solution, but just wanted to see if I was the right place in the code.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tested it again, and all download and data extraction works now 👍

@JacobJeppesen
Copy link

Apologize if the comments were a bit scattered, and not really a proper review. I was just trying to run the code, and commented on the go with any issues I encountered 🙂

flair_2_centroids_sp_to_patch.zip vs. flair-2_centroids_sp_to_patch.json

Refs: microsoft#2394 (comment)
@MathiasBaumgartinger
Copy link
Contributor Author

I was just testing the pull request, and there was an issue with the download, where it gets a 404 when trying to download https://storage.gra.cloud.ovh.net/v1/AUTH_366279ce616242ebb14161b7991a8461/defi-ia/flair_data_2/flair-2_centroids_sp_to_patch.zip. I think reason stems from inconsistent naming, where the zip file is named flair_2_centroids_sp_to_patch.zip, with underscore after flair, and the json file inside is named flair-2_centroids_sp_to_patch.json, with hyphen after flair. I.e., the correct download url is https://storage.gra.cloud.ovh.net/v1/AUTH_366279ce616242ebb14161b7991a8461/defi-ia/flair_data_2/flair_2_centroids_sp_to_patch.zip.

Great work though! Awesome to see FLAIR being implemented 😃

Weird; I was sure I tried this. Thanks for letting me know.

Mathias Baumgartinger added 3 commits November 25, 2024 14:58
Instead of loading both sentinel-data/cloudsnowmasks into one sample, store them sperately. Same with the crop_indices.
@JacobJeppesen
Copy link

JacobJeppesen commented Nov 25, 2024

I was just testing the pull request, and there was an issue with the download, where it gets a 404 when trying to download https://storage.gra.cloud.ovh.net/v1/AUTH_366279ce616242ebb14161b7991a8461/defi-ia/flair_data_2/flair-2_centroids_sp_to_patch.zip. I think reason stems from inconsistent naming, where the zip file is named flair_2_centroids_sp_to_patch.zip, with underscore after flair, and the json file inside is named flair-2_centroids_sp_to_patch.json, with hyphen after flair. I.e., the correct download url is https://storage.gra.cloud.ovh.net/v1/AUTH_366279ce616242ebb14161b7991a8461/defi-ia/flair_data_2/flair_2_centroids_sp_to_patch.zip.
Great work though! Awesome to see FLAIR being implemented 😃

Weird; I was sure I tried this. Thanks for letting me know.

Yeah, it seems like a weird error. Perhaps they changed it on their side, such that the zip-file naming was consistent. Although the json file inside still uses hyphen instead of underscore 🙂

Comment on lines 359 to 362
files = [
dict(image=image, sentinel=sentinel, mask=mask)
for image, sentinel, mask in zip(images, sentinels, masks)
]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this part filters out most of the aerial images. I think it's because each Sentinel-2 image covers multiple aerial images. E.g., flair_aerial_train/D004_2021/Z1_NN contains 100 aerial images, and the corresponding flair_sen_train/D004_2021/Z1_NN contains one .npy file. So the association between Sentinel-2 and the aerial images should probably be done on a per-folder basis instead of per-file basis. E.g., something like:

        sentinel_lookup = {"/".join(s["data"].split("/")[-4:-2]): s for s in sentinels}
        files = [
            dict(image=image,
                 sentinel=sentinel_lookup["/".join(image.split("/")[-4:-2])],
                 mask=mask)
            for image, mask in zip(images, masks)
        ]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow! What a severe mistake. Thanks for clarifying. Your approach seems very valid.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That happens sometimes 🙂

I just tried training a model on the aerial data now with the code posted above, and all seems to be working. I haven't looked closely at the Sentinel-2 data, although they looked correct when I sampled a couple of elements in the files list in the debugger.

This bug caused to omit a good 90% of all images.

Refs: microsoft#2394 (comment)
@JacobJeppesen
Copy link

It got all the correct aerial imagery files now and the download and extraction is working. However, I'm getting a much more annoying bug now: IReadBlock failed at X offset ... random error during training.

I've had this issue before with Rasterio, and it was due to concurrency issues. This shouldn't really happen here, as the dataloader just iterates through a list of files, and each sample is a separate tif file. I can see there's also been a discussion here (#594), so perhaps @adamjstewart or @calebrob6 knows what's going on. My own guess is that it could be multiple workers having GDAL do simultaneous scans of the same directory when opening individual files in said directory. I.e., there's often many image files in each sub-folder in the dataset, and GDAL will scan the sub-folder for metadata when opening a file inside it, which could be the culprit. For now at least, I've added the code below in datasets/flair2.py and is testing to see if it solves the issue.

This has been added below the imports in the top:

# Set GDAL to avoid scanning read directories to avoid IReadBlock errors in Rasterio when using num_workers>0 in the 
# dataloader. We should not have any concurrency issues for this dataset, as each worker should read individual tif files,
# but it seems like it might occasionally happen when multiple workers scan the same directory simultaneously. 
os.environ["GDAL_DISABLE_READDIR_ON_OPEN"] = "EMPTY_DIR"

I've also minimized the time the datareader is open in Rasterio by closing it before doing the tensor operations. This shouldn't really solve the issue here, but added it anyways for good practice. In the _load_image() function:

with rasterio.open(path) as f:
    array: np.typing.NDArray[np.int_] = f.read()
tensor = torch.from_numpy(array).float()
if "B05" in self.bands:
    # Height channel will always be the last dimension
    tensor[-1] = torch.div(tensor[-1], 5)

and in the _load_target() function:

with rasterio.open(path) as f:
    array: np.typing.NDArray[np.int_] = f.read(1)
tensor = torch.from_numpy(array).long()
# According to datapaper, the dataset contains classes beyond 13
# however, those are grouped into a single "other" class
# Rescale the classes to be in the range [0, 12] by subtracting 1
torch.clamp(tensor - 1, 0, len(self.classes) - 1, out=tensor)

Not sure if it fixes it, but I'll train a handful of models and try it out. The issue coming from simultaneous directory scans is a bit of a guess.

@adamjstewart
Copy link
Collaborator

Never had this multiprocessing issue before, and I don't see anything in the code that looks sus. Can you reproduce this issue with the unit tests? If not I can also try downloading the dataset and reproducing locally.

@JacobJeppesen
Copy link

Unfortunately not. It pops up at random after training for quite a while. Encountered it the first time after having trained for almost 2 full epochs. I.e., it had already read the entire dataset once, and then got IReadBlock error during second epoch. There have been some reports with similar errors here: rasterio/rasterio#2053 . That's generally when reading the same file though, and I agree, nothing looks suspicious in the code. I actually thought my disk was broken, but then I remembered that that was my exact same thought last time I got this error with Rasterio.

I'll try and let it run some more times and see if it keeps happening. Just saw in rasterio/rasterio#2053 (comment) that upgrading to newer version of GDAL might help. I'll also give that a try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datamodules PyTorch Lightning datamodules datasets Geospatial or benchmark datasets documentation Improvements or additions to documentation testing Continuous integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants