Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] The file direct_url.json is not found #1

Open
yy6768 opened this issue Apr 16, 2024 · 7 comments
Open

[BUG] The file direct_url.json is not found #1

yy6768 opened this issue Apr 16, 2024 · 7 comments

Comments

@yy6768
Copy link

yy6768 commented Apr 16, 2024

Dear author,

I am currently trying to use the noisebase library for my work, but I encountered an issue with the resolve_data_path() function in the data.py file.

dist_info = json.loads(metadata.distribution('noisebase').read_text('direct_url.json'))

In the function, there is a reference to a direct_url.json file, which is used to determine the location of the dataset. However, I was unable to find this file in the repository.

Could you please provide some additional information about this direct_url.json file? What is its purpose, and where can I find it? I would greatly appreciate if you could clarify this for me.

Additionally, if there is an alternative way to specify the dataset path, could you please share the details? I want to ensure that I can properly configure the library to work with my dataset.

Thank you in advance for your assistance. I look forward to your response.

Best regards.

@balintio
Copy link
Owner

Pip automatically creates the direct_url.json file when you install the package. It's not part of the repository, as it saves data about your specific install configuration.

This part of resolve_data_path checks whether you installed the package as editable, i.e. pip install -e ./noisebase (probably after cloning the repo) or from PyPI, i.e. pip install noisebase. In the former case, the default data location is in the cloned repo folder; in the latter, it's the current working directory.

It's strange that you're missing the direct_url.json file. Can you provide the commands you used to install the package? Can you also provide the error you get from the Python interpreter?

In the meantime, you can give a data_path parameter to datasets to work around the issue, like so:

from noisebase import Noisebase

data_loader = Noisebase(
   'sampleset_v1', # Dataset name
   {
      'data_path': '/mnt/data/noisebase' # Wherever you store our datasets
   }
)

Scripts like nb-download also take a --data_path argument, which functions similarly.

nb-download --data_path="/mnt/data/noisebase" sampleset_v1

@yy6768
Copy link
Author

yy6768 commented Apr 16, 2024

Thank you so much for your helpful response.

  1. I used pip install noisebase to install the package.
  2. The error messages from the Python interpreter:
  File "**********\Desktop\code\graduation_design\nppd\src\train.py", line 60, in main
    trainer.fit(
  File "**********\anaconda3\envs\nppd\Lib\site-packages\lightning\pytorch\trainer\trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "**********\anaconda3\envs\nppd\Lib\site-packages\lightning\pytorch\trainer\call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "**********\anaconda3\envs\nppd\Lib\site-packages\lightning\pytorch\trainer\trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "**********\anaconda3\envs\nppd\Lib\site-packages\lightning\pytorch\trainer\trainer.py", line 987, in _run
    results = self._run_stage()
              ^^^^^^^^^^^^^^^^^
  File "**********\anaconda3\envs\nppd\Lib\site-packages\lightning\pytorch\trainer\trainer.py", line 1031, in _run_stage
    self._run_sanity_check()
  File "**********\anaconda3\envs\nppd\Lib\site-packages\lightning\pytorch\trainer\trainer.py", line 1060, in _run_sanity_check
    val_loop.run()
  File "**********\anaconda3\envs\nppd\Lib\site-packages\lightning\pytorch\loops\utilities.py", line 182, in _decorator
    return loop_run(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "**********\anaconda3\envs\nppd\Lib\site-packages\lightning\pytorch\loops\evaluation_loop.py", line 110, in run
    self.setup_data()
  File "**********\anaconda3\envs\nppd\Lib\site-packages\lightning\pytorch\loops\evaluation_loop.py", line 166, in setup_data
    dataloaders = _request_dataloader(source)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "**********\anaconda3\envs\nppd\Lib\site-packages\lightning\pytorch\trainer\connectors\data_connector.py", line 342, in _request_dataloader
    return data_source.dataloader()
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "**********\anaconda3\envs\nppd\Lib\site-packages\lightning\pytorch\trainer\connectors\data_connector.py", line 309, in dataloader
    return call._call_lightning_datamodule_hook(self.instance.trainer, self.name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "**********\anaconda3\envs\nppd\Lib\site-packages\lightning\pytorch\trainer\call.py", line 179, in _call_lightning_datamodule_hook
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "**********\anaconda3\envs\nppd\Lib\site-packages\noisebase\loaders\lightning\training_sample_v1.py", line 19, in val_dataloader
    return PytorchDataloader(get_epoch=lambda: self.trainer.current_epoch, **self.loader_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "**********\anaconda3\envs\nppd\Lib\site-packages\lightning\fabric\utilities\data.py", line 324, in wrapper
    init(obj, *args, **kwargs)
  File "**********\anaconda3\envs\nppd\Lib\site-packages\noisebase\loaders\torch\training_sample_v1.py", line 200, in __init__
    ds = TrainingSampleDataset(
         ^^^^^^^^^^^^^^^^^^^^^^
  File "**********\anaconda3\envs\nppd\Lib\site-packages\noisebase\loaders\torch\training_sample_v1.py", line 26, in __init__
    data_path = resolve_data_path(data_path)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "**********\anaconda3\envs\nppd\Lib\site-packages\noisebase\data.py", line 13, in resolve_data_path
    dist_info = json.loads(metadata.distribution('noisebase').read_text('direct_url.json'))
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "**********\anaconda3\envs\nppd\Lib\json\__init__.py", line 339, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not NoneType

@yy6768
Copy link
Author

yy6768 commented Apr 16, 2024

I would like to try the train.py script in your nppd project, but I’m not sure how to configure

from noisebase import Noisebase


data_loader = Noisebase(
   'sampleset_v1', # Dataset name
   {
      'data_path': '/mnt/data/noisebase' # Wherever you store our datasets
   }
)

within the noisebase.lighting.Trainer class. It seems that this class does not provide an interface for the data path.

@balintio
Copy link
Owner

NPPD uses Hydra to manage configurations. No need to do this now, but you could put this option in nppd/conf/base.yaml:

...
training_data:
  data_path: /mnt/data/noisebase
  samples: 8
  batch_size: 8
  ...

I think I fixed the issue in e57974c. Could you try updating the package (pip install noisebase --upgrade) and see if that works?

Pip actually doesn't create the direct_url.json when installing the package from PyPI. I only tested installing the locally built wheel and missed this. 😅

@yy6768
Copy link
Author

yy6768 commented Apr 19, 2024

Dear balintio,

I appreciate you providing a solution to the previous issue, even though I haven’t been able to test it yet as I’m still waiting for the training dataset to be downloaded.

Additionally, I had a follow-up inquiry. I’ve noticed that the dataset you are using appears to be in a custom file format, such as .0 and .zarray. I was wondering if there is a possibility of using more common data formats, such as HDR, PNG, or other publicly available data storage formats? This would allow me to potentially test your solution with my own dataset.

I’m curious to know if supporting more standard data storage formats is something you’ve considered or if there are any technical limitations that prevent that. I’m interested in exploring the potential of your solution, and having the flexibility to use my own dataset would be very helpful.

Please let me know your thoughts on this. I’m grateful for your assistance and look forward to your response.

Best regards.

@balintio
Copy link
Owner

No worries, I'm happy to help. The point of publicly releasing a project is to make sure people can use it😉

What you're seeing are Zarr files. It's a very fast hierarchical data format, like HDF5, if you know that one. Initially, we tried storing everything in EXR files, but the speed of our data loaders quickly bottlenecked the training process. Loading hundreds of files for every batch just wasn't a scalable approach.

If you already have everything in individual files, packing them into Zarr files following our format should be easy. We don't have such example scripts in the repository yet, but I can add some and help write the script for your specific case. Just make sure you have all the data listed here in some form.

A large part of this difficulty comes down to using per-sample data. We plan to share per-pixel versions of our datasets in June. They will also use the Zarr format, but they should be even easier to convert between individual files.

@yy6768
Copy link
Author

yy6768 commented Apr 24, 2024

No worries, I'm happy to help. The point of publicly releasing a project is to make sure people can use it😉

What you're seeing are Zarr files. It's a very fast hierarchical data format, like HDF5, if you know that one. Initially, we tried storing everything in EXR files, but the speed of our data loaders quickly bottlenecked the training process. Loading hundreds of files for every batch just wasn't a scalable approach.

If you already have everything in individual files, packing them into Zarr files following our format should be easy. We don't have such example scripts in the repository yet, but I can add some and help write the script for your specific case. Just make sure you have all the data listed here in some form.

A large part of this difficulty comes down to using per-sample data. We plan to share per-pixel versions of our datasets in June. They will also use the Zarr format, but they should be even easier to convert between individual files.

Dear balintio,
I am immensely grateful for your prompt and thorough response. Your kindness and eagerness to assist are truly appreciated.

We are currently on a tight schedule as our project group is aiming to submit a related paper in May. It would be immensely helpful if you could provide a more detailed demonstration or guidance on how to pack our data into Zarr format datasets in the near future. This would greatly facilitate our preparation for the submission.

Additionally, I have noticed that the zeroday data mentioned in your paper seems to differ from the assets available at https://developer.nvidia.com/orca/beeple-zero-day. I am curious whether you rendered the zeroday dataset using Falcor? If the zeroday scenes were indeed rendered with Falcor, could you please provide us with the configuration parameters used at the time, including scene luminance among others?
Thank you once again for your support and for considering our request. We are looking forward to potentially incorporating this valuable knowledge into our work.

Warm regards,
YY.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants