Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add subcommand for searching nf-core test-datasets #3487

Open
jfy133 opened this issue Mar 11, 2025 · 11 comments
Open

Add subcommand for searching nf-core test-datasets #3487

jfy133 opened this issue Mar 11, 2025 · 11 comments
Assignees
Labels
command line tools Anything to do with the cli interfaces good-first-issue

Comments

@jfy133
Copy link
Member

jfy133 commented Mar 11, 2025

Description of feature

When writing modules or pipelines, I often have problems remembering the exact URLs/location of test datafiles on the nf-core/test-datasets repository. I normally have to resort to going to GitHub in my browser and navigating to the write directory which takes a lot of time.

It would be cool to have a nf-core subcommand that allows you to 'search' with clever autocomplete prompts for the exact path, and ideally print the line as you would want to copy/paste into your test.

e.g.

nf-core test-dataset search <start typing keywords for autocomplete>

nf-core test-dataset search sarscov2/genome

And spits out:

For use in modules:

                    file(params.modules_testdata_base_path + 'genomics/sarscov2/genome/genome.fasta', checkIfExists: true)


For use in pipelines:

                    file(params.pipeline_testdata_base_path + 'genomics/sarscov2/genome/genome.fasta', checkIfExists: true)

@mashehu mashehu modified the milestones: 3.4.0, 3.3.0 Mar 11, 2025
@mashehu mashehu added command line tools Anything to do with the cli interfaces and removed enhancement labels Mar 11, 2025
@JulianFlesch JulianFlesch self-assigned this Mar 13, 2025
@JulianFlesch
Copy link

JulianFlesch commented Mar 13, 2025

Hi there, neat idea. I'll get on it. A few things to clarify up front.

@jfy133 Is there a curated text-format list of datasets other than this list of github branches that is referred over on the test-dataset repo?

@mirpedrol is it ok to fetch a list here from somewhere or do we want to maintain a static list in the code?

@jfy133
Copy link
Member Author

jfy133 commented Mar 13, 2025

I think there is two levels that we would need to work on:

  • Branches: modules, then each pipeline has a dedicated branch (you could cross ref with https://nf-co.re/pipelines for example, but tools might already have a function to pull this list)
  • The filetree of each branch

In terms of curated-lists, not really.

For modules there is is the data descriptions: https://github.com/nf-core/test-datasets/tree/modules?tab=readme-ov-file#data-description but it is likely missing a lot

For each pipeline branch, it depends on the pipeline how well documented it is.

The idea in my head in the original comment is simply just to search/display the file path and name of each file in the modules branch - no additional information about each file. I also didn't consider the branch (but I think it does make sense).

In any case the modules branch is quite well structured/sorted so just by the directories in the file path that should be sufficient for identifying relevant files.

Does that sort of help?

@mirpedrol
Copy link
Member

It is ok to fetch a list from the repo 🙂
The names of the pipelines you can get them from https://github.com/nf-core/website/blob/main/public/pipeline_names.json, we use these file in other places in the code too.

As @jfy133 says, I would list the file paths for now, we can think about extending this later

@JulianFlesch
Copy link

JulianFlesch commented Mar 14, 2025

I don't think the list of pipelines alone is helpful for we are trying to achieve here @mirpedrol or am I missing something?

If I understand @jfy133 correctly, he wants a list of all files in all branches and then be able to search through that list.
However, this will be tricky (or at least very slow) to do as part of shell-autocompletion, as a lot of requests to github API have to be made:

  1. One to fetch the list of branches (github api call)
  2. One request for each branch to fetch the respective file tree (github api call for phaseimpute branch)

I see two solutions that are feasible:

  1. Either automatically create a list similar to the list of pipelines suggested by @mirpedrol and put it in a similar place (e.g. https://nf-co.re/test-datasets.json)
  2. or: skip the auto-completion and just hit enter with a rough search term and then do all required requests to the github API

Please let me know if I am missing something obvious ;)

@mirpedrol
Copy link
Member

My idea was to use the list of pipelines that we have in the JSON I sent to know the branch names + modules (we would need a check to make sure that the branch exists). And I would ignore any other branch that's not a pipeline or modules.
With this, we can fetch the files from these branches.

Regarding autocompletion, I wouldn't allow this for now.
We could have an option to provide the branch name, and a keyword to filter the file paths.

@mirpedrol
Copy link
Member

Thinking about this again, with the modules repo, we can clone the repo to our cache. We could do something similar here. But I see this is turning into a bigger project now 😄

@jfy133
Copy link
Member Author

jfy133 commented Mar 14, 2025

As Júlia says, the list of pipeline names is indeed useful - we aren't so strict on the modules repo in terms of branches so there is a lot of potential 'junk'.

@mirpedrol is there a reason why you don't like the autocompletion? I would find this extremely helpful - if I can't remember the exact, having to re-run the command every time to search a new term would be quite annoying and put someone off. Pulling teh filetree once, and being able to rapidly explore it would be helpful. Is it a technical reason?

One way to get around this is could be just allowing --all to print the entire tree of that branc and and then scroll to find the thing you are looking for. but While printing the entire tree might be OK for most pipeline branches, that absolutely would not for the modules branch.

So I think we are going towards something like:

nf-core test-datasets list --branch taxprofiler <keyword/pattern to search>
nf-core test-datasets list --branch modules <keyword/pattern to search>

@mirpedrol
Copy link
Member

It was more of a practical concern. Having to parse all branches + all files to get the complete list for autocompletion sounded like too much. But if we use the list of possible branches and fetch files from that branch only it reduces things considerably 👍

@jfy133
Copy link
Member Author

jfy133 commented Mar 14, 2025

Oh yes definitely! My original proposal only had modules branch in mind the first time 😅

@JulianFlesch
Copy link

Alright! Feel free to check it out on my fork and let me know if you have any sugestions: https://github.com/JulianFlesch/nf-core-tools/tree/feature/test-datasets. I'll be wrap this up (hopefully by the end of the week) with tests for the new functions and classes and then open a pull request.

Due to the concerns mentioned above, I dropped the autocompletion completely in favor of the list subcommand.

Important: To work properly, github Authentication by means of the gh command line tool or the environment variable GITHUB_TOKEN must be available! Otherwise the API throttles or stops requests.

New subcommands:

$ python nf_core/__main__.py test-datasets --help


                                          ,--./,-.
          ___     __   __   __   ___     /,-._.--~\ 
    |\ | |__  __ /  ` /  \ |__) |__         }  {
    | \| |       \__, \__/ |  \ |___     \`-._,-`-,
                                          `._,._,'

    nf-core/tools version 3.2.0 - https://nf-co.re


                                                                                                    
 Usage: __main__.py test-datasets [OPTIONS] COMMAND [ARGS]...                                       
                                                                                                    
 Commands to manage nf-core test datasets.                                                          
                                                                                                    
╭─ Commands ───────────────────────────────────────────────────────────────────────────────────────╮
│ list            List all data files available in the nf-core/test-datasets repository on github  │
│ list-branches   List all remote branches in the nf-core/test-dataset repository on github        │
│ search          Search for files in the nf-core/test-datasets repository on github               │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────╮
│ --help  -h    Show this message and exit.                                                        │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯

Example Search:

$ python nf_core/__main__.py test-datasets search -ib modules sarscov2/genome/bed


                                          ,--./,-.
          ___     __   __   __   ___     /,-._.--~\ 
    |\ | |__  __ /  ` /  \ |__) |__         }  {
    | \| |       \__, \__/ |  \ |___     \`-._,-`-,
                                          `._,._,'

    nf-core/tools version 3.2.0 - https://nf-co.re


(Branch: modules) data/genomics/sarscov2/genome/bed/baits.bed
(Branch: modules) data/genomics/sarscov2/genome/bed/bed6alt.as
(Branch: modules) data/genomics/sarscov2/genome/bed/test.bed
(Branch: modules) data/genomics/sarscov2/genome/bed/test.bed.gz
(Branch: modules) data/genomics/sarscov2/genome/bed/test.bed12
(Branch: modules) data/genomics/sarscov2/genome/bed/test.bedpe
(Branch: modules) data/genomics/sarscov2/genome/bed/test2.bed

@mashehu
Copy link
Contributor

mashehu commented Mar 18, 2025

please open the PR, so it is easier to check the what has been done 🙂 can be set as draft and labeled as WIP to make sure it is not merged pre-maturely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
command line tools Anything to do with the cli interfaces good-first-issue
Projects
None yet
Development

No branches or pull requests

4 participants