Training DATA AVAILABILITY #17

Nianzhen-GU · 2022-10-17T08:04:17Z

NCBI databases (RefSeq (35) and Genbank (36), release July 2019) were queried for ‘prokaryotic virus’ and genomes >10 kb in length were retained. In addition, the IMG/VR database (release July 2018) (37) was downloaded, and sequences were limited to a minimum length of 10 kb. For the IMG/VR dataset, VIBRANT (38) (v1.2.1, -virome) and CheckV (39) (v0.6.0) were used to obtain circular and/or complete sequences. The resulting NCBI and IMG/VR datasets were dereplicated by 95% identity using the method described here (–derep_only –derep_id 0.95 –frac 0.70 –method longest) and combined, resulting in a total of 11,881 putatively complete genomes.

I wonder is there any chance that you can provide this processed dataset? I will appreciate it very much. Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training DATA AVAILABILITY #17

Training DATA AVAILABILITY #17

Nianzhen-GU commented Oct 17, 2022

Training DATA AVAILABILITY #17

Training DATA AVAILABILITY #17

Comments

Nianzhen-GU commented Oct 17, 2022