You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Mashpit is comprised of three major parts: A MinHash database, its associated metadata, and the MinHash querying.
74
+
Mashpit consists of three major parts: A MinHash database, its associated metadata, and the MinHash querying.
75
75
76
76
The database is created with an interface to Mash, called Sourmash [@Brown2016].
77
77
Each genome is imported by sketching it and adding it to a Sourmash signature database.
78
78
Each genome can also have an entry in the associated metadata.
79
79
These data include date of isolation, geography, host age range, and other information that could be useful in an epidemiological investigation.
80
-
Mashpit can build a species database from NCBI Pathogen Detection, termed a Mashpit taxon database or a custom database from user-provided genomes. The Mashpit taxon database is based on the available pathogen species on Pathogen Detection. For each SNP cluster of one species on Pathogen Detection, the set of all genomes in an SNP cluster is defined as:
80
+
Mashpit can build a species database from NCBI Pathogen Detection, termed a Mashpit taxon database or a custom database from a list of biosample accessions. The Mashpit taxon database is based on the available pathogen species on Pathogen Detection. For each SNP cluster of one species on Pathogen Detection, the set of all genomes in an SNP cluster is defined as:
81
81
$$G=\{g_1,g_2,…,g_n\}$$
82
82
where n is the number of genomes in the cluster.
83
83
The centroid genome $g_c$ is calculated as:
@@ -95,15 +95,14 @@ The webserver is built using Flask and can be run locally or deployed on a serve
95
95
The webserver provides a user-friendly interface for users to upload their query genomes and view the results.
96
96
97
97
# Performance
98
-
To evaluate the performance of Mashpit, we tested Mashpit on a server that runs Ubuntu 20.04.2 with an Intel Xeon CPU E5-2697 v4 2.30GHz and 256GB RAM.
99
-
The elapsed time of running a query was calculated for four of the major foodborne pathogens: _Salmonella_, _Listeria_, _E. coli_, and _Campylobacter_.
98
+
To evaluate the performance of Mashpit, we tested Mashpit on a server that runs Ubuntu 20.04.2 with an Intel Xeon CPU E5-2697 v4 2.30GHz and 256GB RAM.
100
99
We used NCBI pathogen detection SNP clusters that were versioned before January 2024. We then randomly selected 1000 newly added genomes for each species added to NCBI pathogen detection after January 2024.
101
-
Subsequently, we queried these genomes against Mashpit taxon databases and recorded the time taken for each step (\autoref{fig:figure}).
100
+
We measured the elapsed time for querying four major foodborne pathogens: _Salmonella_, _Listeria_, _E. coli_, and _Campylobacter_ (\autoref{fig:time_query}).
102
101
We also compared the query results with the true SNP cluster of the query genomes.
103
-
We calculated the proportion of true SNP clusters appearing among the top hits at various thresholds (\autoref{fig:figure}).
102
+
We calculated the proportion of true SNP clusters appearing among the top hits at various thresholds (\autoref{fig:accuracy}).
104
103
The 'threshold' indicates whether the correct SNP cluster is among the top 'threshold number' of query hits.
105
104
For instance, a threshold of 25 indicates that the correct cluster is among the top 25 hits.
106
-
Our findings indicate that _Salmonella_demonstrated a 70% success rate in having the true cluster within the top 25 hits while _Campylobacter_ showed a success rate of approximately 90%. This variability reflects differences in how species are represented in the database and the limitations of MinHash-based methods for resolving closely related clusters.
105
+
Our findings indicate that _Salmonella_achieved a 70% success rate for true clusters appearing within the top 25 hits, compared to approximately 90% for _Campylobacter_. This variability reflects differences in how species are represented in the database and the limitations of MinHash-based methods for resolving closely related clusters.
107
106
108
107
For _Salmonella_, which is the most frequently sequenced organism in NCBI Pathogen Detection, many closely related SNP clusters exist due to its extensive representation. Mash, being a MinHash-based method, operates at a resolution that is not always sufficient to distinguish fine-scale differences between these closely related clusters.
109
108
As a result, users analyzing _Salmonella_ should interpret Mashpit results as preliminary and consider following up with higher-resolution methods for definitive SNP cluster assignments.
@@ -113,14 +112,16 @@ As a result, users analyzing _Salmonella_ should interpret Mashpit results as pr
113
112
Mashpit provides a fast and lightweight platform for genomic epidemiology.
114
113
Its MinHash-based approach enables rapid querying of large datasets on standard scientific workstations, addressing key challenges for laboratories with limited computational resources or privacy concerns.
115
114
116
-
However, we note that the Mash distance does not correlate well to well-established distances such as MLST. And it has resolution limits when differentiating closely related clusters, particularly for species like _Salmonella_ that are highly represented in databases such as NCBI Pathogen Detection.
115
+
However, we note that the Mash distance does not correlate well with established distances such as Average Nucleotide Identity (ANI) for closely related genomes [@jain2018high]. Therefore it has resolution limits when differentiating closely related clusters, particularly for species like _Salmonella_ that are highly represented in databases such as NCBI Pathogen Detection.
117
116
118
117
Therefore we recommend that this platform is used as a first-pass to filter unrelated samples before using a more established protocol such as MLST.
119
118
In conclusion, we believe that Mashpit is an essential genomic epidemiology tool.
120
119
121
120
# Figures
122
121
123
-

122
+
{ width=100% }
123
+
124
+
{ width=60% }
0 commit comments