Skip to content

Commit

Permalink
stats: YAML output by OPTION, new OPTION -m,--multiqc for curated Mul…
Browse files Browse the repository at this point in the history
…tiQC output
  • Loading branch information
subwaystation committed Aug 23, 2021
1 parent 413dd08 commit 64df78a
Show file tree
Hide file tree
Showing 3 changed files with 48 additions and 39 deletions.
5 changes: 3 additions & 2 deletions docs/rst/commands/odgi.rst
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ graph.sorted.og -p bSnSnS
:ref:`odgi squeeze` -f
input_graphs.txt -o graphs.og

:ref:`odgi stats` -i graph.og -S
:ref:`odgi stats` -i graph.og -y

:ref:`odgi test`

Expand Down Expand Up @@ -320,7 +320,8 @@ multiple graphs into the same file.
| **odgi stats** [**-i, --idx**\ =\ *FILE*] [*OPTION*]…
| The odgi stats command produces statistics of a variation graph.
Among other metrics, it can calculate the #nodes, #edges, #paths and
the total nucleotide length of the graph.
the total nucleotide length of the graph. It can also produce a YAML file that is perfectly curated for the input of
`MultiQC's ODGI module <https://multiqc.info/docs/#odgi>`__.
| **odgi test** [<TEST NAME|PATTERN|TAGS> …] [*OPTION*]…
| The odgi test command starts all unit tests that are implemented in
Expand Down
11 changes: 9 additions & 2 deletions docs/rst/commands/odgi_stats.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,8 @@ DESCRIPTION

The odgi stats command produces statistics of a variation graph.
Among other metrics, it can calculate the #nodes, #edges, #paths and the
total nucleotide length of the graph.
total nucleotide length of the graph. It can also produce a YAML file that is perfectly curated for the input of
`MultiQC's ODGI module <https://multiqc.info/docs/#odgi>`__.

OPTIONS
=======
Expand Down Expand Up @@ -50,6 +51,9 @@ Summary Options
| **-D, --delim**\ =\ *STRING*
| The part of each path name before this delimiter is a group identifier, which when specified will ensure that odgi stats collects the summary information per group and not per path.
| **-f, --file-size**
| Show the file size in bytes.
Sorting Goodness Eval Options
---------------------------

Expand Down Expand Up @@ -84,7 +88,10 @@ IO Format Options
-----------------

| **-y, --yaml**
| Setting this option prints all statistics in YAML format instead of pseudo TSV to stdout. This includes *-S,--summarize*, *-W,--weak-connected-components*, *-L,--self-loops*, *-b,--base-content*, *-l,--mean-links-length*, *-g,--no-gap-links*, *-s,--sum-path-nodes-distances*, and *-d,--penelize-different-orientation*. *-p,path-statistics* is still optional. Not applicable to *-N,--nondeterministic-edges*!
| Setting this option prints all selected statistics in YAML format instead of pseudo TSV to stdout.
| **-m, --multiqc**
| Setting this option prints all! statistics in YAML format instead of pseudo TSV to stdout. This includes *-S,--summarize*, *-W,--weak-connected-components*, *-L,--self-loops*, *-b,--base-content*, *-l,--mean-links-length*, *-g,--no-gap-links*, *-s,--sum-path-nodes-distances*, *-f,--file-size*, and *-d,--penalize-different-orientation*. *-p,path-statistics* is still optional. Not applicable to *-N,--nondeterministic-edges*. Overwrites all other given OPTIONs! The output is perfectly curated for the `ODGI MultiQC module <https://multiqc.info/docs/#odgi>`__.
Threading
---------
Expand Down
71 changes: 36 additions & 35 deletions src/subcommand/stats_main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -45,25 +45,26 @@ int main_stats(int argc, char** argv) {
//args::Flag path_multicov_count(parser, "multicountcov", "provide a histogram of coverage over counts of paths", {'L', "multi-count-coverage"});
//args::ValueFlag<std::string> path_bedmulticov(parser, "BED", "for each BED entry, provide a table of path coverage over unique multisets of paths in the graph. Each unique multiset of paths overlapping a given BED interval is described in terms of its length relative to the total interval, the number of path traversals, and unique paths involved in these traversals.", {'B', "bed-multicov"});
args::ValueFlag<std::string> path_delim(summary_opts, "STRING", "The part of each path name before this delimiter is a group identifier, which when specified will ensure that odgi stats collects the summary information per group and not per path.", {'D', "delim"});
args::Group sorting_goodness_evaluation_opts(parser, "[ Sorting Goodness Eval Options ]");
args::ValueFlag<std::string> layout_in_file(sorting_goodness_evaluation_opts, "FILE", "Load the 2D layout coordinates in binary layout format from this *FILE*. The file name usually ends with *.lay*. The sorting goodness evaluation will then be performed for this *FILE*. When the layout coordinates are provided, the mean links length and the sum path nodes distances statistics are evaluated in 2D, else in 1D. Such a file can be generated with *odgi layout*.", {'c', "coords-in"});
args::Flag mean_links_length(sorting_goodness_evaluation_opts, "mean_links_length", "Calculate the mean links length. This metric is path-guided and"
args::Flag _file_size(summary_opts, "file-size", "Show the file size in bytes.", {'f', "file-size"});
args::Group sorting_goodness_evaluation_opts(parser, "[ Sorting Goodness Eval Options ]");
args::ValueFlag<std::string> layout_in_file(sorting_goodness_evaluation_opts, "FILE", "Load the 2D layout coordinates in binary layout format from this *FILE*. The file name usually ends with *.lay*. The sorting goodness evaluation will then be performed for this *FILE*. When the layout coordinates are provided, the mean links length and the sum path nodes distances statistics are evaluated in 2D, else in 1D. Such a file can be generated with *odgi layout*.", {'c', "coords-in"});
args::Flag mean_links_length(sorting_goodness_evaluation_opts, "mean_links_length", "Calculate the mean links length. This metric is path-guided and"
" computable in 1D and 2D.", {'l', "mean-links-length"});
args::Flag dont_penalize_gap_links(sorting_goodness_evaluation_opts, "dont-penalize-gap-links", "Don’t penalize gap links in the mean links length. A gap link is a"
args::Flag dont_penalize_gap_links(sorting_goodness_evaluation_opts, "dont-penalize-gap-links", "Don’t penalize gap links in the mean links length. A gap link is a"
" link which connects two nodes that are consecutive in the linear"
" pangenomic order. This option is specifiable only to compute the mean"
" links length in 1D.", {'g', "no-gap-links"});
args::Flag sum_of_path_node_distances(sorting_goodness_evaluation_opts, "sum_of_path_node_distances", "Calculate the sum of path nodes distances. This metric is path-guided"
args::Flag sum_of_path_node_distances(sorting_goodness_evaluation_opts, "sum_of_path_node_distances", "Calculate the sum of path nodes distances. This metric is path-guided"
" and computable in 1D and 2D. For each path, it iterates from node to"
" node, summing their distances, and normalizing by the path length. In"
" 1D, if a link goes back in the linearized viewpoint of the graph, this"
" is penalized (adding 3 times its length in the sum).", {'s', "sum-path-nodes-distances"});
args::Flag penalize_diff_orientation(sorting_goodness_evaluation_opts, "penalize_diff_orientation", "If a link connects two nodes which have different orientations, this"
args::Flag penalize_diff_orientation(sorting_goodness_evaluation_opts, "penalize_diff_orientation", "If a link connects two nodes which have different orientations, this"
" is penalized (adding 2 times its length in the sum).", {'d', "penalize-different-orientation"});
args::Flag path_statistics(sorting_goodness_evaluation_opts, "path_statistics", "Display the statistics (mean links length or sum path nodes distances) for each path.", {'p', "path-statistics"});
args::Flag _file_size(summary_opts, "file-size", "Show the file size in bytes.", {'f', "file-size"});
args::Flag path_statistics(sorting_goodness_evaluation_opts, "path_statistics", "Display the statistics (mean links length or sum path nodes distances) for each path.", {'p', "path-statistics"});
args::Group io_format_opts(parser, "[ IO Format Options ]");
args::Flag yaml(io_format_opts, "yaml", "Setting this option prints all statistics in YAML format instead of pseudo TSV to stdout. This includes *-S,--summarize*, *-W,--weak-connected-components*, *-L,--self-loops*, *-b,--base-content*, *-l,--mean-links-length*, *-g,--no-gap-links*, *-s,--sum-path-nodes-distances*, and *-d,--penelize-different-orientation*. *-p,path-statistics* is still optional. Not applicable to *-N,--nondeterministic-edges*!", {'y', "yaml"});
args::Flag _multiqc(io_format_opts, "multiqc", "Setting this option prints all! statistics in YAML format instead of pseudo TSV to stdout. This includes *-S,--summarize*, *-W,--weak-connected-components*, *-L,--self-loops*, *-b,--base-content*, *-l,--mean-links-length*, *-g,--no-gap-links*, *-s,--sum-path-nodes-distances*, *-f,--file-size*, and *-d,--penalize-different-orientation*. *-p,path-statistics* is still optional. Not applicable to *-N,--nondeterministic-edges*. Overwrites all other given OPTIONs! The output is perfectly curated for the ODGI MultiQC module.", {'m', "multiqc"});
args::Flag _yaml(io_format_opts, "yaml", "Setting this option prints all selected statistics in YAML format instead of pseudo TSV to stdout.", {'y', "yaml"});
args::Group processing_information(parser, "[ Processing Information ]");
args::ValueFlag<uint64_t> threads(processing_information, "N", "Number of threads to use for parallel operations.", {'t', "threads"});
args::Group processing_info_opts(parser, "[ Processing Information ]");
Expand Down Expand Up @@ -144,17 +145,17 @@ int main_stats(int argc, char** argv) {

const uint64_t shift = number_bool_packing::unpack_number(graph.get_handle(graph.min_node_id()));

if (args::get(mean_links_length) || args::get(sum_of_path_node_distances) || yaml) {
if (args::get(mean_links_length) || args::get(sum_of_path_node_distances) || _multiqc) {
if (number_bool_packing::unpack_number(graph.get_handle(graph.max_node_id())) - shift >= graph.get_node_count()){
std::cerr << "[odgi::stats] error: the node IDs are not compacted. Please run 'odgi sort' using -O, --optimize to optimize the graph." << std::endl;
exit(1);
}
}
if (yaml) {
if (_multiqc || _yaml) {
std::cout << "---" << std::endl;
}

if (args::get(_summarize) || yaml) {
if (args::get(_summarize) || _multiqc) {
uint64_t length_in_bp = 0, node_count = 0, edge_count = 0, path_count = 0;
graph.for_each_handle([&](const handle_t& h) {
length_in_bp += graph.get_length(h);
Expand All @@ -167,7 +168,7 @@ int main_stats(int argc, char** argv) {
graph.for_each_path_handle([&](const path_handle_t& p) {
++path_count;
});
if (yaml) {
if (_multiqc || _yaml) {
std::cout << "length: " << length_in_bp << std::endl;
std::cout << "nodes: " << node_count << std::endl;
std::cout << "edges: " << edge_count << std::endl;
Expand All @@ -178,9 +179,9 @@ int main_stats(int argc, char** argv) {
}
}

if (args::get(_weakly_connected_components) || yaml) {
if (args::get(_weakly_connected_components) || _multiqc) {
std::vector<ska::flat_hash_set<handlegraph::nid_t>> weak_components = algorithms::weakly_connected_components(&graph);
if (yaml) {
if (_multiqc || _yaml) {
std::cout << "num_weakly_connected_components: " << weak_components.size() << std::endl;
std::cout << "weakly_connected_components: " << std::endl;
} else {
Expand All @@ -192,7 +193,7 @@ int main_stats(int argc, char** argv) {

ska::flat_hash_set<handlegraph::nid_t> head_nodes = algorithms::is_nice_and_acyclic(graph, weak_components[i]);
bool acyclic = !(head_nodes.empty());
if (yaml) {
if (_multiqc || _yaml) {
std::cout << " - component:" << std::endl;
std::cout << " id: " << i << std::endl;
std::cout << " nodes: " << weak_components[i].size() << std::endl;
Expand All @@ -203,7 +204,7 @@ int main_stats(int argc, char** argv) {
}
}

if (_num_self_loops || yaml) {
if (_num_self_loops || _multiqc) {
uint64_t total_self_loops = 0;
std::unordered_set<nid_t> loops;
graph.for_each_edge([&](const edge_t& e) {
Expand All @@ -214,7 +215,7 @@ int main_stats(int argc, char** argv) {
});

// Should be these always equal?
if (yaml) {
if (_multiqc || _yaml) {
std::cout << "num_nodes_self_loops:" << std::endl;
std::cout << " total: " << total_self_loops << std::endl;
std::cout << " unique: " << loops.size() << std::endl;
Expand All @@ -224,7 +225,7 @@ int main_stats(int argc, char** argv) {
cout << "unique" << "\t" << loops.size() << endl;
}
}
/// we don't do this when `-y, --yaml` was specified
/// we don't do this when `-y, --_multiqc` was specified
if (_show_nondeterministic_edges) {
// This edges could be compressed in principle

Expand All @@ -247,7 +248,7 @@ int main_stats(int argc, char** argv) {
});
}

if (args::get(base_content) || yaml) {
if (args::get(base_content) || _multiqc) {
std::vector<uint64_t> chars(256);
graph.for_each_handle([&](const handle_t& h) {
std::string seq = graph.get_sequence(h);
Expand All @@ -257,7 +258,7 @@ int main_stats(int argc, char** argv) {
});
for (uint64_t i = 0; i < 256; ++i) {
if (chars[i]) {
if (yaml) {
if (_multiqc || _yaml) {
std::cout << (char)i << ": " << chars[i] << std::endl;
} else {
std::cout << (char)i << "\t" << chars[i] << std::endl;
Expand All @@ -266,7 +267,7 @@ int main_stats(int argc, char** argv) {
}
}

if (args::get(mean_links_length) || args::get(sum_of_path_node_distances) || yaml) {
if (args::get(mean_links_length) || args::get(sum_of_path_node_distances) || _multiqc) {
// This vector is needed for computing the metrics in 1D and for detecting gap-links
std::vector<uint64_t> position_map(graph.get_node_count() + 1);

Expand Down Expand Up @@ -304,7 +305,7 @@ int main_stats(int argc, char** argv) {
position_map[position_map.size() - 1] = len;
}

if (args::get(mean_links_length) || yaml){
if (args::get(mean_links_length) || _multiqc){
bool _dont_penalize_gap_links = args::get(dont_penalize_gap_links);

uint64_t sum_all_node_space = 0;
Expand All @@ -313,7 +314,7 @@ int main_stats(int argc, char** argv) {
uint64_t num_all_links = 0;
uint64_t num_all_gap_links = 0;

if (yaml) {
if (_multiqc || _yaml) {
std::cout << "mean_links_length:" << std::endl;
} else {
std::cout << "#mean_links_length" << std::endl;
Expand Down Expand Up @@ -402,7 +403,7 @@ int main_stats(int argc, char** argv) {
ratio_nt_space = (double)sum_nt_space / (double)num_links;
}
}
if (yaml) {
if (_multiqc || _yaml) {
std::cout << " - length:" << std::endl;
std::cout << " path: " << graph.get_path_name(path) << std::endl;
if (layout_in_file) {
Expand Down Expand Up @@ -448,7 +449,7 @@ int main_stats(int argc, char** argv) {
ratio_nt_space = (double)sum_all_nt_space / (double)num_all_links;
}
}
if (yaml) {
if (_multiqc || _yaml) {
std::cout << " - length:" << std::endl;
std::cout << " path: " << "all_paths" << std::endl;
if (layout_in_file) {
Expand All @@ -458,7 +459,7 @@ int main_stats(int argc, char** argv) {
std::cout << " in_nucleotide_space: " << ratio_nt_space << std::endl;
}
std::cout << " num_links_considered: " << num_all_links << std::endl;
if (dont_penalize_gap_links || yaml) {
if (dont_penalize_gap_links || _multiqc) {
std::cout << " num_gap_links_not_penalized: " << num_all_gap_links << std::endl;
}
} else {
Expand All @@ -476,7 +477,7 @@ int main_stats(int argc, char** argv) {
}
}

if (args::get(sum_of_path_node_distances) || yaml){
if (args::get(sum_of_path_node_distances) || _multiqc){
bool _penalize_diff_orientation = args::get(penalize_diff_orientation);

uint64_t sum_all_path_node_dist_node_space = 0;
Expand All @@ -487,7 +488,7 @@ int main_stats(int argc, char** argv) {
uint64_t num_all_penalties = 0;
uint64_t num_all_penalties_diff_orientation = 0;

if (yaml) {
if (_multiqc || _yaml) {
std::cout << "sum_of_path_node_distances:" << std::endl;
} else {
std::cout << "#sum_of_path_node_distances" << std::endl;
Expand Down Expand Up @@ -572,7 +573,7 @@ int main_stats(int argc, char** argv) {

/// this could land in the YAML, but we don't force it, because we don't need it for the MultiQC module
if (args::get(path_statistics)) {
if (yaml) {
if (_multiqc || _yaml) {
std::cout << " - distance:" << std::endl;
std::cout << " path: " << graph.get_path_name(path) << std::endl;
if (layout_in_file) {
Expand All @@ -587,7 +588,7 @@ int main_stats(int argc, char** argv) {
std::cout << " nucleotides: " << len_path_nt_space << std::endl;
std::cout << " num_penalties: " << num_penalties << std::endl;
}
if (_penalize_diff_orientation || yaml) {
if (_penalize_diff_orientation || _multiqc) {
std::cout << " num_penalties_different_orientation: " << num_penalties_diff_orientation << std::endl;
}
} else {
Expand All @@ -614,7 +615,7 @@ int main_stats(int argc, char** argv) {
num_all_penalties_diff_orientation += num_penalties_diff_orientation;
});

if (yaml) {
if (_multiqc || _yaml) {
std::cout << " - distance:" << std::endl;
std::cout << " path: " << "all_paths" << std::endl;
if (layout_in_file) {
Expand All @@ -629,7 +630,7 @@ int main_stats(int argc, char** argv) {
std::cout << " nucleotides: " << len_all_path_nt_space << std::endl;
std::cout << " num_penalties: " << num_all_penalties << std::endl;
}
if (_penalize_diff_orientation || yaml) {
if (_penalize_diff_orientation || _multiqc) {
std::cout << " num_penalties_different_orientation: " << num_all_penalties_diff_orientation << std::endl;
}
} else {
Expand All @@ -650,7 +651,7 @@ int main_stats(int argc, char** argv) {
}

// TODO
if (_file_size || yaml) {
if (_file_size || _multiqc) {
// 1. get the file size with error handling
const filesystem::path path_infile = infile;
std::error_code err_code;
Expand All @@ -659,7 +660,7 @@ int main_stats(int argc, char** argv) {
std::cerr << "[odgi::stats] error: " << infile << " : " << err_code.message() << std::endl;
exit(1);
} else {
if (yaml) {
if (_multiqc || _yaml) {
std::cout << "file_size_in_bytes: " << file_size << std::endl;
} else {
std::cout << file_size << std::endl;
Expand Down

0 comments on commit 64df78a

Please sign in to comment.