dbsub.out multiple lines for the same gene_id #125

ScaonE · 2023-08-03T10:21:40Z

Dear all,

run_dbcan version:
Based on commit f3dd111, we cloned the repo and built the Docker image from https://github.com/linnabrown/run_dbcan/blob/master/docker/Dockerfile

we ran the following run_dbcan command:

docker run \
        --rm \
        -v /home/Toto_project:/data \
        run_dbcan:4.0.0 \
        /data/$sample \
        meta \
        --out_dir /data/dbcan4/"$name"_dbcan \
        --out_pre "$name" \
        --dia_cpu 34 \
        --hmm_cpu 34 \
        --tf_cpu 34 \
        --stp_cpu 34 \
        2>&1 >> log.txt

When trying to figure out the result for the first sample, looking at the overwiew (sample_a_overview.txt):
In the "dbCAN_sub" column, some results (e.g. "GH42_e21") are eCAMI subfamilies.
We wanted to "convert" them to CAZyme families, using information available in sample_a_dbsub.out.

Ps: We noticed that:
Gene_id are unique in the sample_a_overview.txt file.
This is not always the case in the sample_a_dbsub.out file.

Trying to map eCAMI subfamilies to CAZyme families, we stumbled upon various cases of duplicated Gene_id in the sample_a_dbsub.out file:

n lines for the same Db_can_subfam (CAZyme family)

sample_a_dbsub.out:

Gene_id	Db_can_subfam	Subfam_composition
k141_96749_21	GT2	GT2:53\|GT4:2
k141_96749_21	GT2	GT2:121\|GT4:4

sample_a_overview.txt:

Gene ID	EC#	HMMER	dbCAN_sub	DIAMOND	#ofTools
k141_96749_21	-\|-	GT2(63-192)+GT2(324-453)	GT2+GT2	GT2	3

Notice that the Subfam_composition is not the same for both GT2 entries.

n lines for the same Db_can_subfam (eCAMI subfamiliy)

sample_a_dbsub.out:

Gene_id	Db_can_subfam	Subfam_composition
k141_88523_2	GT39_e29	GT39:80\|CBM0:21\|CBM16:6
k141_88523_2	GT39_e29	GT39:80\|CBM0:21\|CBM16:6

sample_a_overview.txt:

Gene ID	EC#	HMMER	dbCAN_sub	DIAMOND	#ofTools
k141_88523_2	-\|-	GT39(607-822)	GT39_e29+GT39_e29	GT39	3

Notice that the Subfam_composition is the same for both GT39_e29 entries.

n lines for different Db_can_subfam (eCAMI subfamilies)

sample_a_dbsub.out:

Gene_id	Db_can_subfam	Subfam_composition
k141_110494_2	CBM32_e28	CBM32:62\|GH2:59
k141_110494_2	GH2_e53	GH2:132\|CBM32:58\|CBM2:1

sample_a_overview.txt:

Gene ID	EC#	HMMER	dbCAN_sub	DIAMOND	#ofTools
k141_110494_2	-\|-	GH2(5-566)	GH2_e53+CBM32_e28	CBM32+GH2	3

Questions:

Are all those cases expected?
How do dbCAN users usually map dbCAN-sub eCAMI subfamilies to CAZyme families?

The text was updated successfully, but these errors were encountered:

yinlabniu · 2023-08-03T14:56:47Z

Hello, Thanks for the all the questions. Briefly, the situation is because of multi-domain proteins. Please see my detailed responses below.

…

________________________________ From: Erwann SCAON ***@***.***> Sent: Thursday, August 3, 2023 5:21 AM To: linnabrown/run_dbcan ***@***.***> Cc: Subscribed ***@***.***> Subject: [linnabrown/run_dbcan] dbsub.out multiple lines for the same gene_id (Issue #125) Non-NU Email

________________________________ Dear all, run_dbcan version: Based on commit f3dd111<https://urldefense.com/v3/__https://github.com/linnabrown/run_dbcan/commit/f3dd111a9eaf552cdca14f4c8db06baabaa1f2f8__;!!PvXuogZ4sRB2p-tU!DkwacM2wNU_AMsDhHVhBN2Bi-yzGLaipJRrGutlUkErCAu0Uyfglf-YEnxPQMLK0t1T5oYJqnvWyBvijLQvIhg$>, we cloned the repo and built the Docker image from https://github.com/linnabrown/run_dbcan/blob/master/docker/Dockerfile<https://urldefense.com/v3/__https://github.com/linnabrown/run_dbcan/blob/master/docker/Dockerfile__;!!PvXuogZ4sRB2p-tU!DkwacM2wNU_AMsDhHVhBN2Bi-yzGLaipJRrGutlUkErCAu0Uyfglf-YEnxPQMLK0t1T5oYJqnvWyBvgBct1pOA$> we ran the following run_dbcan command: docker run \ --rm \ -v /home/Toto_project:/data \ run_dbcan:4.0.0 \ /data/$sample \ meta \ --out_dir /data/dbcan4/"$name"_dbcan \ --out_pre "$name" \ --dia_cpu 34 \ --hmm_cpu 34 \ --tf_cpu 34 \ --stp_cpu 34 \ 2>&1 >> log.txt When trying to figure out the result for the first sample, looking at the overwiew: sample_a_overview.txt In the "dbCAN_sub" column, some results (e.g. "GH42_e21") are eCAMI subfamilies. We wanted to "convert" them to CAZyme families, using information available in sample_a_dbsub.out. Ps: We noticed that: Gene_id are unique in the sample_a_overview.txt file. This is not always the case in the sample_a_dbsub.out file. Trying to map eCAMI subfamilies to CAZyme families, we stumbled upon various cases of duplicated Gene_id in the sample_a_dbsub.out file: One Gene_id having many lines for the same Db_can_subfam = CAZyme family sample_a_dbsub.out: Gene_id Db_can_subfam Subfam_composition k141_96749_21 GT2 GT2:53|GT4:2 k141_96749_21 GT2 GT2:121|GT4:4 this is because k141_96749_21 matches both subfam HMMs and the matched regions overlap with each other. If you look at the Gene Start Gene End columns in the dbsub.out file, you will see they overlap. sample_a_overview.txt: Gene ID EC# HMMER dbCAN_sub DIAMOND #ofTools k141_96749_21 -|- GT2(63-192)+GT2(324-453) GT2+GT2 GT2 3 Notice that the Subfam_composition is not the same for both GT2 entries. the overview file is parsed, so only the best subfam with the lowest e-value is kept. The Subfam_composition is different and this is expected because the two subfams are derived from different eCAMI clusters. One Gene_id having many lines for the same Db_can_subfam = eCAMI subfamilies sample_a_dbsub.out: Gene_id Db_can_subfam Subfam_composition k141_88523_2 GT39_e29 GT39:80|CBM0:21|CBM16:6 k141_88523_2 GT39_e29 GT39:80|CBM0:21|CBM16:6 this one is possible if you have two duplicated seqs in your input file or the k141_88523_2 has two GT39 domains matching the same subfam HMM but at different locations. You can tell by looking at the Gene Start Gene End columns in the dbsub.out file. sample_a_overview.txt: Gene ID EC# HMMER dbCAN_sub DIAMOND #ofTools k141_88523_2 -|- GT39(607-822) GT39_e29+GT39_e29 GT39 3 Notice that the Subfam_composition is the same for both GT39_e29 entries. One Gene_id having multiple lines for different Db_can_subfam = eCAMI subfamilies sample_a_dbsub.out: Gene_id Db_can_subfam Subfam_composition k141_110494_2 CBM32_e28 CBM32:62|GH2:59 k141_110494_2 GH2_e53 GH2:132|CBM32:58|CBM2:1 this one is very possible if k141_110494_2 has two different domains, one matching CBM32 and the other GH2. Again, you should look at the Gene Start Gene End columns in the dbsub.out file sample_a_overview.txt: Gene ID EC# HMMER dbCAN_sub DIAMOND #ofTools k141_110494_2 -|- GH2(5-566) GH2_e53+CBM32_e28 CBM32+GH2 3 Questions: * Are all those cases expected? yes * How do dbCAN users usually map dbCAN-sub eCAMI subfamilies to CAZyme families? Good question! we have done this but didn't publish it. The results showed that eCAMI/dbCAN-sub subfamilies tend to be smaller than CAZy's subfamilies (eCAMI tends to group sequences in a larger number of smaller clusters). Others have also looked into this by comparing ProfileView, eCAMI, CUPP, CAZy: https://doi.org/10.1093/molbev/msac070. But, remember CAZy only has subfamilies for 34 of over 470 CAZyme families. — Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https://github.com/linnabrown/run_dbcan/issues/125__;!!PvXuogZ4sRB2p-tU!DkwacM2wNU_AMsDhHVhBN2Bi-yzGLaipJRrGutlUkErCAu0Uyfglf-YEnxPQMLK0t1T5oYJqnvWyBvhGiYxcEg$>, or unsubscribe<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AEXNKZVPM654NQVKEMGLJBTXTN3T7ANCNFSM6AAAAAA3CRN5IM__;!!PvXuogZ4sRB2p-tU!DkwacM2wNU_AMsDhHVhBN2Bi-yzGLaipJRrGutlUkErCAu0Uyfglf-YEnxPQMLK0t1T5oYJqnvWyBvhxfp9IRw$>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dbsub.out multiple lines for the same gene_id #125

dbsub.out multiple lines for the same gene_id #125

ScaonE commented Aug 3, 2023 •

edited

Loading

yinlabniu commented Aug 3, 2023 via email

dbsub.out multiple lines for the same gene_id #125

dbsub.out multiple lines for the same gene_id #125

Comments

ScaonE commented Aug 3, 2023 • edited Loading

n lines for the same Db_can_subfam (CAZyme family)

n lines for the same Db_can_subfam (eCAMI subfamiliy)

n lines for different Db_can_subfam (eCAMI subfamilies)

Questions:

yinlabniu commented Aug 3, 2023 via email

ScaonE commented Aug 3, 2023 •

edited

Loading