-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dbsub.out multiple lines for the same gene_id #125
Comments
Hello,
Thanks for the all the questions. Briefly, the situation is because of multi-domain proteins. Please see my detailed responses below.
…________________________________
From: Erwann SCAON ***@***.***>
Sent: Thursday, August 3, 2023 5:21 AM
To: linnabrown/run_dbcan ***@***.***>
Cc: Subscribed ***@***.***>
Subject: [linnabrown/run_dbcan] dbsub.out multiple lines for the same gene_id (Issue #125)
Non-NU Email
________________________________
Dear all,
run_dbcan version:
Based on commit f3dd111<https://urldefense.com/v3/__https://github.com/linnabrown/run_dbcan/commit/f3dd111a9eaf552cdca14f4c8db06baabaa1f2f8__;!!PvXuogZ4sRB2p-tU!DkwacM2wNU_AMsDhHVhBN2Bi-yzGLaipJRrGutlUkErCAu0Uyfglf-YEnxPQMLK0t1T5oYJqnvWyBvijLQvIhg$>, we cloned the repo and built the Docker image from https://github.com/linnabrown/run_dbcan/blob/master/docker/Dockerfile<https://urldefense.com/v3/__https://github.com/linnabrown/run_dbcan/blob/master/docker/Dockerfile__;!!PvXuogZ4sRB2p-tU!DkwacM2wNU_AMsDhHVhBN2Bi-yzGLaipJRrGutlUkErCAu0Uyfglf-YEnxPQMLK0t1T5oYJqnvWyBvgBct1pOA$>
we ran the following run_dbcan command:
docker run \
--rm \
-v /home/Toto_project:/data \
run_dbcan:4.0.0 \
/data/$sample \
meta \
--out_dir /data/dbcan4/"$name"_dbcan \
--out_pre "$name" \
--dia_cpu 34 \
--hmm_cpu 34 \
--tf_cpu 34 \
--stp_cpu 34 \
2>&1 >> log.txt
When trying to figure out the result for the first sample, looking at the overwiew:
sample_a_overview.txt
In the "dbCAN_sub" column, some results (e.g. "GH42_e21") are eCAMI subfamilies.
We wanted to "convert" them to CAZyme families, using information available in sample_a_dbsub.out.
Ps: We noticed that:
Gene_id are unique in the sample_a_overview.txt file.
This is not always the case in the sample_a_dbsub.out file.
Trying to map eCAMI subfamilies to CAZyme families, we stumbled upon various cases of duplicated Gene_id in the sample_a_dbsub.out file:
One Gene_id having many lines for the same Db_can_subfam = CAZyme family
sample_a_dbsub.out:
Gene_id Db_can_subfam Subfam_composition
k141_96749_21 GT2 GT2:53|GT4:2
k141_96749_21 GT2 GT2:121|GT4:4
this is because k141_96749_21 matches both subfam HMMs and the matched regions overlap with each other. If you look at the Gene Start Gene End columns in the dbsub.out file, you will see they overlap.
sample_a_overview.txt:
Gene ID EC# HMMER dbCAN_sub DIAMOND #ofTools
k141_96749_21 -|- GT2(63-192)+GT2(324-453) GT2+GT2 GT2 3
Notice that the Subfam_composition is not the same for both GT2 entries.
the overview file is parsed, so only the best subfam with the lowest e-value is kept. The Subfam_composition is different and this is expected because the two subfams are derived from different eCAMI clusters.
One Gene_id having many lines for the same Db_can_subfam = eCAMI subfamilies
sample_a_dbsub.out:
Gene_id Db_can_subfam Subfam_composition
k141_88523_2 GT39_e29 GT39:80|CBM0:21|CBM16:6
k141_88523_2 GT39_e29 GT39:80|CBM0:21|CBM16:6
this one is possible if you have two duplicated seqs in your input file or the k141_88523_2 has two GT39 domains matching the same subfam HMM but at different locations. You can tell by looking at the Gene Start Gene End columns in the dbsub.out file.
sample_a_overview.txt:
Gene ID EC# HMMER dbCAN_sub DIAMOND #ofTools
k141_88523_2 -|- GT39(607-822) GT39_e29+GT39_e29 GT39 3
Notice that the Subfam_composition is the same for both GT39_e29 entries.
One Gene_id having multiple lines for different Db_can_subfam = eCAMI subfamilies
sample_a_dbsub.out:
Gene_id Db_can_subfam Subfam_composition
k141_110494_2 CBM32_e28 CBM32:62|GH2:59
k141_110494_2 GH2_e53 GH2:132|CBM32:58|CBM2:1
this one is very possible if k141_110494_2 has two different domains, one matching CBM32 and the other GH2. Again, you should look at the Gene Start Gene End columns in the dbsub.out file
sample_a_overview.txt:
Gene ID EC# HMMER dbCAN_sub DIAMOND #ofTools
k141_110494_2 -|- GH2(5-566) GH2_e53+CBM32_e28 CBM32+GH2 3
Questions:
* Are all those cases expected?
yes
* How do dbCAN users usually map dbCAN-sub eCAMI subfamilies to CAZyme families?
Good question! we have done this but didn't publish it. The results showed that eCAMI/dbCAN-sub subfamilies tend to be smaller than CAZy's subfamilies (eCAMI tends to group sequences in a larger number of smaller clusters). Others have also looked into this by comparing ProfileView, eCAMI, CUPP, CAZy: https://doi.org/10.1093/molbev/msac070. But, remember CAZy only has subfamilies for 34 of over 470 CAZyme families.
—
Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https://github.com/linnabrown/run_dbcan/issues/125__;!!PvXuogZ4sRB2p-tU!DkwacM2wNU_AMsDhHVhBN2Bi-yzGLaipJRrGutlUkErCAu0Uyfglf-YEnxPQMLK0t1T5oYJqnvWyBvhGiYxcEg$>, or unsubscribe<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AEXNKZVPM654NQVKEMGLJBTXTN3T7ANCNFSM6AAAAAA3CRN5IM__;!!PvXuogZ4sRB2p-tU!DkwacM2wNU_AMsDhHVhBN2Bi-yzGLaipJRrGutlUkErCAu0Uyfglf-YEnxPQMLK0t1T5oYJqnvWyBvhxfp9IRw$>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Dear all,
run_dbcan version:
Based on commit f3dd111, we cloned the repo and built the Docker image from https://github.com/linnabrown/run_dbcan/blob/master/docker/Dockerfile
we ran the following run_dbcan command:
When trying to figure out the result for the first sample, looking at the overwiew (sample_a_overview.txt):
In the "dbCAN_sub" column, some results (e.g. "GH42_e21") are eCAMI subfamilies.
We wanted to "convert" them to CAZyme families, using information available in sample_a_dbsub.out.
Ps: We noticed that:
Gene_id are unique in the sample_a_overview.txt file.
This is not always the case in the sample_a_dbsub.out file.
Trying to map eCAMI subfamilies to CAZyme families, we stumbled upon various cases of duplicated Gene_id in the sample_a_dbsub.out file:
n lines for the same Db_can_subfam (CAZyme family)
sample_a_dbsub.out:
sample_a_overview.txt:
Notice that the Subfam_composition is not the same for both GT2 entries.
n lines for the same Db_can_subfam (eCAMI subfamiliy)
sample_a_dbsub.out:
sample_a_overview.txt:
Notice that the Subfam_composition is the same for both GT39_e29 entries.
n lines for different Db_can_subfam (eCAMI subfamilies)
sample_a_dbsub.out:
sample_a_overview.txt:
Questions:
The text was updated successfully, but these errors were encountered: