You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am examining the "dbsub.out" file, and have about 500 Gene IDs with multiple dbCAN.subfam and Substrate. Do you recommend keeping all the hits or selecting the best hit?
For example, in the screenshot below, I am interested in examining all the Chitin degrading Gene IDs, so I am worried that I might lose that information if I only end up selecting the best hit. Keeping all the hits would suggest that this protein can target cellulose, chitin, xylan.
Additionally, while both GH and CBM annotation is important for a Gene ID to determine if the GH has an accessory domain or not. Do you recommend keeping Gene IDs with only CBM annotation for substrate selection (eg. CBM will only have accessory roles in chitin degradation, and I should only examine GHs with CBM for this substrate)?
I am looking for suggestions if my understanding of the output is correct.
The text was updated successfully, but these errors were encountered:
You should keep all of them. This file has already been parsed and considered the presence of multiple domains in the same query protein. In your shown case, the protein has four domains and each gave you a substrate prediction, so you should keep all of them.
Note in our new run_dbcan release, the file name dbsub.out is changed dbcan-sub.hmm.out. To give you another example, in the following file: https://bcb.unl.edu/dbCAN_tutorial/dataset1-Carter2023/individual_assembly/Dry2014.dbCAN/dbcan-sub.hmm.out, there are 12947 rows but only 11827 proteins. That's because 894 proteins have multiple domains (each domain match has one row in the file). So this protein Dry2014_81126 has three domains: GH43_e159, GH43_e22, GH43_e159, and the domain positions (cols 11 and 12) are different in the full length.
Hi @linnabrown and @yinlabniu
I am examining the "dbsub.out" file, and have about 500 Gene IDs with multiple dbCAN.subfam and Substrate. Do you recommend keeping all the hits or selecting the best hit?
For example, in the screenshot below, I am interested in examining all the Chitin degrading Gene IDs, so I am worried that I might lose that information if I only end up selecting the best hit. Keeping all the hits would suggest that this protein can target cellulose, chitin, xylan.
Additionally, while both GH and CBM annotation is important for a Gene ID to determine if the GH has an accessory domain or not. Do you recommend keeping Gene IDs with only CBM annotation for substrate selection (eg. CBM will only have accessory roles in chitin degradation, and I should only examine GHs with CBM for this substrate)?
I am looking for suggestions if my understanding of the output is correct.
The text was updated successfully, but these errors were encountered: