Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster based on e-value/tmscore ? #364

Open
Wangchentong opened this issue Oct 11, 2024 · 3 comments
Open

Cluster based on e-value/tmscore ? #364

Wangchentong opened this issue Oct 11, 2024 · 3 comments

Comments

@Wangchentong
Copy link

Expected Behavior

When i run the easy-cluster wih a set of rfdiffusion generated structures, i obeserve that with foldseek cluster program which based e-value will give the ooposite trend, compared to cluster based on tm-score threshold(tmscore cutoff 0.6)

Current Behavior

image
The light blue is the total count of scaffold of each length, the drak blue is the count of clusters, why when use e-values there will be less cluster when length increase while use tm-score the trend is opposite?
what;s your recommondation cluster creterion when calculates the structure diversity of structure generation model?

@Wangchentong
Copy link
Author

cluster by e-value : foldseek easy-cluster pdb/ merge tmp/ -c 0.8
cluster by tmscore: foldseek easy-cluster pdb/ merge tmp/ -c 0.8 --tmscore-threshold 0.6

@Huilin-Li
Copy link

I think It might be based on tmscore, becasue my clustered groups by default settings foldseek easy-cluster pdb/ result tmp are same as the clustered groups by setting the --alignment-type 1

 --alignment-type INT             How to compute the alignment:
                                  0: 3di alignment
                                  1: TM alignment
                                  2: 3Di+AA [2]

However, we can see the default setting of alignment type is 2: 3Di+AA. I'm not sure whether it also calculates tmscore during this alignment type.

@martin-steinegger
Copy link
Collaborator

The threshold you set is crucial. Here are some parameters to consider:

  • -c controls the alignment coverage (default = 0.8); I recommend increasing it to 0.9.
  • --cluster-reassign 1 addresses issues caused by transitive clustering, where coverage violations can occur.
  • -e adjusts the e-value to a more stringent level, this should improve accuracy.
  • --tmscore-threshold sets the TM-score threshold for alignment (compatible with all alignment types). However, this only works well for super-posable structures, which many multi-domain proteins are not.
  • --lddt-threshold sets the alignment LDDT score threshold for the alignment (compatible with all alignment types). Works also for multi-domain proteins

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants