Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation issue when loading multi-conformer mmCIF files #132

Closed
hfswetton opened this issue Nov 29, 2024 · 2 comments
Closed

Validation issue when loading multi-conformer mmCIF files #132

hfswetton opened this issue Nov 29, 2024 · 2 comments

Comments

@hfswetton
Copy link
Contributor

Hi,

I believe I've found a bug in pdbtbx's structure validation, specifically in the call to Atom::corresponds() in pdbtbx::validate::validate_models(): this function assumes that "corresponding" atoms have identical serial numbers over all models, but this appears not to be the case in PDBx/mmCIF format. This means that multi-conformer (in my case, NMR) structures in this format cannot currently be parsed.

For example, the following script using PDB entry 3PDZ panics:

use pdbtbx::{Format, ReadOptions, StrictnessLevel};

fn main() {
    let (pdb, _errors) = ReadOptions::default()
        .set_level(StrictnessLevel::Loose)
        .set_format(Format::Mmcif)
        .read("7QCX.cif")
        .expect("errors in structure");
    println!("{:#?}", pdb);
}

// > errors in structure: [
// > StrictWarning: Atoms in Models not corresponding
// > Atom 1427 in Model 2 does not correspond to the respective Atom in the first model.
// > , StrictWarning: Atoms in Models not corresponding
// > Atom 1428 in Model 2 does not correspond to the respective Atom in the first model.
// > , ...]

As a preliminary fix, I believe it would be easiest to just remove the line self.serial_number == other.serial_number from pdbtbx::structs::atom::Atom::corresponds() entirely - this allows the snippet above to run successfully.

For full compatibility with the format, it may even be better to change the handling of serial numbers entirely though, as the _atom_site.id field it's read from in PDBx/mmCIF format can apparently be any unique identifier (i.e. not necessarily numeric). Maybe behaviour like this would make sense?

  • Atom.serial_number: int is unique within each model, but identical for corresponding atoms in different models (= current behaviour)
  • Atom.global_id: String is unique for every atom in a structure
  • pdbtbx::read::mmcif::parser extracts Atom.global_id from the file and calculates Atom.serial_number itself (ensuring atom correspondence etc.)
  • pdbtbx::read::pdb::parser extracts Atom.serial_number from the file and calculates Atom.global_id itself (e.g. as a String representation of an incrementing integer)
  • pdbtbx::save::mmcif uses Atom.global_id only
  • pdbtbx::save::pdb uses Atom.serial_number only

This would keep most of the library's behaviour as it is - the only "outward-facing" change would be the way PDBx/mmCIF files are read and written.

In any case, it might also be good to add a multi-conformer structure to the test suite (like in #131), if that doesn't end up being too much bloat.

Very sorry for the long message! This is absolutely not urgent for me as I'm not currently using the library, so there's absolutely no rush to get it fixed. If it would help, I'd be glad to try making a PR of how I'd imagine the changes when I get the chance. Many thanks for your work on this!

@douweschulte
Copy link
Owner

Thanks so much for the detailed issue. I might have indeed fared mostly on what I saw in mmCIF files and not the specification itself. The proposed behaviour seems very sensible. Adding a test case is a good thing and if you have the time to work on a PR that would be splendid. Personally my work has moved on from structural proteins so I do not have a lot of time on my hands to work on pdbtbx, but I would be happy to discuss the issue further or review a PR.

@hfswetton
Copy link
Contributor Author

Thanks as well for the quick response! In that case, I'd be glad to work on a PR in this direction (although I'm afraid it may take a while for me as well - anybody please feel free to give me a poke if nothing's happened here yet and if the changes would be useful for you). I'll see how much of the issue I can deal with by myself so you don't need to spend extra time on this, but would also be glad of the opportunity to discuss a bit in case I find something I'm not quite clear on.

I also noticed that #64/#95 are somewhat related to this, so I'll try not to introduce anything that interferes with previous changes from those.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants