Validation issue when loading multi-conformer mmCIF files #132

hfswetton · 2024-11-29T17:40:42Z

Hi,

I believe I've found a bug in pdbtbx's structure validation, specifically in the call to Atom::corresponds() in pdbtbx::validate::validate_models(): this function assumes that "corresponding" atoms have identical serial numbers over all models, but this appears not to be the case in PDBx/mmCIF format. This means that multi-conformer (in my case, NMR) structures in this format cannot currently be parsed.

For example, the following script using PDB entry 3PDZ panics:

use pdbtbx::{Format, ReadOptions, StrictnessLevel};

fn main() {
    let (pdb, _errors) = ReadOptions::default()
        .set_level(StrictnessLevel::Loose)
        .set_format(Format::Mmcif)
        .read("7QCX.cif")
        .expect("errors in structure");
    println!("{:#?}", pdb);
}

// > errors in structure: [
// > StrictWarning: Atoms in Models not corresponding
// > Atom 1427 in Model 2 does not correspond to the respective Atom in the first model.
// > , StrictWarning: Atoms in Models not corresponding
// > Atom 1428 in Model 2 does not correspond to the respective Atom in the first model.
// > , ...]

As a preliminary fix, I believe it would be easiest to just remove the line self.serial_number == other.serial_number from pdbtbx::structs::atom::Atom::corresponds() entirely - this allows the snippet above to run successfully.

For full compatibility with the format, it may even be better to change the handling of serial numbers entirely though, as the _atom_site.id field it's read from in PDBx/mmCIF format can apparently be any unique identifier (i.e. not necessarily numeric). Maybe behaviour like this would make sense?

Atom.serial_number: int is unique within each model, but identical for corresponding atoms in different models (= current behaviour)
Atom.global_id: String is unique for every atom in a structure
pdbtbx::read::mmcif::parser extracts Atom.global_id from the file and calculates Atom.serial_number itself (ensuring atom correspondence etc.)
pdbtbx::read::pdb::parser extracts Atom.serial_number from the file and calculates Atom.global_id itself (e.g. as a String representation of an incrementing integer)
pdbtbx::save::mmcif uses Atom.global_id only
pdbtbx::save::pdb uses Atom.serial_number only

This would keep most of the library's behaviour as it is - the only "outward-facing" change would be the way PDBx/mmCIF files are read and written.

In any case, it might also be good to add a multi-conformer structure to the test suite (like in #131), if that doesn't end up being too much bloat.

Very sorry for the long message! This is absolutely not urgent for me as I'm not currently using the library, so there's absolutely no rush to get it fixed. If it would help, I'd be glad to try making a PR of how I'd imagine the changes when I get the chance. Many thanks for your work on this!

The text was updated successfully, but these errors were encountered:

douweschulte · 2024-12-02T12:54:30Z

Thanks so much for the detailed issue. I might have indeed fared mostly on what I saw in mmCIF files and not the specification itself. The proposed behaviour seems very sensible. Adding a test case is a good thing and if you have the time to work on a PR that would be splendid. Personally my work has moved on from structural proteins so I do not have a lot of time on my hands to work on pdbtbx, but I would be happy to discuss the issue further or review a PR.

hfswetton · 2024-12-10T09:10:19Z

Thanks as well for the quick response! In that case, I'd be glad to work on a PR in this direction (although I'm afraid it may take a while for me as well - anybody please feel free to give me a poke if nothing's happened here yet and if the changes would be useful for you). I'll see how much of the issue I can deal with by myself so you don't need to spend extra time on this, but would also be glad of the opportunity to discuss a bit in case I find something I'm not quite clear on.

I also noticed that #64/#95 are somewhat related to this, so I'll try not to introduce anything that interferes with previous changes from those.

hfswetton mentioned this issue Jan 17, 2025

Add new field Atom::id #136

Merged

hfswetton closed this as completed Feb 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validation issue when loading multi-conformer mmCIF files #132

Validation issue when loading multi-conformer mmCIF files #132

hfswetton commented Nov 29, 2024

douweschulte commented Dec 2, 2024

hfswetton commented Dec 10, 2024

Validation issue when loading multi-conformer mmCIF files #132

Validation issue when loading multi-conformer mmCIF files #132

Comments

hfswetton commented Nov 29, 2024

douweschulte commented Dec 2, 2024

hfswetton commented Dec 10, 2024