Problem when loading pdbs with no bonds #658

brenerrr · 2024-11-14T18:02:48Z

The problem

I encoutered some problems when trying to load PDB files that do not have bonds information and more than one chain identifier, such as

REMARK original generated coordinate pdb file
ATOM      0   C1   BGX   1      -3.563 -12.592-256.215  1.00  0.00           C
ATOM      1   H1   BGX   1      -2.591 -13.101-256.040  1.00  0.00           H
ATOM      2   C5   BGX   1      -2.763 -11.186-258.118  1.00  0.00           C
ATOM      3   H5   BGX   1      -1.754 -11.645-258.040  1.00  0.00           H
ATOM      4   O5   BGX   1      -3.299 -11.258-256.729  1.00  0.00           O
ATOM      5   C2   BGX   2      -4.485 -13.321-257.148  1.00  0.00           C
ATOM      6   H2   BGX   2      -5.452 -12.773-257.146  1.00  0.00           H
END

When trying to load it, There will be an error from biotite on line 29 of molecularnodes/entities/molecule/pdb.py

array = pdb.get_structure(
            pdb_file=self.file,
            extra_fields=["b_factor", "occupancy", "charge", "atom_id"],
            include_bonds=True,
        )

because of the "include_bonds=True" flag.

My solution

I forked the main repository and made the following modifications

array = pdb.get_structure(
    pdb_file=self.file,
    extra_fields=["b_factor", "occupancy", "charge", "atom_id"],
    include_bonds=False,
)
try:
    array.bonds = connect_via_residue_names(array)
except AttributeError:
    print("Not able to find bonds via residue.")

Now the array is loaded without bonds, which are built afterwards. In case biotite raises an AtributeError, then the atoms are loaded without any bond information.

If you think this is a good solution I can submit a pull request later. I am also open to suggestions about this solution.

The text was updated successfully, but these errors were encountered:

BradyAJohnston · 2024-11-15T00:28:22Z

Thanks for catching this! A PR would be great, but I think we should instead also do try / except for the pdb_get_structure.

try:
    array = pdb.get_structure(
        pdb_file=self.file,
        extra_fields=["b_factor", "occupancy", "charge", "atom_id"],
        include_bonds=True,
    )
except (ErrorThatIsRaised):
    array = pdb.get_structure(
        pdb_file=self.file,
        extra_fields=["b_factor", "occupancy", "charge", "atom_id"],
        include_bonds=False,
    )
    try:
        array.bonds = connect_via_residue_names(array)
    except AttributeError:
        print("Not able to find bonds via residue.")

This is also related to #634 where if the structure is missing a b_factor column the import fails. I haven't tested but I assume any of the extra_fields if they are missing will cause it to fail.

A more general way to handle all of these potential missing bits of data, but I am unsure what the best approach would be.

BradyAJohnston · 2024-11-15T00:29:11Z

biotite is very ready to throw an error if something isn't exactly as it expects it, which is a lot of the time in the realm of PDB files

brenerrr · 2024-11-15T02:15:09Z

It would be nice if biotite raised different types of errors but it seems it just raises an AttributeError in case anything goes wrong. So in practice your suggestion would endup being something like

try:
    array = pdb.get_structure(
        pdb_file=self.file,
        extra_fields=["b_factor", "occupancy", "charge", "atom_id"],
        include_bonds=True,
    )
except AttributeError:
    array = pdb.get_structure(
        pdb_file=self.file,
        extra_fields=["b_factor", "occupancy", "charge", "atom_id"],
        include_bonds=False,
    )
    try:
        array.bonds = connect_via_residue_names(array)
    except AttributeError:
        print("Not able to find bonds via residue.")

which seems a bit odd. I guess the most general way that also avoids duplicated code would be to manually check if the extra fields are present in the pdb, raise an error if they aren't, get the structure without bonds, then build the bonds. At least with this pipeline if something goes wrong it would be easier to know if the problem is related with the bonds or the extra fields. However I am not sure how this checking should be done and probably would do that in another PR.

BradyAJohnston · 2024-11-15T02:17:21Z

It might be something that we re-implemented the underlying pdb.get_structure() in our own method that does better checking. Could potentially also raise it with the biotite devs, they have been quite responsive in the past about making changes / improvements

brenerrr · 2024-11-22T00:38:36Z

I can try to raise this issue with the biotite devs. Meanwhile do you think I can do a PR as a hot fix here?

BradyAJohnston · 2024-11-22T01:08:27Z

I'm happy to accept a PR fix here, worth also raising it with the biotite devs as well

brenerrr mentioned this issue Nov 23, 2024

Pdb Loading Fix #666

Merged

BradyAJohnston closed this as completed Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem when loading pdbs with no bonds #658

Problem when loading pdbs with no bonds #658

brenerrr commented Nov 14, 2024

BradyAJohnston commented Nov 15, 2024

BradyAJohnston commented Nov 15, 2024

brenerrr commented Nov 15, 2024

BradyAJohnston commented Nov 15, 2024

brenerrr commented Nov 22, 2024

BradyAJohnston commented Nov 22, 2024

Problem when loading pdbs with no bonds #658

Problem when loading pdbs with no bonds #658

Comments

brenerrr commented Nov 14, 2024

The problem

My solution

BradyAJohnston commented Nov 15, 2024

BradyAJohnston commented Nov 15, 2024

brenerrr commented Nov 15, 2024

BradyAJohnston commented Nov 15, 2024

brenerrr commented Nov 22, 2024

BradyAJohnston commented Nov 22, 2024