Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem when loading pdbs with no bonds #658

Closed
brenerrr opened this issue Nov 14, 2024 · 6 comments
Closed

Problem when loading pdbs with no bonds #658

brenerrr opened this issue Nov 14, 2024 · 6 comments

Comments

@brenerrr
Copy link
Contributor

The problem

I encoutered some problems when trying to load PDB files that do not have bonds information and more than one chain identifier, such as

REMARK original generated coordinate pdb file
ATOM      0   C1   BGX   1      -3.563 -12.592-256.215  1.00  0.00           C
ATOM      1   H1   BGX   1      -2.591 -13.101-256.040  1.00  0.00           H
ATOM      2   C5   BGX   1      -2.763 -11.186-258.118  1.00  0.00           C
ATOM      3   H5   BGX   1      -1.754 -11.645-258.040  1.00  0.00           H
ATOM      4   O5   BGX   1      -3.299 -11.258-256.729  1.00  0.00           O
ATOM      5   C2   BGX   2      -4.485 -13.321-257.148  1.00  0.00           C
ATOM      6   H2   BGX   2      -5.452 -12.773-257.146  1.00  0.00           H
END

When trying to load it, There will be an error from biotite on line 29 of molecularnodes/entities/molecule/pdb.py

array = pdb.get_structure(
            pdb_file=self.file,
            extra_fields=["b_factor", "occupancy", "charge", "atom_id"],
            include_bonds=True,
        )

because of the "include_bonds=True" flag.

My solution

I forked the main repository and made the following modifications

array = pdb.get_structure(
    pdb_file=self.file,
    extra_fields=["b_factor", "occupancy", "charge", "atom_id"],
    include_bonds=False,
)
try:
    array.bonds = connect_via_residue_names(array)
except AttributeError:
    print("Not able to find bonds via residue.")

Now the array is loaded without bonds, which are built afterwards. In case biotite raises an AtributeError, then the atoms are loaded without any bond information.

If you think this is a good solution I can submit a pull request later. I am also open to suggestions about this solution.

@BradyAJohnston
Copy link
Owner

Thanks for catching this! A PR would be great, but I think we should instead also do try / except for the pdb_get_structure.

try:
    array = pdb.get_structure(
        pdb_file=self.file,
        extra_fields=["b_factor", "occupancy", "charge", "atom_id"],
        include_bonds=True,
    )
except (ErrorThatIsRaised):
    array = pdb.get_structure(
        pdb_file=self.file,
        extra_fields=["b_factor", "occupancy", "charge", "atom_id"],
        include_bonds=False,
    )
    try:
        array.bonds = connect_via_residue_names(array)
    except AttributeError:
        print("Not able to find bonds via residue.")

This is also related to #634 where if the structure is missing a b_factor column the import fails. I haven't tested but I assume any of the extra_fields if they are missing will cause it to fail.

A more general way to handle all of these potential missing bits of data, but I am unsure what the best approach would be.

@BradyAJohnston
Copy link
Owner

biotite is very ready to throw an error if something isn't exactly as it expects it, which is a lot of the time in the realm of PDB files

@brenerrr
Copy link
Contributor Author

It would be nice if biotite raised different types of errors but it seems it just raises an AttributeError in case anything goes wrong. So in practice your suggestion would endup being something like

try:
    array = pdb.get_structure(
        pdb_file=self.file,
        extra_fields=["b_factor", "occupancy", "charge", "atom_id"],
        include_bonds=True,
    )
except AttributeError:
    array = pdb.get_structure(
        pdb_file=self.file,
        extra_fields=["b_factor", "occupancy", "charge", "atom_id"],
        include_bonds=False,
    )
    try:
        array.bonds = connect_via_residue_names(array)
    except AttributeError:
        print("Not able to find bonds via residue.")

which seems a bit odd. I guess the most general way that also avoids duplicated code would be to manually check if the extra fields are present in the pdb, raise an error if they aren't, get the structure without bonds, then build the bonds. At least with this pipeline if something goes wrong it would be easier to know if the problem is related with the bonds or the extra fields. However I am not sure how this checking should be done and probably would do that in another PR.

@BradyAJohnston
Copy link
Owner

It might be something that we re-implemented the underlying pdb.get_structure() in our own method that does better checking. Could potentially also raise it with the biotite devs, they have been quite responsive in the past about making changes / improvements

@brenerrr
Copy link
Contributor Author

I can try to raise this issue with the biotite devs. Meanwhile do you think I can do a PR as a hot fix here?

@BradyAJohnston
Copy link
Owner

I'm happy to accept a PR fix here, worth also raising it with the biotite devs as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants