Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Residue numbering start specification #74

Open
lucajovine opened this issue Nov 19, 2024 · 7 comments
Open

Residue numbering start specification #74

lucajovine opened this issue Nov 19, 2024 · 7 comments
Labels
enhancement New feature or request

Comments

@lucajovine
Copy link

This is not a bug but rather a request that, however, I am sure everybody will stand by: please add a way to specify in the input json file what is the number of the starting residue of each chain.

As users, we almost always have to do this "by hand" a posteriori, since AlphaFold numbering start defaults to 1 and - in the majority of cases - does not match the one we actually work with. This is a small change that could make things much easier for everybody, so I hope you'll take the suggestion into consideration. Thank you!

@Augustin-Zidek Augustin-Zidek added the enhancement New feature or request label Nov 19, 2024
@Augustin-Zidek
Copy link
Collaborator

I am wondering whether this should be something that AlphaFold does -- I feel this goes against the UNIX philosophy of doing one thing and doing it well. This is clearly something that is a post-processing step that should be done on the produced mmCIF file. If we were to introduce it in AlphaFold, it would require modifying the input format (to specify the starting residue ID for each chain), which seems too invasive.

That being said, I think I will add a utility method in the Structure module to make it easy to write a post-processing script to do this.

I will leave the issue open until I implement it, then I will comment here with a Python snippet to achieve what you are asking for.

@jkosinski
Copy link

@lucajovine residue numbering is handled in our AlphaPulldown interface to AlphaFold2 (https://github.com/KosinskiLab/AlphaPulldown). You calculate input features for full-length proteins and then you run predictions for any subsets of residues preserving original full-length residue numbers. When/if we add AlphaFold3 backend, the same functionality will be supported.

@lucajovine
Copy link
Author

lucajovine commented Nov 25, 2024

@jkosinski thank you for mentioning this, but frankly this was more of a general comment than something specifically aimed at my lab (where we already have our own post-processing scripts for doing this).

@Augustin-Zidek may I respectfully disagree? I get the UNIX philosophy standpoint, but I do not see why using the correct numbering would go against that — in fact, rather the opposite. Just consider all the secreted proteins that have an N-terminal signal peptide: for any biologically meaningful prediction, we normally do not include those residues (which, in real life, are essentially never "seen" by the rest of the protein), so all resulting mature protein predictions end up being misnumbered (compared to the numbering that one finds in UniProt, for example). And the same of course happens if one wants to predict the structure of an engineered construct, which may have tags or the like. In all these cases, enforcing numbering from 1 is biologically meaningless. Moreover, I do not quite see how the introduction of the option that I was envisaging would be so intrusive, considering that it would just add one (optional) line per sequence in the input JSON.

One may of course argue that if one is able to install and run AlphaFold on their own machine, surely they can easily work out a way to renumber residues if needed. And this is indeed the case (as I mentioned above). But for a lot of the biologists that make predictions using the server this is simply not trivial (as I often get asked about this issue), so my main reason for bringing this up was just to try making everyone's life easier...

@Augustin-Zidek
Copy link
Collaborator

Augustin-Zidek commented Nov 29, 2024

@lucajovine Here is the script that does the renumbering (note that I pushed new Structure methods to enable this in 38d599b) - it should be easy to integrate in your AlphaFold 3 pipeline:

from alphafold3 import structure

# Inputs - the mmCIF file and the desired ID of the first residue in each chain.
mmcif = ... # Read the mmCIF file as a string.
starts = ... # E.g. {'A': 15, 'B': -5}

# Create Structure from the mmCIF input.
struc = structure.from_mmcif(mmcif, include_bonds=True)
# Remove all unresolved residues - they offset the numbering if they are at the beginning.
# This is not an issue for AlphaFold, since all residues are present, but it is for mmCIF
# files from the Protein Data Bank.
struc = struc.copy_and_update(residues=struc.present_residues)

# Create the residue ID remapping.
res_id_map = {}
for res in struc.iter_residues(include_unresolved=True):
  chain_res_id_map = res_id_map.setdefault(res['chain_id'], {})
  chain_res_id_map[res['res_id']] = starts.get(chain_id, 1) + len(chain_res_id_map)
# Remap label reaidue IDs.
struc = struc.remap_res_id(res_id_map)
# Remap also the author residue ID to match the label residue ID.
struc = struc.copy_and_update_residues(
    res_auth_seq_id=np.char.mod('%d', struc.residues_table.id).astype(object)
)

print(struc.to_mmcif())

Another option is to use the RENUMBER program from CCP4 PDBSET.

Moreover, I do not quite see how the introduction of the option that I was envisaging would be so intrusive, considering that it would just add one (optional) line per sequence in the input JSON.

The snippet above illustrates that this renumbering involves choices that user might want to have exposed in the input format. Things like:

  • What do we do about the author residue IDs? Do we remap just those? Or do we set them to match the internal residue IDs? Do we need another flag to control that behavior?
  • Do we allow negative residue IDs?
  • Do we allow more general residue ID remappings? (What if there is a tag in the middle?)
  • Not a problem in case of AlphaFold since it doesn't have any unresolved residues, but more generally - how do we handle unresolved residues that are present in the residue tables (like _entity_poly_seq, _pdbx_poly_seq_scheme), but not in the _atom_site table? Do we renumber those as well?

Some of this complexity would be pushed into AlphaFold's input format, where it doesn't belong. I feel also this is somewhat similar to the relaxation step in AlphaFold 2 which -- in hindsight -- should have been a completely separate program that is run separately on AlphaFold 2's outputs instead of being baked into the AlphaFold 2 binary.

@lucajovine
Copy link
Author

@Augustin-Zidek Thanks for the script and sorry for the delay in answering, it was a busy time in Stockholm recently — as you know very well!
I can see the issues you brought up, but I wasn't really suggesting anything very fancy, just the ability of adding a single offset for residue numbering... which would take care of 99% of incorrect numbering cases. Specifically:

  • What do we do about the author residue IDs? Do we remap just those? Or do we set them to match the internal residue IDs? Do we need another flag to control that behavior?
    I am not quite sure of what are you referring to here by "author residue IDs", can you please explain?

  • Do we allow negative residue IDs?
    Not crucial.

  • Do we allow more general residue ID remappings? (What if there is a tag in the middle?)
    No, also not crucial.

Kind regards, Luca

@Augustin-Zidek
Copy link
Collaborator

What do we do about the author residue IDs? Do we remap just those? Or do we set them to match the internal residue IDs? Do we need another flag to control that behavior?

I am not quite sure of what are you referring to here by "author residue IDs", can you please explain?

I mean _atom_site.label_seq_id (mmCIF internal residue ID) vs _atom_site.auth_seq_id (author residue ID -- corresponds to the PDB-format concept of residue ID).

@lucajovine
Copy link
Author

OK maybe I am missing something but, since the predictions are created by AF3 and not supplied externally, aren't these always the same here? So I suppose that, if one was to remap residue numbers, I would keep the two IDs the same for consistency (i.e. both should be remapped).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants