-
Notifications
You must be signed in to change notification settings - Fork 669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Residue numbering start specification #74
Comments
I am wondering whether this should be something that AlphaFold does -- I feel this goes against the UNIX philosophy of doing one thing and doing it well. This is clearly something that is a post-processing step that should be done on the produced mmCIF file. If we were to introduce it in AlphaFold, it would require modifying the input format (to specify the starting residue ID for each chain), which seems too invasive. That being said, I think I will add a utility method in the I will leave the issue open until I implement it, then I will comment here with a Python snippet to achieve what you are asking for. |
@lucajovine residue numbering is handled in our AlphaPulldown interface to AlphaFold2 (https://github.com/KosinskiLab/AlphaPulldown). You calculate input features for full-length proteins and then you run predictions for any subsets of residues preserving original full-length residue numbers. When/if we add AlphaFold3 backend, the same functionality will be supported. |
@jkosinski thank you for mentioning this, but frankly this was more of a general comment than something specifically aimed at my lab (where we already have our own post-processing scripts for doing this). @Augustin-Zidek may I respectfully disagree? I get the UNIX philosophy standpoint, but I do not see why using the correct numbering would go against that — in fact, rather the opposite. Just consider all the secreted proteins that have an N-terminal signal peptide: for any biologically meaningful prediction, we normally do not include those residues (which, in real life, are essentially never "seen" by the rest of the protein), so all resulting mature protein predictions end up being misnumbered (compared to the numbering that one finds in UniProt, for example). And the same of course happens if one wants to predict the structure of an engineered construct, which may have tags or the like. In all these cases, enforcing numbering from 1 is biologically meaningless. Moreover, I do not quite see how the introduction of the option that I was envisaging would be so intrusive, considering that it would just add one (optional) line per sequence in the input JSON. One may of course argue that if one is able to install and run AlphaFold on their own machine, surely they can easily work out a way to renumber residues if needed. And this is indeed the case (as I mentioned above). But for a lot of the biologists that make predictions using the server this is simply not trivial (as I often get asked about this issue), so my main reason for bringing this up was just to try making everyone's life easier... |
@lucajovine Here is the script that does the renumbering (note that I pushed new Structure methods to enable this in 38d599b) - it should be easy to integrate in your AlphaFold 3 pipeline: from alphafold3 import structure
# Inputs - the mmCIF file and the desired ID of the first residue in each chain.
mmcif = ... # Read the mmCIF file as a string.
starts = ... # E.g. {'A': 15, 'B': -5}
# Create Structure from the mmCIF input.
struc = structure.from_mmcif(mmcif, include_bonds=True)
# Remove all unresolved residues - they offset the numbering if they are at the beginning.
# This is not an issue for AlphaFold, since all residues are present, but it is for mmCIF
# files from the Protein Data Bank.
struc = struc.copy_and_update(residues=struc.present_residues)
# Create the residue ID remapping.
res_id_map = {}
for res in struc.iter_residues(include_unresolved=True):
chain_res_id_map = res_id_map.setdefault(res['chain_id'], {})
chain_res_id_map[res['res_id']] = starts.get(chain_id, 1) + len(chain_res_id_map)
# Remap label reaidue IDs.
struc = struc.remap_res_id(res_id_map)
# Remap also the author residue ID to match the label residue ID.
struc = struc.copy_and_update_residues(
res_auth_seq_id=np.char.mod('%d', struc.residues_table.id).astype(object)
)
print(struc.to_mmcif()) Another option is to use the
The snippet above illustrates that this renumbering involves choices that user might want to have exposed in the input format. Things like:
Some of this complexity would be pushed into AlphaFold's input format, where it doesn't belong. I feel also this is somewhat similar to the relaxation step in AlphaFold 2 which -- in hindsight -- should have been a completely separate program that is run separately on AlphaFold 2's outputs instead of being baked into the AlphaFold 2 binary. |
@Augustin-Zidek Thanks for the script and sorry for the delay in answering, it was a busy time in Stockholm recently — as you know very well!
Kind regards, Luca |
I mean |
OK maybe I am missing something but, since the predictions are created by AF3 and not supplied externally, aren't these always the same here? So I suppose that, if one was to remap residue numbers, I would keep the two IDs the same for consistency (i.e. both should be remapped). |
This is not a bug but rather a request that, however, I am sure everybody will stand by: please add a way to specify in the input json file what is the number of the starting residue of each chain.
As users, we almost always have to do this "by hand" a posteriori, since AlphaFold numbering start defaults to 1 and - in the majority of cases - does not match the one we actually work with. This is a small change that could make things much easier for everybody, so I hope you'll take the suggestion into consideration. Thank you!
The text was updated successfully, but these errors were encountered: