-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Autobuild Python/Pytorch module for derivative code #166
base: second_derivative
Are you sure you want to change the base?
Conversation
…ivative GAPC code (still need to add pytorch_interface.cc generation)
…odule with make -f out.mf install
…core matrices to pytorch Tensors; add generic_pytorch_interface.cc; can now calculate score matrices in Python for NW
…low errors when adding C++ double vals to tensor (probably need to change our code to use float so tensor can be float32 for GPU compatibility); use make_blob function to make tensors (way faster)
… matrices where returned instead of forward matrices
…in object initialization
…rything if forward gets re-called with different inputs; fix comment
… torch::Tensor type based on nt table type
…r and Seq so input tensors can be processed
…pe as char analogous
…ensors; some refactors/comments
…parisons; module compiles and can be built now
…omments; add some more constructors; refactor indexing
…ave differently-sized batch dimensions
…so be converted to Tensors
…ts with the help of template args
Update (04.07.2023): Didn't post an update for a while now, but here all all changes since the last update:
|
…when batched input is supposed to be processed; allow users to limit max batch size with environment variable at compilation
…Batch object allocation
Update (09.07.2023) I added some minor bugfixes and overall improvements:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not yet through. Here come my first comments.
bool batched = input.find("batched") != input.npos; | ||
|
||
// check the number of dims of the input Tensor (default: 2) | ||
std::regex regex("[a-zA-Z0-9]*([0-9]{1,2})D[a-zA-Z0-9]*"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why limiting to 0-99? Is this a limit for PyTorch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think PyTorch tensors have a dimension limit. But how realistic is it that someone has a tensor with e.g. 90 dimensions? Since I wanted to at least support 10D tensors, this regex needs to parse a 2-digit number, so 99 is the natural hard limit. Would you prefer this to be unlimited instead?
src/tensor.hh
Outdated
try { | ||
n_dims = std::stoi(match[1].str()); | ||
} catch (const std::exception &e) { | ||
std::cerr << "Couldn't convert " << match[1].str() << " to a number.\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you please use the gapc log mechanism to report this error. It would also help if you prompt the user to the input declaration line (the Loc(action) mechanism)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will adjust this.
break; | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should also warn/error the user if he/she provides a "tensor" that is not supported at all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the user misspells "tensor" in the input declaration, this will already be caught in str_to_mode
in input.cc
. I added another warning for the specified tensor type if the parser can't determine it.
Makefile
Outdated
test-pytorch: | ||
cd testdata/regresstest &&\ | ||
$(SHELL) test_pytorch_module.sh $(TRUTH_DIR) | ||
|
||
test-pytorch-batched: | ||
cd testdata/regresstest &&\ | ||
$(SHELL) test_pytorch_module.sh $(TRUTH_DIR) "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we merge both tests into just one target, to reduce the risk of missing execution of a test a little?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will merge them.
|
||
GAPC=../../../gapc | ||
MAKE="make" | ||
PYTHON="python3" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is it necessary to specify 3
at the end. Shouldn't python
be sufficient if we install the correct dependency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should work with just python
as well. I think that the official interpreter binary name is still python3
, so that should always work. Just python
probably works in most shells as well though.
equal: bool = True | ||
for matrix, reference in zip(bw_matrices, reference_matrices): | ||
# make sure both matrices have the same dtype (float32) | ||
# and don't contain any inf vals (bug if so?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should check what happens for triangular matrices, i.e. single track, for cells i > j. These cells might be flagged as +/-inf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will eventually, but as of right now I don't of a test program that makes use of derivatives and uses just one track / has triangular matrices. But any inf
values will be set to 0 in one of the next couple of lines in this loop, so they wouldn't cause any trouble here.
# tests for regular (single) input and batched inputs | ||
# for 1st and 2nd derivative | ||
# (Truth files are in gapc-test-suite/Truth/Regress) | ||
TESTS = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we extend this great framework to variable source input? Sometime in the future, I'd like to also process RNA folding algorithms this way. How much effort would it be to also subject testdata/grammars_outside/nodangle.gap
to your mechanism?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean. Are you asking if this feature can also convert programs like nodangle
to Python packages? In general, this feature should be able to convert any GAP-L program that also works with the --derivative
option into a Python package. If nodangle
can be rewritten in a differentiable way that makes use of the --derivative
option, that this feature should also be able to handle e.g. nodangle
.
return x * exp(-2.0); | ||
} | ||
float Ers(<alphabet a, alphabet b>, <tensorslice locA, tensorslice locB>, float x) { | ||
if (equal(a, b)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to overload the ==
operator appropriately, as this would be more natural for programmers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
algebra alg_count auto count; | ||
|
||
algebra alg_hessian implements sig_alignments(alphabet=tensorchar, answer=F64batch) { | ||
F64batch Ins(<alphabet a, void>, <tensorslice locA, tensorslice locB>, F64batch x) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to myself (@sjanssen2) shouldn't it be possible to type answer
and alphabet
here in the source code and later replace all answer
with what is provide in the first line of the algebra definition in the AST, here F64batch
?!
rtlib/traces.hh
Outdated
#include <utility> | ||
#include <cassert> | ||
|
||
#define MAX_INDEX_COMPONENTS 3 // max indices for an index_components object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, we need to consider multi track programs for which some DP matrices can be quadratic, linear or even constant. I feel that restricting the number of index components to 3 might be too restrictive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already fixed in the latest traces.hh
adjustment.
…e PyTorch/Tensor-related GAP-L test programs to testdata/grammar_pytorch; bundle batched/non-batched test into 1; improve tensor input declaration error messages
…tes with differently-sized index components; substitute some copies for moves in the trace mechanism
…ault traces template args
Differential Dynamic Programming
Motivation (some of this might not be 100% correct, ignore if that's the case)
Motivation
Recently, certain DP algorithms such as Needleman-Wunsch (NW) have been proven to be differentiable. This discovery has opened new doors for the usage of such DP algorithms, which can now be used in e.g. deep learning models like DeepBLAST, which aims to learn protein structural similarity from sequence alone. This particular model is written in
Pytorch
and usesPytorch
'sautograd
functionality (automatic differentiation) for the backward pass through the NW algorithm. Due to the implementation of this functionality, this is extremely slow though, because it takes a lot of kernel calls to complete the backward pass. It would be way more efficient if one could calculate the entire backward pass in one function/kernel call. This is whereGAPC
can help.GAPC
can generateC++
-code that can calculate the score matrices for e.g. differentiable NW in one kernel call. If we were able to call this generated code directly fromPython
/Pytorch
, the entire backward pass could be executed in 1 kernel call, which has the potential to significantly speed up the training process of the model.Implementation
In order to make our
GAPC
-generated derivative codePython
-/Pytorch
-compatible and usable in e.g. DeepBLAST, I implemented a method which uses thePytorch
C++ Extension API to automatically create aPython
module from our generated code, which provides aforward
andbackward
function for the first and second derivative (if requested). After installation, this module can be imported/used like any otherPython
module. Currently, these functions take two strings/sequences as arguments (just like theGAPL
NW/Gotoh implementations) and output a List ofPytorch
Tensors of all alignment score matrices (1 for NW, 3 for Gotoh).These functions are provided trough an interface (
pytorch_interface.cc
), which contains generic definitions for theforward
/backward
functions as well as binding instructions for the module build step (Side note: since ourGAPC
-generated code for NW/Gotoh usesdouble
vectors to store all values, I had to usefloat64
Tensors to store these values. This might cause problems if this is supposed to run on a GPU later, so we should maybe see if ourGAPC
algorithm also works with 32-bitfloat
s). The module creation is handled viasetuptools
. An adjustedMakefile
will generate asetup.py
file (as well as thepytorch_interface.cc
file) with all necessary instructions for the module creation.The
GAPC
-generated code is only minimally modified. Theout
object(s) get the additional methodsget_forward_score_matrices
andget_backward_score_matrices
, which convert the internal score matrices (std::vector
) totorch::Tensor
objects and return them. This means that as of right now, the core DP algorithm is unmodified and calculates everything regularly; I merely convert the score matrices to tensors after the algorithm finished it's calculations.Prerequisites, Usage, Example
In order to create a
Python
-/Pytorch
module withGAPC
instead of the regular binary, you can use the--as-torchmod MODULE_NAME
option (in combination with the--derivative N
option; if this option is not set, the module creation cannot work so the user is instructed to set this option), e.g. :With this option enabled, an adjusted
Makefile
will be produced, which contains rules that generate thesetup.py
file as well as thepytorch_interface.cc
file, through which all functions of the module will be provided.The command
will generate the
setup.py
andpytorch_interface.cc
files, the commandwill generate the
Python
module and install it locally.For the installation to work, the user must have the
Python
modulestorch
andpip
installed (as well asPython
, obviously). ThePython.h
header file is also required. This file does not usually come with the defaultPython
installation as it is part of thepython-dev
/python-devel
package. This file is available though if the user installsPython
/Pytorch
in aconda
environment (which is recommended forPytorch
anyways). As an example, I will provide a short installation guide on how I set everything up on my system. Since I don't have a GPU on my machine, I installed the cpu-only version oftorch
. This should also work forcuda
versions ofPytorch
though.After that, you can use the
nw_gapc
module inPython
: