Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversion fails on model loaded via torch.load or torch.jit.load #221

Open
saseptim opened this issue Sep 13, 2024 · 12 comments
Open

Conversion fails on model loaded via torch.load or torch.jit.load #221

saseptim opened this issue Sep 13, 2024 · 12 comments
Assignees
Labels
status:awaiting user response When awaiting user response type:support For use-related issues

Comments

@saseptim
Copy link

Description of the bug:

I have a pytorch model which was saved with torch.jit.save(). I tried both with a traced model and scripted model. The error is:

_File /orcam/ear/scratch/usr/avis/VENV_AI_EDGE/lib/python3.10/site-packages/torch/export/_trace.py:1449, in _export(mod, args, kwargs, dynamic_shapes, strict, preserve_module_call_signature, pre_dispatch, _allow_complex_guards_as_runtime_asserts, _disable_forced_specializations, _is_torch_jit_trace)
1447 original_state_dict = mod.state_dict(keep_vars=True)
1448 if not _is_torch_jit_trace:
-> 1449 forward_arg_names = _get_forward_arg_names(mod, args, kwargs)
1450 else:
1451 forward_arg_names = None

File /orcam/ear/scratch/usr/avis/VENV_AI_EDGE/lib/python3.10/site-packages/torch/export/_trace.py:753, in _get_forward_arg_names(mod, args, kwargs)
739 def _get_forward_arg_names(
740 mod: torch.nn.Module,
741 args: Tuple[Any, ...],
742 kwargs: Optional[Dict[str, Any]] = None,
743 ) -> List[str]:
744 """
745 Gets the argument names to forward that are used, for restoring the
746 original signature when unlifting the exported program module.
(...)
751 export lifted modules.
752 """
--> 753 sig = inspect.signature(mod.forward)
754 _args = sig.bind_partial(*args).arguments
756 names: List[str] = []

File /usr/lib/python3.10/inspect.py:3254, in signature(obj, follow_wrapped, globals, locals, eval_str)
3252 def signature(obj, *, follow_wrapped=True, globals=None, locals=None, eval_str=False):
3253 """Get a signature object for the passed callable."""
-> 3254 return Signature.from_callable(obj, follow_wrapped=follow_wrapped,
3255 globals=globals, locals=locals, eval_str=eval_str)

File /usr/lib/python3.10/inspect.py:3002, in Signature.from_callable(cls, obj, follow_wrapped, globals, locals, eval_str)
2998 @classmethod
2999 def from_callable(cls, obj, *,
3000 follow_wrapped=True, globals=None, locals=None, eval_str=False):
3001 """Constructs Signature for the given callable object."""
-> 3002 return _signature_from_callable(obj, sigcls=cls,
3003 follow_wrapper_chains=follow_wrapped,
3004 globals=globals, locals=locals, eval_str=eval_str)

File /usr/lib/python3.10/inspect.py:2550, in _signature_from_callable(obj, follow_wrapper_chains, skip_bound_arg, globals, locals, eval_str, sigcls)
2548 except ValueError as ex:
2549 msg = 'no signature found for {!r}'.format(obj)
-> 2550 raise ValueError(msg) from ex
2552 if sig is not None:
2553 # For classes and objects we skip the first parameter of their
2554 # call, new, or init methods
2555 if skip_bound_arg:

ValueError: no signature found for <torch.ScriptMethod object at 0x7f942662ffb0>_

Actual vs expected behavior:

No response

Any other information you'd like to share?

No response

@saseptim saseptim added the type:bug Bug label Sep 13, 2024
@pkgoogle
Copy link
Contributor

Hi @saseptim, I don't believe ai-edge-torch can handle that file format..., for this repo the PyTorch model needs to be torch.export compliant... you can find more details here: https://github.com/google-ai-edge/ai-edge-torch/blob/main/docs/pytorch_converter/README.md#conversion

Do you have an example script showing what you are doing? Thanks.

@pkgoogle pkgoogle self-assigned this Sep 16, 2024
@pkgoogle pkgoogle added type:support For use-related issues status:awaiting user response When awaiting user response and removed type:bug Bug labels Sep 16, 2024
Copy link

Marking this issue as stale since it has been open for 7 days with no activity. This issue will be closed if no further activity occurs.

@jchwenger
Copy link

Hi @pkgoogle, I just came across this as I was trying to convert a pix2pix model (here the training code). I save using torch.jit.script, which allows me to reload easily without having to redefine the model in Python. I see that you recommend reading about torch.export, but it's still unclear to me whether it's a matter of saving the model using this new approach, or something else...? The current example is nice but it uses an off-the-shelf network rather than something users would have defined themselves. Any ideas? It would be nice to have a workable pipeline from PyTorch to mediapipe, thanks in advance!

@pkgoogle
Copy link
Contributor

Hi @jchwenger, we don't support non-torch-exportable models (plenty of custom models are torch exportable, although not every one). Fundamentally the root issue is with torch-export so we cannot fix that, however once/if that is fixed then this library should be able to convert it or if it can't then the root cause may be a bug on our end. To test for torch exportability you have to follow the steps here: https://pytorch.org/docs/stable/export.html i.e. you can load the model and see if you can export it with PyTorch API's. If you don't run into an exception then it is probably torch exportable. Torch.Export exports its to StableHLO, an MLIR dialect which is more interoperable w/ the ecosystem of libraries that support MLIR, including this one. You can think of it as a different saving format which is more interoperable with other libraries. This is important to get the models working on heterogenous hardware such as edge devices, mobile, TPU's etc.

@jchwenger
Copy link

Hi @pkgoogle, thanks for this! I was confused by the phrasing in the docs, it's as simple as that: when you say "must be compliant with torch.export", it just means the model must be saved using that format/API. Got this to work, yay!, however out of three tests only a simple dense net works, and I'm not entirely sure why.

I have a very simple and runnable Colab here, maybe you will see something super obvious I missed?

Strangely, I get an error around frozen tensors in the Pix2Pix for the in-place nn.ReLU(True), but not in the DCGAN generator, I'll report this on PyTorch...

@jchwenger
Copy link

jchwenger commented Oct 1, 2024

Side note: the in-place nn.ReLU(True) mystery is now solved, in the issue above, caused by the presence of dropout before that (no mutation, therefore in-place operation, after it as of now).

Copy link

github-actions bot commented Oct 9, 2024

Marking this issue as stale since it has been open for 7 days with no activity. This issue will be closed if no further activity occurs.

@jchwenger
Copy link

Hi again, bumping this up, @pkgoogle. Just wondering: do you believe there is reasonable hope to fix this discrepancy (off-the-shelf ResNet and small dense net OK, but DCGAN and pix2pix not) when converting? Or should I try and post this issue on the PyTorch repo? Any thought welcome, thanks!

@pkgoogle
Copy link
Contributor

Hi @jchwenger I'm having trouble figuring out which discrepancies you are referring to. Are you saying that we can convert ResNet and small dense net, but not DCGAN and pix2pix? The answer will depend on what is causing the issue. If it's due to PyTorch export then the root cause is with PT Export (in which case you should create an issue there), if it's something else...well I will have to investigate. For DCGAN and pix2pix if we haven't confirmed it's PT Export, can you provide me a reproducible script which shows the error? (Sometimes people make small changes/adjustments in their code that actually affect the investigation).

@jchwenger
Copy link

Thanks @pkgoogle for the answer!

It's quite simple, with the custom dense net and ResNet, the test described in the original docs passes with "Inference result with Pytorch and TfLite was within tolerance", whereas with the DCGAN and pix2pix models it fails, with "Something wrong with Pytorch --> TfLite".

As you say, I don't know if it's the PT export or the conversion...

I have all four examples in this Colab, which should be runnable out of the box, with only the session restart needed after installing the dependencies. Thanks in advance!

@pkgoogle
Copy link
Contributor

Hi @jchwenger, I'm looking into it but do you associate w/ the OP? reason being is it feels like we are hijacking this thread as the original problem seems different. If you are not -- In which case we much prefer you create a new issue to track progress on your issues. In this case it's looking like an accuracy issue post-conversion for DCGAN & pix2pix.

@jchwenger
Copy link

Fair point, all done, see here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status:awaiting user response When awaiting user response type:support For use-related issues
Projects
None yet
Development

No branches or pull requests

3 participants