Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add example for splitting model into two models and running them in two tiles #910

Open
andresovela opened this issue Jul 18, 2024 · 4 comments

Comments

@andresovela
Copy link
Contributor

First of all, sorry for spamming you with issues 😅 I'm just trying to optimize inference for performance on a model larger than 512 kB and I'm exploring all possible options at the moment.

I think one alternative would be to split the large model into two smaller models, run each one on separate tiles and have the output of the model on tile[0] be the input to the model on tile[1]. Is that technically feasible using channels or is that a no-go?

@panickal-xmos
Copy link
Collaborator

No worries, more information helps us improve the tools. Yes, that is feasible. We have been primarily focused on flash-based workflows, but there are various options to achieve what you are looking for. What is the size of the model? Would it fit within two tiles? We could communicate directly via emails or so, as that might be easier to share more info regarding the models. My email is [email protected] .

@andresovela
Copy link
Contributor Author

Nice, I'll contact you via email then :)

@panickal-xmos
Copy link
Collaborator

This can be done by splitting the model yourself and compiling them separately with xmos-ai-tools. You would have to wire them up in the application source code.
It's not recommended though, as you would lose quite a bit of space for code and tensor arena on both tiles. When splitting to put the model on multiple tiles, better to keep the code/tensor arena on one tile, and weights on one or more tiles.

@andresovela
Copy link
Contributor Author

I agree it's a waste of resources, but it may allow a model to run faster than if you had to read the weights from flash. Realistically, I think we won't use this option, but it'd be good to have an example I guess :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants