Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for splitting tensor arena into persistent/non-persistent arenas #908

Open
andresovela opened this issue Jul 17, 2024 · 4 comments · May be fixed by xmos/lib_tflite_micro#164
Open

Comments

@andresovela
Copy link
Contributor

andresovela commented Jul 17, 2024

When the tensor arena requirements of a given model are larger than the available SRAM, the tensor arena has to be placed in external RAM, leaving performance on the table due to being unable to use SRAM as scratch memory at all.

I created tensorflow/tflite-micro#2627 to ask TFLM to support this use case, but it doesn't seem to be happening any time soon. However, according to a collaborator, it's already possible to split the tensor arena into persistent/non-persistent arenas.

It seems that in order to support this use case, we would need to add this functionality to xformer. I can do it myself if I get some guidance.

This would allow applications to place the non-persistent arena in SRAM and the persistent arena in external RAM, or vice versa, which would allow models to perform better on the xcore.ai platform.

@andresovela
Copy link
Contributor Author

andresovela commented Jul 18, 2024

I was able to test the persistent/non-persistent split feature and I think it can be very useful in some cases. I tested it with a model with the following arena sizes:

  • Tensor arena size: 158800
  • Non-persistent arena size: 149760
  • Persistent arena size: 9040

In this particular case, the entire arena would actually fit in SRAM, so this test is kind of pointless, but I wanted to see what kind of performance would we be able to achieve for a model where a split would actually make sense.

Entire tensor arena in SRAM

Constraint check for tile[0]:
  Memory available:       524288,   used:      14848 .  OKAY
    (Stack: 1116, Code: 11036, Data: 2696)
Constraints checks PASSED WITH CAVEATS.
Constraint check for tile[1]:
  Memory available:       524288,   used:      240436 .  OKAY
    (Stack: 4164, Code: 49020, Data: 187252)
Constraints checks PASSED WITH CAVEATS.

Cumulative times for invoke()...
53    OP_XC_ld_flash                   2659902      26.60ms
14    OP_XC_conv2d_v2                  80202        0.80ms
4     OP_UNPACK                        3227         0.03ms
4     OP_SPLIT                         9783         0.10ms
20    OP_XC_add                        12066        0.12ms
21    OP_XC_lookup                     12327        0.12ms
12    OP_XC_mul                        6970         0.07ms
3     OP_RESHAPE                       508          0.01ms

Total time for invoke() - 2784985    27.85ms

Entire tensor arena in external RAM

Constraint check for tile[0]:
  Memory available:       524288,   used:      14848 .  OKAY
    (Stack: 1116, Code: 11036, Data: 2696)
Constraints checks PASSED WITH CAVEATS.
Constraint check for tile[1]:
  Memory available:       524288,   used:      81636 .  OKAY
    (Stack: 4164, Code: 49020, Data: 28452)
  ExtMem available:    134217728,   used:     158800 .  OKAY
    (Stack: 0, Code: 0, Data: 158800)
Constraints checks PASSED WITH CAVEATS.

Cumulative times for invoke()...
53    OP_XC_ld_flash                   2662671      26.63ms
14    OP_XC_conv2d_v2                  1372456      13.72ms
4     OP_UNPACK                        3335         0.03ms
4     OP_SPLIT                         12141        0.12ms
20    OP_XC_add                        47413        0.47ms
21    OP_XC_lookup                     26059        0.26ms
12    OP_XC_mul                        11189        0.11ms
3     OP_RESHAPE                       603          0.01ms

Total time for invoke() - 4135867    41.36ms

Non-persistent arena in SRAM, persistent arena in external RAM

Constraint check for tile[0]:
  Memory available:       524288,   used:      14848 .  OKAY
    (Stack: 1116, Code: 11036, Data: 2696)
Constraints checks PASSED WITH CAVEATS.
Constraint check for tile[1]:
  Memory available:       524288,   used:      231396 .  OKAY
    (Stack: 4164, Code: 49020, Data: 178212)
  ExtMem available:    134217728,   used:       9040 .  OKAY
    (Stack: 0, Code: 0, Data: 9040)
Constraints checks PASSED WITH CAVEATS.

Cumulative times for invoke()...
53    OP_XC_ld_flash                   2661568      26.62ms
14    OP_XC_conv2d_v2                  81024        0.81ms
4     OP_UNPACK                        3228         0.03ms
4     OP_SPLIT                         9784         0.10ms
20    OP_XC_add                        14514        0.15ms
21    OP_XC_lookup                     13199        0.13ms
12    OP_XC_mul                        7021         0.07ms
3     OP_RESHAPE                       507          0.01ms

Total time for invoke() - 2790845    27.91ms

As you can see, the performance we see is virtually the same when the persistent arena is moved to external RAM, and we are able to shave off 9 kB of SRAM for this particular model.

I wanted to test one of our larger models, but it turns out that the persistent arena size is actually not that big for that particular model, so the savings are unfortunately not enough to place the non-persistent arena in SRAM.

  • Tensor arena size: 454200
  • Non-persistent arena size: 446976
  • Persistent arena size: 7224

However, if we manage to shave off only 4 kB off the tensor arena, we could place the non-persistent arena in SRAM and we would see pretty significant wins for this particular model using the split arenas feature.

@panickal-xmos
Copy link
Collaborator

Thank you @andresovela. One question, for this example, where you have mentioned arena in external RAM, are you using DDR, or have you simulated that case by running the model with just one thread?

@andresovela
Copy link
Contributor Author

What do you mean by DDR? Is that a feature that needs to be enabled or something?

This is all I did:

#if defined(EXTERN_TENSOR_ARENA)

#if defined(SPLIT_PERSISTENT_TENSOR_ARENA)
__attribute__ ((section(".ExtMem.bss")))
uint8_t persistent_tensor_arena[LARGEST_PERSISTENT_TENSOR_ARENA_SIZE] __attribute__((aligned(8)));
#endif

uint8_t tensor_arena[LARGEST_TENSOR_ARENA_SIZE] __attribute__((aligned(8)));

#endif

@andresovela
Copy link
Contributor Author

andresovela commented Jul 18, 2024

I am running xcore-opt without the --xcore-thread-count option so I am using only 1 thread AFAIK

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants