Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support interleaved complex numbers #22

Open
calebzulawski opened this issue Apr 12, 2024 · 7 comments
Open

Support interleaved complex numbers #22

calebzulawski opened this issue Apr 12, 2024 · 7 comments
Labels
enhancement New feature or request

Comments

@calebzulawski
Copy link
Contributor

A naive first shot at this could simply reorder the inputs and outputs, but a better implementation could embed this in the read and write loops.

@smu160 smu160 added the enhancement New feature or request label Apr 14, 2024
@smu160
Copy link
Member

smu160 commented Apr 14, 2024

Hi @calebzulawski,

Thank you for bringing up this issue. I'm assuming what you mean is that we should be able to provide the input to fft_64 and fft_32 as a vector of structs, such that each struct is of type num::complex::Complex.

Namely, we should have a version of fft_64 (and fft_32) with the following input parameters:

pub fn fft_64(input: Vec<Complex64>, direction: Direction) {
    // <-- snipped -->
}

The original reason why this was never tried in phastft is due to the significant difference in code gen and performance that we saw in the quantum state simulator. Perhaps something such as SOA-derive can be of use here?

Let me know if I'm way off base or any other thoughts you may have. Thanks!

Best,
Saveliy

@calebzulawski
Copy link
Contributor Author

Yep, that's exactly what I mean. I agree that it would definitely be a bit slower, so it would be best to offer it both ways--mostly as a convenience because sometimes you can't choose your data format. Perhaps even offering combinations (interleaving only the input or output, for an out of place operation) would be helpful too.

@smu160
Copy link
Member

smu160 commented Apr 14, 2024

@calebzulawski

Thank you for getting back to me. I wonder if it could be more performant to just "pre-process" the array-of-structs into the struct-of-arrays and then back again (also taking into account how the output should be formatted). I know that the interleaved format needed a lot of permute instructions when I used it in the past. You may have more insight to offer when it comes to that though.

Let me know what you think. Thanks!

Best,
Saveliy

@smu160
Copy link
Member

smu160 commented Apr 26, 2024

@calebzulawski

What do you think of starting out by offering a version of fft_64_interleaved and fft_32_interleaved that call the following function:

pub fn separate_re_im(signal: &[Complex<f64>], chunk_size: usize) -> (Vec<f64>, Vec<f64>) {
    let mut reals = Vec::with_capacity(signal.len());
    let mut imaginaries = Vec::with_capacity(signal.len());

    // We don't assume power of 2 size, so we don't use `chunks_exact`.
    // Process each chunk, including the last chunk which may be smaller.
    for chunk in signal.chunks(chunk_size) {
        for num in chunk {
            reals.push(num.re);
            imaginaries.push(num.im);
        }
    }

    (reals, imaginaries)
}

Note that this function isn't generalized for f64/f32 as it's only for demonstrative purposes.

before calling the fft_64 or fft_32 that we already have.

Given how simple this function is, it should be possible to vectorize the heck out of it, just as @Shnatsel did for twiddle factor generation. Of course, we'd need the inverse function as well (i.e., split reals/imags back into Complex).

I wonder if the overhead of this would be okay considering that it would allow us to avoid permute instructions when we utilize SIMD. In that past, I wasn't able to get around that when complex numbers were stored together.

Looking forward to hearing your thoughts. Thanks!

Best,
Saveliy

@Shnatsel
Copy link
Collaborator

Performance tip: always use chunks_exact + remainder instead of chunks.

And yes, it should be possible to vectorize the heck out of this.

@calebzulawski
Copy link
Contributor Author

Yes, I think something like that is probably sufficient. With some trickery, it would also be possible to do it in place, since &mut [Complex<f64>] is the same shape as &mut [f64] with double the elements. I think more broadly you could generalize input and output transformations for various input and output shapes (plus things like output normalization by 1/N or 1/sqrtN).

I think there's a little bit more performance to pick up by combining the input/output transformations with the first or last loops of the FFT, but is that really important? Probably not to start.

@calebzulawski
Copy link
Contributor Author

Performance tip: always use chunks_exact + remainder instead of chunks.

And yes, it should be possible to vectorize the heck out of this.

I have found it can actually be slightly more efficient to instead segment like this, imagine you have 11 elements in chunks of 4: [0, 1, 2, 3], [4, 5, 6, 7], [7, 8, 9, 10]. The 7th element gets computed twice (assuming you have working memory that is not invalidated) but this allows you to do only vector loads, rather than a slow sequential load at the end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants