Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NVMe SGL support for FEMU #129

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

zhiwayzhang
Copy link

I noticed that the NVMe module of QEMU didn't support NVMe SGL officially when FEMU was first introduced. Now the latest QEMU has added this feature.
Now FEMU uses NVMe PRP to split a large I/O (128KB, 512KB, 1024KB) to many 4KB PRP entry (align with OS physical memory page), and the dram backend will repeat 4KB DMA(memcpy() actually). In my testing result, doing a 1024KB memcpy() is more efficient than repeating 4KB DMA 256 times, and SGL will perform a larger size of memcpy() with fewer operations. This may result in a loss of performance.
The modification of code is based on hw/nvme/ctrl.c, no change to the current code structure. The current code in FEMU has a lot of incompatibilities with the latest QEMU NVMe module, which I have modified appropriately.

@huaicheng
Copy link
Contributor

In my testing result, doing a 1024KB memcpy() is more efficient than repeating 4KB DMA 256 times, and SGL will perform a larger size of memcpy() with fewer operations.

Could you share your setup and results?

Under multi-poller mode, multiple small memcpy() can actually saturate the memory bandwidth.

@zhiwayzhang
Copy link
Author

zhiwayzhang commented Nov 16, 2023

I first conducted performance tests on memcpy by allocating a memory pool and randomly performing memcpy operations with sizes of 4KB or 1024KB.

// memcpy() performance testing
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>

#define SIZE (1024 * 1024)  // 1024KB, 4 * 1024 for 4KB
#define COUNT (100000)
#define mb()    asm volatile("mfence":::"memory")

int main() {
    int Pool_Size = 1024*1024*1024;
    char *src = malloc(Pool_Size);
    char *dst = malloc(SIZE);
    if (src == NULL || dst == NULL) {
        fprintf(stderr, "Memory allocation failed\n");
        return 1;
    }
    struct timeval start, end;

    gettimeofday(&start, NULL); 
    for (int i = 0; i < COUNT; ++i) {
        // get random offset in src memory pool
        long long offset = (random()%(Pool_Size/SIZE)) * SIZE / 8;
        memcpy(dst, src + offset, SIZE);
    }
    gettimeofday(&end, NULL);

    long long total_time = end.tv_sec * 1000000 + end.tv_usec  - start.tv_sec * 1000000 - start.tv_usec;
    double average_time = (double)total_time / COUNT;

    free(src);
    free(dst);

    printf("Total time: %.2lf seconds\n", (double)total_time/1000000);
    printf("Average time per memcpy: %lf us\n", average_time);

    return 0;
}

The purpose of this script is to simulate the environment in which FEMU operates. The test results are presented in the table below. The results indicate that repeatedly executing memcpy with a granularity of 4KB 256 times is less efficient than performing a single memcpy operation with a size of 1024KB.

memcopy size cost
4KB 117us (per 0.46us, 256 repeat)
1024KB 33us

I also conducted tests in FEMU, and the system configuration is as follows:

  • Kernel Version: 5.15.0 Guest/ 5.4.258 Host
  • NUMA enabled and run with numactl bind to a NUMA node
  • BBSSD Configuration script based on the latest commit 68032f3, append with multipollerenabled=1 and queue=4, 16GB SSD with 4GB OP.

I separately measured the execution time of the backend_rw() function with and without SGL enabled. The testing result as blow:

io size PRP SGL
128KB 19us 19us
508KB 76us 67us

In the test results, there is negligible difference when performing 128KB I/O. However, for larger I/O sizes, a slight performance gap is observed between PRP and SGL. This is unlike the previous simulation tests, which exhibited a significant disparity.

During my investigation of performance bottlenecks, I identified this issue, but it wasn't prominent. I made attempts to modify and test it, eventually confirming that it was influenced by other issues (PRP was not the direct cause). The experiments indicate that in FEMU, with multiple pollers enabled, the performance difference between SGL and PRP is not very pronounced. In a multi-thread scenario, the latency timing model and the queuing of I/O requests in FTL can overshadow the minor performance gap in DMA(memcpy). However, there might be greater significance in adding SGL or other features of NVMe. If you're interested in this pull request, I'd be happy to further validate it and make modifications :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants