Add NVMe SGL support for FEMU #129

zhiwayzhang · 2023-11-14T17:54:05Z

I noticed that the NVMe module of QEMU didn't support NVMe SGL officially when FEMU was first introduced. Now the latest QEMU has added this feature.
Now FEMU uses NVMe PRP to split a large I/O (128KB, 512KB, 1024KB) to many 4KB PRP entry (align with OS physical memory page), and the dram backend will repeat 4KB DMA(memcpy() actually). In my testing result, doing a 1024KB memcpy() is more efficient than repeating 4KB DMA 256 times, and SGL will perform a larger size of memcpy() with fewer operations. This may result in a loss of performance.
The modification of code is based on hw/nvme/ctrl.c, no change to the current code structure. The current code in FEMU has a lot of incompatibilities with the latest QEMU NVMe module, which I have modified appropriately.

huaicheng · 2023-11-16T15:38:19Z

In my testing result, doing a 1024KB memcpy() is more efficient than repeating 4KB DMA 256 times, and SGL will perform a larger size of memcpy() with fewer operations.

Could you share your setup and results?

Under multi-poller mode, multiple small memcpy() can actually saturate the memory bandwidth.

zhiwayzhang · 2023-11-16T19:10:39Z

I first conducted performance tests on memcpy by allocating a memory pool and randomly performing memcpy operations with sizes of 4KB or 1024KB.

// memcpy() performance testing
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>

#define SIZE (1024 * 1024)  // 1024KB, 4 * 1024 for 4KB
#define COUNT (100000)
#define mb()    asm volatile("mfence":::"memory")

int main() {
    int Pool_Size = 1024*1024*1024;
    char *src = malloc(Pool_Size);
    char *dst = malloc(SIZE);
    if (src == NULL || dst == NULL) {
        fprintf(stderr, "Memory allocation failed\n");
        return 1;
    }
    struct timeval start, end;

    gettimeofday(&start, NULL); 
    for (int i = 0; i < COUNT; ++i) {
        // get random offset in src memory pool
        long long offset = (random()%(Pool_Size/SIZE)) * SIZE / 8;
        memcpy(dst, src + offset, SIZE);
    }
    gettimeofday(&end, NULL);

    long long total_time = end.tv_sec * 1000000 + end.tv_usec  - start.tv_sec * 1000000 - start.tv_usec;
    double average_time = (double)total_time / COUNT;

    free(src);
    free(dst);

    printf("Total time: %.2lf seconds\n", (double)total_time/1000000);
    printf("Average time per memcpy: %lf us\n", average_time);

    return 0;
}

The purpose of this script is to simulate the environment in which FEMU operates. The test results are presented in the table below. The results indicate that repeatedly executing memcpy with a granularity of 4KB 256 times is less efficient than performing a single memcpy operation with a size of 1024KB.

memcopy size	cost
4KB	117us (per 0.46us, 256 repeat)
1024KB	33us

I also conducted tests in FEMU, and the system configuration is as follows:

Kernel Version: 5.15.0 Guest/ 5.4.258 Host
NUMA enabled and run with numactl bind to a NUMA node
BBSSD Configuration script based on the latest commit 68032f3, append with multipollerenabled=1 and queue=4, 16GB SSD with 4GB OP.

I separately measured the execution time of the backend_rw() function with and without SGL enabled. The testing result as blow:

io size	PRP	SGL
128KB	19us	19us
508KB	76us	67us

In the test results, there is negligible difference when performing 128KB I/O. However, for larger I/O sizes, a slight performance gap is observed between PRP and SGL. This is unlike the previous simulation tests, which exhibited a significant disparity.

During my investigation of performance bottlenecks, I identified this issue, but it wasn't prominent. I made attempts to modify and test it, eventually confirming that it was influenced by other issues (PRP was not the direct cause). The experiments indicate that in FEMU, with multiple pollers enabled, the performance difference between SGL and PRP is not very pronounced. In a multi-thread scenario, the latency timing model and the queuing of I/O requests in FTL can overshadow the minor performance gap in DMA(memcpy). However, there might be greater significance in adding SGL or other features of NVMe. If you're interested in this pull request, I'd be happy to further validate it and make modifications :)

zhiwayzhang added 6 commits November 15, 2023 01:13

add constants and structure for NVMe SGL

516c60c

use nvme_map_dptr() to choose PRP or SGL

f2e3265

add NVMe SGL support

9f138b9

enable SGL in nvme_init_ctrl() by set id->sgls

2844713

zns SGL support

4de6825

oc ssd SGL support

7f19732

huaicheng assigned inhoinno Nov 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NVMe SGL support for FEMU #129

Add NVMe SGL support for FEMU #129

zhiwayzhang commented Nov 14, 2023

huaicheng commented Nov 16, 2023

zhiwayzhang commented Nov 16, 2023 •

edited

Loading

Add NVMe SGL support for FEMU #129

Are you sure you want to change the base?

Add NVMe SGL support for FEMU #129

Conversation

zhiwayzhang commented Nov 14, 2023

huaicheng commented Nov 16, 2023

zhiwayzhang commented Nov 16, 2023 • edited Loading

zhiwayzhang commented Nov 16, 2023 •

edited

Loading