Rz buffer optimize #4779

tushar3q34 · 2024-12-18T18:41:56Z

Your checklist for this pull request

I've read the guidelines for contributing to this repository
I made sure to follow the project's coding style
I've documented or updated the documentation of every function and struct this PR changes. If not so I've explained why.
I've added tests that prove my fix is effective or that my feature works (if possible)
I've updated the rizin book with the relevant information (if needed)

Detailed description

Inefficiency in RzBuffer occurs due to failure in inlining methods like read, write, get_whole_buf etc. as mentioned in the issue. The changes I made include :

Preventing use of RzBufferMethods for RZ_BUFFER_BYTES and RZ_BUFFER_MMAP
Replacing the functions with the actual function to support potential inlining. For example, instead of using b->methods->init for byte type buffer, buf_bytes_init() is used which should get inlined.

Test plan

...

Closing issues

closes #4769
...

Rot127

Have you tested the changes already with perf or similar?
I am not sure if this is enough. Since the compiler not necessarily knows the b->type value. Depending on how the user of RzBuffer is calls the buffer API.

It might require some if (b->type != RZ_BUF_....) return in the using function, to signal the compiler which type the RzBuffer object is.
But we should try this out.

tushar3q34 · 2024-12-18T19:37:55Z

How should I test it against test cases which contain hotpaths?

If this doesn't work, then one of the solutions could be using this (although I'm not sure):

static bool buf_init(RzBuffer *b, const void *user){
       if(b->type != RZ_BUFFER_BYTES && b->type != RZ_BUFFER_MMAP){
              rz_return_val_if_fail(b && b->methods, false);
	      return b->methods->init ? b->methods->init(b, user) : true;
       }
       // Otherwise handle. 
}

We can also create a wrapper function which handles RZ_BUFFER_BYTES and others separately.
We can also possibly use switch case as compilers find it easier to inline funcs in switch case statements.
I will test it once and try to implement these and the other changes suggested.

Rot127 · 2024-12-18T19:53:50Z

I will test it once and try to implement these and the other changes suggested.

I think the easiest is to write some small C program which you link against librz_util.
In this program fill the buffer with a lot (50MB+) of random data, iterate over it and compare it against some constant.

You have to install Rizin every time you make a change to RzBuffer of course.

wargio · 2024-12-19T07:47:32Z

librz/util/buf.c

+		ut64 sz;
+		const ut8 *buf;


define these at the beginning of the function and remove the other definitions within the if/else.

tushar3q34 · 2024-12-20T06:26:43Z

@Rot127 I ran tests for an input of 100.8 MB, writing it to buffer once and reading through it 10,000 times and here is what I found:

The cpu time used is comparable if we use rz_buf_read or buf_bytes_read directly on RzBuffer so I do not think the issue is with inlining because directly calling buf_bytes_read should have given a slightly better performance but it didn't.
The only case where the two reads showed a big difference was if we try to read more characters from the buffer than it's size or if we try to read more characters than the remaining length from cursor persists. The reason for heavy extra cpu time in case of rz_buf_read in this case is due to the presence of memset which sets remaining memory to some constant

if (len > result) {
		memset(buf + result, b->Oxff_priv, len - result);
	}

Can you please check once whether the case you were implementing rz_buf_read was using the second point?

Rot127 · 2024-12-20T17:29:48Z

@tushar3q34 Thank you for the preliminary results! Can you share your test code please? So we have the same basis to work with?

tushar3q34 · 2024-12-20T17:39:48Z

Yeah sure : Test code for RzBuffer

Cases I tested :

Multiple reads with original rz_buf_read
Multiple reads with original buf_bytes_read
Multiple reads with modified rz_buf_read ( buf_read inside rz_buf_read) according to this PR
Multiple reads with modified rz_buf_read in the following way :

switch (b->type) {
	case RZ_BUFFER_BYTES:
	case RZ_BUFFER_MMAP:
		return buf_bytes_read(b, buf, len);
		break;
	default:
		rz_return_val_if_fail(b->methods, -1);
		return b->methods->read ? b->methods->read(b, buf, len) : -1;
		break;
	}

Multiple reads with modified rz_buf_read in the following way(which is incorrect but just for testing) :

return buf_bytes_read(b, buf, len);

Multiple reads with modified rz_buf_read in the following way(bringing whole buf_bytes_read, again incorrect but just for testing ) :

struct buf_bytes_priv *priv = get_priv_bytes(b);
	if (!priv->buf) {
		// This can happen when called from buf_mmap.c, when you open a 0-length
		// file.
		return -1;
	}
	ut64 real_len = priv->length < priv->offset ? 0 : RZ_MIN(priv->length - priv->offset, len);
	memmove(buf, priv->buf + priv->offset, real_len);
	priv->offset += real_len;
	return real_len;

For my system, all of the above gave similar results
The difference comes when we remove rz_buf_seek(buf, 0, 0). After removing this, the cursor isn't reset to the beginning which makes buf_bytes_read significantly faster than rz_buf_read(due to memset).

Rot127 · 2024-12-20T21:03:47Z

Ok, this is pretty funny. I rewrote your testing a little, to be more similar to my slow use case from before. But I couldn't replicate it.
The thing is, I know it was RzBuffer before and the callbacks. Because I replaced it with a normal ut8* and everything got ~3-4 times faster. Also perf showed the runtime increase on the callback functions.

Thanks so far for checking it. I will implement my working minimal example tomorrow. And send it here.

wargio · 2024-12-21T04:09:20Z

Just a comment: please be careful on using memmove since the buffer needs to be immutable (unless is a copy)

Rot127 · 2024-12-21T13:07:48Z

@tushar3q34 Ok, so here is the branch with the problem:
https://github.com/Rot127/rizin/tree/rz_buf_runtime
The branch refactors the hexadecimal byte search.

You can build it and run the following command on some larger file:

cat /dev/urandom > input.data
# Press Ctrl + C after a few seconds
rizin -qc "/x a158f101" ./input.data

The functions which are relevant are:

serach.c::rz_search_on_io() - It search on the IO layer. It prepares a list of chunks/windows into memory. Each of those chunks/windows is searched for the bytes in a separated thread (via rz_th_iterate_list()).
search.c::search_iterator_io_map_cb() - This function is run concurrently in separated threads. It reads a the bytes of a chunk from the file and passes it on to find().
find() - is a callback to bytes_search.c::bytes_pattern_compare(). In there it does a simple memcmp to the bytes pattern searched for.

If you simply time the command you get:

time rizin -qc "/x a158f101" ./input.data 
Searching... hits: 0
real	0m1.618s
user	0m5.596s
sys	0m0.104s

Now, this is a branch which uses ut8 * instead of RzBuffer. The runtime for it is roughly twice as fast:

time rizin -qc "/x a158f101" ./input.data 
Searching... hits: 0
real	0m0.835s
user	0m2.376s
sys	0m0.105s

I don't know how people do performance testing on Windows or MacOS. But on Linux you usually use perf. perf shows you which function took how much CPU time (in percent).

You use it like this:

perf record rizin -qc "/x a158f101" ./input.data
perf report # Opens details which functions took how much CPU time

The RzBuffer (slow) version gets these results:

  35.68%  rizin    librz_util.so.0.8.0    [.] rz_buf_fwd_scan
  17.83%  rizin    libc.so.6              [.] __memcmp_avx2_movbe
  17.32%  rizin    librz_search.so.0.8.0  [.] bytes_pattern_compare
  11.32%  rizin    librz_search.so.0.8.0  [.] bytes_find
  10.96%  rizin    librz_util.so.0.8.0    [.] buf_bytes_get_whole_buf
   2.78%  rizin    librz_search.so.0.8.0  [.] memcmp@plt
   1.43%  rizin    libc.so.6              [.] __memmove_avx_unaligned_erms
   0.36%  rizin    libc.so.6              [.] _int_free
   0.19%  rizin    libc.so.6              [.] pthread_mutex_lock@@GLIBC_2.2.5
   0.18%  rizin    libc.so.6              [.] malloc_consolidate

The (faster) ut8 * version has these statistics:

  64.64%  rizin    libc.so.6              [.] __memcmp_avx2_movbe
  26.04%  rizin    librz_search.so.0.8.0  [.] bytes_find
   3.64%  rizin    libc.so.6              [.] __memmove_avx_unaligned_erms
   0.60%  rizin    libc.so.6              [.] pthread_mutex_lock@@GLIBC_2.2.5
   0.40%  rizin    libc.so.6              [.] __memset_avx2_unaligned_erms
   0.32%  rizin    libc.so.6              [.] _int_free
...

In general it is ok if RzBuffer methods have some overhead. But they never should consume +30%. No matter what. The largest par tof the runtime should be spent in __memcmp

I hope you can work from there on. If you have questions please let me know. Also, feel free to question my assumptions I made, that it is indeed the fetching of function pointers which lead to the performance hit.

Also please note, that the "fixed" faster branch has some other edits. But they still do logically more or less the same.

tushar3q34 and others added 2 commits December 18, 2024 23:56

Optimised RzBuffer

9d719e9

Merge branch 'rizinorg:dev' into RzBuffer_Optimize

dd8c4e3

tushar3q34 requested review from ret2libc and thestr4ng3r as code owners December 18, 2024 18:41

github-actions bot added the RzUtil label Dec 18, 2024

Rot127 reviewed Dec 18, 2024

View reviewed changes

XVilka requested a review from wargio December 19, 2024 04:45

wargio reviewed Dec 19, 2024

View reviewed changes

XVilka added the performance A performance problem/enhancement label Dec 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rz buffer optimize #4779

Rz buffer optimize #4779

tushar3q34 commented Dec 18, 2024

Rot127 left a comment •

edited

Loading

tushar3q34 commented Dec 18, 2024

Rot127 commented Dec 18, 2024

wargio Dec 19, 2024

tushar3q34 commented Dec 20, 2024

Rot127 commented Dec 20, 2024 •

edited

Loading

tushar3q34 commented Dec 20, 2024 •

edited

Loading

Rot127 commented Dec 20, 2024 •

edited

Loading

wargio commented Dec 21, 2024

Rot127 commented Dec 21, 2024

Rz buffer optimize #4779

Are you sure you want to change the base?

Rz buffer optimize #4779

Conversation

tushar3q34 commented Dec 18, 2024

Rot127 left a comment • edited Loading

Choose a reason for hiding this comment

tushar3q34 commented Dec 18, 2024

Rot127 commented Dec 18, 2024

wargio Dec 19, 2024

Choose a reason for hiding this comment

tushar3q34 commented Dec 20, 2024

Rot127 commented Dec 20, 2024 • edited Loading

tushar3q34 commented Dec 20, 2024 • edited Loading

Rot127 commented Dec 20, 2024 • edited Loading

wargio commented Dec 21, 2024

Rot127 commented Dec 21, 2024

Rot127 left a comment •

edited

Loading

Rot127 commented Dec 20, 2024 •

edited

Loading

tushar3q34 commented Dec 20, 2024 •

edited

Loading

Rot127 commented Dec 20, 2024 •

edited

Loading