Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rz buffer optimize #4779

Open
wants to merge 2 commits into
base: dev
Choose a base branch
from
Open

Conversation

tushar3q34
Copy link
Contributor

Your checklist for this pull request

  • I've read the guidelines for contributing to this repository
  • I made sure to follow the project's coding style
  • I've documented or updated the documentation of every function and struct this PR changes. If not so I've explained why.
  • I've added tests that prove my fix is effective or that my feature works (if possible)
  • I've updated the rizin book with the relevant information (if needed)

Detailed description

Inefficiency in RzBuffer occurs due to failure in inlining methods like read, write, get_whole_buf etc. as mentioned in the issue. The changes I made include :

  • Preventing use of RzBufferMethods for RZ_BUFFER_BYTES and RZ_BUFFER_MMAP
  • Replacing the functions with the actual function to support potential inlining. For example, instead of using b->methods->init for byte type buffer, buf_bytes_init() is used which should get inlined.

Test plan

...

Closing issues

closes #4769
...

Copy link
Member

@Rot127 Rot127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested the changes already with perf or similar?
I am not sure if this is enough. Since the compiler not necessarily knows the b->type value. Depending on how the user of RzBuffer is calls the buffer API.

It might require some if (b->type != RZ_BUF_....) return in the using function, to signal the compiler which type the RzBuffer object is.
But we should try this out.

@tushar3q34
Copy link
Contributor Author

How should I test it against test cases which contain hotpaths?

If this doesn't work, then one of the solutions could be using this (although I'm not sure):

static bool buf_init(RzBuffer *b, const void *user){
       if(b->type != RZ_BUFFER_BYTES && b->type != RZ_BUFFER_MMAP){
              rz_return_val_if_fail(b && b->methods, false);
	      return b->methods->init ? b->methods->init(b, user) : true;
       }
       // Otherwise handle. 
}

We can also create a wrapper function which handles RZ_BUFFER_BYTES and others separately.
We can also possibly use switch case as compilers find it easier to inline funcs in switch case statements.
I will test it once and try to implement these and the other changes suggested.

@Rot127
Copy link
Member

Rot127 commented Dec 18, 2024

I will test it once and try to implement these and the other changes suggested.

I think the easiest is to write some small C program which you link against librz_util.
In this program fill the buffer with a lot (50MB+) of random data, iterate over it and compare it against some constant.

You have to install Rizin every time you make a change to RzBuffer of course.

@XVilka XVilka requested a review from wargio December 19, 2024 04:45
Comment on lines +1416 to +1417
ut64 sz;
const ut8 *buf;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

define these at the beginning of the function and remove the other definitions within the if/else.

@tushar3q34
Copy link
Contributor Author

@Rot127 I ran tests for an input of 100.8 MB, writing it to buffer once and reading through it 10,000 times and here is what I found:

  • The cpu time used is comparable if we use rz_buf_read or buf_bytes_read directly on RzBuffer so I do not think the issue is with inlining because directly calling buf_bytes_read should have given a slightly better performance but it didn't.
  • The only case where the two reads showed a big difference was if we try to read more characters from the buffer than it's size or if we try to read more characters than the remaining length from cursor persists. The reason for heavy extra cpu time in case of rz_buf_read in this case is due to the presence of memset which sets remaining memory to some constant
if (len > result) {
		memset(buf + result, b->Oxff_priv, len - result);
	}

Can you please check once whether the case you were implementing rz_buf_read was using the second point?

@Rot127
Copy link
Member

Rot127 commented Dec 20, 2024

@tushar3q34 Thank you for the preliminary results! Can you share your test code please? So we have the same basis to work with?

@tushar3q34
Copy link
Contributor Author

tushar3q34 commented Dec 20, 2024

Yeah sure : Test code for RzBuffer

Cases I tested :

  • Multiple reads with original rz_buf_read
  • Multiple reads with original buf_bytes_read
  • Multiple reads with modified rz_buf_read ( buf_read inside rz_buf_read) according to this PR
  • Multiple reads with modified rz_buf_read in the following way :
switch (b->type) {
	case RZ_BUFFER_BYTES:
	case RZ_BUFFER_MMAP:
		return buf_bytes_read(b, buf, len);
		break;
	default:
		rz_return_val_if_fail(b->methods, -1);
		return b->methods->read ? b->methods->read(b, buf, len) : -1;
		break;
	}
  • Multiple reads with modified rz_buf_read in the following way(which is incorrect but just for testing) :
return buf_bytes_read(b, buf, len);
  • Multiple reads with modified rz_buf_read in the following way(bringing whole buf_bytes_read, again incorrect but just for testing ) :
struct buf_bytes_priv *priv = get_priv_bytes(b);
	if (!priv->buf) {
		// This can happen when called from buf_mmap.c, when you open a 0-length
		// file.
		return -1;
	}
	ut64 real_len = priv->length < priv->offset ? 0 : RZ_MIN(priv->length - priv->offset, len);
	memmove(buf, priv->buf + priv->offset, real_len);
	priv->offset += real_len;
	return real_len;
  • For my system, all of the above gave similar results
  • The difference comes when we remove rz_buf_seek(buf, 0, 0). After removing this, the cursor isn't reset to the beginning which makes buf_bytes_read significantly faster than rz_buf_read(due to memset).

@Rot127
Copy link
Member

Rot127 commented Dec 20, 2024

Ok, this is pretty funny. I rewrote your testing a little, to be more similar to my slow use case from before. But I couldn't replicate it.
The thing is, I know it was RzBuffer before and the callbacks. Because I replaced it with a normal ut8* and everything got ~3-4 times faster. Also perf showed the runtime increase on the callback functions.

Thanks so far for checking it. I will implement my working minimal example tomorrow. And send it here.

@wargio
Copy link
Member

wargio commented Dec 21, 2024

Just a comment: please be careful on using memmove since the buffer needs to be immutable (unless is a copy)

@Rot127
Copy link
Member

Rot127 commented Dec 21, 2024

@tushar3q34 Ok, so here is the branch with the problem:
https://github.com/Rot127/rizin/tree/rz_buf_runtime
The branch refactors the hexadecimal byte search.

You can build it and run the following command on some larger file:

cat /dev/urandom > input.data
# Press Ctrl + C after a few seconds
rizin -qc "/x a158f101" ./input.data

The functions which are relevant are:

  • serach.c::rz_search_on_io() - It search on the IO layer. It prepares a list of chunks/windows into memory. Each of those chunks/windows is searched for the bytes in a separated thread (via rz_th_iterate_list()).
  • search.c::search_iterator_io_map_cb() - This function is run concurrently in separated threads. It reads a the bytes of a chunk from the file and passes it on to find().
  • find() - is a callback to bytes_search.c::bytes_pattern_compare(). In there it does a simple memcmp to the bytes pattern searched for.

If you simply time the command you get:

time rizin -qc "/x a158f101" ./input.data 
Searching... hits: 0
real	0m1.618s
user	0m5.596s
sys	0m0.104s

Now, this is a branch which uses ut8 * instead of RzBuffer. The runtime for it is roughly twice as fast:

time rizin -qc "/x a158f101" ./input.data 
Searching... hits: 0
real	0m0.835s
user	0m2.376s
sys	0m0.105s

I don't know how people do performance testing on Windows or MacOS. But on Linux you usually use perf. perf shows you which function took how much CPU time (in percent).

You use it like this:

perf record rizin -qc "/x a158f101" ./input.data
perf report # Opens details which functions took how much CPU time

The RzBuffer (slow) version gets these results:

  35.68%  rizin    librz_util.so.0.8.0    [.] rz_buf_fwd_scan
  17.83%  rizin    libc.so.6              [.] __memcmp_avx2_movbe
  17.32%  rizin    librz_search.so.0.8.0  [.] bytes_pattern_compare
  11.32%  rizin    librz_search.so.0.8.0  [.] bytes_find
  10.96%  rizin    librz_util.so.0.8.0    [.] buf_bytes_get_whole_buf
   2.78%  rizin    librz_search.so.0.8.0  [.] memcmp@plt
   1.43%  rizin    libc.so.6              [.] __memmove_avx_unaligned_erms
   0.36%  rizin    libc.so.6              [.] _int_free
   0.19%  rizin    libc.so.6              [.] pthread_mutex_lock@@GLIBC_2.2.5
   0.18%  rizin    libc.so.6              [.] malloc_consolidate

The (faster) ut8 * version has these statistics:

  64.64%  rizin    libc.so.6              [.] __memcmp_avx2_movbe
  26.04%  rizin    librz_search.so.0.8.0  [.] bytes_find
   3.64%  rizin    libc.so.6              [.] __memmove_avx_unaligned_erms
   0.60%  rizin    libc.so.6              [.] pthread_mutex_lock@@GLIBC_2.2.5
   0.40%  rizin    libc.so.6              [.] __memset_avx2_unaligned_erms
   0.32%  rizin    libc.so.6              [.] _int_free
...

In general it is ok if RzBuffer methods have some overhead. But they never should consume +30%. No matter what. The largest par tof the runtime should be spent in __memcmp

I hope you can work from there on. If you have questions please let me know. Also, feel free to question my assumptions I made, that it is indeed the fetching of function pointers which lead to the performance hit.

Also please note, that the "fixed" faster branch has some other edits. But they still do logically more or less the same.

@XVilka XVilka added the performance A performance problem/enhancement label Dec 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance A performance problem/enhancement RzUtil
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RzBuffer is inefficient (only on hot paths)
4 participants