-
Notifications
You must be signed in to change notification settings - Fork 7.8k
Improve internal function argument parsing performance by reducing code bloat #18436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…de bloat The goal of this PR is to reduce some of the code bloat induced by fast-ZPP. Reduced code bloat results in fewer cache misses (and better DSB coverage), and fewer instructions executed. If we take a look at a simple function: ```c PHP_FUNCTION(twice) { zend_long foo; ZEND_PARSE_PARAMETERS_START(1, 1) Z_PARAM_LONG(foo) ZEND_PARSE_PARAMETERS_END(); RETURN_LONG(foo * 2); } ``` We obtain the following assembly on x86-64 in the non-cold blocks: ```s <+0>: push %r12 <+2>: push %rbp <+3>: push %rbx <+4>: sub $0x10,%rsp <+8>: mov %fs:0x28,%r12 <+17>: mov %r12,0x8(%rsp) <+22>: mov 0x2c(%rdi),%r12d <+26>: cmp $0x1,%r12d <+30>: jne 0x22beaf <zif_twice-3212257> <+36>: cmpb $0x4,0x58(%rdi) <+40>: mov %rsi,%rbp <+43>: jne 0x53c2f0 <zif_twice+96> <+45>: mov 0x50(%rdi),%rax <+49>: add %rax,%rax <+52>: movl $0x4,0x8(%rbp) <+59>: mov %rax,0x0(%rbp) <+63>: mov 0x8(%rsp),%rax <+68>: sub %fs:0x28,%rax <+77>: jne 0x53c312 <zif_twice+130> <+79>: add $0x10,%rsp <+83>: pop %rbx <+84>: pop %rbp <+85>: pop %r12 <+87>: ret <+88>: nopl 0x0(%rax,%rax,1) <+96>: lea 0x50(%rdi),%rbx <+100>: mov %rsp,%rsi <+103>: mov $0x1,%edx <+108>: mov %rbx,%rdi <+111>: call 0x620240 <zend_parse_arg_long_slow> <+116>: test %al,%al <+118>: je 0x22be96 <zif_twice.cold> <+124>: mov (%rsp),%rax <+128>: jmp 0x53c2c1 <zif_twice+49> <+130>: call 0x201050 <__stack_chk_fail@plt> ``` Notice how we get the stack protector overhead in this function and also have to reload the parsed value on the slow path. This happens because the parsed value is returned via a pointer. If instead we were to return struct with a value pair (similar to optional in C++ / Option in Rust), then the values are returned via registers. This means that we no longer have stack protector overhead and we also don't need to reload a value, resulting in better register usage. This is the resulting assembly for the sample function after this patch: ```s <+0>: push %r12 <+2>: push %rbp <+3>: push %rbx <+4>: mov 0x2c(%rdi),%r12d <+8>: cmp $0x1,%r12d <+12>: jne 0x22d482 <zif_twice-3205454> <+18>: cmpb $0x4,0x58(%rdi) <+22>: mov %rsi,%rbp <+25>: jne 0x53be08 <zif_twice+56> <+27>: mov 0x50(%rdi),%rax <+31>: add %rax,%rax <+34>: movl $0x4,0x8(%rbp) <+41>: mov %rax,0x0(%rbp) <+45>: pop %rbx <+46>: pop %rbp <+47>: pop %r12 <+49>: ret <+50>: nopw 0x0(%rax,%rax,1) <+56>: lea 0x50(%rdi),%rbx <+60>: mov $0x1,%esi <+65>: mov %rbx,%rdi <+68>: call 0x61e7b0 <zend_parse_arg_long_slow> <+73>: test %dl,%dl <+75>: je 0x22d46a <zif_twice.cold> <+81>: jmp 0x53bdef <zif_twice+31> ``` The following uses the default benchmark programs we use in CI. Each program is ran on php-cgi with the appropriate `-T` argument, then repeated 15 times. It shows a small performance improvement on Symfony both with and without JIT, and a small improvement on WordPress with JIT. For WordPress, the difference is small as my CPU is bottlenecked on some other stuff as well. | Test | Old Mean | Old Stddev | New Mean | New Stddev | |---------------------------------|----------|------------|----------|------------| | Symfony, no JIT (-T10,50) | 0.5324 | 0.0050 | 0.5272 | 0.0042 | | Symfony, tracing JIT (-T10,50) | 0.5301 | 0.0029 | 0.5264 | 0.0036 | | WordPress, no JIT (-T5,25) | 0.7408 | 0.0049 | 0.7404 | 0.0048 | | WordPress, tracing JIT (-T5,25) | 0.6814 | 0.0052 | 0.6770 | 0.0055 | I was not able to measure any meaningful difference for our micro benchmarks `Zend/bench.php` and `Zend/micro_bench.php`. The Valgrind instruction counts also show a decrease: -0.19% on Symfony without JIT, and -0.14% on WordPress without JIT (see CI).
One potential optimization idea is to store a |
When building on Arm, an uninitialized warning was triggered at the following location. Line 5514 in c919ab4
I also dumped the non-cold block assembly on Arm. before:
after:
Regarding |
@SakiTakamachi Thanks. The warning you get is a false positive. Previously the compiler didn't warn because the pointer escaped but now it no longer is a pointer escape and the compiler's warning analysis is not strong enough to avoid the warning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall mstm but lgtm for pgsql at least
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning structures is a not well defined system depended C feature. On one system this may be implemented through a pair of registers on the others through a stack allocated area. GCC has special flags that control this behavior (-fpcc-struct-return/-freg-struct-return) but they break ABI.
I though about this feature long time ago (for general API optimization), but decided not to use it.
Please check the effect of the patch on x86 Linux, and x86/x86_64 Windows.
The goal of this PR is to reduce some of the code bloat induced by fast-ZPP. Reduced code bloat results in fewer cache misses (and better DSB coverage), and fewer instructions executed.
This is an initial proposal, it may be possible to optimize this further.
If we take a look at a simple function:
We obtain the following assembly on x86-64 in the non-cold blocks:
Notice how we get the stack protector overhead in this function and also have to reload the parsed value on the slow path. This happens because the parsed value is returned via a pointer. If instead we were to return struct with a value pair (similar to optional in C++ / Option in Rust), then the values are returned via registers. This means that we no longer have stack protector overhead and we also don't need to reload a value, resulting in better register usage. This is the resulting assembly for the sample function after this patch:
The following uses the default benchmark programs we use in CI. Each program is ran on php-cgi with the appropriate
-T
argument, then repeated 15 times. It shows a small performance improvement on Symfony both with and without JIT, and a small improvement on WordPress with JIT.For WordPress, the difference is small as my CPU is bottlenecked on some other stuff as well.
Results obtained on an i7-4790 with 32GiB RAM.
I was not able to measure any meaningful difference for our micro benchmarks
Zend/bench.php
andZend/micro_bench.php
.The Valgrind instruction counts also show a decrease: -0.19% on Symfony without JIT, and -0.14% on WordPress without JIT (see CI).