Improve internal function argument parsing performance by reducing code bloat #18436

nielsdos · 2025-04-26T18:32:33Z

The goal of this PR is to reduce some of the code bloat induced by fast-ZPP. Reduced code bloat results in fewer cache misses (and better DSB coverage), and fewer instructions executed.
This is an initial proposal, it may be possible to optimize this further.

If we take a look at a simple function:

PHP_FUNCTION(twice) {
    zend_long foo;
    ZEND_PARSE_PARAMETERS_START(1, 1)
        Z_PARAM_LONG(foo)
    ZEND_PARSE_PARAMETERS_END();
    RETURN_LONG(foo * 2);
}

We obtain the following assembly on x86-64 in the non-cold blocks:

<+0>:	push   %r12
<+2>:	push   %rbp
<+3>:	push   %rbx
<+4>:	sub    $0x10,%rsp
<+8>:	mov    %fs:0x28,%r12
<+17>:	mov    %r12,0x8(%rsp)
<+22>:	mov    0x2c(%rdi),%r12d
<+26>:	cmp    $0x1,%r12d
<+30>:	jne    0x22beaf <zif_twice-3212257>
<+36>:	cmpb   $0x4,0x58(%rdi)
<+40>:	mov    %rsi,%rbp
<+43>:	jne    0x53c2f0 <zif_twice+96>
<+45>:	mov    0x50(%rdi),%rax
<+49>:	add    %rax,%rax
<+52>:	movl   $0x4,0x8(%rbp)
<+59>:	mov    %rax,0x0(%rbp)
<+63>:	mov    0x8(%rsp),%rax
<+68>:	sub    %fs:0x28,%rax
<+77>:	jne    0x53c312 <zif_twice+130>
<+79>:	add    $0x10,%rsp
<+83>:	pop    %rbx
<+84>:	pop    %rbp
<+85>:	pop    %r12
<+87>:	ret
<+88>:	nopl   0x0(%rax,%rax,1)
<+96>:	lea    0x50(%rdi),%rbx
<+100>:	mov    %rsp,%rsi
<+103>:	mov    $0x1,%edx
<+108>:	mov    %rbx,%rdi
<+111>:	call   0x620240 <zend_parse_arg_long_slow>
<+116>:	test   %al,%al
<+118>:	je     0x22be96 <zif_twice.cold>
<+124>:	mov    (%rsp),%rax
<+128>:	jmp    0x53c2c1 <zif_twice+49>
<+130>:	call   0x201050 <__stack_chk_fail@plt>

Notice how we get the stack protector overhead in this function and also have to reload the parsed value on the slow path. This happens because the parsed value is returned via a pointer. If instead we were to return struct with a value pair (similar to optional in C++ / Option in Rust), then the values are returned via registers. This means that we no longer have stack protector overhead and we also don't need to reload a value, resulting in better register usage. This is the resulting assembly for the sample function after this patch:

<+0>:	push   %r12
<+2>:	push   %rbp
<+3>:	push   %rbx
<+4>:	mov    0x2c(%rdi),%r12d
<+8>:	cmp    $0x1,%r12d
<+12>:	jne    0x22d482 <zif_twice-3205454>
<+18>:	cmpb   $0x4,0x58(%rdi)
<+22>:	mov    %rsi,%rbp
<+25>:	jne    0x53be08 <zif_twice+56>
<+27>:	mov    0x50(%rdi),%rax
<+31>:	add    %rax,%rax
<+34>:	movl   $0x4,0x8(%rbp)
<+41>:	mov    %rax,0x0(%rbp)
<+45>:	pop    %rbx
<+46>:	pop    %rbp
<+47>:	pop    %r12
<+49>:	ret
<+50>:	nopw   0x0(%rax,%rax,1)
<+56>:	lea    0x50(%rdi),%rbx
<+60>:	mov    $0x1,%esi
<+65>:	mov    %rbx,%rdi
<+68>:	call   0x61e7b0 <zend_parse_arg_long_slow>
<+73>:	test   %dl,%dl
<+75>:	je     0x22d46a <zif_twice.cold>
<+81>:	jmp    0x53bdef <zif_twice+31>

The following uses the default benchmark programs we use in CI. Each program is ran on php-cgi with the appropriate -T argument, then repeated 15 times. It shows a small performance improvement on Symfony both with and without JIT, and a small improvement on WordPress with JIT.
For WordPress, the difference is small as my CPU is bottlenecked on some other stuff as well.
Results obtained on an i7-4790 with 32GiB RAM.

Test	Old Mean	Old Stddev	New Mean	New Stddev
Symfony, no JIT (-T10,50)	0.5324	0.0050	0.5272	0.0042
Symfony, tracing JIT (-T10,50)	0.5301	0.0029	0.5264	0.0036
WordPress, no JIT (-T5,25)	0.7408	0.0049	0.7404	0.0048
WordPress, tracing JIT (-T5,25)	0.6814	0.0052	0.6770	0.0055

I was not able to measure any meaningful difference for our micro benchmarks Zend/bench.php and Zend/micro_bench.php.

The Valgrind instruction counts also show a decrease: -0.19% on Symfony without JIT, and -0.14% on WordPress without JIT (see CI).

…de bloat The goal of this PR is to reduce some of the code bloat induced by fast-ZPP. Reduced code bloat results in fewer cache misses (and better DSB coverage), and fewer instructions executed. If we take a look at a simple function: ```c PHP_FUNCTION(twice) { zend_long foo; ZEND_PARSE_PARAMETERS_START(1, 1) Z_PARAM_LONG(foo) ZEND_PARSE_PARAMETERS_END(); RETURN_LONG(foo * 2); } ``` We obtain the following assembly on x86-64 in the non-cold blocks: ```s <+0>: push %r12 <+2>: push %rbp <+3>: push %rbx <+4>: sub $0x10,%rsp <+8>: mov %fs:0x28,%r12 <+17>: mov %r12,0x8(%rsp) <+22>: mov 0x2c(%rdi),%r12d <+26>: cmp $0x1,%r12d <+30>: jne 0x22beaf <zif_twice-3212257> <+36>: cmpb $0x4,0x58(%rdi) <+40>: mov %rsi,%rbp <+43>: jne 0x53c2f0 <zif_twice+96> <+45>: mov 0x50(%rdi),%rax <+49>: add %rax,%rax <+52>: movl $0x4,0x8(%rbp) <+59>: mov %rax,0x0(%rbp) <+63>: mov 0x8(%rsp),%rax <+68>: sub %fs:0x28,%rax <+77>: jne 0x53c312 <zif_twice+130> <+79>: add $0x10,%rsp <+83>: pop %rbx <+84>: pop %rbp <+85>: pop %r12 <+87>: ret <+88>: nopl 0x0(%rax,%rax,1) <+96>: lea 0x50(%rdi),%rbx <+100>: mov %rsp,%rsi <+103>: mov $0x1,%edx <+108>: mov %rbx,%rdi <+111>: call 0x620240 <zend_parse_arg_long_slow> <+116>: test %al,%al <+118>: je 0x22be96 <zif_twice.cold> <+124>: mov (%rsp),%rax <+128>: jmp 0x53c2c1 <zif_twice+49> <+130>: call 0x201050 <__stack_chk_fail@plt> ``` Notice how we get the stack protector overhead in this function and also have to reload the parsed value on the slow path. This happens because the parsed value is returned via a pointer. If instead we were to return struct with a value pair (similar to optional in C++ / Option in Rust), then the values are returned via registers. This means that we no longer have stack protector overhead and we also don't need to reload a value, resulting in better register usage. This is the resulting assembly for the sample function after this patch: ```s <+0>: push %r12 <+2>: push %rbp <+3>: push %rbx <+4>: mov 0x2c(%rdi),%r12d <+8>: cmp $0x1,%r12d <+12>: jne 0x22d482 <zif_twice-3205454> <+18>: cmpb $0x4,0x58(%rdi) <+22>: mov %rsi,%rbp <+25>: jne 0x53be08 <zif_twice+56> <+27>: mov 0x50(%rdi),%rax <+31>: add %rax,%rax <+34>: movl $0x4,0x8(%rbp) <+41>: mov %rax,0x0(%rbp) <+45>: pop %rbx <+46>: pop %rbp <+47>: pop %r12 <+49>: ret <+50>: nopw 0x0(%rax,%rax,1) <+56>: lea 0x50(%rdi),%rbx <+60>: mov $0x1,%esi <+65>: mov %rbx,%rdi <+68>: call 0x61e7b0 <zend_parse_arg_long_slow> <+73>: test %dl,%dl <+75>: je 0x22d46a <zif_twice.cold> <+81>: jmp 0x53bdef <zif_twice+31> ``` The following uses the default benchmark programs we use in CI. Each program is ran on php-cgi with the appropriate `-T` argument, then repeated 15 times. It shows a small performance improvement on Symfony both with and without JIT, and a small improvement on WordPress with JIT. For WordPress, the difference is small as my CPU is bottlenecked on some other stuff as well. | Test | Old Mean | Old Stddev | New Mean | New Stddev | |---------------------------------|----------|------------|----------|------------| | Symfony, no JIT (-T10,50) | 0.5324 | 0.0050 | 0.5272 | 0.0042 | | Symfony, tracing JIT (-T10,50) | 0.5301 | 0.0029 | 0.5264 | 0.0036 | | WordPress, no JIT (-T5,25) | 0.7408 | 0.0049 | 0.7404 | 0.0048 | | WordPress, tracing JIT (-T5,25) | 0.6814 | 0.0052 | 0.6770 | 0.0055 | I was not able to measure any meaningful difference for our micro benchmarks `Zend/bench.php` and `Zend/micro_bench.php`. The Valgrind instruction counts also show a decrease: -0.19% on Symfony without JIT, and -0.14% on WordPress without JIT (see CI).

nielsdos · 2025-04-26T20:15:38Z

One potential optimization idea is to store a bool is_empty instead of bool has_value. Because in that case we have to store a 0 instead of a 1 for the "success path". This may be a bit faster on the x86 where setting the lower parts of some registers also clears the upper part.

SakiTakamachi · 2025-04-27T04:44:01Z

When building on Arm, an uninitialized warning was triggered at the following location.
Although there doesn’t seem to be any uninitialized code path, it might be a good idea to set an initial value just to silence the warning.

php-src/ext/date/php_date.c

Line 5514 in c919ab4

N = (calc_sunset ? h_set : h_rise) + gmt_offset;

I also dumped the non-cold block assembly on Arm.
I already knew theoretically that it should have an effect on Arm as well, and it seems that is indeed the case.

before:

   0:   a9bc7bfd        stp     x29, x30, [sp, #-64]!
   4:   90000002        adrp    x2, 0 <__stack_chk_guard>
   8:   910003fd        mov     x29, sp
   c:   f9400042        ldr     x2, [x2]
  10:   a90153f3        stp     x19, x20, [sp, #16]
  14:   aa0103f4        mov     x20, x1
  18:   f90013f5        str     x21, [sp, #32]
  1c:   b9402c15        ldr     w21, [x0, #44]
  20:   f9400041        ldr     x1, [x2]
  24:   f9001fe1        str     x1, [sp, #56]
  28:   d2800001        mov     x1, #0x0                        // #0
  2c:   710006bf        cmp     w21, #0x1
  30:   54000401        b.ne    b0 <zif_twice+0xb0>  // b.any
  34:   39416001        ldrb    w1, [x0, #88]
  38:   7100103f        cmp     w1, #0x4
  3c:   54000221        b.ne    80 <zif_twice+0x80>  // b.any
  40:   f9402800        ldr     x0, [x0, #80]
  44:   d37ff800        lsl     x0, x0, #1
  48:   52800081        mov     w1, #0x4                        // #4
  4c:   f9000280        str     x0, [x20]
  50:   b9000a81        str     w1, [x20, #8]
  54:   90000000        adrp    x0, 0 <__stack_chk_guard>
  58:   f9400000        ldr     x0, [x0]
  5c:   f9401fe2        ldr     x2, [sp, #56]
  60:   f9400001        ldr     x1, [x0]
  64:   eb010042        subs    x2, x2, x1
  68:   d2800001        mov     x1, #0x0                        // #0
  6c:   540001c1        b.ne    a4 <zif_twice+0xa4>  // b.any
  70:   a94153f3        ldp     x19, x20, [sp, #16]
  74:   f94013f5        ldr     x21, [sp, #32]
  78:   a8c47bfd        ldp     x29, x30, [sp], #64
  7c:   d65f03c0        ret
  80:   91014013        add     x19, x0, #0x50
  84:   2a1503e2        mov     w2, w21
  88:   aa1303e0        mov     x0, x19
  8c:   9100c3e1        add     x1, sp, #0x30
  90:   94000000        bl      0 <zend_parse_arg_long_slow>
  94:   52800122        mov     w2, #0x9                        // #9
  98:   72001c1f        tst     w0, #0xff
  9c:   54000061        b.ne    a8 <zif_twice+0xa8>  // b.any
  a0:   14000000        b       0 <zif_twice>
  a4:   94000000        bl      0 <__stack_chk_fail>
  a8:   f9401be0        ldr     x0, [sp, #48]
  ac:   17ffffe6        b       44 <zif_twice+0x44>
  b0:   14000000        b       0 <zif_twice>

after:

   0:   a9bd7bfd        stp     x29, x30, [sp, #-48]!
   4:   910003fd        mov     x29, sp
   8:   f90013f5        str     x21, [sp, #32]
   c:   b9402c15        ldr     w21, [x0, #44]
  10:   a90153f3        stp     x19, x20, [sp, #16]
  14:   710006bf        cmp     w21, #0x1
  18:   540003a1        b.ne    8c <zif_twice+0x8c>  // b.any
  1c:   aa0103f4        mov     x20, x1
  20:   39416001        ldrb    w1, [x0, #88]
  24:   7100103f        cmp     w1, #0x4
  28:   54000141        b.ne    50 <zif_twice+0x50>  // b.any
  2c:   f9402800        ldr     x0, [x0, #80]
  30:   d37ff800        lsl     x0, x0, #1
  34:   52800081        mov     w1, #0x4                        // #4
  38:   f94013f5        ldr     x21, [sp, #32]
  3c:   f9000280        str     x0, [x20]
  40:   b9000a81        str     w1, [x20, #8]
  44:   a94153f3        ldp     x19, x20, [sp, #16]
  48:   a8c37bfd        ldp     x29, x30, [sp], #48
  4c:   d65f03c0        ret
  50:   91014013        add     x19, x0, #0x50
  54:   2a1503e1        mov     w1, w21
  58:   aa1303e0        mov     x0, x19
  5c:   94000000        bl      0 <zend_parse_arg_long_slow>
  60:   72001c3f        tst     w1, #0xff
  64:   54fffe61        b.ne    30 <zif_twice+0x30>  // b.any
  68:   52800120        mov     w0, #0x9                        // #9
  6c:   aa1303e4        mov     x4, x19
  70:   2a1503e1        mov     w1, w21
  74:   a94153f3        ldp     x19, x20, [sp, #16]
  78:   52800003        mov     w3, #0x0                        // #0
  7c:   f94013f5        ldr     x21, [sp, #32]
  80:   d2800002        mov     x2, #0x0                        // #0
  84:   a8c37bfd        ldp     x29, x30, [sp], #48
  88:   14000000        b       0 <zend_wrong_parameter_error>
  8c:   14000000        b       0 <zif_twice>

Regarding is_empty, the performance should theoretically be the same on Arm, so I think it’s fine to decide based on what’s best for x86.

nielsdos · 2025-04-27T09:39:08Z

@SakiTakamachi Thanks. The warning you get is a false positive. Previously the compiler didn't warn because the pointer escaped but now it no longer is a pointer escape and the compiler's warning analysis is not strong enough to avoid the warning.

devnexen

overall mstm but lgtm for pgsql at least

dstogov

Returning structures is a not well defined system depended C feature. On one system this may be implemented through a pair of registers on the others through a stack allocated area. GCC has special flags that control this behavior (-fpcc-struct-return/-freg-struct-return) but they break ABI.

I though about this feature long time ago (for general API optimization), but decided not to use it.

Please check the effect of the patch on x86 Linux, and x86/x86_64 Windows.

nielsdos requested review from dstogov, bukka, Girgias, devnexen, alexdowad, youkidearitai and SakiTakamachi as code owners April 26, 2025 18:32

nielsdos requested review from arnaud-lb and iluuu1994 April 26, 2025 18:32

github-actions bot added Extension: openssl Category: Engine Extension: bcmath Extension: gmp Extension: mbstring Extension: pgsql Extension: session Extension: standard ABI break labels Apr 26, 2025

Girgias removed their request for review April 26, 2025 20:22

nielsdos requested a review from derickr as a code owner April 27, 2025 09:38

SakiTakamachi approved these changes Apr 27, 2025

View reviewed changes

Fix false positive warning

a1c97a8

github-actions bot added the Extension: date label Apr 27, 2025

devnexen approved these changes Apr 27, 2025

View reviewed changes

dstogov reviewed Apr 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve internal function argument parsing performance by reducing code bloat #18436

Improve internal function argument parsing performance by reducing code bloat #18436

nielsdos commented Apr 26, 2025 •

edited

Loading

nielsdos commented Apr 26, 2025

SakiTakamachi commented Apr 27, 2025

nielsdos commented Apr 27, 2025

devnexen left a comment

dstogov left a comment

Improve internal function argument parsing performance by reducing code bloat #18436

Are you sure you want to change the base?

Improve internal function argument parsing performance by reducing code bloat #18436

Conversation

nielsdos commented Apr 26, 2025 • edited Loading

nielsdos commented Apr 26, 2025

SakiTakamachi commented Apr 27, 2025

nielsdos commented Apr 27, 2025

devnexen left a comment

Choose a reason for hiding this comment

dstogov left a comment

Choose a reason for hiding this comment

nielsdos commented Apr 26, 2025 •

edited

Loading