Skip to content

Improve internal function argument parsing performance by reducing code bloat #18436

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

nielsdos
Copy link
Member

@nielsdos nielsdos commented Apr 26, 2025

The goal of this PR is to reduce some of the code bloat induced by fast-ZPP. Reduced code bloat results in fewer cache misses (and better DSB coverage), and fewer instructions executed.
This is an initial proposal, it may be possible to optimize this further.

If we take a look at a simple function:

PHP_FUNCTION(twice) {
    zend_long foo;
    ZEND_PARSE_PARAMETERS_START(1, 1)
        Z_PARAM_LONG(foo)
    ZEND_PARSE_PARAMETERS_END();
    RETURN_LONG(foo * 2);
}

We obtain the following assembly on x86-64 in the non-cold blocks:

<+0>:	push   %r12
<+2>:	push   %rbp
<+3>:	push   %rbx
<+4>:	sub    $0x10,%rsp
<+8>:	mov    %fs:0x28,%r12
<+17>:	mov    %r12,0x8(%rsp)
<+22>:	mov    0x2c(%rdi),%r12d
<+26>:	cmp    $0x1,%r12d
<+30>:	jne    0x22beaf <zif_twice-3212257>
<+36>:	cmpb   $0x4,0x58(%rdi)
<+40>:	mov    %rsi,%rbp
<+43>:	jne    0x53c2f0 <zif_twice+96>
<+45>:	mov    0x50(%rdi),%rax
<+49>:	add    %rax,%rax
<+52>:	movl   $0x4,0x8(%rbp)
<+59>:	mov    %rax,0x0(%rbp)
<+63>:	mov    0x8(%rsp),%rax
<+68>:	sub    %fs:0x28,%rax
<+77>:	jne    0x53c312 <zif_twice+130>
<+79>:	add    $0x10,%rsp
<+83>:	pop    %rbx
<+84>:	pop    %rbp
<+85>:	pop    %r12
<+87>:	ret
<+88>:	nopl   0x0(%rax,%rax,1)
<+96>:	lea    0x50(%rdi),%rbx
<+100>:	mov    %rsp,%rsi
<+103>:	mov    $0x1,%edx
<+108>:	mov    %rbx,%rdi
<+111>:	call   0x620240 <zend_parse_arg_long_slow>
<+116>:	test   %al,%al
<+118>:	je     0x22be96 <zif_twice.cold>
<+124>:	mov    (%rsp),%rax
<+128>:	jmp    0x53c2c1 <zif_twice+49>
<+130>:	call   0x201050 <__stack_chk_fail@plt>

Notice how we get the stack protector overhead in this function and also have to reload the parsed value on the slow path. This happens because the parsed value is returned via a pointer. If instead we were to return struct with a value pair (similar to optional in C++ / Option in Rust), then the values are returned via registers. This means that we no longer have stack protector overhead and we also don't need to reload a value, resulting in better register usage. This is the resulting assembly for the sample function after this patch:

<+0>:	push   %r12
<+2>:	push   %rbp
<+3>:	push   %rbx
<+4>:	mov    0x2c(%rdi),%r12d
<+8>:	cmp    $0x1,%r12d
<+12>:	jne    0x22d482 <zif_twice-3205454>
<+18>:	cmpb   $0x4,0x58(%rdi)
<+22>:	mov    %rsi,%rbp
<+25>:	jne    0x53be08 <zif_twice+56>
<+27>:	mov    0x50(%rdi),%rax
<+31>:	add    %rax,%rax
<+34>:	movl   $0x4,0x8(%rbp)
<+41>:	mov    %rax,0x0(%rbp)
<+45>:	pop    %rbx
<+46>:	pop    %rbp
<+47>:	pop    %r12
<+49>:	ret
<+50>:	nopw   0x0(%rax,%rax,1)
<+56>:	lea    0x50(%rdi),%rbx
<+60>:	mov    $0x1,%esi
<+65>:	mov    %rbx,%rdi
<+68>:	call   0x61e7b0 <zend_parse_arg_long_slow>
<+73>:	test   %dl,%dl
<+75>:	je     0x22d46a <zif_twice.cold>
<+81>:	jmp    0x53bdef <zif_twice+31>

The following uses the default benchmark programs we use in CI. Each program is ran on php-cgi with the appropriate -T argument, then repeated 15 times. It shows a small performance improvement on Symfony both with and without JIT, and a small improvement on WordPress with JIT.
For WordPress, the difference is small as my CPU is bottlenecked on some other stuff as well.
Results obtained on an i7-4790 with 32GiB RAM.

Test Old Mean Old Stddev New Mean New Stddev
Symfony, no JIT (-T10,50) 0.5324 0.0050 0.5272 0.0042
Symfony, tracing JIT (-T10,50) 0.5301 0.0029 0.5264 0.0036
WordPress, no JIT (-T5,25) 0.7408 0.0049 0.7404 0.0048
WordPress, tracing JIT (-T5,25) 0.6814 0.0052 0.6770 0.0055

I was not able to measure any meaningful difference for our micro benchmarks Zend/bench.php and Zend/micro_bench.php.

The Valgrind instruction counts also show a decrease: -0.19% on Symfony without JIT, and -0.14% on WordPress without JIT (see CI).

…de bloat

The goal of this PR is to reduce some of the code bloat induced by fast-ZPP.
Reduced code bloat results in fewer cache misses (and better DSB coverage),
and fewer instructions executed.

If we take a look at a simple function:
```c
PHP_FUNCTION(twice) {
    zend_long foo;
    ZEND_PARSE_PARAMETERS_START(1, 1)
        Z_PARAM_LONG(foo)
    ZEND_PARSE_PARAMETERS_END();
    RETURN_LONG(foo * 2);
}
```

We obtain the following assembly on x86-64 in the non-cold blocks:
```s
<+0>:	push   %r12
<+2>:	push   %rbp
<+3>:	push   %rbx
<+4>:	sub    $0x10,%rsp
<+8>:	mov    %fs:0x28,%r12
<+17>:	mov    %r12,0x8(%rsp)
<+22>:	mov    0x2c(%rdi),%r12d
<+26>:	cmp    $0x1,%r12d
<+30>:	jne    0x22beaf <zif_twice-3212257>
<+36>:	cmpb   $0x4,0x58(%rdi)
<+40>:	mov    %rsi,%rbp
<+43>:	jne    0x53c2f0 <zif_twice+96>
<+45>:	mov    0x50(%rdi),%rax
<+49>:	add    %rax,%rax
<+52>:	movl   $0x4,0x8(%rbp)
<+59>:	mov    %rax,0x0(%rbp)
<+63>:	mov    0x8(%rsp),%rax
<+68>:	sub    %fs:0x28,%rax
<+77>:	jne    0x53c312 <zif_twice+130>
<+79>:	add    $0x10,%rsp
<+83>:	pop    %rbx
<+84>:	pop    %rbp
<+85>:	pop    %r12
<+87>:	ret
<+88>:	nopl   0x0(%rax,%rax,1)
<+96>:	lea    0x50(%rdi),%rbx
<+100>:	mov    %rsp,%rsi
<+103>:	mov    $0x1,%edx
<+108>:	mov    %rbx,%rdi
<+111>:	call   0x620240 <zend_parse_arg_long_slow>
<+116>:	test   %al,%al
<+118>:	je     0x22be96 <zif_twice.cold>
<+124>:	mov    (%rsp),%rax
<+128>:	jmp    0x53c2c1 <zif_twice+49>
<+130>:	call   0x201050 <__stack_chk_fail@plt>
```

Notice how we get the stack protector overhead in this function
and also have to reload the parsed value on the slow path.
This happens because the parsed value is returned via a pointer.
If instead we were to return struct with a value pair
(similar to optional in C++ / Option in Rust), then the values are returned
via registers. This means that we no longer have stack protector overhead
and we also don't need to reload a value, resulting in better register usage.
This is the resulting assembly for the sample function after this patch:
```s
<+0>:	push   %r12
<+2>:	push   %rbp
<+3>:	push   %rbx
<+4>:	mov    0x2c(%rdi),%r12d
<+8>:	cmp    $0x1,%r12d
<+12>:	jne    0x22d482 <zif_twice-3205454>
<+18>:	cmpb   $0x4,0x58(%rdi)
<+22>:	mov    %rsi,%rbp
<+25>:	jne    0x53be08 <zif_twice+56>
<+27>:	mov    0x50(%rdi),%rax
<+31>:	add    %rax,%rax
<+34>:	movl   $0x4,0x8(%rbp)
<+41>:	mov    %rax,0x0(%rbp)
<+45>:	pop    %rbx
<+46>:	pop    %rbp
<+47>:	pop    %r12
<+49>:	ret
<+50>:	nopw   0x0(%rax,%rax,1)
<+56>:	lea    0x50(%rdi),%rbx
<+60>:	mov    $0x1,%esi
<+65>:	mov    %rbx,%rdi
<+68>:	call   0x61e7b0 <zend_parse_arg_long_slow>
<+73>:	test   %dl,%dl
<+75>:	je     0x22d46a <zif_twice.cold>
<+81>:	jmp    0x53bdef <zif_twice+31>
```

The following uses the default benchmark programs we use in CI.
Each program is ran on php-cgi with the appropriate `-T` argument, then repeated 15 times.
It shows a small performance improvement on Symfony both with and without JIT, and a small improvement
on WordPress with JIT.
For WordPress, the difference is small as my CPU is bottlenecked on some other stuff as well.

| Test                            | Old Mean | Old Stddev | New Mean | New Stddev |
|---------------------------------|----------|------------|----------|------------|
| Symfony, no JIT (-T10,50)       | 0.5324   | 0.0050     | 0.5272   | 0.0042     |
| Symfony, tracing JIT (-T10,50)  | 0.5301   | 0.0029     | 0.5264   | 0.0036     |
| WordPress, no JIT (-T5,25)      | 0.7408   | 0.0049     | 0.7404   | 0.0048     |
| WordPress, tracing JIT (-T5,25) | 0.6814   | 0.0052     | 0.6770   | 0.0055     |

I was not able to measure any meaningful difference for our micro benchmarks
`Zend/bench.php` and `Zend/micro_bench.php`.

The Valgrind instruction counts also show a decrease:
-0.19% on Symfony without JIT, and -0.14% on WordPress without JIT (see CI).
@nielsdos
Copy link
Member Author

One potential optimization idea is to store a bool is_empty instead of bool has_value. Because in that case we have to store a 0 instead of a 1 for the "success path". This may be a bit faster on the x86 where setting the lower parts of some registers also clears the upper part.

@Girgias Girgias removed their request for review April 26, 2025 20:22
@SakiTakamachi
Copy link
Member

When building on Arm, an uninitialized warning was triggered at the following location.
Although there doesn’t seem to be any uninitialized code path, it might be a good idea to set an initial value just to silence the warning.

N = (calc_sunset ? h_set : h_rise) + gmt_offset;


I also dumped the non-cold block assembly on Arm.
I already knew theoretically that it should have an effect on Arm as well, and it seems that is indeed the case.

before:

   0:   a9bc7bfd        stp     x29, x30, [sp, #-64]!
   4:   90000002        adrp    x2, 0 <__stack_chk_guard>
   8:   910003fd        mov     x29, sp
   c:   f9400042        ldr     x2, [x2]
  10:   a90153f3        stp     x19, x20, [sp, #16]
  14:   aa0103f4        mov     x20, x1
  18:   f90013f5        str     x21, [sp, #32]
  1c:   b9402c15        ldr     w21, [x0, #44]
  20:   f9400041        ldr     x1, [x2]
  24:   f9001fe1        str     x1, [sp, #56]
  28:   d2800001        mov     x1, #0x0                        // #0
  2c:   710006bf        cmp     w21, #0x1
  30:   54000401        b.ne    b0 <zif_twice+0xb0>  // b.any
  34:   39416001        ldrb    w1, [x0, #88]
  38:   7100103f        cmp     w1, #0x4
  3c:   54000221        b.ne    80 <zif_twice+0x80>  // b.any
  40:   f9402800        ldr     x0, [x0, #80]
  44:   d37ff800        lsl     x0, x0, #1
  48:   52800081        mov     w1, #0x4                        // #4
  4c:   f9000280        str     x0, [x20]
  50:   b9000a81        str     w1, [x20, #8]
  54:   90000000        adrp    x0, 0 <__stack_chk_guard>
  58:   f9400000        ldr     x0, [x0]
  5c:   f9401fe2        ldr     x2, [sp, #56]
  60:   f9400001        ldr     x1, [x0]
  64:   eb010042        subs    x2, x2, x1
  68:   d2800001        mov     x1, #0x0                        // #0
  6c:   540001c1        b.ne    a4 <zif_twice+0xa4>  // b.any
  70:   a94153f3        ldp     x19, x20, [sp, #16]
  74:   f94013f5        ldr     x21, [sp, #32]
  78:   a8c47bfd        ldp     x29, x30, [sp], #64
  7c:   d65f03c0        ret
  80:   91014013        add     x19, x0, #0x50
  84:   2a1503e2        mov     w2, w21
  88:   aa1303e0        mov     x0, x19
  8c:   9100c3e1        add     x1, sp, #0x30
  90:   94000000        bl      0 <zend_parse_arg_long_slow>
  94:   52800122        mov     w2, #0x9                        // #9
  98:   72001c1f        tst     w0, #0xff
  9c:   54000061        b.ne    a8 <zif_twice+0xa8>  // b.any
  a0:   14000000        b       0 <zif_twice>
  a4:   94000000        bl      0 <__stack_chk_fail>
  a8:   f9401be0        ldr     x0, [sp, #48]
  ac:   17ffffe6        b       44 <zif_twice+0x44>
  b0:   14000000        b       0 <zif_twice>

after:

   0:   a9bd7bfd        stp     x29, x30, [sp, #-48]!
   4:   910003fd        mov     x29, sp
   8:   f90013f5        str     x21, [sp, #32]
   c:   b9402c15        ldr     w21, [x0, #44]
  10:   a90153f3        stp     x19, x20, [sp, #16]
  14:   710006bf        cmp     w21, #0x1
  18:   540003a1        b.ne    8c <zif_twice+0x8c>  // b.any
  1c:   aa0103f4        mov     x20, x1
  20:   39416001        ldrb    w1, [x0, #88]
  24:   7100103f        cmp     w1, #0x4
  28:   54000141        b.ne    50 <zif_twice+0x50>  // b.any
  2c:   f9402800        ldr     x0, [x0, #80]
  30:   d37ff800        lsl     x0, x0, #1
  34:   52800081        mov     w1, #0x4                        // #4
  38:   f94013f5        ldr     x21, [sp, #32]
  3c:   f9000280        str     x0, [x20]
  40:   b9000a81        str     w1, [x20, #8]
  44:   a94153f3        ldp     x19, x20, [sp, #16]
  48:   a8c37bfd        ldp     x29, x30, [sp], #48
  4c:   d65f03c0        ret
  50:   91014013        add     x19, x0, #0x50
  54:   2a1503e1        mov     w1, w21
  58:   aa1303e0        mov     x0, x19
  5c:   94000000        bl      0 <zend_parse_arg_long_slow>
  60:   72001c3f        tst     w1, #0xff
  64:   54fffe61        b.ne    30 <zif_twice+0x30>  // b.any
  68:   52800120        mov     w0, #0x9                        // #9
  6c:   aa1303e4        mov     x4, x19
  70:   2a1503e1        mov     w1, w21
  74:   a94153f3        ldp     x19, x20, [sp, #16]
  78:   52800003        mov     w3, #0x0                        // #0
  7c:   f94013f5        ldr     x21, [sp, #32]
  80:   d2800002        mov     x2, #0x0                        // #0
  84:   a8c37bfd        ldp     x29, x30, [sp], #48
  88:   14000000        b       0 <zend_wrong_parameter_error>
  8c:   14000000        b       0 <zif_twice>

Regarding is_empty, the performance should theoretically be the same on Arm, so I think it’s fine to decide based on what’s best for x86.

@nielsdos nielsdos requested a review from derickr as a code owner April 27, 2025 09:38
@nielsdos
Copy link
Member Author

@SakiTakamachi Thanks. The warning you get is a false positive. Previously the compiler didn't warn because the pointer escaped but now it no longer is a pointer escape and the compiler's warning analysis is not strong enough to avoid the warning.

Copy link
Member

@devnexen devnexen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall mstm but lgtm for pgsql at least

Copy link
Member

@dstogov dstogov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning structures is a not well defined system depended C feature. On one system this may be implemented through a pair of registers on the others through a stack allocated area. GCC has special flags that control this behavior (-fpcc-struct-return/-freg-struct-return) but they break ABI.

I though about this feature long time ago (for general API optimization), but decided not to use it.

Please check the effect of the patch on x86 Linux, and x86/x86_64 Windows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment