Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tail call VM #17849

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

Tail call VM #17849

wants to merge 1 commit into from

Conversation

arnaud-lb
Copy link
Member

@arnaud-lb arnaud-lb commented Feb 18, 2025

This implements the technique described in https://blog.reverberate.org/2021/04/21/musttail-efficient-interpreters.html, which addresses the issues described in http://lua-users.org/lists/lua-l/2011-02/msg00742.html. Python recently implemented this, which resulted in a 9-15% performance improvements: https://blog.reverberate.org/2025/02/10/tail-call-updates.html.

It turns out that @dstogov already addressed these by using a different technique, enabled when compiling with GCC, so this will not improve performances with this compiler, but it makes PHP on Clang as fast as on GCC.

Benchmarks

Zend/bench.php:

Benchmark 1: /tmp/gcc-base/sapi/cli/php -n -d zend_extension=opcache.so -d opcache.enable_cli=1 --repeat 10 Zend/bench.php
  Time (mean ± σ):      1.006 s ±  0.002 s    [User: 0.984 s, System: 0.020 s]
  Range (min … max):    1.003 s …  1.008 s    10 runs
 
Benchmark 2: /tmp/clang-base/sapi/cli/php -n -d zend_extension=opcache.so -d opcache.enable_cli=1 --repeat 10 Zend/bench.php
  Time (mean ± σ):      1.783 s ±  0.009 s    [User: 1.761 s, System: 0.019 s]
  Range (min … max):    1.771 s …  1.801 s    10 runs
 
Benchmark 3: /tmp/clang-tail/sapi/cli/php -n -d zend_extension=opcache.so -d opcache.enable_cli=1 --repeat 10 Zend/bench.php
  Time (mean ± σ):      1.017 s ±  0.003 s    [User: 0.998 s, System: 0.018 s]
  Range (min … max):    1.014 s …  1.023 s    10 runs
 
Summary
  /tmp/gcc-base/sapi/cli/php -n -d zend_extension=opcache.so -d opcache.enable_cli=1 --repeat 10 Zend/bench.php ran
    1.01 ± 0.00 times faster than /tmp/clang-tail/sapi/cli/php -n -d zend_extension=opcache.so -d opcache.enable_cli=1 --repeat 10 Zend/bench.php
    1.77 ± 0.01 times faster than /tmp/clang-base/sapi/cli/php -n -d zend_extension=opcache.so -d opcache.enable_cli=1 --repeat 10 Zend/bench.php

PHP/Clang was 77% slower in this benchmark, now only 1% slower.

Symfony Demo:

gcc-base:    mean:  0.5064;  stddev:  0.0008;  diff:  +0.00%
clang-base:  mean:  0.5344;  stddev:  0.0006;  diff:  +5.53%
clang-tail:  mean:  0.5017;  stddev:  0.0008;  diff:  -0.94%

PHP/Clang was 5% slower in this benchmark.

Current interpreter

The interpreter is generated by Zend/zend_vm_gen.php. Multiple modes are supported, but the default (and only supported mode) is the hybrid one, which generates both a call-based interpreter and a GCC-specific interpreter. Which one is actually compiled depends on the compiler being used.

In the call-based interpreter, op code handlers are separate functions, the next opline to execute is stored in execute_data, and execute_data is passed as argument to op handlers:

void execute_ex() {
    while (1) {
        int ret = execute_data->opline->handler(execute_data);
        if (ret != 0) {
            // leave interpreter
        }
    }
}

// example op handler
int ZEND_INIT_FCALL_SPEC_CONST_HANDLER(zend_execute_data *execute_data) {
    // load opline
    const zend_op *opline = execute_data->opline;

    // instruction execution

    // dispatch
    // ZEND_VM_NEXT_OPCODE():
    execute_data->opline++;
    return 0; // ZEND_VM_CONTINUE()
}

Handlers typically load execute_data->opline, execute the operation, update execute_data->opline, and return.

There is quite a lot of overhead: The call instruction pushes a return address on the stack, the function saves/spills registers, etc. E.g. the code of ZEND_INIT_FCALL_SPEC_CONST_HANDLER() starts with

push   %rbp
push   %r15
push   %r14
push   %rbx
push   %rax

Also, opline needs to be loaded/stored from/to memory.

The GCC interpreter manages to eliminate the overhead. opline->handler is a computed-goto target, which calls the actual handler. Hot handlers are inlined, FP/IP (execute_data/opline) are register variables, handlers take no arguments and have no return value:

void execute_ex() {
    goto opline->handler;
    ZEND_INIT_FCALL_SPEC_CONST_LABEL:
        ZEND_INIT_FCALL_SPEC_CONST_HANDLER(); // inlined
        goto opline->handler;
    ... (other handlers)
    ZEND_RETURN:
        // leave interpreter
}

void always_inline ZEND_INIT_FCALL_SPEC_CONST_HANDLER(void) {
    // opline is already in a register
    
    // instruction execution

    // dispatch
    // ZEND_VM_NEXT_OPCODE():
    opline++;
    return;
}

Changes

Here I had a variation of the call-based interpreter, enabled when using clang-19:

  • execute_data and opline are passed as op handler arguments, so they are always in registers unless they are spilled on the stack
  • handlers tail call the next opline handler: function call overhead is eliminated
  • handlers use the preserve_none calling convention: reduces register save/spills.
void execute_ex() {
    execute_data->opline->handler(execute_data);
    // leave interpreter
}

__attribute__((preserve_none))
int ZEND_INIT_FCALL_SPEC_CONST_HANDLER(zend_execute_data *execute_data, const zend_no *opline) {
    // opline is already loaded

    // instruction execution

    // dispatch
    // ZEND_VM_NEXT_OPCODE():
    opline++;
    __attribute__((musttail)) return opline->handler(execute_data, opline);
}

The musttail attribute is used to force tail calling.

Unfortunately musttail rejects calls to function whose signature is not compatible with the caller, so it's not possible to tail call VM helpers that have extra parameters. Instead, we use a trampoline when calling these: The helper returns a struct{opline,handler} (in two registers) which is then tail called by the caller. Since helpers always return (unless they call other helpers), the stack doesn't grow indefinitely:

    // ZEND_VM_DISPATCH_TO_HELPER(zend_cannot_pass_by_ref_helper, _arg_num, arg_num, _arg, arg)
    zend_vm_trampoline t = zend_cannot_pass_by_ref_helper(arg_num, arg, execute_data, opline);
    __attribute__((musttail)) return t.handler(execute_data, t.opline);

I introduce a ZEND_VM_DISPATCH() macro that is used by ZEND_VM_NEXT_OPCODE() and related macros. This macro tail calls the next opline by default. In VM helpers with extra parameters, ZEND_DISPATCH() is redefined to return the trampoline value instead:

#undef  ZEND_VM_DISPATCH
#define ZEND_VM_DISPATCH ZEND_VM_DISPATCH_NOTAIL
zend_vm_trampoline zend_cannot_pass_by_ref_helper(arg_num, arg, execute_data, opline) {
   ...
}
#undef  ZEND_VM_DISPATCH
#define ZEND_VM_DISPATCH ZEND_VM_DISPATCH_DEFAULT

Caveats

  • The ABI of __attribute__((preserve_none)) is not stable, so we might not use it in exported functions. This has implications for JIT and user opcode handlers. We might need to generate wrappers with a stable convention.
  • There are now 3 interpreters to test. It may be possible to enable some of the change by default (e.g. passing opline as argument and __attribute__((preserve_none))) to reduce the differences between the call-based interpreter and the clang one.

TODO

  • JIT support
  • Measure the impact of passing opline as argument, without other changes. Maybe do that by default?
  • Measure the impact of __attribute__((preserve_none)), without other changes
  • Measure/test on aarch64, x86 (not sure it's supported)

Future scope:

  • Tweak preserve_none / preserve_most / slow paths

@@ -313,6 +313,18 @@ char *alloca();
# define ZEND_FASTCALL
#endif

#if __has_attribute(preserve_none) && !defined(__SANITIZE_ADDRESS__)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an incompatibility between preserve_none and ASAN, which crashes Clang. I will report the issue.

@@ -8212,9 +8212,9 @@ ZEND_VM_HANDLER(150, ZEND_USER_OPCODE, ANY, ANY)
case ZEND_USER_OPCODE_LEAVE:
ZEND_VM_LEAVE();
case ZEND_USER_OPCODE_DISPATCH:
ZEND_VM_DISPATCH(opline->opcode, opline);
ZEND_VM_DISPATCH_OPCODE(opline->opcode, opline);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed this rarely used macro so I could re-use its name

@@ -1,3 +1,5 @@
#include "Zend/zend_vm_opcodes.h"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes language servers / IDEs happy when viewing zend_vm_execute.h

Comment on lines +2426 to +2427
$str .= "#include <main/php_config.h>\n";
$str .= "#include \"Zend/zend_portability.h\"\n";
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes language servers / IDEs happy when viewing zend_vm_opcodes.h

@dstogov
Copy link
Member

dstogov commented Feb 18, 2025

Interesting work! I suppose this will require special support for JIT.

@arnaud-lb
Copy link
Member Author

Yes this does require some changes to the JIT to accommodate for the new opcode handler signature and how FP/IP are passed around. I plan to implement them unless there are major issues with the current approach.

The fact that preserve_none is an unstable ABI will complicate things a bit. Some possible solutions I have in mind are:

  • Generate wrappers with a stable ABI for each opcode handler, and use that in opcache/jit
  • Enforce that opcache must be compiled with the same clang version as the php binary
  • Move opcache to Zend/ and embed it in the php binary

The second one seems reasonable to me.

@@ -21,6 +21,9 @@
#ifndef ZEND_VM_OPCODES_H
#define ZEND_VM_OPCODES_H

#include <main/php_config.h>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to avoid this dependency on main?

@cmb69
Copy link
Member

cmb69 commented Feb 18, 2025

FWIW: feature request to support guaranteed tail calls for MSVC.

@dstogov
Copy link
Member

dstogov commented Feb 19, 2025

Generate wrappers with a stable ABI for each opcode handler, and use that in opcache/jit

HYBRID VM generates two handlers for each opcode (C function with standard ABI + non standard GOTO). JIT uses one or the other when suitable. Technically, tail call does the same GOTO, so the same approach might work.

CLANG doesn't support global register variables. LLVM may achieve similar thing, using custom calling convention that pin arguments to registers (this technique used for Haskel, Erlang, HHVM ...). Unfortunately, I didn't found a way to introduce new calling convention without LLVM patching (cool OOP style). Using them in CLANG was also problematic. It was long time ago and may be something is changed.

@dstogov
Copy link
Member

dstogov commented Feb 19, 2025

BTW LLVM/CLANG should support local register variables. So maybe GOTO and HYBRID VMs may be adopted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants