深入了解解释器

2020年10月27日 · 作者：John Högberg

在我的前一篇文章中，我们了解了 BEAM，现在我们对它更熟悉了，是时候看看参考实现：解释器了。

解释器可以被认为是一个无限循环，它查看当前指令，执行它，然后移动到下一条指令。

在正常的构建中，我们的代码布局为直接线程代码，其中每条指令由其处理程序的机器代码地址组成，后跟其参数，参数后面又是下一条指令

Instruction address
    First argument
    Second argument
    ... and so on.
Instruction address
    First argument
    Second argument
    ... and so on.

当一条指令完成时，它会读取下一条指令地址并跳转到它，永远遵循指令的“线程”。

除了极少数例外，这些指令都是“纯粹”的，因为它们总是对相同的输入执行相同的操作，并且它们只会通过改变控制流或将结果写入寄存器来影响 BEAM。这使得指令非常容易阅读和推理，所以为了简洁起见，我们只看其中一条指令。

is_nonempty_list(Label, Src) {

    /* Check if our $Src is not a list. */
    if (is_not_list($Src)) {

        /* Invoke the $FAIL macro, jumping to our
         * $Label. */
        $FAIL($Label);
    }

    /* Execute the next instruction. */
}

上面是用特定领域语言 (DSL) 编写的，它是 C 的超集，其中 $ 前缀的宏会被展开，其余部分保持原样。

beam_makeops 脚本会获取此定义，并为那些手写很麻烦的部分生成代码，例如跳转到下一条指令。它还隐藏了参数处理，这使我们能够通过在幕后将小参数打包在一起从而减小代码大小，我们在之前的关于该主题的文章中探讨过这一点。

出于性能原因，它还基于 ops.tab 中概述的参数类型生成不同的变体，以减少我们在运行时需要做的工作量。

让我们看一下此指令的生成代码，特别是 X 寄存器的变体

/* (This has been modified slightly for readability) */
case is_nonempty_list_fx:
{
    Eterm arg_word, term;

    /* Read the argument word from the instruction
     * stream. */
    arg_word = I[1];

    /* Unpack the offset of our source register (upper
     * 32 bits) and then read its contents.
     *
     * Note that we address the X registers directly;
     * had this instruction not been specialized, we
     * would first need to determine whether the
     * argument was an X or a Y register. */
    term = x_registers[arg_word >> 32];

    /* is_not_list(term) */
    if (term & (_TAG_PRIMARY_MASK - TAG_PRIMARY_LIST)) {

        /* Unpack our fail label (lower 32 bits) and add
         * it to our current instruction pointer. This
         * is an offset and may be negative for backward
         * jumps. */
        I += (Sint32)arg_word;

        /* Jump to the instruction at the fail label using
         * the "labels as values" GCC extension. */
        goto *I;
    }

    /* Skip the current label address and argument word,
     * then jump to the next instruction. This was
     * automatically generated by beam_makeops. */
    I += 2;
    goto *I;
}

还有很多内容，那些想了解更多的人可以阅读beam_makeops 脚本的文档，但以上内容涵盖了要点。

虽然以上效率很高，但调度和参数处理会产生很大的开销。以下是上面稍微修改过的汇编列表

    ; Read the argument word from the instruction
    ; stream.
    mov    rdx, [rbx + 8]

    ; Unpack the offset of our source register (upper 32
    ; 32 bits).
    mov    rcx, rdx
    shr    rcx, 32

    ; X registers live in machine register r15, and we
    ; placed our offset in rcx, so we can find our term at
    ; [r15 + rcx].
    ;
    ; Perform a non-destructive bitwise AND on the term
    ; using the `test` instruction, and jump to the fail
    ; label if the result is non-zero.
    test   byte [r15 + rcx], (_TAG_PRIMARY_MASK - TAG_PRIMARY_LIST)
    jne    jump_to_fail_label

    ; Skip the current label address and argument word,
    ; then jump to the next instruction.
    add    rbx, 16
    jmp    [rbx]

jump_to_fail_label:
    ; Unpack our fail label (lower 32 bits) and add
    ; it to our current instruction pointer.
    movsxd rdx, edx
    lea    rbx, [rbx + rdx * 8]

    ; Jump to the instruction at the fail label.
    jmp    [rbx]

粗体部分是指令的核心，其余部分是参数解包和指令调度。虽然这对大型指令来说不是什么大问题，但它对像这样短小的指令的影响非常大，并且在查看分析器（例如 perf）时，最终的 jmp 通常会占主导地位。

为了减少这种影响，加载程序将常用的指令序列组合成一条指令。例如，两个独立的移动可能会融合到 move2_par 中，而 is_nonempty_list 后跟 get_list 可能会融合到 is_nonempty_list_get_list 中。

这降低了短指令的成本，但仅在我们可以识别常见模式的范围内有效，并且由于每个组合都必须手动实现，因此维护成本很高。即便如此，效果往往是适度的，并且调度开销仍然很大。

解释器的另一个缺点（尽管较小）是，现代处理器针对在“普通”本机代码中常见的模式进行了高度优化。例如，几乎所有处理器都有一个专门的预测器，专门用于调用和返回。假设每次调用都有一个对应的返回，除非抛出异常，否则它可以完美地预测返回，但由于解释器不使用本机调用和返回，因此它无法利用这种优化。

不幸的是，关于这一点没有太多的改进空间，经过二十多年的改进，以有意义的方式对其进行优化变得越来越困难。

因此，我们提高性能的努力反而集中在两个方面：改进编译器和实现 JIT。最近，我们在这两个方面都取得了很大进展，并且我们为最终合并后者而感到非常自豪。

请继续关注我们的下一篇文章，我们将了解 JIT，看看它是如何避免这些问题的。