> resulting VM outperforms both my previous Rust implementation and my hand-coded ARM64 assembly
it's always surprising for me how absurdly efficient "highly specialized VM/instruction interpreters" are
like e.g. two independent research projects into how to have better (fast, more compact) serialization in rust ended up with something like a VM/interpreter for serialization instructions leading to both higher performance and more compact code size while still being cable of supporting similar feature sets as serde(1)
(in general monomorphisation and double dispatch (e.g. serde) can bring you very far, but the best approach is like always not the extrem. Neither allays monomorphisation nor dynamic dispatch but a balance between taking advantage of the strength of both. And specialized mini VMs are in a certain way an extra flexible form of dynamic dispatch.)
---
(1): More compact code size on normal to large project, not necessary on micro projects as the "fixed overhead" is often slightly larger while the per serialization type/protocol overhead can be smaller.
(1b): They have been experimental research project, not sure if any of them got published to GitHub, non are suited for usage in production or similar.
gavinray 53 minutes ago [-]
It doesn't make sense to me that an embedded VM/interpreter could ever outperform direct code
You're adding a layer of abstraction and indirection, so how is it possible that a more indirect solution can have better performance?
This seems counterintuitive, so I googled it. Apparently, it boils down to instruction cache efficiency and branch prediction, largely. The best content I could find was this post, as well as some scattered comments from Mike Pall of LuaJIT fame:
Interestingly, this is also discussed on a similar blogpost about using Clang's recent-ish [[musttail]] tailcall attribute to improve C++ JSON parsing performance:
it's always surprising for me how absurdly efficient "highly specialized VM/instruction interpreters" are
like e.g. two independent research projects into how to have better (fast, more compact) serialization in rust ended up with something like a VM/interpreter for serialization instructions leading to both higher performance and more compact code size while still being cable of supporting similar feature sets as serde(1)
(in general monomorphisation and double dispatch (e.g. serde) can bring you very far, but the best approach is like always not the extrem. Neither allays monomorphisation nor dynamic dispatch but a balance between taking advantage of the strength of both. And specialized mini VMs are in a certain way an extra flexible form of dynamic dispatch.)
---
(1): More compact code size on normal to large project, not necessary on micro projects as the "fixed overhead" is often slightly larger while the per serialization type/protocol overhead can be smaller.
(1b): They have been experimental research project, not sure if any of them got published to GitHub, non are suited for usage in production or similar.
You're adding a layer of abstraction and indirection, so how is it possible that a more indirect solution can have better performance?
This seems counterintuitive, so I googled it. Apparently, it boils down to instruction cache efficiency and branch prediction, largely. The best content I could find was this post, as well as some scattered comments from Mike Pall of LuaJIT fame:
https://sillycross.github.io/2022/11/22/2022-11-22/
Interestingly, this is also discussed on a similar blogpost about using Clang's recent-ish [[musttail]] tailcall attribute to improve C++ JSON parsing performance:
https://blog.reverberate.org/2021/04/21/musttail-efficient-i...
Tail recursion opens up for people to write really really neat looping facilities using macros.
https://doc.rust-lang.org/std/keyword.become.html
> Last week, I wrote a tail-call interpreter using the become keyword, which was recently added to nightly Rust (seven months ago is recent, right?).