Cool paper! The authors use the fact that the M1 chip supports both ARM's weaker memory consistency model and x86's total order to investigate the performance hit from using the latter, ceteris paribus.
They see an average of 10% degradation on SPEC and show some synthetic benchmarks with a 2x hit.
jandrewrogers 17 minutes ago [-]
This raises questions.
For example, modern x86 architectures still readily out-perform ARM64 in performance-engineered contexts. I don’t think that is controversial. There are a lot of ways to explain it e.g. x86 is significantly more efficient in some other unrelated areas, x86 code is tacitly designed to minimize the performance impact of TSO, or the Apple Silicon implementations nerf the TSO because it isn’t worth the cost to optimize a compatibility shim. TSO must have some value in some contexts, it wasn’t chosen arbitrarily.
Apple Silicon is also an unconventional implementation of ARM64, so I wonder the extent to which this applies to any other ARM64 implementation. I’d like to see more thorough and diverse data. It feels like there are confounding factors.
I think it is great that this is being studied, I’m just not sure it is actionable without much better and more rigorous measurement across unrelated silicon microarchitectures.
loeg 3 hours ago [-]
This comment is a two sentence summary of the six sentence Abstract at the very top of the linked article. (Though the paper claims 9%, not 10% -- to three sig figs, so rounding up to 10% is inappropriate.)
Also -- 9% is huge! I am kind of skeptical of this result (haven't yet read the paper). E.g., is it possible ARM's TSO order isn't optimal, providing a weaker relative performance than a TSO native platform like x86?
> An application can benefit from weak MCMs if it distributes its workload across multiple threads which then access the same memory. Less-optimal access patterns might result in heavy cache-line bouncing between cores. In a weak MCM, cores can reschedule their instructions more effectively to hide cache misses while stronger MCMs might have to stall more frequently.
So to some extent, this is avoidable overhead with better design (reduced mutable sharing between threads). The impact of TSO vs WO is greater for programs with more sharing.
> The 644.nab_s benchmark consists of parallel floating point calculations for molecular modeling. ... If not properly aligned, two cores still share the same cache-line as these chunks span over two instead of one cache-line. As shown in Fig. 5, the consequence is an enormous cache-line pressure where one cache-line is permanently bouncing between two cores. This high pressure can enforce stalls on architectures with stronger MCMs like TSO, that wait until a core can exclusively claim a cache-line for writing, while weaker memory models are able to reschedule instructions more effectively. Consequently, 644.nab_s performs 24 percent better under WO compared to TSO.
Yeah, ok, so the huge magnitude observed is due to some really poor program design.
> The primary performance advantage applications might gain from running under weaker memory ordering models like WO is due to greater instruction reordering capabilities. Therefore, the performance benefit vanishes if the hardware architecture cannot sufficiently reorder the instructions (e.g., due to data dependencies).
Read the thing all the way through. It's interesting and maybe useful for thinking about WO vs TSO mode on Apple M1 Ultra chips specifically, but I don't know how much it generalizes.
ip26 39 minutes ago [-]
I’m not an expert… but it seems like it could be even simpler than program design. They note false sharing occurs due to data not being cacheline aligned. Yet when compiling for ARM, that’s not a big deal due to WO. When targeting x86, you would hope the compiler would work hard to align them! So the out of the box compiler behavior could be crucial. Are there extra flags that should be used when targeting ARM-TSO?
MBCook 3 hours ago [-]
I’ve seen the stronger x86 memory model argued as one of the things that affects its performance before.
It’s neat to see real numbers on it. Didn’t seem to be very big in many circumstances which I guess would have been my guess.
Of course Apple just implemented that on the M1 and AMD/Intel had been doing it for a long time. I wonder if later M chips reduced the effect. And will they drop the feature once they drop Rosetta 2?
jchw 3 hours ago [-]
I'm really curious how exactly they'll wind up phasing out Rosetta 2. They seem to be a bit coy about it:
> Rosetta was designed to make the transition to Apple silicon easier, and we plan to make it available for the next two major macOS releases – through macOS 27 – as a general-purpose tool for Intel apps to help developers complete the migration of their apps. Beyond this timeframe, we will keep a subset of Rosetta functionality aimed at supporting older unmaintained gaming titles, that rely on Intel-based frameworks.
However, that leaves much unsaid. Unmaintained gaming titles? Does this mean native, old macOS games? I thought many of them were already no longer functional by this point. What about Crossover? What about Rosetta 2 inside Linux?
I wouldn't be surprised if they really do drop some x86 amenities from the SoC at the cost of performance, but I think it would be a bummer of they dropped Rosetta 2 use cases that don't involve native apps. Those ones are useful. Rosetta 2 is faster than alternative recompilers. Maybe FEX will have bridged the gap most of the way by then?
toast0 2 hours ago [-]
> However, that leaves much unsaid. Unmaintained gaming titles? Does this mean native, old macOS games? I thought many of them were already no longer functional by this point. What about Crossover? What about Rosetta 2 inside Linux?
Apple keeps trying to be a platform for games. Keeping old games running would be a step in that direction. Might include support for x86 games running through wine/apple game porting toolkit/etc
Rendered at 06:42:19 GMT+0000 (Coordinated Universal Time) with Vercel.
They see an average of 10% degradation on SPEC and show some synthetic benchmarks with a 2x hit.
For example, modern x86 architectures still readily out-perform ARM64 in performance-engineered contexts. I don’t think that is controversial. There are a lot of ways to explain it e.g. x86 is significantly more efficient in some other unrelated areas, x86 code is tacitly designed to minimize the performance impact of TSO, or the Apple Silicon implementations nerf the TSO because it isn’t worth the cost to optimize a compatibility shim. TSO must have some value in some contexts, it wasn’t chosen arbitrarily.
Apple Silicon is also an unconventional implementation of ARM64, so I wonder the extent to which this applies to any other ARM64 implementation. I’d like to see more thorough and diverse data. It feels like there are confounding factors.
I think it is great that this is being studied, I’m just not sure it is actionable without much better and more rigorous measurement across unrelated silicon microarchitectures.
Also -- 9% is huge! I am kind of skeptical of this result (haven't yet read the paper). E.g., is it possible ARM's TSO order isn't optimal, providing a weaker relative performance than a TSO native platform like x86?
> An application can benefit from weak MCMs if it distributes its workload across multiple threads which then access the same memory. Less-optimal access patterns might result in heavy cache-line bouncing between cores. In a weak MCM, cores can reschedule their instructions more effectively to hide cache misses while stronger MCMs might have to stall more frequently.
So to some extent, this is avoidable overhead with better design (reduced mutable sharing between threads). The impact of TSO vs WO is greater for programs with more sharing.
> The 644.nab_s benchmark consists of parallel floating point calculations for molecular modeling. ... If not properly aligned, two cores still share the same cache-line as these chunks span over two instead of one cache-line. As shown in Fig. 5, the consequence is an enormous cache-line pressure where one cache-line is permanently bouncing between two cores. This high pressure can enforce stalls on architectures with stronger MCMs like TSO, that wait until a core can exclusively claim a cache-line for writing, while weaker memory models are able to reschedule instructions more effectively. Consequently, 644.nab_s performs 24 percent better under WO compared to TSO.
Yeah, ok, so the huge magnitude observed is due to some really poor program design.
> The primary performance advantage applications might gain from running under weaker memory ordering models like WO is due to greater instruction reordering capabilities. Therefore, the performance benefit vanishes if the hardware architecture cannot sufficiently reorder the instructions (e.g., due to data dependencies).
Read the thing all the way through. It's interesting and maybe useful for thinking about WO vs TSO mode on Apple M1 Ultra chips specifically, but I don't know how much it generalizes.
It’s neat to see real numbers on it. Didn’t seem to be very big in many circumstances which I guess would have been my guess.
Of course Apple just implemented that on the M1 and AMD/Intel had been doing it for a long time. I wonder if later M chips reduced the effect. And will they drop the feature once they drop Rosetta 2?
> Rosetta was designed to make the transition to Apple silicon easier, and we plan to make it available for the next two major macOS releases – through macOS 27 – as a general-purpose tool for Intel apps to help developers complete the migration of their apps. Beyond this timeframe, we will keep a subset of Rosetta functionality aimed at supporting older unmaintained gaming titles, that rely on Intel-based frameworks.
However, that leaves much unsaid. Unmaintained gaming titles? Does this mean native, old macOS games? I thought many of them were already no longer functional by this point. What about Crossover? What about Rosetta 2 inside Linux?
https://developer.apple.com/documentation/virtualization/run...
I wouldn't be surprised if they really do drop some x86 amenities from the SoC at the cost of performance, but I think it would be a bummer of they dropped Rosetta 2 use cases that don't involve native apps. Those ones are useful. Rosetta 2 is faster than alternative recompilers. Maybe FEX will have bridged the gap most of the way by then?
Apple keeps trying to be a platform for games. Keeping old games running would be a step in that direction. Might include support for x86 games running through wine/apple game porting toolkit/etc