I think this is super cool, but the combative tone doesn't bode well for relying on it in case I find issues or want to submit a PR.
ndgold 3 days ago [-]
Hey this is dope but I’m too stupid to understand how this is happening in the first place pls send help
dooglius 3 hours ago [-]
Bit-exact matches are nice but they do not remove floating-point drift (i.e. discrepancy between the calculated values and the true mathematical value), which is indeed unavoidable because the mathematical result of most operations you perform will not be exactly a floating-point representable value.
avadodin 3 hours ago [-]
What is going on here?
Is one of the implementations wrong(as in not following the IEEE-specified result)? Both?
I mean, ideally, floating point algorithms would have ranges they tolerate but if they are going to expect bit-perfect results, I think the specs should have them covered assuming the implementation is compliant.
I see the GitHub mentions sin(x). I have never dug so deep. Are trigonometric functions left as an exercise to the reader?
varispeed 3 hours ago [-]
Interesting. I was renting H100 and they always produced different results than running on M1 or 5080.
In the end I was not able to train anything on H100, as my models were collapsing quickly, whereas on M1 or 5080 exact same models with the same params trained just fine. For instance on H100 model would be dead at epoch 100, whereas same model would train fine 50000 epochs and beyond on M1 or 5080.
plutodev 2 days ago [-]
[flagged]
Rendered at 01:54:22 GMT+0000 (Coordinated Universal Time) with Vercel.
Is one of the implementations wrong(as in not following the IEEE-specified result)? Both?
I mean, ideally, floating point algorithms would have ranges they tolerate but if they are going to expect bit-perfect results, I think the specs should have them covered assuming the implementation is compliant.
I see the GitHub mentions sin(x). I have never dug so deep. Are trigonometric functions left as an exercise to the reader?
In the end I was not able to train anything on H100, as my models were collapsing quickly, whereas on M1 or 5080 exact same models with the same params trained just fine. For instance on H100 model would be dead at epoch 100, whereas same model would train fine 50000 epochs and beyond on M1 or 5080.