Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲Step 3.5 Flash – Open-source foundation model with exceptional efficiency (static.stepfun.com)

78 points by kristianp 8 hours ago | 26 comments

anentropic 5 minutes ago [-]

> 51.0% on Terminal-Bench 2.0, proving its ability to handle sophisticated, long-horizon tasks with unwavering stability

I don't know anything about TerminalBench, but on the face of it a 51% score on a test metric doesn't sound like it would guarantee 'unwavering stability' on sophisticated long-horizon tasks

janalsncm 39 minutes ago [-]

Number of params isn’t really the relevant metric imo. Top models don’t support local inference. More relevant is tokens per dollar or per second.

dakolli 13 minutes ago [-]

Its an open source model, why wouldn't it be relevant for people who want to self host.....

culi 2 hours ago [-]

It's nice to see more focus on efficiency. All the recent new model releases have come along with massive jumps in certain benchmarks but when you dig into it it's almost always paired with a massive increase in token usage to achieve those results (ahem Google Deep Think ahem). For AI to truly be transformational it needs to solve the electricity problem

tankenmate 1 hours ago [-]

And not just token usage, expensive token usage; when it comes to tokens/joule not all tokens are equal. Efficient use of MoE architectures does have an impact on tokens/joule and tokens/sec.

mohsen1 48 minutes ago [-]

SWE-bench Verified is nice but we need better SWE benchmarks. Making a fair benchmark is a lot of work and a lot of money needed to run it continuously.

Most of "live" benchmarks are not running enough with recent models to give you a good picture of which models win.

The idea of a live benchmark is great! There are thousands of GitHub issues that are resolved with a PR every day.

cbracketdash 21 minutes ago [-]

Help us out with Terminal Bench 3.0!

https://docs.google.com/document/d/1pe_gEbhVDgORtYsQv4Dyml8u...

kristianp 8 hours ago [-]

Recent model released a couple of weeks ago. "Mixture of Experts (MoE) architecture, it selectively activates only 11B of its 196B parameters per token". Beats Kimi K2.5 and GLM 4.7 on more benchmarks than it loses to them.

Edit: there are 4 bit quants that can be run on an 128GB machine like a GB10 [1], AI Max+ 395, or mac studio.

[1] https://forums.developer.nvidia.com/t/running-step-3-5-flash...

Alifatisk 24 minutes ago [-]

> Beats Kimi K2.5 and GLM 4.7 on more benchmarks than it loses to them.

Does this really mean anything? I for example, tend to ignore certain benchmarks that are focused towards agentic tasks because that is not my use case. Instruction following, long context reasoning and non-hallucinations has more weight to me.

danieltanfh95 5 hours ago [-]

Hallucinates like crazy. use with caution. Tested it with a simple "Find me championship decks for X pokemon", "How does Y deck work". Opus 4.6, Deepseek and Kimi all performed well as expected.

mickeyp 55 minutes ago [-]

I mean, is it possible the latter models used Search? Not saying Stepfun's perfect (it is not.) Gemini especially and unsurprisingly uses search a lot and it is ridiculously fast, too.

prmph 33 minutes ago [-]

Interesting.

Each time a Chinese model makes the news, I wonder: How come no major models are coming from Japan or Europe?

jstummbillig 28 minutes ago [-]

Have you heard of Mistral? I would consider Mistral major, albeit not frontier.

sinenomine 43 minutes ago [-]

Works impressively well with pi.dev minimal agent.

wmf 7 hours ago [-]

That reverse x axis sure is confusing.

esafak 4 hours ago [-]

I imagine they thought they'd look better this way. I don't think they do.

SilverElfin 5 hours ago [-]

So who exactly is StepFun? What is their business (how do they make money)? Each time I click “About Stepfun” somewhere on their website, it sends me to a generic landing page in a loop.

kristopolous 1 hours ago [-]

They've been around a couple years. This is the first model that has really broken into the anglosphere.

Keep a tab on aihubmix, the Chinese openrouter, if you want to stay on top of the latest models. They keep track of things like the Baichuan, Doubao, baai (beijing academy), Meituan, 01.AI (yi), xiaomi, etc...

Much larger chinese coverage than openrouter

Havoc 50 minutes ago [-]

>first model that has really broken into the anglosphere.

Do you know of a couple of interesting ones that haven't yet?

kristopolous 45 minutes ago [-]

doubao (bytedance) seed models are interesting

Keep your eye on Baidu's Ernie https://ernie.baidu.com/

Artificial analysis is generally on top of everything

https://artificialanalysis.ai/leaderboards/models

Those two are really the new players

Nanbeige which they haven't benchmarked just put out a shockingly good 3b model https://huggingface.co/Nanbeige - specifically https://huggingface.co/Nanbeige/Nanbeige4.1-3B

You have to tweak the hyper parameter like they say but I'm getting quality output, commensurate with maybe a 32b model, in exchange for a huge thinking lag

It's the new LFM 2.5

0x1997 5 hours ago [-]

https://en.wikipedia.org/wiki/StepFun

SilverElfin 5 hours ago [-]

Thanks. Do they sell any of these products today or is it more like research? I am not able to find anything relating to pricing on their website. Just a chatbot.

0x1997 4 hours ago [-]

Princing can be found on their docs website https://platform.stepfun.ai/docs/en/pricing/details

deaux 5 hours ago [-]

Might want to give it a search.

agentifysh 4 hours ago [-]

what country is behind this one ?

personalcompute 3 hours ago [-]

Step 3.5 Flash was made by Chinese company StepFun - https://en.wikipedia.org/wiki/StepFun

Rendered at 10:16:35 GMT+0000 (Coordinated Universal Time) with Vercel.