As someone who works in the area, this provides a decent summary of the most popular research items. The most useful and impressive part is the set of open problems at the end, which just about covers all of the main research directions in the field.
The skepticism I'm seeing in the comments really highlights how little of this work is trickling down to the public, which is very sad to see. While it can offer few mathematical mechanisms to infer optimal network design yet (mostly because just trying stuff empirically is often faster than going through the theory, so it is more common to retroactively infer things), the question "why do neural networks work better than other models?" is getting pretty close to a solid answer. Problem is, that was never the question people seem to have ever really been interested in, so the field now has to figure out what questions we ask next.
chadcmulligan 10 hours ago [-]
"why do neural networks work better than other models?" That sounds really interesting - any references (for a non specialist)?
the better question is why does gradient descent work for them
jmalicki 8 hours ago [-]
The properties that the uniform approximation theorem proves are not unique to neural networks.
Any models using an infinite dimensional Hilbert space, such as SVMs with RBF or polynomial kernels, Gaussian process regression, gradient boosted decision trees, etc. have the same property (though proven via a different theorem of course).
So the universal approximation theorem tells us nothing about why should expect neural networks to perform better than those models.
hodgehog11 7 hours ago [-]
Extremely well said. Universal approximation is necessary but not sufficient for the performance we are seeing. The secret sauce is implicit regularization, which comes about analogously to enforcing compression.
jimmypk 2 hours ago [-]
@hodgehog11 The grokking phenomenon (Power et al. 2022) is a puzzle for the compression view: models trained on algorithmic tasks like modular arithmetic memorize training data first (near-zero training loss, near-random test accuracy) and then, after many more gradient steps, suddenly generalize. The transition happens long after any obvious compression pressure would have fired. Do you think grokking is consistent with implicit regularization as compression, or does it require a separate mechanism - something more like a phase transition in the weight norms or the Fourier frequency structure?
hackinthebochs 2 hours ago [-]
>Do you think grokking is consistent with implicit regularization as compression
Pretty sure it's been shown that grokking requires L1 regularization which pushes model parameters towards zero. This can be viewed as compression in the sense of encoding the distribution in the fewest bits possible, which happens to correspond to better generalization.
soVeryTired 60 minutes ago [-]
Whenever people bring this up I like to remind them that linear interpolation is a universal function approximator.
NooneAtAll3 5 hours ago [-]
Universal approximation is like saying that a problem is computable
sure, that gives some relief - but it says nothing in practice unlike f.e. which side of P/NP divide the problem is on
ngruhn 4 hours ago [-]
> unlike f.e. which side of P/NP divide the problem is on
Actually the P/NP divide is a similar case in my opinion. In practice a quadratic algorithm is sometimes unacceptably slow and an NP problem can be virtually solved. E.g. SAT problems are routinely solved at scale.
imtringued 2 hours ago [-]
An NP problem can contain subproblems that are not worst case problems.
It's similar to the gap between pushdown automata and Turing machines. You can check if pushdown automata will terminate or not. You can't do it for Turing machines, but this doesn't stop you from running a pushdown automata algorithm on the turning machine with decidable termination.
fc417fc802 7 hours ago [-]
I don't follow. Why wouldn't it work? It seems to me that a biased random walk down a gradient is about as universal as it gets. A bit like asking why walking uphill eventually results in you arriving at the top.
hodgehog11 7 hours ago [-]
It wouldn't work if your landscape has more local minima than atoms in the known universe (which it does) and only some of them are good. Neural networks can easily fail, but there's a lot of things one can do to help ensure it works.
anvuong 6 hours ago [-]
A funny thing is, in very high-dimensional space, like millions and billions of parameters, the chance that you'd get stuck in a local minima is extremely small. Think about it like this, to be stuck in a local minima in 2D, you only need 2 gradient components to be zero, in higher dimension, you'd need every single one of them, millions up millions of them, to be all zero. You'd only need 1 single gradient component to be non-zero and SGD can get you out of it. Now, SGD is a stochastic walk on that manifold, not entirely random, but rather noisy, the chance that you somehow walk into a local minima is very very low, unless that is a "really good" local minima, in a sense that it dominates all other local minimas in its neighborhood.
hodgehog11 4 hours ago [-]
You are essentially correct, which is why stochastic gradient optimizers induce a low-sharpness bias. However, there is an awful lot more that complicates things. There are plenty of wide minima that it can get stuck in far away from where people typically initialise, so the initialisation scheme proves extremely important (but is mostly done for you).
Perhaps more important, just because it is easy to escape any local minimum does not mean that there is necessarily a trend towards a really good optimum, as it can just bounce between a bunch of really bad ones for a long time. This actually happens almost all the time if you try to design your entire architecture from scratch, e.g. highly connected networks. People who are new to the field sometimes don't seem to understand why SGD doesn't just always fix everything; this is why. You need very strong inductive biases in your architecture design to ensure that the loss (which is data-dependent so you cannot ascertain this property a priori) exhibits a global bowl-like shape (we often call this a 'funnel') to provide a general trajectory for the optimizer toward good solutions. Sometimes this only works for some optimizers and not others.
This is why architecture design is something of an art form, and explaining "why neural networks work so well" is a complex question involving a ton of parts, all of which contribute in meaningful ways. There are often plenty of counterexamples to any simpler explanation.
leoc 4 hours ago [-]
(‘Minimum’ is the singular of ‘minima’.)
charcircuit 4 hours ago [-]
>you'd need every single one of them, millions up millions of them, to be all zero
If they were all correlated with each other that does not seem far fetched.
imtringued 2 hours ago [-]
Ok but it's already known that you shouldn't initialize your network parameters to a single constant and instead initialize the parameters with random numbers.
charcircuit 1 hours ago [-]
The model can converge towards such a state even if randomly initialized.
appplication 7 hours ago [-]
Not a mathematician so I’m immediately out of my depth here (and butchering terminology), but it seems, intuitively, like the presence of a massive amount of local minima wouldn’t really be relevant for gradient descent. A given local minimum would need to have a “well” at least be as large as your step size to reasonably capture your descent.
E.g. you could land perfectly on a local minima but you won’t stay the unless your step size was minute or the minima was quite substantial.
fc417fc802 6 hours ago [-]
I believe what was meant was that assuming local minima of a sufficient size to capture your probe, given a sufficiently high density of those, you become extremely likely to get stuck. A counterpoint regarding dimensionality is made by the comment adjacent to yours.
niksmather 5 hours ago [-]
Do neural networks work better than other models? They can definitely model a wider class of problems than traditional ML models (images being the canonical example). However, I thought where a like for like comparison was possible they tend to worse than gradient boosting.
hodgehog11 3 hours ago [-]
Gradient boosting handles tabular data better than neural networks, often because the structure is simpler, and it becomes more of an issue to deal with the noise. You can do like-to-like comparisons between them for unstructured data like images, audio, video, text, and a well-designed NN will mop the floor with gradient boosting. This is because to handle that sort of data, you need to encode some form of bias around expected convolutional patterns in the data, or you won't get anywhere. Both CNNs and transformers do this.
energy123 3 hours ago [-]
Would you agree/disagree with the following:
- It's not gradient boosting per se that's good on tabular data, it's trees. Other fitting methods with trees as the model are also usually superior to NNs on tabular data.
- Trees are better on tabular data because they encode a useful inductive bias that NNs currently do not. Just like CNNs or ViTs are better on images because they encode spatial locality as an inductive bias.
r0ze-at-hn 3 hours ago [-]
We’re in a strange era where the Information-Theoretic foundations of deep learning are solidifying. The 'Why' is largely solved: it’s the efficient minimization of irreversible information loss relative to the noise floor. There is so much waste scaling models bigger and bigger when the math points to how to do it much more efficiently. One can take a great 70B model and have it run in only ~16GB with no loss in capability and the ability to keep training, but the last few years funding only went for "bigger".
As you noted, the industry has moved the goalposts to Agency and Long-horizon Persistence. The transition from building 'calculators that predict' to 'systems that endure' is a non-equilibrium thermodynamics problem. There is math/formulas and basic laws at play here that apply to AI just as much as it applies to other systems. Ironically it is the same math. The same thing that results in a signal persisting in a model will result in agents persisting.
This is my specific niche. I study how things persist. It’s honestly a bit painful watching the AI field struggle to re-learn first principles that other disciplines have already learned. I have a doc I use to help teach folks how the math works and how to apply it to their domain and it is fun giving it folks who then stop guessing and know exactly how to improve the persistence of what they are working on. Like the idea of "How many hours we can have a model work" is so cute compared to the right questions.
pdhborges 3 hours ago [-]
Can you share that document?
cookiengineer 7 hours ago [-]
In my opinion current research should focus on revisiting older concepts to figure out if they can be applied to transformers.
Transformers are superior "database" encodings as the hype about LLMs points out, but there have been promising ML models that were focusing on memory parts for their niche use cases, which could be promising concepts if we could make them work with attention matrixes and/or use the frequency projection idea on their neuron weights.
The way RNNs evolved to LSTMs, GRUs, and eventually DNCs was pretty interesting to me. In my own implementations and use cases I wasn't able to reproduce Deepmind's claims in the DNC memory related parts. Back at the time the "seeking heads" idea of attention matrixes wasn't there yet, maybe there's a way to build better read/write/access/etc gates now.
> why do neural networks work better than other models
The only people for whom this is an open question are the academics - everyone else understands it's entirely because of the bagillions of parameters.
hodgehog11 8 hours ago [-]
No it isn't, and it's frustrating when the "common wisdom" tries to boil it down to this. If this was true, then the models with "infinitely many" parameters would be amazing. What about just training a gigantic two-layer network? There is a huge amount of work trying to engineer training procedures that work well.
The actual reason is due to complex biases that arise from the interaction of network architectures and the optimizers and persist in the regime where data scales proportionally to model size. The multiscale nature of the data induces neural scaling laws that enable better performance than any other class of models can hope to achieve.
skydhash 7 hours ago [-]
> The actual reason is due to complex biases that arise from the interaction of network architectures and the optimizers and persist in the regime where data scales proportionally to model size. The multiscale nature of the data induces neural scaling laws that enable better performance than any other class of models can hope to achieve.
That’s a lot of words to say that, if you encode a class of things as numbers, there’s a formula somewhere that can approximate an instance of that class. It works for linear regression and works as well for neural network. The key thing here is approximation.
hodgehog11 4 hours ago [-]
No, it is relatively few words to quickly touch on several different concepts that go well beyond basic approximation theory.
I can construct a Gaussian process model (essentially fancy linear regression) that will fit _all_ of my medical image data _exactly_, but it will perform like absolute rubbish for determining tumor presence compared to if I trained a convolutional neural network on the same data and problem _and_ perfectly fit the data.
I could even train a fully connected network on the same data and problem, get any degree of fit you like, and it would still be rubbish.
bubblyworld 7 hours ago [-]
That isn't what they are saying at all, lol.
tacet 8 hours ago [-]
Also massive human work done on them, that wasn't done before.
Data labeling is pretty big industry in some countries and I guess dropping 200 kilodollars on labeling is beyond the reach of most academics, even if they would not care about ethics of that.
geokon 3 hours ago [-]
normally more parameters leads to overfitting (like fitting a polynomial to points), but neural nets are for some reason not as susceptible to that and can scale well with more parameters.
Thats been my understanding of the crux of mystery.
Would love to be corrected by someone more knowledgable though
hodgehog11 3 hours ago [-]
This absolutely was the crux of the (first) mystery, and I would argue that "deep learning theory" really only took off once it recognized this. There are other mysteries too, like the feasibility of transfer learning, neural scaling laws, and now more recently, in-context learning.
RyanShook 14 hours ago [-]
Here's where I'm missing understanding: for decades the idea of neural networks had existed with minimal attention. Then in 2017 Attention Is All You Need gets released and since then there is an exponential explosion in deep learning. I understand that deep learning is accelerated by GPUs but the concept of a transformer could have been used on much slower hardware much earlier.
pash 14 hours ago [-]
The inflection point was 2012, when AlexNet [0], a deep convolutional neural net, achieved a step-change improvement in the ImageNet classification competition.
After seeing AlexNet’s results, all of the major ML imaging labs switched to deep CNNs, and other approaches almost completely disappeared from SOTA imaging competitions. Over the next few years, deep neural networks took over in other ML domains as well.
The conventional wisdom is that it was the combination of (1) exponentially more compute than in earlier eras with (2) exponentially larger, high-quality datasets (e.g., the curated and hand-labeled ImageNet set) that finally allowed deep neural networks to shine.
The development of “attention” was particularly valuable in learning complex relationships among somewhat freely ordered sequential data like text, but I think most ML people now think of neural-network architectures as being, essentially, choices of tradeoffs that facilitate learning in one context or another when data and compute are in short supply, but not as being fundamental to learning. The “bitter lesson” [1] is that more compute and more data eventually beats better models that don’t scale.
Consider this: humans have on the order of 10^11 neurons in their body, dogs have 10^9, and mice have 10^7. What jumps out at me about those numbers is that they’re all big. Even a mouse needs hundreds of millions of neurons to do what a mouse does.
Intelligence, even of a limited sort, seems to emerge only after crossing a high threshold of compute capacity. Probably this has to do with the need for a lot of parameters to deal with the intrinsic complexity of a complex learning environment. (Mice and men both exist in the same physical reality.)
On the other hand, we know many simple techniques with low parameter counts that work well (or are even proved to be optimal) on simple or stylized problems. “Learning” and “intelligence”, in the way we use the words, tends to imply a complex environment, and complexity by its nature requires a large number of parameters to model.
Thanks for posting a through and accurate summary of the historical picture. I think it is important to know the past trajectory to extrapolate to the future correctly.
For a bit more context: Before 2012 most approaches were based on hand crafted features + SVMs that achieved state of the art performance on academic competitions such as Pascal VOC and neural nets were not competitive on the surface. Around 2010 Fei Fei Li of Stanford University collected a comparatively large dataset and launched the ImageNet competition. AlexNet cut the error rate by half in 2012 leading to major labs to switch to deeper neural nets. The success seems to be a combination of large enough dataset + GPUs to make training time reasonable. The architecture is a scaled version of ConvNets of Yan Lecun tying to the bitter lesson that scaling is more important than complexity.
coppsilgold 13 hours ago [-]
Comparing Deep Learning with neuroscience may turn out to be erroneous. They may be orthogonal.
The brain likely has more in common with Reservoir Computing (sans the actual learning algorithm) than Deep Learning.
Deep Learning relies on end to end loss optimization, something which is much more powerful than anything the brain can be doing. But the end-to-end limitation is restricting, credit assignment is a big problem.
Consider how crazy the generative diffusion models are, we generate the output in its entirety with a fixed number of steps - the complexity of the output is irrelevant. If only we could train a model to just use Photoshop directly, but we can't.
Interestingly, there are some attempts at a middle ground where a variable number of continuous variables describe an image: <https://visual-gen.github.io/semanticist/>
ACCount37 2 minutes ago [-]
Modern systems like Nano Banana 2 and ChatGPT Images 2.0 are very close to "just use Photoshop directly" in concept, if not in execution.
They seem to use an agentic LLM with image inputs and outputs to produce, verify, refine and compose visual artifacts. Those operations appear to be learned functions, however, not an external tool like Photoshop.
This allows for "variable depth" in practice. Composition uses previous images, which may have been generated from scratch, or from previous images.
jvanderbot 13 hours ago [-]
If you think a 2 year old is doing deep learning, you're probably wrong.
But if you think natural selection was providing end to end loss optimization, you might be closer to right. An _awful lot_ of our brain structure and connectivity is born, vs learned, and that goes for Mice and Men.
ACCount37 2 minutes ago [-]
Why not both? A pre-trained LLM has an awful lot of structure, and during SFT, we're still doing deep learning to teach it further.
imtringued 2 hours ago [-]
I don't know what you mean by end to end loss optimization in particular, but if you mean something that involves global propagation of errors e.g. backpropagation you are dead wrong.
Predictive coding is more biologically plausible because it uses local information from neighbouring neurons only.
roenxi 9 hours ago [-]
> If only we could train a model to just use Photoshop directly, but we can't.
It is probably coming, I get the impression - just from following the trend of the progress - that internal world models are the hardest part. I was playing with Gemma 4 and it seemed to have a remarkable amount of trouble with the idea of going from its house to another house, collecting something and returning; starting part-way through where it was already at house #2. It figured it out but it seemed to be working very hard with the concept to a degree that was really a bit comical.
It looks like that issue is solving itself as text & image models start to unify and they get more video-based data that makes the object-oriented nature of physical reality obvious. Understanding spatial layouts seems like it might be a prerequisite to being able to consistently set up a scene in Photoshop. It is a bit weird that it seems pulling an image fully formed from the aether is statistically easier than putting it together piece by piece.
antonvs 5 hours ago [-]
> If only we could train a model to just use Photoshop directly, but we can't.
What kind of sadist would wish this on an intelligent entity?
cdavid 8 hours ago [-]
Indeed. I would add a third factor to compute and datasets: the lego-like aspect of NN that enabled scalable OSS DL frameworks.
I did some ML in mid 2000s, and it was a PITA to reuse other people code (when available at all). You had some well known libraries for SVM, for HMM you had to use HTK that had a weird license, and otherwise looking at experiments required you to reimplement stuff yourself.
Late 2000s had a lot of practical innovation that democratized ML: theano and then tf/keras/pytorch for DL, scikit learn for ML, etc. That ended up being important because you need a lot of tricks to make this work on top of "textbook" implementation. E.g. if you implement EM algo for GMM, you need to do it in the log space to avoid underflow, DL as well (gorot and co initialization, etc.).
alasdair_ 7 hours ago [-]
I think your post may have more acronyms than any other post I have ever read on hn. Do you have a guide to which specific things you are talking about with each acronym? Deep Learning and Machine Learning are obvious but some of the others I can’t follow at all - they could be so many different things.
AgentMatt 4 hours ago [-]
NN - neural networks
OSS DL frameworks - open source deep learning frameworks
PITA - pain in the ass
SVM - support vector machines
HMM - hidden Markov model
EM - expectation maximization
GMM - gaussian mixture model
HTK - hidden Markov model tool kit
ButlerianJihad 7 hours ago [-]
I think he maintains pinball machines and jukeboxes for a chain of Greek restaurants
jesseab 8 hours ago [-]
Remember watching Alec Radford's Theano tutorial and feeling like I had found literal gold.
Sohakes 12 hours ago [-]
> but I think most ML people now think of neural-network architectures as being, essentially, choices of tradeoffs that facilitate learning in one context or another when data and compute are in short supply, but not as being fundamental to learning.
I feel like you are downplaying the importance of architecture. I never read the bitter lesson, but I have always heard more as a comment on embedding knowledge into models instead of making them to just scale with data. We know algorithmic improvement is very important to scale NNs (see https://www.semanticscholar.org/paper/Measuring-the-Algorith...). You can't scale an architecture that has catastrophic forgetting embedded in it. It is not really a matter of tradeoffs, some are really worse in all aspects. What I agree is just that architectures that scale better with data and compute do better. And sure, you can say that smaller architectures are better for smaller problems, but then the framing with the bitter lesson makes less sense.
hodgehog11 12 hours ago [-]
> Intelligence, even of a limited sort, seems to emerge only after crossing a high threshold of compute capacity. Probably this has to do with the need for a lot of parameters to deal with the intrinsic complexity of a complex learning environment.
Real intelligence deals with information over a ludicrous number of size scales. Simple models effectively blur over these scales and fail to pull them apart. However, extra compute is not enough to do this effectively, as nonparametric models have demonstrated.
The key is injecting a sensible inductive bias into the model. Nonparametric models require this to be done explicitly, but this is almost impossible unless you're God. A better way is to express the bias as a "post-hoc query" in terms of the trained model and its interaction with the data. The only way to train such a model is iteratively, as it needs to update its bias retroactively. This can only be accomplished by a nonlinear (in parameters) parametric model that is dense in function space and possesses parameter counts proportional to the data size. Every model we know of that does this is called "a neural network".
getnormality 13 hours ago [-]
> I think most ML people now think of neural-network architectures as being, essentially, choices of tradeoffs that facilitate learning in one context or another when data and compute are in short supply, but not as being fundamental to learning.
Is this a practical viewpoint? Can you remove any of the specific architectural tricks used in Transformers and expect them to work about equally well?
musebox35 12 hours ago [-]
I think this question is one of the more concrete and practical ways to attack the problem of understanding transformers. Empirically the current architecture is the best to converge training by gradient descent dynamics. Potentially, a different form might be possible and even beneficial once the core learning task is completed. Also the requirements of iterated and continuous learning might lead to a completely different approach.
> The conventional wisdom is that it was the combination of (1) exponentially more compute than in earlier eras with (2) exponentially larger, high-quality datasets (e.g., the curated and hand-labeled ImageNet set) that finally allowed deep neural networks to shine.
I'd thought it was some issue with training where older math didn't play nice with having too many layers.
etiam 12 hours ago [-]
Sigmoid-type activation functions were popular, probably for the bounded activity and some measure of analogy to biological neuron responses. They work, but get problematic scaling of gradient feedback outside their most dynamic span.
My understanding of the development is that persistent layer-wise pretraining with RBM or autoencoder created an initiation state where the optimization could cope even for more layers, and then when it was proven that it could work, analysis of why led to some changes such as new initiation heuristics, rectified linear activation, eventually normalizations ... so that the pretraining was usually not needed any more.
One finding was that the supervised training with the old arrangement often does work on its own, if you let it run much longer than people reasonably could afford to wait around for just on speculation contrary to observations in CPU computations in the 80s--00s. It has to work its way to a reasonably optimizable state using a chain of poorly scaled gradients first though.
mystraline 12 hours ago [-]
Ive yet to see a model that trains AND applies the trained data real-time. Thats basically every living being, from bacteria to plants to mammals.
Even PID loops have a training phase separate from recitation phase.
seanhunter 4 hours ago [-]
That’s not a meaningful technical obstacle. If you wanted to, you could just take the output of the model and use it at each iteration of the training phase to perform (badly) whatever task the model is intended to do.
The reason noone does this is you don’t have to and you’ll get much better results if you first fully train and then apply the best model you have to whatever problem. Biological systems don’t have that luxury.
cgearhart 14 hours ago [-]
A much earlier major win for deep learning was AlexNet for image recognition in 2012. It dominated the competition and within a couple years it was effectively the only way to do image tasks. I think it was Jeremy Howard who wrote a paper around 2017 wondering when we’d get a transfer learning approach that worked as well for NLP as convnets did for images. The attention paper that year didn’t immediately dominate. The hardware wasn’t good enough and there wasn’t consensus on belief that scale would solve everything. It took like five more years before GPT3 took off and started this current wave.
I also think you might be discounting exactly how much compute is used to train these monsters. A single 1ghz processor would take about 100,000,000 years to train something in this class. Even with on the order of 25k GPUs training GPT3 size models takes a couple months. The anemic RAM on GPUs a decade ago (I think we had k80 GPUs with 12GB vs 100’s of GBs on H100/H200 today) and it was actually completely impossible to train a large transformer model prior to the early 2020s.
I’m even reminded how much gamers complained in the late 2010s about GPU prices skyrocketing because of ML use.
porcoda 14 hours ago [-]
As others pointed out, the explosion of interest started with the deep convolutional networks that were applied in image problems. What I always thought was interesting was that prior to that, NNs were largely dismissed as interesting. When I took a course on them around the year 2000 that was the attitude most people took. It seems like what it took to spark renewed interest was ImageNet and seeing what you get when you have a ton of training data to throw at the problem and fast processors to help. After that the ball kept rolling with the subsequent developments around specific network architectures. In the broader community AlexNet is viewed as the big inflection point, but in the academic community you saw interest simmering a couple years earlier - I began to see more talks at workshops about NNs that weren’t being dismissed anymore, probably starting around 2008/09.
bobbruno 2 hours ago [-]
I played with NNs in the late 80's/early 90s, with little more than a copy of Hinton's paper, a PC and a C compiler. Obviously, I got no practical results. But I got the intuition of how they worked and what they could potentially do.
Cut to 2008-9,and I started to see smartphones, grid (then cloud) computing and social networks emerging. My MBA dissertation, finished in 2011, was about how that would change the world, because the requirements for meaningful AI were coming along - data and compute. The theory was already there, Hinton, LeCun, Schmidhuber,etc.
That got me back into the Data Science field, after years working in Data Engineering. Too bad I lived in Brazil back then and couldn't find a way to join the emerging scene in California and other top places. I'd be rich now...
srean 13 hours ago [-]
> NNs were largely dismissed
I agree with your larger point but dismissed is rather too strong. They were considered fiddly to train, prone to local minima, long training time, no clear guidelines about what the number of hidden layers and number of nodes ought to be. But for homework (toy) exercises they were still ok.
In comparison, kernel methods gave a better experience over all for large but not super large data sets. Most models had easily obtainable global minimum. Fewer moving parts and very good performance.
It turns out, however, that if you have several orders of magnitude more data, the usual kernels are too simple -- (i) they cannot take advantage of more data after a point and start twiddling the 10th place of decimal of some parameters and (ii) are expensive to train for very large data sets. So bit of a double whammy. Well, there was a third, no hardware acceleration that can compare with GPUs.
Kernels may make a comeback though, you never know. We need to find a way to compose kernels in a user friendly way to increase their modeling capacity. We had a few ways of doing just that but they weren't great. We need a breakthrough to scale them to GPT sized data sets.
In a way DNNs are "design your own kernels using data" whereas kernels came in any color you liked provided it was black (yes there were many types, but it was still a fairly limited catalogue. The killer was that there was no good way of composing them to increase modeling capacity that yielded efficiently trainable kernel machines)
energy123 3 hours ago [-]
Deepmind solving Atari games was another big milestone around that time.
whateverboat 14 hours ago [-]
The same thing happened with matrices. We had matrices for 400 years, but the field of linear algebra and especially numerical linear algebra exploded only with advent of computers.
In olden days, the correct way to solve a linear system of equations was to use theory of minors. With advent of computers, you suddenly had a huge theory of gaussian elimination, or Krylov spaces and what not.
embedding-shape 14 hours ago [-]
> I understand that deep learning is accelerated by GPUs but the concept of a transformer could have been used on much slower hardware much earlier
But they don't give the same results at those smaller scales. People imagined, but no one could have put into practice because the hardware wasn't there yet. Simplified, LLMs is basically Transformers with the additional idea of "and a shitton of data to learn from", and for making training feasible with that amount of data, you do need some capable hardware.
j_bum 8 hours ago [-]
This video gives a great overview of the history of the acceleration:
Although it’s focused on Ilya, some great history is covered.
HarHarVeryFunny 11 hours ago [-]
Without fast parallel hardware there would neither have been the incentive to design the Transformer, or much benefit even if someone had come up with the design all the same!
The incentive to design something new - which became the Transformer - came from language model researchers who had been working with recurrent models such as LSTMs, whose recurrent nature made them inefficient to train (needing BPPT), and wanted to come up with a new seq-2-seq/language model that could take advantage of the parallel hardware that now existed and (since AlexNet) was now being used to good effect for other types of model.
As I understand it, the inspiration for the concept of what would become the Transformer came from Attention paper co-author Jakob Uzkoreit who realized that language, while superficially appearing sequential (hence a good match for RNNs) was in fact really parallel + hierarchical as can be seen by linguist's sentence parse trees where different branches of the tree reflect parallel analysis of different parts of the sentence, which are then combined at higher levels of the hierarchical parse tree. This insight gave rise to the idea of a language model that mirrored this analytical structure with hierarchical layers of parallel processing, with the parallel processing being the whole point since this could be accelerated by GPUs. While the concept was Uzkoreit's, it took another researcher, Noam Shazeer, to take the concept and realize it as a performant architecture - the Transformer.
Without the fast parallel hardware already pre-existing, there would not have been any incentive to design a new type of language model to take advantage of it!
The other point is that while the Transformer is a very powerful general purpose and scalable type of model, it only really comes into it's own at scale. If a Transformer had somehow been designed in the pre-GPU-compute era, before the compute power to scale it up to massive size existed it, then it would likely not have appeared so promising/interesting.
The other aspect to the history is that neural networks, of various types, have evolved in complexity and sophistication over time. RNNs and LSTMs came first, then Bahdanau attention as a way to improve their context focus and performance. Attention was now seen to be a valuable part of language and seq-2-seq modelling, so when GPUs motivated the Transformer, attention was retained, recurrence ditched, and hence "Attention is all you need".
The time was right for the Transformer to appear when it did, designed to take advantage of recent GPU advances, building on top of this new attention architecture, and now with the compute power and dataset size available that it started to really shine when scaled from GPT-1 to GPT-2 size, and beyond.
BigTTYGothGF 14 hours ago [-]
The modern neural net revival got kicked off long before 2017.
noosphr 14 hours ago [-]
Alex net in 2012 is only 5 years earlier.
14 hours ago [-]
quicklywilliam 13 hours ago [-]
Agreed, there is probably a theoretical world where we got enough money/compute together and had this explosion happen earlier.
Or perhaps a world where it happened later. I think a big part of what enabled the AI boom was the concentration of money and compute around the crypto boom.
slashdave 13 hours ago [-]
Deep-learning hinges on highly redundant solution space (highly redundant weights), along with normalized weights (optimization methodology is commoditized). The original neural network work had no such concepts.
teekert 14 hours ago [-]
If you are in the radiology field it started “exploding” much earlier, with CNNs.
CamperBob2 14 hours ago [-]
the concept of a transformer could have been used on much slower hardware much earlier.
It could have been done in the early 1970s -- see "Paper tape is all you need" at https://github.com/dbrll/ATTN-11 and the various C-64 projects that have been posted on HN -- but the problem was that Marvin Minsky "proved" that there was no way a perceptron-based network could do anything interesting. Funding dried up in a hurry after that.
LPisGood 12 hours ago [-]
> Marvin Minsky "proved" that there was no way a perceptron-based network could do anything interesting
I'm sure it's an oversimplification to blame the entire 1970s AI winter on Minsky, considering they couldn't have gotten much further than the proof-of-concept stage due to lack of hardware. But his voice was a loud, widely-respected one in academia, and it did have a negative effect on the field.
antonvs 5 hours ago [-]
I suspect all Minsky did was reinforce what many people were already thinking. I experimented with neural nets in the late 80s and they seemed super interesting, but also very limited. My sense at the time was that the general thinking was, they might be useful if you could approach the number of neurons and connections in the human brain, but that seemed like a very far off, effectively impossible goal at the time.
wslh 14 hours ago [-]
Don't understimate the massive data you need to make those networks tick. Also, impracticable in slow training algorithms, beyond if they were in GPUs or CPUs.
js8 1 hours ago [-]
Honestly, I found these two attempts at universal theory more interesting:
I am also interested in connection with fuzzy logic - it seems that NNs can reason in a fuzzy way, but what they are doing, formally? For years, people have been trying to formalize fuzzy reasoning but it looks like we don't care anymore.
I feel like NNs (and transformers) are the OOP (object-oriented programming) of ML. Really popular, works pretty well in practice, but nobody understands the fundamentals; there is a feeling it is a made up new language to express things expressible before, but hard to pinpoint where exactly it helps.
Animats 12 hours ago [-]
This is encouraging. The title is a bit much. "Potential points of attack for understanding what deep learning is really doing" would be more accurate but less attention-grabbing.
It might lead to understanding how to measure when a deep learning system is making stuff up or hallucinating. That would have a huge payoff. Until we get that, deep learning systems are limited to tasks where the consequences of outputting bullshit are low.
hodgehog11 12 hours ago [-]
> measure when a deep learning system is making stuff up or hallucinating
That's a great problem to solve! (Maybe biased, because this is my primary research direction). One popular approach is OOD detection, but this always seemed ill-posed to me. My colleagues and I have been approaching this from a more fundamental direction using measures of model misspecification, but this is admittedly niche because it is very computationally expensive. Could still be a while before a breakthrough comes from any direction.
Animats 8 hours ago [-]
> Could still be a while before a breakthrough comes from any direction.
It would be valuable enough that getting significant funding to work on it is probably possible. Especially with all the money being thrown at AI.
18al 4 hours ago [-]
Could you elaborate on what you mean by OOD detection seeming ill-posed?
sakesun 2 hours ago [-]
Hopefully the day some vendors market AI as divine entity will soon over.
adzm 16 hours ago [-]
I'm only partially through this paper, but it's written in a very engaging and thoughtful manner.
There is so much to digest here but it's fascinating seeing it all put together!
AkiraHsieh 6 hours ago [-]
Theory becomes critical when you need to predict failure modes. A decision support system that 'just works' most of the time but fails silently on edge cases is worse than a simpler system with known limitations.
Understanding the bias mechanisms would help us know when a model is confident vs when it's just pattern matching. That distinction matters when the stakes are high.
sweezyjeezy 14 hours ago [-]
Deep learning works at a very high level because 'it can keep learning from more data' better than any other approaches. But without the 'stupid amount of data' that is available now, the architecture would be kind of irrelevant. Unless you are going some way to explain both sides of the model-data equation I don't feel you have a solid basis to build a scientific theory, e.g. 'why reasoning models can reason'. The model is the product of both the architecture and training data.
My fear is that this is as hopeless right now as explaining why humans or other animals can learn certain things from their huge amount of input data. We'll gain better empirical understanding, but it won't ever be fundamental computer science again, because the giga-datasets are the fundamental complexity not the architecture.
le-mark 13 hours ago [-]
> We argue complexity conceals underlying regularity, and that deep learning will indeed admit a scientific theory
That would be amazing, but personally I’m skeptical.
slashdave 11 hours ago [-]
There is an analogy with statistical mechanics. It's not crazy.
wwarner 7 hours ago [-]
Sane & interesting enough to have been disproven, by Boaz Barak iirc. Maybe not surprising since simulated annealing never achieved the results of gradient descent + backprop.
qsera 9 hours ago [-]
Wait a min. Does this paper says we don't know how back-propogation works?
spicyusername 8 hours ago [-]
No
avdelazeri 10 hours ago [-]
We must know, we will know.
CamperBob2 7 hours ago [-]
"Yeah, about that" - Gödel
4b11b4 15 hours ago [-]
wow.. this would be cool. Instead of just.. guessing "shapes"
NitpickLawyer 14 hours ago [-]
tbf, we've learned (ha!) more from smashing teeny tiny particles and "looking" at what comes out than from say 40 years of string theory. Sometimes doing stuff works, and the theory (hopefully) follows.
olalonde 55 minutes ago [-]
Same with electricity. We had Ohm's Law and were building electrical devices (e.g. telegraph, lightbulb) long before we discovered the electron.
metalman 4 hours ago [-]
I have a "theory" that will be wrong, but for a reasonable consideration I can "theorise" in the the other direction.
UltraSane 14 hours ago [-]
I think we need the equivalent of general relativity for latent spaces.
Is there not some Rice's Theorem equivalent for deep nets? After all they are machines that are randomly generated, so from classical computer science I would not presume a theory of "what do all deep nets do" to be prima facie logically possible. Nor do I see this explained in the objections section.
sanxiyn 3 hours ago [-]
As I understand, Rice's theorem does not apply because neural networks are not Turing-complete.
calf 1 hours ago [-]
I'm not sure I agree with that. Even technically, my PC is not Turing-complete because its hard drive is finite. Yet there is an informal sense that Rice's Theorem is still relevant in a kind of PC abstraction sense, as we are all taught "virus checkers are strictly speaking impossible". This is a subtle point that needs further clarification from CS theorists, of which I am not.
Neural networks in general are Turing models. Human brains are in the abstract Turing complete as well, as a simple example. LLMs being run iteratively in an unbounded loop may be "effectively Turing complete" for this simple reason, as well.
Regardless, any theory purporting to be foundational ought to explicitly address this demarcation. Unless practitioners think computability and formal complexity are not scientific foundations for CS.
lou1306 7 minutes ago [-]
But most "normal" neural networks are feed-forward, so they are guaranteed to terminate in a bounded amount of time. This rules Turing completeness right out.
And even recurrent NNs can be "unfolded" into feed-forward equivalents, so they are not TC either.
You need a memory element the network can interact with, just like an ALU by itself is not TC, but a barebones stateful CPU (ALU + registers) is.
stared 13 hours ago [-]
Well, "There Will Be a Scientific Theory of Deep Learning" looks like flag planting - an academic variant of "I told you so!", but one that is a citation magnet.
A_D_E_P_T 12 hours ago [-]
It's actually really fascinating that there isn't a scientific theory of deep learning, especially as it's a product of human engineering as opposed to e.g. biology or particle physics.
hodgehog11 12 hours ago [-]
There are very good reasons why it took this long, but can be summed up as: everyone was looking in the wrong place. Deep learning breaks a hundred years of statistical intuition, and you don't move a ship that large quickly.
slashdave 11 hours ago [-]
There is, but it is fractured. I would equate this effort as more of a standardization of terms and language.
antonvs 4 hours ago [-]
Calling it “a product of human engineering” is misleading. Deep learning exploits principles we don’t fully understand. We didn’t engineer those principles. It’s not fundamentally any different than particle physics or biology, which are both similarly consequences of rules that we didn’t invent and can’t control.
I’m in the skeptical camp. Whatever theory that will eventually emerge will not be as solid as:
1. Theory of pattern recognition (as developed in 80s and 90s)
2. Theory of thermodynamics
3. Theory of gravity
4. Theory of electromagnetism
5. Theory of relativity
Etc. because of two reasons:
1. While half of deep learning is how humans construct the architecture of networks, the more important half relies on data. This data is a hodgepodge of scraped internet data (text and videos), books, user interactions etc., which really has no coherent structure
2. To extract meaningful insights from this much data, it takes models of enormous size like 10B+. The thing about random systems (in the mathematical sense) is that it takes “something” of order of magnitude bigger size to “understand” it, unless there is some concentration of measure type mathematical niceties (as in thermodynamics), which I don’t think is there in these models and data. This is the same reason I don’t think humans will ever be able to “understand” human consciousness. It will take something of an order of magnitude bigger than our own brains to do that.
Here is Terence Tao explaining this concentration stuff in another context: https://mathstodon.xyz/@tao/113873092369347147
I would love to be proven wrong though.
antonvs 4 hours ago [-]
The whole point about theory, though, is that simple rules can define complex phenomena. I don’t think anything you wrote fundamentally rules out the idea that we could find a theory of deep learning.
Rendered at 11:40:15 GMT+0000 (Coordinated Universal Time) with Vercel.
The skepticism I'm seeing in the comments really highlights how little of this work is trickling down to the public, which is very sad to see. While it can offer few mathematical mechanisms to infer optimal network design yet (mostly because just trying stuff empirically is often faster than going through the theory, so it is more common to retroactively infer things), the question "why do neural networks work better than other models?" is getting pretty close to a solid answer. Problem is, that was never the question people seem to have ever really been interested in, so the field now has to figure out what questions we ask next.
the better question is why does gradient descent work for them
Any models using an infinite dimensional Hilbert space, such as SVMs with RBF or polynomial kernels, Gaussian process regression, gradient boosted decision trees, etc. have the same property (though proven via a different theorem of course).
So the universal approximation theorem tells us nothing about why should expect neural networks to perform better than those models.
Pretty sure it's been shown that grokking requires L1 regularization which pushes model parameters towards zero. This can be viewed as compression in the sense of encoding the distribution in the fewest bits possible, which happens to correspond to better generalization.
sure, that gives some relief - but it says nothing in practice unlike f.e. which side of P/NP divide the problem is on
Actually the P/NP divide is a similar case in my opinion. In practice a quadratic algorithm is sometimes unacceptably slow and an NP problem can be virtually solved. E.g. SAT problems are routinely solved at scale.
It's similar to the gap between pushdown automata and Turing machines. You can check if pushdown automata will terminate or not. You can't do it for Turing machines, but this doesn't stop you from running a pushdown automata algorithm on the turning machine with decidable termination.
Perhaps more important, just because it is easy to escape any local minimum does not mean that there is necessarily a trend towards a really good optimum, as it can just bounce between a bunch of really bad ones for a long time. This actually happens almost all the time if you try to design your entire architecture from scratch, e.g. highly connected networks. People who are new to the field sometimes don't seem to understand why SGD doesn't just always fix everything; this is why. You need very strong inductive biases in your architecture design to ensure that the loss (which is data-dependent so you cannot ascertain this property a priori) exhibits a global bowl-like shape (we often call this a 'funnel') to provide a general trajectory for the optimizer toward good solutions. Sometimes this only works for some optimizers and not others.
This is why architecture design is something of an art form, and explaining "why neural networks work so well" is a complex question involving a ton of parts, all of which contribute in meaningful ways. There are often plenty of counterexamples to any simpler explanation.
If they were all correlated with each other that does not seem far fetched.
E.g. you could land perfectly on a local minima but you won’t stay the unless your step size was minute or the minima was quite substantial.
- It's not gradient boosting per se that's good on tabular data, it's trees. Other fitting methods with trees as the model are also usually superior to NNs on tabular data.
- Trees are better on tabular data because they encode a useful inductive bias that NNs currently do not. Just like CNNs or ViTs are better on images because they encode spatial locality as an inductive bias.
As you noted, the industry has moved the goalposts to Agency and Long-horizon Persistence. The transition from building 'calculators that predict' to 'systems that endure' is a non-equilibrium thermodynamics problem. There is math/formulas and basic laws at play here that apply to AI just as much as it applies to other systems. Ironically it is the same math. The same thing that results in a signal persisting in a model will result in agents persisting.
This is my specific niche. I study how things persist. It’s honestly a bit painful watching the AI field struggle to re-learn first principles that other disciplines have already learned. I have a doc I use to help teach folks how the math works and how to apply it to their domain and it is fun giving it folks who then stop guessing and know exactly how to improve the persistence of what they are working on. Like the idea of "How many hours we can have a model work" is so cute compared to the right questions.
Transformers are superior "database" encodings as the hype about LLMs points out, but there have been promising ML models that were focusing on memory parts for their niche use cases, which could be promising concepts if we could make them work with attention matrixes and/or use the frequency projection idea on their neuron weights.
The way RNNs evolved to LSTMs, GRUs, and eventually DNCs was pretty interesting to me. In my own implementations and use cases I wasn't able to reproduce Deepmind's claims in the DNC memory related parts. Back at the time the "seeking heads" idea of attention matrixes wasn't there yet, maybe there's a way to build better read/write/access/etc gates now.
[1] a fairly good implementation I found: https://github.com/joergfranke/ADNC
The only people for whom this is an open question are the academics - everyone else understands it's entirely because of the bagillions of parameters.
The actual reason is due to complex biases that arise from the interaction of network architectures and the optimizers and persist in the regime where data scales proportionally to model size. The multiscale nature of the data induces neural scaling laws that enable better performance than any other class of models can hope to achieve.
That’s a lot of words to say that, if you encode a class of things as numbers, there’s a formula somewhere that can approximate an instance of that class. It works for linear regression and works as well for neural network. The key thing here is approximation.
I can construct a Gaussian process model (essentially fancy linear regression) that will fit _all_ of my medical image data _exactly_, but it will perform like absolute rubbish for determining tumor presence compared to if I trained a convolutional neural network on the same data and problem _and_ perfectly fit the data.
I could even train a fully connected network on the same data and problem, get any degree of fit you like, and it would still be rubbish.
Data labeling is pretty big industry in some countries and I guess dropping 200 kilodollars on labeling is beyond the reach of most academics, even if they would not care about ethics of that.
Thats been my understanding of the crux of mystery.
Would love to be corrected by someone more knowledgable though
After seeing AlexNet’s results, all of the major ML imaging labs switched to deep CNNs, and other approaches almost completely disappeared from SOTA imaging competitions. Over the next few years, deep neural networks took over in other ML domains as well.
The conventional wisdom is that it was the combination of (1) exponentially more compute than in earlier eras with (2) exponentially larger, high-quality datasets (e.g., the curated and hand-labeled ImageNet set) that finally allowed deep neural networks to shine.
The development of “attention” was particularly valuable in learning complex relationships among somewhat freely ordered sequential data like text, but I think most ML people now think of neural-network architectures as being, essentially, choices of tradeoffs that facilitate learning in one context or another when data and compute are in short supply, but not as being fundamental to learning. The “bitter lesson” [1] is that more compute and more data eventually beats better models that don’t scale.
Consider this: humans have on the order of 10^11 neurons in their body, dogs have 10^9, and mice have 10^7. What jumps out at me about those numbers is that they’re all big. Even a mouse needs hundreds of millions of neurons to do what a mouse does.
Intelligence, even of a limited sort, seems to emerge only after crossing a high threshold of compute capacity. Probably this has to do with the need for a lot of parameters to deal with the intrinsic complexity of a complex learning environment. (Mice and men both exist in the same physical reality.)
On the other hand, we know many simple techniques with low parameter counts that work well (or are even proved to be optimal) on simple or stylized problems. “Learning” and “intelligence”, in the way we use the words, tends to imply a complex environment, and complexity by its nature requires a large number of parameters to model.
0. https://en.wikipedia.org/wiki/AlexNet
1. https://en.wikipedia.org/wiki/Bitter_lesson
For a bit more context: Before 2012 most approaches were based on hand crafted features + SVMs that achieved state of the art performance on academic competitions such as Pascal VOC and neural nets were not competitive on the surface. Around 2010 Fei Fei Li of Stanford University collected a comparatively large dataset and launched the ImageNet competition. AlexNet cut the error rate by half in 2012 leading to major labs to switch to deeper neural nets. The success seems to be a combination of large enough dataset + GPUs to make training time reasonable. The architecture is a scaled version of ConvNets of Yan Lecun tying to the bitter lesson that scaling is more important than complexity.
The brain likely has more in common with Reservoir Computing (sans the actual learning algorithm) than Deep Learning.
Deep Learning relies on end to end loss optimization, something which is much more powerful than anything the brain can be doing. But the end-to-end limitation is restricting, credit assignment is a big problem.
Consider how crazy the generative diffusion models are, we generate the output in its entirety with a fixed number of steps - the complexity of the output is irrelevant. If only we could train a model to just use Photoshop directly, but we can't.
Interestingly, there are some attempts at a middle ground where a variable number of continuous variables describe an image: <https://visual-gen.github.io/semanticist/>
They seem to use an agentic LLM with image inputs and outputs to produce, verify, refine and compose visual artifacts. Those operations appear to be learned functions, however, not an external tool like Photoshop.
This allows for "variable depth" in practice. Composition uses previous images, which may have been generated from scratch, or from previous images.
Predictive coding is more biologically plausible because it uses local information from neighbouring neurons only.
It is probably coming, I get the impression - just from following the trend of the progress - that internal world models are the hardest part. I was playing with Gemma 4 and it seemed to have a remarkable amount of trouble with the idea of going from its house to another house, collecting something and returning; starting part-way through where it was already at house #2. It figured it out but it seemed to be working very hard with the concept to a degree that was really a bit comical.
It looks like that issue is solving itself as text & image models start to unify and they get more video-based data that makes the object-oriented nature of physical reality obvious. Understanding spatial layouts seems like it might be a prerequisite to being able to consistently set up a scene in Photoshop. It is a bit weird that it seems pulling an image fully formed from the aether is statistically easier than putting it together piece by piece.
What kind of sadist would wish this on an intelligent entity?
I did some ML in mid 2000s, and it was a PITA to reuse other people code (when available at all). You had some well known libraries for SVM, for HMM you had to use HTK that had a weird license, and otherwise looking at experiments required you to reimplement stuff yourself.
Late 2000s had a lot of practical innovation that democratized ML: theano and then tf/keras/pytorch for DL, scikit learn for ML, etc. That ended up being important because you need a lot of tricks to make this work on top of "textbook" implementation. E.g. if you implement EM algo for GMM, you need to do it in the log space to avoid underflow, DL as well (gorot and co initialization, etc.).
PITA - pain in the ass
SVM - support vector machines HMM - hidden Markov model EM - expectation maximization GMM - gaussian mixture model HTK - hidden Markov model tool kit
I feel like you are downplaying the importance of architecture. I never read the bitter lesson, but I have always heard more as a comment on embedding knowledge into models instead of making them to just scale with data. We know algorithmic improvement is very important to scale NNs (see https://www.semanticscholar.org/paper/Measuring-the-Algorith...). You can't scale an architecture that has catastrophic forgetting embedded in it. It is not really a matter of tradeoffs, some are really worse in all aspects. What I agree is just that architectures that scale better with data and compute do better. And sure, you can say that smaller architectures are better for smaller problems, but then the framing with the bitter lesson makes less sense.
Real intelligence deals with information over a ludicrous number of size scales. Simple models effectively blur over these scales and fail to pull them apart. However, extra compute is not enough to do this effectively, as nonparametric models have demonstrated.
The key is injecting a sensible inductive bias into the model. Nonparametric models require this to be done explicitly, but this is almost impossible unless you're God. A better way is to express the bias as a "post-hoc query" in terms of the trained model and its interaction with the data. The only way to train such a model is iteratively, as it needs to update its bias retroactively. This can only be accomplished by a nonlinear (in parameters) parametric model that is dense in function space and possesses parameter counts proportional to the data size. Every model we know of that does this is called "a neural network".
Is this a practical viewpoint? Can you remove any of the specific architectural tricks used in Transformers and expect them to work about equally well?
https://news.ycombinator.com/item?id=41732853
I'd thought it was some issue with training where older math didn't play nice with having too many layers.
My understanding of the development is that persistent layer-wise pretraining with RBM or autoencoder created an initiation state where the optimization could cope even for more layers, and then when it was proven that it could work, analysis of why led to some changes such as new initiation heuristics, rectified linear activation, eventually normalizations ... so that the pretraining was usually not needed any more.
One finding was that the supervised training with the old arrangement often does work on its own, if you let it run much longer than people reasonably could afford to wait around for just on speculation contrary to observations in CPU computations in the 80s--00s. It has to work its way to a reasonably optimizable state using a chain of poorly scaled gradients first though.
Even PID loops have a training phase separate from recitation phase.
The reason noone does this is you don’t have to and you’ll get much better results if you first fully train and then apply the best model you have to whatever problem. Biological systems don’t have that luxury.
I also think you might be discounting exactly how much compute is used to train these monsters. A single 1ghz processor would take about 100,000,000 years to train something in this class. Even with on the order of 25k GPUs training GPT3 size models takes a couple months. The anemic RAM on GPUs a decade ago (I think we had k80 GPUs with 12GB vs 100’s of GBs on H100/H200 today) and it was actually completely impossible to train a large transformer model prior to the early 2020s.
I’m even reminded how much gamers complained in the late 2010s about GPU prices skyrocketing because of ML use.
Cut to 2008-9,and I started to see smartphones, grid (then cloud) computing and social networks emerging. My MBA dissertation, finished in 2011, was about how that would change the world, because the requirements for meaningful AI were coming along - data and compute. The theory was already there, Hinton, LeCun, Schmidhuber,etc.
That got me back into the Data Science field, after years working in Data Engineering. Too bad I lived in Brazil back then and couldn't find a way to join the emerging scene in California and other top places. I'd be rich now...
I agree with your larger point but dismissed is rather too strong. They were considered fiddly to train, prone to local minima, long training time, no clear guidelines about what the number of hidden layers and number of nodes ought to be. But for homework (toy) exercises they were still ok.
In comparison, kernel methods gave a better experience over all for large but not super large data sets. Most models had easily obtainable global minimum. Fewer moving parts and very good performance.
It turns out, however, that if you have several orders of magnitude more data, the usual kernels are too simple -- (i) they cannot take advantage of more data after a point and start twiddling the 10th place of decimal of some parameters and (ii) are expensive to train for very large data sets. So bit of a double whammy. Well, there was a third, no hardware acceleration that can compare with GPUs.
Kernels may make a comeback though, you never know. We need to find a way to compose kernels in a user friendly way to increase their modeling capacity. We had a few ways of doing just that but they weren't great. We need a breakthrough to scale them to GPT sized data sets.
In a way DNNs are "design your own kernels using data" whereas kernels came in any color you liked provided it was black (yes there were many types, but it was still a fairly limited catalogue. The killer was that there was no good way of composing them to increase modeling capacity that yielded efficiently trainable kernel machines)
In olden days, the correct way to solve a linear system of equations was to use theory of minors. With advent of computers, you suddenly had a huge theory of gaussian elimination, or Krylov spaces and what not.
But they don't give the same results at those smaller scales. People imagined, but no one could have put into practice because the hardware wasn't there yet. Simplified, LLMs is basically Transformers with the additional idea of "and a shitton of data to learn from", and for making training feasible with that amount of data, you do need some capable hardware.
https://youtu.be/glWvwvhZkQ8?si=-HGtfd_KHYfatEQ
Although it’s focused on Ilya, some great history is covered.
The incentive to design something new - which became the Transformer - came from language model researchers who had been working with recurrent models such as LSTMs, whose recurrent nature made them inefficient to train (needing BPPT), and wanted to come up with a new seq-2-seq/language model that could take advantage of the parallel hardware that now existed and (since AlexNet) was now being used to good effect for other types of model.
As I understand it, the inspiration for the concept of what would become the Transformer came from Attention paper co-author Jakob Uzkoreit who realized that language, while superficially appearing sequential (hence a good match for RNNs) was in fact really parallel + hierarchical as can be seen by linguist's sentence parse trees where different branches of the tree reflect parallel analysis of different parts of the sentence, which are then combined at higher levels of the hierarchical parse tree. This insight gave rise to the idea of a language model that mirrored this analytical structure with hierarchical layers of parallel processing, with the parallel processing being the whole point since this could be accelerated by GPUs. While the concept was Uzkoreit's, it took another researcher, Noam Shazeer, to take the concept and realize it as a performant architecture - the Transformer.
Without the fast parallel hardware already pre-existing, there would not have been any incentive to design a new type of language model to take advantage of it!
The other point is that while the Transformer is a very powerful general purpose and scalable type of model, it only really comes into it's own at scale. If a Transformer had somehow been designed in the pre-GPU-compute era, before the compute power to scale it up to massive size existed it, then it would likely not have appeared so promising/interesting.
The other aspect to the history is that neural networks, of various types, have evolved in complexity and sophistication over time. RNNs and LSTMs came first, then Bahdanau attention as a way to improve their context focus and performance. Attention was now seen to be a valuable part of language and seq-2-seq modelling, so when GPUs motivated the Transformer, attention was retained, recurrence ditched, and hence "Attention is all you need".
The time was right for the Transformer to appear when it did, designed to take advantage of recent GPU advances, building on top of this new attention architecture, and now with the compute power and dataset size available that it started to really shine when scaled from GPT-1 to GPT-2 size, and beyond.
Or perhaps a world where it happened later. I think a big part of what enabled the AI boom was the concentration of money and compute around the crypto boom.
It could have been done in the early 1970s -- see "Paper tape is all you need" at https://github.com/dbrll/ATTN-11 and the various C-64 projects that have been posted on HN -- but the problem was that Marvin Minsky "proved" that there was no way a perceptron-based network could do anything interesting. Funding dried up in a hurry after that.
What result are you referring to?
I'm sure it's an oversimplification to blame the entire 1970s AI winter on Minsky, considering they couldn't have gotten much further than the proof-of-concept stage due to lack of hardware. But his voice was a loud, widely-respected one in academia, and it did have a negative effect on the field.
https://arxiv.org/abs/2510.12269
https://www.mdpi.com/1099-4300/28/3/332
I am also interested in connection with fuzzy logic - it seems that NNs can reason in a fuzzy way, but what they are doing, formally? For years, people have been trying to formalize fuzzy reasoning but it looks like we don't care anymore.
I feel like NNs (and transformers) are the OOP (object-oriented programming) of ML. Really popular, works pretty well in practice, but nobody understands the fundamentals; there is a feeling it is a made up new language to express things expressible before, but hard to pinpoint where exactly it helps.
It might lead to understanding how to measure when a deep learning system is making stuff up or hallucinating. That would have a huge payoff. Until we get that, deep learning systems are limited to tasks where the consequences of outputting bullshit are low.
That's a great problem to solve! (Maybe biased, because this is my primary research direction). One popular approach is OOD detection, but this always seemed ill-posed to me. My colleagues and I have been approaching this from a more fundamental direction using measures of model misspecification, but this is admittedly niche because it is very computationally expensive. Could still be a while before a breakthrough comes from any direction.
It would be valuable enough that getting significant funding to work on it is probably possible. Especially with all the money being thrown at AI.
There is so much to digest here but it's fascinating seeing it all put together!
My fear is that this is as hopeless right now as explaining why humans or other animals can learn certain things from their huge amount of input data. We'll gain better empirical understanding, but it won't ever be fundamental computer science again, because the giga-datasets are the fundamental complexity not the architecture.
That would be amazing, but personally I’m skeptical.
Neural networks in general are Turing models. Human brains are in the abstract Turing complete as well, as a simple example. LLMs being run iteratively in an unbounded loop may be "effectively Turing complete" for this simple reason, as well.
Regardless, any theory purporting to be foundational ought to explicitly address this demarcation. Unless practitioners think computability and formal complexity are not scientific foundations for CS.
You need a memory element the network can interact with, just like an ALU by itself is not TC, but a barebones stateful CPU (ALU + registers) is.