Itโs a log/log plot so radius vs volume will just change the slope of the lines. If you rescale the axes it would look exactly the same
@dferrer
ML Scientist (derogatory), Ex-Cosmologist, Post Large Scale Structuralist I worked on the Great Problems in AI: Useless Facts about Dark Energy, the difference between a bed and a sofa, and now facilitating bank-on-bank violence. Frequentists DNI.
Itโs a log/log plot so radius vs volume will just change the slope of the lines. If you rescale the axes it would look exactly the same
Social media has been a great source of random papers Iโd disregarded/underestimated when they came out. Itโs not the main source of anything for me, but it fills a gap nicely
Oh man, it's basically viral advertising too. Like, imagine the indignity of being blown-up by home-made bomb that first tells you to "Revolutionize your B2B impact with Boom.ly". Man-made horrors beyond our comprehension.
And you can run Qwen 3.5 yourself! (With the right hardware). They just give the weights away! You can do what you want. No billionaires need be party to it.
It may be generally true that โthe methods that end up working will be among the more general of the ones you triedโ, but โgeneral, scalable methods are always better than more specific onesโ is not.
Fin.
In the domain where the NFLT is applicable, for instance, performance on a held-out test set is *completely uncorrelated* with overall performance. These are not the sort of problems anyone ever actually encounters. They are the equivalent of trying to compress noise.
At the extreme end, Bitter Lesson-ing can veer into NFLT-like โall search is the sameโ appeals which should be rejected out of hand. If this were true, progress would be impossible.
Much of the essential structure of โscalable generalist methodsโ was developed *by imitating human thought processes*. Weโve just decided it is good, helpful structure instead of the dumb, wasteful kind you shouldnโt use. โAvoid *bad* structureโ is not a useful statement.
An objection Iโve heard to this is that thereโs a difference between โessential structure for a method to workโ and โunnecessary extra structure that imitates our thought processesโ. The distinction between these, though, is only knowable in retrospect.
In all three cases, the maximalist reading of The Bitter Lesson would tell us that with enough scaling, these methods would have beaten what replaced them. That never happened. I guess it still could in the long run, but to paraphrase Keynes, in the long run weโre all dead.
Finally, a highly specific example is positional embeddings. Initial LLMs learned these from scratch with no inductive bias. Again, this worked at small scales, but was utterly dominated by theoretically motivated, domain specialized methods like RoPE. No one tries to learn shift covariance anymore.
Our next example is GANs. GANs were the generic algorithms of the 2010s. They promised to solve virtually any problemโyou didnโt even need a domain specific loss. Unfortunately (like GAs) they only performed well on the simplest problems. ImageGen only took off when we left them behind.
To be unfair, if the maximalist reading of The Bitter Lesson were true, genetic algorithms would have found Transformer in 2003.
I still have people tell me these will make a reappearance any day now, but far less than a decade ago.
Genetic Algorithms are *the* generalist, scalable method. The NFLT bros once loved to talk about how soon these would replace all optimization. We even had a working example in biology. They were rightโif you have time for a trillion or so iterations you *can* make them work. You donโt, though.
First, while there are many cases of general, scalable methods beating domain specialized ones, there are plenty of cases of *too-general methods that didnโt work because they were too general*. Letโs look at a few in decreasing level of ambition:
The โour blessed homelandโ meme where that the same things people often argue against on โbitter lessonโ grounds are contrasted with the positive framing we give those same actions for methods that *work*. Our Essential Structure vs. Their Bitter Lesson Our visionary equivariance vs. their constricting inductive bias Our cunning generality vs. Their infeasibly vast solution space Our clever synthetic data vs. Their Desperate Data Augmentation Our exponential speedup vs. Their losing battle with Mooreโs Law
โThe Bitter Lessonโ is a fine little essay. Anyone in software/ML/AI should read it. But I see a lot of people run with a maximalist overstatement of it, essentially: โstructure and domain knowledge are always a mistakeโ. This is historically dubious and theoretically ungrounded. A thread:
Part of the reason I never open source (or push to release commercially) anything for agents is that I do not want the misery of owning it
i'm talking about this to be clear
this has all been an obvious idea for literally years imho
bsky.app/profile/davi...
Spent months building something like this out early last year because it was obviously a good ideaโbut zero support in the open ecosystem. I hope this finally takes off so I donโt have to keep supporting it internally.
Meanwhile optimal transport just naturally handles all three regimes (discrete / mutual support / disjoint support) in the same way.
For me the thing that settles it is that the distances for point masses are the distance in the underlying metric. I have seen so many people try to hack KL into working on distributions with disjoint support (doing crazy things like gaussian smoothing or adding a difference of CoM term to losses)
For me (which is probably just different intuition), it is deeply weird to have two distributions with different support always be at the same distance from each other. Practically it also means distances are dominated by low mass regions even if support is the same.
W2 for normal distributions is just the euclidean norm on (mean, std), so these should be different
Talk to your ML students about Information Geometry---before someone else does.
Friends don't let friends use the KL Divergence for distributions over spaces with a natural metric.
And here is the Wasserstein metric
So this is a geodesic under the Fisher Metric (the riemannian metric version of the KL divergence)
The KL geodesic between distributions with widely separated means and the same shape first blows up the variance, then slowly shifts the mean. The Wasserstein geodesic keeps the same shape but moves the mean in a straight path.
Iโll try to dig up an animation of this I made. Itโs striking.
One big reason to hate it is that the ultra weak topology makes far more intuitive sense on distributions over metric spaces than the weak topology.
A probability mass at x=1 should be a neighbor of a mass at x=1 + epsilon. That the FIM / KL divergence doesnโt have this is a crime.
If the answer is that we expect leaders to care about their own lives more than those of their people, and thus react more aggressively to deter decapitationโI understand why *leaders* want that norm. We donโt have to agree, though.