- Luke J Byrne
- Posts
- Meta's Llama 4 Drama...
Meta's Llama 4 Drama...
Genius AI or Benchmark Cheater?
Meta just dropped a new family of AI models, Llama 4.
And the big buzz? They might be cheating their way up the leaderboards.
Meet the Llama 4 "herd":
Scout: The small one, but with a mind-blowing 10 MILLION token context window (Gemini 2.5 Pro has 1M).
Maverick: The medium, general-purpose model.
Behemoth: The massive one (coming soon).
Meta's making some bold claims:
Scout perfectly retrieves info across its entire 10M token context in tests (that's 20 hours of video!).
They suggest Maverick beats competitors like Claude 3.7.
Maverick shot up to #2 on the popular Chatbot Arena leaderboard.
But is that #2 spot earned fairly?
Here’s the catch:
While Llama 4 looks good in some comparisons Meta shows, it lags significantly behind models like Gemini 2.5 Pro on standard academic benchmarks (MMLU, coding, etc.).
The Chatbot Arena isn't a typical benchmark – it relies on human preference. Users blindly compare outputs from two models and pick the "better" one.
Meta admits they specifically "optimized [Maverick] for conversationality" just for this arena. People simply liked its answers more, boosting its score.
So, is it cheating?
My take: Probably not cheating in the traditional sense (like training on test data). But optimizing purely for likability on a subjective leaderboard feels like gaming the system, not necessarily proving superior capability.
What about real-world use?
Llama 4 is incredibly FAST. Seriously rapid generation.
It's "open weight," meaning you can download and run Scout locally (Maverick needs serious hardware).
But in my tests? The output quality was… basic. Gemini 2.5 produced much better results, even if it took longer.
Is Llama 4 a game-changer with its context length and speed, or just cleverly marketed?
I break down the benchmarks, the "cheating" claims, and run live coding tests in my latest video. See for yourself if Llama 4 lives up to the hype.
Watch it here → https://www.youtube.com/watch/xwK3jPcsbIw
What do you think?
Is optimizing for conversationality fair game, or misleading benchmark manipulation? Let me know!
Luke
You’ve heard the hype. It’s time for results.
After two years of siloed experiments, proofs of concept that fail to scale, and disappointing ROI, most enterprises are stuck. AI isn't transforming their organizations — it’s adding complexity, friction, and frustration.
But Writer customers are seeing positive impact across their companies. Our end-to-end approach is delivering adoption and ROI at scale. Now, we’re applying that same platform and technology to build agentic AI that actually works for every enterprise.
This isn’t just another hype train that overpromises and underdelivers. It’s the AI you’ve been waiting for — and it’s going to change the way enterprises operate. Be among the first to see end-to-end agentic AI in action. Join us for a live product release on April 10 at 2pm ET (11am PT).
Can't make it live? No worries — register anyway and we'll send you the recording!
Reply