The New Scientist used freedom of information laws to get the ChatGPT records of the UK’s technology secretary.
The headline hints at a damning exposé, but ends up being a story about a politician making pretty reasonable and sensible use of language models to be more informed and make better policy decisions.
He asked it why small business owners are slow to adopt AI, which popular podcasts he should appear on, and to define terms like antimatter and digital inclusion.
This all seems extremely fine to me. Perhaps my standards for politicians are too low, but I assume they don’t actually know much and rely heavily on advisors to define terms for them and decide on policy improvements. And I think ChatGPT connected to some grounded sources would be a decent policy advisor. Better than most human policy advisors. At least when it comes to consistency, rapidly searching and synthesising lots of documents, and avoiding personal bias. Models still carry the bias of their creators, but it all becomes a trade-off between human flaws and model flaws.
Claiming language models should have anything to do with national governance feels slightly insane. But we’re also sitting in a moment where Trump and Musk are implementing policies that trigger trade wars and crash the U.S. economy. And I have to think “What if we just put Claude in charge?”
We have a new(ish) Okay, it’s not that new – created in September 2024 – but we’ve only recently seen companies using when they announce new models. benchmark, cutely named “Humanity’s Last Exam.”
If you’re not familiar with benchmarks, they’re how we measure the capabilities of particular AI models like o1 or Claude Sonnet 3.5. Each one is a standardised test designed to check a specific skill set.
(Graduate-Level Google-Proof Q&A Benchmark) measures correctness on a set of questions written by PhD students and domain experts in biology, physics, and chemistry.
When you run a model on a benchmark it gets a score, which allows us to create leaderboards
showing which model is currently the best for that test. To make scoring easy, the answers are usually formatted as multiple choice, true/false, or unit tests for programming tasks.
Among the many problems with using benchmarks as a stand-in for “intelligence” (other than the fact they’re multiple choice standardised tests – do you think that’s a reasonable measure of human capabilities in the real world?), is that our current benchmarks aren’t hard enough.
New models routinely achieve 90%+ on the best ones we have. So there’s a clear need for harder benchmarks to measure model performance against.
Made by ScaleAI and the Center for AI Safety, they’ve crowdsourced “the hardest and broadest set of questions ever” by experts across domains. 2,700 questions at the moment, some of which they’re keeping private to prevent future models training on the dataset and memorising answers ahead of time. Questions like this:
So far, it’s doing it’s job well – the highest scoring model is OpenAI’s Deep Research
at 26.6%, with other common models like GPT-4o, Grok, and Claude only getting 3-4% correct. Maybe it’ll last a year before we have to design the next “last exam.”
A quick note on benchmarks and sweeping generalisations
When people make sweeping statements like “language models are bullshit machines” or “ChatGPT lies,” it usually tells me they’re not seriously engaged in any kind of AI/ML work or productive discourse in this space.
First, because saying a machine “lies” or “bullshits” implies motivated intent in a social context, which language models don’t have. Models doing statistical pattern matching aren’t purposefully trying to deceive or manipulate their users.
And second, broad generalisations about “AI”‘s correctness, truthfulness, or usefulness is meaningless outside of a specific context. Or rather, a specific model measured on a specific benchmark or reproducible test.
So, next time you hear someone making grand statements about AI capabilities (both critical and overhyped), ask: which model are they talking about? On what benchmark? With what prompting techniques? With what supporting infrastructure around the model? Everything is in the details, and the only way to be a sensible thinker in this space is to learn about the details.
” technique dramatically improves the quality of its answers. These models are also fine-tuned to perform well on complex reasoning tasks.
R1 reaches equal or better performance on a number of major benchmarks compared to OpenAI’s o1 (our current state-of-the-art reasoning model) and Anthropic’s Claude Sonnet 3.5 but is significantly cheaper to use.
at a much lower cost than OpenAI or Anthropic. But given this is a Chinese model, and the current political climate is “complicated,” and they’re almost certainly training on input data, don’t put any sensitive or personal data through it.
You can use R1 online through the DeepSeek chat interface
. You can turn on both reasoning and web search to inform your answers. Reasoning mode shows you the model “thinking out loud” before returning the final answer.
to run R1 on your own machine, but standard personal laptops won’t be able to handle the larger, more capable versions of the model (32B+). You’ll have to run the smaller 8B or 14B version, which will be slightly less capable. I have the 14B version running just fine on a Macbook Pro with an Apple M1 chip. Here’s a Reddit guide
for GPT-4. If true, building state-of-the-art models is no longer just a billionaires game.
The thoughtbois of Twixxer are winding themselves into knots trying to theorise what this means for the U.S.-China AI arms race. A few people have referred to this as a “sputnik moment
From my initial, unscientific, unsystematic explorations with it, it’s really good. Using it as my default LM going forward (for tasks that don’t involve sensitive data). Quirks include being way too verbose in its reasoning explanations and using lots of Chinese language sources when it searches the web. Makes it challenging to validate whether claims match the source texts.
Here’s the announcement Tweet:
🚀 DeepSeek-R1 is here!
⚡ Performance on par with OpenAI-o1 📖 Fully open-source model & technical report 🏆 MIT licensed: Distill & commercialize freely!
TLDR high-quality reasoning models are getting significantly cheaper and more open-source. This means companies like Google, OpenAI, and Anthropic won’t be able to maintain a monopoly on access to fast, cheap, good quality reasoning. This is net good for everyone.
A roboticist breaks down common misconceptions about what’s hard and easy in robotics. A response to
everyone asking “can’t we just stick a large language model into its brain to make it more capable?”
Contrary to the assumptions of many people, making robots perceive and move in the world in the way
humans can turns out to be an extraordinarily hard problem to solve. While seemingly “hard” problems
like scoring well on intelligence tests, winning at chess, and acing the GMAT turn out to be much
easier.
Everyone thought it would be extremely hard and computationally expensive to teach computers
language, and easy to teach them to identify objects visually. The opposite turned out to be true.
This is known as Moravec’s Paradox
Especially liked the ending where Dan explores why people are so resistant to the idea picking up
a cup is more complex than solving logic puzzles. Partly anthropocentrism; humans are special
because we can do higher order thinking. Any lowly animal can sense the world and move through it.
Partly social class bias; people who work manual labour jobs using their bodies are less valued then
people who sit still using their intellect to solve problems.
Researchers submitted entirely AI-generated exam answers to the undergraduate psychology department
of a “reputable” UK university. The vast majority went undetected and the AI answers achieved higher
scores than real students.
“We report a rigorous, blind study in which we injected 100% AI written submissions into the
examinations system in five undergraduate modules, across all years of study, for a BSc degree in
Psychology at a reputable UK university. We found that 94% of our AI submissions were
undetected. The grades awarded to our AI submissions were on average half a grade boundary
higher than that achieved by real students. Across modules there was an 83.4% chance that the AI
submissions on a module would outperform a random selection of the same number of real student
submissions.”
I have to assume educators are swiftly moving to hand-written exams under supervised conditions and
oral exams. Anything else seems to negate the point of exams.
A browser extension that filters out engagement bait from your feed on Twixxer. Uses Llama 3.3 under
the hood to analyse Tweets in real time and then blurs out sensationalist political content. Or
whatever else you prompt it to blur – the system prompt is editable:
System settings and a customisable prompt for the Unbaited app
This is certainly a way to try and manage Twixxer’s slow demise into right-wing extremist content.
Though I’m taking this more as a thought experiment and interesting prototype than a sincere
suggestion we should spend precious energy burning GPUs on clickbait filtering. Integrating LLMs
into the browsing experience and using them to selectively curate content for you is the more
interesting move here.
. One that was
overdue. They’re called smidgeons. Teeny, tiny entries. The kinds of things I used to put in
Tweets, before Twitter died a terrible death.
Most are only a few sentences long. They’re mainly links to notable things – good articles, papers,
and ideas. I’ve been meaning to do this for a while, but a recent migration to
Astro