Claude 3.7 Sonnet Just Shocked Everyone! (Claude 3.7 Sonnet and Claude Code)
Title: Claude 3.7 Sonnet Just Shocked Everyone! (Claude 3.7 Sonnet and Claude Code)
Author: TheAIGRID
Transcript:
so today anthropic finally released
their new AI model clae 3.7 Sonet and
there's actually a lot to digest because
this one isn't your standard llm but
rather a hybrid reasoning model that has
a lot to offer in terms of the
benchmarks broken so if you actually
take a look at what we see here it says
that today we're announcing clae 3.7
Sonet our most intelligent model to date
and the first hybrid reasoning model on
the market it says right here that claw
3.7 Sonic can produce near instant
responses or extended step-by-step
thinking that is made visible to the
user API users also have fined grain
control over how long the model can
think for and of course this hybrid
reasoning that they actually talk about
is essentially where you have system one
and system two thinking which I'll
explain a little bit in a second and
essentially this hybrid reasoning
actually means that the model can offer
responses that can be suited to both
difficult queries and of course those
quick queries where you need an instant
response so they talk about how they
developed clae 3.7 sonnet with a
different Philosophy from other
reasoning models on the market just as
humans use a single brain for both quick
responses and deep reflection We Believe
reasoning should be an integrated
capability of Frontier models rather
than a separate model entirely this
unified approach also creates a more
seamless experience for users so they're
essentially referring to this where of
course you have system one and system
two of course with your system one this
is where you have your intuition and
Instinct so quite like humans LMS can
respond instantly but of course with
system 2 this is much slower logical and
this is of course where you think for
your problems and come up with more
complex Solutions and this is of course
what they've embedded into the model now
with CLA 3.7 Sonic you can actually
control the budget for the thinking so
if you're a developer and you are
thinking about the customizability for
this model you can actually tell claw to
think for no more than a certain number
of tokens and it's up to you how long
you want claw to think I think this is
something that is really valuable
because of course it allows us to
control how long CL thinks about it
problems because one of the things I've
realized about the other thinking models
is that sometimes the model will think
about for our problem for maybe 10
seconds when we wanted the model to
think for maybe 100 seconds or even 200
seconds so this is going to be something
that is really useful I will get to the
very Infamous benchmarks in a moment but
when we actually look at claw 3.7 versus
claw 3.5 in the standard model claw 3.7
this is just essentially a lot smarter
than the previous claw 3. six and
essentially this version of course you
can just enable the extended thinking
mode where of course it self-reflect
before answering which of course
improves its performance on math physics
instruction following coding and many
other tasks and they generally find that
prompting the model Works similarly in
both modes so for those of you who do
have prompts that used to work on clae
3.6 sonnet it's quite likely that they
will work on CLA 3.6 and will work the
same way they have done on clae 3.7 so
there won't be any changes to prompting
now one thing that I found really
interesting and I am finally glad an AI
company has done this is now the fact
that they are optimizing for real world
Focus they state that in developing our
reasoning models we've optimized
somewhat less for math and computer
science competition problems and instead
shifted Focus towards a real world tasks
that better reflect how businesses
actually use LMS the reason I think this
is going to be gamechanging is because
often times we see companies focus and
obsess over benchmarks that are in areas
that aren't in everyday use for example
when we about to take a look at the
benchmarks for claw 3.7 you'll see that
the benchmarks are still very impressive
but a lot of those areas don't actually
directly translate to World business use
that the average everyday person is
going to get value out of the model for
and I think this is why claw 3.7 and
Claw 3.6 have traditionally been better
than chai GPT and their rival
counterparts because they train the
models so that they're actually good for
real world use and maybe not so much
competition problems so these are the
benchmarks for claw 3.7 Sonet and
essentially right here we can actually
see where claw 3.7 Sonet exceeds and one
of the first things that we do notice is
that claw 3.7 Sonet isn't crushing these
other companies in terms of the
benchmarks if you do remember it was
very recently that we did have the grock
3 beta be released and of course we can
already see that clae 3.7 is on it
honestly don't know how they did it but
for many and several benchmarks it does
come out on top for example in agentic
coding and of course agentic tool use
which I'll dive into a little bit more
but these are the areas that are
actually real world use cases these two
right here are really important but in
other areas it does seem like you know
for example visual reasoning and high
school competition math it does seem
like a lot of these top models are kind
of all converging towards the same area
around 86% but of course in the GP QA we
can see that clor 3.7 Sonet manages to
Edge out over grock 3 beta now I will
say that this one is a little bit more
interesting because of course I do think
with clae 3.7 Sonic it really is a model
that you can't look at the benchmarks
and judge it it is a model that you
truly have to use and some of the tweets
that I'm seeing in the AI Community
definitely show us that this is probably
going to be the model that a lot of
people immediately switch to I wouldn't
be surprised if Claude probably run out
of inference for the model considering
so many people already used the previous
one now like I said before of course I
don't want to focus too much on these
benchmarks but I'll show you the
benchmark that we do need to focus on so
this Benchmark right here is of course a
gentic tool use so this is the tow
Benchmark and this essentially is the
framework that tests AI agents on real
world tasks with users and Tool
interactions so this is basically a
benchmark that will actually have real
world usage and so this tow Benchmark
here that we see is basically one that
is really important because like I said
before this is something that is really
really needed for real world use case so
this Benchmark basically just evaluates
how consistently an AI agent can perform
the same task consistently across
multiple trials using a metric code pass
K which basically you know looks at how
reliably it does it over competed
attempts and I think you know this stuff
is important because we need benchmarks
that actually work for the real world
use cases of course competition math the
GP QA this is of course great for
assessing how smart an AI is but you're
going to need real world use cases if
you want AI to actually be used in of
course the real world so by focusing on
tool use which is what this Benchmark
does it basically you know looks for
consistent behavior and it ensures that
these AI agents are prepared for
deployment in sensitive domains like
customer service or Healthcare and of
course it wouldn't be benchmarks if we
didn't get to the swe bench verified so
this Benchmark is of course one that is
for the software development Niche and
this actually achieves state-of-the-art
performance on thew Benchmark and this
being state-of-the-art really just goes
to show how much better this model is
over opening eyes 03 we can literally
see right here that whilst yes there's a
lot of deep seek hype a lot of 03 hype
too claw 3.7 actually manages to surpass
those models by a pretty significant
amount and this is something that like I
don't just believe the benchmarks for
this model this is something that I've
actually seen first case when looking at
people who currently Cloe code with clo
3.7 son it and all of the things that
they're saying is basically that this
model is pretty much outstanding so we
can see that open A3 is at 49% open A1
is at 48% claw 3.5 Sonic all of them are
basically around you know 49% and then
we get this massive jump to 62.3% and
then all the way up to 70.3% with custom
scaffolding so that is a huge huge jump
and then this one here I believe was the
October 2024 so that is 4 months and
we've seen nearly a 12% increase which
doesn't seem like a lot but in terms of
your actual day-to-day use this is going
to be a remark more helpful model for
various different use cases when it
comes towe related tasks now of course
for those of you who may use Devon
you'll also see here that in the agentic
coding evaluation this model actually
once again jumps up to 67% so once again
I think it's quite clear to see that the
trend we on we can see with GPT 40
starting at 49% and now with Claude
Sonic 3.7 already at 67% can you imagine
where we're going to be in just a few
years I mean the future is truly truly
exciting and and if you're actually
wondering about clae coding you actually
may want to take a look at this so this
is basically a video where they
introduce a new agentic coding tool that
actually allows users to work with clae
directly in the terminal now this is
something that is launched as a direct
research preview to enhance coding
capabilities and features of course of
this is basically where you know it can
understand your code base it can analyze
a repository it can provide insights
into its structure users can request
changes you know it can display the
thought process you know it can generate
and execute tests resolving errors
automatically it can detect and find you
know fix build issues iteratively it can
you know push changes to GitHub with
clear summaries and this is a video that
you're definitely going to want to watch
if you're someone that uses Claude for
coding should we be doing like big smile
or uh no big smile is creepy that's sort
of way think I'm Boris I'm an engineer
I'm cat I'm a product manager we love
seeing what people build with Claud
especially with coding and we want to
make Claud better at coding for everyone
we built some tools one of which we're
sharing today we're launching CLA code
as a research
[Music]
preview CLA code is a agentic coding
tool that lets you work with CLA
directly in your terminal we're going to
show you an example of it in action so
we have a project here it's a nextjs app
let's open it up in an instance of quad
code now that we've done this clad code
has access to all of the files and this
repository we don't know much about this
codebase it looks like an app for
chatting with a customer support agent
let's get CLA to help explain this code
base to
us clad starts by reading the higher
level files and then it Dives in
deeper now it's going through all the
components in the
project cool here's its final analysis
so say I was asked to replace this left
sidebar with a chat history and I'm also
going to add a new chat button I'm going
ask clad to help me out
here we haven't specified any files or
paths and Cloud's already finding the
right file stop date by itself Claud can
also show its thinking and we can see
how it's decided to tackle this
problem quad asking me if I want to
accept these changes I'll say yeah now
quad's updating the Navar adding a
button and icons as
well next it's updating the logic to
ensure the saving State works
correctly after a bit Claude completes
the task here's a summary of what it's
done let's take a look at that so we're
seeing a new chat button and new chat
history section on the left let's check
if I can start a new chat while keeping
the previous one
saved I'll try out the new chat button
too great it's all working now let's ask
Kaa to add some tests to make sure that
the features we just added
work C's asking for permission to run
commands we'll say
yes Cod is making some changes to run
these tests
after getting the results it continues
with its plan until all tests
pass after a few minutes it looks like
we're good to
go now I'm going to ask CA to compile
the app and see if we get any build
errors let's see what it
finds Cod identif the build errors and
is now fixing
them then it tries to build
again it'll keep going until it
works now let's finish everything up by
asking quad to commit its changes and
push them to GitHub quad creates a
summary and a description of our
changes and it'll push the changes to
GitHub that's it that's an example of
what clad code can do we can't wait for
people to start building now there was
also this Benchmark that I forgot to
include but they've actually introduced
a Claude Benchmark for the model playing
Pokémon so it says here that Claude 3.7
Sonic demonstrates that it is the very
best of all the Sonic models so far at
playing Pokemon Red fortunately I don't
actually play Pokemon but they talk
about that Pokemon is a fun way to
appreciate clothe 3.7 Sonic capabilities
but the ex these capabilities to have a
real world impact act Beyond playing
games because the model's ability to
essentially maintain focus and
accomplish open-ended goals will help
developers in a wide range of
state-of-the-art AI agents being
developed so that is why they did this
and I think that these kind of new
benchmarks are going to be super
entertaining and super interesting so of
course as well we do have this which is
of course something for the future and
this is basically where they said and I
quote the clae 3.7 Sonet and clae code
mark an important step towards AI
systems that can truly augment human
capabilities with their ability to
reason deeply work autonomously and
collaborate effectively they bring us
closer to a future where Ai and riches
and expands what humans can achieve and
this is basically where they're stating
that you know by 2027 Claude is going to
be having Pioneers so first we having
the assistants in 2024 then of course we
had the collaborators in 2025 Claude
does hours of independent work for you
on power with experts expanding what
every person or team is capable of then
of course in clae 2027 this is where we
have Pioneers where they're predicting
that Claude is going to be able to find
breakthrough solutions to challenging
problems that would have taken teams
years to achieve so the future for
Claude is certainly bright but let me
know if you have used this already of
course if you go right here you can see
that Claude 3.7 Sonic is right there and
of course you can also see the thinking
mode you've got normal and then you've
got extended so it's completely up to
you what you want to use hopefully you
guys enjoyed the video and I'll see you
in the next one