OpenAI’s Deep Research Team on Why Reinforcement Learning is the Future for AI Agents
Title: OpenAI’s Deep Research Team on Why Reinforcement Learning is the Future for AI Agents
Author: Sequoia Capital
Transcript:
a lesson that I've seen people learn
over and over again in this field is
like you know we we think that we can do
things that are smarter than what the
models do by writing it ourselves but as
the field progresses the models come up
with um better solutions to things than
humans do the like probably like number
one lesson of machine learning is like
you get what you optimize for and so um
if you if you're able to set up the
system such that you can optimize
directly for the outcome that you're
looking for um the results are going to
be much much better than if you sort of
try to glue together models that are not
optimized end end for for the the tasks
that you're trying to have them do so my
like like long-term guidance is that um
you know I think like reinforcement
learning um tuning on top of models is
probably going to be a critical part of
how the most powerful agents get built
[Music]
we're excited to welcome Issa fulford
and Josh Tobin who lead the Deep
research product at openai Deep research
launched 3 weeks ago and has quickly
become a hit product used by many Tech
luminaries like the cisin for everything
from industry analysis to medical
research to birthday party planning deep
research was trained using endtoend
reinforcement learning on hard browsing
and reasoning tasks and is the second
product in a series of agent lunches
from openai with the first being
operator we talked to ISA and Josh about
everything from Deep researchers use
cases to how the technology Works under
the hood to what we should expect in
future agent lunches from open AI Issa
and Josh welcome to the
show thank you thank you so much for
joining us excited to be here thank you
for having us um so maybe let's start
with like what is deep research tell us
about the origin stories and what this
product is doing so deep research is a
agent that is able to search many online
websites and it can create very
comprehensive reports um it can do tasks
that would take humans many hours to
complete and it's in chat gbt and it
takes like 5 to 30 minutes to to answer
you and so it's able to do much more
in-depth research and answer your
questions with much more detail um and
specific sources than regular chat gbt
response would be able to do it's one of
the first agents that we've released so
we released operator um pretty recently
as well and so um deep research is the
second agent and you know we'll release
many more in um in future what's the
origin story behind deep research like
when did you choose to do this what was
the inspiration and how many people work
on it like what what did it take to
bring this to poition good question this
is before my time so to hear yeah so I
think um maybe around a year ago we were
seeing a lot of success internally with
um this new reasoning Paradigm and
training uh models to think before
responding and we were focusing a lot on
um Math and Science domains but I think
that that the other thing that this kind
of new reasoning model um regime unlocks
is the ability to do longer Horizon
tasks that involve like AG gench kind of
you know abilities and so we thought you
know a lot of people do tasks that
require a lot of online research or a
lot of external context and that
involves a lot of reasoning and
discriminating between sources and you
have to be quite creative to do those
kinds of things and I think we finally
had models or a way of training models
um that would allow us to to be able to
um tackle some of those tasks so we
decided to try and start training models
to um do first browsing tasks so using
like the same methods that we used to
train reasoning models but on more real
world tasks was it your idea and Josh
how did you get involved at first uh it
was like me and Yos ptil um who as our
opening ey is working on a a similar um
project that will be released at some
point which we very excited about um and
we we built an original
um and then also with Thomas Dimson
who's one of those people who just is an
amazing engineer like will dive into
anything and just you know get loads of
things on so it was very fun yeah and I
I joined more recently I uh rejoined
opening I about six months ago um uh
from my startup I was uh out opening ey
in the early days and um was looking
around the projects um when I rejoined
and got very interested in some of our
HMT efforts including this one and uh
and got involved through that
amazing well tell us a little about who
you built it for yeah I mean it's it's
really for anyone who does knowledge
work as part of their um as part of
their day-to-day job or really as part
of their life um so we're seeing uh a
lot of the usage come from people using
it for work um doing things like uh you
know research as part of their jobs um
for uh you know understanding U markets
companies uh real estat a lot of
scientific research medical I think
we've seen a lot of U medical examples
as well um and and one of the things
we're really excited about as well is um
you this this style of like uh I just
need to go out and spend many hours
doing something that um you know where I
have to do a bunch of web searches and
Cate a bunch of information is not just
a work thing but it's also um useful for
shopping and uh travel as well so we're
excited for the the plus launch so that
more people will be able to try deep
research and maybe we'll see some new
use cases as well it's definitely one of
the products I've used the most over the
last couple weeks it's been amazing
using it for work for work definitely
also for fun like what are you using it
for oh for me oh my goodness um so I was
thinking about buying a new car and I
was trying to figure out when the next
model was going to be released for the
car and there's all these speculative
blog posts like there's patterns from
the manufacturer and so I asked deep
research can you break down all the
gossip about this car and then all of
the facts about what they've done what
this automakers in before and it put
together an amazing report that told me
maybe wait a couple months but this year
like in the next few months it should
come out yeah like one of the things
that's really cool about it is it's it's
uh like it's not just for going Broad
and Gathering all of the information
about a source but it's also really good
at finding like very obscure like weird
facts on the internet um like if you
have something very specific you want to
know that you might might not just turn
up in the first page of search results
it's good at that kind of thing too so
that's cool what are some of the
surprising use cases that you've seen o
I think the thing I've been most
surprised by is how many people are
using it for coding yeah um which wasn't
really a use case i' considered but I've
seen a lot of people on Twitter and in
various places where we get feedback
using it for coding and code search and
also for finding the latest
documentation on a certain package or
something and helping them write a
script or something so yeah I'm like I'm
kind of embarrassed that we didn't think
of that as a use case because it's like
you know for chbt users it seems so
obvious but um it's I know it's
impressive how well it works how do you
think the balance of business versus
individual use case will will evolve
over time like you mentioned the plus
launch that's happening you know in a
year's time or two years time would you
guess this is mostly a business tool or
mostly a consumer tool I would say
hopefully both uh I think it's a pretty
General um capability which and I think
it's something that we do both in work
and in personal life so yeah I'm excited
about both I think the the magic of it
is like um it just saves people a lot of
time um you know if there's uh something
that might have taken you hours or in
some cases we've heard like days um
people can just put it in here and get
you know 90% of what they would have
come out up with on their own um and so
yeah I I tend to think there's like
there's more tasks like that in uh
business than there are in personal but
I mean I think for sure it's going to be
part of people's lives and both yeah
it's really become the majority of my
uses for CHC I just always pick deep
research rather than normal so what are
you saying in terms of consumer use
cases and what are you excited about I
think a lot of shopping travel
recommendations um I've personally used
the model a lot I've been using it for
months to do these kinds of things um we
were in Japan for the for the launch of
deep research so it was very helpful and
finding restaurants with very specific
requirements and finding things that I
wouldn't have like necessarily found
yeah and I found it like when you have
um something it's like the kind of thing
where you know if you're shopping maybe
for something expensive or you're
planning a trip that uh is special or
you want to spend a lot of uh uh that
you're you want to spend a lot of time
thinking about it's like for me you know
I might go and spend hours and hours
like trying to read everything on the
internet about this one this product
that I'm interested in buying um like
scouring all the reviews and uh the
forums and stuff like that and deep
research can put together um kind of
like something like that very quickly
and so it's it's really useful for that
that kind of thing the model is also
very good at instruction following so if
you have a query with many different
parts or many different
questions so if you don't you want the
information about the product but you
also want comparisons to all other
products and you also want um
information information about reviews
from you know Reddit or something like
that you can give loads of different
requirements and it will do all of them
for you yeah another another uh tip is
like just ask it to format it in in a
table it will usually do that anyway but
it's uh like if you it's really helpful
to have like a table with a bunch of
citations and things like that um for
all the categories of things that you
want to research yeah there are also
some features that hopefully we get into
the product at some point but the model
is able to the underlying model is able
to embed images so it can find images of
the products um and it's also this is
not a consumer use case but it's able to
create graphs as well and then embed
those in its response so hopefully that
will come to chbt soon as well nerdy
cons use case yeah yeah well and
speaking of nerdy consumer use cases uh
also like personal personalized
education is a really interesting use
case like if there's if there's a topic
that you've been meaning to learn about
um you know if you uh uh need to brush
up on your your biology or uh or you
know you want to learn about like um
like like some some world event it's um
it's really good at you know put put in
all the information about um what you
feel like you don't understand what
aspects of it you wanted to go do
research on it and it'll put together a
nice report for you one of my friends is
considering starting a cpg company and
he's been using it so much to find
similar products to see if specific um
names are already you know the domains
already taken Market sizing like all of
these different things um so that's been
that's been fun to he'll share the
reports with me and I'll read them so
it's been pretty fun to see another like
fun use case is that it's really good at
finding like a single like obscure fact
on the internet like if there's like a
uh you know like a an obscure TV show or
something um that you want to you know
to like find like one particular episode
of of or something like that it'll go
and it'll go deep and uh find the like
one reference to it on the web oh yeah
my my brother's friend's dad had this
very
specific fact um it was about some
Austrian General who was in power during
a certain a death of someone during a
battle like a very Niche question and
apparently chat gbt had previously
answered it wrong and he was very sure
that it was wrong so he went to the
public library and found a record and
found that it was wrong and so then um
deep research was able to get it right
so we sent it to him and he was he was
excited um what is the rough mental
model for you know what deep research is
excellent at today and uh you know where
should people be using the O Series of
models where should where should they be
using deep research what deep research
really excels at is if you have a sort
of detailed description of what you want
and in order to get the best possible
answer requires reading a lot of the
internet um if you have kind of like
more of a a vag question um it'll help
you kind of clarify what you want but
it's I mean it's it's really at its best
when there's like a specific set of
information that you're looking for and
I think it's it's very good at
synthesizing information at encounters
it's very good at finding um specific
like hard to find information um but
it's maybe less and it can make kind of
some new insights I guess from what it
from what it encounters but I don't
think it's NE it's not making new
scientific discoveries yet um and then I
think using the O Series model for for
me if I'm asking for something to do
with coding usually it doesn't require
knowledge outside of what the model
already knows from it like pre-training
so you would usually use 01 Pro or o1
for coding or3 mini high and so deep
research are a great example of where
some of the new product directions for
open AI are going I'm curious how to the
extent you can share how does it work
the model that powers deep research um
is a fine-tune version of 03 which is
our most advanced um reasoning model and
we specifically trained it on um hard
browsing tasks that we collected as well
as um reason other reasoning tasks and
so
it also has access to a browsing tool
and python tool so through training um
end to end on those tasks it learned
like strategies to to solve them um and
the resulting resulting models good at
online search and Analysis yeah and like
intuitively the way you can think about
it is um you make this sort of this
request ideally a detailed request about
what you want the model thinks hard
about that um it searches for
information it pulls that information
and it reads it um it understands how it
relates to that request and then decides
um what to search for next in order to
get kind of closer to the final answer
that you want um and it's trained to do
a good job of pulling together um all of
those all that information into a nice
tidy report um with citations that point
back the original information that I
found yeah I think what's new about deep
research as an agent at capability is
that because we have the ability to
train end to end there are a lot of
things that uh that you have to do in
the process of doing research that you
couldn't really predict beforehand so I
don't think it's possible to write some
kind of language model program or script
that would be as flexible as what the
model's able to learn through training
where where it's actually reacting to
live web information and based on
something it sees it has to make it
change its strategy and um things like
that so we actually see it doing pretty
creative searches um you you can read
the The Chain of Thought summary and I'm
sure you can see sometimes it it's very
um very smart about how it how it comes
up with the next thing to look for so
John Carlson had a tweet that went
somewhat viral you know how much of the
magic of deep research is you know real
time access to web content and how much
of the magic is in kind of Chain of
Thought uh can can you maybe shed some
light on that um I think it's definitely
a combination I think you can see that
because there are other sear products
that don't um necessarily they weren't
trained ENT and so um won't be as
flexible in responding to um you
responding to information and accounts
won't be as creative about how to solve
specific problems because they weren't
specifically trained for that purpose um
so it's definitely a combination I mean
it's a fine tun version of 03 O3 is a
very smart and Powerful Model A lot of
the analysis capability um is also from
the underlying 03 model training um but
so I think it's definitely a combination
before open AI um was working at a
startup and uh we were uh dabbling and
building agents um kind of the way that
I see most people describe building
agents on on uh on the internet um which
is essentially you know you uh you
construct this graph of operations and
some of the nodes in that graph are
language models um and so you can the
language model can decide what to do
next but the overarching logic of the
you know sequence of steps that H uh
that happen is defined by a human and um
what we found is that it's really it's
like powerful way of building things to
get quickly to a prototype but um it
falls down pretty quickly in the real
world because it's very hard to
anticipate all the scenarios um that the
model might face and think about all the
different branches of the path that you
might want to take um in addition to
that the um models often are not the
best decision makers um at nodes and
that graph because they weren't trained
to do to make those decisions they were
trained to do things that look similar
to that um and so I think the the uh the
the thing that's really powerful about
this model um is that it's trained
directly end to end to solve the kinds
of tasks that uh that users are using it
to solve so you don't have to set up a
graph or make those node like decisions
in the back on the architecture on the
back end it's all driven by the model
itself yeah can you say more about this
because you know it seems like that's
one of the very opinionated decisions
that you've made and clearly it's worked
um there's so many companies that are
building on your API um kind of
prompting uh to you know to you know
solve specific tasks for specific users
do you think all a lot of those
applications would be better served by
kind of having you know trained models
end to end for their specific workflows
I think if you have a very specific
workflow that is quite predictable it
makes a lot of sense to do something um
like just described but if you have
something that um has a lot of edge
cases or um it needs to be be quite
flexible then I think something similar
to deep research is probably a better
approach yeah I I think like the
guidance I give people is um the the one
thing that you don't want to bake into
the model is like kind of hard and fast
rules um like if you have you know a
database that you don't want the model
to touch or something like that it's
it's better to encode that in in human
written logic but I think it's kind of
like a a lesson that I've seen people
learn over and over again in this field
is like um you know we we think that we
can do things that are smarter than what
the models do by writing it ourselves
but uh in reality like usually the mo
like as the field progresses the model
come up with um better solutions to
things than humans do and um and uh also
like you know um the like probably like
number one lesson of machine learning is
like you get what you optimize for and
so um if you if you're able to set up
the system such that you can optimize
directly for the outcome that you're
looking for um the results are going to
be much much better than if you sort of
try to glue together models that are not
optimized end to end for for the the
tasks that you're trying to have them do
so my like like long-term guidance is
that um you know I think like
reinforcement learning um tuning on top
of models is probably going to be a
critical part of how the most powerful
agents get built what were the biggest
technical challenges along the way to to
making this work I well I mean maybe I
can say as like an observer um rather
than someone who was involved in this
from the beginning but it seems like
kind of one of the uh the things that um
that Isa and the rest of the team worked
really really hard on and was kind of
like one of the hidden keys to success
was like um making really high quality
data sets um it's uh you know another
one of those like age-old lessons in
machine learning that people keep
relearning but the the quality of the
data that you put into the model is is
probably the biggest determining factor
in the quality of the model that you get
on the other side and then have someone
like um Edward Edward son who's other um
person who works on the project who just
any data set he will optimize so
that's a secret to
success find your Edward yes great great
machine learning uh uh model training
how do you make sure that it's right
yeah so that's obviously a core part of
this model and product is that we want
it to be users to be able to trust the
outputs so part of that is we have um
citations and so um users are able to
see where the model is um citing
information from and we during training
that's something that we actually like
try and make sure is um correct but it's
still possible for the model to make
mistakes or hallucinate or trust a
source that maybe isn't the most um
trust worthy source of information so
that's definitely an active area where
we're want to continue improving the
model how should we think about this
together with you know 03 and operator
and other different releases like does
this use operator do do these all build
on top of each other or are they all
kind of a series of different
applications of o03 uh today these are
pretty disconnected um but you can kind
of um you can imagine kind of where
we're going with this uh right which is
like um the the ultimate agent that um
people will have access to at some point
in the future should be able to do um
you know not just web search or using a
computer or any of the other types of
actions that you'd want like kind of a a
human assistant to do but should be able
to fuse all these things in a more
natural way any other design decisions
that you know you've taken that are
maybe not obvious at first
glance I think one of them is the the
clarification flow so if you've used
deep research the model will ask you
questions before starting its research
and usually chat gbt maybe will ask you
a question at the end of its response
but it usually doesn't have such um uh
that kind of behavior upfront um and
that was intentional because um you will
get the best response from the research
model if the prompt is very well
specified and detailed and think that
it's not the natural user Behavior to
give all of the information in um the
first prompt so we wanted to make sure
that if you're going to wait five
minutes 30 minutes that your response is
as detailed and satisfactory so um we
added this additional step to make sure
that the user provides all the detail
that we would need and I've actually
seen a bunch of of people on Twitter are
saying that they have this flow or that
they will talk to 01 or 01 Pro or to
help um make their prompt more detailed
and then once they're happy with the
prompt then they'll send it to deep
research which is interesting um so
people are finding their own own
workflow so how to use
this so there's been three different
deep research products launched in the
last few months tell us a little about
what makes you guys special and how we
should think about
it and they're all called Deep research
right they're
research yeah not a lot of naming
creativity in this field um I I think
people should uh should try all them for
themselves and get a feel I think uh I
think the the difference in like quality
um I think they they all have pros and
cons but I think the the difference will
be clear um but like what that comes
down to is just the way that this model
was built um and the the um sort of the
effort that went into um constructing
the data sets and then the the engine
that we have with the O Series mod
models um which allows us to just um
optimize models uh to make things that
are like really smart and really high
quality we had the 01 team on the
podcast last year and we were joking
that open I is not that good at naming
things I will say this is your best
named product deep
researches at least it describes what it
does I guess
yeah so I'm curious to hear a little
about where you want to go from here you
have deep research today what do you
think it looks like a year from now and
what maybe are complimentary things you
want to build along the way well
excited to expand the data sources that
the model has access to we've trained a
model that's generally very good at
browsing public information but um it
should also be able to to search private
data as well and then I think just
pushing the
capabilities um further so it could be
better at browsing it could be better at
analysis and then and then thinking
about how this fits into our agent road
map more broadly um like I think the the
recipe here is um something that's going
to scale to a pretty wide range of use
cases um things that are uh can surprise
people how well they work um but this
idea of you take a a state-of-the-art
reasoning model you give it access to
the same tools um that that humans can
use to uh to do their jobs or to go
about their daily lives and then you
optimize directly for the kinds of
outcomes that um that you're looking
that uh you want the agent to be able to
do um that recipe there's like really
nothing stopping that recipe from
scaling to more and more complex tasks
um so I feel like yeah AGI is like an
operational problem now um and uh I
think yeah um a lot of things to come in
that general formula I'm so Sam had a
pretty striking quote of deep research
will take over a single DIN percentage
of all economically viable tasks
valuable tasks in the world how should
we think about that I think of it as
like um
it's deep research is not capable of uh
doing all of what you do um but it is
capable of saving you like hours or
sometimes in some cases days at a time
um and so I I think like uh what we're
hopefully relatively close to is um deep
research in the agents that we build
next and the Agents that we build on top
of it um giving you you know 1 5 10 25%
of your time back depending on the type
of work that you do I mean
I think you've already automated 80% of
what I do so it's definitely on the
higher end for me we just need to start
writing checks I guess
yeah are there entire job categories
that you think are kind of more at risk
is the wrong word but like more in the
in the strike zone for what deep
research is exceptional so for example
I'm thinking Consulting uh but like Are
there specific categories that you think
are more in strikes on yeah I used to be
consultant I don't think any jobs are at
risk like I I don't really think of this
as like a labor replacement kind of
thing um at all like it's uh but for
these types of knowledge work jobs where
like where you are spending a lot of
your time kind of looking through
information and making conclusions I I
think it's uh it's going to give people
superpowers yeah I'm very excited about
a lot of the medical use cases just the
ability to um find all of the literature
or all of the recent cases um for a
certain condition I think I've already
seen a lot of doctors posting about this
or like they've reached out to us and
said oh we used it for this thing we
used it to help find um a clinical trial
for this patient or something like that
so just people who are already so busy
just saving some time or um it's maybe
something that they wouldn't have had
time to do so and now they they are able
to have that information for them yeah
and I think the like the impact of that
is like maybe a little bit more profound
than it sounds on the surface right it's
not just like it's not just like uh you
know getting 5% of your time back but
it's um the the type of that might have
taken you four hours or eight hours to
do um now you can do for you know um a
chat gbt subscription and 5 minutes and
so like what types of things would you
do if you had infinite time that now
maybe you can do like many many copies
of so like you know you should should
you do uh research on every single
possible startup that you could invest
in instead of just the ones that you
have time to meet with things like that
or on the consumer side one thing that
I'm thinking of is you know the the
working mom that's too busy to plan a
birthday party for the for her like now
it's now it's doable so it's I agree
with you it's way more important than 5%
of your time it's all the things you
couldn't do before
exactly what does this change about
education and the way we should learn
and you know what what will you be
teaching your kids now that we're in a
world of agents in deep research yeah
education's been like one of the top few
things that people use it for um I think
it's I mean this is true for atrib
generally it's it's like a if uh like
learning things by talking to um an AI
system that is able to like
personalize uh the information that it
gives you based on what you tell it or
uh maybe in the future what it knows
about you um feels like a much more
efficient way to learn and a much more
engaging way to learn than uh like
reading textbooks we have some lightning
round questions all right okay your
favorite deep research use case I'll say
yeah like personalized education just
like learning about anything I want to
learn about i' I've already mentioned
this but I think a lot of the
personal stories that people have shared
about finding information about a
diagnosis that they received or someone
in that family received have been really
great to
see okay we saw a few application
categories breakout last year so for
example coding being being an obvious
one what application categories do you
think will break out this year I mean
clearly agents agents I was gonna say
too okay 2025 is the year of the agent I
think so yeah and then how do you think
about what piece of content that you
should recommend people reading to read
to learn more about agents or where the
state of AI is
going could be an author too training
data this podcast
not I think it's it's like um it's so
hard to keep up with the
state-of-the-art in AI um I think the
like the general advice I have for
people is like um pick one or two
subtopics that you're really interested
in and go like cu cre a list of people
who are we think are saying interesting
things about it and like how to find
those one or two things you're
interested in um maybe actually that's a
good deep research use case like you
know go go uh go use it to F like go
deep on things that you want to learn
more about this this is a bit old now
but I think a few years ago I watched
the I think it's called like foundations
of RL or something like this from pel
and it's um it's it's a few years old
but I think that it was a good
introduction to reinforcement Landing so
yeah would definitely second any any
content by uh Peter AAL my grad school
adviser yeah oh yeah
yeah okay reinforcement learning is it
you know it kind of went through a peak
and then felt like it was in a little
bit of adrum again is speaking again is
is that the right read on what's
happening with RL it's so back yeah so
back yeah why why now because everything
else is
working like uh I think if you um maybe
people have been following the field for
a while we'll remember the Y laon cake
analog analogy if you're building a cake
um then most of the cake is the cake and
then there's a little bit of frosting
and then there's a few cherries on top
and the analogy was that like
unsupervised learning there's the cake
supervised learning is the frosting and
uh reinforcement learning is the
cherries on top when we in the field
were working on reinforcement learning
back in you know 2015 2016 it's kind of
like I think uh uh Yan lon's analogy
which I think in retrospect is probably
correct is that we were like trying to
add the cherries before we had the cake
um but now we have language models that
are pre-trained on massive amounts of
data and are incredibly capable um we
know how to uh how to you know do uh
supervised fine tuning on those language
models to make them good at instruction
following and like generally doing the
things that people want them to do and
so now that that works really well it's
like very ripe to uh to tune those
models for any kind of use case that you
can define a reward function for great
okay so from this lightning we got
agents will be you know the breakout
category in 2025 and reinforcement
learning is so back I love it um thank
you guys so much for joining us we we
love this conversation congratulations
on an incredible product and we can't
wait to see what comes with it thank you
thank you thank you
[Music]
[Music]
oh
[Music]