OpenAI Just Released ChatGPT Agent, Its Most Powerful Agent Yet
Isa Fulford, Casey Chu, and Edward Sun from OpenAI's ChatGPT agent team reveal how they combined Deep Research and Operator into a single, powerful AI agent that can perform complex, multi-step tasks lasting up to an hour. By giving the model access to a virtual computer with text browsing, visual browsing, terminal access, and API integrations—all with shared state—they've created what may be the first truly embodied AI assistant. The team discusses their reinforcement learning approach, safety mitigations for real-world actions, and how small teams can build transformative AI products through close research-applied collaboration. Hosted by Sonya Huang and Lauren Reeder, Sequoia Capital
- Published
- Published Jul 22, 2025
- Uploaded
- Uploaded Jun 11, 2026
- File type
- Podcast
- Queried
- 00
Full transcript
Showing the full transcript for this episode.
AI-generated transcript with timestamped sections.
[00:00] I think this model... [00:01] is actually very good at multi-turn conversations. [00:03] And it's very nice to continue working on a task with. I think that's one of the deficiencies of deep research. [00:10] A lot of people will do multiple deep research requests in a single conversation, but it doesn't always work so well. So I think we're really happy with that. [00:17] this model's multi-tornability and we just want to [00:20] you know, improve even further. And then I also think like personalization and memory for agents will also be very important. And right now, every agent task is initiated by the user, but in future, it should also be doing things for you without you having to even ask in the first place. [00:36] Bye. [00:51] *Bell rings* [00:54] Today we're exploring the evolution of AI agents with Isa Fulford, Casey Chu, and Edward Sun, the OpenAI team behind the new ChatGPT agent. [01:02] You'll learn how they got to a huge link forward in capability by unifying the architecture across deep research and operator, allowing for multiple tools to share state, giving users fluid transitions between visual browsing, [01:14] text analysis, and code execution all within a single environment. [01:19] will discuss their training approach. Rather than programming specific tool usage patterns, they let the models discover the optimal strategies through reinforcement learning across thousands of virtual machines. [01:29] They've created an agent that can work alongside you for hours,
[01:32] asking clarifying questions, and accepting mid-task corrections. [01:36] expanding the ways that we can interact with AI agents. [01:39] The team shares fascinating challenges around safety, [01:42] guard relevance around agent activities, and why things like date picking still remain mysteriously difficult for AI systems. [01:49] They've revealed how a small focused teams are achieving breakthrough capabilities through careful data curation, suggesting that we're now entering a new phase of AI development where product insights matter just as much as compute power. [02:01] Enjoy the show. [02:03] Issa, Casey, Edward, thank you for joining us today. Thank you so much for having us. So you're the team behind the ChatGPT agent or agent mode. [02:12] What is it? Yeah, so this has been a collaboration between the deep research, the former deep research and operator teams, and we've created a new agent in ChatGPT. [02:21] that's able to carry out tasks that would take humans a long time. And we gave the agent access to a virtual computer. And through that, it has a few different two different ways to access the internet or actually [02:35] more ways but we'll get to that um it has a text browser um which is similar to deep research tools so it's able to efficiently um access information online and search through things with this very um [02:47] fast text browsing tool. And then it also has a virtual browser, which is similar to the operator tool. So it actually has full access to the graphical user interface. And it's able to click and type things into forms and scroll and drag and all this kind of stuff.
[03:06] you know, all these kinds of things. So together it's much more powerful than either of those two tools because one's more efficient and one's like much more flexible. And then we also gave access to a terminal. So it's able to run code and analyze files and, you know, create. [03:22] artifacts for you like spreadsheets or slides. We also, [03:26] through the terminal, it's able to call APIs. So either public APIs or private APIs, if you sign in, it could access your GitHub or Google Drive, SharePoint. [03:36] many other things. The cool thing about this tool is all of the tools have shared states, so it's similar to if you're using a computer, all of your different applications have access to the same file system and things like that. It's the same for the tool, so the model can do quite flexible things. We'll talk more about this later, but I think it's just a very flexible way for the model to do very complex tasks on behalf of [04:01] of users. [04:03] Tell us a little about the origin story. How did this get started? [04:05] Well, our team worked on Operator. And our team worked on Deep Research. And so back in January, we released our first agent, Operator. [04:15] This is a product that can do internet tasks for you, like buy things on the internet, shop for you, this kind of thing. [04:22] And then two weeks later, we released Deep Research, which is a different model that's, or a different product that's able to extensively browse the internet, synthesize information, and it creates a long research report with citations for you.
[04:36] and we were thinking through our roadmap, [04:38] And we were kind of like, hey, this is kind of a match made in heaven here. Like, so, you know, operator is really good at visual, you know, interacting with a web page. But it's less good at kind of the text browser, like reading long articles. [04:52] Whereas deep research is really good at reading long articles, but it has a tougher time with interactive elements or highly visual things. [04:59] Yeah, because the tools are different. So Deep Research has a text browser, so it's able to really efficiently read information and search and [05:07] synthesize information but um it's not able to like [05:10] uh scroll and click in the same way or fill out forms in the same way that operator is because it has actually full access to the gui browser um and as casey was mentioning like all [05:21] uh, [05:22] Deep Research has some things that Operator doesn't have. And then similarly, one of the biggest requests for Deep Research is [05:27] for the model to be able to access like paywall sources or things that you have to [05:32] pay a subscription for, and Operator is able to do that. And also, one of our members of our team, Eric, he was running an analysis on the types of prompts that people were trying on Operator. And we realized that it was a lot of deep research type tasks, like research this trip for me, [05:47] then book it. [05:48] So it really is a natural combination. [05:51] In what way is 1 plus 1 equals 3? [05:53] So in Deep Research, we always wanted to figure out how to let Deep Research have access to a real browser that can load in all the real contents that previous Deep Research cannot have access to.
[06:06] It's funny that you bring up the 1 plus 1 equals 3, because not only did we combine deep research and operator, but we also threw in a bunch of other tools that are basically everything we could think of. So the terminal tool is there, so it can run commands to do calculations. [06:25] The image gen tool is a fun one. If it wants to spruce up its slides by making an image, it can do that. [06:32] You can cool APIs. [06:34] Can produce PowerPoints. Yes. Yeah, I can do a lot of different things. [06:38] Yeah, tell us a little bit how people are using it, knowing it's still early days. So... [06:42] I think the cool thing about it is we have some ideas... [06:45] of how we think people are going to use it. But I think we intentionally kept it quite open-ended. I mean, it's called Agent that's so, [06:51] vague. [06:52] partially because we are excited to see how people end up using it. So I think some of the things that we specifically trained it for, of course, [06:59] deep research type tasks, so things where you want a long report on a topic, um, operator type tasks where you want it to do something for you, like book something or, um, [07:08] book a flight, buy something for you, and then also tasks to make slide decks. [07:15] We also... [07:17] you know, spent a lot of effort on making spreadsheets and doing data analysis. But I think there are also just so many other things a model can do. So we're just excited to see [07:25] how people use it. Kind of similarly to how when we launched Deep Research, [07:29] We saw a lot of people using it for code search, which was really surprising to us. [07:32] We're hoping to see a lot of [07:34] new use cases that we didn't even think of ourselves.
[07:36] Would you guess it'd be more consumer or kind of B2B type use cases? [07:41] Or is that a false dichotomy? Hopefully both. Okay. I think we're kind of aiming for the prosumer, someone who's willing to wait 30 minutes for a detailed report, but that can be in the consumer case or at your job. I think it could be good for both. Do any of you have favorite things you've used it for? For me, it's more like pulling data from our spreadsheet of Google Docs, like documenting our expand log, [08:11] data is pretty useful. [08:12] I've been doing a deep dive into ancient DNA. It's just one of my interests. And there's actually a lot of exciting work going on these past five years. They're sequencing all this DNA and discovering all these facts about, oh, where did this group of people come from? And historical stuff. [08:31] The problem is that everything is so new that... [08:36] there isn't a reference source material to summarize, like a survey of these materials. [08:43] But, you know, [08:45] agent can go out and pull together all these sources and synthesize it into a report that I can read or slides that I can read and I think it's kind of [08:54] made for this topic. Yeah, I like it for [08:56] consumer use cases like I've used it for online shopping I think especially because a lot of um websites require using visual browser because it will have a search filter or something that it needs to go through or the model like
[09:08] actually needs to be able to see [09:10] what the item looks like. [09:12] And then also for planning events, it's been pretty useful. [09:16] What's your favorite shopping query? I think I was using it [09:19] For... [09:20] for clothes shopping. Love it. And you guys also showed us a really cool use case right before we filmed this episode. Do you want to share that one? Yeah, sure. So that was actually something that one of our co-workers, Tejal, shared with us. She asked the agent to estimate OpenAI's valuation and create, based on things that it found online, create a financial model with projections to create a spreadsheet. And [09:48] also create a summary analysis, and then also create a slide deck presenting the results. [09:52] And so... [09:53] Hopefully the model is correct because it had quite an ambitious projection for us. It was an impressive slide deck. [10:01] One thing I want to point out about this [10:04] trajectory was that it reasoned for I think 28 minutes and [10:09] Yeah, I think this is kind of opening up a new paradigm where you ask the agent for a task, and then you step away, and it comes back with a report. [10:16] and [10:19] Yeah, I think as agents become more agentic, it'll be longer and longer tasks, and this is a good example of one. [10:26] Are these the longest running tasks you guys have launched so far? [10:29] I would say so. I just did one that was an hour long, and I don't think I've ever seen that. I didn't know how long Codex can run for. That's true, yeah. [10:37] Is there anything special that goes into making an agent run for so long without flying off the rails?
[10:43] We have some tools to enable the model to be able to further extend this context length beyond what's the hard limit so that the model is able to [10:58] perform tasks by documenting what is [11:01] doing and step by step, like kind of like increase the time like it can do the task, the horizon of the task it can do without the human's interruption. Yeah. It's also the flow to go back and forth between the model and the human also is very nice. So I can correct it as it's going, right? Yeah. So this model is very flexible and collaborative, and that was very important to us. [11:25] So it's modeled after how you would interact with someone if you ask them to do a task for you. [11:30] So, [11:31] Imagine you're asking someone on Slack to do something for you. You'd probably give them instructions. [11:36] And then they'd ask you some questions. [11:38] and then maybe start doing the task and then maybe in the middle of the task [11:42] They'll say, [11:43] Oh, actually, can you... [11:45] clarify this for me or can you sign into this thing for me or am I allowed to do this for you? And similarly, you might remember something that you forgot to say when you first gave them the task and you might want to interrupt them and just say, oh, hey, please also do this. [12:00] Or you might want a status update if they're taking a long time to do it. Or you might want to redirect them if they're going on the wrong path. So... [12:07] That's what we modeled it after. And I think it's very important that the user and agent are both able to initiate communication with each other. So I think what we have now is probably the most important.
[12:18] basic version of what of what this could be. But it's, um, [12:21] better than anything [12:23] we've released before in this area because at first the model can or the agent can ask [12:30] you clarifying questions similar to deep research but it's more flexible so it doesn't always ask you clarifying questions um and then [12:38] You can interrupt the model so you can say, oh, can you summarize what you've done so far? Or, oh, I forgot to say I actually only want like blue sneakers. Um, [12:48] And then if the model is going to take some kind of destructive action, or if it needs you to log into something, it will also ask the user if it's allowed to do that before doing anything. On this topic, I think it's a really good question. [13:00] We... [13:02] We kind of built this computer interface, you guys saw it, where you can kind of watch along with what the agent is doing. And that actually persists for beyond the conversation. So once it's done with the task, you can actually go back and ask it follow up questions and ask it to fix something or do another task. [13:23] You can also take over that computer, so you can click in and then now you have access to its environment and you can like click for it or like log in for it or insert your credit card information or things like that. [13:38] Yeah, I like to think of it as like, [13:39] looking over your coworkers' shoulders and like, [13:42] and being able to take over if necessary. [13:45] Thank you for enabling the micromanager in me.
[13:50] Just kidding. [13:51] So we'd love to talk a little bit about how this works to the extent that you can share. [13:54] Yes, so this agent is trained with the same technique as the one, the reinforcement learning. So we give this agent model all the tools we have implemented. [14:05] in the same virtual machine, like a text browser, like a GUI browser, a terminal, and the imaging tool. And then the model will try to solve the task like we created [14:17] which are pretty hard tasks that the model has to complete with using these tools. And then kind of like we reward the model if the model completed the task efficiently and correctly. And for example, like after this training, the model is able to [14:35] You should learn to switch between these tools fluently. For example, if you ask the model to research some restaurants and maybe book a spot for you, we will first do a deep research style text-based browsing. And then we will probably also use the GUI browser to view the images of the food and also view the availability, which is [15:00] usually written in JavaScript that you have to use a real GUI browser. Then, for example, if you ask it to create an artifacts, it usually can pull sources from a website and then use them in the terminal. [15:16] I think the cool thing about this tool compared to tool use implementation in the past is that
[15:21] all of the tools have shared state, so, [15:24] It's like when you're using your computer and you have many different applications, you know if you download something, it's going to be accessible to other applications. It's very similar. So the model can open a page in the text browser, which is more efficient, but then maybe it realizes it needs the visual browser so it can just seamlessly switch or it could download something using the... [15:44] using the browser and then in terminal it manipulates it or something like that. It can run something in terminal and then open it in the browser. It's very flexible. [15:52] And so it's just giving the model a more powerful [15:54] way of [15:56] interacting with the internet and files in its file system and code and things like that. [16:01] One interesting thing to emphasize is that... [16:04] We essentially give the model all these tools and then lock it in the room and then it experiments. We don't really tell it when to use what tool. It kind of figures that out by itself. It's kind of almost magic. [16:15] Is the technique, it sounds very similar to, you know, deep research. We had you on the podcast before. [16:21] Should we think about this as the standard technique of how OpenAI thinks that agents will be trained going forward? [16:27] I think [16:28] We can take this... [16:30] really far... [16:31] Um... [16:33] You know, this was, we haven't, our teams haven't been collaborating for that long. [16:38] even framed this model run as [16:41] Um... [16:42] kind of minimum shippable d-risk that was most mostly for pr reasons internally but um this is really like the most basic version we could make together and i think we have so um
[16:53] so much further we could push this with these methods. For example, the slides capability is a new capability. It's like very [17:03] you know, already impressive. It's a great work from Aidan, Paloma, Martin, a bunch of other people. [17:09] But, you know, there's so much further we can... [17:11] we can push that and improve. [17:13] um using the same techniques but i think we can take it further but we probably need other things too [17:19] Yeah, I feel so far it's [17:20] Pretty magical, like the same I.O. algorithm just works on like O.1 reasoning, like deep research with Tocor, and then now like a more advanced, you know, computer use, browser use agent. Where does it run into the limits with this strategy and with this model specifically as well? [17:37] I think the interesting thing with... [17:40] this model is that because it's taking it's able to take actions with external side effects [17:46] Um, [17:47] there's a lot more risk. [17:49] So for deep research, it was read only. [17:51] So there's kind of a limit to what the model could do. [17:55] Um, [17:56] in terms of like data exfiltration and other things but with this um [18:01] In theory, the model could successfully complete a task, but take a lot of harmful actions along the way. Like you could ask it to, [18:09] buy you something and it decides to buy just like a hundred different options to make sure that you're satisfied. Exactly. Or, you know, you can think of many examples like that. So I think that safety and safety training and mitigations was kind of one of the really core parts of our process with this model. And, um,
[18:26] Yeah. [18:27] Maybe Casey can talk more about it. I was going to mention that kind of along the same lines, it's like this contact with the real world. [18:33] that makes things difficult um you know we have to train this on um like a bunch of vms like it's like thousands of vms maybe um and uh you know things break and you know as soon as you're hitting a real website like the website's down or like you're hitting like um all these capacity limits and like load testing and this kind of thing um [18:56] Yeah, it's really the very beginning and you know, we're gonna iron out all these details and continue but [19:01] That's a major limitation. [19:02] How do you think about from the safety perspective, building in the right [19:06] guardrails and how do I make sure the model's not logging into my bank account and sending it all off to a Nigerian prince? Yeah, that's a very good question. [19:17] Yeah, this is definitely an emerging risk. [19:21] where, um, [19:22] The internet's a scary place. There are a lot of attackers and scammers and this kind of thing, phishing attacks, this goes on and on. [19:30] And yeah, our model is a bit like [19:33] It can-- [19:34] It can reason about these things. If you tell it to be careful, we've done some safety training to make this [19:38] more robust, but sometimes it can get fooled. And sometimes it is a bit too over-eager to complete your task. [19:48] we... [19:48] Um, [19:50] have a long list of mitigations and the team has worked really hard to like stack together a bunch of um [19:55] techniques to
[19:57] really try to make the model as safe as possible. [20:00] So, [20:01] Um, [20:02] One example that I'll call out is that we have a monitor that looks over its shoulder and just sees if anything looks funny, whether it's going on a weird website or anything like this. Kind of like antivirus for your computer. It's just kind of persistently watching. And then if it looks like there's anything suspicious, then it'll stop the trajectory and stop there. [20:26] Of course, we can't catch everything, and this is a major area that we'll continue to iterate on. We do have [20:35] Um, [20:36] like a protocol for if there are new attacks in the wild that we discover [20:41] or we encounter, then we can rapidly respond and update these monitors, kind of like you would update your antivirus software, like it would pick up on these new attacks and hopefully keep you safe. [20:52] Yeah, I think the... [20:53] Cool. [20:54] thing about the safety training is that it's been a really [20:57] uh, [20:58] Cross... [20:58] org effort from [21:01] the safety team, governance team, legal team, [21:04] research team, engineering team, like so many others. And we have so many mitigations at every single level. [21:10] um we did a lot of external red teaming internal red teaming but yeah as casey mentioned there's more [21:17] surely when we release the model there will be new things we uncover. So we just need to make sure we also have [21:22] robust ways of detecting those and then mitigating those. [21:25] For some of these models, there's a risk of what you can do with the models, whether it's creating biohazards or otherwise. How do you guys manage some of that?
[21:34] Yeah, it's actually-- [21:35] Bio has been heavily on our mind. [21:39] Yeah, the team has been really thoughtful about, you know, yeah, this agent [21:43] We think it's very powerful. It can do research. It can really speed up your work. [21:49] But that also means that it could speed up. [21:51] harm. [21:52] and [21:55] kind of one of the top things that our team has been looking into is the risk of bio risk. So like creating bio weapons, this kind of thing. [22:04] And yeah, the team has been really thoughtful about [22:07] how to mitigate against this. And generally being very cautious, we did many weeks of red teaming to make sure that this model cannot be used for those harms. [22:17] A bunch of other mitigations in place. Shout out Karen. [22:20] who spearheaded this effort. [22:23] And yeah, in general, I think we're very aware of this and [22:29] Just trying to be very cautious. [22:31] Yeah, makes sense. Tell us a little about the team that came together to build this. So as Casey mentioned earlier, we had... [22:38] Deep Research, research team, and then Deep Research Applied Team, and Operator, research team, computer using agent research team, and Operator. [22:46] apply to you and [22:49] we effectively merged [22:51] everybody. We all work really closely together, both the research team and the applied team. [22:56] And the vibes have been great. It's been so fun. He said I have been friends for a long time. Yeah. So like it was a natural. Yeah, it was really fun. How many of you are there?
[23:05] On deep research? [23:07] for the majority of the time, [23:09] three or four. Now we have some new people, which is very exciting. [23:12] And then on Kua? On Kua, I think around six to eight, somewhere around there. [23:18] on the research side and then [23:20] We have an amazing... [23:22] applied teams like engineering, product design, [23:25] led by Yash Kumar, and then he has just a... [23:28] really cracked engineering team. So it's been very fun to... [23:31] to work really closely. I think that's one thing that's made this collaboration really special is that [23:36] um, [23:37] the research and applied teams work so closely and even from the beginning when we're defining what the product should be able to do, it's very much a collaboration between research and product and design so [23:48] We go backwards from the use cases we want to be able to solve to training the model and building the product. And obviously it's able to do, it's not able to do all of those things fully yet, and it can do some things that we didn't plan, but I think it's a good solution. [24:03] framework for us when we're starting a [24:06] starting a project it's like very grounded and [24:08] how we want people to use it in the real world. [24:11] It's a way smaller team than I was expecting. [24:13] - Small teams can do amazing things. - Yes, you've built a lot. Yeah, and we haven't been working together for very long. It's been a few months. - Yeah. [24:20] And actually the boundary between the research team and the applied team are not very decisive because during the model training, lots of applied engineers, they are helping us train the model. And also after we train the model, some research team members are also working on the new set up as model in the deployed model to the real users.
[24:43] What was the hardest part about training this agent? [24:45] Yeah, I think one of the biggest challenges we have is how to make training stable, especially given when we train deep research, it's only using browsing and Python. It's pretty mature tools there. We've been using it for a while, but when training the agent model, it has some new tools like a computer and also the terminal bundled in the same container, in the same virtual machine [25:15] is a computer. So it's actually quite hard to change it, because we are literally set up hundreds of thousands of virtual machines at the same time. And then they all visit the internet. [25:29] And we [25:30] So it's, [25:31] It's one of the biggest challenges. We see that actually the training sometimes will fail, but finally we are very happy that we get this model. [25:39] So VMs. [25:41] yes all back to the engineering tell us about what's next more sources more tools better model how do you think about it [25:51] Well, I think one thing I like about our agent framing is that [25:57] Um... [25:59] you can ask it to do whatever you want. And [26:01] Thank you. [26:02] you can ask it to do every possible task you can imagine. It just might not do it well. [26:07] Could you tell it, like, go make me money on the internet? You can tell it that. We'll try. It'll try. Should we try that right after this? Yes. Let's do it.
[26:16] But yeah, I think it's really a matter of like, [26:19] improving the accuracy, like the performance of tasks, of like the whole distribution of tasks. That anyone does on a computer. [26:27] Right. [26:28] which is a lot of tasks. And through this iterative deployment, we are very excited to see [26:34] like what's the new capabilities. [26:36] that our user will find in our agent like the coding ability in deep research or [26:42] deep research ability in operator. [26:44] You were using the agent [26:46] mode. [26:47] for coding. Yeah, I use it for coding a lot because I feel it's, you know, it's actually not, you know, like a very, like a, [26:55] always try to rewrite my whole code base. It just actually have some small editing. And also it actually read the original docs of different functions pretty well. [27:07] I feel it hallucinate less on the [27:10] function coding. Oh, interesting. How do you choose when to go to codex versus when to go to agent for that? [27:14] For the agent, it's more similar to how I use 4.0.3, so it's more like an interactive experience. For the codecs, it's more like you have some weird [27:27] design the problem that you want a co-worker to solve and then it will make a PR for you. But for the agent, it's more like just give you a function or give you a suggestion. [27:37] Cool. And it can do code search because it can access GitHub through the API connector. [27:42] So code search kind of things. [27:44] It almost feels like the agent roadmap up until now, you've built the different appendages of what it would take to have an agent. And by combining them all, this really is the first fully embodied agent on a computer. I think it's very exciting.
[27:59] Yeah, I think another area that we're excited to push on is... [28:04] the experience of collaborating with the agent. I think this model, [28:07] is actually very good at multi-turn conversations. And it's very nice to, [28:12] um [28:13] to continue working on a task with. I think that's one of the deficiencies of deep research. Um, [28:18] A lot of people will do multiple deep research requests in a single conversation, but it doesn't always work so well. So I think we're really happy with. [28:26] this model's multi-tornability and we just want to [28:28] you know, improve even further. And then I also think like personalization and memory, [28:33] for agents will also be [28:35] very important and also [28:38] Right now, every agent task is initiated by the user, but in future, it should also be doing things for you without you having to even ask in the first place. [28:47] Yeah, I'm also pretty excited about the UI and UX surrounding the agent. Because right now, I think, you know, obviously we're working in a chat GPT world. Like, it's like you start a conversation and it goes. But you can imagine a lot of different modes of interaction interacting with an agent. And I'm really excited to explore different ways of, yeah, different ways of interacting with the agent. [29:08] Do you see this as always being a kind of single [29:12] omniscient [29:13] super agent or will there be the, you know, financial analyst sub agent and the, you know, personal party planner sub agents? Like, what's your vision for how that kind of plays out? [29:25] I think people have different opinions on this. [29:29] I think in the limit, if you could just ask one thing and it can figure out
[29:34] what it needs to do to finish [29:36] the thing that you want it to do for you that seems like it would be easiest like if you just had a really amazing chief of staff who [29:43] knows how to route things correctly and basically [29:46] can do anything you need that seems like [29:48] It would be... [29:50] um [29:51] pretty easy. [29:52] I think I agree with that take. And even in some of our trajectories where-- I don't know, you're asking about-- [30:00] I don't know, maybe like a shopping task. Like sometimes it'll go into terminal and do some calculations, budget. And I think the model should be free to use all the tools it wants. It doesn't need to be a financial analyst to like, [30:12] have the financial analyst tool set. Yeah, I feel like when you launch the product, it sometimes makes sense to have some GPTS, like a customized model or customized instruction to put the model into a specific role. But in general, when training the model, there are lots of positive transfer between deep research, co-operations, also slice generation, like all of these skills are [30:42] you may not just have a single agent. [30:45] like as an underlying base model. [30:47] Totally. I guess even though, you know, people do different types of work, we're all fundamentally, we're sending emails, we're making slide decks, we're doing a lot of the same. [30:55] work in front of a computer. [30:56] I'd love to understand some of the learnings from the reinforcement learning perspective. [31:00] It seems like that's the method that seems to really be working for you guys with agents.
[31:05] Was it very data intensive to get to this point of having an agent that's so good at such a wide variety of tasks? What were some of the learnings from an RL perspective? [31:16] Yes, so we actually create a bunch of very diverse set of tasks. [31:22] like some tasks to find some very niche topic or very niche answer in the internet, or some tasks just very similar to deep research, like you need to write a whole 4D-length article and also lots of tasks, like just all the tasks that we want the model to be good at. So far, we think that as [31:52] You can judge whether the models performance is good or not. You can kind of like reliably change the model to be [32:01] like even better on this task [32:03] Was there anything special you needed to do to make sure it had good... [32:06] turn-by-turn interaction with users when doing that training? Or was it just about the type of trajectories you collected? [32:12] Mmm. [32:14] Yes. So, [32:15] like I think [32:16] Most of the time, we focus on end-to-end performance, like a [32:22] from the [32:23] from a well-specified prompt how to complete the task. And somehow it's very good at working with users. [32:29] To... [32:30] To your question, the reinforcement learning is very data efficient. So [32:34] That means that we're able to curate a much smaller set of very high quality data. The scale of the data just
[32:41] so minuscule compared to the scale of pre-training data. [32:44] We're able to teach the model new capabilities, [32:48] by just curating these much smaller, high-quality data sets. I will say to get the operator piece to work well, [32:56] before we do RL, the model has to be good enough to have a basic completion of tasks. And our team has spent a lot of time in the past, like over the past two, maybe three, [33:10] maybe three years, getting the model to that point where it's able to actually reason about a page and understand the visual elements really well. [33:20] This model is built on all that as well. Actually, could you say a little bit more about that? Because I remember early days of OpenAI, this was always part of the world of bit stuff. And you're trying to RL the mouse paths. And it was just like way too unbounded of a problem. What's changed now for that to be kind of solvable? [33:38] Yeah, that's great that you point out the world of bits. [33:42] Uh, this [33:43] Project does have a very long lineage dating back to 2017 or so. [33:49] Actually, like, [33:50] Our code name is World of Bits 2 for the computer use part. That's awesome. [33:55] and uh [33:58] Yeah, what's changed? I think [34:00] Essentially the scale of the training has changed. Like we have [34:05] I don't know the multiplier, but it must be like 100,000x or something, like in terms of compute. [34:10] um...
[34:11] the amount of training data we've done both in pre-training and RL. [34:17] Yeah, I really think it's just scale. And... [34:20] Um... [34:21] the scale catching up to, um, [34:24] our ambition, I guess. [34:26] Scale is all you need. [34:27] I believe it. And some good data. Are there particular capabilities or functionality that you're especially excited about in agent mode? [34:35] Yeah, so this model is [34:37] actually pretty good at doing some real research like data science and also summarize the [34:43] reports or like the findings in a spreadsheet. So we have some evaluation, like, you know, data science bench, we evaluate the model and it's [34:51] actually outperform the human baseline. So in some sense, it's actually superhuman in some research tasks. We can rely on the model to perform some basic analysis for us. [35:04] And this is an area that John Blackman on our team was really pushing on, like spreadsheets and data science. So shout out, John. [35:10] spreadsheets and data science you are automating us out of a job over here elevating us enhancing [35:18] Another thing I'm excited about is [35:22] uh, you know, we released Operator in January and, um, [35:26] It was decent at clicking around, but I think we've substantially improved that capability where it's much more accurate and just kind of getting the basic things right. [35:35] is what I'm actually excited about, where it can reliably fill out a form and, you know, [35:40] Do those kind of things. Date picking? Date picking. Date picking still needs a bit of work, but... For some reason, date picking is just...
[35:47] the hottest toss. It's hard for humans too. Like picking a date in the [35:52] The calendar drop down? Yes. [35:54] Okay, last question. It seems like you guys have the overall framework and structure in place for something really interesting here. [36:02] What's ahead? Where do you go from here? [36:04] I think the thing that we're really excited about is that this – [36:08] tool that we've given the model access to is very general. It's basically [36:12] I'm [36:13] most of what you could do on a computer. And if you think about [36:17] all of the tasks that a human can do on a computer. It's very extensive. [36:21] Um, [36:22] And so now we kind of feel like it's a matter of us [36:26] um, making the model good at all of those tasks too, and figuring out a way of training on as diverse of tasks. [36:33] as possible with this very general tool. So I think there's a lot of hard work ahead of us, but we're very excited about it. I think what we're so excited about [36:42] pushing different [36:43] forms, [36:45] ways of interacting with the agent. I think there'll be a lot of new [36:49] interaction paradigms between like users and [36:52] Um, [36:52] use virtual assistants or agents. [36:56] So a lot of exciting times ahead. I can't wait to see it. Thank you. [37:01] Thanks for joining us. Congratulations on the launch. Thank you so much. Thank you for having us. [37:05] *music*
[37:29] Thank you.
Want to learn more?