Build a Deep Research Agent Locally on the CodeAct Framework

Deep Research is one of the most demanding things you can build with AI — not because the individual pieces are hard, but because multi-step reasoning, source tracking, report generation, and quality evaluation all have to work together.

Building it locally adds another layer: you need technical architecture thinking, not just vibe coding capability. Framework selection, execution model, observability — these decisions shape what the agent can and can't do. This tutorial covers the full local implementation: a CodeAct agent built on LangGraph, with LangSmith connected for tracing, debugging, and evaluation.

Showcase

1. Open Source Technical Framework Analysis

Building this locally doesn't mean starting from scratch — there are open-source frameworks that already handle the core agent infrastructure. The first step is picking the right one. That means comparing what's available, then building on top of the framework that fits best.

1.1 Framework Comparison

Awesome Deep Research Agent A curated survey of deep research agent architectures and papers, not a runnable framework. Its value is taxonomic: it organizes deep research systems along a spectrum from static workflow → dynamic single-agent → dynamic multi-agent, and provides the vocabulary for evaluating the other two options.
Open Deep Research A purpose-built pipeline on LangGraph with dedicated LLM roles for each stage — summarization, research, compression, and report generation. The pipeline is configurable, but the research logic is predetermined. The agent cannot restructure its search strategy mid-task. Best for reproducible, structured reports.
LangGraph CodeAct Implements the CodeAct architecture — also used by Manus. Instead of calling tools one at a time through JSON outputs, the agent writes executable Python code. This shifts tool use from a discrete selection problem to a programming problem, enabling composable tool calls, runtime branching, persistent state across turns, and a tool-agnostic architecture.

	Awesome Deep Research	Open Deep Research	LangGraph CodeAct
Type	Reference survey	Semi-static pipeline	Dynamic agent
Execution	Not runnable	Fixed pipeline stages	Python code generation
Tool use	—	One at a time (JSON)	Composable (code blocks)
Adaptability	—	Fixed stages	Runtime loops & branching
State persistence	—	Context window only	Python execution context
Best for	Architecture research	Reproducible reports	Complex, adaptive research

LangGraph CodeAct fits best because its dynamic execution model — composable tools, runtime adaptability, and persistent state — matches exactly what a Deep Research agent needs.

GitHub repository page for Open Deep Research framework comparison

GitHub Awesome Deep Research Agent page used for framework analysis

GitHub LangGraph CodeAct repository selected for deep research agent build

The build approach is straightforward: start with the LangGraph CodeAct framework as the foundation, then use Claude Code to develop the features on top of it. Two external dependencies are required — an OpenAI API key for the language model, and a Tavily API key for search. The sections below walk through each setup step.

2. Deploy Framework Locally

Open Terminal, clone the LangGraph CodeAct repository from GitHub, and install the dependencies.

Terminal cloning LangGraph CodeAct repository for local research agent setup

3. Develop with Claude Code in Cursor

Inside Cursor, open Claude Code from the Terminal. Describe what you need in natural language and let Claude Code build the features on top of the framework.

Cursor editor configuring local deep research agent files after repository setup

4. Get Your API Keys

The agent needs two external APIs: a language model for reasoning and report generation, and a search tool for gathering sources.

For the language model, generate an API key from the OpenAI platform and select a mainstream model. Any current flagship model has the language processing capability needed for research tasks — text-heavy reasoning, summarization, and structured report generation are well within range.

OpenAI platform API key page for deep research agent model access

OpenAI platform model page showing GPT options for research agent setup

For search, this build uses Tavily. It performs well for web research and offers free credits to get started.

DeepResearch Bench leaderboard comparing deep research agent performance

Source: https://huggingface.co/spaces/muset-ai/DeepResearch-Bench-Leaderboard

Tavily API key page used for deep research agent search integration

Once both keys are ready, store them in a .env.local file to keep them out of version control.

Cursor editor showing Python code for deep research agent source handling

5. Trace with LangSmith

Deep Research agents run multi-step workflows — search, reasoning, synthesis, report generation. Without visibility into each step, a failure during testing becomes a black box: something went wrong, but you can't tell where.

LangSmith trace page for observing deep research agent workflow execution

LangSmith solves this by making the full execution path observable. Every tool call, reasoning step, and search query is logged, so when something breaks, you can pinpoint exactly which step caused it — and fix that, not guess at it. That precision is what builds robustness: the agent's behavior becomes something you can actually understand, debug, and improve over time.

LangSmith run list with deep research agent trace and report output

6. Optimize the Agent

With the APIs configured and LangSmith connected, start the server and review the frontend — check whether the interface needs any adjustments before optimizing the agent behavior itself.

The main optimization levers are model choice, API configuration, and search behavior. Adjust these based on what the LangSmith traces show, then retest.

Cursor editor and terminal running the local deep research agent implementation

After all features have been debugged, run one final end-to-end test. Even though the key capabilities have already been tested throughout the process, this final run is for reviewing the overall product experience and output quality as a whole.

The frontend exposes a search round selector — users can choose between the default 5 rounds or go up to 10 for broader coverage. More rounds means more sources collected, which generally increases the information density and depth of the final report. This flexibility is built on the CodeAct framework's ability to dynamically adjust how the agent searches based on the configured parameters.

AI research assistant frontend with search round selector control

The frontend also lets users switch between different mainstream LLM APIs — Claude, Groq, Gemini, and others. Running the same research topic across multiple models and comparing the output quality is a practical way to evaluate which model best fits the Deep Research use case and consistently meets the output standard you're aiming for.

AI research assistant report output with abstract and research sections

7. Retrospective

CodeAct is more extensible than it looks. The framework supports multiple specialized tools — each acting like a domain expert. Configure the right tools and it can handle different research domains, multi-step workflows, or even multi-agent coordination.
The default search depth isn't enough — you have to push it. The first outputs weren't deep enough. Each step needed more specific instructions: which search tool to call, how to structure the report, how many rounds to run. For complex topics, prompting the agent to run a second round based on what the first round found produces significantly better output.
Start evaluating as soon as the frontend is up. Don't wait until everything is "done" to test. After each round, identify what broke, find the root cause, and fix it before moving on. When something fails, the browser's Console and Network panels are the fastest way to see what actually happened — copy the error into an AI coding tool and let it help diagnose.
API failures aren't always your bug. Rate limits and unstable endpoints happen. If the API stops responding after a burst of calls, wait about an hour before retrying. If results come back incomplete on one run but fine on the next, the endpoint may just be flaky — run it again before assuming the product is broken.
Evaluation costs tokens, but it can't be skipped. Testing consumes tokens and adds cost. The minimum viable bar: every key product feature gets tested at least once.
System prompt quality directly affects report quality. Define the report format clearly in the system prompt — structure, sections, depth expectations. Vague prompts produce vague reports. This is the highest-leverage place to improve output quality without changing the agent architecture.