Recently, I participated in the zero-to-one planning of an AI video product at work. While building the product demo, I explored three different ways to bring the product to life. Behind these three approaches were three very different technical architectures.
That experience made me start thinking more deeply about their differences. Their underlying philosophies are completely different, and the scenarios they are suited for are also different. In this article, I will draw on this hands-on experience to compare three technical frameworks for AI product implementation: Workflow, Agent, and Skill.
1. A Brief Background of the AI Product
In the AI era, building a product does not necessarily start with writing a PRD. Instead, the specific product plan and functional details can be directly demonstrated inside a working demo.
During this process, I also looked deeply into the full implementation path: from the user entering a requirement, to the system calling the large model API. My goal was to first make sure the core technical capability of the product could run end to end, and then build the frontend interaction layer on top of that capability.
Overall, the product execution process included the following steps:
- The user enters requirement copy and uploads the first and last frame images, then selects video-related settings such as duration, size, style, and other dimensions.
- The large language model first understands the user’s requirement, then combines it with the video settings to generate a storyboard script according to predefined rules.
- The definitions and descriptions of different video styles are stored in corresponding Skill files, and are called when needed.
- The backend sends the user-uploaded images and the storyboard script together to the AI video model API to generate the video.
- After the video is generated, the API result is returned and displayed on the frontend.
Of course, this execution process is somewhat idealized. Quality control mechanisms such as multiple generation attempts or result selection are not the focus of this article, so I will leave them aside for now.
After running through the whole process, the logic became relatively clear. The input and output of each step were well defined, and the points where the user needed to intervene were also clear. Based on these steps, I tried three different implementation approaches.
2. Breaking Down the Three Technical Architectures
I will first briefly describe the three technical architecture approaches, and then analyze them one by one.
1. Implementation Through a Dify / Coze Workflow
The first approach is to use a workflow. Whether in Dify or Coze, the idea is to implement multi-model orchestration through a visual workflow.
The rough workflow is:
User enters copy and images → LLM node understands the user requirement → Skill management for video styles + LLM node with system prompts → LLM node generates the storyboard script → AI video model generates the video based on the storyboard script and the user’s uploaded images → User reviews the AI video result.
When building this workflow, the whole process can be completed by dragging and connecting visual nodes, without writing any code. During setup, AI can also help configure each node and write the prompts. In most cases, only simple operations are needed to complete the workflow.
This approach allows a development team to quickly launch a product MVP and put it in front of users for validation. If the product needs to be adjusted, the team can find the corresponding node and modify it directly. The process is accurate and convenient.
2. Implementation Through an Agent
The second approach is to use an Agent. The Agent understands the user’s requirement, makes decisions on its own, and completes the content generation process.
The implementation logic is:
User enters copy and images → The Agent independently calls tools and Skills to generate the storyboard script → The storyboard script and images are sent to the AI video model → The AI video is generated.
In this process, the Agent decides the order and method of tool calls by itself. It can handle unexpected situations flexibly and deal with more complex scenarios. As long as the required tools, Skills, and working rules are clearly prepared and given to it, it knows how to work.
After the user finishes the input, there is no need for further intervention. The user can simply wait for the result.
3. Implementation Through a Single Skill
The third approach is to implement the entire process through one Skill.
While researching, I found a Skill on Clawhub that encapsulated an entire pipeline. So I also tried this idea: packaging all the complete execution steps into one Skill and letting that Skill handle the whole process.
The implementation logic is:
Create a new Skill → Write the execution steps into the Skill → Put the video style references into a reference folder for the Skill to use → During execution, the Skill follows the steps in order or calls different modules → The video generation is completed.
After creating this Skill, I opened it in Cursor and created a .env.local file to store the API keys for the LLM and the AI video model. When executed, it could use those keys to complete the full workflow.
One thing worth noting is that if I installed this Skill in Claude and asked Claude to call it directly to generate a video, the process could initially run. However, when it reached the API call stage, it could no longer read the API key.
The Skill runs in a sandboxed environment. The copywriting part can be handled by Claude’s large model, but after that, Claude restricts access to the API key. As a result, the Skill cannot call the AI video model to complete the final generation step.
3. Comparing the Technical Architectures
After completing these experiments, I started thinking about the differences between the three technical architectures: whether they fit the current product scenario, and what kinds of scenarios each one is better suited for.
Dify / Coze Workflow
This approach works well for simple and clear task logic, such as the AI video generation process in this case.
However, once the process becomes slightly more complex, the limitations become obvious. If the execution process requires calling tools, searching online, organizing information, or extracting key points, the workflow approach becomes less flexible.
Even if the workflow is used to quickly validate the product MVP, switching to another technical architecture during the development phase still brings migration and adjustment costs.
Agent Mode
Agents have autonomous decision-making capabilities. But for this product, that autonomy makes the process less controllable and harder to debug.
When something goes wrong, it is difficult to clearly determine where the problem happened, or whether it was caused by the AI model’s hallucination.
In this product, every execution step has already been clearly defined. Therefore, it is not suitable to let an Agent make autonomous decisions throughout the process.
Single-Skill Implementation
At the implementation level, this approach essentially writes the multi-model orchestration logic into one Skill. The Skill then executes and calls each module in sequence.
The problem is that this Skill becomes very “heavy.” The cost of debugging and making partial modifications becomes high. It also creates similar problems during execution: if something goes wrong, it is hard to locate the issue, and troubleshooting becomes frustrating.
API calling is another problem. For these reasons, this method is not suitable for the current product architecture.
Other Possibilities
There is actually another possible approach: using an Agent SDK to build an Agent Team and implement the product through a multi-agent architecture.
Each Agent would be responsible for a specific role and collaborate through messages or shared state. A main Agent would handle the orchestration, while each sub-Agent would focus on its own responsibility.
For example:
- Requirement Understanding Agent: collects and understands the user’s requirement.
- Structuring Agent: converts the user requirement into JSON.
- Script Agent: generates the storyboard script.
- Video Generation Agent: calls the video model API with the images, requirement, and storyboard script to generate the video.
With this approach, each Agent has a clear responsibility and the process is traceable. However, it is critical to ensure that state is passed effectively between Agents.
The drawback is that if the product requirement and execution process are simple, this method becomes overly complex. It is more suitable for solving complex tasks. For a simple task like this, it is somewhat excessive.
At the same time, it consumes a large number of tokens and increases cost. So although this approach could theoretically work, it is not suitable for this product.
Final Thoughts
In the end, it is important to remember one thing: without real traffic, most architecture optimization is essentially over-engineering.
There is no perfect architecture. There is only the architecture that best fits the product.
If a simple workflow can solve the problem, there is no need to use a complicated approach. At the MVP stage, the simplest architecture can be used to get the product running. Later, when the product enters the development stage, the overall architecture can still be adjusted.
Different stages require different architectures. The goal is to ensure that the product can launch smoothly while also providing users with an efficient experience.