Building AI Agents for Financial Services: Lessons from Fintool¶

繁體中文總結¶

Nicolas Bustamante 以兩年實戰經驗揭示在金融服務業打造 AI Agents 的核心挑戰：這個領域不容許錯誤，專業投資人基於你的輸出做百萬美元決策，一次失誤就永久失去信任。他主張「模型不是產品，模型周邊的體驗才是產品」，並分享 11 個關鍵架構決策，涵蓋隔離執行、資料正規化、技能系統、工作流管理、即時串流與嚴格評估。

核心洞察：恐懼成為最佳功能¶

金融領域的嚴苛要求：
這個領域不原諒錯誤。錯誤的營收數字、誤解的業績指引、不正確的 DCF 假設，都可能讓專業投資人在 1 億美元部位上做出錯誤決策。使用者極度挑剔且時間緊迫，能立刻識破胡扯，需要精準、速度和深度。

作者形容：「對錯誤的恐懼成為我們最好的功能。」每個數字都要雙重驗證、每個假設都要確認、每個模型都要壓力測試。你會開始質疑 LLM 的所有輸出，因為你知道使用者會質疑。DCF 模型中一個錯誤計算，信譽永遠受損。

大膽的架構賭注：
當 Claude Code 推出 filesystem-first agentic 方法時，Fintool 立即採用。當時整個業界（包括 Fintool）都在打造精密的 RAG pipelines + vector databases + embeddings。在與 Anthropic 的 Thariq 深入討論後，作者寫下「RAG 訃告」，Fintool 全面轉向 agentic search，甚至淘汰珍貴的 embedding pipeline。

當時人們認為這很瘋狂。文章獲得大量讚譽，也有大量負面評論。現在大多數新創都在採用這些最佳實踐。

11 個關鍵架構決策¶

1. Sandbox 是必需品（The Sandbox Is Not Optional）¶

問題： 剛開始時以為「只是跑 Python scripts」不需要 sandbox。直到第一次 LLM 決定在伺服器上執行 rm -rf /（它試圖「清理暫存檔」），才成為真正的信徒。

為什麼需要： Agents 需要執行多步驟操作。專業投資人要求 DCF 評價，這不是單一 API 呼叫。Agent 需要研究公司、收集財務資料、在 Excel 建立模型、執行敏感度分析、生成複雜圖表、迭代假設。這是數十個步驟，每個都可能修改檔案、安裝套件、執行 scripts。

架構設計：
每個使用者獲得獨立隔離環境，三個掛載點： - Private - 讀寫，存放個人資料 - Shared - 唯讀，組織內共享 - Public - 唯讀，所有人共享

安全機制：
使用 AWS ABAC（Attribute-Based Access Control）生成短期憑證，限定在特定 S3 prefixes。User A 物理上無法存取 User B 的資料。IAM policy 使用 ${aws:PrincipalTag/S3Prefix} 限制存取。

效能優化：
Sandbox pre-warming - 當使用者開始打字時，背景就開始啟動 sandbox。當他們按下 enter，sandbox 已準備好。600 秒 timeout，每次工具使用延長 10 分鐘。Sandbox 在對話回合之間保持溫暖。

2. Context 才是產品（Context Is the Product）¶

核心主張： Agent 的好壞取決於它能存取的 context。真正的工作不是 prompt engineering，而是將來自數十個來源的混亂金融資料，轉換成模型實際能使用的乾淨結構化 context。這需要工程團隊的大量領域專業知識。

異質性問題： 金融資料來自各種格式： - SEC 文件：HTML with nested tables、exhibits、signatures - 財報電話會議：Speaker-segmented text with Q&A sections - 新聞稿：Semi-structured HTML from PRNewswire - 研究報告：PDFs with charts and footnotes - 市場資料：Snowflake/databases with structured numerical data - 新聞：Articles with varying quality and structure - 另類資料：Satellite imagery、web traffic、credit card panels - 券商研究：Proprietary PDFs with price targets and models - 基金文件：13F holdings、proxy statements、activist letters

每個來源有不同 schemas、不同更新頻率、不同品質水準。

正規化層（Normalization Layer）： 一切變成三種格式： - Markdown - 敘事性內容（文件、會議記錄、文章） - CSV/tables - 結構化資料（財務、指標、比較） - JSON metadata - 可搜尋性（tickers、日期、文件類型、財年）

Chunking 策略的重要性： 不同文件用不同方式切塊： - 10-K 文件 - 按監管結構分段（Item 1、1A、7、8...） - 財報電話會議 - 按發言者回合（CEO remarks、CFO remarks、Q&A by analyst） - 新聞稿 - 通常小到足以成為單一 chunk - 新聞文章 - 段落層級 chunks - 13F 文件 - 按持有者和季度部位變化

Chunking 策略決定 agent 檢索什麼 context。壞的 chunks = 壞的答案。

表格的特殊性： 金融資料充滿表格。LLM 對 markdown tables 的推理能力驚人地好，但對 HTML <table> tags 或原始 CSV dumps 卻很糟。正規化層將所有東西轉換成乾淨的 markdown tables。

Metadata 啟用檢索： 當使用者問「Apple 在上次財報電話會議中對服務營收說了什麼？」，Fintool 需要： - Ticker resolution（AAPL → 正確公司） - Document type filtering（財報會議記錄，不是 10-K） - Temporal filtering（最近的，不是 2019） - Section targeting（CFO remarks 或營收討論，不是法律免責聲明）

這就是為什麼每個文件都有 meta.json。沒有結構化 metadata，你就是在乾草堆中做關鍵字搜尋。

關鍵認知： 任何人都能呼叫 LLM API。不是每個人都有將數十年金融資料正規化為可搜尋、切塊的 markdown 加上適當 metadata。資料層才是讓 agents 真正運作的關鍵。

3. 解析問題（The Parsing Problem）¶

SEC 文件的對抗性： 它們不是為機器閱讀設計，而是為法規合規： - 表格跨多頁，重複表頭 - 腳註引用 exhibits，exhibits 又引用其他腳註 - 數字出現在文字、表格、exhibits 中 - 有時不一致 - XBRL tags 存在但常常錯誤或不完整 - 格式在不同提交者之間差異極大（每個律師事務所都有自己的範本）

現成解析器的失敗： PDF/HTML 解析器在這些情況失敗： - Proxy statements 中的多欄位版面 - MD&A sections 中的巢狀表格（表格中的表格中的表格） - 浮水印和表頭滲入內容 - 掃描的 exhibits（在舊文件和附件中仍然常見） - Unicode 問題（彎引號、em-dashes、non-breaking spaces）

Fintool 解析 pipeline：

原始文件（HTML/PDF）
↓
文件結構偵測（headers、sections、exhibits）
↓
表格提取並保留 cell 關係
↓
實體提取（公司、人物、日期、金額）
↓
交叉引用解析（Ex. 10.1 → 實際 exhibit 內容）
↓
財年正規化（FY2024 → Oct 2023 to Sep 2024 for Apple）
↓
品質評分（每個提取欄位的信心度）

表格提取的複雜性： 金融表格充滿意義。營收分解表可能有： - 跨多欄的合併表頭 cells - 腳註標記 (1), (2), (a), (b) 引用下方解釋 - 負數用括號：$(1,234) 表示 -1234 - 同一表格中混合單位（營收用 millions，利潤率用百分比） - 斜體或星號標示的前期重述

品質控制： 對每個提取的表格評分： - Cell boundary 準確度（是否正確分割/合併？） - Header 偵測（第一列真的是 headers，還是上方有標題列？） - 數值解析（"$1,234" 解析為 1234 還是留作文字？） - 單位推斷（millions？billions？per share？percentage？）

信心度低於 90% 的表格會標記供審查。低信心度的提取不會進入 agent 的 context - 垃圾進，垃圾出。

財年正規化的關鍵性： "Q1 2024" 是模糊的： - Calendar Q1（2024 年 1-3 月） - Apple 的 fiscal Q1（2023 年 10-12 月） - Microsoft 的 fiscal Q1（2023 年 7-9 月） - "在 Q1 報告"（Q1 提交，但涵蓋前期）

Fintool 維護 10,000+ 公司的財年行事曆資料庫。每個日期引用都正規化為絕對日期範圍。當 agent 檢索「Apple Q1 2024 revenue」時，它知道要查看 2023 年 10-12 月的資料。

這對使用者不可見但對正確性至關重要。沒有它，你會比較 Apple 的 10 月營收和 Microsoft 的 1 月營收，然後稱之為「同一季度」。

4. Skills 是一切（Skills Are Everything）¶

關鍵認知： 模型不是產品。Skills 現在才是產品。

作者辛苦學到這一課。過去試圖透過 prompt engineering 讓基礎模型「更聰明」。調整 system prompt、加入範例、寫精密指令。有點幫助。但 skills 才是缺失的部分。

沒有 Skills 時模型的弱點： 要求 frontier model 做 DCF 評價。它知道 DCF 是什麼。能解釋理論。但實際執行？它會漏掉關鍵步驟、使用錯誤的行業折現率、忘記加回股票薪酬、跳過敏感度分析。輸出看起來合理但在重要的地方有微妙錯誤。

突破： 當開始將 skills 視為一等公民。就像產品本身的一部分。

什麼是 Skill？ 一個告訴 agent 如何做特定事情的 markdown 檔案。簡化版的 DCF skill：

# dcf

## When to Use
Use this skill for discounted cash flow valuations.

## Instructions
1. Deep dive on the company using Task tool (understand all segments)
2. Identify the company's industry and load industry-specific guidelines
3. Gather financial data: revenue, margins, CapEx, working capital
4. Build the DCF model in Excel using xlsx skill
5. Calculate WACC using industry benchmarks
6. Run sensitivity analysis on WACC and terminal growth
7. Validate: reconcile base year to actuals, compare to market price
8. Document your view vs market pricing

## Industry Guidelines
- Technology/SaaS: `/public/skills/dcf/guidelines/technology-saas.md`
- Healthcare/Pharma: `/public/skills/dcf/guidelines/healthcare-pharma-biotech.md`
- Financial Services: `/public/skills/dcf/guidelines/financial-services.md`
[... 10+ industries with specific methodologies]

就這樣。一個 markdown 檔案。不用改程式碼。不用 production deployment。只是一個告訴 agent 該做什麼的檔案。

Skills 優於程式碼的原因：

非工程師可建立 skills - 分析師寫 skills。客戶寫 skills。做過 500 次 DCF 評價的投資組合經理可以將他們的方法論編碼進 skill，不用寫任何一行 Python。
不需要部署 - 改變 skill 檔案立即生效。沒有 CI/CD、沒有 code review、不用等發布週期。領域專家可以自行迭代。
可讀且可審計 - 當出錯時，你可以讀 skill 並準確理解 agent 應該做什麼。試試對 2,000 行 Python 模組做這件事。

Copy-on-write Shadowing 系統：
優先順序：private > shared > public

如果你不喜歡 Fintool 做 DCF 評價的方式，寫你自己的。放到 /private/skills/dcf/SKILL.md。你的版本獲勝。

為什麼不將所有 skills 掛載到檔案系統：
天真的做法是直接將每個 skill 檔案掛載到 sandbox。Agent 可以 cat 任何需要的 skill。簡單，對吧？

錯誤。使用 SQL discovery 的原因：

SELECT user_id, path, metadata
FROM fs_files
WHERE user_id = ANY(:user_ids)
AND path LIKE 'skills/%/SKILL.md'

Lazy loading - 有數十個 skills 帶有廣泛文件，光 DCF skill 就有 10+ 產業指南檔案。將所有東西載入每個對話的 context 會燒掉 tokens 並混淆模型。相反，預先發現 skill metadata（name、description），只在 agent 實際使用該 skill 時載入完整文件。
查詢時的存取控制 - SQL 查詢實作三層存取模型：public skills 所有人可用、組織 skills 給該組織使用者、private skills 給個人使用者。資料庫強制執行這個。不會意外將客戶的專有 skill 暴露給其他客戶。
Shadowing 邏輯 - 當使用者客製化 skill 時，他們的版本需要覆蓋預設。SQL 讓這很簡單 - 查詢三層、套用優先規則、回傳勝者。用檔案系統掛載做這件事會是 symlinks 和目錄排序的噩夢。
Metadata-driven 篩選 - fs_files.metadata 欄位儲存解析過的 YAML frontmatter。可以按 skill type 篩選、檢查 skill 是否僅限 main-agent、或查詢任何其他結構化屬性 - 全部不用讀取檔案本身。

模式： S3 是 source of truth，Lambda function 同步變更到 PostgreSQL 供快速查詢，agent 在需要時準確獲得需要的東西。

Skills 的重要性： 無法強調這有多重要。如果你在打造 AI agent 但沒有 skills 系統，你會很痛苦。最大的論證是頂級模型（Claude 或 GPT）經過後訓練使用 Skills。模型想要取得 skills。

模型只是想學習，而它們想學習的就是我們的 skills... 直到它們吃掉它。

5. 模型會吃掉你的支架（The Model Will Eat Your Scaffolding）¶

不舒服的真相： 剛剛講的關於 skills 的一切？在作者看來是暫時的。

模型進步得很快。每幾個月就有新模型讓你一半的程式碼過時。你打造的精密支架來處理邊緣案例？模型現在就... 直接處理它們了。

剛開始時，某些簡單任務需要詳細的 skills 帶逐步指令。「先做 X，然後做 Y，然後檢查 Z。」現在？對簡單任務常常可以只說「做個財報預覽」，模型就能想出來（有點！）

這創造奇怪的張力： 今天你需要 skills 因為當前模型不夠聰明。但你應該設計你的 skills 知道未來模型需要更少手把手指導。這就是為什麼作者看好 markdown 檔案 versus 程式碼來做模型指令。更容易更新和刪除。

向 AI labs 發送詳細回饋： 每當打造複雜支架來繞過模型限制時，文件記錄模型在什麼地方掙扎並與實驗室研究團隊分享。這幫助通知下一代模型。目標是讓自己的支架過時。

預測： 兩年內，大多數基礎 skills 會是一行程式。「生成一個 20 個 tabs 的 DCF。」就這樣。模型會知道那是什麼意思。

但反過來： 當基礎任務變成商品化，會推進到更複雜的領域。帶逐段分析的多步驟評價。投資策略的自動化回測。帶複雜觸發條件的即時投資組合監控。前沿持續移動。

所以寫 skills。當它們變得不必要時刪除它們。為出現的更困難問題打造新的。而所有這些都是檔案... 在我們的檔案系統中。

6. S3-First 架構（The S3-First Architecture）¶

令人驚訝的發現： 對檔案而言，S3 是比資料庫更好的資料庫。

架構設計： 將使用者資料（watchlists、portfolio、preferences、memories、skills）以 YAML 檔案形式存在 S3。S3 是 source of truth。Lambda function 同步變更到 PostgreSQL 供快速查詢。

Writes → S3 (source of truth)
↓
Lambda trigger
↓
PostgreSQL (fs_files table)
↓
Reads ← Fast queries

為什麼？ - 耐久性 - S3 有 11 9's。資料庫沒有。 - 版本控制 - S3 versioning 免費提供審計軌跡 - 簡潔性 - YAML 檔案人類可讀。可以用 cat 除錯。 - 成本 - S3 便宜。資料庫儲存不便宜。

模式： - Writes 直接到 S3 - List queries 打 database（快） - Single-item reads 到 S3（最新資料）

同步架構： 執行兩個 Lambda functions 保持 S3 和 PostgreSQL 同步：

S3 (file upload/delete)
↓
SNS Topic
↓
fs-sync Lambda → Upsert/delete in fs_files table (real-time)

EventBridge (every 3 hours)
↓
fs-reconcile Lambda → Full S3 vs DB scan, fix discrepancies

兩者都使用帶時間戳保護的 upsert - 較新資料總是獲勝。reconcile job 捕捉任何漏掉的事件（S3 eventual consistency、Lambda cold starts、網路問題）。

使用者記憶也在這裡： 每個使用者有一個 /private/memories/UserMemories.md 檔案在 S3。就是 markdown - 使用者可以直接在 UI 中編輯。每次對話，載入它並注入為 context：

org_memories, user_memories = await fetch_memories(safe_user_id, org_id)

conversation_manager.add_backend_message(
    UserMessage(content=f"<user-memories>\n{user_memories}\n</user-memories>")
)

這驚人地強大。使用者寫「我專注於小型價值股」或「總是與產業中位數比較，不是平均值」或「我的投資組合集中在科技業，所以標記集中風險。」Agent 在每次對話中看到這個並相應調整。

沒有 migrations。沒有 schema changes。只是使用者控制的 markdown 檔案。

Watchlists 也一樣運作： S3 中的 YAML 檔案，同步到 PostgreSQL 供快速查詢。當使用者問「我的 watchlist」，載入相關 tickers 並注入為 context。Agent 知道哪些公司對這個使用者重要。

檔案系統成為使用者的個人知識庫： Skills 告訴 agent 如何做事情。Memories 告訴它使用者關心什麼。兩者都只是檔案。

7. 檔案系統工具（The File System Tools）¶

金融服務中的 Agents 需要讀寫檔案。很多檔案。PDFs、spreadsheets、images、code。

ReadFile 處理複雜性（path normalization、MIME type detection、size limits）

WriteFile 創建連回 UI 的 artifacts：

# Files in /private/artifacts/ become clickable links
# computer://user_id/artifacts/chart.png → opens in viewer

Bash 提供持久 shell 存取，180 秒 timeout 和 100K 字元輸出限制。對所有東西做 path normalization（LLM 喜歡嘗試 path traversal attacks，很好笑）。

Bash 比你想的重要： AI 社群中有越來越強的信念認為檔案系統和 bash 是 AI agents 的最佳抽象。Braintrust 最近執行 eval 比較 SQL agents、bash agents 和混合方法來查詢半結構化資料。

結果有趣：pure SQL 達到 100% 準確度但錯過邊緣案例。Pure bash 較慢且較貴但捕捉到驗證機會。贏家？混合方法，agent 使用 bash 探索和驗證，SQL 做結構化查詢。

這符合 Fintool 的經驗。金融資料很混亂。需要 bash 來 grep 文件、找模式、探索目錄結構。但也需要結構化工具來做重活。Agent 需要兩者 - 以及判斷何時使用各自的能力。

Fintool 大力投入在 sandbox 中給 agents 完整 shell 存取。不只是為了跑 Python scripts。是為了探索、驗證、以及複雜任務需要的那種 ad-hoc 資料處理。

但複雜任務意味著長時間執行的 agents。而長時間執行的 agents 打破所有東西。

8. Temporal 改變一切（Temporal Changed Everything）¶

Temporal 之前： 長時間執行任務是災難。使用者要求全面的公司分析。需要 5 分鐘。如果伺服器重啟怎麼辦？如果使用者關閉 tab 然後回來？如果... 任何事？

有自製的 job queue。很糟。Retries 不一致。狀態管理是噩夢。

切換到 Temporal 後： 作者想哭出喜悅的眼淚！

@activity.defn
async def run_conversation(params: ConversationParams):
    # That's it. Temporal handles worker crashes, retries, everything.

就這樣。Temporal 處理 worker crashes、retries、一切。如果 Heroku dyno 在對話中重啟（經常發生 lol），Temporal 自動在另一個 worker 上 retry。使用者永遠不知道。

取消處理是棘手部分： 使用者點擊「停止」，會發生什麼？Activity 已經在不同伺服器上執行。使用 heartbeats，每幾秒發送一次。

執行兩種 worker types： - Chat workers - 面向使用者，25 concurrent activities - Background workers - 非同步任務，10 concurrent activities

獨立擴展。Chat traffic 高峰？擴展 chat workers。

接下來是速度。

9. 即時串流（Real-Time Streaming）¶

在金融業，人們沒耐心。不會等 30 秒盯著 loading spinner。需要看到正在發生的事情。

所以打造即時串流。Agent 工作，你看到進度。

Agent → SSE Events → Redis Stream → API → Frontend

關鍵洞察： Delta updates，不是 full state。不是發送「這是到目前為止的完整回應」（昂貴），而是發送「附加這 50 個字元」（便宜）。

enum DeltaOperation {
  ADD = "add",       // Insert object at index
  APPEND = "append", // Append to string/array
  REPLACE = "replace",
  PATCH = "patch",
  TRUNCATE = "truncate"
}

用 Streamdown 串流豐富內容： 文字串流是基本要求。更困難的問題是串流豐富內容：帶表格的 markdown、圖表、引用、數學方程式。使用 Streamdown 來隨資料到達渲染 markdown，帶我們領域特定元件的客製 plugins。

圖表漸進式渲染。引用連結到來源文件。數學方程式用 KaTeX 正確顯示。使用者看到完整、互動的回應即時建構。

AskUserQuestion：互動式 agent 工作流： 有時 agent 在工作流中間需要使用者輸入。 - 「你偏好哪種評價方法？」 - 「我應該使用共識估計還是管理層指引？」 - 「你要我在評價中包含 pipeline assets 嗎？」

打造 AskUserQuestion 工具讓 agent 暫停、呈現選項、等待使用者輸入。

當 agent 呼叫這個工具時，agentic loop 攔截它、儲存狀態、向使用者呈現 UI。使用者挑選選項（或輸入客製答案），對話以他們的選擇繼續。

這將 agents 從自主黑盒轉變為協作工具。Agent 做重活，但使用者保持對關鍵決策的控制。對需要使用者驗證假設的高風險金融工作至關重要。

10. Evaluation 不可妥協（Evaluation Is Not Optional）¶

「快速出貨，稍後修正」對大多數新創有效。對金融服務無效。

錯誤的財報數字可能讓某人損失金錢。誤解的業績指引可能導致錯誤投資決策。當使用者基於你的輸出做百萬美元決策時，不能只是「稍後修正」。

使用 Braintrust 做實驗追蹤： 每個模型變更、每個 prompt 變更、每個 skill 變更都對測試集評估。

通用 NLP metrics（BLEU、ROUGE）對金融無效： 回應可以語意相似但有完全錯誤的數字。建立 eval datasets 比建立 agent 更難。維護 ~2,000 測試案例跨類別：

Ticker 消歧： 這出奇困難： - "Apple" → AAPL，不是 APLE (Appel Petroleum) - "Meta" → META，不是 MSTR（有些人稱為 "meta"） - "Delta" → DAL（航空）還是使用者在談 delta hedging（選擇權術語）？

真正討厭的案例是 ticker changes。Facebook 在 2021 變成 META。Google 在 GOOG/GOOGL 下重組。Twitter 變成 X（但保留法律實體）。當使用者問「Facebook stock 在 2023 發生什麼事？」，需要知道 FB → META，而且 2021 年 10 月前的歷史資料在舊 ticker 下。

維護 ticker history table 和過去十年每個主要更名的測試案例。

財年地獄： 這是大多數金融 agents 默默失敗的地方： - Apple 的 Q1 是 10-12 月（財年在 9 月結束） - Microsoft 的 Q2 是 10-12 月（財年在 6 月結束） - 大多數公司 Q1 是 1-3 月（日曆年）

1 月 15 日的「上一季」意味著： - Calendar-year 公司的 Q4 2024 - Apple 的 Q1 2025（他們剛報告） - Microsoft 的 Q2 2025（他們在季度中）

維護 10,000+ 公司的財年行事曆。每個期間引用正規化為絕對日期範圍。僅期間提取就有 200+ 測試案例。

數值精度： $4.2B vs $4,200M vs $4.2 billion vs "four point two billion"。全部等價。但單獨的 "4.2" 是錯的 - 缺少單位。是 millions？Billions？Per share？

測試單位推斷、量級正規化、貨幣處理。說「revenue was 4.2」沒有單位的回應失敗 eval，即使 4.2B 是正確的。

對抗性 grounding： 將假數字注入 context 並驗證模型引用真實來源，不是植入的。

範例：包含假分析師報告說「Apple revenue was $50B」連同顯示 $94B 的真實 10-K。如果 agent 引用 $50B，失敗。如果引用 $94B 帶適當來源歸屬，通過。有 50 個測試案例專門針對幻覺抗性。

Eval-driven 開發： 每個 skill 有配套 eval。DCF skill 有 40 個測試案例涵蓋 WACC 邊緣案例、終值合理性檢查、股票薪酬加回（模型經常忘記這個）。

PR 被阻擋如果 eval score 下降 >5%。 沒有例外。

11. Production 監控（Production Monitoring）¶

Production 設定： - Heroku 上的 Web 和 Worker dynos - Temporal Cloud 執行工作流 - PostgreSQL（Neon）存 metadata - S3 存使用者檔案 - Braintrust 追蹤每個對話 - Sentry 錯誤追蹤 - Datadog metrics

自動建立 GitHub issues 處理 production 錯誤： 錯誤發生，issue 被建立帶完整 context：conversation ID、使用者資訊、traceback、連結到 Braintrust traces 和 Temporal workflows。付費客戶獲得 priority:high 標籤。

按複雜度的模型路由： 簡單查詢使用 Haiku（便宜），複雜分析使用 Sonnet（貴）。Enterprise 使用者總是獲得最佳模型。

終極教訓：你的護城河是什麼？¶

最大的教訓不是關於 sandboxes 或 skills 或 streaming。是這個：

模型不是你的產品。模型周邊的體驗才是你的產品。

任何人都能呼叫 Claude 或 GPT。API 對每個人都一樣。讓你的產品不同的是其他所有東西：你能存取的資料、你打造的 skills、你設計的 UX、你工程化的可靠性，坦白說還有你對產業的了解程度（這是你與客戶相處時間的函數）。

模型會持續變好。那很棒！意味著更少支架、更少 prompt engineering、更少複雜度。但也意味著模型變得更商品化。

你的護城河不是模型。你的護城河是你在它周圍打造的一切。

對 Fintool 而言，那是金融資料、領域特定 skills、即時串流、以及與專業投資人建立的信任。

English Summary¶

Nicolas Bustamante shares two years of battle scars from building AI agents for financial services at Fintool, where mistakes can cost millions and professional investors make high-stakes decisions based on agent output. He argues "the model is not your product—the experience around the model is your product" and reveals 11 key architectural decisions born from the unforgiving demands of finance.

Core Insight: Fear as a Feature¶

The Unforgiving Nature of Finance:
This domain doesn't forgive mistakes. A wrong revenue figure, a misinterpreted guidance statement, an incorrect DCF assumption—professional investors make million-dollar decisions based on agent output. One mistake on a $100M position and trust is destroyed forever.

Users are also demanding: spotting bullshit instantly, requiring precision, speed, and depth. You can't hand-wave through a valuation model or gloss over nuances in an earnings call.

This forces "almost paranoid attention to detail." Every number gets double-checked, every assumption validated, every model stress-tested. You question everything the LLM outputs because users will.

"I sometimes feel that the fear of being wrong becomes our best feature."

Bold Architectural Bets:
When Claude Code launched with its filesystem-first agentic approach, Fintool immediately adopted it. At the time, the entire industry (including Fintool) was building elaborate RAG pipelines with vector databases and embeddings.

After reflecting on the future of information retrieval with agents, Bustamante wrote "The RAG Obituary" and Fintool moved fully to agentic search, even retiring their precious embedding pipeline.

People thought they were crazy. The article got praise and negative comments. Now most startups are adopting these best practices.

11 Key Architectural Decisions¶

1. The Sandbox Is Not Optional¶

The Wake-Up Call: Thought sandboxing was overkill for "just running Python scripts" until an LLM decided to rm -rf / on the server (trying to "clean up temporary files"). Became a true believer instantly.

Why It's Essential: Agents need multi-step operations. A DCF valuation request isn't a single API call—it's researching the company, gathering financial data, building Excel models, running sensitivity analysis, generating charts, iterating on assumptions. Dozens of steps, each potentially modifying files, installing packages, running scripts.

Can't do this without code execution. Executing arbitrary code on production servers is insane.

Architecture Design:
Every user gets an isolated environment with three mount points: - Private: Read/write for personal data - Shared: Read-only for organization - Public: Read-only for everyone

Security Mechanism:
AWS ABAC (Attribute-Based Access Control) generates short-lived credentials scoped to specific S3 prefixes. User A physically cannot access User B's data. IAM policy uses ${aws:PrincipalTag/S3Prefix} for restriction.

Performance Optimization:
Sandbox pre-warming: when user starts typing, sandbox spins up in background. By the time they hit enter, sandbox is ready. 600-second timeout, extended 10 minutes on each tool usage. Sandbox stays warm across conversation turns.

2. Context Is the Product¶

Core Claim: Your agent is only as good as the context it can access. The real work isn't prompt engineering—it's turning messy financial data from dozens of sources into clean, structured context the model can actually use. Requires massive domain expertise from the engineering team.

The Heterogeneity Problem: Financial data comes in every format imaginable—SEC filings (HTML with nested tables), earnings transcripts (speaker-segmented text), press releases (semi-structured HTML), research reports (PDFs with charts), market data (structured databases), news, alternative data, broker research, fund filings.

Each source has different schemas, update frequencies, quality levels.

The Normalization Layer: Everything becomes three formats: - Markdown for narrative content (filings, transcripts, articles) - CSV/tables for structured data (financials, metrics, comparisons) - JSON metadata for searchability (tickers, dates, document types, fiscal periods)

Chunking Strategy Matters: Different documents chunk differently: - 10-K filings: By regulatory structure (Item 1, 1A, 7, 8...) - Earnings transcripts: By speaker turn (CEO remarks, CFO remarks, Q&A by analyst) - Press releases: Usually small enough to be one chunk - News articles: Paragraph-level chunks - 13F filings: By holder and position changes quarter-over-quarter

Bad chunks = bad answers.

Tables Are Special: Financial data is full of tables. LLMs are surprisingly good at reasoning over markdown tables but terrible at HTML <table> tags or raw CSV dumps. Normalization layer converts everything to clean markdown tables.

Metadata Enables Retrieval: When user asks "What did Apple say about services revenue in their last earnings call?", Fintool needs: - Ticker resolution (AAPL → correct company) - Document type filtering (earnings transcript, not 10-K) - Temporal filtering (most recent, not 2019) - Section targeting (CFO remarks or revenue discussion, not legal disclaimers)

That's why meta.json exists for every document. Without structured metadata, you're doing keyword search over a haystack.

Key Recognition: Anyone can call an LLM API. Not everyone has normalized decades of financial data into searchable, chunked markdown with proper metadata. The data layer is what makes agents actually work.

3. The Parsing Problem¶

SEC Filings Are Adversarial: Designed for legal compliance, not machine reading: - Tables span multiple pages with repeated headers - Footnotes reference exhibits that reference other footnotes - Numbers appear in text, tables, exhibits—sometimes inconsistently - XBRL tags exist but often wrong or incomplete - Formatting varies wildly between filers (every law firm has their own template)

Off-the-Shelf Parsers Failed On: - Multi-column layouts in proxy statements - Nested tables in MD&A sections (tables within tables within tables) - Watermarks and headers bleeding into content - Scanned exhibits (still common in older filings) - Unicode issues (curly quotes, em-dashes, non-breaking spaces)

Fintool Parsing Pipeline:

Raw Filing (HTML/PDF)
↓
Document structure detection (headers, sections, exhibits)
↓
Table extraction with cell relationship preservation
↓
Entity extraction (companies, people, dates, dollar amounts)
↓
Cross-reference resolution (Ex. 10.1 → actual exhibit content)
↓
Fiscal period normalization (FY2024 → Oct 2023 to Sep 2024 for Apple)
↓
Quality scoring (confidence per extracted field)

Table Extraction Complexity: Financial tables are dense with meaning. A revenue breakdown table might have merged header cells spanning multiple columns, footnote markers referencing explanations below, parentheses for negative numbers, mixed units in the same table, prior period restatements in italics.

Quality Control: Score every extracted table on cell boundary accuracy, header detection, numeric parsing, unit inference. Tables below 90% confidence get flagged for review. Low-confidence extractions don't enter agent's context—garbage in, garbage out.

Fiscal Period Normalization Is Critical: "Q1 2024" is ambiguous: - Calendar Q1 (January-March 2024) - Apple's fiscal Q1 (October-December 2023) - Microsoft's fiscal Q1 (July-September 2023) - "Reported in Q1" (filed in Q1, but covers prior period)

Fintool maintains a fiscal calendar database for 10,000+ companies. Every date reference gets normalized to absolute date ranges. When agent retrieves "Apple Q1 2024 revenue," it knows to look for October-December 2023 data.

Invisible to users but essential for correctness. Without it, you're comparing Apple's October revenue to Microsoft's January revenue and calling it "same quarter."

4. Skills Are Everything¶

Key Recognition: The model is not the product. Skills are now the product.

Learned this the hard way. Used to try making base model "smarter" through prompt engineering—tweaking system prompts, adding examples, writing elaborate instructions. Helped a little. But skills were the missing part.

Model Weakness Without Skills: Ask a frontier model to do a DCF valuation. It knows what DCF is, can explain the theory. But actually executing one? Will miss critical steps, use wrong discount rates for the industry, forget to add back stock-based compensation, skip sensitivity analysis. Output looks plausible but is subtly wrong in ways that matter.

The Breakthrough: Started thinking about skills as first-class citizens. Like part of the product itself.

What Is a Skill? A markdown file that tells the agent how to do something specific. Simplified DCF skill example in article.

Why Skills Beat Code:

Non-engineers can create skills - Analysts write skills. Customers write skills. A portfolio manager who's done 500 DCF valuations can encode their methodology without writing a single line of Python.
No deployment needed - Change a skill file and it takes effect immediately. No CI/CD, no code review, no waiting for release cycles. Domain experts can iterate on their own.
Readable and auditable - When something goes wrong, you can read the skill and understand exactly what the agent was supposed to do. Try doing that with a 2,000-line Python module.

Copy-on-write Shadowing System:
Priority: private > shared > public

Don't like how Fintool does DCF valuations? Write your own. Drop it in /private/skills/dcf/SKILL.md. Your version wins.

Why Not Mount All Skills to Filesystem:
Naive approach would mount every skill file directly into sandbox. Agent can just cat any skill it needs. Simple, right?

Wrong. Use SQL discovery instead for lazy loading, access control at query time, shadowing logic, and metadata-driven filtering without reading files themselves.

Pattern: S3 is source of truth, Lambda syncs changes to PostgreSQL for fast queries, agent gets exactly what it needs when it needs it.

Skills Are Essential: Cannot emphasize enough. If building an AI agent without a skills system, you're going to have a bad time. Biggest argument: top models (Claude or GPT) are post-trained on using Skills. The model wants to fetch skills.

Models just want to learn, and what they want to learn is our skills... Until they eat it.

5. The Model Will Eat Your Scaffolding¶

Uncomfortable Truth: Everything just explained about skills? It's temporary.

Models are getting better fast. Every few months, new model makes half your code obsolete. The elaborate scaffolding built to handle edge cases? The model just... handles them now.

Started needing detailed skills with step-by-step instructions for simple tasks. "First do X, then do Y, then check Z." Now? Can often just say "do an earnings preview" for simple tasks and the model figures it out (kinda!)

Creates Weird Tension: Need skills today because current models aren't smart enough. But should design skills knowing future models will need less hand-holding. Why bullish on markdown files versus code for model instructions—easier to update and delete.

Send Detailed Feedback to AI Labs: Whenever building complex scaffolding to work around model limitations, document exactly what model struggles with and share with lab research team. Helps inform next generation of models. Goal is to make own scaffolding obsolete.

Prediction: In two years, most basic skills will be one-liners. "Generate a 20 tabs DCF." That's it. Model will know what that means.

But Flip Side: As basic tasks get commoditized, will push into more complex territory. Multi-step valuations with segment-by-segment analysis. Automated backtesting of investment strategies. Real-time portfolio monitoring with complex triggers. The frontier keeps moving.

So write skills. Delete them when unnecessary. Build new ones for harder problems that emerge. And all that are files... in the filesystem.

6. The S3-First Architecture¶

Surprising Discovery: S3 for files is a better database than a database.

Store user data (watchlists, portfolio, preferences, memories, skills) in S3 as YAML files. S3 is source of truth. Lambda function syncs changes to PostgreSQL for fast queries.

Why? - Durability: S3 has 11 9's. A database doesn't. - Versioning: S3 versioning gives audit trails for free - Simplicity: YAML files are human-readable. Can debug with cat. - Cost: S3 is cheap. Database storage is not.

The Pattern: - Writes go to S3 directly - List queries hit database (fast) - Single-item reads go to S3 (freshest data)

Sync Architecture: Run two Lambda functions to keep S3 and PostgreSQL in sync—real-time fs-sync and periodic fs-reconcile (every 3 hours).

User Memories Live Here: Every user has /private/memories/UserMemories.md file in S3. Just markdown—users can edit directly in UI. On every conversation, load it and inject as context. Surprisingly powerful. Users write things like "I focus on small-cap value stocks" or "Always compare to industry median, not mean." Agent sees this on every conversation and adapts.

No migrations. No schema changes. Just markdown file user controls.

Filesystem Becomes User's Personal Knowledge Base: Skills tell agent how to do things. Memories tell it what user cares about. Both are just files.

7. The File System Tools¶

Agents in financial services need to read and write files. A lot of files. PDFs, spreadsheets, images, code.

ReadFile handles complexity (path normalization, MIME type detection, size limits)

WriteFile creates artifacts that link back to UI

Bash gives persistent shell access with 180-second timeout and 100K character output limit. Path normalization on everything (LLMs love trying path traversal attacks, it's hilarious).

Bash Is More Important Than You Think: Growing conviction in AI community that filesystems and bash are the optimal abstraction for AI agents. Braintrust recently ran an eval comparing SQL agents, bash agents, and hybrid approaches for querying semi-structured data.

Results: pure SQL hit 100% accuracy but missed edge cases. Pure bash was slower and more expensive but caught verification opportunities. Winner? Hybrid approach where agent uses bash to explore and verify, SQL for structured queries.

Matches Fintool's experience. Financial data is messy. Need bash to grep through filing documents, find patterns, explore directory structures. But also need structured tools for heavy lifting. Agent needs both—and judgment to know when to use each.

Leaned hard into giving agents full shell access in sandbox. Not just for running Python scripts—for exploration, verification, and the kind of ad-hoc data manipulation complex tasks require.

But complex tasks mean long-running agents. And long-running agents break everything.

8. Temporal Changed Everything¶

Before Temporal: Long-running tasks were a disaster. User asks for comprehensive company analysis—takes 5 minutes. What if server restarts? What if user closes tab and comes back? What if... anything?

Had homegrown job queue. It was bad. Retries inconsistent. State management nightmare.

After Switching to Temporal: Wanted to cry tears of joy!

@activity.defn
async def run_conversation(params: ConversationParams):
    # That's it. Temporal handles worker crashes, retries, everything.

Temporal handles worker crashes, retries, everything. If Heroku dyno restarts mid-conversation (happens all the time), Temporal automatically retries on another worker. User never knows.

Cancellation Handling Is Tricky: User clicks "stop," what happens? Activity already running on different server. Use heartbeats sent every few seconds.

Run Two Worker Types: - Chat workers: User-facing, 25 concurrent activities - Background workers: Async tasks, 10 concurrent activities

Scale independently. Chat traffic spikes? Scale chat workers.

Next is speed.

9. Real-Time Streaming¶

In finance, people are impatient. Won't wait 30 seconds staring at loading spinner. Need to see something happening.

Built real-time streaming. Agent works, you see progress.

Agent → SSE Events → Redis Stream → API → Frontend

Key Insight: Delta updates, not full state. Instead of sending "here's complete response so far" (expensive), send "append these 50 characters" (cheap).

Streaming Rich Content with Streamdown: Text streaming is table stakes. Harder problem is streaming rich content: markdown with tables, charts, citations, math equations. Use Streamdown to render markdown as it arrives, with custom plugins for domain-specific components.

Charts render progressively. Citations link to source documents. Math equations display properly with KaTeX. User sees complete, interactive response building in real-time.

AskUserQuestion: Interactive Agent Workflows: Sometimes agent needs user input mid-workflow: - "Which valuation method do you prefer?" - "Should I use consensus estimates or management guidance?" - "Do you want me to include pipeline assets in valuation?"

Built AskUserQuestion tool that lets agent pause, present options, wait for user input.

When agent calls this tool, agentic loop intercepts it, saves state, presents UI to user. User picks option (or types custom answer), conversation resumes with their choice.

Transforms agents from autonomous black boxes into collaborative tools. Agent does heavy lifting, but user stays in control of key decisions. Essential for high-stakes financial work where users need to validate assumptions.

10. Evaluation Is Not Optional¶

"Ship fast, fix later" works for most startups. Does not work for financial services.

Wrong earnings number can cost someone money. Misinterpreted guidance statement can lead to bad investment decisions. Can't "fix it later" when users making million-dollar decisions based on output.

Use Braintrust for Experiment Tracking: Every model change, prompt change, skill change gets evaluated against test set.

Generic NLP Metrics Don't Work: BLEU, ROUGE don't work for finance. Response can be semantically similar but have completely wrong numbers. Building eval datasets is harder than building the agent. Maintain ~2,000 test cases across categories:

Ticker Disambiguation: Deceptively hard: - "Apple" → AAPL, not APLE (Appel Petroleum) - "Meta" → META, not MSTR (which some call "meta") - "Delta" → DAL (airline) or user talking about delta hedging (options term)?

Really nasty cases are ticker changes. Facebook became META in 2021. Google restructured under GOOG/GOOGL. Twitter became X (but kept legal entity). When user asks "What happened to Facebook stock in 2023?", need to know FB → META, and historical data before Oct 2021 lives under old ticker.

Maintain ticker history table and test cases for every major rename in last decade.

Fiscal Period Hell: Where most financial agents silently fail: - Apple's Q1 is October-December (fiscal year ends September) - Microsoft's Q2 is October-December (fiscal year ends June) - Most companies Q1 is January-March (calendar year)

"Last quarter" on January 15^th means Q4 2024 for calendar-year companies, Q1 2025 for Apple (just reported), Q2 2025 for Microsoft (mid-quarter).

Maintain fiscal calendars for 10,000+ companies. Every period reference normalized to absolute date ranges. 200+ test cases just for period extraction.

Numeric Precision: Revenue of $4.2B vs $4,200M vs $4.2 billion vs "four point two billion"—all equivalent. But "4.2" alone is wrong—missing units.

Test unit inference, magnitude normalization, currency handling. Response saying "revenue was 4.2" without units fails eval, even if 4.2B is correct.

Adversarial Grounding: Inject fake numbers into context and verify model cites real source, not planted one. Example: Include fake analyst report stating "Apple revenue was $50B" alongside real 10-K showing $94B. If agent cites $50B, fails. If cites $94B with proper source attribution, passes. 50 test cases specifically for hallucination resistance.

Eval-Driven Development: Every skill has companion eval. DCF skill has 40 test cases covering WACC edge cases, terminal value sanity checks, stock-based compensation add-backs (models forget this constantly).

PR blocked if eval score drops >5%. No exceptions.

11. Production Monitoring¶

Production setup: Heroku web/worker dynos, Temporal Cloud for workflows, PostgreSQL (Neon) for metadata, S3 for user files, Braintrust tracking every conversation, Sentry for errors, Datadog for metrics.

Auto-file GitHub Issues for Production Errors: Error happens, issue created with full context—conversation ID, user info, traceback, links to Braintrust traces and Temporal workflows. Paying customers get priority:high label.

Model Routing by Complexity: Simple queries use Haiku (cheap), complex analysis uses Sonnet (expensive). Enterprise users always get best model.

The Meta Lesson: What's Your Moat?¶

The biggest lesson isn't about sandboxes or skills or streaming. It's this:

The model is not your product. The experience around the model is your product.

Anyone can call Claude or GPT. The API is the same for everyone. What makes your product different is everything else: the data you have access to, the skills you've built, the UX you've designed, the reliability you've engineered, and frankly how well you know the industry (which is a function of how much time you spend with your customers).

Models will keep getting better. That's great! Means less scaffolding, less prompt engineering, less complexity. But it also means the model becomes more of a commodity.

Your moat is not the model. Your moat is everything you build around it.

For Fintool, that's financial data, domain-specific skills, real-time streaming, and the trust built with professional investors.

What's yours?

Key Takeaways¶

Sandboxes are mandatory for any agent executing multi-step operations—AWS ABAC + S3 prefix isolation
Context normalization is 80% of the work —transforming heterogeneous financial data into clean Markdown/CSV/JSON
Skills are the product, not the model —Markdown-based, non-engineer editable, copy-on-write shadowing
S3-First architecture beats traditional databases for user data—11 9's durability + versioning + human-readable YAML
Filesystem + Bash are optimal abstractions for agents—exploration, verification, ad-hoc data manipulation
Temporal solves long-running task hell —automatic retries, state management, cancellation handling
Real-time streaming via delta updates —SSE + Redis Stream, AskUserQuestion for interactive workflows
Domain-specific evals are non-negotiable —2,000+ test cases for ticker disambiguation, fiscal period normalization, numeric precision
Models will eat your scaffolding —design for obsolescence, markdown beats code for future-proofing
Fear of being wrong becomes your best feature —in finance, one mistake destroys trust forever
Your moat is everything around the model —data access, skills, UX, reliability, domain expertise

Original article: https://x.com/nicbstme/status/2015174818497437834
The RAG Obituary: https://www.nicolasbustamante.com/p/the-rag-obituary-killed-by-agents
Anthropic Agent Skills: https://www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills
Braintrust Bash Eval: https://www.braintrust.dev/blog/testing-if-bash-is-all-you-need
Streamdown (markdown streaming): https://github.com/nicholasgriffintn/streamdown