content/03-rag/chromadb-inspect.md

ChromaDB 数据查看指南

图形化工具

1. DB Browser for SQLite(最简单)

ChromaDB 底层是 SQLite,无需额外依赖,直接打开文件:

open interview_kb/chroma.sqlite3

下载地址:https://sqlitebrowser.org 关键表:embeddings / embedding_metadata / collections


2. chromadb-admin(Web UI)

pip install chromadb-admin
python -m chromadb_admin --path interview_kb

浏览器打开 http://localhost:8000,可浏览、搜索 collection。


3. UMAP 聚类可视化

查看不同 topic 的 chunk 是否语义分区明显。

pip install umap-learn matplotlib
import chromadb
import numpy as np
import matplotlib.pyplot as plt
from umap import UMAP

col = chromadb.PersistentClient(path="interview_kb").get_collection("ai_handbook")
res = col.get(include=["embeddings", "metadatas"])

embs   = np.array(res["embeddings"])
topics = [m["topic"] for m in res["metadatas"]]
coords = UMAP(n_components=2, random_state=42).fit_transform(embs)

colors = {"rag": "blue", "mcp": "green", "agent": "orange", "interview": "red"}
for topic, color in colors.items():
    idx = [i for i, t in enumerate(topics) if t == topic]
    plt.scatter(coords[idx, 0], coords[idx, 1], c=color, label=topic, alpha=0.6, s=10)

plt.legend()
plt.title("ChromaDB chunks by topic")
plt.savefig("kb_viz.png", dpi=150)
print("saved kb_viz.png")

理想结果:不同 topic 的点应有明显聚类。


命令行查看

基本信息 + 分块抽样

import chromadb

col = chromadb.PersistentClient(path="interview_kb").get_collection("ai_handbook")

# 总量和版本
print(col.count())
print(col.metadata)

# 前 5 个 chunk(看分块效果)
res = col.get(limit=5, include=["documents", "metadatas"])
for doc, meta in zip(res["documents"], res["metadatas"]):
    print(f"[{meta['topic']}] {meta['source']}")
    print(doc[:200])
    print("---")

# 按 topic 统计 chunk 数
for topic in ["rag", "mcp", "agent", "interview"]:
    r = col.get(where={"topic": topic}, include=[])
    print(f"{topic:12s}: {len(r['ids'])} chunks")

检索质量测试

rag/code/ 目录下运行:

from importlib.util import module_from_spec, spec_from_file_location
import chromadb

# 加载 embed 函数
spec = spec_from_file_location("p", "00_配置提供商_先改这个.py")
mod  = module_from_spec(spec)
spec.loader.exec_module(mod)
embed = mod.embed

col = chromadb.PersistentClient(path="interview_kb").get_collection("ai_handbook")

queries = [
    ("RRF 互惠排名融合原理", "rag"),
    ("MCP 三类能力",         "mcp"),
    ("什么是 Agentic RAG",  "rag"),
]

for query, topic in queries:
    print(f"\n{'─'*50}")
    print(f"查询: {query}  [topic={topic}]")
    res = col.query(
        query_embeddings=[embed(query).tolist()],
        n_results=3,
        where={"topic": topic},
        include=["documents", "metadatas", "distances"],
    )
    for doc, meta, dist in zip(res["documents"][0], res["metadatas"][0], res["distances"][0]):
        score = round(1 - dist, 3)
        print(f"\n  [{score}] {meta['source']}")
        print(f"  {doc[:300]}")

检索质量判断标准

得分评价
> 0.8很好
0.6 ~ 0.8可用
< 0.6需优化分块策略或更换 embedding 模型

分块质量检查要点:

  • chunk 开头应有面包屑标题,如 [混合检索 > RRF 算法]
  • 不应命中导航栏、按钮等 UI 文字
  • 单个 chunk 长度建议 100~500 字

评论