wildchat

WildChat-MCP: Hierarchical Analysis Framework via Model Context Protocol

WildChat-MCP Framework

Advisor: Prof. Jimmy Lin, University of Waterloo
Code Repository: https://github.com/minsilu/WildChat-MCP

1. Abstract

Large Language Models (LLMs) have demonstrated profound capabilities in reasoning and tool-use, but their application to large-scale, unstructured exploratory semantic datasets is bottlenecked by context windows and rigid interaction paradigms. In this project, we architected WildChat-MCP, a comprehensive data-engineering and hierarchical autonomous agent framework via the Model Context Protocol (MCP). It enables an LLM (Claude Desktop) to act as a fully autonomous data analyst, exploring, indexing, and interpreting a massive dataset of over 1.4 million real-world user-assistant interaction logs contextually.

2. Core Methodologies and Contributions

2.1 Full-Stack ETL and Data Engineering Pipeline

We built a robust Extract, Transform, Load (ETL) pipeline using modern data stack principles:

  • HuggingFace Streaming to DuckDB: Implemented a streaming ingester directly from HuggingFace datasets into partitioned DuckDB architectures, bypassing excessive memory overheads.
  • Hybrid Topic Modeling & Clustering: Employed embedding-based k-means clustering strategies alongside a localized LLM (Qwen 7B) to autonomously generate high-fidelity cluster labels characterizing user intents.
  • High-Performance Database Indexing: Optimized large-scale analytical updates utilizing DuckDB’s CTAS (Create Table As Select) mechanisms. The database was subsequently indexed with a hybrid B-Tree and BM25 Full-Text Search (FTS) schema to ensure low-latency analytical retrievals during agent execution.

2.2 Hierarchical Agent Tool Stack Architecture

To overcome the contextual limitations of LLMs when interacting with massive databases, we designed a tiered, hierarchical tool access pattern—dubbed the “Meta → Macro → Discovery → Micro” architecture using the Model Context Protocol:

  • Meta Level: Tools enabling the agent to understand the database schema, available metrics, and data dictionaries.
  • Macro Level: Tools executing pre-computed aggregate queries (e.g., temporal trends, cluster distributions).
  • Discovery Level: Tools utilizing BM25 FTS and semantic search to iteratively filter and sample relevant dialogue cohorts based on hypothesized patterns.
  • Micro Level: Safe, read-only tools allowing the agent to inspect the granular turns of specific conversational sessions.

2.3 Safety-Sandboxed Execution Environment

Autonomous agents generating and executing SQL queries present significant integrity risks. To mitigate this, the MCP server enforces a strict safety-sandboxed read-only SQL execution environment, rigorously validating and sanitizing dynamically generated analysis queries before execution against the core DuckDB instance.

3. Conclusion

WildChat-MCP represents a forward-looking implementation of AI-driven data analytics. By standardizing the interfaces via MCP and backing the LLM with a robust, highly-indexed columnar data store (DuckDB), we transformed a static, million-row conversational dataset into an interactive, dynamically explorable knowledge graph driven purely by natural language hypothesis generation and autonomous tool orchestration.