ZenML

Building a Production MCP Server for AI Assistant Integration

Hugging Face 2025
View original source

Hugging Face developed an official Model Context Protocol (MCP) server to enable AI assistants to access their AI model hub and thousands of AI applications through a simple URL. The team faced complex architectural decisions around transport protocols, choosing Streamable HTTP over deprecated SSE transport, and implementing a stateless, direct response configuration for production deployment. The server provides customizable tools for different user types and integrates seamlessly with existing Hugging Face infrastructure including authentication and resource quotas.

Industry

Tech

Technologies

Hugging Face’s development of their official Model Context Protocol (MCP) server represents a significant case study in building production-ready infrastructure for AI assistant integration. The project demonstrates the complexities involved in deploying LLM-adjacent services that need to bridge the gap between AI assistants and external resources at scale.

The core challenge Hugging Face faced was enabling AI assistants to access their extensive ecosystem of AI models and applications through a standardized protocol. The Model Context Protocol, launched in November 2024, emerged as the standard for connecting AI assistants to external resources, making it a natural choice for Hugging Face to implement. However, the protocol’s rapid evolution, with three revisions in nine months, created significant technical challenges around compatibility and feature support across different client applications.

The team’s approach to solving this integration challenge involved several key architectural decisions that highlight important LLMOps considerations. First, they prioritized dynamic customization, allowing users to adjust their available tools on the fly rather than providing a static set of capabilities. This decision reflects a sophisticated understanding of how AI assistants are used in practice, where different users have varying needs and access requirements. Anonymous users receive a standard set of tools for using the Hub along with basic image generation capabilities, while authenticated users can access personalized tool sets and selected Gradio applications.

The technical architecture decisions reveal deep insights into production LLM infrastructure design. The team had to choose between three transport options: STDIO (for local connections), HTTP with Server-Sent Events (SSE), and the newer Streamable HTTP transport. Each option presents different trade-offs in terms of deployment complexity, scalability, and feature support. The SSE transport, while still widely used, was deprecated in favor of the more flexible Streamable HTTP approach, forcing the team to balance current client compatibility with future-proofing their infrastructure.

The Streamable HTTP transport itself offers three communication patterns, each with distinct implications for production deployment. Direct Response provides simple request-response interactions suitable for stateless operations like searches. Request Scoped Streams enable temporary SSE connections for operations requiring progress updates or user interactions during execution. Server Push Streams support long-lived connections with server-initiated messages, enabling real-time notifications but requiring sophisticated connection management including keep-alive mechanisms and resumption handling.

For their production deployment, Hugging Face chose a stateless, direct response configuration, which demonstrates pragmatic engineering decision-making. The stateless approach enables simple horizontal scaling where any server instance can handle any request, avoiding the complexity of session affinity or shared state mechanisms that would be required for stateful configurations. This choice was validated by their use case analysis, where user state primarily consists of authentication credentials and tool selections that can be efficiently looked up per request rather than maintained in server memory.

The authentication integration showcases how LLMOps systems must seamlessly integrate with existing infrastructure. The server handles both anonymous and authenticated users, with authenticated access managed through HF_TOKEN or OAuth credentials. For authenticated users, the system correctly applies ZeroGPU quotas, demonstrating how resource management and billing considerations must be embedded into AI service architectures from the ground up.

The team’s approach to handling protocol evolution reveals important lessons about building resilient AI infrastructure. Rather than betting on a single transport method, they built their open-source implementation to support all transport variants (STDIO, SSE, and Streamable HTTP) in both direct response and server push modes. This flexibility allowed them to adapt to changing client capabilities while maintaining backward compatibility during the transition period.

The production deployment strategy reflects mature DevOps practices applied to AI infrastructure. The team implemented comprehensive observability features, including a connection dashboard that provides insights into how different clients manage connections and handle tool list change notifications. This observability is crucial for understanding system behavior and optimizing performance in production environments where AI assistants may have varying usage patterns and connection management strategies.

The case study also highlights the importance of resource management in AI service deployments. The server’s integration with Hugging Face’s quota system ensures that computational resources are properly allocated and billed, which is essential for sustainable AI service operations. The team’s decision to avoid features requiring sampling or elicitation during tool execution reflects a focus on minimizing deployment complexity and resource overhead, at least in the initial production release.

One notable aspect of this implementation is the balance between standardization and customization. While MCP provides a standard protocol, Hugging Face’s implementation allows for significant customization of available tools and applications. This flexibility is achieved through their integration with the broader Hugging Face ecosystem, including thousands of AI applications available through Spaces. This approach demonstrates how LLMOps systems can provide both standardized interfaces and rich, customizable functionality.

The open-source nature of the implementation adds another dimension to the case study. By making their MCP server code available, Hugging Face enables the broader community to learn from their architectural decisions and potentially contribute improvements. This approach aligns with the collaborative nature of the AI development community while also providing transparency into their technical choices.

The rapid adoption and evolution of the MCP protocol, as evidenced by major clients like VSCode and Cursor quickly adopting the new Streamable HTTP transport, validates the team’s decision to prioritize the newer protocol version despite initial compatibility challenges. This demonstrates the importance of staying ahead of protocol evolution in rapidly developing AI infrastructure ecosystems.

Overall, this case study illustrates the complexity of building production-ready AI infrastructure that serves as a bridge between AI assistants and external resources. The technical decisions around transport protocols, state management, authentication, and resource allocation reflect the multifaceted challenges involved in deploying LLM-adjacent services at scale. The success of this implementation provides valuable insights for organizations looking to integrate AI assistants with their existing systems and services while maintaining scalability, reliability, and security requirements.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Deploying Secure AI Agents in Highly Regulated Financial and Gaming Environments

Sicoob / Holland Casino 2025

Two organizations operating in highly regulated industries—Sicoob, a Brazilian cooperative financial institution, and Holland Casino, a government-mandated Dutch gaming operator—share their approaches to deploying generative AI workloads while maintaining strict compliance requirements. Sicoob built a scalable infrastructure using Amazon EKS with GPU instances, leveraging open-source tools like Karpenter, KEDA, vLLM, and Open WebUI to run multiple open-source LLMs (Llama, Mistral, DeepSeek, Granite) for code generation, robotic process automation, investment advisory, and document interaction use cases, achieving cost efficiency through spot instances and auto-scaling. Holland Casino took a different path, using Anthropic's Claude models via Amazon Bedrock and developing lightweight AI agents using the Strands framework, later deploying them through Bedrock Agent Core to provide management stakeholders with self-service access to cost, security, and operational insights. Both organizations emphasized the importance of security, governance, compliance frameworks (including ISO 42001 for AI), and responsible AI practices while demonstrating that regulatory requirements need not inhibit AI adoption when proper architectural patterns and AWS services are employed.

healthcare fraud_detection customer_support +50

Enterprise AI Platform Integration for Secure Production Deployment

Rubrik 2025

Predibase, a fine-tuning and model serving platform, announced its acquisition by Rubrik, a data security and governance company, with the goal of combining Predibase's generative AI capabilities with Rubrik's secure data infrastructure. The integration aims to address the critical challenge that over 50% of AI pilots never reach production due to issues with security, model quality, latency, and cost. By combining Predibase's post-training and inference capabilities with Rubrik's data security posture management, the merged platform seeks to provide an end-to-end solution that enables enterprises to deploy generative AI applications securely and efficiently at scale.

customer_support content_moderation chatbot +53