Posts tagged "infrastructure"

Checkpoint Replay, Worker Shape, and Where Durable Execution Is Going

Armin Ronacher's Absurd and Kitaru arrived at the same answers on replay semantics, ephemeral compute, and an agent-legible runtime. Here's why that matters.

Hamza Tahir

May 11, 2026

Kitaru

The runtime layer underneath your agent stack

What people call the agent stack is really four layers: model, harness, runtime, platform. Conflating them costs durability. The runtime layer, and one split inside it, gets the least attention.

Hamza Tahir

Apr 22, 2026

KAI Scheduler vs Run:ai comparison - Which GPU Scheduling Tool Fits Your AI Infrastructure?

MLOps

KAI Scheduler vs Run:ai: Which GPU Scheduling Tool Fits Your AI Infrastructure?

We break down GPU scheduling, fractional GPU allocation, gang scheduling, integrations, and pricing to help you pick the right tool for your AI infrastructure.

Hamza Tahir

Apr 9, 202614 mins

Run:ai vs ClearML comparison cover image

MLOps

Run:ai vs ClearML: Which AI Infrastructure Platform Fits Your MLOps Stack?

In this Run:ai vs ClearML comparison, we break down GPU orchestration, workload scheduling, resource policies, RBAC, integrations, and pricing to help you pick the right platform for your AI infrastructure.

Hamza Tahir

Apr 6, 202614 mins

Kitaru

Introducing Kitaru: Open Source Infrastructure For Asynchronous Agents (Built by the ZenML Team)

Meet Kitaru — open source durable execution for Python agents, built by the ZenML team. Crash recovery, human-in-the-loop, and replay from any checkpoint.

ZenML Team

Apr 1, 20268 mins

Kitaru

Kitaru is open source and ready to use

Kitaru is live: open-source infrastructure platform for running Python agents in production.

Hamza Tahir

Mar 21, 2026

Kitaru

From ZenML to Kitaru: Why We Built a New Product

We spent five years building ML pipeline infrastructure. Then agents showed up and we realized the next problem needed a new tool — not an extension of the old one.

Hamza Tahir

Mar 10, 2026

Kitaru

Your Agents Need More Than Just Traces

Tracing shows you what went wrong. But what if you could go back, fix the input, and resume from where it failed — without re-running everything?

Hamza Tahir

Mar 8, 2026

Kitaru

Why Kitaru Doesn't Use Journal Replay?

Every durable execution engine today forces your code to be deterministic. Kitaru takes a different approach — and it matters more than you think.

Hamza Tahir

Mar 5, 2026

Kitaru

Why Your AI Agents Need Durable Execution

AI agents fail — they timeout, hit rate limits, crash on bad API responses. Without durable execution, every failure means starting over from scratch.

Hamza Tahir

Mar 1, 2026

Kitaru

Your Agents Are Not Microservices

Durable execution engines were built for payment flows and order processing. AI agents need something different. Here's why.

Hamza Tahir

Feb 25, 2026

MLOps

Why Retail MLOps Is Harder Than You Think

An in-depth analysis of retail MLOps challenges, covering data complexity, edge computing, seasonality, and multi-cloud deployment, with real-world examples from major retailers like Wayfair and Starbucks, and practical solutions including ZenML's impact in reducing deployment time from 8.5 to 2 weeks at Adeo Leroy Merlin.

Hamza Tahir

May 16, 20255 mins

MLOps

NVIDIA KAI Scheduler: Optimize GPU Usage in ZenML Pipelines

Discover how to optimize GPU utilization in Kubernetes environments by integrating NVIDIA's KAI Scheduler with ZenML pipelines, enabling fractional GPU allocation for improved resource efficiency and cost savings in machine learning workflows.

Alex Strick van Linschoten

May 15, 20255 mins

LLMOps

Building Advanced Search, Retrieval, and Recommendation Systems with LLMs

Discover how embeddings power modern search and recommendation systems with LLMs, using case studies from the LLMOps Database. From RAG systems to personalized recommendations, learn key strategies and best practices for building intelligent applications that truly understand user intent and deliver relevant results.

Alex Strick van Linschoten

Dec 6, 20248 mins

MLOps

Cognitive Load in MLOps: Why Your Data Scientists Need Infrastructure Abstraction

Discover why cognitive load is the hidden barrier to ML success and how infrastructure abstraction can revolutionize your data science team's productivity. This comprehensive guide explores the real costs of infrastructure complexity in MLOps, from security challenges to the pitfalls of home-grown solutions. Learn practical strategies for creating effective abstractions that let data scientists focus on what they do best – building better models – while maintaining robust security and control. Perfect for ML leaders and architects looking to scale their machine learning initiatives efficiently.

Jayesh Sharma

Nov 18, 20242 mins

Case Studies

Empowering ZenML Pro Infrastructure Management: Our Journey from Spacelift to ArgoCD

The combination of ZenML and Neptune can streamline machine learning workflows and provide unprecedented visibility into experiments. ZenML is an extensible framework for creating production-ready pipelines, while Neptune is a metadata store for MLOps. When combined, these tools offer a robust solution for managing the entire ML lifecycle, from experimentation to production. The combination of these tools can significantly accelerate the development process, especially when working with complex tasks like language model fine-tuning. This integration offers the ability to focus more on innovating and less on managing the intricacies of your ML pipelines.