At Emelia, we embed artificial intelligence at the core of our B2B prospecting platform: AI-assisted cold email writing, data enrichment, lead scoring. Every LLM call passes through critical infrastructure that must be fast, reliable, and optimizable. That is why we keep a close eye on the LLMOps ecosystem, and TensorZero immediately caught our attention. We have started evaluating it internally on parallel projects, and here is our complete analysis.
TensorZero is an open-source LLMOps stack built primarily in Rust that unifies five essential building blocks of the LLM application lifecycle: the gateway, observability, optimization, evaluation, and experimentation. The stated ambition is clear: transform LLM applications from simple API wrappers into "defensible AI products."
The project was founded in January 2024 by Gabriel Bianconi (CEO, former CPO at Ondo Finance, Stanford BS & MS in Computer Science) and Viraj Mehta (CTO, CMU PhD in reinforcement learning, specialized in nuclear fusion and LLMs). The open-source release came in September 2024, and the repository immediately became the #1 trending repository of the week on GitHub, jumping from 3,000 to over 9,700 stars in a matter of months.
Today, TensorZero has accumulated 11,100 stars on GitHub, 769 forks, 124 contributors, and has raised $7.3 million in seed funding from FirstMark Capital, Bessemer Venture Partners, Bedrock, and DRW.
TensorZero's origin story is unusual. CTO Viraj Mehta spent years applying reinforcement learning (RL) to nuclear fusion reactors. In that domain, every data point costs roughly $30,000 for 5 seconds of collection. This extreme constraint forged an obsession: extracting maximum value from every available data point.
When Mehta and Bianconi started working together, they reconceptualized LLM applications as Partially Observable Markov Decision Processes (POMDPs). This theoretical framework, borrowed from RL research, models an LLM application as an agent making decisions in an uncertain environment, collecting feedback, and improving continuously.
POMDP Element | LLM Equivalent |
|---|---|
Agent (policy π) | The LLM functions themselves |
State Space (S) | Unobserved factors (user intent, external systems) |
Action Space (A) | LLM function outputs (text, tool calls) |
Observation Space (O) | LLM function inputs (history, variables) |
Transition (T) | Non-LLM application code + real-world changes |
Reward (R) | Business KPIs (conversion, satisfaction, accuracy) |
This framing is not just academic. It structures TensorZero's entire architecture and enables what the team calls the data flywheel: a self-reinforcing learning loop.
The data flywheel is the central concept that sets TensorZero apart from competitors. It is a continuous cycle in four stages:
1. Collect. Every production inference is recorded in a structured format in ClickHouse. TensorZero does not store raw prompts: it records input variables, outputs, and feedback. This approach makes data provider-agnostic, meaning you can reuse the same data to fine-tune an OpenAI, Anthropic, or open-source model.
2. Optimize. Collected data is used to generate policy variants: prompt updates, supervised fine-tuning (SFT), preference fine-tuning (DPO), reinforcement learning from human feedback (RLHF), or inference-time optimization (Dynamic In-Context Learning, Best-of-N, Mixture-of-N).
3. Evaluate. Offline evaluations (backtests) on historical data validate each variant before deployment. TensorZero offers static evaluations (heuristics + LLM judges) and dynamic evaluations (end-to-end workflows).
4. Loop. Production traffic automatically generates new variants, evaluates them against KPIs, and closes the loop. Engineers focus on high-level decisions: what data to feed in, what feedback signals to use, what behaviors to incentivize.
The gateway is the component through which all LLM calls pass. Built in Rust, it delivers spectacular performance: under 1 ms P99 latency, even at 10,000 requests per second. For comparison, LiteLLM (written in Python) fails entirely at 1,000 QPS and already adds 25 to 100 times more latency than TensorZero at just 100 QPS.
Metric | LiteLLM (100 QPS) | LiteLLM (500 QPS) | LiteLLM (1,000 QPS) | TensorZero (10,000 QPS) |
|---|---|---|---|---|
Mean latency | 4.91 ms | 7.45 ms | Failure | 0.37 ms |
P50 | 4.83 ms | 5.81 ms | Failure | 0.35 ms |
P90 | 5.26 ms | 10.02 ms | Failure | 0.50 ms |
P99 | 5.87 ms | 39.69 ms | Failure | 0.94 ms |
These benchmarks were run on the same AWS c7i.xlarge instance (4 vCPUs, 8 GB RAM), according to the official TensorZero documentation.
The gateway supports streaming, tool use, structured generation (JSON mode), batch inference, multimodal inputs, caching, automatic retries, fallbacks, load balancing, and granular timeouts. It is compatible with the OpenAI SDK, TensorZero's Python client, or a raw HTTP API.
TensorZero natively supports over 19 major providers: OpenAI, Anthropic, AWS Bedrock, AWS SageMaker, Azure OpenAI, DeepSeek, Fireworks, GCP Vertex AI (Anthropic and Gemini), Google AI Studio, Groq, Hyperbolic, Mistral, OpenRouter, SGLang, TGI, Together AI, vLLM, xAI (Grok), and any OpenAI-compatible API (including Ollama for local models).
All inferences and feedback are stored in your own ClickHouse database, a columnar database optimized for analytics at scale. TensorZero provides an open-source UI for monitoring, with the ability to drill down to individual inferences or observe aggregate patterns. OpenTelemetry (OTLP) export to third-party tools is also supported.
A crucial capability: you can replay historical inferences with new prompts or models for counterfactual analysis. This means you can test a new prompt against six months of past data without sending a single request to an LLM.
TensorZero distinguishes itself through the depth of its optimization capabilities:
Model optimization: supervised fine-tuning (SFT), preference fine-tuning (DPO), reinforcement learning from human feedback (RLHF).
Prompt optimization: MIPROv2, DSPy integration, GEPA. These methods automatically generate optimized prompt variants from your production data.
Inference-time optimization: Dynamic In-Context Learning (DICL), which dynamically selects the best examples to inject into context; Best-of-N sampling, which generates N responses and selects the best; Mixture-of-N, which combines outputs from multiple models; and Chain-of-Thought (CoT).
All optimizations rely on structured production data tied to real KPIs. This is not lab optimization: it is continuous improvement based on real-world usage.
TensorZero offers two types of evaluations:
Static evaluations: heuristics and LLM judges. These are the equivalent of unit tests for your LLM functions. LLM judges are themselves TensorZero functions, meaning they can be optimized using the same mechanisms.
Dynamic evaluations: end-to-end workflows, the equivalent of integration tests. They verify that the entire chain works correctly.
Experimentation is native in TensorZero. You can run rigorous A/B tests (randomized controlled trials) across models, prompts, providers, and hyperparameters. The system automatically manages randomization in complex multi-turn workflows.
Going further, TensorZero supports adaptive experimentation with multi-armed bandits, which progressively allocate more traffic to the best-performing variants.
Launched in preview in version 2026.1.7 (February 2026), Autopilot is described as "Claude Code for LLM engineering." It is an automated AI engineer that operates on top of the TensorZero stack.
Autopilot can:
Analyze millions of inferences to surface error patterns and optimization opportunities
Recommend models and inference strategies to improve quality, cost, and latency
Generate and refine prompts based on human feedback, metrics, and evaluations
Drive optimization workflows including fine-tuning, RL, and knowledge distillation
Set up evaluations, prevent regressions, and align LLM judges to real-world scenarios
Run A/B tests to validate changes and identify winners
The TensorZero team claims that Autopilot has already produced "substantial performance improvements in use cases ranging from data extraction to customer support agents."
Autopilot is currently invite-only (waitlist). It represents the future monetization layer on top of the fully free open-source stack.
The choice of Rust is deliberate. The team includes Aaron Hill, a Rust compiler maintainer, which speaks to the depth of technical expertise.
Here is why Rust is a decisive advantage for an LLM gateway:
The gateway is a hot path. Every LLM call flows through this layer. Even small latency additions multiply across millions of daily requests. At 10,000 QPS, a Python gateway adds 50 to 100 ms of pure overhead per request.
Memory safety without garbage collection pauses. Rust's ownership model eliminates both GC pauses (which cause latency spikes) and memory bugs (which cause production crashes).
Fearless concurrency. Rust's type system catches data races at compile time, not in production. For a concurrent gateway routing requests across dozens of LLM providers, this eliminates an entire category of bugs.
Operational predictability. Rust's performance is deterministic. Python's GIL (Global Interpreter Lock) creates throughput ceilings that only surface under production load.
How does TensorZero stack up against the most popular alternatives?
Feature | TensorZero | LangSmith | Langfuse | LiteLLM |
|---|---|---|---|---|
Open source | 100% (Apache 2.0) | Partial (LangSmith is commercial) | Partial (paid features) | Partial (enterprise tier) |
Self-hosted | Yes | Partially | Yes | Yes |
LLM Gateway | Yes | Via LangChain | No | Yes |
Observability | Full OSS UI | Paid | Full UI (partial OSS) | Third-party integrations |
Evaluations | Built-in | Paid | Built-in | No |
A/B Testing | Native | No | No | No |
Fine-tuning | Built-in (SFT, DPO, RLHF) | No | No | No |
Inference-time opt | Yes (DICL, BoN, MoN) | No | No | No |
Performance | < 1 ms P99 at 10K QPS | Slow (Python) | N/A | Fails at 1K QPS |
Pricing | 100% free | Paid | Freemium | Freemium |
TensorZero cleanly separates application engineering from LLM optimization, where LangChain blends the two. LangSmith requires a separate paid subscription; TensorZero's observability is entirely free and open source. TensorZero is built for production; LangChain excels at rapid prototyping. Critically, TensorZero is language-agnostic (HTTP API), while LangChain only supports Python and JavaScript, according to the TensorZero comparison documentation.
Langfuse does not offer an LLM gateway, meaning you need to combine Langfuse with another tool for that function. Langfuse has a more mature observability UI and a more advanced playground, but TensorZero is significantly stronger on optimization. The two tools can actually be combined, per the official comparison page.
The performance gap is the most striking point. TensorZero handles 10,000 QPS with under 1 ms at P99; LiteLLM fails at 1,000 QPS. Beyond the gateway, TensorZero adds evaluations, A/B testing, and optimization that none of the alternatives offer, according to the benchmark documentation. LiteLLM does support over 100 models compared to about twenty for TensorZero (extensible via any OpenAI-compatible API).
A case study published by TensorZero describes deployment at a major European bank. Engineers were required to write detailed changelogs for every GitLab merge request, a task most skipped. TensorZero was integrated into GitLab CI/CD pipelines using Dynamic In-Context Learning (DICL): every human correction to an AI-generated changelog automatically fed future requests. The entire setup was deployed fully on-premise with TensorZero + Ollama, with no data leaving the bank's infrastructure.
On a named entity recognition task, a GPT-4o Mini model optimized with TensorZero outperformed GPT-4o (unoptimized), at a fraction of the cost and latency, using a small training dataset. Proof that systematic optimization can compensate for model size.
TensorZero demonstrated the ability to reverse-engineer the LLM client of Cursor (valued at $9.9 billion) to observe the prompts being used, A/B test different models, and optimize its own prompts and models. This demonstration went viral with over 11,700 views.
For a tool like Emelia, TensorZero's potential use cases are numerous:
Optimizing AI-generated writing: every AI-generated email can be evaluated via user feedback (open rate, reply rate), feeding the data flywheel to continuously improve quality
A/B testing models: testing GPT-4o vs Claude 3.7 vs Mistral on the same writing task, with real production metrics
Cost reduction: using fine-tuning to achieve GPT-4o performance with a lighter, cheaper model
Minimal latency: the Rust gateway ensures that LLM infrastructure is never the bottleneck
TensorZero is distributed under the Apache 2.0 license. No features are paid. Self-hosting is free: you bring your own LLM API keys, and TensorZero adds zero cost. Enterprise support is even free: a simple email to hello@tensorzero.com gets you a dedicated Slack or Teams channel.
Gabriel Bianconi explained to VentureBeat: "We realized very early on that we needed to make this open source, to give enterprises the confidence to do this." The open-source strategy directly addresses the vendor lock-in fear that enterprises feel about their sensitive AI data.
Future monetization will come from a managed service (Autopilot) that will include GPU infrastructure for fine-tuning, automated experiment management, and proactive optimization suggestions. FirstMark's Matt Turck summarized the situation in a tweet: "Been thinking about feedback loops in AI forever and those guys are the real deal."
The team, based in Brooklyn, New York, is remarkably dense in expertise:
Name | Role | Background |
|---|---|---|
Gabriel Bianconi | CEO | CPO at Ondo Finance (DeFi, $1B+ AUM); Stanford BS & MS |
Viraj Mehta | CTO | CMU PhD (RL for nuclear fusion + LLMs); Stanford BS & MS |
Aaron Hill | Engineer | Rust compiler maintainer; AWS, Svix |
Alan Mishler | Researcher | VP at J.P. Morgan AI Research; CMU PhD; 1,300+ citations |
Andrew Jesson | Researcher | Columbia postdoc, Oxford PhD (LLMs); 4,000+ citations |
Antoine Toussaint | Engineer | Staff SWE (Shopify, Illumio); ex-quant; Princeton PhD |
Michelle Hui | ML/Product | Wing/Alphabet, UN; Cornell BS & MS |
Shuyang Li | Engineer | Staff SWE at Google (LLM infra, search); Palantir |
Simeon Lee | Design | Head of Design at Merge; design engineer AI/devtools |
Let us be honest about the limitations:
Learning curve. TensorZero uses a Configuration-as-Code (GitOps) approach for prompt management. If your team is not comfortable with TOML configuration files and Git workflows, onboarding will be slower than with a GUI-based tool like Langfuse or LangSmith.
Less mature observability UI. TensorZero's user interface is functional but less polished than Langfuse or LangSmith. If your priority is an elegant dashboard for non-technical stakeholders, other tools are better suited.
No dynamic routing by latency or cost. LiteLLM offers dynamic routing based on provider latency or cost. TensorZero only supports static routing for now.
Limited native provider count. With around 20 native providers versus over 100 for LiteLLM, TensorZero covers the major players but not the long tail. Any OpenAI-compatible API can be added, however.
No native SSO/access control. For large organizations, the lack of built-in SSO requires adding Nginx or OAuth2 Proxy, which complicates deployment.
Autopilot still in preview. The most promising feature is invite-only and not yet stabilized.
Teams just getting started with LLMs who want a quick prototype (LangChain is better suited)
Organizations that need a full GUI for non-technical users
Low-volume projects that do not justify the ClickHouse infrastructure
Teams without a DevOps/GitOps culture
Teams operating LLM applications in production at medium to large scale
Organizations that want to continuously optimize models with production data
Companies with strict performance requirements (latency, throughput)
Teams with data sovereignty constraints (self-hosted, on-premise)
AI startups looking to build a competitive moat through the learning loop
TensorZero is the most ambitious project in the open-source LLMOps ecosystem. Where most tools focus on a single aspect (observability for Langfuse, gateway for LiteLLM, prototyping for LangChain), TensorZero aims for complete integration of the LLM lifecycle in a single coherent stack.
The technical bet is bold: building in Rust for unmatched performance, modeling LLM applications as POMDPs to maximize learning, and making everything 100% free and open source. With 11,100 stars, $7.3 million in funding, and a team that includes a Rust compiler maintainer, a J.P. Morgan AI Research VP, and Staff engineers from Google and Shopify, the project has the means to match its ambitions.
As the LLM gateway guide from getmaxim.ai noted: "TensorZero targets teams with strong DevOps cultures that treat AI infrastructure with the same rigor as traditional backend systems."
For teams building serious LLM applications, the question is no longer whether you need an LLMOps stack, but which one to choose. TensorZero makes a compelling case for being that choice.

Sin compromiso, precios para ayudarte a aumentar tu prospección.
No necesitas créditos si solo quieres enviar emails o hacer acciones en LinkedIn
Se pueden utilizar para:
Buscar Emails
Acción IA
Buscar Números
Verificar Emails