AI

WEKA OCI Benchmark Shows 10x Long-Context AI Gains

Q: What did the WEKA OCI benchmark show?

The WEKA OCI benchmark showed 10x more concurrent users, 10x higher token throughput, and 7x more tokens per GPU than DRAM-only configurations, according to WEKA's June 9, 2026 announcement.

Q: What hardware was used in the test?

Oracle's benchmark page says the system used a 9-node OCI bare-metal H100 cluster with 8 GPUs per node, 72 GPUs total, and 16x Gen4 NVMe drives per node.

Q: How did Augmented Memory Grid change the cache path?

WEKA says Augmented Memory Grid streams key-value cache data between GPU memory and flash storage using RDMA and GPUDirect Storage.

Q: What is the main workload caveat?

Oracle says the tests were configured to maximize potential cache hit rate, and vLLM says prefix caching does not help when vLLM spends most of the time generating answers or when prefixes do not match.

Q: Is NeuralMesh with Augmented Memory Grid available on OCI?

WEKA says NeuralMesh with Augmented Memory Grid is generally available to WEKA customers and on the Oracle Marketplace, with OCI as WEKA's exclusive cloud launch partner.

WEKA OCI benchmark says NeuralMesh served 10x more users and 7x more tokens on an OCI H100 cluster by moving KV cache to NVMe without adding GPUs.

Published

2 months ago

June 9, 2026

Logan Pierce

WEKA OCI benchmark results announced on June 9, 2026, say NeuralMesh with Augmented Memory Grid served 10x more concurrent users, delivered 10x higher token throughput, and produced 7x more tokens per GPU than DRAM-only configurations without adding GPUs. The results were validated on a 9-node OCI bare-metal H100 cluster with 100,000-token context windows, according to WEKA’s June 9 benchmark announcement.

Oracle published the benchmark methodology, system configuration, and results on May 13, 2026. The full benchmark methodology and results put the work in a narrow class of long-context and agentic AI workloads where KV cache eviction causes unnecessary re-computation.

Oracle Published the Test Bed

Oracle’s benchmark page says the work moved from early validation to production-relevant workload testing on OCI bare-metal H100s. The test used 72 GPUs total across a 9-node cluster, with 8 GPUs per node. It also used multiple TP4 instances of MiniMax-M2.5-NVFP4.

The workload definition was specific. Each simulated user was a 100K-token input plus a 100-token response per turn, and the tests were configured to maximize potential cache hit rate while isolating offloading versus recompute.

16x Gen4 NVMe drives per node pooled into a converged Augmented Memory Grid layer.
287 TiB usable NVMe in Augmented Memory Grid.
~8.64 TiB available DRAM in the baseline.
2x 200Gb RDMA NICs per node in the cluster.

Oracle named three comparison baselines. The 10x and 7x figures in WEKA’s announcement are measured against the DRAM-only baseline.

Configuration	Cache Path	Oracle Label
Baseline	HBM + DRAM only	standard vLLM serving
Augmented Memory Grid	HBM + NVMe only	Augmented Memory Grid
Augmented Memory Grid full stack	HBM + DRAM + NVMe	Augmented Memory Grid full stack

WEKA OCI long-context AI inference benchmark

The Result Table Starts with DRAM Saturation

Oracle’s result section says DRAM-only hit a hard ceiling at approximately 600 concurrent users, while Augmented Memory Grid scaled past 5,000 in unbounded testing. It says the performance gap appears at the point where DRAM cache saturates. Under that point, Oracle wrote, the systems can look similar.

Metric	DRAM Baseline	Augmented Memory Grid
Max concurrent users	~600	5,000+
Requests completed, 2,400-user test	~6,700	47,000+
Tokens served	700M	5B
Token throughput	<200K tokens/sec	~2M tokens/sec

Oracle ties the same threshold to service level objectives. When a user’s session misses cache, the system has to rebuild context; for a 100K-token input, a multi-turn coding session, or an agent workflow with a long project history, Oracle says the user experiences that miss as a pause.

Availability Puts NeuralMesh on OCI

WEKA’s product page describes Augmented Memory Grid on OCI product details as a persistent, petabyte-scale token warehouse that extends GPU high-bandwidth memory. It says the system streams key-value cache data between GPU memory and flash storage using RDMA and GPUDirect Storage, with OCI bare-metal GPU infrastructure in the path.

Enterprise AI workloads are pushing context windows and GPU utilization to new limits. These benchmarks show how WEKA’s NeuralMesh platform with Augmented Memory Grid on OCI helps remove memory bottlenecks so customers can support larger, more demanding inference workloads without simply adding more GPUs.

Pablo Selem, senior director, software development, Oracle Cloud Infrastructure, gave that statement in WEKA’s June 9 release. WEKA CEO Liran Zvibel used the same release to say inference is bottlenecked by how much effective memory is available to GPUs.

WEKA says NeuralMesh with Augmented Memory Grid is generally available to WEKA customers and on the Oracle Marketplace, with OCI as WEKA’s exclusive cloud launch partner. The June announcement also cites a prior phase of validation that demonstrated 1000x more KV cache capacity and up to 20x faster time to first token at 128,000 tokens.

KV Cache Reuse Is the Older Fight

The software context starts with vLLM. Its automatic prefix caching documentation says APC caches the KV cache of existing queries so a new query can reuse the cache when it shares the same prefix and skip the shared computation.

vLLM lists two example workloads where APC can provide a huge performance benefit.

Long document query: repeated queries against the same long document can avoid processing that document again and again.
Multi-round conversation: vLLM can reuse chat history across future rounds of a session.
APC limit: APC reduces the prefilling phase and does not reduce the time spent generating new tokens.

Oracle’s separate LMCache reuse benchmark on OCI discusses cache reuse in OCI Data Science AI Quick Actions. That page says LMCache delivered near 2× throughput improvements and 50%+ reductions in time-to-first-token in conversational workloads.

That LMCache page also writes memory planning into a formula. It says teams should allocate sufficient CPU memory for LMCache and gives LMCACHE_MAX_LOCAL_CPU_SIZE = (number of concurrent conversations) × (max context length) × (KV cache size per token).

The Caveat Sits Inside the Workload

Oracle’s WEKA test begins with a workload definition. The benchmark was configured to maximize potential cache hit rate, and Oracle says the goal was to measure serving density and throughput when DRAM is no longer sufficient for long-context, cache-sensitive workloads on OCI infrastructure.

vLLM’s limits section says APC does not bring performance gain when vLLM spends most of the time generating answers, or when new queries do not share the same prefix with existing queries.

Oracle also wrote that production inference is a full-stack problem across Kubernetes, GPUDirect Storage, RDMA, vLLM, OCI bare-metal GPUs, and Augmented Memory Grid. It says cache capacity and cache movement both matter because a system has to retrieve KV cache fast enough to avoid stalling GPUs.

The tested GPU footprint stayed fixed while the cache path changed from DRAM-only service to Augmented Memory Grid variants. For workloads with no shared prefix, vLLM says the computation cannot be reused.

Frequently Asked Questions

What did the WEKA OCI benchmark show?

The WEKA OCI benchmark showed 10x more concurrent users, 10x higher token throughput, and 7x more tokens per GPU than DRAM-only configurations, according to WEKA’s June 9, 2026 announcement.

What hardware was used in the test?

Oracle’s benchmark page says the system used a 9-node OCI bare-metal H100 cluster with 8 GPUs per node, 72 GPUs total, and 16x Gen4 NVMe drives per node.

How did Augmented Memory Grid change the cache path?

WEKA says Augmented Memory Grid streams key-value cache data between GPU memory and flash storage using RDMA and GPUDirect Storage.

What is the main workload caveat?

Oracle says the tests were configured to maximize potential cache hit rate, and vLLM says prefix caching does not help when vLLM spends most of the time generating answers or when prefixes do not match.

Is NeuralMesh with Augmented Memory Grid available on OCI?

WEKA says NeuralMesh with Augmented Memory Grid is generally available to WEKA customers and on the Oracle Marketplace, with OCI as WEKA’s exclusive cloud launch partner.