AWS X DevOps: The 2026 Stack

πŸ“… JUNE 11, 2026 Β· πŸ“ AWS Experience Floor28, Tel Aviv

FOYER

Doors Open & Registration

πŸ• 8:45 AM – 9:45 AM
Dudy Cohen Sharone Zitzman
Main Stage

Opening Words

Dudy Cohen, AWS Β· Sharone Zitzman, RTFM Please

πŸ• 9:45 AM – 10:00 AM πŸ“ Main Stage
Main Stage

Opening Words

Dudy Cohen
Dudy Cohen
AWS
Sharone Zitzman
Sharone Zitzman
RTFM Please
πŸ• WHEN
9:45 AM – 10:00 AM
πŸ“ WHERE
Main Stage
Ran Tavory
Main Stage

Your Agent Built It. Who the 🀬 Runs It?

Ran Tavory, Keynote Speaker

πŸ• 10:00 AM – 10:30 AM πŸ“ Main Stage
Main Stage

Your Agent Built It. Who the 🀬 Runs It?

Coding agents are shipping production code faster than you can read it. Somebody still has to run it at 2 AM. This talk is for the people on that pager - three forces making production harder, and the concrete shifts (observability, SLOs, policy-as-code, AI investigation reports) that define the next decade of the SRE role.

Ran Tavory
Ran Tavory
Keynote Speaker
πŸ• WHEN
10:00 AM – 10:30 AM
πŸ“ WHERE
Main Stage
Anton Weiss
Main Stage

Sponsor Ignites

Anton Weiss, Developer Advocate, DoIT Β· , Innocom

πŸ• 10:30 AM – 10:40 AM πŸ“ Main Stage
Main Stage

Sponsor Ignites

DoIT: Leaner K8s with PSI metrics</br></br>PSI metrics have graduated to beta in K8s v1.34. It may be a small feature but it can matter a lot for your application performance and efficiency. Let's understand how they can make your clusters and autoscaling leaner.

Anton Weiss
Anton Weiss
Developer Advocate, DoIT

Innocom
πŸ• WHEN
10:30 AM – 10:40 AM
πŸ“ WHERE
Main Stage
Itay Shakury
Main Stage

Open Source in the Age of AI

Itay Shakury, CNCF Ambassador, Trivy Core Maintainer, Engineering Leader

πŸ• 10:40 AM – 11:00 AM πŸ“ Main Stage
Main Stage

Open Source in the Age of AI

Open Source in the Age of AI

Itay Shakury
Itay Shakury
CNCF Ambassador, Trivy Core Maintainer, Engineering Leader
πŸ• WHEN
10:40 AM – 11:00 AM
πŸ“ WHERE
Main Stage
FOYER

Break – Split to Tracks

πŸ• 11:00 AM – 11:10 AM
Din Shor
Track 1 - AWS Experience

Building a Multi-Agent Research System on AWS

Din Shor, Director of DevOps, Tavily

πŸ• 11:10 AM – 11:35 AM πŸ“ Track 1 – AWS Experience
Track 1 - AWS Experience

Building a Multi-Agent Research System on AWS

Building a multi-agent research system on AWS - our LangGraph-based research agent with parallel sub-agents, async SQS processing, triple-layer model fallbacks, S3 storage, LangSmith observability. Architecture deep dive.

Din Shor
Din Shor
Director of DevOps, Tavily
πŸ• WHEN
11:10 AM – 11:35 AM
πŸ“ WHERE
Track 1 – AWS Experience
Bar Kaduri
Track 2 - TLVCommunity

Agentic Chaos - What 86K+ Agent Codebases Reveal About 700K+ Exposed AI Systems

Bar Kaduri, Principal Researcher, Capsule Security

πŸ• 11:10 AM – 11:35 AM πŸ“ Track 2 – TLVCommunity
Track 2 - TLVCommunity

Agentic Chaos - What 86K+ Agent Codebases Reveal About 700K+ Exposed AI Systems

Over the past two years, AI agents have been stealthily becoming the new backbone of the global internet infrastructure. Autonomous systems capable of invoking tools, executing code, orchestrating workflows, and interacting with external services are now being built and deployed across production environments, developer workflows and tools, automation platforms, data pipelines, and enterprise systems as a whole.

What’s become glaringly obvious is that despite the speed of this adoption, almost nothing is known about how these systems are actually built or secured in the wild.
To answer that question, we conducted one of the largest first-of-its-kind empirical studies of the agent ecosystem. We analyzed more than 86,000 public repositories implementing agent logic across frameworks including LangChain, LangGraph, CrewAI, AutoGen, and Model Context Protocol (MCP). We examined prompt construction patterns, tool implementations, authentication models, execution capabilities, and permission boundaries to understand how developers are building agents in practice.

We paired this code-level analysis with internet-wide infrastructure measurement using Shodan, Censys, and ShadowServer to map where agent platforms are actually running in the wild. The research surfaced more than 700,000 exposed agent-related systems on the public internet, including Ollama inference servers, Ray clusters, n8n automation platforms, and MCP tool servers. Many of these systems were directly exposed to the internet with little or no authentication and, in numerous cases, were vulnerable to known high-impact CVEs.
Together, these two datasets reveal a striking pattern. The same architectural assumptions and security shortcuts visible in agent codebases appear repeatedly in real deployments at internet scale.

This talk presents the data, visualizations, and insights produced from this research.

Using examples drawn directly from the dataset, we reconstruct several representative attack paths created by common agent design patterns. We show how seemingly harmless implementation choices, such as tool exposure, prompt construction shortcuts, and weak capability boundaries - can cascade into exploitable conditions once agents are deployed in real environments. We’ll wrap up with practical architectural changes that agent frameworks and platform teams can adopt to prevent these patterns from becoming the next generation of supply-chain vulnerabilities.

Bar Kaduri
Bar Kaduri
Principal Researcher, Capsule Security
πŸ• WHEN
11:10 AM – 11:35 AM
πŸ“ WHERE
Track 2 – TLVCommunity
Haim Raitsev
Track 1 - AWS Experience

Your AI Agent Is a Monolith β€” Here's How We Made Ours Modular

Haim Raitsev, Software Engineer, Harmony

πŸ• 11:35 AM – 12:00 PM πŸ“ Track 1 – AWS Experience
Track 1 - AWS Experience

Your AI Agent Is a Monolith β€” Here's How We Made Ours Modular

Most AI agents start the same way: a single prompt, a growing list of tools, and increasingly fragile behavior as complexity creeps in. Ours did too - until it became unmaintainable. In this talk, I'll show how we refactored our production IT helpdesk agent into a composable toolkit architecture using AWS Bedrock, where each capability - password resets, app provisioning, device recovery, knowledge base search, ticket management - is an independent, self-contained module with its own tools, prompts, dependencies, and runtime activation gates. Skills assemble dynamically per-request based on tenant configuration, user roles, and identity provider capabilities. I'll cover the engineering patterns that made it work: runtime gating, lazy dependency injection, streaming responses, and how we keep a multi-tenant agent from doing things it shouldn't. Real production code, real lessons learned.

Haim Raitsev
Haim Raitsev
Software Engineer, Harmony
πŸ• WHEN
11:35 AM – 12:00 PM
πŸ“ WHERE
Track 1 – AWS Experience
Niv Yungelson Hai Rozencwajg
Track 2 - TLVCommunity

We Tried OSS Model Routing Tools So You Don’t Have To

Niv Yungelson, Engineering Team, Accomplish Β· Hai Rozencwajg, Founding Team, Member of Technical Staff, Accomplish

πŸ• 11:35 AM – 12:00 PM πŸ“ Track 2 – TLVCommunity
Track 2 - TLVCommunity

We Tried OSS Model Routing Tools So You Don’t Have To

Modern AI applications are no longer single prompts. They are pipelines of decisions: retrieval, reasoning, tool use, and execution. However, most systems still default to a single model across all steps, leading to unnecessary cost, latency, and failure modes.

This talk presents a hands-on evaluation of open-source model routing tools in the context of multi-step AI workflows. Instead of comparing models in isolation, we examine how routing decisions change when facing a multi-step pipeline as part of computer-use task execution.

We’ll cover real experiments, architectural patterns, and common pitfalls, and provide a practical approach to step-aware routing that improves both performance and cost efficiency.

Niv Yungelson
Niv Yungelson
Engineering Team, Accomplish
Hai Rozencwajg
Hai Rozencwajg
Founding Team, Member of Technical Staff, Accomplish
πŸ• WHEN
11:35 AM – 12:00 PM
πŸ“ WHERE
Track 2 – TLVCommunity
Aviv Zohari
Track 1 - AWS Experience

From Black Box to Glass Box: Observability for LLMs

Aviv Zohari, Field CTO, groundcover

πŸ• 12:00 PM – 12:25 PM πŸ“ Track 1 – AWS Experience
Track 1 - AWS Experience

From Black Box to Glass Box: Observability for LLMs

In just a few years, LLMs have gone from research curiosities to the backbone of new software experiences. Organizations are rapidly productionizing LLM workflows because of their immense value, but often without observability guardrails. This introduces new layers of fragility and complexity: performance volatility, quality drift, and security risks. In this talk, we explore how to monitor and troubleshoot LLM applications with zero instrumentation. Whether on commercial LLM stacks or AWS Bedrock, we’ll break open the LLM black box and learn how to track token usage, response latency, data exposure risks, and model execution failures.

Aviv Zohari
Aviv Zohari
Field CTO, groundcover
πŸ• WHEN
12:00 PM – 12:25 PM
πŸ“ WHERE
Track 1 – AWS Experience
Issac Goldstand
Track 2 - TLVCommunity

The Agentic Era: Microservices, Reimagined?

Issac Goldstand, Field CTO, CloudEx

πŸ• 12:00 PM – 12:25 PM πŸ“ Track 2 – TLVCommunity
Track 2 - TLVCommunity

The Agentic Era: Microservices, Reimagined?

As we enter the 'Agentic Era' of software engineering, we are replacing deterministic programming languages with natural language and replacing traditional flow control with non-deterministic LLMs. But are we throwing away a decade of architectural best practices? While the talk will focus on high-level architecture principles, we will briefly dip into code written using the AWS Strands SDK to understand the basics of writing AI Agents that can run on AWS Bedrock. We'll also understand how this relates to writing code with Agentic AI IDEs, such as AWS Kiro or VS Code.

Issac Goldstand
Issac Goldstand
Field CTO, CloudEx
πŸ• WHEN
12:00 PM – 12:25 PM
πŸ“ WHERE
Track 2 – TLVCommunity
Boaz Touito
Track 1 - AWS Experience

Running 100M prompts on EC2 under GPU capacity constraints

Boaz Touito, CTO, Impala AI

πŸ• 12:25 PM – 12:50 PM πŸ“ Track 1 – AWS Experience
Track 1 - AWS Experience

Running 100M prompts on EC2 under GPU capacity constraints

This session is a technical customer story about running a 100M-image multimodal inference workload on Amazon EKS with vLLM under real GPU capacity constraints. It starts with a straightforward capacity plan based on B200 machines and a GPU-hour calculation, then shows why that plan breaks down in practice when ideal capacity is not consistently available. From there, the talk explains how the hardware strategy expands across multiple AWS GPU families, why NVFP4 changes the economics, and why vLLM serving strategy has to be adapted to the topology of each machine type. The session closes with the final architecture and an additional slide on how Impala productizes this approach for production workloads.

Boaz Touito
Boaz Touito
CTO, Impala AI
πŸ• WHEN
12:25 PM – 12:50 PM
πŸ“ WHERE
Track 1 – AWS Experience
Haggai Zohar
Track 2 - TLVCommunity

RAG on Your Data Lake

Haggai Zohar, Data Tech Lead, TeraSky

πŸ• 12:25 PM – 12:50 PM πŸ“ Track 2 – TLVCommunity
Track 2 - TLVCommunity

RAG on Your Data Lake

Most data lakes aren’t LLM-ready. Without structure, metadata, and governance, connecting generative AI to S3 data creates risk, cost, and unreliable results. This session walks through an end-to-end Retrieval-Augmented Generation (RAG) architecture on AWS. We’ll cover ingestion and preprocessing of structured and unstructured data from Amazon S3, building embeddings with Amazon Bedrock or SageMaker, and choosing vector storage with OpenSearch or Aurora pgvector. You’ll see how retrieval orchestration and inference run securely with Bedrock, alongside IAM, PII, and governance controls. We’ll also examine cost, latency, freshness trade-offsβ€”and when RAG beats fine-tuning or search-only approaches.

Haggai Zohar
Haggai Zohar
Data Tech Lead, TeraSky
πŸ• WHEN
12:25 PM – 12:50 PM
πŸ“ WHERE
Track 2 – TLVCommunity
FOYER

Lunch

πŸ• 12:50 PM – 2:00 PM
Assaf Peleg Tomer Doitshman
Track 1 - AWS Experience

Raising the Signal: Improving SNR with Automated Root Cause Analysis on AWS

Assaf Peleg, Cato Networks Β· Tomer Doitshman, Cato Networks

πŸ• 2:00 PM – 2:25 PM πŸ“ Track 1 – AWS Experience
Track 1 - AWS Experience

Raising the Signal: Improving SNR with Automated Root Cause Analysis on AWS

SOC & NOC teams aren't short on data - they're short on signal. With overwhelming alerts & logs, the challenge is improving Signal-to-Noise Ratio (SNR): isolating what truly matters.

This session presents an automated RCA framework running on Amazon EKS that amplifies signal from network and security incidents. By combining trend detection with GenAI on Amazon Bedrock, the framework transforms complex network and security data into clear, actionable explanations of why incidents occurβ€”and what is likely to happen next.

Our customizable, tool-driven architecture builds tailored RCA workflows per story type, increasing SNR precision. Amazon Bedrock AgentCore provides secure runtime, contextual memory, observability, operationalizing high-SNR insights into RCA-driven actions.
Detection creates data. RCA creates signal.

Assaf Peleg
Assaf Peleg
Cato Networks
Tomer Doitshman
Tomer Doitshman
Cato Networks
πŸ• WHEN
2:00 PM – 2:25 PM
πŸ“ WHERE
Track 1 – AWS Experience
May Walter
Track 2 - TLVCommunity

What Production Knows: Closing the Loop Between AI Agents and the Systems They Build

May Walter, CTO & Co-Founder, Hud

πŸ• 2:00 PM – 2:25 PM πŸ“ Track 2 – TLVCommunity
Track 2 - TLVCommunity

What Production Knows: Closing the Loop Between AI Agents and the Systems They Build

Most of the AI-for-engineering conversation runs in one direction: humans tell agents what to do, and agents produce code. But for high-velocity teams, the more interesting flow runs the other way, production telling the agents what is actually happening, so they can fix, refactor, and keep shipping without drifting away from reality.

This talk is about that reverse loop. It looks at what changes when the systems we operate start moving faster than the people who own them, the failure modes that quietly become normal when agents work without ground truth, and the durable plumbing that keeps them honest: traces, error budgets, customer signal, SLO breaches, incident timelines, all fed back as first-class inputs to the tools doing the writing.

If the broader AI shift is changing what it means to author code, this is the operational half of that story: the feedback infrastructure that makes the new way of working actually safe to run.

May Walter
May Walter
CTO & Co-Founder, Hud
πŸ• WHEN
2:00 PM – 2:25 PM
πŸ“ WHERE
Track 2 – TLVCommunity
Asia Salner
Track 1 - AWS Experience

AI-Driven DevOps: From Operational Overload to Continuous Optimization (Cost, Security, Reliability)

Asia Salner, Director of DevOps, Coro

πŸ• 2:25 PM – 2:50 PM πŸ“ Track 1 – AWS Experience
Track 1 - AWS Experience

AI-Driven DevOps: From Operational Overload to Continuous Optimization (Cost, Security, Reliability)

We operate a multi-account AWS environment (12 accounts, 4 regions) on EKS, Terraform, GitLab CI, and ArgoCD. The primary constraint was not infrastructure scale, but operational overload: growing backlog, missed cost optimizations, delayed risk detection, and constant reactive work.

This session presents how AI is embedded into both DevOps workflows and the platform layer to shift from reactive operations to continuous optimization,
--Cost & hygiene β€” continuous scanning across AWS accounts to detect unused resources, redundant services, orphaned assets, and cost inefficiencies, with fixes generated as code
--Security & compliance β€” identifying over-permissioned access, configuration drift, and policy violations, with controlled remediation via pipelines
--Production signal analysis β€” logs, metrics, and code analyzed to surface hidden issues such as memory leaks and abnormal patterns missed during development
--Shift-left review β€” AI-assisted code review for Terraform, Kubernetes, and CI/CD to prevent misconfigurations before deployment
--Autonomous task execution β€” DevOps tasks from Jira translated into reviewed, GitOps-driven changes
--Impact-based prioritization β€” operational work ranked by cost, reliability, and security impact, reducing backlog and eliminating low-value manual triage

All actions are enforced through GitOps and review. No direct production mutations.

Focus: architecture, integration patterns, control mechanisms, and measurable impact on cost, MTTR, and operational load.

Asia Salner
Asia Salner
Director of DevOps, Coro
πŸ• WHEN
2:25 PM – 2:50 PM
πŸ“ WHERE
Track 1 – AWS Experience
Asaf Savich
Track 2 - TLVCommunity

Why Your Kubernetes Cluster Will Fail: Lessons from 1 Million Real-World Incidents

Asaf Savich, AI Engineering Group Manager, Komodor

πŸ• 2:25 PM – 2:50 PM πŸ“ Track 2 – TLVCommunity
Track 2 - TLVCommunity

Why Your Kubernetes Cluster Will Fail: Lessons from 1 Million Real-World Incidents

We've analyzed over 1 million production K8s failures across thousands of clusters. The data reveals something striking - the vast majority of incidents fall into predictable, preventable categories. By the law of large numbers, if we address these recurring issues, we can drastically improve production reliability.

This talk presents the most common K8s failure patterns backed by real data at scale. We'll cover 6 major categories - Resource Exhaustion (OOMKilled pods, memory leaks, GPU thermal throttling), Image & Deployment Issues (ImagePullBackOff, stuck rollouts), Config & Secret Management (rotation breaking apps, ConfigMap drift), Cascading Failures (missing ConfigMaps triggering multi-pod failures, dependency chains), Storage & Persistence (PVC conflicts, CSI driver issues), & App vs Infra Debugging (CrashLoopBackOff mysteries, GPU XID errors).

For each category, we'll show real K8s events, explain why these failures are so common, and provide actionable prevention strategies.

Asaf Savich
Asaf Savich
AI Engineering Group Manager, Komodor
πŸ• WHEN
2:25 PM – 2:50 PM
πŸ“ WHERE
Track 2 – TLVCommunity
FOYER

10 Min Break

πŸ• 2:50 PM – 3:00 PM
Miki Manor
Main Stage

'The Last Human On-Call' β€” When AI fixes everything, who still knows anything?

Miki Manor, Director of Infrastructure Engineering, Skai.ai

πŸ• 3:00 PM – 3:10 PM πŸ“ Main Stage
Main Stage

'The Last Human On-Call' β€” When AI fixes everything, who still knows anything?

Eliminating toil is one of the SRE community's most sacred goals. AI agents are finally delivering on that promise β€” handling first-line incident response, running runbooks, auto-remediating known failure modes. Page volume is down. Mean time to resolution is down. On-call burden is down. This is unambiguously good. And it's also hiding a slow-moving crisis. SREs have always learned production systems by living with them β€” by being paged at 2am, clicking through runbook steps that felt tedious, noticing that this particular service always spikes after a deploy to that downstream dependency. That pattern recognition didn't live in documentation. It lived in the repetition. It lived in the toil.
When AI agents absorb all of that, the humans who remain are faster, less burned out, and increasingly de-calibrated from their own systems. The first time something genuinely novel breaks β€” outside the agent's training distribution, no runbook applicable β€” you need a human who deeply knows the system. That human may no longer exist because they haven't been paged in eighteen months. This talk is about the expertise debt accumulating silently in every org embracing agentic operations β€” and the engineering practices that can help you not wake up to it the hard way.

Miki Manor
Miki Manor
Director of Infrastructure Engineering, Skai.ai
πŸ• WHEN
3:00 PM – 3:10 PM
πŸ“ WHERE
Main Stage
Itamar Syn Hershko
Main Stage

Thousands of Clusters Maintained: Lessons Learned and Tales to Tell

Itamar Syn Hershko, CTO & Founder, BigData Boutique

πŸ• 3:10 PM – 3:20 PM πŸ“ Main Stage
Main Stage

Thousands of Clusters Maintained: Lessons Learned and Tales to Tell

In this talk, we'll dive deep into the heart of OpenSearch and higlight common mistakes but also not so common gotchas that'd help you optimize and stabilize your deployment too. This talk is based on 15 years of cluster maintenance in production - at every vertical, any scale, all use-cases you can think of. From Elasticsearch 0.10 to OpenSearch 3.0, I've seen it all and survived to tell the story.

Itamar Syn Hershko
Itamar Syn Hershko
CTO & Founder, BigData Boutique
πŸ• WHEN
3:10 PM – 3:20 PM
πŸ“ WHERE
Main Stage
Edan Shahmoon
FOYER

SPECIAL HAPPY HOUR SESSION - Promethouse: Transforming Your metrics into house music + 🍻 snacks & drinks

Edan Shahmoon, Data Scientist and DevOps Engineer

πŸ• 3:20 PM – 4:00 PM πŸ“ Main Stage
FOYER

SPECIAL HAPPY HOUR SESSION - Promethouse: Transforming Your metrics into house music + 🍻 snacks & drinks

Closing talk followed by mingling and networking.

Have you ever wondered what your infrastructure sounds like? In this talk, I’ll introduce 'Promethouse,' a lighthearted side project that turns system observability into house music. We'll start by diving under the hood of the Prometheus pull-based scraping protocol to understand how it ingests massive amounts of time-series data. But instead of building another Grafana dashboard, we'll route that data to the dancefloor. I will walk through the core engineering challenge: applying dimensionality reduction algorithms to distill thousands of chaotic, high-dimensional service metrics into a unified 'tempo' for each microservice. By the end of this session, you'll see (and hear!) how mathematical transformations can turn the noise of CPU spikes and memory leaks into a surprisingly groovy, cohesive musical track.

Edan Shahmoon
Edan Shahmoon
Data Scientist and DevOps Engineer
πŸ• WHEN
3:20 PM – 4:00 PM
πŸ“ WHERE
Main Stage

Huge thanks to our event sponsors and organizers!

Host

Event Sponsors

Organized By

And Many More Who Make Our Amazing Community Possible