π JUNE 11, 2026 Β· π AWS Experience Floor28, Tel Aviv
Dudy Cohen, AWS Β· Sharone Zitzman, RTFM Please
Ran Tavory, Keynote Speaker
Coding agents are shipping production code faster than you can read it. Somebody still has to run it at 2 AM. This talk is for the people on that pager - three forces making production harder, and the concrete shifts (observability, SLOs, policy-as-code, AI investigation reports) that define the next decade of the SRE role.
Anton Weiss, Developer Advocate, DoIT Β· Or Plavnik & Michael Isayev, Innocom Β· Grafana Team, Grafana
Innocom: MTTR Vertigo: The Power of Consolidation.
2 AM. Service down. You're flipping between seven tools trying to figure out where to start -that's MTTR vertigo. This talk shows how consolidating your entire stack onto one platform removes the silos that turn 5-minute investigations into hour-long ones. You'll see what root cause analysis looks like when metrics, logs, traces, and ownership all live in one database, queryable in one language. Walk in skeptical. Leave with a clear answer to why silos cost more than every observability license combined.
DoIT: Leaner K8s with PSI metrics
PSI metrics have graduated to beta in K8s v1.34. It may be a small feature but it can matter a lot for your application performance and efficiency. Let's understand how they can make your clusters and autoscaling leaner.
Itay Shakury, OSS Maintainer, Engineering Leader
Open Source has always been a delicate ecosystem, a rare coexistence of interests, passion and opportunity. That delicate balance is being tested now, and the first signs are starting to show. You might have felt it, I know I have, a gradual change in my own experience as user and maintainer. Small things that only when combined reveal a larger pattern. I can now see four clear trends that I believe are happening right now, and are fundamentally changing the open source landscape. In this talk, I will share my observations and insights about what and how the open source experience is changing, and also why it's happening now (spoiler alert: AI). You will learn to navigate the new open source landscape, the new practices and culture that is emerging, the tools and policies that are in use, and the way it affects you, as an open source user, contributor, or maintainer.
Haim Raitsev, Software Engineer, Harmony
Most AI agents start the same way: a single prompt, a growing list of tools, and increasingly fragile behavior as complexity creeps in. Ours did too - until it became unmaintainable. In this talk, I'll show how we refactored our production IT helpdesk agent into a composable toolkit architecture using AWS Bedrock, where each capability - password resets, app provisioning, device recovery, knowledge base search, ticket management - is an independent, self-contained module with its own tools, prompts, dependencies, and runtime activation gates. Skills assemble dynamically per-request based on tenant configuration, user roles, and identity provider capabilities. I'll cover the engineering patterns that made it work: runtime gating, lazy dependency injection, streaming responses, and how we keep a multi-tenant agent from doing things it shouldn't. Real production code, real lessons learned.
Bar Kaduri, Principal Researcher, Capsule Security
Over the past two years, AI agents have been stealthily becoming the new backbone of the global internet infrastructure. Autonomous systems capable of invoking tools, executing code, orchestrating workflows, and interacting with external services are now being built and deployed across production environments, developer workflows and tools, automation platforms, data pipelines, and enterprise systems as a whole.
What's become glaringly obvious is that despite the speed of this adoption, almost nothing is known about how these systems are actually built or secured in the wild.
To answer that question, we conducted one of the largest first-of-its-kind empirical studies of the agent ecosystem. We analyzed more than 86,000 public repositories implementing agent logic across frameworks including LangChain, LangGraph, CrewAI, AutoGen, and Model Context Protocol (MCP). We examined prompt construction patterns, tool implementations, authentication models, execution capabilities, and permission boundaries to understand how developers are building agents in practice.
We paired this code-level analysis with internet-wide infrastructure measurement using Shodan, Censys, and ShadowServer to map where agent platforms are actually running in the wild. The research surfaced more than 700,000 exposed agent-related systems on the public internet, including Ollama inference servers, Ray clusters, n8n automation platforms, and MCP tool servers. Many of these systems were directly exposed to the internet with little or no authentication and, in numerous cases, were vulnerable to known high-impact CVEs.
Together, these two datasets reveal a striking pattern. The same architectural assumptions and security shortcuts visible in agent codebases appear repeatedly in real deployments at internet scale.
This talk presents the data, visualizations, and insights produced from this research.
Using examples drawn directly from the dataset, we reconstruct several representative attack paths created by common agent design patterns. We show how seemingly harmless implementation choices, such as tool exposure, prompt construction shortcuts, and weak capability boundaries - can cascade into exploitable conditions once agents are deployed in real environments. We'll wrap up with practical architectural changes that agent frameworks and platform teams can adopt to prevent these patterns from becoming the next generation of supply-chain vulnerabilities.
Aviv Zohari, Field CTO, groundcover
In just a few years, LLMs have gone from research curiosities to the backbone of new software experiences. Organizations are rapidly productionizing LLM workflows because of their immense value, but often without observability guardrails. This introduces new layers of fragility and complexity: performance volatility, quality drift, and security risks. In this talk, we explore how to monitor and troubleshoot LLM applications with zero instrumentation. Whether on commercial LLM stacks or AWS Bedrock, we'll break open the LLM black box and learn how to track token usage, response latency, data exposure risks, and model execution failures.
Niv Yungelson, Founding Team, Member of Technical Staff, Accomplish Β· Hai Rozencwajg, Founding Team, Member of Technical Staff, Accomplish
Modern AI applications are no longer single prompts. They are pipelines of decisions: retrieval, reasoning, tool use, and execution. However, most systems still default to a single model across all steps, leading to unnecessary cost, latency, and failure modes.
This talk presents a hands-on evaluation of open-source model routing tools in the context of multi-step AI workflows. Instead of comparing models in isolation, we examine how routing decisions change when facing a multi-step pipeline as part of computer-use task execution.
We'll cover real experiments, architectural patterns, and common pitfalls, and provide a practical approach to step-aware routing that improves both performance and cost efficiency.
Sharon Dahan, Chief Architect, Impala AI
This session is a technical customer story about running a 100M-image multimodal inference workload on Amazon EKS with vLLM under real GPU capacity constraints. It starts with a straightforward capacity plan based on B200 machines and a GPU-hour calculation, then shows why that plan breaks down in practice when ideal capacity is not consistently available. From there, the talk explains how the hardware strategy expands across multiple AWS GPU families, why NVFP4 changes the economics, and why vLLM serving strategy has to be adapted to the topology of each machine type. The session closes with the final architecture and an additional slide on how Impala productizes this approach for production workloads.
Issac Goldstand, Field CTO, CloudEx
As we enter the 'Agentic Era' of software engineering, we are replacing deterministic programming languages with natural language and replacing traditional flow control with non-deterministic LLMs. But are we throwing away a decade of architectural best practices? While the talk will focus on high-level architecture principles, we will briefly dip into code written using the AWS Strands SDK to understand the basics of writing AI Agents that can run on AWS Bedrock. We'll also understand how this relates to writing code with Agentic AI IDEs, such as AWS Kiro or VS Code.
Daniel Avital, Co-Founder & CEO, Pandorian
Nobody ever cared about writing rules well. You'd drop some standards in a Confluence doc, maybe an onboarding wiki, and move on. It didn't really matter how precise they were because nobody was reading them anyway. As long as you had strong engineers and a decent review culture, the codebase held together well enough.
AI changed the math. When code volume explodes and the author is increasingly a model, every other propagation mechanism breaks down. What's left is the rule. How precisely you can define it, scope it, and enforce it is now the difference between a codebase that reflects your judgment and one that slowly stops being yours.
This talk is about that shift.
Haggai Zohar, Data Tech Lead, TeraSky
Most data lakes aren't LLM-ready. Without structure, metadata, and governance, connecting generative AI to S3 data creates risk, cost, and unreliable results. This session walks through an end-to-end Retrieval-Augmented Generation (RAG) architecture on AWS. We'll cover ingestion and preprocessing of structured and unstructured data from Amazon S3, building embeddings with Amazon Bedrock or SageMaker, and choosing vector storage with OpenSearch or Aurora pgvector. You'll see how retrieval orchestration and inference run securely with Bedrock, alongside IAM, PII, and governance controls. We'll also examine cost, latency, freshness trade-offsβand when RAG beats fine-tuning or search-only approaches.
Assaf Peleg, Cato Networks
SOC & NOC teams aren't short on data - they're short on signal. With overwhelming alerts & logs, the challenge is improving Signal-to-Noise Ratio (SNR): isolating what truly matters.
This session presents an automated RCA framework running on Amazon EKS that amplifies signal from network and security incidents. By combining trend detection with GenAI on Amazon Bedrock, the framework transforms complex network and security data into clear, actionable explanations of why incidents occurβand what is likely to happen next.
Our customizable, tool-driven architecture builds tailored RCA workflows per story type, increasing SNR precision. Amazon Bedrock AgentCore provides secure runtime, contextual memory, observability, operationalizing high-SNR insights into RCA-driven actions.
Detection creates data. RCA creates signal.
May Walter, CTO & Co-Founder, Hud
Most of the AI-for-engineering conversation runs in one direction: humans tell agents what to do, and agents produce code. But for high-velocity teams, the more interesting flow runs the other way, production telling the agents what is actually happening, so they can fix, refactor, and keep shipping without drifting away from reality.
This talk is about that reverse loop. It looks at what changes when the systems we operate start moving faster than the people who own them, the failure modes that quietly become normal when agents work without ground truth, and the durable plumbing that keeps them honest: traces, error budgets, customer signal, SLO breaches, incident timelines, all fed back as first-class inputs to the tools doing the writing.
If the broader AI shift is changing what it means to author code, this is the operational half of that story: the feedback infrastructure that makes the new way of working actually safe to run.
Asia Salner, Director of DevOps, Coro
We operate a multi-account AWS environment (12 accounts, 4 regions) on EKS, Terraform, GitLab CI, and ArgoCD. The primary constraint was not infrastructure scale, but operational overload: growing backlog, missed cost optimizations, delayed risk detection, and constant reactive work.
This session presents how AI is embedded into both DevOps workflows and the platform layer to shift from reactive operations to continuous optimization,
--Cost & hygiene β continuous scanning across AWS accounts to detect unused resources, redundant services, orphaned assets, and cost inefficiencies, with fixes generated as code
--Security & compliance β identifying over-permissioned access, configuration drift, and policy violations, with controlled remediation via pipelines
--Production signal analysis β logs, metrics, and code analyzed to surface hidden issues such as memory leaks and abnormal patterns missed during development
--Shift-left review β AI-assisted code review for Terraform, Kubernetes, and CI/CD to prevent misconfigurations before deployment
--Autonomous task execution β DevOps tasks from Jira translated into reviewed, GitOps-driven changes
--Impact-based prioritization β operational work ranked by cost, reliability, and security impact, reducing backlog and eliminating low-value manual triage
All actions are enforced through GitOps and review. No direct production mutations.
Focus: architecture, integration patterns, control mechanisms, and measurable impact on cost, MTTR, and operational load.
Asaf Savich, AI Engineering Group Manager, Komodor
We've analyzed over 1 million production K8s failures across thousands of clusters. The data reveals something striking - the vast majority of incidents fall into predictable, preventable categories. By the law of large numbers, if we address these recurring issues, we can drastically improve production reliability.
This talk presents the most common K8s failure patterns backed by real data at scale. We'll cover 6 major categories - Resource Exhaustion (OOMKilled pods, memory leaks, GPU thermal throttling), Image & Deployment Issues (ImagePullBackOff, stuck rollouts), Config & Secret Management (rotation breaking apps, ConfigMap drift), Cascading Failures (missing ConfigMaps triggering multi-pod failures, dependency chains), Storage & Persistence (PVC conflicts, CSI driver issues), & App vs Infra Debugging (CrashLoopBackOff mysteries, GPU XID errors).
For each category, we'll show real K8s events, explain why these failures are so common, and provide actionable prevention strategies.
Miki Manor, Director of Infrastructure Engineering, Skai
Eliminating toil is one of the SRE community's most sacred goals. AI agents are finally delivering on that promise β handling first-line incident response, running runbooks, auto-remediating known failure modes. Page volume is down. Mean time to resolution is down. On-call burden is down. This is unambiguously good. And it's also hiding a slow-moving crisis. SREs have always learned production systems by living with them β by being paged at 2am, clicking through runbook steps that felt tedious, noticing that this particular service always spikes after a deploy to that downstream dependency. That pattern recognition didn't live in documentation. It lived in the repetition. It lived in the toil.
When AI agents absorb all of that, the humans who remain are faster, less burned out, and increasingly de-calibrated from their own systems. The first time something genuinely novel breaks β outside the agent's training distribution, no runbook applicable β you need a human who deeply knows the system. That human may no longer exist because they haven't been paged in eighteen months. This talk is about the expertise debt accumulating silently in every org embracing agentic operations β and the engineering practices that can help you not wake up to it the hard way.
Itamar Syn Hershko, CTO & Founder, BigData Boutique
In this talk, we'll dive deep into the heart of OpenSearch and higlight common mistakes but also not so common gotchas that'd help you optimize and stabilize your deployment too. This talk is based on 15 years of cluster maintenance in production - at every vertical, any scale, all use-cases you can think of. From Elasticsearch 0.10 to OpenSearch 3.0, I've seen it all and survived to tell the story.
Edan Shahmoon, Data Scientist and DevOps Engineer
Closing talk followed by mingling and networking.
Have you ever wondered what your infrastructure sounds like? In this talk, I'll introduce 'Promethouse,' a lighthearted side project that turns system observability into house music. We'll start by diving under the hood of the Prometheus pull-based scraping protocol to understand how it ingests massive amounts of time-series data. But instead of building another Grafana dashboard, we'll route that data to the dancefloor. I will walk through the core engineering challenge: applying dimensionality reduction algorithms to distill thousands of chaotic, high-dimensional service metrics into a unified 'tempo' for each microservice. By the end of this session, you'll see (and hear!) how mathematical transformations can turn the noise of CPU spikes and memory leaks into a surprisingly groovy, cohesive musical track.