π JUNE 11, 2026 Β· π AWS Experience Floor28, Tel Aviv
Dudy Cohen, AWS Β· Sharone Zitzman, RTFM Please
Ran Tavory, Keynote Speaker
Coding agents are shipping production code faster than you can read it. Somebody still has to run it at 2 AM. This talk is for the people on that pager - three forces making production harder, and the concrete shifts (observability, SLOs, policy-as-code, AI investigation reports) that define the next decade of the SRE role.
Anton Weiss, Developer Advocate, DoIT Β· , Innocom
DoIT: Leaner K8s with PSI metrics</br></br>PSI metrics have graduated to beta in K8s v1.34. It may be a small feature but it can matter a lot for your application performance and efficiency. Let's understand how they can make your clusters and autoscaling leaner.
Itay Shakury, CNCF Ambassador, Trivy Core Maintainer, Engineering Leader
Open Source in the Age of AI
Din Shor, Director of DevOps, Tavily
Building a multi-agent research system on AWS - our LangGraph-based research agent with parallel sub-agents, async SQS processing, triple-layer model fallbacks, S3 storage, LangSmith observability. Architecture deep dive.
Bar Kaduri, Principal Researcher, Capsule Security
Over the past two years, AI agents have been stealthily becoming the new backbone of the global internet infrastructure. Autonomous systems capable of invoking tools, executing code, orchestrating workflows, and interacting with external services are now being built and deployed across production environments, developer workflows and tools, automation platforms, data pipelines, and enterprise systems as a whole.
Whatβs become glaringly obvious is that despite the speed of this adoption, almost nothing is known about how these systems are actually built or secured in the wild.
To answer that question, we conducted one of the largest first-of-its-kind empirical studies of the agent ecosystem. We analyzed more than 86,000 public repositories implementing agent logic across frameworks including LangChain, LangGraph, CrewAI, AutoGen, and Model Context Protocol (MCP). We examined prompt construction patterns, tool implementations, authentication models, execution capabilities, and permission boundaries to understand how developers are building agents in practice.
We paired this code-level analysis with internet-wide infrastructure measurement using Shodan, Censys, and ShadowServer to map where agent platforms are actually running in the wild. The research surfaced more than 700,000 exposed agent-related systems on the public internet, including Ollama inference servers, Ray clusters, n8n automation platforms, and MCP tool servers. Many of these systems were directly exposed to the internet with little or no authentication and, in numerous cases, were vulnerable to known high-impact CVEs.
Together, these two datasets reveal a striking pattern. The same architectural assumptions and security shortcuts visible in agent codebases appear repeatedly in real deployments at internet scale.
This talk presents the data, visualizations, and insights produced from this research.
Using examples drawn directly from the dataset, we reconstruct several representative attack paths created by common agent design patterns. We show how seemingly harmless implementation choices, such as tool exposure, prompt construction shortcuts, and weak capability boundaries - can cascade into exploitable conditions once agents are deployed in real environments. Weβll wrap up with practical architectural changes that agent frameworks and platform teams can adopt to prevent these patterns from becoming the next generation of supply-chain vulnerabilities.
Haim Raitsev, Software Engineer, Harmony
Most AI agents start the same way: a single prompt, a growing list of tools, and increasingly fragile behavior as complexity creeps in. Ours did too - until it became unmaintainable. In this talk, I'll show how we refactored our production IT helpdesk agent into a composable toolkit architecture using AWS Bedrock, where each capability - password resets, app provisioning, device recovery, knowledge base search, ticket management - is an independent, self-contained module with its own tools, prompts, dependencies, and runtime activation gates. Skills assemble dynamically per-request based on tenant configuration, user roles, and identity provider capabilities. I'll cover the engineering patterns that made it work: runtime gating, lazy dependency injection, streaming responses, and how we keep a multi-tenant agent from doing things it shouldn't. Real production code, real lessons learned.
Niv Yungelson, Engineering Team, Accomplish Β· Hai Rozencwajg, Founding Team, Member of Technical Staff, Accomplish
Modern AI applications are no longer single prompts. They are pipelines of decisions: retrieval, reasoning, tool use, and execution. However, most systems still default to a single model across all steps, leading to unnecessary cost, latency, and failure modes.
This talk presents a hands-on evaluation of open-source model routing tools in the context of multi-step AI workflows. Instead of comparing models in isolation, we examine how routing decisions change when facing a multi-step pipeline as part of computer-use task execution.
Weβll cover real experiments, architectural patterns, and common pitfalls, and provide a practical approach to step-aware routing that improves both performance and cost efficiency.
Aviv Zohari, Field CTO, groundcover
In just a few years, LLMs have gone from research curiosities to the backbone of new software experiences. Organizations are rapidly productionizing LLM workflows because of their immense value, but often without observability guardrails. This introduces new layers of fragility and complexity: performance volatility, quality drift, and security risks. In this talk, we explore how to monitor and troubleshoot LLM applications with zero instrumentation. Whether on commercial LLM stacks or AWS Bedrock, weβll break open the LLM black box and learn how to track token usage, response latency, data exposure risks, and model execution failures.
Issac Goldstand, Field CTO, CloudEx
As we enter the 'Agentic Era' of software engineering, we are replacing deterministic programming languages with natural language and replacing traditional flow control with non-deterministic LLMs. But are we throwing away a decade of architectural best practices? While the talk will focus on high-level architecture principles, we will briefly dip into code written using the AWS Strands SDK to understand the basics of writing AI Agents that can run on AWS Bedrock. We'll also understand how this relates to writing code with Agentic AI IDEs, such as AWS Kiro or VS Code.
Boaz Touito, CTO, Impala AI
This session is a technical customer story about running a 100M-image multimodal inference workload on Amazon EKS with vLLM under real GPU capacity constraints. It starts with a straightforward capacity plan based on B200 machines and a GPU-hour calculation, then shows why that plan breaks down in practice when ideal capacity is not consistently available. From there, the talk explains how the hardware strategy expands across multiple AWS GPU families, why NVFP4 changes the economics, and why vLLM serving strategy has to be adapted to the topology of each machine type. The session closes with the final architecture and an additional slide on how Impala productizes this approach for production workloads.
Haggai Zohar, Data Tech Lead, TeraSky
Most data lakes arenβt LLM-ready. Without structure, metadata, and governance, connecting generative AI to S3 data creates risk, cost, and unreliable results. This session walks through an end-to-end Retrieval-Augmented Generation (RAG) architecture on AWS. Weβll cover ingestion and preprocessing of structured and unstructured data from Amazon S3, building embeddings with Amazon Bedrock or SageMaker, and choosing vector storage with OpenSearch or Aurora pgvector. Youβll see how retrieval orchestration and inference run securely with Bedrock, alongside IAM, PII, and governance controls. Weβll also examine cost, latency, freshness trade-offsβand when RAG beats fine-tuning or search-only approaches.
Assaf Peleg, Cato Networks Β· Tomer Doitshman, Cato Networks
SOC & NOC teams aren't short on data - they're short on signal. With overwhelming alerts & logs, the challenge is improving Signal-to-Noise Ratio (SNR): isolating what truly matters.
This session presents an automated RCA framework running on Amazon EKS that amplifies signal from network and security incidents. By combining trend detection with GenAI on Amazon Bedrock, the framework transforms complex network and security data into clear, actionable explanations of why incidents occurβand what is likely to happen next.
Our customizable, tool-driven architecture builds tailored RCA workflows per story type, increasing SNR precision. Amazon Bedrock AgentCore provides secure runtime, contextual memory, observability, operationalizing high-SNR insights into RCA-driven actions.
Detection creates data. RCA creates signal.
May Walter, CTO & Co-Founder, Hud
Most of the AI-for-engineering conversation runs in one direction: humans tell agents what to do, and agents produce code. But for high-velocity teams, the more interesting flow runs the other way, production telling the agents what is actually happening, so they can fix, refactor, and keep shipping without drifting away from reality.
This talk is about that reverse loop. It looks at what changes when the systems we operate start moving faster than the people who own them, the failure modes that quietly become normal when agents work without ground truth, and the durable plumbing that keeps them honest: traces, error budgets, customer signal, SLO breaches, incident timelines, all fed back as first-class inputs to the tools doing the writing.
If the broader AI shift is changing what it means to author code, this is the operational half of that story: the feedback infrastructure that makes the new way of working actually safe to run.
Asia Salner, Director of DevOps, Coro
We operate a multi-account AWS environment (12 accounts, 4 regions) on EKS, Terraform, GitLab CI, and ArgoCD. The primary constraint was not infrastructure scale, but operational overload: growing backlog, missed cost optimizations, delayed risk detection, and constant reactive work.
This session presents how AI is embedded into both DevOps workflows and the platform layer to shift from reactive operations to continuous optimization,
--Cost & hygiene β continuous scanning across AWS accounts to detect unused resources, redundant services, orphaned assets, and cost inefficiencies, with fixes generated as code
--Security & compliance β identifying over-permissioned access, configuration drift, and policy violations, with controlled remediation via pipelines
--Production signal analysis β logs, metrics, and code analyzed to surface hidden issues such as memory leaks and abnormal patterns missed during development
--Shift-left review β AI-assisted code review for Terraform, Kubernetes, and CI/CD to prevent misconfigurations before deployment
--Autonomous task execution β DevOps tasks from Jira translated into reviewed, GitOps-driven changes
--Impact-based prioritization β operational work ranked by cost, reliability, and security impact, reducing backlog and eliminating low-value manual triage
All actions are enforced through GitOps and review. No direct production mutations.
Focus: architecture, integration patterns, control mechanisms, and measurable impact on cost, MTTR, and operational load.
Asaf Savich, AI Engineering Group Manager, Komodor
We've analyzed over 1 million production K8s failures across thousands of clusters. The data reveals something striking - the vast majority of incidents fall into predictable, preventable categories. By the law of large numbers, if we address these recurring issues, we can drastically improve production reliability.
This talk presents the most common K8s failure patterns backed by real data at scale. We'll cover 6 major categories - Resource Exhaustion (OOMKilled pods, memory leaks, GPU thermal throttling), Image & Deployment Issues (ImagePullBackOff, stuck rollouts), Config & Secret Management (rotation breaking apps, ConfigMap drift), Cascading Failures (missing ConfigMaps triggering multi-pod failures, dependency chains), Storage & Persistence (PVC conflicts, CSI driver issues), & App vs Infra Debugging (CrashLoopBackOff mysteries, GPU XID errors).
For each category, we'll show real K8s events, explain why these failures are so common, and provide actionable prevention strategies.
Miki Manor, Director of Infrastructure Engineering, Skai.ai
Eliminating toil is one of the SRE community's most sacred goals. AI agents are finally delivering on that promise β handling first-line incident response, running runbooks, auto-remediating known failure modes. Page volume is down. Mean time to resolution is down. On-call burden is down. This is unambiguously good. And it's also hiding a slow-moving crisis. SREs have always learned production systems by living with them β by being paged at 2am, clicking through runbook steps that felt tedious, noticing that this particular service always spikes after a deploy to that downstream dependency. That pattern recognition didn't live in documentation. It lived in the repetition. It lived in the toil.
When AI agents absorb all of that, the humans who remain are faster, less burned out, and increasingly de-calibrated from their own systems. The first time something genuinely novel breaks β outside the agent's training distribution, no runbook applicable β you need a human who deeply knows the system. That human may no longer exist because they haven't been paged in eighteen months. This talk is about the expertise debt accumulating silently in every org embracing agentic operations β and the engineering practices that can help you not wake up to it the hard way.
Itamar Syn Hershko, CTO & Founder, BigData Boutique
In this talk, we'll dive deep into the heart of OpenSearch and higlight common mistakes but also not so common gotchas that'd help you optimize and stabilize your deployment too. This talk is based on 15 years of cluster maintenance in production - at every vertical, any scale, all use-cases you can think of. From Elasticsearch 0.10 to OpenSearch 3.0, I've seen it all and survived to tell the story.
Edan Shahmoon, Data Scientist and DevOps Engineer
Closing talk followed by mingling and networking.
Have you ever wondered what your infrastructure sounds like? In this talk, Iβll introduce 'Promethouse,' a lighthearted side project that turns system observability into house music. We'll start by diving under the hood of the Prometheus pull-based scraping protocol to understand how it ingests massive amounts of time-series data. But instead of building another Grafana dashboard, we'll route that data to the dancefloor. I will walk through the core engineering challenge: applying dimensionality reduction algorithms to distill thousands of chaotic, high-dimensional service metrics into a unified 'tempo' for each microservice. By the end of this session, you'll see (and hear!) how mathematical transformations can turn the noise of CPU spikes and memory leaks into a surprisingly groovy, cohesive musical track.