DNS for Platform Engineering: The Silent Killer
Автор: platform-engineering-playbook
Загружено: 2025-11-13
Просмотров: 5
Why does a forty-year-old protocol keep taking down billion-dollar infrastructure? The October 2024 AWS outage lasted fifteen hours because of a DNS race condition. Kubernetes defaults create 5x query amplification. We investigate how DNS really works in modern platforms—CoreDNS plugin chains, the ndots:5 trap, GSLB failover—and deliver the five-layer defensive playbook to prevent your platform from becoming the next postmortem.
🔗 Full episode page: https://platformengineeringplaybook.c...
📝 See a mistake or have insights to add? This podcast is community-driven - open a PR on GitHub!
Summary:
• CoreDNS plugin-based architecture: middleware → backend chain, Kubernetes plugin watches API server and generates responses on-the-fly for cluster.local, forward plugin handles external queries
• ndots:5 trap creates 5x DNS query amplification—api.stripe.com tries 4 search domains before absolute query; fix by lowering to ndots:1, using FQDNs with trailing dot, implementing app-level caching
• AWS October 19-20, 2024 outage: two DNS Enactors racing in DynamoDB DNS automation, cleanup deleted all IPs for regional endpoint, 15+ hours of cascading failures (DynamoDB → dependent services → Slack/Atlassian/Snapchat)
• Five-layer defensive playbook: (1) optimize—fix ndots, tune CoreDNS cache to 10K records/30s, latency less than 100ms warning; (2) failover—GSLB with health checks, TTL 60-300s for backends; (3) security—DNSSEC + DoH with internal resolvers; (4) monitoring—track p95 latency, error rates by type, top requesters; (5) testing—DNS failure game days, kill CoreDNS pods, inject latency, model failover scenarios
• TTL balancing trade-off: low TTL (60-300s) enables fast failover but increases query load; high TTL (3600-86400s) improves performance but delays failover; no perfect answer, depends on SLO
Доступные форматы для скачивания:
Скачать видео mp4
-
Информация по загрузке: