Kubernetes is the container running platform to use with really easy adding or removing of resources. For real 12 factor app this is straightforward and a good way to go. The challenge comes if you have long running (stateful) apps or big data apps like hadoop, spark, flink... This talk shows strategies to meditate these challenges and take advantages out of it. For example instead of running small cluster 24h, run a 12 times bigger cluster only for short intervals(~2h). If your jobs are scalable, you get your results up to 12 times faster and bigger business value with the same resource consumption. Or running a live recording, you need to stay until event ends even it is a 24h. Run an AI training which don’t usually work so well with snapshots for recovery. - use cloud in dynamic way (scale 100x of capacity in minutes is possible) - per job cluster with the fitting sizing - leverage multiple node pools/groups - help from k8s operators for deployment - how this can work together with workflows like airflow - k8s cluster auto scaler, how to leverage him and pitfalls - k8s scheduler, alternatives , options to consider |