Public Cloud Watcher - TIB AV-Portal

Public Cloud Watcher

00:00

0

Zugehöriges Material

Smorodsky, Anton

Formale Metadaten

Titel

Public Cloud Watcher

Untertitel

Short introduction to framework for monitor and cleanup Cloud Service Providers

Serientitel

openSUSE Virtual Conference 2021

Anzahl der Teile

33

Autor

Smorodsky, Anton

Mitwirkende

Chaos Computer Club e.V.

Lizenz

CC-Namensnennung 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.

Identifikatoren

10.5446/54665 (DOI)

Herausgeber

Erscheinungsjahr

Sprache

Inhaltliche Metadaten

Fachgebiet

Genre

Abstract

In this presentation I would like to talk about Public Cloud Watcher. Tool used by SUSE SLE QE team to monitor Public Cloud providers ( Azure , AWS , GCE ) for testing leftovers and delete them. I will describe: 1. tool itself ( internal architecture and features it provides ) 2. how we maintain running instance of PCW ( it is deployed in 3 docker containers maintained by mixture of docker-compose, docker files and some bash scripts on top it )

openSUSE Virtual Conference 202133 / 33

1

15:10

openSUSEway is a full Wayland DE based on Sway

2

15:51

openSUSE on the Mainframe

3

19:56

2021 - openSUSE on Arm

4

26:49

openSUSE docs: Tame the beast, make it a friend

5

14:11

Multi-build Python

6

13:10

7

37:36

Make Your Calls Better

8

29:24

Rescue and renew: How to get legacy open source projects unstuck

9

1:14:51

2021 - Meet the Board

10

09:47

shells Your Personal Cloud Computer

11

25:41

Getting your changes into openSUSE Leap and SLE

12

36:18

Firebird: The High Performance RDMS

13

58:29

Cross Collaboration Panel

14

32:40

Say hi to Avahi

15

29:10

The long way for software freedom

16

31:02

Linux and Kubernetes Innovations at SUSE

17

34:27

The State of Free Software in Healthcare

18

21:24

Fedora Community Outreach Revamp

19

29:49

End of Year Survey

20

31:15

Enabling new client operating systems in Uyuni

21

14:56

Early Integration Testing

22

29:48

Doing Openbanking With Foss

23

16:15

Being inclusive and embracing diversity

24

33:07

Authentication with Kanidm

25

27:03

*sing* %post and %pre and securiteeee

26

20:40

Public Money? Public Code!

27

22:02

Confidential Virtual Machines with AMD SEV-ES and openSUSE Tumbleweed

28

15:28

LSP for Salt States

29

36:27

Uyuni, the movie

30

18:51

Sweeter image builds with KIWI

31

24:30

Security alerting made easy using Python

32

11:46

33

20:10

Public Cloud Watcher

Automatisches Abspielen

Sprache

Text

Bild

00:00

Virtuelle RealitätCloud ComputingSoftwareDienst <Informatik>GruppenoperationTermSystemaufrufFramework <Informatik>Service providerCloud ComputingComputeranimation

00:23

AppletSoftwaretestCloud ComputingLineares GleichungssystemBefehl <Informatik>PasswortInterface <Schaltung>FlächeninhaltCloud ComputingCASE <Informatik>UmwandlungsenthalpieBenutzeroberflächeInformationsspeicherungService providerDifferenteStrömungsrichtungProzess <Informatik>Exogene VariableGruppenoperationVektorpotenzialOffene MengeMomentenproblemAutomatische HandlungsplanungMathematische LogikZahlenbereichProgrammschleifeMAPNamensraumSoftwarewartungSoftwaretestInstantiierungProgrammfehlerDatenbankElektronische PublikationStellenringKonfigurationsraumFramework <Informatik>ProgrammierumgebungComputersicherheitCachingE-MailBenutzerbeteiligungProjektive EbeneMailing-ListeBildgebendes VerfahrenAdditionSchnelltasteInteraktives FernsehenDatenflussLineares GleichungssystemAppletKlasse <Mathematik>MultiplikationsoperatorSchnittmengeKonfiguration <Informatik>Mini-DiscCodeWort <Informatik>Virtuelle MaschineSupersymmetrieDienst <Informatik>SystemaufrufQuellcodeAbgeschlossene MengeAuflösung <Mathematik>Formation <Mathematik>Divergente ReiheDemoszene <Programmierung>p-BlockInterface <Schaltung>UltimatumspielEinfügungsdämpfungZentrische StreckungSichtenkonzeptFokalpunktSchlüsselverwaltungMeterSystemprogrammAbstimmung <Frequenz>Gerichteter GraphEinfache GenauigkeitGoogolKategorie <Mathematik>MereologieNormalvektorARM <Computerarchitektur>BAYESÄußere Algebra eines ModulsGeneigte EbeneComputeranimation

10:14

Interface <Schaltung>ProgrammierumgebungRechenwerkSoftwaretestCloud ComputingPunktElektronische PublikationVersionsverwaltungSoftwaretestDatenflussSchnittmengeService providerSkriptspracheServiceorientierte ArchitekturEndliche ModelltheorieNamensraumToken-RingDigitales ZertifikatWeg <Topologie>SchlüsselverwaltungInformationsspeicherungKomponententestNichtlinearer OperatorBenutzeroberflächeCASE <Informatik>InstantiierungSensitivitätsanalyseWeb-SeiteKonfiguration <Informatik>DatenloggerTypentheorieBenutzerfreundlichkeitComputersicherheitSichtenkonzeptBootenE-MailCodeEinfügungsdämpfungAutomatische HandlungsplanungEinfach zusammenhängender RaumDatenverwaltungMultiplikationsoperatorMereologieAusnahmebehandlungSicherungskopieVirtuelle MaschineAggregatzustandAdditionBenutzerbeteiligungTelekommunikationDokumentenserverMailing-ListeWurzel <Mathematik>DämpfungKonfigurationsraumCLIProjektive EbeneATMGenerator <Informatik>Maximum-Entropie-MethodeEreignisdatenanalyseDatenbankBildgebendes VerfahrenMathematische LogikStabilitätstheorie <Logik>Einfache GenauigkeitOffene MengeProzess <Informatik>Basis <Mathematik>Chord <Kommunikationsprotokoll>Divergente ReiheSupersymmetrieAbstimmung <Frequenz>ExzentrizitätÜbertragWeb SiteEinsQuick-SortLie-GruppeHIP <Kommunikationsprotokoll>WasserdampftafelProxy ServerZweiComputeranimation

20:05

VideokonferenzVirtuelle RealitätHypermedia

Transkript: Englisch(automatisch erzeugt)

00:09

Hello, my name is Anton Smirotsky, and today I will do a short introduction into the framework for monitor and clean up cloud service providers – public cloud watcher.

00:22

Let's start. First of all, a few words about myself. I am working in IT since 2005, using Linux as main tool for work and fun since 2007. Main areas of interest before joining SUSE was Java and automated testing, mainly in

00:42

area of online retail stores. After joining SUSE, focus shifted to Perl, Python, and testing of SLE and OpenSUSE. One of the areas is testing of SLE in public cloud. And currently my favorite distro is OpenSUSE Leap, so I am installing it everywhere I can.

01:05

We will start from stating the problem which public cloud watcher is trying to solve, then switch to the tool itself in going through its main features. Also, we will speak about internals of public cloud watcher.

01:21

Next, we will speak about HashiCorp Vault, what this tool is generally for, and about our very specific use case of this tool. And we will discuss how we currently maintain running instance of public cloud watcher, what setup we have done to keep it running, and don't worry about any potential

01:42

environment breakage. Last topic will be about future plans for this project. At my daily work, one of the main things that I am doing is testing how SLE behaves as virtual machine in different cloud service providers.

02:01

Azure, AWS, and GCE at the moment. All testing related to public cloud providers is happening in OpenQA in an automated way. I assume majority of people who are attending OpenSUSE conferences know at least something about OpenQA. For those who don't know or want to know more, I will recommend to visit

02:25

open.qa or find some nice talks about OpenQA in previous conferences. I am sure that there were more than one talk like this. But let's get back to our main topic.

02:42

So, OpenQA uploading SLE image into dedicated cloud, then creating VM using this image, and then running some testing against it. And after that, trying to clean up after itself by deleting all created entities.

03:01

Keep in mind that I said creating VM, but actually it's much more than that. Together with VM, every public cloud provider creating a lot of different entities. Subnets, resource groups, disk, and disk images, and et cetera, et cetera, et cetera. After successful test execution, we have logic which clean up all created entities.

03:25

But of course there is plenty of options how things might go wrong. And regardless that we keep improving our code to do cleanup always, there is still room for some unexpected behavior from OpenQA side or from provider side which

03:43

would not allow to finish cleanup. All what I just described needs to be multiplied by diversity of public cloud providers. Like I said, VM creation usually means creation of some additional resources for this VM. And every cloud service provider has their own understanding what kind of

04:03

resources will be created and how they will behave when you're trying to delete VM because some of them will be automatically cleaned up, but others know and there is a lot of different cases when this behavior differs.

04:22

So basically whenever something goes wrong in OpenQA, you not simply check that VM was deleted, but also check some other entities and you need to keep in mind all differences of current provider. Currently we're starting around 300 VMs daily in OpenQA in all three

04:41

providers together. And this number will supposed to grow in the nearest future. So obviously there is no chance to keep all this in manual checkup. So we need to have some logic which will double check that our tests don't burn extra money.

05:02

Also it would be nice to have it outside OpenQA to not get into a trap where now that same bug which would invalidate cleanup in test level will do the same with our less chance cleanup logic. Also it would be nice to have everything in one place to ease maintenance.

05:22

Another problem not directly related to what I just described is coming from the fact that creation of VM requires credentials which you need to bypass to the test in secure way and make sure that unattended person will not be

05:40

able to use them. To address first problem we created public cloud watcher. Project published in GitHub so feel free to use it, learn from it or contribute. It's written in Python using Django framework.

06:01

Django was selected because of the cheap way to get web UI. Public cloud watcher monitor cleanup and notify about leftovers in supported providers. Currently we support three cloud service providers. Microsoft Azure, Amazon AWS and Google Cloud Enterprise.

06:28

For each provider it using native Python API bindings from provider to communicate with the cloud. Public cloud watcher is currently used by several teams inside our company.

06:43

Each team has its own credentials to access clouds, its own images naming convention and its own flows which images needs to be deleted and when. This is why in public cloud watcher we have notion of namespaces which allow us

07:01

define these differences per team. Also in some cases public cloud accounts may be used not only by our open QA automation but also some other people and other automated workflows may create resources in the cloud.

07:22

And we should not touch them obviously so we need to implement something smarter than delete everything older than next days. To not interfere with work happening outside open QA. For this we decided to use feature available in all three providers.

07:40

Ability to set text on the resource. So when open QA creates VM in its VM it settings two tags. Open QA created by which generally give a hint for public cloud watchers that this VM should be monitored. And open QA TTL which store amount of time after which public cloud watcher

08:06

may delete VM if open QA failed to do so after the test run. Internals of public cloud watcher may be logically divided into several groups.

08:21

One group represented by classes responsible for actual interaction with providers. It contains dedicated classes for each provider which holds all provider specific logic. Each class knows how to authenticate with certain provider how to query for resources

08:41

and which resources actually needs to be queried. Another group responsible for actual VM cleanup. It has process which periodically loops over all providers in all defined namespaces and collect VMs according to tags which I mentioned before.

09:02

Then this VM serialized into local SQLite database. Then it tries to delete VMs living longer than TTL defined for them. And at the end it sent email notification about VMs which was not deleted

09:22

so potential human involvement is needed. Also there is a group responsible for cleanup of everything else. This group holds knowledge about specialties of every provider and what kind of additional entities are created together with VMs.

09:40

To clean up them separately because there is a lot of cases that after VM deletion provider not delete these entities. All these cleanup and notification flows can be turned on and off on global level or per namespace. Such configuration stored in PCW.ini file which public cloud watcher reads on startup.

10:09

Also local SQLite database with cached list of VMs can be accessed via web interface which allows to browse through list of VMs

10:23

and do some basic search and manually trigger deletion of some VMs which was not yet cleaned up automatically. Now let's discuss second raised problem credentials.

10:41

After some discussions around the problem we realized to provide really secure flow to operate credentials within OpenQA we would need to build pretty complex flow which still might be broken easily. So we choose another path. We decided not to hide credentials at all

11:02

but define short ETL and very limited permission set for them. To achieve this we pick the project from company which created Terraform which we also use quite a lot in the internals of our testing approach. The name of the project is Vault.

11:23

Project page describe it as a following. Secure, store and tightly control access to tokens, passwords, certificates, encryption keys for protecting secrets and other sensitive data using UI, CLII or HTTP API.

11:42

Vault also has notion of namespaces so it perfectly fit in our model. Another useful feature from Vault is ability to request temporary credentials from provider with certain set of permissions. Vault will keep track on this requested tokens

12:02

and may request token deletion from the provider. Vault is used by OpenQA tests so they have ability to create, delete VMs and public Cloud Watcher so it can query our accounts and delete leftovers in case there are any.

12:22

We're using very basic Vault setup which actually describe it as for testing purposes only and not using prod environments. We're not using any of fancy Vault storage types which would give plenty of options where persistently store credentials. Instead we're using RAM storage which means 100% data loss

12:44

in case of reboot shutdown of Vault. But it's totally fine for us because we're generating temporary credentials anyway. Another not recommended way of using Vault is that we allowing not encrypted connections to Vault.

13:01

But I will tell more about that part when I will describe our prod instance setup. So last but not least let's speak about our prod instance where we have public Cloud Watcher and Vault running. At the beginning it was just random VM running somewhere.

13:24

But over time with more people start using it and more tests run rely on it we start realizing that we need more stable ground. If Vault is down no test will be able to finish because tests need Vault to get temporary credentials

13:44

which will allow create-delete entities in the Cloud. Also if public Cloud Watcher is down tests are able to run but there is a risk of getting big builds from Microsoft, Amazon and Google. So at some point we decided to move to state

14:04

where we can recreate whole setup on clean machine within 10 minutes. There is plenty of options how one can backup some infrastructure in the code. Each one of these options has some pros and each one has some cons.

14:23

Also sometimes it's simply a matter of taste. After going through several different options we end up on choosing containers as building material for our setup. The idea end up a separate repository which contains Docker files for public Cloud Watcher, Vault and Nginx,

14:45

Docker compose file which tie containers together, some bash scripts which automate everything what should be done outside Docker files, like installing Docker on the host, getting SUSE SSL certificate into host

15:04

which would be needed for communication with OpenQA, check out of latest version of public Cloud Watcher from GitHub and etc. Set of vault cli commands which do vault configuration from scratch.

15:24

Setup all needed namespaces and upload root credentials for each cloud service provider in each namespace which vault will later use to generate delete temporary credential on request. Public Cloud Watcher container based on Python 3 container.

15:45

Whenever new image is built it will take latest version of public Cloud Watcher from GitHub and PCVWE with details about which namespaces needs to be monitored and how.

16:02

Also container has attached storage to a low SQLite database with instances cache survive container recreation. Vault container is based on official container from HashiCorp and contains small config which just defines in mem storage

16:23

and disables web interface of vault. Most of business logic is in bash scripts with vault cli commands which will do initial init of fresh vault instance and setup all needed namespaces and secrets. So basically our approach that we are doing full automated install

16:46

of both entities public Cloud Watcher and vault every time from scratch on cli container but because it's fully automated we basically can do it fast and because it's based on stable container versions

17:03

we are in a safe mode that we don't get any... So basically we have reproducible builds at some point.

17:22

To simplify vault setup but keep it secure we introduce third container with nginx. Nginx is playing reverse proxy role here by encapsulating HTTP into HTTPS and a single entry point which broker request appropriate container

17:43

public Cloud Watcher and vault. This trick allow us to keep vault unsecure from setting up point of view but secure from usability point of view. This is how we work around what vault is not recommending to do

18:04

by allowing plain HTTP connections. Now let's speak about future plans. First and in my opinion most important we need to improve our code coverage with unit tests.

18:23

Not that we don't have them at all but let's say coverage is far away from 100%. Another problem which bothers me as a person who maintain public Cloud Watcher is that whenever there is some issue in the cloud

18:40

for example Azure changed API in non-backward compatible way which already happened several times. I'm start getting email notification about exception every x minutes. From one side I do want to be notified about the problem so I would start acting as soon as possible

19:01

but on other hand after I have read first email I can skip next dozen which would not say something new. So notification flow needs to become smarter and be aware whether it is some new problem which I need to be notified or it is something what I have already seen.

19:24

Also currently web interface will only manage VMs but in fact public Cloud Watcher is cleaning a lot of other entities. It would be nice to see them in the web UI and have better understanding what actually cleaned and when.

19:41

For now to solve this problem I usually graphing log file of public Cloud Watcher. That's all what I wanted to say. I will be happy to answer any additional questions during conference or any time after. I would be even more happy if someone would consider using public Cloud Watcher together with us.