Bestand wählen

Advanced Database Programming with Python

Zitierlink des Filmsegments
Embed Code

Automatisierte Medienanalyse

Erkannte Entitäten
OK we will proceed 8 months and the Lemba will talk about advanced database programming he's season 5 from developers been around since 1993 also founder and CEO of GTG . com 1 of the founding members of the Python software community apart from Software Foundation and then all members of the group home society which brought this lovely conference to you you can walk away from few
thank you very much for coming and going to give a little talk about status this programming because we've seen in the past hundred alot about the easy stuff so I thought it might be a good idea to to
talk a bit about the more advanced things a bit about myself I'm I'm Mark Lambert I've been using Python for over a long time I've studied mathematics have company doing present projects um Corabella per software presence of foundation of Member Member of the European Society and based on so this is the agenda for the
talks I don't know whether I can actually do everything that I have on this agenda because of the time constraints but I try 1st I will
start with a short introduction of the hasn't database API to that all many of you know the Python database API interesting not that many so the president of its
API was was a designs in the was designed started and then in the mid
nineties but there was the 1 . 0 version which is now deprecated and when no answer to that OK so it's a very old kind of standard of that that was developed in the development is on going on the present music so if you want to join the discussion there you just have to subscribe to the mailing list and you can you can come and your your thoughts to the standard it's supposed to be a very simple this kind of standard is supposed to be easy to implement so that we get as many data respond modules as possible and I think that has worked really well for the 2 main concepts and about Python database API 1 is a connection object and the other is a cursor object so use connection objects to actually connect to the database and also to measure transactions and then if you want to run queries you open the cursor objects and you run your queries on the curse object in the cursor actually works like a cursor in text processing system you have to scroll down or like the spreadsheet you scroll down in your results section and then get the
data into the application so this is how a typical and application looks like that users to deviate 1st import you module you get the Kinect API from that module you open a connection you pass in Europe the database name the username and password that he created a cursor object on the connection object and you run your queries on the on the because object and finally you free the resources by the closing everything again so that was sort of very very short introduction to the API uh the next part is going to be about transactions transactions are very useful thing in databases you can you can do stuff on the database and if you find that you've made a mistake you can just fall back here changes which is uh this furnace to have you needed in production systems in production systems to work around box or input errors from from users so that you're and could become corrupt so it's very useful to to use these transactions however there are a few mistakes that people often made and and the sometimes cause people to not like transactions whether 1 column Status they've forget to commit that changes so they they apply a lot of changes on the cursor and connections and then close the application and see that hasn't really changed because the database API defaults to transactional but it doesn't actually store the data if you don't do an explicit commit now work around for this is to just disabled transactions with which of course is possible in databases as well but it's not really good work around because instead of losing your changes you get data corruption for free another
common mistake that people make is they they keep the transactions running for too long time and I'm coming to that later in this talk the transaction so basically you units of of locking things in the database so you want to keep transactions short to not locked other processes from accessing the database so what you have to do is you have to try to to make their transactions fraud now the best practices of course like instead always use transactions even if there are sometimes annoying don't use or to commit always try to make use of them keep the transactions short if they get too long you can run them in batches for example if you loading data into database it's much more convenient to do that and batches say thousand rows at time and and to commit that also keeps a transaction log of the database which are short and the performance will stay just find so you won't really see the the overhead that's caused by the transaction mechanism and if you know that you're not actually writing to the database is a very good practice set the read-only flag on the database we can do that in the connection you can usually do that in the connection options and then the database will know that it's have a read only connection so you basically not work on this that user a transaction log and make it make the whole query a mechanism run much faster so that again was the simple kind of level of transactions then we have
a more advanced level of transactions if you want to connect multiple databases and you want to have transactions spanned the different databases they have to think about what what to do and when you when you read data from 1 database and put it into some other database of course you all you want that to succeed if all the databases have actually received and that's what's called trend distributed transactions typical use-cases or in accounting for example you he get from from 1 account on 1 database and you credit the the amount to some other than that it's the only 1 that to succeed if if both databases have actually made the
change and you have similar things in you processing or if you want to integrate different applications now the the typical buzz word that you hear when you when talking about distribute from actions two-phase commit which is the kind of the standard method of of of approaching this problem so it works like this if you have 1st place in the 1st place the Committee prepared to all the different databases asked whether the Committee would would succeed with high probability and of all databases so yes you go to the 2nd phase and you actually do the committee but this is a tiny of probability there that the the data some database in that process may fail in that 2nd phase and so then I can send you sorry then you data is corrupted but you have to work around that in some way because this is no easy way of undoing the 2nd phase but most of its systems handle this fine to make it easier to to deal with these transactions across multiple databases there's something called transaction manager This is not in Python um this is usually done in some other system for example this MQ series from IBM which I to III or of Microsoft has DTC mechanism some database systems you to an oracle offer these transactions as you can sometimes hooked into them from Python there Python API for some of these and you can then use the more you can use a database perspective specific 1 like for example the 1 that that's integrated into + crest in the DB API we address this with a new set of API is the TPC API um those are modeled after the exo for these transaction managers but unfortunately not too many databases
support this and that maybe database modules actually provide these API eyes so you have to check a database with it supports case next .
concurrent database access we don't have fast too slow um database access is very important if you have multiple processes accessing a database of for example if you have a web application because of the the Gill you would normally want to have multiple processes set out to talk to you database and so it's important to think about how the database is deal with the problem of concurrent access so we have typical setups would have written down here the most typical 1 is of course again the many writers many readers so there's no real special case that that fits in the common case he differ e to have make compromises in the set up
so when when writing an application and thinking about this you have to ask yourself some questions for example so that readers immediately see changes made by other transactions were made about the processes and so they can see things that went into the database even though the transactions in those processes have not yet been committed or for the reader just see everything as it was when it started the transactions so it doesn't see anything that came into the database efforts on transactions and the database taken the databases can can handle all these different situations they provide different what's called transaction isolation levels but they have to implement this using lots and lots of something that you to try to avoid any application because they they keep your all application from running in parallel and and and using most of the of the resources that you have so the works the typical set of of transactions isolation levels that you have the first one is the easiest 1 to implemented sweet uncommitted Î which basically says you don't have any locks um with we uncommitted isolation level all the processes that you have the charter data is will immediately see everything that changes in the database even uncommitted changes so strange things can happen you can have dirty reads which means that you you read data from another process that hasn't committed jet engine natural it's actually going to be committed later on you can have phantom reads which is something that basically says you you add something to the database of some other process and then you remove it again and and your process might have rats this this world that was added by another process later on is removed again so it's basically a fountain that you're working with and there are some of the things that you have to watch out for if you wanna read up on these things there's this URL down there it's going to be in the in the talk society can click on it and it's a very good explanation of these things then the next level is the recommitted this is the default level in most database applications so when you open your connect you will you really get this oscillation of this
oscillation of basically says you only gonna see changes that were committed to the database now you can still see changes that were made uh while you're running your current transactions so some of the process commits while you're running transaction was still see those um but you will not see any uncommitted things from other processes so the way that it works is is you you have this person this I drew this this table up there with the with the yellow bar in and it'll lot of current role will put a read lock on it a the database will if if there is a right lock on a role in wait for that right locked to to be removed by the database so some of the transaction has written to that role that of the transaction will have put that right locks on the row and the the right lockers only removed the transaction is committed so if the other transaction has committed to change and you can actually go ahead and have read this so this is basically you just looking at 1 row then the next level is repeatable read this
basically says that the general data won't change in the transaction to everything that was returned to the application by the database was guaranteed to stay the same for all the whole transaction and this of course requires more locks on the database so you you put locks on an obvious thing that you actually passed back to the to the application and then the highest level is serializable which basically means whatever you do on the database the some the the the database will will stay like that that's would say exactly like it was when you started the transaction and nothing will change and this requires lots and lots of lots lots will not only the things that you've read from the database where you've written to the database but everything that you've ever touched in the database so you can hold table if necessary all of these levels are necessary for some applications for example if you want a runner report you may have you may want to avoid inconsistencies in the report so you may for example want serializable isolation of so and there are these other levels can be used if you have situations that are not as strict about data processing so how do you do this in in Python there are 2 ways to do this well actually that's where you can usually you have an option in connection settings that you can set to 274 isolation level but you can also do dynamically in your application you can actually do it on a per connection basis so can have multiple connections see that there isn't different isolation levels so you can run a statement said transaction isolation level for example or you sometimes some data models have special ways of directly setting the options on the connection what's important to know is if you want to change the setting while having a connection open you need to make sure that all transactions currently running on that connection so the easiest way to do that is just committed connection or a rollback right
optimization so you have a
database application of course you want to run as fast as possible the 1st thing that you should do is you should ask yourself what kind of system you're running whether you're running in all LTP system which means online transactions so interested in not putting lots and lots of it in you not so much interested in making complex queries on the database or you want the other thing you want to have data analysis so already have all the data and usually huge amounts of data in the database and you're interested in doing complex queries multidimensional of faceted search food on all these things and then you'd use all apps systems now I'm just like with with transactions the the situation is often that you actually want a mix of both so you want to have both run faster so this is 1 way you can do this depending on the size of databases that to talking about 1 way is to to put in order to peace system in front of you the system that's actually taking in the area and then every now and then you copy over the data into your O L 8 pieces them to analyze on a more
lower level you can in price you can you have a certain number of of problems that you can have historically 1 is for example you you queries run too slow and this is the queries of simple so you just doing for a sample select on a few on a few tables a few columns usual way to address that assist you add more indexes this adding index this is for easy there is some people they add indexes everything that have database from this slows down things because every time right into the database the database has to update all these indexes and so you should really only put a nexus on columns or on combinations of problems it's actually need database and the best way to find out which which of tables and which becomes have to index is best to use the query analyzer databases usually offer way to to get the information about how queries run in the database and you have a look at that you analyze the data and then you check which indexes you should put on the database and it'll it'll increase the performance enormously if you're doing it for using Python you
can in some situations also at caching at the present level so you have you read a data from the database and you store in memory and for subsequent use you can even use equal light for that if you have a small datasets and the 2 in 1 in memory processing now the next 1 is complex queries run too slowly for example you have a report that's running and millions and millions of rows those will usually take few minutes to run depending on how complex they are of course you don't want to wait a few minutes for this so a common strategy for doing this is to preprocessing some of some parts of those queries so every say 15 minutes you run the queries to put them into a separate table and then you you run the reports on those query tables the then again
if you queries is themselves are too complex you can understand has as well what you do is you simply use split up your queries makes them easier to handle for the for the database and then we combine the results from those queries and Python a typical example of that is if you want to if you want to run a report that has aggregate right in the results set so doing that in in sequel is really hard and you can do it but it's it's really complex and it's much easier to do in Python so for in this example you just want to queries you want run query for the details and run 1 query for the for the aggregates and then you combine everything into a single table and
use any applications tips and tricks this is a collection of random stuff that I just thought might be interesting for you um typical problem you have this record ID creation so you want to put in a new row into a database he he 94 that for that record and you have this kind of chicken and egg problem because some the typical way of doing that is to use an ordering increment column unit in your database and just half of the database deal with that we have a sequence and get your ideas from that sequence in the database
of the problem there was the order commit for example is a race conditions because it was the audit committee that with the ordering from and the database will take care of adding the the increment value and then you have to of course that value again because you want to continue working on the world and the usual way is to after the database for the last user ID and depending on how you do it you you run into race conditions or we run into context and problems because it's not really clear what the last ideas that could have been some other transaction for example but also just insert something and so starkly where to get that lasted from but another way of doing that is to just you you let the ordering increments fields insecure ID for you and then you just query back the role simply by knowing what's in that role in query the database for that robert that introduces the performance overhead so so it's not really ideal something that we always use in our applications is very simple approach it's kind of randomized approach we simply have a big integer few for the idea and then we use a good random number generator to generate the energy for us how we just that on not having collisions so we use that idea in the role we put it into the database and usually succeeds in a very very rare cases where doesn't succeed you just create a new random number and then you try again how does this work in Python well 1st you have to you have to set some so some constant so you have to have a role range ideally what we often do is we set the highest that so the idea is look nice they have an API get random role this needs to be a threat local because every such French generates its own ideas so you don't get any overlaps and then you have to deal with setting up the random number generator and do the best way to do that is to use some random together a good seed for the random number generator and then you you put the CQ put it into text and then no you you feet that you see the random number generator was that and then use that in Europe threat locally right next point is referential constraints people using a very happy about using them but does everyone know what a referential constraint is over
a few so that basically means this instead of for example you have a you have a table up there with the with the you want reference product now instead of referencing the the product name directly in table what you do is referenced the ID into the table that has all the product names and then you just put the idea into your table instead of the name and then in your reports you combine all those things into a into a nice-looking output and this process of referencing from 1 table to another is called a referential constraint usually use foreign keys for that you can implement 1 to N mapping secant implemented and M. mappings the constraints are enforced by the database and that can sometimes lead to problems because if you for
example you have lots of references in your data in your database schema and you want to load data into your database then it'll often fail because of the way that data is loaded does not actually match those referential constraints there was some databases you can switch off the used to checking for those constraints during the load phase but it's not really ideal another thing that can happen is if you do something that you can get a cascaded the which not history what you want so what what we do is for these things we just completely developed of referential constraints and put everything into the into the pattern database abstraction layer which has much more knowledge about these things it's also a good idea to put those things into the business logic and then you avoid all these things so if you wanted to motion into an emergency situation where you have to quickly no data from the back up again you don't have to think about how to turn off the referential constraint uses sort using works I
availability is another 1 of those things that you have to think about so you have multiple databases and 1 database server breaks modest which over to the to the other 1 the various systems for doing that they tend to not always work perfectly so what is something that can happen is you can have automatic tailback which means that you have the universe is a situation where the where the systems which is to different databases and that automatically comes back but you not necessary to assure that all the database service of actually synchronized by then so you can have the split what's called split-brain situation we have the data is spread across different servers but you not necessary issue with all the service have the same data and of course some clients may miss the failover event so some clients may continue to write into different database server than the 1 that is actually currently being used so again the best thing that you can do is you move back to Python and manage everything in Python and then you can also handle the communication between the clients and ensure that all the clients know about this a rent you can do the failed act in an optimal way and you can also
use this for dealing weed optimization so you can have for example the database of the application part right into the database right to 1 database and then have the synchronization between the databases and take care of moving the data to the other server and at the same time while writing here you can read from the other database and you also avoid some of these locking issues the so actually made a 42
slides in half an hour that's it thank you for listening but I
or any questions this makes so for talk he said transactions shouldn't be too long was a good measure for the size of the transaction but if you're right into the database and a sadist maybe like 10 to 100 rows that you write in a single transaction things like like you may not want to use the random numbers for Europe varieties since you made for a range of our awareness of location and the internal balanced trees so this may be a good idea to use increment of the wealth of location otherwise you may have some performance issues during yeah you're right into the database became world we've measured and it worked out fine so but we some there is we using sequences for example from the deliveries to generate those depend on implementation of the and that it I think I think and depends on the implementation in the database so I think you'd end in a VBE 1 main right grammar our case no 1 mostly using both personally I uh I have a question if you can help you that that this is something because when you were checkin about so probability of the small complicated queries is suggested that you may split into multiple simple while Aquarius and move the weight of the computation of would be to divide now and my question is why would it depends the optic cross because as I mentioned at database is constructed in such a way that such wastes should be done and the maximum performance so what why would
like to do the same logical thing faster but it's not so much about speeding up the operation it's about making it possible in the 1st place because there's some things that you want to do and reports that are not possible in the sequel because it was too limited for that and is in an even though it sometimes you can do things in the world but but you get really huge sequel statements to 2 to your processing where you have to resort to procedures and everything things which makes things a lot more complicated so we found it it's usually better instead of just wasting time thinking about how to structure is equal properly and making a very complex it's easier to just have a few simple sequel of queries and then just to the the combination of those in kind of a big things this log many confuse them think so thanks for the talk was very interesting just a suggestion based on experience because of the deviation of formalized if you people planned to do applications to talk with you know database people beforehand and just they usually nice they don't bite you just talk with the database people they can only make your life much easier because there is so much implementation details to be considered so maybe that's a good question just to considered by the lunch developed flowers of limiting the about 8 hours of programming can you think because of that suggested by the way many of these things are database-specific so you really have to know the database if you want to make proper decisions so they don't have necessary applied to all that is so there is some there is some of the worst we found that was 1st described in this user that's a recommendation thank you thank you
Hinterlegungsverfahren <Kryptologie>
Gruppe <Mathematik>
Fahne <Mathematik>
Automatische Indexierung
Einheit <Mathematik>
Ordnung <Mathematik>
Tabelle <Informatik>
Lesen <Datenverarbeitung>
Objekt <Kategorie>
Folge <Mathematik>
Ausgeglichener Baum
Mathematische Logik
Tabelle <Informatik>
Komplex <Algebra>
Wort <Informatik>
Prozess <Physik>
Natürliche Zahl
Fortsetzung <Mathematik>
Befehl <Informatik>
Computerunterstütztes Verfahren
Komplex <Algebra>
Speicher <Informatik>
Algorithmische Programmiersprache
Konfiguration <Informatik>
Spannweite <Stochastik>
Automatische Indexierung
Projektive Ebene
Varietät <Mathematik>
Kontextbezogenes System
Wiederherstellung <Informatik>
Physikalisches System
Speicher <Informatik>
Einfach zusammenhängender Raum
Physikalisches System
Objekt <Kategorie>
Offene Menge
Desintegration <Mathematik>
Formale Grammatik
Orakel <Informatik>
Trennschärfe <Statistik>
App <Programm>
Befehl <Informatik>
Güte der Anpassung
Kontextbezogenes System
Dienst <Informatik>
Rechter Winkel
Zurücksetzung <Transaktion>
Lesen <Datenverarbeitung>
Mathematische Logik
Zurücksetzung <Transaktion>
Abgeschlossene Menge
Open Source
Spannweite <Stochastik>
Surreale Zahl
MIDI <Musikelektronik>
Overhead <Kommunikationstechnik>
Orakel <Informatik>
Konfiguration <Informatik>
Kartesische Koordinaten
Einheit <Mathematik>
Mixed Reality
Funktion <Mathematik>
Nichtlinearer Operator
Prozess <Informatik>
Twitter <Softwareplattform>
Strategisches Spiel
Overhead <Kommunikationstechnik>
Gewicht <Mathematik>
Ganze Zahl
Einfache Genauigkeit
Mapping <Computergraphik>


Formale Metadaten

Titel Advanced Database Programming with Python
Serientitel EuroPython 2014
Teil 39
Anzahl der Teile 120
Autor Lemburg, Marc-Andre
Lizenz CC-Namensnennung 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
DOI 10.5446/19999
Herausgeber EuroPython
Erscheinungsjahr 2014
Sprache Englisch
Produktionsort Berlin

Inhaltliche Metadaten

Fachgebiet Informatik
Abstract Marc-Andre Lemburg - Advanced Database Programming with Python The Python DB-API 2.0 provides a direct interface to many popular database backends. It makes interaction with relational database very straight forward and allows tapping into the full set of features these databases provide. The talk will cover advanced database topics which are relevant in production environments such as locks, distributed transactions and transaction isolation. ----- The Python DB-API 2.0 provides a direct interface to many popular database backends. It makes interaction with relational database very straight forward and allows tapping into the full set of features these databases provide. The talk will cover advanced database topics which are relevant in production environments such as locks, distributed transactions and transaction isolation. ---- The talk will give an in-depth discussion of advanced database programming topics based on the Python DB-API 2.0: locks and dead-locks, two-phase commits, transaction isolation, result set scrolling, schema introspection and handling multiple result sets.
Schlagwörter EuroPython Conference
EP 2014
EuroPython 2014

Ähnliche Filme