Logo TIB AV-Portal Logo TIB AV-Portal

An Adventure in Data Modeling

Video in TIB AV-Portal: An Adventure in Data Modeling

Formal Metadata

An Adventure in Data Modeling
The Entity-Attribute-Value Data Model
Title of Series
Number of Parts
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date
Production Place
Ottawa, Canada

Content Metadata

Subject Area
A case study on the trials of Emma's performance when implementing the Entity-Attribute-Value data model on their PostgreSQL database systems. Emma, Inc. is an email marketing company that provides a Web based application for managing email lists and marketing campaigns. This is a tale about Emma's experience with a database schema design refactor that is now using the Entity-Attribute-Value (EAV) data model on the members in email lists. We will briefly describe the previous data model and why we had to move to a new one. This is followed by all the places where Emma stumbled and recovered in implementing the EAV data model and rebuilding the application around it. Finally we will touch on where Emma might go next with respect to the data.
voting Adventure Authorization model model
email server statistics services Computer animation software The list Databases open subsets reading
information time The list sun Databases clients Coloured favourite structured data Computer animation information table partition
Computer animation core traction Databases Right
Computer animation statements Databases bits catalog objects table applications Windows systems attributes
backup Super administrations basis catalog clients lines mining Computer animation case Query core objects table systems
server information Development front end sets Databases clients events powerful number web mathematics structured data round Computer animation hypermedia table middleware partition God
Types track structured data Computer animation string statements loss bits clients table number
Identify key formating Databases fields number attributes Types numerous Computer animation case core orders diagrams table addresses number field
tablets spreadsheets Computer animation Sequel equal time model sampling Databases system call
Computer animation Query The list Right clients part icons fields number attributes
service Computer animation software time statements connections
server email Computer animation Databases addresses
functionality distribution Computer animation traction bits extent table part pivotal fan
tablets bottom functionality Computer animation cost function Continuation Databases pattern argument fields
Computer animation Right
choice point functionality Sequel Development The list infinity clients applications number structured data Computer animation record form
versions Computer animation configuration construction storage model help Right extent data types fan
email constraints Sequel key integrators states time storage fields Computer animation cycle table addresses
Computer animation Query partition second
point email Computer animation Query table fields addresses
Computer animation storage table second
impulses area Types structured data sign Computer animation Now time storage sets Right partition
FIL it's OK if you 1 sec model to go to work like this of that's why wasn't constituent research authority as you will work with the thank and you the morning the of money is marked on the work at a company called and of what
I do is try to take care of the PostgreSQL database server is that we have
anyone who remember before but that is that the you know marketing company we provide were world software service would provide a way that that you can use to manage e-mail lists the create marketing campaigns and to attract some basic statistics on people who open reading click forward and so forth so I'm here to
tell a little story about some fun we have with changing data model around I remember information and how all the fun we had and how we're getting through it all the Member Information what what we call a member of the nation really is just a list e-mails and the budget actually it's that that we allow each client to to have the final that they want to my fault so so some simple examples are 1st name last name and what your fairy databases got you can add things like birth dates or favorite color and so forth Sun
has been around for about 10 years and then once upon a time when they got started they decided to partition the databases of the partition the database based on what for each account each account had about 14 tables each and the member information that this is going when tugboats mostly going to be around getting their member information back and database so the exporting of the data was was really simple because all the Member Information was always in a single table here's an example
if I was e-mailing all the core members of the police stress PostgreSQL 14 I would have the you know just your 1st name last name and and if you a database it for those you're in here that idea very database right bottle command will be
analyzed in this again artificially so what was wrong
of we it over 40 thousand accounts sustaining database made a little bit hard to minds and data we know that with over a million objects in the system catalog and whenever someone wants to change add or remove an attribute on that table that was used was heavily accessed across the application of that do not alter table statement on window and frees up the application for everyone so some of the simple
data mining things that we we couldn't do with the sky looking like this was as simple as well harmony marketing campaigns that people sent out yesterday about the I think a real basic example was got 4 thousand tables we would try to query something on a parent table enough grabbing a little too many walks wanna do anything more interesting start joining tables together we end up with no more multiples of of 40 thousand lots of 4 MIT so
hard to know focus the were unknown it if no 1 else was doing anything on systems that wasn't about but that was really the case are usually usually would end up happening in some would try to do something realize to to all and stop there and that it wasn't always the case where somebody query and the clients are calling saying line lies in the and of responding to it but that and and 1 of people that didn't like going after each of the child table separately and having to go through a lot of that aggregation by him for for the system administrators what but they did not like was it would try to back up this data and because of how many how many objects from assisting catalogs that would it would take over a day to back out about a terabytes data to take a dump of the whole system and and they weren't comfortable to doing this other than over the weekend and then we didn't have any we had backups unemphatically basis which which of the people were unhappy with so we did something
dramatic a few years ago I
think if years ago now and the couple the couple changes that we made a highly cool changes to highlight around how we dealt with the Member information we we reduce the number of of partitions by by making having a fixed set of tables so thousand 24 because we like those round powers of 2 and and we have still all the accounts and event based on whatever the Kennedy was so this reduce our our dataset I suppose for each partition to about 1 gigabyte of data we also developed a homegrown Python middleware layer between the web server and the and the database for a couple reasons for for providing public media DPI to clients and so hopefully God development easier for the front end then we applied the entity actually value data model so that Member information table only data modelers right so many people are
familiar with with what this data model is how many fans so that this data model basically lets us avoid doing
multi-table statements and what's what's clients and remove that treats 1 track by which which would result in simpler insert update delete statements so going into this data model or that was the primary reason we chose it 1 avoid those table statements and we did know up front that need to query the data differently and that strict doing strict type checking would be a little bit harder like making sure stranger strings dates were quite dates numbers are really numbers this data model consist
of basically 3 tables 1 that would have the that you know the core member formation of the basically all information that that a member really had to have 2 in this case would really be just e-mail address and then in the another table the hubble batteries were wanting to check so that would be where the 1st name last name their database would be defined go on and then the 3rd table which would actually have the actual values of what that attributes work like are
diagrams so this is mainly just references on the examples that we have up there we hope we can see that the Member table we have surrogate keys war of that would help identify with the counter is with the Member a member of the family or as a surrogate for the moderate and then for In the field with the type of another the name of it which is which is actually the unique identifier the order of which the columns of A that attributes would appear where in the country looking at the number of formation and feel that it is then in the middle we have that number field table where we have a column for each type of data we were tracking so if you were going to be compensable in of text column for the text fields and numerics and so forth the most a a couple of things
have to know about the middle where where is that we use sequel alchemy and to model the database how many Python people the you guys are fantasy equality the so on so in modeling that modeling database with with the or we have decided to also take the data in in the middle where where and that we don't want people jumping out there was some millions of members at a time using with with the public API call so we also restricted called which only 500 members at a time so the 4 thousand 4 thousand 1 thousand members would be corner 4 times to get all your data out is
everyone know what I did like simple sample so with with this PAD model if we report the data out at the top there that's that's what it would look like basically have a role for every actually that you were tracking per member so Justice 1st name last last name and very data is welcome back as super rows so that the that middle where where we would we would do that and turn it back rotate that they'd over so so looks more like a lot of typical spreadsheet or 1 on pretty straightforward so far
so give me an idea how much data
part of our clients have right now this is a short survey of of about it accounts are a religious accounts you can see that the number of members that they have ranged from anywhere from about 400 thousand to almost 4 and a half million as and the number of attributes at all of following for the list of what I did in the far right column is kind of multiply that out of the 2 icons 1 is the actual number of fields number of values that they actually have entered verses how many values that they could have but I found a value on the right that's kind of like that feeling I didn't really do any any real concrete thing would term with any any and I can't think of a way for but only become far right seem to have bigger influence on how old how the query performance turned out coming up so I have that the who
wants to guess how long it took for these guys get their data anyone guess 6 hours and when 1 of the better 10 minutes no 1 not to get out so try to out what's going on
well we knew we had statement timeouts receptor-1 out and that's easy to disable then we found out that you were hitting the 2 or 3 hours time on that Apache had so there around we repeat the code out of the of the APC just to run it outside of the web services see on the then we found that the networks which has had a 3 or 4 hour time allowed on TCP IP connections so we yank
Dulbecco that ran on a database server just to see how long it really took to get it out the database for 2 biggest accounts of we give up after waiting and half a day and finally got down to an account with 650 thousand e-mail addresses and the data eventually came on for hours the so we decided that
maybe the middleware maybe Python shouldn't be pivoting data maybe we should have the database try to do a little more work
so who's familiar with with that table func extension anyone fans of across that function so suppose stress provides this table like extension and part of that in the in the distribution to which contains several functions but the 1 in particular and is is which will pivot the data so all all
of explode a little bit for you right for this for this portion and those cursed as pivot data faster than Python if you use the right function so when we have when
we have we wrote the rear of the of query and answer these macrocell function and started getting customers are data realize we weren't doing right so anyone ever for you guys you would use cross 70 anyone ever fall into this trap so basically it I added on and on and on and set her database was there database was Mongo DB but I didn't really have a last names and she doesn't really have 1 of the the crust of because these cost functions of the bottom which takes single argument will will shift all data were won't it's not capable padding field values that are in all serving is shifted to the left and the customers are calling and asking why the data got screwed 1 of the major that we didn't break anything so that was easy to
fix I use right across the front was 1 of the bottom taking arguments will properly pattern the the columns that don't have any values in right next
quiz how much faster think and yes this 50 % faster than 10 year 6 evidence art anyone else for 1 week
so it's pretty good right the guys who can get data about it on 20 minutes and the poor guy and came about 10 minutes still slower than the than the minutes that people were used to originally but so that we could we could get them the data now so much
faster but for 4 how many how many
application developers so that it was an obvious now we can use the form to model the data because basically you need to be able to model an infinite number of possibilities depending on what attributes clients and so and make the guys in the world more comfortable with writing a sequel and all the enough or maybe not too surprising you know that the guys with with hundreds of million dollars of actually took a few minutes longer to get the data but I seem to be fair trade off we saw problems In experimenting with our our growth seen how large of customers of the become with we found out that even using the constant function if we have a member list somewhere between 5 and 10 million this we can get the data out uh within 12 hours something still taking too long so this point looking at our data model choice and in deciding that maybe this is a little too inefficient and it was a little disappointing that that the performance of the sky started to really she over you know with only millions of rows of data so we're
striving to do better wondering what to do and we're trying together with are trying to figure out where all our options are and so the most common question we get asked internally is well what just pull that the that out of the and of PostgreSQL groovin tried somewhere else so will help entertainment question but will will model something a little different post-stressed were about taking later any anyone is stories there's fans also fans are so so we saw a key-value key-value store recall that we a special datatype it's another
extension that comes constructs the reason 1 of the reasons why we decide and decided try to use the store 1st was because of a lot of people are wondering and and something like right this would be the the best way to do something like this I thought this would be a nice way model just over the we tried this in and not version of so
some things we knew going into it if we were to move forward with the story we don't get strict anymore everything's strength we can get strong referential integrity constraints between the keys and the store column and and the fields that we have in our field table and give you an idea of how long ago we thought about this this was at the time the her heat store wasn't yet supported cycle PTT or sequel of me a couple years ago but I'm sure you're all aware now it's in there so for those of you unfamiliar
with a star star star at this is what it looks like if you just look pulled it out and see what's in there pretty straightforward you got your e-mail address there are unique identifier than in the states so column we have the key value pairs of the 1st name last name very and then took a little
exercise to try to see how long it would take to convert all that data any any guesses of anyone want to guess long it to take 1 the data and we do all the stored 3 days 15 minutes 0 sorry what about 2 hours and 30 seconds annuity annuity guess the effect of but will 1 dollar did what 2 minutes
to convert to take 1 of our 1 gigabyte partitions and and convert all the data I put this query up just for show but I mean it's really really the 2 minutes of important someone wants to rewrite this for of so they store are
query 4 point update it kind of looks like how did and how did originally understand 1 table if their e-mail address and port all the field individually right
next was who was this all to the to start dumping Big data in like this I the wells close the so
what originally took a couple minutes minutes for this particular account that took 4 hours and then I would crossed and on 10 minutes and using something like a store additional back into a single table came back down to 15 seconds well known will
prove a concept is nice and we've only looked at each store so far what I'm so 1 of which to pull let it out eventually but that's what we are today still trying to decide what to try next with go with that data a couple
ideas have been thrown around in various places we have suggestions for using cross tempted tivate chunks of data time after seeing that it took only a couple minutes to to convert data for an entire partition maybe we can convert data on the fly from a cow and will basically this is pending the data using a store or you converting the east or as a way of getting data but some people are concerned about whether it's it's important to be able to still have strict type checking with with dates and may be used to find something I we use the song still find something else on impulse Chrissy maybe try data model of 1 looks reasonable and and even got a suggestion to use XML to to fuse DTDs and and check are types that way so that was my story thank you for
listening the ultraclean majority but I imagine it's probably not not dramatically different and probably much better with with what's coming up in 9 4 a question yes you this the this the you know was what you said about the the people working on this was I just want to talk about this the in the center of the the what it was 1 of the you have a lot of people in front of the room right right the problem is that we have a set of all of the 1st all of the time and no right the any other questions comments jokes yes a all in all my signs of the area the policy has been known for a long time meaning the of user or that's the only thing we have to kind of you know value of the of the cost of the right OK thank you the most ardent is at