Scaling a Cloud Based Analytics Engine
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Alternative Title |
| |
Title of Series | ||
Number of Parts | 31 | |
Author | ||
Contributors | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/19087 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Production Place | Ottawa, Canada |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
PGCon 201416 / 31
1
9
13
14
18
19
20
21
22
26
27
28
29
30
00:00
Data storage deviceComputer configurationDensity of statesSicInformation retrievalLogarithmTerm (mathematics)MIDIConsistencyEvent horizonRepetitionFlow separationType theoryPort scannerStapeldateiMultiplicationParallel portBeat (acoustics)CodeBackupData typeSubject indexingLogical constantLoop (music)Table (information)Price indexMetropolitan area networkSineMaxima and minimaInsertion lossInformationPartition (number theory)System callPhase transitionHausdorff dimensionPosition operatorSingle-precision floating-point formatPerformance appraisalState of matterOrder (biology)CountingSet (mathematics)Higher-order logicString (computer science)OvalPhysical systemMach's principleValue-added networkMenu (computing)Physical systemCodeStatisticsDatabaseRight angleCASE <Informatik>Query languageMereologyData storage devicePoint cloudDatabase normalizationLoginState of matterInstance (computer science)Multiplication signMiniDiscProof theoryBasis <Mathematik>Event horizonNumberPerturbation theoryResultantData storage deviceBuildingAlpha (investment)File archiverDegree (graph theory)Matching (graph theory)Point (geometry)Type theoryAreaImplementationLevel (video gaming)Software developerAnalytic setWeightLimit (category theory)Data managementLine (geometry)Statement (computer science)Computer architectureComplex (psychology)Numbering schemeDifferent (Kate Ryan album)InformationCausalityTable (information)TrailScaling (geometry)Insertion lossSampling (statistics)Perfect groupSingle-precision floating-point formatSoftwareAnalogyAutomationInverter (logic gate)LogicError messageSet (mathematics)MultiplicationMathematicsBit rateRollback (data management)SequenceRepresentational state transferRepetitionNatural numberKey (cryptography)Software testingAxiom of choiceData warehouseBlock (periodic table)Server (computing)Projective planeStapeldateiProduct (business)Relational databaseConsistencyBlack boxComputer fileBitDatabase transactionAutocovarianceFlow separationInformation retrievalSound effectVolume (thermodynamics)NeuroinformatikMathematical optimizationTimestampConstraint (mathematics)Latent heatSlide ruleInheritance (object-oriented programming)Variable (mathematics)BackupRevision controlIntegrated development environmentGame controllerFunction (mathematics)SpacetimeComputer hardwareStructural loadComputer-assisted translationDimensional analysisMessage passingSource codeDivisorComputing platformClient (computing)Process (computing)Goodness of fitModal logicSystem callMaxima and minimaProgrammschleifeContent (media)Entire functionCartesian coordinate systemSemiconductor memoryVulnerability (computing)Functional (mathematics)ChainPresentation of a groupPredictabilityFrequencyQueue (abstract data type)EmailService (economics)PhysicalismWrapper (data mining)WebsiteProper mapComputer configurationExtreme programmingWritingAtomic numberPhase transitionDesign by contractDataflowWeb applicationQuicksortInternet service providerLiquidUniform resource locatorGroup actionRow (database)Gene clusterGame theoryTerm (mathematics)Social classTheory of relativityData conversionThumbnailSupremumFilter <Stochastik>InferenceScalabilityRemote procedure callConfiguration spaceRule of inferenceCrash (computing)WordSubject indexingoutputSoftware bugCache (computing)Data loggerDisk read-and-write headExpandierender GraphHeegaard splittingScripting languageDirection (geometry)Core dumpWeb crawlerWeb 2.0Moving averageSoftware maintenanceElementary arithmeticPartition (number theory)Data compressionLoop (music)ParsingSelectivity (electronic)IdempotentSolid geometryElectronic mailing listComplete metric spaceComputer animation
Transcript: English(auto-generated)
00:03
Hello, this presentation is about scaling a cloud-based analytics engine. It was originally titled Mission Impossible. That was a marketing person that gave it that name. It's not my favorite, so I went back to basics.
00:20
You probably don't know me. I've been around the community for quite a while. I've been a Postgres DBA for about 15 years now. But this is my first talk at a conference. My name is Samantha Billington. And currently, I'm the database manager for a company called iParadise. This project that I'm speaking about here was actually for
00:40
a company called Message Bus, which was a startup that sadly didn't quite make it. But we did some really cool things while we were around. The kind of things that startups do when you can start from nothing, build things the right way, and spend lots of cash. Not all companies have that benefit.
01:05
So a little bit about the platform. It was 100% cloud native. So it was Joyent and Rackspace, not Amazon. And the entirety of the company was cloud.
01:21
It was not just the database architecture, it wasn't just the log. Everything that we did was cloud-based. So there was no actual physical hardware. What we found out, well, what I found out after leaving that type of environment, was it didn't really matter. So what the cloud meant for us was that we needed small to mid-size instances.
01:45
Because we needed many of them. You don't want large instances for many different reasons. But with your many instances, you get lack of control over your physical resources. So you don't know what else is running on your hardware.
02:00
You don't know what else your resource contending with. And even though that's typically not supposed to be a problem, everybody knows it actually is. Yeah, there were many times at four o'clock, the entire cloud becomes slow because everybody just got home from school and they're playing a game on the other VM. It wasn't pleasant. There's often no delay in expanding.
02:21
So when you find your under load, you spin up new VMs, which is fantastic. That's the whole benefit of the cloud, besides not having to pay for your own hardware, is that you just get more. It's there. You expand. You contract. But at the same time, you don't have the options over your storage.
02:41
So it's the same thing over lack of control of physical resources. You don't necessarily know how your storage is attached to your database. So these are the benefits trade-offs of clouds. Redundancy in cloud, extremely easy, but it's absolutely necessary. Because it's not just a prediction that you might lose your VM. It happens.
03:00
And it happens a lot. And it's not just a possibility that you can lose your entire cloud, because if you're on Amazon, you have probably done that. So these are the benefits trade-offs. These are the things you have to architect around. So unlike having physical hardware where you know your limitations, it adds an extra layer of vulnerability.
03:25
And just to reiterate on that one more time, those same principles that you build around cloud, even though your necessity is building around them for cloud, it still works for hardware too. You don't want to over commit your resources. You want the small backups.
03:42
You don't want large backups. You don't want things taking over 24 hours and then skipping. These are the resource limitations that even though you're making them because you're in the cloud, there's still good rule of thumb when you're on physical hardware.
04:05
What does your data look like? In our case, we were using classic transactional log data. So we were an email service provider of sorts, and the analytics that came with sending email were what we were dealing with.
04:22
So how many messages per hour, what channel were they, were they marketing, were they direct transactional type? And how do you store that volume of data, not necessarily in flat files because we wanted the ability to comb over it. How do you get it into a database where you have
04:41
a full SQL to go over that? So by the nature of that type of platform, logs from distributed sources, they're already distributed. So we're not dealing with, say, a web application that has a single entry point for rights. We have the ability to take every single entry point where these logs are created
05:04
and get them into a local warehouse. That's fantastic. That's absolutely fantastic. Because one of the major issues that comes with an application is dealing with a single master. But in this case, when you're dealing with distributed data, that problem simply goes away.
05:21
We needed both long and mid-term storage. So short-term, mid-term storage was the ability to go through everything and analyze it quickly, and then archive that off. For legal reasons, for whatever, this is a common practice among data, is the ability to use the data that's hot and to store that data long-term.
05:43
And by long-term here, we were meaning three months, not necessarily forever. But that's another rabbit hole. Eventual versus immediate consistency, the approach that we took here worked because we didn't need immediate consistency. Again, this is log data, so logs can get held up.
06:00
They can be absent for a while. They take time to write. So nothing about this type of system necessitated immediate consistency. In that type of instance, this wouldn't necessarily have worked, but again, it was a nice thing that solved a lot of problems.
06:23
The important thing that we needed over all of this was the speed of retrieval. Cuz the end point of this data is API, customer statistics. That's a very common high demand thing for most companies is the ability to tell customers what you've done, how fast you've done it, and
06:41
give them proof that what they're paying for is valid. Most marketing will just love. That's the one thing consistently throughout my career that people have wanted is marketing wants data. And this is the eventual output of this is being able to tell your customer you have value to what you've done.
07:04
So I put this slide in here because I realized that not everyone is going to be high level here, and for developers and for long term DBAs. The two hot things that we're dealing with in this are OLTP.
07:21
So online transaction processing. This is the log data that I was just going over. Things that are one time actions that you need to get to disk heavy on writes. Versus the analytical processing, OLAP, which is combing over all that data and deriving your statistics from it.
07:47
The requirements in this case were absolutely no data loss. That goes back to redundancy. At no point in this could we have a single point of failure. That also meant that we needed the ability to change masters,
08:02
to pause replication, to pause writes, to promote, to any type of critical maintenance. This is a point where historically, if you're not designing for this from the start, it will be a major issue in the future.
08:21
That safety net, that building redundancy into your system where if the database goes away, your system doesn't crash, is a huge point to build around. How many of you have come across the point where you have a web application, and your database goes down, and your web application completely stops? Yeah, don't you wish you had a queue?
08:42
Or a write cache? Yeah, this was from the start, an immediate need. Secondly, we could not be the bottleneck. Again, this was the output of the service. So in no way can your analytics stop the core of your actual service from running.
09:02
So if at any point the writes slowed down, or the queue wouldn't take answers, there was no way that we could let that stop service. Again, this is a, the idea that the data itself is a service to the rest of the company. The other thing that comes with this type of transactional data
09:20
is out of sequence events. So if a VM becomes unresponsive and comes back later, there's valuable data there. You don't want to just throw that away. It's also possible things just got paused, or you had network hiccups. So the ability to take that data and fold it back in at the proper point in time was a requirement.
09:45
Along with that is you're going to have bugs. They happen. I've never seen a piece of software. Has anyone seen a perfect, absolutely perfect piece of software with any degree of complexity that was free of bugs? Yeah, I haven't. So to that degree, what we needed was the ability to go back in history and
10:04
restate analytics. Because even over the course of this, there were queries, there were issues, and alphas always have that. So that was a fundamentally awesome thing to have.
10:20
So this really sounds like Splunk. I mean, really, the ability to go over logs, to perform ad hoc queries, to show data, it really, so why not? Well, Splunk is really expensive. And the more data you have, the more expensive it gets.
10:40
That was not an issue here. However, this was a customer facing API. That's not exactly something you want to throw in front of Splunk, is a REST interface in front of Splunk and hand that back to your customers. It was also defined analytics. So these were preset things that we wanted to specifically search for,
11:01
analyze, condense, and have out immediately to a customer. So the difference being your internal debugging and log combing versus what you're going to show as a service and as a business class deliverable.
11:25
Tools that we chose to use, not all of these were the best choices. But some of them were absolutely fantastic. Chef for server automation, which in cloud, even not in cloud,
11:40
it's such a wonderful thing to have. Is everybody here familiar with Chef? Not as many people as I would assume. Chef allowed us to bring up servers in minutes, and they were identical, and they were fantastic. You could say the same for Puppet or for CF Engine to lesser degrees, but
12:04
some type of automation tool, especially when you're working on several different clusters, especially across different locations, incredibly useful. Liquibase. Liquibase is the schema tool of choice for myself, for schema management.
12:24
I assume that's fairly known here. Yes, Liquibase? No? What do you guys use for schema management? SVN?
12:41
Liquibase is similar to, I would suppose, Git, although it sits, we use it inside of Git, but it's specifically to roll out merges, changes, updates, rollbacks, and keep consistency in your schema. And it allowed for Chef, which set up the original database and
13:00
the users and permissions, Liquibase would run on top of and set up your full schema, so automated roll out with tracking. And it made the ability to do single line schema updates over however many merges that you might have incredibly easy.
13:24
I'm actually surprised that it's not a more known tool. The queue that sat in front of this was Hornet Queue. It could have been anything. That's simply because we were using a Scala implementation. It could have been a Redis queue.
13:40
It could have been a RabbitMQ, could have been rescue. It was simply a queue. RepManager, looking back, that caused more problems than it solved. RepManager was in place for promote and for redundancy with master-slave bailovers.
14:02
But I don't actually think that's necessary. Has anyone used RepManager? What did you think of it? Mm-hm. And that's exactly what we did.
14:21
Right. I no longer use it in the stack that I'm using, just because the return value wasn't high enough. And finally, Postgres, the star of the show, which I assume you all use. This is the knowing that you all nod at yes, you do.
14:42
You don't actually need it. I've had several conversations about that and everything that RepManager did can either be done in Bash or with just straight PGCTL promote. So unless you really need automated promote of master, in the case of failover, you really don't need it. And even if you do, Bash script it.
15:03
But honestly, I would rather wake up at 2 AM or 4 AM and do a manual promote than have it be automated for me. I'm sure some of you feel differently about that, but if my database goes down in the middle of the night, I want to know it. I want to see what's going on.
15:20
It's more critical to me to be able to do hands-on, get up and get to a computer than failover and possibly have more severe side effects. So writing the raw data. Distributed OLTP, I got into this a little bit earlier. So by nature, all of that log, all that log file transactional data
15:45
created distributed architecture that could be per cluster and per web host. So very elegant, very easy to create redundancy within your cluster and cross cluster.
16:06
Because it's log data, it's not relational. So actually nothing about this project was relational. That was a rather big shock for me, being a relational database person. This is the first project of any scale I had worked on that wasn't
16:22
specifically tied into foreign keys and that type of data. But it was rather elegant. Your disparate event types can be written to separate instances. So if you have a log sending events, and a log of errors, and a log of updates, all these things can go to different databases
16:45
in different instances because there's no need to consolidate them back. That's done at a later level. You don't need synchronous multi-masters. Again, because you're not dealing with data that has to be relational. So you're not querying a master expecting the same result from
17:00
every other place. You're holding results from your single instance or for your single type. And the next piece, batch process writes with copy, because it was absolutely essential. That was one of those things that you think, oh, inserts are slow.
17:23
We'll add a copy. And it'll be an optimization for when we need it. That was absolutely not a premature optimization. From the start, at a very low volume, when we encountered issues with copy, who has not encountered an issue with copy?
17:41
Yeah, that's the second point. We found that the system absolutely, so do a crawl. Cuz the, the safety net for this was individual inserts. Individual inserts, even at a very, very small scale, were drastically bad.
18:03
So this is a, a must have from the start when dealing with any type of volume, but you must have a safety net. So when copy breaks, cuz copy does, we reverted to individual inserts. But that had bugs too.
18:22
So developers going through this would go through the the copy, list of copies, find the, find the line that broke, because copies all or nothing. So as soon as they got to the, the logic was, find the line that broke, throw it out, go back to copy. Yeah, yeah, you get immediately why that's bad, right?
18:44
Everything below that line got thrown out. So beat it with a hammer. Test and test and test again, because developers don't always make the right choices. That goes without saying. But in this case, it's one of those logic things that as a DBA or
19:03
as someone familiar with how this works, it makes perfect sense. But if you're a developer, finding the error and moving on makes better sense. In this case, it's completely illogical.
19:22
This is the thing I love most about Postgres, and that's a big statement. That's a big statement, over 15 years in the same software. And inheritance is what I love most. It really is. So, back to the disparate event types.
19:41
As you're writing your types, parent per type. So my sends have a single parent. And if those sends are such a volume that you need to shard, and I'm assuming you do, shard by time was the approach that we took. I'm sure it can be done a multitude of different ways, depending on the data.
20:02
But this was right dependent. So the volume that we chose to start with was one child per day. And that could have easily been up to a child per hour, or a child per half hour, or whatever your volume demanded. But by keeping those children small, you create the ability to do individual backups.
20:23
Your queries are more efficient, and it creates sanity. So those individual children could be archived off one at a time. PG dump dash dash table. The caveat to that is that say in three months, you go back to a single child that was dumped.
20:42
And you try and back that up to go over that data again cuz you found a bug. But your schema has changed, and backup fails. So with this, one of the ways you can fix for that is back up into its own database, run liquid base,
21:02
get the table itself up to the current version, and then back up. But you have to make sure that you plan for these things. Cuz the last thing you want, when you have a critical bug, and again you're up at 4 AM, is your backup to fail. So you know that, you plan for that. You know that you're not backing up the entire database, and
21:22
it's not going to back up perfectly unless you're in the same state. So the other thing about inheritance are your constraints. Like indexes, your constraints are type specific.
21:42
So if you hand a date, if you hand a timestamp constraint, a timestamp where clause to a constraint on date, inheritance will ignore that and you will end up going through your entire data set. So, everyone familiar with constraints on parent child and
22:01
why they work so well? Right. Make sure you use the same type. Otherwise, it's going to take a very, very long time. Our event ID was the other, the serial primary key, which allowed us to close a shard. And always explain analyze.
22:22
The queries that are going to be going over this data, compressing this data, need to work, and they need to be fast. Because this is your, your primary right master. And even if you're doing it on slaves, you don't want that bogged down by something going over your entire data set. So this might seem redundant, it might seem like elementary. But explain analyze and making, and
22:42
make absolutely sure you're not going through that full data set. And this is a thing that doesn't show up in tests. It won't show up in black box unless you're under scale. Because small scale testing where you build up your test database, run through it, small data set, you're never going to find that.
23:04
So specifically look for explain analyze or do load testing. The only two places to find this. So, when using inheritance like this, you need to make sure that you have a place to write. Which means that in advance of the time when you're going to be using that shard,
23:23
it has to exist. We used quartz for cron, just because we were in a Scala environment. However, you can use basic cron, it really doesn't matter. Code to go through, create the actual shard that you're creating. Including all of your index, your check constraints, and complete.
23:48
I did not optimize any of these, and they're actually kind of simplistic for the sake of just getting them on a slide. So there's no pre-declared variables or other extreme optimizations.
24:07
The other important part is closing the shard. And by that I mean, while you're actively using a child, you're running from a base event ID, primary key, up until the time that you stop. And if your query is using an ID to constrain itself, if you never,
24:25
if you never determine your final ID for that time period, again, you're gonna go through your entire data set. So, when you're done with this shard, nightly, hourly, when you have passed the time that it is a, a viable table to write information to, you must close the shard.
24:45
So, this was the function created to go through. Select all of the tables, parse out the names, and then loop through everything that is not closed yet.
25:04
And determine your minimum and your maximum event ID, so that you can replace the current constraint, which just has a start, with a start date and an end date. Is that clear, the reasoning, the logic behind that?
25:21
This is actually a more advanced crowd than I thought so. So, Star and Snowflake analytics. This was, again, because this was a client-facing analytics engine. The API had to be fast, it had to be sub-second fast. So, Star and Snowflake was basically the only choice for that.
25:45
Which added also the bonuses of being able to partition it, use table spaces if you'd like. Very basic Star Snowflake data warehousing techniques.
26:00
So, again, we were using Quartz. And it was a three part, a three part way to get data from our transaction warehouse into our analytics warehouse. Where we pull all the aggregates in, so reach out to all of our, our transactional data, update and
26:22
add into the base fact that just the very base that you want, and then cascade. So, in our analytics database, our children, as we discussed, were pulling by sequential ID. So position, pull, update, look up, pull and update in a single transaction.
26:45
And saving the state so that you can go through and do analytics on how much you've done and how your performance is. That was a pretty awesome thing for us, so that we could determine what our resources were and what we needed to throw at a cluster.
27:03
So this is the code for an aggregate on pull. This was done in PG9.1. Now we would use foreign data wrappers. This is also the slide where I recall you. Currently, foreign data wrappers with 9.3 and
27:22
the ability to write back into your primary source would have been far superior. And the reasoning for that is to keep track of your ID. And that was a point of contention, where to keep track of your pull. So whatever ID your last pull stopped on is the point where your next one
27:43
will start. Keeping track of that or keeping a, a work log in your transactional database if your analytics database goes down and another one takes over for it. Say in a multi, a multi cloud situation,
28:04
where you have one cloud picking up for another. If you cannot access your analytics engine, keeping that on the transactional database is absolutely essential. But if you keep it in your analytics, it's easier to get to and it's just easier to process.
28:21
So that was a back and forth where we ended up actually just keeping it in both places. So this I touched on before. When you aggregate data, so DBLink, FDW,
28:40
from your transactional data, only go into your finest facts. So if your star snowflake has five dimensions deep, you don't want to aggregate into every single one. You want to go to your finest one, and then you want to continually condense down.
29:00
So this is where the, the query, or the code gets a little yucky. Updating into the base fact. So your event source, I should have cleaned this up, I really should have.
29:22
However, the idea is that when you aggregate on that pool, you create not a temporary table, but you actually create a table in your analytics database that holds all the aggregate data. And at that point, you select and you condense from that table into your fact table.
29:41
It's not a temporary table, because temporary tables are they're not idempotent. This is the case that we wanted several workers to be able to come along and aggregate a table back into the main fact. We wanted things to be if, if this was a interrupted transaction. We don't want to have to re-aggregate that,
30:02
because that's the expensive part of this. Is going to your, your transaction processing, grabbing all that data, sending it back over the wire, and then pulling it into your fact. So that's the point where everything has to be set on disk and solid. That's the idempotency of this.
30:23
So, and it gets worse. This is where you take the base facts and in two phases, update everything down the line, and insert all the new data down the line. And that includes all of your data that might have come in delayed,
30:43
needs to get aggregated back in to the proper time period of which it came from. Or the proper, whatever it came from. In our case, we had multiple different ways of aggregating based on channels and sessions and date. But it's the idea that historical data needs to get filed properly, historically.
31:07
So, what we found was that aside from being a service that we provided to customers and an analytics and marketing information piece, everything showed up here.
31:20
Every little problem that we found higher up, you see the end here, because it's not just for marketing data, this is performance data. This was something went wrong here, and I could see that, because we accidentally sent 55,000 emails to that one guy in five minutes. No, we didn't, but we did send text messages.
31:42
That's what happens when you have your text routed through email. So don't always trust your data. If something looks funny, don't assume it's your problem. Don't assume that you have a bug. You might, but don't make, don't immediately think the worst.
32:02
Maybe it's somebody else's problem up the chain, because this is the funnel where you will find everything. And it's also really kind of nice when you can say, nope, not my problem. So putting it all together, what it looks like. Inputs from our transaction processing.
32:23
In this case, it was the queue to our analytics, sorry, to our transactional processing databases. Different types, different copies of types, into your first warehouse.
32:42
And out through your query head. So this was a base system that can have one or more OLTP databases. And just very simple, very basic cluster configuration. The next scalability point is when your log files or
33:01
when your transactions get to be more than you originally wanted. Each of them can scale independently and still go back into the same analytics warehouse. And the further scaling of this is to go into separate warehouses. So multiple transaction and
33:24
multiple analytics and all out through one or multiple query heads. And in our case, everything was based on Dune. So we started with Harvester.
33:41
Harvester gathered all of our data and the data had the flow, so it was our spice. And because our analytics saw new everything, we had Alia and out through the Mentat. I can't take credit for that, I just named a couple of them.
34:01
So questions? Yes. That was one of the requirements. Yep. We were using streaming replication.
34:22
And at the time, some of the bugs weren't known. But we were reading off of the master as well as writing to the master. The slaves were for hot-swap redundancy, not necessarily for reads-writes, and because they were small instances, we were able to scale that way, async.
34:42
This was before sync. Again, the slaves were there for hot-swap redundancy. What we were doing with no data loss was primarily having the queue layer. So everything wrote to the queue, and the queues were redundant. So Hornet writes through to disk.
35:01
It's not an in-memory only cache. So we had both Hornet writing to disk across, and it had slaves as well. And then from that, the copy batch into Postgres. If there were any issues with there, the Hornet queue was not cleaned, so it would retry, and then the retry at the copy level as well.
35:21
So if the entire copy failed, it would revert back to inserts. And then from the Postgres level, once you have it there, we were reading writing off of the master and keeping the master small enough and light enough load that we didn't necessarily have to go to the slave. But if we did, we had instantaneous, or not instantaneous, but very, very quick promote, and we could go back over the master if needed.
35:56
This was, this was a year ago. And at the time, one of the major things I disliked about rep manager was
36:03
that it has a 5,000 file limit to keep on disk. It assumes that you have a very long time between when you start your rsync to do a, to do a slave, to stop. Which in our case was absolutely not the case. And you can set that lower, but it's just, it was a pain.
36:23
Yes. It was a manual basis. So this, this system was tested up to 200 gigs a day. And it was put under load for several days.
36:42
During load test and then for production. Sadly, the company went under before it got to see that level of load. So I don't know what the final numbers were but they were, they were not bad, but the scale out was by hand.
37:03
And usually you can see it projecting enough, at least in our case. But that would be an easy thing to automate out because once you have the chef tools in place to spin up a new cluster it's really easy to set limits on, on right numbers and whatnot and kick off a new cluster.
37:21
Yes. Being a start up, this was a company of 30 people.
37:40
And I wrote all the SQL for it. I, there were two other developers like I said that helped in the the rest API and writing the end points and whatnot which is how we ran into the copy bag. But all the SQL itself I was lucky enough to touch directly.
38:04
This is a perfect case for a DAL and this was in, you know, this was in itself a data access layer. It wasn't something that had we grown to the point where developers wanted to add more statistics, it would have remained sequestered off into its own access layer, it had the rest end points.
38:22
Everything that would have been needed to be pulled outside of it would have come off of that rest end point. Yes. No, well, it was an invalid row where the, the logic on that was the copy failed and copy is all or nothing.
38:45
That was one of the requirements I set out in the in the writing of the inserts. So the fall back to that was to do individual row inserts. The developer who went through that went through insert by insert starting from the top until he found the error.
39:01
Once he found, yeah, it was not an actual bug with copy. It was a logical bug with not continuing on from the point of error. If you went to the to the tutorial yesterday, the secondary tutorial on INET, did anyone go to that? There was a cute way of handling copy bug where everything was
39:23
parsed into a sequential ID column and a secondary column that had all data. And just a straight copy into as one nice large blob. And the secondary piece of that was to split two array based on some item in your data. So from that it's a two part copy into that alleviates a lot of the problems.
39:43
I thought that was a nifty little trick. It would have been a lot slower in this case, so I wouldn't have coded around it, but I might have used that for my safety net. So my safety net being my copy into broke, copy is an entire column, split that column, and then try and copy those into something else. Anyone else?
40:05
Colin?