We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

RelStorage for mere mortals

00:00

Formal Metadata

Title
RelStorage for mere mortals
Title of Series
Number of Parts
45
Author
License
CC Attribution - NonCommercial 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Production PlaceBristol, UK

Content Metadata

Subject Area
Genre
PlanningObservational studyComputer virusReal numberBitData storage deviceProduct (business)CASE <Informatik>Lecture/Conference
Plane (geometry)IntranetWaveVideoconferencingIntranetWebsiteInternetworkingPlanningMaxima and minimaBitLecture/Conference
Membrane keyboardObject (grammar)Server (computing)Read-only memoryPrice indexComputer configurationReplication (computing)Goodness of fitData storage deviceServer (computing)Projective planePoint (geometry)Computer configurationInstance (computer science)PlanningArmWebsiteTheory of relativityTerm (mathematics)MereologySemiconductor memoryDatabaseSpeech synthesisSingle-precision floating-point formatSubject indexingComputer architectureOffice suiteMembrane keyboardDecision theoryReplication (computing)Object (grammar)Acoustic shadowService (economics)CASE <Informatik>System callSocial classOpen sourceLimit (category theory)Personal digital assistantLibrary catalogComputer fileFile systemRelational databaseLecture/Conference
DatabaseRelational databaseData storage deviceDrop (liquid)Computer fileClient (computing)MiniDiscCache (computing)Plane (geometry)PasswordObject (grammar)Revision controlData storage deviceBlogObject (grammar)Shared memoryMereologyRaw image formatCompass (drafting)Drop (liquid)DatabasePresentation of a groupDirection (geometry)Point (geometry)Data conversionNormal (geometry)CuboidComputer configurationInstance (computer science)Operator (mathematics)Service (economics)Default (computer science)PlanningInformationClient (computing)Maxima and minimaSlide ruleProjective planeBitTheory of relativityPressureRight angleComputer fileDirectory serviceFile systemRelational databaseSet (mathematics)Cache (computing)Configuration spaceDebuggerVariable (mathematics)Error messageServer (computing)Lecture/Conference
SatelliteWeb pageArithmetic meanData storage devicePoint (geometry)WebsiteFile systemReplication (computing)CloningSet (mathematics)Real numberCuboidRevision controlSatelliteDomain nameObject (grammar)Computer fileInstance (computer science)10 (number)Multiplication signCASE <Informatik>Database normalizationDatabaseConnected spacePlanningNumberSystem administratorNormal (geometry)Shared memoryAuthenticationLecture/Conference
SatelliteCloningSystem administratorDatabaseWebsiteConnected spaceInstance (computer science)Data storage deviceLine (geometry)PlanningProduct (business)Replication (computing)Shared memoryFood energyLecture/Conference
Multiplication signMereologyDiagramPoint (geometry)Content (media)Scaling (geometry)Module (mathematics)Instance (computer science)WebsiteSummierbarkeitData managementWave packetCAN busComputing platformConnected spaceSystem administratorSource codeOnline helpLecture/Conference
Multiplication signPoint (geometry)Data storage deviceComputer fileLecture/Conference
Data storage deviceConnected spacePlanningWebsiteArrow of timeGoodness of fitLecture/Conference
Execution unitPoint (geometry)Satellite1 (number)DatabaseArrow of timeCuboidData storage deviceRootWebsiteMereologySlide ruleEntire functionParameter (computer programming)Remote procedure callRow (database)Lecture/Conference
Replication (computing)SatelliteCache (computing)MeasurementExterior algebraBitReplication (computing)DatabaseData storage deviceAxiom of choiceCAN busRule of inferenceSlide ruleSingle-precision floating-point formatLecture/Conference
1 (number)Replication (computing)SatelliteReading (process)WebsiteSound effectDatabaseBlogServer (computing)Acoustic shadowSequelTheory of relativityLine (geometry)Row (database)Level (video gaming)AreaVideo gameRight angleAxiom of choiceLecture/Conference
Type theorySatelliteMedical imagingMultiplication signReading (process)Object (grammar)WebsiteDigital photographyWorkstation <Musikinstrument>Point (geometry)Scaling (geometry)CodeConnected spaceLibrary catalogRule of inferenceExistential quantificationException handlingRaw image formatCASE <Informatik>Subject indexingDataflowCausalityShared memoryProcess (computing)Right angleMembrane keyboardLocal ringComputer configurationElectronic mailing listSoftwareBit rateClient (computing)Metropolitan area networkArithmetic meanPatch (Unix)Film editingData storage deviceRoundness (object)Product (business)Virtual machineOrder (biology)Set (mathematics)2 (number)Content (media)ArmPlanningTime zoneMereologySelf-organizationLevel (video gaming)Row (database)Cache (computing)Server (computing)Buffer solutionLimit (category theory)Computer fileNumberRead-only memorySemiconductor memoryOpen sourceDecision theoryFile systemMiniDiscPhysical systemCloningMobile appThread (computing)Canadian Mathematical SocietyField (computer science)RootSheaf (mathematics)Cellular automatonComputer hardwareLecture/Conference
Transcript: English(auto-generated)
My name is Matt, and this is Adam. A quick show of hands first, how many people have used rail storage in production? I'll just start off the talk saying that you are slightly
more advanced than we are. When we decided to do a talk for the Plone Conference, we were about to embark on using rail storage in production, and we were about to, but it's not there yet. But we thought it would be helpful. One of the things that we found in looking into
it was that there wasn't necessarily a great deal of chatter or documentation about it, so we thought it would be helpful to do a talk anyway, and give a bit of a case study on how we're planning to use it. So, do you want to say who we are?
Yep, sure. So we both worked for Netsite, we both got double-barred surnames, we both use Emacs, we love Star Wars. We've both worked on a Plone API quite a bit,
and we kind of kicked off the Plone intranet packages that are going to be coming out soon, and we also wrote a pop song called Christmas in July with Arno Blumer, so there's a picture of us in the video.
That's worth a touch. It's not about rail storage, it's just as good. So, I suppose we're not using rail storage in kind of the out-of-the-box format,
so the main reason that we looked into rail storage was for redundancy, and because the way the Plone site, I use that term loosely, it is a Plone site, but it's kind of
Plone sites within Plone sites, and they share data, and currently it's not very easy to do with file storage. So the project we're using this on, it's currently got about 90,000 user accounts. They're all membrane objects. It's got about 44 million objects in
the ZADB, so as you can imagine, the catalogs are pretty slow. All of this eventually just points back to one file in the file system. There's just one data FS in the file system, which is pretty bonkers, and obviously because that file has to live somewhere,
and replicating a file is not easy. If that server goes down, although we've got more than one server, if that server goes down, there's no data, everything goes down.
So the limitations of file storage and Zio, so like I said, we've got three front-end servers, all of which have four instances running serving, but the Zio server is still a single point of failure because they all have to talk back to Zio to get the data. Obviously, Zio requires an
in-memory index, and there is a way to add replication for a file-on-file system, but it's not that easy, plus our data FS at this point is about 150 gig, so it's not small.
Who has used replication services? Cool. We briefly looked into that, but because it was
only recently open-sourced, we couldn't find anything else about it, and anything about it other than its own documentation, so that may well be a solution. For our use case, it has
many of the same benefits, but we decided to look at rail storage because it seemed to have been around for a lot longer, and we found out more about it. So the possible other solutions
were, sure, it's not really using blob storage as the other. It's been migrated from a Plone 3.0 site with stuff, and that was done quite quickly, and we didn't move everything to
blob storage. So there was using that, which was an option. The other option was maybe start again, choose something else, but that wasn't really... There's a lot of relational data
which in parts... The same project has featured in previous Plone conferences where we've talked about things like architecture decisions that we've had to kind of remake later on in the project and moving other things to relational database for other reasons,
but that wasn't possible because it's so big. So that's why we decided to look at rail storage.
So brief intro to rail storage. Most of these points I've stolen from other presentations on rail storage, so thank you to the people who wrote those. It's a drop-in replacement for file storage if you don't already know, so it's still the ZDDB, but it's not stored in a big file. It stores pickles in a relational database of your choosing, has support for MySQL and
Postgres. There isn't a huge data FS anymore, and you can actually have two... There are two options for blobs. If all your instances are on the same server, then the blobs can all be...
It can act kind of like Zio and share a directory where the blobs are stored. If they're not on the same server, then they can actually live in the relational database, and then each of the front-end clients maintain a cache of the blobs,
which is what we'll be doing because they're not all on the same server. It's actually very, very simple to set up. One of the things, and part of the reason we wanted to do this talk was because it seemed like a bit of... When we initially approached
rail storage, it seemed like some kind of black magic, and then the more we looked into it, it was like this is actually really simple and seems to give good benefit. So it's supported by clone recipes up to instance out of the box. You just provide the rail storage option
instead of the file storage option and give it information about your database. Then there's a tool that comes with it where you can convert from an existing data FS. So you just set up a little config file that points to your data FS and the rail storage, and you
run this set of to be convert tool which comes with the rail storage egg, and it will convert it for you. So that's very simple. Something which hit us, which was kind of hidden in this blog
post somewhere, was that by default for MySQL at least, the maximum allowed packet for MySQL is smaller than most ZDB objects. So you have to increase that variable. Otherwise, you'll have some issues attempting to store objects in MySQL. It initially really frightened me because we got
down to the point of, oh, rail storage is working. It's great. Let's use rail storage. Okay, let's convert some data, and it just kept dropping out with an error that didn't even hint that the packet might be too large. And I was going, oh my goodness,
rail storage is, no, no, back out, back out. But it fixed everything. Does that mean it's the biggest file you can store even in MySQL? No, it's just the size of the data that the ZDB convert is sending to MySQL is bigger than
just the default is really all it is. So the packets are bigger, basically, than it's expecting. I think the default is maybe 16 meg, and so they're a bit bigger. So you may, I mean, I suppose you may, if you have objects larger than that, you may. Has anyone who's used this had to make it larger than that, or is nobody using MySQL?
Nobody. That worked for us. Do you want to talk about this? It's got the clips from, yeah. So we were surprised when we started
looking into this, and we realized how easy it was to set up, and the conversion is relatively simple. Why we'd not seen this before, I mean, we've been using Plone for many years, and why we'd never thought of it before. I mean, we looked through previous talks that
have been done on rail storage, and there was slides on the downsides of rail storage, but none of them are actually really downsides. I mean, although it's, you know, everyone, you know, ZADB, everyone loves object databases, you've still got the ZADB. The ZADB is still there. You don't have to start using, you know, writing SQL or anything. It's still ZADB
on the front end. And, yeah, we haven't yet come across an actual downside to using rail storage. There's not really that much documentation on it, at least in docs.plone.org.
I think it might be referenced once. This is why the doc doesn't tell you how to use the rail storage. It's just a reference to the rail storage. Yeah, so I think on that page it says you could use rail storage, and then points to the PyP page, but, I mean, that's it. It even does that.
Oh, it doesn't even do that. Cool. It is mentioned in the Substance D documentation, and in Pyramid and in Substance D, it's kind of more of a, you know, you get ZADB with file storage out of the box. If you're going to be deploying and it's going to be over X amount of objects, use rail storage. It's just kind of, you know, that's just kind of the normal
in Substance D and in Pyramid. I haven't actually been able to, I think it was Matt R. who told me that there was a magical number of if you're using this, you know, if you've gone over this many objects, switch to rail storage. I haven't actually been able to find that number, but, yeah, I mean, when the objects
tend to grow into the tens of thousands, I would say it's time to start looking for rail storage really. So for our use case, so the setup they've got is, well, the setup originally
was one clone site, and many, although it was one clone site, it's kind of managing many sites, so each domain points back to an archetypes object within that clone site, and there's around 50 of these. So there's kind of 50 sites using this one clone site. I know
it's crazy, but, yeah, that's where we were, and they had no real redundancy, no real replication of anything. So the solution that we came up with, which involves rail storage, is that you would have a master site, and that would kind of deal
with all the legacy sites that are still in this one monolithic clone site, and it would also deal with all the shared data, and it would also deal with all the authentication, and then you would have the new satellite sites, which are smaller clone instances that just have,
you know, one domain pointing at them on one set of user traffic going to them, and they would have a replicated version of the rail storage data, which would handle all the shared data, but they'd also have their own ZODB using file storage,
and that's just how local data would be saved. So this is kind of what it looks like. So you can see the master there. That's got a read-write connection to the database there,
the shared database, so that the administrators would visit that site, and they'd edit the shared data, and that's written via the ZODB to rail storage, and then using MySQL or Postgres replication, that data is then replicated to the satellite servers, and then the satellite clone instances have read-only access to that shared data.
And what we have now, which we didn't have before, which is indicated by this dotted line, is if the master site does go down, these satellite sites can stay up. They've got the
shared data, although they can't edit anything. They can't edit the, you know, the shared data. The sites are still live. You can still view the data, and they, you know, they're not dark.
Yeah, yep, so there is a delay, yeah. So, I mean, it's just a question of the user, so the administrators of the master site know that if they add something, it's not instantly on the satellite site. So, like I said, this... How big is the lag? We don't know yet, because it hasn't been put into production yet. We don't know. But,
I mean, we're hoping that it won't be days. You know, we're hoping it'll be more like minutes.
So, what that previous diagram kind of alluded to is, part of the challenge of this particular setup is that we have... We've developed this platform for people that they effectively resell
to other people, and although it was all one Plone site, there's some content which is, for instance, kind of certain training modules or help. There's some data which is shared between all of the kind of endpoints that their customers connect to, but there's some
which is separated off and specific to the customers that they resell to, which they... And they shouldn't be able to see one another's content, but they should be able to do. So, in the single Plone site, we've managed that with permissions. But the idea... But what we need to do in this solution, which we're trying to make scale, is the fact that there's
a central site where the site administrators are managing shared content that is available to everyone, and then there are other Plone sites where some of the content is specific to
just that Plone site, but some of it is coming from a shared source. And so, the extra step that we decided to introduce here is
mount points, which was fun. So, has anybody... Have people used mount points much in their sites? Has anyone used mount points and rail storage at the same time to mount
different file storages? We'll let you know. So, basically what...
Yeah. Good tip. Thank you. Don't do it with... You had it here first. Don't do it with MySQL.
So, what we decided to do is basically have this shared rail storage connection mounted into each of the Plone sites. So, if I go back a few... What is this one? So, the arrows
are basically mount points, and the master one has a read-write mount point into the shared database, and the satellite ones have read-only mount points. So,
if you've never set up a mount point before, you basically... If you look at your out-of-the-box setup for Plone or Zope, you will just see probably something like... Which means
this storage that I'm pointing to is the root of my Zope site. So, you can instead... This actually, as you can see, has two in there. You can mount a storage into a certain part of your site, which will take an external ZODB and make the data available
at that point. What this is doing, because of the way we had to set up, if you have two parts in there, it says, mount this part of this remote ZODB in this part of my local site, which is basically what we're doing because of the way that we've...
Because of the way that paths work and because we've basically have moved part of the site into this rail storage. If we only had one of those arguments, it would try and mount the entire of remote ZODB inside. Our thing, which would be wrong, anyway. So, yeah. And there's...
Did I have it in the previous one? There's also another argument to the rail storage, well, in fact, to any storage, which is whether it's read-only or not. So, this slide should maybe have been earlier in the talk, but one of the...
Obviously, a big benefit of rail storage is replication because SQL replication in its various guises is pretty reliable and obviously used by some pretty huge companies who love SQL and need it to be reliable. But this is not a talk about SQL replication.
You need to go to SQL Conf for that. But MySQL has had replication added to it,
but as we've already discussed, you shouldn't be using MySQL. So, Postgres had it from the beginning. Are you using Postgres rather than MySQL? He's busy. What are you using as an alternative to MySQL? Is it Postgres? Postgres.
Okay. We're doing measurements on our databases. We have also quite a lot, and it comes from everywhere,
but MySQL was read a bit faster than write on Postgres, but Postgres just runs. It's just waterproof, and MySQL had really lots of problems with it. Maybe we can blame CentOS a bit for it because it wasn't all centralized fine. And, but I'm not sure if it's really a good choice for using MySQL.
Another problem was that MySQL grows and grows, and if you, other than the garbage collection, like pecking, then it doesn't shrink anymore. And also another problem was...
Do you want to come up here? So, what this setup gives us is that these, as Adam alluded to earlier,
we should have had this appear every other slide, is that we can take this main server offline completely, and these databases become effectively, they just get stale,
and then when it comes back up again, they start replicating again. So these are read-only, the satellite ones.
No, so the front end, there are basically no writes to those shared mirrors.
So, in this particular setup, if we take the master offline, they can't manage the shared data anymore for that particular area,
because the master site is the only place that reads and writes to that. When we were reading, master-master replication was, every other blog post or documentation I read said,
maybe don't do master-master replication. Yeah, I think what I read about master-master replication is that RailStories can't cope with that, because various SQL things can do that, but because that happens at the SQL level,
it doesn't know anything about how the ZDB is structured, and so the data will just get corrupted, because it doesn't know how to keep... It's only RailStories that knows how to keep the data in there.
Yeah, yeah.
The master just has some collected shared data that's only ever intended to be read-only
by users, even users on the master, general users on the master, are only reading that content. So they wouldn't get an error, they'd just have got their own data effects. Do you want to talk about that one?
So some issues we ran into quite quickly were... So because this site, what is currently just a big master monolithic site, and because the catalogs got so slow, the catalogs were split down
into a catalog for these type of objects, and a catalog for some other type of objects, and then the port catalog for everything else. It became, very quickly, it became quite difficult with read-only RailStorage mount points.
You know, every other access also does some re-indexing, and because it's read-only, you would get a read-only exception raised. So things that you wouldn't expect to be doing writes to the ZODB,
like clone app imaging, so when you scale an image, it creates a little cache, and then it would try to write back. So every time we're doing things like just loading a person's profile, because it was trying to create an image scale, it was trying to write back to the ZODB, and because it was a satellite site, it was raising an error.
So I mean, you may not use it in just a read-only form, but it's just something to be aware of when you're using them in just read-only, is that, yeah, lots of things actually write when you don't expect them to. What did you do?
So the plane app imaging one, we haven't actually solved yet. And the catalog re-indexing, we essentially rewrote where the catalogs actually lived. So they've got kind of local copies of the catalogs.
And so is that what we did? Well, no. We basically patched the code that looks up catalogs to make sure that the satellite sites never try and put stuff in the basically the, I mean, this is still using archetypes.
So that's one of the issues with archetypes, is that when you re-index something in archetypes, it actually modifies the object itself. So as well as the catalog. So there's basically code, which is if this is a master, I'm allowed to store this in a catalog.
If I find my satellite site, then I never write to catalogs for this particular type of content. It's basically where we've had to patch in code. Just to step back slightly, it's worth saying that if you want to, because we're mounting particular parts of the ZDDB in a read-only fashion,
if you want to use a catalog in order to query that data, which we do a lot, we have to mount a shared catalog as well.
We can't use the local portal catalog on the satellite sites because they are not obviously notified when the external ZDB is being changed. So the master site manages for each kind of section of the site, manages its own catalog, which is then also mounted. And then there's code that queries those catalogs.
Which leads on to the second point here, which is in our original setup, we had four mount points, which was a set of data, then a catalog, then another set of data, then a catalog. What we quickly realized is that if you're trying to write back to two mount points,
it will fail. You just throw an exception. So we had to kind of include the catalog in that mount point. So rather than having catalogs all at the root of the site, portal catalog and membrane tool is an example. We had, for example, if you had slash users,
which is where all your user objects are, we have the membrane tool in there, and then you just have one mount point, which solved that issue for us.
We haven't actually tried it. That was on the list. The list was too long. On technologies to change to. But yes, moving the search entirely out to something completely different would also be an option.
I think at this point, the catalogs, it's not really the right tool. The code base is so vast for this that they're just used all over the place. So it would be a big rewrite to actually move everything out of there. I think at this point, it's probably going to be the next thing we look at,
because it's becoming pretty crazy.
Yeah, definitely. One of the main reasons that we went through this whole process was because the sites were just getting really slow. And one of the main reasons they're getting really slow is anytime it goes to the catalog to do a search for anything, it's just really slow.
So I'm looking into how to, I mean, into hosting Plone and Heroku, because Heroku gives you forever one small server with a small database, and its production level runs really well.
And I'm trying to get Plone to run sufficiently on that for non-profits, open source communities, small organizations. And one of the problems is that you're limited, as a free plan, you're limited to 10,000 rows in the DB, in Postgres. And Plone, that's a problem, because each pickle is its own row.
And after you reach about 100 content items, you go above that 10k limit. And I'm looking for ideas of how we could decrease the number of rows that Plone uses.
And yeah, we're not using history. I mean, they actually feel that 10k should be enough for a small site forever.
If you think about it, like, any other CMS uses way less for 100 content items. Uses about 100, probably. It's a bigger problem. Yeah, yeah. I mean, it's usually 4b3 or 7, but we're done. Another question.
Yeah, I think there are two things, the main thing about really the advantage of Rails storage also, that completes all the caching mechanism and on the client side is completely different. Because in classical Zio setup, you have the caching, like this RAM cache and the disk cache, right?
And with Rails storage, you have still the RAM cache, see? And then, another thing, you have memcached, where you can cache and share the fetched data, the pickers from the database, and it's a shared cache between all Plone instances,
also between all threads. So the main advantage is that you don't have to fetch data over a slow network, but have it on the machine in the memcache, and just can access this object. So what we did is we reduced the RAM cache for each thread
down to less objects, depends on the setup, not the search cell, or the hardware cell we had before in Zio, and get it down again. So we needed less RAM at this point, and just increased and put on a lot of memcache, so we can buffer everything in memcache, and don't need to fetch it over a slow network connection.
Any network connection is slow compared to a direct one. Why can't you do that on a fast storage? Because a file storage is managed by the kernel, so how's the file storage? If you put something in file storage, you don't know if it's in memory or not, right?
Okay, but like, you have sync, you don't know? No, you don't know if the system decides. Yes, it's written back. And you can also share between several services, but it makes no sense at this point. But depending on how your setup is, and how it passed your I.O. on the search,
we had real problems with the I.O. on the virtual machine, so we had to use the field to access it. I think I will talk about it tomorrow, I might talk about it, but at this point it says the caching is really an advantage, because having the stuff on the machine in memcache
is better than fetching it over the network, and it's faster than accessing it in disk cache. But you are talking to memcache? Memcache for the zero client connection. So you have a connection pool, the connection pool is the database, and the request comes in, the zone gets the connection from the connection pool, right?
And then you have the ARM cache in the connection pool, and then you have the disk cache.