We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Data-as-a-service with Infinispan

00:00

Formal Metadata

Title
Data-as-a-service with Infinispan
Alternative Title
Data as a service with infinispanfull
Title of Series
Number of Parts
64
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Dealing with reliable data storage on clouds is hard. Cloud infrastructure has raised peoples expectations of easy to manage, highly scalable and highly elastic services for serving web applications as well as providing compute power, however when it comes to data storage things start to become very complex. In this talk, Surtani discusses "Data-as-a-Service", along the lines of infrastructure and platform service offerings that are becoming increasingly common. What is "Data-as-a-Service"? Why is it important, and why would you want one? How could you build one? The talk then goes on to discuss Infinispan - an open source data grid platform - and how Infinispan can serve as a basic building block for a Data-as-a-Service offering.
Web serviceElasticity (physics)Data storage deviceCloud computingVirtualizationMultiplication signCartesian coordinate systemWeb serviceGoodness of fitScalabilityPoint cloudAdditionOperator (mathematics)DatabaseEvoluteInstance (computer science)Point (geometry)Software development kitTwitterForm (programming)Data storage deviceVirtual machineState of matterMedical imagingBootingException handlingBitFront and back endsMiddlewareTrailData miningCASE <Informatik>WebsiteRepresentational state transferHigh availabilityTransient stateVolume (thermodynamics)Pattern languageStress (mechanics)Structural loadScaling (geometry)Cloud computingComputer fileSurjective functionWindowIntegrated development environmentComplete metric spaceInformation securityWeb 2.0Connected spaceQuicksortExpected valueInheritance (object-oriented programming)Elasticity (physics)Computing platformVirtualizationMultitier architectureBlock (periodic table)Electric generatorDebuggerLastteilungOperating systemRight angleProjective planeValue-added networkInfinityDegree (graph theory)SequelGoogle App EngineGraphics tabletPhysical systemLie groupMultiplicationWordE-bookComputer animationXMLLecture/Conference
Web serviceVirtualizationPoint cloudElasticity (physics)Characteristic polynomialLinear mapDistribution (mathematics)Mobile appData storage deviceRead-only memoryArchitectureWeb serviceLoginFreewareNumberException handlingSinc functionRevision controlPhysical systemData storage deviceIntegrated development environmentImplementationBuildingScalabilityDistribution (mathematics)Cloud computingElasticity (physics)Characteristic polynomialDatabaseNatural numberTerm (mathematics)Similarity (geometry)Object (grammar)Table (information)Database transactionCore dumpMappingComputer hardwareSoftware developerInformation securityArchitectureClient (computing)Server (computing)Mobile appBlock (periodic table)BitJava appletKey (cryptography)Point cloudPerfect groupAsynchronous Transfer ModeMiddlewareGoodness of fitQuicksortRow (database)MereologyNetwork socketForm (programming)Different (Kate Ryan album)Instance (computer science)Shared memoryOpen sourcePeer-to-peerGroup actionProjective planeCommunications protocolInjektivitätSimilitude (model)Web 2.0State of matterFault-tolerant systemHigh availabilityMultitier architectureQuality of serviceLastteilungTransport Layer Security1 (number)Covering spaceCASE <Informatik>Computer configurationProcess (computing)MiniDiscInfinityElectric generatorRight angleFitness functionHand fanSemiconductor memoryReplication (computing)Metropolitan area networkLibrary (computing)Local ringElement (mathematics)DebuggerMechatronicsPoint (geometry)Variable (mathematics)Data centerTransient stateSequelSquare numberSlide ruleSoftware frameworkMusical ensembleGradientLatent heatDomain nameCartesian coordinate systemObject-relational mappingLevel (video gaming)Standard deviationWeightStatement (computer science)Lecture/Conference
ArchitectureServer (computing)Client (computing)Distribution (mathematics)Hash functionReduction of orderCommunications protocolOpen setCommunications protocolComputing platformAlgorithmServer (computing)Physical systemClient (computing)Cloud computingData storage deviceWeb pageMiniDiscArchitectureCache (computing)Semiconductor memoryForm (programming)Exception handlingHash functionDatabase transactionCASE <Informatik>SequelUniverse (mathematics)ImplementationDistribution (mathematics)Ruby on RailsFile systemHigh availabilityAdditionJava appletGroup actionPoint (geometry)State of matterBroadcasting (networking)Atomic numberConsistencyMoment (mathematics)Task (computing)Formal languageIntegrated development environmentAlgebraic closureAlpha (investment)Cartesian coordinate systemMobile appAsynchronous Transfer ModeNetwork socketInfinityInstance (computer science)Term (mathematics)Subject indexingLevel (video gaming)Library (computing)DemonUtility softwareProcess (computing)Core dumpSingle-precision floating-point formatNumberCuboidRow (database)Projective planeDatabasePerturbation theoryExtension (kinesiology)Graphics tabletSystem callSoftware maintenanceMetadataMereologyInformationElectronic mailing listResultantMulti-core processorBasis <Mathematik>Multiplication signVirtual machineNetwork topologyUniform resource locatorRight angleInterface (computing)Crash (computing)Device driverWeb serviceLatent heatQuery languageComputer fileGene clusterLecture/Conference
Client (computing)Pairwise comparisonServer (computing)Communications protocolServer (computing)Electronic mailing listClient (computing)Multiplication.NET FrameworkInternet service providerGroup actionImplementationJava appletLibrary (computing)View (database)Dynamical systemTerm (mathematics)1 (number)Cache (computing)Front and back endsMultiplication signLastteilungBuildingRoutingPlastikkarteIP addressBitPairwise comparisonStructural loadComputer hardwareBinary codeAsynchronous Transfer ModeCommunications protocolCase moddingWeb browserChemical equationKey (cryptography)Perturbation theoryHand fanLecture/Conference
Programming paradigmData storage deviceSummierbarkeitInfinityDatabaseMobile appDatabase transactionCharacteristic polynomialQuery languageMultitier architectureScalabilitySequelSemiconductor memoryLevel (video gaming)Right angleKey (cryptography)Lecture/Conference
ArchitectureMultitier architectureData storage deviceServer (computing)InfinityScaling (geometry)Computer fileConfiguration spaceSocket-SchnittstelleScripting languageQuicksortDatabase transactionDistribution (mathematics)Client (computing)Server (computing)TunisCloud computingElasticity (physics)Single-precision floating-point formatSemiconductor memoryView (database)1 (number)Memory managementDirection (geometry)Data structureBookmark (World Wide Web)Characteristic polynomialMereologyJava appletDiagramCommunications protocolFreewareWeb serviceNetwork socketThread (computing)NumberInformation securityOpen sourceWeb 2.0ScalabilityMiddlewareComputer hardwareHigh availabilityWeb applicationMultitier architectureMultiplicationEntire functionComputing platformMusical ensembleMetropolitan area networkDifferent (Kate Ryan album)WeightBitComputer configurationModul <Datentyp>Lecture/Conference
InfinityMereologyGroup actionScaling (geometry)Subject indexingObject (grammar)NumberOverhead (computing)Total S.A.Computer fileTelecommunicationNetwork topologyDegree (graph theory)SubsetDirectory serviceInformation securityImplementationSinc functionMoment (mathematics)Data storage deviceEvent horizonRepresentational state transferProcess (computing)Channel capacityComplex (psychology)Ultraviolet photoelectron spectroscopyLevel (video gaming)SoftwareTerm (mathematics)AuthenticationPublic key certificateRule of inferenceAlpha (investment)CodeDemosceneCodeBlogProjective planeWebsiteWeb pageRevision controlFAQTheory of relativityUniform resource locatorInformationHigh availabilityDatabase transactionData managementPhysical systemDeadlockCloud computingDistribution (mathematics)Multiplication signClient (computing)Digital photographyHash functionMiniDiscStaff (military)Right angleSingle-precision floating-point formatPoint (geometry)Service-oriented architecturePoint cloudWeb serviceConcurrency (computer science)Open sourcePhase transitionMetropolitan area networkNumbering schemeConsistencyPlanningDifferent (Kate Ryan album)Peer-to-peerDynamical systemPiSummierbarkeitSingle-board computerArithmetic meanMultiplicationLecture/Conference
XML
Transcript: English(auto-generated)
Thank you for coming. Welcome to my talk. I've got these, one of these really weird slots here where it's just before lunch and I know everyone's hungry and wants to go out and get some of that pizza in the pizza van, but I'm going to actually bore you for an hour first and then you can get pizza.
Right, so my name is Manik Sertani. I'm going to be talking about data as a service. I'm also going to talk about InfiniSpan, which is a new project of mine. So hopefully it'll be very, very exciting. It won't be as boring as I said. Right, so the gist of the talk really is all about, it's about cloud, this is the cloud track as you know.
Everybody likes cloud. Everyone thinks it's a good thing. I'm sure you do too, that's why you're here. That's why you're here too, the cloud track in the first place. And then cloud is good. There are lots of good things about it. I mean, yes, a lot of people think it's just a buzzword and it's a marketing thing.
It's true to some degree, but there are some very good things about it. So cloud is good, but at the same time everyone also likes data and data storage. Data storage is pretty much, you know, you can't really build a serious application without a form of data storage, right? Now, okay, so we like cloud, we like data storage, but unfortunately cloud and data storage don't like each other.
And that's really what this talk is about, why cloud does not like data storage, why data storage does not like cloud, and hopefully what you can do about it. Before I get into the talk, I know everyone likes Twitter and we've got some awesome Wi-Fi here. I must say this is like the best conference Wi-Fi I've ever seen. It really works and works well.
So I'm sure you're going to be tweeting about this. If you are, please use these hashtags. The InfiniSpan community, they're very, very excited about what we do. They follow us quite closely on Twitter and I'm sure they'd appreciate Amazon EC2, Rackspace. This is all very, very cool stuff, right? Now, I'm not going to go into too much detail as to why cloud is good and why cloud is cool and why we all like it.
I'm sure you all know that. All the good stuff about, you know, pay-as-you-go and easy access to lots of resources and stuff like that. But in general, you've now got a virtualized operating system. That's really the starting point of cloud. That's the starting point where you can say, I'm going to use virtualization,
whether it's an EC2 kit or whether it's on VMware or whatever, and now this stuff is elastic. I can fire out more instances as I need them. I can scale them back down when I don't need them anymore and things like that. All very cool, right? But not enough. I mean, you still need to maintain all this other stuff on top of your virtual operating system.
You still have the same pinpoints of maintaining your database on top of a virtual operating system. Your middleware, your front ends, and so on and so forth. So, I mean, the evolution of cloud, the next step really is platform as a service. This is where people start to take things a few steps further. In addition to just virtualizing your operating system, we also start virtualizing your front end and your middleware.
Now, this is all very, very good stuff, right? Now, before people start saying, how come that's not virtualized, I want to talk about that. Most platform vendors, most PaaS vendors tend to just virtualize the first two things, and there's a reason why. It's actually quite easy or relatively easy to do.
Of course, there are some vendors that actually virtualize the entire stack, and that's the good thing. That's the holy grail. So if you look at things like Google App Engine, the entire stack is virtualized all the way down to your data tier. But Google App Engine is actually a little bit of an anomaly. They're a bit of an exception. Not everybody does that. But there's a reason as to why they don't do that.
It's actually quite hard to do. Where is your data actually stored? Now, I mentioned cloud doesn't like data. Data doesn't like cloud. Virtualizing data is not quite as simple as cloudifying or elasticating or virtualizing any other layer. Virtualizing Apache is easy.
I mean, you can shovel your state somewhere else, and you have stateless instances. Your middleware, you can kind of do the same thing as well. It's not so easy with databases. So let's take a step back and think why. The main reason is clouds are ephemeral, or at least you need to assume that clouds are ephemeral. What does that mean? What I mean by that is clouds are stateless,
or they've been designed with the assumption that they will be stateless. I'm going to talk about EC2 specifically, but this actually does apply to many systems. So with EC2, you've got a machine, a virtual machine that you boot up based on an image, based on a Linux image or whatever that you've got,
and you're going to get a fresh copy of that machine. Now, if that machine were to go down, any state that you might have stored on it is lost. It might come back again. You might bring it back up again. That's cool. But you actually lost anything that you've stored on it. Now, there are lots of things you can do to get around that. I mean, Amazon gives you a few other services like EBS, Elastic Block Storage.
Anyone heard of EBS? People know about it? Okay, good. So you can do that. You can store stuff in EBS. It's a semi-persistent thing. They give you something called S3 again that's also kind of persistent, but they're not quite as trivial to work with. It's not just saying, here's a machine. Here's an operating system, which I can just store stuff on it, and if it goes away, I can bring it back up.
I haven't lost anything. That's not the case. So that's how clouds tend to achieve elasticity. They do so by making this assumption that everything is going to be stateless. If EC2 decides that they're going to re-provision one of your nodes because a physical machine died, they're going to give you another one very quickly.
You may not even notice, but what you will notice is you've lost all your data on it. They're just going to boot up another copy of your image and so on and so forth. It means that you now need to store all of your state somewhere else. So why is it so hard to virtualize your data tier? So I've been talking about state a lot. That really is the big deal.
Now, front-ends and middleware also have state. Well, transient state. HTTP sessions, perhaps. Your middleware might store some stuff as well. Again, HTTP sessions and things like that. But typically, most platform as a service vendors get around the problem of state in your middleware or state in your front-end
by pushing all that state into the data store. I mean, yes, that's a very, very valid solution. It means that your middleware in your front-end is now stateless, and that's awesome. It means that if you have more requests, if your site gets slash-dotted or whatever, that's fine. You can deal with that. Just bring up a few more Apache instances. That's fine. Bring up a few more middleware instances. You're running an EC2. That's cool. You can do that.
You can bring them up very quickly. Boot them up off an image and things like that and add them to your load balance, and you're good to go. But that's putting a hell of a lot more stress on your data storage tier, and that's what you need to think about at this stage. That's the hard bit. So I keep saying that virtualizing data is hard, but there are some public services that do exist, that do exist, that give you virtualized data.
Now, what are these services? I mentioned Google earlier. Google's one of them. Amazon gives you a couple of things as well. I've got RDS up there, but RDS actually is not a really scalable true virtualized data tier. Who's heard of RDS? Anyone know what it is? Okay, a few hands. So for those who don't know, the usual pattern of running a database on EC2,
I'm going to use MySQL as an example. That's what the Amazon docs use anyway. And the usual pattern is that you will boot up a MySQL instance on an EC2 image. You will attach an EBS volume onto it. An EBS volume is supposedly a persistent volume in Amazon.
So you attach that volume onto this virtual image, and then you point all your data files to that volume. So if the virtual machine goes down, you can bring it up again, point it back to that EBS volume, and you've still got your data. It's not that simple. It's actually much harder than you think. A couple of reasons. One is you can't scale this outwards.
An EBS volume can only be mounted on a single virtual machine at any given time. So you can't actually have multiple virtual machines. For example, all running MySQL, all pointing to the same data store. That doesn't work. Whether it's for high availability or even scalability, you still can't do that. Also, EBS is not guaranteed. It's not guaranteed to actually be there.
Even EBS can disappear, in which case you will lose your stuff. And again, the usual pattern that Amazon talk about is you're going to take snapshots from EBS and store it in S3. Now, S3 is the super secure, yes, this data is always going to be around sort of storage service that Amazon gives you. Now, the problem with S3, even though it's all really cool and really safe
and all of that stuff, very durable, the problem is it's slow. I mean, as you'd expect. There's always a trade-off. S3 works on either the API that you use to connect to S3 is either a REST API or a web service, both of which are notoriously slow for various reasons,
mainly HTTP connections and things like that. And then on top of that, you build a layer. So yeah, that can get slow. So you can use it for snapshots, that's cool. But snapshots, again, I actually hate snapshots. Snapshots are terrible, a complete false sense of security. It makes you think your data is safe, but you've still got these windows where you actually might lose stuff. The drawback with snapshots is it actually makes you go back home
and say, all right, yeah, my data is safe. I don't need to think anymore about any issues that might arise if I were to lose my environment at the wrong time. No, you've got to think about that. That's still pretty nasty. So okay, that's already it. Amazon also gives you something else called SimpleDB. Now, SimpleDB is actually a scalable data service.
I'll mention SimpleDB quite briefly here. Who's heard of a paper that Amazon once put out called, I forget the name now. Anyway, it's a very interesting paper. Dynamo, yes, Amazon Dynamo. Who's heard of Dynamo? Who's read it? It's a long and pretty complex paper, but it's very interesting. And they actually talk about a truly scalable data grid system.
And SimpleDB is actually their internal implementation of Dynamo. Now, the only problem with SimpleDB again is API. And actually the name as well, they call it SimpleDB. That's probably the worst name they can come up with. Because anyone's used it here? You can probably confirm with me that A, it's not simple
and B, it's not a database. So calling it SimpleDB is probably the worst thing you can call it. It's essentially a web service. It's an XML-based web service where you store key-value pairs. I mean, it's horrible to use. Who the hell wants to write a web service to store key-value pairs, right? But that's the only service they actually guarantee to be properly elastic and truly durable in their environment.
There are others as well. FathomDB is very similar to RDS, except it's not run by Amazon, it's run by somebody else. Cloudant, it's another service company they're working on, an elastic version of CouchDB. Mongo HQ do the same thing with MongoDB, things like that. It's basically, they manage the environment, they host it,
they manage the replication, the snapshots, blah, blah, blah, and they just give you a service login where you can sit and store stuff. But these are all public services. This is all stuff where you buy a login and you use it. But that's not good for everyone, and there are many reasons why. Well, what about private clouds? Private clouds are important as well. When you talk about cloud computing,
not everything lives in the public space. Very often you want to use your own hardware. Why? Well, because you have your own hardware, for example. You already bought it, maybe. You don't want to throw it away and then use EC2 and pay for stuff again. Or maybe because of security concerns. Maybe you're concerned about storing your data on a public service
where Uncle Sam might walk in with a letter and seize all the hardware and your data with it. What about data locality? You might actually want stuff local to where your users are processing things. You might not want to fetch data from across the Atlantic or things like that. So there are lots of reasons why you want to build your own data service on your own hardware.
So how would you go about doing that? Let's start by looking at the characteristics of a data service. What are the characteristics that make a scalable data service? Firstly, you need elasticity, you need scalability, high availability, and fault tolerance. These are all things that you expect out of a cloud service.
One of the reasons why people use cloud for all the other stuff, for your front end, your middleware, and things like that, is elasticity and high availability. You want to be able to scale out on demand when you're slash dotted and things like that. So your data service needs to be able to scale out as well. You also need high availability and fault tolerance.
This is all running on commodity hardware. That's the whole point of cloud. You're going to be running on cheap commodity hardware, and cheap commodity hardware will break. It will break. Power supplies will blow up and things like that. What are you going to do? You need to have a highly available service so that you don't actually lose any quality of service from the users.
There are other optional features as well that you might want out of a data service. Things like transactional capabilities, XA, or MapReduce, and things like that. But those are all optional extras, I would say. They're not the core of a data service. So how do you normally store data traditionally?
In your traditional three-tier style app that I showed. I've actually got SQL and NoSQL on the same slides. I'm actually not categorizing them together. What I'm trying to talk about here is essentially non-distributed databases. So non-distributed SQL and non-distributed NoSQL databases. This kind of stuff you can't really use for a data service.
Let's start with databases. Traditionally you'd use something like MySQL or Postgres or something like that. You don't really have all the stuff that I spoke about. You don't have elasticity. You don't really have high availability. Yes, you can cobble it together. You can use NDB as an engine and things like that. But that's not really something that's truly reliable or tried and tested.
The other option is stuff like Oracle. Oracle's rack and things like that, which are all extremely expensive. And it doesn't actually run on commodity hardware either. You need special hardware. You need a SAN, all this other stuff. So it doesn't really work for cloud.
And in the case of NoSQL databases, the ones that are not distributed, again, you've got the same problem. If it's not distributed, you're not really going to be elastic or highly available. So in general, stuff like this won't really work to build your data service. What can you use then? What's the next step? So I'm going to talk about data grids now.
I'm just going to change gears a little bit. So essentially, when I talk about data grids, I'm going to cover both in-memory and disk-based data grids. They're two fairly different beasts in terms of how they actually store things, in terms of how they optimize for latency and things like that. But in terms of distribution or in terms of the way they deal with fault tolerance and elasticity,
they're very, very similar. They follow similar principles and similar concepts. So essentially, distributed databases were just by the very name. They're inherently distributed. They're distributed by design. They're distributed by the very nature of what it is. And that means you get a lot of things for free.
Since the system is distributed, you're going to get high availability for free. There's always going to be another node somewhere, provided you have adequate numbers of nodes altogether, to take over from a node that were to die, that were to fail. And you're going to get elasticity, too. You're going to be able to add more nodes to the system, and it will scale outwards, and it will scale back down again as you shut nodes down.
If you don't need them anymore, and things like that. So you get all this stuff for free. And this is why I think distributed data grids are actually perfect for a cloud data service. It's a perfect building block. You get all the characteristics that you want from a cloud data service for free out of a data grid. And what about API?
So how do your apps actually talk to your data service? I mean, a lot of NoSQL databases use things like key values and stuff like that, or document-orientated stores, which are very good for special purpose, or very application-specific, domain-specific needs. But it's a little bit too low-level in general.
So anyone here at Java developers heard of JPA? Who uses JPA here? So JPA essentially, for the non-Java folks, is your standard high-level object-oriented way of storing data, of storing objects in Java. It's an ORM. It basically maps high-level objects into relational tables and things like that.
But leaving the ORM part of it aside, essentially this is the API that you traditionally use. In Ruby, it's kind of similar. You use active record and things like that. You use high-level APIs. You don't actually fiddle around with SQL statements, right? You shouldn't anyway, not in a high-level app. So essentially what you should be doing is these are the kind of APIs you need.
You shouldn't really be talking key value. So when you're trying to build out a data service, you should think about stuff like this. What sort of APIs are you going to offer your middleware tier? So kind of shifting gears a little bit, I'm going to introduce InfiniSpan. How's everyone doing so far? Have I put anyone to sleep yet?
All right, good. Excellent. So what is InfiniSpan? I started InfiniSpan a couple of years back. It's an open-source data grid. It's in-memory primarily. We also do spool off to disk, but it's primarily in memory. It's written in Java and Scala as well,
even though it's not just for the JVM, even though we write it in Java. I'll explain why. And there are a couple of primary usage modes. So there are two ways of interacting with InfiniSpan. One is an embedded mode. Embedded mode is if your app happens to run in a JVM. Now I'm careful to say JVM here, not just Java. I mean, you could be writing a groovy app
or a JRuby app or whatever. That works just fine. You can launch an InfiniSpan instance within your JVM, and if you launch a few copies of your app in different servers, they'll have to auto-discover each other. They'll form a cluster. They'll start sharing state, et cetera, et cetera. So that's embedded mode. And then there's client server mode, where you can actually start up
or launch individual InfiniSpan nodes and connect to them remotely over a socket over a number of protocols that we support. So I'm going to give you a brief tour of some of the high-level, more notable features of InfiniSpan, and then I'm going to talk about how you can actually use InfiniSpan to build a data service. So we start with embedded mode.
InfiniSpan is primarily a peer-to-peer system. We use peer-to-peer protocols for auto-discovery, for sharing state, for group membership, things like that. Who's heard of a project called JGroups? JGroups is an open-source peer-to-peer library, and that's essentially what InfiniSpan uses. We were built on top of JGroups.
We use JGroups for discovery. We use JGroups to share data and things like that. So essentially, this is your typical embedded architecture. This is how you would use InfiniSpan in embedded mode. As I said, you'd start up a bunch of JVMs. Your app sits in the JVM. Your app will launch an InfiniSpan instance. InfiniSpan instances discover each other, start sharing data.
And that's pretty much it. It's very simple. You automatically get all of the features of InfiniSpan in your app, stuff like high availability. You can now load balance requests across your app because data is shared everywhere, like session state and things like that. A lot of frameworks, a lot of web frameworks, for example, tend to use InfiniSpan to share their transient state.
A lot of web containers, like JBoss App Server, for example, actually uses InfiniSpan as well to distribute HTTP session state or EJB session state, things like that. And this is how JBoss, like I said, achieves clustering.
What does the API look like? InfiniSpan's got a number of APIs. The primary API is a very simple map-like API. So again, if you use Java, the core InfiniSpan API actually extends Java you to a map. So if you know how to use a map in Java, you know how to use InfiniSpan. It's very, very simple. It's more than that, of course. We've got a few other higher-level APIs and more rich APIs,
including asynchronous APIs, non-blocking APIs, and we're also building an upcoming JPA layer as well. So like I mentioned, I mentioned JPA earlier, so you actually have this JPA style of interacting with InfiniSpan to persist your data as though it was MySQL or as though it was some other database.
There are also other high-level APIs being discussed in the community, including ActiveRecord. Who's heard of a project called TalkBox? Any JRuby people here? Any Ruby people here? So TalkBox essentially is a Ruby on Rails implementation on JRuby running on top of JBoss. It's actually extremely popular in the cloud community.
So who uses Engineyard? Anyone use Engineyard here? Anyone heard of Engineyard? So Engineyard's a hosted Ruby on Rails cloud environment where you can host your Ruby on Rails applications. It actually uses JRuby underneath. It actually uses TalkBox underneath. The benefit of using TalkBox
as opposed to Ruby on Rails on its own is that you get all the benefits of a Java EE app server underneath Ruby on Rails, which particularly includes clustering, high availability, and so on and so forth. What else do you have in InfiniSpan? So again, like I said, a bunch of high-level features. InfiniSpan distributes data across its nodes
using a consistent hash-based distribution algorithm. The reason why we do this is because consistent hashing gives us a very fast and deterministic way of locating data in a cluster without needing to maintain lots of metadata and without having to broadcast calls around the cluster to try and locate stuff. Everything happens in a single machine.
You can locate stuff very quickly in a deterministic fashion. And as a result, the system is self-healing. You don't have a single point of failure, so on and so forth. InfiniSpan's also highly concurrent. We use MVC-C-based locking on a per-node basis, which means that you've got concurrent reads and writes and things like that. Again, very, very performant,
especially in massively multi-core environments. Persistence. I mentioned that InfiniSpan also does persist to disk. It's not just in memory. Although primarily it stores stuff in memory, it also does write through to disk as well. Now, when I say disk, we've got an interface called the cache store interface.
We've got a few implementations that we ship with. Some of the implementations of the cache store is just a file system-based driver. Some of them write to BarclayDB. You can actually write back to a database as well using a JDBC driver. We also have pluggable drivers for Amazon S3, for example, to actually persist onto something like S3.
And of course you can write your own as well. Eviction and expiry. If our primary data store is in memory, you're gonna have a problem because if you keep putting stuff in memory at some point, your JVM's gonna crash with an out-of-memory exception. You are gonna run out of memory, so you do need some form of eviction and expiry
to be able to say, okay, I need to take stuff out of memory and put it onto disk, onto my cache store. Essentially paging that happens in most operating systems. So we've got a couple of rather interesting algorithms, very interesting adaptive algorithms. Some of this stuff has literally only just come out of university research like two years ago. We're one of the first implementations.
If you want, I can talk about it in more detail later on. But yeah, go and check it out. Very, very interesting stuff. XA transactions. Yes, we support XA transactions. And this is actually very interesting because a lot of NoSQL databases try and not support XA transactions or they don't support XA transactions
as a trade-off for performance to be able to scale. Now we actually think we can scale while still supporting XA transactions because guess what? Most business use cases need transactions. They tend to be transactional. So a lot of people want it. We do do XA transactions. If anyone's ever heard of a group called Arjuna,
this was a research group out of the University of Newcastle which then spun off into its own company. And they used to do XA transactional engines probably well before Java, 20 years ago. And they're probably one of the most mature transactional implementation around. We work very closely with them. They actually help us implement XA transactions
within InfiniSpan and so on and so forth. And in addition to that, we're also researching new ways of providing consistency or providing strong transactional-like consistency and making it perform even better. One particular idea we're working on at the moment is atomic broadcasts. If anyone's familiar with atomic broadcasts,
there's lots of very interesting stuff around that. We're working with a couple of universities, one in Portugal and one in Italy and working very closely with them on it. So that's one to watch out for. That'll be hitting a release soon at some point. What else do we do? Yep, we do do MapReduce as well. It's currently in a pre-release state. It's in an alpha.
It's alpha quality. But do download it. Try it out. We've got a very interesting API in my opinion. A lot of people talk about how complex MapReduce is in the Java world. I mean, I know it's not very complex in the non-Java world because you've got interesting things like closures and dynamically-typed languages where you can actually make MapReduce quite nice.
But in a strongly-typed and rather clunky environment like Java, MapReduce can be very complex. If you look at Hadoop's MapReduce implementation, has anyone tried to implement a MapReduce task in Hadoop? No? A few there? All right. I mean, it is complex. It's non-trivial. And we're hoping that our API is actually very, very simple. We've taken lots of cues from dynamically-typed languages
and things like that. So, yeah. And yes, we also do support querying. We do support indexing and querying stuff that you store in InfiniSpan. We use Lucene as a querying engine and a querying API as well. So that's embedded mode so far. Now, client-server mode kind of builds
on top of embedded mode. In terms of architecture, this is kind of what it looks like. If you want to start up a bunch of InfiniSpan instances in client-server mode, each one starts up in its own JVM. That's what that thing is. And each one opens up a socket and starts listening on it. Now, in this setup, your application does not share
the same JVM as InfiniSpan. It sits outside. In fact, it may not sit in a JVM at all. Your app might not be a Java-based application at all. As long as it can talk one of these protocols over the wire to the InfiniSpan cluster, you can use InfiniSpan. Now, we support a number of protocols. Firstly, we support REST. Why do we support REST?
Because it's very popular in cloud environments. It's very easy to manage. It's a very useful protocol to support. Plus, it's very easy to build and very easy to implement. So we support REST. We support Memcached. Memcached is a, for those of you who don't know it, it's a single VM, a single-server daemon process, and it's very, very popular. It's ubiquitous.
Most Linux distributions ship with it. But what's also very interesting about Memcached is there is a client library for Memcached for pretty much any language or any platform on the planet. And this means that since we support the Memcached protocol, it means that pretty much any platform or any language can use InfiniSpan as well.
And then there's HotRod as well. So you're probably thinking, what is HotRod? HotRod's essentially a wire protocol that we started building before InfiniSpan specifically. It's an extension of Memcached. It kind of adds a few extra things on top of Memcached. But, I mean, let me start about what the big deal is
or why did we do it in the first place. So Memcached is a very simple protocol. It's a very simple protocol for storing and retrieving data. But we found a couple of shortcomings in Memcached. Firstly, it's text-based. It's not binary-based. I know there's a binary variant of the Memcached protocol, but that's not really well supported.
It doesn't have the same rich number of clients out there and things like that. But specifically, the reasons we found the Memcached protocol falling short is because it's a one-way protocol. It's a one-way protocol where clients talk to servers. Clients talk to servers and get results. That's it. There's no way for a server to talk to a client. Now, why would a server want to talk to a client?
A couple of reasons, right? If you want to maintain a dynamic list of which servers are available, that's pretty useful if servers can send this information back to clients. Another very interesting reason is built-in high availability and failover. You can actually build this into clients as well if clients knew where your servers were and how this was to change.
And the last part is, like I said, InfiniSpan uses a consistent hash-based algorithm to store data. If the clients could be made aware of what algorithms enforce and your server topology at that time, it can actually direct a request to the actual node which has the data rather than having to have that node hop somewhere else
to find your data for you. So you can optimize a lot of things if you had a two-way protocol. So that really is what HotRod is. It's essentially a two-way protocol where you can do smart routing, where clients can be built in a very intelligent fashion. So here's a quick comparison of the different endpoints, the different protocols that InfiniSpan supports in client server mode.
So REST and Memcached, they're both text protocols, as I said. HotRod's binary. In terms of client libraries, you don't need a client library for REST. You just use an HTTP client. Memcached, there are lots and lots of client libraries out there. The big drawback with HotRod is that currently we only have a Java client.
It's essentially a reference implementation client that we have built. I know that someone is building a Python client. I have also heard of a group building a .NET client. We are hoping to see more of these come out in the community. They're all clustered at the same time because the backend is clustered, because your InfiniSpan nodes are clustered, unlike Memcached itself where the backend is not clustered.
If you lose your Memcached node, you've lost your cache. But this isn't a cache. It's more than a cache. In terms of smart routing, I just explained, HotRod supports smart routing. Memcached and REST don't. But you still can build in failover and load balancing across any of these protocols. You just need to use different techniques.
So in terms of REST, if you were to use the REST endpoint, it's quite easy to build HA and load balancing. You just use any HTTP load balancer, like ModJK or ModCluster for Apache, or a hardware load balancer, which can be used as well. Memcached is a little bit limiting there.
So most Memcached libraries, client libraries, allow you to provide multiple endpoints or IP addresses of multiple Memcached servers, and it will try and load balance and failover across them. But the only problem is this predefined server list is static. So if some of these servers were to go away and some new ones were to be introduced, you've got to either restart your client or reconfigure your client or something, right?
So that's not always possible. Whereas with HotRod, that's all dynamic, because HotRod's got a dynamic view of what's happening on your server side. So just to kind of sum things up, what is infinite span, right? Is it a data grid, right? It's called the characteristics of a data grid.
It's in memory, it's peer-to-peer, it's distributed, low latency, because primarily stored in memory, and it's a key value store, right? So what is it? Or is it a NoSQL database? Because you do persistence as well, and we do MapReduce as well. Or is it something else entirely?
Because you also have querying support and transactional support, stuff that even NoSQL databases don't do, or many of them don't do. So what is it? In reality, it's all of these things. That's what infinite span is. It's a highly scalable data store that can be used as a database replacement, that can be used as a NoSQL database, and it will still support high-level things
like querying and transactions. You don't have to compromise your programming model or what you do on your app tier to fit in a scalable database. So why is infinite span sexy? I actually don't know who that is. Just a caveat there.
I'll give you six reasons for that. Firstly, it scales horizontally. It scales outwards and back in again. Very important stuff. It's elastic in both directions. Fast and low latency access to data. Primarily, stuff is stored in memory. It makes it very, very fast to access things. It addresses a very large heap.
So if you've got a bunch of JVMs on a bunch of nodes and each one's got two gigs of heap, it gives you the aggregate view of the entire thing. And in Java, that's pretty cool, to actually have a single data structure that looks like it can store 100 gigs of stuff, even though it's just striped across multiple nodes. It's cloud-friendly. It's cloud-friendly. It runs in EC2. It runs on Rackspace and things like that.
It'll run on a private cloud. It deals with ephemeral nodes, because like I said, it is distributed by design, and that helps you deal with things like that. And it can be consumed by any platform. It's not just for Java. It's not just for the JVM. And of course, most importantly, it's free and it doesn't suck.
So as I promised, I want to talk about how you actually build a data service with infinite span now. So let's revisit this little diagram over there in the corner where we had the various parts of your three-tier architecture turning cloudy. And let's try and make that data a bit cloudy as well. How would you do that? One solution is to actually replace that single node
with a bunch of infinite span nodes, perhaps sitting in EC2 or whatever virtualized hardware that you have. Like I said, because they're distributed, you're going to get all the elasticity and the high availability on that tier. Your middleware tier would talk to it using one of the three client server endpoints. You could use all three as well.
You don't have to stick to just one. You can actually have each of those nodes listening on all three of those protocols as well, which is an interesting setup if you have heterogeneous middleware doing different things written on different platforms and stuff. And there you have it. So that's how you end up achieving elasticity, high availability and scalability in your data tier.
And this is actually true of pretty much any data grid that would support the features that I spoke about. It's not just infinite span. So how'd you actually start an infinite span server? Well, step one is actually not there. So download the distribution. I'm assuming you've done that. Download the distribution.
You've got your startup script. You provide the protocol that you want to use. There's a bunch of other tuning options as well. I mean, stuff to tune the sockets you're using and things like that. And you pass in the infinite span configuration. That XML file actually configures and tunes the memory node that the server endpoint uses.
You get to tune things like what sort of JTA characteristics you want, what sort of transactional characteristics you want, what sort of locking characteristics you want and things like that. The REST endpoint is slightly different. The infinite span distribution comes with a WAR file, which is a Java web application. And you deploy the infinite span WAR file
in your favorite servlet container and that exposes an HTTP endpoint for REST. If you don't have a servlet container, I mean, there are lots of open source ones. There's some very good ones I can recommend. And essentially you use your web container to tune things like security or the number of threads in your socket pooling
and stuff like that. So where are we headed? What's the future for infinite span? That's not my motorcycle as well. Just another caveat. So our first release, we released it in February 2010, February last year.
It was version 4. For those of you who are asking, wondering why is this first release called version 4, I was kind of bored of everyone releasing the first version as version 1, right? Let's do something different. There's an FAQ on the website if you're interested. Go and have a look. Version 4, codenamed Star of Brno, named after the beer we were drinking
when you were actually coding it, comes with all these features, the map-like API, async API, which I didn't talk about. That's a non-blocking API, and so it's pretty cool. Consistent hash-based distribution right through and right behind to disk, all the storage stuff, eviction and expiration, some management tooling with JMX to actually monitor the nodes, see what's going on, things like that.
A REST API and all of that. 4.1, by this stage, we had progressed to a different beer. Yeah, it was rather gassed this time. This was released in August 2010. Other very cool things like deadlock detection and stuff for distributed transactions.
All the client-server stuff came out in this release. Lucene directory implementations and stuff. So if you use Lucene for indexing stuff, you can actually distribute your indexes using InfiniSpan, stuff like that. What are we doing right now? I'm working very hard on 5.0. 5.0 is going to have some other cool stuff, including the JPA-like API,
the distributed code execution, all the MapReduce stuff. So I mentioned that's already out there in alpha quality at the moment. The goal is to actually try and have this as a final release by the summer. And of course, 5.1 and beyond, what are we doing over there is dynamic provisioning. This stuff is all very cool. Some of the stuff's pie in the sky.
We're just playing around with a few ideas here. Dynamic provisioning is quite interesting, where you can actually plug in some rules, like SLAs. I want to make sure if I, I don't know, hit 80% capacity, fire up some new nodes automatically and start distributing stuff in sharing state, or if I drop below 30% capacity, shut down a few nodes. I don't need so many nodes and stuff like that. So that'll be all very interesting to watch.
Complex event processing as well. This is stuff that we want to do to add event notifications and things like that. So it could be quite cool to watch. So to sum things up, essentially, I was talking about clouds are mainstream. They're here, blah, blah, blah. We like them.
We like data. They don't like each other. And rather how elastic data is a hard problem to solve, how it's hard to achieve, how data grids can help as well. I talked about InfiniSpan as an open source data grid, a viable solution as a data service, both for public and private cloud as well.
And with that, I believe I've got a few minutes for questions. There's a bunch of URLs up there if you want more information, including the project page, the project blog, the project in GitHub, the contribute code. We like that. It's all good, all good fun. We're in IRC. That's our IRC channel on Freenode. And yeah, with that, any questions?
Yes. You mentioned that because it's running on peer-to-peer, you don't have any single board failure. Then later you mentioned that you're doing XA transactions with two-phase B, which actually relies on a single transaction manager and has to survive the whole transaction. So what happens when the transaction manager does a single board failure? So the question is, we say we're peer-to-peer based
and we don't have a single point of failure. But at the same time, we participate in XA transactions. And XA transactions means that if your transaction manager fails, does that mean that you've lost your transaction? Is that your question, right? Yeah. Essentially, when we deal with external systems
like a transaction manager, that's not internal to InfiniSpan. That's an external thing. If your transaction broker dies, that's a problem of the transaction broker. But we do have a solution for that as well. So I mentioned Arjuna earlier. Arjuna's kind of been rebranded and renamed. It's now JBoss Transactions, the JBoss Transaction Manager,
which in itself is distributed and highly available as well. So as a recommendation, if you want true high availability, we'd recommend using JBoss TS as your transaction manager, because that in itself has got distribution capabilities and they don't have a single point of failure either. So that's kind of how we achieve that. Yep.
Does InfiniSpan use data locking on multiple nodes, for example, during transactions? I'm sorry, could you repeat that? Does InfiniSpan use the data locking on multiple nodes? Yes, it does. So the question was, does InfiniSpan do data locking on multiple nodes?
Yes, but we have a very optimistic scheme around that, where we actually only acquire remote locks during the prepare and commit phases, not through the entire lifecycle of the transaction. Although that's, again, something you can configure, you can use eager locking as well. Eager locking will, of course, mean you've got less overall concurrency in the overall system, but it gives you a more secure.
You have fewer transactions potentially rolling back. Fails not where? The transaction would fail. If you can't acquire all the locks you need, your transaction will fail. Yeah. What's the overhead of running Lucene over InfiniSpan?
So the question is, what's the overhead of using Lucene over InfiniSpan? Are you referring to using Lucene to index stuff in InfiniSpan? Well, there is a overhead, yes. What's the overhead? No, is it a number?
Yeah, it's something you need to try out. It depends on how complex your objects are, the effort involved in indexing those objects, and of course, most importantly, how you tune and set up Lucene as well. So, yeah. Do you have a plan for using J-groups? Yes. Do you have a plan for using J-groups?
The question is, so we're using J-groups, how does that scale, essentially, right? Specifically with regards to multicast. No, we don't actually use the multicast part of J-groups. J-groups is actually very, very tunable and configurable, and you can actually tune it to the nth degree,
and we only use a certain subset of J-groups for a few very specific things, specifically, primarily for discovery and things like that, not so much for the actual communications. For the actual communications, we tend to use point-to-point TCPs. Yeah. Is InfiniSpan suitable for storage gigs or pipes?
The question is, is InfiniSpan suitable to store gigs of files? Are you referring to a single file of that kind of size, or a total... Many of them. Yeah, I've seen people use it to store terabytes of data as well.
Yeah, so the answer is yes. Anything else? How do you cope with security features? The question is, how do we cope with security? Security is on the roadmap. Unfortunately, we don't do very much with security at the moment. We do a little bit. Since, like I said, we do use J-groups underneath
to actually talk on the wire, J-groups has got a number of security features, and we just benefit from that. J-groups can be configured to use certificates for authentication to join a cluster. J-groups can also be configured to encrypt any traffic that's sent on the network. But in terms of security on the user-level features,
so actually accessing stuff, we don't support anything yet, but it is on the roadmap. Okay, so with that, if there are no more questions, thank you all for listening, and I won't hold you back from your lunch.