Recommendation Engines with Redis and Ruby
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 50 | |
Author | ||
License | CC Attribution - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/37490 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Producer | ||
Production Place | Miami Beach, Florida |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
Ruby Conference 20135 / 50
2
3
7
13
15
17
18
19
28
29
35
39
44
48
49
50
00:00
MultilaterationProgrammer (hardware)FreewareReal-time operating systemNoise (electronics)SpacetimeEvent horizonMixed realityCodeMeeting/Interview
01:26
Context awarenessTangentComputer networkElectronic visual displayClient (computing)ApproximationFacebookReal-time operating systemMereologyMultiplication signBitSeries (mathematics)Different (Kate Ryan album)Process (computing)Near-ringCASE <Informatik>Software developerApproximationClient (computing)Set (mathematics)Context awarenessGame theoryHidden Markov modelIdeal (ethics)Observational studyWebsiteEvent horizonPlastikkartePhysical lawInternetworkingQuicksortRWE DeaRight angleVideo gameStatisticsData miningTheory of relativityMultilaterationPattern recognitionSpacetimeGoogolMeeting/Interview
04:16
StatisticsStatisticsForcing (mathematics)Client (computing)Meeting/Interview
04:42
Polymorphism (materials science)WeightVideo gameStatisticsPhysical systemMechanism designTerm (mathematics)Game theoryClient (computing)Event horizonFacebookIntermediate value theoremProgrammer (hardware)Java appletBookmark (World Wide Web)Software developerDependent and independent variablesContent (media)Order (biology)Associative propertyCASE <Informatik>Formal languageAlgebraic closureDisk read-and-write headEndliche ModelltheorieMultiplication signMereologyDiagramMatching (graph theory)Row (database)FlickrFigurate numberSeries (mathematics)Exception handlingCartesian coordinate systemState observerMultilaterationPoint (geometry)BitPolymorphism (materials science)Single-precision floating-point formatCycle (graph theory)1 (number)Process (computing)MiniDiscWave packetWordComputer-assisted translationAlgorithmScripting languageSlide ruleDifferent (Kate Ryan album)Meeting/Interview
11:17
Event horizonAlgorithmComputer scienceEvent horizonState observerPhysical systemHand fanSquare numberGame theoryInformation managementComputer animationMeeting/Interview
12:02
Event horizonRead-only memoryComputer configurationHash functionElectronic mailing listData structureData structurePhysical systemKey (cryptography)Event horizonTask (computing)Insertion lossMultiplication signMereologyWindowComputer scienceHash functionQuicksortSet (mathematics)DemonOrder (biology)BitMacro (computer science)Logical constantComputer configurationDifferent (Kate Ryan album)Run time (program lifecycle phase)AlgorithmData storage deviceElectronic mailing listCASE <Informatik>1 (number)Natural numberReading (process)Process (computing)Type theoryMiniDiscThumbnailSquare numberSemiconductor memoryLoop (music)Client (computing)String (computer science)Point (geometry)Group actionMetropolitan area networkState observerServer (computing)Meeting/Interview
15:55
Group actionEvent horizonStrategy gameAbstractionComputer iconoutputData storage deviceExecution unitFunction (mathematics)WeightScalar fieldField (computer science)Hash functionWave packetKey (cryptography)Electronic mailing listDeclarative programmingState observerDegree (graph theory)Form (programming)Associative propertyEvent horizonSheaf (mathematics)Function (mathematics)AlgorithmSoftware testingMultilaterationDifferent (Kate Ryan album)TwitterMultiplication signType theoryWeightCalculationHash functionGroup actionField (computer science)QuicksortInterior (topology)MereologyPhysical systemComputer clusterLevel (video gaming)Right angleSemiconductor memoryVideo gameRun time (program lifecycle phase)Slide ruleScalar fieldPoint (geometry)Social classDecision theoryMehrplatzsystemData storage deviceDivisorBitData structureAbstractionReading (process)Scaling (geometry)SpacetimeSystem callObject (grammar)Patch (Unix)Real numberNamespaceGame theoryInternet forumNeuroinformatikStructural loadHand fanWebsiteStrategy gameMeeting/Interview
23:48
Price indexContent (media)Field (computer science)Hash functionData storage deviceSubject indexingData structureNumberContent (media)Price indexHash functionLink (knot theory)Bit rateDatabasePoint (geometry)Multiplication signSet (mathematics)Query languagePattern languageDataflowDisk read-and-write headSeries (mathematics)Intrusion detection systemQueue (abstract data type)Meeting/Interview
25:06
BefehlsprozessorHash functionComputer configurationFunction (mathematics)BitResultantDatabaseMobile appSeries (mathematics)Sound effectPhysical systemSoftwareCalculationStatisticsKey (cryptography)MereologyNichtlineares GleichungssystemVideo game2 (number)Price indexFunction (mathematics)Functional programmingoutputQueue (abstract data type)System callHash functionElectronic mailing list1 (number)WordStapeldateiClient (computing)Computer configurationMultiplication signDependent and independent variablesBefehlsprozessorEvent horizonScripting languageThread (computing)Semiconductor memoryRight angleStreaming mediaData storage deviceData structureCodeNatural numberTimestampField (computer science)State of matterInsertion lossFunctional (mathematics)Parity (mathematics)MultilaterationWebsite40 (number)Task (computing)Existential quantificationLetterpress printingAlgorithmInternetworkingReading (process)Single-precision floating-point formatSlide ruleJava appletStatement (computer science)Bit rateWave packetSet (mathematics)InjektivitätCASE <Informatik>Goodness of fitRow (database)Medical imagingGodDivisorMeeting/Interview
Transcript: English(auto-generated)
00:16
OK, so it's 2.0.1, I guess we better get started, because I was told that, oh right,
00:20
I have to hit this little button, that once I run out of time, this little doohickey here is gonna make lots of noise, and then they're gonna bring out the gong and it won't be pretty. So yeah, that's me. And we're, oh, so right, I'm mixed up. I'm Xavier Shay, I'm here to talk about Ruby profiling, if you were looking for seven light guy, he's in that other room. Oh wait, no, that's not right, yeah, OK.
00:42
I'm Evan Light, and we're talking about recommendation engines with Ruby and Redis, and wow, there are more people in here than I expected, OK. So very briefly about me, I created and ran this event out in Northern Virginia, called Ruby Decamp. It's a three day nerd commune in the woods for Ruby programmers.
01:00
If you haven't heard about it, there are a bunch of attendees, a bunch of people, participants here, who have been before. So, but, in a nutshell, you come out in the woods, you hack on Ruby code, you hang out with awesome programmers, you're not allowed to leave until the very end, and you've got, and the attendees decide on basically everything, and they have to do all the chores,
01:20
and that sounds like a lot of work, but it's really an awful lot of fun. Oh, and free. But you have to get a code to be able to attend. Also, I work for this little company called Rackspace. Can you guys raise your hands if you've heard of us before? That's pretty good. How many of you guys use us? Or, well, I guess I'll say currently use us. Hmm. That's not too many.
01:43
We need to work on that stuff. So I'm a, what they call a developer advocate for Rackspace. That is that I'm here for you guys. Truly. And that's why I took the job. I wanted a job where I could do more for the Ruby community, and they basically said, great, that's the kind of person we want. So if there's anything I can do to make your lives, those few of you here,
02:02
we need more, who use Rackspace, make your lives better with Rackspace, great. And for those of you who don't, if there's anything you can think of that would make you want to, yeah, we'd like to hear that, too. Let's see. So moving right along, in a nutshell, here's what we're going to talk about. This is a case study of sorts,
02:21
for which a client who's probably myself with a recommendation engine will talk about that. So we'll talk about the context, the solution that I used. I need to not look at my phone. Some Redis-related tangents, because this is really all about Redis and Ruby. And some painful lessons I learned along the way.
02:40
So the context. The client of mine, who shall remain nameless, just so that way I can be a little freer with discussion, they had a soccer, or have, I should say, a soccer social network. So imagine Facebook, but for soccer. Facebook for blah blah blah. That's pretty common in California, right? But in their case, what made them really interesting is that they have
03:00
a live feed of soccer data coming in all the time. So as games are being played, every time that there's a red card, or a yellow card, or someone scores a goal, or there's a penalty, they get a notification about it. And what they wanted to be able to do is they wanted their users to be able to see popular events, popular posts on their site. So the soccer event feed,
03:24
as it's coming in, would be automatically spewed out into the website as a series of posts. And they'd be contextualized. That is, that they would have tags associated more on that later. So they wanted the users to be able to see popular posts and relevant posts, and in near real-time.
03:42
And that in near real-time part made it a little bit more exciting. So recommendation engines. I'm sure that most of you are at least familiar with the idea, because you use this thing called Google, probably. Maybe you've heard of it. So recommendation engines are an approximation. And they are based on, obviously, large sets of data, ideally.
04:04
And in this case, we want two different kinds of recommendations. Again, we want what's popular. That's pretty straightforward. But what's relevant, and that's very subjective. So they're based on statistics, but this is me in statistics.
04:22
And this, to me, is actually what makes this talk interesting, because I built the recommendation engine being that dog. So statistics, so recommendation engines are canonically based in statistical methods. And yeah, statistical methods and I, we don't get along so great. So this is basically about
04:40
how you do it with brute force, and still get away with it. So other than being ignorant to statistical methods, quite frankly, I couldn't get the client to pay me for a day or two of research. I asked them, I said, wouldn't you like to do the right thing, rather than just something probably ugly that'll work? And they said, no, basically, we trust you, so just go build it.
05:01
But I'm telling you, it'd be better if I did a little research in advance. No, no, just go build it, okay? Because I like being paid. So why Ruby? Well, kind of the same thing there. Their developers knew Ruby. They knew JavaScript. I said, maybe we should use something faster, you know, Java,
05:23
which feels really funny to say, having been a programmer for a while. If you said Java was fast 20 years ago, I would, or if I'd said it, I'd be laughed out of the room. Nowadays, Java, fast, go fast, see fast, even, you know, JVM languages. I said closure because I like closure, but nope, they wanted Ruby.
05:40
So okay, Ruby it is, no statistical methods really, fine, I'll figure something out. So let's talk about the system a little bit. Like every social network, it has the typical nouns of users, posts, comments, you're used to those. But then we have a few new ones. We have teams, players, I forget, I think they had a match as a noun,
06:02
but really a match to me was just two teams playing an event with two teams on it. And then we had a series of verbs. So submitting a post, I'm sure you're familiar with that, except that I alluded to this a little bit earlier, posts have tags. They're taggable polymorphically.
06:20
So you could put any old thing on them, but usually you would see teams and players, and that's really all they wanted out of the recommendation engine. That's important to mention, more on that later. And it's not that important, it's really not that important, it's just a fun point later. So you can comment on a post, big surprise, again, social network, you probably didn't expect to see that. But you can tag posts, you can tag, sorry, comments with users,
06:44
kind of like you could in Facebook. It was a little bit more of a nuisance, because they didn't have a tagging mechanism per se for users like Facebook does. I just had to write something to scan the text. Not entirely relevant to the rest of the discussion, so we'll just keep on going.
07:00
Other verbs that kind of mattered a bit, favoriting teams or players. This isn't something that Facebook has, more like a four square thing where you say, I love this, I love this team, I love this player. And then liking posts, pretty typical stuff. So given a model that looks something a little like this, leaving out comments and likes and favorites for now.
07:24
Let's say you have a user, in this case he's user two, he has posted three posts. The first two posts are really the important ones, and the first two posts talk about tags one, two, and three. So say we're given this. And maybe we have something like this, but we don't initially.
07:41
Where we can say, we have this other guy, and he's interested in these tags. So those might be teams or players. And they have these scalar values associated. This user has these scalar values associated with each of these, say, teams or players. So given those things, what we want ultimately is that. We want to be able to say, this user is going to be interested in
08:04
these posts and not that post. So post three, and going back two slides, had tag four. And user one only cared about tags one, two, and three, not tag four. So he should only have a score for post one, post two, and
08:21
he shouldn't have anything for post three, cuz he just doesn't care. So we want something like that. So this part here is where the interestingness came in. When the client approached me, they said, well, we have this idea for how a recommendation engine will work. We'll just have a weight associated with each one of these events as they occur.
08:41
Well, that's all well and good, but going from the first diagram with the post and the tags to, I have this in a single step, doesn't really make any sense. I needed some kind of lens in order to figure out what the user, what content the user would actually care about. So I needed an intermediate value.
09:00
I needed some intermediate values to get a sense of what does the user care about? So moving on, we start with active record. Every good application does. Not really, but it was a Rails app, so yeah, we had active record. But really, we're talking about active record observers. So that's to say that we would capture, or
09:29
we would have some data that we would feed into something, and we'll get there in just a minute. So to reiterate, we care about two different kinds of posts. Really they're, well, posts are posts, but
09:40
we care about quantifying them in two different buckets. Popular, which is a global thing, and relevant, which is subjective to the individual user. So popularity is pretty straightforward. It can be made a little more complex than this. Popularity, if I recall, is based on comments and likes, and I forget which was worth more, cuz we would,
10:03
and that's kind of irrelevant, the point is they would have different weightings. So from a trending standpoint, a comment might be worth more than a like, or a like might be worth more than a comment. One thing that we had talked about doing, we hadn't done, would have made life a lot more interesting, is to have a notion of taste makers. And that is people who are super jazzed about a topic having their likes and
10:23
their comments being more valuable in terms of popularity than other people's. If you instantly start thinking about gaming the system when I say something like that, then you're basically reading my mind, because I kept going back to the client over and over again about that, and their response was, to have such problems. And well, I had to agree with them.
10:41
If someone games your system, then well, you're doing pretty well for someone to care enough to do it. Relevance is really where it gets a little more interesting, or a lot more interesting. So we have these verbs, or I guess these statements. Like if you go in favor of DC United, or you submit a post tagged with DC United. Let's say you like DC United, or you comment on a post that is tagged with DC
11:02
United, or the really confusing one, you're mentioned in a comment on a post tagged DC United. If your head hurts on that one, I understand it took me a while, but I have my brain around it too. So obviously, if you hadn't figured it out, there was a time in my life when I liked DC United, but I'm not really much in the sports now anyway. But moving right along.
11:22
So, relevance in this system is defined by an algorithm kind of like this. So given an arbitrary event defined by an AR observer, or essentially serialized by an AR observer. For each tag on that event, for each user interested in that tag,
11:42
go score the user's interest in that tag, or go re-score, assuming that there's interest already. So if you hadn't figured it out already, that's a big O of n squared algorithm if you're in computer science. And that's a bad, bad, bad thing. Damn, I was hoping there might have been more Pacific Rim fans in the audience, but, well.
12:04
So yeah, big O n squared algorithm, I'm up against it. So I'm thinking, this is bad, what am I gonna do with this situation? Well, how can we cheat? So it occurred to me, we're talking about soccer matches, about sports games. We're talking about wanting timely recommendations.
12:24
Why do we care about stuff that's far in the past? We shouldn't. So I went to the client and said, what if we just say, have a window of three days, and then after that, we just don't care anymore? And they said thumbs up, and I thought, great, now there's a whole lot of data I don't have to worry about. So big O n squared is bad, but n just got a whole lot smaller.
12:42
By the way, sorry, computer science parlance, big O of n squared is to say, it's a nested loop, and n is some arbitrarily large constant. It's the largest, if I think we were being concrete about it, it'd be the size of the largest data structure we'd be iterating over. So the big O is worst case runtime of this algorithm would be
13:02
looping over the longest structure in a nested fashion. And that's generally very slow, you don't wanna do that. So, we only care about recent posts, as I said a moment ago. But now we narrowed down what events we care about, we need some kind of event consumer.
13:21
So how many of you are familiar with Resque? Okay, about half, I wasn't sure what to expect, interesting. So Resque is a queuing system for processing background tasks. And it's written using this thing called Redis, which was in the talks. I assume you might be vaguely interested in this thing called Redis.
13:40
How many people know about Redis or are kinda comfortable with it? Okay, that's a little more than half, pretty good, so I'll keep it short. Again, don't tie me. So Redis is a key value store, which is to say it's a little bit like a memcache, or if you just wanna speak more Ruby parlance, it's basically like a glorified hash. Except it runs as a demon process, basically.
14:04
Its storage is in memory, but it can persist to disk, and there are a couple different persistence options that give you a little bit of flexibility about how often, how reliably it persists. And the interesting thing about Redis is it's not just a straight key value store,
14:20
it's not just a hash. Or I guess you could say it is a lot like a Ruby hash in some ways, because the value doesn't have to just be a string, the value can be some kind of data structure. And Redis supports, well, the ones listed here. So lists, so lists allow repetition, and they're sorted. Well, they're sorted based on insertion order, I should say.
14:40
A hash, as you might expect, so actually, so key value is by nature, a lot like a hash, so basically you can have hashes in your hashes. You don't necessarily wanna use those, and we'll talk about that soon. Sets, so a list where the insertion order doesn't necessarily matter, but no repetition's allowed. And sorted sets, which are pretty darn interesting, because they don't allow
15:01
repetition, and they maintain a sorting order. And so you're inserting a value and some sortable value to go with it. Well, and again, more on that later. Maybe one of the most interesting parts to me about Redis is that it supports adding a time to live, an arbitrary time to live, user definable, to any given key that you put in Redis.
15:24
Now, when I say key, I need to be very specific. Key at the macro level of key value for Redis. So if you would store a hash, a hash has a single key that refers to the whole hash. If you're storing a list or a set or a sorted set, there's one key that points to the whole thing.
15:41
So you can put a TTL on that, and what that says is, I want this value to just go away after this amount of time. That can be pretty handy. So when I mentioned that three day window earlier, the TTL is very handy there. That's just a little too big font-wise. So the AR observers were pushing events out to rescue.
16:02
And the event would look something like this. We'd be pushing JSON up. So the event would have the type, the noun, essentially, the action. I think we were only concerned with creates and occasionally deletes, but we didn't really care about updates. I offered to add that, this was a 1.0 release,
16:22
it just wasn't something that mattered that much at the time. Then we would have the idea of whatever the thing was, the user ID, because that very much matters here, since we're talking about the user's interest in things. And then the names of the tags associated. But we have all this stuff queued up, but one does not simply share the load. We have to define our workers.
16:42
So the worker that I created, I called a calculator, because I figured we're calculating a score. And the calculator originally was just one giant class, and it was awful. So TDD very quickly showed me how bad of an idea this was, as my tests grew to be more and more hard.
17:02
So then I started to break it out into three different kinds of calculators that formed a sort of workflow. And also, I've learned through more TDD suffering that I shouldn't even have my individual calculators think about persistence. Because then that made their already busy life of trying to compute things
17:21
even busier, but trying to worry about, well, where do I put the stuff when I'm done? So instead, I just had the outer level calculator act as a sort of strategy, I guess, in the object-oriented sense. And so he was the rescue worker, and he handled all the persistence, and he just directed the other guys to do work. He would call one guy, get his output, pass it on to the other, and so forth.
17:43
So persistence was handled by Redis, but I created a very simple abstraction around it, just a class. So that way, the customer could decide later, well, storing everything in memory is kind of sucking, because our Redis is costing us hundreds of dollars now, a month or more, because Redis, again, is all in memory.
18:02
And memory gets a lot more expensive when you start getting bigger and bigger chunks of RAM. So I thought at some point they might want something like, dare I say, a MongoDB, not a big fan, but something like that maybe. So I put that there. It wasn't something I really had to worry about too much while I was working with him again for a 1.0 version, but
18:22
it seemed like an easy win. So getting into the individual calculators, the training list calculator, just like my discussion about popularity earlier, this guy was really straightforward. You like something, that bumps up the score on a post. You comment on something, that bumps up the score on a post. Really dumb.
18:41
And then it outputs, so it would get the event, it would output a new score for that individual post. The way that data was stored was just as a simple key value pair in Redis. So you would have, and this was actually, I guess as a brief aside, this was a little uncommon for me. I was trying to find lots of ways to use Redis data structures. And for whatever reason, this made more sense to me as a key value pair.
19:03
As it turned out, it probably would have been better as something else, but that's in the lessons learned section. So I would munge the keys, so I could namespace the values I was storing. Cuz if I just had the post ID in Redis, then I would have a key of 42. And well, if that's the post ID, if I wanted to store anything else with that
19:23
key, well, I would overwrite whatever was there and that would suck. So I would put something in front, like say, trend for trending this. That's pretty common in key value stores, just to have long, munged names sometimes, just for name spacing purposes. So let's see, right, and the trending scores had the three-day TTL,
19:42
that I talked about earlier. The one part that I regretted here was that these values were sorted in Ruby at run time when trending this was requested. Now remember, we're only talking about three days worth of posts. And this was for a fairly new social network. So again, going back to the remark I made about gaming, I ought to have such problems where sorting in Ruby would be that painful.
20:03
But I would far rather sort and say something like C or Java, which is like 100 times faster. So that sorting wouldn't be as painful as soon, but alas. So the user interest calculator, this is where we start getting into that relevance business, deciding which users care about what.
20:21
So it would get the event, but it's important to mention that for a given event, there might be multiple users that might care about the event. And the reason for that is because you have the person who posted the original post, but then you have all the commenters, you have all the likers. So you have to aggregate all of those people together.
20:40
Because if anything else happens on this event, these people have expressed some degree of interest in the tags that are involved. I don't think I have a slide for this, so I wish I had. I'll take a brief aside to mention that every single one of those verbs had a weighting factor associated with it. I'm just computing scalars here.
21:01
I did have an AI class 20 years ago back in college, so I learned a little bit. So each one of those events would have some kind of weighting associated with it. So when we had a scalar value, we would know it was based primarily on this, a little bit of that, and a little bit less of this, and a little bit less of that. When you favored something, that's a big, large declaration to say, I love this.
21:24
When you comment on something, well, maybe I kind of like it. If you are tagged on a comment belonging to a post, okay, that's a pretty weak attachment, but that connotes some degree of interest, cuz you're associated with someone who cares about something else. So that's a very weak association, but it is some form of association.
21:44
So all of those users needed to have their interest rescored. Right, and I just mentioned arbitrarily assign the weights per event type. So that was all I had, that one bullet. So I think the aside was worth it. So this is how we get this structure, that we have a user and
22:01
we have a score for each tag based on that big, nasty, big O of N squared algorithm that I defined earlier. So internally in Redis, this is how I would store it. I would have one hash per user, and the field, so that the key would be something like UI, the user ID, user interest.
22:22
It's something that we would look up an awful lot. So having a nice short key seemed important. Having it munged is kind of essential, because again, don't wanna step on user ID values with something else later. I think Redis calls the individual keys on a hash fields.
22:40
I don't remember if I have my Redis nomenclature right. But the field names were just the tag names, and then the values were the scalar interests. And intentionally, I did not wanna put any kind of time to live on that hash, because user's interests are one thing we know are gonna live on, and on, and on, and on. Downside is the user's interests are something that will live on, and
23:00
on, and on, and on. So you know that they're just gonna take up more and more space. Tags are not something that leave the system very often, because players tend to play for a while. And even if they retire, they might get mentioned again in the social network. So I don't know that players were gonna leave the system often, teams likewise. So it just made sense to just basically leave these data structures alone and
23:22
let them grow. That said, having those in Redis, yeah, bugged me a little. But again, 1-0 system, it wasn't a big concern. So post score calculator is, I think I got mixed up there. Post score calculator is where the big, big O of n squared nastiness came in. So now we've re-scored all these users' interests.
23:43
We need to go propagate this throughout the system. And so again, back to the, excuse me, I'm so sorry. But I discovered after the fact a name for this pattern that I came upon, it's called inverted indices. An inverted index, that's a link by the way,
24:02
these will go up on GitHub at some point, this is all HTML. You'll be able to click through, you don't have to take any notes if by chance you are. An inverted index is basically just an index of the content to where the content is stored. So I had a few different sets. I would, for, let's see, a post, I would have the set of all the tags.
24:23
And let's see, and I actually have to read this cuz I don't remember off the top of my head. And then I had a set of, right, the interested user IDs by tag. And that would save me from having to go out to the database all the time to perform a whole bunch of expensive queries. I could just go to Redis and say, hey, just give me, and boom, there I go.
24:41
And the user post scores were also stored in Redis as a hash, very much like the user interest scores. It's just instead of having to tag, you had a post ID. So this structure I showed you earlier, it's a workflow, but really what it also could be is a series of queues. It's not a big truck, you don't just dump stuff on it.
25:01
Thank you, I'm so glad someone appreciated that one. So some other design considerations that came up as we went along. So I've alluded a few times, I was trying to aggressively optimize the RDBMS out of the equation. The client very much did not want the recommendation engine to make
25:22
the Rails app run slower, because, well, they're on, let's say they're on Heroku and, you know, dynos cost money. And we already talked about using inverted indices to some good effect. Again, further reducing the need for database queries. And I already talked about those examples.
25:42
Now, the other thing that I mentioned, that I broke the calculator down into a trending list calculator, and a user interest score calculator, and post score calculator. And no one's made any kind of rude gestures, but those names really suck. I'm sorry. Post score calculator, it's German, take a whole bunch of words together,
26:01
mush them together. Is it good enough? Well, it ran in production, and the customer was happy. So, yay. Was I ashamed of a lot of the code I wrote? God, yes. Would it scale? Well, I was limited by Redis, so memory, RAM.
26:22
And I knew I'd be limited by CPU. But what they were concerned with was getting a 1.0 release out the door, something that people could use right away. And if they'd be successful, well, worst case, they would rewrite it. But there was potential to refactor and to scale it further. One of the little interesting things that happened along the way is,
26:42
because the post was polymorphically taggable, you could just throw anything on it. So the engine originally didn't just care about teams and players, it just took any old tag you gave it. The client later said, yeah, I really only want the teams and players. But the interesting side effect was,
27:00
well, users would get thrown in there as tags too. So I thought maybe a side business is a sports dating site or something. Because all of a sudden it would say, hey, this is how interested I am in this person versus that person. But no, I took that out and hard coded it to just teams and players. So lessons learned along the way. Statistical methods obviously would have been nice,
27:22
because any time I have to write a big O of N squared algorithm, and I know N's gonna keep getting bigger, I get really anxious. I did not like writing this. The lesson learned for me, if I were still freelancing, but even though I'm not, I work at Rackspace, is when I know something is right, I need to argue for it just a little bit more.
27:41
Prefer straight key value over hashes. I've mentioned TTLs, and I think I mentioned TTLs, and I mentioned TTLs. You can't put a TTL on a field on a hash. So you can't say, I want this, for this user, this post score to expire sometime in the future. No, you're stuck with that guy. So there is a way around that.
28:01
That's, I think, the next slide, but that's more work. What I could have done instead is just had longer munged names. Said user ID, user ID, blah, blah, blah, post ID, blah, blah, blah. And then just had a value and put a TTL on that, and then it would just disappear in three days and life would have been better.
28:21
Extracting small, okay, so one more slide. Extracting smaller workers, when I said this is a series of queues, what I was getting at was I designed the system expecting that post score calculator would inevitably eat up more CPU than the other guys. So they're all written as rescue workers, but only the outside calculator is a rescue worker.
28:40
It would have been fairly trivial to extract the other three, give them all their each, give them each their own queue. The only thing I would have had to have done is I would have had to add persistence capability to them, and that could have been something dependency injectable, for example. And that would have been kind of simple. The other thing that would have been nice is each one of those little guys basically runs on a case statement, and that's the big giant OO scream of,
29:03
please extract me, please extract me, and well, well. Less chattiness with Redis. So I just was making individual calls to Redis. If I needed to get a set, I would just make a single call if I needed to to push a key value pair, it'd be a single call. It was enough, it worked well enough.
29:22
Individual calls on AWS from Heroku to Redis, I did actually bench it. It was something like two milliseconds. They add up, but only if you're making a ton of them. This was still way faster than the Rails app, so it really wasn't a big concern, but something to be aware of. Redis supports two different features that only one at the time when I wrote
29:42
this recommendation engine that would have helped here. A pipelining, which allows you to just batch up commands. You get futures back, which is basically saying, here, this is where your result will go later, and then all the results come back. And you just access the futures to get the results. So you send one big request with all of your different commands, and
30:02
then you get a bunch of little responses back when they're ready. And that will result in less network chattiness, which means your app will run faster. The second one, and this one is very dangerous, because Redis is an evented key value store I didn't mention, which means there's one thread.
30:20
You can script inside of Redis. If you script badly inside of Redis, you might occupy that one thread for a while, and when you send commands to Redis, it might say, sorry, I'm busy right now. So that would be bad, like crossing the streams in Ghostbusters. So pruning, when I mentioned not wanting to use hashes so
30:40
much in Redis, if you are gonna use something that's just gonna grow and grow and grow, and nothing ever expires, and you want things to be removed eventually, one option is to put a timestamp in every value that you're gonna put in that hash. So instead of just putting straight values into a hash of just integers, for example, you just put JSON in, like we do with rescue.
31:02
And that might have a timestamp and then the value. And then what that means is periodically, although I've heard that the best practice is maybe every time you do an insertion into a structure, to also go through and prune that structure and look for things that you can remove that have outlived their usefulness. But I mentioned earlier, still, better to prefer a key value where you can
31:23
just say TTL than to go and have to deal with pruning. Pruning is more work. I told the client that was something I was concerned about for later. Again, for 1.0, they didn't care, so I could wash my hands of it and walk away, but still, I didn't like knowing memory would just grow. I used to code in C and Java, and you don't like leaking memory.
31:43
So one other thing I realized actually just today was the calculator, because it was stateless, could have benefited a lot from a functional programming style using what's called referential transparency. And really, that's just a fancy way of saying the output from one function is the input to the next function. And you just accumulate state by taking that output from one function,
32:04
passing that as your input to the next, along with what other stuff you need to. And just keep accumulating in your final outputs what you care about. That might have been pretty nice to do. It might have made the code a little bit readable, because the imperative style can be a bit hard to follow sometimes. I know, as I said, I wasn't thrilled with the end result of the code, and
32:22
I tried really hard to make it readable. But the imperative style didn't look too good in the calculator. And the final lesson learned is use something faster than Ruby. So, but that was really just a joke, because Ruby was actually adequate to the task, it wasn't a problem. So this is the part where I get to say I've got seven minutes and 41 seconds.
32:42
Are there any questions, heckling, or other statements, remarks, something? No? Okay, cool. Well, three minutes left, so thanks very much.