DATA DUPLICATION VILLAGE - Facts, figures, fun from managing 100,000 hard drives

Video thumbnail (Frame 0) Video thumbnail (Frame 6008) Video thumbnail (Frame 8543) Video thumbnail (Frame 10545) Video thumbnail (Frame 12414) Video thumbnail (Frame 13882) Video thumbnail (Frame 25494) Video thumbnail (Frame 26695) Video thumbnail (Frame 28163) Video thumbnail (Frame 29631) Video thumbnail (Frame 38440) Video thumbnail (Frame 39641) Video thumbnail (Frame 40842) Video thumbnail (Frame 51119) Video thumbnail (Frame 60595)
Video in TIB AV-Portal: DATA DUPLICATION VILLAGE - Facts, figures, fun from managing 100,000 hard drives

Formal Metadata

DATA DUPLICATION VILLAGE - Facts, figures, fun from managing 100,000 hard drives
Alternative Title
Facts figures, fun from managing 100000 HDDs
Title of Series
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date

Content Metadata

Subject Area
For the last five years Backblaze has collected daily operational data from the hard drives in our data centers. This includes daily SMART statistics from over 100,000 hard drives totaling over 500 Petabytes of storage. We’ll start by looking at the lifetime statistics for all the hard drives we have ever used, split out by size and manufacturer. Then we’ll compare the failure rates of consumer versus enterprise drives and we’ll also compare helium-filled versus air-filled drives. We’ll finish up with looking at a handful of SMART attributes to see how temperature relates to hard drive failure and whether or not you can use SMART stats to predict hard drive failure. As a bonus, we’ll show you where to get the data so you can do your own analysis – enjoy.
Point (geometry) Statistics Group action Computer file Code Multiplication sign System administrator View (database) 1 (number) Data storage device Mereology Number Prototype Integrated development environment Data conversion Traffic reporting Physical system Enterprise architecture Dependent and independent variables Data storage device Bit Basis <Mathematik> Integrated development environment Hard disk drive Data center Right angle
Source code NP-hard Plastikkarte Bit Data storage device Graph coloring Open set Process (computing) Internetworking Operator (mathematics) Data center Hard disk drive Right angle Metropolitan area network Physical system
Source code Surface Multiplication Statistics NP-hard Sequel Channel capacity Dreizehn Multiplication sign Set (mathematics) Attribute grammar RAID Plastikkarte Statistics Open set Number Uniform resource locator Error message Business model Video game Website
Point (geometry) Decimal Multiplication sign View (database) Attribute grammar Bit Plastikkarte Mereology Software maintenance Statistics Human migration Error message Personal digital assistant Hill differential equation Right angle Physical system Row (database)
Multiplication sign 1 (number) Bit rate Mereology Neuroinformatik Measurement Bit rate Different (Kate Ryan album) Cuboid Error message Physical system File format Bit Statistics Measurement Entire function Process (computing) Frequency Vector space Right angle Electric current Point (geometry) Slide rule Asynchronous Transfer Mode Statistics Service (economics) RAID Plastikkarte Number Frequency Goodness of fit Well-formed formula Operator (mathematics) Business model Energy level Software testing MiniDisc Traffic reporting Validity (statistics) Survival analysis Plastikkarte Counting Total S.A. Software maintenance RAID Inclusion map Number Software Personal digital assistant Synchronization Business model
Point (geometry) Statistics Process (computing) Bit rate Business model Video game Machine vision Descriptive statistics Number
Statistics Information Multiplication sign Bit rate Statistics Entire function Number Frequency Facebook Goodness of fit Mathematics Bit rate Data center Right angle Business model
Point (geometry) Statistics Multiplication sign 1 (number) Set (mathematics) Bit rate Mereology Number Lebensdauerverteilung Frequency Mathematics Bit rate Different (Kate Ryan album) Operator (mathematics) Graph (mathematics) Business model Energy level Software testing Logic gate Exception handling Area Curve MUD Digitizing Debugger Data storage device Fitness function Statistics Type theory Integrated development environment Personal digital assistant Blog Data center Video game Right angle Business model Electric current Modem
Video game Data center Data storage device Right angle Business model Solid geometry Statistics Logic gate Sequence
Video game 1 (number) Business model Statistics Modem Electric current 10 (number)
Axiom of choice State observer Greatest element Multiplication sign 1 (number) Counting Mereology Disk read-and-write head Unicode Neuroinformatik Mathematics Bit rate Different (Kate Ryan album) Single-precision floating-point format Cuboid Drum memory Pairwise comparison Error message Physical system Enterprise architecture Structural load Digitizing Data storage device MiniDisc Right angle Point (geometry) Ocean current Slide rule Statistics Enterprise architecture Hidden Markov model Number Power (physics) Frequency Goodness of fit Population density Business model Energy level Divisor MiniDisc Summierbarkeit Firmware Raw image format Information Graph (mathematics) Plastikkarte Power (physics) Integrated development environment Software Intrusion detection system Data center Marginal distribution
State observer Greatest element Building Code Multiplication sign Range (statistics) 1 (number) Bit rate Parameter (computer programming) Mereology Disk read-and-write head Neuroinformatik Mathematics Computer cluster Bit rate Different (Kate Ryan album) Core dump File system Physical system Social class God Enterprise architecture Block (periodic table) Data storage device Fitness function Bit Annulus (mathematics) Degree (graph theory) Process (computing) Hard disk drive Normal (geometry) Right angle Point (geometry) Dataflow Statistics Observational study Event horizon Number Attribute grammar Revision control Goodness of fit Cross-correlation Average Operator (mathematics) Business model Energy level Software testing MiniDisc Default (computer science) Standard deviation Scaling (geometry) Mathematical analysis Plastikkarte 8 (number) RAID Leak Personal digital assistant Data center Family
thank you all for coming in I appreciate it standing remotely is awesome we're gonna have to have a bigger venue next year and everybody's having conversations if you could keep it down for our talk please thank you I'd have the pleasure of introducing the talks at the data duplication village is the first time we've had talked this year and I love seeing your response thank you all for coming out and I also have the pleasure of introducing this individual and decline from back plays we have been paying close attention to the reports that he has been generating for the past several years but four years now yeah four years on a quarterly basis he's been generating reports the hard drive been putting him out for society for the community to be able to take advantage of especially us so we can see what happens in the way of failures and drives to look for so without further ado Andy thank you very much welcome thank you Scott much appreciated all right so they did let be in here with a title of marketing so I don't know how that happened I snuck in under the radar I do a long time ago I actually coded for a living I played systems administrator for a while and all of that then I crossed over to the dark side and became a marketing person the good part is is I'm hard to bluff from a technical point of view but I still have a marketing person so take everything I say with a grain of salt all right so a little bit about I have to do this we're gonna talk today a little bit about our environment so all about drives and stuff like that right how we measure a failure all right because we do that I'll walk you through some of the stats we have and then we'll do some fun stuff like looking at Enterprise Drive versus consumer drives and helium drives versus airfield drives and a little bit about the idea of can you actually predict failure on a hard drive all right if you have enough statistics and enough number she can predicted anything and this will finish up with temperature just so you know for those of you who know Backblaze this was the original storage spot we built okay for those in the back that's plywood okay that's how we prototype the first one you can see it up on the rack over there in with that lovely little piece of Dell equipment which cost nine times as much by the way so we built that and then we have actually changed into those pretty red ones that we'll talk about a few minutes so our environment and it's important that you understand our environment just a little bit because people look at this the data sometimes then they go it's not how my system is and this is it and I have two drives in one failed and you guys don't know what you're talking about okay this is our environment it's a data center sixty drives in a chassis there are now systems with up to a hundred drives in a 4u chassis right but that's how we do it and then we actually logically group together 20 of them 20 of those chassis into something called the bolts so when a file comes in it actually gets sharded across those 20 different chassis all right so that way we can lose because of the way this is done with our own we created our own encoding erase your code stuff a mechanism and we actually open sourced it by the way so go to github you can look it up if you want to steal it excuse me borrow it excuse me make it better it's out there for you guys to look at it's but it's a it's a 17-3 encoding mechanism so you can lose up to three Driss all systems before you lose that evening any data and long before that ever happens okay where you we're way ahead of that okay but drives fail and this is
the kind of mechanism you have to build when you want to scale the system right and because dries do fail and you'll see a little bit later how many yay right now we're storing about 600 petabytes of data so just slightly more than came in the last few days but I am amazed by the way these guys deserve a hell of a hand okay if they do anime easing job to get all of this stuff replicated for you guys or anyone who's doing that so I think that's the first thing we do is let's give them a hand okay because it is it is hard to maintain all of this and then they have little fail drives over there to keep you track it's kind of fun we have more than a hundred thousand drives that are in operation right now so we stick them in those wonderful red chassis right and that's that's us okay now you've probably seen this shot on the internet because everybody loves to say looked at a data center and it's all pretty in red right little story the reason they're red was because when we did the very first one who built it the guy called us and said what color would you like them because they were just metal we said I don't know and he said I have some red literally it's that simple and we said okay okay there was no guy down there with if they going hey give me a give me PMS color 127 no it didn't happen he had red so we've stuck with it and it's worked out really well it's pretty cool
collecting hard drive data so we use the smart man tools package many of you have heard of that do that all right the data we collect is the smart stats off of that I'll show you exactly what we get in a few minutes we collect it
once a day we actually scan the drives multiple times a day all right and there's a reason we're scanning we're looking for things that are going wrong but we keep a copy of the data for each drive for each day we've been doing that since April of 2013 okay so that's the data set that's out there if you would like the data is public we publish it it's on our website there the URL they are for you you can download it I think it's over a hundred gig these days worth of data we explained to you how it's laid out you can go in and look at it and we even give you some sequel files to go and play with if you want to go do that to do chest and do your own thing so if you ever if you ever be curious and you got nothing to do for a couple of days you can download some of this stuff you know what do we have here i thought i fixed this life oh well dave serial number
model this is for every dr once a day right capacity you can see that and then we carry the smart stats so smart stats carry unnormalized at a raw value for each of the statistics that are out there there are currently 255 sets you could get not all of them are used matter of fact roughly only half were used that we're aware of from the drives that we have and we collect all of that we store all of that so if you want to know what you know on Thursday June 13th 2017 what the smart draw value was what to rob value was for that drive we have that it's in that data set no ok I don't know why you would want to know that but yeah yeah
an on button okay I think that's a little bit better I think I'll stay about this far away from it too all right now the most important part from the point of view of what we're talking about today is to fail your button so what's a right and that's what you can see there's a failure value all right in that record that means that that drive failed on that day okay the way we do it okay is we actually will scan for drives and when a drive comes up and it's not there we go and try to figure out why it either failed or sometimes a lot of times what happens is they took the system down for some reason perhaps migration data migration or whatever the case may be but we do get failures as well and so the only thing that gets marked there is a one for failure a system Drive can disappear
from the data because again it's being migrated maybe we took a pot out for maintenance purposes or something like that okay so what is a failure okay and this
is the part that's really important the first two are easy right did it spin up you can't see it the Rader a won't do anything with it whatever the case may be right that third one drives people nuts I'll be tired right because it's it's our educated guess from the smart status values that we see and other things that are happening with that drive so it failed it throws epic errors okay and we see it happening consistently we pull it out and we'll pull it out we'll mark it as a failure well also by the way before we mark it as a fail you we run it through two levels of testing on the backside which are basically nothing more than a quick reformat and then a long-term format which beats the living tar out of it that we have from one of the manufacturers who shall go nameless and if the drive passes then it gets put back into service if it passes both of those so really confident that a drive has failed by the time one of those three things has happened to it and and like I said we don't use it now for those of you who are familiar with our drive stats okay we have been recording the data and this is the data you're seeing is through the end of June right we I could go in and pull it through today but it actually pulled it at the beginning of August and it was no real significant difference so I didn't bother updating all of these slides you can see the kinds of things we've had 132,000 drives we've had in play Yatta Yatta all right the thing it's important is the failure rate everybody looks at the failure rate he looks at the failure rate everybody looked at that they go that's the number I care about I don't care about all of these other numbers and there were two basic ways to compute a failure rate and one of them is wrong and let's see if you can figure out which one is wrong some people do it this way number of failures divided by dries times 100 right hard easy formula and given that data I just showed you the failure rate for all of those drives is five point six two percent some people do it this way values divided by what's called drive days drive days are a count of each day a drive is in operation right if it is not an operation ie had failed and it's gone it's no longer counted and that's the way and you can see same kind of numbers right drive days divided by failures but now you get a number that's almost half as much two point five or six percent all right which one is right well we use the second method all right we use drive dates and the reason is for us all right method one that first method assumes that every drive has been operating the same period of time and then they be very valid for you in in a situation where you have five drives in a in a raid and an ass box or something like that and you want to do computation that's a very valid computation right we don't we have drives in and out of this system all the time so whether it's a failed Drive whether we take a system down for maintenance whether we put it in a new system another when we put in drives now we put in 1200 drives at a time all right that's one of those Wolf's takes 1200 drives it's come spinning up right the lights dim everything happens right so that's why we use that method because it accounts for the fact that we have drives in and out of the system all over the time so if you're ever on our comments and all of that kind of stuff and people are yelling about this they usually are thinking the first one is hey why why does it network and that's why it doesn't work I make sure I stay on time now thank you just just because other people have in fact taking all of that data that I talked about and me and applied other models to it okay so we have an annualized failure rate we create and you'll see that in a minute alright for those of you who have ever been in the medical bit side of thing by ology side of thing there's something called kaplan-meier which is basically a how long will something live versus how often will it fail alright and the Simon Aaron I'm going to get his name right err knee okay he's from Sweden you see down there where he publishes it each quarter he updates it we published this data he publishes it this is pretty boring little chart but he's got a lot of out there ones that are a whole lot more interesting but basically this is the all of the drives we've ever had over time and the chance that they're going to survive after so many days so one year two year three or four years you're looking at something in the neighborhood or about 88 percent okay so that's the failure rate over time how long can something be expected to live you put in a drive today there's an 88 percent chance it will survive four years that's that's the kind of thing that this one doesn't any by the way for all of the different drive models that we have so it's kind of cool and the technique is it very hard if you know how to do it so a number of other people it does some fun stuff with the data but this is a really good one and he does a really nice job of explaining it measurements okay so when you look when you look you come and quarterly we publish to drive stats and like you said we've been doing this for about four years now this is everything we got here's all the data there's no hiding it's everything and each corner we published two sometimes three different looks at one is a quarter they look tell me everything that happened in the last quarter only the other one is a lifetime look given all of the drives we currently have running how had they been doing since we ever put them in and then the last one is every drive we ever own tell me how it did over the entire period right so you'll see the data if you go look at the stats and stuff like that you'll see all that cool stuff but that's how you look at the data and so a lot of times we make a mistake by the way okay and when we have published a report the first thing is the quarterly numbers and people are short attention spans so they look at the quarterly numbers and they immediately just say oh my goodness so and so's Drive only had had a zero percent failure rate oh my goodness they're a great Drive oh well yeah but there's only 20 of them you know and it's only they've only had a whole total of 539 days which they were there so make sure you pay attention to the data the quarterly data in particular we use the quarterly data as just a mechanism with like a vector is a drive moving up or down in in how it's failing over time alright and that's a good that's a good
vision point for us to do I will tell you pulling these stats and doing all of the magic that we do and stuff like that that's like a part-time job it's like it's like another thing we have to do and so it's we do it and
it's great but it's not like that's in
my job description anywhere so right now on the other side okay you get some really good looking things because you start to get decent numbers of drives
and decent number of Drive days so for example I know there's some six I've heard there's a couple of six terabyte drives over there that particular model
Seagate which you cannot get anymore by the way sorry that's failure rate at your life's
failure rate is 0.87 percent less than one percent that's a pretty good number alright that's a pretty good number there's some 4 terabyte HGST drives that are stellar 0.26% that's amazing alright that's that's like you know these 100 drives a quarter of one failed over the year you know that's that's pretty stellar numbers can't get those either more anymore we we bought I know you can't because we bought every single one of them Western Digital who owns HGST had in warehouses everywhere alright we sent guys under trucks and in corners looking through the boxes of hard drives and we bought every single one they had because they were great and then we opened up a data center with them that's what we do if
you know one thing about Backblaze is we are frugal we started and we continue to use consumer drives alright which we'll talk about in a minute lifetime stats remember I said the the stats now I have stats and I go back if this is over the entire period of time right a little less information but you can kind of see that's the math we calculated before ok that's a fun number 70 437 drives failed since we started keeping stats anybody else have that many failures unless you're from Google or Facebook or one of those guys you probably by the their number is going to be a whole lot bigger than that you know I we grow we have 600 petabytes yay right those guys have I don't know a hundred times that it's amazing still proud of what we do anyway so what we did I did
there is I summarized up for all of the different size drives so if you're going down to the store what might you be thinking about now we had a really good run with two terabyte drives all right really good every single one we got in the place was great right and then there was this thing in in 2011 2012 called the Thailand Drive crisis and we had to buy a lot of drives and they were all three terabyte drives during that period and let's just say that the drives were not as good I will leave it at that and much present and the numbers of the numbers right now you look at this you go wow look at those twelve terabyte drives you're doing great remember we've only had them for a year so so if you start to think about how drives fail they do seem to follow a bathtub curve so there's a little infant mortality at the beginning and then they then they kind of settle into a really nice low rate four to three years then they start to pump up at about three and a half to four years and a failure rate starts to go up from there I and drive seemed to follow it the interesting part about that is the infant mortality rate for us is gone almost flat at the front end of the curve for some of the bigger drives I don't know if it's because they're making better drives their testing and well whatever the case may be who we're just not seeing the same level we're seeing a really interesting it's it's almost it's almost indistinguishable from the middle of the curve now so so yay for the drive manufacturers for that let's see Oh my manufacturer everybody always asked which manufacturer makes your best drive it depends okay HGST most of the drives we have for them with the exception of one mud model are four terabytes or less and that was generally before they were acquired they have so that number kind of fits that model the early sea gates not so good the late see gates pretty darn good and we don't have enough to achieve with drives yet I do have 1214 terabyte to achieve the drives in the warehouse that are going to be deployed any day so that'll help some of these numbers but that's we have so you look at the failure rates you know people everybody wants that answer should I buy CJ's right and there's there's I could divide the room in half and this half is the rest from digital half and that half is the see in half right and you guys can yell or throw each other stuff and everything like that and then there's a couple of toshiba guys down in front and and nobody pays attention to them anyway so so I where I'm not going to try to solve that problem for you okay the other thing that gets in the way is our environment is our environment right it's a data center we treat these guys really nuts I get they go in a nice chassis alright they get tested they get put in there if they're air-conditioned right we we monitor have the electricity going through them it's all filtered and everything like that I don't know if that's the same environment you have at home okay okay I'll just say it but this is the way the data will yeah a little more stuff life time okay so for the drives that are currently in the data center today running there's 98 thousand of them and you can see the failure rate all right a little less than two percent all right still 4300 of those ma of those drives have failed and that's what we consider to be the most relevant thing I can't go back in history I don't really care about one terabyte drives anymore because we don't have any we just got rid of the last three terabyte drives like two days ago and the only reason we even had them the only reason we even happen was because they were in Iraq there was four pots full of them in Iraq and they were in an area where we don't have we don't go I mean it's just the way that this particular data set are set up there's like there's like two or three racks that are just all by themself at a corner somewhere and it's and they're caged and everything and we were going to build out the rest of that at one point and we ran out of electricity I guess is the best way to put it all right who wouldn't doubt it right but they it was it was too expensive it was actually cheaper to go and have another data center than it was to drag more electricity into that existing one and this is the kind of fun math you have to do you know so no I don't want to go spend a five hundred thousand dollars to have PG&E drag another you know 100 megawatts or whatever it was into that data center let's just go open one in Phoenix and it worked out now but these these happen to be in Iraq one of those racks in the corner and those drives just never fail for some reason neither they were in a little set there but they finally they finally left we had a ceremony farm they will will even we'll do a little blog post about them and everything like that because that's kind of goofy stuff we do so just all of those models right life's
type stats and operational models just so you can see the ones that are really kind of fun if you if you stared all the way over at that right-hand column which one's the best
well it's the 10 terabyte see gates all the way down at the bottom right those have been really rock solid drives and Seagate doesn't make ten terabyte drives see they they they
skip right to 12 they made those for like a week I swear and we bought it we bought 1200 of them and we went to go buy some more because what we do is we run a sequence of things right very typical data center kinds of things you have to think about we put in 2020 is what's called a tome
it's fun drive in each of those 20 different pods right and we see how they perform and if they were if they if they keep up with everything else that's going on and all of that and then we'll say great then we'll build a whole storage pod out of them and we so we had another 59 because his 60 in there I and then we do it again and then if we're happy with that we like the results of
that then we'll go out and we'll fill the vault with them at 1200 and and that's where we did with that with the Seagate ones but apparently we took too
long because they decided they weren't going to make tens anymore and now they're making twelves and four teams are coming there don't they don't I don't believe they have any 14 yet so so
sorry can't get those it looks like you're going to need really big drives next year anyway so it's amazing the
other one that does really well is the HGST about a half a percent you can see those
two 4 terabyte ones towards the top which are really good and rock solid mentioned the Seagate sixes were pretty good too the Seagate a terabyte ones we'll talk about those in just a second
that's a really nice thing because one of them is consumer drive and the other one is an enterprise drive today so I think it's important to understand what we care about okay because it's probably may or may not be the same as what you care about when you buy a drive all right we care about cost number one the rest of them you can gray out it's almost that much right but that second one as I mentioned sometimes power is really interesting to us so for example we put in the enterprise eight terabyte Seagate drives they they were almost one and a half times as much power as the consumer drives and when you're running on the ragged edge of the amount of mattracks power you have in Iraq all right you can't do that because otherwise you have to you can't put in ten strip pods in there because ten times you can feel the whole rack right you can only put in six and that is that's not good brew density right so the power is important but then Seagate has a really nice capability and they're called power technology or something like that it's on the next slide which allows you to adjust the amount of power that you're going to give that dry when we could get it in there so coughed right now for us someplace around 2.2 cents to two point two five cents right you can actually get a better price on that every once in a while you'll go down and somebody will be having a sale at Costco or wherever and you'll be able to get it it'll actually math out to be less than that but a long time ago we used to buy drives at Costco now if we show up and say at Costco that we need a thousand drives they they don't let us have them so we buy now straight from the manufacturers or close enough so that's about what we pay okay someplace in there the other things in there fits our usage like I mentioned earlier if we put a drive in now and it just doesn't work it fails and and that happens sometimes we put a drive in we put 20 of those drives in there and they just can't keep up okay there's something that's not working right in the environment we don't use it why what beat our head against the wall right failure rates do matter so you saw really nice load numbers we can tolerate anything in a single failure rate okay once you start to get above that you're starting to plate you're starting to roll the dice really hard on your durability okay so anything above a single-digit kind of failure rate the lower the better right right now we're running at about 1.1 1.2 percent and that's a really nice number because it keeps the durability going and and their durability is remember they're shorted across seven across 20 different things 17 and 3 is a mechanism so I can lose three whole systems like I mentioned earlier that's part of the durability that you do the calculation with but if I have drive failure rates that are twelve percent I start and rebuild times that are now starting to approach two weeks right on some of the large drive is alright all of a sudden the math starts to get funny so we like single digits warranty I don't care about warranty we don't care about Marty it's it's almost not worth it for us I know it's worth it for most of you but when a drive fails the time it takes us to go and fill out all of the information put it in a box send it off hopefully get it back okay and they're going to send you a refurb drive right which I really don't want you know Marty's not interesting and then the last one for us to speak okay we we're really the way when we build that array of 20 there are 60 of those in a vault I have no trouble accepting data onto those disks none at all alright the gating item is is you can't get me enough data I mean it's that simple yeah yeah yeah there's a I don't have the slide with me actually it's probably on my computer somewhere but we've seen it when we first started it was around 11 cents and and then it's over though over the 10 or so years we've been doing this it's come down and you can literally see drives by sighs do that they come down and it'll start will go down then the next one will be introduced and it'll be a little higher and you got to weed it out until it gets down to where the previous one was and then and they they're just consistent the only time that broke was during the drive crisis and that broke hard okay we are drive prices through a normal channels when up 3x yeah so the desert that's a really good point and it has and the other thing that factors into that is density storage density in a given spot okay so one of the things we've been doing over the last three or four years now is migrating from the small it drives to larger drives so I take out a 4 and I put in a 12 I just got three times as much storage for approximately the same cost okay now I got a bunch of four terabyte drives but I got four years out of them all right I turn them over and they get recycled I don't end up in a pile in China wherever hopefully not so
that's a good point about that and there is exactly that kind of math we do so so let's prepare this is an 8 terabyte drives okay by by the same manufacturer two different models once the consumer one runs an enterprise one all right about a year ago we did it the failure rates in that second column you can see where they were now you can see the current failure rates that's within the margin of error by the way so you sit there going hmm I could spend 129 dollars to buy that a terabyte drive all right maybe a hundred fifty-nine or I could spend four hundred twenty-nine dollars to buy the enterprise drive I wonder which one I should do if I'm interested about failure all right for us it didn't matter now for us all so let's just say the price of those things is approximately the same okay you can't do that I can't use him okay because I buy a million dollars worth of drives at a time okay so but that's what we see out of the data so if I was looking at that and going what would I do I might think a consumer drive because really it's the failures is about the same I will tell you a little difference that we've seen and this is anecdotal I don't I'm never gonna write this down anywhere all right the consumer drives seem to have this tolerance for things happening to them inside like bad sectors that have to be remapped around and all of this kind of stuff right the enterprise drives don't seem to have that same tolerance when they start to go they just go they don't they don't give you a whole lot of notice okay it's kind of like I don't feel good goodbye okay it's just an observation and I think it's just because the situation I think I'm gonna put a drive in a consumer system and consumers or they're not it's not going in a data center it's going in it's an external drive and you have it next to your thing and you drag it around you bring it over there at Molly's house and you drop it on the floor you know so they have a lot of tolerance built into it but if you're making the decision for yourself about what do you got to think about these are the things that you need I think you you know might want to consider right the warranties of course are different all right typical enterprise one is five the consumer ones are two at one point during the drive crisis they were one year okay and and if they could have gotten it down to like 90 days they would have done it during that period consumer drives are really much less expensive for just off the shelf all right Enterprise Drive had a lot more features I have no power choice technology that's what I was trying to think of for example from Seagate but they have a lot more things you can tweak in the firmware all right to make that drive perform that fits into your environment really well there's absolutely faster to read and write absolutely again we don't care because that's not where the bottleneck is there's the bottleneck who's just getting the data to us you know and we have plenty of network for that so it's just a matter of weeding kind of sitting around doing nothing but on the consumer side they you use a whole lot less power out of the box all right and again mentioned do more for fuel failure so much is right for you yeah so they seem to do quite well okay and like I say that then and then they get sick and die and it's it's that fast it's like it's sometimes it's it's hours you'll see the first little it'll throw an epic check or something like that and then it goes offline you know two hours later and you go I didn't even have time to look at it they're just any HGST by the way with the seem to be the same way and its behavior we just thought it with an HGST versus seagate thing but it seems to be the same it seems to be an enterprise versus consumer thing all right helium will have plenty of time for questions and getting all of this stuff out of the way so helium so any large drive now alright starting about the eight cherubi tries although there were some six terabyte helium drives but starting with about the eight and moving on up I'm going to have I'm going to be filled with helium they finally got that technology right it was they were trying for years to figure out how to keep the helium in there because if you want to get out yeah and they finally figured it out they even created a smart stat the smart stats now to measure the amount of helium so for example HGST smart stat is 22 and it's 100 is a raw value as their value and anything less than that means it's leaking and they have a tolerance number but they haven't told us what it is but we have a handful of a running in the 90s right now so we're trying to figure it out the toshiba drives we just got in the 14 terabyte to Achebe drives their helium-filled drives and they have to numbers 23 and 24 and they measure helium at two different levels inside the drive a high they call it high and low I think it's above low the ladders their platters and it's the same kind of thing days it's the same same kind of thing they're actually in many ways still learning what that number means to them a little bit because it is a fairly new thing for them so what do we what can I tell you about that so we have some helium-filled drives on the top we have some non helium-filled drives on the bottom one
of the funny things about the bottom we'll talk about temperature in a minute the eight terabyte air-filled drives ran hot they they did they ran three four or five degrees Celsius hotter then the lower end drives and it's just there's so much going on in there the helium-filled drives were a little cooler they run back at normal levels and that makes sense that's one of the things they talked about and right now okay we don't see any difference in this annualized failure rates between helium in here all right which is it which actually bodes well that means they picked a good technology they move forward it didn't cost him anything you can see the different failure rates out there you can compare it now if this were going to be a perfect test the drive days there would be but roughly the same and they're not right now so it's not quite apples to apples but it's pretty good alright and there's enough data there to start to actually get to that kind of conclusion that it looks like the helium drives are going to have a reasonably I'm going to be able to perform at least as similarly to airfield drives they still cost a little bit more or a lot in some cases when we bought those HGST ones there we bought those like four years ago three years ago from here three years ago and those were about four hundred and fifty dollars apiece and that was a crazy number for us we bought 45 of them so we that was the most expensive storage pod way to ever built but they're doing okay they're in three years now three plus years now that's a really good failure right after three years and they're hanging on so we'll see we'll see what's going to happen we're going to track them over time and see if over time the helium drives continue to maintain you know that kind of performance that kind of failure rates with that let's see what else yeah so we'll continue to do that those are the two 8 terabytes we we have I was going to throw the sixes in there too but they're really like I said there were a couple of six terabyte helium-filled drive models but they didn't make a mini quantity and they kind of experiment with it with them they really started to do it in the eights is where technology and if you buy anything above that now that's what's gonna be in it okay chances are it's gonna have helium in it interesting a little thing though both of those models up there for the helium are it's our Enterprise drives or enterprise class drives it'll be interesting to see if manufacturers continue to build large quantities of consumer drives in that size the reason we buy the enterprise one is again the price is about the same quantity okay so we buy a bunch of them I don't know if I could buy let's say 12,000 consumer drives right now you know if I wanted to buy 12 terabyte Seagate consumer drives I don't know if I could buy them I don't know if anybody would sell them to me got it so and that's that's part of the that's part of the way that drive manufacturers manage their channel there's the money you might be able to go to Costco and buy one or two okay we we did that with dirty to drive the title and drive crisis we went to Costco and Best Buy and bought drives off the shelf because we couldn't get them anywhere else yeah and but I don't want to do that to try to buy 12,000 of them there's I was gonna say there's not enough Costco's what there probably is alright so nothing about that smart stats then you actually predict right if a drive is going to fail or not so we tracked five stats by default right we did we've been doing this for years we talked to drive manufacturers and lots of folks and I said hey these are five good ones all right and so we tracked these numbers and and one of the things that did a few a little bit ago was say all right if there's an operational drive but I looked at all the drives that are running right now how many of them have one of those one or more of those attributes right that are greater than zero and that's it's either zero or a number better than zero zero is good anything else is bad right and just so happened about four point two percent of them we're like that so then I looked at all of the failed drives and I said well then how many of the failed drives had the same thing right seventy six point seven so you if you're a stats guy you're looking at that gun that doesn't feel like a very good predictor right those five little stats it kind of looks like it it's obvious is really big gap in there but you're not sure you'd like to see that number on the other side be what 95 right a couple of deviations out you start to feel good about it so some really smart people okay not me over at IBM Switzerland got together and did a wonderful little paper of a couple years ago and that's where you can find it if you're and if you if you're not good with math that's oh it's a fun read but what they were able to do by drive model this was the amazing part right by drive model show that you could actually predict with that kind of certainty okay what a drive was going to fail and that's pretty amazing you can start to do think about it right so wouldn't you like to know what the ninety-seven two percent degree certainty three days ahead of time that a drive is going to fail that's awesome right now you have to calculate that for that particular drive is the CA 4 terabyte drive right then you got to calculate that every single Drive and it gets one of these interesting little things of well somebody's got to have a lot of drives to produce the data to calculate it so that everybody else can use it okay so but it's interesting that the drive stats that they and that's exactly how they did this they looked at all kinds of Drive stats that we had they used our data to do this with and then you are able to get pretty good now you look at the HGST and that's so good that particular model I had three days out 84% I don't know if you really want to throw away 16% of your drives that are good right I don't I think the other ones pretty cool like three I could deal with 3% right so it seems like there's some way you can calculate this right I like I said this is it's not my day job to do all of this stuff so we're trying to run a backup company it could cost George company so if any of you guys I want to do it will give you the data is there to do those kinds of things but it is interesting now I've heard people by the way talk about drive stat as smart stats and say it's a bunch of garbage it's a bunch of garbage the different manufacturers know they did didn't they just spit out numbers and who cares right I don't know that doesn't look like a bunch of garbage all right that looks pretty good and and they were really if you look through the paper that they did they really spent a lot of time with it last things laughs temperature righteous to the front of it because everybody asked this question a number of years ago Google did a study right and said a temperature doesn't matter you can just crank up the heat turn down the air conditioning even God all right so we wanted to figure out if that was true because I don't mind save it on air conditioning right especially since we've built a data center in Phoenix so the average temperature of operational drives for us is you can see around 77 of 77 degrees or so I convert it to Fahrenheit because we're in America but and you can kind of see how that chart is and there's a handful of them that run at 45 degrees but you know which is pretty warm by the way you can really start to see but all of that is within a range of a drive all right and this is ticking by the way right off of the smart it's the sensor inside the drive so this isn't like in the chassis or uh sticking at kilometer on top of the thing or anything this is inside that during the drive itself and so put all of those fit within the parameters that they give you of the operational range of a hard drive all right so none we never had a drive fall outside of those parameters all right now the interesting part would be failure is there any correlation to failure so I broke it down bike drive model this once again everybody cares about that and you can see the HD of the droop the three mount manufacturers aren't even close right how they fail you know HGST down at the bottom doesn't look like there's any real correlation it could be anywhere along the air the Seagate one yeah maybe but it's not much and the Western Digital one is a Batman cow so I think that's what I see I don't know what you guys see the fun thing is is towards the end which is where Google's spent their time talking about once you start to get about 40 degrees Celsius you actually do see that bump up of drive failure right but we just don't see enough drive to fail there to actually say that's what happens right they seem to fail it in other places it'll take a while for example the Western Digital want at 30 degrees Celsius is 18 percent of their drives or fail in their right that's not much above their normal temperature so I don't know I don't think there's any real correlation between the failure in during a normal range of operation now once you get up there I say you know you know we talked about so I'll leave I'll leave it with questions since we got a few minutes here for questions anybody got anything yeah no no oh sorry have we done any analysis for what file systems and how they might affect drives and so on right the answer is no we use our file system now I'm going to remember what it is but it's a standard one I can't remember what it is but it's a standard one you everybody and their brother would use so we didn't invent our own or anything like that I do know some folks who have tried that because they do some really funky things as it relates to writing blocks and so on but but we just the drive that's worked with the driving of work so no no difference with wild systems that I'm aware of so anything else yeah so so the question is is it since we started putting our numbers out have we noticed that the consumer drives have gotten better I failed less yes we've noticed while I would like to say I'd like to take credit for that because sometimes transparency is a very good thing I don't think Seagate is sitting around you know in boardrooms going gosh they just publish their data we better get better I also think by the way they learned a lot that the Thailand Drive crisis was really an awful event for a lot of reasons and they really took in and they really got hurt during that for a lot of different things but lissa t everything like that so I think what you're really seeing over the last few years is just them making Drive now with a reliable set of parts and so on I'd like to believe we had some influence on them making better drives and we have good relationships with CGA for example with all the drive manufacturers maybe they just give us the good ones I don't know but you know but we have observed that yeah that's a fair observation anything else yes how did they keep helium and the helium drives that's a really good question I did I I know they spent a lot of time creating the case that goes around and how they pack it I don't know the mechanics I was reading an article a few weeks ago about it about how they did it because we've got to achieve it drives and the history and how it was actually I think Western Digital which got the first commercial versions of a mouse but I don't know the mechanics and I don't think they share a lot of that there's this general notion of hey we did it and we use the flexi core of low of a marketing name thing to do it and but they don't give you this six of you know hey we coated it here we did this year who loved this gap here and all of that so I mean it has to they can't let anything out helium is just gonna leak out it could actually leak out through a lot of substances you know so yeah so that's right there is no flow so they've had to reinvent the drive a little bit the the question was is with air drives air helps helps the heads a little bit there and the helium doesn't do that because there is no flow basically how do you get rid of heat okay we would that was one of you know so those are the kinds of things that they have managed to figure out how to do but they can't lose the helium that's in there they they put it in there it's sealed in and sound like they come around and plug in a thing every so often and add some more helium so so yeah there was another question so we actually wrote our own erase your code so it's like rate but it's it's I'll just say it's like great but it's different and there's we we did publish that and how we did that and we put it up on github if you want to read it but it is that same kind of a notion of charting something across you know X number of devices and having to be able to use have so many of them to restore the entire thing and it so it is that notion of what rate is we used originally rate six and so a lot of the storage pods still run raid six but all of the bolts run our own erase your coding so anything else that we doing on one month that's worth so fun question how much if you're looking at like s3 you know they seem they charge a little bit more than we do how much of that is profit right okay so I know they make a lot of money and Jeff Bezos is making more than our CEO you know now they have scale to enormous area so that certainly adds cost to some level or another they also subsidize some of their other businesses with the money they make and all of that I think we like we do a really good job they do some things better they have a lot of compute capabilities and all that so I'm not going to tell you we're the same service but for what we do we try to make it as economical as possible and we'll always do that so and even though our CEO will be poor so all right thank you very much