We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Building a reasonably popular web application for the first time.

00:00

Formal Metadata

Title
Building a reasonably popular web application for the first time.
Title of Series
Part Number
57
Number of Parts
169
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Erik Näslund - Building a reasonably popular web application for the first time. These are the lessons learned when scaling a SaaS web application which grew much faster than any one us could have ever expected. - Log and monitor from day one. - Things will fail, be sure you know when they do. - Choose components which allow language interoperability. - Horizontally scalable everything. - Plan for database downtime. - Have a way to share settings between backend and frontend. - Have a way to enter maintenance mode. - And more... ----- My name is Erik Näslund - I’m the co-founder and Head of Engineering at Hotjar. I'd love to share the lessons learned when scaling a SaaS web application which grew much faster than any one us could have ever expected. Words like “big” and “popular” carry very little meaning, so let me define how big Hotjar is right now using some numbers. We onboard about 500 new users on a daily basis. We process around 250 000 API requests every minute. Our CDN delivers about 10 TB of data per day. We have roughly 3 TB of data in our primary data store (PostgreSQL), another 1 TB in our Elasticsearch cluster, and a LOT more on Amazon S3. These are the key things we wish we knew when we started. They would have made our life so much easier! - Log and monitor from day one. - Have a way to profile your API calls. - Things will fail, be sure you know when they do. - Have a way to keep secrets. - Everything needs a limit (even if it's really big). - Be wary of hitting data type limits. - Don't get too attached to a framework. - Choose components which allow language interoperability. - Horizontally scalable everything. - Plan for database downtime. - Features are a great way to test things out before launching them to the public. - Have a way to share settings between back end and front end. - Have a way to enter maintenance mode. - Require different quality of code for different parts of your application.
11
52
79
World Wide Web ConsortiumWeightMobile appArithmetic meanPoint (geometry)Process (computing)Multiplication signHand fanLoginIntegrated development environmentConfiguration spaceCodeNumberDatabaseTouchscreenSystem callRight angleStructural loadScaling (geometry)2 (number)Line (geometry)Game theoryProfil (magazine)Software developerPhysical systemStack (abstract data type)Query languageSet (mathematics)Web applicationComputer programming1 (number)ArchitectureDifferent (Kate Ryan album)WebsiteFeedbackWeb 2.0Sinc functionString (computer science)Analytic setProcedural programmingPosition operatorData typeState of matterCASE <Informatik>Data storage deviceConnected spaceStandard deviationMusical ensembleProduct (business)PlanningMatching (graph theory)ResultantMoment (mathematics)BitContext awarenessShared memoryLecture/Conference
Ring (mathematics)DreizehnWide area networkContent-addressable memoryExpected valueLibrary (computing)Asynchronous Transfer ModePhysical systemServer (computing)Formal languageFlagPlanningProcess (computing)Term (mathematics)Condition numberCodePoint (geometry)BitLatent heatImplementationRow (database)Intrusion detection systemKey (cryptography)DatabaseQueue (abstract data type)Utility softwareMultiplication signComputer configurationMereologyInterface (computing)Absolute valueSelectivity (electronic)Web pageSubsetFluid staticsSoftware framework2 (number)Query languageRevision controlSource codeGroup actionData storage deviceDifferent (Kate Ryan album)Scaling (geometry)Product (business)Statement (computer science)Software maintenanceScripting languageUniqueness quantificationConnectivity (graph theory)Adaptive behaviorNatural numberSystem callIntegrated development environmentTelecommunicationBenchmarkAxiom of choiceHuman migrationSoftware developerLimit (category theory)Object-relational mappingStandard deviationRight angle32-bitException handlingFehlererkennungElectronic data processingSet (mathematics)Interior (topology)Field (computer science)Client (computing)Motion captureCASE <Informatik>BootingUnit testingFront and back endsSoftware testingDefault (computer science)Overhead (computing)Raw image formatWritingSoftware bugStaff (military)Inheritance (object-oriented programming)LengthTraffic reportingShared memoryData typeOffice suiteIdeal (ethics)Table (information)ResultantExecution unitAreaGraph coloringGoodness of fitCuboidFormal grammarNeuroinformatikLine (geometry)Service (economics)Symbol tableWrapper (data mining)Series (mathematics)Similarity (geometry)State of matterSound effectContext awarenessComputer programmingOrder (biology)Link (knot theory)ConcentricComputer fontSkeleton (computer programming)Form (programming)NumberNumbering schemeSpeech synthesisVolumenvisualisierungInformationData bufferEndliche ModelltheorieQuicksortScalar fieldData storage deviceWireless LANContent (media)Workstation <Musikinstrument>Level (video gaming)Arithmetic meanHistogramNP-hardCurvatureOpen setBulletin board systemRoundness (object)Exclusive orExtension (kinesiology)Social class3 (number)Geometric quantizationParameter (computer programming)Universe (mathematics)Computer animationLecture/Conference
Transcript: English(auto-generated)
Welcome all. Here we have Eric telling us about building a reasonable popular website for the first time. Give him a clap. Thanks. First of all, I can't get the screen for configurations exactly right, so I'll do this without my notes.
Please excuse me if it goes wrong. So, I'm going to talk about building a reasonably popular web application for the first time because I got lucky enough into being able to architect, build and design something that grew quite quickly and got to learn to deal with scale way quicker than I would have expected.
So, I learned a lot of this during this time and I'd like to share what we learned so hopefully you can do at least skip doing the mistakes we did and make your own unique ones instead. So, who am I? Why am I speaking? I'm the co-founder and chief architect at the company called Hotjar.
Hotjar, both the name of the company and our product, is a set of web analytics and feedback tools. So, basically this means a lot of data ingestion. We are installed on almost 200,000 sites in the world right now, so a lot of data coming in. I'll give you some numbers later. So, my development career actually started a long, long time ago at the age of six.
I wrote my first game. It wasn't that awesome probably in retrospect, but I thought it was. So, I got hooked in programming and I've been ever since. And after that I transitioned between different tech stacks throughout the years, but started with Python about seven years ago now and it's the one I definitely like the most so far.
So, since I'm going to talk to you about something recent, popular, reasonably big, it's only fair that I give you a definition of what I think is reasonably big, right? So, Hotjar right now, we post around 400,000 API requests every minute. Our CDN delivers about 10 terabytes of data to our users every day,
and we have roughly three terabytes of data in our primary data store. It's a Postgres. And another two terabytes in our Elasticsearch cluster, and it's somewhere between 35 and 40 terabytes on Amazon history. So, that's our definition of reasonably popular, reasonably big for today. We still use reasonably standard solutions though. Our tech stack isn't anything out of the ordinary, as you can see here.
Nginx, Memcache, MicroWhisky, Python, Elasticsearch, Lua, Postgres, and Redis. It works amazingly well to just run a load of MicroWhisky workers, even at this scale, believe it or not. At some point we will, of course, start using all the fans in U.S., Inkyo, and UVloop, and all these things.
It's probably going to be a great match for us, but for now, very plain process-based MicroWhisky scales really well. So, now that you have some context, let me start out with what we learned during the last two years, kind of. So, login monitor from day one.
This is something we messed up a bit, because we only started logging and aggregating logs once we started having problems. At that point though, we had so much log data coming in, so we had to spend quite a lot of time cleaning things up before we could actually see through the noise.
So start logging and aggregating your logs from day one, and, you know, keep your logs clean. Act on the problems you see. Otherwise, you can have a mess cleaning it up when you need to, and it's going to be like a... It's kind of adept as well, not managing your logs. Have a way to profile your API calls. So, we ended up using SQLAlchemy as an ORM.
It's great, and I love it like 95% of the time. But every now and then you have this like little innocent line of Python code that causes some really weird query. And having a way to profile both code and database queries is great. We have the concept that our super users, ourselves only, can actually append ?profile equals one to any API call in the query string.
Instead of returning the normal results, that makes the endpoint return C profile data and SQLAlchemy profile data.
And having like an easy way to get profile data from a live API call in the live environment in just a few seconds is actually great. It makes the profile a lot more, and you get a much better understanding of your system as a whole. So, highly recommended to have a way to just ad-hoc profile a query from a live environment. You know, great to do.
Sometimes it's a Python code that takes time, sometimes it's a database. But you'll be surprised how often the Python code is actually... You know, you do a silly little mistake in SQLAlchemy that's really heavy in processing. So, it's a great thing to do. Know when things fail. So, at some point we had to add some cron jobs, I don't remember quite for what, but you know, some background processing.
And yeah, they failed at some point without us noticing, because it was a silent failure. It exited for some unknown reason, it didn't throw an exception or anything like that, because we were obviously monitoring for that, but it just failed silently.
So, it's just as important to know when things are not happening as to know as when, you know, bad things are happening. So, we solved this by adding the simple concept of job expectations and job results. A job expectation is something simple like, I expect this job to run every hour.
A job result is simply a log entry from the job that it writes when it's complete. Then we basically just have a status endpoint that's called by external third -party service, and basically checks that all expectations are satisfied all the time. That way we know that jobs run, and they run on time, and they run successfully.
So, always beware about safeguarding against things that fail explicitly and things that fail silently. Just as important and easy to miss. And also, third-party systems to monitor your own systems as well, because, you know, your own monitoring might fail. Have a way to keep secrets.
Hotjar, as everything else, started out as an experiment of source. So, you know, we weren't too diligent about not maybe keeping external API keys in source control and stuff. In hindsight, stupid of us, but, you know, that is. Then, as the development team grew, we realized, okay, maybe it's not best idea that
everyone has access to all third-party systems, like through APIs, you know, in live environments. So, I'd recommend to use something like Ansible Vault or similar, like from day one. It's gonna pay off. Because now we didn't. So, at the time when we, like, had to start, you know, keeping secrets, we had to change all the API keys and that kind of stuff.
It was a mess. So, have a way to keep secrets from day one. This is an interesting one. Everything needs a limit, even if it's big. So, a good example here is, we have the concept of tags. We can basically tag a recording.
It's used for, we envisioned it to be used for people saying, okay, this recording, the user visited the checkout page in this recording. However, our users used it slightly differently, some of them. They tagged each recording with unique user IDs coming from our third-party systems like Google Analytics.
So, that meant some users ended up with 400,000 different tags. We showed that in a little nice HTML select drop-down. 400,000 select drop-down options does not render well. Our interface broke terribly because we didn't have limits in it. Users are very creative and if you give them a way to put like limitless amounts of information, they will.
And these limits goes for, it goes for UI, it goes for APIs, length of fields, stuff like that. It also obviously goes for databases, length of fields. Never, never, ever allow unlimited. Perfectly fine to allow really big, but unlimited is bad.
If you give your users a way to put unlimited amounts of data in your system, they will eventually. It took like a year, but then it happened. And another one here is slightly more interesting, I'd say, and much more surprising.
Postgres or Datastore uses int for, 32-bit ints for as the default for the ID column. We eventually hit that limit on a table. So, we had our two something billion rows. That was kind of hectic trying to solve that when everything was done.
Because I didn't even anticipate it. Never worked with data this scale before, but it happened. So, think about when trying to design your schema. Try to think ahead a year or two. I know it's hard, but try. Is it a possibility I could end up with like reaching data type limits if I use this type here?
If you think you're even going to be close, choose a bigger data type. It's not expensive, it's just not default, so you have to make a conscious choice. But think about how your data will grow. And, if possible, put monitoring in place for this as well.
When you're about to reach limits, you know, halfway there, you want to know, so you have time to like plan migration. Don't get too attached to a framework. Right now, we're using Flosk and Flosk RESTful. It works really well. We're super happy with it.
But, at four or five hundred thousand requests per minute, it's starting to have like a significant overhead. Because most of our requests are like really quickly processed. So, the framework matters. This, of course, depends on the use case. But for us, it matters. So, at some point, we're probably going to have to transition to something else.
So, a good advice to minimize the pain of doing that is to use framework agnostic libraries as much as possible. Like SQLAlchemy is a great example, because, you know, it works like it has adapters for basically everything, and if it doesn't, easily do one yourself.
I don't have anything against using what I like to call thin wrappers, like Flosk SQLAlchemy, because it basically doesn't do that much. It's just a nice helper. But if you were to, you know, if you switched away from Flosk, you could easily implement what Flosk SQLAlchemy does yourself. So, thin wrappers, fine.
Otherwise, I try to avoid framework specific libraries. It's kind of like, you know, vendor lock-in, framework lock-in. It limits your flexibility. Choose components which allow for language interoperability. So, we're definitely mainly a Python shop. But we have about, like, half a percent, one percent of our code base in Lua, actually,
for performance reasons, running inside Nginx. We did the mistake of using a queuing system called RQ, initially. Great system. But, Python only. And this caused some issues when we basically just wanted our Lua code to put some simple things in the queue,
that ended up being a much bigger thing now, because, you know, we couldn't put it there because it was a Python-only queue. So, when possible, choose components, libraries, servers, whatever, you know,
that allow for greater language interoperability. It makes it so easy, if you have, like, a performance-critical part, to just take it out and write it in something else. Plan for database downtime. So, yeah. In the beginning, all our database migrations, schema migrations, were simple,
because we had basically no usage and no data. It gets harder, and at some point in time, we ended up, you know, we couldn't just do our basic outer table statements anymore, because they started taking significant amounts of time. Fair enough, there are some QTL tricks you can do to alleviate some of them,
but at some points, you have to, like, introduce kind of a downtime. However, this is a nice trick that helps a bit. Try to decouple data ingestion from data processing as much as possible. A neat way to do it is capture data from the user, put it in a queue, process later.
That way, you become much more resilient to having database downtime. Even if it's just for, you know, a minute, you need to take it down, do a little change, but if you have, like, this queue as a buffer, it's great. It's not always possible to do this, obviously, but it's a great thing to do when you can.
So, I have a way to share settings between backend and frontend code. We introduced a couple of silly bugs a couple of times, simply because we were lazy. We copied things from backend and frontend. And then we changed one of them, but not the other.
And the frontend and backend code didn't agree on values anymore. So, this is just silly and stupid, and there's a very simple solution. We ended up having a settings.json file, which contains our shared settings. It's injected using nginx server-side includes, and that way, Python can read the JSON, and the frontend can read, you know, the content JSON as well.
So, super simple, all our shared settings go there, and no more bugs of this kind. So, shared settings are good, duplicating code, like, we duplicate things like error codes and stuff. Copy-pasted. Now, shared settings, not a problem anymore.
I have a way to go into maintenance mode. What I mean by maintenance mode is basically a little page saying, we're currently down, sorry. It's not nice when you have to bring it up. But it's probably going to happen to every one of us at some point. And then it's a great insurance having one.
We basically have a very little switch to turn on and off the maintenance page. And when doing the maintenance page, be careful and let it have as few external dependencies as possible, because, you know, you probably want to turn it on like when your database server crashed or something. So, don't store the switch to turn it on in the database, because it already crashed.
That was our first version that did just that. Also, on our maintenance page, we've put in the communication tool, where people can talk with our support group. It's a really good idea, I think, to keep communications open with users, even when bad things happen.
Feature flags are a great way to test things out before releasing them to everyone. So, at this point in time, we started getting really big and wouldn't want to release things we weren't too sure about to everyone. So, we introduced feature flags.
We have both server-side and client-side feature flags. So basically, we say, this part of the UI requires this feature, and this part of the API requires this feature. That way, we can do gradual rollouts, we can do beta testing with a limited group of people, and, yeah, we can also do things like enabling things, depending on which type of plan the user is on.
So, saying, if you're on the pro plan, you get this feature. So, they're a very versatile tool to have, if you start thinking in terms of on and off feature switches. Highly recommend, very simple to implement, great thing in your toolbooks.
Accept different quality of code for different parts of the systems. This was personally kind of a hard one for me, because, you know, as a developer, you kind of get attached to what you created, and you want it to be super awesome everywhere. But it can't, because then you run out of time.
So, for example, we require all our user-facing code to be properly tested, performing well, all these things. However, imagine you have like a back office report for internal use. It's okay if it performs so, so if it takes five seconds to generate, it's okay. But think about these things up front before starting to build a new feature.
How good does my documentation need to be here? How well does it need to perform? How well does it need to be tested? In an ideal world, everything would be perfectly documented, tested, and perform awesome. But when you need to prioritize, think about it up front. It helps a lot.
And these are basically the most noteworthy things we've learned. Kind of, not unique things, but surprising things, I'd say, most of them. I'm sure we still have many new things to learn, but this is it for now.
Thank you for listening. Who has any questions? Why was the SQLAlchemy chosen compared to Django, for example?
What were the reasons, and how do you feel about going with Flask so far? Okay, yep. SQLAlchemy instead of Django, you said? Django ORM. Okay, yep. Well, SQLAlchemy, we actually started out with a different ORM called PeeWee.
But for some of our very performance critical things, we needed, you know, we didn't want to go write raw SQL, that's why we use ORM. But the actual SQLAlchemy allows you to drop down, like a mid-level, and still do really strong queries.
While, and I don't think Django ORM is even, I prefer, I can say it like this, 90% of the products you ever do, Django ORM is awesome. But SQLAlchemy, when you really need to do these weird performance optimizations, and use very possibly specific features and stuff, I found it a bit better. But we could have done it with Django ORM, absolutely.
However, we already decided on Flask, because of simple benchmarking. Flask is quite a lot faster than Django, even if you skip out all the middlewares and whatnot. So, we didn't really have a natural tie into Django, if you get what I mean. So, and then SQLAlchemy was a good choice, and I still think it is. Thanks.
Any more questions? Thanks. Could you get a bit more detail on the implementation of your, the maintenance mode page? Yes.
We're having to do that currently. Absolutely. It's a very simple thing. Thirty second background of how our deployments work. We basically push things to the bucket, we have the servers pull it and update themselves. So, entering maintenance mode is basically, we run the deployment script through Jenkins,
and basically check the box, maintenance mode instead. So, Jenkins deploys, servers pick it up, this takes about 20-30 seconds, kind of. What they basically do is, during our build pipeline, executed on Jenkins, we actually have conditionals in the nginx configs, and basically this is as simple as like, if maintenance mode, show this page, static HTML.
Any more? And if I were to add anything to this excellent guidelines, of course there are endless such guidelines, but what have, what had proven to be useful, especially for our company,
that would be writing utilities for testing server, just small clients, because you can write unit tests, but unit tests use prepared environment, not very production made,
so if you can just quick run your client in production and test what fails, that is also good. And I think making everything deployable with tools like puppet,
so you can easily just boot a new server and make it build very fast, it is also linked to virtualization, so that is very useful. Cool.
About the profiling, so do you use anything else, other than having this ability to see the live profiling? We do, a lot of things. I just picked this because I think it's, I haven't seen it that much before, but we're heavy users of New Relic, and we use PGstat statements in Postgres, it's an awesome thing,
very small extension, adds extremely little overhead, less than 1% in most cases, and basically generalizes queries, so dependent, independent of query parameters, it like groups queries for you, and it gives like mean, execution time, average, standard deviation, stuff like that.
So if you want to like really find slow queries, PGstat statements, for day-to-day monitoring, New Relic, and that's basically it for performance monitoring, yeah. How do you limit the profiling only to your staff users, I suppose?
That's simple. You have to log in, in the system, as a normal user, but then we have like a little super user flag for certain users, we just put in DB, so that's simple, and a Python decorator called requires-super-user, so only a loud byte. Sorry, people. Any more?
Awesome. Enjoy your lunch, thanks for coming.