The Times they are a changin’
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 133 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/48934 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Digital signalSoftware developerMultiplication signWebsiteMoment (mathematics)DebuggerCanadian Mathematical SocietyFront and back endsLine (geometry)Product (business)BuildingPhysical systemContent (media)Web 2.0BitComputer architectureOnline helpMobile appScaling (geometry)Projective planeFlow separationComa BerenicesJSONXMLUMLComputer animation
01:23
WebsiteMobile appIntrusion detection systemStrategy gameSoftware developerElasticity (physics)Content (media)Markup languageDifferent (Kate Ryan album)Point (geometry)Form (programming)VideoconferencingType theoryScheduling (computing)Military baseMedical imagingMathematicsCodePhysical systemCloningService (economics)Integrated development environmentSoftware testingResultantCombinational logicProduct (business)Canadian Mathematical SocietyFlow separationTheory of relativityUniform resource locatorContent (media)Game controllerBitDebuggerUnit testingGoodness of fitContent management systemProper mapTouchscreenProcess (computing)Library (computing)Software frameworkStandard deviationProjective planeSource codeData managementMultiplication signElasticity (physics)Dependent and independent variablesWeb pageWebsiteStrategy gameWeb 2.0DigitizingComputer architectureLetterpress printingInteractive televisionSingle-precision floating-point formatMobile WebDecision theorySoftwareMobile appOnline helpPoint cloudScaling (geometry)Maxima and minimaMassAtomic numberPulse (signal processing)Web browserTwitterDisk read-and-write headQuery languageWeb serviceFreewareSoftware developerRevision controlForceFamilyTraffic reportingInformation securityArithmetic meanComputer animation
08:23
Content (media)Software developerReal numberType theoryCycle (graph theory)Link (knot theory)Data storage deviceInformationArithmetic meanPower (physics)Address spaceBitAdditionFunction (mathematics)Physical systemObject (grammar)Content (media)Mobile appMathematicsCanadian Mathematical SocietyNetwork topologyMedical imagingSemiconductor memorySocial classComputer fileTheory of relativityMoment (mathematics)Mechanism designGoodness of fitSheaf (mathematics)Flow separationWebsitePay televisionProduct (business)RootLevel (video gaming)AuthorizationMultiplicationCategory of beingData structureService (economics)Letterpress printingAttribute grammarExistential quantificationProcess (computing)Direction (geometry)Different (Kate Ryan album)Business modelDigitizingWeb pageRevision controlData recoverySoftware bugCovering spaceDatabaseSoftware developerFile formatCommunications protocolMassUniform boundedness principleInternetworkingLine (geometry)State of matterRight angleAxiom of choiceWeb serviceHome pageTablet computerMessage passingDataflowSummierbarkeitDemosceneMultiplication signScaling (geometry)Inheritance (object-oriented programming)VideoconferencingSign (mathematics)ResultantInsertion lossComputer animation
15:23
Software developerCore dumpFlow separationStructural loadCache (computing)Semiconductor memoryContent (media)BitAverageRecursionDistanceCycle (graph theory)View (database)Data storage deviceData structureQuery languageTable (information)2 (number)Graph (mathematics)File formatBusiness modelDatabaseField (computer science)Type theoryState of matterLink (knot theory)Home pageProjective planeGraphical user interfaceDependent and independent variablesKey (cryptography)Multiplication signComplex (psychology)Medical imagingPoint (geometry)Category of beingSheaf (mathematics)Physical systemArithmetic meanMultiplicationStandard deviationCore dumpHuman migrationProgrammschleifeProcess (computing)Moment (mathematics)Network topologyInheritance (object-oriented programming)Mixed realityBuildingRevision controlTape driveTouch typingMassAddress spaceACIDIdeal (ethics)Position operatorAnalytic continuationInjektivitätSoftware testingObject (grammar)Elasticity (physics)HypermediaReading (process)WebsiteNumbering schemeWage labourUniqueness quantificationRaw image formatTheoryIntegrated development environmentEvent horizonReduction of orderComputer animation
22:22
Software developerCache (computing)LogicChemical equationImage registrationCache (computing)Virtual machineBefehlsprozessorPoint cloudDependent and independent variablesContent (media)Software testingComputer fileHome pageRevision controlIntegrated development environmentResponse time (technology)Streaming mediaOrder (biology)MiddlewareBitPhysical systemClient (computing)MultilaterationArithmetic meanExpressionApplication service providerQueue (abstract data type)Semiconductor memoryEmailPressureObject (grammar)Service (economics)Line (geometry)Event horizonReading (process)Musical ensemblePatch (Unix)VotingOpen sourceMachine visionDatabaseLogicComputer animation
25:27
Revision controlSoftware developerComputer multitaskingSheaf (mathematics)Type theoryDependent and independent variablesQuicksortTouchscreenBuildingContent (media)Control flowSheaf (mathematics)Standard deviationData structureProcess (computing)Revision controlGame controllerFluid staticsSimilarity (geometry)Letterpress printingTablet computerPhysical systemFile formatElectronic program guideData modelRight angleMathematicsField (computer science)Mobile WebDifferent (Kate Ryan album)Touch typingOrder (biology)Moment (mathematics)Focus (optics)Program slicingCartesian coordinate systemTemplate (C++)Web pageUniform resource locatorText editorEmail1 (number)WebsiteGroup actionHome pageDebuggerAdditionData storage deviceCore dumpStructural loadMedical imagingModule (mathematics)MultiplicationBitBeta functionAxiom of choiceVertex (graph theory)Stack (abstract data type)Position operatorVolumenvisualisierungNumberMobile appSoftware testingConnectivity (graph theory)MereologyProduct (business)Object (grammar)Service (economics)CodePrisoner's dilemmaMultiplication signPoint (geometry)AlgebraNetwork topologyFlow separationVideoconferencingTraffic reportingPattern languageHeegaard splittingMachine visionGame theoryConcentricOnline helpPhysical lawVideo gameFamilyCausalityLevel (video gaming)Computer architectureIdentical particlesResultantObservational studyRoyal NavyForm (programming)Sound effectUsabilityProper mapComputer animation
33:48
Software developerProduct (business)BuildingCore dumpContent (media)AerodynamicsGraph (mathematics)VolumenvisualisierungCategory of beingFiber bundlePrototypeGraph (mathematics)Multiplication signPoint (geometry)Virtual machineDynamical systemBitContent (media)Cache (computing)Medical imagingWebsiteDomain nameComputer fileSoftware developerEmailCanadian Mathematical SocietyLatent heatCore dumpProjective planeMixed realityMoment (mathematics)DebuggerProduct (business)Cartesian coordinate systemPhysical systemDatabaseMobile WebService (economics)Motion captureEvent horizonQuery languageInstallation artDifferent (Kate Ryan album)Web 2.0Single-precision floating-point formatData storage deviceWeb pageConnectivity (graph theory)AuthorizationBuildingContent delivery networkData structureMobile app1 (number)Analytic setInternet service providerVideoconferencingType theoryMechanism designStapeldateiAutomatic differentiationFlow separationINTEGRALIntegrated development environmentRevision controlCASE <Informatik>Band matrixWave packetServer (computing)Sound effectGoodness of fitScaling (geometry)Musical ensemblePlanningPhysical lawPolarization (waves)Natural numberTowerLevel (video gaming)SoftwareRow (database)Network topologyInformation securityGradientRight angleHome pageComputing platformInformationInteractive televisionComputer animation
42:09
Software developerComputer animation
Transcript: English(auto-generated)
00:07
OK, is everyone in? Oh, so I'm going to be talking today about the times and the Sunday times and what they have at the moment, what we've been doing, and the new website that they're about to release.
00:21
So my name's Andy. I'm a technical lead at Softwire. I've been working with news for the last year. There's four engineering teams on this project, and I've been working mainly with the web team to kind of help them build the new site. So what they have at the moment is a fairly old website. It's not very pretty. It's a bit clunky.
00:40
So it isn't responsive. It doesn't work particularly well on mobile or on tablet. And as well as the front end being a bit naff, the back end's a bit rubbish too. So the architecture doesn't scale particularly well. They have several CMS systems in a line that pump data between each other into a final system that's not very good, and editorial don't really like it. So they also can't build new systems on top of this
01:01
really easily. So they've got the content in one place, and it's very hard to shift that content to another system. So it's hard work for them to get the content from the website to the apps to whatever other products they want to build. So it's not a great system. And as businesses do, they want to redesign this. They want to support all the users' needs.
01:21
They want to make things generally better. And this was born Project D. Nobody knows what the D stands for. It's not really important. What the aims of this were to do was to merge the Sunday Times and Times websites together. Currently, they had separate systems for both websites, which is very difficult to maintain. You've got two code bases for essentially the same thing.
01:41
They want to bring in a new front end to make it responsive so you can actually view the website on a mobile. They want to bring in new apps as well. So the current apps are, again, quite clunky, quite old. Bring in new apps that can use the data much more efficiently. A new CMS system, so trying to trim down all the legacy stuff, replacing all of the multiple systems that have grown organically
02:00
over the years with a single CMS that can handle print and digital at the same time, making it much easier to share all the content. If that wasn't enough, they're also doing a new ad system to serve better adverts, if there is such a thing, and a new business strategy around how they publish content. So it's both a technical and a business project. They want to make the tech side much simpler
02:22
and, as a result of that, make the business side easier to do as well. So let's think about what journalists think of. They want a really simple system for putting content in. They want to type content once. They want to type it in. And then they want it to appear in front of the users. They don't want to have to faff about multiple systems. They want to have it fast. They want to have it easy.
02:41
And they want rich content as well. So they want to have images and videos and text and podcasts and all these different types of content, new and interactive things like parallax articles. They want to bring all of that in as quickly as they can. They also want to keep the users engaged. So this means working on the latest devices, working on newer browsers, working on all that kind of stuff.
03:01
So making sure the users can view the content the way they want it to. As well as this moving to a newer design, there's the tackling the print legacy of things. So when you deal with print journalists, they're used to tweaking the headline and the content to fit exactly on a page. They will literally sit there and tweak characters and headlines until it fits exactly right. And this just doesn't work on responsive web
03:22
because you've got any kind of screen size, any kind of device. And so there's a kind of reeducation process to teach them about what it means to be responsive, design a headline, maybe a long headline and a short one. But you need that to educate them that responsive web is new and different. So this is how the journalists think of it. Very simple, content goes in, content comes out,
03:41
users love it. As tech people, we think of it in a much more complex way. So this is what the overall architecture looks like. There's lots of different systems talking to each other. There's a lot of this stuff is internal, but it integrates a lot of external systems as well. And pretty much all of this was redesigned from the ground up. I'm not talking about the legacy system here. This is what we've built.
04:01
There's a lot of new technology in place. So everything's based on Amazon Web Services. So it's designed for the cloud, it's designed to scale, designed to build new products very easily, replicate environments, do all that. The legacy system was in Java. We've gone with Node.js for the new stuff. So we're relying on much newer technologies,
04:20
designed to iterate and develop quickly on top of those and build a much better experience. Lots of automation in place. So using Chef to do proper infrastructure automation. We've got complete clones of every environment. So your Chef defines an image once and you just copy it straight through. And using NPM for package management and Git for source control. So lots of new tech moving away
04:42
from all the legacy stuff they have, rebuilding everything from the ground up. Proper unit testing in place. So Kyomock is a really good combination for JavaScript, but also relying quite heavily on Selenium testing. So we do a lot of unit testing and code coverage to make sure we've got decently well-tested services and we can rely on them, but also Selenium testing to make sure
05:00
that the users actually get what they want and we can release frequently. The old code base released pretty infrequently. They do a release every few months. Our aim was to really shorten that down and get releases out every two weeks at a minimum, and ideally several times a day. So these techs help us move towards continuous delivery. Not there yet, but the idea was that we could release
05:20
to several environments automatically and then test those changes and make sure that could happen. As a business, this is quite a big change. They're used to having these big release forms and you fill in a form and you get someone else to approve it, and then that release happens at some point later when someone schedules it. And we're pushing back on that and saying, no, we wanna release automatically to a UAT environment and then have users look at it every day and give us feedback.
05:41
So these techs help with that, but it's also a business change as well. On the front end, they're moving away from the legacy site. We're moving to HTML5, CSS3, SAS. So they made a great decision to only support newer browsers, so we don't have to support anything less than IE10, which is great.
06:00
We do a lot of work to make sure the SAS site is nice and small. We'll talk a bit more about that later. We're using newer frameworks on the front end, so ampersand, quite new. It's based on Backbone. It's quite good. It means we can do a lot more interesting stuff. Using Elastic as a data store, which gives us, again, a lot of flexibility to move that around. So there's the broad picture of the tech stack.
06:21
There's a lot of smaller libraries that I haven't talked about. Every JavaScript project includes things like Lodash, because everyone uses it, and the same with jQuery. It's in there somewhere. But let's step away from the tech and think about content. So in the news industry, content really is the most important thing. Journalists produce it. They want to do it really quickly. They want to get that story out there,
06:40
and they want to have that happen across multiple devices, so they don't want to have to write it in one system, copy and paste it into another one, hit publish several times, and have this really complicated workflow. They want it to be really, really simple. They also want to reuse and relate bits of content. So currently we find it very hard to take a story from one system and use it somewhere else. They want to make that nice and easy to do.
07:04
So as software developers, we think about content a bit differently to journalists. We've decided that content should be addressable. So every individual piece of content needs to be unique, and this is every single article, every single image, every single video, every single podcast, every single piece of content that exists
07:20
needs to be uniquely identified. It needs to be rich, so we need to support things like bold, italic, headings, and that lets them build these kind of experiences users want. The most important thing is that you can relate it to other pieces. So the articles don't exist by themselves. They relate to other articles. Images don't exist by themselves. They're contained in articles, or they're in galleries,
07:41
and it's really important to be able to relate all these different types. Notice that this doesn't talk about how we're rendering it. We're just trying to describe the content layer itself, and we really want to avoid lock-in with this as well. We don't want to be, oh, this CMS does everything for us, so we're gonna use that. We want it to exist outside of that, so we don't want to end up back in the situation we're in now
08:01
where we have a single CMS that does everything, but it's really hard to move away from. So we need to provide a method of getting this content without talking to a CMS. It turns out that Atom is quite good at this. So AtomSpec is an XML standard. It lets you define pieces of content. It has an identifier, and you can reference that as a URL.
08:22
So you can upload your content somewhere, store it in HTTP, access it that way. It's got all the attributes you need, so you can add things like authors and titles and categories of objects, so we can describe articles and images and all these different types, and even new types that we haven't thought of yet. And it has rich markup, so we can support HTML-like stuff. We can add paragraphs and bolt attacks,
08:41
everything we need. We don't include styling information in this. There's a really clean separation between the content itself and the layout. So because we're looking at targeting both a desktop app and a native app, we can't include CSS classes and all that kind of stuff. We have to think about a better way of doing that. So like I said, we don't want to embed any styling information there,
09:01
and we don't want to make it too device-specific or anything like that. So making it accessible over HTTP means we've achieved the first two, and making the content, restricting it down means we've achieved the second bit as well. Atom also supports linking between things. So you've got these links at the bottom, link related, where you can link between different atom documents. These allow us to describe the relations.
09:21
We've separated the relation from actually the usage of it. So you can just link between documents, and that doesn't tell you anything about how they're used or what type they are or anything like that. And this means that we don't actually use the thing directly. We don't link directly to an image. We just link to a file that describes its properties, or we link to a file that describes a podcast or anything. We don't really care about what the content is.
09:41
We just care about the relations between them. And then that separate content type can describe its properties. The image I'm linking to will describe its height, its width, its copyright restrictions, other pieces of data. And the item containing that doesn't know or care about it. This is a really powerful concept that we use to build on.
10:01
So this lets us describe arbitrary collections of pretty much whatever content we have, and even content things we haven't created at the moment. So the site initially was fairly simple. We just created articles, and that's all we had was articles and sections, and it was really easy. Pretty soon you want to extend that to images, and you assign images to articles, then you want to extend those videos.
10:21
The underlying content model doesn't change. And that was the real power of it, that we could add these new types without fundamentally changing anything about it. So we said we don't want to be tied into a CMS, which means that we can have multiple systems making these documents. The primary system they're using is Vode. It's a CMS that does print and digital.
10:42
And for digital, all you need to do is output these documents and store them somewhere. Journalists go through, they create their print article, they'll save it to the newspaper, but copy it across and then just output this. At that point, it's left the CMS. You don't need to go into CMS to edit again. In fact, you can just edit these by hand if you want to. Not that anyone would, but you can. And so that actually means we've completely decoupled
11:02
the storage of the content from the CMS, which is gonna be really useful later on, making sure we don't get locked into a particular system again. Because we're based on Amazon Web servers, actually we just store everything in S3. So S3 is a really, really good HTTP accessible file store. It means that we don't have to worry about storage or scaling, because S3 will just store whatever you give it.
11:21
We can store hundreds of thousands of documents and it will just cost hardly anything. It's really, really handy. So CMS's output atom into S3, and there we go. We've got a permanent link, as long as we don't overwrite it, it stays there forever. We've got our aim of having this uniquely addressable piece of content and the links between them. If you're being a bit cynical, you could say we now have just a fairly rubbish version
11:41
of the internet, because we've got web pages and links, but it's a little bit cynical, so, okay. So this is really good. We've got content somewhere, but now we want to actually do something with it. So News.UK built a content hub. So what they've got is all this content, and they need to tell people about it. So the content hub is a PubSubHubbub,
12:02
it's difficult to say, system based on a PubSub protocol for distributing content to subscribers. So we've decoupled again the CMS from the people that are interested in it. The workflow is that the CMS will write the content to S3, and then publish the content over the hub. Hub has topics, you publish content on a particular topic,
12:21
and anyone interested in that will subscribe. So any downstream system can subscribe to the hub, get notified when a particular piece of content changes or is added or updated. And because we're not tied to particular CMS, this means multiple systems can push content as well. So primary system pushes content, that's great. This primary system fails, disaster recovery system pushes content, also great.
12:41
Downstream systems don't know or care. All they care about is the fact they're receiving updates. We actually have multiple content creation systems in place, we've got one for articles, a separate one for producing puzzles, a separate one for producing marketing content, they're all completely isolated systems, and our downstream systems don't care because we've got this PubSub mechanism in the middle, which is really powerful now,
13:01
but it's gonna be even more powerful in the future when we wanna change CMS again or introduce much more specialized systems. So as a consumer, we don't really care where this content comes from, we just subscribe to a topic and the hub has mechanisms that kind of retry and on failure and managing those kind of subscriptions so that downstream systems connect once and then they know they're gonna get updates from that.
13:21
So think about our overall architecture. What we've got now, we've got the left-hand side covered. We can produce content, we can create content, we can create it from multiple systems, we can store that content, all different types in different buckets, and then we can broadcast that content out. So content over the middle, PubSub system, push the content out to whoever's listening.
13:43
So let's think about what we can actually do with that. So we know something's listening because we've got a website at the front, we want the content to be there. Content ingestion is the process we call it. We take multiple documents that are linked together and we subscribe to that and we create this in-memory tree structure of them. So we have a service that sits there,
14:01
listens to everything that comes in and builds this in-memory tree based on a particular document. So you will send it at root level section and it will crawl all of the links to find out what everything it links to and build this kind of tree in memory of all of that. So this gets quite complicated when you think about it because you can build a tree in memory but you might have articles that relate to other articles that relate back to the first one.
14:21
So you need to be quite careful that you're handling things like cycles and broken links and making sure that you don't end up infinitely recursing all the way down. So when you're designing this, it actually sounds very simple, but it's quite tricky to walk this tree and build up this in-memory object because it can get quite large. Some of the additions we have are hundreds and hundreds or thousands of documents in them. So these things can get quite large.
14:41
They need to be quite careful to make sure we're doing the right thing with it. Atom is in the easiest format to work with, particularly in Node. So we convert everything to JSON, which is much easier for us to work with. Again, other systems can use Atom if they like. We're gonna stick with JSON because it's easy for us. Because this is all HTTP-based, it's really easy for it to push new content in.
15:02
So in a lot of systems, it's very hard to take content from production and put it in your dev database and have a play around with it. For us, it's really easy. We were a hub subscriber. We accept a HD post, meaning we can just post today's content into our development databases on the command line, meaning it's quite easy to develop on today's content. Bug on the homepage? It's really easy.
15:20
Suck the data in, look at it locally, fix it, develop a fix. So it's really, really easy to move content around because we're not calling a particular database. We're just waiting for content to be given to us, which is really flexible as a developer. So we use GUI to identify things rather than integers. So in theory, we can have this huge array of content. So it means we can also have multiple environments
15:41
of content sat side by side, and everything is identified uniquely. So we take this in memory tree, and that's great for the ingestion service, but then we need to do something with it. So we flatten that out, and we turn it into multiple separate documents and store all the children and assessed of those. It's gonna make more sense in a minute.
16:00
This is what we do at the moment. So you take every memory tree, and you object, and you say this relates to this, this is for this, this is how it's positioned in the tree. The example I've got, it's fairly simple. You've just got one related to two related to three. But you can see how this could get a lot more complicated when you have multiple inbound parents. So if you've got a really popular article that 10 other articles link to, you need to store the assessed of those and make sure they update correctly.
16:22
Because we've atomized everything, pardon the pun, it means that we can actually update everything individually. So the problem with publishing a big fat document that says here is today's content, it's a huge big thing. If you want to update something inside it, you have to republish the whole thing again. Whereas breaking this down means we can just publish the very small thing that's changed and update that,
16:42
and then merge it with what's there already. So if we want to just update three, we don't need to republish one and two, which is really important when you've got these huge collections of content. You just wanna publish the small thing that's changed. And the hub and atom together make that really easy for us to do. So we've got these documents now. We've flattened everything out.
17:01
We know where everything lives. And we're gonna store those. So we take those and we store them in Elasticsearch. So Elasticsearch isn't an ideal database for this. It's not an ACID compliant database. It doesn't have all the nice properties you want. But it's pretty useful for other ways. It's JSON native and it's a document store. So it's really good for storing these documents
17:21
and doing updates. And you can do some trickery to kind of make those updates look transactional. It's not actually transactional. So if you have an update, you're in a bit of a mess. You have to try again and just keep pushing the data. But because it's not the authoritative store, that says three, it's okay to actually just retry this, retry this, and try this until you get the right state at the end of it.
17:44
The advantage of using that is that during development and during the project load cycle, it's schema-less. So you just keep pushing content in. It means you can just push in new structures and it's really quick and really easy. So for us to develop on, it was really quick. We didn't have to worry about table formats, or columns, or data types, or anything. We just keep pushing new formats in
18:00
and changing them as we please. Once you hit production, it gets a little bit harder because you have data in old format. And data migrations in Elasticsearch, we discovered painfully later on, are not that easy. You effectively have to take all the data, take it out, and then push it back in in a new structure and transform it all. So we had to go through that process a few times
18:20
before we go, hang on, we probably shouldn't do this. So we had a bit of pain trying to actually migrate the data around once you get into production. You can't just wipe your database every time. So there are a few pain points for Elasticsearch, but for what we needed for it, it's pretty good. And it's pretty fast as well. So we've got that, we've got Elasticsearch.
18:40
It's read-only, so even if we lost our entire database, we can recreate it just by pushing all the content back in. So for our purposes, it's pretty good. If you think about this, it's actually a graph. So the content as a whole is actually a big graph model. And if you take all of today's content, the red blob in the middle is the homepage, and all the content spanning off that
19:00
is all the articles and sections and books and things like that that go around it. So this is actually a directed graph. And we can think about the data like this. So if you take the dot in the middle and walk everything out, you'll do the same as walking a link from an Atom document. This is actually produced from Orion DB, which is a technology we looked at, to see whether it would work. And it is pretty good.
19:21
It really does some of the stuff we need to. But because we're not actually working with it as a graph model, we don't need all the complexity of graph queries. So we don't need to work out, say, average distances between nodes or shortest paths or anything like that that a graph database does really well. All we're really interested in is, I want here this piece of content and just get everything else that's alongside it. So it's kind of very simple scenarios.
19:41
So we've got our content. We've got a content store for it. We need to think about getting that content out. So we've got an API for this, as everyone does. So a core API sits on top of Elasticsearch, and you query all the content and just go, give me everything, give me everything. It's all REST-based, because we're in the modern world. We develop things with REST throughout formats.
20:02
And it's all JSON formatted. So you've got all the standard features of documentation for endpoint, URL versioning, data versioning. You can HP, so you can just retry very simply. You could point out that querying by ID is basically very similar to querying what we had before.
20:21
So getting a document by an ID is very similar to what we had at the start of Atom. The key improvement is that we've now got all the nested things in a single response. So we can retrieve all the children at the same time. And we can also filter and query those. So you can get things, rather than just a single document, you can get all the documents of particular type, all the articles that were published today,
20:41
all of the homepage things from three days ago. You can get them all in a single query. So we've got the same data, but we've got a much more flexible way of querying it. As with ingestion, you need to think about recursion. So rather than just getting a flat view of the data, what you've actually got is cycles within this big graph. So particularly articles related to each other.
21:03
And this makes designing the API quite tricky because you can't just get stuck in infinite loops every time someone happens to query a related article. So you're quite careful to design one that lets you recurse down as far as you need and then stop, bearing in mind any filtering and things you've got. When we come to do this, you'll notice the endpoint is actually document. It's not a particular type.
21:22
So we're not querying for articles, images. We're saying we want a piece of content. We don't really care about the type of that. And when you get a piece of content back, it might have any content underneath it. So we've got a really flexible system that just lets you push new content in and then just start adding new types. And the system doesn't really care. It lets you add new fields dynamically, add new types dynamically, which is really flexible,
21:40
lets them build something for the future. So yeah, we've got mixed types in there. And so that's good. So we've got an API. That's really handy. Of course, when you build an API, you need to load test it. And we've got quite a lot of load on our system, usually so, particularly during big events, like say, David Bowie down, you get a huge spike on the site.
22:00
People want to read articles about it. So you need to be able to perform very quickly. We discovered on the way that caching is quite hard to do well. So we had a couple of goes at building a cache, relatively successful. The first one we had, just a very simple, the built-in express static cache. It's all a cache. You say, yep, 10 second expiry time, so we want nice, quick refreshes,
22:20
and we'll go with that. Problem is this hits the lovely thundering herd problem. If you heard this is where you have cache hit, you serve the content, cache to the content, when you have a cache miss, every single request that goes through and hits the database, which means if you're looking at something like a several hundred requests a second, it just falls over pretty much immediately as soon as you have a cache miss because you're requesting the same content over and over again. It just goes blarp.
22:41
So all falls over. That's not so good. We don't want this, particularly when we have NFRs to hit. So first simple built-in express cache, not so good, especially when you have fairly long response times. So we took advantage of some newer node technologies. So streams in particular introduced 12, much improved in version four, which is the version we're using.
23:01
So we built a cache using those, and it's kind of a newer one. It's open-sourced, which is really good, and we built on this. What makes this different is when you have a cache hit, it's really good. You just serve the content straight back from your cache. When you have a cache miss, you let one request through, and then you hand a stream back to every other request that comes in.
23:20
And so you have multiple requests while this one request is going through, and they all just get given a stream, and then they sit there and wait. When the first request that created the stream finishes, its data comes back, the cache is populated, and you can then stream the data out to all the clients that requested it in the meantime, meaning you don't have this thundering herb problem anymore because all the requests just kind of queue up very nicely behind it.
23:41
It's also a lot more performant. Node streams are very fast, and they let us stream a huge amount of data. It's always a bit misleading when you run performance tests on a dev machine to when you run them in the cloud. When we did it locally with Apache Bench, you get something ridiculous, like several thousand requests a second without even trying. When you run it on an actual environment where you've got a spread load,
24:01
it is a bit slower than that, sadly. So we've got a cache in there. We've got some new node stuff that's pretty cool. We're using streams, which are really nice, so we can now redesign our API to take advantage of these. We're not dealing with objects and memory anymore. We're thinking about streaming the content through, which is really efficient and helps us do things like back pressure and designing a system for that.
24:20
But we were not quite finished. Run your performance tests again. You run into more problems. So pipeline ordering. Express middleware is pipeline-based, so it's not like old ones. It's like the newer ASP.NET MVCs where you kind of register middleware in an order and then it runs those in that. And if you're not careful of the orders to register it,
24:41
you have problems. So our initial registration went, I've got some database, some logic, I cache, and then I gzip. Our content's quite large. It's around two and a half meg for the home page response because there's a lot of content in there. And repeatedly gzipping that file from two and a half meg down to about 300K chews up all your CPU for no reason. So our first version did that.
25:02
Chewed up all the CPU, bit silly. Reordering the pipeline, actually doing gzip and then caching the gzip response and making sure we match on headers means we can get a very efficient pipeline where we're actually serving exactly what we need to to the client. So at the end of this, we've got a pretty decent cache system. We're quite happy with it. It's nice and fast. It serves the content we need to. But we've actually ignored the hard problem,
25:21
which was cache expiry. So we'll come back to that later. We learned quite a lot when we built this. So versioning APIs and data was really important from the outset. We built in a versioning system into the URL. So you have version 0.1, version 0.2, version 0.3, which allowed us to kind of build a new version while
25:41
maintaining the old one, which means we can try ideas out. If it doesn't work, never mind. We'll not use that. If it's really good, we can migrate everyone to it and deprecate the old one. So we went through several versions of the APIs we were building. Even though we were building a product just internally, we went through API versioning before we even released it to the public. We'd also do different documentation of each version to kind of make people help people migrate
26:00
and understand the changes. And that was really easy and really important because we had lots of early consumers. We had a separate Android team building a native app, trying to use the APIs we were building it. And had we not had versioning, that would have been much harder than it was. As it was, it was pretty hard. And they yelled at us a lot for breaking it. But without versioning, we would have been even more stuck. So even when you're building an app internally,
26:21
it's really important to version things early on. Versioning APIs is different to versioning data. It's important to distinguish between the two. Versioning APIs we think of as putting a version number in the URL. That describes the structure of your API in the first place. Our first one had just article as its only endpoint. And then we had article additions are only endpoints.
26:41
And the later one has document and documents in an entirely different way of describing the APIs. So we version that structure first. But the second thing is to version the data. And that we do with headers. So you specify a particular application data version in the header. And that will give you the right format back of data. This is really important, again, for early consumers
27:00
because the data model changes quite a lot. And they want to have fields kept in the same way so that we don't break their live demos, which we have done. And they got very angry about that. So we learned from that and built data versioning in again from the ground up to make sure that doesn't happen in production and we can support multiple versions early on. It also encourages you to be quite strict about your process for doing that.
27:20
Even early on when you're doing releases, you want to get those releases reliable so that you don't break other people that are using your stuff. So versioning also lets you concentrate on what's actually important. So early on we made the mistake of trying to build API endpoints for everything. And there's a huge proliferation of it. And really, it wasn't necessary when you think about it. It helped you, and then we focused
27:41
on what was actually important and just trimmed it right down at the end and threw away all the early versions. So all the early versions we had in our code base have just been deleted. They're gone. We don't use them anymore. And that was something that was really useful for us to do. So in our overall architecture, we've got the left-hand side done still, great content. We've now got somewhere to store it for the new website.
28:01
So we've got ingestion service. We've got data storage. We've got an API that has that. And so we're getting closer. So now what we need to do is think about giving that content to consumers. So we need to take their API data, get it out there so people can start looking at it and enjoying all this content. So we're most of the way through. Atom documents have particular types.
28:21
So when we actually come to render the content, this is the kind of structure we use. We have the edition at the top, which is what we call the content for a particular day. So our test data for some reason was the 23rd of March. And that's what we're going to go with. The edition contains multiple sections. And there's what you're kind of used to on a news website. You've got news and sport and business and opinion, all that kind of stuff.
28:41
Inside the section, we don't just have articles. We break it down into these slices, which is a sort of horizontal slice of the website so that you've got this kind of vertical stacks that you're building up. And that lets you do some positioning stuff that I'm going to talk about in a sec. Inside slices, you have multiple articles. You have kind of one to four articles within those. And then you group those kind of down. So I've only shown a few here,
29:01
but a full section and full newspaper is quite long. Those articles themselves will have images or they might have videos or relate to other articles, but we'll keep it simple for this. We take these structures and then we think about the nesting of these. So edition section slice an article lets you produce a kind of structure like this. You've got the outer homepage, the sections and the slices within those,
29:21
which are these vertical containers. And then you've got the articles within those. This is exactly like the DOM. It's really, really easy to render. So you've got this nested JSON object and you've got Dust.js templates. You just kind of render a template for each section. You go down a level, you render a new template, down a level, render a new template. So it's really easy for us
29:40
to actually build this very recursively. And it's really easy to preview as well. So one of the features that the users asked for was to be able to preview any particular part of the site. So they want to preview just the new section or just these four stories. And that was really easy to build because we'd done this splitting out of the sections into templates and that was really powerful.
30:00
It also means we've got a very modular front end. So you can design things that just affect the section without affecting the things inside it or without affecting other content. And that was really useful to be able to do as well. So these slices, you've got a couple of choices here. Editorial have a lot of slices to choose from. So the design team went away
30:21
and designed loads and loads and loads of them. And then we implemented a few of those. So design got a bit carried away. We implemented about 20 for launch. There's a lot of big ones on the left-hand side. They're called the lead slices. They've got kind of a big news story at the top, huge image, big headline, very bold graphic. And they use those for the leading story of the day,
30:41
whatever it's gonna be, they'll use that at the top of the website to get impact. They've got the secondary ones in the middle. So these are kind of your groups of four or five or three, whatever stories they're using. And then they've got very specific ones on the side as well. So they've got things like obituaries or engagement announcements or focus or opinion slices that they use to kind of build up these different types of content.
31:01
And these are actually based on what's in the print edition as well. So we talked before about how we want to reuse the content across multiple systems. What we're doing as well is making the website content much similar to the print content. So when you've got a opinion section in the paper, you've got a very visually distinct opinion on the website as well, which is something they really wanted was to have that kind of similarity between the two types.
31:23
So we're gonna have all these different types to choose from. They're all responsive. So each horizontal section responds down based on your screen size. You've got the desktop on the left and the mobile one on the right. And they just kind of respond down nicely. So everything's built to respond down to different screen sizes and it kind of squishes down and you can just build your home page. You don't have to worry about what order they go in.
31:42
You can just reorder these as you want. So this is what editors are using. They just select each of them has an ID and they just pick a template and that's really nice. So pair this with a style guide. So the dev team built a style guide that describes the responsive grid they have. It's based on Stuzy CSS and they've got a sort of standard column-based layout,
32:01
12 columns and six columns for mobile and content is described as going across those. So it's a fairly standard responsive pattern and they've designed all the individual components for that as well. So inside your slices and articles, you'll have things like a separate headline or a pull quote or different things like that. And we're still building this out as we go. We only support a couple of components at the moment but the idea is it'll support a lot of things later on
32:21
and they'll be usable across the site. So this is how we do styling and theming builds on top of this as well. So particularly with the Sunday times, they're very interested in having very visually distinct content. They wanna have their magazines look really different so they have different fonts, different colors, different layouts and this supports that as well. So you've got a kind of very plain basic module
32:42
and you can apply styling and theming to it. Unfortunately, I'm not allowed to show you the homepage because it's still in beta and nobody else has seen it either so there you go. But if you were to see the homepage, it would look something like this. So we've taken the slices, we've composed them together and you can imagine if you apply the style guide to these, you end up with a kind of really distinct homepage
33:00
and this is what it like on desktop, tablet and mobile. Like I said, it's a huge step forward from what they have now because it's a proper responsive design. It's very visually distinctive. It's completely flexible. They can reorder these things. Every day looks different. So they've got this hugely flexible system and they really like it for the front end. There's lots of really nice features on the front end that I can't show at the moment so things like touch friendly controls.
33:20
It's much easier to use on tablet. The carousels are all really nice. There's a lot of polish. There's a very good, kind of the nav bar is very easy to use. It's got lots of nice animations. We're doing a best of flight website bloat. So there's a lot of articles about how websites are all massive. We're currently at about 850K for HTML, JavaScript, CSS and most of the static images and that's coming down
33:41
because we're still optimizing it. So we're getting there to kind of a nice small core website and there's definitely other things we're thinking about so we're not going to do a release of these huge, humongous pages because nobody likes that. We've also got some experimentation with things like Polymer and web components built in. So a lot of the interactive content we build,
34:01
we've tried to use polyfills and won't use web components so they can bring in that content and try using that on the homepage, things like that. So there's a lot of interesting stuff going on here. So this is a desktop. This is all really good. But I said at the start, we're building native apps as well. So mobile API. We've got our desktop site built but we also need to distribute content to mobile.
34:22
The native apps are designed to work a bit differently. The only thing really distinguishing them is the fact they have offline. So your phone in the morning, you wanted to have the news on when you wake up so you get up and you go on the tube and the news is already there. You don't need to wait for anything to download. So the mobile API is designed to do that. It's designed to use GCM to send the data and because our pipeline is push-based,
34:42
the way this works is really neat. We consume some content. We then say new content is available. Mobile API watches for that event, captures it and then downloads all the content, bundles it all up. It also bundles up all the images into a tar file so you get these compressed images specific for each device. They build a separate bundle for each category of device
35:01
and they then send out a notification to the device to say your bundles are ready, download them at seven o'clock in the morning before you actually wake up and go on the tube if you turn notifications on anyway. So if you do that, you wake up, all this stuff is on your phone. But of course, it's triggered by GCM and we trigger everyone at the same time. So that's hundreds of thousands of subscribers leaving their phones on overnight and triggering it.
35:23
So at seven a.m. exactly, the bandwidth usage goes from zero to around 10 gig a second for about two minutes and then back down to zero again, which is surprisingly difficult to do. So we generate these bundles. We've got those. We don't really wanna put them on our servers because that means that hosting AWS, you get a huge cost.
35:41
We have to scale up very largely and then scale down almost right away. It's very silly. So the mobile API is actually more generic. It's more like a batch API. So we trigger it, we give it some event, and it builds these bundles as a batch and sticks them in a static file store. So we use S3 and a CDN to actually do that and we have a content auth system in front of that that actually lets us just put these bundles in
36:01
and then we don't have to worry about how we're gonna scale up for all these subscribers. We just put the bundles in the file store and then they go. If it was just bundles by themselves, that would be quite difficult. But because again, this is push-based, if we update a piece of content, so say we have to retract an article for legal reasons or make a modification to a headline or fix a typo, again, no need to do anything special.
36:20
Just push the content, the whole workflow kicks off again, new bundles get updated, notifications get sent out, phones download it without the user having to do anything. So the system is really flexible and really scalable, making sure that we get all the content at the right place at the right time. So that's the mobile API. Let's just build native apps, let's just get the content in the right place. That's really handy.
36:41
So there we go. We've got pretty much the whole site. We've got content being created on one side, flowing through to our systems, being rendered on the front end of HTML to devices and being pushed out of the mobile API and down to the apps through GCM and other triggering mechanisms. There's a lot of third-party stuff I haven't talked about,
37:00
so there's integration with ads and analytics and the video provider and commenting and all that kind of stuff. So there's all these other third parties that link in and provide additional services. It's all fairly modular, you can drop and use those. We have our own system called ACS, which does subscriber auth, and that integrates with the CDN, so you don't have to wait for anything to happen.
37:21
The CDN actually does the authorization for you, which is really handy. Because this is based on AWS, we've got built-in monitoring and logging through the services they provide. So New Relic does application monitoring, that's really good, but we also have CloudWatch to do actual logging-level events and infrastructure health monitoring
37:40
and all the things like that. And there's separate tools for doing that as well. So the infrastructure as a whole has a pretty, we've got a pretty good picture of when it's performing well, when it isn't, when we've seen demand and services fall over. Because it's AWS, everything also scales, so if machines fall over, they get brought back up automatically, they scale out as we hit peaks and need to. The CDN manages the bulk of the static assets, but when we have APR requests,
38:01
particularly for things like user collections, they can scale up as needed. Similarly with Datastore and Elasticsearch, that's clustered and you can add new machines to it. So everything kind of works in a very nice way and it's all kind of quite easy to think about. So quite a long project. We did learn quite a few things along the way.
38:21
Building a new product is a very different mindset to incrementing on an existing one. It requires a different way of thinking about things and being able to unlearn what you already know. In my case, it was really helpful because I'm coming into the company, I don't know anything about what they do. I knew nothing about publishing. And so for me to come in and go, why do you do that? What's happening here was useful.
38:41
And having the mix of domain expertise and someone who doesn't know anything and has done a lot of development kind of builds a really useful product at the end of it. You need to think about UGL features versus prioritization. So it's different to actually just building incremental feature releases, to actually taking an idea or a prototype and turning it into a production-ready product. And that was surprisingly difficult to go from,
39:02
it runs on my dev machine to it works reliably in the crowd. There's a lot of work there. And the infrastructure automation to do that is really difficult. So everyone thinks AWS is a magic bullet, you suddenly get scaling, hooray, easy. That's not the case. We spent a lot of time investing in making sure our environments are repeatable and reliable and that when you hit deploy and infrastructure,
39:21
it doesn't just trash everything, which it has done. And making sure that we use CloudFormation properly and Chef properly and all that kind of stuff so that we can release things like new versions of node without having to worry about will it work in just one environment. So there's a lot of work for an infrastructure machine. We've spent a lot of time doing DevOps type things, which is really interesting,
39:40
and getting that to a point where we trust to do everything fairly reliably. Similarly, re-platforming is hard, not just from a tech point of view, it's not too bad, but you also need to think about the business side of things. I was quite lucky in that I only got to focus on the tech side of things. Other people were looking at how do we retrain 400 people to go from one CMS to another one in time to line up with the tech deadlines.
40:01
And for a large company, that's a really tough thing to do. So there's a lot of difficult bits there that exist outside the project. For us, it's very easy to say we'll just create a tool for that, but the downstream effect of we need to train everyone on it, we need to make sure it's secure, we need to make sure everything else happens, it's quite tough. So we're doing quite a few things on the way. It was a really good project for everyone concerned.
40:23
What's next? So this is the initial launch of the product. We've got a website that works, it's got the core features they wanted, but there's a lot much more that we can do. So initially when the project scope came out, it was something several years worth of work. They basically wanted to do everything the current website had and a whole bunch more.
40:42
And that's the way with all products is they come in and say, right, must do everything. And we convinced them to cut that down to what they actually call ours. So we've got this core API now, which is a new thing. They don't have this core API in any of the product. They've got kind of very static feed-based ones. So using that to build new products. It's something that the businesses want to do for a while is have APIs of their content and be able to use that. But now they've got this.
41:02
Similarly, dynamic content. So we talk mainly about articles and images and kind of nice plain ones. But bringing in things like interactivity, parallax articles, single page apps that you can view on the desktop and on the phone would be a really interesting feature. So there's a lot of people thinking about how we can do that in a really nice way. We talked before how to graph structure.
41:20
We're using Elasticsearch. There's use in there for a graph database. So it'd be really interesting to see if we can actually move that content into a graph database, get the same things out as we do now, but query in a more interesting way and do more interesting things with the content. I do install Netflix as things like pre-rendering HTML. So we talked before about how you only push content several times a day. That means you could pre-render most of the site.
41:41
We don't do it at the moment. We just kind of cache things and apply cache headers and push updates. But actually a lot of the site is pre-renderable. And so we can actually go from the point where we have everything on request to actually pre-rendering the bulk of it and then just combining things as we need to. So there's a lot of interesting stuff that's gonna come up. That is the end. I think I've played to the left.
42:00
Does anyone have questions? No, cool, all right. Thank you very much.