We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Have content quality, will search your Intranet

00:00

Formal Metadata

Title
Have content quality, will search your Intranet
Title of Series
Number of Parts
61
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
TriMet runs the public transportation system for the city of Portland, Oregon, and the surrounding area. Over several years, TriMet's Plone-based intranet had accumulated lots of content, and the built-in search was not working very well anymore. For this case study, I will show how we solved the problem by focusing on content quality. Faceted search was helpful both in the push for content quality, as well as in the final search functionality. These ideas can help any Plone site, large or small, and should be considered for additional default features of Plone.
Moving averageBit error rateBus (computing)Software developerWeb browserSoftware developerWorkstation <Musikinstrument>Service (economics)Term (mathematics)Canadian Mathematical SocietyPoint (geometry)Wave packetSoftware maintenanceInternetworkingOffice suiteBus (computing)Line (geometry)Content (media)NeuroinformatikTransportation theory (mathematics)ReliefCAN busPhysical systemMusical ensembleWeb 2.0QuicksortEntire functionServer (computing)MereologyText editorDistanceRow (database)WebsiteState of matterMaxima and minimaQuantum stateRandomizationNumberAreaCommutatorField (computer science)BitWeb browserLimit (category theory)Group actionComputer animationDiagram
Mobile appBlogWeb browserBootingCovering spaceMultilaterationCovering spaceProjective planePerfect groupWebsiteCartesian coordinate systemInformationSelf-organizationRepository (publishing)Dynamical systemBlogComputer hardwareLink (knot theory)QuicksortWave packetIntranetTelecommunicationWeb pageInstallation artWeb applicationCloningDocument management systemMultiplication signKnowledge baseBootstrap aggregatingGraph coloringMathematicsDirection (geometry)Template (C++)Complex (psychology)Musical ensembleProcess (computing)InternetworkingPhysical systemPersonal digital assistantProgramming paradigmConservation lawComputer animationDiagram
Covering spaceContent (media)NumberMaxima and minimaWeb pageBlogComputer-generated imageryComputer fileOnline helpOperations researchSelf-organizationComplex (psychology)Function (mathematics)Standard deviationWeb pageTesselationSet (mathematics)Covering spaceCuboidEvent horizonResultantDescriptive statisticsCloningWordMultilaterationServer (computing)BitTerm (mathematics)Content (media)Revision controlForcing (mathematics)Category of beingLink (knot theory)Process (computing)Bootstrap aggregatingView (database)Point (geometry)Form (programming)Level (video gaming)Product (business)WebsiteMetadataSoftware developerLanding pageDefault (computer science)IntranetSearch algorithmMedical imagingBlogComputer fileWeb 2.0Object (grammar)Musical ensembleMultiplication signService (economics)Table (information)Computer animationDiagram
Execution unitMaizeProbability density functionData typeContent (media)Interior (topology)Intrusion detection systemFeedbackAlgorithmWide area networkLengthInverse elementFrequencyNumberMUDContext awarenessInclusion mapUsabilityPrice indexQuicksortOrder (biology)Installable File SystemResultantComputer fileTotal S.A.Set (mathematics)Type theoryTerm (mathematics)FeedbackUniform resource locatorWebsiteAlgorithmFrequencyCloningLengthNumberElement (mathematics)QuicksortContext awarenessContent (media)Online helpWeb pageOrder (biology)Subject indexingGoogolAuthorizationComputer iconThumbnailDescriptive statisticsField (computer science)InformationText editorMetadataGoodness of fitMacro (computer science)Process (computing)Filter <Stochastik>Matching (graph theory)Well-formed formulaNeuroinformatikKeyboard shortcutDomain nameWordMetropolitan area networkMusical ensembleBitFile viewerComputer animation
PlanningWaveUsabilityRaw image formatContent (media)Division (mathematics)Data typeGame theoryWeb pageComputer fileOrbitPlane (geometry)Software developerSubject indexingWeb pageType theoryElectronic program guideSubject indexingDescriptive statisticsContent (media)Division (mathematics)Multiplication signDefault (computer science)Lattice (order)Library catalogProjective planeCASE <Informatik>Arrow of timeCode1 (number)Musical ensembleTask (computing)Text editorDigitizingProcess (computing)Data managementGUI widgetTraffic reportingFeedbackMetropolitan area networkScheduling (computing)Game controllerRange (statistics)Search engine (computing)MetadataResultantForm (programming)VideoconferencingLoginEvent horizonFlow separationPerfect groupDecision theoryTerm (mathematics)Online helpOrder (biology)Computer fileMedical imagingComputer animation
Software developerSystem programming
Transcript: English(auto-generated)
I think some of you missed the On My Bas song by Plone, I suggest you google it.
The band that is also a CMS called Plone. Okay, so this morning I want to talk to you about how TriMet improved their search.
They were having challenges with search and I was called in to help with that. So who is TriMet? TriMet runs the entire public transportation system for the metropolitan area of Portland, Oregon.
Just a few numbers, they have a budget of $507 million. They cover a pretty large area, lots of people and they are a freestanding government agency that is not part of the city or the state.
They run a lot of vehicles, buses with lots of buses, lots of stops, lots of bus lines.
The MAX, which is the light rail, sort of like a long distance street car with lots of vehicles, five rail lines, lots of miles of service and stations.
The Westside Express service is actually a full sized train for commuters, which has an almost 50 mile service line with five stations and six vehicles.
And then they run Paratransit, the Lyft, which is a Paratransit service.
And they run the Streetcar, which is owned by the City of Portland but is essentially maintained and operated by TriMet. And some of you probably, I know some of you went to Python in Portland this year and last year and you will have experienced how awesome the public transportation system in Portland is.
It is really one of the best in the U.S. in my opinion. In terms of trips, people take 101 million trips every year.
In terms of coverage or size of the transportation system, it is the ninth per capita in the country. Even though Portland is a relatively small city, it is the 24th largest city in the U.S. But in terms of transportation system, it is a ninth which would put it on par with Atlanta, which is a very large city.
And I doubt that Atlanta has such a good system. Talking about employees, we get to the point where we start thinking about users for our systems. So TriMet has 2900 employees.
90% of these work in the field. So they are bus drivers, train conductors, operators, supervisors, maintenance workers, construction workers.
So therefore they do need access to their internet but they don't access it very frequently. Unlike the 10% that is administrative and they sit in an office and work in an office and presumably have access to a computer all day long, the rest of them don't.
So the majority of the employees have the CAN reroll and it is there authentically with the web server off.
There are just about a dozen or maybe two dozen, roughly speaking, of content developers which have the editor row. So that gives you an idea of the user base for it.
So if it's not in the browser, we don't really know what to do with it. So let's talk about their IT infrastructure. So the public facing site trimet.org is not actually clone.
So if you go there, that's not clone. Also not clone. You can imagine an organization of this size and complexity has a lot of IT systems. Internal web apps and internal sites and they do have clone.
They do run clone sites and before this project they had over five clone sites. And I say over because some five sites were actively being used and there were some others that were basically abandoned. So we got rid of them.
Of those five, three were blogs. One was a knowledge base, sort of a document management system. And it is still the repository for all of their technical documentation manuals, for all of the technical hardware that they have.
So for a train nerd, this is paradise. You can find information about anything concerning trains, buses, communication systems, which is anything you want.
It's just awesome. I talked about the other side. And then they have the intranet, which is what the majority of this talk is about. After this project, there were two sites left because the three blogs were merged into the intranet.
The knowledge base was upgraded from clone three something to clone five, the latest one at the time. And the reason why they were stuck on three was because they had some clone four artists add-ons installed.
And so they could not upgrade, as you all probably are familiar with. And I would like to thank Nathan Benghi for a wildcard for his persistent utilities.
Saved the day many times. And the intranet was already on clone four, clone two something or other, and so we upgraded to the latest. By the time we were done, there were actually four blogs, so they all got merged using lineage.
So while this project was in process, the four blogs kept looking exactly the way they were before, just by having a sub-skin for their sub-site in lineage, which is awesome.
And then I'm going to talk a lot more about tri-net, so please stay. Now you know who tri-net is.
The project was, basically, the main goal was to improve searchability of their intranet, apart from upgrading the knowledge base. And there were three pieces to this. One was, obviously, if it's not responsive, people are not going to be able to use it on their phones.
So it's gotten very expensive. But this was not really a re-theming project. So we decided to just stick to basic bootstrap with just a few color changes.
So they have an internal design team, and so their directive was to do not get fancy. Just don't make it look any different than bootstrap so that I don't have to do a lot of work with diazo.
Just give us a couple of templates that look like bootstrap, and we'll get fancy later. Another thing was the use of covers, collective cover, because the idea was news publications, magazines, and newspapers, and so on,
know that they have developed an art to basically guiding you, the user, to what they want you to find.
And collective cover is a perfect application for this. And so the idea was, well, let's see if we can surface the information that we want users to find in a dynamic way so that people don't have to create a bunch of links on pages manually and so on.
And so I created, I developed a few custom tiles. At the time, collective cover did not have a calendar tile, and now it does, I found out. So I developed a calendar tile, and that is an example.
So you as a content developer, you just create events wherever they need to be on site, and the page with the calendar tile just surfaces them, and you don't have to do anything about it. Just one word here about this process of using covers and bootstrap.
Lineage was great for this, because I created some new landing pages using collective cover that were obviously themed with bootstrap and tested and worked with, you know,
the whole composition process works with bootstrap. But they needed to be built by the content developers. They needed to do a bunch of work to populate the tiles just the way they wanted them.
And these tiles had to contain a bunch of links to production content. So instead of doing this whole work as usual on a staging server or a dev server, this all actually happened on the production server by using sub-sites with lineage.
So the rest of the site was completely unchanged, still the old theme, nothing changed, but the sub-site had the new theme, bootstrap cover installed, running in there. And so they could create these covers, and then on the launch day,
I could just turn on the theme for the rest of the site, copy and paste the pages, the cover objects over to where they needed to be, and switch them to be the default landing pages for those folders. So really, that's one reason why lineage definitely got one of my stars
in the contest that we have out there. So let's now dig a little deeper into TriMet, the TriMet's intranet.
So the challenge is, oh my god, our search results are useless, what are we going to do? And IT gave us a mandate, no elastic search. They did not want to have this additional dependency and stack installed
on these virtual boxes that they provided. So what are we going to do? Again, let's look a little bit closer at the intranet. It was running clone 4.2.x. It didn't actually have a lot of add-ons.
Because of the clone for artists, they were completely allergic to adding any add-ons. They just wanted a plain clone as much as possible. So these were basically the main add-ons that they had. If you can't read them, they are scrawl for blogs, content-developed portlets, web server of CX Oracle, which is only in one place and really could do without it.
And clone 4.10. It's basically sunburst, almost unchanged. It's just a little bit customized in portals, getting it custom.
And in terms of content, it has about 10,000 items. 70% of those were files and images. And the rest were basically, most of them were pages, blog posts, and folders.
So this is not really a huge site, but it's where I think clone's default search starts falling down. Probably even before you reach 10,000. But by that point, it becomes useless, especially with this many files.
So they used a bunch of work arrive-ons. One was to exclude files and images, which you can do very easily in the search settings control panel. Just uncheck the box for file and image, and they will no longer appear.
And the result of this was, for example, if I search for non-revenue vehicle, before I got 226 results, most of which were files. And most of which, the files... Okay, so the snippets, the description under there,
does not tell you anything about where clone's search found the keywords. In other words, you could have these words anywhere in the PDF, and you wouldn't know by looking at this web, why am I getting this result?
So they were really frustrated by that, and therefore they turned off files and images, and now for the same search, we have 28 results.
And this is what it looks like now, for the same search. So this is just a little preview of more to come later. Another workaround is keyword stuffing. So somebody at some point figured out, what is this keyword thing? I wonder if I can trick the search by putting all the keywords that I think people want to find,
want to use when they search for this particular page. I put them all in here, and I can get my results up.
There's this whole SEO craze. So there were about 80 keywords, most of which were duplicates, where they were like singular and plural forms of the same word, and hyphenated and non-hyphenated versions of composite words,
or capitalizing on the lower case, which obviously doesn't make any difference, or stuff like that. So that was useless. And then link force. I don't know if you remember 1995, or sites, or AltaVista, or even Yahoo,
the way they looked, they were basically just... The keyword search that you typed wasn't really that helpful before Google came along. So what they did, they had basically all these categories with links,
and you clicked one link in there, and it took you to another place with lots of other sub-categories and lots of other links, and that's how it worked. Basically, they figured out how to replicate this insanity.
And when you are a user, and you're looking for something in particular, having a whole page full of links isn't really going to help you. So let's talk about a little bit more about search.
I don't know if you've ever taken a bird's eye view of a Google search results page, and realized how little effort it takes for your eyes to scan it,
and to immediately discard the things that you're not interested in, and to zero in on exactly the right thing that is a good match for you. And it's done without a whole lot of fancy design.
It's, look at this. This is an individual search result, and look at the colors, look at the font sizes, and the spacing, and the font weights, and I think we can do worse than just copy what Google has figured out.
So the search page at a micro level, at the individual search result level, has some metadata that we want people to see. So definitely we want a title. We want a good, descriptive title.
Google puts URLs, and I don't know if you noticed, Google puts URLs, you'd think that in this day and age, a lot of people don't really know what to do with URLs, but actually, it's vital. It's really, really important to know what you're clicking on.
Just looking at the domain, you kind of get a good idea of where you're going. So that is really important. The byline, Google doesn't show it, but in an intranet, I think it's very important. You know, having a date last modified, and the author. The snippet, aka description, aka summary, which, you know,
the thing that we can do in Plone is different from what Google can do, but anyway. I think, you know, when we decided that tags are useful to have in search results, and we'll talk more about those later,
and then icons or thumbnails could also be very good hints that I implemented something for that, but then it ended up not being used for now, maybe later. So, okay, this is a typical, well, not typical,
but it's one of the worst examples I could find on a file. This thing only makes sense to the person who created it.
How could this ever, then I'd blur out the name to protect the incident. So, title to title is really important, and just using the file name for a title is less than usual. A lot of times, even though the description field is not mandatory
to save your content in Plone, but a lot of people think they have to put something in it, so just copy and paste the title. I mean, that, again, is another example of,
give me some information about this thing, this type 5 odometer location, you know, just repeat the title. So, but to be honest, this is not the user's fault, or the content editor's fault. It's a little bit, a lot, not Plone's fault that Plone doesn't give editors any feedback
on the content quality that they're creating. So, these examples that I've showed you, there's nothing to prevent people, or to give people any hint that there's anything to improve them.
And even if they wanted to improve something, say me as a content editor, I know I have created a bunch of files,
and I did not give them any titles, and the file names are atrocious, but I created hundreds of them, and they're all over the place. How do I find them? I mean, do I really have to go through folder contents and search for them that way? Or, even worse, description or other metadata that we would want to fix.
So, that's where we are falling down. We're not giving editors any help at all. And, well, just as an example,
this is a screenshot of a search result, as it looks now. So, you see there's a tag, there's a violin, there's a URL, without the HTTP in front, because that's useless, and a good title and a good description.
That's a good... If all the search results were like that, we would still start working. Wait, there's more. At a macro level... Don't do that.
At a macro level, Google had one job. Give me a picture of a guy without a tattoo. Actually, at a macro level, search is about two jobs.
Sorting and filtering. Because Google has... Let's just throw it over. One trillion possible search results. It doesn't just sort them. It doesn't just give you a trillion search results sorted by the keyword that you typed. No, it also filters.
But, in Plone, the scoring algorithm that we use is something called Okapi EM25. EM stands for best match. And, not to scare you with this formula,
but there is actually... Even if you don't have any idea what this means and what this does, there is a lot you can learn just by looking at the formula and by looking at what it depends on
and what it does not depend on. So, what it depends on is the frequency of each keyword in the particular document that is computing a score for. So, how often does this keyword appear?
The length of this particular document. The average length of all documents. The ratio of those two. And then, the total number of documents in the whole set and the number of documents containing this particular keyword. That's it. That's all this depends on.
If you think about that for a minute, you realize that this scoring algorithm does not understand anything about the context of a keyword in terms of either the location in the site or the location of the keyword inside the document.
It doesn't include any document that does not contain any of the keywords. So, if you're using a synonym, for example, that's your own fault.
If you misspell something, you can't help you there. And obviously, it doesn't offer any suggestions. So, this is why Clones default search is, you know, it's fine when you install a clone site and you start creating content, you know, ten pages,
a few dozen pages, a couple hundred pages. Oh, search actually works pretty great. Now you get to a thousand or ten thousand. So, thinking about that, I'll make my own custom sort order.
And I'll make an index that depends on these four elements. In this order, so, for content quality,
two items that have the same score and content quality, it will then score higher one item that has tags versus one that doesn't. And then, for the same score, it will rank content that is not a file higher and then files.
And then, also, last modified seems to be a pretty useful thing to sort on. And we'll get into content quality.
So, that's about the sorting. And now it's about filtering, because that's the second big job that the search engine performs. It's eliminating anything that doesn't have anything to do with what you're looking for.
And so, what if users could do it themselves if clone doesn't help you? And we have a perfect solution for that. I mean, that's the navigation. But first we have to decide which facets do we want to use.
And that's what our process entailed. So, in this project, we needed to decide, make a bunch of decisions. So, in terms of metadata, we said, OK, we want title, description, URL, tags, and so on.
And there's a really nice add-on, collective Jekyll, which I like a lot. And I also gave it a star in our contest. And, let's see, in case you haven't seen it, it gives you a little viewlet that appears in the byline of the content.
You see that red little thing there? That's a viewlet that collective Jekyll puts there. And it has a little drop-down arrow, so you can click on it. It says, warning. And when you click on it, it drops down this summary of content quality symptoms.
And in this case, this page does not have a summary. And that's the one that's red. Everything else is green. So, me, as a content editor, I can just go in, give it a summary,
and in this case, the summaries are hidden, so you don't see it, but now it's green. It says, OK. So, this is what Jekyll gives you. And it's really nice. And I looked at the code, and I like that. It's really well-written, I think. And it's really easily extensible.
So, we decided, OK, let's start brainstorming. What symptoms do we want to use? And prioritize. Which ones are we going to fix first? So, we want to have really good titles.
So, we want them all to be title-paid. No all-caps titles. No all-door titles. Follow the AP style guide and be really clean about it. But then, what do we do about acronyms that are all in order to look at another page? So, that's something that I handled in a particular Jekyll filter that I created.
Then, the summary, aka description. So, we want every content to have a description. It has to be a complete sentence. It has to be not the same as the title, or even contain the title as a substring.
It has to be properly spelled with a capital letter and so on. So, that's another symptom that there is some stuff already in collective Jekyll, but we improved it. And then the page ID, when you're creating a copy,
the page ID always has a copy off in front of it. And so, that is, first of all, it's ugly. But, moreover, it's actually like a symptom of work that was probably left undone, unfinished. And so, it's good to fix that.
So, the other thing is, collective Jekyll does not do, is it gives you that little viewlet with an okay or warning. It also gives you a collection that you can use. But, it computes all these symptoms on the fly when you're actually requesting a page.
So, I created a custom index so that all those symptoms are persisted in the catalog, and we can create reports. And so, reports we can create with faster navigation. And here you see a widget in faster navigation that has the symptoms.
And so, then the managers could give the task to their editors to go and, okay, let's start fixing all the titles. And so, people could, and there are other widgets down there that you don't see, so that people could find all the things that they had to fix.
And so, that's how the process worked. Then tags, we decided to do a control vocabulary. None of this folksonomy stuff. So, with Plone Keyword Manager, we eliminated all the bad keywords, the duplicates and so on. And with the Azo, we removed the widget that lets editors add new keywords.
Instead, they can only pick from a control vocabulary. And that's, okay, then we have to decide what facets do we want in our search page.
And for those, I needed to create some custom indexes. The division is one. It's based on the path, but it's not the same thing as the path. The categories, just default category, index, the type.
Okay, let's be honest. Users do not care about the content types that we create. So, we like our dexterity and archetype schemas and all that stuff, but users do not care. So, let's consolidate.
So, all of these, document, folder, link, collection, form folder, are all bundled into a page. A trinab web page. But there are some types that they do care about. So, whereas Word, Excel, PowerPoint, videos and all that other stuff is just a file to us, to the users,
those are really important things to be able to filter out. We do want to show images in the search results.
Log entries and events as separate types. So, those stay. The last modified is another faceted search widget. This is a last modified relative data widget. So, we use this week, this month, this year, over a year ago, all.
So, this needs to be updated every day with a cron job. And finally, this is the result. And you don't see all the widgets on the left, but we talked about it. Oh, and obviously this is bootstrapped, so it's responsive. And like Amazon does, you don't see the widgets when you first load the page,
but they pop up when you click the button. And you can set them, apply them, and then the search results are talked about. Almost everything, just want to give you a few takeaways.
So, for TriMet, this meant they needed to decide on which facets they wanted. This was a process. They needed to decide on a control vocabulary for tags. And they needed to decide which content quality symptoms they cared about, and prioritize them, and work on them on a schedule.
For me, I had to create a bunch of indexes, I had to create the reports, and I had to create the search page, and so on. For Plone, okay, this is for us as a community.
We need to get better editor feedback on content quality. And that's something that Castle does, and I think we should definitely do something like that. We need to be able to give users a way to both filter...
Now, the folder contents view in Plone 5 is great now. You can do bulk updates, but you still can't really do a lot with other metadata, like content quality of the summary, for example. You can't really do it. And bulk edits. So, think about content quality.
It always matters. That's it.