We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Precise, cross-project code navigation at GitHub scale

00:00

Formal Metadata

Title
Precise, cross-project code navigation at GitHub scale
Title of Series
Number of Parts
490
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
GitHub has recently added Code Navigation features (jump to definition and find all references) that let you navigate code directly on github.com. For the languages that we support, we extract and store symbol information for every named branch and tag, of every repository, public or private, with no configuration necessary. The compute and storage requirements to do this for all of the code on GitHub are quite large. In this talk, we'll discuss some of the trade-offs we've made to make this tractable at GitHub's scale, to be able to operate and monitor this service effectively, and to let us add support for new languages quickly and easily. We'll also talk about our ongoing work to extend Code Navigation to handle links that cross package and repository boundaries.
Scale (map)Machine codeScaling (geometry)Semantics (computer science)Machine codeFluid staticsEvent horizonData managementMathematical analysisEntire functionGraphics tabletComputer animationLecture/Conference
Machine codeFuzzy logicIntegrated development environmentView (database)System callPlastikkarteDirection (geometry)Source codeText editorLink (knot theory)Electronic mailing listLecture/Conference
ResultantLimit (category theory)Programming languageRepository (publishing)Machine codeFormal languageNumberLecture/Conference
Limit (category theory)Machine codeFuzzy logicUser interfaceRepository (publishing)InformationMatching (graph theory)System callSymbol tableFunctional (mathematics)ResultantBitFormal languageParsingLecture/Conference
Formal languageProjective planeMachine codeLink (knot theory)Functional (mathematics)Boundary value problemRepository (publishing)WordLimit (category theory)Data managementLecture/Conference
Scaling (geometry)QuicksortSet (mathematics)DivisorComputer fileImplementationDatabaseNumberMachine codeFormal languageLink (knot theory)Limit (category theory)Symbol tableFigurate numberBoundary value problemInformationCommitment schemeBranch (computer science)Software repositoryRight angleLevel (video gaming)Entire functionRepository (publishing)ParsingLecture/Conference
Software developerLaptopService (economics)Data storage deviceComputer fileCommitment schemeProcess (computing)Electric generatorMachine codeBoundary value problemImplementationRepository (publishing)Scaling (geometry)AdditionSymbol tableNumberContent (media)Direction (geometry)Hydraulic jumpOperator (mathematics)Integrated development environmentGraph (mathematics)ResultantDifferent (Kate Ryan album)Group actionSoftware repositoryQuicksortMappingBitLecture/Conference
Software developerLocal ringFormal languageMachine codeConstraint (mathematics)Server (computing)ImplementationDirection (geometry)Lecture/Conference
QuicksortVapor barrierServer (computing)Communications protocolText editorBoundary value problemMereologyAbstractionWeb browserLaptopProgramming languageImplementationConstraint (mathematics)Direction (geometry)Shape (magazine)Local ringResultantMultiplication signMathematicsContent (media)Axiom of choiceType theoryGoodness of fitService (economics)Symbol tableContext awarenessSoftware developerMachine visionFocus (optics)Single-precision floating-point formatRepository (publishing)Branch (computer science)TouchscreenFormal languageWindowMachine codeTelecommunicationDifferent (Kate Ryan album)SoftwareCASE <Informatik>Software maintenanceOrder (biology)Internet service providerMatrix (mathematics)Library (computing)Semiconductor memoryView (database)Client (computing)Pattern languageState of matterInteractive televisionProcess (computing)Web 2.0Message passingComputer fileLine (geometry)Connected spaceRepresentation (politics)Lecture/Conference
Formal languageParsingParsingVirtual machineFormal grammarAbstract syntax treeFunction (mathematics)Matching (graph theory)Rule of inferenceMachine codeProjective planeParsingExpressionElectric generatorParsingNumberLimit (category theory)System callMereologyVirtual machineMultiplication signSoftware frameworkFunctional (mathematics)Representation (politics)Descriptive statisticsFormal languageInstance (computer science)Pattern languageSyntaxbaumScripting languageSoftware developerProcess (computing)Flow separationData typeOpen sourceProgramming languageCompilerResultantLibrary (computing)Arithmetic meanSemantics (computer science)ProgrammanalyseSymbol tableImplementationState of matterRule of inferenceAdditionFuzzy logicText editorCASE <Informatik>BitBoilerplate (text)Lecture/Conference
Computer fileContent (media)Repository (publishing)CalculationElectronic mailing listCategory of beingLecture/Conference
Machine codeFuzzy logicComputer fileData storage deviceRepository (publishing)Web browserProcess (computing)Set (mathematics)BefehlsprozessorLecture/Conference
1 (number)Symbol tableFunctional (mathematics)System callTracing (software)BefehlsprozessorQuicksortField (computer science)Computer fileServer (computing)Different (Kate Ryan album)Content (media)ImplementationCommunications protocolRepository (publishing)Statement (computer science)PropagatorMultiplication signRun time (program lifecycle phase)MathematicsControl flowFormal languageCASE <Informatik>Lecture/Conference
Multiplication signRepository (publishing)QuicksortRevision controlComputer fileUniform resource locatorParameter (computer programming)Set (mathematics)Lecture/Conference
Infinite conjugacy class propertyComputer fileType theoryLocal ringLink (knot theory)Graph (mathematics)MereologyRepository (publishing)BitSource codeBoundary value problemQuicksortMachine codeData structureText editorParsingSystem callOpen sourceOrder (biology)Web pageMultiplication signLibrary (computing)Electric generatorGraph theoryContent (media)Descriptive statisticsAbstract syntax treeRevision controlLecture/Conference
QuicksortPattern languageMachine codeLecture/Conference
FacebookPoint cloudOpen source
Transcript: English(auto-generated)
Thank you, Georgios, for inviting me. My name is Doug Krieger. I'm here from GitHub. I'm not going to talk about left pad or event stream, unlike most of the previous talks in this room. Instead, I'm going to be talking about some code navigation work that we've been doing recently at GitHub.
I'm the manager of the semantic code team. We're a team, as Georgios says, that does static analysis at scale on basically the entire corpus of code that's hosted on GitHub. So that lends some interesting challenges that we have to solve. OK. My team, back in November at GitHub universe,
announced the GA launch of code navigation, fuzzy code navigation, on GitHub.com. If you've used any kind of editor or IDE, modern editor or IDE in the recent years, you've probably encountered code navigation before. This is not a fundamentally new concept. It's exciting for us because we get to launch it directly
on the code view of GitHub itself. So you're looking at a particular code base on GitHub This in particular is an example of Kubernetes, so some Go code. You're looking at it, you're trying to figure out how it works. You see a call to a method. You can click on it. It pulls up a hover card that shows you a link to the definition of that method.
Clicking on the link takes you to the source code that defines the method. Once you're there, you can also click on its definition and that'll pull up a list of all of the other references to that method. So again, it's jump to definition and find all references, just like you would see in an editor or an IDE, but direct in the code view on github.com, okay?
So we're very excited by it. We're very happy to have gotten this launched. It was a big milestone for my team, like we were happy with the result. There are limitations in what we've shipped, so I want to talk about a few of those and what we're doing to address those moving forward. One of the first limitations to talk about is that it's only GA launched
for three programming languages right now, Python, Go, and Ruby. We do have three others that are in beta, so JavaScript, TypeScript, and PHP. There's about a 10% chance that your repository has code navigation activated for it and a 90% chance that it doesn't. We do have other languages that we're working on as well queued up behind those, but right now it's only GA launched for these three.
It is important to call out though that it's GA launched for all public and private repositories on GitHub that has any code with any of those three languages. It's still a pretty substantial launch for us even though it's just a relatively small number of individual programming languages.
A more jarring, I would say, limitation here is I called it fuzzy code navigation when I first talked about that, and we also like to call this CTags-like. And what this means is that when you click on a reference to a function, we're not doing anything clever to figure out which particular function that you're calling.
All we're doing is we're matching up the name of the symbol, so this is why we call it CTags-like. So if you have a lot of functions or methods in your repository that have the same name, we're gonna present an ambiguous result where you click on a method call and we do show you all of the possibilities based on that symbol name match. Similarly, if you're looking at references,
if you're looking at a particular definition, we're gonna be showing you all of the references to any function with that name in your repository, not necessarily the calls to this particular function. So this is, like we said, fuzzy or CTags-like. We're not actually using CTags itself as the way of extracting this data. So in particular, we are using real parsing
of the languages in question, so we're not doing just regular expression-based matches. That's interesting in that we do get a little bit more syntactic information about the symbols. We can see if you're talking about a function call or a local variable reference, and we are able to tweak the user interface slightly based on that, but we are still limited to that fact that we're just doing this textual match of the symbols.
Okay, the last limitation that I wanna talk about today, and I think the one most relevant to this audience in the dependency management room, is that this is only code navigation within your repository. So you can see the links to the functions that you're calling that are defined in the same repository,
but not to links that cross a repository boundary. Now, I am being very careful to use the word repository here. It is possibly cross-dependency code navigation for those languages or projects where you're vendoring your dependencies, where you're actually placing a copy of the code into your repository with you.
And in Go, that's actually relatively prevalent, so for many Go projects, you actually are getting cross-package code navigation with what's currently launched, just because of that tendency to do vendoring. But most language communities don't consider that to be a best practice, and so we really do wanna think about what a cross-repository solution to this would look like
if we wanna do this for real for the entirety of the corpus of code on GitHub. Okay, so we've talked about these three limitations, a limited number of languages so far, fuzzy symbol matching instead of precise, and the fact that this is just within a repository instead of across repository boundaries.
And we are actively working on extending the code navigation implementation that's been launched to address all three of those limitations. And what's really interesting is that, for our purposes, the main bottleneck, the limiting factor here, is really just a question of scale. There are techniques out there that let us extract symbol information for more languages
than just the three that have been launched, and that let us extract precise symbol information, not just fuzzy, and can even do that across repository boundaries because we have the ability to parse manifest files and download all of the code and just run our implementation across all of your dependencies. The issue is one of scale, right? So we've only launched to these three languages so far
because the database that we're storing all of this data into is large enough to hold that set of three languages, and we have to increase our database size to be able to move on to the next one. Mm. Are you also going back to nine? So what we're doing right now is the most recent commit on any named branch.
So it's not just master, it's any named branch, but it is just the most recent one. When you resolve, it's in the same commit. It's in the same commit. So you're looking at a particular commit. Because we're doing within repo right now, we can do this on the commit level. So when you follow a link, it keeps you within the current commit. We don't, we GC old commits,
but it's sort of as an ad hoc basis, so you might still have lingering data, but there's no guarantee that it'll stick around if it's not a named branch tip. Okay? So yeah, so scale is the link factor here. We wanna figure out how to do this efficiently for the entire corpus of code.
So there's three main takeaways I want to hopefully convince you of sort of related to addressing these limitations. The first is that local development is not the same as a hosted service, and so the way that you would implement code navigation for editing locally in your IDE on your laptop
is different than how we would possibly implement code navigation for a large hosted service like GitHub. There's different constraints, and that might lead us in the direction of a different implementation. In particular, because of the scale that we're operating at, incremental processing is a must. What this means is that as a new push is coming in from one of our users,
there's a very good chance that it's only touching a small number of files in the repository, and that a lot of the files are unchanged from the previous commit. We really, really, really want to be able to reuse the results from that previous commit, not have to reanalyze the contents of a file that we've already seen, not have to restore, store additional copies of the data for the files
that we've already seen. We really want incremental processing. And then lastly, and this is maybe a little bit interesting and counterintuitive, if we can figure out how to generate precise symbol mappings incrementally within a repository, that same solution will work
for generating incremental precise symbol data across repository boundaries. And that one maybe is, like I said, a little bit counterintuitive, so once we see some examples later on, hopefully I can convince you of that. And it's nice because we can sort of solve this once for the in repository case, get precise symbol mappings, and then once we start to connect in with things
like the dependency graph and other ways of expanding out and figuring out your package transitive closure, then we can sort of extend this and get cross repo code in that as well. Cool? Okay, let's jump in. So the first takeaway, local development is not the same as a hosted service, otherwise known as
why aren't we using LSP? So can I do a quick straw poll here? Who here is familiar with the language server protocol? We're not using it for code nav as shipped on GitHub, and we often get questions about why we're not. So I'm gonna drill into some of the implementation constraints that led us in a different direction.
So what is true of local development? So what is true that makes the language server protocol a very, very good implementation of code navigation for local development? And it is, by the way. I'm not at all trying to disparage LSP at all. So the first is that user choice is paramount, right?
If the stereotypical VI versus Emacs debate, and it's only extended since then as we have things like VS code and Adam and other kinds of editors, right? User choice is important. They're gonna be using whatever editor they want. We don't get to dictate what tools they're gonna use on their laptop for local development in order to bring about something like code navigation as a useful feature.
And on the flip side, the programming language that the user is using is something that we have no control over. We want to provide code navigation as a useful service regardless of if you're using Go or Python or Ruby or Rust or any of those. So we have this very interesting end by end problem, this matrix of capabilities where every editor needs
to be able to support every single programming language in the worst case. It's also interesting that your context is sort of like a single workspace. It's almost like a tunnel vision kind of thing. There's one user or maybe a pair or something like that if you're working with somebody, but you're looking at a single screen at a single thing at a time. And so as the editor is shifting,
as the user is shifting what it is that they're looking at, the tool that is figuring out what the symbol data is and how it should be exposed needs to be shifting and changing as well. So there's this single tunnel vision focus where they're really only looking at one thing at a time. And lastly, it needs to be very, very interactive. It looks like I misspelled that.
Maybe I didn't, interactive. So for local development as you're actively editing the code we want your editor to be able to respond quickly as you type to the individual keystrokes. So that means we need a very tight turnaround time. We need to be able to react quickly to the changes that you're making and the tool that we're using to analyze the content
of that code as you're editing it needs to be executing quickly enough that we can present the results to the user with very, very low latency. So all of these constraints steer you in the direction of a solution. And LSP is the natural direction that you get to, the natural shape of an implementation
that you get to for those constraints. And the way that it works, if you're not familiar with it, is that it separates. It provides a nice abstraction boundary between the part that knows about the programming language that you're using and the part that knows about the editor that you're using. So if any editor supports the LSP protocol it can consume code nav data from the language tooling
for any programming language that supports the language server protocol and vice versa. So it's a really, really good solution to this M by N problem because it's sort of that abstraction barrier turns it into an N plus N problem. And that's great. We love that.
The other thing that's interesting about how LSP works though is because of that sort of single context and interactivity constraints, what's happening is that your language server is like a long running sidecar process. It's living right next to your editor and the connection between the two is very, very chatty. The editor is sending messages
into the language server protocol to either give it the updates directly or to send it messages that a file has changed and it should reload. But it's a very chatty back and forth communication pattern. So that means that the language server is able to maintain a state in memory about its current view of the code that's being edited. And that's what allows it to get
the sort of very tight turnaround, the low latency that supports the interactivity. Okay, what's different about a hosted service? So on the one hand, we don't have a single context. It's not one user with a single window in their editor looking at a single thing. It's millions of users looking at millions of repositories
at any given time. And we don't really know in advance what it is that they're gonna wanna look at. So in some ways, we kind of just have to steer into the idea that what we're looking at here is something like a database. We just need to be able to store the symbol data for all of the repositories that our users are creating and browsing and interacting with and make sure that it's available for them
whenever they come and want to query it about any of those repositories or branches or whatever. So it needs to be immediately available, but it's also something where we don't get to have that single focus that allows us to do one thing at a time with a nice in-memory representation. We don't need the as-you-type interactivity
because you're not actually interacting with the hosted service as you're editing the code. We are instead responding to pushes as they come in. So they make some edits, they make a commit locally, and then they push it up to GitHub. We need to react to those pushes. So we don't have quite the same tight latency bounds,
quite the same interactivity requirements, but we do actually still care about latency because we want it to be the case, what we're sort of aiming for internally as our team goal is that once you save locally and push up, we kind of want the data to be available on github.com before you have a chance to alt-tab over to your web browser from your Git client,
whether it's command line or editor or whatever. So that's like really tight. We still have to be able to process really, really quickly. And that informs that sort of incremental processing that I was talking about earlier. Another constraint or another feature of a hosted service here is that everything is hidden behind an API.
So we don't actually have quite the same end-by-end problem that you do for local development. We have one web UI. And we have to provide data for it, but we don't have to provide data for it in a way that's directly consumable by lots of different clients. Or the way that we do that is through an API that sort of is, again, sort of like a single thing
that we just get to implement. And then the last thing is that we carry a pager. This is a hosted service. My team is responsible for making sure that it's up and running 24-7. So we really need to have a deep understanding of all of the different libraries and pieces of software that goes into the implementation of our service. And that means that adopting external code like a language
server, even if it's very, very capable and accurate, that's just adding operational and maintenance work on top of the work that my team is already doing, even if it is saving development work. So it's not necessarily a free ride. So how are we going to do that instead? So we don't have the end-by-end problem
where we have to worry about supporting lots of editors, but we do still have the part of it where we have to support lots of programming languages. I'm going to go through this part relatively quickly, because I am starting to get limited for time. But we want to make it as easy as possible to add support for a new programming language to our programming analysis framework. And the way that we do this is through code generation.
We want to still limit the amount of development work that it takes. So for us, for Semantic, the open source project that we've created for all of this, the first thing that's required for adding support for a new language is to create a parser for it using the TreeSitter parsing framework. That's another open source project that my team is responsible for.
It's used for other interesting use cases, like syntax highlighting on github.com and in an increasing number of editors. So we're hoping that the act of creating a TreeSitter parser will be something that there's enough value to that language communities will take that work on themselves. And it's not that hard. It's a parser, right?
It's relatively straightforward, and we think that there's decent documentation for this. So this is an example of the TreeSitter Go parser. This is the rule for a function call, the syntax for a function call. What's interesting is that TreeSitter, in addition to producing the actual code that implements the parser, giving that parser description, will also create a nice machine readable description
of the parse states that go into the implementation of the parser. Why is that important? The fact that it's machine readable means that this is the entirety of the code in semantic that allows us to use a TreeSitter parser for Go and have the parse results available in the rest of our library.
So we're using Haskell. This is using a feature of Haskell called template Haskell. It's basically just code generation running inside the compiler instead of as a separate script as part of your build process. But this is just running some code that says take in that machine readable description of the Go parser, and it will automatically generate all of the Haskell data types
that we will use to store the parse results. And that includes generating API documentation. So if you go look at Hackage, which is the Haskell package manager, you can see the linked documentation for all of this generated code, even though these data types don't exist directly in the code itself.
So we still have documentation for it. Given those data types, you can write an instance of some pattern matching rules. So if you haven't programmed in Haskell, this might look kind of esoteric and hard to follow. But if you have followed Haskell, I think it's actually relatively readable. What it's doing is it's basically taking apart those parsed AST representations of the code.
And what this is saying is that if you have a call expression, that is a definition. No, sorry, a reference. Call expression, not a function definition. If you have a call expression, this is a reference. And so it's pulling out the parts of the call expression that define the symbol name for that reference. And there's a few different possibilities
depending on if it's a qualified name or not, but it's a relatively small number of cases that you have to think about for this particular parse node. There's a little bit of boilerplate where you have to go through and say that for all of the other ones, you don't get to, there aren't any definitions or references there. So it's a little bit of boilerplate, but you're at least not having to say
how to recurse down into the parse trees. So it could be worse. But once you've done that, we've extracted all of the data that we need for Go to do fuzzy code match, okay? So what's really nice about that approach is that we can produce the data incrementally. So I called out earlier that that's really important to us.
If you use something like this to pull out the list of definitions and references in a file, that list depends only on the content of that file. It doesn't depend on any of the other files in your repository. And that's exactly the incremental property that we care about. A new commit comes in. If the file is unchanged, we get to reuse the list of definitions
and references that we had already calculated. That's in our database already. We just leave it there. We're good to go. So incremental processing is absolutely a must. The only reason we were able to ship fuzzy code nav in the way that we did for the set of repositories that we did for all of the Python, Go, and Ruby packages on GitHub
is because of that incremental processing. If we didn't have it, the CPU requirements would have been too high, the data storage would have been too high, and the user perceived latency would have been too high. Like we talked about wanting to have the data ready when you alt tab over to the browser. If you have to process all five or 10,000 files in your repository every time, it's not gonna happen.
So the important question is, can we do this also for precise symbol matches? If we wanna actually say that this particular function call refers to this particular method and not any of the other ones that happens to have the same name, can we do that incrementally? And to show why it's kind of an issue, right? So this is just a very toy example
where you've got three Python files. If somebody wanted to click on that reference to foo, we would love to be able to show them that it refers to this definition down in b.py. And if you just sort of look at this, hopefully you can see that that is the case. There's a couple of import statements that sort of propagate the reference through a couple of files. But this is the example where that sort of highlights
how it's not incremental, at least in the sort of naive case. Because we have to look at files that aren't the one that holds the definition or the one that holds a reference and to be able to figure out what it actually points to. And if a new commit comes in that changes that file in the middle, right? We've changed not just what foo refers to
but which file the thing that foo refers to lives in in a way that depended on basically the entire contents of your repository. And it gets even worse if you have to sort of trace things through function calls, right? So this is trying to say, I want this reference to foo to refer to the field down here in this class.
But that's being passed through a function call in a really weird way. And that function is living in a different file. So how can we do this in an incremental way as well? It's not just sort of syntactically do the things live in different files, but are you tracing things through control flow that bounces through different files at runtime in different ways? And this is really hard.
And I don't have an answer that I can tell you about. We have experiments that are starting to work and hopefully we can tell you really quickly and soon about it working. But this is why it's really, really difficult. This is another reason why we can't rely on something like the language server protocol because it will give us something like this if your LSP implementation is really good but it's gonna require a lot of CPU
and it's gonna have to look at and analyze the contents of the entirety of your repository every time. Okay, last one. I hope that I can really quickly in my remaining time convince you that within repository is the same as cross repository. So if we can find an answer that works for this, here's the shortest version of this argument.
It doesn't matter whether these two files live in the same repository or in a different package in another repository. The only thing that changed is the set of files that we have to look at when we try to stitch things together. If we can figure out how to do this on a per file basis, the location of the files, what packages they live in, what repositories they live in is no longer the sort of bottleneck concern
that we have to worry about. So that's the end. I think we probably don't have a lot of time for very many questions but thanks very much for your attention.
Oh, okay, cool. Any questions? I'm sorry, can you say that again? Yes, it absolutely will.
So the question was for the precise code matching, are we gonna be doing anything with graph theory basically? And the short answer is yes, yes, very much yes. And briefly, the part that's hard is that if you tried to construct a call graph that encoded that information, the problem is that the portions of the graph that come from this file
might live in remote parts of the graph because you had to follow those links into other files. And that's sort of like a nice succinct description of why it's not easily and obviously incremental because the structure of that graph is not something that obviously depends only on the contents of your file. But there will definitely be graph structure involved in order to make it work.
So will we be analyzing code from third-party package managers? And so I guess to clarify, you mean a package uploaded to NPM where the source did not come from a Git repository and probably even GitHub in particular.
So there's a couple of answers to that. One is that all of this is in a couple of open source libraries. So Semantic and TreeSitter are open source and so you could use the tools to analyze that yourself. It's not something that at least for the first revision we plan on including, not because we don't want to but just because we're trying to limit scope and have something that's a little bit more tractable to begin with.
That's a really good question. So the question was if we can do this incremental precise extraction in a cross file way that works for our hosted service, will any of those benefits translate over to code nav in local development? And yes, in that the techniques would certainly work there
and in that you are still doing things like creating commits and doing edits as you type and stuff like that, we can probably still have that incremental thing work. In the local development, in the editing case, it's actually even a little bit more fine-grained. It's not just that you want to care about the edits to the file, it's that you want to care about the edits to the particular part of the file.
So a lot of the incremental things in editors recognize that I changed this line and so I don't have to re-syntax highlight the seven pages that occurred before. And that's something that wouldn't directly translate across because we really are operating on the file boundary with what we're doing right now.
Yeah, so are we doing things where we can do code generation based on extracting ASTs?
That is effectively what we're doing, it's just that we've sort of, at least right now again for limiting scope, we're just sort of locking down the way that you would implement the parser that generates those ASTs. But that is what we're doing. So we have, we get the source code, we extract an AST from it, and then we run that sort of pattern matching code that I showed you the example of
to extract from that the definitions and references out of that AST. Okay, I think with that I'm done. Thanks very much. Thank you.