We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

RECON VILLAGE - Adventures in the dark web of government data

00:00

Formal Metadata

Title
RECON VILLAGE - Adventures in the dark web of government data
Title of Series
Number of Parts
322
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Government bureaucracy is your friend. The US federal government alone produces tens of thousands of different forms that collect information on everything from the owner and location of every oil well in the country, to the VIN number of every car that’s imported, the location and height of every cell phone tower, and much more. While most of this data is locked behind clunky 1990s-era search forms, or in exports of antiquated database formats, the enterprising researcher will find a treasure trove that exists outside the indexes of Google and LexisNexis. I have written scrapers and parsers for 100s of these databases and will share with you what I’ve learned about coaxing OSINT out of some of the messiest and hard to find data out there. The talk will specifically feature a deep dive into the data produced by the US Federal Communications Commission. The FCC has issued over 20 million licenses for transmitting on regulated parts of the electromagnetic spectrum. The data residue of this process can be used for everything from geo-locating electronic border surveillance infrastructure to discovering the location and transmission frequency of every McDonald’s drive-thru radio. In the second portion of the talk, I will discuss how various protocols for data transmission can be decoded and joined with other contextual public data. For instance, every cargo ship emits an ““Automated Identification System”” signal that can be joined with shipping records to understand what the ship is carrying. By the end of the talk, I hope attendees will develop new intuitions and techniques for finding and working with government data, and specifically have the tools to run their own investigations using FCC data
Information privacyOpen setWeb 2.0Adventure gameMessage passingReal numberComputer animation
TunisLaptopQuicksortSearch engine (computing)Open sourceComputer animation
Reduction of orderWeb 2.0Term (mathematics)Level (video gaming)Multiplication signReduction of orderQuicksortPerspective (visual)Physical lawForm (programming)Local ringMappingInformationMassRegulator geneOpen sourceState of matterComputer animation
Execution unitoutputHydraulic jumpPhysical lawMultiplication signBranch (computer science)Game controllerForm (programming)NumberMereology
QuicksortForm (programming)Multiplication signInformationPoint (geometry)EstimatorComputer animation
Electric currentExplosionOcean currentProper mapTraffic reportingWebsiteQuicksortFlagComputer fileBitDrop (liquid)
InformationForm (programming)Real numberQuicksort
PlastikkarteElectronic meeting systemFingerprintOperator (mathematics)FrictionPlastikkarte
Information securityObservational studySampling (music)Computer programTraffic reportingPhysical systemArchaeological field surveyForm (programming)Physical systemNear-ringAddressing mode
InformationForm (programming)QuicksortScheduling (computing)Line (geometry)
Quicksort40 (number)State of matterLevel (video gaming)BitSet (mathematics)Building
Maxima and minimaWireless Markup LanguageBuildingImage registrationNumberMathematicsSpacetimeBitQuicksortHazard (2005 film)InformationCASE <Informatik>Cartesian coordinate systemComputer programming
Set (mathematics)Source codeRow (database)Real numberProcess (computing)Image registrationDeclarative programmingQuicksortMassPlanningTelecommunicationState of matterComputer animation
QuicksortSoftware-defined radioInformation privacySpektrum <Mathematik>Different (Kate Ryan album)
QuicksortWorkstation <Musikinstrument>Different (Kate Ryan album)LengthBroadcasting (networking)
Message passingQuicksortWorkstation <Musikinstrument>ResultantDirected graphOperator (mathematics)Mereology
State of matterStatisticsQuicksortGoodness of fitRegulator geneBlock (periodic table)Series (mathematics)Greatest elementSet (mathematics)Level (video gaming)TouchscreenComputer animation
Negative numberPosition operatorQuicksortBlock (periodic table)Set (mathematics)Range (statistics)BitComputer animation
Execution unitDifferent (Kate Ryan album)Communications protocolAnalytic continuationQuicksortAnalogy
Physical systemQuicksortComputer fileInformationDatabaseUniverse (mathematics)Different (Kate Ryan album)View (database)Open set
Server (computing)RankingQuicksortRevision controlSoftware repositoryRadiusDegree (graph theory)RoboticsPattern recognitionIntegrated development environmentDatabaseSet (mathematics)Subject indexingCodeSystem callConstructor (object-oriented programming)Link (knot theory)BuildingMereologyGeometryInheritance (object-oriented programming)Image resolutionPhysical systemGreatest elementCountingScripting language
Directed graphSoftwareSoftware protection dongleSoftware-defined radioComputer animation
Computer programmingQuicksortBitEmailLine (geometry)Gene clusterImage resolutionInformationSelf-organizationNumberAddress space
Quicksort1 (number)Water vaporPhysical systemSpacetimeTime zoneBitGame controllerContext awarenessComputer animation
Disk read-and-write headSpacetimeQuicksortRange (statistics)Medical imagingComputer animation
Uniform resource locatorSatelliteOrder (biology)Different (Kate Ryan album)Information privacyComputer animation
FluidSocial classPhysical systemCommunications protocolInformationSystem identificationWorkstation <Musikinstrument>Computer animation
InformationMenu (computing)QuicksortCuboidCodierung <Programmierung>Bit rateSoftware protection dongleMessage passingPosition operatorNumberLibrary (computing)Identifiability1 (number)Control flowRaw image format
QuicksortIdentifiabilityFreewareInformationSoftware repositoryDifferential equationOffice suiteMultiplication sign
Execution unitMoving averageTrailQuicksortExtension (kinesiology)CircleLatent heatComputer animation
AdditionPlanningCircleQuicksortAnalytic setPattern languageInformation security
Multiplication signDifferent (Kate Ryan album)Pattern languageInformation privacyQuicksortSinc functionLevel (video gaming)
Pattern languageStreuung <Stochastik>Remote procedure callSpacetimeQuicksortTowerComputer animation
QuicksortDifferent (Kate Ryan album)Magnetic stripe cardRoboticsComputer animation
Metropolitan area networkPortable communications deviceQuicksortComputing platform
Information securityQuicksortPhysical systemWebsiteComputer animation
Information securityPunched cardWebsiteQuicksortBitComputer programmingContext awarenessMereology
Multiplication signComputer animation
Transcript: English(auto-generated)
This is our comprehensive talk on the adventures in the dark web of government data. Your speaker's gonna be Mark Dacosta. He's the co-founder and chairman of Enigma, an open data infrastructure company. He's got a PhD in anthropology and writes about culture and technology. And without further ado, I'm gonna pass over to Mark. Thank you very much.
Thank you. Well, thanks everyone for coming. I'm certainly excited to have the opportunity to share with you all some of my adventures and real fascination and passion for public data. So yes, I'm Mark, I'm from New York City originally. Sometimes I go around the city with my laptop
and a large antenna and sort of tune in to some of the fun things that can be overheard on the radio spectrum around the city. As mentioned, I do a lot of work with kind of public and government data and the company Enigma that I started,
we have a big sort of open source search engine called Enigma Public of all of this stuff that we aggregated and bring together. But I think probably to kick things off, it would be helpful to sort of get some kind of clarity on terms and what exactly is government data and does it really have a dark web?
So it's interesting, I think one of the easiest ways to think about this more expanded idea of government data is that it's sort of the thing that's produced every time you come up against or hit regulation in some ways. Of course, we have these sprawling bureaucracies at federal, state and local levels
and every time you touch them, they have a way of kind of kicking off some data exhaust and from sort of a reconnaissance and open source intelligence perspective, this can be really good. One of the really kind of interesting maps, I think at least at the US federal level to what's going on from a data collection perspective
all came out of this thing from 1980 called the Paperwork Reduction Act. And basically what happened in the 70s is there was just a massive proliferation of forms and sort of government information collection instruments and all these things and it rose to the level where the Congress passed a law.
And what that law said was basically every time the federal government wants to make a new form, they have to themselves fill out a form and register it with the Office of Management and Budget, just part of the executive branch. And just to kind of show you here, this is a kind of an ordinary tax return 1040
and it has this OMB control number in it. And this is great. Anytime you have a US federal government form, it will definitely have an OMB number on it somewhere. And when a government agency wants to make a new form, they've got to apply to the OMB.
And one of the things they have to do is justify why they need the form and also estimate what are called sort of the burden hours of the forms. So right now there's maybe about 10,000 different unique forms that are registered with the federal government. And what I find extraordinarily remarkable is that according to the government's own estimates,
they require 11.3 billion hours each year of people's time to fill out. So we can certainly extrapolate that there's a lot of information being produced here. Just to kind of flag this, if there's something that's kind of interesting to you guys and you want to explore further, it's kind of hard to Google for, but it's called the current inventory report.
I made just a little bitly that'll drop you right into the sort of proper government site. And it is kind of fun because there is like an XML file that has all of this stuff structured in it and you can go and play with it. And so just to kind of flesh out what is this real spectrum of information
the government's producing? If anyone's come into the country, into the US from abroad, you've probably seen this form. It's one of the most filled out with over 300 million of them a year. Things like W-2s, so the sort of tax form if you're on payroll somewhere. A quarter billion of those produced a year.
I was sort of surprised to see that these fiction ridge cards, or friction ridge cards, actually there are about 90 million of them filled out every year. And I suppose it's not all strictly for people being arrested. This one that I just found on Google Images is for someone applying to be a pyrotechnic operator. So I suppose these things are produced in lots of ways.
And so those are some of the more common forms that are produced, but there also is a really long tail here. So everything from the 20 or 30 companies that actually fish off the Alaska coast near Russia and have a specific form they need to fill out to the importation of shelled peas from Kenya,
and things like the petroleum supply reporting system, which I'm not sure exactly what it is, but does sound like it could be interesting and juicy in some ways. And so once you start to know that, oh, there is a form out there, they're of course not all public, but there's a really interesting tranche of them that are, you can start to go out and collect information. So just as an example,
this is what a Federal Election Commission form looks like and I don't know if you can see it, but this is like a line item by line item sort of disbursement schedule for all of the things that the Trump campaign spent money on. So we have $140 Uber credit there.
I'll talk a little bit more about this data set later, but the FCC licenses all commercial radios in some way in the country, and you can use that data set to actually find every McDonald's drive-through in the country and also what frequencies it's linked to the restaurant with. Certainly, I'm sure you guys have been in elevators, you'll see these little inspection placards
that's run at the state level, but that sort of data that you can get and learn all about what's going on inside of a building. Certainly, aircraft registrations are really interesting and have tail numbers and all sorts of interesting joins you can do with radios. This is an example of just the deed for the hotel that we're in.
And when you take it a step further, it's kind of cool because whenever they're building permit applications filed for changes used in the space or renovation, there's often sort of architectural drawings and things like that. So this is also from the hotel. The Department of Labor collects a lot of information on things like, this is, I think, for the OSHA,
so the Occupational Safety and Hazard something. A little bit of a sad case of someone who fell down an elevator shaft here, but there's a lot of information produced. H-1B visas, I couldn't really show it very clearly here, but these are the 14 or 15 or whatever it is, H-1B visas that CSER has applied for in 2018.
And you can see they're mostly tech, sort of tech programming-looking people. This I just kind of found and thought was kind of funny. It's the 401k plan that DEFCON Communications has for the four people that are enrolled in it. And this is actually one of my all-time favorite pieces.
So this is a customs declaration from the 1960s that the Apollo 11 mission filed upon coming back with moon rocks. And so it just is kind of a lovely artifact of bureaucracy, I think, and does give us some sort of sense of the kinds of things that do appear
hidden away in this data. And I think the takeaway that I want to leave you guys with just from kind of having blown through all of that stuff is government bureaucracy can really be your friend. I think that there's certainly a key set of probably sources of government data that are in our toolkits, be they real estate records
or corporate registrations or whatever. But this is a really deep and sort of massive well of resources. And by kind of thinking about what are the processes and how does that potentially reflect in data, you can start to develop all sorts of new avenues for research and exploration.
So I have a big personal interest sort of in software-defined radios and the sort of EM spectrum. And I was really curious to sort of see how the public data that's available around the usage of the electromagnetic spectrum could be used to sort of ask different questions of the world.
I'm sure this won't be really a surprise to anybody in this room, but of course radio waves are all around us. They're sort of in that really cool sort of spectrum that takes us from all of the visible light that we see around us to the FM stations in our car and the wifi and all of these things are just waves of different lengths.
Of course, Marconi is often credited as being one of the sort of inventors of radio. And it's kind of amazing in its early days, it was not surprisingly like a terribly unregulated and quite sort of chaotic technology that people were just broadcasting,
creating all sorts of interference. And actually a lot of the regulatory regimes now that we have in the US are said to sort of come as a result of the sinking of the Titanic in part, because the Titanic being a new ship, did have a radio operator on it and was sending out SOS messages.
But the kind of thought was that there was so much interference on the sort of land-based stations that a lot of those messages weren't received. And so that led eventually in 1912 to the Congress passing what was called the Radio Act, which became sort of the precursor to setting up the sort of FCC regulatory regime
that we have today. And so we of course now live in a world where there is a lot more sort of tension regulation around the radio spectrum. And that's actually really cool and exciting when it comes to trying to understand
how this spectrum is being used. So I'm just curious, show of hands, has anyone seen this map before? So it looks like about maybe 20% of people. I'll keep coming back to this sort of throughout the remainder of the talk, because it's I think a really good sort of touchstone to understand how a lot of these things
are existing next to each other. So if you see this, it's I know a little difficult to get with much detail on the screen, but it goes basically from maybe three kilohertz all the way at the top to like 300 gigahertz all the way at the bottom. And each one of those little blocks
is basically a sort of reserved set of uses that that bit of the spectrum can be used for. So you can see here the FM radio band is of course like 88 megahertz to 108 megahertz roughly, and that's sort of blocked off there. But what's interesting is you can start to see
that these things exist next to and alongside of course other uses of the spectrum. So further down in the 150s, 160 megahertz range is where this thing called AIS, which is like a ship positioning data is transmitted. And then further down in the sort of next block
at 1090 megahertz or just about a gig, that's where all of the sort of aircrafts are broadcasting their vessel positioning data. And so I just call that out to show how these different protocols and uses of the spectrum do have a kind of continuity to them.
And of course there's a ton of politics and money at stake here. And as we know recently as sort of analog television has all been shut down, that spectrum is getting sold off. And just last year you have $20 billion being spent
mostly by the big telco companies to get access to some of that stuff that was freed up. So needless to say, this is like a very kind of high stake if somewhat obscure and invisible sort of place that data is produced. So in the US, of course the FCC is the main regulatory body here. And they basically like collect a ton
of different information and release it in two different ways. The first one which has the most data is this thing called the universal licensing system. And there's maybe 15 or 16 different kinds of licenses that wind up giving out. And each one has a lot of sort of detailed information associated with it.
As part of like an open data initiative, the FCC has done some work to unify all of that into this database called the License View Database. So I think it's maybe like 100 columns that sort of are harmonized across all of these things. What's nice is it collects in one place all of these different licenses.
And it basically pops out in one CSV file. This is a bitly link to a GitHub repo I made which basically makes it relatively easy to, if you have a Postgres server running, you can basically run the script and it'll download the most recent version
of this database, geocode it, geo-index it, and make it searchable for you. And the cool thing is once you do that, you can actually start to use this data to ask really targeted and specific questions about your local environment in a way.
So this is, I just did a sort of search of a kilometer radius around the Caesars Hotel here and said basically like for all the licenses that have been given out within a kilometer of here, who has the most of them and what are the rank ordered counts of how the spectrum is being used.
So probably not super surprisingly, the top three are all next-tells. This is your sort of cell phone stuff. But then kind of digging in, it was sort of interesting for me to start to learn where am I and what's going on around here. So Perini Building Company is a legitimate construction firm that has no ties with the mafia,
but they have done a lot of the casino construction in Las Vegas and certainly one of the biggest holders around here. And then sort of drilling down, we of course see a bunch of the casinos themselves are really big recipients of licenses. I was kind of surprised to see, because they'll come up later in the talk, but this firm Recon Robotics,
which by their own tagline is the world leader in tactical micro robot and personal sensor systems, has a good 32 licenses right in this part of Las Vegas. And that in fact puts them on par with DEFCON, who I was very impressed to see is quite fastidious about making sure that the official FCC licenses
are all sort of filled out. And one other thing that I'll sort of call out here that's I think really important when, to keep in mind when you're working with these sort of government data sets is that there can be often a lot of confusion and difficulty when it comes to doing like
entity recognition and resolution and stuff. And so towards the bottom here we have PHWLV LLC, which I saw and I said, what is that? And in fact it's the parent company that, it's a Planet Hollywood Holdings that owns this casino and many others. So then now that you can kind of start to identify
what's going on around you geographically, how can you start to use and apply that? You know of course it's been quite amazing to see in the last several years how cheap software defined radios have gotten, how much that's really opened up. So for those of you who don't know, for like literally 20 bucks you can get
a little USB dongle that will let you tune in to pretty broad spectrum. I think these will go from like maybe 50 or 60 megahertz to just over a gigahertz or something like this. And they're really cool and very easy to just sort of get started with. This is a program called GQRX,
which is just a really simple sort of tuner. So if you plug in one of these USBs and put in a frequency, you can listen to whatever might be coming across it. And so what's kind of interesting is we can start to not only just look at like what is the sort of
the clustering of radio licenses around us but actually dig into them a bit more specifically. And what's really nice about these is you do get some very high resolution information about how organizations kind of operate and function. So this one is for the Caesars Hotel. It's one of many that they have.
But what's sort of interesting is the person who actually filled out this license, his name is Eric Dominguez, who is the VP of sort of facilities and engineering here. And what's also included is his phone number and email address. And it is his direct line. I called him, so I do know that to be true. And so these things kind of become interesting
when you're trying to think about what are other ways of understanding a target or a place of interest and finding things that let you have a lot of sort of base knowledge about what's going on. So if anyone's interested, these are sort of a big tranche of the radio frequencies
that the Caesars Palace itself has licenses for. There are other ones under other entities that didn't come up in my sort of first search, but they can be ferreted out. And just to kind of remind us to keep all of this in context, we can see sort of these Caesar Palace radios are in the 450 meg zone.
But then just a little bit down the spectrum, we've got the radio frequencies being used for sort of the control infrastructure around the water system in Las Vegas. And so it's a very rich and crowded sort of space. But of course, this isn't only limited
to these sorts of things. So there's, you know, NOAA 19 is a weather satellite that's flying around above head. It all operates in sort of the 137 megahertz range. And a friend of mine actually in New York built an antenna and a GCal reminder so that whenever this weather satellite
is actually over the eastern seaboard, he can bring this thing outside and actually download the images because of course, you know, satellites, these are kind of coming down unencrypted and are there for gathering. And that's the URL for it if anyone's interested.
But I was also kind of very curious to see in what ways different kinds of public data could start to get joined with what we know is available on the radio spectrum in order to do things like maybe look inside of a cargo ship. So of course, today, ships are, you know,
really diverse radio stations in and of themselves. You can see here, you know, you of course have GPS antennas and maybe satellite TV. You know, PAM radio antennas. But importantly, up here on the top left is an AIS antenna. And AIS stands for Automated Identification System.
And it's basically a radio protocol that is used for navigation and safety. And whenever a ship is underway, it broadcasts some information encoded on this channel. And it all basically lives around, I guess, 161, 162 megahertz. There's two different channels that it goes on.
And what's interesting is if you're, you know, have a line of sight or have a decent antenna, you can actually, using one of these $20 dongles as an example, receive those AIS messages that the ship is sending off.
And so here you can kind of see in this like text box or whatever, those are what sort of the raw demodulated packets sort of look like. And what you can basically do, it's because there's a, you know, there's a great Python library called libAIS, and there's many other ones where people have sort of taken the spec
and made all the decoding. But basically what data you're getting when you're listening to these ships basically breaks down to, you know, what you're seeing here. And this tells you things like, you know, the position and heading and rate of turn and things like that. But importantly, it also has this thing called an MMSI. And the MMSI is a,
stands for Mobile Maritime Subscriber Identifier. It's basically like the cell phone number of the ship. And you can use that to then join with a second order piece of government data. Here I wrote an API that was all linked in that repo that I showed earlier.
But to connect to the International Telecommunication Union to take that MMSI identifier of the ship and turn it basically into the vessel name and some other information about the ship itself. And once you have those two pieces, you can then get to the place where you can actually look inside of a ship.
And the way that you do that, the sort of conceit here, is by taking bills of lading data that often get filed before a ship hits the port that explain basically for the purposes of customs taxation, everything that's inside of the ship. That data is kind of made available in a very crazy way.
So it's the only way that anyone can get access to it is by going to the customs and border protection office in Washington, DC, giving them $100 certified check and getting a CD in return. But through Enigma Public, we actually gather all of that data and it's free with an API on it. And so he's able to sort of stack all of these things
together. I'm just on time. Okay, so sort of one example. Another one I'll just quickly talk about is using ADS-B sort of data, which is very similar to AIS, but it's for aircraft. And there's a really interesting piece of work
that was done by BuzzFeed, specifically around looking for the extent to which governments were using stingray devices, which often are put in aircraft and flown in circles when they're going after a target. And stingrays, of course, are ways to track
and intercept some very specific cell phones. And so basically what they did that was really smart is we're able to take all of the sort of ADS-B flight data and there's companies like FlightAware and others that aggregate it for the whole US.
And they applied some basic kind of analytics to it to look for all of the flight patterns over cities where planes were just kind of flying in circles a lot. And based on that, they were able to identify all of these, both airplanes that were very clearly registered to Homeland Security or to a police department, but also in addition, all of these new companies
that were shell companies being used by the government, but they were able to kind of back into once they knew that those companies were potentially of interest because of these unusual flight patterns. There is, I think when we think about all of the different radio devices that surround us all the time,
there are a lot of different opportunities and examples of taking this sort of contextual public data and applying them to those devices. And just kind of in closing, since we're coming up on time, I want to tell you about sort of another investigation that I did here around trying to understand
the surveillance infrastructure along the US-Mexico border. So what you're looking at here is just kind of a slightly interpolated map of all the radio licenses that are within 10 kilometers of the US-Mexico border. And when I was looking at them, you sort of see these normal dispersion patterns around cities, of course, like the radio towers
and uses are all over the place. But what I was kind of very interested in is sort of seeing out in sort of some of these more remote sort of desert frontier places, these very regularly space towers that were being put up along the border. And this one in particular was put up by a company called IMSAR.
And so I started looking and said, what does IMSAR do? Well, they make the kind of the radar packages that go, the ground radar packages that go on predator drones and other things like that. So I thought, well, this could be interesting to try to dig in and get a sense of who and what else is sort of happening along the border.
So this is just kind of like a count of like who are all of these kind of entities that are showing up doing experimental work specifically along the border. I just called out that company Recon Robotics, which was the one I had mentioned earlier is also doing a lot of work around this hotel. But then I sort of went through and actually wanted to look at all these companies.
And basically, found this was not so surprising, but that in fact, the vast majority of them are defense contractors of different stripes. And so sort of starting to go through and looking at like, who are these companies and what are they doing? You sort of stumbled upon all this really kind of fascinating technology, I suppose in a way.
So TECOM makes these aerostatic blimps that are used as surveillance platforms. Leonardo DRS is a Italian defense contractor, but they're purport to have the most widely used a ground surveillance radar. You sort of see a lot of these interesting packages.
ELTA is an Israeli defense company that does a lot of border security work that's also sort of working there, as is Elbit Systems. And so, what's really interesting is, you can again pivot from these very specific licenses or these sort of aggregates of licenses to then go and look at like, where are the sites
and where are these sorts of things happening? So it's kind of incredible for me to just then actually be able to go over to Google Maps, punch these things up and start to see all of the sort of sites where these bits of exploration and sort of prototyping this virtual fence are starting to happen.
Just as a last piece of context there, a bunch of these were part of an older program that Boeing sort of wound up being a massive disaster. They were supposed to be able to cover the entire border for seven billion dollars, but wound up spending a billion dollars to only do 50 miles and the thing didn't even work.
But the thing that I'll sort of leave you with and hopefully kind of came across in the talk and sort of threw these examples in context of what's possible with data more generally is to really think about not only where these deeper, perhaps unseen bits of data are, but really thinking about how they can be put together
to tell sort of broader stories. So anyway, thank you very much. Thank you. No time for questions.