We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Reverse Engineering: Satellite Based IP Content Distribution

00:00

Formal Metadata

Title
Reverse Engineering: Satellite Based IP Content Distribution
Title of Series
Part Number
11
Number of Parts
20
Author
License
CC Attribution 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Production PlaceBrüssel

Content Metadata

Subject Area
Genre
Abstract
The presentation will cover reverse engineering a satellite based IP content delivery system. These systems are generally used for moving digital media (such as movies, video on demand) but also can be used for digital signage and any other type of files. The presentation will touch on all aspects of reverse engineering from satellite reception, packet analysis, forward error correction reverse engineering (along with an explanation of the math), to the difficulty dealing with the extremely constant high bitrates on an off the shelf linux PC. The end result of the entire reverse engineering project was a linux based software client that has similar features as the commercial version based solely on an analysis of the protocol and incoming data.
Level (video gaming)Principal ideal domainSoftwareIP addressBitComputer wormVorwärtsfehlerkorrekturComponent-based software engineeringOperator (mathematics)Multiplication signData transmissionDirection (geometry)UnicastingverfahrenSingle-precision floating-point formatComputer programmingField (computer science)State observerType theoryLink (knot theory)Bit rateInsertion lossVideoconferencingDatabase normalizationComputer fileMatrix (mathematics)Buffer solutionStreaming mediaContent (media)TransponderNumbering schemePoint (geometry)2 (number)Heat transferSampling (statistics)MiniDiscWordPort scannerDevice driverFood energySystem programmingCartesian coordinate systemEncryptionLatent heatInternetworkingHexagonMereologyFile formatDiagramBlock (periodic table)Slide ruleElement (mathematics)Internet forumRow (database)Presentation of a groupCore dumpVariable (mathematics)Elementary arithmeticNichtlineares GleichungssystemIntegerCategory of beingKernel (computing)StatisticsDivision (mathematics)Real numberArithmetic meanVector potentialUser interface32-bitPolynomialSheaf (mathematics)Regular graphFigurate numberTunisDifferent (Kate Ryan album)Band matrixRight angleMotion captureStandard deviationFraction (mathematics)CASE <Informatik>TDMAMathematicsResource allocationEncapsulation (object-oriented programming)DigitizingComputer programOrder (biology)Numerical analysisGraphics tabletCarry (arithmetic)TouchscreenSign (mathematics)Natural numberGreatest elementWeb pageEmailKey (cryptography)Function (mathematics)Hard disk driveQuicksortMilitary baseEstimatorCross-correlationCoroutineParsingCuboidSimilarity (geometry)Communications protocolSet-top boxVolume (thermodynamics)Real-time operating systemString (computer science)Open sourceDistribution (mathematics)Default (computer science)Semiconductor memoryGame controllerLengthVariety (linguistics)Set theorySpeech synthesisExclusive orDifferenz <Mathematik>InformationDigitale VideotechnikPattern languageLimit (category theory)AreaCache (computing)Chemical equationProcess (computing)Error messageTotal S.A.Workstation <Musikinstrument>Flow separationObject (grammar)Error detection and correctionResultantInterleavingBasis <Mathematik>ImplementationGalois-FeldComplete metric spaceSoftware testingIdentity managementMathematical analysisBoundary value problemMoving averagePrice indexSinc function1 (number)Exception handlingBuffer overflowBezeichnungssystemVandermonde-MatrixArithmetic progressionLinearizationSpacetimeNeuroinformatikTerm (mathematics)MultiplicationTable (information)CodecBuildingHeegaard splittingFunktionalanalysisSource codeLine (geometry)Computer hardwareForm (programming)Process capability indexAdvanced Encryption StandardClient (computing)SequenceProjective planeRevision controlFile systemAlgebraic equationTimestampFilter <Stochastik>Inverse elementUniform resource locatorData streamUsabilityDrop (liquid)RoboticsDivisorAdditionData storage deviceMusical ensembleRadio-frequency identificationProgrammable read-only memoryFrequencyStack (abstract data type)Game theoryHacker (term)TrailScalar fieldWave packetComputer scienceDigital photographyPattern recognitionRoutingMoment (mathematics)MassApproximationQR codeEvent horizonPeg solitaireForcing (mathematics)Pay televisionIsometrie <Mathematik>Light fieldBound stateCovering spaceWaveIdentifiabilityCrash (computing)Information privacyCellular automatonRandomizationMetadataMessage passingACIDPosition operatorCompact CassetteInternet service providerData structureWebsiteAutonomous System (Internet)Directed graphState of matterSpecial unitary groupDistanceCoprocessorPlastikkarteInheritance (object-oriented programming)Distributed computingDisk read-and-write headDimensional analysisVolumenvisualisierungFitness functionFocus (optics)Electronic visual displayNetwork topologyMetropolitan area networkAddressing modeTransmitterUtility softwareSensitivity analysisPlanningComputer animation
Transcript: English(auto-generated)
All right, my name is Taylor Jacob, and I'm here to discuss a satellite-based IP protocol
that I reverse engineered a couple years ago. This protocol is used all over North America to distribute news media, video on demand, films to movie theaters, digital signage files, amongst other more mundane uses. There are many different protocols that operate on the same principles as what I
will discuss today. I can't speak of the prevalence of these same systems or similar systems being used in Europe, but based on the vendor websites, I suspect there are similar systems used across the globe and not limited to just North and South America. So I need to start with a little bit of background to make sure we're all on the
same page. Although there are many types of different satellites circling the Earth, this presentation deals specifically with the most commonly known type geostationary satellites. These satellites are well-suited for distributing a variety of information from a single source to a large geographic area. Most commonly known use is for TV content, but the setup is also well-suited for the
other types of media distribution. And there's typically no return path to this sort of delivery network, and that sort of setup is okay. If you're watching TV, it's not important that you tell the station you're watching, and if you're receiving media files, they generally won't be needed within seconds of delivery. Usually, they position the data days beforehand.
So for this, my hardware setup was pretty basic, just standard. I used some standard C-band dishes, which you see there, and a PCI DVB-S card, and just a Linux x86 PC. And for the software, I used all open source software. I used DVB-Snoop, which is a transport stream analyzing tool, the regular DVB tools
like sSAP to tune DVB traffic to do some in-depth analysis on the transport, or to look at the PIDs on the traffic, and then your standard Unix tools, GREPS, or UNIQ, and other text-based tools. And then at some point, I had to start writing my own software to get deeper into it.
So before I dive in, I want to briefly cover some technical aspects of the satellite systems and the video formats. DVB is the most prevalent digital video broadcasting standard in the world. There's three main types. There's DVB-T for terrestrial, DVB-C for cable television, DVB-S for satellite. There are newer versions, DVB-S2, T2, and C2.
And although the V in DVB would indicate that it's for video, it can also carry any other type of digital content. In the case of this presentation specifically, IP traffic. So the main difference between DVB-T, C, and S is the transmission medium. For T, it's the air. For C, it's the copper cable. And S, it's the Earth's atmosphere.
The physical interface is generally referred to as a mux in all three DVB flavors. In the case of DVB-S, it's also referred to rather erroneously as a transponder. And once the signals are demodulated into a bit stream, they're virtually identical. And the standard way this data is moved is MPEG transport stream, also called TS for short.
And the format's relatively simple. It's 188 byte packets that have a simple four byte header. And all you really need to know for this presentation is it starts with 47 hex, and there's a 13-bit field called the packet identifier, more commonly known as a PID.
So on the transport stream, there's 8,191 available PIDs. And of all of them, only a fraction will be used simultaneously. Each PID will carry a specific type of a traffic. It can be a single video stream, audio stream, program metadata, or other data traffic. The PRDs are how a set-top box or other component can filter out the content not relevant to their operation.
For example, a digital video channel will be made up of a few PIDs. Generally, you will have one PID used for the video stream, another for the audio track, and potentially others for subtitles or captions or whatever you call it in Europe. That's what we call it in North America. So the set-top box would filter those specific PIDs and ignore everything else.
And as far as encryption, DVB encryption will run on that 184-byte payload. That's just the generic 01 to B7 there in my diagram. So, and then the second part I need to cover is the packetized elementary stream and PSI data, sorry.
PS is the format that defines how the elementary stream data, which is generally audio or video, is carried in a transport stream. The elementary stream is packetized in sequential order, and the PS packets are sent on the PID. The format of the video or audio codec isn't defined, and it typically is changed over time. Historically, the video would have been MPEG-2, now it's H.264, and
it's migrating to HEVC. And the important part that you need to know about this is packetized elementary stream always starts with that 001 sequence and hex. And the other type of payload, PSI, is most commonly described, is used to describe the layout of a transport stream, meaning how it's configured,
what TV channels are available, what PIDs are available to find the video and audio streams on. But there's many other types of PSI data that allow different standards to write on top of the transport stream to be able to define what is necessary for their application. Now, DVB multi-protocol encapsulation standard is a standard that pertains to
this presentation. It defines the means to carry IP traffic over the transport stream. The key parts are one or more PIDs can carry DVB MPE traffic, and all the PIDs will operate independently of each other. Each PID's DVB MPE stream may contain more than one destination IP. The intention is to take the IP packets from one PID and
put them on an Ethernet network somewhere once they're decoded. In North America, I encountered very little unencrypted unicast IP traffic, but I did see it here and there. The vast majority of the IP traffic I encountered is multicast UDP, since the satellite link is usually single directional. So a little bit of the background, I've always enjoyed scanning satellites for
news events, feeds, sports events, etc. And although there's a lot of programming that's available for subscription, it's always been fun to see what you aren't supposed to see. When scanning signals with a satellite set-top box, you'll look for these hidden channels. It's common to run into lots that are encrypted. It's also common to find a transponder or
transport stream that'll need no programs on. And what you can see here is a set-top box doing a blind scan. And so anyway, so these transponders that have no programs that aren't encrypted or anything, they obviously have some purpose, but it's not carrying television traffic. And many times, they'll be satellite-based Internet services, but not always.
So for years, I always wondered what they were, but I never did much investigation, assuming it was just an encrypted Internet traffic. But at some point, seven or eight years ago, I saw hints here and there on some forums that there being TV channels on these unknown transponders, they were calling it IPTV. Unfortunately, all these signals were on C-band, and I don't have a C-band dish
at the time, so I set out to find a C-band dish and start examining these signals with a Linux PC. I've always had a PC around with a DVB-S card, so examining them was just a matter of sitting down and poking around once I managed to find a suitable dish and get it installed. So once I acquired my first C-band dish, I installed it and I started taking notes on these empty transponders.
So the process that I would use to examine these empty transponders once I found them with the setup box was to see what PIDs were present. Once I knew what PIDs were present, I'd start identifying them. At first, I didn't really know what I was looking for, but
if it was unencrypted and not a regular PS stream, I was interested. So one of the tools I used to identify what PIDs were present was DVB traffic. It's part of the standard Linux DVB tools. It operates by checking every packet in the transport stream over a second and counting what PIDs are present and the bandwidth. So you can see an output here.
The capture on the right is from a regular TV mux, and you can see there's a lot of different PIDs and various bit rates. And the one on the left here is what I'm calling an empty mux that has one PID that has, you can see, is almost 72 megabit of data just sitting there. So once I'd identified one of these PIDs with lots of traffic,
I'd start to look at it in DVB snoop. And DVB snoop will allow you to examine in really fine detail whatever content is on a particular PID. So I'd use DVB traffic after I'd identified the PID. After a quick glance, you can see what kind of traffic is being carried.
And yeah, if you look at this example here, I cut out a lot cuz it would have been 30 screens worth of data. But you can see it's IP traffic, and in red you can see the IP and the destination port. So I've definitely got some sort of UDB traffic here. So yeah, and a lot of this process, I'm mentioning it manually, but
eventually I programmed it up cuz it was kind of boring to just write all this stuff on paper, so let's see. So anyway, DVB snoop will provide a tremendous amount of detail about almost anything it encounters. And here's a portion of the output. Well, that's actually the portion of the output.
So anyway, so basically since it outputs text, I was able to pipe this stuff through grep and look at what IPs were present. And get other statistics and figure out what basically I was looking for. Sorry, I'm a little lost here.
So yeah, at first I wasn't really sure what I was looking for, but the first thing I did was to identify the IPs. The IP addresses that were present. Then I started looking at the traffic on each of the IPs. I realized very quickly that if I didn't encounter unicast traffic, it wasn't very interesting. Multicast was a completely different story.
So looking at a UDP dump, looking at a dump of the UDP packets, you can see here that I've highlighted the 47s, which are spaced 188 bytes apart. And then in blue, which I guess you can't see very well, there's a bunch of the PES start sequences. So it's really obvious in this packet that
there's some sort of video and there's some sort of transport stream. But how to get it out, I didn't really know at the time. So, let's see, yeah, by this point I reached the end. By this point I had reached the end of the grappleable part of my investigation. So I threw together some very basic code to capture the IP packets and
manipulate them. At first this little program just displayed what IPs were present on the PID. Then I started to say the payloads of certain IPs to a file and I had my first results. The first success turned out to be linear TV channels. Linear is an industry standard term to describe a real time
television station. These channels were made up of a single MPEG transport stream encapsulated in UDP packets with one channel per multicast IP. Each UDP packet would contain exactly seven transport stream packets. Two transponders on the same satellite had a large line of
encrypted channels, but a few more were in the clear. One of these happened to be NHL Center Ice. It stayed unencrypted for about two years. So to play these channels back on a computer, I wrote a simple program to capture the UDP packets and re-encapsulate them on a different multicast IP, which I would play back in VLC, which you can see a little screenshot here.
So after finding the encapsulated linear video, I still had no clue what the majority of the multicast IPs were. There was one multicast IP on the same transponder as NHL Center Ice that took up most of the bandwidth of the transponder. So I'd learned from the linear video that each multicast IP was used for a single program.
If you use the same technique of saving the UDP packets to a payload on the unknown IP and tried to play them back, every now and then VLC would attempt to render a picture or start to play a short sample of audio. And you can see a example here of how it would kind of start to render something, but you wouldn't really get anything. And I found examples of this type of behavior on numerous satellites. If you examine the UDP payload, you'd see all the telltale signs of
the MPEG transport stream and PS packets, but I couldn't figure out how to extract them. So upon closer examination, it was clear the UDP payloads had some sort of header on them. So I wrote some software to strip some number of bytes and save the UDP payload. Using the same grep technique, it wasn't difficult to determine
what was the header and what was the payload. I tried this blind header stripping technique on numerous transponders without much luck, and so I tried it on a KU transponder that I had tucked away in my notes. On this transponder, I got lucky and managed to get some playable video. Although I was excited to finally get video to play without error, it turned out to be an episode of Dr. Oz.
American Daytime Program wasn't exactly the most rewarding prize, but it was progress nonetheless. Still wasn't sure my previous attempts at header stripping had not worked yet on the other transponders. So by this point, I determined the header and the packets were a fixed size with a few exceptions. I started looking at the header more specifically to see if I could
understand why I had not always been able to get some sort of video to play. Much like before, I used DVB snoop and grep to examine just the UDP packet headers in real time. And looking at the slide, you can see almost immediately that the headers have some sort of 32-bit field that's in red, interleaved, and a separate 32-bit counter in blue, and the number would increment in each packet.
I don't know if you guys can see the blue that well, but anyway. So I started to look at the different transponders in my notes, and I realized that I wasn't dealing with the same protocol, or at least the same version of the protocol. And I identified at least five different systems that are all almost identical in design, very similar on a technical level, but different nonetheless.
I started to focus my efforts on the high bandwidth C-band transponder that carried NHL center ice, as I thought it might yield the most interesting results. From here on out, I mainly focused my effort on this transponder and the unknown protocol it was using. So going back to the slide,
you can see almost all the packets start with 2 bytes 0 and 0.1. And I started to speculate that the interleaved 32-bit field in the header was some sort of transmission ID. So I wrote some Tesco to start saving each transmission ID to a unique file. And with this basic change, I was able to render video from a much larger percentage of the files. And it became clear why my header stripping previously was so hit or
miss was that multiple videos were being sent at the same time. So I examined the files that I couldn't play, and I looked at the packet headers for clues. Almost all the transmissions are sent in a sequential order based on that counter, but in the examples I couldn't play, the counter wasn't in sequential order. So I realized that the counter was actually a block number in the transfer, and started saving the payload using the counter as a block as an offset.
This cleaned up almost all the errors I was getting, and I was able to play back almost every file. So once I was able to play the majority of the files being transferred, I also noticed the running time of the videos were significantly longer than the time it took to download. It was only then that I realized the files were meant for
playback in the future, i.e., a content delivery system. So being able to save the media files was a great achievement, but it left a lot to be desired. Every file was saved as a 32-bit number, and the only indication as to what the content was required further examination, which usually was me trying to play the file on a media player.
So I'd also received the files numerous times. I had no knowledge that it was a retransmission. I also noticed for the first time at the end of these files, the end of these files always contained some seemingly random data. I had a hunch it was some sort of error correction data, but I had no way of proving it at the time. Seeing the split between the file and this extra data was pretty obvious.
The random data always started on a packet boundary. So I knew there were other packets in the stream that I was ignoring, and I assumed they had to have some sort of control packets. The receiving stations had to be aware of the files, know when the transmission started or ended.
I started logging all of the non-payload packets to a file. It didn't take very long to get a basic understanding of the other types. All the packets had the same 8-byte header at a 16-bit packet type, a 16-bit packet length in the 32-bit object field. After this, all the packets differed.
But since all the packets used the same object ID field, I started to look at how other packets correlated with the file I downloaded and see what I could determine. The 03 was the most prevalent packet, and it was the longest of all of them. At a quick glance, you could see ASCII strings containing what seemed to be quite obviously file names. I'll come back to this in a minute. The other two packets were the 06 and the FF, and
they only occurred before and after transmission. Their bodies were really short, and it was soon clear that they were used to indicate when a new object ID was starting in the stream or when it was being removed. So the 03 packet, like I said I'd come back to, this is the one I was looking for. It clearly contained the file name along with some other descriptive
information about the file. So I started logging as many of these as I could and looking for patterns. The process I used was pretty rudimentary. I used Vim and looked at packets side by side. I piped and grabbed packets with various filters to see what pattern I could discern. The first problem was the location of the ASCII file name was not at a fixed offset into the packet. This meant I had to look more into how the packet was structured to be able to
programmatically extract the file name. So my goal was to try to identify as many parts of the 03 packet as possible so that I could see what remained. I assumed the packet was made up of little indian 32 bit and null terminated strings since there were a lot of zeros. I knew the total payload size, including the extra data.
I had a good idea of the file size with a few hundred bytes when the extra data was removed. I also knew the transmission block size. I was able to clearly see the file name in the packet as well. So I found a date field that was usually within the past week or so in Unix timestamp format. So I may be showing my age here a little bit, but
I printed out hex dumps of a couple of 03 packets for all these properties. I may be showing my age but I printed out hex dumps of a couple of 03 packets for files I knew, all these properties, I've been highlighting them. So once I'd marked off all these areas, it was obvious that the fields I was interested in were all clustered together.
You may not be able to see it clearly here in the photo, but there was always a 32-bit little indian 02 right above the transmission block size, which is highlighted in yellow on there. It was followed by a 32-bit little indian length that correlated with 13 bytes after the file name, which was usually the end of the packet. Based on this knowledge and a little bit more human pattern recognition,
I was able to make a parser that was token-based, starting 24 bytes into the packet. Now, this had some major flaws, and it would crash fairly often, as I didn't know nearly enough about the format. But regardless, I was able to properly save the files with the correct file names. I did some balance checking in my parser to stop segfaulting, and I logged the errors. And whenever I would mis-parse an 03 packet,
I would compare the packet that crashed to what I was expecting. And this process continued on and off for a year or two until the parser stopped running into errors. I'm sure my parser routine isn't a facsimile of the real software, but it's been stable for four to five years now. So the two big problems at this point were dealing with the volume and the speed of the transmissions.
The big video on demand system had two transponders that combined were a constant 150 megabit stream. I could fill up a two terabyte hard drive in less than two days. And most of the content I was collecting, I had little or no interest in watching, but I was able to filter out some of the content based on file names. Wasn't an ideal solution and required me to manually clean up the hard drives every few days.
But over the course of a month, I set up a ten terabyte array of two terabyte drives. I never fully addressed the volume of the file problem programmatically, but with some string matching I was able to reject a large enough portion I could sort through the files once a month or so.
The overall bit rate was the most difficult problem to tackle. This is by far the most IO bound application I'd ever worked with. The transmissions were constant, so I needed to be able to write data at least as fast as the bit rate because no amount of buffering would help. The ten terabyte array was more than adequate to keep up with the data rate if I totally left it alone, but when doing other operations on the disk,
I would run into problems. So the first thing I tried to deal with was to increase the buffering on the DVB driver. I tried using a buffer as big as 64 megabytes, but that only covered about seven seconds. The value I'd been using before was four to eight megs. So it was an improvement, but it alone didn't eliminate the problem.
I can't quantitatively say how much it helped, but I still ran into problems very consistently. So from watching the hard drive light, it was clear the data was being written to disks in quick bursts of a second and then a pause for a few seconds. I started to look at the disk using BM stat, and you could also see the burstiness.
And you can see that here in the slide on the top, the blocks out field there in red, you can see it's writing nothing and then writing 45,144 and nothing and I'll explain in a second. So I did some research and found the kernel settings, the control of how much I know data is cached in memory until it's flushed to disk. There's a pair of settings in the kernel,
VM dirty background bytes and VM dirty background ratio. And what these settings do, what this setting does is either it sets a byte limit or a ratio to the amount of data being held in I know cache before flushing it to disk. The box I was using had eight gigs of RAM, the default setting for the kernel I was using was 10%. This box wasn't using that much RAM for application to be conservative.
Four gigs were being used in I know cache. If four gigs were being used for I know caching, a write wouldn't get flushed to disk until approximately 400 megs needed to be flushed. At 150 megabytes, a rough estimate would take around 20 seconds to add 400 megs to the buffer. The problem was the buffers would still be filling while the data was written
out, so that's why the bursts were so much closely spaced. The real problem can be seen in the upper VM stats WA field. This is the percentage of time that the IO operations are suspended and where buffer under-runs could occur if everything in my application had to stop and wait. So basically that means that everything's waiting for that IO to happen.
So that last row right there with the yellow, so 17% of that one second, everything is just paused. So to stop this behavior, I changed the VM dirty background ratio bytes to zero, which meant that the data was written to the disk as soon as it expired. This immediately changed the disk access behavior to be much smoother and
data was written out as it was received continuously instead of being so bursty. It also meant the potential spikes in IO blocking were reduced significantly. So you can see this improvement in the lower VM stat output, and you can just see it's like writing a consistent 8,000 in each sample.
So after tweaking the VM dirty settings, my client was able to write data reasonably well. But I still ran into problems doing other file system manipulations and very specifically deleting a file. Up until this point, I was just using the ext4 file system that is the default with Linux. I never had any reason to use anything else as it met my needs.
I started researching file deletion times and the larger file is in the ext4, the longer it would take to delete due to inodes being fragmented. I did some work on MythTV years before, and I remember the hardcore users recommended something other than the ext4, but I couldn't remember why. It turns out that the long deletion times were causing the IO issues.
Did a little bit more research and I started trying various file systems. I ended up settling on XFS. Two main reasons were that you can control how large a block of the disk was allocated at a time. And the file deletion time was always fractions of a second, regardless of how large the file was. The large block allocation was also very useful since almost every file
downloaded was multiple gigs. I set that value to one gig. And there's some XFS tools to see how many non-contiguous blocks a file takes up. And with the one gig setting, even for really large files, in the order of 20 gigs or more, they'd still be split up into a few pieces. So by this point, I was able to store just about every media file from
the video on demand distribution system. But I still had problems here and there due to reception problems. But overall, I was able to complete 95% or more of the files and buffer overruns due to IO or a thing of the past.
So although I was quite pleased with being able to store and organize the files due to the nature of satellite transmission, I would quite regularly be missing a few packets from transmissions. It was quite frustrating to be missing one to two packets from a 20 gig movie. So eventually, I decided that I should take a look at the FEC.
Up until that point, I didn't understand much about FEC, aside from that a high level provided data protections, using some fancy math I didn't understand. At some point, I had found the vendor's website, and it mentioned using a proprietary FEC method. So I started to do some research in FEC, and I started looking at the files I was downloading for any clues on how I could use to focus my energy.
I knew the FEC data always fit exactly into the packets, and I knew there was no padding on the FEC data like there was on the actual file. So this led me to believe that it was block based. I also noticed the ratio of FEC data had a close correlation to
the file being transmitted. I started examining other systems that use the same protocol. I noticed this ratio differed a little on different systems. C-band tended to use less redundancy than KU. This most likely is because C-band is less sensitive to interference and signal loss than KU-band. So one of the more important observations that I happen to notice
is a single packet transmission's FEC was an exact copy of the payload. So this slide's an example of a two packet file that was 2,600 total bytes. You can see the contents are the same at byte 0 and at byte 1,300. And up until this point, I was mostly overwhelmed with attempting to
know where to start because I had focused my energy on the large files for some reason. So I started examining the smaller transmissions and started collecting samples of two, three, and four packet transfers. At some point during my research, I'd done patent searches and come across some potentially useful FEC schemes but
hadn't been able to make sense of them. I eventually went back to these patent documents and started trying to piece together how it all worked. It was this mundane patent search that actually proved to be the breakthrough I needed. I just didn't know the answer was sitting under my nose the whole time. So from this point on, I started researching all the math mentioned
in the patents and trying to remember how matrix math worked. I had to learn all about finite field math, Galois fields, and matrix math. Over the course of a few weeks and many, many sheets of paper, I managed to calculate the FEC of small file examples by hand. I don't want to turn this into a long math lesson, but
I want to walk through the high level overview of how the math works for the FEC system. So a key part of the forward error correction is Galois field arithmetic. A Galois field has two important properties. The first is that it's a set of integers that math operations addition, subtraction, multiplication, or division between the integers are another integer in the set.
The second property is that the field is finite. The FEC scheme that I'm speaking of, and in fact most others, are a set of algebraic equations that are used to calculate the missing elements from the other. So performing the math operations in a Galois field are different than standard math. Addition and subtraction are just XOR.
Multiplication and division are a bit more complicated, but the total possibility for a two to the eighth field are small and can easily be performed in a computer with a look up table. All Galois field math is done using a primitive polynomial, not reducible by any other polynomial. In the case of the two to eight Galois field, the polynomial is represented by, well, wait, it's represented, sorry.
Yeah, it's represented by the x to the eighth equation there on the bottom. So this is the same polynomial used for QR codes, Reed-Solomon, and lots of other error correction schemes. So the FEC method I employed, I found out later,
was called the Vandermonde FEC method. It consists of the basic equation, y to the k equals x sub n times g. Where y to the k are the transmitted code words, those are all the packets with the FEC data. X to the n is the original data, and g is this n by k generator matrix. And you can get back x to the n with the following equation,
which is x to the n equals z to the n times a to the minus 1, which is your repair matrix. So yeah, so the second equation, z to n, is where you receive code words. And a to 1 is the repair matrix that's based on your generator matrix.
You have to have at least elements of z to be able to recreate x. So the whole principle of the FEC system is you have a system of k equations and n variables. So if you have at least n pieces, you can solve for the missing values. So what does all that mean in a real world implementation? Well, let's roll up our sleeves and do some matrix math.
For example, I'll do a demonstration on a two by three FEC scheme. This means there are two bytes that we want to protect using three bytes. It's a small enough example, we can do it quickly, and yet the whole process will be understood. So the first step is to build a generator matrix. And for this example, it'll be two by three.
It'll be a two by three Vandermonde matrix. And this is the equation right here. And basically, it's just a simple geometric progression. So the actual, so the two by three Vandermonde matrix for the Galloway field to the eighth is figure five. But this isn't the generator matrix.
To create the generator matrix, we need to get it to standard form by reducing it. I'm not gonna get into this, but that's what figure six is right there. And you'll notice the two by two part on the left of the matrix is all ones in the diagonal space. This is called the identity matrix. And it'll also prove very useful features later on in the FEC scheme.
Okay, so for the two bytes that we want to protect, they're 1f and f7. And I got these values from actual packets, and you can see that in figure nine here. So now to do the math, I don't know how many of you remember how to do matrix multiplication, but I'd completely forgotten.
In a quick review, you flip the two by one matrix on its side and leave the two by three matrix alone. You multiply the two by one matrix on the left with each row and end up with a three by one matrix. So to do the math, you multiply 1f by one and f7 by zero. And you follow this through for each of the three fields.
So in a Galway field, anything times one is itself, and anything times zero is zero, so that makes three of these easy to see. The first byte of the code word is 1f and the second is f7. The last field I'll give you some help. 1f times three in the Galway field is 21, f7 times two is f3. And 21 x word f3 is d2.
So d2 is the FEC value, which would be in the third packet. It would be the byte transmitted in the third packet. You'll also notice the great property is the first two bytes of the code word in the original form are the same. That's because the left section is always the identity matrix, as I mentioned earlier.
This is the case for any FEC data rates. It also means that all the bytes that are written to disk for transfer won't have to be altered later. So now that we've been able to generate the FEC, the next question is how do we get the data back out? So let's say in our example from before that we get the first
packet and the third packet, and we've lost the second. Since we've got two packets of the code words, we'll be able to generate the missing byte from the second packet. To do so, we need to build the repair matrix. Now to do this, we take rows from the generator matrix for the packet, and we build a two by two matrix. In this case, it'll be what you see here in figure 13.
The one in the zero correlates to the first packet, and the three and two are the generator matrix's third field. So the matrix needs to be inverted to create the repair matrix, and I'm gonna spare you that trouble and show you, and that's figure 14 there. So this gives us A inverse, which is what we use to generate x sub n.
So now we multiply the two bytes that we have, one FD2 by A inverse. Now the math is the same as before, basic matrix arithmetic. Now it's a very simple example, but that explains the process. I did numerous examples on paper before I was able to understand it enough to do much in software.
So now that I understood how the FEC process worked, I started to work on a software implementation. Luckily, since Galloway field math is very standard in computer science, it didn't take long to find some code online to build multiplication division tables. Once I did that, I wrote a few functions to reduce and invert the matrices. By this point, I had all the mathematical tools and
software, and needed to weave it together to be able to do the FEC. So for this part, I modified my software to stop chunking the FEC data out of the files when I saved them. I collected a few examples of complete files with complete FEC repair sections. And since I should be able to generate the FEC sections as well as repair sections, my first step was to write software that would verify
the FEC data fields in the saved files. Didn't take very long to implement this in software for files that consisted of less than 100 packets. But after that, I ran into problems with files that consisted of 101 packets. I made the assumption that 100 must not be some arbitrary limit, and
the programmers had decided on it for some reason. It turns out that after 100 packets, the packets are relieved, so that 100 by X is the largest scheme. The interleaved data rate was easy to calculate based on the total number of packets divided by 100. And any packets that weren't present to get to the even 100 were just considered all zero.
So all of these little things necessary to do a complete implementation were easy to test in software with complete file and FEC sections to test against. So by now it was elated that I was able to repair bad files, but I was still not totally sure what the FEC rate would be. I could make some assumptions based on the total file sizes, but
I took a look back at the myriad of unknown sections in the 03 packets from much earlier, and sure enough, there was a field that correlated the FEC and data rate. It was a 32-bit field after the file name. The only reason that it wasn't obvious was it was stored as the FEC value times 100. So if the FEC scheme was 100 by 106, the field would be 10,600.
So once I had this last piece of information out, it was time to try and do it in real time on the live stream. So the time it takes to build a generator matrix, which is the basis of the repair matrices, is quite quick for the two by three example I just presented, but with the larger FEC rates that are used in a field such as
a repair is 100 by 109, takes much longer. So I pre-calculated the generator matrices, and I stored these in a header file that was compiled into the executable. I only stored the section of the generator matrix used to the right of the identity matrix, and once I implemented the FEC,
I was able to emulate as far as I could the entire protocol. So if I ran any IO errors, the FEC was able to repair these in short bursts. So in conclusion, the biggest challenge of the project for me was the FEC, as the math was something I never really dealt with in my schooling, aside from remembering it long enough to pass a test.
And that was it. So I'm sure you'll be wondering about the video on demand system, if it's still running. The system is running, but all the files now are encrypted, and using some sort of scheme based on AES-256. So hopefully I've explained all this in a way that's easy to understand, and thank you for listening to my presentation.
So once you'd applied all the error correcting stuff, what proportion of the data stream were you able to decode successfully?
100%.