Building a high throughput low-latency PCIe based SDR
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Title of Series | ||
Number of Parts | 147 | |
Author | ||
License | CC Attribution 4.0 International: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/43911 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
00:00
PCI ExpressSoftwareField programmable gate arraySoftware-defined radioJames Waddell Alexander IIRoundness (object)Software-defined radioExpressionReliefSoftwareMultiplication signJames Waddell Alexander IIComputer animationLecture/Conference
01:11
James Waddell Alexander IISoftware-defined radioOpen setGoodness of fitSoftware-defined radioSoftwareBitPerfect groupArmPrisoner's dilemmaXMLComputer animationMeeting/InterviewLecture/Conference
02:35
Software-defined radioSoftwareGroup actionWorkstation <Musikinstrument>Source codeGoodness of fitComputer hardwareSoftware-defined radioPlastikkartePCI ExpressForm factor (electronics)TelecommunicationDivisorElement (mathematics)Computer animationLecture/Conference
04:53
Band matrixComputerPCI ExpressLaptopSoftware-defined radioSingle-board computerNeuroinformatikCivil engineeringSampling (statistics)MassSpectrum (functional analysis)Regular graphMultiplication signComputer animationLecture/Conference
06:47
DivisorField programmable gate arrayDigital signal processorDigital-to-analog converterPlastikkarteInterface (computing)WärmestrahlungSimulationExpressionPCI ExpressForm factor (electronics)Prisoner's dilemmaDivisorPerfect groupUhrenparadoxonStability theoryCommunications systemSoftware-defined radioBlock (periodic table)Computer animationLecture/ConferenceMeeting/Interview
07:37
DivisorPCI ExpressField programmable gate arrayDigital signal processorDigital-to-analog converterInterface (computing)SimulationPlastikkarteWärmestrahlungTelecommunicationPerspective (visual)Euclidean vectorClassical physicsLine (geometry)Squeeze theoremTape driveStability theoryPerfect groupUhrenparadoxonTelecommunicationSoftwareInterface (computing)Projective planeSource codePlastikkarteSimulationOpen sourceMultilaterationComputer animationLecture/Conference
08:56
Open setElectronic design automationDifferent (Kate Ryan album)BitNumberVolume (thermodynamics)Lecture/ConferenceMeeting/Interview
09:51
Latent heatMemory managementDevice driverKernel (computing)Band matrixNP-hardComputing platformInterface (computing)PCI ExpressExpressionMultiplication signVolume (thermodynamics)Game theoryLevel (video gaming)Physical systemPrisoner's dilemmaVariety (linguistics)InternetworkingMeeting/InterviewComputer animation
10:38
Goodness of fitAxiom of choicePower (physics)InternetworkingBand matrixMusical ensembleWhiteboardPhysical lawElectric power transmissionPCI ExpressWindowInterface (computing)Software developerVirtual machineDevice driverArmError messageDevice driverRewritingMereologyWritingMaizeFile formatFreewarePhase transitionPlastikkarteLecture/Conference
12:35
IntelInformationSoftware developerWritingPCI ExpressDevice driverMemory managementPatch (Unix)LaptopSoftware developerInterface (computing)Bus (computing)Asynchronous Transfer ModeNP-hardData conversionPlanningReliefWhiteboardPhysical lawMultiplication signComputer animationLecture/ConferenceMeeting/Interview
14:12
PCI ExpressInformationSoftware developerIntelWritingDevice driverMemory managementPatch (Unix)Data conversionVotingPlastikkarteLaptopPCI ExpressGame controllerCore dumpInformation securityBus (computing)SpywareRandomizationKernel (computing)Patch (Unix)Multiplication signLine (geometry)MeasurementCASE <Informatik>WindowMeeting/InterviewComputer animation
15:28
PCI ExpressField programmable gate arrayInformation securityGoodness of fitDevice driverMeasurementCASE <Informatik>LaptopBeat (acoustics)Computer hardwarePhysicalismWhiteboard2 (number)Serial portBlock (periodic table)NumberPoint (geometry)ResultantLecture/ConferenceComputer animation
16:50
PCI ExpressField programmable gate arrayNumberTerm (mathematics)Connectivity (graph theory)Level (video gaming)Metropolitan area networkSoftware bugLecture/ConferenceComputer animation
17:50
PeripheralMemory managementWritingModule (mathematics)SequenceSimulationModul <Datentyp>DisintegrationField programmable gate arrayLink (knot theory)Software bugSource codeDevice driverCodeLevel (video gaming)Core dumpKernel (computing)Formal languageTransport Layer SecurityPoint (geometry)Control flowSpacetimeMultiplication signPeripheralReal-time operating systemPhysical systemLecture/ConferenceMeeting/InterviewComputer animation
18:56
CodeMereologyPhysical systemSoftware testingReal-time operating systemClassical physicsMultiplication signControl flowComputer hardwareRight angleLogicTwitterCuboidMeeting/InterviewLecture/Conference
19:55
Complete metric spacePhysical systemSoftwareLetterpress printingInheritance (object-oriented programming)Memory managementLatent heatData bufferDatabase transactionState of matterBlock (periodic table)ParsingField programmable gate arrayPhysical systemMereologyCondition numberLogikanalysatorBoss CorporationCASE <Informatik>Device driverCAN busStatisticsComputer animation
21:09
SoftwareMemory managementData bufferMathematical optimizationPhysical systemInternet der DingeSoftwareSampling (statistics)Multiplication signImplementationWordWaveArmReal-time operating systemMereologySoftware bugBuffer solutionResultantTime zoneBand matrixBitCycle (graph theory)Point (geometry)Device driverMusical ensembleMeeting/InterviewComputer animationLecture/Conference
22:05
Control flowKernel (computing)Interface (computing)SpacetimeDevice driverWritingReading (process)Buffer solutionMemory managementData structureDevice driverDevice driverCASE <Informatik>SpacetimeSpectrum (functional analysis)Kernel (computing)LogicGame controllerInterface (computing)Structural loadKey (cryptography)Computer animation
22:57
SpacetimeKernel (computing)ImplementationSemiconductor memoryContext awarenessInterface (computing)Game controllerArmMaizeProcess (computing)Lecture/Conference
24:14
NP-hardComputer programMemory managementRead-only memorySpacetimeKernel (computing)Band matrixCore dumpBefehlsprozessorDevice driverAxiom of choiceFunction (mathematics)Band matrixHacker (term)Kernel (computing)Musical ensembleSpacetimeLoop (music)Core dumpProgrammer (hardware)BefehlsprozessorGodSource codeComputer animationMeeting/Interview
25:37
Lie groupControl flowKernel (computing)Device driverPCI ExpressGeneric programmingImplementationLibrary (computing)Digital filterLatent heatDevice driverComputing platformComputer programArchitectureKernel (computing)Cartesian coordinate systemBefehlsprozessorComputer architectureArmPhysical systemPortable communications deviceForm (programming)Device driverImplementationLibrary (computing)PCI ExpressDevice driverLogicGame controllerLevel (video gaming)Boss CorporationPeripheralLecture/ConferenceComputer animation
27:38
Decision theoryDevice driverKernel (computing)PlastikkarteEvent horizonSimulationWärmestrahlungSynchronizationMemory managementControl flowSpacetimeBuffer solutionInterrupt <Informatik>WritingDevice driverElement (mathematics)Classical physicsKernel (computing)InformationDecision theoryCuboidCartesian coordinate systemInterface (computing)TelecommunicationMeeting/InterviewComputer animation
28:41
SynchronizationType theoryPhysical systemMultiplicationCartesian coordinate systemLocal ringLine (geometry)Interface (computing)Sampling (statistics)SoftwareMultiplication signKernel (computing)Core dumpAuthorization2 (number)Lecture/ConferenceMeeting/Interview
29:51
SoftwareImplementationPCI ExpressSource codeBand matrixInterface (computing)AbstractionField programmable gate arrayLatent heatOpen sourceMultilaterationReal numberSeries (mathematics)Real-time operating systemSource codeBoss CorporationImplementationCartesian coordinate systemBus (computing)Link (knot theory)Term (mathematics)Direction (geometry)Computer animationMeeting/Interview
31:09
Memory managementControl flowTranslation (relic)Data bufferField programmable gate arrayPCI ExpressArchitectureDigital-to-analog converterAnalog-to-digital converterComplete metric spaceCore dumpStandard deviationException handlingSoftwareDecision theoryTerm (mathematics)Computer clusterImplementationExpressionCore dumpSoftware developerComputing platformStatement (computer science)IterationStandard deviationCountingPoint (geometry)QuicksortElectric generatorMeeting/InterviewComputer animation
32:43
Complete metric spaceKey (cryptography)ArmMathematical optimizationSequenceSemiconductor memoryWhiteboardTerm (mathematics)Lecture/ConferenceMeeting/Interview
34:02
Read-only memoryDevice driverMemory managementKernel (computing)Field programmable gate arrayData bufferLibrary (computing)Address spaceLogicFormal verificationComputing platformTranslation (relic)SPARCIntelVisualization (computer graphics)NeuroinformatikAddress spaceSoftware bugFlow separationSemiconductor memorySoftware testingRandomizationOperating systemDressing (medical)Buffer solutionDevice driverComputer animation
35:09
Database transactionMemory managementPCI ExpressComplex (psychology)Address spaceComplete metric spaceReading (process)Gamma functionPhysical systemSemiconductor memoryDevice driverComplete metric spaceComputer architectureDatabase transactionFinite-state machineMultiplicationOrder (biology)Single-precision floating-point formatReading (process)Cycle (graph theory)MereologyRight angleRandomizationPoint (geometry)Lie groupDescriptive statisticsArmMeeting/InterviewComputer animation
37:17
ImplementationField programmable gate arrayBand matrixScale (map)PCI ExpressMotherboardPlastikkarteSimulationInterface (computing)ScalabilityComputer architectureDescriptive statisticsSampling (statistics)Message passingMultiplication signBand matrixScalabilityMusical ensembleLecture/ConferenceComputer animation
38:17
WhiteboardMotherboardStandard deviationSampling (statistics)2 (number)Data conversionMultiplicationCharge carrierBoss CorporationMusical ensembleLecture/Conference
39:40
PCI ExpressSample (statistics)Device driverBand matrixBefehlsprozessorRange (statistics)SubsetRange (statistics)Software testingReal numberSampling (statistics)Device driverMultiplication signBefehlsprozessorBitDynamic range2 (number)Computer animationMeeting/Interview
40:34
Sample (statistics)Field programmable gate arrayAxiom of choiceComponent-based software engineeringBus (computing)Very-high-bit-rate digital subscriber linePoint (geometry)Boss CorporationMeasurementVoltmeterGoodness of fitPhysical systemPrototypeSpacetimeStandard deviationMultiplication signPower (physics)Limit (category theory)CAN busComputer hardwareComputer animationLecture/Conference
42:19
TheoryPCI ExpressField programmable gate arrayPrototypePlanningBoss CorporationNumberPoint (geometry)Multiplication signBus (computing)Limit (category theory)Green's functionDifferent (Kate Ryan album)DigitizingElectric generatorDoubling the cubeMeeting/InterviewComputer animation
43:36
PCI ExpressLimit (category theory)Computer wormStandard deviationCore dumpIntelCommunications protocolOverhead (computing)Maxima and minimaTheoryOpen setCore dumpImplementationBoss CorporationBus (computing)Heat transferNeuroinformatikComputer wormDifferent (Kate Ryan album)Overhead (computing)Computer animationLecture/Conference
44:48
IntelCore dumpPCI ExpressVideoconferencingPlastikkarteDiscrete groupDatabase transactionComputing platformPCI ExpressVideo cardLaptopCore dumpBus (computing)ArmSystem callComputer animation
45:34
PlastikkarteTouchscreenPCI ExpressPhysical systemWordVideo cardInverter (logic gate)Buffer solutionArchaeological field surveyTwitterLecture/Conference
46:54
PCI ExpressJames Waddell Alexander IISoftware-defined radioScheduling (computing)Right angleSuite (music)Line (geometry)Streaming mediaHypermediaBit rateMaxima and minimaFrequencyXMLComputer animationLecture/ConferenceMeeting/Interview
47:45
Library (computing)LengthDifferent (Kate Ryan album)INTEGRALDevice driverProjective planeTerm (mathematics)MereologyWhiteboardCodePrototypeSampling (statistics)2 (number)Physical systemMeasurementLimit (category theory)FrequencyPCI ExpressDevice driverMaxima and minimaNumberLine (geometry)Level (video gaming)PlanningBus (computing)Element (mathematics)Computer simulationTracing (software)Water vaporContext awarenessSheaf (mathematics)CodeOpen sourceDegree (graph theory)Meeting/Interview
51:45
LogikanalysatorSoftware-defined radioTracing (software)Term (mathematics)outputDigitizingAnalogyRow (database)LogicSemiconductor memorySoftware development kitSampling (statistics)Very-high-bit-rate digital subscriber lineSpectrum (functional analysis)Shift operatorLecture/ConferenceMeeting/Interview
53:11
TelecommunicationLaptopPhysical systemBit rateSampling (statistics)InternetworkingPersonal identification numberSpectrum (functional analysis)Source codeProcess (computing)Wireless LANoutputTerm (mathematics)Software testingSpacetimeSlide ruleOperating systemImplementationMassConstraint (mathematics)Duplex (telecommunications)PlastikkarteFilter <Stochastik>Execution unitFitness functionFamilyLecture/Conference
57:12
Open setSource codeTelecommunicationExpressionPCI ExpressPhase transitionWorkstation <Musikinstrument>Line (geometry)Computer configurationTunisSynchronizationMultiplicationWhiteboardTerm (mathematics)Computer hardwarePresentation of a group2 (number)InternetworkingPrototypePlanningCategory of beingSummierbarkeitLevel (video gaming)Different (Kate Ryan album)ComputerProcess (computing)QuicksortInformation securityMultiplication signMeeting/InterviewLecture/Conference
59:39
PCI Express10 (number)PCI ExpressMultiplication signRevision controlLaptopTerm (mathematics)Lecture/Conference
01:00:30
MedianHypermediaCartesian closed categoryJSON
Transcript: English(auto-generated)
00:13
Has anyone in here ever worked with libUSB or PyUSB? Hands up.
00:20
OK. Who also thinks USB is a pain? OK. Sergey and Alexander were here back at the 26th C3. That's a long time ago. I think it was back in Berlin. And back then, they presented their first homemade,
00:41
or not homemade, SDR, software-defined radio. This year, they are back again, and they want to show us how they implemented another one using an FPGA. And to communicate with it, they used PCI Express. So I think if you thought USB was a pain, let's see what they can tell us about PCI Express.
01:03
A warm round of applause for Alexander and Sergey for building a high throughput, low latency, PCIe-based software-defined radio. Hi, everyone.
01:21
Good morning, and welcome to the first day of the Congress. So just a little bit of background about what we've done previously and why we are doing what we are doing right now is that we started working with software-defined radios.
01:42
And by the way, who knows what software-defined radio is? OK, perfect. And whoever actually used the software-defined radio? Is RTL-SDR, or less people, but still quite a lot?
02:01
OK, good. I wonder whether anyone here uses more expensive radios, like USERPs, less people, but few. OK, good. Cool. So before 2008, I had no idea what software-defined radio is,
02:21
was working with USERP, software person, et cetera, et cetera. So in 2008, I heard about OpenBTS, got introduced to software-defined radio, and I wanted to make it really work. And that's what led us to today.
02:46
In 2009, we had to develop a clock tamer hardware, which allowed to use USERP 1 to run GSM without problems. If anyone ever tried doing this without,
03:02
a good clock source knows what I'm talking about. And we presented this. It wasn't really an SDR, it was just a clock source, we presented this in 2009 in 2063. Then we realized that using USERP 1 is not really a good idea
03:20
because we wanted to build a robust, industrial-grade base stations. So we started developing our own software-defined radio, which we call UMTREX. And it was in 2000, we started this in 2011. Our first base stations were deployed in 2013.
03:44
But I always wanted to have something really small and really inexpensive. And back then, it wasn't possible. My original idea in 2011, we were to build a PCI express card.
04:05
Sorry, not a PCI express card, but a mini PCI card. If you remember, there were like other Wi-Fi cards in mini PCI form factor. And I thought that would be really cool to have an SDR in mini PCI, so I can plug this into my laptop or in some embedded PC
04:22
and have a nice SDR equipment. But back then, it just was not like really possible because electronics were bigger and more power hungry and just didn't work that way. So we designed UMTREX to work over gigabit Ethernet
04:44
and it was about that size. So now we spend this year designing something which really brings me to what I wanted those years ago. So the UMTREX is a mini PCI express.
05:04
Again, there was no PCI express back then. So now it's mini PCI express, which is even smaller than mini PCI. And it's built to be embedded friendly. So you can plug this into a single board computer,
05:21
embedded single board computer. If you have a laptop with a mini PCI express, you can plug this into your laptop and you have a really small software-defined radio equipment. And we really want to make it inexpensive. That's why I was asking how many of you have ever worked with RTLSDR,
05:40
how many of you have ever worked with Usurps, because the gap between them is pretty big. And we want to really bring the software-defined radio to masses, definitely won't be as cheap as RTLSDR, but we try to make it as close as possible.
06:01
And at the same time, so at the size of RTLSDR, as the price will higher, but hopefully it will be affordable to pretty much everyone, we really want to bring high performance into your hands. And by high performance, I mean, this is a full transmit receive
06:21
with two channels, trust me, two channels receive, which is usually called two by two MIMO in the radio world. The goal was to bring it to 160 mega samples per second, which can roughly give you like 120 megahertz of radio spectrum available.
06:44
So what we were able to achieve is, again, this is mini PCI express form factor. It has small Arctic seven, that's the smallest and most inexpensive FPGA,
07:00
which has ability to work with a PCI express. It has LMS 7000 chip for our FIC, very high performance, very tightly, let's go, so the tightly embedded chip with even a DSP blocks inside.
07:23
It has even a GPS chip here. Do you see my screen? No, you can't see, but, so it has, you can actually on the right upper side, you can see a GPS chip. So you can actually synchronize your SDR to GPS
07:40
for perfect clock stability. So you won't have any problems running any telecommunication systems like GSM, 3G, 4G, due to clock problems. And it also has interface for SIM cards, so you can actually create a software defined radio modem,
08:04
and there are all the open source projects to build one in a 4LT called the SRS-UE, if you're interested. Et cetera, et cetera. So it's really, really tightly packed one. And if you put this into perspective,
08:21
that's how it all started in 2006, and that's what you have 10 years later. It's pretty impressive. So, thanks. But I think it's actually applause to the whole industry who is working on shrinking the sizes because we just put stuff on the PCB.
08:41
You know, we are not building the silicon itself. Interesting thing is that we did the first approach. We said, let's pack everything. Let's do a very tight PCB design. We did an eight layer PCB design,
09:01
and when we send it to a fab to estimate the cost, it turned out it's $15,000 per piece. Well, in small volumes, obviously, but still a little bit too much. So we had to redesign this, and the first thing which we did
09:23
is we still kept eight layers because in our experience, number of layers nowadays have only minimal impact on the cost of the device. So like six, eight layers, the price difference is not so big. But we did complete rerouting
09:45
and only kept two deep microvias and never used the buried vias. So this make it much easier and much faster for the fab to manufacture it. And the price suddenly went five, six times down.
10:01
And in volume again, it will be significantly cheaper. And that's just for geek porn, how PCB looks inside. So now let's go into real stuff. So PCI Express, why did we choose PCI Express?
10:24
As it was said, USB is a pain in the ass. You can't really use USB in industrial systems for whole variety of reasons, just unstable. So we did use internet for many years successfully,
10:41
but internet has one problem. First of all, inexpensive Ethernet is only one gigabit and one gigabit doesn't offer you enough bandwidth to carry all the data we want. Plus it's power hungry, et cetera, et cetera. So PCI Express is really a good choice because it's low power, it has low latency,
11:04
it has very high bandwidth and it's available almost universally. When we started looking into this, we realized that even ARM boards, some of ARM boards have mini PCI Express slots, which was a big surprise for me, for example. So the problems is that unlike USB,
11:26
you do need to write your own kernel driver for this and there is no way around. And it is really hard to write this driver universally, so we are writing it obviously for Linux
11:42
because we are working with embedded systems, but if we want to rewrite it for Windows or for Mac OS, we'll have to do a lot of rewriting. So we focus on what we want on Linux only right now. And now the hardest part, debugging is really non-trivial. One small error and your PC is completely hang
12:02
because you did something wrong. And you have to reboot it and restart it. And that's like debugging kernel, but sometimes even harder. To make it worse, there is no really easy to use plug and play interface. If you want to restart normally when you develop a PCI Express card,
12:23
when you want to restart it, you have to restart your development machine. Again, not a nice way, it's really hard. So the first thing we did is we found that we can use Thunderball 3, which is just recently released,
12:43
and it has ability to work directly with PCI Express bus. So it basically has a mode in which it converts a PCI Express into plug and play interface. So if you have a laptop which supports Thunderball 3,
13:03
then you can use this to plug and play your plug or unplug your device to make your development easier. There are always problems. There's no easy way. There is no documentation.
13:24
Thunderbolt 3 is not compatible with Thunderbolt 2. So we had to buy a special laptop with Thunderbolt 3 with special cables, like all this hard stuff. And if you really want to get documentation, you have to sign NDA and send a business plan to them
13:46
so they can approve that your business makes sense. So we actually opted out. We decided not to go through this. What we did is we found that someone
14:02
is actually making PCI Express or Thunderbolt 3 converters and selling them as dev boards. And that was a big relief because it saved us lots of time, lots of money. You have to order it from some Asian company.
14:21
And yeah, this is how it looks like, this converter. So you buy several pieces. You can plug in your PCI Express card there and you plug this into your laptop. And this is with XTRX already plugged into it. The only problem we found is that typically,
14:45
UEFI has a security control enabled so that any random Thunderbolt device can't hijack your PCI bus and can't get access to your kernel memory and do some bad stuff, which is a good idea.
15:00
The only problem is that it's not fully implemented in Linux. So under Windows, if you plug in a device which has no security features, which is not certified, that will politely ask you, do you really trust this device? Do you want to use it? You can say yes. Under Linux, it just does not work.
15:20
So we spent some time trying to figure out how to get around this. There are some patches from Intel which are not mainline, and we were not able to actually get them to work. So we just had to disable all this security measure in the laptop. So be aware that this is the case.
15:41
And we suspect that happy users of Apple might not be able to do this because Apple don't have BIOS, so you probably can't disable this feature. Probably a good incentive for someone to actually finish writing the driver.
16:02
So now to the goal. So we want to achieve 160 mega samples per second, two by two MIMO, which means to transmit, to receive channels at 12 bits, which is roughly 7.5 gigabit per second.
16:22
So first result, when we got this board from the fab, it didn't work. That was expected. Yeah, it was expected. So the first interesting thing we realized is that, well, first of all, the FPGA has hardware blocks
16:41
for talking to a PCI Express, which called GTP, which basically implement like a PCI Express serial physical layer. But the thing is the numbering is reversed in PCI Express and in the FPGA,
17:00
and we didn't realize this. So we had to do very, very fine soldering to actually swap the lanes. You can see this very fine work there. We also found that one of the components was dead bug, which is a well-known term for chips,
17:24
which at design stage are placed mirrored. So we mirrored, occasionally mirrored the pinout. So we had to solder it upside down. And if you can realize how small it is,
17:41
you can also appreciate the work done. And what's funny, when I was looking at dead bugs, I actually found a manual from NASA which describes how to properly solder dead bugs to get it approved.
18:00
So this is the link. I think you can go there and enjoy. There's lots of fun stuff there. So after fixing all of this, our next attempt kind of works. So next stage is debugging the FPGA code, which has to talk to PCI Express,
18:22
and PCI Express has to talk to Linux kernel. Linux kernel has to talk to the driver. Driver has to talk to the user space. So peripherals are easy. So all the yards, SPIs, we've got to work almost immediately. No problems with that, but DMA was a real beast.
18:43
So we spent lots of time trying to get DMA to work. And the problem is that with DMA, it's on FPGA, so you can't just place a break point like you do in C or C++ or any other languages. It's real-time system running.
19:01
Not system like it's real-time hardware, which is running on the fabric. So we had to, Sergey was mainly developing this, had to write a lot of small test benches and test everything piece by piece.
19:23
So all parts of the DMA code we had was wrapped into a small test bench, which was emulating all the tricks. And as classics predicted, it took about five to 10 times more
19:41
than actually writing the code. So we really blew up our predicted timelines by doing this, but at the end, we've got really stable work. So some suggestions for anyone who will try to repeat this exercise
20:01
is there is a logic analyzer built into Xilinx, and you can use it, it's nice. Sometimes it's very helpful, but you can't debug transient bugs, which are coming out when some weird conditions are coming up. So you have to implement some read-back registers,
20:23
which shows important statistics, like important data about how your system behaves. In our case, it's various counters on the DMA interface, so you can actually see, kind of see what's happening with your data. Is it received?
20:40
Is it sent? How much is sent? How much is received? So for example, we can see when we saturate the bus, or when actually there's an underrun, so the host is not providing data fast enough so we can at least understand whether it's a host problem or whether it's an FPGA problem, and which part should we debug next,
21:01
because again, it's a very multi-layer problem. You start with FPGA, PCI Express, kernel, driver, user space, and any part can fail. So you can't work blind like this. So again, the goal was to get 160 mega samples. With the first naive implementation, we got two mega samples per second,
21:23
roughly 60 times slower. The problem is that software just wasn't keeping up and wasn't sending data fast enough. So it was like many things done, but the most important parts is use real-time priority
21:40
if you want to get very stable results and, well, fix software bugs. And one of the most important bugs we had was that DMA buffers weren't freed in proper time immediately, so they were busy for longer than they should be, which introduced extra cycles and basically just reduced the bandwidth.
22:02
So at this point, let's talk a little bit about how to implement a high-performance driver for Linux, because if you want to get real performance, you have to start with the right design. So there are basically three approaches
22:23
and the whole spectrum in between, like two approaches and the whole spectrum in between, which you can refer to three. So the first approach is full kernel control, in which case kernel driver not only is on the transfer,
22:42
it actually has all the logics of controlling your device and only export your KTL to the user space, and that's kind of a traditional way of writing drivers. Your user space is completely abstracted from all the details. Well, the problem is that this is probably
23:03
the slowest way to do it. So the other way is what's called a zero-copy interface. Your only control is held in the kernel and data is provided, the raw data is provided to your space as is,
23:20
so you avoid the memory copy, which make it faster, but still not fast enough if you really want to achieve maximum performance because you still have context switches between the kernel and the user space. So the most, the fastest approach possible
23:42
is to have full user space implementation when kernel just expose everything and says now you do it yourself and you have no context switches, like almost now, and you can really optimize everything. So what are the problems with this?
24:06
The pros I already mentioned, no switches between kernel and user space, it's very low latency because of this as well, it's very high bandwidth, but if you're not interested in getting the very high performance,
24:22
the most performance and you just want to have like some low bandwidth performance, then you will have to add hacks because you can't get notification from the kernel that resource is available, there's more data available. So it also makes it vulnerable
24:43
because if user space can access it, then it can do whatever it want. So we at the end decided, so one more important thing, so how to actually to get the best performance
25:01
out of the bus, and this is whether you want to pull your device or not to pull, get notified. So what is pulling? I guess everyone is programmer understands it, so pulling is when you ask repeatedly, are you ready, are you ready, are you ready, and when it's ready, you get the data immediately.
25:21
So it's basically busy loop of your, you're just constantly asking device what's happening, and you need to dedicate a full core, and thanks God we have multi-core CPUs nowadays, so you can dedicate the full core to this pulling, and you can just pull constantly.
25:42
But again, if you don't need this highest performance, you just need to get something, then you will be wasting a lot of CPU resources. So at the end, we decided to do a combined architecture, it is possible to pull,
26:00
but there's also a chance to get notification from a kernel for applications which need low bandwidth, but also require better CPU performance, which I think is the best way if you're trying to target both worlds. So very quickly, so the architecture of the system,
26:27
so we tried to make it very, very portable and flexible, so there is a kernel driver,
26:43
there is a kernel driver which talks to a low level library which implements all this logic which we took out of the driver to control the PCI express, to work with DMA, to provide all the details of the actual program,
27:07
plus implementation. And then there is a high level library which talks to this low level library and also to libraries which implement control of actual peripherals,
27:22
and most importantly to the library which implements control over our RFIC chip. So this way, it's very modular, we can replace PCI express with something else later, we might be able to port it to other operating systems,
27:41
and that's the goal. Another interesting issue is when you start writing a Linux kernel driver, you very quickly realize that while LDD, which is a classic book for Linux driver writing, is good and it will give you a good insight,
28:01
it's not actually up to date. It's more than 10 years old, and there's all of new interfaces which are not described there, so you have to resort to reading the manuals and all the documentation in the kernel itself. Well, at least you get the up-to-date information.
28:23
So the decisions we made is to make everything easy, so we use TTI and TTY for GPS, and so you can really attach pretty much any application which talks to GPS, all of existing applications can just work out of the box
28:43
and we also wanted to be able to synchronize system clock to GPS, so we get automatic log synchronization across multiple systems, which is very important when we are deploying multiple, many, many devices around the world.
29:00
So we plan to do two interfaces, one is key PPS and another is a DCD because DCD line on the yard exposed over TTY, because again, we found that there are two types of applications, one that support one API and others that support other API
29:22
and there is no common thing, so we have to support both. What's interesting else is that, yeah, and as we described, we want to have pool so we can get notifications from the kernel when data is available and we don't need
29:44
to do real busy looping all the time. So after all the software optimizations, we've got to like 10 mega samples per second, still very, very far from what we want to achieve. Now there should have been a lot of explanations
30:03
about PCI Express, but when we actually wrote everything we wanted to say, we realized it's just like a full two hours talk just on PCI Express, so we are not going to give it here, I'll just give some highlights which are most interesting. If there's real interest, we can set up a workshop
30:24
at some of the later days and talk in more details about PCI Express specifically. So the thing is, there is no open source course for PCI Express which are optimized for high performance, high performance
30:42
real-time applications. There is XaliBus, which as I understand it is not really open source, but they provide you a source if you pay them, but it's very popular because it's very, very easy to do, but it's not giving you performance. If I remember correctly, the best it can do
31:01
is maybe like 50% bus saturation. So there's also Xilinx implementation, but if you're using Xilinx implementation with XaliBus, then you're really locked in with XaliBus, with Xilinx, and it's also not very efficient in terms of resources,
31:23
and if you remember, we want to make this very, very inexpensive, so our goal is to be able to fit everything in the smallest Arctic-7 FPGA, and that's quite challenging with all the stuff in there, and we just can't waste resources.
31:42
So decision is to write your own PCI Express implementation, that's how it looks like. I'm not going to discuss it right now. There were several iterations initially looking much simpler, turned out not to work well.
32:04
Some interesting stuff about PCI Express, which we stumbled upon, is that it was working really well on Atom, which is our main development platform because we are doing a lot of embedded stuff. Worked really well when we tried to plug this into core i7, just started hanging once in a while.
32:27
So after several, not days, maybe, debugging, Sergey found that there's a very interesting statement in the standard which says that value zero in byte count actually stands not for zero bytes, but for 4,096 bytes.
32:47
I mean, that's a really cool optimization. So another thing is completion, which is a term in PCI Express basically for acknowledgement,
33:02
which also can carry some data back to your request, and sometimes if you're not sending completion, device just hangs, and what happens is that in this case, due to some historical heritage
33:21
of x86, it just starts returning you FFF, and if you have a register which says is your device okay and this register shows one to say the device is okay, guess what will happen? You will be always reading that your device is okay,
33:40
so the suggestion is not to use one as a status for okay and use either zero or better like a two-bit sequence, so you're definitely sure that you are okay and not getting FFFs. So when you have a device which may fail
34:04
at any of the layers, you just got this new board, it's really hard, it's really hard to debug, and there's a lot of memory corruption, so we had a software bug and it was writing DMA addresses incorrectly, and we were wondering
34:25
why we are not getting any data in our buffers and at the same time, after several starts, our operating system just crashes. Well, that's the reason why there is this UFI protection
34:42
which prevents you from plugging in devices like this into your computer because it was basically writing data, like random data into random portions of your memory. So a lot of debugging, a lot of tests and test benches
35:00
and we were able to find this. And another thing is if you de-initialize your driver incorrectly and that's what's happening when you have plug-and-play device which you can plug and unplug, then you may end up in a situation
35:20
where you are trying to write into memory which is already freed by operating system and used for something else. A very well-known problem, but it also happens here. So why DMA is really hard
35:42
is because it has this completion architecture for writing, for reading data. Writes are easy. You just send the data, you forget about it. It's a fire and forget system. But for reading, you really need to get your data back
36:04
and the thing is it looks like this. I really hope that there will be some pointing device here but basically on the top left, you can see requests for read and on the right, you can see completion transactions.
36:24
So basically each transaction can be and most likely will be split into multiple transactions. So first of all, you have to collect all these pieces and write them into proper parts of the memory
36:42
but that's not all. The thing is the latency between request and completion is really high. It's like 50 cycles. So if you have only single transaction in fly, you will get really bad performance. You do need to have multiple transactions in flight
37:02
and the worst thing is that transactions can return data in a random order. So it's a much more complicated state machine than we expected originally. So when I said our architecture was much simpler originally,
37:22
we did not have all of this and we had to realize this while implementing. So again, here was a whole description of how exactly this works but not this time. So now after all these optimizations,
37:41
we've got 20 mega samples per second, which is just six times lower than what we are aiming at. So now the next thing is PCI Express lane scalability. So PCI Express is a serial bus,
38:01
so it has multiple lanes and they allow you to basically horizontally scale your bandwidth. One lane is like X, then two lane is 2X, four lane is 4X. So the more lanes you have, the more performance you're getting out of your bus.
38:22
Sorry, more bandwidth you'll get out of your bus, not performance, but. So the issue is that a typical mini PCI Express, so the mini PCI Express standard only standardize one lane and second lane is left as optional. So most motherboards don't support this.
38:42
There are some but not all of them and we really wanted to get this done. So we designed a special converter board which allows you to plug your mini PCI Express into a full-size PCI Express and get two lanes working.
39:04
And we're also planning to have a similar board which will have multiple slots so you will be able to get multiple Xtrix SDRs onto the same carrier board and plug this into, let's say, PCI Express 16X.
39:21
And you will get really a lot of IQ data which then will be your problem, how to process. So with 2Xs, it's about twice performance so we are getting 50 mega samples per second.
39:44
And that's the time to really cut the fat because the real sample size of LMS7 is 12 bits. And we are transmitting 16 because it's easier because CPU is working on 8, 16, and 32.
40:05
So we originally designed the driver to support 8-bit, 12-bit, and 16-bit to be able to do this scaling. And for the test, we said, okay, let's go from 16 to 8-bit.
40:23
We'll lose some dynamic range but who cares these days. Still stayed the same. It's still 50 mega samples per second no matter what we did. And that was a lot of interesting debugging going on
40:43
and we realized that we actually made another, not really a mistake, we didn't really know this when we designed but we should have used a higher voltage for this high-speed bus
41:01
to get it to the full performance. And at 1.8, it was just degrading too fast and the bus itself was not performing well. So our next prototype will be using higher voltage specifically for this bus.
41:21
And this is kind of stuff which makes designing hardware for high-speed really hard because you have to care about coherence over the parallel buses on your system. So at the same time, we do want to keep 1.8 volt for everything else as much as possible because another problem we are facing with this device
41:43
is that by the standard Mini PCI Express allows only 2.5 watts of power consumption and no more. And we were very lucky that LMS7 has so good power consumption performance
42:03
we actually had some extra space to have FPGA and GPS and all this stuff but we just can't let the power consumption go up. Our measurements on this device showed about 2.3 watts of power consumption. So we are at the limit at this point.
42:24
So when we fix the bus with a higher voltage, now it's a theoretical exercise because we haven't done this yet. It's planned to happen in a couple of months. We should be able to get to this number which is just 1.2 times slower.
42:43
And then the next thing will be to fix another issue which we made at the very beginning. We procured a wrong chip. Just one digit difference. You can see it's highlighted in red and green.
43:04
And this chip supports only generation one PCI Express which is twice slower than generation two PCI Express. So again, hopefully we'll replace the chip and just get very simple doubling of the performance.
43:26
Still, it will be slower than we wanted it to be. And here is what comes like practical versus theoretical numbers.
43:40
Well, as every bus, it has overheads. And one of the things which again we realized when we were implementing this is that even though the standard dies as a payload size of four kilobytes, actual implementations are different. For example, desktop computers like Core, Intel Core,
44:03
or Intel Atom, they only have 128 byte payload so there is much more overhead going on the bus to transfer data. And even theoretically, you can only achieve 87% efficiency.
44:26
And on Xeon, we tested and we found that they're using 256 payload size and this can give you like 92% efficiency on the bus. And this is before the overhead
44:41
so the real reality is even worse. An interesting thing which we also did not expect is that we originally were developing on Intel Atom and everything was working great.
45:00
When we plug this into a laptop like Core i7, multi-core, really powerful device, we didn't expect that it wouldn't work. Obviously, like Core i7 should work better than Atom. No, not always. The thing is we were plugging into a laptop
45:24
which had a built-in video card which was sitting on the same PCI bus and probably manufacturer hard-coded higher, I forgot the word priority, yes,
45:47
a higher priority for the video card than for everything else in the system because you don't want your screen to flicker. And so when you move a window, you actually see there late packets coming to your PCI device.
46:02
We had to introduce a jitter buffer and add more FIFO into the device to smooth it out. And on the other hand, Xeon is performing really well. So it's very optimized. That said, we tested it with discrete card
46:23
and outperforms everything by a whopping 5-7%. It's what you get for the price. So this is actually the end of the presentation. So we still have not scheduled any workshop,
46:41
but if there's any interest in actually seeing the device working, or if you're interested in learning more about the PCI Express in details, let us know, we'll schedule something in the next few days. That's it, and I think we can proceed with questions
47:02
if there are any. Okay, thank you very much. If you are leaving now, please try to leave quietly because we might have some questions and we want to hear them.
47:20
And if you have questions, please line up right behind the microphones and I think we'll just wait because we don't have anything from the signal angel. However, if you're watching on stream, you can hop into the channels and over social media to ask questions and they will be answered hopefully.
47:40
So that microphone. What's the minimum and maximum frequency of the card? You mean RF frequency? And now the minimum frequency you can sample at. Most STL devices can only sample at over 50 megahertz.
48:05
Is there a similar limitation at your card? Yeah, so if you're talking about RF frequency, it can go from almost zero, even though that works below 50 megahertz and all the way to 3.8 gigahertz, if I remember correctly.
48:26
And in terms of the sample rate, right now it works from like about two mega samples per second and all to about 50 right now. But again, we're planning to get it to these numbers we quoted.
48:41
Okay, the microphone over there. Thanks for your talk. Did you manage to put your Linux kernel driver to the main line? Oh, not yet. I mean, it's not even like fully published. So I did not say in the beginning, sorry for this, that we only just manufactured the first prototype, which we debugged heavily.
49:01
So we are only planning to manufacture the second prototype with all these fixes. And then we will release like the kernel driver and the other thing. And I'm not sure we will, maybe we'll try or maybe one try, haven't decided yet. Thanks. Okay, and that'll be the whole other experience.
49:23
Okay, over there. Hey, it looks like you went through some incredible amounts of pain to make this work. So I was wondering, aren't there any simulators, at least for parts of the system, for the PCIe bus, for the DMA something? Any simulator so that you can actually first design the system there and debug it more easily?
49:44
Yes, there are available simulators, but the problems, they're non-free, so you have to pay for them. So yeah, and they choose hard way. Okay, thanks. Okay, we have a question from the Signal Angel.
50:01
Yeah, are the FPGA codes, Linux driver and library code in the design project fights public? And if so, did they post them yet? I know that we will have posted them yet. They can't find them on XTRX.io. Yeah, so they're not published yet. As I said, we haven't released them.
50:20
So the drivers and libraries will definitely available FPGA code. We are considering this probably also will be available in open source, but we will publish them together with the public announcement of the device.
50:41
Okay, that microphone. Yes, did you guys see any signal integrity issues between the PCI bus or on those bus to the LMS chip, line microchip, I think that's doing that right. Did you try to measure signal integrity issues or because there were some reliability issues, right? Yeah, we actually, so with PCI,
51:01
we never had issues if I remember correctly. I just, it was just working. Actually, the board is so small, and when there is small traces, there is no problem in signal integrity, so it's actually save us. Yeah, designing a small board is easier. Yeah, with LMS7, the problem is not the signal integrity
51:22
in terms of difference in the length of the traces, but rather the fact that the signal degrades over voltage, oh sorry, over speed in terms of voltage and drops below the detection level
51:41
and all this stuff. We did some measurements. I actually wanted to add some pictures here, but decided that's not going to be super interesting. Okay, microphone over there. Yeah, thanks for the talk. How much work would it be to convert the two by two SDR into an eight input logic analyzer
52:03
in terms of heart and software? So to have a really fast logic analyzer where you can record unlimited traces with. Logic analyzer. So basically it's just also an analog digital converter,
52:21
and you would largely want fast sampling and a large amount of memory to store the traces. Well, I just think it's not the best use for it. It's probably, I don't know. Maybe Sergey has any ideas, but I think it's just maybe easier
52:42
to get a high-speed ADC and replace the lime chip with a high-speed ADC to get what you want. Because the lime chip has so many things there specifically for RF. Yeah, the main problem,
53:00
you cannot just sample original data. You should shift it over frequency, so you cannot sample original signal. And then using the kit for something else except spectrum analyzing is hard. Thanks.
53:22
Okay, another question from the internet. Yes, have you compared the sample rate of the ADC of the limes DR chip to the USRP ADCs? And if so, how does the lower sample rate affect the performance? So comparing low sample rate to high sample rate,
53:42
we have not done much testing on the RF performance yet because we were so busy with all this stuff. So we are yet to see in terms of low bit rates versus sample rate versus high sample rate.
54:02
Well, high sample rate always gives you better performance, but you also get higher power consumption. So I guess it's the question of what's more important for you. Okay, over there. I've got it, there is no mixer bypass,
54:21
so you can't directly sample the signal. Is there a way to use the same antenna for send and receive yet? Actually, there is input for ADC. But it's not a bypass. It's a dedicated pin on a LMS chip. And since we are very space constraint,
54:40
we didn't roll with them. So you cannot actually bypass it. Okay, in this, in our specific hardware, so in general, so in the LMS chips, there is a special, again, to rephrase, a special pin which allows you to drive your signal directly to ADC without all the mixers, filters, all this radio stuff just directly to ADC.
55:02
So yes, theoretically, that's possible. We even thought about this, but it doesn't fit with design. Okay, and can I share antennas? Because I have an existing laptop with existing antennas, so but I would use the same antenna for send and receive? Yeah, so I mean, that's depends on like
55:22
what exactly do you want to do. If you want a TDG system, then yes. If you want an FDG system, then you will have to put a small duplexer in there. But yeah, that's the idea. So you can plug this into your laptop and use your existing antennas. That's one of the ideas of how to use Xtrex.
55:40
Yeah, because there's all four connectors. Yeah, one thing which I actually forgot to mention is like I kind of mentioned in the slides is that any other SDRs which are based on Ethernet or on the USB can't work with a CSMA wireless systems. And the most famous CSMA system is Wi-Fi.
56:03
So turns out that because of the latency between your operating system and your radio on USB, you just can't react fast enough for Wi-Fi to work because probably you know that in Wi-Fi you carry your sense
56:21
and if you sense that the spectrum is free, you start transmitting. Doesn't make sense when you have huge latency because you know that the spectrum was free back then. So with Xtrex, you actually can work with CSMA systems like Wi-Fi.
56:42
So again, it makes it possible to have a fully software implementation of Wi-Fi in your laptop. It obviously won't work like as good as your commercial Wi-Fi because you will have to do a lot of processing on your CPU, but for some purposes,
57:03
like experimentation, for example, for wireless labs and R&D labs, that's really valuable. Okay, over there. Okay, what PCB design package that you use? I'll assume. I'll assume, yeah. And I'd be interested in the PCI Express workshop.
57:21
Would be really great if you do this one. Sorry, say again? Would be really great if you do the PCI Express workshop. Wow, PCI Express workshop, okay, thank you. Okay, I think we have one more question from the microphones and that's you. Okay, great talk and again, I would appreciate the PCI workshop if it ever happens.
57:44
What are the synchronization options between multiple cards? Can you synchronize the ADC clock and can you synchronize the presumably digitally created IF? So unfortunately, just IF synchronization is not possible
58:05
because line chip doesn't expose a low frequency, but we can synchronize digitally. So we have special one PPS signal synchronization. We have lines for clock synchronization and other stuff we can do in software.
58:24
So the line chip has phase correction register. So when you measure phase difference, so you can compensate it on different boards. So tune to a station a long way away and then rotate the phase until it aligns.
58:41
Thank you. Yeah, a little tricky but possible. So that's one of our plans for future because we do want to see like 128 by 128 MIMO at home. Okay, we have another question from the internet. I actually have two questions. The first one is what is the expected price
59:02
after a prototype stage? And the second one is can you tell us more about the setup that you had for debugging the PCI issues? So could you repeat the second question? It's setup debug. More about the setup you had for debugging the PCI issues.
59:22
Second question, I think it's most about our next workshop because it's a more complicated setup. So we mostly remove everything about it in our current presentation. Yeah, but in general, in terms of hardware setup, that was our hardware setup. So we bought this PCI Express to Thunderbolt 3.
59:45
We bought a laptop which supports Thunderbolt 3 and that's how we were debugging it. So we don't need like a full-fledged PC. We don't have to restart it all the time. So in terms of price, we don't have the fixed price yet.
01:00:04
So what I can say right now is that we are targeting no more than your like BladeRF or HackRF devices and probably even cheaper for some versions.
01:00:22
Okay, we are out of time. So thank you again, Sergei and Alexander.