We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

High Volume PDF Text Extraction using Python Open-Source Tools

00:00

Formal Metadata

Title
High Volume PDF Text Extraction using Python Open-Source Tools
Title of Series
Number of Parts
Author
Contributors
License
CC Attribution - NonCommercial - ShareAlike 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
All major companies have huge amounts of (mostly PDF) documents that contain important - even critically important - information, that does no longer exist anywhere else in their data stores. Reports, once generated for shareholders and legal or financial authorities, may still be useful for developing longterm forecasts or triggering company management decisions. By definition, documents are intended for human perception, and as such contain unstructured data from an information technology perspective. Therefore, tools to extract PDF text content (mostly, but not only text) from millions of pages have become important vehicles to recreate structured information. This presentation talks about extraction "need for speed" in this Big Data scenario, the need for integration with OCR capabilities and presents an open-source toolset which combines both, top-of-the-class performance and maximum extraction detail.
Presentation of a groupData structureWordComputer animationLecture/Conference
Probability density functionComputer fileWeb pageComputer-generated imageryFile formatOptical character recognitionData managementFisher informationMathematical analysisPlanningPoint (geometry)Process (computing)DatabaseMeta elementContent (media)Medical imagingGreatest elementPoint (geometry)Presentation of a groupObject (grammar)Computer fileTraffic reportingLatent heatData structureTranslation (relic)Computer fontGraph coloringEnterprise architectureProcess (computing)Multiplication signDatabaseContent (media)Vector graphicsMetadataProbability density functionFile formatLibrary (computing)Fisher informationVery-high-bit-rate digital subscriber lineWeb pageElectronic data processingBitFinite-state machineNatural numberEvoluteWindowQuicksortComputer hardwareSoftwareData managementRevision controlIterationRight angleCASE <Informatik>Text editorSet (mathematics)Vector spaceMultimediaComputer animation
Computer fileProbability density functionFile formatDatabaseFisher informationMeta elementContent (media)Translation (relic)Web pageStrukturierte DatenOptical character recognitionSource codeTerm (mathematics)Data structurePoint (geometry)Network topologyParsingTable (information)Data conversionFreewarePattern recognitionKeyboard shortcutGoogolFile viewerOpen setHome pageLetterpress printingVector spaceComputer-generated imageryObject (grammar)SequenceMathematical analysisData analysisType theoryField (computer science)Computer fontFlagInsertion lossRectangleTable (information)Web pageFile formatProbability density functionForm (programming)Field (computer science)CircleTraffic reportingType theoryHazard (2005 film)Descriptive statisticsSound effectComputer programmingCuboidSpectrum (functional analysis)Boundary value problemSummierbarkeitMultiplication signData dictionaryConsistencyUnicodeNumberSocial classMetadataDatabaseGraph coloringData structurePoint (geometry)Level (video gaming)Home pageWordEmailComputer fontCartesian coordinate systemProcess (computing)Fisher informationLibrary (computing)Slide ruleVector graphicsSequenceGreen's functionOpen sourceFreewareE-bookLetterpress printingRootGreatest elementSubject indexingMedical imagingLine (geometry)Sheaf (mathematics)2 (number)Right angleMeta elementDisk read-and-write headComputer animation
Fisher informationMeta elementComputer fontFlagData analysisString (computer science)Insertion lossRectangleOptical character recognitionHome pageUnicodePairwise comparisonComputer fileWeb pageVolumeLine (geometry)Vector spaceData structurePoint (geometry)Pressure volume diagramStandard deviationObject (grammar)Pattern recognitionComputer-generated imageryField (computer science)Greatest elementTotal S.A.MeasurementCone penetration testOrder (biology)Order of magnitudeState transition systemLattice (order)Type theoryProbability density functionData conversionDisintegrationContent (media)Form (programming)VolumenvisualisierungEncryptionFile formatFisher informationComputer fontMultiplication signHome pageGreatest elementRectangleProbability density functionPoint (geometry)CuboidBitGraph coloringTranslation (relic)UnicodeString (computer science)Web pageCASE <Informatik>Rhombus2 (number)IntegerLine (geometry)NumberFile formatGoodness of fitComputer programmingFunction (mathematics)AverageInheritance (object-oriented programming)Demo (music)Process (computing)Green's functionFigurate numberLibrary (computing)Product (business)Optical character recognitionRight angleCoordinate systemComputer animation
Software testingVector spaceComputer fileAutomationInflection pointForm (programming)Field (computer science)Home pageVolumenvisualisierungOptical character recognitionDisintegrationComputer fontType theoryFile formatComputer-generated imageryChi-squared distributionData conversionEncryptionMagneto-optical driveContent (media)Mathematical singularityInternet service providerWordPresentation of a groupCoefficientProbability density functionMultiplication signLine (geometry)File formatType theoryHome pageBitLibrary (computing)CASE <Informatik>Key (cryptography)Form (programming)Field (computer science)Fisher informationFunction (mathematics)Row (database)Vertex (graph theory)Content (media)Computer fileSequenceSpacetimeCuboidPosition operatorSingle-precision floating-point formatTable (information)Medical imagingPoint (geometry)Letterpress printingRectangleArtificial neural networkAutomationSource codeOnline helpTerm (mathematics)ChainVector graphicsMachine learningConfidence intervalComputer fontBootingString (computer science)Computer virusOptical character recognitionCodeShared memoryLecture/Conference
Transcript: Englisch(auto-generated)