What's new in Apache Tika 2.0 - we mean it this time!

Plain Schwarz

Allison, Tim

Formale Metadaten

Titel

Serientitel

Berlin Buzzwords 2021

Anzahl der Teile

Autor

Allison, Tim

Mitwirkende

N. N. (Moderation)

Lizenz

CC-Namensnennung 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.

Identifikatoren

10.5446/67358 (DOI)

Herausgeber

Plain Schwarz

Erscheinungsjahr

2021

Sprache

Englisch

Inhaltliche Metadaten

Fachgebiet

Informatik

Genre

Konferenz/Talk

Abstract

Apache Tika is used in big data document processing pipelines to extract text and metadata from numerous file formats. Text extraction is a critical component for search systems. While work on 2.0 has been ongoing for years, the Tika team released 2.0.0-ALPHA in January and will release 2.0.0 before Buzzwords 2021. In addition to dramatically increased modularization, there are new components to improve scaling, integration and robustness. This talk will offer an overview of the changes in Tika 2.0 with a deep dive on the new tika-pipes module that enables synchronous and asynchronous fetching from numerous data sources (jdbc, fileshare, S3), parsing and then emitting to other endpoints (fileshare, S3, Solr, Elasticsearch, etc).