Introduction

This library is used to transform common files and data sources, e.g., a Twitter feed, into a uniform document format that can be used by other processes, such as the application of natural language processing pipelines. For example, we can take an HTML file, remove all of its HTML tags, and extract only the human-visible text. We can also take a PDF file, again extracting only the human-visible text. The resulting documents are indistinguishable from any other text documents, regardless of the source. This process is called ingestion.