Example: Text Files

The text source is one of the most straightforward source types to ingest, and Sorcero provides several tools that we can use to make our ingestion process easier and more useful. We’re going to start with a sample text document and perform a number of transformations that will convert it to a collection of passage documents.

Notes:

For our sample document, we’re going to use one of the test data documents found in the library. If you’d like to follow along, you can find it here.

See Pipeline Example: Text Documents below for a discussion of the pipeline version of this same example.


The source we use in the example is shown below:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do ...
vulputate eu scelerisque felis. Faucibus nisl tincidunt eget nullam.

Fringilla phasellus faucibus scelerisque eleifend. Volutpat commodo...
faucibus in ornare quam. Felis eget nunc lobortis mattis.

Risus nec feugiat in fermentum posuere. Odio ut enim blandit volutpat...
et egestas quis ip\n-sum suspendisse... congue mauris rhoncus aenean.

Pulvinar mattis nunc sed blandit libero volutpat sed cras. Id porta...
fringilla. Morbi enim nunc faucibus a.

Sollicitudin nibh sit amet commodo nulla facilisi nullam...
viverra orci sagittis eu.

Step 1: Import

Import three libraries from ingestum: documents, sources, and transformers.

from ingestum import documents
from ingestum import sources
from ingestum import transformers

Step 2: Create a Text source

In this example, we are using a text source, so we use sources.Text(path) to define the source type. This retrives the source document at the path provided and identifies it as a text source document.

text_source = sources.Text(path="tests/data/test.txt")

Step 3: Create a Text document

Once we have the Text source object, we can apply transformers. The first transformer we apply is TextSourceCreateDocument. This transformer converts a Text source into a Text document.

document = transformers.TextSourceCreateDocument().transform(
    source=text_source
)

Let’s look at each part of this line. transformers is the imported library containing all transformers. TextSourceCreateDocument is our transformer, which has no arguments. We then call the .transform() method, which takes one argument, the source document we defined in the previous step.

As a result of Step 3, the content of the Text source document has been embedded within a document structure within the object.

{
    "content": "Lorem ipsum dolor sit amet, consectetur adipiscing
    elit, sed do... vulputate eu scelerisque felis. Faucibus nisl
    tincidunt eget nullam.\n\nFringilla phasellus faucibus scelerisque
    eleifend. Volutpat commodo... faucibus in ornare quam. Felis eget
    nunc lobortis mattis.\n\nRisus nec feugiat in fermentum
    posuere. Odio ut enim blandit volutpat... et egestas quis ip\n-sum
    suspendisse... congue mauris rhoncus aenean.\n\nPulvinar mattis
    nunc sed blandit libero volutpat sed cras. Id
    porta... fringilla. Morbi enim nunc faucibus a.\n\nSollicitudin
    nibh sit amet commodo nulla facilisi nullam... viverra orci
    sagittis eu.\n",
    "context": {},
    "origin": null,
    "pdf_context": null,
    "title": "",
    "type": "text",
    "version": "1.0"
}

Step 4: Remove hyphenations

Now that we’ve got a text document, we can use a variety of tools that will allow us to tune the content. For example, there are some hyphenated word, such as “ip-nsum”. We can use TextDocumentHyphensRemove to remove the hyphens.

document = transformers.TextDocumentHyphensRemove().transform(document=document)

As a result of Step 4, the hyphens have been removed from the text.

{
    "content": "Lorem ipsum dolor sit amet, consectetur adipiscing
    elit, sed do... vulputate eu scelerisque felis. Faucibus nisl
    tincidunt eget nullam.\n\nFringilla phasellus faucibus scelerisque
    eleifend. Volutpat commodo... faucibus in ornare quam. Felis eget
    nunc lobortis mattis.\n\nRisus nec feugiat in fermentum
    posuere. Odio ut enim blandit volutpat... et egestas quis ipsum
    suspendisse... congue mauris rhoncus aenean.\n\nPulvinar mattis
    nunc sed blandit libero volutpat sed cras. Id
    porta... fringilla. Morbi enim nunc faucibus a.\n\nSollicitudin
    nibh sit amet commodo nulla facilisi nullam... viverra orci
    sagittis eu.\n",
    "context": {},
    "origin": null,
    "pdf_context": null,
    "title": "",
    "type": "text",
    "version": "1.0"
}

Step 5: Create the collection

It can be useful to split a document up into a collection of parts. In this example, we will make a document from each paragraph by using \n\n to split the document into a collection.

transformers.TextSplitIntoCollectionDocument(
    separator='\n\n'
).transform(document=document)

The collection of text documents is shown below.

{
    "content": [
        {
            "content": "Lorem ipsum dolor sit amet, consectetur
            adipiscing elit, sed do... vulputate eu scelerisque
            felis. Faucibus nisl tincidunt eget nullam.",
            "context": {},
            "origin": null,
            "pdf_context": null,
            "title": "",
            "type": "text",
            "version": "1.0"
        },
        {
            "content": "Fringilla phasellus faucibus scelerisque
            eleifend. Volutpat commodo... faucibus in ornare
            quam. Felis eget nunc lobortis mattis.",
            "context": {},
            "origin": null,
            "pdf_context": null,
            "title": "",
            "type": "text",
            "version": "1.0"
        },
        {
            "content": "Risus nec feugiat in fermentum posuere. Odio
            ut enim blandit volutpat... et egestas quis ipsum
            suspendisse...  congue mauris rhoncus aenean.",
            "context": {},
            "origin": null,
            "pdf_context": null,
            "title": "",
            "type": "text",
            "version": "1.0"
        },
        {
            "content": "Pulvinar mattis nunc sed blandit libero
            volutpat sed cras. Id porta... fringilla. Morbi enim nunc
            faucibus a.",
            "context": {},
            "origin": null,
            "pdf_context": null,
            "title": "",
            "type": "text",
            "version": "1.0"
        },
        {
            "content": "Sollicitudin nibh sit amet commodo nulla
            facilisi nullam... viverra orci sagittis eu.\n",
            "context": {},
            "origin": null,
            "pdf_context": null,
            "title": "",
            "type": "text",
            "version": "1.0"
        }
    ],
    "context": {},
    "origin": null,
    "title": "",
    "type": "collection",
    "version": "1.0"
}

There are many other transformations that we can apply to text sources. You might want to replace strings with the TextDocumentStringReplace transformer, or try more advanced concepts such as converting your document into passage documents, where you can add metadata such as tags and anchors. There are also Conditionals that allow you to apply transformers if and only if a specific condition is true. Check out our Reference Documentation or our other Ingestion Examples for more ideas.

Pipeline Example: Text Documents

A Python script can be used to configure a pipeline. See Pipelines Reference for more details.

1. Build the framework

We’ll start by adding some Python so we can run our pipeline. We’ll be focusing on the pipeline aspect of the script, so we’ll mostly gloss over this bit.

The following block of code is a template with the basic structure needed to configure an Ingestum Pipeline. Both the pipeline and the manifest are initially empty. Add this to an empty Python file.

import json
import argparse
import tempfile

from ingestum import engine
from ingestum import manifests
from ingestum import pipelines
from ingestum import transformers
from ingestum.utils import stringify_document


def generate_pipeline():
    pipeline = pipelines.base.Pipeline(
        name='default',
        pipes=[
            pipelines.base.Pipe(
                name='default',
                sources=[],
                steps=[]
            )
        ]
    )

    return pipeline


def ingest(path):
    destination = tempfile.TemporaryDirectory()

    manifest = manifests.base.Manifest(
        sources=[])

    pipeline = generate_pipeline()

    results, *_ = engine.run(
        manifest=manifest,
        pipelines=[pipeline],
        pipelines_dir=None,
        artifacts_dir=None,
        workspace_dir=None)

    destination.cleanup()

    return results[0]


def main():
    parser = argparse.ArgumentParser()
    subparser = parser.add_subparsers(dest='command', required=True)
    subparser.add_parser('export')
    ingest_parser = subparser.add_parser('ingest')
    ingest_parser.add_argument('path')
    args = parser.parse_args()

    if args.command == 'export':
        output = generate_pipeline()
    else:
        output = ingest(args.path)

    print(stringify_document(output))


if __name__ == "__main__":
    main()

2. Define the sources

The manifest lists the sources that will be ingested. In this case we only have a Text file as source, so we create a manifests.sources.Text source and add it to the collection of sources contained in the manifest. We also specify the source’s standard arguments id, pipeline, location, and destination.

def ingest(path):
    manifest = manifests.base.Manifest(
        sources=[
            manifests.sources.Text(
                id='id',
                pipeline='default',
                location=manifests.sources.locations.Local(
                    path=path
                ),
                destination=manifests.sources.destinations.Local(
                    directory=destination.name
                )
            )
        ]
    )

3. Apply the transformers

For each pipe, we must specify which source will be accepted as input, as well as the sequence of transformers that will be applied to the input source.

Note that, unlike manifest sources, the order in which transformers are listed matters (i.e. they aren’t commutative).

def generate_pipeline():
    pipeline = pipelines.base.Pipeline(
        name='default',
        pipes=[
            pipelines.base.Pipe(
                name='default',
                sources=[
                    pipelines.sources.Manifest(
                        source='text'
                    )
                ],
                steps=[
                    transformers.TextSourceCreateDocument(),
                    transformers.TextDocumentHyphensRemove(),
                    transformers.TextSplitIntoCollectionDocument(
                        separator='\n\n'
                    )
                ]
            )
        ]
    )
    return pipeline

In this example we have only one pipe, which accepts a Text file as input (specified by pipelines.sources.Manifest(source='text')). The pipe sequentially applies three transformers to this source: transformers.TextSourceCreateDocument, transformers.TextDocumentHyphensRemove, and transformers.TextSplitIntoCollectionDocument.

4. Test our pipeline

We’re done! All we have to do is test it:

$ python3 path/to/script.py ingest tests/data/test.txt

Note that this example pipeline has only one pipe, we can add as many as we want.

5. Export your pipeline

Python for humans, json for computers:

$ python3 path/to/script.py export

Note

This prints the output to the command line, you can write the output to a file with:

$ python3 path/to/script.py export > filename.json