Example: Spreadsheets

In this example, we walk through a simple example of ingestion from an CSV (comma separated values) spreadsheet source using the Ingestum Python libraries.

Notes:

  • Sorcero can also ingest XLS (Microsoft Excel) spreadsheets.

  • You’ll need to follow the the Installation Guide if you haven’t used this library before.

  • To learn more about the available ingestion sources, see Sources Reference.

For our sample document, we’re going to use one of the test data documents found in the library. If you’d like to follow along, you can find it here.

See Pipeline Example: Spreadsheets below for a discussion of the pipeline version of this same example.


Spreadsheets are some of the most common and flexible content types available, and Sorcero provides a wide variety of tools with which spreadsheets can be manipulated.

The source we use in the example is shown below:

Username,Identifier,First name,Last name
booker12,9012,Rachel,Booker
grey07,2070,Laura,Grey
johnson81,4081,Craig,Johnson
jenkins46,9346,Mary,Jenkins
smith79,5079,Jamie,Smith

Step 1: Import

Import three libraries from ingestum: documents, sources, and transformers.

from ingestum import documents
from ingestum import sources
from ingestum import transformers

Step 2: Create a CSV source

Create an CSV source object from an CSV file.

csv_source = sources.CSV(path="tests/data/test.csv")

Step 3: Create a Tabular document

The next step is to convert our CSV source into a tabular document. We can use the CSVSourceCreateTabularDocument transformer to do this.

tabular_document = transformers.CSVSourceCreateTabularDocument().transform(
    source=csv_source
)

The output of Step 3 is:

{
    "columns": 4,
    "content": [
        [
            "Username",
            "Identifier",
            "First name",
            "Last name"
        ],
        [
            "booker12",
            "9012",
            "Rachel",
            "Booker"
        ],
        [
            "grey07",
            "2070",
            "Laura",
            "Grey"
        ],
        [
            "johnson81",
            "4081",
            "Craig",
            "Johnson"
        ],
        [
            "jenkins46",
            "9346",
            "Mary",
            "Jenkins"
        ],
        [
            "smith79",
            "5079",
            "Jamie",
            "Smith"
        ]
    ],
    "pdf_context": null,
    "rows": 6,
    "title": "",
    "type": "tabular",
    "version": "1.0"
}

If we have an XLS source, the process is very similar. Our source is sources.XLS. We must specify the sheet we want to work with and use the XLSSourceCreateTabularDocument transformer to extract that sheet into a tabular document.

xls_source = sources.XLS(path="tests/data/test.xlsx")
tabular_document = transformers.XLSSourceCreateTabularDocument(
    sheet="Sheet1").transform(source=xls_source)

Step 4: Customize our tables

Now’s the fun part – customization. There are a number of options that we can try to work with our table data but we’ll only use one as an example in this tutorial. TabularDocumentColumnsInsert transforms a Tabular document into another Tabular document where a new empty column is inserted at the given position.

document = transformers.TabularDocumentColumnsInsert(
    position=2,
    columns=1
).transform(document=tabular_document)

The output of Step 4 is a table with a new column added:

{
    "columns": 4,
    "content": [
        [
            "Username",
            "Identifier",
            "First name",
            "Last name"
        ],
        [
            "booker12",
            "9012",
            "",
            "Rachel",
            "Booker"
        ],
        [
            "grey07",
            "2070",
            "Laura",
            "Grey"
        ],
        [
            "johnson81",
            "4081",
            "Craig",
            "Johnson"
        ],
        [
            "jenkins46",
            "9346",
            "Mary",
            "Jenkins"
        ],
        [
            "smith79",
            "5079",
            "Jamie",
            "Smith"
        ]
    ],
    "pdf_context": null,
    "rows": 6,
    "title": "",
    "type": "tabular",
    "version": "1.0"
}

Pipeline Example: Spreadsheets

A Python script can be used to configure a pipeline. See Pipelines Reference for more details.

1. Build the framework

Just like in Example: Text Files, we’ll start by adding some Python so we can run our pipeline.

The following block of code is a template with the basic structure needed to configure an Ingestum Pipeline. Both the pipeline and the manifest are initially empty. Add this to an empty Python file.

import json
import argparse
import tempfile

from ingestum import engine
from ingestum import manifests
from ingestum import pipelines
from ingestum import transformers
from ingestum.utils import stringify_document


def generate_pipeline():
    pipeline = pipelines.base.Pipeline(
        name='default',
        pipes=[
            pipelines.base.Pipe(
                name='default',
                sources=[],
                steps=[]
            )
        ]
    )

    return pipeline


def ingest(path):
    destination = tempfile.TemporaryDirectory()

    manifest = manifests.base.Manifest(
        sources=[]
    )

    pipeline = generate_pipeline()

    results, *_ = engine.run(
        manifest=manifest,
        pipelines=[pipeline],
        pipelines_dir=None,
        artifacts_dir=None,
        workspace_dir=None
    )

    destination.cleanup()

    return results[0]


def main():
    parser = argparse.ArgumentParser()
    subparser = parser.add_subparsers(dest='command', required=True)
    subparser.add_parser('export')
    ingest_parser = subparser.add_parser('ingest')
    ingest_parser.add_argument('path')
    args = parser.parse_args()

    if args.command == 'export':
        output = generate_pipeline()
    else:
        output = ingest(args.path)

    print(stringify_document(output))


if __name__ == "__main__":
    main()

2. Define the sources

The manifest lists the sources that will be ingested. In this case we only have a CSV as source, so we create a manifests.sources.CSV source and add it to the collection of sources contained in the manifest. We also specify the source’s standard arguments id, pipeline, location, and destination.

def ingest(path):
    manifest = manifests.base.Manifest(
        sources=[
            manifests.sources.CSV(
                id='id',
                pipeline='default',
                location=manifests.sources.locations.Local(
                    path=path,
                ),
                destination=manifests.sources.destinations.Local(
                    directory=destination.name,
                )
            )
        ]
    )

Note that if the source had source-specific arguments, we would also include them here. These source-specific arguments would be previously passed as parameters to the ingest function.

3. Apply the transformers

For each pipe, we must specify which source will be accepted as input, as well as the sequence of transformers that will be applied to the input source.

Note that, unlike manifest sources, the order in which transformers are listed matters (i.e. they aren’t commutative).

def generate_pipeline():
    pipeline = pipelines.base.Pipeline(
        name='default',
        pipes=[
            pipelines.base.Pipe(
                name='default',
                sources=[
                    pipelines.sources.Manifest(
                        source='csv'
                    )
                ],
                steps=[
                    transformers.CSVSourceCreateTabularDocument(),
                    transformers.TabularDocumentColumnsInsert(
                        position=2,
                        columns=1
                    )
                ]
            )
        ]
    )
return pipeline

In this example we have only one pipe, which accepts a CSV file as input (specified by pipelines.sources.Manifest(source='csv')). The pipe sequentially applies two transformers to this source: transformers.CSVSourceCreateTabularDocument and transformers.TabularDocumentColumnsInsert.

4. Test our pipeline

We’re done! All we have to do is test it:

$ python3 path/to/script.py ingest tests/data/test.csv

Note that this example pipeline has only one pipe, we can add as many as we want.

This tutorial gave some examples of what we can do with a CSV source, but it’s certainly not exhaustive. Sorcero provides a variety of tools to deal with tabular documents – if you’d like to try them out, you can use them in step 4 –. Check out our Reference Documentation or our other Ingestion Examples for more ideas.

5. Export our pipeline

Python for humans, json for computers:

$ python3 path/to/script.py export