Example: Webpages

In this example, we walk through a simple example of ingestion from an HTML source using the Ingestum Python libraries.

Notes:

For our sample document, we’re going to use one of the test data documents found in the library. If you’d like to follow along, you can find it here.

See Pipeline Example: Webpages below for a discussion of the pipeline version of this same example.


The source we use in the example is shown below.

<meta property="og:title" content="test">
<body>this is a test</body>

Step 1: Import

Import three libraries from ingestum: documents, sources, and transformers.

from ingestum import documents
from ingestum import sources
from ingestum import transformers

Step 2: Create an HTML source

Create an HTML source object from an HTML file.

html_source = sources.HTML(path="tests/data/test.html")

Step 3: Create an HTML document

We first need to apply HTMLSourceCreateDocument. This transformer converts an HTML source into an HTML document.

document = transformers.HTMLSourceCreateDocument().transform(
    source=html_source
)

Let’s look at each part of this line. transformers is the imported library containing all transformers. HTMLSourceCreateDocument is our transformer, which has no arguments. We then call the .transform() method, which takes one argument, the source document we defined in the previous step.

As a result of Step 3, the content of the HTML source document has been embedded within a document structure within the object.

{
    "content": "<html><head><meta content=\"test\" property=\"og:title\"/>\n</head><body>this is a test</body>\n</html>",
    "context": {},
    "origin": null,
    "title": "test",
    "type": "html",
    "version": "1.0"
}

Step 4: Extract the images

Once we have the HTML document object, we can apply transformers. A typical transformer to apply at this stage is HTMLDocumentImagesExtract in order extract images from the HTML. Images will be placed in the directory specified by the directory argument to the transformer.

transformers.HTMLDocumentImagesExtract(
    directory='tests/output'
).transform(document)

Step 5: Create a Text document

In this step, we extract the text from the HTML file using the XMLCreateTextDocument transformer. (Since HTML is a subset of XML, we can use XML transformers on HTML documents.)

document = transformers.XMLCreateTextDocument().transform(
    document=document
)

A byproduct of applying this transformer is that all of the tags have been stripped from the text.

{
    "content": "this is a test",
    "context": {},
    "origin": null,
    "pdf_context": null,
    "title": "test",
    "type": "text",
    "version": "1.0"
}

Pipeline Example: Webpages

A Python script can be used to configure a pipeline. See Pipelines Reference for more details.

1. Build the framework

Just like in Example: Text Files, we’ll start by adding some Python so we can run our pipeline.

The following block of code is a template with the basic structure needed to configure an Ingestum Pipeline. Both the pipeline and the manifest are initially empty. Add this to an empty Python file.

import json
import argparse
import tempfile

from ingestum import engine
from ingestum import manifests
from ingestum import pipelines
from ingestum import transformers
from ingestum.utils import stringify_document


def generate_pipeline():
    pipeline = pipelines.base.Pipeline(
        name='default',
        pipes=[
            pipelines.base.Pipe(
                name='default',
                sources=[],
                steps=[])])

    return pipeline


def ingest(path, target):
    destination = tempfile.TemporaryDirectory()

    manifest = manifests.base.Manifest(
        sources=[])

    pipeline = generate_pipeline()

    results, *_ = engine.run(
        manifest=manifest,
        pipelines=[pipeline],
        pipelines_dir=None,
        artifacts_dir=None,
        workspace_dir=None)

    destination.cleanup()

    return results[0]


def main():
    parser = argparse.ArgumentParser()
    subparser = parser.add_subparsers(dest='command', required=True)
    subparser.add_parser('export')
    ingest_parser = subparser.add_parser('ingest')
    ingest_parser.add_argument('path')
    ingest_parser.add_argument('target')
    args = parser.parse_args()

    if args.command == 'export':
        output = generate_pipeline()
    else:
        output = ingest(args.path, args.target)

    print(stringify_document(output))


if __name__ == "__main__":
    main()

2. Define the sources

The manifest lists the sources that will be ingested. In this case we only have a CSV as source, so we create a manifests.sources.HTML source and add it to the collection of sources contained in the manifest. We also specify the source’s standard arguments id, pipeline, location, and destination, as well as the source-specific argument target.

def ingest(path, target):
    manifest = manifests.base.Manifest(
        sources=[
            manifests.sources.HTML(
                id='id',
                pipeline='default',
                target=target,
                location=manifests.sources.locations.Local(
                    path=path
                ),
                destination=manifests.sources.destinations.Local(
                    directory=destination.name
                )
            )
        ]
    )

3. Apply the transformers

For each pipe, we must specify which source will be accepted as input, as well as the sequence of transformers that will be applied to the input source.

Note that, unlike manifest sources, the order in which transformers are listed matters (i.e. they aren’t commutative).

def generate_pipeline():
    pipeline = pipelines.base.Pipeline(
        name='default',
        pipes=[
            pipelines.base.Pipe(
                name='default',
                sources=[
                    pipelines.sources.Manifest(
                        source='html'
                    )
                ],
                steps=[
                    transformers.HTMLSourceCreateDocument(),
                    transformers.HTMLDocumentImagesExtract(
                        directory='tests/files'
                    ),
                    transformers.XMLCreateTextDocument()
                ]
            )
        ]
    )
return pipeline

In this example we have only one pipe, which accepts a HTML file as input (specified by pipelines.sources.Manifest(source='html')). The pipe sequentially applies three transformers to this source: transformers.HTMLSourceCreateDocument, transformers.HTMLDocumentImagesExtract, and transformers.XMLCreateTextDocument.

4. Test our pipeline

We’re done! All we have to do is test it:

$ python3 path/to/script.py ingest tests/data/test.html body

Note that this example pipeline has only one pipe, we can add as many as we want.

This tutorial gave some examples of what we can do with a HTML source, but it’s certainly not exhaustive. Sorcero provides a variety of tools to deal with HTML and passage documents. Check out our Reference Documentation or our other Ingestion Examples for more ideas.

5. Export our pipeline

Python for humans, json for computers:

$ python3 path/to/script.py export