Tools
There are several command-line tools to make it easier to use Ingestum. All of these tools must be run in the Ingestum environment.
ingestum-manifest
This command-line utility is used to run an Ingestum manifest.
$ ingestum-manifest
usage: ingestum-manifest [-h] [--pipelines PIPELINES] [--artifacts ARTIFACTS] [--workspace WORKSPACE] [--instrumentation [{measure-memory}]] manifest
The
manifest
mandatory argument is used to specify the manifest to be processed.The
--pipelines
mandatory argument is used to specify a path to the pipeline used in the manifest.The
--artifacts
optional argument is used to specify a path for any artifacts (images, etc.) output from the ingestion process.The
--workspace
optional argument is used to specify a path for the document output from the ingestion process.The
--instrumentation
optional argument is used to profile memory usage during the ingestion process.
Example:
$ ingestum-manifest manifest.json --pipelines sorcero-ingestion-scripts/pipelines
ingestum-generate-manifest
This command-line utility is used to generate an Ingestum manifest.
$ ingestum-generate-manifest
usage: ingestum-generate-manifest [-h] [--id ID] [--pipeline PIPELINE] [--manifest MANIFEST] source
The
source
mandatory argument is used to specify the manifest source to be used.The
--id
optional argument is used to specify the manifest source ID. If not specified, a random ID will be auto-generated.The
--pipelines
argument is used to specify the name of a pipeline to used in the manifest.The
--manifest
optional argument is used to specify a path of an existing manifest to append a new manifest source.
The source itself is parsed for additional arguments.
Example:
$ ingestum-generate-manifest pubmed
usage: ingestum-generate-manifest [-h] [--id ID] [--pipeline PIPELINE] [--manifest MANIFEST] [--terms TERMS [TERMS ...]] --articles ARTICLES [--hours HOURS] [--from_date FROM_DATE] [--to_date TO_DATE] source
$ ingestum-generate-manifest pubmed --terms eye --articles 1500 --hours 168 --pipeline pipeline_pubmed_publication --id test_eye_query
ingestum-generate-manifest-from-xls
This command-line utility is used to generate an Ingestum manifest from a spreadsheet.
$ ingestum-generate-manifest-from-xls
usage: ingestum-generate-manifest-from-xls [-h] [--exclude_artifact] path destination
The
path
mandatory argument is used to specify the path to the spreadsheet source.The
destination
mandatory argument is used to specify the URL where the output will be written to.The
--exclude_artifact
optional argument is used to exclude the artifact ZIP file from the destination. If not specified, the artifact ZIP file will be created.
The expected spreadsheet format is:
A separate sheet for each source, e.g., one sheet called
europepmc
and another one calledpubmed
.The first row of every sheet is reserved for the source field names, e.g., the first row would contain
id
,pipeline
,query
, andhours
, while the second row would containtest
,pipeline_europepmc_publication
,eye
, and168
.For sources that require a location, a URL must be provided, e.g., the first row would contain
location
, while the second row would containfile://tests/data/test.pdf
orgs://bucket/path/to/file
.
A sample spreadsheet file is provided here.
Example:
$ ingestum-generate-manifest-from-xls
usage: ingestum-generate-manifest-from-xls [-h] [--exclude_artifact] path destination
$ ingestum-generate-manifest-from-xls ./manifest.xlsx file://output --exclude_artifact
ingestum-pipeline
This command-line utility is used to run an Ingestum pipeline.
$ ingestum-pipeline
usage: ingestum-pipeline [-h] [--workspace WORKSPACE] [--artifacts ARTIFACTS] pipeline
The
pipeline
mandatory argument is used to specify the pipeline to be ran.The
--workspace
optional argument is used to specify a path for the document output from the ingestion process.The
--artifacts
optional argument is used to specify a path for any artifacts (images, etc.) output from the ingestion process.
The pipeline itself is parsed for additional arguments.
Example:
$ ingestum-pipeline pipeline_pubmed_publication.json
usage: ingestum-pipeline [-h] [--workspace WORKSPACE] [--artifacts ARTIFACTS] [--terms TERMS [TERMS ...]] --articles ARTICLES [--hours HOURS] [--from_date FROM_DATE] [--to_date TO_DATE] [--full_text] pipeline
$ ingestum-pipeline pipeline_pubmed_publication.json --term eye --articles 1500 --hours 168
ingestum-envelope
This command-line utility is used to run an Ingestum envelope.
$ ingestum-envelope
usage: ingestum-envelope [-h] [--pipelines PIPELINES] [--artifacts ARTIFACTS] [--workspace WORKSPACE] [--results RESULTS] envelope
The
envelope
mandatory argument is used to specify the envelope to be processed.The
--pipelines
optional argument is used to specify a path to the pipeline used in the manifest.The
--artifacts
optional argument is used to specify a path for any artifacts (images, etc.) output from the ingestion process.The
--workspace
optional argument is used to specify a path for the document output from the ingestion process.The
--results
optional argument is used to specify a path for the references output to be written to. Without this argument, the references output will be directed to the standard output.
Example:
$ ingestum-envelope envelope.json --results results.json
ingestum-generate-envelope
This command-line utility is used to generate an Ingestum envelope.
$ ingestum-generate-envelope
usage: ingestum-generate-envelope [-h] [--pipelines PIPELINES] manifest ingestum-generate-envelope: error: the following arguments are required: manifest
The
manifest
mandatory argument is used to specify the path to the manifest.The
--pipelines
optional argument is used to specify the path to the directory containing the pipelines used in the manifest.
Example:
$ ingestum-generate-envelope manifest.json --pipelines tests/pipelines
ingestum-merge
This command-line utility is used to merge multiple documents into one document.
$ ingestum-merge
usage: ingestum-merge [-h] [--output OUTPUT] documents [documents ...]
The
documents
mandatory argument is used to specify the list of documents to be processed.The
--output
mandatory argument is used to specify the path to the output merged document.
Example:
We could merge results from multiple PubMed searches into one document.
$ ingestum-merge document1.json document2.json document3.json --output document4.json
ingestum-migrate
This command-line utility is used to migrate multiple documents from earlier versions of Ingestum to the current document format (as on occasion, we add new fields to the document format).
$ ingestum-migrate
usage: ingestum-migrate [-h] documents [documents ...]
The
documents
mandatory argument is used to specify the list of documents to be processed.
The documents are updated in place.
Example:
$ ingestum-migrate tests/output/*.json
ingestum-inspect
This command-line utility is used to extract the content from an ingested document.
$ ingestum-inspect
usage: ingestum-inspect [-h] document
The
document
mandatory argument is used to specify the path of the document to be processed.
Example:
$ ingestum-inspect document.json