Installation Guide

This guide will take you through the process of installing the Sorcero Ingestion library on your machine.

Simplified installation

The simplest way of getting Ingestum running is to use a Docker container bind-mounted with a folder on your host system. In this way, you can use the Docker container as an execution sandbox while using the files and apps (code editors, PDF viewers, etc.) on your host system.

To install Docker, visit Get Docker.

$ git clone https://gitlab.com/sorcero/community/ingestum
$ cd ingestum/docker
$ docker build -t ingestum:latest .
$ docker run -it --rm --name ingestum --mount type=bind,src=/absolute/path/on/host,dst=/app ingestum:latest

If you running Fedora, you can also use a toolbox container. Toolbox makes it easy to use a containerized environment for everyday software development. Therefore, we provide a Dockerfile to get started in no time:

$ sudo dnf -y install git toolbox
$ git clone https://gitlab.com/sorcero/community/ingestum
$ cd ingestum/toolbox
$ podman build . -t ingestum-toolbox:latest
$ toolbox create -c ingestum-toolbox -i ingestum-toolbox
$ toolbox enter ingestum-toolbox

Warning

Ingestum won’t fully work on AARCH64/ARM64 systems as two Python modules (opencv-python and deepspeech) won’t install. See manual installation steps for workaround.

Manual installation

Warning

Ingestum was developed for Ubuntu 20.04 (Linux). It may still work on other operating systems (especially ones that are unix-based) but be aware that some or most features may not work. If you don’t have a computer that runs Ubuntu, consider using a VirtualBox VM or a Docker container from a Ubuntu 20.04 image.

1. Download the system dependencies

Install the following system dependencies:

$ sudo apt -y install python3-pip python3-dev python3-virtualenv libsm-dev libxrender-dev libxext-dev libxss-dev libgtk-3-dev poppler-utils sox attr ffmpeg ghostscript tesseract-ocr
$ sudo apt-get -y install libreoffice

For AARCH64/ARM64, you need one more dependency (libxslt-dev):

$ sudo apt -y install libxslt-dev

The following dependencies are used for audio ingestion:

$ mkdir ~/.deepspeech
$ wget -O ~/.deepspeech/models.pbmm https://github.com/mozilla/DeepSpeech/releases/download/v0.7.3/deepspeech-0.7.3-models.pbmm
$ wget -O ~/.deepspeech/models.scorer https://github.com/mozilla/DeepSpeech/releases/download/v0.7.3/deepspeech-0.7.3-models.scorer

For AARCH64/ARM64, you need deepspeech 0.9.3 instead:

$ mkdir ~/.deepspeech
$ wget -O ~/.deepspeech/models.pbmm https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.pbmm
$ wget -O ~/.deepspeech/models.scorer https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.scorer

2. Download the library

Use git clone or some other method to download the ingestum library.

$ git clone https://gitlab.com/sorcero/community/ingestum.git

3. Install the library

You’ll need to download virtualenv if you don’t already have it:

$ sudo apt install python3-venv
$ virtualenv env

or simply:

$ python3 -m venv env

Activate the virtual environment and install the dependencies:

$ cd ingestum
$ source ../env/bin/activate
$ pip install .

Warning

On AARCH64/ARM64, pip install . will fail because pip won’t be able to install deepspeech 0.7.3 and opencv-python 4.2.0.34. In the requirements.txt file, replace deepspeech==0.7.3 with deepspeech==0.9.3 and opencv-python==4.2.0.34 with opencv-python. You can then go ahead with pip install ..

If you are still having trouble installing deepspeech 0.9.3, remove it from the requirements.txt file and then continue with pip install .. Since deepspeech will not be installed, any transformer (e.g. audio_source_create_text_document) that requires it, will crash. The rest should work fine.

Note

Don’t forget to activate the virtual environment everytime you’re going to use Ingestum.

4. Set the plugins directory

The default location of the plugins directory is:

$HOME/.ingestum/plugins

(optional) This environment variable is used for specifying the location of the plugins directory:

export INGESTUM_PLUGINS_DIR=""

5. Set your authentication credentials

(optional) These environment variables are used for Twitter feed ingestion:

export INGESTUM_TWITTER_CONSUMER_KEY=""
export INGESTUM_TWITTER_CONSUMER_SECRET=""
export INGESTUM_TWITTER_ACCESS_TOKEN=""
export INGESTUM_TWITTER_ACCESS_SECRET=""

(optional) These environment variables are used for Email ingestion:

export INGESTUM_EMAIL_HOST=""
export INGESTUM_EMAIL_PORT=""
export INGESTUM_EMAIL_USER=""
export INGESTUM_EMAIL_PASSWORD=""

(optional) These environment variables are used for ProQuest ingestion:

export INGESTUM_PROQUEST_ENDPOINT=""
export INGESTUM_PROQUEST_TOKEN=""

(optional) These environment variables are used for PubMed ingestion (from https://support.nlm.nih.gov/knowledgebase/article/KA-05317/en-us):

export INGESTUM_PUBMED_TOOL=""
export INGESTUM_PUBMED_EMAIL=""
export INGESTUM_PUBMED_API_KEY=""

(optional) These environment variables are used for Reddit ingestion (from https://www.reddit.com/prefs/apps):

export INGESTUM_REDDIT_CLIENT_ID=""
export INGESTUM_REDDIT_CLIENT_SECRET=""