This document will show you how to get up and running with Fonduer. We’ll show you how to get everything installed and your machine so that you can walk through real examples by checking out our Tutorials.
Installing External Dependencies¶
Fonduer relies on a couple of external applications. You’ll need to install
these and be sure are on your
For OS X using homebrew:
$ brew install poppler $ brew install postgresql $ brew install libpng freetype pkg-config $ brew install libomp #https://github.com/pytorch/pytorch/issues/20030 $ brew install imagemagick@6
On Debian-based distros:
$ sudo apt update $ sudo apt install libxml2-dev libxslt-dev python3-dev $ sudo apt build-dep python-matplotlib $ sudo apt install poppler-utils $ sudo apt install postgresql $ sudo apt install libmagickwand-dev
Fonduer recommends using PostgreSQL version 9.6 or later.
poppler-utils to be version 0.36.0 or later.
-bbox-layout option is not available for
It is recommended to use
poppler-utils version 0.48.0 or later
to avoid a known bug.
Fonduer depends on Wand (>=0.4.4, <0.5.0), which does not support ImageMagick7.
Installing the Fonduer Package¶
Then, install Fonduer by running:
$ pip install fonduer
Fonduer only supports Python 3. Python 2 is not supported.
For the Python dependencies, we recommend using a virtualenv, which will allow you to install Fonduer and its python dependencies in an isolated Python environment. Once you have virtualenv installed, you can create a Python 3 virtual environment as follows.:
$ virtualenv -p python3.6 .venv
Once the virtual environment is created, activate it by running:
$ source .venv/bin/activate
Any Python libraries installed will now be contained within this virtual environment. To deactivate the environment, simply run:
The Fonduer Pipeline¶
The Fonduer pipeline can be broken into five phases.
- In this first stage, an input corpus of richly formatted documents is parsed into Fonduer’s data model.
- Mention and Candidate Extraction
- Here, we initialize the knowledge base with the user’s target schema. Users define Mentions using Matchers, and then combine Mentions to create Candidates. Throttlers can also (optionally) be added to filter out invalid Candidates to achieve better class balance.
- Multimodal Featurization
- Fonduer then featurizes each candidate with features from multiple modalities.
- Next, users provide labeling functions (which can leverage our data model utilities) to provide weak supervision.
- Finally, Fonduer provides machine learning models which are used to classify each Candidate.
To demonstrate how to set up and use Fonduer in your applications, we walk through each of these phases in real-world examples in our Tutorials.
Check out the Fonduer paper for more details about the system.