Packaging¶

You can package a whole trained Fonduer pipeline model (parsing, extraction, featurization, and classification) and deploy it to a remote place to serve. To this end, we use MLflow Model as a storage format. A packaged Fonduer pipeline model (or simply referred to as a Fonduer model) looks like this:

Directory written by fonduer.packaging.save_model¶

fonduer_model/
├── MLmodel
├── code
│   ├── my_subclasses.py
│   └── my_fonduer_model.py
├── conda.yaml
├── candidate_classes.pkl
├── mention_classes.pkl
└── model.pkl  # the pickled Fonduer pipeline model.

Currently, two types of classifiers are supported: EmmentalModel (aka discriminative model) and LabelModel (aka generative model). The following example shows how to package a Fonduer pipeline model that uses EmmentalModel as a classifier.

Example¶

First, create a class that inherits FonduerModel and implements _classify(). You can see fully functional examples of such a class at hardware_fonduer_model.py and my_fonduer_model.py. Then, put this class in a Python module like my_fonduer_model.py instead of in a Jupyter notebook or a Python script as this module will be packaged.

my_fonduer_model.py¶

class MyFonduerModel(FonduerModel):
    def _classify(self, doc: Document) -> DataFrame:
        # Assume only one candidate class is used.
        candidate_class = self.candidate_extractor.candidate_classes[0]
        # Get a list of candidates for this candidate_class.
        test_cands = getattr(doc, candidate_class.__tablename__ + "s")
        # Get a list of true predictions out of candidates.
        ...
        true_preds = [test_cands[_] for _ in positive[0]]

        # Load the true predictions into a dataframe.
        df = DataFrame()
        for true_pred in true_preds:
            entity_relation = tuple(m.context.get_span() for m in true_pred.get_mentions())
            df = df.append(
                DataFrame([entity_relation],
                columns=[m.__name__ for m in candidate_class.mentions]
                )
            )
        return df

Similarly, put anything that is required for MentionExtractor and CandidateExtractor, i.e., mention_classes, mention_spaces, matchers, candidate_classes, and throttlers, into another module.

my_subclasses.py¶

from fonduer.candidates.models import mention_subclass
Presidentname = mention_subclass("Presidentname")
Placeofbirth = mention_subclass("Placeofbirth")

mention_classes = [Presidentname, Placeofbirth]
...
mention_spaces = [presname_ngrams, placeofbirth_ngrams]
matchers = [president_name_matcher, place_of_birth_matcher]
candidate_classes = [PresidentnamePlaceofbirth]
throttlers = [my_throttler]

Finally, in a Jupyter notebook or a Python script, build and train a pipeline, then save the trained pipeline.

>>> from fonduer.parser.preprocessors import HTMLDocPreprocessor
>>> preprocessor = HTMLDocPreprocessor(docs_path)
>>> ...
# Import mention_classes, candidate_classes, etc. from my_subclasses.py
# instead of defining them here.
>>> from my_subclasses import mention_classes, mention_spaces, matchers
>>> mention_extractor = MentionExtractor(session, mention_classes, mention_spaces, matchers)
>>> from my_subclasses import candidate_classes, throttlers
>>> candidate_extractor = CandidateExtractor(session, candidate_classes, throttlers)
>>> ...
>>> from my_fonduer_model import MyFonduerModel
>>> from fonduer.packaging import save_model
>>> save_model(
        fonduer_model=MyFonduerModel(),
        path="fonduer_model",
        code_paths=["my_subclasses.py", "my_fonduer_model.py"],
        preprocessor=preprocessor,
        parser=parser,
        mention_extractor=mention_extractor,
        candidate_extractor=candidate_extractor,
        featurizer=featurizer,
        emmental_model=emmental_model,
        word2id=emb_layer.word2id,
    )

Remember to list my_subclasses.py and my_fonduer_model.py in the code_paths argument. Other modules can also be listed if they are required during inference. Alternatively, you can manually place arbitrary modules or data under /code or /data directory, respectively. For further information about MLflow Model, please see MLflow Model.

MLflow model for Fonduer¶

Customized MLflow model for Fonduer.

class fonduer.packaging.fonduer_model.FonduerModel(*args, **kwargs)[source]¶

Bases: mlflow.pyfunc.

A custom MLflow model for Fonduer.

This class is intended to be subclassed.

static convert_features_to_matrix(features, keys)[source]¶

Convert features (the output from FeaturizerUDF.apply) into a sparse matrix.

Parameters

features (List[Dict[str, Any]]) – a list of feature mapping (key: key, value=feature).
keys (List[str]) – a list of all keys.

Return type

csr_matrix

static convert_labels_to_matrix(labels, keys)[source]¶

Convert labels (the output from LabelerUDF.apply) into a dense matrix.

Note that the input labels are 0-indexed ({0, 1, ..., k}), while the output labels are -1-indexed ({-1, 0, ..., k-1}).

Parameters

labels (List[Dict[str, Any]]) – a list of label mapping (key: key, value=label).
keys (List[str]) – a list of all keys.

Return type

ndarray

predict(model_input)[source]¶

Take html_path (and pdf_path) as input and return extracted information.

This method is required and its signature is defined by the MLflow’s convention. See MLflow for more details.

Parameters: model_input (DataFrame) – Pandas DataFrame with rows as docs and colums as params. params should include “html_path” and can optionally include “pdf_path”.
Return type: DataFrame
Returns: Pandas DataFrame containing the output from _classify(), which depends on how it is implemented by a subclass.

fonduer.packaging.fonduer_model.log_model(fonduer_model, artifact_path, preprocessor, parser, mention_extractor, candidate_extractor, conda_env=None, code_paths=None, model_type='emmental', labeler=None, lfs=None, label_models=None, featurizer=None, emmental_model=None, word2id=None)[source]¶

Log a Fonduer model as an MLflow artifact for the current run.

Parameters

fonduer_model (FonduerModel) – Fonduer model to be saved.
artifact_path (str) – Run-relative artifact path.
preprocessor (DocPreprocessor) – the doc preprocessor.
parser (Parser) – self-explanatory
mention_extractor (MentionExtractor) – self-explanatory
candidate_extractor (CandidateExtractor) – self-explanatory
conda_env (Union[Dict, str, None]) – Either a dictionary representation of a Conda environment or the path to a Conda environment yaml file.
code_paths (Optional[List[str]]) – A list of local filesystem paths to Python file dependencies, or directories containing file dependencies. These files are prepended to the system path when the model is loaded.
model_type (Optional[str]) – the model type, either “emmental” or “label”, defaults to “emmental”.
labeler (Optional[Labeler]) – a labeler, defaults to None.
lfs (Optional[List[List[Callable]]]) – a list of list of labeling functions.
label_models (Optional[List[LabelModel]]) – a list of label models, defaults to None.
featurizer (Optional[Featurizer]) – a featurizer, defaults to None.
emmental_model (Optional[EmmentalModel]) – an Emmental model, defaults to None.
word2id (Optional[Dict]) – a word embedding map.

Return type

None

fonduer.packaging.fonduer_model.save_model(fonduer_model, path, preprocessor, parser, mention_extractor, candidate_extractor, mlflow_model=mlflow.models.Model, conda_env=None, code_paths=None, model_type='emmental', labeler=None, lfs=None, label_models=None, featurizer=None, emmental_model=None, word2id=None)[source]¶

Save a Fonduer model to a path on the local file system.

Parameters

fonduer_model (FonduerModel) – Fonduer model to be saved.
path (str) – the path on the local file system.
preprocessor (DocPreprocessor) – the doc preprocessor.
parser (Parser) – self-explanatory
mention_extractor (MentionExtractor) – self-explanatory
candidate_extractor (CandidateExtractor) – self-explanatory
mlflow_model (Model) – model configuration.
conda_env (Union[Dict, str, None]) – Either a dictionary representation of a Conda environment or the path to a Conda environment yaml file.
code_paths (Optional[List[str]]) – A list of local filesystem paths to Python file dependencies, or directories containing file dependencies. These files are prepended to the system path when the model is loaded.
model_type (Optional[str]) – the model type, either “emmental” or “label”, defaults to “emmental”.
labeler (Optional[Labeler]) – a labeler, defaults to None.
lfs (Optional[List[List[Callable]]]) – a list of list of labeling functions.
label_models (Optional[List[LabelModel]]) – a list of label models, defaults to None.
featurizer (Optional[Featurizer]) – a featurizer, defaults to None.
emmental_model (Optional[EmmentalModel]) – an Emmental model, defaults to None.
word2id (Optional[Dict]) – a word embedding map.

Return type

None