Packaging

You can package a whole trained Fonduer pipeline model (parsing, extraction, featurization, and classification) and deploy it to a remote place to serve. To this end, we use MLflow Model as a storage format. A packaged Fonduer pipeline model (or simply referred to as a Fonduer model) looks like this:

Directory written by fonduer.packaging.save_model
fonduer_model/
├── MLmodel
├── code
│   ├── my_subclasses.py
│   └── my_fonduer_model.py
├── conda.yaml
├── candidate_classes.pkl
├── mention_classes.pkl
└── model.pkl  # the pickled Fonduer pipeline model.

Currently, two types of classifiers are supported: EmmentalModel (aka discriminative model) and LabelModel (aka generative model). The following example shows how to package a Fonduer pipeline model that uses EmmentalModel as a classifier.

Example

First, create a class that inherits FonduerModel and implements _classify(). You can see fully functional examples of such a class at hardware_fonduer_model.py and my_fonduer_model.py. Then, put this class in a Python module like my_fonduer_model.py instead of in a Jupyter notebook or a Python script as this module will be packaged.

my_fonduer_model.py
class MyFonduerModel(FonduerModel):
    def _classify(self, doc: Document) -> DataFrame:
        # Assume only one candidate class is used.
        candidate_class = self.candidate_extractor.candidate_classes[0]
        # Get a list of candidates for this candidate_class.
        test_cands = getattr(doc, candidate_class.__tablename__ + "s")
        # Get a list of true predictions out of candidates.
        ...
        true_preds = [test_cands[_] for _ in positive[0]]

        # Load the true predictions into a dataframe.
        df = DataFrame()
        for true_pred in true_preds:
            entity_relation = tuple(m.context.get_span() for m in true_pred.get_mentions())
            df = df.append(
                DataFrame([entity_relation],
                columns=[m.__name__ for m in candidate_class.mentions]
                )
            )
        return df

Similarly, put anything that is required for MentionExtractor and CandidateExtractor, i.e., mention_classes, mention_spaces, matchers, candidate_classes, and throttlers, into another module.

my_subclasses.py
from fonduer.candidates.models import mention_subclass
Presidentname = mention_subclass("Presidentname")
Placeofbirth = mention_subclass("Placeofbirth")

mention_classes = [Presidentname, Placeofbirth]
...
mention_spaces = [presname_ngrams, placeofbirth_ngrams]
matchers = [president_name_matcher, place_of_birth_matcher]
candidate_classes = [PresidentnamePlaceofbirth]
throttlers = [my_throttler]

Finally, in a Jupyter notebook or a Python script, build and train a pipeline, then save the trained pipeline.

>>> from fonduer.parser.preprocessors import HTMLDocPreprocessor
>>> preprocessor = HTMLDocPreprocessor(docs_path)
>>> ...
# Import mention_classes, candidate_classes, etc. from my_subclasses.py
# instead of defining them here.
>>> from my_subclasses import mention_classes, mention_spaces, matchers
>>> mention_extractor = MentionExtractor(session, mention_classes, mention_spaces, matchers)
>>> from my_subclasses import candidate_classes, throttlers
>>> candidate_extractor = CandidateExtractor(session, candidate_classes, throttlers)
>>> ...
>>> from my_fonduer_model import MyFonduerModel
>>> from fonduer.packaging import save_model
>>> save_model(
        fonduer_model=MyFonduerModel(),
        path="fonduer_model",
        code_paths=["my_subclasses.py", "my_fonduer_model.py"],
        preprocessor=preprocessor,
        parser=parser,
        mention_extractor=mention_extractor,
        candidate_extractor=candidate_extractor,
        featurizer=featurizer,
        emmental_model=emmental_model,
        word2id=emb_layer.word2id,
    )

Remember to list my_subclasses.py and my_fonduer_model.py in the code_paths argument. Other modules can also be listed if they are required during inference. Alternatively, you can manually place arbitrary modules or data under /code or /data directory, respectively. For further information about MLflow Model, please see MLflow Model.

MLflow model for Fonduer

Customized MLflow model for Fonduer.

class fonduer.packaging.fonduer_model.FonduerModel(*args, **kwargs)[source]

Bases: mlflow.pyfunc.

A custom MLflow model for Fonduer.

This class is intended to be subclassed.

static convert_features_to_matrix(features, keys)[source]

Convert features (the output from FeaturizerUDF.apply) into a sparse matrix.

Parameters
  • features (List[Dict[str, Any]]) – a list of feature mapping (key: key, value=feature).

  • keys (List[str]) – a list of all keys.

Return type

csr_matrix

static convert_labels_to_matrix(labels, keys)[source]

Convert labels (the output from LabelerUDF.apply) into a dense matrix.

Note that the input labels are 0-indexed ({0, 1, ..., k}), while the output labels are -1-indexed ({-1, 0, ..., k-1}).

Parameters
  • labels (List[Dict[str, Any]]) – a list of label mapping (key: key, value=label).

  • keys (List[str]) – a list of all keys.

Return type

ndarray

predict(model_input)[source]

Take html_path (and pdf_path) as input and return extracted information.

This method is required and its signature is defined by the MLflow’s convention. See MLflow for more details.

Parameters

model_input (DataFrame) – Pandas DataFrame with rows as docs and colums as params. params should include “html_path” and can optionally include “pdf_path”.

Return type

DataFrame

Returns

Pandas DataFrame containing the output from _classify(), which depends on how it is implemented by a subclass.

fonduer.packaging.fonduer_model.log_model(fonduer_model, artifact_path, preprocessor, parser, mention_extractor, candidate_extractor, conda_env=None, code_paths=None, model_type='emmental', labeler=None, lfs=None, label_models=None, featurizer=None, emmental_model=None, word2id=None)[source]

Log a Fonduer model as an MLflow artifact for the current run.

Parameters
  • fonduer_model (FonduerModel) – Fonduer model to be saved.

  • artifact_path (str) – Run-relative artifact path.

  • preprocessor (DocPreprocessor) – the doc preprocessor.

  • parser (Parser) – self-explanatory

  • mention_extractor (MentionExtractor) – self-explanatory

  • candidate_extractor (CandidateExtractor) – self-explanatory

  • conda_env (Union[Dict, str, None]) – Either a dictionary representation of a Conda environment or the path to a Conda environment yaml file.

  • code_paths (Optional[List[str]]) – A list of local filesystem paths to Python file dependencies, or directories containing file dependencies. These files are prepended to the system path when the model is loaded.

  • model_type (Optional[str]) – the model type, either “emmental” or “label”, defaults to “emmental”.

  • labeler (Optional[Labeler]) – a labeler, defaults to None.

  • lfs (Optional[List[List[Callable]]]) – a list of list of labeling functions.

  • label_models (Optional[List[LabelModel]]) – a list of label models, defaults to None.

  • featurizer (Optional[Featurizer]) – a featurizer, defaults to None.

  • emmental_model (Optional[EmmentalModel]) – an Emmental model, defaults to None.

  • word2id (Optional[Dict]) – a word embedding map.

Return type

None

fonduer.packaging.fonduer_model.save_model(fonduer_model, path, preprocessor, parser, mention_extractor, candidate_extractor, mlflow_model=mlflow.models.Model, conda_env=None, code_paths=None, model_type='emmental', labeler=None, lfs=None, label_models=None, featurizer=None, emmental_model=None, word2id=None)[source]

Save a Fonduer model to a path on the local file system.

Parameters
  • fonduer_model (FonduerModel) – Fonduer model to be saved.

  • path (str) – the path on the local file system.

  • preprocessor (DocPreprocessor) – the doc preprocessor.

  • parser (Parser) – self-explanatory

  • mention_extractor (MentionExtractor) – self-explanatory

  • candidate_extractor (CandidateExtractor) – self-explanatory

  • mlflow_model (Model) – model configuration.

  • conda_env (Union[Dict, str, None]) – Either a dictionary representation of a Conda environment or the path to a Conda environment yaml file.

  • code_paths (Optional[List[str]]) – A list of local filesystem paths to Python file dependencies, or directories containing file dependencies. These files are prepended to the system path when the model is loaded.

  • model_type (Optional[str]) – the model type, either “emmental” or “label”, defaults to “emmental”.

  • labeler (Optional[Labeler]) – a labeler, defaults to None.

  • lfs (Optional[List[List[Callable]]]) – a list of list of labeling functions.

  • label_models (Optional[List[LabelModel]]) – a list of label models, defaults to None.

  • featurizer (Optional[Featurizer]) – a featurizer, defaults to None.

  • emmental_model (Optional[EmmentalModel]) – an Emmental model, defaults to None.

  • word2id (Optional[Dict]) – a word embedding map.

Return type

None