Packaging¶
You can package a whole trained Fonduer pipeline model (parsing, extraction, featurization, and classification) and deploy it to a remote place to serve. To this end, we use MLflow Model as a storage format. A packaged Fonduer pipeline model (or simply referred to as a Fonduer model) looks like this:
Directory written by fonduer.packaging.save_model¶fonduer_model/ ├── MLmodel ├── code │ ├── my_subclasses.py │ └── my_fonduer_model.py ├── conda.yaml ├── candidate_classes.pkl ├── mention_classes.pkl └── model.pkl # the pickled Fonduer pipeline model.
Currently, two types of classifiers are supported: EmmentalModel
(aka discriminative model) and LabelModel
(aka generative model).
The following example shows how to package a Fonduer pipeline model that uses EmmentalModel
as a classifier.
Example¶
First, create a class that inherits FonduerModel
and implements _classify()
.
You can see fully functional examples of such a class at hardware_fonduer_model.py and my_fonduer_model.py.
Then, put this class in a Python module like my_fonduer_model.py instead of in a Jupyter notebook or a Python script as this module will be packaged.
my_fonduer_model.py¶class MyFonduerModel(FonduerModel): def _classify(self, doc: Document) -> DataFrame: # Assume only one candidate class is used. candidate_class = self.candidate_extractor.candidate_classes[0] # Get a list of candidates for this candidate_class. test_cands = getattr(doc, candidate_class.__tablename__ + "s") # Get a list of true predictions out of candidates. ... true_preds = [test_cands[_] for _ in positive[0]] # Load the true predictions into a dataframe. df = DataFrame() for true_pred in true_preds: entity_relation = tuple(m.context.get_span() for m in true_pred.get_mentions()) df = df.append( DataFrame([entity_relation], columns=[m.__name__ for m in candidate_class.mentions] ) ) return df
Similarly, put anything that is required for MentionExtractor
and CandidateExtractor
, i.e., mention_classes, mention_spaces, matchers, candidate_classes, and throttlers, into another module.
my_subclasses.py¶from fonduer.candidates.models import mention_subclass Presidentname = mention_subclass("Presidentname") Placeofbirth = mention_subclass("Placeofbirth") mention_classes = [Presidentname, Placeofbirth] ... mention_spaces = [presname_ngrams, placeofbirth_ngrams] matchers = [president_name_matcher, place_of_birth_matcher] candidate_classes = [PresidentnamePlaceofbirth] throttlers = [my_throttler]
Finally, in a Jupyter notebook or a Python script, build and train a pipeline, then save the trained pipeline.
>>> from fonduer.parser.preprocessors import HTMLDocPreprocessor
>>> preprocessor = HTMLDocPreprocessor(docs_path)
>>> ...
# Import mention_classes, candidate_classes, etc. from my_subclasses.py
# instead of defining them here.
>>> from my_subclasses import mention_classes, mention_spaces, matchers
>>> mention_extractor = MentionExtractor(session, mention_classes, mention_spaces, matchers)
>>> from my_subclasses import candidate_classes, throttlers
>>> candidate_extractor = CandidateExtractor(session, candidate_classes, throttlers)
>>> ...
>>> from my_fonduer_model import MyFonduerModel
>>> from fonduer.packaging import save_model
>>> save_model(
fonduer_model=MyFonduerModel(),
path="fonduer_model",
code_paths=["my_subclasses.py", "my_fonduer_model.py"],
preprocessor=preprocessor,
parser=parser,
mention_extractor=mention_extractor,
candidate_extractor=candidate_extractor,
featurizer=featurizer,
emmental_model=emmental_model,
word2id=emb_layer.word2id,
)
Remember to list my_subclasses.py and my_fonduer_model.py in the code_paths
argument.
Other modules can also be listed if they are required during inference.
Alternatively, you can manually place arbitrary modules or data under /code or /data directory, respectively.
For further information about MLflow Model, please see MLflow Model.
MLflow model for Fonduer¶
Customized MLflow model for Fonduer.
-
class
fonduer.packaging.fonduer_model.
FonduerModel
(*args, **kwargs)[source]¶ Bases:
mlflow.pyfunc.
A custom MLflow model for Fonduer.
This class is intended to be subclassed.
-
static
convert_features_to_matrix
(features, keys)[source]¶ Convert features (the output from FeaturizerUDF.apply) into a sparse matrix.
- Parameters
features (
List
[Dict
[str
,Any
]]) – a list of feature mapping (key: key, value=feature).keys (
List
[str
]) – a list of all keys.
- Return type
csr_matrix
-
static
convert_labels_to_matrix
(labels, keys)[source]¶ Convert labels (the output from LabelerUDF.apply) into a dense matrix.
Note that the input labels are 0-indexed (
{0, 1, ..., k}
), while the output labels are -1-indexed ({-1, 0, ..., k-1}
).- Parameters
labels (
List
[Dict
[str
,Any
]]) – a list of label mapping (key: key, value=label).keys (
List
[str
]) – a list of all keys.
- Return type
ndarray
-
predict
(model_input)[source]¶ Take html_path (and pdf_path) as input and return extracted information.
This method is required and its signature is defined by the MLflow’s convention. See MLflow for more details.
- Parameters
model_input (
DataFrame
) – Pandas DataFrame with rows as docs and colums as params. params should include “html_path” and can optionally include “pdf_path”.- Return type
DataFrame
- Returns
Pandas DataFrame containing the output from
_classify()
, which depends on how it is implemented by a subclass.
-
static
-
fonduer.packaging.fonduer_model.
log_model
(fonduer_model, artifact_path, preprocessor, parser, mention_extractor, candidate_extractor, conda_env=None, code_paths=None, model_type='emmental', labeler=None, lfs=None, label_models=None, featurizer=None, emmental_model=None, word2id=None)[source]¶ Log a Fonduer model as an MLflow artifact for the current run.
- Parameters
fonduer_model (
FonduerModel
) – Fonduer model to be saved.artifact_path (
str
) – Run-relative artifact path.preprocessor (
DocPreprocessor
) – the doc preprocessor.parser (
Parser
) – self-explanatorymention_extractor (
MentionExtractor
) – self-explanatorycandidate_extractor (
CandidateExtractor
) – self-explanatoryconda_env (
Union
[Dict
,str
,None
]) – Either a dictionary representation of a Conda environment or the path to a Conda environment yaml file.code_paths (
Optional
[List
[str
]]) – A list of local filesystem paths to Python file dependencies, or directories containing file dependencies. These files are prepended to the system path when the model is loaded.model_type (
Optional
[str
]) – the model type, either “emmental” or “label”, defaults to “emmental”.labeler (
Optional
[Labeler
]) – a labeler, defaults to None.lfs (
Optional
[List
[List
[Callable
]]]) – a list of list of labeling functions.label_models (
Optional
[List
[LabelModel
]]) – a list of label models, defaults to None.featurizer (
Optional
[Featurizer
]) – a featurizer, defaults to None.emmental_model (
Optional
[EmmentalModel
]) – an Emmental model, defaults to None.word2id (
Optional
[Dict
]) – a word embedding map.
- Return type
None
-
fonduer.packaging.fonduer_model.
save_model
(fonduer_model, path, preprocessor, parser, mention_extractor, candidate_extractor, mlflow_model=mlflow.models.Model, conda_env=None, code_paths=None, model_type='emmental', labeler=None, lfs=None, label_models=None, featurizer=None, emmental_model=None, word2id=None)[source]¶ Save a Fonduer model to a path on the local file system.
- Parameters
fonduer_model (
FonduerModel
) – Fonduer model to be saved.path (
str
) – the path on the local file system.preprocessor (
DocPreprocessor
) – the doc preprocessor.parser (
Parser
) – self-explanatorymention_extractor (
MentionExtractor
) – self-explanatorycandidate_extractor (
CandidateExtractor
) – self-explanatorymlflow_model (
Model
) – model configuration.conda_env (
Union
[Dict
,str
,None
]) – Either a dictionary representation of a Conda environment or the path to a Conda environment yaml file.code_paths (
Optional
[List
[str
]]) – A list of local filesystem paths to Python file dependencies, or directories containing file dependencies. These files are prepended to the system path when the model is loaded.model_type (
Optional
[str
]) – the model type, either “emmental” or “label”, defaults to “emmental”.labeler (
Optional
[Labeler
]) – a labeler, defaults to None.lfs (
Optional
[List
[List
[Callable
]]]) – a list of list of labeling functions.label_models (
Optional
[List
[LabelModel
]]) – a list of label models, defaults to None.featurizer (
Optional
[Featurizer
]) – a featurizer, defaults to None.emmental_model (
Optional
[EmmentalModel
]) – an Emmental model, defaults to None.word2id (
Optional
[Dict
]) – a word embedding map.
- Return type
None