Learning¶
The final stage of Fonduer’s pipeline is to use machine learning models to model the noise between supervision sources to generate probabilistic labels as training data, and then classify each Candidate.
Learning Utilities¶
These utilities can be used during error analysis to provide additional insights.

class
fonduer.learning.utils.
MultiModalDataset
(X, Y=None)[source]¶ Bases:
sphinx.ext.autodoc.importer._MockObject
A dataset contains all multimodal features in X and coressponding label Y.

fonduer.learning.utils.
confusion_matrix
(pred, gold)[source]¶ Return a confusion matrix.
This can be used for both entitylevel and mentionlevel
Parameters:  pred (set) – a set of predicted entities/candidates
 gold (set) – a set of golden entities/candidates
Returns: a tuple of TP, FP, and FN
Return type: (set, set, set)

fonduer.learning.utils.
save_marginals
(session, X, marginals, training=True)[source]¶ Save marginal probabilities for a set of Candidates to db.
Parameters:  X – A list of arbitrary objects with candidate ids accessible via a .id attrib
 marginals – A dense M x K matrix of marginal probabilities, where K is the cardinality of the candidates, OR a Mdim list/array if K=2.
 training – If True, these are training marginals / labels; else they are saved as end model predictions.
Note: The marginals for k=0 are not stored, only for k = 1,…,K
Configuration Settings¶
Visit the Configuring Fonduer page to see how to provide configuration
parameters to Fonduer via .fonduerconfig.yaml
.
The learning parameters of different models are described below:
learning:
# LSTM model
LSTM:
# Word embedding dimension size
emb_dim: 100
# The number of features in the LSTM hidden state
hidden_dim: 100
# Use attention or not (Options: True or False)
attention: True
# Dropout parameter
dropout: 0.1
# Use bidirectional LSTM or not (Options: True or False)
bidirectional: True
# Prefered host device (Options: CPU or GPU)
host_device: "CPU"
# Maximum sentence length of LSTM input
max_sentence_length: 100
# Logistic Regression model
LogisticRegression:
# bias term
bias: False
# Sparse LSTM model
SparseLSTM:
# Word embedding dimension size
emb_dim: 100
# The number of features in the LSTM hidden state
hidden_dim: 100
# Use attention or not (Options: True or False)
attention: True
# Dropout parameter
dropout: 0.1
# Use bidirectional LSTM or not (Options: True or False)
bidirectional: True
# Prefered host device (Options: CPU or GPU)
host_device: "CPU"
# Maximum sentence length of LSTM input
max_sentence_length: 100
# bias term
bias: False
# Sparse Logistic Regression model
SparseLogisticRegression:
# bias term
bias: False