Learning

The final stage of Fonduer’s pipeline is to use machine learning models to model the noise between supervision sources to generate probabilistic labels as training data, and then classify each Candidate.

Core Learning Objects

These are Fonduer’s core objects used for learning.

Learning Utilities

These utilities can be used during error analysis to provide additional insights.

class fonduer.learning.utils.MultiModalDataset(X, Y=None)[source]

Bases: sphinx.ext.autodoc.importer._MockObject

A dataset contains all multimodal features in X and coressponding label Y.

fonduer.learning.utils.confusion_matrix(pred, gold)[source]

Return a confusion matrix.

This can be used for both entity-level and mention-level

Parameters:
  • pred (set) – a set of predicted entities/candidates
  • gold (set) – a set of golden entities/candidates
Returns:

a tuple of TP, FP, and FN

Return type:

(set, set, set)

fonduer.learning.utils.save_marginals(session, X, marginals, training=True)[source]

Save marginal probabilities for a set of Candidates to db.

Parameters:
  • X – A list of arbitrary objects with candidate ids accessible via a .id attrib
  • marginals – A dense M x K matrix of marginal probabilities, where K is the cardinality of the candidates, OR a M-dim list/array if K=2.
  • training – If True, these are training marginals / labels; else they are saved as end model predictions.

Note: The marginals for k=0 are not stored, only for k = 1,…,K

Configuration Settings

Visit the Configuring Fonduer page to see how to provide configuration parameters to Fonduer via .fonduer-config.yaml.

The learning parameters of different models are described below:

learning:
  # LSTM model
  LSTM:
    # Word embedding dimension size
    emb_dim: 100
    # The number of features in the LSTM hidden state
    hidden_dim: 100
    # Use attention or not (Options: True or False)
    attention: True
    # Dropout parameter
    dropout: 0.1
    # Use bidirectional LSTM or not (Options: True or False)
    bidirectional: True
    # Prefered host device (Options: CPU or GPU)
    host_device: "CPU"
    # Maximum sentence length of LSTM input
    max_sentence_length: 100
  # Logistic Regression model
  LogisticRegression:
    # bias term
    bias: False
  # Sparse LSTM model
  SparseLSTM:
    # Word embedding dimension size
    emb_dim: 100
    # The number of features in the LSTM hidden state
    hidden_dim: 100
    # Use attention or not (Options: True or False)
    attention: True
    # Dropout parameter
    dropout: 0.1
    # Use bidirectional LSTM or not (Options: True or False)
    bidirectional: True
    # Prefered host device (Options: CPU or GPU)
    host_device: "CPU"
    # Maximum sentence length of LSTM input
    max_sentence_length: 100
    # bias term
    bias: False
  # Sparse Logistic Regression model
  SparseLogisticRegression:
    # bias term
    bias: False