Learning¶

The final stage of Fonduer’s pipeline is to use machine learning models to model the noise between supervision sources to generate probabilistic labels as training data, and then classify each Candidate. Rather than maintaining a separate learning engine, we switch to Emmental, a deep learning framework for multi-task learning. Switching to a more general learning framework allows Fonduer to support more applications and multi-task learning. With Emmental, you need do following steps to perform learning:

Create task for each relations and EmmentalModel to learn those tasks.

Wrap candidates into EmmentalDataLoader for training.

Training and inference (prediction).

Core Learning Objects¶

These are Fonduer’s core objects used for learning. First, we describe how to create Emmental task for each relation.

Customized Emmental task for Fonduer.

fonduer.learning.task.create_task(task_names, n_arities, n_features, n_classes, emb_layer, model='LSTM', mode='MTL')[source]¶

Create task from relation(s).

Parameters

task_names (Union[str, List[str]]) – Relation name(s), If str, only one relation; If List[str], multiple relations.
n_arities (Union[int, List[int]]) – The arity of each relation.
n_features (int) – The multimodal feature set size.
n_classes (Union[int, List[int]]) – Number of classes for each task. (Only support classification task now).
emb_layer (Optional[EmbeddingModule]) – The embedding layer for LSTM. No need for LogisticRegression model.
model (str) – Model name (available models: “LSTM”, “LogisticRegression”), defaults to “LSTM”.
mode (str) – Learning mode (available modes: “STL”, “MTL”), defaults to “MTL”.

Return type

List[EmmentalTask]

fonduer.learning.task.loss(module_name, intermediate_output_dict, Y, active)[source]¶

Define the loss of the task.

Parameters

module_name (str) – The module name to calculate the loss.
intermediate_output_dict (Dict[str, Any]) – The intermediate output dictionary
Y (Tensor) – Ground truth labels.
active (Tensor) – The sample mask.

Return type

Tensor

Returns

Loss.

fonduer.learning.task.output(module_name, intermediate_output_dict)[source]¶

Define the output of the task.

Parameters

module_name (str) – The module name to calculate the loss.
intermediate_output_dict (Dict[str, Any]) – The intermediate output dictionary

Return type

Tensor

Returns

Output tensor.

Then, we describe how to wrap candidates into an EmmentalDataLoader.

Fonduer dataset.

class fonduer.learning.dataset.FonduerDataset(name, candidates, features, word2id, labels, index=None)[source]¶

Bases: emmental.data.

A FonduerDataset class which is inherited from EmmentalDataset.

This class takes list of candidates and corresponding feature matrix as input and wraps them.

Parameters

name (str) – The name of the dataset.
candidates (List[Candidate]) – The list of candidates.
features (csr_matrix) – The corresponding feature matrix.
word2id (Dict) – The name of the dataset.
labels (Union[array, int]) – If np.array, it’s the label for all candidates; If int, it’s the number of classes of label and we will create placeholder labels (mainly used for inference).
labels – Which candidates to use. If None, use all candidates.

Initialize FonduerDataset.

Learning Utilities¶

These utilities can be used during error analysis to provide additional insights.

Fonduer learning utils.

fonduer.learning.utils.collect_word_counter(candidates)[source]¶

Collect word counter from candidates.

Parameters: candidates (Union[List[Candidate], List[List[Candidate]]]) – The candidates used to collect word counter.
Return type: Dict[str, int]
Returns: The word counter.

fonduer.learning.utils.confusion_matrix(pred, gold)[source]¶

Return a confusion matrix.

This can be used for both entity-level and mention-level

Parameters

pred (Set) – a set of predicted entities/candidates
gold (Set) – a set of golden entities/candidates

Return type

Tuple[Set, Set, Set]

Returns

a tuple of TP, FP, and FN

fonduer.learning.utils.mark(l, h, idx)[source]¶

Produce markers based on argument positions.

Parameters

l (int) – sentence position of first word in argument.
h (int) – sentence position of last word in argument.
idx (int) – argument index (1 or 2).

Return type

List[Tuple[int, str]]

Returns

markers.

fonduer.learning.utils.mark_sentence(s, args)[source]¶

Insert markers around relation arguments in word sequence.

Parameters

s (List[str]) – list of tokens in sentence.
args (List[Tuple[int, int, int]]) – list of triples (l, h, idx) as per @_mark(…) corresponding to relation arguments

Return type

List[str]

Returns

The marked sentence.

Example:: Then Barack married Michelle. -> Then ~~[[1 Barack 1]]~~ married ~~[[2 Michelle 2]]~~.

fonduer.learning.utils.mention_to_tokens(mention, token_type='words', lowercase=False)[source]¶

Extract tokens from the mention.

Parameters

mention (Mention) – mention object.
token_type (str) – token type that wants to extract (e.g. words, lemmas, poses).
lowercase (bool) – use lowercase or not.

Return type

List[str]

Returns

The token list.

fonduer.learning.utils.save_marginals(session, X, marginals, training=True)[source]¶

Save marginal probabilities for a set of Candidates to db.

Parameters

X (List[Candidate]) – A list of arbitrary objects with candidate ids accessible via a .id attrib
marginals (Session) – A dense M x K matrix of marginal probabilities, where K is the cardinality of the candidates, OR a M-dim list/array if K=2.
training (bool) – If True, these are training marginals / labels; else they are saved as end model predictions.

Note: The marginals for k=0 are not stored, only for k = 1,…,K

Return type: None

Configuration Settings¶

Visit the Configuring Fonduer page to see how to provide configuration parameters to Fonduer via .fonduer-config.yaml.

The learning parameters of different models are described below:

learning:
  # LSTM model
  LSTM:
    # Word embedding dimension size
    emb_dim: 100
    # The number of features in the LSTM hidden state
    hidden_dim: 100
    # Use attention or not (Options: True or False)
    attention: True
    # Dropout parameter
    dropout: 0.1
    # Use bidirectional LSTM or not (Options: True or False)
    bidirectional: True
  # Logistic Regression model
  LogisticRegression:
    # The number of features in the LogisticRegression hidden state
    hidden_dim: 100
    # bias term
    bias: False