Learning¶
The final stage of Fonduer’s pipeline is to use machine learning models to model the noise between supervision sources to generate probabilistic labels as training data, and then classify each Candidate. Rather than maintaining a separate learning engine, we switch to Emmental, a deep learning framework for multi-task learning. Switching to a more general learning framework allows Fonduer to support more applications and multi-task learning. With Emmental, you need do following steps to perform learning:
Create task for each relations and EmmentalModel to learn those tasks.
Wrap candidates into EmmentalDataLoader for training.
Training and inference (prediction).
Core Learning Objects¶
These are Fonduer’s core objects used for learning. First, we describe how to create Emmental task for each relation.
Customized Emmental task for Fonduer.
-
fonduer.learning.task.
create_task
(task_names, n_arities, n_features, n_classes, emb_layer, model='LSTM', mode='MTL')[source]¶ Create task from relation(s).
- Parameters
task_names (
Union
[str
,List
[str
]]) – Relation name(s), If str, only one relation; If List[str], multiple relations.n_arities (
Union
[int
,List
[int
]]) – The arity of each relation.n_features (
int
) – The multimodal feature set size.n_classes (
Union
[int
,List
[int
]]) – Number of classes for each task. (Only support classification task now).emb_layer (
Optional
[EmbeddingModule
]) – The embedding layer for LSTM. No need for LogisticRegression model.model (
str
) – Model name (available models: “LSTM”, “LogisticRegression”), defaults to “LSTM”.mode (
str
) – Learning mode (available modes: “STL”, “MTL”), defaults to “MTL”.
- Return type
List
[EmmentalTask
]
-
fonduer.learning.task.
loss
(module_name, intermediate_output_dict, Y, active)[source]¶ Define the loss of the task.
- Parameters
module_name (
str
) – The module name to calculate the loss.intermediate_output_dict (
Dict
[str
,Any
]) – The intermediate output dictionaryY (
Tensor
) – Ground truth labels.active (
Tensor
) – The sample mask.
- Return type
Tensor
- Returns
Loss.
-
fonduer.learning.task.
output
(module_name, intermediate_output_dict)[source]¶ Define the output of the task.
- Parameters
module_name (
str
) – The module name to calculate the loss.intermediate_output_dict (
Dict
[str
,Any
]) – The intermediate output dictionary
- Return type
Tensor
- Returns
Output tensor.
Then, we describe how to wrap candidates into an EmmentalDataLoader.
Fonduer dataset.
-
class
fonduer.learning.dataset.
FonduerDataset
(name, candidates, features, word2id, labels, index=None)[source]¶ Bases:
emmental.data.
A FonduerDataset class which is inherited from EmmentalDataset.
This class takes list of candidates and corresponding feature matrix as input and wraps them.
- Parameters
name (
str
) – The name of the dataset.candidates (
List
[Candidate
]) – The list of candidates.features (
csr_matrix
) – The corresponding feature matrix.word2id (
Dict
) – The name of the dataset.labels (
Union
[array
,int
]) – If np.array, it’s the label for all candidates; If int, it’s the number of classes of label and we will create placeholder labels (mainly used for inference).labels – Which candidates to use. If None, use all candidates.
Initialize FonduerDataset.
Learning Utilities¶
These utilities can be used during error analysis to provide additional insights.
Fonduer learning utils.
-
fonduer.learning.utils.
collect_word_counter
(candidates)[source]¶ Collect word counter from candidates.
- Parameters
candidates (
Union
[List
[Candidate
],List
[List
[Candidate
]]]) – The candidates used to collect word counter.- Return type
Dict
[str
,int
]- Returns
The word counter.
-
fonduer.learning.utils.
confusion_matrix
(pred, gold)[source]¶ Return a confusion matrix.
This can be used for both entity-level and mention-level
- Parameters
pred (
Set
) – a set of predicted entities/candidatesgold (
Set
) – a set of golden entities/candidates
- Return type
Tuple
[Set
,Set
,Set
]- Returns
a tuple of TP, FP, and FN
-
fonduer.learning.utils.
mark
(l, h, idx)[source]¶ Produce markers based on argument positions.
- Parameters
l (
int
) – sentence position of first word in argument.h (
int
) – sentence position of last word in argument.idx (
int
) – argument index (1 or 2).
- Return type
List
[Tuple
[int
,str
]]- Returns
markers.
-
fonduer.learning.utils.
mark_sentence
(s, args)[source]¶ Insert markers around relation arguments in word sequence.
- Parameters
s (
List
[str
]) – list of tokens in sentence.args (
List
[Tuple
[int
,int
,int
]]) – list of triples (l, h, idx) as per @_mark(…) corresponding to relation arguments
- Return type
List
[str
]- Returns
The marked sentence.
- Example:
Then Barack married Michelle. -> Then ~~[[1 Barack 1]]~~ married ~~[[2 Michelle 2]]~~.
-
fonduer.learning.utils.
mention_to_tokens
(mention, token_type='words', lowercase=False)[source]¶ Extract tokens from the mention.
- Parameters
mention (
Mention
) – mention object.token_type (
str
) – token type that wants to extract (e.g. words, lemmas, poses).lowercase (
bool
) – use lowercase or not.
- Return type
List
[str
]- Returns
The token list.
-
fonduer.learning.utils.
save_marginals
(session, X, marginals, training=True)[source]¶ Save marginal probabilities for a set of Candidates to db.
- Parameters
X (
List
[Candidate
]) – A list of arbitrary objects with candidate ids accessible via a .id attribmarginals (
Session
) – A dense M x K matrix of marginal probabilities, where K is the cardinality of the candidates, OR a M-dim list/array if K=2.training (
bool
) – If True, these are training marginals / labels; else they are saved as end model predictions.
Note: The marginals for k=0 are not stored, only for k = 1,…,K
- Return type
None
Configuration Settings¶
Visit the Configuring Fonduer page to see how to provide configuration
parameters to Fonduer via .fonduer-config.yaml
.
The learning parameters of different models are described below:
learning:
# LSTM model
LSTM:
# Word embedding dimension size
emb_dim: 100
# The number of features in the LSTM hidden state
hidden_dim: 100
# Use attention or not (Options: True or False)
attention: True
# Dropout parameter
dropout: 0.1
# Use bidirectional LSTM or not (Options: True or False)
bidirectional: True
# Logistic Regression model
LogisticRegression:
# The number of features in the LogisticRegression hidden state
hidden_dim: 100
# bias term
bias: False