Learning

The final stage of Fonduer’s pipeline is to use machine learning models to model the noise between supervision sources to generate probabilistic labels as training data, and then classify each Candidate. Rather than maintaining a separate learning engine, we switch to Emmental, a deep learning framework for multi-task learning. Switching to a more general learning framework allows Fonduer to support more applications and multi-task learning. With Emmental, you need do following steps to perform learning:

  1. Create task for each relations and EmmentalModel to learn those tasks.
  2. Wrap candidates into EmmentalDataLoader for training.
  3. Training and inference (prediction).

Core Learning Objects

These are Fonduer’s core objects used for learning. First, we describe how to create Emmental task for each relation.

fonduer.learning.task.create_task(task_names: Union[str, List[str]], n_arities: Union[int, List[int]], n_features: int, n_classes: Union[int, List[int]], emb_layer: Optional[<sphinx.ext.autodoc.importer._MockObject object at 0x7f12c6460b00>], model: str = 'LSTM') → List[<sphinx.ext.autodoc.importer._MockObject object at 0x7f12c63a5438>][source]

Create task from relation(s).

Parameters:
  • task_names (str, List[str]) – Relation name(s), If str, only one relation; If List[str], multiple relations.
  • n_arities (int, List[int]) – The arity of each relation.
  • n_features (int) – The multimodal feature set size.
  • n_classes (int, List[int]) – Number of classes for each task. (Only support classification task now).
  • emb_layer (EmbeddingModule) – The embedding layer for LSTM. No need for LogisticRegression model.
  • model (str) – Model name (available models: “LSTM”, “LogisticRegression”), defaults to “LSTM”.

Then, we describe how to wrap candidates into an EmmentalDataLoader.

class fonduer.learning.dataset.FonduerDataset(name: str, candidates: List[fonduer.candidates.models.candidate.Candidate], features: <sphinx.ext.autodoc.importer._MockObject object at 0x7f12c62d0ba8>, word2id: Dict[KT, VT], labels: Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7f12c62d0cc0>, int], index: Optional[List[int]] = None)[source]

Bases: sphinx.ext.autodoc.importer._MockObject

A FonduerDataset class which is inherited from EmmentalDataset, which takes list of candidates and corresponding feature matrix as input and wraps them.

Parameters:
  • name (str) – The name of the dataset.
  • candidates (List[Candidate]) – The list of candidates.
  • features (csr_matrix) – The corresponding feature matrix.
  • word2id (dict) – The name of the dataset.
  • labels (List[int]) – If np.array, it’s the label for all candidates; If int, it’s the number of classes of label and we will create placeholder labels (mainly used for inference).
  • labels – Which candidates to use. If None, use all candidates.

Learning Utilities

These utilities can be used during error analysis to provide additional insights.

fonduer.learning.utils.collect_word_counter(candidates: Union[List[fonduer.candidates.models.candidate.Candidate], List[List[fonduer.candidates.models.candidate.Candidate]]]) → Dict[str, int][source]

Collect word counter from candidates

Parameters:candidates (list (of list) of candidates) – The candidates used to collect word counter.
Returns:The word counter.
Return type:collections.Counter
fonduer.learning.utils.confusion_matrix(pred: Set[T], gold: Set[T]) → Tuple[Set[T], Set[T], Set[T]][source]

Return a confusion matrix.

This can be used for both entity-level and mention-level

Parameters:
  • pred (set) – a set of predicted entities/candidates
  • gold (set) – a set of golden entities/candidates
Returns:

a tuple of TP, FP, and FN

Return type:

(set, set, set)

fonduer.learning.utils.mark(l: int, h: int, idx: int) → List[Tuple[int, str]][source]

Produce markers based on argument positions

Parameters:
  • l (int) – sentence position of first word in argument.
  • h (int) – sentence position of last word in argument.
  • idx (int) – argument index (1 or 2).
Returns:

markers.

Return type:

list of markers

fonduer.learning.utils.mark_sentence(s: List[str], args: List[Tuple[int, int, int]]) → List[str][source]

Insert markers around relation arguments in word sequence

Parameters:
  • s (list) – list of tokens in sentence.
  • args (list) – list of triples (l, h, idx) as per @_mark(…) corresponding to relation arguments
Returns:

The marked sentence.

Return type:

list

Example: Then Barack married Michelle.
-> Then ~~[[1 Barack 1]]~~ married ~~[[2 Michelle 2]]~~.
fonduer.learning.utils.mention_to_tokens(mention: fonduer.candidates.models.mention.Mention, token_type: str = 'words', lowercase: bool = False) → List[str][source]

Extract tokens from the mention

Parameters:
  • mention – mention object.
  • token_type (str) – token type that wants to extract (e.g. words, lemmas, poses).
  • lowercase (bool) – use lowercase or not.
Returns:

The token list.

Return type:

list

fonduer.learning.utils.save_marginals(session: sqlalchemy.orm.session.Session, X: List[fonduer.candidates.models.candidate.Candidate], marginals: sqlalchemy.orm.session.Session, training: bool = True) → None[source]

Save marginal probabilities for a set of Candidates to db.

Parameters:
  • X – A list of arbitrary objects with candidate ids accessible via a .id attrib
  • marginals – A dense M x K matrix of marginal probabilities, where K is the cardinality of the candidates, OR a M-dim list/array if K=2.
  • training – If True, these are training marginals / labels; else they are saved as end model predictions.

Note: The marginals for k=0 are not stored, only for k = 1,…,K

Configuration Settings

Visit the Configuring Fonduer page to see how to provide configuration parameters to Fonduer via .fonduer-config.yaml.

The learning parameters of different models are described below:

learning:
  # LSTM model
  LSTM:
    # Word embedding dimension size
    emb_dim: 100
    # The number of features in the LSTM hidden state
    hidden_dim: 100
    # Use attention or not (Options: True or False)
    attention: True
    # Dropout parameter
    dropout: 0.1
    # Use bidirectional LSTM or not (Options: True or False)
    bidirectional: True
  # Logistic Regression model
  LogisticRegression:
    # bias term
    bias: False