Supervision

The fourth stage of Fonduer’s pipeline is to provide weak supervision which can be used to generate a large set of training data.

Supervision Model Classes

These are the model classes used for supervision in Fonduer.

class fonduer.supervision.models.GoldLabel(**kwargs)[source]

Bases: fonduer.utils.models.annotation.AnnotationMixin, sqlalchemy.ext.declarative.api.Base

A separate class for labels from human annotators or other gold standards.

candidate

The Candidate.

candidate_id

The id of the Candidate being annotated.

keys

A list of strings of each Key name.

values

A list of integer values for each Key.

class fonduer.supervision.models.GoldLabelKey(**kwargs)[source]

Bases: fonduer.utils.models.annotation.AnnotationKeyMixin, sqlalchemy.ext.declarative.api.Base

A gold label’s key that identifies the annotator of the gold label.

candidate_classes

The name of the Key.

name

The name of the Key.

class fonduer.supervision.models.Label(**kwargs)[source]

Bases: fonduer.utils.models.annotation.AnnotationMixin, sqlalchemy.ext.declarative.api.Base

A discrete label associated with a Candidate, indicating a target prediction value.

Labels are used to represent the output of labeling functions. A Label’s annotation key identifies the labeling function that provided the Label.

candidate

The Candidate.

candidate_id

The id of the Candidate being annotated.

keys

A list of strings of each Key name.

values

A list of integer values for each Key.

class fonduer.supervision.models.LabelKey(**kwargs)[source]

Bases: fonduer.utils.models.annotation.AnnotationKeyMixin, sqlalchemy.ext.declarative.api.Base

A label’s key that identifies the labeling function.

candidate_classes

The name of the Key.

name

The name of the Key.

class fonduer.supervision.models.StableLabel(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base

A special secondary table for preserving labels created by human annotators in a stable format that does not cascade, and is independent of the Candidate IDs.

Note

This is currently unused.

annotator_name

The annotator’s name

context_stable_ids

Delimited list of the context stable ids.

split

Which split the label belongs to

Core Objects

These are Fonduer’s core objects used for supervision.

class fonduer.supervision.Labeler(session: sqlalchemy.orm.session.Session, candidate_classes: List[Type[fonduer.candidates.models.candidate.Candidate]], parallelism: int = 1)[source]

Bases: fonduer.utils.udf.UDFRunner

An operator to add Label Annotations to Candidates.

Parameters:
  • session – The database session to use.
  • candidate_classes (list) – A list of candidate_subclasses to label.
  • parallelism (int) – The number of processes to use in parallel. Default 1.
apply(docs: Collection[fonduer.parser.models.document.Document] = None, split: int = 0, train: bool = False, lfs: List[List[Callable]] = None, clear: bool = True, parallelism: int = None, progress_bar: bool = True, table: sqlalchemy.sql.schema.Table = <class 'fonduer.supervision.models.label.Label'>) → None[source]

Apply the labels of the specified candidates based on the provided LFs.

Parameters:
  • docs – If provided, apply the LFs to all the candidates in these documents.
  • split (int) – If docs is None, apply the LFs to the candidates in this particular split.
  • train (bool) – Whether or not to update the global key set of labels and the labels of candidates.
  • lfs (list of lists) – A list of lists of labeling functions to apply. Each list should correspond with the candidate_classes used to initialize the Labeler.
  • clear (bool) – Whether or not to clear the labels table before applying these LFs.
  • parallelism (int) – How many threads to use for extraction. This will override the parallelism value used to initialize the Labeler if it is provided.
  • progress_bar (bool) – Whether or not to display a progress bar. The progress bar is measured per document.
Raises:

ValueError – If labeling functions are not provided for each candidate class.

clear(train: bool, split: int, lfs: Optional[List[List[Callable]]] = None, table: sqlalchemy.sql.schema.Table = <class 'fonduer.supervision.models.label.Label'>, **kwargs) → None[source]

Delete Labels of each class from the database.

Parameters:
  • train (bool) – Whether or not to clear the LabelKeys.
  • split (int) – Which split of candidates to clear labels from.
  • lfs – This parameter is ignored.
clear_all(table: sqlalchemy.sql.schema.Table = <class 'fonduer.supervision.models.label.Label'>) → None[source]

Delete all Labels.

drop_keys(keys: Iterable[Union[str, Callable]], candidate_classes: Union[Type[fonduer.candidates.models.candidate.Candidate], List[Type[fonduer.candidates.models.candidate.Candidate]], None] = None) → None[source]

Drop the specified keys from LabelKeys.

Parameters:
  • keys (list, tuple) – A list of labeling functions to delete.
  • candidate_classes (list, tuple) – A list of the Candidates to drop the key for. If None, drops the keys for all candidate classes associated with this Labeler.
get_gold_labels(cand_lists: List[List[fonduer.candidates.models.candidate.Candidate]], annotator: Optional[str] = None) → List[<sphinx.ext.autodoc.importer._MockObject object at 0x7f4d268bcfd0>][source]

Load dense matrix of GoldLabels for each candidate_class.

Parameters:
  • cand_lists (List of list of candidates.) – The candidates to get gold labels for.
  • annotator (str) – A specific annotator key to get labels for. Default None.
Returns:

A list of MxN dense matrix where M are the candidates and N is the annotators. If annotator is provided, return a list of Mx1 matrix.

Return type:

list[np.ndarray]

get_keys() → List[str][source]

Return a list of keys for the Labels.

Returns:List of LabelKeys.
Return type:list
get_label_matrices(cand_lists: List[List[fonduer.candidates.models.candidate.Candidate]]) → List[<sphinx.ext.autodoc.importer._MockObject object at 0x7f4d268bc4e0>][source]

Load dense matrix of Labels for each candidate_class.

Parameters:cand_lists (List of list of candidates.) – The candidates to get labels for.
Returns:A list of MxN dense matrix where M are the candidates and N is the labeling functions.
Return type:list[np.ndarray]
update(docs: Collection[fonduer.parser.models.document.Document] = None, split: int = 0, lfs: List[List[Callable]] = None, parallelism: int = None, progress_bar: bool = True, table: sqlalchemy.sql.schema.Table = <class 'fonduer.supervision.models.label.Label'>) → None[source]

Update the labels of the specified candidates based on the provided LFs.

Parameters:
  • docs – If provided, apply the updated LFs to all the candidates in these documents.
  • split – If docs is None, apply the updated LFs to the candidates in this particular split.
  • lfs – A list of lists of labeling functions to update. Each list should correspond with the candidate_classes used to initialize the Labeler.
  • parallelism (int) – How many threads to use for extraction. This will override the parallelism value used to initialize the Labeler if it is provided.
  • progress_bar (bool) – Whether or not to display a progress bar. The progress bar is measured per document.
upsert_keys(keys: Iterable[Union[str, Callable]], candidate_classes: Union[Type[fonduer.candidates.models.candidate.Candidate], List[Type[fonduer.candidates.models.candidate.Candidate]], None] = None) → None[source]

Upsert the specified keys from LabelKeys.

Parameters:
  • keys (list, tuple) – A list of labeling functions to upsert.
  • candidate_classes (list, tuple) – A list of the Candidates to upsert the key for. If None, upsert the keys for all candidate classes associated with this Labeler.