Multimodal Featurization

The third stage of Fonduer’s pipeline is to featurize each Candidate with multimodal features.

Feature Model Classes

The following describes the Feature element.

class fonduer.features.models.Feature(**kwargs)[source]

Bases: fonduer.utils.models.annotation.AnnotationMixin, sqlalchemy.ext.declarative.api.Base

An element of a representation of a Candidate in a feature space.

A Feature’s annotation key identifies the definition of the Feature, e.g., a function that implements it or the library name and feature name in an automatic featurization library.

candidate

The Candidate.

candidate_id

The id of the Candidate being annotated.

keys

A list of strings of each Key name.

values

A list of floating point values for each Key.

class fonduer.features.models.FeatureKey(**kwargs)[source]

Bases: fonduer.utils.models.annotation.AnnotationKeyMixin, sqlalchemy.ext.declarative.api.Base

A feature’s key that identifies the definition of the Feature.

candidate_classes

The name of the Key.

name

The name of the Key.

Core Objects

These are Fonduer’s core objects used for featurization.

class fonduer.features.Featurizer(session, candidate_classes, parallelism=1)[source]

Bases: fonduer.utils.udf.UDFRunner

An operator to add Feature Annotations to Candidates.

Parameters:
  • session – The database session to use.
  • candidate_classes (list) – A list of candidate_subclasses to featurize.
  • parallelism (int) – The number of processes to use in parallel. Default 1.
apply(docs=None, split=0, train=False, clear=True, parallelism=None, progress_bar=True)[source]

Apply features to the specified candidates.

Parameters:
  • docs – If provided, apply features to all the candidates in these documents.
  • split (int) – If docs is None, apply features to the candidates in this particular split.
  • train (bool) – Whether or not to update the global key set of features and the features of candidates.
  • clear (bool) – Whether or not to clear the features table before applying features.
  • parallelism (int) – How many threads to use for extraction. This will override the parallelism value used to initialize the Featurizer if it is provided.
  • progress_bar (bool) – Whether or not to display a progress bar. The progress bar is measured per document.
clear(train=False, split=0)[source]

Delete Features of each class from the database.

Parameters:
  • train (bool) – Whether or not to clear the FeatureKeys
  • split (int) – Which split of candidates to clear features from.
clear_all()[source]

Delete all Features.

drop_keys(keys, candidate_classes=None)[source]

Drop the specified keys from FeatureKeys.

Parameters:
  • keys (list, tuple) – A list of FeatureKey names to delete.
  • candidate_classes (list, tuple) – A list of the Candidates to drop the key for. If None, drops the keys for all candidate classes associated with this Featurizer.
get_feature_matrices(cand_lists)[source]

Load sparse matrix of Features for each candidate_class.

Parameters:cand_lists (List of list of candidates.) – The candidates to get features for.
Returns:A list of MxN sparse matrix where M are the candidates and N is the features.
Return type:list[csr_matrix]
get_keys()[source]

Return a list of keys for the Features.

Returns:List of FeatureKeys.
Return type:list
update(docs=None, split=0, parallelism=None, progress_bar=True)[source]

Update the features of the specified candidates.

Parameters:
  • docs – If provided, apply features to all the candidates in these documents.
  • split (int) – If docs is None, apply features to the candidates in this particular split.
  • parallelism (int) – How many threads to use for extraction. This will override the parallelism value used to initialize the Featurizer if it is provided.
  • progress_bar (bool) – Whether or not to display a progress bar. The progress bar is measured per document.
upsert_keys(keys, candidate_classes=None)[source]

Upsert the specified keys to FeatureKey.

Parameters:
  • keys (list | tuple) – A list of FeatureKey names to upsert.
  • candidate_classes (list | tuple) – A list of the Candidates to upsert the key for. If None, upsert the keys for all candidate classes associated with this Featurizer.

Configuration Settings

Visit the Configuring Fonduer page to see how to provide configuration parameters to Fonduer via .fonduer-config.yaml.

The different featurization parameters are explained in this section:

featurization:
  # settings of content-based features
  content:
    # settings for window features
    window_feature:
      size: 3
      combinations: True
      isolated: True
    # settings for word window usd to extract features from surrounding words
    word_feature:
      window: 7
  # settings of table-based features
  table:
    # unary feture settings
    unary_features:
      # type of attributes
      attrib:
        - words
      # number of gram for features extract in cells
      get_cell_ngrams:
        max: 2
      # number of gram for features extract in headers
      get_head_ngrams:
        max: 2
      # number of gram for features extract in rows
      get_row_ngrams:
        max: 2
      # number of gram for features extract in columns
      get_col_ngrams:
        max: 2
    # binary feature settings
    binary_features:
      # minimal difference in rows to check
      min_row_diff:
        absolute: False
      # minimal difference in cols to check
      min_col_diff:
        absolute: False