Multimodal Featurization

The third stage of Fonduer’s pipeline is to featurize each Candidate with multimodal features.

Feature Model Classes

The following describes the Feature element.

Fonduer’s feature model module.

class fonduer.features.models.Feature(**kwargs)[source]

Bases: fonduer.utils.models.annotation.AnnotationMixin, sqlalchemy.orm.decl_api.Base

An element of a representation of a Candidate in a feature space.

A Feature’s annotation key identifies the definition of the Feature, e.g., a function that implements it or the library name and feature name in an automatic featurization library.

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

candidate

Candidate.

candidate_id

Id of the Candidate being annotated.

keys

List of strings of each Key name.

values: sqlalchemy.sql.schema.Column

A list of floating point values for each Key.

class fonduer.features.models.FeatureKey(**kwargs)[source]

Bases: fonduer.utils.models.annotation.AnnotationKeyMixin, sqlalchemy.orm.decl_api.Base

A feature’s key that identifies the definition of the Feature.

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

candidate_classes

List of strings of each Key name.

name

Name of the Key.

Core Objects

These are Fonduer’s core objects used for featurization.

Fonduer’s features module.

class fonduer.features.FeatureExtractor(features=['textual', 'structural', 'tabular', 'visual'], customize_feature_funcs=[])[source]

Bases: object

A class to extract features from candidates.

Parameters
  • features (List[str]) – a list of which Fonduer feature types to extract, defaults to [“textual”, “structural”, “tabular”, “visual”]

  • customize_feature_funcs (Union[Callable[[List[Candidate]], Iterator[Tuple[int, str, int]]], List[Callable[[List[Candidate]], Iterator[Tuple[int, str, int]]]]]) – a list of customized feature extractors where the extractor takes a list of candidates as input and yield tuples of (candidate_id, feature, value), defaults to []

Initialize FeatureExtractor.

extract(candidates)[source]

Extract features from candidates.

Parameters

candidates (Union[List[Candidate], Candidate]) – A list of candidates to extract features from

Return type

Iterator[Tuple[int, str, int]]

class fonduer.features.Featurizer(session, candidate_classes, feature_extractors=<fonduer.features.feature_extractors.FeatureExtractor object>, parallelism=1)[source]

Bases: fonduer.utils.udf.UDFRunner

An operator to add Feature Annotations to Candidates.

Parameters
  • session (Session) – The database session to use.

  • candidate_classes (List[Candidate]) – A list of candidate_subclasses to featurize.

  • parallelism (int) – The number of processes to use in parallel. Default 1.

Initialize the Featurizer.

apply(docs=None, split=0, train=False, clear=True, parallelism=None, progress_bar=True)[source]

Apply features to the specified candidates.

Parameters
  • docs (Optional[Collection[Document]]) – If provided, apply features to all the candidates in these documents.

  • split (int) – If docs is None, apply features to the candidates in this particular split.

  • train (bool) – Whether or not to update the global key set of features and the features of candidates.

  • clear (bool) – Whether or not to clear the features table before applying features.

  • parallelism (Optional[int]) – How many threads to use for extraction. This will override the parallelism value used to initialize the Featurizer if it is provided.

  • progress_bar (bool) – Whether or not to display a progress bar. The progress bar is measured per document.

Return type

None

clear(train=False, split=0)[source]

Delete Features of each class from the database.

Parameters
  • train (bool) – Whether or not to clear the FeatureKeys

  • split (int) – Which split of candidates to clear features from.

Return type

None

clear_all()[source]

Delete all Features.

Return type

None

drop_keys(keys, candidate_classes=None)[source]

Drop the specified keys from FeatureKeys.

Parameters
  • keys (Iterable[str]) – A list of FeatureKey names to delete.

  • candidate_classes (Union[Candidate, Iterable[Candidate], None]) – A list of the Candidates to drop the key for. If None, drops the keys for all candidate classes associated with this Featurizer.

Return type

None

get_feature_matrices(cand_lists)[source]

Load sparse matrix of Features for each candidate_class.

Parameters

cand_lists (List[List[Candidate]]) – The candidates to get features for.

Return type

List[csr_matrix]

Returns

A list of MxN sparse matrix where M are the candidates and N is the features.

get_keys()[source]

Return a list of keys for the Features.

Return type

List[FeatureKey]

Returns

List of FeatureKeys.

last_docs: Set[str]

The last set of documents that apply() was called on

update(docs=None, split=0, parallelism=None, progress_bar=True)[source]

Update the features of the specified candidates.

Parameters
  • docs (Optional[Collection[Document]]) – If provided, apply features to all the candidates in these documents.

  • split (int) – If docs is None, apply features to the candidates in this particular split.

  • parallelism (Optional[int]) – How many threads to use for extraction. This will override the parallelism value used to initialize the Featurizer if it is provided.

  • progress_bar (bool) – Whether or not to display a progress bar. The progress bar is measured per document.

Return type

None

upsert_keys(keys, candidate_classes=None)[source]

Upsert the specified keys to FeatureKey.

Parameters
  • keys (Iterable[str]) – A list of FeatureKey names to upsert.

  • candidate_classes (Union[Candidate, Iterable[Candidate], None]) – A list of the Candidates to upsert the key for. If None, upsert the keys for all candidate classes associated with this Featurizer.

Return type

None

Multimodal features

Fonduer includes a basic multimodal feature library based on its rich data model. In addition, users can provide their own feature extractors to use with their applications.

Fonduer’s feature library module.

fonduer.features.feature_libs.extract_structural_features(candidates)[source]

Extract structural features.

Parameters

candidates (Union[Candidate, List[Candidate]]) – A list of candidates to extract features from

Return type

Iterator[Tuple[int, str, int]]

fonduer.features.feature_libs.extract_tabular_features(candidates)[source]

Extract tabular features.

Parameters

candidates (Union[Candidate, List[Candidate]]) – A list of candidates to extract features from

Return type

Iterator[Tuple[int, str, int]]

fonduer.features.feature_libs.extract_textual_features(candidates)[source]

Extract textual features.

Parameters

candidates (Union[Candidate, List[Candidate]]) – A list of candidates to extract features from

Return type

Iterator[Tuple[int, str, int]]

fonduer.features.feature_libs.extract_visual_features(candidates)[source]

Extract visual features.

Parameters

candidates (Union[Candidate, List[Candidate]]) – A list of candidates to extract features from

Return type

Iterator[Tuple[int, str, int]]

Configuration Settings

Visit the Configuring Fonduer page to see how to provide configuration parameters to Fonduer via .fonduer-config.yaml.

The different featurization parameters are explained in this section:

featurization:
  # settings of textual-based features
  textual:
    # settings for window features
    window_feature:
      size: 3
      combinations: True
      isolated: True
    # settings for word window usd to extract features from surrounding words
    word_feature:
      window: 7
  # settings of tabular-based features
  tabular:
    # unary feture settings
    unary_features:
      # type of attributes
      attrib:
        - words
      # number of gram for features extract in cells
      get_cell_ngrams:
        max: 2
      # number of gram for features extract in headers
      get_head_ngrams:
        max: 2
      # number of gram for features extract in rows
      get_row_ngrams:
        max: 2
      # number of gram for features extract in columns
      get_col_ngrams:
        max: 2
    # binary feature settings
    multinary_features:
      # minimal difference in rows to check
      min_row_diff:
        absolute: False
      # minimal difference in cols to check
      min_col_diff:
        absolute: False