Multimodal Featurization

The third stage of Fonduer’s pipeline is to featurize each Candidate with multimodal features.

Feature Model Classes

The following describes the Feature element.

class fonduer.features.models.Feature(**kwargs)[source]

Bases: fonduer.utils.models.annotation.AnnotationMixin, sqlalchemy.ext.declarative.api.Base

An element of a representation of a Candidate in a feature space.

A Feature’s annotation key identifies the definition of the Feature, e.g., a function that implements it or the library name and feature name in an automatic featurization library.

candidate

The Candidate.

candidate_id

The id of the Candidate being annotated.

keys

A list of strings of each Key name.

values

A list of floating point values for each Key.

class fonduer.features.models.FeatureKey(**kwargs)[source]

Bases: fonduer.utils.models.annotation.AnnotationKeyMixin, sqlalchemy.ext.declarative.api.Base

A feature’s key that identifies the definition of the Feature.

candidate_classes

The name of the Key.

name

The name of the Key.

Core Objects

These are Fonduer’s core objects used for featurization.

class fonduer.features.Featurizer(session: sqlalchemy.orm.session.Session, candidate_classes: List[fonduer.candidates.models.candidate.Candidate], feature_extractors: fonduer.features.feature_extractors.FeatureExtractor = <fonduer.features.feature_extractors.FeatureExtractor object>, parallelism: int = 1)[source]

Bases: fonduer.utils.udf.UDFRunner

An operator to add Feature Annotations to Candidates.

Parameters:
  • session – The database session to use.
  • candidate_classes (list) – A list of candidate_subclasses to featurize.
  • parallelism (int) – The number of processes to use in parallel. Default 1.
apply(docs: Optional[Collection[fonduer.parser.models.document.Document]] = None, split: int = 0, train: bool = False, clear: bool = True, parallelism: Optional[int] = None, progress_bar: bool = True) → None[source]

Apply features to the specified candidates.

Parameters:
  • docs – If provided, apply features to all the candidates in these documents.
  • split (int) – If docs is None, apply features to the candidates in this particular split.
  • train (bool) – Whether or not to update the global key set of features and the features of candidates.
  • clear (bool) – Whether or not to clear the features table before applying features.
  • parallelism (int) – How many threads to use for extraction. This will override the parallelism value used to initialize the Featurizer if it is provided.
  • progress_bar (bool) – Whether or not to display a progress bar. The progress bar is measured per document.
clear(train: bool = False, split: int = 0) → None[source]

Delete Features of each class from the database.

Parameters:
  • train (bool) – Whether or not to clear the FeatureKeys
  • split (int) – Which split of candidates to clear features from.
clear_all() → None[source]

Delete all Features.

drop_keys(keys: Iterable[str], candidate_classes: Optional[Iterable[fonduer.candidates.models.candidate.Candidate]] = None) → None[source]

Drop the specified keys from FeatureKeys.

Parameters:
  • keys (list, tuple) – A list of FeatureKey names to delete.
  • candidate_classes (list, tuple) – A list of the Candidates to drop the key for. If None, drops the keys for all candidate classes associated with this Featurizer.
get_feature_matrices(cand_lists: List[List[fonduer.candidates.models.candidate.Candidate]]) → List[<sphinx.ext.autodoc.importer._MockObject object at 0x7fdb49e14b38>][source]

Load sparse matrix of Features for each candidate_class.

Parameters:cand_lists (List of list of candidates.) – The candidates to get features for.
Returns:A list of MxN sparse matrix where M are the candidates and N is the features.
Return type:list[csr_matrix]
get_keys() → List[fonduer.features.models.feature.FeatureKey][source]

Return a list of keys for the Features.

Returns:List of FeatureKeys.
Return type:list
update(docs: Optional[Collection[fonduer.parser.models.document.Document]] = None, split: int = 0, parallelism: Optional[int] = None, progress_bar: bool = True) → None[source]

Update the features of the specified candidates.

Parameters:
  • docs – If provided, apply features to all the candidates in these documents.
  • split (int) – If docs is None, apply features to the candidates in this particular split.
  • parallelism (int) – How many threads to use for extraction. This will override the parallelism value used to initialize the Featurizer if it is provided.
  • progress_bar (bool) – Whether or not to display a progress bar. The progress bar is measured per document.
upsert_keys(keys: Iterable[str], candidate_classes: Optional[Iterable[fonduer.candidates.models.candidate.Candidate]] = None) → None[source]

Upsert the specified keys to FeatureKey.

Parameters:
  • keys (list | tuple) – A list of FeatureKey names to upsert.
  • candidate_classes (list | tuple) – A list of the Candidates to upsert the key for. If None, upsert the keys for all candidate classes associated with this Featurizer.
class fonduer.features.FeatureExtractor(features: List[str] = ['textual', 'structural', 'tabular', 'visual'], customize_feature_funcs: List[Callable[List[fonduer.candidates.models.candidate.Candidate], Iterator[Tuple[int, str, int]]]] = [])[source]

Bases: object

A class to extract features from candidates.

Parameters:
  • features (list, optional) – a list of which Fonduer feature types to extract, defaults to [“textual”, “structural”, “tabular”, “visual”]
  • customize_feature_funcs (list, optional) – a list of customized feature extractors where the extractor takes a list of candidates as input and yield tuples of (candidate_id, feature, value), defaults to []
extract(candidates: List[fonduer.candidates.models.candidate.Candidate]) → Iterator[Tuple[int, str, int]][source]

Extract features from candidates.

Parameters:candidates (list) – A list of candidates to extract features from

Multimodal features

Fonduer includes a basic multimodal feature library based on its rich data model. In addition, users can provide their own feature extractors to use with their applications.

fonduer.features.feature_libs.extract_textual_features(candidates: List[fonduer.candidates.models.candidate.Candidate]) → Iterator[Tuple[int, str, int]][source]

Extract textual features.

Parameters:candidates (list) – A list of candidates to extract features from
fonduer.features.feature_libs.extract_structural_features(candidates: List[fonduer.candidates.models.candidate.Candidate]) → Iterator[Tuple[int, str, int]][source]

Extract structural features.

Parameters:candidates (list) – A list of candidates to extract features from
fonduer.features.feature_libs.extract_tabular_features(candidates: List[fonduer.candidates.models.candidate.Candidate]) → Iterator[Tuple[int, str, int]][source]

Extract tabular features.

Parameters:candidates (list) – A list of candidates to extract features from
fonduer.features.feature_libs.extract_visual_features(candidates: List[fonduer.candidates.models.candidate.Candidate]) → Iterator[Tuple[int, str, int]][source]

Extract visual features.

Parameters:candidates (list) – A list of candidates to extract features from

Configuration Settings

Visit the Configuring Fonduer page to see how to provide configuration parameters to Fonduer via .fonduer-config.yaml.

The different featurization parameters are explained in this section:

featurization:
  # settings of textual-based features
  textual:
    # settings for window features
    window_feature:
      size: 3
      combinations: True
      isolated: True
    # settings for word window usd to extract features from surrounding words
    word_feature:
      window: 7
  # settings of tabular-based features
  tabular:
    # unary feture settings
    unary_features:
      # type of attributes
      attrib:
        - words
      # number of gram for features extract in cells
      get_cell_ngrams:
        max: 2
      # number of gram for features extract in headers
      get_head_ngrams:
        max: 2
      # number of gram for features extract in rows
      get_row_ngrams:
        max: 2
      # number of gram for features extract in columns
      get_col_ngrams:
        max: 2
    # binary feature settings
    binary_features:
      # minimal difference in rows to check
      min_row_diff:
        absolute: False
      # minimal difference in cols to check
      min_col_diff:
        absolute: False