Multimodal Featurization¶
The third stage of Fonduer’s pipeline is to featurize each Candidate with multimodal features.
Feature Model Classes¶
The following describes the Feature element.
Fonduer’s feature model module.
-
class
fonduer.features.models.
Feature
(**kwargs)[source]¶ Bases:
fonduer.utils.models.annotation.AnnotationMixin
,sqlalchemy.orm.decl_api.Base
An element of a representation of a Candidate in a feature space.
A Feature’s annotation key identifies the definition of the Feature, e.g., a function that implements it or the library name and feature name in an automatic featurization library.
A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs
.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
-
candidate
¶ Candidate
.
-
candidate_id
¶ Id of the
Candidate
being annotated.
-
keys
¶ List of strings of each Key name.
-
values
: sqlalchemy.sql.schema.Column¶ A list of floating point values for each Key.
-
-
class
fonduer.features.models.
FeatureKey
(**kwargs)[source]¶ Bases:
fonduer.utils.models.annotation.AnnotationKeyMixin
,sqlalchemy.orm.decl_api.Base
A feature’s key that identifies the definition of the Feature.
A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs
.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
-
candidate_classes
¶ List of strings of each Key name.
-
name
¶ Name of the Key.
-
Core Objects¶
These are Fonduer’s core objects used for featurization.
Fonduer’s features module.
-
class
fonduer.features.
FeatureExtractor
(features=['textual', 'structural', 'tabular', 'visual'], customize_feature_funcs=[])[source]¶ Bases:
object
A class to extract features from candidates.
- Parameters
features (
List
[str
]) – a list of which Fonduer feature types to extract, defaults to [“textual”, “structural”, “tabular”, “visual”]customize_feature_funcs (
Union
[Callable
[[List
[Candidate
]],Iterator
[Tuple
[int
,str
,int
]]],List
[Callable
[[List
[Candidate
]],Iterator
[Tuple
[int
,str
,int
]]]]]) – a list of customized feature extractors where the extractor takes a list of candidates as input and yield tuples of (candidate_id, feature, value), defaults to []
Initialize FeatureExtractor.
-
class
fonduer.features.
Featurizer
(session, candidate_classes, feature_extractors=<fonduer.features.feature_extractors.FeatureExtractor object>, parallelism=1)[source]¶ Bases:
fonduer.utils.udf.UDFRunner
An operator to add Feature Annotations to Candidates.
- Parameters
session (
Session
) – The database session to use.candidate_classes (
List
[Candidate
]) – A list of candidate_subclasses to featurize.parallelism (
int
) – The number of processes to use in parallel. Default 1.
Initialize the Featurizer.
-
apply
(docs=None, split=0, train=False, clear=True, parallelism=None, progress_bar=True)[source]¶ Apply features to the specified candidates.
- Parameters
docs (
Optional
[Collection
[Document
]]) – If provided, apply features to all the candidates in these documents.split (
int
) – If docs is None, apply features to the candidates in this particular split.train (
bool
) – Whether or not to update the global key set of features and the features of candidates.clear (
bool
) – Whether or not to clear the features table before applying features.parallelism (
Optional
[int
]) – How many threads to use for extraction. This will override the parallelism value used to initialize the Featurizer if it is provided.progress_bar (
bool
) – Whether or not to display a progress bar. The progress bar is measured per document.
- Return type
None
-
clear
(train=False, split=0)[source]¶ Delete Features of each class from the database.
- Parameters
train (
bool
) – Whether or not to clear the FeatureKeyssplit (
int
) – Which split of candidates to clear features from.
- Return type
None
-
drop_keys
(keys, candidate_classes=None)[source]¶ Drop the specified keys from FeatureKeys.
- Parameters
keys (
Iterable
[str
]) – A list of FeatureKey names to delete.candidate_classes (
Union
[Candidate
,Iterable
[Candidate
],None
]) – A list of the Candidates to drop the key for. If None, drops the keys for all candidate classes associated with this Featurizer.
- Return type
None
-
get_feature_matrices
(cand_lists)[source]¶ Load sparse matrix of Features for each candidate_class.
- Parameters
cand_lists (
List
[List
[Candidate
]]) – The candidates to get features for.- Return type
List
[csr_matrix
]- Returns
A list of MxN sparse matrix where M are the candidates and N is the features.
-
get_keys
()[source]¶ Return a list of keys for the Features.
- Return type
List
[FeatureKey
]- Returns
List of FeatureKeys.
-
last_docs
: Set[str]¶ The last set of documents that apply() was called on
-
update
(docs=None, split=0, parallelism=None, progress_bar=True)[source]¶ Update the features of the specified candidates.
- Parameters
docs (
Optional
[Collection
[Document
]]) – If provided, apply features to all the candidates in these documents.split (
int
) – If docs is None, apply features to the candidates in this particular split.parallelism (
Optional
[int
]) – How many threads to use for extraction. This will override the parallelism value used to initialize the Featurizer if it is provided.progress_bar (
bool
) – Whether or not to display a progress bar. The progress bar is measured per document.
- Return type
None
-
upsert_keys
(keys, candidate_classes=None)[source]¶ Upsert the specified keys to FeatureKey.
- Parameters
keys (
Iterable
[str
]) – A list of FeatureKey names to upsert.candidate_classes (
Union
[Candidate
,Iterable
[Candidate
],None
]) – A list of the Candidates to upsert the key for. If None, upsert the keys for all candidate classes associated with this Featurizer.
- Return type
None
Multimodal features¶
Fonduer includes a basic multimodal feature library based on its rich data model. In addition, users can provide their own feature extractors to use with their applications.
Fonduer’s feature library module.
-
fonduer.features.feature_libs.
extract_structural_features
(candidates)[source]¶ Extract structural features.
- Parameters
candidates (
Union
[Candidate
,List
[Candidate
]]) – A list of candidates to extract features from- Return type
Iterator
[Tuple
[int
,str
,int
]]
-
fonduer.features.feature_libs.
extract_tabular_features
(candidates)[source]¶ Extract tabular features.
- Parameters
candidates (
Union
[Candidate
,List
[Candidate
]]) – A list of candidates to extract features from- Return type
Iterator
[Tuple
[int
,str
,int
]]
Configuration Settings¶
Visit the Configuring Fonduer page to see how to provide configuration
parameters to Fonduer via .fonduer-config.yaml
.
The different featurization parameters are explained in this section:
featurization:
# settings of textual-based features
textual:
# settings for window features
window_feature:
size: 3
combinations: True
isolated: True
# settings for word window usd to extract features from surrounding words
word_feature:
window: 7
# settings of tabular-based features
tabular:
# unary feture settings
unary_features:
# type of attributes
attrib:
- words
# number of gram for features extract in cells
get_cell_ngrams:
max: 2
# number of gram for features extract in headers
get_head_ngrams:
max: 2
# number of gram for features extract in rows
get_row_ngrams:
max: 2
# number of gram for features extract in columns
get_col_ngrams:
max: 2
# binary feature settings
multinary_features:
# minimal difference in rows to check
min_row_diff:
absolute: False
# minimal difference in cols to check
min_col_diff:
absolute: False