Candidate Extraction

The second stage of Fonduer’s pipeline is to extract Mentions and Candidates from the data model.

Candidate Model Classes

The following describes elements of used for Mention and Candidate extraction.

class fonduer.candidates.models.Candidate(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base

An abstract candidate relation.

New relation types should be defined by calling candidate_subclass(), not subclassing this class directly.

get_mentions() → Tuple[fonduer.candidates.models.mention.Mention, ...][source]

Return a tuple of the constituent Mentions making up this Candidate.

Return type:tuple
id

The unique id for the Candidate.

split

Which split the Candidate belongs to. Used to organize train/dev/test.

type

The type for the Candidate, which corresponds to the names the user gives to the candidate_subclasses.

class fonduer.candidates.models.CaptionMention(**kwargs)[source]

Bases: fonduer.parser.models.context.Context, fonduer.candidates.models.caption_mention.TemporaryCaptionMention

A caption Mention.

caption

The parent Caption.

caption_id

The id of the parent Caption.

get_stable_id() → str

Return a stable id for the CaptionMention.

id

The unique id of the CaptionMention.

class fonduer.candidates.models.CellMention(**kwargs)[source]

Bases: fonduer.parser.models.context.Context, fonduer.candidates.models.cell_mention.TemporaryCellMention

A cell Mention.

cell

The parent Cell.

cell_id

The id of the parent Cell.

get_stable_id() → str

Return a stable id for the CellMention.

id

The unique id of the CellMention.

class fonduer.candidates.models.DocumentMention(**kwargs)[source]

Bases: fonduer.parser.models.context.Context, fonduer.candidates.models.document_mention.TemporaryDocumentMention

A document Mention.

document

The parent Document.

document_id

The id of the parent Document.

get_stable_id() → str

Return a stable id for the DocumentMention.

id

The unique id of the DocumentMention.

class fonduer.candidates.models.FigureMention(**kwargs)[source]

Bases: fonduer.parser.models.context.Context, fonduer.candidates.models.figure_mention.TemporaryFigureMention

A figure Mention.

figure

The parent Figure.

figure_id

The id of the parent Figure.

get_stable_id() → str

Return a stable id for the FigureMention.

id

The unique id of the FigureMention.

class fonduer.candidates.models.ImplicitSpanMention(**kwargs)[source]

Bases: fonduer.parser.models.context.Context, fonduer.candidates.models.implicit_span_mention.TemporaryImplicitSpanMention

A span of characters that may not appear verbatim in the source text.

It is identified by Context id, character-index start and end (inclusive), as well as a key representing what ‘expander’ function drew the ImplicitSpanMention from an existing SpanMention, and a position (where position=0 corresponds to the first ImplicitSpanMention produced from the expander function).

The character-index start and end point to the segment of text that was expanded to produce the ImplicitSpanMention.

char_end

The ending character-index of the ImplicitSpanMention (inclusive).

char_start

The starting character-index of the ImplicitSpanMention.

dep_labels

A list of the dependency labels for each word in the ImplicitSpanMention.

dep_parents

A list of the dependency parents for each word in the ImplicitSpanMention.

get_attrib_span(a: str, sep: str = ' ') → str

Get the span of sentence attribute a.

Intuitively, like calling:

sep.join(implicit_span.a)
Parameters:
  • a (str) – The attribute to get a span for.
  • sep (str) – The separator to use for the join.
Returns:

The joined tokens, or text if a=”words”.

Return type:

str

get_attrib_tokens(a: str = 'words')

Get the tokens of sentence attribute a.

Intuitively, like calling:

implicit_span.a
Parameters:a (str) – The attribute to get tokens for.
Returns:The tokens of sentence attribute defined by a for the span.
Return type:list
get_num_words() → int

Get the number of words in the span.

Returns:The number of words in the span (n of the ngrams).
Return type:int
get_span() → str

Return the text of the Span.

Returns:The text of the Span.
Return type:str
get_stable_id() → str

Return a stable id.

Return type:string
get_word_end_index() → int

Get the index of the ending word of the span.

Returns:The word-index of the last word of the span.
Return type:int
get_word_start_index() → int

Get the index of the starting word of the span.

Returns:The word-index of the start of the span.
Return type:int
id

The unique id of the ImplicitSpanMention.

lemmas

A list of the lemmas for each word in the ImplicitSpanMention.

meta

Pickled metadata about the ImplicitSpanMention.

ner_tags

A list of the NER tags for each word in the ImplicitSpanMention.

page

A list of the page number each word in the ImplicitSpanMention.

pos_tags

A list of the POS tags for each word in the ImplicitSpanMention.

position

The position of the ImplicitSpanMention where position=0 is the first ImplicitSpanMention produced by the expander.

sentence

The parent Sentence.

sentence_id

The id of the parent Sentence.

text

The raw text of the ImplicitSpanMention.

words

A list of the words in the ImplicitSpanMention.

class fonduer.candidates.models.Mention(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base

An abstract Mention.

New mention types should be defined by calling mention_subclass(), not subclassing this class directly.

get_contexts() → Tuple[fonduer.parser.models.context.Context, ...][source]

Get the constituent context making up this mention.

id

The unique id of the Mention.

type

The type for the Mention, which corresponds to the names the user gives to the mention_subclass.

class fonduer.candidates.models.ParagraphMention(**kwargs)[source]

Bases: fonduer.parser.models.context.Context, fonduer.candidates.models.paragraph_mention.TemporaryParagraphMention

A paragraph Mention.

get_stable_id() → str

Return a stable id for the ParagraphMention.

id

The unique id of the ParagraphMention.

paragraph

The parent Paragraph.

paragraph_id

The id of the parent Paragraph.

class fonduer.candidates.models.SectionMention(**kwargs)[source]

Bases: fonduer.parser.models.context.Context, fonduer.candidates.models.section_mention.TemporarySectionMention

A section Mention.

get_stable_id() → str

Return a stable id for the SectionMention.

id

The unique id of the SectionMention.

section

The parent Section.

section_id

The id of the parent Section.

class fonduer.candidates.models.SpanMention(**kwargs)[source]

Bases: fonduer.parser.models.context.Context, fonduer.candidates.models.span_mention.TemporarySpanMention

A span of chars, identified by Context ID and char-index start, end (inclusive).

char_offsets are relative to the Context start

char_end

The ending character-index of the SpanMention (inclusive).

char_start

The starting character-index of the SpanMention.

get_attrib_span(a: str, sep: str = ' ') → str

Get the span of sentence attribute a.

Intuitively, like calling:

sep.join(span.a)
Parameters:
  • a (str) – The attribute to get a span for.
  • sep (str) – The separator to use for the join.
Returns:

The joined tokens, or text if a=”words”.

Return type:

str

get_attrib_tokens(a: str = 'words')

Get the tokens of sentence attribute a.

Intuitively, like calling:

span.a
Parameters:a (str) – The attribute to get tokens for.
Returns:The tokens of sentence attribute defined by a for the span.
Return type:list
get_num_words() → int

Get the number of words in the span.

Returns:The number of words in the span (n of the ngrams).
Return type:int
get_span() → str

Return the text of the Span.

Returns:The text of the Span.
Return type:str
get_stable_id() → str

Return a stable id.

Return type:string
get_word_end_index() → int

Get the index of the ending word of the span.

Returns:The word-index of the last word of the span.
Return type:int
get_word_start_index() → int

Get the index of the starting word of the span.

Returns:The word-index of the start of the span.
Return type:int
id

The unique id of the SpanMention.

meta

Pickled metadata about the ImplicitSpanMention.

sentence

The parent Sentence.

sentence_id

The id of the parent Sentence.

class fonduer.candidates.models.TableMention(**kwargs)[source]

Bases: fonduer.parser.models.context.Context, fonduer.candidates.models.table_mention.TemporaryTableMention

A table Mention.

get_stable_id() → str

Return a stable id for the TableMention.

id

The unique id of the TableMention.

table

The parent Table.

table_id

The id of the parent Table.

fonduer.candidates.models.candidate_subclass(class_name: str, args: List[fonduer.candidates.models.mention.Mention], table_name: Optional[str] = None, cardinality: Optional[int] = None, values: Optional[List[Any]] = None) → Type[fonduer.candidates.models.candidate.Candidate][source]

Creates and returns a Candidate subclass with provided argument names, which are Context type. Creates the table in DB if does not exist yet.

Import using:

from fonduer.candidates.models import candidate_subclass
Parameters:
  • class_name – The name of the class, should be “camel case” e.g. NewCandidate
  • args – A list of names of constituent arguments, which refer to the Contexts–representing mentions–that comprise the candidate
  • table_name – The name of the corresponding table in DB; if not provided, is converted from camel case by default, e.g. new_candidate
  • cardinality – The cardinality of the variable corresponding to the Candidate. By default is 2 i.e. is a binary value, e.g. is or is not a true mention.
fonduer.candidates.models.mention_subclass(class_name: str, cardinality: Optional[int] = None, values: Optional[List[Any]] = None, table_name: Optional[str] = None) → Type[fonduer.candidates.models.mention.Mention][source]

Creates and returns a Mention subclass with provided argument names, which are Context type. Creates the table in DB if does not exist yet.

Import using:

from fonduer.candidates.models import mention_subclass
Parameters:
  • class_name – The name of the class, should be “camel case” e.g. NewMention
  • table_name – The name of the corresponding table in DB; if not provided, is converted from camel case by default, e.g. new_mention
  • values – The values that the variable corresponding to the Mention can take. By default it will be [True, False].
  • cardinality – The cardinality of the variable corresponding to the Mention. By default is 2 i.e. is a binary value, e.g. is or is not a true mention.

Core Objects

These are Fonduer’s core objects used for Mention and Candidate extraction.

class fonduer.candidates.MentionExtractor(session: sqlalchemy.orm.session.Session, mention_classes: List[fonduer.candidates.models.mention.Mention], mention_spaces: List[fonduer.candidates.mentions.MentionSpace], matchers: List[fonduer.candidates.matchers._Matcher], parallelism: int = 1)[source]

Bases: fonduer.utils.udf.UDFRunner

An operator to extract Mention objects from a Context.

Example:

Assuming we want to extract two types of Mentions, a Part and a Temperature, and we have already defined Matchers to use:

part_ngrams = MentionNgrams(n_max=3)
temp_ngrams = MentionNgrams(n_max=2)

Part = mention_subclass("Part")
Temp = mention_subclass("Temp")

mention_extractor = MentionExtractor(
    session,
    [Part, Temp],
    [part_ngrams, temp_ngrams],
    [part_matcher, temp_matcher]
)
Parameters:
  • session – An initialized database session.
  • mention_classes (list) – The type of relation to extract, defined using :func: fonduer.mentions.mention_subclass.
  • mention_spaces (list) – one or list of MentionSpace objects, one for each relation argument. Defines space of Contexts to consider
  • matchers (list) – one or list of fonduer.matchers.Matcher objects, one for each relation argument. Only tuples of Contexts for which each element is accepted by the corresponding Matcher will be returned as Mentions
  • parallelism (int) – The number of processes to use in parallel for calls to apply().
Raises:

ValueError – If mention classes, spaces, and matchers are not the same length.

apply(docs: Collection[fonduer.parser.models.document.Document], clear: bool = True, parallelism: Optional[int] = None, progress_bar: bool = True) → None[source]

Run the MentionExtractor.

Example:

To extract mentions from a set of training documents using 4 cores:

mention_extractor.apply(train_docs, parallelism=4)
Parameters:
  • docs – Set of documents to extract from.
  • clear (bool) – Whether or not to clear the existing Mentions beforehand.
  • parallelism (int) – How many threads to use for extraction. This will override the parallelism value used to initialize the MentionExtractor if it is provided.
  • progress_bar (bool) – Whether or not to display a progress bar. The progress bar is measured per document.
clear() → None[source]

Delete Mentions of each class in the extractor from the given split.

clear_all() → None[source]

Delete all Mentions from given split the database.

get_mentions(docs: Optional[Collection[fonduer.parser.models.document.Document]] = None, sort: bool = False) → List[List[fonduer.candidates.models.mention.Mention]][source]

Return a list of lists of the mentions associated with this extractor.

Each list of the return will contain the Mentions for one of the mention classes associated with the MentionExtractor.

Parameters:
  • docs – If provided, return Mentions from these documents. Else, return all Mentions.
  • sort (bool) – If sort is True, then return all Mentions sorted by stable_id.
Returns:

Mentions for each mention_class.

Return type:

List of lists.

class fonduer.candidates.CandidateExtractor(session: sqlalchemy.orm.session.Session, candidate_classes: List[Type[fonduer.candidates.models.candidate.Candidate]], throttlers: Optional[List[Callable[Tuple[fonduer.candidates.models.mention.Mention, ...], bool]]] = None, self_relations: bool = False, nested_relations: bool = False, symmetric_relations: bool = True, parallelism: int = 1)[source]

Bases: fonduer.utils.udf.UDFRunner

An operator to extract Candidate objects from a Context.

Example:

Assuming we have already defined a Part and Temp Mention subclass, and a throttler called templ_throttler, we can create a candidate extractor as follows:

PartTemp = candidate_subclass("PartTemp", [Part, Temp])
candidate_extractor = CandidateExtractor(
    session, [PartTemp], throttlers=[temp_throttler]
)
Parameters:
  • session – An initialized database session.
  • candidate_classes (list of candidate subclasses.) – The types of relation to extract, defined using :func: fonduer.candidates.candidate_subclass.
  • throttlers (list of throttlers.) – optional functions for filtering out candidates which returns a Boolean expressing whether or not the candidate should be instantiated.
  • self_relations (bool) – Boolean indicating whether to extract Candidates that relate the same context. Only applies to binary relations.
  • nested_relations (bool) – Boolean indicating whether to extract Candidates that relate one Context with another that contains it. Only applies to binary relations.
  • symmetric_relations (bool) – Boolean indicating whether to extract symmetric Candidates, i.e., rel(A,B) and rel(B,A), where A and B are Contexts. Only applies to binary relations.
  • parallelism (int) – The number of processes to use in parallel for calls to apply().
Raises:

ValueError – If throttlers are provided, but a throtters are not the same length as candidate classes.

apply(docs: Collection[fonduer.parser.models.document.Document], split: int = 0, clear: bool = True, parallelism: Optional[int] = None, progress_bar: bool = True) → None[source]

Run the CandidateExtractor.

Example:

To extract candidates from a set of training documents using 4 cores:

candidate_extractor.apply(train_docs, split=0, parallelism=4)
Parameters:
  • docs – Set of documents to extract from.
  • split (int) – Which split to assign the extracted Candidates to.
  • clear (bool) – Whether or not to clear the existing Candidates beforehand.
  • parallelism (int) – How many threads to use for extraction. This will override the parallelism value used to initialize the CandidateExtractor if it is provided.
  • progress_bar (bool) – Whether or not to display a progress bar. The progress bar is measured per document.
clear(split: int) → None[source]

Delete Candidates of each class initialized with the CandidateExtractor from given split the database.

Parameters:split (int) – Which split to clear.
clear_all(split: int) → None[source]

Delete ALL Candidates from given split the database.

Parameters:split (int) – Which split to clear.
get_candidates(docs: Optional[Collection[fonduer.parser.models.document.Document]] = None, split: int = 0, sort: bool = False) → List[List[fonduer.candidates.models.candidate.Candidate]][source]

Return a list of lists of the candidates associated with this extractor.

Each list of the return will contain the candidates for one of the candidate classes associated with the CandidateExtractor.

Parameters:
  • docs (list, tuple of Documents.) – If provided, return candidates from these documents from all splits.
  • split (int) – If docs is None, then return all the candidates from this split.
  • sort (bool) – If sort is True, then return all candidates sorted by stable_id.
Returns:

Candidates for each candidate_class.

Return type:

List of lists of Candidates.

MentionSpaces

A MentionSpace defines the space of mentions, i.e., the set of all possible mentions. Depending on your needs, you can use a pre-defined child class of MentionSpace or extend one.

class fonduer.candidates.mentions.MentionSpace[source]

Bases: object

Defines the space of Mention objects.

Calling apply(x) given an object x returns a generator over mentions in x.

class fonduer.candidates.mentions.Ngrams(n_min: int = 1, n_max: int = 5, split_tokens: Collection[str] = [])[source]

Bases: fonduer.candidates.mentions.MentionSpace

Defines the space of Mentions as all n-grams (n_min <= n <= n_max) in a Sentence x, indexing by character offset.

Parameters:
  • n_min (int) – Lower limit for the generated n_grams.
  • n_max (int) – Upper limit for the generated n_grams.
  • split_tokens (tuple, list of str.) – Tokens, on which unigrams are split into two separate unigrams.
class fonduer.candidates.mentions.MentionNgrams(n_min: int = 1, n_max: int = 5, split_tokens: Collection[str] = [])[source]

Bases: fonduer.candidates.mentions.Ngrams

Defines the space of Mentions.

Defines the space of Mentions as all n-grams (n_min <= n <= n_max) in a Document x, divided into Sentences inside of html elements (such as table cells).

Parameters:
  • n_min (int) – Lower limit for the generated n_grams.
  • n_max (int) – Upper limit for the generated n_grams.
  • split_tokens (tuple, list of str.) – Tokens, on which unigrams are split into two separate unigrams.
class fonduer.candidates.mentions.MentionFigures(types: Optional[str] = None)[source]

Bases: fonduer.candidates.mentions.MentionSpace

Defines the space of Mentions as all figures in a Document x.

Parameters:types (list, tuple of str) – If specified, only yield TemporaryFigureMentions whose url ends in one of the specified types. Example: types=[“png”, “jpg”, “jpeg”].
class fonduer.candidates.mentions.MentionSentences[source]

Bases: fonduer.candidates.mentions.MentionSpace

Defines the space of Mentions as all sentences in a Document x.

class fonduer.candidates.mentions.MentionParagraphs[source]

Bases: fonduer.candidates.mentions.MentionSpace

Defines the space of Mentions as all paragraphs in a Document x.

class fonduer.candidates.mentions.MentionCaptions[source]

Bases: fonduer.candidates.mentions.MentionSpace

Defines the space of Mentions as all captions in a Document x.

class fonduer.candidates.mentions.MentionCells[source]

Bases: fonduer.candidates.mentions.MentionSpace

Defines the space of Mentions as all cells in a Document x.

class fonduer.candidates.mentions.MentionTables[source]

Bases: fonduer.candidates.mentions.MentionSpace

Defines the space of Mentions as all tables in a Document x.

class fonduer.candidates.mentions.MentionSections[source]

Bases: fonduer.candidates.mentions.MentionSpace

Defines the space of Mentions as all sections in a Document x.

class fonduer.candidates.mentions.MentionDocuments[source]

Bases: fonduer.candidates.mentions.MentionSpace

Defines the space of Mentions as a document in a Document x.

Matchers

This shows the matchers included with Fonduer. These matchers can be used alone, or combined together, to define what spans of text should be made into Mentions.

class fonduer.candidates.matchers.DateMatcher(*children, **kwargs)[source]

Bases: fonduer.candidates.matchers.RegexMatchEach

Matches Spans that are dates, as identified by spaCy.

A convenience class for setting up a RegexMatchEach to match spans for which each token was tagged as a date (DATE).

class fonduer.candidates.matchers.DictionaryMatch(*children, **opts)[source]

Bases: fonduer.candidates.matchers._NgramMatcher

Selects mention Ngrams that match against a given list d.

Parameters:
  • d (list of str) – A list of strings representing a dictionary.
  • ignore_case (bool) – Whether to ignore the case when matching. Default True.
  • inverse (bool) – Whether to invert the results (e.g., return those which are not in the list). Default False.
  • stemmer – Optionally provide a stemmer to preprocess the dictionary. Can be any object which has a stem() method. Use stemmer=”porter” to use a PorterStemmer(). Default None.
class fonduer.candidates.matchers.LambdaFunctionFigureMatcher(*children, **opts)[source]

Bases: fonduer.candidates.matchers._FigureMatcher

Selects Figures that return True when fed to a function f.

Parameters:func (function) – The function to evaluate. See LambdaFunctionMatcher for details.
class fonduer.candidates.matchers.LambdaFunctionMatcher(*children, **opts)[source]

Bases: fonduer.candidates.matchers._NgramMatcher

Selects Ngrams that return True when fed to a function f.

Parameters:
  • func (function) – The function to evaluate with a signature of f: m -> {True, False}, where m denotes a mention. More precisely, m is an instance of child class of TemporaryContext, depending on which MentionSpace is used. E.g., TemporarySpanMention when MentionNgrams is used.
  • longest_match_only (bool) – Whether to only return the longest span matched, rather than all spans. Default False.
class fonduer.candidates.matchers.LocationMatcher(*children, **kwargs)[source]

Bases: fonduer.candidates.matchers.RegexMatchEach

Matches Spans that are the names of locations, as identified by spaCy.

A convenience class for setting up a RegexMatchEach to match spans for which each token was tagged as a location (GPE or LOC).

class fonduer.candidates.matchers.MiscMatcher(*children, **kwargs)[source]

Bases: fonduer.candidates.matchers.RegexMatchEach

Matches Spans that are miscellaneous named entities, as identified by spaCy.

A convenience class for setting up a RegexMatchEach to match spans for which each token was tagged as miscellaneous (MISC).

class fonduer.candidates.matchers.NumberMatcher(*children, **kwargs)[source]

Bases: fonduer.candidates.matchers.RegexMatchEach

Matches Spans that are numbers, as identified by spaCy.

A convenience class for setting up a RegexMatchEach to match spans for which each token was tagged as a number (NUMBER or QUANTITY).

class fonduer.candidates.matchers.OrganizationMatcher(*children, **kwargs)[source]

Bases: fonduer.candidates.matchers.RegexMatchEach

Matches Spans that are the names of organizations, as identified by spaCy.

A convenience class for setting up a RegexMatchEach to match spans for which each token was tagged as an organization (NORG or ORG).

class fonduer.candidates.matchers.PersonMatcher(*children, **kwargs)[source]

Bases: fonduer.candidates.matchers.RegexMatchEach

Matches Spans that are the names of people, as identified by spaCy.

A convenience class for setting up a RegexMatchEach to match spans for which each token was tagged as a person (PERSON).

class fonduer.candidates.matchers.RegexMatchEach(*children, **opts)[source]

Bases: fonduer.candidates.matchers._RegexMatch

Matches regex pattern on each token.

Parameters:
  • rgx (str) – The RegEx pattern to use.
  • ignore_case (bool) – Whether or not to ignore case in the RegEx. Default True.
  • full_match (bool) – If True, wrap the provided rgx with (<rgx>)$. Default True.
  • longest_match_only (bool) – If True, only return the longest match. Default True.
class fonduer.candidates.matchers.RegexMatchSpan(*children, **opts)[source]

Bases: fonduer.candidates.matchers._RegexMatch

Matches regex pattern on full concatenated span.

Parameters:
  • rgx (str) – The RegEx pattern to use.
  • ignore_case (bool) – Whether or not to ignore case in the RegEx. Default True.
  • search (bool) – If True, search the regex pattern through the concatenated span. If False, try to match the regex patten only at its beginning. Default False.
  • full_match (bool) – If True, wrap the provided rgx with (<rgx>)$. Default True.
  • longest_match_only (bool) – If True, only return the longest match. Default True.

Matcher Operators

These are the operators which can be use to compose matchers.

class fonduer.candidates.matchers.Concat(*children, **opts)[source]

Bases: fonduer.candidates.matchers._NgramMatcher

Selects mentions which are the concatenation of adjacent matches from child operators.

Example:

A concatenation of a NumberMatcher and PersonMatcher could match on a span of text like “10 Obama”.

Parameters:
  • permutations (bool) – Default False.
  • left_required (bool) – Whether or not to require the left child to match. Default True.
  • right_required (bool) – Whether or not to require the right child to match. Default True.
  • ignore_sep (bool) – Whether or not to ignore the separator. Default True.
  • sep – If not ignoring the separator, specify which separator to look for. Default sep=” “.
Raises:

ValueError – If Concat is not provided with two child matcher objects.

Note

Currently slices on word index and considers concatenation along these divisions only.

class fonduer.candidates.matchers.Intersect(*children, **opts)[source]

Bases: fonduer.candidates.matchers._Matcher

Takes the intersection of mention sets returned by the provided Matchers.

class fonduer.candidates.matchers.Inverse(*children, **opts)[source]

Bases: fonduer.candidates.matchers._Matcher

Returns the opposite result of ifs child Matcher.

Raises:ValueError – If more than one Matcher is provided.
class fonduer.candidates.matchers.Union(*children, **opts)[source]

Bases: fonduer.candidates.matchers._NgramMatcher

Takes the union of mention sets returned by the provided Matchers.