Candidate Extraction¶

The second stage of Fonduer’s pipeline is to extract Mentions and Candidates from the data model.

Candidate Model Classes¶

The following describes elements of used for Mention and Candidate extraction.

Fonduer’s candidate model module.

class fonduer.candidates.models.Candidate(**kwargs)[source]¶

Bases: sqlalchemy.orm.decl_api.Base

An abstract candidate relation.

New relation types should be defined by calling candidate_subclass(), not subclassing this class directly.

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

get_mentions()[source]¶

Get a tuple of the constituent Mentions making up this Candidate.

Return type: Tuple[Mention, …]

id¶: The unique id for the Candidate.

split¶: Which split the Candidate belongs to. Used to organize train/dev/test.

type¶: The type for the Candidate, which corresponds to the names the user gives to the candidate_subclasses.

class fonduer.candidates.models.CaptionMention(tc)[source]¶

Bases: fonduer.parser.models.context.Context, fonduer.candidates.models.caption_mention.TemporaryCaptionMention

A caption Mention.

Initialize CaptionMention.

caption¶: The parent Caption.

caption_id¶: The id of the parent Caption.

get_stable_id()¶

Return a stable id.

Return type: str

id¶: The unique id of the CaptionMention.

class fonduer.candidates.models.CellMention(tc)[source]¶

Bases: fonduer.parser.models.context.Context, fonduer.candidates.models.cell_mention.TemporaryCellMention

A cell Mention.

Initialize CellMention.

cell¶: The parent Cell.

cell_id¶: The id of the parent Cell.

get_stable_id()¶

Return a stable id.

Return type: str

id¶: The unique id of the CellMention.

class fonduer.candidates.models.DocumentMention(tc)[source]¶

Bases: fonduer.parser.models.context.Context, fonduer.candidates.models.document_mention.TemporaryDocumentMention

A document Mention.

Initialize DocumentMention.

document¶: The parent Document.

document_id¶: The id of the parent Document.

get_stable_id()¶

Return a stable id.

Return type: str

id¶: The unique id of the DocumentMention.

class fonduer.candidates.models.FigureMention(tc)[source]¶

Bases: fonduer.parser.models.context.Context, fonduer.candidates.models.figure_mention.TemporaryFigureMention

A figure Mention.

Initialize FigureMention.

figure¶: The parent Figure.

figure_id¶: The id of the parent Figure.

get_stable_id()¶

Return a stable id.

Return type: str

id¶: The unique id of the FigureMention.

class fonduer.candidates.models.ImplicitSpanMention(tc)[source]¶

Bases: fonduer.parser.models.context.Context, fonduer.candidates.models.implicit_span_mention.TemporaryImplicitSpanMention

A span of characters that may not appear verbatim in the source text.

It is identified by Context id, character-index start and end (inclusive), as well as a key representing what ‘expander’ function drew the ImplicitSpanMention from an existing SpanMention, and a position (where position=0 corresponds to the first ImplicitSpanMention produced from the expander function).

The character-index start and end point to the segment of text that was expanded to produce the ImplicitSpanMention.

Initialize ImplicitSpanMention.

char_end¶: The ending character-index of the ImplicitSpanMention (inclusive).

char_start¶: The starting character-index of the ImplicitSpanMention.

dep_labels¶: A list of the dependency labels for each word in the ImplicitSpanMention.

dep_parents¶: A list of the dependency parents for each word in the ImplicitSpanMention.

get_attrib_span(a, sep='')¶

Get the span of sentence attribute a.

Intuitively, like calling:

sep.join(implicit_span.a)

Parameters

a (str) – The attribute to get a span for.
sep (str) – The separator to use for the join, or to be removed from text if a=”words”.

Return type

str

Returns

The joined tokens, or text if a=”words”.

get_attrib_tokens(a='words')¶

Get the tokens of sentence attribute a.

Intuitively, like calling:

implicit_span.a

Parameters: a (str) – The attribute to get tokens for.
Return type: List
Returns: The tokens of sentence attribute defined by a for the span.

get_bbox()¶

Get the bounding box.

Return type: Bbox

get_num_words()¶

Get the number of words in the span.

Return type: int
Returns: The number of words in the span (n of the ngrams).

get_span()¶

Return the text of the Span.

Return type: str
Returns: The text of the Span.

get_stable_id()¶

Return a stable id.

Return type: str

get_word_end_index()¶

Get the index of the ending word of the span.

Return type: int
Returns: The word-index of the last word of the span.

get_word_start_index()¶

Get the index of the starting word of the span.

Return type: int
Returns: The word-index of the start of the span.

id¶: The unique id of the ImplicitSpanMention.

lemmas¶: A list of the lemmas for each word in the ImplicitSpanMention.

meta¶: Pickled metadata about the ImplicitSpanMention.

ner_tags¶: A list of the NER tags for each word in the ImplicitSpanMention.

page¶: A list of the page number each word in the ImplicitSpanMention.

pos_tags¶: A list of the POS tags for each word in the ImplicitSpanMention.

position¶: The position of the ImplicitSpanMention where position=0 is the first ImplicitSpanMention produced by the expander.

sentence¶: The parent Sentence.

sentence_id¶: The id of the parent Sentence.

text¶: The raw text of the ImplicitSpanMention.

words¶: A list of the words in the ImplicitSpanMention.

class fonduer.candidates.models.Mention(**kwargs)[source]¶

Bases: sqlalchemy.orm.decl_api.Base

An abstract Mention.

New mention types should be defined by calling mention_subclass(), not subclassing this class directly.

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

get_contexts()[source]¶

Get the constituent context making up this mention.

Return type: Tuple[Context, …]

id¶: The unique id of the Mention.

type¶: The type for the Mention, which corresponds to the names the user gives to the mention_subclass.

class fonduer.candidates.models.ParagraphMention(tc)[source]¶

Bases: fonduer.parser.models.context.Context, fonduer.candidates.models.paragraph_mention.TemporaryParagraphMention

A paragraph Mention.

Initialize ParagraphMention.

get_stable_id()¶

Return a stable id.

Return type: str

id¶: The unique id of the ParagraphMention.

paragraph¶: The parent Paragraph.

paragraph_id¶: The id of the parent Paragraph.

class fonduer.candidates.models.SectionMention(tc)[source]¶

Bases: fonduer.parser.models.context.Context, fonduer.candidates.models.section_mention.TemporarySectionMention

A section Mention.

Initialize SectionMention.

get_stable_id()¶

Return a stable id.

Return type: str

id¶: The unique id of the SectionMention.

section¶: The parent Section.

section_id¶: The id of the parent Section.

class fonduer.candidates.models.SpanMention(tc)[source]¶

Bases: fonduer.parser.models.context.Context, fonduer.candidates.models.span_mention.TemporarySpanMention

A span of chars, identified by Context ID and char-index start, end (inclusive).

char_offsets are relative to the Context start

Initialize SpanMention.

char_end¶: The ending character-index of the SpanMention (inclusive).

char_start¶: The starting character-index of the SpanMention.

get_attrib_span(a, sep='')¶

Get the span of sentence attribute a.

Intuitively, like calling:

sep.join(span.a)

Parameters

a (str) – The attribute to get a span for.
sep (str) – The separator to use for the join, or to be removed from text if a=”words”.

Return type

str

Returns

The joined tokens, or text if a=”words”.

get_attrib_tokens(a='words')¶

Get the tokens of sentence attribute a.

Intuitively, like calling:

span.a

Parameters: a (str) – The attribute to get tokens for.
Return type: List
Returns: The tokens of sentence attribute defined by a for the span.

get_bbox()¶

Get the bounding box.

Return type: Bbox

get_num_words()¶

Get the number of words in the span.

Return type: int
Returns: The number of words in the span (n of the ngrams).

get_span()¶

Return the text of the Span.

Return type: str
Returns: The text of the Span.

get_stable_id()¶

Return a stable id.

Return type: str

get_word_end_index()¶

Get the index of the ending word of the span.

Return type: int
Returns: The word-index of the last word of the span.

get_word_start_index()¶

Get the index of the starting word of the span.

Return type: int
Returns: The word-index of the start of the span.

id¶: The unique id of the SpanMention.

meta¶: Pickled metadata about the ImplicitSpanMention.

sentence¶: The parent Sentence.

sentence_id¶: The id of the parent Sentence.

class fonduer.candidates.models.TableMention(tc)[source]¶

Bases: fonduer.parser.models.context.Context, fonduer.candidates.models.table_mention.TemporaryTableMention

A table Mention.

Initialize TableMention.

get_stable_id()¶

Return a stable id.

Return type: str

id¶: The unique id of the TableMention.

table¶: The parent Table.

table_id¶: The id of the parent Table.

fonduer.candidates.models.candidate_subclass(class_name, args, table_name=None, cardinality=None, values=None, nullables=None)[source]¶

Create new relation.

Creates and returns a Candidate subclass with provided argument names, which are Context type. Creates the table in DB if does not exist yet.

Import using:

from fonduer.candidates.models import candidate_subclass

Parameters

class_name (str) – The name of the class, should be “camel case” e.g. NewCandidate
args (List[Mention]) – A list of names of constituent arguments, which refer to the Contexts–representing mentions–that comprise the candidate
table_name (Optional[str]) – The name of the corresponding table in DB; if not provided, is converted from camel case by default, e.g. new_candidate
cardinality (Optional[int]) – The cardinality of the variable corresponding to the Candidate. By default is 2 i.e. is a binary value, e.g. is or is not a true mention.
values (Optional[List[Any]]) – A list of values a candidate can take as their label.
nullables (Optional[List[bool]]) – The number of nullables must match that of args. If nullables[i]==True, a mention for ith mention subclass can be NULL. If nullables=``None`` (by default), no mention can be NULL.

Return type

Type[Candidate]

fonduer.candidates.models.mention_subclass(class_name, cardinality=None, values=None, table_name=None)[source]¶

Create new mention.

Creates and returns a Mention subclass with provided argument names, which are Context type. Creates the table in DB if does not exist yet.

Import using:

from fonduer.candidates.models import mention_subclass

Parameters

class_name (str) – The name of the class, should be “camel case” e.g. NewMention
table_name (Optional[str]) – The name of the corresponding table in DB; if not provided, is converted from camel case by default, e.g. new_mention
values (Optional[List[Any]]) – The values that the variable corresponding to the Mention can take. By default it will be [True, False].
cardinality (Optional[int]) – The cardinality of the variable corresponding to the Mention. By default is 2 i.e. is a binary value, e.g. is or is not a true mention.

Return type

Type[Mention]

Core Objects¶

These are Fonduer’s core objects used for Mention and Candidate extraction.

class fonduer.candidates.MentionExtractor(session, mention_classes, mention_spaces, matchers, parallelism=1)[source]¶

Bases: fonduer.utils.udf.UDFRunner

An operator to extract Mention objects from a Context.

Example

Assuming we want to extract two types of Mentions, a Part and a Temperature, and we have already defined Matchers to use:

part_ngrams = MentionNgrams(n_max=3)
temp_ngrams = MentionNgrams(n_max=2)

Part = mention_subclass("Part")
Temp = mention_subclass("Temp")

mention_extractor = MentionExtractor(
    session,
    [Part, Temp],
    [part_ngrams, temp_ngrams],
    [part_matcher, temp_matcher]
)

Parameters

session (Session) – An initialized database session.
mention_classes (List[Mention]) – The type of relation to extract, defined using :func: fonduer.mentions.mention_subclass.
mention_spaces (List[MentionSpace]) – one or list of MentionSpace objects, one for each relation argument. Defines space of Contexts to consider
matchers (List[_Matcher]) – one or list of fonduer.matchers.Matcher objects, one for each relation argument. Only tuples of Contexts for which each element is accepted by the corresponding Matcher will be returned as Mentions
parallelism (int) – The number of processes to use in parallel for calls to apply().

Raises

ValueError – If mention classes, spaces, and matchers are not the same length.

Initialize the MentionExtractor.

apply(docs, clear=True, parallelism=None, progress_bar=True)[source]¶

Run the MentionExtractor.

Example

To extract mentions from a set of training documents using 4 cores:

mention_extractor.apply(train_docs, parallelism=4)

Parameters

docs (Collection[Document]) – Set of documents to extract from.
clear (bool) – Whether or not to clear the existing Mentions beforehand.
parallelism (Optional[int]) – How many threads to use for extraction. This will override the parallelism value used to initialize the MentionExtractor if it is provided.
progress_bar (bool) – Whether or not to display a progress bar. The progress bar is measured per document.

Return type

None

clear()[source]¶

Delete Mentions of each class in the extractor from the given split.

Return type: None

clear_all()[source]¶

Delete all Mentions from given split the database.

Return type: None

get_mentions(docs=None, sort=False)[source]¶

Return a list of lists of the mentions associated with this extractor.

Each list of the return will contain the Mentions for one of the mention classes associated with the MentionExtractor.

Parameters

docs (Union[Document, Iterable[Document], None]) – If provided, return Mentions from these documents. Else, return all Mentions.
sort (bool) – If sort is True, then return all Mentions sorted by stable_id.

Return type

List[List[Mention]]

Returns

Mentions for each mention_class.

last_docs: Set[str]¶: The last set of documents that apply() was called on

class fonduer.candidates.CandidateExtractor(session, candidate_classes, throttlers=None, self_relations=False, nested_relations=False, symmetric_relations=True, parallelism=1)[source]¶

Bases: fonduer.utils.udf.UDFRunner

An operator to extract Candidate objects from a Context.

Example

Assuming we have already defined a Part and Temp Mention subclass, and a throttler called templ_throttler, we can create a candidate extractor as follows:

PartTemp = candidate_subclass("PartTemp", [Part, Temp])
candidate_extractor = CandidateExtractor(
    session, [PartTemp], throttlers=[temp_throttler]
)

Parameters

session (Session) – An initialized database session.
candidate_classes (List[Type[Candidate]]) – The types of relation to extract, defined using fonduer.candidates.candidate_subclass().
throttlers (list of throttlers.) – optional functions for filtering out candidates which returns a Boolean expressing whether or not the candidate should be instantiated.
self_relations (bool) – Boolean indicating whether to extract Candidates that relate the same context. Only applies to binary relations.
nested_relations (bool) – Boolean indicating whether to extract Candidates that relate one Context with another that contains it. Only applies to binary relations.
symmetric_relations (bool) – Boolean indicating whether to extract symmetric Candidates, i.e., rel(A,B) and rel(B,A), where A and B are Contexts. Only applies to binary relations.
parallelism (int) – The number of processes to use in parallel for calls to apply().

Raises

ValueError – If throttlers are provided, but a throtters are not the same length as candidate classes.

Set throttlers match candidate_classes if not provide.

apply(docs, split=0, clear=True, parallelism=None, progress_bar=True)[source]¶

Run the CandidateExtractor.

Example

To extract candidates from a set of training documents using 4 cores:

candidate_extractor.apply(train_docs, split=0, parallelism=4)

Parameters

docs (Collection[Document]) – Set of documents to extract from.
split (int) – Which split to assign the extracted Candidates to.
clear (bool) – Whether or not to clear the existing Candidates beforehand.
parallelism (Optional[int]) – How many threads to use for extraction. This will override the parallelism value used to initialize the CandidateExtractor if it is provided.
progress_bar (bool) – Whether or not to display a progress bar. The progress bar is measured per document.

Return type

None

clear(split)[source]¶

Clear Candidates of each class.

Delete Candidates of each class initialized with the CandidateExtractor from the given split in the database.

Parameters: split (int) – Which split to clear.
Return type: None

clear_all(split)[source]¶

Delete ALL Candidates from given split the database.

Parameters: split (int) – Which split to clear.
Return type: None

get_candidates(docs=None, split=0, sort=False)[source]¶

Return a list of lists of the candidates associated with this extractor.

Each list of the return will contain the candidates for one of the candidate classes associated with the CandidateExtractor.

Parameters

docs (Union[Document, Iterable[Document], None]) – If provided, return candidates from these documents from all splits.
split (int) – If docs is None, then return all the candidates from this split.
sort (bool) – If sort is True, then return all candidates sorted by stable_id.

Return type

List[List[Candidate]]

Returns

Candidates for each candidate_class.

last_docs: Set[str]¶: The last set of documents that apply() was called on

MentionSpaces¶

A MentionSpace defines the space of mentions, i.e., the set of all possible mentions. Depending on your needs, you can use a pre-defined child class of MentionSpace or extend one.

class fonduer.candidates.mentions.MentionSpace[source]¶

Bases: object

Define the space of Mention objects.

Calling apply(x) given an object x returns a generator over mentions in x.

Initialize mention space.

class fonduer.candidates.mentions.Ngrams(n_min=1, n_max=5, split_tokens=[])[source]¶

Bases: fonduer.candidates.mentions.MentionSpace

Define the space of Mentions as all n-grams in a Sentence.

Define the space of Mentions as all n-grams (n_min <= n <= n_max) in a Sentence x, indexing by character offset.

Parameters

n_min (int) – Lower limit for the generated n_grams.
n_max (int) – Upper limit for the generated n_grams.
split_tokens (tuple, list of str.) – Tokens, on which unigrams are split into two separate unigrams.

Initialize Ngrams.

class fonduer.candidates.mentions.MentionNgrams(n_min=1, n_max=5, split_tokens=[])[source]¶

Bases: fonduer.candidates.mentions.Ngrams

Defines the space of Mentions as n-grams in a Document.

Defines the space of Mentions as all n-grams (n_min <= n <= n_max) in a Document x, divided into Sentences inside of html elements (such as table cells).

Parameters

n_min (int) – Lower limit for the generated n_grams.
n_max (int) – Upper limit for the generated n_grams.
split_tokens (tuple, list of str.) – Tokens, on which unigrams are split into two separate unigrams.

Initialize MentionNgrams.

class fonduer.candidates.mentions.MentionFigures(types=None)[source]¶

Bases: fonduer.candidates.mentions.MentionSpace

Defines the space of Mentions as all figures in a Document x.

Parameters: types (list, tuple of str) – If specified, only yield TemporaryFigureMentions whose url ends in one of the specified types. Example: types=[“png”, “jpg”, “jpeg”].

Initialize MentionFigures.

class fonduer.candidates.mentions.MentionSentences[source]¶

Bases: fonduer.candidates.mentions.MentionSpace

Defines the space of Mentions as all sentences in a Document x.

Initialize MentionSentences.

class fonduer.candidates.mentions.MentionParagraphs[source]¶

Bases: fonduer.candidates.mentions.MentionSpace

Defines the space of Mentions as all paragraphs in a Document x.

Initialize MentionParagraphs.

class fonduer.candidates.mentions.MentionCaptions[source]¶

Bases: fonduer.candidates.mentions.MentionSpace

Defines the space of Mentions as all captions in a Document x.

Initialize MentionCaptions.

class fonduer.candidates.mentions.MentionCells[source]¶

Bases: fonduer.candidates.mentions.MentionSpace

Defines the space of Mentions as all cells in a Document x.

Initialize MentionCells.

class fonduer.candidates.mentions.MentionTables[source]¶

Bases: fonduer.candidates.mentions.MentionSpace

Defines the space of Mentions as all tables in a Document x.

Initialize MentionTables.

class fonduer.candidates.mentions.MentionSections[source]¶

Bases: fonduer.candidates.mentions.MentionSpace

Defines the space of Mentions as all sections in a Document x.

Initialize MentionSections.

class fonduer.candidates.mentions.MentionDocuments[source]¶

Bases: fonduer.candidates.mentions.MentionSpace

Defines the space of Mentions as a document in a Document x.

Initialize MentionDocuments.

Matchers¶

This shows the matchers included with Fonduer. These matchers can be used alone, or combined together, to define what spans of text should be made into Mentions.

class fonduer.candidates.matchers.DateMatcher(*children, **kwargs)[source]¶

Bases: fonduer.candidates.matchers.RegexMatchEach

Match Spans that are dates, as identified by spaCy.

A convenience class for setting up a RegexMatchEach to match spans for which each token was tagged as a date (DATE).

Initialize date matcher.

class fonduer.candidates.matchers.DictionaryMatch(*children, **opts)[source]¶

Bases: fonduer.candidates.matchers._Matcher

Select mention Ngrams that match against a given list d.

Parameters

d (list of str) – A list of strings representing a dictionary.
ignore_case (bool) – Whether to ignore the case when matching. Default True.
inverse (bool) – Whether to invert the results (e.g., return those which are not in the list). Default False.
stemmer – Optionally provide a stemmer to preprocess the dictionary. Can be any object which has a stem(str) -> str method like PorterStemmer(). Default None.

class fonduer.candidates.matchers.LambdaFunctionFigureMatcher(*children, **opts)[source]¶

Bases: fonduer.candidates.matchers._Matcher

Select Figures that return True when fed to a function f.

Parameters: func (function) – The function to evaluate. See LambdaFunctionMatcher for details.

class fonduer.candidates.matchers.LambdaFunctionMatcher(*children, **opts)[source]¶

Bases: fonduer.candidates.matchers._Matcher

Select Ngrams that return True when fed to a function f.

Parameters

func (function) – The function to evaluate with a signature of f: m -> {True, False}, where m denotes a mention. More precisely, m is an instance of child class of TemporaryContext, depending on which MentionSpace is used. E.g., TemporarySpanMention when MentionNgrams is used.
longest_match_only (bool) – Whether to only return the longest span matched, rather than all spans. Default False.

class fonduer.candidates.matchers.LocationMatcher(*children, **kwargs)[source]¶

Bases: fonduer.candidates.matchers.RegexMatchEach

Match Spans that are the names of locations, as identified by spaCy.

A convenience class for setting up a RegexMatchEach to match spans for which each token was tagged as a location (GPE or LOC).

Initialize location matcher.

class fonduer.candidates.matchers.MiscMatcher(*children, **kwargs)[source]¶

Bases: fonduer.candidates.matchers.RegexMatchEach

Match Spans that are miscellaneous named entities, as identified by spaCy.

A convenience class for setting up a RegexMatchEach to match spans for which each token was tagged as miscellaneous (MISC).

Initialize miscellaneous matcher.

class fonduer.candidates.matchers.NumberMatcher(*children, **kwargs)[source]¶

Bases: fonduer.candidates.matchers.RegexMatchEach

Match Spans that are numbers, as identified by spaCy.

A convenience class for setting up a RegexMatchEach to match spans for which each token was tagged as a number (CARDINAL or QUANTITY).

Initialize number matcher.

class fonduer.candidates.matchers.OrganizationMatcher(*children, **kwargs)[source]¶

Bases: fonduer.candidates.matchers.RegexMatchEach

Match Spans that are the names of organizations, as identified by spaCy.

A convenience class for setting up a RegexMatchEach to match spans for which each token was tagged as an organization (NORG or ORG).

Initialize organization matcher.

class fonduer.candidates.matchers.PersonMatcher(*children, **kwargs)[source]¶

Bases: fonduer.candidates.matchers.RegexMatchEach

Match Spans that are the names of people, as identified by spaCy.

A convenience class for setting up a RegexMatchEach to match spans for which each token was tagged as a person (PERSON).

Initialize person matcher.

class fonduer.candidates.matchers.RegexMatchEach(*children, **opts)[source]¶

Bases: fonduer.candidates.matchers._RegexMatch

Match regex pattern on each token.

Parameters

rgx (str) – The RegEx pattern to use.
ignore_case (bool) – Whether or not to ignore case in the RegEx. Default True.
full_match (bool) – If True, wrap the provided rgx with (<rgx>)$. Default True.
longest_match_only (bool) – If True, only return the longest match. Default True.

class fonduer.candidates.matchers.RegexMatchSpan(*children, **opts)[source]¶

Bases: fonduer.candidates.matchers._RegexMatch

Match regex pattern on full concatenated span.

Parameters

rgx (str) – The RegEx pattern to use.
ignore_case (bool) – Whether or not to ignore case in the RegEx. Default True.
search (bool) – If True, search the regex pattern through the concatenated span. If False, try to match the regex patten only at its beginning. Default False.
full_match (bool) – If True, wrap the provided rgx with (<rgx>)$. Default True.
longest_match_only (bool) – If True, only return the longest match. Default True. Will be overridden by the parent matcher like Union when it is wrapped by Union, Intersect, or Inverse.

Matcher Operators¶

These are the operators which can be use to compose matchers.

class fonduer.candidates.matchers.Concat(*children, **opts)[source]¶

Bases: fonduer.candidates.matchers._Matcher

Concatenate mentions generated by Matchers.

Select mentions which are the concatenation of adjacent matches from child operators.

Example

A concatenation of a NumberMatcher and PersonMatcher could match on a span of text like “10 Obama”.

Parameters

permutations (bool) – Default False.
left_required (bool) – Whether or not to require the left child to match. Default True.
right_required (bool) – Whether or not to require the right child to match. Default True.
ignore_sep (bool) – Whether or not to ignore the separator. Default True.
sep – If not ignoring the separator, specify which separator to look for. Default sep=” “.

Raises

ValueError – If Concat is not provided with two child matcher objects.

Note

Currently slices on word index and considers concatenation along these divisions only.

class fonduer.candidates.matchers.Intersect(*children, **opts)[source]¶

Bases: fonduer.candidates.matchers._Matcher

Take the intersection of mention sets returned by the provided Matchers.

Parameters: longest_match_only (bool) – If True, only return the longest match. Default True. Overrides longest_match_only of its child Matchers.

class fonduer.candidates.matchers.Inverse(*children, **opts)[source]¶

Bases: fonduer.candidates.matchers._Matcher

Return the opposite result of ifs child Matcher.

Raises: ValueError – If more than one Matcher is provided.
Parameters: longest_match_only (bool) – If True, only return the longest match. Default True. Overrides longest_match_only of its child Matchers.

Initialize inverse matcher.

class fonduer.candidates.matchers.Union(*children, **opts)[source]¶

Bases: fonduer.candidates.matchers._Matcher

Take the union of mention sets returned by the provided Matchers.

Parameters: longest_match_only (bool) – If True, only return the longest match. Default True. Overrides longest_match_only of its child Matchers.