Candidate Extraction¶
The second stage of Fonduer’s pipeline is to extract Mentions and Candidates from the data model.
Candidate Model Classes¶
The following describes elements of used for Mention and Candidate extraction.
Fonduer’s candidate model module.
-
class
fonduer.candidates.models.
Candidate
(**kwargs)[source]¶ Bases:
sqlalchemy.orm.decl_api.Base
An abstract candidate relation.
New relation types should be defined by calling candidate_subclass(), not subclassing this class directly.
A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs
.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
-
get_mentions
()[source]¶ Get a tuple of the constituent
Mentions
making up thisCandidate
.- Return type
Tuple
[Mention
, …]
-
id
¶ The unique id for the
Candidate
.
-
split
¶ Which split the
Candidate
belongs to. Used to organize train/dev/test.
-
type
¶ The type for the
Candidate
, which corresponds to the names the user gives to the candidate_subclasses.
-
-
class
fonduer.candidates.models.
CaptionMention
(tc)[source]¶ Bases:
fonduer.parser.models.context.Context
,fonduer.candidates.models.caption_mention.TemporaryCaptionMention
A caption
Mention
.Initialize CaptionMention.
-
caption
¶ The parent
Caption
.
-
caption_id
¶ The id of the parent
Caption
.
-
get_stable_id
()¶ Return a stable id.
- Return type
str
-
id
¶ The unique id of the
CaptionMention
.
-
-
class
fonduer.candidates.models.
CellMention
(tc)[source]¶ Bases:
fonduer.parser.models.context.Context
,fonduer.candidates.models.cell_mention.TemporaryCellMention
A cell
Mention
.Initialize CellMention.
-
cell
¶ The parent
Cell
.
-
cell_id
¶ The id of the parent
Cell
.
-
get_stable_id
()¶ Return a stable id.
- Return type
str
-
id
¶ The unique id of the
CellMention
.
-
-
class
fonduer.candidates.models.
DocumentMention
(tc)[source]¶ Bases:
fonduer.parser.models.context.Context
,fonduer.candidates.models.document_mention.TemporaryDocumentMention
A document
Mention
.Initialize DocumentMention.
-
document
¶ The parent
Document
.
-
document_id
¶ The id of the parent
Document
.
-
get_stable_id
()¶ Return a stable id.
- Return type
str
-
id
¶ The unique id of the
DocumentMention
.
-
-
class
fonduer.candidates.models.
FigureMention
(tc)[source]¶ Bases:
fonduer.parser.models.context.Context
,fonduer.candidates.models.figure_mention.TemporaryFigureMention
A figure
Mention
.Initialize FigureMention.
-
figure
¶ The parent
Figure
.
-
figure_id
¶ The id of the parent
Figure
.
-
get_stable_id
()¶ Return a stable id.
- Return type
str
-
id
¶ The unique id of the
FigureMention
.
-
-
class
fonduer.candidates.models.
ImplicitSpanMention
(tc)[source]¶ Bases:
fonduer.parser.models.context.Context
,fonduer.candidates.models.implicit_span_mention.TemporaryImplicitSpanMention
A span of characters that may not appear verbatim in the source text.
It is identified by Context id, character-index start and end (inclusive), as well as a key representing what ‘expander’ function drew the ImplicitSpanMention from an existing SpanMention, and a position (where position=0 corresponds to the first ImplicitSpanMention produced from the expander function).
The character-index start and end point to the segment of text that was expanded to produce the ImplicitSpanMention.
Initialize ImplicitSpanMention.
-
char_end
¶ The ending character-index of the
ImplicitSpanMention
(inclusive).
-
char_start
¶ The starting character-index of the
ImplicitSpanMention
.
-
dep_labels
¶ A list of the dependency labels for each word in the
ImplicitSpanMention
.
-
dep_parents
¶ A list of the dependency parents for each word in the
ImplicitSpanMention
.
-
get_attrib_span
(a, sep='')¶ Get the span of sentence attribute a.
Intuitively, like calling:
sep.join(implicit_span.a)
- Parameters
a (
str
) – The attribute to get a span for.sep (
str
) – The separator to use for the join, or to be removed from text if a=”words”.
- Return type
str
- Returns
The joined tokens, or text if a=”words”.
-
get_attrib_tokens
(a='words')¶ Get the tokens of sentence attribute a.
Intuitively, like calling:
implicit_span.a
- Parameters
a (
str
) – The attribute to get tokens for.- Return type
List
- Returns
The tokens of sentence attribute defined by a for the span.
-
get_bbox
()¶ Get the bounding box.
- Return type
Bbox
-
get_num_words
()¶ Get the number of words in the span.
- Return type
int
- Returns
The number of words in the span (n of the ngrams).
-
get_span
()¶ Return the text of the
Span
.- Return type
str
- Returns
The text of the
Span
.
-
get_stable_id
()¶ Return a stable id.
- Return type
str
-
get_word_end_index
()¶ Get the index of the ending word of the span.
- Return type
int
- Returns
The word-index of the last word of the span.
-
get_word_start_index
()¶ Get the index of the starting word of the span.
- Return type
int
- Returns
The word-index of the start of the span.
-
id
¶ The unique id of the
ImplicitSpanMention
.
-
lemmas
¶ A list of the lemmas for each word in the
ImplicitSpanMention
.
-
meta
¶ Pickled metadata about the
ImplicitSpanMention
.
A list of the NER tags for each word in the
ImplicitSpanMention
.
-
page
¶ A list of the page number each word in the
ImplicitSpanMention
.
A list of the POS tags for each word in the
ImplicitSpanMention
.
-
position
¶ The position of the
ImplicitSpanMention
where position=0 is the firstImplicitSpanMention
produced by the expander.
-
sentence
¶ The parent
Sentence
.
-
sentence_id
¶ The id of the parent
Sentence
.
-
text
¶ The raw text of the
ImplicitSpanMention
.
-
words
¶ A list of the words in the
ImplicitSpanMention
.
-
-
class
fonduer.candidates.models.
Mention
(**kwargs)[source]¶ Bases:
sqlalchemy.orm.decl_api.Base
An abstract Mention.
New mention types should be defined by calling mention_subclass(), not subclassing this class directly.
A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs
.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
-
get_contexts
()[source]¶ Get the constituent context making up this mention.
- Return type
Tuple
[Context
, …]
-
id
¶ The unique id of the
Mention
.
-
type
¶ The type for the
Mention
, which corresponds to the names the user gives to the mention_subclass.
-
-
class
fonduer.candidates.models.
ParagraphMention
(tc)[source]¶ Bases:
fonduer.parser.models.context.Context
,fonduer.candidates.models.paragraph_mention.TemporaryParagraphMention
A paragraph
Mention
.Initialize ParagraphMention.
-
get_stable_id
()¶ Return a stable id.
- Return type
str
-
id
¶ The unique id of the
ParagraphMention
.
-
paragraph
¶ The parent
Paragraph
.
-
paragraph_id
¶ The id of the parent
Paragraph
.
-
-
class
fonduer.candidates.models.
SectionMention
(tc)[source]¶ Bases:
fonduer.parser.models.context.Context
,fonduer.candidates.models.section_mention.TemporarySectionMention
A section
Mention
.Initialize SectionMention.
-
get_stable_id
()¶ Return a stable id.
- Return type
str
-
id
¶ The unique id of the
SectionMention
.
-
section
¶ The parent
Section
.
-
section_id
¶ The id of the parent
Section
.
-
-
class
fonduer.candidates.models.
SpanMention
(tc)[source]¶ Bases:
fonduer.parser.models.context.Context
,fonduer.candidates.models.span_mention.TemporarySpanMention
A span of chars, identified by Context ID and char-index start, end (inclusive).
char_offsets are relative to the Context start
Initialize SpanMention.
-
char_end
¶ The ending character-index of the
SpanMention
(inclusive).
-
char_start
¶ The starting character-index of the
SpanMention
.
-
get_attrib_span
(a, sep='')¶ Get the span of sentence attribute a.
Intuitively, like calling:
sep.join(span.a)
- Parameters
a (
str
) – The attribute to get a span for.sep (
str
) – The separator to use for the join, or to be removed from text if a=”words”.
- Return type
str
- Returns
The joined tokens, or text if a=”words”.
-
get_attrib_tokens
(a='words')¶ Get the tokens of sentence attribute a.
Intuitively, like calling:
span.a
- Parameters
a (
str
) – The attribute to get tokens for.- Return type
List
- Returns
The tokens of sentence attribute defined by a for the span.
-
get_bbox
()¶ Get the bounding box.
- Return type
Bbox
-
get_num_words
()¶ Get the number of words in the span.
- Return type
int
- Returns
The number of words in the span (n of the ngrams).
-
get_span
()¶ Return the text of the
Span
.- Return type
str
- Returns
The text of the
Span
.
-
get_stable_id
()¶ Return a stable id.
- Return type
str
-
get_word_end_index
()¶ Get the index of the ending word of the span.
- Return type
int
- Returns
The word-index of the last word of the span.
-
get_word_start_index
()¶ Get the index of the starting word of the span.
- Return type
int
- Returns
The word-index of the start of the span.
-
id
¶ The unique id of the
SpanMention
.
-
meta
¶ Pickled metadata about the
ImplicitSpanMention
.
-
sentence
¶ The parent
Sentence
.
-
sentence_id
¶ The id of the parent
Sentence
.
-
-
class
fonduer.candidates.models.
TableMention
(tc)[source]¶ Bases:
fonduer.parser.models.context.Context
,fonduer.candidates.models.table_mention.TemporaryTableMention
A table
Mention
.Initialize TableMention.
-
get_stable_id
()¶ Return a stable id.
- Return type
str
-
id
¶ The unique id of the
TableMention
.
-
table
¶ The parent
Table
.
-
table_id
¶ The id of the parent
Table
.
-
-
fonduer.candidates.models.
candidate_subclass
(class_name, args, table_name=None, cardinality=None, values=None, nullables=None)[source]¶ Create new relation.
Creates and returns a Candidate subclass with provided argument names, which are Context type. Creates the table in DB if does not exist yet.
Import using:
from fonduer.candidates.models import candidate_subclass
- Parameters
class_name (
str
) – The name of the class, should be “camel case” e.g. NewCandidateargs (
List
[Mention
]) – A list of names of constituent arguments, which refer to the Contexts–representing mentions–that comprise the candidatetable_name (
Optional
[str
]) – The name of the corresponding table in DB; if not provided, is converted from camel case by default, e.g. new_candidatecardinality (
Optional
[int
]) – The cardinality of the variable corresponding to the Candidate. By default is 2 i.e. is a binary value, e.g. is or is not a true mention.values (
Optional
[List
[Any
]]) – A list of values a candidate can take as their label.nullables (
Optional
[List
[bool
]]) – The number of nullables must match that of args. If nullables[i]==True, a mention for ith mention subclass can be NULL. If nullables=``None`` (by default), no mention can be NULL.
- Return type
Type
[Candidate
]
-
fonduer.candidates.models.
mention_subclass
(class_name, cardinality=None, values=None, table_name=None)[source]¶ Create new mention.
Creates and returns a Mention subclass with provided argument names, which are Context type. Creates the table in DB if does not exist yet.
Import using:
from fonduer.candidates.models import mention_subclass
- Parameters
class_name (
str
) – The name of the class, should be “camel case” e.g. NewMentiontable_name (
Optional
[str
]) – The name of the corresponding table in DB; if not provided, is converted from camel case by default, e.g. new_mentionvalues (
Optional
[List
[Any
]]) – The values that the variable corresponding to the Mention can take. By default it will be [True, False].cardinality (
Optional
[int
]) – The cardinality of the variable corresponding to the Mention. By default is 2 i.e. is a binary value, e.g. is or is not a true mention.
- Return type
Type
[Mention
]
Core Objects¶
These are Fonduer’s core objects used for Mention and Candidate extraction.
-
class
fonduer.candidates.
MentionExtractor
(session, mention_classes, mention_spaces, matchers, parallelism=1)[source]¶ Bases:
fonduer.utils.udf.UDFRunner
An operator to extract Mention objects from a Context.
- Example
Assuming we want to extract two types of
Mentions
, a Part and a Temperature, and we have already defined Matchers to use:part_ngrams = MentionNgrams(n_max=3) temp_ngrams = MentionNgrams(n_max=2) Part = mention_subclass("Part") Temp = mention_subclass("Temp") mention_extractor = MentionExtractor( session, [Part, Temp], [part_ngrams, temp_ngrams], [part_matcher, temp_matcher] )
- Parameters
session (
Session
) – An initialized database session.mention_classes (
List
[Mention
]) – The type of relation to extract, defined using :func: fonduer.mentions.mention_subclass.mention_spaces (
List
[MentionSpace
]) – one or list ofMentionSpace
objects, one for each relation argument. Defines space of Contexts to considermatchers (
List
[_Matcher
]) – one or list offonduer.matchers.Matcher
objects, one for each relation argument. Only tuples of Contexts for which each element is accepted by the corresponding Matcher will be returned as Mentionsparallelism (
int
) – The number of processes to use in parallel for calls to apply().
- Raises
ValueError – If mention classes, spaces, and matchers are not the same length.
Initialize the MentionExtractor.
-
apply
(docs, clear=True, parallelism=None, progress_bar=True)[source]¶ Run the MentionExtractor.
- Example
To extract mentions from a set of training documents using 4 cores:
mention_extractor.apply(train_docs, parallelism=4)
- Parameters
docs (
Collection
[Document
]) – Set of documents to extract from.clear (
bool
) – Whether or not to clear the existing Mentions beforehand.parallelism (
Optional
[int
]) – How many threads to use for extraction. This will override the parallelism value used to initialize the MentionExtractor if it is provided.progress_bar (
bool
) – Whether or not to display a progress bar. The progress bar is measured per document.
- Return type
None
-
clear
()[source]¶ Delete Mentions of each class in the extractor from the given split.
- Return type
None
-
get_mentions
(docs=None, sort=False)[source]¶ Return a list of lists of the mentions associated with this extractor.
Each list of the return will contain the Mentions for one of the mention classes associated with the MentionExtractor.
- Parameters
docs (
Union
[Document
,Iterable
[Document
],None
]) – If provided, return Mentions from these documents. Else, return all Mentions.sort (
bool
) – If sort is True, then return all Mentions sorted by stable_id.
- Return type
List
[List
[Mention
]]- Returns
Mentions for each mention_class.
-
last_docs
: Set[str]¶ The last set of documents that apply() was called on
-
class
fonduer.candidates.
CandidateExtractor
(session, candidate_classes, throttlers=None, self_relations=False, nested_relations=False, symmetric_relations=True, parallelism=1)[source]¶ Bases:
fonduer.utils.udf.UDFRunner
An operator to extract Candidate objects from a Context.
- Example
Assuming we have already defined a Part and Temp
Mention
subclass, and a throttler called templ_throttler, we can create a candidate extractor as follows:PartTemp = candidate_subclass("PartTemp", [Part, Temp]) candidate_extractor = CandidateExtractor( session, [PartTemp], throttlers=[temp_throttler] )
- Parameters
session (
Session
) – An initialized database session.candidate_classes (
List
[Type
[Candidate
]]) – The types of relation to extract, defined usingfonduer.candidates.candidate_subclass()
.throttlers (list of throttlers.) – optional functions for filtering out candidates which returns a Boolean expressing whether or not the candidate should be instantiated.
self_relations (
bool
) – Boolean indicating whether to extract Candidates that relate the same context. Only applies to binary relations.nested_relations (
bool
) – Boolean indicating whether to extract Candidates that relate one Context with another that contains it. Only applies to binary relations.symmetric_relations (
bool
) – Boolean indicating whether to extract symmetric Candidates, i.e., rel(A,B) and rel(B,A), where A and B are Contexts. Only applies to binary relations.parallelism (
int
) – The number of processes to use in parallel for calls to apply().
- Raises
ValueError – If throttlers are provided, but a throtters are not the same length as candidate classes.
Set throttlers match candidate_classes if not provide.
-
apply
(docs, split=0, clear=True, parallelism=None, progress_bar=True)[source]¶ Run the CandidateExtractor.
- Example
To extract candidates from a set of training documents using 4 cores:
candidate_extractor.apply(train_docs, split=0, parallelism=4)
- Parameters
docs (
Collection
[Document
]) – Set of documents to extract from.split (
int
) – Which split to assign the extracted Candidates to.clear (
bool
) – Whether or not to clear the existing Candidates beforehand.parallelism (
Optional
[int
]) – How many threads to use for extraction. This will override the parallelism value used to initialize the CandidateExtractor if it is provided.progress_bar (
bool
) – Whether or not to display a progress bar. The progress bar is measured per document.
- Return type
None
-
clear
(split)[source]¶ Clear Candidates of each class.
Delete Candidates of each class initialized with the CandidateExtractor from the given split in the database.
- Parameters
split (
int
) – Which split to clear.- Return type
None
-
clear_all
(split)[source]¶ Delete ALL Candidates from given split the database.
- Parameters
split (
int
) – Which split to clear.- Return type
None
-
get_candidates
(docs=None, split=0, sort=False)[source]¶ Return a list of lists of the candidates associated with this extractor.
Each list of the return will contain the candidates for one of the candidate classes associated with the CandidateExtractor.
- Parameters
docs (
Union
[Document
,Iterable
[Document
],None
]) – If provided, return candidates from these documents from all splits.split (
int
) – If docs is None, then return all the candidates from this split.sort (
bool
) – If sort is True, then return all candidates sorted by stable_id.
- Return type
List
[List
[Candidate
]]- Returns
Candidates for each candidate_class.
-
last_docs
: Set[str]¶ The last set of documents that apply() was called on
MentionSpaces¶
A MentionSpace defines the space of mentions, i.e., the set of all possible mentions. Depending on your needs, you can use a pre-defined child class of MentionSpace or extend one.
-
class
fonduer.candidates.mentions.
MentionSpace
[source]¶ Bases:
object
Define the space of Mention objects.
Calling apply(x) given an object x returns a generator over mentions in x.
Initialize mention space.
-
class
fonduer.candidates.mentions.
Ngrams
(n_min=1, n_max=5, split_tokens=[])[source]¶ Bases:
fonduer.candidates.mentions.MentionSpace
Define the space of Mentions as all n-grams in a Sentence.
Define the space of Mentions as all n-grams (n_min <= n <= n_max) in a Sentence x, indexing by character offset.
- Parameters
n_min (
int
) – Lower limit for the generated n_grams.n_max (
int
) – Upper limit for the generated n_grams.split_tokens (tuple, list of str.) – Tokens, on which unigrams are split into two separate unigrams.
Initialize Ngrams.
-
class
fonduer.candidates.mentions.
MentionNgrams
(n_min=1, n_max=5, split_tokens=[])[source]¶ Bases:
fonduer.candidates.mentions.Ngrams
Defines the space of Mentions as n-grams in a Document.
Defines the space of Mentions as all n-grams (n_min <= n <= n_max) in a Document x, divided into Sentences inside of html elements (such as table cells).
- Parameters
n_min (
int
) – Lower limit for the generated n_grams.n_max (
int
) – Upper limit for the generated n_grams.split_tokens (tuple, list of str.) – Tokens, on which unigrams are split into two separate unigrams.
Initialize MentionNgrams.
-
class
fonduer.candidates.mentions.
MentionFigures
(types=None)[source]¶ Bases:
fonduer.candidates.mentions.MentionSpace
Defines the space of Mentions as all figures in a Document x.
- Parameters
types (list, tuple of str) – If specified, only yield TemporaryFigureMentions whose url ends in one of the specified types. Example: types=[“png”, “jpg”, “jpeg”].
Initialize MentionFigures.
-
class
fonduer.candidates.mentions.
MentionSentences
[source]¶ Bases:
fonduer.candidates.mentions.MentionSpace
Defines the space of Mentions as all sentences in a Document x.
Initialize MentionSentences.
-
class
fonduer.candidates.mentions.
MentionParagraphs
[source]¶ Bases:
fonduer.candidates.mentions.MentionSpace
Defines the space of Mentions as all paragraphs in a Document x.
Initialize MentionParagraphs.
-
class
fonduer.candidates.mentions.
MentionCaptions
[source]¶ Bases:
fonduer.candidates.mentions.MentionSpace
Defines the space of Mentions as all captions in a Document x.
Initialize MentionCaptions.
-
class
fonduer.candidates.mentions.
MentionCells
[source]¶ Bases:
fonduer.candidates.mentions.MentionSpace
Defines the space of Mentions as all cells in a Document x.
Initialize MentionCells.
-
class
fonduer.candidates.mentions.
MentionTables
[source]¶ Bases:
fonduer.candidates.mentions.MentionSpace
Defines the space of Mentions as all tables in a Document x.
Initialize MentionTables.
-
class
fonduer.candidates.mentions.
MentionSections
[source]¶ Bases:
fonduer.candidates.mentions.MentionSpace
Defines the space of Mentions as all sections in a Document x.
Initialize MentionSections.
-
class
fonduer.candidates.mentions.
MentionDocuments
[source]¶ Bases:
fonduer.candidates.mentions.MentionSpace
Defines the space of Mentions as a document in a Document x.
Initialize MentionDocuments.
Matchers¶
This shows the matchers included with Fonduer. These matchers can be used alone, or combined together, to define what spans of text should be made into Mentions.
-
class
fonduer.candidates.matchers.
DateMatcher
(*children, **kwargs)[source]¶ Bases:
fonduer.candidates.matchers.RegexMatchEach
Match Spans that are dates, as identified by spaCy.
A convenience class for setting up a RegexMatchEach to match spans for which each token was tagged as a date (DATE).
Initialize date matcher.
-
class
fonduer.candidates.matchers.
DictionaryMatch
(*children, **opts)[source]¶ Bases:
fonduer.candidates.matchers._Matcher
Select mention Ngrams that match against a given list d.
- Parameters
d (list of str) – A list of strings representing a dictionary.
ignore_case (bool) – Whether to ignore the case when matching. Default True.
inverse (bool) – Whether to invert the results (e.g., return those which are not in the list). Default False.
stemmer – Optionally provide a stemmer to preprocess the dictionary. Can be any object which has a
stem(str) -> str
method likePorterStemmer()
. Default None.
-
class
fonduer.candidates.matchers.
LambdaFunctionFigureMatcher
(*children, **opts)[source]¶ Bases:
fonduer.candidates.matchers._Matcher
Select Figures that return True when fed to a function f.
- Parameters
func (function) – The function to evaluate. See
LambdaFunctionMatcher
for details.
-
class
fonduer.candidates.matchers.
LambdaFunctionMatcher
(*children, **opts)[source]¶ Bases:
fonduer.candidates.matchers._Matcher
Select
Ngrams
that return True when fed to a function f.- Parameters
func (function) – The function to evaluate with a signature of
f: m -> {True, False}
, wherem
denotes a mention. More precisely,m
is an instance of child class ofTemporaryContext
, depending on whichMentionSpace
is used. E.g.,TemporarySpanMention
whenMentionNgrams
is used.longest_match_only (bool) – Whether to only return the longest span matched, rather than all spans. Default False.
-
class
fonduer.candidates.matchers.
LocationMatcher
(*children, **kwargs)[source]¶ Bases:
fonduer.candidates.matchers.RegexMatchEach
Match Spans that are the names of locations, as identified by spaCy.
A convenience class for setting up a RegexMatchEach to match spans for which each token was tagged as a location (GPE or LOC).
Initialize location matcher.
-
class
fonduer.candidates.matchers.
MiscMatcher
(*children, **kwargs)[source]¶ Bases:
fonduer.candidates.matchers.RegexMatchEach
Match Spans that are miscellaneous named entities, as identified by spaCy.
A convenience class for setting up a RegexMatchEach to match spans for which each token was tagged as miscellaneous (MISC).
Initialize miscellaneous matcher.
-
class
fonduer.candidates.matchers.
NumberMatcher
(*children, **kwargs)[source]¶ Bases:
fonduer.candidates.matchers.RegexMatchEach
Match Spans that are numbers, as identified by spaCy.
A convenience class for setting up a RegexMatchEach to match spans for which each token was tagged as a number (CARDINAL or QUANTITY).
Initialize number matcher.
-
class
fonduer.candidates.matchers.
OrganizationMatcher
(*children, **kwargs)[source]¶ Bases:
fonduer.candidates.matchers.RegexMatchEach
Match Spans that are the names of organizations, as identified by spaCy.
A convenience class for setting up a RegexMatchEach to match spans for which each token was tagged as an organization (NORG or ORG).
Initialize organization matcher.
-
class
fonduer.candidates.matchers.
PersonMatcher
(*children, **kwargs)[source]¶ Bases:
fonduer.candidates.matchers.RegexMatchEach
Match Spans that are the names of people, as identified by spaCy.
A convenience class for setting up a RegexMatchEach to match spans for which each token was tagged as a person (PERSON).
Initialize person matcher.
-
class
fonduer.candidates.matchers.
RegexMatchEach
(*children, **opts)[source]¶ Bases:
fonduer.candidates.matchers._RegexMatch
Match regex pattern on each token.
- Parameters
rgx (str) – The RegEx pattern to use.
ignore_case (bool) – Whether or not to ignore case in the RegEx. Default True.
full_match (bool) – If True, wrap the provided rgx with
(<rgx>)$
. Default True.longest_match_only (bool) – If True, only return the longest match. Default True.
-
class
fonduer.candidates.matchers.
RegexMatchSpan
(*children, **opts)[source]¶ Bases:
fonduer.candidates.matchers._RegexMatch
Match regex pattern on full concatenated span.
- Parameters
rgx (str) – The RegEx pattern to use.
ignore_case (bool) – Whether or not to ignore case in the RegEx. Default True.
search (bool) – If True, search the regex pattern through the concatenated span. If False, try to match the regex patten only at its beginning. Default False.
full_match (bool) – If True, wrap the provided rgx with
(<rgx>)$
. Default True.longest_match_only (bool) – If True, only return the longest match. Default True. Will be overridden by the parent matcher like
Union
when it is wrapped byUnion
,Intersect
, orInverse
.
Matcher Operators¶
These are the operators which can be use to compose matchers.
-
class
fonduer.candidates.matchers.
Concat
(*children, **opts)[source]¶ Bases:
fonduer.candidates.matchers._Matcher
Concatenate mentions generated by Matchers.
Select mentions which are the concatenation of adjacent matches from child operators.
- Example
A concatenation of a NumberMatcher and PersonMatcher could match on a span of text like “10 Obama”.
- Parameters
permutations (bool) – Default False.
left_required (bool) – Whether or not to require the left child to match. Default True.
right_required (bool) – Whether or not to require the right child to match. Default True.
ignore_sep (bool) – Whether or not to ignore the separator. Default True.
sep – If not ignoring the separator, specify which separator to look for. Default sep=” “.
- Raises
ValueError – If Concat is not provided with two child matcher objects.
Note
Currently slices on word index and considers concatenation along these divisions only.
-
class
fonduer.candidates.matchers.
Intersect
(*children, **opts)[source]¶ Bases:
fonduer.candidates.matchers._Matcher
Take the intersection of mention sets returned by the provided
Matchers
.- Parameters
longest_match_only (bool) – If True, only return the longest match. Default True. Overrides longest_match_only of its child
Matchers
.
-
class
fonduer.candidates.matchers.
Inverse
(*children, **opts)[source]¶ Bases:
fonduer.candidates.matchers._Matcher
Return the opposite result of ifs child
Matcher
.- Raises
ValueError – If more than one Matcher is provided.
- Parameters
longest_match_only (bool) – If True, only return the longest match. Default True. Overrides longest_match_only of its child
Matchers
.
Initialize inverse matcher.
-
class
fonduer.candidates.matchers.
Union
(*children, **opts)[source]¶ Bases:
fonduer.candidates.matchers._Matcher
Take the union of mention sets returned by the provided
Matchers
.- Parameters
longest_match_only (bool) – If True, only return the longest match. Default True. Overrides longest_match_only of its child
Matchers
.