Data Model Utilities

This page shows descriptions of the utility functions included with Fonduer which can be used to label candidates based on textual, structural, tabular, and visual information. We group each data model utility based on the modality of information that they leverage.


General Data Model Utilities

fonduer.utils.data_model_utils.utils.get_matches(lf, candidate_set, match_values=[1, -1])[source]

Return a list of candidates that are matched by a particular LF.

A simple helper function to see how many matches (non-zero by default) an LF gets.

Parameters:
  • lf – The labeling function to apply to the candidate_set
  • candidate_set – The set of candidates to evaluate
  • match_values – An option list of the values to consider as matched. [1, -1] by default.
Return type:

a list of candidates

fonduer.utils.data_model_utils.utils.is_superset(a, b)[source]

Check if a is a superset of b.

This is typically used to check if ALL of a list of sentences is in the ngrams returned by an lf_helper.

Parameters:
  • a – A collection of items
  • b – A collection of items
Return type:

boolean

fonduer.utils.data_model_utils.utils.overlap(a, b)[source]

Check if a overlaps b.

This is typically used to check if ANY of a list of sentences is in the ngrams returned by an lf_helper.

Parameters:
  • a – A collection of items
  • b – A collection of items
Return type:

boolean

Textual Data Model Utilities

fonduer.utils.data_model_utils.textual.get_between_ngrams(c, attrib='words', n_min=1, n_max=1, lower=True)[source]

Return the ngrams between two unary Mentions of a binary-Mention Candidate.

Get the ngrams between two unary Mentions of a binary-Mention Candidate, where both share the same sentence Context.

Parameters:
  • c – The binary-Mention Candidate to evaluate.
  • attrib – The token attribute type (e.g. words, lemmas, poses)
  • n_min – The minimum n of the ngrams that should be returned
  • n_max – The maximum n of the ngrams that should be returned
  • lower – If ‘True’, all ngrams will be returned in lower case
Return type:

a generator of ngrams

fonduer.utils.data_model_utils.textual.get_left_ngrams(mention, window=3, attrib='words', n_min=1, n_max=1, lower=True)[source]

Get the ngrams within a window to the left from the sentence Context.

For higher-arity Candidates, defaults to the first argument.

Parameters:
  • mention – The Mention to evaluate. If a candidate is given, default to its first Mention.
  • window – The number of tokens to the left of the first argument to return.
  • attrib – The token attribute type (e.g. words, lemmas, poses)
  • n_min – The minimum n of the ngrams that should be returned
  • n_max – The maximum n of the ngrams that should be returned
  • lower – If True, all ngrams will be returned in lower case
Return type:

a generator of ngrams

fonduer.utils.data_model_utils.textual.get_right_ngrams(mention, window=3, attrib='words', n_min=1, n_max=1, lower=True)[source]

Get the ngrams within a window to the right from the sentence Context.

For higher-arity Candidates, defaults to the last argument.

Parameters:
  • mention – The Mention to evaluate. If a candidate is given, default to its last Mention.
  • window – The number of tokens to the left of the first argument to return
  • attrib – The token attribute type (e.g. words, lemmas, poses)
  • n_min – The minimum n of the ngrams that should be returned
  • n_max – The maximum n of the ngrams that should be returned
  • lower – If True, all ngrams will be returned in lower case
Return type:

a generator of ngrams

Structural Data Model Utilities

fonduer.utils.data_model_utils.structural.common_ancestor(c)[source]

Return the path to the root that is shared between a binary-Mention Candidate.

In particular, this is the common path of HTML tags.

Parameters:c – The binary-Mention Candidate to evaluate
Return type:list of strings
fonduer.utils.data_model_utils.structural.get_ancestor_class_names(mention)[source]

Return the HTML classes of the Mention’s ancestors.

If a candidate is passed in, only the ancestors of its first Mention are returned.

Parameters:mention – The Mention to evaluate
Return type:list of strings
fonduer.utils.data_model_utils.structural.get_ancestor_id_names(mention)[source]

Return the HTML id’s of the Mention’s ancestors.

If a candidate is passed in, only the ancestors of its first Mention are returned.

Parameters:mention – The Mention to evaluate
Return type:list of strings
fonduer.utils.data_model_utils.structural.get_ancestor_tag_names(mention)[source]

Return the HTML tag of the Mention’s ancestors.

For example, [‘html’, ‘body’, ‘p’]. If a candidate is passed in, only the ancestors of its first Mention are returned.

Parameters:mention – The Mention to evaluate
Return type:list of strings
fonduer.utils.data_model_utils.structural.get_attributes(mention)[source]

Return the HTML attributes of the Mention.

If a candidate is passed in, only the tag of its first Mention is returned.

A sample outout of this function on a Mention in a paragraph tag is [u’style=padding-top: 8pt;padding-left: 20pt;text-indent: 0pt;text-align: left;’]

Parameters:mention – The Mention to evaluate
Return type:list of strings representing HTML attributes
fonduer.utils.data_model_utils.structural.get_next_sibling_tags(mention)[source]

Return the HTML tag of the Mention’s next siblings.

Next siblings are Mentions which are at the same level in the HTML tree as the given mention, but are declared after the given mention. If a candidate is passed in, only the next siblings of its last Mention are considered in the calculation.

Parameters:mention – The Mention to evaluate
Return type:list of strings
fonduer.utils.data_model_utils.structural.get_parent_tag(mention)[source]

Return the HTML tag of the Mention’s parent.

These may be tags such as ‘p’, ‘h2’, ‘table’, ‘div’, etc. If a candidate is passed in, only the tag of its first Mention is returned.

Parameters:mention – The Mention to evaluate
Return type:string
fonduer.utils.data_model_utils.structural.get_prev_sibling_tags(mention)[source]

Return the HTML tag of the Mention’s previous siblings.

Previous siblings are Mentions which are at the same level in the HTML tree as the given mention, but are declared before the given mention. If a candidate is passed in, only the previous siblings of its first Mention are considered in the calculation.

Parameters:mention – The Mention to evaluate
Return type:list of strings
fonduer.utils.data_model_utils.structural.get_tag(mention)[source]

Return the HTML tag of the Mention.

If a candidate is passed in, only the tag of its first Mention is returned.

These may be tags such as ‘p’, ‘h2’, ‘table’, ‘div’, etc. :param mention: The Mention to evaluate :rtype: string

fonduer.utils.data_model_utils.structural.lowest_common_ancestor_depth(c)[source]

Return the minimum distance between a binary-Mention Candidate to their lowest common ancestor.

For example, if the tree looked like this:

html
├──<div> Mention 1 </div>
├──table
│    ├──tr
│    │  └──<th> Mention 2 </th>

we return 1, the distance from Mention 1 to the html root. Smaller values indicate that two Mentions are close structurally, while larger values indicate that two Mentions are spread far apart structurally in the document.

Parameters:c – The binary-Mention Candidate to evaluate
Return type:integer

Tabular Data Model Utilities

fonduer.utils.data_model_utils.tabular.get_aligned_ngrams(mention, attrib='words', n_min=1, n_max=1, spread=[0, 0], lower=True)[source]

Get the ngrams from all Cells in the same row or column as the given Mention.

Note that if a candidate is passed in, all of its Mentions will be searched.

Parameters:
  • mention – The Mention whose row and column Cells are being searched
  • attrib – The token attribute type (e.g. words, lemmas, poses)
  • n_min – The minimum n of the ngrams that should be returned
  • n_max – The maximum n of the ngrams that should be returned
  • spread – The number of rows/cols above/below/left/right to also consider “aligned”.
  • lower – If True, all ngrams will be returned in lower case
Return type:

a generator of ngrams

fonduer.utils.data_model_utils.tabular.get_cell_ngrams(mention, attrib='words', n_min=1, n_max=1, lower=True)[source]

Get the ngrams that are in the Cell of the given mention, not including itself.

Note that if a candidate is passed in, all of its Mentions will be searched.

Parameters:
  • mention – The Mention whose Cell is being searched
  • attrib – The token attribute type (e.g. words, lemmas, poses)
  • n_min – The minimum n of the ngrams that should be returned
  • n_max – The maximum n of the ngrams that should be returned
  • lower – If True, all ngrams will be returned in lower case
Return type:

a generator of ngrams

fonduer.utils.data_model_utils.tabular.get_col_ngrams(mention, attrib='words', n_min=1, n_max=1, spread=[0, 0], lower=True)[source]

Get the ngrams from all Cells that are in the same column as the given Mention.

Note that if a candidate is passed in, all of its Mentions will be searched.

Parameters:
  • mention – The Mention whose column Cells are being searched
  • attrib – The token attribute type (e.g. words, lemmas, poses)
  • n_min – The minimum n of the ngrams that should be returned
  • n_max – The maximum n of the ngrams that should be returned
  • spread – The number of cols left and right to also consider “aligned”.
  • lower – If True, all ngrams will be returned in lower case
Return type:

a generator of ngrams

fonduer.utils.data_model_utils.tabular.get_head_ngrams(mention, axis=None, attrib='words', n_min=1, n_max=1, lower=True)[source]

Get the ngrams from the cell in the head of the row or column.

More specifically, this returns the ngrams in the leftmost cell in a row and/or the ngrams in the topmost cell in the column, depending on the axis parameter.

Note that if a candidate is passed in, all of its Mentions will be searched.

Parameters:
  • mention – The Mention whose head Cells are being returned
  • axis – Which axis {‘row’, ‘col’} to search. If None, then both row and col are searched.
  • attrib – The token attribute type (e.g. words, lemmas, poses)
  • n_min – The minimum n of the ngrams that should be returned
  • n_max – The maximum n of the ngrams that should be returned
  • lower – If True, all ngrams will be returned in lower case
Return type:

a generator of ngrams

fonduer.utils.data_model_utils.tabular.get_max_col_num(mention)[source]

Return the largest column number that a Mention occupies.

Parameters:mention – The Mention to evaluate. If a candidate is given, default to its last Mention.
Return type:integer or None
fonduer.utils.data_model_utils.tabular.get_min_col_num(mention)[source]

Return the lowest column number that a Mention occupies.

Parameters:mention – The Mention to evaluate. If a candidate is given, default to its first Mention.
Return type:integer or None
fonduer.utils.data_model_utils.tabular.get_min_row_num(mention)[source]

Return the lowest row number that a Mention occupies.

Parameters:mention – The Mention to evaluate. If a candidate is given, default to its first Mention.
Return type:integer or None
fonduer.utils.data_model_utils.tabular.get_neighbor_cell_ngrams(mention, dist=1, directions=False, attrib='words', n_min=1, n_max=1, lower=True)[source]

Get the ngrams from all Cells that are within a given Cell distance in one direction from the given Mention.

Note that if a candidate is passed in, all of its Mentions will be searched. If directions=True`, each ngram will be returned with a direction in {‘UP’, ‘DOWN’, ‘LEFT’, ‘RIGHT’}.

Parameters:
  • mention – The Mention whose neighbor Cells are being searched
  • dist – The Cell distance within which a neighbor Cell must be to be considered
  • directions – A Boolean expressing whether or not to return the direction of each ngram
  • attrib – The token attribute type (e.g. words, lemmas, poses)
  • n_min – The minimum n of the ngrams that should be returned
  • n_max – The maximum n of the ngrams that should be returned
  • lower – If True, all ngrams will be returned in lower case
Return type:

a generator of ngrams (or (ngram, direction) tuples if directions=True)

fonduer.utils.data_model_utils.tabular.get_neighbor_sentence_ngrams(mention, d=1, attrib='words', n_min=1, n_max=1, lower=True)[source]

Get the ngrams that are in the neighoring Sentences of the given Mention.

Note that if a candidate is passed in, all of its Mentions will be searched.

Parameters:
  • mention – The Mention whose neighbor Sentences are being searched
  • attrib – The token attribute type (e.g. words, lemmas, poses)
  • n_min – The minimum n of the ngrams that should be returned
  • n_max – The maximum n of the ngrams that should be returned
  • lower – If True, all ngrams will be returned in lower case
Return type:

a generator of ngrams

fonduer.utils.data_model_utils.tabular.get_row_ngrams(mention, attrib='words', n_min=1, n_max=1, spread=[0, 0], lower=True)[source]

Get the ngrams from all Cells that are in the same row as the given Mention.

Note that if a candidate is passed in, all of its Mentions will be searched.

Parameters:
  • mention – The Mention whose row Cells are being searched
  • attrib – The token attribute type (e.g. words, lemmas, poses)
  • n_min – The minimum n of the ngrams that should be returned
  • n_max – The maximum n of the ngrams that should be returned
  • spread – The number of rows above and below to also consider “aligned”.
  • lower – If True, all ngrams will be returned in lower case
Return type:

a generator of ngrams

fonduer.utils.data_model_utils.tabular.get_sentence_ngrams(mention, attrib='words', n_min=1, n_max=1, lower=True)[source]

Get the ngrams that are in the Sentence of the given Mention, not including itself.

Note that if a candidate is passed in, all of its Mentions will be searched.

Parameters:
  • mention – The Mention whose Sentence is being searched
  • attrib – The token attribute type (e.g. words, lemmas, poses)
  • n_min – The minimum n of the ngrams that should be returned
  • n_max – The maximum n of the ngrams that should be returned
  • lower – If True, all ngrams will be returned in lower case
Return type:

a generator of ngrams

fonduer.utils.data_model_utils.tabular.is_tabular_aligned(c)[source]

Return True if all Mentions in the given candidate are from the same Row or Col.

Parameters:c – The candidate whose Mentions are being compared
Return type:boolean
fonduer.utils.data_model_utils.tabular.same_cell(c)[source]

Return True if all Mentions in the given candidate are from the same Cell.

Parameters:c – The candidate whose Mentions are being compared
Return type:boolean
fonduer.utils.data_model_utils.tabular.same_col(c)[source]

Return True if all Mentions in the given candidate are from the same Col.

Parameters:c – The candidate whose Mentions are being compared
Return type:boolean
fonduer.utils.data_model_utils.tabular.same_row(c)[source]

Return True if all Mentions in the given candidate are from the same Row.

Parameters:c – The candidate whose Mentions are being compared
Return type:boolean
fonduer.utils.data_model_utils.tabular.same_sentence(c)[source]

Return True if all Mentions in the given candidate are from the same Sentence.

Parameters:c – The candidate whose Mentions are being compared
Return type:boolean
fonduer.utils.data_model_utils.tabular.same_table(c)[source]

Return True if all Mentions in the given candidate are from the same Table.

Parameters:c – The candidate whose Mentions are being compared
Return type:boolean

Visual Data Model Utilities

fonduer.utils.data_model_utils.visual.get_aligned_lemmas(mention)[source]

Return a set of the lemmas aligned visually with the Mention.

Note that if a candidate is passed in, all of its Mentions will be searched.

Parameters:mention – The Mention to evaluate.
Return type:a set of lemmas
fonduer.utils.data_model_utils.visual.get_horz_ngrams(mention, attrib='words', n_min=1, n_max=1, lower=True, from_sentence=True)[source]

Return all ngrams which are visually horizontally aligned with the Mention.

Note that if a candidate is passed in, all of its Mentions will be searched.

Parameters:
  • mention – The Mention to evaluate
  • attrib – The token attribute type (e.g. words, lemmas, poses)
  • n_min – The minimum n of the ngrams that should be returned
  • n_max – The maximum n of the ngrams that should be returned
  • lower – If True, all ngrams will be returned in lower case
  • from_sentence – If True, returns ngrams from any horizontally aligned Sentences, rather than just horizontally aligned ngrams themselves.
Return type:

a generator of ngrams

fonduer.utils.data_model_utils.visual.get_page[source]

Return the page number of the given mention.

If a candidate is passed in, this returns the page of its first Mention.

Parameters:mention – The Mention to get the page number of.
Return type:integer
fonduer.utils.data_model_utils.visual.get_page_horz_percentile(mention, page_width=612, page_height=792)[source]

Return which percentile from the LEFT in the page the Mention is located in.

Percentile is calculated where the left of the page is 0.0, and the right of the page is 1.0.

Page width and height are based on pt values:

Letter      612x792
Tabloid     792x1224
Ledger      1224x792
Legal       612x1008
Statement   396x612
Executive   540x720
A0          2384x3371
A1          1685x2384
A2          1190x1684
A3          842x1190
A4          595x842
A4Small     595x842
A5          420x595
B4          729x1032
B5          516x729
Folio       612x936
Quarto      610x780
10x14       720x1008

and should match the source documents. Letter size is used by default.

Note that if a candidate is passed in, only the vertical percentile of its first Mention is returned.

Parameters:
  • c – The Mention to evaluate
  • page_width – The width of the page. Default to Letter paper width.
  • page_height – The heigh of the page. Default to Letter paper height.
Return type:

float in [0.0, 1.0]

fonduer.utils.data_model_utils.visual.get_page_vert_percentile(mention, page_width=612, page_height=792)[source]

Return which percentile from the TOP in the page the Mention is located in.

Percentile is calculated where the top of the page is 0.0, and the bottom of the page is 1.0. For example, a Mention in at the top 1/4 of the page will have a percentile of 0.25.

Page width and height are based on pt values:

Letter      612x792
Tabloid     792x1224
Ledger      1224x792
Legal       612x1008
Statement   396x612
Executive   540x720
A0          2384x3371
A1          1685x2384
A2          1190x1684
A3          842x1190
A4          595x842
A4Small     595x842
A5          420x595
B4          729x1032
B5          516x729
Folio       612x936
Quarto      610x780
10x14       720x1008

and should match the source documents. Letter size is used by default.

Note that if a candidate is passed in, only the vertical percentil of its first Mention is returned.

Parameters:
  • mention – The Mention to evaluate
  • page_width – The width of the page. Default to Letter paper width.
  • page_height – The heigh of the page. Default to Letter paper height.
Return type:

float in [0.0, 1.0]

fonduer.utils.data_model_utils.visual.get_vert_ngrams(mention, attrib='words', n_min=1, n_max=1, lower=True, from_sentence=True)[source]

Return all ngrams which are visually vertivally aligned with the Mention.

Note that if a candidate is passed in, all of its Mentions will be searched.

Parameters:
  • mention – The Mention to evaluate
  • attrib – The token attribute type (e.g. words, lemmas, poses)
  • n_min – The minimum n of the ngrams that should be returned
  • n_max – The maximum n of the ngrams that should be returned
  • lower – If True, all ngrams will be returned in lower case
  • from_sentence – If True, returns ngrams from any horizontally aligned Sentences, rather than just horizontally aligned ngrams themselves.
Return type:

a generator of ngrams

fonduer.utils.data_model_utils.visual.get_vert_ngrams_center(c)[source]

Not implemented.

fonduer.utils.data_model_utils.visual.get_vert_ngrams_left(c)[source]

Not implemented.

fonduer.utils.data_model_utils.visual.get_vert_ngrams_right(c)[source]

Not implemented.

fonduer.utils.data_model_utils.visual.get_visual_aligned_lemmas(mention)[source]

Return a generator of the lemmas aligned visually with the Mention.

Note that if a candidate is passed in, all of its Mentions will be searched.

Parameters:mention – The Mention to evaluate.
Return type:a generator of lemmas
fonduer.utils.data_model_utils.visual.get_visual_distance(c, axis=None)[source]

Not implemented.

fonduer.utils.data_model_utils.visual.get_visual_header_ngrams(c, axis=None)[source]

Not implemented.

fonduer.utils.data_model_utils.visual.is_horz_aligned[source]

Return True if all the components of c are horizontally aligned.

Horizontal alignment means that the bounding boxes of each Mention of c shares a similar y-axis value in the visual rendering of the document.

Parameters:c – The candidate to evaluate
Return type:boolean
fonduer.utils.data_model_utils.visual.is_vert_aligned[source]

Return true if all the components of c are vertically aligned.

Vertical alignment means that the bounding boxes of each Mention of c shares a similar x-axis value in the visual rendering of the document.

Parameters:c – The candidate to evaluate
Return type:boolean
fonduer.utils.data_model_utils.visual.is_vert_aligned_center[source]

Return true if all the components are vertically aligned on their center.

Vertical alignment means that the bounding boxes of each Mention of c shares a similar x-axis value in the visual rendering of the document. In this function the similarity of the x-axis value is based on the center of their bounding boxes.

Parameters:c – The candidate to evaluate
Return type:boolean
fonduer.utils.data_model_utils.visual.is_vert_aligned_left[source]

Return true if all components are vertically aligned on their left border.

Vertical alignment means that the bounding boxes of each Mention of c shares a similar x-axis value in the visual rendering of the document. In this function the similarity of the x-axis value is based on the left border of their bounding boxes.

Parameters:c – The candidate to evaluate
Return type:boolean
fonduer.utils.data_model_utils.visual.is_vert_aligned_right[source]

Return true if all components vertically aligned on their right border.

Vertical alignment means that the bounding boxes of each Mention of c shares a similar x-axis value in the visual rendering of the document. In this function the similarity of the x-axis value is based on the right border of their bounding boxes.

Parameters:c – The candidate to evaluate
Return type:boolean
fonduer.utils.data_model_utils.visual.same_page[source]

Return true if all the components of c are on the same page of the document.

Page numbers are based on the PDF rendering of the document. If a PDF file is provided, it is used. Otherwise, if only a HTML/XML document is provided, a PDF is created and then used to determine the page number of a Mention.

Parameters:c – The candidate to evaluate
Return type:boolean