Parsing

The first stage of Fonduer’s pipeline is to parse an input corpus of documents into the Fonduer data model.

Multimodal Data Model

The following docs describe elements of Fonduer’s data model. These attributes can be used when creating matchers, throttlers, and labeling functions.

class fonduer.parser.models.Caption(**kwargs)[source]

Bases: fonduer.parser.models.context.Context

A Caption Context in a Document.

Used to represent figure or table captions in a document.

Note

As of v0.6.2, <caption> and <figcaption> tags turn into Caption.

document

The parent Document.

document_id

The id of the parent Document.

figure

The parent Figure, if any.

figure_id

The id of the parent Figure, if any.

id

The unique id the Caption.

name

The name of a Caption.

position

The position of the Caption in the Document.

table

The parent Table, if any.

table_id

The id of the parent Table, if any.

class fonduer.parser.models.Cell(**kwargs)[source]

Bases: fonduer.parser.models.context.Context

A cell Context in a Document.

Used to represent the cells that comprise a table in a document.

Note

As of v0.6.2, <th> and <td> tags turn into Cell.

col_end

The end index of the column in the Table the Cell is in.

col_start

The start index of the column in the Table the Cell is in.

document

The parent Document.

document_id

The id of the parent Document.

id

The unique id of the Cell.

name

The name of a Cell.

position

The position of the Cell in the Table.

row_end

The end index of the row in the Table the Cell is in.

row_start

The start index of the row in the Table the Cell is in.

table

The parent Table.

table_id

The id of the parent Table.

class fonduer.parser.models.Context(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base

A piece of content from which Candidates are composed.

This serves as the base class of the Fonduer document model.

id

The unique id of the Context.

stable_id

A stable representation of the Context that will not change between runs.

type

The type of the Context represented as a string (e.g. “sentence”, “paragraph”, “figure”).

class fonduer.parser.models.Document(**kwargs)[source]

Bases: fonduer.parser.models.context.Context

A document Context.

Represents all the information of a particular document. What becomes a document depends on which child class of DocPreprocessor is used.

Note

As of v0.6.2, each file is one document when HTMLDocPreprocessor or TextDocPreprocessor is used, each line in the input file is treated as one document when CSVDocPreprocessor or TSVDocPreprocessor is used.

id

The unique id of a Document.

meta

Pickled metadata about a document extrated from a document preprocessor.

name

The filename of a Document, without its extension (e.g., “BC818”).

text

The full text of the Document.

class fonduer.parser.models.Figure(**kwargs)[source]

Bases: fonduer.parser.models.context.Context

A figure Context in a Document.

Used to represent figures in a document.

Note

As of v0.6.2, <img> and <figure> tags turn into Figure.

cell

The the parent Cell, if any.

cell_id

The id of the parent Cell, if any.

document

The parent Document.

document_id

The id of the parent Document.

id

The unique id of the Figure.

name

The name of a Figure.

position

The position of the Figure in the Document.

section

The parent Section.

section_id

The id of the parent Section.

url

The Figure’s URL.

class fonduer.parser.models.Paragraph(**kwargs)[source]

Bases: fonduer.parser.models.context.Context

A paragraph Context in a Document.

Represents a grouping of adjacent sentences.

Note

As of v0.6.2, a text content in two properties .text and .tail turn into Paragraph. See https://lxml.de/tutorial.html#elements-contain-text for details about .text and .tail properties.

caption

The parent Caption, if any.

caption_id

The id of the parent Caption, if any.

cell

The parent Cell, if any.

cell_id

The id of the parent Cell, if any.

document

The parent Document.

document_id

The id of the parent Document.

id

The unique id of the Paragraph.

name

The name of a Paragraph.

position

The position of the Paragraph in the Document.

section

The parent Section.

section_id

The id of the parent Section.

class fonduer.parser.models.Section(**kwargs)[source]

Bases: fonduer.parser.models.context.Context

A Section Context in a Document.

Note

As of v0.6.2, each document simply has a single Section. Specifically, <html> and <section> tags turn into Section. Future parsing improvements can add better section recognition, such as the sections of an academic paper.

document

The parent Document.

document_id

The id of the parent Document.

id

The unique id of the Section.

name

The name of a Section.

position

The position of the Section in a Document.

class fonduer.parser.models.Sentence(**kwargs)[source]

Bases: fonduer.parser.models.context.Context, fonduer.parser.models.sentence.TabularMixin, fonduer.parser.models.sentence.LingualMixin, fonduer.parser.models.sentence.VisualMixin, fonduer.parser.models.sentence.StructuralMixin, fonduer.parser.models.sentence.SentenceMixin

A Sentence subclass with Lingual, Tabular, Visual, and HTML attributes.

Note

Unlike other data models, there is no HTML element corresponding to Sentence. One Paragraph comprises one or more of Sentence, but how a Paragraph is split depends on which NLP parser (e.g., spaCy) is used.

abs_char_offsets

A list of the character offsets of each word in a Sentence, with respect to the entire document.

bottom

A list of each word’s BOTTOM bounding box coordinate in the Sentence.

cell

The parent Cell, if any.

cell_id

The id of the parent Cell, if any.

char_offsets

A list of the character offsets of each word in a Sentence, with respect to the start of the sentence.

col_end

The col_end of the parent Cell, if any.

col_start

The col_start of the parent Cell, if any.

dep_labels

A list of dependency labels for each word in a Sentence.

dep_parents

A list of the dependency parents for each word in a Sentence.

document

The the parent Document.

document_id

The id of the parent Document.

html_attrs

A list of the html attributes of the element containing the Sentence.

html_tag

The HTML tag of the element containing the Sentence.

id

The unique id for the Sentence.

is_cellular() → bool

Whether or not the Sentence contains information about its table cell.

Return type:bool
is_lingual() → bool

Whether or not the Sentence contains NLP information.

Return type:bool
is_structural() → bool

Whether or not the Sentence contains structural information.

Return type:bool
is_tabular() → bool

Whether or not the Sentence contains tabular information.

Return type:bool
is_visual() → bool

Whether or not the Sentence contains visual information.

Return type:bool
left

A list of each word’s LEFT bounding box coordinate in the Sentence.

lemmas

A list of the lemmas for each word in a Sentence.

name

The name of a Sentence.

ner_tags

A list of NER tags for each word in a Sentence.

page

A list of the page index of each word in the Sentence.

Page indexes start at 0.

paragraph

The parent Paragraph.

paragraph_id

The id of the parent Paragraph.

pos_tags

A list of POS tags for each word in a Sentence.

position

The position of the Sentence in the Document.

right

A list of each word’s RIGHT bounding box coordinate in the Sentence.

row_end

The row_end of the parent Cell, if any.

row_start

The row_start of the parent Cell, if any.

section

The parent Section.

section_id

The id of the parent Section.

table

The parent Table, if any.

table_id

The id of the parent Table, if any.

text

The full text of the Sentence.

top

A list of each word’s TOP bounding box coordinate in the Sentence.

words

A list of the words in a Sentence.

xpath

The HTML XPATH to the Sentence.

class fonduer.parser.models.Table(**kwargs)[source]

Bases: fonduer.parser.models.context.Context

A Table Context in a Document.

Used to represent tables found in a document.

Note

As of v0.6.2, <table> tags turn into Table.

document

The parent Document.

document_id

The id of the parent Document.

id

The unique id of the Table.

name

The name of a Table.

position

The position of the Table in the Document.

section

The parent Section.

section_id

The id of the parent Section.

class fonduer.parser.models.Webpage(**kwargs)[source]

Bases: fonduer.parser.models.context.Context

A Webpage Context enhanced with additional metadata.

crawltime

The timestamp of when the Webpage was crawled.

host

The host of the Webpage.

id

The unique id of the Webpage.

name

The name of a Webpage.

page_type

The type of the Webpage.

raw_content

The raw content of the Webpage.

url

The URL of the Webpage.

Core Objects

This is Fonduer’s core Parser object.

class fonduer.parser.Parser(session: sqlalchemy.orm.session.Session, parallelism: int = 1, structural: bool = True, blacklist: List[str] = ['style', 'script'], flatten: List[str] = ['span', 'br'], language: str = 'en', lingual: bool = True, lingual_parser: Optional[fonduer.parser.lingual_parser.lingual_parser.LingualParser] = None, strip: bool = True, replacements: List[Tuple[str, str]] = [('[‐‑‒–—−]', '-')], tabular: bool = True, visual: bool = False, vizlink: Optional[fonduer.parser.visual_linker.VisualLinker] = None, pdf_path: Optional[str] = None)[source]

Bases: fonduer.utils.udf.UDFRunner

Parses into documents into Fonduer’s Data Model.

Parameters:
  • session – The database session to use.
  • parallelism – The number of processes to use in parallel. Default 1.
  • structural – Whether to parse structural information from a DOM.
  • blacklist – A list of tag types to ignore. Default [“style”, “script”].
  • flatten – A list of tag types to flatten. Default [“span”, “br”]
  • language – Which spaCy NLP language package. Default “en”.
  • lingual – Whether or not to include NLP information. Default True.
  • lingual_parser – A custom lingual parser that inherits LingualParser. When specified, language will be ignored. When not, Spacy with language will be used.
  • strip – Whether or not to strip whitespace during parsing. Default True.
  • replacements – A list of tuples where the regex string in the first position is replaced by the character in the second position. Default [(u”[‐‑‒–—−]”, “-“)], which replaces various unicode variants of a hyphen (e.g. emdash, endash, minus, etc.) with a standard ASCII hyphen.
  • tabular – Whether to include tabular information in the parse.
  • visual – Whether to include visual information in the parse. Requires PDFs for each input document.
  • vizlink – A custom visual linker that inherits VisualLinker. Unless otherwise specified, VisualLinker will be used.
  • pdf_path – The path to the corresponding PDFs use for visual info.
apply(doc_loader: Collection[fonduer.parser.models.document.Document], clear: bool = True, parallelism: Optional[int] = None, progress_bar: bool = True, pdf_path: Optional[str] = None) → None[source]

Run the Parser.

Parameters:
  • doc_loader – An iteratable of Documents to parse. Typically, one of Fonduer’s document preprocessors.
  • pdf_path – The path to the PDF documents, if any. This path will override the one used in initialization, if provided.
  • clear (bool) – Whether or not to clear the labels table before applying these LFs.
  • parallelism (int) – How many threads to use for extraction. This will override the parallelism value used to initialize the Labeler if it is provided.
  • progress_bar (bool) – Whether or not to display a progress bar. The progress bar is measured per document.
clear(pdf_path: Optional[str] = None) → None[source]

Clear all of the Context objects in the database.

Parameters:pdf_path – This parameter is ignored.
get_documents() → List[fonduer.parser.models.document.Document][source]

Return all the parsed Documents in the database.

Return type:A list of all Documents in the database ordered by name.
get_last_documents() → List[fonduer.parser.models.document.Document][source]

Return the most recently parsed list of Documents.

Return type:A list of the most recently parsed Documents ordered by name.
class fonduer.parser.visual_linker.VisualLinker(pdf_path: str, time: bool = False, verbose: bool = False)[source]

Bases: object

Link visual information with sentences.

is_linkable(filename: str) → bool[source]

Verify that the file exists and has a PDF extension.

Parameters:filename (str) – The path to the PDF document.
Return type:boolean

Link visual information with sentences.

Parameters:
  • document_name (str) – the document name.
  • sentences (Iterable[Sentence]) – sentences to be linked with visual information.
  • pdf_path (str) – The path to the PDF documents, if any. This path will override the one used in initialization, if provided.
Return type:

A generator of Sentence.

Lingual Parsers

The following docs describe various lingual parsers. They split text into sentences and enrich them with NLP.

class fonduer.parser.lingual_parser.LingualParser[source]

Bases: object

Lingual parser.

enrich_sentences_with_NLP(sentences: Collection[fonduer.parser.models.sentence.Sentence]) → Iterator[fonduer.parser.models.sentence.Sentence][source]

Add NLP attributes like lemmas, pos_tags, etc. to sentences.

Parameters:sentences – a iterator of Sentence.
Returns:a generator of Sentence.
has_NLP_support() → bool[source]

Returns True when NLP is supported.

Returns:True when NLP is supported.
has_tokenizer_support() → bool[source]

Returns True when a tokenizer is supported.

Returns:True when a tokenizer is supported.
split_sentences(text: str) → Iterable[dict][source]

Split input text into sentences.

Parameters:text (str) – text to be split
Returns:A generator of dict that is used as **kwargs to instantiate Sentence.
class fonduer.parser.lingual_parser.SpacyParser(lang: Optional[str])[source]

Bases: fonduer.parser.lingual_parser.lingual_parser.LingualParser

spaCy https://spacy.io/

Models for each target language needs to be downloaded using the following command:

python -m spacy download en

Default named entity types

PERSON People, including fictional. NORP Nationalities or religious or political groups. FACILITY Buildings, airports, highways, bridges, etc. ORG Companies, agencies, institutions, etc. GPE Countries, cities, states. LOC Non-GPE locations, mountain ranges, bodies of water. PRODUCT Objects, vehicles, foods, etc. (Not services.) EVENT Named hurricanes, battles, wars, sports events, etc. WORK_OF_ART Titles of books, songs, etc. LANGUAGE Any named language.

DATE Absolute or relative dates or periods. TIME Times smaller than a day. PERCENT Percentage, including “%”. MONEY Monetary values, including unit. QUANTITY Measurements, as of weight or distance. ORDINAL “first”, “second”, etc. CARDINAL Numerals that do not fall under another type.

enrich_sentences_with_NLP(sentences: Collection[fonduer.parser.models.sentence.Sentence]) → Iterator[fonduer.parser.models.sentence.Sentence][source]

Enrich a list of fonduer Sentence objects with NLP features. We merge and process the text of all Sentences for higher efficiency.

Parameters:sentences – List of fonduer Sentence objects for one document
Returns:
has_NLP_support() → bool[source]

Returns True when NLP is supported.

Returns:True when NLP is supported.
has_tokenizer_support() → bool[source]

Returns True when a tokenizer is supported.

Returns:True when a tokenizer is supported.
static is_package(name: str) → bool[source]

Check if string maps to a package installed via pip.

name (unicode): Name of package. RETURNS (bool): True if installed package, False if not.

From https://github.com/explosion/spaCy/blob/master/spacy/util.py

static model_installed(name: str) → bool[source]

Check if spaCy language model is installed.

From https://github.com/explosion/spaCy/blob/master/spacy/util.py

Parameters:name
Returns:
split_sentences(text: str) → Iterator[Dict[str, Any]][source]

Split input text into sentences that match CoreNLP’s default format, but are not yet processed.

Parameters:text – The text of the parent paragraph of the sentences
Returns:
class fonduer.parser.lingual_parser.SimpleParser(delim: str = '<NB>')[source]

Bases: fonduer.parser.lingual_parser.lingual_parser.LingualParser

Tokenizes text on whitespace only using split().

enrich_sentences_with_NLP(sentences: Collection[fonduer.parser.models.sentence.Sentence]) → Iterator[fonduer.parser.models.sentence.Sentence]

Add NLP attributes like lemmas, pos_tags, etc. to sentences.

Parameters:sentences – a iterator of Sentence.
Returns:a generator of Sentence.
has_NLP_support() → bool[source]

Returns True when NLP is supported.

Returns:True when NLP is supported.
has_tokenizer_support() → bool[source]

Returns True when a tokenizer is supported.

Returns:True when a tokenizer is supported.
split_sentences(str: str) → Iterator[Dict[str, Any]][source]

Parse the document.

Parameters:str – The text contents of the document.
Return type:a generator of tokenized text.

Preprocessors

The following shows descriptions of the various document preprocessors included with Fonduer which are used in parsing documents of different formats.

class fonduer.parser.preprocessors.CSVDocPreprocessor(path: str, encoding: str = 'utf-8', max_docs: int = 9223372036854775807, header: bool = False, delim: str = ', ', parser_rule: Optional[Dict[int, Callable]] = None)[source]

Bases: fonduer.parser.preprocessors.doc_preprocessor.DocPreprocessor

A generator which processes a CSV file or directory of CSV files into a set of Document objects. It treats each line in the input file as a document. It assumes that each column is one section and content in each column as one paragraph as defalt. However, if the column is complex, an advanced parser may be used by specifying parser_rule parameter in a dict format where key is the column index and value is the specific parser, e,g., column_constructor in fonduer.utils.utils_parser.

Parameters:
  • path (str) – filesystem path to file or directory to parse.
  • encoding (str) – file encoding to use (e.g. “utf-8”).
  • max_docs (int) – the maximum number of Documents to produce.
  • header (bool) – if the CSV file contain header or not, if yes, the header will be used as Section name. default = False
  • delim (int) – delimiter to be used to separate columns when file has more than one column. It is active only when column is not None. default=’,’
  • parser_rule – The parser rule to be used to parse the specific column. default = None
Return type:

A generator of Documents.

class fonduer.parser.preprocessors.DocPreprocessor(path: str, encoding: str = 'utf-8', max_docs: int = 9223372036854775807)[source]

Bases: object

A generator which processes a file or directory of files into a set of Document objects.

Parameters:
  • path (str) – filesystem path to file or directory to parse.
  • encoding (str) – file encoding to use (e.g. “utf-8”).
  • max_docs (int) – the maximum number of Documents to produce.
Return type:

A generator of Documents.

class fonduer.parser.preprocessors.HTMLDocPreprocessor(path: str, encoding: str = 'utf-8', max_docs: int = 9223372036854775807)[source]

Bases: fonduer.parser.preprocessors.doc_preprocessor.DocPreprocessor

A generator which processes an HTML file or directory of HTML files into a set of Document objects.

Parameters:
  • encoding (str) – file encoding to use (e.g. “utf-8”).
  • path (str) – filesystem path to file or directory to parse.
  • max_docs (int) – the maximum number of Documents to produce.
Return type:

A generator of Documents.

class fonduer.parser.preprocessors.TSVDocPreprocessor(path: str, encoding: str = 'utf-8', max_docs: int = 9223372036854775807, header: bool = False)[source]

Bases: fonduer.parser.preprocessors.doc_preprocessor.DocPreprocessor

A generator which processes a TSV file or directory of TSV files into a set of Document objects.

The TSV file should have one (doc_name <tab> doc_text) per line.

Parameters:
  • path (str) – filesystem path to file or directory to parse.
  • encoding (str) – file encoding to use (e.g. “utf-8”).
  • max_docs (int) – the maximum number of Documents to produce.
  • header (bool) – if the TSV file contain header or not. default = False
Return type:

A generator of Documents.

class fonduer.parser.preprocessors.TextDocPreprocessor(path: str, encoding: str = 'utf-8', max_docs: int = 9223372036854775807)[source]

Bases: fonduer.parser.preprocessors.doc_preprocessor.DocPreprocessor

A generator which processes a text file or directory of text files into a set of Document objects.

Assumes one document per file.

Parameters:
  • encoding (str) – file encoding to use (e.g. “utf-8”).
  • path (str) – filesystem path to file or directory to parse.
  • max_docs (int) – the maximum number of Documents to produce.
Return type:

A generator of Documents.