Parsing¶
The first stage of Fonduer’s pipeline is to parse an input corpus of documents into the Fonduer data model.
Multimodal Data Model¶
The following docs describe elements of Fonduer’s data model. These attributes can be used when creating matchers, throttlers, and labeling functions.
-
class
fonduer.parser.models.
Caption
(**kwargs)[source]¶ Bases:
fonduer.parser.models.context.Context
A Caption Context in a Document.
Used to represent figure or table captions in a document.
Note
As of v0.6.2,
<caption>
and<figcaption>
tags turn intoCaption
.-
document
¶ The parent
Document
.
-
document_id
¶ The id of the parent
Document
.
-
figure
¶ The parent
Figure
, if any.
-
figure_id
¶ The id of the parent
Figure
, if any.
-
id
¶ The unique id the
Caption
.
-
name
¶ The name of a
Caption
.
-
position
¶ The position of the
Caption
in theDocument
.
-
table
¶ The parent
Table
, if any.
-
table_id
¶ The id of the parent
Table
, if any.
-
-
class
fonduer.parser.models.
Cell
(**kwargs)[source]¶ Bases:
fonduer.parser.models.context.Context
A cell Context in a Document.
Used to represent the cells that comprise a table in a document.
Note
As of v0.6.2,
<th>
and<td>
tags turn intoCell
.-
col_end
¶ The end index of the column in the
Table
theCell
is in.
-
col_start
¶ The start index of the column in the
Table
theCell
is in.
-
document
¶ The parent
Document
.
-
document_id
¶ The id of the parent
Document
.
-
id
¶ The unique id of the
Cell
.
-
name
¶ The name of a
Cell
.
-
position
¶ The position of the
Cell
in theTable
.
-
row_end
¶ The end index of the row in the
Table
theCell
is in.
-
row_start
¶ The start index of the row in the
Table
theCell
is in.
-
table
¶ The parent
Table
.
-
table_id
¶ The id of the parent
Table
.
-
-
class
fonduer.parser.models.
Context
(**kwargs)[source]¶ Bases:
sqlalchemy.ext.declarative.api.Base
A piece of content from which Candidates are composed.
This serves as the base class of the Fonduer document model.
-
id
¶ The unique id of the
Context
.
-
stable_id
¶ A stable representation of the
Context
that will not change between runs.
-
type
¶ The type of the
Context
represented as a string (e.g. “sentence”, “paragraph”, “figure”).
-
-
class
fonduer.parser.models.
Document
(**kwargs)[source]¶ Bases:
fonduer.parser.models.context.Context
A document Context.
Represents all the information of a particular document. What becomes a document depends on which child class of
DocPreprocessor
is used.Note
As of v0.6.2, each file is one document when
HTMLDocPreprocessor
orTextDocPreprocessor
is used, each line in the input file is treated as one document whenCSVDocPreprocessor
orTSVDocPreprocessor
is used.-
id
¶ The unique id of a
Document
.
-
meta
¶ Pickled metadata about a document extrated from a document preprocessor.
-
name
¶ The filename of a
Document
, without its extension (e.g., “BC818”).
-
text
¶ The full text of the
Document
.
-
-
class
fonduer.parser.models.
Figure
(**kwargs)[source]¶ Bases:
fonduer.parser.models.context.Context
A figure Context in a Document.
Used to represent figures in a document.
Note
As of v0.6.2,
<img>
and<figure>
tags turn intoFigure
.-
cell
¶ The the parent
Cell
, if any.
-
cell_id
¶ The id of the parent
Cell
, if any.
-
document
¶ The parent
Document
.
-
document_id
¶ The id of the parent
Document
.
-
id
¶ The unique id of the
Figure
.
-
name
¶ The name of a
Figure
.
-
position
¶ The position of the
Figure
in theDocument
.
-
section
¶ The parent
Section
.
-
section_id
¶ The id of the parent
Section
.
-
url
¶ The
Figure
’s URL.
-
-
class
fonduer.parser.models.
Paragraph
(**kwargs)[source]¶ Bases:
fonduer.parser.models.context.Context
A paragraph Context in a Document.
Represents a grouping of adjacent sentences.
Note
As of v0.6.2, a text content in two properties
.text
and.tail
turn intoParagraph
. See https://lxml.de/tutorial.html#elements-contain-text for details about.text
and.tail
properties.-
caption
¶ The parent
Caption
, if any.
-
caption_id
¶ The id of the parent
Caption
, if any.
-
cell
¶ The parent
Cell
, if any.
-
cell_id
¶ The id of the parent
Cell
, if any.
-
document
¶ The parent
Document
.
-
document_id
¶ The id of the parent
Document
.
-
id
¶ The unique id of the
Paragraph
.
-
name
¶ The name of a
Paragraph
.
-
position
¶ The position of the
Paragraph
in theDocument
.
-
section
¶ The parent
Section
.
-
section_id
¶ The id of the parent
Section
.
-
-
class
fonduer.parser.models.
Section
(**kwargs)[source]¶ Bases:
fonduer.parser.models.context.Context
A Section Context in a Document.
Note
As of v0.6.2, each document simply has a single Section. Specifically,
<html>
and<section>
tags turn intoSection
. Future parsing improvements can add better section recognition, such as the sections of an academic paper.-
document
¶ The parent
Document
.
-
document_id
¶ The id of the parent
Document
.
-
id
¶ The unique id of the
Section
.
-
name
¶ The name of a
Section
.
-
position
¶ The position of the
Section
in aDocument
.
-
-
class
fonduer.parser.models.
Sentence
(**kwargs)[source]¶ Bases:
fonduer.parser.models.context.Context
,fonduer.parser.models.sentence.TabularMixin
,fonduer.parser.models.sentence.LingualMixin
,fonduer.parser.models.sentence.VisualMixin
,fonduer.parser.models.sentence.StructuralMixin
,fonduer.parser.models.sentence.SentenceMixin
A Sentence subclass with Lingual, Tabular, Visual, and HTML attributes.
Note
Unlike other data models, there is no HTML element corresponding to
Sentence
. OneParagraph
comprises one or more ofSentence
, but how aParagraph
is split depends on which NLP parser (e.g., spaCy) is used.-
abs_char_offsets
¶ A list of the character offsets of each word in a
Sentence
, with respect to the entire document.
-
bottom
¶ A list of each word’s BOTTOM bounding box coordinate in the
Sentence
.
-
cell
¶ The parent
Cell
, if any.
-
cell_id
¶ The id of the parent
Cell
, if any.
-
char_offsets
¶ A list of the character offsets of each word in a
Sentence
, with respect to the start of the sentence.
-
col_end
¶ The
col_end
of the parentCell
, if any.
-
col_start
¶ The
col_start
of the parentCell
, if any.
-
dep_labels
¶ A list of dependency labels for each word in a
Sentence
.
-
dep_parents
¶ A list of the dependency parents for each word in a
Sentence
.
-
document
¶ The the parent
Document
.
-
document_id
¶ The id of the parent
Document
.
-
html_attrs
¶ A list of the html attributes of the element containing the
Sentence
.
-
html_tag
¶ The HTML tag of the element containing the
Sentence
.
-
id
¶ The unique id for the
Sentence
.
-
is_cellular
() → bool¶ Whether or not the
Sentence
contains information about its table cell.Return type: bool
-
is_lingual
() → bool¶ Whether or not the
Sentence
contains NLP information.Return type: bool
-
is_structural
() → bool¶ Whether or not the
Sentence
contains structural information.Return type: bool
-
is_tabular
() → bool¶ Whether or not the
Sentence
contains tabular information.Return type: bool
-
is_visual
() → bool¶ Whether or not the
Sentence
contains visual information.Return type: bool
-
left
¶ A list of each word’s LEFT bounding box coordinate in the
Sentence
.
-
lemmas
¶ A list of the lemmas for each word in a
Sentence
.
-
name
¶ The name of a
Sentence
.
A list of NER tags for each word in a
Sentence
.
-
page
¶ A list of the page index of each word in the
Sentence
.Page indexes start at 0.
-
paragraph
¶ The parent
Paragraph
.
-
paragraph_id
¶ The id of the parent
Paragraph
.
A list of POS tags for each word in a
Sentence
.
-
position
¶ The position of the
Sentence
in theDocument
.
-
right
¶ A list of each word’s RIGHT bounding box coordinate in the
Sentence
.
-
row_end
¶ The
row_end
of the parentCell
, if any.
-
row_start
¶ The
row_start
of the parentCell
, if any.
-
section
¶ The parent
Section
.
-
section_id
¶ The id of the parent
Section
.
-
table
¶ The parent
Table
, if any.
-
table_id
¶ The id of the parent
Table
, if any.
-
text
¶ The full text of the
Sentence
.
-
top
¶ A list of each word’s TOP bounding box coordinate in the
Sentence
.
-
words
¶ A list of the words in a
Sentence
.
-
xpath
¶ The HTML XPATH to the
Sentence
.
-
-
class
fonduer.parser.models.
Table
(**kwargs)[source]¶ Bases:
fonduer.parser.models.context.Context
A Table Context in a Document.
Used to represent tables found in a document.
Note
As of v0.6.2,
<table>
tags turn intoTable
.-
document
¶ The parent
Document
.
-
document_id
¶ The id of the parent
Document
.
-
id
¶ The unique id of the
Table
.
-
name
¶ The name of a
Table
.
-
position
¶ The position of the
Table
in theDocument
.
-
section
¶ The parent
Section
.
-
section_id
¶ The id of the parent
Section
.
-
-
class
fonduer.parser.models.
Webpage
(**kwargs)[source]¶ Bases:
fonduer.parser.models.context.Context
A Webpage Context enhanced with additional metadata.
-
crawltime
¶ The timestamp of when the
Webpage
was crawled.
-
host
¶ The host of the
Webpage
.
-
id
¶ The unique id of the
Webpage
.
-
name
¶ The name of a
Webpage
.
-
page_type
¶ The type of the
Webpage
.
-
raw_content
¶ The raw content of the
Webpage
.
-
url
¶ The URL of the
Webpage
.
-
Core Objects¶
This is Fonduer’s core Parser object.
-
class
fonduer.parser.
Parser
(session: sqlalchemy.orm.session.Session, parallelism: int = 1, structural: bool = True, blacklist: List[str] = ['style', 'script'], flatten: List[str] = ['span', 'br'], language: str = 'en', lingual: bool = True, lingual_parser: Optional[fonduer.parser.lingual_parser.lingual_parser.LingualParser] = None, strip: bool = True, replacements: List[Tuple[str, str]] = [('[‐‑‒–—−]', '-')], tabular: bool = True, visual: bool = False, vizlink: Optional[fonduer.parser.visual_linker.VisualLinker] = None, pdf_path: Optional[str] = None)[source]¶ Bases:
fonduer.utils.udf.UDFRunner
Parses into documents into Fonduer’s Data Model.
Parameters: - session – The database session to use.
- parallelism – The number of processes to use in parallel. Default 1.
- structural – Whether to parse structural information from a DOM.
- blacklist – A list of tag types to ignore. Default [“style”, “script”].
- flatten – A list of tag types to flatten. Default [“span”, “br”]
- language – Which spaCy NLP language package. Default “en”.
- lingual – Whether or not to include NLP information. Default True.
- lingual_parser – A custom lingual parser that inherits
LingualParser
. When specified, language will be ignored. When not,Spacy
with language will be used. - strip – Whether or not to strip whitespace during parsing. Default True.
- replacements – A list of tuples where the regex string in the first position is replaced by the character in the second position. Default [(u”[‐‑‒–—−]”, “-“)], which replaces various unicode variants of a hyphen (e.g. emdash, endash, minus, etc.) with a standard ASCII hyphen.
- tabular – Whether to include tabular information in the parse.
- visual – Whether to include visual information in the parse. Requires PDFs for each input document.
- vizlink – A custom visual linker that inherits
VisualLinker
. Unless otherwise specified,VisualLinker
will be used. - pdf_path – The path to the corresponding PDFs use for visual info.
-
apply
(doc_loader: Collection[fonduer.parser.models.document.Document], clear: bool = True, parallelism: Optional[int] = None, progress_bar: bool = True, pdf_path: Optional[str] = None) → None[source]¶ Run the Parser.
Parameters: - doc_loader – An iteratable of
Documents
to parse. Typically, one of Fonduer’s document preprocessors. - pdf_path – The path to the PDF documents, if any. This path will override the one used in initialization, if provided.
- clear (bool) – Whether or not to clear the labels table before applying these LFs.
- parallelism (int) – How many threads to use for extraction. This will override the parallelism value used to initialize the Labeler if it is provided.
- progress_bar (bool) – Whether or not to display a progress bar. The progress bar is measured per document.
- doc_loader – An iteratable of
-
clear
(pdf_path: Optional[str] = None) → None[source]¶ Clear all of the
Context
objects in the database.Parameters: pdf_path – This parameter is ignored.
-
class
fonduer.parser.visual_linker.
VisualLinker
(pdf_path: str, time: bool = False, verbose: bool = False)[source]¶ Bases:
object
Link visual information with sentences.
-
is_linkable
(filename: str) → bool[source]¶ Verify that the file exists and has a PDF extension.
Parameters: filename (str) – The path to the PDF document. Return type: boolean
-
link
(document_name: str, sentences: List[fonduer.parser.models.sentence.Sentence], pdf_path: str) → Iterator[fonduer.parser.models.sentence.Sentence][source]¶ Link visual information with sentences.
Parameters: - document_name (str) – the document name.
- sentences (Iterable[Sentence]) – sentences to be linked with visual information.
- pdf_path (str) – The path to the PDF documents, if any. This path will override the one used in initialization, if provided.
Return type: A generator of
Sentence
.
-
Lingual Parsers¶
The following docs describe various lingual parsers. They split text into sentences and enrich them with NLP.
-
class
fonduer.parser.lingual_parser.
LingualParser
[source]¶ Bases:
object
Lingual parser.
-
enrich_sentences_with_NLP
(sentences: Collection[fonduer.parser.models.sentence.Sentence]) → Iterator[fonduer.parser.models.sentence.Sentence][source]¶ Add NLP attributes like lemmas, pos_tags, etc. to sentences.
Parameters: sentences – a iterator of Sentence
.Returns: a generator of Sentence
.
-
has_NLP_support
() → bool[source]¶ Returns True when NLP is supported.
Returns: True when NLP is supported.
-
-
class
fonduer.parser.lingual_parser.
SpacyParser
(lang: Optional[str])[source]¶ Bases:
fonduer.parser.lingual_parser.lingual_parser.LingualParser
spaCy https://spacy.io/
Models for each target language needs to be downloaded using the following command:
python -m spacy download en
Default named entity types
PERSON People, including fictional. NORP Nationalities or religious or political groups. FACILITY Buildings, airports, highways, bridges, etc. ORG Companies, agencies, institutions, etc. GPE Countries, cities, states. LOC Non-GPE locations, mountain ranges, bodies of water. PRODUCT Objects, vehicles, foods, etc. (Not services.) EVENT Named hurricanes, battles, wars, sports events, etc. WORK_OF_ART Titles of books, songs, etc. LANGUAGE Any named language.
DATE Absolute or relative dates or periods. TIME Times smaller than a day. PERCENT Percentage, including “%”. MONEY Monetary values, including unit. QUANTITY Measurements, as of weight or distance. ORDINAL “first”, “second”, etc. CARDINAL Numerals that do not fall under another type.
-
enrich_sentences_with_NLP
(sentences: Collection[fonduer.parser.models.sentence.Sentence]) → Iterator[fonduer.parser.models.sentence.Sentence][source]¶ Enrich a list of fonduer Sentence objects with NLP features. We merge and process the text of all Sentences for higher efficiency.
Parameters: sentences – List of fonduer Sentence objects for one document Returns:
-
has_NLP_support
() → bool[source]¶ Returns True when NLP is supported.
Returns: True when NLP is supported.
-
has_tokenizer_support
() → bool[source]¶ Returns True when a tokenizer is supported.
Returns: True when a tokenizer is supported.
-
static
model_installed
(name: str) → bool[source]¶ Check if spaCy language model is installed.
From https://github.com/explosion/spaCy/blob/master/spacy/util.py
Parameters: name – Returns:
-
-
class
fonduer.parser.lingual_parser.
SimpleParser
(delim: str = '.')[source]¶ Bases:
fonduer.parser.lingual_parser.lingual_parser.LingualParser
Tokenizes text on whitespace only using split().
Parameters: delim (str) – a delimiter to split text into sentences. -
enrich_sentences_with_NLP
(sentences: Collection[fonduer.parser.models.sentence.Sentence]) → Iterator[fonduer.parser.models.sentence.Sentence]¶ Add NLP attributes like lemmas, pos_tags, etc. to sentences.
Parameters: sentences – a iterator of Sentence
.Returns: a generator of Sentence
.
-
has_NLP_support
() → bool[source]¶ Returns True when NLP is supported.
Returns: True when NLP is supported.
-
Preprocessors¶
The following shows descriptions of the various document preprocessors included with Fonduer which are used in parsing documents of different formats.
-
class
fonduer.parser.preprocessors.
CSVDocPreprocessor
(path: str, encoding: str = 'utf-8', max_docs: int = 9223372036854775807, header: bool = False, delim: str = ', ', parser_rule: Optional[Dict[int, Callable]] = None)[source]¶ Bases:
fonduer.parser.preprocessors.doc_preprocessor.DocPreprocessor
A generator which processes a CSV file or directory of CSV files into a set of Document objects. It treats each line in the input file as a document. It assumes that each column is one section and content in each column as one paragraph as defalt. However, if the column is complex, an advanced parser may be used by specifying
parser_rule
parameter in a dict format where key is the column index and value is the specific parser, e,g.,column_constructor
infonduer.utils.utils_parser
.Parameters: - path (str) – filesystem path to file or directory to parse.
- encoding (str) – file encoding to use (e.g. “utf-8”).
- max_docs (int) – the maximum number of
Documents
to produce. - header (bool) – if the CSV file contain header or not, if yes, the header will be used as Section name. default = False
- delim (int) – delimiter to be used to separate columns when file has
more than one column. It is active only when
column is not None
. default=’,’ - parser_rule – The parser rule to be used to parse the specific column. default = None
Return type: A generator of
Documents
.
-
class
fonduer.parser.preprocessors.
DocPreprocessor
(path: str, encoding: str = 'utf-8', max_docs: int = 9223372036854775807)[source]¶ Bases:
object
A generator which processes a file or directory of files into a set of Document objects.
Parameters: - path (str) – filesystem path to file or directory to parse.
- encoding (str) – file encoding to use (e.g. “utf-8”).
- max_docs (int) – the maximum number of
Documents
to produce.
Return type: A generator of
Documents
.
-
class
fonduer.parser.preprocessors.
HTMLDocPreprocessor
(path: str, encoding: str = 'utf-8', max_docs: int = 9223372036854775807)[source]¶ Bases:
fonduer.parser.preprocessors.doc_preprocessor.DocPreprocessor
A generator which processes an HTML file or directory of HTML files into a set of Document objects.
Parameters: - encoding (str) – file encoding to use (e.g. “utf-8”).
- path (str) – filesystem path to file or directory to parse.
- max_docs (int) – the maximum number of
Documents
to produce.
Return type: A generator of
Documents
.
-
class
fonduer.parser.preprocessors.
TSVDocPreprocessor
(path: str, encoding: str = 'utf-8', max_docs: int = 9223372036854775807, header: bool = False)[source]¶ Bases:
fonduer.parser.preprocessors.doc_preprocessor.DocPreprocessor
A generator which processes a TSV file or directory of TSV files into a set of Document objects.
The TSV file should have one (doc_name <tab> doc_text) per line.
Parameters: - path (str) – filesystem path to file or directory to parse.
- encoding (str) – file encoding to use (e.g. “utf-8”).
- max_docs (int) – the maximum number of
Documents
to produce. - header (bool) – if the TSV file contain header or not. default = False
Return type: A generator of
Documents
.
-
class
fonduer.parser.preprocessors.
TextDocPreprocessor
(path: str, encoding: str = 'utf-8', max_docs: int = 9223372036854775807)[source]¶ Bases:
fonduer.parser.preprocessors.doc_preprocessor.DocPreprocessor
A generator which processes a text file or directory of text files into a set of Document objects.
Assumes one document per file.
Parameters: - encoding (str) – file encoding to use (e.g. “utf-8”).
- path (str) – filesystem path to file or directory to parse.
- max_docs (int) – the maximum number of
Documents
to produce.
Return type: A generator of
Documents
.