Parsing¶
The first stage of Fonduer’s pipeline is to parse an input corpus of documents into the Fonduer data model.
Fonduer supports different file formats: CSV/TSV, TXT, HTML, and hOCR.
The diagram below illustrates how files in each format are preprocessed and consumed by
Parser. Nodes in dark blue represent original source files.
You have to convert some of them into the formats that Fonduer can consume:
a scanned document (incl. non-searchable PDF) is OCRed and exported in hOCR,
a (born-digital) PDF is converted into hOCR using tools like pdftotree.
It is also possible to convert PDF into HTML using third-party tools,
but not recommended (see Visual Parsers).
Multimodal Data Model¶
The following docs describe elements of Fonduer’s data model. These attributes can be used when creating matchers, throttlers, and labeling functions.
Fonduer’s parser model module.
-
class
fonduer.parser.models.Caption(**kwargs)[source]¶ Bases:
fonduer.parser.models.context.ContextA Caption Context in a Document.
Used to represent figure or table captions in a document.
Note
As of v0.6.2,
<caption>and<figcaption>tags turn intoCaption.A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
-
document¶ The parent
Document.
-
document_id¶ The id of the parent
Document.
-
figure¶ The parent
Figure, if any.
-
figure_id¶ The id of the parent
Figure, if any.
-
id¶ The unique id the
Caption.
-
name¶ The name of a
Caption.
-
position¶ The position of the
Captionin theDocument.
-
table¶ The parent
Table, if any.
-
table_id¶ The id of the parent
Table, if any.
-
-
class
fonduer.parser.models.Cell(**kwargs)[source]¶ Bases:
fonduer.parser.models.context.ContextA cell Context in a Document.
Used to represent the cells that comprise a table in a document.
Note
As of v0.6.2,
<th>and<td>tags turn intoCell.A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
-
col_end¶ The end index of the column in the
TabletheCellis in.
-
col_start¶ The start index of the column in the
TabletheCellis in.
-
document¶ The parent
Document.
-
document_id¶ The id of the parent
Document.
-
id¶ The unique id of the
Cell.
-
name¶ The name of a
Cell.
-
position¶ The position of the
Cellin theTable.
-
row_end¶ The end index of the row in the
TabletheCellis in.
-
row_start¶ The start index of the row in the
TabletheCellis in.
-
table¶ The parent
Table.
-
table_id¶ The id of the parent
Table.
-
-
class
fonduer.parser.models.Context(**kwargs)[source]¶ Bases:
sqlalchemy.orm.decl_api.BaseA piece of content from which Candidates are composed.
This serves as the base class of the Fonduer document model.
A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
-
id¶ The unique id of the
Context.
-
stable_id¶ A stable representation of the
Contextthat will not change between runs.
-
type¶ The type of the
Contextrepresented as a string (e.g. “sentence”, “paragraph”, “figure”).
-
-
class
fonduer.parser.models.Document(**kwargs)[source]¶ Bases:
fonduer.parser.models.context.ContextA document Context.
Represents all the information of a particular document. What becomes a document depends on which child class of
DocPreprocessoris used.Note
As of v0.6.2, each file is one document when
HTMLDocPreprocessororTextDocPreprocessoris used, each line in the input file is treated as one document whenCSVDocPreprocessororTSVDocPreprocessoris used.A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
-
id¶ The unique id of a
Document.
-
meta¶ Pickled metadata about a document extrated from a document preprocessor.
-
name¶ The filename of a
Document, without its extension (e.g., “BC818”).
-
text¶ The full text of the
Document.
-
-
class
fonduer.parser.models.Figure(**kwargs)[source]¶ Bases:
fonduer.parser.models.context.ContextA figure Context in a Document.
Used to represent figures in a document.
Note
As of v0.6.2,
<img>and<figure>tags turn intoFigure.A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
-
cell¶ The the parent
Cell, if any.
-
cell_id¶ The id of the parent
Cell, if any.
-
document¶ The parent
Document.
-
document_id¶ The id of the parent
Document.
-
id¶ The unique id of the
Figure.
-
name¶ The name of a
Figure.
-
position¶ The position of the
Figurein theDocument.
-
section¶ The parent
Section.
-
section_id¶ The id of the parent
Section.
-
url¶ The
Figure’s URL.
-
-
class
fonduer.parser.models.Paragraph(**kwargs)[source]¶ Bases:
fonduer.parser.models.context.ContextA paragraph Context in a Document.
Represents a grouping of adjacent sentences.
Note
As of v0.6.2, a text content in two properties
.textand.tailturn intoParagraph. See https://lxml.de/tutorial.html#elements-contain-text for details about.textand.tailproperties.A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
-
caption¶ The parent
Caption, if any.
-
caption_id¶ The id of the parent
Caption, if any.
-
cell¶ The parent
Cell, if any.
-
cell_id¶ The id of the parent
Cell, if any.
-
document¶ The parent
Document.
-
document_id¶ The id of the parent
Document.
-
id¶ The unique id of the
Paragraph.
-
name¶ The name of a
Paragraph.
-
position¶ The position of the
Paragraphin theDocument.
-
section¶ The parent
Section.
-
section_id¶ The id of the parent
Section.
-
-
class
fonduer.parser.models.Section(**kwargs)[source]¶ Bases:
fonduer.parser.models.context.ContextA Section Context in a Document.
Note
As of v0.6.2, each document simply has a single Section. Specifically,
<html>and<section>tags turn intoSection. Future parsing improvements can add better section recognition, such as the sections of an academic paper.A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
-
document¶ The parent
Document.
-
document_id¶ The id of the parent
Document.
-
id¶ The unique id of the
Section.
-
name¶ The name of a
Section.
-
position¶ The position of the
Sectionin aDocument.
-
-
class
fonduer.parser.models.Sentence(**kwargs)[source]¶ Bases:
fonduer.parser.models.context.Context,fonduer.parser.models.sentence.TabularMixin,fonduer.parser.models.sentence.LingualMixin,fonduer.parser.models.sentence.VisualMixin,fonduer.parser.models.sentence.StructuralMixin,fonduer.parser.models.sentence.SentenceMixinA Sentence subclass with Lingual, Tabular, Visual, and HTML attributes.
Note
Unlike other data models, there is no HTML element corresponding to
Sentence. OneParagraphcomprises one or more ofSentence, but how aParagraphis split depends on which NLP parser (e.g., spaCy) is used.A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
-
abs_char_offsets¶ A list of the character offsets of each word in a
Sentence, with respect to the entire document.
-
bottom¶ List of each word’s BOTTOM bounding box coordinate in the
Sentence.
-
cell¶ Parent
Cell, if any.
-
cell_id¶ Id of the parent
Cell, if any.
-
char_offsets¶ A list of the character offsets of each word in a
Sentence, with respect to the start of the sentence.
-
col_end¶ col_endof the parentCell, if any.
-
col_start¶ col_startof the parentCell, if any.
-
dep_labels¶ List of dependency labels for each word in a
Sentence.
-
dep_parents¶ List of the dependency parents for each word in a
Sentence.
-
document¶ The the parent
Document.
-
document_id¶ The id of the parent
Document.
-
get_bbox()¶ Get the bounding box.
- Return type
Bbox
-
html_attrs¶ List of the html attributes of the element containing the
Sentence.
-
html_tag¶ HTML tag of the element containing the
Sentence.
-
id¶ The unique id for the
Sentence.
-
is_cellular()¶ Whether or not the
Sentencecontains information about its table cell.- Return type
bool
-
is_lingual()¶ Whether or not the
Sentencecontains NLP information.- Return type
bool
-
is_structural()¶ Whether or not the
Sentencecontains structural information.- Return type
bool
-
is_tabular()¶ Whether or not the
Sentencecontains tabular information.- Return type
bool
-
is_visual()¶ Whether or not the
Sentencecontains visual information.- Return type
bool
-
left¶ List of each word’s LEFT bounding box coordinate in the
Sentence.
-
lemmas¶ List of the lemmas for each word in a
Sentence.
-
name¶ The name of a
Sentence.
List of NER tags for each word in a
Sentence.
-
page¶ List of the page index of each word in the
Sentence.Page indexes start at 1.
-
paragraph¶ The parent
Paragraph.
-
paragraph_id¶ The id of the parent
Paragraph.
List of POS tags for each word in a
Sentence.
-
position¶ The position of the
Sentencein theDocument.
-
right¶ List of each word’s RIGHT bounding box coordinate in the
Sentence.
-
row_end¶ row_endof the parentCell, if any.
-
row_start¶ row_startof the parentCell, if any.
-
section¶ The parent
Section.
-
section_id¶ The id of the parent
Section.
-
table¶ Parent
Table, if any.
-
table_id¶ Id of the parent
Table, if any.
-
text¶ The full text of the
Sentence.
-
top¶ List of each word’s TOP bounding box coordinate in the
Sentence.
-
words¶ A list of the words in a
Sentence.
-
xpath¶ HTML XPATH to the
Sentence.
-
-
class
fonduer.parser.models.Table(**kwargs)[source]¶ Bases:
fonduer.parser.models.context.ContextA Table Context in a Document.
Used to represent tables found in a document.
Note
As of v0.6.2,
<table>tags turn intoTable.A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
-
document¶ The parent
Document.
-
document_id¶ The id of the parent
Document.
-
id¶ The unique id of the
Table.
-
name¶ The name of a
Table.
-
position¶ The position of the
Tablein theDocument.
-
section¶ The parent
Section.
-
section_id¶ The id of the parent
Section.
-
-
class
fonduer.parser.models.Webpage(**kwargs)[source]¶ Bases:
fonduer.parser.models.context.ContextA Webpage Context enhanced with additional metadata.
A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
-
crawltime¶ The timestamp of when the
Webpagewas crawled.
-
host¶ The host of the
Webpage.
-
id¶ The unique id of the
Webpage.
-
name¶ The name of a
Webpage.
-
page_type¶ The type of the
Webpage.
-
raw_content¶ The raw content of the
Webpage.
-
url¶ The URL of the
Webpage.
-
Core Objects¶
This is Fonduer’s core Parser object.
Fonduer’s parser module.
-
class
fonduer.parser.Parser(session, parallelism=1, structural=True, blacklist=['style', 'script'], flatten=['span', 'br'], language='en', lingual=True, lingual_parser=None, strip=True, replacements=[('[‐‑‒–—−]', '-')], tabular=True, visual_parser=None)[source]¶ Bases:
fonduer.utils.udf.UDFRunnerParses into documents into Fonduer’s Data Model.
- Parameters
session (
Session) – The database session to use.parallelism (
int) – The number of processes to use in parallel. Default 1.structural (
bool) – Whether to parse structural information from a DOM.blacklist (
List[str]) – A list of tag types to ignore. Default [“style”, “script”].flatten (
List[str]) – A list of tag types to flatten. Default [“span”, “br”]language (
str) – Which spaCy NLP language package. Default “en”.lingual (
bool) – Whether or not to include NLP information. Default True.lingual_parser (
Optional[LingualParser]) – A custom lingual parser that inheritsLingualParser. When specified, language will be ignored. When not,Spacywith language will be used.strip (
bool) – Whether or not to strip whitespace during parsing. Default True.replacements (
List[Tuple[str,str]]) – A list of tuples where the regex string in the first position is replaced by the character in the second position. Default [(u”[u2010u2011u2012u2013u2014u2212]”, “-“)], which replaces various unicode variants of a hyphen (e.g. emdash, endash, minus, etc.) with a standard ASCII hyphen.tabular (
bool) – Whether to include tabular information in the parse.visual_parser (
Optional[VisualParser]) – A visual parser that parses visual information. Defaults to None (visual information is not parsed).
Initialize Parser.
-
apply(doc_loader, clear=True, parallelism=None, progress_bar=True)[source]¶ Run the Parser.
- Parameters
doc_loader (
Collection[Document]) – An iteratable ofDocumentsto parse. Typically, one of Fonduer’s document preprocessors.clear (
bool) – Whether or not to clear the labels table before applying these LFs.parallelism (
Optional[int]) – How many threads to use for extraction. This will override the parallelism value used to initialize the Labeler if it is provided.progress_bar (
bool) – Whether or not to display a progress bar. The progress bar is measured per document.
- Return type
None
-
get_documents()[source]¶ Return all the successfully parsed
Documentsin the database.- Return type
List[Document]- Returns
A list of all
Documentsin the database ordered by name.
-
get_last_documents()[source]¶ Return the most recently successfully parsed list of
Documents.- Return type
List[Document]- Returns
A list of the most recently parsed
Documentsordered by name.
-
last_docs: Set[str]¶ The last set of documents that apply() was called on
Lingual Parsers¶
The following docs describe various lingual parsers. They split text into sentences and enrich them with NLP.
Fonduer’s lingual parser module.
-
class
fonduer.parser.lingual_parser.LingualParser[source]¶ Bases:
objectLingual parser.
-
enrich_sentences_with_NLP(sentences)[source]¶ Add NLP attributes like lemmas, pos_tags, etc. to sentences.
-
has_NLP_support()[source]¶ Return True when NLP is supported.
- Return type
bool- Returns
True when NLP is supported.
-
-
class
fonduer.parser.lingual_parser.SimpleParser(delim='.')[source]¶ Bases:
fonduer.parser.lingual_parser.lingual_parser.LingualParserTokenizes text on whitespace only using split().
- Parameters
delim (
str) – a delimiter to split text into sentences.
Initialize SimpleParser.
-
enrich_sentences_with_NLP(sentences)¶ Add NLP attributes like lemmas, pos_tags, etc. to sentences.
-
has_NLP_support()[source]¶ Return True when NLP is supported.
- Return type
bool- Returns
True when NLP is supported.
-
class
fonduer.parser.lingual_parser.SpacyParser(lang)[source]¶ Bases:
fonduer.parser.lingual_parser.lingual_parser.LingualParserSpacy parser class.
- Parameters
lang (
Optional[str]) – Language. This can be one of["en", "de", "es", "pt", "fr", "it", "nl", "xx", "ja", "zh"]. See here for details of languages supported by spaCy.
Initialize SpacyParser.
-
enrich_sentences_with_NLP(sentences)[source]¶ Enrich a list of fonduer Sentence objects with NLP features.
We merge and process the text of all Sentences for higher efficiency.
- Parameters
sentences (
Collection[Sentence]) – List of fonduer Sentence objects for one document- Return type
Iterator[Sentence]- Returns
-
has_NLP_support()[source]¶ Return True when NLP is supported.
- Return type
bool- Returns
True when NLP is supported.
-
has_tokenizer_support()[source]¶ Return True when a tokenizer is supported.
- Return type
bool- Returns
True when a tokenizer is supported.
-
static
model_installed(name)[source]¶ Check if spaCy language model is installed.
From https://github.com/explosion/spaCy/blob/master/spacy/util.py
- Parameters
name (
str) –- Return type
bool- Returns
Visual Parsers¶
The following docs describe various visual parsers. They parse visual information,
e.g., bounding boxes of each word.
Fonduer can parse visual information only for hOCR and HTML files
with help of HocrVisualParser and PdfVisualParser, respectively.
It is recommended to provide documents in hOCR instead of HTML,
because PdfVisualParser is not always accurate by its nature and could
assign a wrong bounding box to a word.
(see #12).
Fonduer’s visual parser module.
-
class
fonduer.parser.visual_parser.HocrVisualParser(replacements=[('[‐‑‒–—−]', '-')])[source]¶ Bases:
fonduer.parser.visual_parser.visual_parser.VisualParserVisual Parser for hOCR.
Initialize a visual parser.
- Raises
ImportError – an error is raised when spaCy is not 2.3.0 or later.
-
class
fonduer.parser.visual_parser.PdfVisualParser(pdf_path, verbose=False)[source]¶ Bases:
fonduer.parser.visual_parser.visual_parser.VisualParserLink visual information, extracted from PDF, with parsed sentences.
This linker assumes the following conditions for expected results:
The PDF file exists in a directory specified by pdf_path.
The basename of the PDF file is same as the document name and its extension is either “.pdf” or “.PDF”.
A PDF has a text layer.
Initialize VisualParser.
- Parameters
pdf_path (
str) – a path to directory that contains PDF files.verbose (
bool) – whether to turn on verbose logging.
-
class
fonduer.parser.visual_parser.VisualParser[source]¶ Bases:
abc.ABCAbstract visual parer.
-
abstract
is_parsable(document_name)[source]¶ Check if visual information can be parsed.
- Parameters
document_name (
str) – the document name.- Return type
bool- Returns
Whether visual information is parsable.
-
abstract
parse(document_name, sentences)[source]¶ Parse visual information and link them with given sentences.
- Parameters
document_name (
str) – the document name.sentences (
Iterable[Sentence]) – sentences to be linked with visual information.
- Yield
sentences with visual information.
- Return type
Iterator[Sentence]
-
abstract
Preprocessors¶
The following shows descriptions of the various document preprocessors included with Fonduer which are used in parsing documents of different formats.
Fonduer’s parser preprocessor module.
-
class
fonduer.parser.preprocessors.CSVDocPreprocessor(path, encoding='utf-8', max_docs=9223372036854775807, header=False, delim=',', parser_rule=None)[source]¶ Bases:
fonduer.parser.preprocessors.doc_preprocessor.DocPreprocessorA
Documentgenerator for CVS files.It treats each line in the input file as a
Document. It assumes that each column is oneSectionand content in each column as oneParagraphby default. However, if the column is complex, an advanced parser may be used by specifyingparser_ruleparameter in a dict format where key is the column index and value is the specific parser, e,g.,column_constructorinfonduer.utils.utils_parser.Initialize CSV DocPreprocessor.
- Parameters
path (
str) – a path to file or directory, or a glob pattern. The basename (as returned byos.path.basename) should be unique among all files.encoding (
str) – file encoding to use (e.g. “utf-8”).max_docs (
int) – the maximum number ofDocumentsto produce.header (
bool) – if the CSV file contain header or not, if yes, the header will be used as Section name. default = Falsedelim (
str) – delimiter to be used to separate columns when file has more than one column. It is active only whencolumn is not None. default=’,’parser_rule (
Optional[Dict[int,Callable]]) – The parser rule to be used to parse the specific column. default = None
- Returns
A generator of
Documents.
-
class
fonduer.parser.preprocessors.DocPreprocessor(path, encoding='utf-8', max_docs=9223372036854775807)[source]¶ Bases:
objectAn abstract class of a
Documentgenerator.Unless otherwise stated by a subclass, it’s assumed that there is one
Documentper file.Initialize DocPreprocessor.
- Parameters
path (
str) – a path to file or directory, or a glob pattern. The basename (as returned byos.path.basename) should be unique among all files.encoding (
str) – file encoding to use, defaults to “utf-8”.max_docs (
int) – the maximum number ofDocumentsto produce, defaults to sys.maxsize.
- Returns
A generator of
Documents.
-
class
fonduer.parser.preprocessors.HOCRDocPreprocessor(path, encoding='utf-8', max_docs=9223372036854775807, space=True)[source]¶ Bases:
fonduer.parser.preprocessors.doc_preprocessor.DocPreprocessorA
Documentgenerator for hOCR files.hOCR should comply with hOCR v1.2. Note that ppageno property of ocr_page is optional by hOCR v1.2, but is required by Fonduer.
Initialize HOCRDocPreprocessor.
- Parameters
path (
str) – a path to file or directory, or a glob pattern. The basename (as returned byos.path.basename) should be unique among all files.encoding (
str) – file encoding to use, defaults to “utf-8”.max_docs (
int) – the maximum number ofDocumentsto produce, defaults to sys.maxsize.space (
bool) – boolean value indicating whether each word should have a subsequent space. E.g., English has spaces between words.
- Returns
A generator of
Documents.
-
class
fonduer.parser.preprocessors.HTMLDocPreprocessor(path, encoding='utf-8', max_docs=9223372036854775807)[source]¶ Bases:
fonduer.parser.preprocessors.doc_preprocessor.DocPreprocessorA
Documentgenerator for HTML files.Initialize DocPreprocessor.
- Parameters
path (
str) – a path to file or directory, or a glob pattern. The basename (as returned byos.path.basename) should be unique among all files.encoding (
str) – file encoding to use, defaults to “utf-8”.max_docs (
int) – the maximum number ofDocumentsto produce, defaults to sys.maxsize.
- Returns
A generator of
Documents.
-
class
fonduer.parser.preprocessors.TSVDocPreprocessor(path, encoding='utf-8', max_docs=9223372036854775807, header=False)[source]¶ Bases:
fonduer.parser.preprocessors.doc_preprocessor.DocPreprocessorA
Documentgenerator for TSV files.It treats each line in the input file as a
Document. The TSV file should have one (doc_name <tab> doc_text) per line.Initialize TSV DocPreprocessor.
- Parameters
path (
str) – a path to file or directory, or a glob pattern. The basename (as returned byos.path.basename) should be unique among all files.encoding (
str) – file encoding to use (e.g. “utf-8”).max_docs (
int) – the maximum number ofDocumentsto produce.header (
bool) – if the TSV file contain header or not. default = False
- Returns
A generator of
Documents.
-
class
fonduer.parser.preprocessors.TextDocPreprocessor(path, encoding='utf-8', max_docs=9223372036854775807)[source]¶ Bases:
fonduer.parser.preprocessors.doc_preprocessor.DocPreprocessorA
Documentgenerator for plain text files.Initialize DocPreprocessor.
- Parameters
path (
str) – a path to file or directory, or a glob pattern. The basename (as returned byos.path.basename) should be unique among all files.encoding (
str) – file encoding to use, defaults to “utf-8”.max_docs (
int) – the maximum number ofDocumentsto produce, defaults to sys.maxsize.
- Returns
A generator of
Documents.