Parsing¶

The first stage of Fonduer’s pipeline is to parse an input corpus of documents into the Fonduer data model.

Fonduer supports different file formats: CSV/TSV, TXT, HTML, and hOCR. The diagram below illustrates how files in each format are preprocessed and consumed by Parser. Nodes in dark blue represent original source files. You have to convert some of them into the formats that Fonduer can consume: a scanned document (incl. non-searchable PDF) is OCRed and exported in hOCR, a (born-digital) PDF is converted into hOCR using tools like pdftotree. It is also possible to convert PDF into HTML using third-party tools, but not recommended (see Visual Parsers).

graph LR; CSV[(CSV)]-->CSVDoc; TXT[(TXT)]-->TXTDoc; HTML[(HTML)]-->HTMLDoc; PDF[(PDF)]--Convert-->HTML2[(HTML)]; HTML2-->HTMLDoc; PDF--Convert-->hOCR2[(hOCR)]; Scan[(Scan)]--OCR-->hOCR[(hOCR)]; hOCR-->HOCRDoc; hOCR2-->HOCRDoc; subgraph Fonduer CSVDoc(CSVDocPreprocessor)-->parser; TXTDoc(TXTDocPreprocessor)-->parser; HTMLDoc(HTMLDocPreprocessor)-->parser; HOCRDoc(HOCRDocPreprocessor)-->parser; parser(Parser)-->others(..); end classDef source fill:#aaf; classDef preproc fill:#afa; class Scan,CSV,TXT,HTML,PDF source; class HOCRDoc,CSVDoc,TXTDoc,HTMLDoc preproc;

Multimodal Data Model¶

The following docs describe elements of Fonduer’s data model. These attributes can be used when creating matchers, throttlers, and labeling functions.

Fonduer’s parser model module.

class fonduer.parser.models.Caption(**kwargs)[source]¶

Bases: fonduer.parser.models.context.Context

A Caption Context in a Document.

Used to represent figure or table captions in a document.

Note

As of v0.6.2, <caption> and <figcaption> tags turn into Caption.

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

document¶: The parent Document.

document_id¶: The id of the parent Document.

figure¶: The parent Figure, if any.

figure_id¶: The id of the parent Figure, if any.

id¶: The unique id the Caption.

name¶: The name of a Caption.

position¶: The position of the Caption in the Document.

table¶: The parent Table, if any.

table_id¶: The id of the parent Table, if any.

class fonduer.parser.models.Cell(**kwargs)[source]¶

Bases: fonduer.parser.models.context.Context

A cell Context in a Document.

Used to represent the cells that comprise a table in a document.

Note

As of v0.6.2, <th> and <td> tags turn into Cell.

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

col_end¶: The end index of the column in the Table the Cell is in.

col_start¶: The start index of the column in the Table the Cell is in.

document¶: The parent Document.

document_id¶: The id of the parent Document.

id¶: The unique id of the Cell.

name¶: The name of a Cell.

position¶: The position of the Cell in the Table.

row_end¶: The end index of the row in the Table the Cell is in.

row_start¶: The start index of the row in the Table the Cell is in.

table¶: The parent Table.

table_id¶: The id of the parent Table.

class fonduer.parser.models.Context(**kwargs)[source]¶

Bases: sqlalchemy.orm.decl_api.Base

A piece of content from which Candidates are composed.

This serves as the base class of the Fonduer document model.

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

id¶: The unique id of the Context.

stable_id¶: A stable representation of the Context that will not change between runs.

type¶: The type of the Context represented as a string (e.g. “sentence”, “paragraph”, “figure”).

class fonduer.parser.models.Document(**kwargs)[source]¶

Bases: fonduer.parser.models.context.Context

A document Context.

Represents all the information of a particular document. What becomes a document depends on which child class of DocPreprocessor is used.

Note

As of v0.6.2, each file is one document when HTMLDocPreprocessor or TextDocPreprocessor is used, each line in the input file is treated as one document when CSVDocPreprocessor or TSVDocPreprocessor is used.

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

id¶: The unique id of a Document.

meta¶: Pickled metadata about a document extrated from a document preprocessor.

name¶: The filename of a Document, without its extension (e.g., “BC818”).

text¶: The full text of the Document.

class fonduer.parser.models.Figure(**kwargs)[source]¶

Bases: fonduer.parser.models.context.Context

A figure Context in a Document.

Used to represent figures in a document.

Note

As of v0.6.2, <img> and <figure> tags turn into Figure.

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

cell¶: The the parent Cell, if any.

cell_id¶: The id of the parent Cell, if any.

document¶: The parent Document.

document_id¶: The id of the parent Document.

id¶: The unique id of the Figure.

name¶: The name of a Figure.

position¶: The position of the Figure in the Document.

section¶: The parent Section.

section_id¶: The id of the parent Section.

url¶: The Figure’s URL.

class fonduer.parser.models.Paragraph(**kwargs)[source]¶

Bases: fonduer.parser.models.context.Context

A paragraph Context in a Document.

Represents a grouping of adjacent sentences.

Note

As of v0.6.2, a text content in two properties .text and .tail turn into Paragraph. See https://lxml.de/tutorial.html#elements-contain-text for details about .text and .tail properties.

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

caption¶: The parent Caption, if any.

caption_id¶: The id of the parent Caption, if any.

cell¶: The parent Cell, if any.

cell_id¶: The id of the parent Cell, if any.

document¶: The parent Document.

document_id¶: The id of the parent Document.

id¶: The unique id of the Paragraph.

name¶: The name of a Paragraph.

position¶: The position of the Paragraph in the Document.

section¶: The parent Section.

section_id¶: The id of the parent Section.

class fonduer.parser.models.Section(**kwargs)[source]¶

Bases: fonduer.parser.models.context.Context

A Section Context in a Document.

Note

As of v0.6.2, each document simply has a single Section. Specifically, <html> and <section> tags turn into Section. Future parsing improvements can add better section recognition, such as the sections of an academic paper.

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

document¶: The parent Document.

document_id¶: The id of the parent Document.

id¶: The unique id of the Section.

name¶: The name of a Section.

position¶: The position of the Section in a Document.

class fonduer.parser.models.Sentence(**kwargs)[source]¶

Bases: fonduer.parser.models.context.Context, fonduer.parser.models.sentence.TabularMixin, fonduer.parser.models.sentence.LingualMixin, fonduer.parser.models.sentence.VisualMixin, fonduer.parser.models.sentence.StructuralMixin, fonduer.parser.models.sentence.SentenceMixin

A Sentence subclass with Lingual, Tabular, Visual, and HTML attributes.

Note

Unlike other data models, there is no HTML element corresponding to Sentence. One Paragraph comprises one or more of Sentence, but how a Paragraph is split depends on which NLP parser (e.g., spaCy) is used.

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

abs_char_offsets¶: A list of the character offsets of each word in a Sentence, with respect to the entire document.

bottom¶: List of each word’s BOTTOM bounding box coordinate in the Sentence.

cell¶: Parent Cell, if any.

cell_id¶: Id of the parent Cell, if any.

char_offsets¶: A list of the character offsets of each word in a Sentence, with respect to the start of the sentence.

col_end¶: col_end of the parent Cell, if any.

col_start¶: col_start of the parent Cell, if any.

dep_labels¶: List of dependency labels for each word in a Sentence.

dep_parents¶: List of the dependency parents for each word in a Sentence.

document¶: The the parent Document.

document_id¶: The id of the parent Document.

get_bbox()¶

Get the bounding box.

Return type: Bbox

html_attrs¶: List of the html attributes of the element containing the Sentence.

html_tag¶: HTML tag of the element containing the Sentence.

id¶: The unique id for the Sentence.

is_cellular()¶

Whether or not the Sentence contains information about its table cell.

Return type: bool

is_lingual()¶

Whether or not the Sentence contains NLP information.

Return type: bool

is_structural()¶

Whether or not the Sentence contains structural information.

Return type: bool

is_tabular()¶

Whether or not the Sentence contains tabular information.

Return type: bool

is_visual()¶

Whether or not the Sentence contains visual information.

Return type: bool

left¶: List of each word’s LEFT bounding box coordinate in the Sentence.

lemmas¶: List of the lemmas for each word in a Sentence.

name¶: The name of a Sentence.

ner_tags¶: List of NER tags for each word in a Sentence.

page¶

List of the page index of each word in the Sentence.

Page indexes start at 1.

paragraph¶: The parent Paragraph.

paragraph_id¶: The id of the parent Paragraph.

pos_tags¶: List of POS tags for each word in a Sentence.

position¶: The position of the Sentence in the Document.

right¶: List of each word’s RIGHT bounding box coordinate in the Sentence.

row_end¶: row_end of the parent Cell, if any.

row_start¶: row_start of the parent Cell, if any.

section¶: The parent Section.

section_id¶: The id of the parent Section.

table¶: Parent Table, if any.

table_id¶: Id of the parent Table, if any.

text¶: The full text of the Sentence.

top¶: List of each word’s TOP bounding box coordinate in the Sentence.

words¶: A list of the words in a Sentence.

xpath¶: HTML XPATH to the Sentence.

class fonduer.parser.models.Table(**kwargs)[source]¶

Bases: fonduer.parser.models.context.Context

A Table Context in a Document.

Used to represent tables found in a document.

Note

As of v0.6.2, <table> tags turn into Table.

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

document¶: The parent Document.

document_id¶: The id of the parent Document.

id¶: The unique id of the Table.

name¶: The name of a Table.

position¶: The position of the Table in the Document.

section¶: The parent Section.

section_id¶: The id of the parent Section.

class fonduer.parser.models.Webpage(**kwargs)[source]¶

Bases: fonduer.parser.models.context.Context

A Webpage Context enhanced with additional metadata.

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

crawltime¶: The timestamp of when the Webpage was crawled.

host¶: The host of the Webpage.

id¶: The unique id of the Webpage.

name¶: The name of a Webpage.

page_type¶: The type of the Webpage.

raw_content¶: The raw content of the Webpage.

url¶: The URL of the Webpage.

Core Objects¶

This is Fonduer’s core Parser object.

Fonduer’s parser module.

class fonduer.parser.Parser(session, parallelism=1, structural=True, blacklist=['style', 'script'], flatten=['span', 'br'], language='en', lingual=True, lingual_parser=None, strip=True, replacements=[('[‐‑‒–—−]', '-')], tabular=True, visual_parser=None)[source]¶

Bases: fonduer.utils.udf.UDFRunner

Parses into documents into Fonduer’s Data Model.

Parameters

session (Session) – The database session to use.
parallelism (int) – The number of processes to use in parallel. Default 1.
structural (bool) – Whether to parse structural information from a DOM.
blacklist (List[str]) – A list of tag types to ignore. Default [“style”, “script”].
flatten (List[str]) – A list of tag types to flatten. Default [“span”, “br”]
language (str) – Which spaCy NLP language package. Default “en”.
lingual (bool) – Whether or not to include NLP information. Default True.
lingual_parser (Optional[LingualParser]) – A custom lingual parser that inherits LingualParser. When specified, language will be ignored. When not, Spacy with language will be used.
strip (bool) – Whether or not to strip whitespace during parsing. Default True.
replacements (List[Tuple[str, str]]) – A list of tuples where the regex string in the first position is replaced by the character in the second position. Default [(u”[u2010u2011u2012u2013u2014u2212]”, “-“)], which replaces various unicode variants of a hyphen (e.g. emdash, endash, minus, etc.) with a standard ASCII hyphen.
tabular (bool) – Whether to include tabular information in the parse.
visual_parser (Optional[VisualParser]) – A visual parser that parses visual information. Defaults to None (visual information is not parsed).

Initialize Parser.

apply(doc_loader, clear=True, parallelism=None, progress_bar=True)[source]¶

Run the Parser.

Parameters

doc_loader (Collection[Document]) – An iteratable of Documents to parse. Typically, one of Fonduer’s document preprocessors.
clear (bool) – Whether or not to clear the labels table before applying these LFs.
parallelism (Optional[int]) – How many threads to use for extraction. This will override the parallelism value used to initialize the Labeler if it is provided.
progress_bar (bool) – Whether or not to display a progress bar. The progress bar is measured per document.

Return type

None

clear()[source]¶

Clear all of the Context objects in the database.

Return type: None

get_documents()[source]¶

Return all the successfully parsed Documents in the database.

Return type: List[Document]
Returns: A list of all Documents in the database ordered by name.

get_last_documents()[source]¶

Return the most recently successfully parsed list of Documents.

Return type: List[Document]
Returns: A list of the most recently parsed Documents ordered by name.

last_docs: Set[str]¶: The last set of documents that apply() was called on

Lingual Parsers¶

The following docs describe various lingual parsers. They split text into sentences and enrich them with NLP.

Fonduer’s lingual parser module.

class fonduer.parser.lingual_parser.LingualParser[source]¶

Bases: object

Lingual parser.

enrich_sentences_with_NLP(sentences)[source]¶

Add NLP attributes like lemmas, pos_tags, etc. to sentences.

Parameters: sentences (Collection[Sentence]) – a iterator of Sentence.
Return type: Iterator[Sentence]
Returns: a generator of Sentence.

has_NLP_support()[source]¶

Return True when NLP is supported.

Return type: bool
Returns: True when NLP is supported.

has_tokenizer_support()[source]¶

Return True when a tokenizer is supported.

Return type: bool
Returns: True when a tokenizer is supported.

split_sentences(text)[source]¶

Split input text into sentences.

Parameters: text (str) – text to be split
Return type: Iterable[dict]
Returns: A generator of dict that is used as **kwargs to instantiate Sentence.

class fonduer.parser.lingual_parser.SimpleParser(delim='.')[source]¶

Bases: fonduer.parser.lingual_parser.lingual_parser.LingualParser

Tokenizes text on whitespace only using split().

Parameters: delim (str) – a delimiter to split text into sentences.

Initialize SimpleParser.

enrich_sentences_with_NLP(sentences)¶

Add NLP attributes like lemmas, pos_tags, etc. to sentences.

Parameters: sentences (Collection[Sentence]) – a iterator of Sentence.
Return type: Iterator[Sentence]
Returns: a generator of Sentence.

has_NLP_support()[source]¶

Return True when NLP is supported.

Return type: bool
Returns: True when NLP is supported.

has_tokenizer_support()[source]¶

Return True when a tokenizer is supported.

Return type: bool
Returns: True when a tokenizer is supported.

split_sentences(str)[source]¶

Parse the document.

Parameters: str (str) – The text contents of the document.
Return type: Iterator[Dict[str, Any]]
Returns: a generator of tokenized text.

class fonduer.parser.lingual_parser.SpacyParser(lang)[source]¶

Bases: fonduer.parser.lingual_parser.lingual_parser.LingualParser

Spacy parser class.

Parameters: lang (Optional[str]) – Language. This can be one of ["en", "de", "es", "pt", "fr", "it", "nl", "xx", "ja", "zh"]. See here for details of languages supported by spaCy.

Initialize SpacyParser.

enrich_sentences_with_NLP(sentences)[source]¶

Enrich a list of fonduer Sentence objects with NLP features.

We merge and process the text of all Sentences for higher efficiency.

Parameters: sentences (Collection[Sentence]) – List of fonduer Sentence objects for one document
Return type: Iterator[Sentence]
Returns

has_NLP_support()[source]¶

Return True when NLP is supported.

Return type: bool
Returns: True when NLP is supported.

has_tokenizer_support()[source]¶

Return True when a tokenizer is supported.

Return type: bool
Returns: True when a tokenizer is supported.

static model_installed(name)[source]¶

Check if spaCy language model is installed.

From https://github.com/explosion/spaCy/blob/master/spacy/util.py

Parameters: name (str) –
Return type: bool
Returns

split_sentences(text)[source]¶

Split text into sentences.

Split input text into sentences that match CoreNLP’s default format, but are not yet processed.

Parameters: text (str) – The text of the parent paragraph of the sentences
Return type: Iterator[Dict[str, Any]]
Returns

Visual Parsers¶

The following docs describe various visual parsers. They parse visual information, e.g., bounding boxes of each word. Fonduer can parse visual information only for hOCR and HTML files with help of HocrVisualParser and PdfVisualParser, respectively. It is recommended to provide documents in hOCR instead of HTML, because PdfVisualParser is not always accurate by its nature and could assign a wrong bounding box to a word. (see #12).

graph LR; PDF[(PDF)]--Convert-->HTML; PDF--Convert-->hOCR; hOCR[(hOCR)]-->HOCRDoc; HTML[(HTML)]-->HTMLDoc; PDF-.->pdf_visual_parser; subgraph Fonduer HOCRDoc(HOCRDocPreprocessor)-->parser; parser(Parser)-->hocr_visual_parser(HocrVisualParser); HTMLDoc(HTMLDocPreprocessor)-->parser2; parser2(Parser)-->pdf_visual_parser(PdfVisualParser); hocr_visual_parser-->others; pdf_visual_parser-->others(..); end classDef source fill:#aaf; classDef preproc fill:#afa; class PDF source; class HTMLDoc,HOCRDoc preproc;

Fonduer’s visual parser module.

class fonduer.parser.visual_parser.HocrVisualParser(replacements=[('[‐‑‒–—−]', '-')])[source]¶

Bases: fonduer.parser.visual_parser.visual_parser.VisualParser

Visual Parser for hOCR.

Initialize a visual parser.

Raises: ImportError – an error is raised when spaCy is not 2.3.0 or later.

is_parsable(document_name)[source]¶

Whether visual information can be parsed. Currently always return True.

Parameters: document_name (str) – the document name.
Return type: bool

parse(document_name, sentences)[source]¶

Parse visual information embedded in sentence’s html_attrs.

Parameters

document_name (str) – the document name.
sentences (Iterable[Sentence]) – sentences to be linked with visual information.

Return type

Iterator[Sentence]

Returns

A generator of Sentence.

class fonduer.parser.visual_parser.PdfVisualParser(pdf_path, verbose=False)[source]¶

Bases: fonduer.parser.visual_parser.visual_parser.VisualParser

Link visual information, extracted from PDF, with parsed sentences.

This linker assumes the following conditions for expected results:

The PDF file exists in a directory specified by pdf_path.
The basename of the PDF file is same as the document name and its extension is either “.pdf” or “.PDF”.
A PDF has a text layer.

Initialize VisualParser.

Parameters

pdf_path (str) – a path to directory that contains PDF files.
verbose (bool) – whether to turn on verbose logging.

is_parsable(document_name)[source]¶

Verify that the file exists and has a PDF extension.

Parameters: document_name (str) – The path to the PDF document.
Return type: bool

parse(document_name, sentences)[source]¶

Link visual information with sentences.

Parameters

document_name (str) – the document name.
sentences (Iterable[Sentence]) – sentences to be linked with visual information.

Return type

Iterator[Sentence]

Returns

A generator of Sentence.

class fonduer.parser.visual_parser.VisualParser[source]¶

Bases: abc.ABC

Abstract visual parer.

abstract is_parsable(document_name)[source]¶

Check if visual information can be parsed.

Parameters: document_name (str) – the document name.
Return type: bool
Returns: Whether visual information is parsable.

abstract parse(document_name, sentences)[source]¶

Parse visual information and link them with given sentences.

Parameters

document_name (str) – the document name.
sentences (Iterable[Sentence]) – sentences to be linked with visual information.

Yield

sentences with visual information.

Return type

Iterator[Sentence]

Preprocessors¶

The following shows descriptions of the various document preprocessors included with Fonduer which are used in parsing documents of different formats.

Fonduer’s parser preprocessor module.

class fonduer.parser.preprocessors.CSVDocPreprocessor(path, encoding='utf-8', max_docs=9223372036854775807, header=False, delim=',', parser_rule=None)[source]¶

Bases: fonduer.parser.preprocessors.doc_preprocessor.DocPreprocessor

A Document generator for CVS files.

It treats each line in the input file as a Document. It assumes that each column is one Section and content in each column as one Paragraph by default. However, if the column is complex, an advanced parser may be used by specifying parser_rule parameter in a dict format where key is the column index and value is the specific parser, e,g., column_constructor in fonduer.utils.utils_parser.

Initialize CSV DocPreprocessor.

Parameters

path (str) – a path to file or directory, or a glob pattern. The basename (as returned by os.path.basename) should be unique among all files.
encoding (str) – file encoding to use (e.g. “utf-8”).
max_docs (int) – the maximum number of Documents to produce.
header (bool) – if the CSV file contain header or not, if yes, the header will be used as Section name. default = False
delim (str) – delimiter to be used to separate columns when file has more than one column. It is active only when column is not None. default=’,’
parser_rule (Optional[Dict[int, Callable]]) – The parser rule to be used to parse the specific column. default = None

Returns

A generator of Documents.

class fonduer.parser.preprocessors.DocPreprocessor(path, encoding='utf-8', max_docs=9223372036854775807)[source]¶

Bases: object

An abstract class of a Document generator.

Unless otherwise stated by a subclass, it’s assumed that there is one Document per file.

Initialize DocPreprocessor.

Parameters

path (str) – a path to file or directory, or a glob pattern. The basename (as returned by os.path.basename) should be unique among all files.
encoding (str) – file encoding to use, defaults to “utf-8”.
max_docs (int) – the maximum number of Documents to produce, defaults to sys.maxsize.

Returns

A generator of Documents.

class fonduer.parser.preprocessors.HOCRDocPreprocessor(path, encoding='utf-8', max_docs=9223372036854775807, space=True)[source]¶

Bases: fonduer.parser.preprocessors.doc_preprocessor.DocPreprocessor

A Document generator for hOCR files.

hOCR should comply with hOCR v1.2. Note that ppageno property of ocr_page is optional by hOCR v1.2, but is required by Fonduer.

Initialize HOCRDocPreprocessor.

Parameters

path (str) – a path to file or directory, or a glob pattern. The basename (as returned by os.path.basename) should be unique among all files.
encoding (str) – file encoding to use, defaults to “utf-8”.
max_docs (int) – the maximum number of Documents to produce, defaults to sys.maxsize.
space (bool) – boolean value indicating whether each word should have a subsequent space. E.g., English has spaces between words.

Returns

A generator of Documents.

class fonduer.parser.preprocessors.HTMLDocPreprocessor(path, encoding='utf-8', max_docs=9223372036854775807)[source]¶

Bases: fonduer.parser.preprocessors.doc_preprocessor.DocPreprocessor

A Document generator for HTML files.

Initialize DocPreprocessor.

Parameters

path (str) – a path to file or directory, or a glob pattern. The basename (as returned by os.path.basename) should be unique among all files.
encoding (str) – file encoding to use, defaults to “utf-8”.
max_docs (int) – the maximum number of Documents to produce, defaults to sys.maxsize.

Returns

A generator of Documents.

class fonduer.parser.preprocessors.TSVDocPreprocessor(path, encoding='utf-8', max_docs=9223372036854775807, header=False)[source]¶

Bases: fonduer.parser.preprocessors.doc_preprocessor.DocPreprocessor

A Document generator for TSV files.

It treats each line in the input file as a Document. The TSV file should have one (doc_name <tab> doc_text) per line.

Initialize TSV DocPreprocessor.

Parameters

path (str) – a path to file or directory, or a glob pattern. The basename (as returned by os.path.basename) should be unique among all files.
encoding (str) – file encoding to use (e.g. “utf-8”).
max_docs (int) – the maximum number of Documents to produce.
header (bool) – if the TSV file contain header or not. default = False

Returns

A generator of Documents.

class fonduer.parser.preprocessors.TextDocPreprocessor(path, encoding='utf-8', max_docs=9223372036854775807)[source]¶

Bases: fonduer.parser.preprocessors.doc_preprocessor.DocPreprocessor

A Document generator for plain text files.

Initialize DocPreprocessor.

Parameters

path (str) – a path to file or directory, or a glob pattern. The basename (as returned by os.path.basename) should be unique among all files.
encoding (str) – file encoding to use, defaults to “utf-8”.
max_docs (int) – the maximum number of Documents to produce, defaults to sys.maxsize.

Returns

A generator of Documents.