Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog and this project adheres to Semantic Versioning 2.0.0 conventions. The maintainers will create a git tag for each release and increment the version number found in fonduer/_version.py accordingly. We release tagged versions to PyPI automatically using Travis-CI.

Note

Fonduer is still under active development and APIs may still change rapidly. Until we release v1.0.0, changes in MINOR version indicate backward incompatible changes.

Unreleased

Added

  • @senwu: Refactor Featurization to support user defined customized feature extractors and rename existing feature extractors’ name to match the paper.

Note

Rather than using a fixed multimodal feature library along, we have added an interface for users to provide customized feature extractors. Please see our full documentation for details.

from fonduer.features import Featurizer, FeatureExtractor

# Example feature extractor
def feat_ext(candidates):
    for candidate in candidates:
        yield candidate.id, f"{candidate.id}", 1

feature_extractors=FeatureExtractor(customize_feature_funcs=[feat_ext])
featurizer = Featurizer(session, [PartTemp], feature_extractors=feature_extractors)

Rather than:

from fonduer.features import Featurizer

featurizer = Featurizer(session, [PartTemp])
  • @HiromuHota: Add page argument to get_pdf_dim in case pages have different dimensions.
  • @HiromuHota: Add Labeler#upsert_keys.
  • @HiromuHota: Add vizlink as an argument to Parser to be able to plug a custom visual linker. Unless otherwise specified, VisualLinker will be used by default.

Note

Example usage:

from fonduer.parser.visual_linker import VisualLinker
class CustomVisualLinker(VisualLinker):
    def __init__(self):
        """Your code"""

    def link(self, document_name: str, sentences: Iterable[Sentence], pdf_path: str) -> Iterable[Sentence]:
        """Your code"""

    def is_linkable(self, filename: str) -> bool:
        """Your code"""

from fonduer.parser import Parser
parser = Parser(session, vizlink=CustomVisualLinker())
  • @HiromuHota: Add LingualParser, which any lingual parser like Spacy should inherit from, and add lingual_parser as an argument to Parser to be able to plug a custom lingual parser.
  • @HiromuHota: Annotate types to some of the classes incl. preprocesssors and parser/models.

Changed

  • @HiromuHota: Load a spaCy model if possible during Spacy#__init__.
  • @HiromuHota: Rename Spacy to SpacyParser.
  • @HiromuHota: Rename SimpleTokenizer into SimpleParser and let it inherit LingualParser.
  • @HiromuHota: Move all ligual parsers into lingual_parser folder.
  • @HiromuHota: Make load_lang_model private as a model is internally loaded during init.
  • @HiromuHota: Add a unit test for Parser with tabular=False. (#261)

Removed

  • @HiromuHota: Remove __repr__ from each mixin class as the referenced attributes are not available.

Fixed

  • @senwu: Fix legacy code bug in SymbolTable.
  • @HiromuHota: Fix the type of max_docs.
  • @HiromuHota: Associate sentence with section and paragraph no matter what tabular is. (#261)

0.7.0 - 2019-06-12

Added

  • @HiromuHota: Add notes about the current implementation of data models.
  • @HiromuHota: Add Featurizer#upsert_keys.
  • @HiromuHota: Update the doc for OS X about an external dependency on libomp.
  • @HiromuHota: Add test_classifier.py to unit test Classifier and its subclasses.
  • @senwu: Add test_simple_tokenizer.py to unit test simple_tokenizer.
  • @HiromuHota: Add test_spacy_parser.py to unit test spacy_parser.

Changed

  • @HiromuHota: Assign a section for mention spaces.
  • @HiromuHota: Incorporate entity_confusion_matrix as a first-class citizen and rename it to confusion_matrix because it can be used both entity-level and mention-level.
  • @HiromuHota: Separate Spacy#_split_sentences_by_char_limit to test itself.
  • @HiromuHota: Refactor the custom sentence_boundary_detector for readability and efficiency.
  • @HiromuHota: Remove a redundant argument, document, from Spacy#split_sentences.
  • @HiromuHota: Refactor TokenPreservingTokenizer for readability.

Removed

  • @HiromuHota: Remove data_model_utils.tabular.same_document, which always returns True because a candidate can only have mentions from the same document under the current implemention of CandidateExtractorUDF.

Fixed

  • @senwu: Fix the doc about the PostgreSQL version requirement.

0.6.2 - 2019-04-01

Fixed

  • @lukehsiao: Fix Meta initialization bug which would configure logging upon import rather than allowing the user to configure logging themselves.

0.6.1 - 2019-03-29

Added

  • @senwu: update the spacy version to v2.1.x.
  • @lukehsiao: provide fonduer.init_logging() as a way to configure logging to a temp directory by default.

Note

Although you can still configure logging manually, with this change we also provide a function for initializing logging. For example, you can call:

import logging
import fonduer

# Optionally configure logging
fonduer.init_logging(
  log_dir="log_folder",
  format="[%(asctime)s][%(levelname)s] %(name)s:%(lineno)s - %(message)s",
  level=logging.INFO
)

session = fonduer.Meta.init(conn_string).Session()

which will create logs within the log_folder directory. If logging is not explicitly initialized, we will provide a default configuration which will store logs in a temporary directory.

Changed

  • @senwu: Update the whole logging strategy.

Note

For the whole logging strategy:

With this change, the running log is stored fonduer.log in the {fonduer.Meta.log_path}/{datetime} folder. User can specify it using fonduer.init_logging(). It also contains the learning logs init.

For learning logging strategy:

Previously, the model checkpoints are stored in the user provided folder by save_dir and the name for checkpoint is {model_name}.mdl.ckpt.{global_step}.

With this change, the model is saved in the subfolder of the same folder fonduer.Meta.log_path with log file file. Each learning run creates a subfolder under name {datetime}_{model_name} with all model checkpoints and tensorboard log file init. To use the tensorboard to check the learning curve, run tensorboard --logdir LOG_FOLDER.

Fixed

  • @senwu: Change the exception condition to make sure parser run end to end.
  • @lukehsiao: Fix parser error when text was located in the tail of an LXML table node..
  • @HiromuHota: Store lemmas and pos_tags in case they are returned from a tokenizer.
  • @HiromuHota: Use unidic instead of ipadic for Japanese. (#231)
  • @senwu: Use mecab-python3 version 0.7 for Japanese tokenization since spaCy only support version 0.7.
  • @HiromuHota: Use black 18.9b0 or higher to be consistent with isort. (#225)
  • @HiromuHota: Workaround no longer required for Japanese as of spaCy v2.1.0. (#224)
  • @senwu: Update the metal version.
  • @senwu: Expose the b and pos_label in training.
  • @senwu: Fix the issue that pdfinfo causes parsing error when it contains more than one Page.

0.6.0 - 2019-02-17

Changed

  • @lukehsiao: improved performance of data_model_utils through caching and simplifying the underlying queries. (#212, #215)
  • @senwu: upgrade to PyTorch v1.0.0. (#209)

Removed

  • @lukehsiao: Removed the redundant get_gold_labels function.

Note

Rather than calling get_gold_labels directly, call it from the Labeler:

from fonduer.supervision import Labeler
labeler = Labeler(session, [relations])
L_gold_train = labeler.get_gold_labels(train_cands, annotator='gold')

Rather than:

from fonduer.supervision import Labeler, get_gold_labels
labeler = Labeler(session, [relations])
L_gold_train = get_gold_labels(session, train_cands, annotator_name='gold')

Fixed

  • @senwu: Improve type checking in featurization.
  • @lukehsiao: Fixed sentence.sentence_num bug in get_neighbor_sentence_ngrams.
  • @lukehsiao: Add session synchronization to sqlalchemy delete queries. (#214)
  • @lukehsiao: Update PyYAML dependency to patch CVE-2017-18342. (#205)
  • @KenSugimoto: Fix max/min in visualizer.get_box

0.5.0 - 2019-01-01

Added

  • @senwu: Support CSV, TSV, Text input data format. For CSV format, CSVDocPreprocessor treats each line in the input file as a document. It assumes that each column is one section and content in each column as one paragraph as default. However, if the column is complex, an advanced parser may be used by specifying parser_rule parameter in a dict format where key is the column index and value is the specific parser.

Note

In Fonduer v0.5.0, you can use CSVDocPreprocessor:

from fonduer.parser import Parser
from fonduer.parser.preprocessors import CSVDocPreprocessor
from fonduer.utils.utils_parser import column_constructor

max_docs = 10

# Define specific parser for the third column (index 2), which takes ``text``,
# ``name=None``, ``type="text"``, and ``delim=None`` as input and generate
# ``(content type, content name, content)`` for ``build_node``
# in ``fonduer.utils.utils_parser``.
parser_rule = {
    2: partial(column_constructor, type="figure"),
}

doc_preprocessor = CSVDocPreprocessor(
    PATH_TO_DOCS, max_docs=max_docs, header=True, parser_rule=parser_rule
)

corpus_parser = Parser(session, structural=True, lingual=True, visual=False)
corpus_parser.apply(doc_preprocessor, parallelism=PARALLEL)

all_docs = corpus_parser.get_documents()

For TSV format, TSVDocPreprocessor assumes each line in input file as a document which should follow (doc_name <tab> doc_text) format.

For Text format, TextDocPreprocessor assumes one document per file.

Changed

  • @senwu: Reorganize learning module to use pytorch dataloader, include MultiModalDataset to better handle multimodal information, and simplify the code
  • @senwu: Remove batch_size input argument from _calc_logits, marginals, predict, and score in Classifier
  • @senwu: Rename predictions to predict in Classifier and update the input arguments to have pos_label (assign positive label for binary class prediction) and return_probs (If True, return predict probablities as well)
  • @senwu: Update score function in Classifier to include: (1) For binary: precision, recall, F-beta score, accuracy, ROC-AUC score; (2) For categorical: accuracy;
  • @senwu: Remove LabelBalancer
  • @senwu: Remove original Classifier class, rename NoiseAwareModel to Classifier and use the same setting for both binary and multi-class classifier
  • @senwu: Unify the loss (SoftCrossEntropyLoss) for all settings
  • @senwu: Rename layers in learning module to modules
  • @senwu: Update code to use Python 3.6+’s f-strings
  • @HiromuHota: Reattach doc with the current session at MentionExtractorUDF#apply to avoid doing so at each MentionSpace.

Fixed

  • @HiromuHota: Modify docstring of functions that return get_sparse_matrix
  • @lukehsiao: Fix the behavior of get_last_documents to return Documents that are correctly linked to the database and can be navigated by the user. (#201)
  • @lukehsiao: Fix the behavior of MentionExtractor clear and clear_all to also delete the Candidates that correspond to the Mentions.

0.4.1 - 2018-12-12

Added

  • @senwu: Added alpha spacy support for Chinese tokenizer.

Changed

  • @lukehsiao: Add soft version pinning to avoid failures due to dependency API changes.
  • @j-rausch: Change get_row_ngrams and get_col_ngrams to return None if the passed Mention argument is not inside a table. (#194)

Fixed

  • @senwu: fix non-deterministic issue from get_candidates and get_mentions by parallel candidate/mention generation.

0.4.0 - 2018-11-27

Added

  • @senwu: Rename span attribute to context in mention_subclass to better support mulitmodal mentions. (#184)

Note

The way to retrieve corresponding data model object from mention changed. In Fonduer v0.3.6, we use .span:

# sent_mention is a SentenceMention
sentence = sent_mention.span.sentence

With this release, we use .context:

# sent_mention is a SentenceMention
sentence = sent_mention.context.sentence
  • @senwu: Add support to extract multimodal candidates and add DoNothingMatcher matcher. (#184)

Note

The Mention extraction support all data types in data model. In Fonduer v0.3.6, Mention extraction only supports MentionNgrams and MentionFigures:

from fonduer.candidates import (
    MentionFigures,
    MentionNgrams,
)

With this release, it supports all data types:

from fonduer.candidates import (
    MentionCaptions,
    MentionCells,
    MentionDocuments,
    MentionFigures,
    MentionNgrams,
    MentionParagraphs,
    MentionSections,
    MentionSentences,
    MentionTables,
)
  • @senwu: Add support to parse multiple sections in parser, fix webpage context, and add name column for each context in data model. (#182)

Fixed

  • @senwu: Remove unnecessary backref in mention generation.
  • @j-rausch: Improve error handling for invalid row spans. (#183)

0.3.6 - 2018-11-15

Fixed

  • @lukehsiao: Updated snorkel-metal version requirement to ensure new syntax works when a user upgrades Fonduer.
  • @lukehsiao: Improve error messages on PostgreSQL connection and update FAQ.

0.3.5 - 2018-11-04

Added

  • @senwu: Add SparseLSTM support reducing the memory used by the LSTM for large applications. (#175)

Note

With the SparseLSTM discriminative model, we save memory for the origin LSTM model while sacrificing runtime. In Fonduer v0.3.5, SparseLSTM is as follows:

from fonduer.learning import SparseLSTM

disc_model = SparseLSTM()
disc_model.train(
    (train_cands, train_feature), train_marginals, n_epochs=5, lr=0.001
)

Fixed

  • @senwu: Fix issue with get_last_documents returning the incorrect number of docs and update the tests. (#176)
  • @senwu: Use the latest MeTaL syntax and fix flake8 issues. (#173)

0.3.4 - 2018-10-17

Changed

  • @senwu: Use sqlalchemy to check connection string. Use postgresql instead of postgres in connection string.

Fixed

  • @lukehsiao: The features/labels/gold_label key tables were not properly designed for multiple relations in that they indistinguishably shared the global index of keys. This fixes this issue by including the names of the relations associated with each key. In addition, this ensures that clearing a single relation, or relabeling a single training relation does not inadvertently corrupt the global index of keys. (#167)

0.3.3 - 2018-09-27

Changed

  • @lukehsiao: Added longest_match_only parameter to LambdaFunctionMatcher, which defaults to False, rather than True. (#165)

Fixed

  • @lukehsiao: Fixes the behavior of the get_between_ngrams data model util. (#164)
  • @lukehsiao: Batch queries so that PostgreSQL buffers aren’t exceeded. (#162)

0.3.2 - 2018-09-20

Changed

  • @lukehsiao: MentionNgrams split_tokens now defaults to an empty list and splits on all occurrences, rather than just the first occurrence.
  • @j-rausch: Parser will now skip documents with parsing errors rather than crashing.

Fixed

  • @lukehsiao: Fix attribute error when using MentionFigures.

0.3.1 - 2018-09-18

Fixed

  • @lukehsiao: Fix the layers module in fonduer.learning.disc_models.layers.

0.3.0 - 2018-09-18

Added

  • @lukehsiao: Add supporting functions for incremental knowledge base construction. (#154)
  • @j-rausch: Added alpha spacy support for Japanese tokenizer.
  • @senwu: Add sparse logistic regression support.
  • @senwu: Support Python 3.7.
  • @lukehsiao: Allow user to change featurization settings by providing .fonduer-config.yaml in their project.
  • @lukehsiao: Add a new Mention object, and have Candidate objects be composed of Mention objects, rather than directly of Spans. This allows a single Mention to be reused in multiple relations.
  • @lukehsiao: Improved connection-string validation for the Meta class.

Changed

  • @j-rausch: Document.text now returns the modified document text, based on the user-defined html-tag stripping in the parsing stage.
  • @j-rausch: Ngrams now has a n_min argument to specify a minimum number of tokens per extracted n-gram.
  • @lukehsiao: Rename BatchLabelAnnotator to Labeler and BatchFeatureAnnotator to Featurizer. The classes now support multiple relations.
  • @j-rausch: Made spacy tokenizer to default tokenizer, as long as there is (alpha) support for the chosen language. `lingual` argument now specifies whether additional spacy NLP processing shall be performed.
  • @senwu: Reorganize the disc model structure. (#126)
  • @lukehsiao: Add session and parallelism as a parameter to all UDF classes.
  • @j-rausch: Sentence splitting in lingual mode is now performed by spacy’s sentencizer instead of the dependency parser. This can lead to variations in sentence segmentation and tokenization.
  • @j-rausch: Added language argument to Parser for specification of language used by spacy_parser. E.g. language='en'`.
  • @senwu: Change weak supervision learning framework from numbskull to MeTaL <https://github.com/HazyResearch/metal>_. (#119)
  • @senwu: Change learning framework from Tensorflow to PyTorch. (#115)
  • @lukehsiao: Blacklist <script> nodes by default when parsing HTML docs.
  • @lukehsiao: Reorganize ReadTheDocs structure to mirror the repository structure. Now, each pipeline phase’s user-facing API is clearly shown.
  • @lukehsiao: Rather than importing ambiguously from fonduer directly, disperse imports into their respective pipeline phases. This eliminates circular dependencies, and makes imports more explicit and clearer to the user where each import is originating from.
  • @lukehsiao: Provide debug logging of external subprocess calls.
  • @lukehsiao: Use tdqm for progress bar (including multiprocessing).
  • @lukehsiao: Set the default PostgreSQL client encoding to “utf8”.
  • @lukehsiao: Organize documentation for data_model_utils by modality. (#85)
  • @lukehsiao: Rename lf_helpers to data_model_utils, since they can be applied more generally to throttlers or used for error analysis, and are not limited to just being used in labeling functions.
  • @lukehsiao: Update the CHANGELOG to start following KeepAChangelog conventions.

Removed

  • @lukehsiao: Remove the XMLMultiDocPreprocessor.
  • @lukehsiao: Remove the reduce option for UDFs, which were unused.
  • @lukehsiao: Remove get parent/children/sentence generator from Context. (#87)
  • @lukehsiao: Remove dependency on pdftotree, which is currently unused.

Fixed

  • @j-rausch: Improve spacy_parser performance. We split the lingual parsing pipeline into two stages. First, we parse structure and gather all sentences for a document. Then, we merge and feed all sentences per document into the spacy NLP pipeline for more efficient processing.
  • @senwu: Speed-up of _get_node using caching.
  • @HiromuHota: Fixed bug with Ngram splitting and empty TemporarySpans. (#108, #112)
  • @lukehsiao: Fixed PDF path validation when using visual=True during parsing.
  • @lukehsiao: Fix Meta bug which would not switch databases when init() was called with a new connection string.

Note

With the addition of Mentions, the process of Candidate extraction has changed. In Fonduer v0.2.3, Candidate extraction was as follows:

candidate_extractor = CandidateExtractor(PartAttr,
                        [part_ngrams, attr_ngrams],
                        [part_matcher, attr_matcher],
                        candidate_filter=candidate_filter)

candidate_extractor.apply(docs, split=0, parallelism=PARALLEL)

With this release, you will now first extract Mentions and then extract Candidates based on those Mentions:

# Mention Extraction
part_ngrams = MentionNgramsPart(parts_by_doc=None, n_max=3)
temp_ngrams = MentionNgramsTemp(n_max=2)
volt_ngrams = MentionNgramsVolt(n_max=1)

Part = mention_subclass("Part")
Temp = mention_subclass("Temp")
Volt = mention_subclass("Volt")
mention_extractor = MentionExtractor(
    session,
    [Part, Temp, Volt],
    [part_ngrams, temp_ngrams, volt_ngrams],
    [part_matcher, temp_matcher, volt_matcher],
)
mention_extractor.apply(docs, split=0, parallelism=PARALLEL)

# Candidate Extraction
PartTemp = candidate_subclass("PartTemp", [Part, Temp])
PartVolt = candidate_subclass("PartVolt", [Part, Volt])

candidate_extractor = CandidateExtractor(
    session,
    [PartTemp, PartVolt],
    throttlers=[temp_throttler, volt_throttler]
)

candidate_extractor.apply(docs, split=0, parallelism=PARALLEL)

Furthermore, because Candidates are now composed of Mentions rather than directly of Spans, to get the Span object from a mention, use the .span attribute of a Mention.

Note

Fonduer has been reorganized to require more explicit import syntax. In Fonduer v0.2.3, nearly everything was imported directly from fonduer:

from fonduer import (
    CandidateExtractor,
    DictionaryMatch,
    Document,
    FeatureAnnotator,
    GenerativeModel,
    HTMLDocPreprocessor,
    Intersect,
    LabelAnnotator,
    LambdaFunctionMatcher,
    MentionExtractor,
    Meta,
    Parser,
    RegexMatchSpan,
    Sentence,
    SparseLogisticRegression,
    Union,
    candidate_subclass,
    load_gold_labels,
    mention_subclass,
)

With this release, you will now import from each pipeline phase. This makes imports more explicit and allows you to more clearly see which pipeline phase each import is associated with:

from fonduer import Meta
from fonduer.candidates import CandidateExtractor, MentionExtractor
from fonduer.candidates.matchers import (
    DictionaryMatch,
    Intersect,
    LambdaFunctionMatcher,
    RegexMatchSpan,
    Union,
)
from fonduer.candidates.models import candidate_subclass, mention_subclass
from fonduer.features import Featurizer
from metal.label_model import LabelModel # GenerativeModel in v0.2.3
from fonduer.learning import SparseLogisticRegression
from fonduer.parser import Parser
from fonduer.parser.models import Document, Sentence
from fonduer.parser.preprocessors import HTMLDocPreprocessor
from fonduer.supervision import Labeler, get_gold_labels

0.2.3 - 2018-07-23

Added

  • @lukehsiao: Support Figures nested in Cell contexts and Paragraphs in Figure contexts. (#84)

0.2.2 - 2018-07-22

Note

Version 0.2.0 and 0.2.1 had to be skipped due to errors in uploading those versions to PyPi. Consequently, v0.2.2 is the version directly after v0.1.8.

Warning

This release is NOT backwards compatable with v0.1.8. The code has now been refactored into submodules, where each submodule corresponds with a phase of the Fonduer pipeline. Consequently, you may need to adjust the paths of your imports from Fonduer.

Added

Changed

  • @senwu: Split up lf_helpers into separate files for each modality. (#81)
  • @lukehsiao: Rename to Phrase to Sentence. (#72)
  • @lukehsiao: Split models and preprocessors into individual files. (#60, #64)

Removed

  • @lukehsiao: Remove the futures imports, truly making Fonduer Python 3 only. Also reorganize the codebase into submodules for each pipeline phase. (#59)

Fixed

0.1.8 - 2018-06-01

Added

  • @prabh06: Extend styles parsing and add regex search (#52)

Removed

  • @senwu: Remove the Viewer, which is unused in Fonduer (#55)
  • @lukehsiao: Remove unnecessary encoding in __repr__ (#50)

Fixed

  • @senwu: Fix SimpleTokenizer for lingual features are disabled (#53)
  • @lukehsiao: Fix LocationMatch NER tags for spaCy (#50)

0.1.7 - 2018-04-04

Warning

This release is NOT backwards compatable with v0.1.6. Specifically, the snorkel submodule in fonduer has been removed. Any previous imports of the form:

from fonduer.snorkel._ import _

Should drop the snorkel submodule:

from fonduer._ import _

Tip

To leverage the logging output of Fonduer, such as in a Jupyter Notebook, you can configure a logger in your application:

import logging

logging.basicConfig(stream=sys.stdout, format='[%(levelname)s] %(name)s - %(message)s')
log = logging.getLogger('fonduer')
log.setLevel(logging.INFO)

Added

Removed

  • @lukehsiao: Remove SQLite code, switch to logging, and absorb snorkel codebase directly into the fonduer package for simplicity (#44)
  • @lukehsiao: Remove unused package dependencies (#41)

0.1.6 - 2018-03-31

Changed

  • @lukehsiao: Switch README from Markdown to reStructuredText

Fixed

  • @senwu: Fix support for providing a PostgreSQL username and password as part of the connection string provided to Meta.init() (#40)

0.1.5 - 2018-03-31

Warning

This release is NOT backwards compatable with v0.1.4. Specifically, in order to initialize a session with postgresql, you no longer do

os.environ['SNORKELDB'] = 'postgres://localhost:5432/' + DBNAME
from fonduer import SnorkelSession
session = SnorkelSession()

which had the side-effects of manipulating your database tables on import (or creating a snorkel.db file if you forgot to set the environment variable). Now, you use the Meta class to initialize your session:

from fonduer import Meta
session = Meta.init("postgres://localhost:5432/" + DBNAME).Session()

No side-effects occur until Meta is initialized.

Removed

  • @lukehsiao: Remove reliance on environment vars and remove side-effects of importing fonduer (#36)

Fixed

  • @lukehsiao: Bring codebase in PEP8 compliance and add automatic code-style checks (#37)

0.1.4 - 2018-03-30

Changed

0.1.3 - 2018-03-29

Fixed

Minor hotfix to the README formatting for PyPi.

0.1.2 - 2018-03-29

Added

  • @lukehsiao: Deploy Fonduer to PyPi using Travis-CI