Some links related to Apache Commons Text

kinow @ May 28, 2017 19:50:39

Apache Commons Text is one of the most recent new components in Apache Commons. It “is a library focused on algorithms working on strings”. I recently collected some links under a bookmark folder that are in some way related to the project. In case you are interested, check some of the links below.

  • Morgan Wahl Text is More Complicated Than You Think Comparing and Sorting Unicode PyCon 2017
    • Q: test [text] to check if our methods are OK with some examples in this talk)
    • Q: Canonical Decomposition, and code points comparisons; are we doing it? Are we doing it right?
    • Q: Do we have casefolding?
    • Q: Do we have multi-level sort?
    • Q: CLDR
  • Ɓukasz Langa Unicode what is the big deal PyCon 2017
    • Q: Quite sure we have an issue to guess the encoding for a text…. there is a GPL library for that? Under Mozilla perhaps?
  • Jiaqi Liu Fuzzy Search Algorithms How and When to Use Them PyCon 2017
    • Q: Does OpenNLP have N-GRAM’s? Would it make sense to have that in [text]?
    • Q: Where can we find some tokenizers? OpenNLP?
  • Lothaire’s Books like “Combinatorics on Words” and “Algebraic Combinatorics”.
  • Java tutorial lesson “Working with Text”
  • Mitzi Morris’ Text Processing in Java book
  • StringSearch java library.
    • “The Java language lacks fast string searching algorithms. StringSearch provides implementations of the Boyer-Moore and the Shift-Or (bit-parallel) algorithms. These algorithms are easily five to ten times faster than the naïve implementation found in java.lang.String”.
  • Jakarta Oro (attic)
    • The Jakarta-ORO Java classes are a set of text-processing Java classes that provide Perl5 compatible regular expressions, AWK-like regular expressions, glob expressions, and utility classes for performing substitutions, splits, filtering filenames, etc. This library is the successor to the OROMatcher, AwkTools, PerlTools, and TextTools libraries originally from ORO, Inc. Despite little activity in the form of new development initiatives, issue reports, questions, and suggestions are responded to quickly.
    • Discontinued, but is there anything useful in there? The attic has always interesting things after all…
  • TextProcessing blog - A Text Processing Portal for Humans
  • twitter-text, the Twitter Java (multi-language actually…) text processing library.
  • Python’s text modules

♥ Open Source