Blog

Posts about technology and arts.

Apache Commons Text LookupTranslator

Apache Commons Text includes several algorithms for text processing. Today’s post is about one of the classes available since the 1.0 release, the LookupTranslator.

It is used to translate text using a lookup table. Most users won’t necessarily be - knowingly - using this class. Most likely, they will use the StringEscapeUtils, which contains methods to escape and unescape CSV, JSON, XML, Java, and EcmaScript.

String original = "He didn't say, \"stop!\"";
String expected = "He didn't say, \\\"stop!\\\"";
String result   = StringEscapeUtils.escapeJava(original);

StringEscapeUtils uses CharSequenceTranslator’s, including LookupTranslator. You can use it directly too, to escape other data your text may contain, special characters not supported by some third party library or system, or even a simpler case.

In other words, you would be creating your own StringEscapeUtils. Let’s say you have some text where numbers must never start with the zero digital, due to some restriction in the way you use that data later.

Map<String, String> lookupTable = new HashMap<>();
lookupTable.put("a", "");
final LookupTranslator escapeNumber0 = new LookupTranslator(new String[][] { {"0", ""} });
String escaped = escapeNumber0.translate("There are 02 texts waiting for analysis today...");

That way the resulting text would be “There are 2 texts waiting for analysis today”, allowing you to proceed with the rest of your analysis. This is a very simple example, but hopefully you grokked how LookupTranslator works.

♥ Open Source

Some links related to Apache Commons Text

Apache Commons Text is one of the most recent new components in Apache Commons. It “is a library focused on algorithms working on strings”. I recently collected some links under a bookmark folder that are in some way related to the project. In case you are interested, check some of the links below.

  • Morgan Wahl Text is More Complicated Than You Think Comparing and Sorting Unicode PyCon 2017 <iframe width="560" height="315" src="https://www.youtube.com/embed/bx3NOoroV-M?rel=0" frameborder="0" allowfullscreen></iframe>
    • Q: test [text] to check if our methods are OK with some examples in this talk)
    • Q: Canonical Decomposition, and code points comparisons; are we doing it? Are we doing it right?
    • Q: Do we have casefolding?
    • Q: Do we have multi-level sort?
    • Q: CLDR
  • Łukasz Langa Unicode what is the big deal PyCon 2017 <iframe width="560" height="315" src="https://www.youtube.com/embed/7m5JA3XaZ4k?rel=0" frameborder="0" allowfullscreen></iframe>
    • Q: Quite sure we have an issue to guess the encoding for a text…. there is a GPL library for that? Under Mozilla perhaps?
  • Jiaqi Liu Fuzzy Search Algorithms How and When to Use Them PyCon 2017 <iframe width="560" height="315" src="https://www.youtube.com/embed/kTS2b6pGElE?rel=0" frameborder="0" allowfullscreen></iframe>
    • Q: Does OpenNLP have N-GRAM’s? Would it make sense to have that in [text]?
    • Q: Where can we find some tokenizers? OpenNLP?
  • Lothaire’s Books like “Combinatorics on Words” and “Algebraic Combinatorics”.
  • Java tutorial lesson “Working with Text”
  • Mitzi Morris’ Text Processing in Java book
  • StringSearch java library.
    • "The Java language lacks fast string searching algorithms. StringSearch provides implementations of the Boyer-Moore and the Shift-Or (bit-parallel) algorithms. These algorithms are easily five to ten times faster than the naïve implementation found in java.lang.String".
  • Jakarta Oro (attic)
    • The Jakarta-ORO Java classes are a set of text-processing Java classes that provide Perl5 compatible regular expressions, AWK-like regular expressions, glob expressions, and utility classes for performing substitutions, splits, filtering filenames, etc. This library is the successor to the OROMatcher, AwkTools, PerlTools, and TextTools libraries originally from ORO, Inc. Despite little activity in the form of new development initiatives, issue reports, questions, and suggestions are responded to quickly.
    • Discontinued, but is there anything useful in there? The attic has always interesting things after all…
  • TextProcessing blog - A Text Processing Portal for Humans
  • twitter-text, the Twitter Java (multi-language actually…) text processing library.
  • Python’s text modules

♥ Open Source

When you don't realize you need a Comparable

In 2012, I wrote about how you always learn something new by following the Apache dev mailing lists.

After about five years, I am still learning, and still getting impressed by the knowledge of other developers. Days ago I was massaging some code in a pull request and a developer suggested me to simplify my code.

Drawing sketch: Page 036

Page 036

Almost ANZAC day. And almost time to switch back to coding Jenkins plug-ins.

Normally sketch more in-between development cycles for Open Source projects, or contract works.

  • Krill: Krill are small crustaceans of the order Euphausiacea, and are found in all the world’s oceans. The name krill comes from the Norwegian word krill, meaning “small fry of fish”, which is also often attributed to species of fish.
  • Tui: Tui are boisterous, medium-sized, common and widespread bird of forest and suburbia – unless you live in Canterbury. They look black from a distance, but in good light tui have a blue, green and bronze iridescent sheen, and distinctive white throat tufts (poi). They are usually very vocal, with a complicated mix of tuneful notes interspersed with coughs, grunts and wheezes. In flight, their bodies slant with the head higher than the tail, and their noisy whirring flight is interspersed with short glides.
  • Fatigue or Craziness: Lots of crazy people on Queen Street during holidays.

Troubleshooting a Jenkins Plug-in compatibility issue

This post is probably different from others. I will give a TL;DR, but will then give you a copy of a comment of a Jenkins JIRA issue. Hope you have fun reading it, specially if you maintain Jenkins servers or plug-ins.

TL;DR: there was an issue in Jenkins Job DSL Plug-in, that caused jobs created to have an invalid script. The fix had not been released, but was already in the master branch in GitHub.

Well, that was fun bug reproduced, I believe I know why that’s happening. Not so complicated to fix… but no fast way to fix it. Here’s the issue analysis (grab a coffee to read it).

  • Downloaded jenkins.war (2.32.3.war)
  • mkdir /tmp/123
  • JENKINS_HOME=/tmp/123 java -jar jenkins.war
  • Entered secret into form and submitted
  • Installed suggested plugins (boy that takes a while)
  • Created temp user
  • Installed (without restart) active-choices-plugin 1.5.3
  • Installed (without restart) job-dsl-plugin
  • Manually stopped Jenkins, and started it again with same command #3
  • Log in with user, all looking good
  • Created Freestyle job JENKINS-42655 (see attached config.xml)
  • Executed job, and found new job JENKINS-42655-1
  • Never opened the job configuration, clicked on the “Generated Items link to JENKINS-42655-1” to open in a new tab
  • Clicked on Build with Parameters
  • Looked at logs, and noticed the security-script-plugin exceptions
  • Went to “Manage Jenkins” / “In-process Script Approval” and approved scripts
  • Went back to the JENKINS-42655-1 build with parameters screen, and everything worked as expected

*Hummm. Issue more or less reproduced. Let’s investigate more.

  • Restarted Jenkins again
  • Changed the JENKINS-42655 seed job configuration to use a different script
  • Copied the config.xml file to another location
  • Went to build with parameters, and now it was broken again
  • Saved the job manually
  • Copied the config.xml file to yet another location
  • Went to build with parameters, and now it worked as reported in this issue

Now comes the interesting part. Looking at the diff. Attaching a screen shot so that others can have fun looking at it too. I installed Kompare as it has some cool features such as disabling diff for white spaces, blank lines, etc. The whole file changes as you save it. But if you ignore the number of white spaces… Then you can see that the Job DSL Plug-in is creating a <script> tag, as we used to do before 1.5 I think.

Subscribe