Blog Taxonomy


Securely using passwords with R

kinow @ Jun 03, 2017 20:21:39 ()

It is quite common to have code that needs to interact with another system, database, or third party application, and need to use some sort of credentials to securely communicate.

Most of the code I wrote in R, or reviewed, had normally no borders (from a system analysis perspective) with other systems, or basically just interacted with the file system to retrieve NetCDF or JSON files.

However, after I saw a comment in Reddit [1] some time ago about this, I decided to check what others used. With R Shiny becoming more popular, and moving more R code to the web, I think this will become a common requirement for R code.

There is a blog post from RevolutionAnalytics [2] that does a nice summary of the options for that. The post is from 2015, but I do not think much changed now in 2017. From the blog post and comments, there are seven methods listed:

  1. Put credentials in your source code
  2. Put credentials in a file in your project, but do not share this file (#3, #4, and #5 are similar to this one)
  3. Put credentials in a .Rprofile file
  4. Put credentials in a .Renviron file
  5. Put credentials in a JSON or YAML file
  6. Put credentials in a secure store that you can access from R
  7. Ask the user for the credentials when the script is executed (possibly not useful for R Shiny applications)

My preferred way for R is the .Renviron (or a dotEnv) file. You basically store your password in this file, make sure you do not share this (a global gitignore could be helpful to prevent any accident) and read the variables when you start your code.

## Secrets

If you would like to increase the security, you can combine it with a variation of #6. You use a .Renviron file, and use an encryption service like Amazon KMS (KMS stands for Key Management Service).

With AWS KMS in R, you can encrypt your values, put them encrypted in your .Renviron, and even if someone gets hold of your .Renviron file, you have an extra layer of protection, as the attacker would require access to your cloud environment to decrypt it too.

## Secrets


♥ Open Source

Natural Language Processing and Natural Language Understanding

kinow @ Jun 03, 2017 12:30:39 ()

I used Natural Language Processing (NLP) tools in a few projects in the past years. But only recently, while involved with a chatbot project, I noticed the term Natural Language Understanding (NLU).

NLU can be seen as a subfield of NLP. NLP englobes all techniques used for parsing text and extracting some knowledge about it. It could be finding out what are the common entities in the text, calculating the likelihood of a certain word being a preposition or a verb, sentiment analysis, or even spell checking sentences.

NLP, NLU, and NLG — source

When you are processing a language, it does not mean you are understanding the language. You may simply have rules, statistics, or some heuristics built to extract the information you need for some work.

NLU, on the other hand, implies that a system will try to understand the text. A text such as “Given the price of mandarin lately, how likely is it to increase as much as the bananas?”, could be parsed with NLP tools such as OpenNLP, CoreNLP, spaCy, etc.

However, you would have annotated text, entities, sentiment, perhaps coreference, but you still would not be able to build a machine that simply understands it. You can - with some effort - build rule based systems, apply semantics and ontologies, or use information retrieval techniques.

But it is still extremely hard to build a system to understand all the ways it could be said, slightly modified, such as “are mandarins getting as expensive as bananas?”, or take into consideration context information like localization or time “were mandarins getting as expensive as bananas around the Pacific Islands five years ago?”.

There are other subfields in NLP, such as Natural Language Generation (NLG). Techniques from both NLU and NLG are used in chat bots, to parse the language and also to generate text based on the conversation.

NLP is an extremely interesting topic. 2001: A Space Odissey, Star Trek, The Foundation Series, and other science fiction books and movies. These all have intelligente computer systems that interact with humans. But should we tried to re-create these right now, even with all the work on machine learning, we would still have very funny (and perhaps dangerous) results.


Apache Commons Text LookupTranslator

kinow @ Jun 02, 2017 22:50:39 ()

Apache Commons Text includes several algorithms for text processing. Today’s post is about one of the classes available since the 1.0 release, the LookupTranslator.

It is used to translate text using a lookup table. Most users won’t necessarily be - knowingly - using this class. Most likely, they will use the StringEscapeUtils, which contains methods to escape and unescape CSV, JSON, XML, Java, and EcmaScript.

String original = "He didn't say, \"stop!\"";
String expected = "He didn't say, \\\"stop!\\\"";
String result   = StringEscapeUtils.escapeJava(original);

StringEscapeUtils uses CharSequenceTranslator’s, including LookupTranslator. You can use it directly too, to escape other data your text may contain, special characters not supported by some third party library or system, or even a simpler case.

In other words, you would be creating your own StringEscapeUtils. Let’s say you have some text where numbers must never start with the zero digital, due to some restriction in the way you use that data later.

Map<String, String> lookupTable = new HashMap<>();
lookupTable.put("a", "");
final LookupTranslator escapeNumber0 = new LookupTranslator(new String[][] { {"0", ""} });
String escaped = escapeNumber0.translate("There are 02 texts waiting for analysis today...");

That way the resulting text would be “There are 2 texts waiting for analysis today”, allowing you to proceed with the rest of your analysis. This is a very simple example, but hopefully you grokked how LookupTranslator works.

♥ Open Source

Some links related to Apache Commons Text

kinow @ May 28, 2017 19:50:39 ()

Apache Commons Text is one of the most recent new components in Apache Commons. It “is a library focused on algorithms working on strings”. I recently collected some links under a bookmark folder that are in some way related to the project. In case you are interested, check some of the links below.

  • Morgan Wahl Text is More Complicated Than You Think Comparing and Sorting Unicode PyCon 2017
    • Q: test [text] to check if our methods are OK with some examples in this talk)
    • Q: Canonical Decomposition, and code points comparisons; are we doing it? Are we doing it right?
    • Q: Do we have casefolding?
    • Q: Do we have multi-level sort?
    • Q: CLDR
  • Ɓukasz Langa Unicode what is the big deal PyCon 2017
    • Q: Quite sure we have an issue to guess the encoding for a text…. there is a GPL library for that? Under Mozilla perhaps?
  • Jiaqi Liu Fuzzy Search Algorithms How and When to Use Them PyCon 2017
    • Q: Does OpenNLP have N-GRAM’s? Would it make sense to have that in [text]?
    • Q: Where can we find some tokenizers? OpenNLP?
  • Lothaire’s Books like “Combinatorics on Words” and “Algebraic Combinatorics”.
  • Java tutorial lesson “Working with Text”
  • Mitzi Morris’ Text Processing in Java book
  • StringSearch java library.
    • “The Java language lacks fast string searching algorithms. StringSearch provides implementations of the Boyer-Moore and the Shift-Or (bit-parallel) algorithms. These algorithms are easily five to ten times faster than the naïve implementation found in java.lang.String”.
  • Jakarta Oro (attic)
    • The Jakarta-ORO Java classes are a set of text-processing Java classes that provide Perl5 compatible regular expressions, AWK-like regular expressions, glob expressions, and utility classes for performing substitutions, splits, filtering filenames, etc. This library is the successor to the OROMatcher, AwkTools, PerlTools, and TextTools libraries originally from ORO, Inc. Despite little activity in the form of new development initiatives, issue reports, questions, and suggestions are responded to quickly.
    • Discontinued, but is there anything useful in there? The attic has always interesting things after all…
  • TextProcessing blog - A Text Processing Portal for Humans
  • twitter-text, the Twitter Java (multi-language actually…) text processing library.
  • Python’s text modules

♥ Open Source

When you don't realize you need a Comparable

kinow @ May 15, 2017 23:07:39 ()

In 2012, I wrote about how you always learn something new by following the Apache dev mailing lists.

After about five years, I am still learning, and still getting impressed by the knowledge of other developers. Days ago I was massaging some code in a pull request and a developer suggested me to simplify my code.

The suggestion was to make a class a Comparable type to both simplify the code, and also have a better design. I immediately agreed, and looking back in hindsight, it was the most logical choice. Yet, I simply did not think about that.

// What the code was
    int cmp = 0;
    String c1 = nv1.getCollation();
    String c2 = nv2.getCollation();
    if (c1 != null && c2 != null && c1.equals(c2)) {
        // locales are parsed. Here we could think about caching if necessary
        Locale desiredLocale = Locale.forLanguageTag(c1);
        Collator collator = Collator.getInstance(desiredLocale);
        cmp =, nv2.getString());
    } else {
        cmp = XSDFuncOp.compareString(nv1, nv2) ;
    return cmp;
// What the code is now
    return ((NodeValueSortKey) nv1).compareTo((NodeValueSortKey) nv2);

This moved the logic to a method in the NodeValueSortKey class. This reduced the complexity of the class with the switch statement. And it also made it easier to write unit tests.

If you are not involved in Open Source projects yet, I keep my suggestion from five years ago. Find a project related to something you like, and start reading the code, lurk in the mailing list or watch GitHub repositories.

You can always learn more!

♥ Open Source