Menu

Home

SciCo 2017: An Introduction to the International Image Interoperability Framework (IIIF)

kinow @ Jul 21, 2017 21:25:03

An Introduction to the International Image Interoperability Framework (IIIF)

Event
Science Coding Conference 2017
Where ?
Wellington, New Zealand
When ?
2017-08-04

Drawing Archer

kinow @ Jun 27, 2017 00:09:03

Archer, from redditgetsdrawn.

Archer drawing
Archer drawing
Archer photo
Archer photo

Simplified and changed her angle. On DeviantArt too.

How to remove the signature from e-mails with NLP?

kinow @ Jun 14, 2017 13:59:33

Some time ago I stumbled across EmailParser, a Python utility to remove e-mail signatures. Here’s a sample input e-mail from the project documentation.

Wendy – thanks for the intro! Moving you to bcc.

Hi Vincent – nice to meet you over email. Apologize for the late reply, I was on PTO for a couple weeks and this is my first week back in office. As Wendy mentioned, I am leading an AR/VR taskforce at Foobar Retail Solutions. The goal of the taskforce is to better understand how AR/VR can apply to retail/commerce and if/what is the role of a shopping center in AR/VR applications for retail.

Wendy mentioned that you would be a great person to speak to since you are close to what is going on in this space. Would love to set up some time to chat via phone next week. What does your availability look like on Monday or Wednesday?

Best,
Joe Smith

Joe Smith | Strategy & Business Development
111 Market St. Suite 111| San Francisco, CA 94103
M: 111.111.1111| joe@foobar.com

And here’s what it looks like afterwards.

Wendy – thanks for the intro! Moving you to bcc.

Hi Vincent – nice to meet you over email. Apologize for the late reply, I was on PTO for a couple weeks and this is my first week back in office. As Wendy mentioned, I am leading an AR/VR taskforce at Foobar Retail Solutions. The goal of the taskforce is to better understand how AR/VR can apply to retail/commerce and if/what is the role of a shopping center in AR/VR applications for retail.

Wendy mentioned that you would be a great person to speak to since you are close to what is going on in this space. Would love to set up some time to chat via phone next week. What does your availability look like on Monday or Wednesday?

As you can see, it removed all the lines after the main part of the message (i.e. after the three paragraphs). Here’s what the Python code looks like.

>>> from Parser import read_email, strip, prob_block
>>> from spacy.en import English

>>> pos = English()  # part-of-speech tagger
>>> msg_raw = read_email('emails/test1.txt')
>>> msg_stripped = strip(msg_raw)  # preprocessing text before POS tagging

# iterate through lines, write to file if not signature block
>>> generate_text(msg_stripped, .9, pos_tagger, 'emails/test1_clean.txt')

What got me interested about this utility was the use of NLP. I couldn’t imagine how someone could use NLP for that. And I liked the simplicity of the approach, which is not perfect, but can be useful someday.

After the imports in the code, it creates a Part of Speech tagger using spaCy NLP library, reads the e-mail from a file, and sripts and creates an array with each paragraph of the message.

The magic happens in the generate_text function, which receives the array of paragraphs, a threshold, the POS tagger, and the output destination. Here’s what the function does.

for each message
    if probability ( signature block | message ) < threshold
        write to output file

And the formula for calculating the probability is quite simple too.

1. For a given paragraph (message block), find all the sentences in it.
2. Then for each word (token) in the sentence, count the number of times a non-verb appears.
3. Return the proportion of non-verbs per sentence, i.e. number of non-verbs / number of sentences.

In summary, it discards blocks that do not contain enough verbs to be considered a message block, being treated as signature blocks instead.

Never thought about using an approach like this. It may definitely be helpful when doing data analysis, information retrieval, or scraping data from the web. Not necessarily with e-mails and signatures, but you got the gist of it.

♥ Open Source

Backward compatibility and switch statement with constant expressions

kinow @ Jun 10, 2017 17:35:39

Maintaining Open Source software can be challenging. Making sure you keep backward compatibility (not only binary) can be even more challenging. Apache Commons Lang 3.6 release is happening right now thanks to Benedikt Ritter, and it is on its fourth Release Candidate (i.e. RC4).

A previous RC2 was cancelled due to IBM JDK 8 compatibility, more specifically the lazy initialization of ArrayList’s seems to be different in Oracle JDK and IBM JDK.

The RC3 was cancelled due to a change that could affect users using a switch statement.

The change was in the CharEncoding class. The issue was that this class has some constants, that stopped being constant expressions.

A constant expression is an expression denoting a value of primitive type or a String that does not complete abruptly and is composed using only the following (…)
public class CharEncoding {
    // ...
    public static final String ISO_8859_1 = "ISO-8859-1";
    // ...
}

The code above contains a constant (static final) variable, that is a constant expression. So users can safely use it in switch statements, and the Java compiler won’t complain about it.

public class CharEncoding {
    // ...
    public static final String ISO_8859_1 = StandardCharsets.ISO_8859_1.name();
    // ...
}

The code above is from the change that caused the regression. Any user that was using the ISO_8859_1 constant variable in a switch statement would get a compilation error (e.g. case expressions must be constant expressions) when updating to Apache Commons Lang 3.6. That is because the constant variable is not a constant expression.

I think I learned that some time ago, but if you asked me what was wrong with the change, and if it would break backward compatibility, I would probably fail to spot the issue. There are tools for that now (e.g. Clirr, japicmp) though they may miss some cases too.

Luckily in this case a user subscribed to the Apache Commons development mailing list spotted the issue and quickly reported it. It gets easier maintaining an Open Source project with the constructive feedback of users, like this one.

♥ Open Source

Securely using passwords with R

kinow @ Jun 03, 2017 20:21:39

It is quite common to have code that needs to interact with another system, database, or third party application, and need to use some sort of credentials to securely communicate.

Most of the code I wrote in R, or reviewed, had normally no borders (from a system analysis perspective) with other systems, or basically just interacted with the file system to retrieve NetCDF or JSON files.

However, after I saw a comment in Reddit [1] some time ago about this, I decided to check what others used. With R Shiny becoming more popular, and moving more R code to the web, I think this will become a common requirement for R code.

There is a blog post from RevolutionAnalytics [2] that does a nice summary of the options for that. The post is from 2015, but I do not think much changed now in 2017. From the blog post and comments, there are seven methods listed:

  1. Put credentials in your source code
  2. Put credentials in a file in your project, but do not share this file (#3, #4, and #5 are similar to this one)
  3. Put credentials in a .Rprofile file
  4. Put credentials in a .Renviron file
  5. Put credentials in a JSON or YAML file
  6. Put credentials in a secure store that you can access from R
  7. Ask the user for the credentials when the script is executed (possibly not useful for R Shiny applications)

My preferred way for R is the .Renviron (or a dotEnv) file. You basically store your password in this file, make sure you do not share this (a global gitignore could be helpful to prevent any accident) and read the variables when you start your code.

## Secrets
MYSQL_PASSWORD=secret

If you would like to increase the security, you can combine it with a variation of #6. You use a .Renviron file, and use an encryption service like Amazon KMS (KMS stands for Key Management Service).

With AWS KMS in R, you can encrypt your values, put them encrypted in your .Renviron, and even if someone gets hold of your .Renviron file, you have an extra layer of protection, as the attacker would require access to your cloud environment to decrypt it too.

## Secrets
MYSQL_PASSWORD=AQECAHga320J8WadplGCqqVAr4HNvDaFSQ+NaiwIBhmm6qDSFwAAAGIwYAYJKoZIhvcNAQcGoFMwUQIBADBMBgkqhkiG9w0BBwEwHgYJYIZIAWUDBAEuMBEE99+LoLdvYv8l41OhAAIBEIAfx49FFJCLeYrkfMfAw6XlnxP23MmDBdqP8dPp28OoAQ==

References

♥ Open Source