Menu

Tag

Posts tagged with ‘natural language processing’

How to remove the signature from e-mails with NLP?

kinow @ Jun 14, 2017 13:59:33

Some time ago I stumbled across EmailParser, a Python utility to remove e-mail signatures. Here’s a sample input e-mail from the project documentation.

Wendy – thanks for the intro! Moving you to bcc.

Hi Vincent – nice to meet you over email. Apologize for the late reply, I was on PTO for a couple weeks and this is my first week back in office. As Wendy mentioned, I am leading an AR/VR taskforce at Foobar Retail Solutions. The goal of the taskforce is to better understand how AR/VR can apply to retail/commerce and if/what is the role of a shopping center in AR/VR applications for retail.

Wendy mentioned that you would be a great person to speak to since you are close to what is going on in this space. Would love to set up some time to chat via phone next week. What does your availability look like on Monday or Wednesday?

Best,
Joe Smith

Joe Smith | Strategy & Business Development
111 Market St. Suite 111| San Francisco, CA 94103
M: 111.111.1111| joe@foobar.com

And here’s what it looks like afterwards.

Wendy – thanks for the intro! Moving you to bcc.

Hi Vincent – nice to meet you over email. Apologize for the late reply, I was on PTO for a couple weeks and this is my first week back in office. As Wendy mentioned, I am leading an AR/VR taskforce at Foobar Retail Solutions. The goal of the taskforce is to better understand how AR/VR can apply to retail/commerce and if/what is the role of a shopping center in AR/VR applications for retail.

Wendy mentioned that you would be a great person to speak to since you are close to what is going on in this space. Would love to set up some time to chat via phone next week. What does your availability look like on Monday or Wednesday?

As you can see, it removed all the lines after the main part of the message (i.e. after the three paragraphs). Here’s what the Python code looks like.

>>> from Parser import read_email, strip, prob_block
>>> from spacy.en import English

>>> pos = English()  # part-of-speech tagger
>>> msg_raw = read_email('emails/test1.txt')
>>> msg_stripped = strip(msg_raw)  # preprocessing text before POS tagging

# iterate through lines, write to file if not signature block
>>> generate_text(msg_stripped, .9, pos_tagger, 'emails/test1_clean.txt')

What got me interested about this utility was the use of NLP. I couldn’t imagine how someone could use NLP for that. And I liked the simplicity of the approach, which is not perfect, but can be useful someday.

After the imports in the code, it creates a Part of Speech tagger using spaCy NLP library, reads the e-mail from a file, and sripts and creates an array with each paragraph of the message.

The magic happens in the generate_text function, which receives the array of paragraphs, a threshold, the POS tagger, and the output destination. Here’s what the function does.

for each message
    if probability ( signature block | message ) < threshold
        write to output file

And the formula for calculating the probability is quite simple too.

1. For a given paragraph (message block), find all the sentences in it.
2. Then for each word (token) in the sentence, count the number of times a non-verb appears.
3. Return the proportion of non-verbs per sentence, i.e. number of non-verbs / number of sentences.

In summary, it discards blocks that do not contain enough verbs to be considered a message block, being treated as signature blocks instead.

Never thought about using an approach like this. It may definitely be helpful when doing data analysis, information retrieval, or scraping data from the web. Not necessarily with e-mails and signatures, but you got the gist of it.

♥ Open Source

Natural Language Processing and Natural Language Understanding

kinow @ Jun 03, 2017 12:30:39

I used Natural Language Processing (NLP) tools in a few projects in the past years. But only recently, while involved with a chatbot project, I noticed the term Natural Language Understanding (NLU).

NLU can be seen as a subfield of NLP. NLP englobes all techniques used for parsing text and extracting some knowledge about it. It could be finding out what are the common entities in the text, calculating the likelihood of a certain word being a preposition or a verb, sentiment analysis, or even spell checking sentences.

NLP, NLU, and NLG
NLP, NLU, and NLG — source http://nlp.stanford.edu/~wcmac/papers/20140716-UNLU.pdf

When you are processing a language, it does not mean you are understanding the language. You may simply have rules, statistics, or some heuristics built to extract the information you need for some work.

NLU, on the other hand, implies that a system will try to understand the text. A text such as “Given the price of mandarin lately, how likely is it to increase as much as the bananas?”, could be parsed with NLP tools such as OpenNLP, CoreNLP, spaCy, etc.

However, you would have annotated text, entities, sentiment, perhaps coreference, but you still would not be able to build a machine that simply understands it. You can - with some effort - build rule based systems, apply semantics and ontologies, or use information retrieval techniques.

But it is still extremely hard to build a system to understand all the ways it could be said, slightly modified, such as “are mandarins getting as expensive as bananas?”, or take into consideration context information like localization or time “were mandarins getting as expensive as bananas around the Pacific Islands five years ago?”.

There are other subfields in NLP, such as Natural Language Generation (NLG). Techniques from both NLU and NLG are used in chat bots, to parse the language and also to generate text based on the conversation.

NLP is an extremely interesting topic. 2001: A Space Odissey, Star Trek, The Foundation Series, and other science fiction books and movies. These all have intelligente computer systems that interact with humans. But should we tried to re-create these right now, even with all the work on machine learning, we would still have very funny (and perhaps dangerous) results.

♥ NLP

Using Google Natural Language API in an AWS Elastic Beanstalk application

kinow @ Mar 15, 2017 13:18:03

Besides providing an API for users and developers, Google also provides a series of (very well-written) client implementation, in several programming languages. That’s the case for Google Storage, Google Vision, and for Google Natural Language API’s.

I recently had to use the latter in a POC at work, and ran into an interesting issue with our AWS Elastic Beanstalk environment.

Google Language API requires that clients authenticate before sending requests. If you use their google-cloud-java, you can define the GOOGLE_APPLICATION_CREDENTIALS. This environment variable must point to a JSON file with the Google Language API credentials.

The issue is that while Google Language API (or more precisely google-auth-library-java) looks for an environment variable, in Elastic Beanstalk you are able to specify only system properties (unless you want to try something with ebextensions, maybe some JNI…).

A workaround for this issue in Google Natural Language API, is to create and pass a LanguageServiceSettings to your LanguageServiceClient. This settings object, when created, must be given a channel provider with a FixedCredentialsProvider.

Of course reading code is way easier than reading this workaround description.

// File: GoogleNaturalLanguageService.java
// ...
    // envvar or property used to specify the Google Application Credentials
    private final static String GOOGLE_APPLICATION_CREDENTIALS = "GOOGLE_APPLICATION_CREDENTIALS";

    /**
     * Google Natural Language API.
     */
    private LanguageServiceClient languageServiceClient;

    @PostConstruct
    public void init() throws Exception {
        // Elastic Beanstalk supports Properties, not Environment Variables.
        // Google credentials library will load
        // the JSON location for the service to authenticate from an envVar. So
        // we need to fix that here.
        String googleApplicationCredentials = System.getenv(GOOGLE_APPLICATION_CREDENTIALS);
        LOGGER.info(String.format("GOOGLE_APPLICATION_CREDENTIALS in environment variable: %s", googleApplicationCredentials));
        if (StringUtils.isBlank(googleApplicationCredentials)) {
            googleApplicationCredentials = System.getProperty(GOOGLE_APPLICATION_CREDENTIALS);
            LOGGER.info(String.format("GOOGLE_APPLICATION_CREDENTIALS in JVM property: %s", googleApplicationCredentials));
        }

        if (googleApplicationCredentials == null) {
            throw new RuntimeException("Could not locate GOOGLE_APPLICATION_CREDENTIALS variable!");
        }

        final LanguageServiceSettings languageServiceSettings;
        try (InputStream is = new FileInputStream(new File(googleApplicationCredentials))) {
            final GoogleCredentials myCredentials = GoogleCredentials
                    .fromStream(is)
                    .createScoped(
                            Collections.singleton("https://www.googleapis.com/auth/cloud-platform")
                    );
            final CredentialsProvider credentialsProvider = FixedCredentialsProvider.create(myCredentials);

            final InstantiatingChannelProvider channelProvider = LanguageServiceSettings
                    .defaultChannelProviderBuilder()
                    .setCredentialsProvider(credentialsProvider)
                    .build();
            languageServiceSettings = LanguageServiceSettings
                    .defaultBuilder()
                    .setChannelProvider(channelProvider)
                    .build();
        } catch (IOException ioe) {
            LOGGER.error(String.format("IO error creating Google NLP settings: %s", ioe.getMessage()), ioe);
            throw ioe;
        }

        // Create Google API client
        this.languageServiceClient = LanguageServiceClient.create(languageServiceSettings);
    }

    @PreDestroy
    public void destroy() throws Exception {
        if (LOGGER.isDebugEnabled()) {
            LOGGER.debug("Destroying Google NLP API client");
        }
        // Close Google API executors and channels
        this.languageServiceClient.close();
    }
// ...

That way you should be able to use the API with AWS Elastic Beanstalk without having to hack your environment to provide the GOOGLE_APPLICATION_CREDENTIALS environment variable.

An alternative would be google-auth-library-java to look for an environment variable and a system property. Or maybe Amazon AWS add a way to provide environment variables.

Note also that I included the @PostConstruct and @PreDestroy annotated methods. The API will start an executor thread pool, so if you do not want to risk to have problems re-deploying your application, then remember to close your streams.

♥ Open Source

Proposed logos for OpenNLP

kinow @ Jan 15, 2017 20:42:03

A couple of logos submitted to OPENNLP-6. Made with Inkscape, fonts from Google Fonts.

Proposed logos for OpenNLP
Proposed logos for OpenNLP

For more, check out my DeviantArt page.