Tag

Posts tagged with ‘python’

Enabling Markdown Extension Tables For Piecrust

kinow @ Sep 09, 2017 20:35:01

PieCrust is a Python static site generator. It allows users to write content in Markdown. But if you try adding a table, the content by default will be generated as plain text.

You have to enable Markdown extension tables. PieCrust will load it when creating the Markdown instance.

# config.yml
markdown:
  extensions:
    - tables

Et, voilà! Happy blogging!

♥ Open Source

How to remove the signature from e-mails with NLP?

kinow @ Jun 14, 2017 13:59:33

Some time ago I stumbled across EmailParser, a Python utility to remove e-mail signatures. Here’s a sample input e-mail from the project documentation.

Wendy – thanks for the intro! Moving you to bcc.

Hi Vincent – nice to meet you over email. Apologize for the late reply, I was on PTO for a couple weeks and this is my first week back in office. As Wendy mentioned, I am leading an AR/VR taskforce at Foobar Retail Solutions. The goal of the taskforce is to better understand how AR/VR can apply to retail/commerce and if/what is the role of a shopping center in AR/VR applications for retail.

Wendy mentioned that you would be a great person to speak to since you are close to what is going on in this space. Would love to set up some time to chat via phone next week. What does your availability look like on Monday or Wednesday?

Best,
Joe Smith

Joe Smith | Strategy & Business Development
111 Market St. Suite 111| San Francisco, CA 94103
M: 111.111.1111| joe@foobar.com

And here’s what it looks like afterwards.

Wendy – thanks for the intro! Moving you to bcc.

Hi Vincent – nice to meet you over email. Apologize for the late reply, I was on PTO for a couple weeks and this is my first week back in office. As Wendy mentioned, I am leading an AR/VR taskforce at Foobar Retail Solutions. The goal of the taskforce is to better understand how AR/VR can apply to retail/commerce and if/what is the role of a shopping center in AR/VR applications for retail.

Wendy mentioned that you would be a great person to speak to since you are close to what is going on in this space. Would love to set up some time to chat via phone next week. What does your availability look like on Monday or Wednesday?

As you can see, it removed all the lines after the main part of the message (i.e. after the three paragraphs). Here’s what the Python code looks like.

>>> from Parser import read_email, strip, prob_block
>>> from spacy.en import English

>>> pos = English()  # part-of-speech tagger
>>> msg_raw = read_email('emails/test1.txt')
>>> msg_stripped = strip(msg_raw)  # preprocessing text before POS tagging

# iterate through lines, write to file if not signature block
>>> generate_text(msg_stripped, .9, pos_tagger, 'emails/test1_clean.txt')

What got me interested about this utility was the use of NLP. I couldn’t imagine how someone could use NLP for that. And I liked the simplicity of the approach, which is not perfect, but can be useful someday.

After the imports in the code, it creates a Part of Speech tagger using spaCy NLP library, reads the e-mail from a file, and sripts and creates an array with each paragraph of the message.

The magic happens in the generate_text function, which receives the array of paragraphs, a threshold, the POS tagger, and the output destination. Here’s what the function does.

for each message
    if probability ( signature block | message ) < threshold
        write to output file

And the formula for calculating the probability is quite simple too.

1. For a given paragraph (message block), find all the sentences in it.
2. Then for each word (token) in the sentence, count the number of times a non-verb appears.
3. Return the proportion of non-verbs per sentence, i.e. number of non-verbs / number of sentences.

In summary, it discards blocks that do not contain enough verbs to be considered a message block, being treated as signature blocks instead.

Never thought about using an approach like this. It may definitely be helpful when doing data analysis, information retrieval, or scraping data from the web. Not necessarily with e-mails and signatures, but you got the gist of it.

♥ Open Source

Writing a binary parser in Python: NumPy vs. Construct

kinow @ Apr 14, 2017 19:21:03

Some time ago I worked with researchers to write a parser for an old data format. The data was generated by device (radiosonde) using the vendor (Vaisala) specific binary format.

One of the researchers told me someone had written a parser for his work, and shared it on GitHub. To be honest, that was my first time parsing data in binary with Python. Did that before with C, C++, Perl, and Java, but never with Python.

The code on GitHub used NumPy and looked similar to this one.

import numpy as np

parse_header = np.dtype( [ (('field_a', 'b1'), ('field_b', '17b1') ] )

with open('input.dat', 'rb') as f:
    header = np.fromfile(f, dtype=parse_header, count=1)
    # ...

And it indeed worked fine. But in the end I used the code - after contacting the author and letting him know what I was about to do - as reference together with an old specification document for the format, and created a parser with Construct.

From Construct’s website:

Construct is a powerful declarative parser (and builder) for binary data.

This is what the code with construct looked like.

from construct import *

parse_header = Struct("parse_header",
    Enum(Byte("file_ready"),
        READY = 1,
        NOT_READY = 0,
        _default_ = "UNKNOWN",
    ),
    Bytes("reserved", 17)
)

# ...

parse_contents = Struct("parse_contents",
    parse_header,
    Range(mincount=1, maxcout=5, subcon=pre_data),
    OptionalGreedyRange(detailed_data)
)

with open('input.dat', 'rb') as f:
    parse_results = parse_contents.parse_stream(fid)
    # ...

Writing the parser with NumPy or Construct would achieve the same result. However, in the end this came down to personal preference, and my point of view as Software Engineer. This is the description of NumPy.

NumPy is the fundamental package for scientific computing with Python.

NumPy is a project tailored for scientific computing, with a focus on linear algebra, N-dimensional arrays, and so it goes. While it contains code that can parse binary data, the footprint added to a project that includes it as dependency is quite big.

The parser written with NumPy wasn’t using 5% of the NumPy code base. Probably less than 1%. Updates to NumPy could break the application compatibility, even if the update came due to some new matrix operation added to NumPy through some external and missing dependency.

In Java something similar happens with Google Guava. While I use it some times, most of the times I find myself using one of the Apache Commons libraries, or another dependency with just what I need. To avoid including unnecessary code to my application.

If you prefer to use NumPy that’s fine too :-) I just had the time enough to rewrite it instead of using the NumPy (took a couple of hours). In other cases it may still make sense to use another tool or library, even if it was not made specifically for the job ¯\(ツ)

♥ Open Source

Using Requests Session Objects for web scraping

kinow @ Nov 25, 2016 01:24:03

I had to write a Python script some months ago, that would retrieve Solar energy data from a web site. It was basically a handful of HTTP calls, parse the response that was mainly in JSON, and store the results as JSON and CSV for processing later.

Since it was such a small task, I used the Requests module instead of a complete web scraper. The HTTP requests had to be made with a time out, and also pass certain headers. I started customizing each call, until I learned about the Requests Session Objects.

You create a session, as in ORM/JPA, where you can define a context, with certain properties and control an orthogonal behavior.

import requests

def get(session, url):
    """Utility method to HTTP GET a URL"""
    response = session.get(url, timeout=None)
    return response

def post(session, url, data):
    """Utility method to HTTP POST a URL with data parameters"""
    response = session.post(url, data=data, timeout=None)
    if response.status_code != requests.codes.ok:
        response.raise_for_status()
    return response

with requests.Session() as s:
    # x-test header will be sent with every request
    s.headers.update({'x-test': 'true'})

    data = {'user': user, 'password': password, 'rememberne': 'false'}
    r = post(s, 'https://portal.login.url', data)

    r = get(s, 'https://portal.home.url')

Besides the session object, that gives you ability to add headers to all requests, you won’t have to worry about redirects. The library by default takes care of that with a default limit of up to 30.

Happy hacking!

Using the AWS API with Python

kinow @ Oct 04, 2016 21:15:03

Amazon Web Services provides a series of cloud services. When you access the web interface, most - if not all - of the actions you do there are actually translated into API calls.

They also provide SDK’s in several programming languages. With these SDK’s you are able to call the same API used by the web interface. Python has boto (or boto3) which lets you to automate several tasks in AWS.

But before you start using the API, you will need to set up your access key.

It is likely that with time you will have different roles, and may have different permissions with each role. You have to configure your local environment so that you can either use the command line Python utility (installed via pip install awscli) or with boto.

The awscli is a dependency for using boto3. After you install it, you need to run aws configure. It will create the ~/.aws/config and ~/.aws/credentials files. You can tweak these files to support multiple roles.

I followed the tutorials, but got all sorts of different issues. Then after debugging some locally installed dependencies, in special awscli files, I found that the following settings work for my environment.

# File: config
[profile default]
region = ap-southeast-2

[profile profile1]
region = ap-southeast-2
source_profile = default
role_arn = arn:aws:iam::123:role/Developer

[profile profile2]
region = ap-southeast-2
source_profile = default
role_arn = arn:aws:iam::456:role/Sysops
mfa_serial = arn:aws:iam::789:mfa/user@domain.blabla

and

# File: credentials
[default]
aws_access_key_id = YOU_KEY_ID
aws_secret_access_key = YOUR_SECRET

And once it is done you can, for example, confirm it is working with some S3 commands in Python.

#!/usr/bin/env python3

import os
import boto3

session = boto3.Session(profile_name='profile2')
s3 = session.resource('s3')

found = False

name = 'mysuperduperbucket'

for bucket in s3.buckets.all():
    if bucket.name == name:
        found = True

if not found:
    print("Creating bucket...")
    s3.create_bucket(Bucket=name)

file_location = os.path.dirname(os.path.realpath(__file__)) + os.path.sep + 'samplefile.txt'
s3.meta.client.upload_file(Filename=file_location, Bucket=name, Key='book.txt')

The AWS files in this example are using MFA too, the multi-factor authentication. So the first time you run this code you may be asked to generate a token, which will be cached for a short time.

That’s it for today.

Happy hacking!