Blog Taxonomy

Posts tagged with 'opensource'

Enabling Markdown Extension Tables For Piecrust

kinow @ Sep 09, 2017 20:35:01

PieCrust is a Python static site generator. It allows users to write content in Markdown. But if you try adding a table, the content by default will be generated as plain text.

You have to enable Markdown extension tables. PieCrust will load it when creating the Markdown instance.

# config.yml
markdown:
  extensions:
    - tables

Et, voilà! Happy blogging!

♥ Open Source

Finding Base64 implementations in Apache Software Foundation projects

kinow @ Sep 01, 2017 20:23:03

NZ Grey Warbler (riroriro)
New Zealand Grey Warbler (riroriro)

Some time ago while working in one of the many projects in the Apache Software Foundation (Apache Commons FileUpload if I remember well), I noticed that it had a Base64 implementation. What called my attention was that the project not using the Apache Commons Codec Base64 implementation.

While Apache Commons’ mission is to create components that can be re-used across ASF projects, and also by other projects not necessarily under the ASF, it is understandable that some projects prefer to keep its dependencies to a minimum. It is normally a good software engineering practice to carefully manage your dependencies.

But would Apache Commons FileUpload be the only project in the ASF with its own Base64 implementation?

What is Base64?

Simply put, Base64 is a way to encode bytes to strings. It utilises a table, to convert parts of the binary input to certain numbers. These numbers match an entry in the table used by the Base64 implementation. There are several Base64 implementations, though some are obsolete now.

The input text “this is base64!” results in “dGhpcyBpcyBiYXNlNjQh”. It can be decoded and will result in the same input text. An image can also be encoded. Or a ZIP file. This is helpful for data transfer and storage.

Apache Commons Codec is well known to provide a Bse64 implementation, and used in several projects, both Open Source and in the industry. Its implementation is based on the RFC-2045.

Java 8 contains a Base64 implementation, so that may very well replace Apache Commons Coded use in some projects, though that may take some time. The Java 8 implementation supports the RFC-2045, RFC-4648, and has also support to the URL and MIME formats.

Searching for other Base64 implementations

Using GitHub search, I looked for other Base64 implementations in the ASF projects. Here’s the result table with only the custom implementations found after going through some 15 pages in more than 100 pages with hits for “base64”.

Project & link to implementation JVM Base64 implementation
Apache ActiveMQ Artemis 8 RFC-3548, based on http://iharder.net/base64
Apache AsterixDB Hyracks (Incubator) 8 ?
Apache Calcite Avatica 7 RFC-3548, based on http://iharder.net/base64
Apache Cayenne 8 RFC-2045 (based on codec)
Apache Chemistry 7 RFC-3548, based on http://iharder.net/base64
Apache Commons FileUpload 6 ?
Apache Commons Net 6 RFC-2045 (copy of codec?)
Apache Directory Kerby 7 RFC-2045 (copy of codec?)
Apache Felix 5 (?) RFC-2045 (copy of codec?)
Apache HBase 8 RFC-3548, based on http://iharder.net/base64
Apache Jackrabbit 8 (?) ?
Apache James 6 RFC-2045 via javax.mail.internet.MimeUtility
Apache James Mime4J 5 RFC-2045 (based on codec)
Apache OFBiz 8 RFC-2045
Apache Pivot 6 RFC-2045
Apache Qpid 8 ? uses javax.xml.bind.DatatypeConverter#parseBase64Binary()
Apache Shiro 6 RFC-2045 (based on commons)
Apache Tomcat 8 RFC-2045 (copy of codec?)
Apache TomEE 7 RFC-2045
Apache TomEE (Site-NG) 6 RFC-2045
Apache Trafodion (Incubator) 7 RFC-3548, based on http://iharder.net/base64
Apache Wave (Incubator) 7 RFC-3548 (?), based on http://iharder.net/base64

Notes and conclusions

  • Projects using Java 8 can likely remove its own implementation in favour of the new JVM 8 implementation.
  • Some projects were already using Java 8 Base64 implementation.
  • Some projects were using Apache Commons Codec.
  • Some projects were using the Base64 implementation from http://iharder.net/base64, which claims to be very fast. It could be interesting to further investigate it. Perhaps projects where Base64 is used a lot, there could be a significant performance increase by using this version.
  • Even though most of these projects are not using Apache Commons Codec, some have either copied or based their implementations on Apache Commons Codec. Perhaps shading would be more effective? Or maybe adding it as a dependency…
  • I guess the Base64 implementations could be hidden from external users with /*protected*/, private scope. As they are probably not part of Apache Commons Net, or Apache Cayenne public API. Which will be solved eventually after Java 9…
  • Some implementations do not make it clear which RFC or standard they are following. Some derived the reference work (e.g. Apache Wave (Incubator) modified the iharder removing features…).
  • It could be that some of these projects that contain many dependencies like Cayenne and Pivot may even have Commons Codec in the class path as a transitive dependency. If so, it could be interesting to add it as a dependency and remove its own implementation, or simply use Java 8’s.
  • Some implementations like HBase and Trafodion were not using the latest version from http://iharder.net/base64. In the case of HBase and Trafodion, several invalid inputs have been fixed from 2.2.1 to 2.3.7.

Future work

  • I will try to investigate which of these projects that have a custom Base64 implementation and are using Java 8 can be updated to throw away its own version (時間があるときでしょう!).
  • The implementation from http://iharder.net/base64 promises to be very fast, and Apache ActiveMQ Artemis adopted it. We could consider adding a similar fast version to Apache Commons Codec. This could be a reason for keeping its own Base64 implementation.
  • Java 8 Base64 provides Base64, MIME, and URL formats for encode and decoding. Perhaps we could add more formats to Apache Commons Codec too. Even custom formats. This could be a reason for keeping Apache Commons Codec’s implementation.

Happy encoding!

♥ Open Source

Two other Maven Plug-ins: impsort and deptools

kinow @ Aug 12, 2017 22:16:03

Last week I wrote about the ImmobilienScout24/illegal-transitive-dependency-check rule for Maven Enforcer Plug-in. There are two other Maven Plug-ins that can be useful.

mbknor/deptools

The mbknor/deptools is another rule for the Maven Enforcer Plug-in. It will scan your project dependency tree, looking for transitive dependencies. Whenever it finds a transitive dependency, it will keep track of the versions. And if, because of the way your dependencies and transitive dependencies are organised, you end up with a version that is not the newest, the build will fail.

So, for example, if you have commons-lang3 as transitive dependency of two other dependencies, but one is using 3.4 and the other 3.5. If for any reason you are using 3.4 instead of 3.5, you will have a build error.

Here’s an example of the plug-in configuration.

<project>
  ...
  <build>
    <plugins>
      <plugin>
        <groupId>deptools.plugin</groupId>
        <artifactId>maven-deptools-plugin</artifactId>
        <version>1.3</version>
          <executions>
              <execution>
                  <phase>compile</phase>
                  <goals>
                      <goal>version-checker</goal>
                  </goals>
              </execution>
          </executions>
        </plugin>
    </plugins>
  </build>

  <pluginRepositories>
    <pluginRepository>
      <id>mbk_mvn_repo</id>
      <name>mbk_mvn_repo</name>
      <url>https://raw.githubusercontent.com/mbknor/mbknor.github.com/master/m2repo/releases</url>
    </pluginRepository>
  </pluginRepositories>
  ...
</project>

Running mvn clean verify will execute the Maven Enforcer Plug-in enforce goal, which will call the deptools check. As you may have noticed, you also need to download the plug-in from GitHub, as it is not released to Maven Central.

I do not use it for this reason, and also because I normally spend some time looking at the dependency tree anyway, but every now and then when I work on a new project I like quickly running it just to see what are the dependencies that are being shadowed by older versions.

revelc/impsort-maven-plugin

I only found about this plug-in in the last Apache News Round-up. Where it was mentioned that Apache Accumulo uses this plug-in to standarize the order of imports in code.

<project>
  ...
  <build>
    <plugins>
      <plugin>
        <groupId>net.revelc.code</groupId>
        <artifactId>impsort-maven-plugin</artifactId>
        <version>1.0.0</version>
        <configuration>
          <groups>java.,javax.,org.,com.</groups>
          <staticGroups>java,*</staticGroups>
          <excludes>
            <exclude>**/thrift/*.java</exclude>
          </excludes>
        </configuration>
        <executions>
          <execution>
            <id>sort-imports</id>
            <goals>
              <goal>sort</goal><!-- runs at process-sources phase by default -->
            </goals>
          </execution>
        </executions>
      </plugin>
    </plugins>
  </build>
  ...
</project>

I am neutral on imports order, though in cases where you have several contributors submitting pull requests, it can probably be useful to reduce the number of interactions. In other words, if a user submits a pull request and you have an automated check, then the user would be automatically notified about changes that s/he needs to do in order for his pull request to be accepted.

Conclusion

I am not using any of these two plug-ins, but wanted to save it somewhere in case I needed to use them in the future, and also to share with others. Besides most common plug-ins (PMD, CheckStyle, FindBugs), I normally use at least some Maven Enforcer Plug-in rules, and the OWASP plug-in.

While most of the time we spend writing code, preparing the infrastructure, and deploying and testing, I got bitten by some maven build bugs a few times, and had to spend days/weeks debugging some of these. So hope some of these posts save some hours of someone out there in a similar situation.

♥ Open Source

Checking for transitive dependencies use with Maven Enforcer Plug-in

kinow @ Aug 06, 2017 17:35:39

Maven Enforcer Plug-in “provides goals to control certain environmental constraints such as Maven version, JDK version and OS family along with many more built-in rules and user created rules”. There are several libraries that provide custom rules, or you can write your own.

One of these libraries is ImmobilienScout24/illegal-transitive-dependency-check, “an additional rule for the maven-enforcer-plugin that checks for classes referenced via transitive Maven dependencies”.

With the following example:

<project>
  ...
  <build>
    <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-enforcer-plugin</artifactId>
        <version>1.3.1</version>
        <dependencies>
          <dependency>
            <groupId>de.is24.maven.enforcer.rules</groupId>
            <artifactId>illegal-transitive-dependency-check</artifactId>
            <version>1.7.4</version>
          </dependency>
        </dependencies>
        <executions>
          <execution>
            <id>enforce</id>
            <phase>verify</phase>
            <goals>
              <goal>enforce</goal>
            </goals>
            <configuration>
              <rules>
                <illegalTransitiveDependencyCheck implementation="de.is24.maven.enforcer.rules.IllegalTransitiveDependencyCheck">
                  <reportOnly>false</reportOnly>
                  <useClassesFromLastBuild>true</useClassesFromLastBuild>
                  <suppressTypesFromJavaRuntime>true</suppressTypesFromJavaRuntime>
                  <listMissingArtifacts>false</listMissingArtifacts>
                </illegalTransitiveDependencyCheck>
              </rules>
            </configuration>
          </execution>
        </executions>
      </plugin>
    </plugins>
  </build>
  ...
</project>

Running mvn clean verify will execute the Maven Enforcer Plug-in enforce goal, which will call the illegal transitive dependency check.

And the build will fail if your code is using (i.e. importing) any class that is not available in your first-level dependencies. For example, if in your pom.xml you added commons-lang3 and commons-configuration, the latter which includes commons-lang 2.x, and you used org.apache.commons.lang.StringUtils instead of org.apache.commons.lang3.StringUtils, the build would fail.

In order to fix the build, you have to either add the transitive dependency to your pom.xml file, or correct your import statements. This is specially useful to prevent future issues due to other dependencies being added or updated, and changing the version of the transitive dependency.

Bonus points if you combine that with continuous integration and some service like Travis-CI.

♥ Open Source

How to remove the signature from e-mails with NLP?

kinow @ Jun 14, 2017 13:59:33

Some time ago I stumbled across EmailParser, a Python utility to remove e-mail signatures. Here’s a sample input e-mail from the project documentation.

Wendy – thanks for the intro! Moving you to bcc.

Hi Vincent – nice to meet you over email. Apologize for the late reply, I was on PTO for a couple weeks and this is my first week back in office. As Wendy mentioned, I am leading an AR/VR taskforce at Foobar Retail Solutions. The goal of the taskforce is to better understand how AR/VR can apply to retail/commerce and if/what is the role of a shopping center in AR/VR applications for retail.

Wendy mentioned that you would be a great person to speak to since you are close to what is going on in this space. Would love to set up some time to chat via phone next week. What does your availability look like on Monday or Wednesday?

Best,
Joe Smith

Joe Smith | Strategy & Business Development
111 Market St. Suite 111| San Francisco, CA 94103
M: 111.111.1111| joe@foobar.com

And here’s what it looks like afterwards.

Wendy – thanks for the intro! Moving you to bcc.

Hi Vincent – nice to meet you over email. Apologize for the late reply, I was on PTO for a couple weeks and this is my first week back in office. As Wendy mentioned, I am leading an AR/VR taskforce at Foobar Retail Solutions. The goal of the taskforce is to better understand how AR/VR can apply to retail/commerce and if/what is the role of a shopping center in AR/VR applications for retail.

Wendy mentioned that you would be a great person to speak to since you are close to what is going on in this space. Would love to set up some time to chat via phone next week. What does your availability look like on Monday or Wednesday?

As you can see, it removed all the lines after the main part of the message (i.e. after the three paragraphs). Here’s what the Python code looks like.

>>> from Parser import read_email, strip, prob_block
>>> from spacy.en import English

>>> pos = English()  # part-of-speech tagger
>>> msg_raw = read_email('emails/test1.txt')
>>> msg_stripped = strip(msg_raw)  # preprocessing text before POS tagging

# iterate through lines, write to file if not signature block
>>> generate_text(msg_stripped, .9, pos_tagger, 'emails/test1_clean.txt')

What got me interested about this utility was the use of NLP. I couldn’t imagine how someone could use NLP for that. And I liked the simplicity of the approach, which is not perfect, but can be useful someday.

After the imports in the code, it creates a Part of Speech tagger using spaCy NLP library, reads the e-mail from a file, and sripts and creates an array with each paragraph of the message.

The magic happens in the generate_text function, which receives the array of paragraphs, a threshold, the POS tagger, and the output destination. Here’s what the function does.

for each message
    if probability ( signature block | message ) < threshold
        write to output file

And the formula for calculating the probability is quite simple too.

1. For a given paragraph (message block), find all the sentences in it.
2. Then for each word (token) in the sentence, count the number of times a non-verb appears.
3. Return the proportion of non-verbs per sentence, i.e. number of non-verbs / number of sentences.

In summary, it discards blocks that do not contain enough verbs to be considered a message block, being treated as signature blocks instead.

Never thought about using an approach like this. It may definitely be helpful when doing data analysis, information retrieval, or scraping data from the web. Not necessarily with e-mails and signatures, but you got the gist of it.

♥ Open Source