A real life example that shows the power of the JVM based Python implementationPosted by Agustín Bartó 2 years, 11 months ago Comments
Jython is is an implementation of Python that runs on top of the Java Virtual Machine. Why is it different? Why should I care about it? This blogpost will try to give an answer to those questions by introducing a real life example.
I had the privilege of working in Java for almost 15 years before I jumped to the Python bandwagon, so for me the value of Jython is pretty obvious. This might not be the case for you if you’ve never worked with either language, so let me tell you (and show you) what makes Jython awesome and useful.
According to the Jython site, these are the features that makes Jython standout over other JVM based languages:
- Dynamic compilation to Java bytecodes - leads to highest possible performance without sacrificing interactivity.
- Ability to extend existing Java classes in Jython - allows effective use of abstract classes.
- Optional static compilation - allows creation of applets, servlets, beans, ...
- Bean Properties - make use of Java packages much easier.
- Python Language - combines remarkable power with very clear syntax. It also supports a full object-oriented programming model which makes it a natural fit for Java’s OO design.
I think the first, second and fifth bullets require special attention.
For some reason, a lot of people believe that the JVM is slow. This might have been true on the first years of the platform, but the JVM’s performance has increased a lot since then. A lot has been written on this subject but the following Wikipedia article summarizes the situation pretty well.
As mentioned above, it is possible to use Java classes in Jython. Although this statement is true, it fails to convey what I think is the most important aspect of Jython: there are A LOT of high-quality mature Java libraries out there. The possibility of mixing all this libraries with the flexibility and richness of Python is invaluable. Let me give you a taste of this power.
Until the introduction of the new Date and Time API of Java 8, the only way to handle time properly in Java was to use Joda-Time. Joda-Time is an incredible powerful and flexible library for handling date and time on Java (or any JVM language for that matter). Although there are similar libraries in Python, I still haven’t come across one that can give Joda-Time a run for its money. The following shows a Jython shell session using Joda-Time:
Jython 2.7b2 (default:a5bc0032cf79+, Apr 22 2014, 21:20:17) [Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.8.0_05 Type "help", "copyright", "credits" or "license" for more information. >>> from org.joda.time import DateTime >>> date_time = DateTime() >>> date_time 2014-07-14T20:06:11.074-03:00 >>> date_time.getMonthOfYear() 7 >>> date_time.withYear(2000) 2000-07-14T20:06:11.074-03:00 >>> date_time.monthOfYear().getAsText() u'July' >>> date_time.monthOfYear().getAsShortText(Locale.FRENCH); u'juil.' >>> date_time.dayOfMonth().roundFloorCopy(); 2014-07-14T00:00:00.000-03:00 >>> date_time.plus(Period.days(1)) 2014-07-15T20:06:11.074-03:00 >>> date_time.plus(Duration(24L*60L*60L*1000L)); 2014-07-15T20:06:11.074-03:00
This was just a quick example of the simplest features of Joda-Time. Although most of the features of Joda-Time are present in python-dateutil (with the exception of unusual chronologies), this is just an example. There are other popular Java libraries without a Python counter-part (I’ll show you one on the next section).
As I mentioned before, I switched to Python recently. There was a lot involved in that decision, but the language itself played a major role. The possibility of combining this fantastic language with the power of the JVM and all the Java libraries and tools readily available is an interesting proposition.
Let me show you a real life example that I think summarizes perfectly what Jython matters.
Redacting names on comments
Not too long ago, we had to redact names from comments coming from social media sites. Our first idea was to use NLTK’s NERTagger. This class depends on the Stanford Named Entity Recognizer (NER) which is a Java library. The integration is done invoking a java shell command and analyzing its output. Not only this is far from ideal, it might create some problems if your data isn’t just a piece of large text (which is our case).
This limitation is not caused by the NER API but by the way NLTK interacts with it. Wouldn’t it be nice if we could just write Python code that uses this API? Let’s do just that.
We cannot show you the data we had to work with, but I wrote an IPython Notebook to generate fake comments and save them on a CSV file so our script can work with them.
After the comments have been read, all we need to do is have the classifier tag the tokens, so we can redact the person names from the comments:
classifier = CRFClassifier.getClassifierNoExceptions( 'stanford-ner-2014-01-04/classifiers/english.all.3class.distsim.crf.ser.gz' ) for row in dict_reader: redacted_text = row['text'] classify_result = classifier.classify(row['text']); for sentence in classify_result: for word in sentence: token = word.originalText() tag = word.get(AnswerAnnotation) if tag == 'PERSON': redacted_text = redacted_text.replace(token, '****') row['redacted_text'] = redacted_text
This is an excerpt from a Python script available on github to redact names from text coming from a CSV file. All we need to run it is a JRE, Jython 2.7 distribution and the Stanford NER jars. All we need to do is run the following from the command line:
java -Dpython.path=stanford-ner-2014-01-04/stanford-ner.jar -jar jython-standalone-2.7-b2.jar redact_name_entities.py comments_df.csv comments_df_redacted.csv
Although we cannot run the code directly from Python (cPython, that is), we didn’t need to write a single line of Java to get access to the full power of Stanford NER API.
I hope by now you have an idea of just how important Jython is. It has some limitations, like the inability of integrating modules written in C or that it is only compatible with Python 2.7, but I think its advantages far outweigh the shortcomings.
Although we haven’t had the chance to work with .NET, I think the same rationale can be applied to IronPython when it comes to interacting with Microsoft’s framework.