Java Scripting Vulnerabilites

Recently I had to look into processing user inputted expressions which would enable a user to insert control flow logic into a workflow depending on whatever criteria they wish. We are already using JBoss’s jBPMN which is an implementation of BPMN, a standard workflow schema. It is a stable, trusted and widely used tool in the enterprise community.

jBPMN allows the workflow to have “constraints” where you may inject variables or really whatever you wish and supply either Java code or MVEL expressions to evaluate and ultimately control the flow of the workflow. MVEL is just a subset of Java and provides a nice “scripting” feel while maintaining access to all Java objects and classes without the verbosity of Java. With that said, both Java and MVEL are subject to malicious code should as much as if you were running that code directly not in an interpreter. My first thought was that it was likely that either on the MVEL layer or on the jBPMN layer there are some restrictions added in to safeguard the application and the underlying system. I was wrong. I was able to invoke RunTime.getRuntime().exec("rm -rf /") which will obviously yield some pretty devastating results. I’m assuming that jBPMN was never intended to be written for the purposes of allowing external parties to invoke their own code. However, that is precisely what I needed to accomplish without sacrificing the security of the system.

I looked into using Rhino (https://developer.mozilla.org/en-US/docs/Mozilla/Projects/Rhino, https://docs.oracle.com/javase/7/docs/technotes/guides/scripting/programmer_guide/) which is the standard scripting engine available in Java 7. Rhino has now been taken out of the spotlight as Java 8 offers the new and improved Nashorn JavaScript engine. I didn’t evaluate Nashorn regarding security concerns as the project this was related to was limited to using Java 7. I was able to invoke a slightly differently command as I did with Java/MVEL. With Rhino I needed to provide the entire package name, but the outcome was ultimately the same; I was able to invoke the exact same commands as I did earlier.

I looked around and found this blog post (http://maxrohde.com/2015/08/06/sandboxing-javascript-in-java-app-link-collection/) which led me to a simple little library (https://github.com/javadelight/delight-rhino-sandbox). It combines a number of things including the initSafeStandardObjects() method which removes the Java connection entirely. This library also provides some easy ways to still inject desired objects and data.

The result is with a little bit of code I am able to invoke the sandboxed JavaScript within the jBPMN evaluation and having the inputted script written in JavaScript. All I have done is made it that the constraint evaluation will yield a separate evaluation which is then used to return a boolean value indicating whether or not the corresponding step should be invoked or not. In reality this sort of logic would not be limited to JavaScript via Rhino. Once we are separating the execution from reliance on what jBPMN supports there isn’t anything stopping us from supporting any script language that may be invoked in the JVM or on the host machine.

The end goal is we were able to leverage the stability and power of jBPMN without sacrificing our systems security.

Multi-model database Evaluation

I have been evaluating various databases recently for a project at work. Without going into the details about the product all you need to know is that there are several types of data and performance is a high priority. We have a very large amount of data and versatile data types. We had been using PostgreSQL for holding relational data, Riak for holding most of everything else, and ElasticSearch for handling full text search. I want to focus on multi-model databases but I cannot overemphasize that there is no such thing as a one-size fits all database. A database is built for specific uses and specific data types, and data sizes. You may need multiple storage engines (the word I am going to use instead of database) to handle your data requirements. The problems that you run into is as the number of storage engines increase your system becomes more complex. You may have difficulty with data that is stored in a single storage engine but require some sort of decimation from another storage engine. This will result in performance issues and general complications. With all of this being said there may be times that a single storage engine solution may be viable and appropriate.

Neo4j is a popular graph database that is well respected and widely used. It is however, a graph database, not a document store, and may not handle large data as well as key-value stores that are intended for large storage. Neo4j is pretty fast and distributed. It uses a visually descriptive language called cypher. I happen to find that cypher is very easy to use and understand even with most of my life with SQL and plenty of MongoDB and other NoSQL databases. For a graph database it more or less just made sense. It’s is supported by Spring Data which makes using it easier. If you are using it for commercial use be prepared to break open the piggy bank as Neo4j does not come cheap. You may be curious why am I bringing in Neo4j as it is not a multi-model database. The only  reason I wanted to bring up Neo4j is because it is the defacto for graph databases. It really tries to use a graph database a much more intuitive way to persist your data and not an odd plugin on another storage engine. It is truly data stored though normalization as opposed to de-normalization in your traditional RDBMS.

OrientDB saw Neo4j, traditions RDBMS and thought they could do better. OrientDB claims to truly replace your RDBMS with greater flexibility and scalability. It sits on top of a document store which sits on top of a key-value store. Enabling you to store your entities and relate them. Storing complex nested data is possible. One of the coolest features I had encountered was its object oriented hierarchy. Allowing for polymorphism with querying entities.

Unfortunately OrientDB has its share of problems. Don’t even waste your time with their distributed solution. There documentation is poor and have many open GitHub issues. They have an awesome product deep down there, but there are a lot of rough edges that need to be handled.

After evaluating OrientDB and we looked at ArangoDB which seems to be on a similar trajectory as OrientDB but more focused on stability and less and features. Due to the fact that Arango lacked a binary protocol and, support for binary data (non base-64 encoded bloat), and a decent Java API. Their Java API was simply awful. I couldn’t even find Javadocs let alone a nicely designed API.

As of right now our plan is to partially swap our key value store with Orient and slowly but surely evaluate it. We are being cautious due to the volatile nature of Orient and not put all of our eggs in one basket. If we are successful with large data storage both with high volume simultaneous writes and reads we will continue and port the remaining database to Orient.

During my evaluations when I was working with large documents several hundred megabytes after some time I was getting heap errors on the server side. This is a posted issue that large documents can cause issues. Now I had no issues with “some” large documents. It was the large bombardment of concurrent writes of large documents that seemed to cause issues.

I was attempting to remedy this with two basic attempts. These documents were JSON. I tried to take the data and break it down to vertices and edges so that there wouldn’t be any single large document. So it did work nicely…but a single document could end up with hundreds of thousands of vertices and edges causing the write time to be very long, several minutes! The traversal time was very good in this case, which is to be expected.

Alternatively, I tried stored the data as an attribute in the document. Another possibility was to store the data opaquely with the zero byte document type, chucking the data into small chunks. Then streaming the data.

I have now realized that for “larger” documents you will run into issues with Orient and will need to dish out the cash for more RAM. I have been running a large export process on a node with 30GB of RAM with the -Xmx=10g. There is a substantial amount of data being exported and are in multiple locations and network so unfortunately latency is high. The Java process for Orient has been more or less consistent at ~25G for the past few days. Slowly rising as the data set grew. Keep in mind Orient claims that its disc cache is more important than its heap. While that may be true, Orient will throw out of heap space errors and die if there isn’t enough heap space. My concern was that there is no end to the amount of memory that Orient would consume. It seems based on the size of file(s) you need a minimum heap setting to keep Orient happy. I do not know how to calculate that magical number, but I can say fairly confidently that 10g hits the sweet spot in this case.

What this means for anyone using or considering Orient as a solution is that the size of the data you store must be taken into account. As of now Orient will not gracefully slow down on performance in order to keep itself going and keep those large data chunks on disk. Ideally Orient would never throw a heap error and better manage its memory. I’m sure it uses some logic based on the frequency and assumed relevance of each record and what level of memory it should sit. Size of the heap space and current utilization needs to be taken into that equation!

As of my project is going to be moving forwards temporarily holding off on a storage engine switch. It will be my recommendation when the discussion comes up again, as it will that Orient be heavily considered. With its vast features and excellent performance it brings plenty to the table as long as you are know that you are going to need to invest in getting things going. As of now Orient will not be a plug and play solution without some tinkering and extensive reading of documentation and even some source code. With that said, I’m looking forward to using Orient on some smaller less critical projects now and hope to use it more in the future.