October 09, 2015

Multi-model database Evaluation

I have been evaluating various databases recently for a project at work. Without going into the details about the product all you need to know is that there are several types of data and performance is a high priority. We have a very large amount of data and versatile data types. We had been using PostgreSQL for holding relational data, Riak for holding most of everything else, and ElasticSearch for handling full text search. I want to focus on multi-model databases but I cannot overemphasize that there is no such thing as a one-size fits all database. A database is built for specific uses and specific data types, and data sizes. You may need multiple storage engines (the word I am going to use instead of database) to handle your data requirements. The problems that you run into is as the number of storage engines increase your system becomes more complex. You may have difficulty with data that is stored in a single storage engine but require some sort of decimation from another storage engine. This will result in performance issues and general complications. With all of this being said there may be times that a single storage engine solution may be viable and appropriate.

Neo4j is a popular graph database that is well respected and widely used. It is however, a graph database, not a document store, and may not handle large data as well as key-value stores that are intended for large storage. Neo4j is pretty fast and distributed. It uses a visually descriptive language called cypher. I happen to find that cypher is very easy to use and understand even with most of my life with SQL and plenty of MongoDB and other NoSQL databases. For a graph database it more or less just made sense. It’s is supported by Spring Data which makes using it easier. If you are using it for commercial use be prepared to break open the piggy bank as Neo4j does not come cheap. You may be curious why am I bringing in Neo4j as it is not a multi-model database. The only  reason I wanted to bring up Neo4j is because it is the defacto for graph databases. It really tries to use a graph database a much more intuitive way to persist your data and not an odd plugin on another storage engine. It is truly data stored though normalization as opposed to de-normalization in your traditional RDBMS.

OrientDB saw Neo4j, traditions RDBMS and thought they could do better. OrientDB claims to truly replace your RDBMS with greater flexibility and scalability. It sits on top of a document store which sits on top of a key-value store. Enabling you to store your entities and relate them. Storing complex nested data is possible. One of the coolest features I had encountered was its object oriented hierarchy. Allowing for polymorphism with querying entities.

Unfortunately OrientDB has its share of problems. Don’t even waste your time with their distributed solution. There documentation is poor and have many open GitHub issues. They have an awesome product deep down there, but there are a lot of rough edges that need to be handled.

After evaluating OrientDB and we looked at ArangoDB which seems to be on a similar trajectory as OrientDB but more focused on stability and less and features. Due to the fact that Arango lacked a binary protocol and, support for binary data (non base-64 encoded bloat), and a decent Java API. Their Java API was simply awful. I couldn’t even find Javadocs let alone a nicely designed API.

As of right now our plan is to partially swap our key value store with Orient and slowly but surely evaluate it. We are being cautious due to the volatile nature of Orient and not put all of our eggs in one basket. If we are successful with large data storage both with high volume simultaneous writes and reads we will continue and port the remaining database to Orient.

During my evaluations when I was working with large documents several hundred megabytes after some time I was getting heap errors on the server side. This is a posted issue that large documents can cause issues. Now I had no issues with “some” large documents. It was the large bombardment of concurrent writes of large documents that seemed to cause issues.

I was attempting to remedy this with two basic attempts. These documents were JSON. I tried to take the data and break it down to vertices and edges so that there wouldn’t be any single large document. So it did work nicely…but a single document could end up with hundreds of thousands of vertices and edges causing the write time to be very long, several minutes! The traversal time was very good in this case, which is to be expected.

Alternatively, I tried stored the data as an attribute in the document. Another possibility was to store the data opaquely with the zero byte document type, chucking the data into small chunks. Then streaming the data.

I have now realized that for “larger” documents you will run into issues with Orient and will need to dish out the cash for more RAM. I have been running a large export process on a node with 30GB of RAM with the -Xmx=10g. There is a substantial amount of data being exported and are in multiple locations and network so unfortunately latency is high. The Java process for Orient has been more or less consistent at ~25G for the past few days. Slowly rising as the data set grew. Keep in mind Orient claims that its disc cache is more important than its heap. While that may be true, Orient will throw out of heap space errors and die if there isn’t enough heap space. My concern was that there is no end to the amount of memory that Orient would consume. It seems based on the size of file(s) you need a minimum heap setting to keep Orient happy. I do not know how to calculate that magical number, but I can say fairly confidently that `10g` hits the sweet spot in this case.

What this means for anyone using or considering Orient as a solution is that the size of the data you store must be taken into account. As of now Orient will not gracefully slow down on performance in order to keep itself going and keep those large data chunks on disk. Ideally Orient would never throw a heap error and better manage its memory. I’m sure it uses some logic based on the frequency and assumed relevance of each record and what level of memory it should sit. Size of the heap space and current utilization needs to be taken into that equation!

As of my project is going to be moving forwards temporarily holding off on a storage engine switch. It will be my recommendation when the discussion comes up again, as it will that Orient be heavily considered. With its vast features and excellent performance it brings plenty to the table as long as you are know that you are going to need to invest in getting things going. As of now Orient will not be a plug and play solution without some tinkering and extensive reading of documentation and even some source code. With that said, I’m looking forward to using Orient on some smaller less critical projects now and hope to use it more in the future.