Crate.io – Part 1

So I started playing a bit with crate.io while I waited for the power to come back on here at the office. Crate is powered as a NoSQL data model with implied typing and optimistic locking. What makes crate unique is that it attempts to provide the aggregation functionality commonly found in RDBMS. However the eventually consistent non-transactional data model is very foreign. It has things like full text search and geometric querying functionality, which is nice but nothing to write home about.

The data types it claims to handle add the additional array and object types not found in traditional SQL systems. PostgreSQL does support these types but they are not without their limitations. Crate’s ability to handle JSON type input seems a great deal more natural than has been my experience with PSQL 9.4+, which was fairly awful.

Let’s create the following table:

 now we make an insert with this data:

 Now if I attempt to use insert another row with a data type that is different from the previous row there is  a problem:

yields:

now that may be logical…what crate does is for the complex data types (object and arrays) it detects the data types of the inutted values and creates a schema around those values. There are other data storage engines that perform similarly. Creating a schema on the fly. This makes querying fast and more efficient, especially where the actual database is written in a format that is much more type rigid.

Now this was an unexpected and annoying issue:

yields:

The strongly typed Java roots of Crate may be apparent from these few limitations. The type flexibility within arrays is not an uncommon convention in JSON. As for the rigidity of the schema, I imagine that for performance reasons Crate initially detects inputted types and creates a schema to adhere to it. This is not uncommon for some NoSQL databases that attempt to have what I’ve seen called gradual typing.

Crate supports blob types, like MySQL and others. Blobs are supposed to allow for binary storage of data. Crate doesn’t point out any clear limitation.

I want to quickly summarize my findings so far. I have not reviewed crate for performance, high availability and reliability and many other things for that matter. My initial focus was evaluating its support for JSON and having “dynamic schema” while using a SQL variant. It is non-transactional system that utilizes optimisic locking. If you want to store your JSON which may have a “fuzzy” schema or hybrid schema you may run into problems. Crate locks into its perceived schema based on inputted data. If your JSON is consistent and you want to support database aggregation (the way it should be) crate may be for you.

Bottom line: Looks like a promising solution for dealing with data that has a schema with limited fluidity. Has many of the features you would expect from an RDMBS with the scalability of the newer NoSQL variants. Warrants further investigation and looks “optimistic”.

SCP, SMB and more

Recently I was asked to add additional connection protocols as a means to submit samples to be analyzed by our  automated forensic analysis platform. We had our UI, and REST API and multiple REST clients (C#, Java and Python). These are all standard but all require integration or manual intervention. Protocols like FTP, SFTP, SCP, SMB and many others are used to transfer files for everyday use. They have commercially available clients and many operating systems already have built-in support as well. My challenge was providing a smart, powerful and flexible integration.

I knew that there are several open servers available to receive files. In order to integrate in to our architecture something would need to consume the uploaded files. Additionally, we would have to handle the authentication, authorization aspects necessary to upload to the server. We can’t just have users created for every system reading directories as that is clunky, error prone, and not scalable. I opted in for an alternate approach.

Our core components are mostly written in Java and I was looking for a solution that would directly integrate or by means of JNI. I began with SCP and immediately found the Apache project MINA. It provides a complete Java SSHD/SFTP/SCP solution from soup to nuts. The intention of the SCP/SFTP is to be written to disk but with little ingenuity I Was able to completely cut out that step stream directly into our system without ever writing to disk. We are already using Spring Security for our authentication. While I wasn’t going to take the time to extend Spring Security to handle the SSH protocol I did utilize the ThreadLocal security context SecurityContextHolder. This enabled connection between the authentication mechanism that MINA provides and the data transfer to identify the user based on the security context setup. This enable me to continue using the rest of the application I had already secured. The rest of the system thought it came from HTTP, or didn’t really care. Ideally I would extend some interfaces in Spring Security and actually bind the protocol but that would be a nice addon that I can recommend to the Spring Integration team who already support SFTP. Click here to view the gist for the scp integration.

Some of this code is just extending the ScpHelper and the ScpCommand. This provided an easy way to access my existing authentication service and setup the security context.

Scp was out of the way but SMB was the more challenging integration. I didn’t find nearly as much on the topic and there are a lot more complications to handle. SCP/SFTP is safeguarded with TLS the same encryption process that makes HTTPS secure using public/private key encryption. After the initial handshake all data sent over the wire is encrypted. This facilitates authentication in comparison to many other protocols out there. Much to my surprise and naivety I was hoping that I would be able to utilize the stored credentials already encrypted and protected which we use to access via Spring Security. Instead SMB usually send encrypted credentials that are compared to already obtained credentials. This is a typical practice passing challenge data as to conceal the secret information and prevent any false information. I had to result in storing the hash digest MD4 of the user’s password to be compared to the client provided hashes password.

The library I used was developed by Alfresco called JLAN. It is on the older side and scarcely maintained. Sadly, its documentation was slightly better than you’d expect to find from a Jboss product. There is a developer’s guide and installation guide. For general usage it may be fine, but for what I was planning on doing it was tad more challenging. Some software engineers try to protect their future code by using final for anything and everything and making things very rigid and hard to extend. They only let you access to very small selection of methods and may not even document those well. I wanted a way to hook in my UserAuthenticationService that we used in the scp service. My challenge was that the SecurityConfigSection would only let you specify the UsersInterface by specifying a String of the Class that implements said interface. That class is instantiated and made completely inaccessible there after. This made accessing my Spring managed bean very difficult to nearly impossible. Usually I would try to have @ComponentScan pickup the class and either @Autowired the interface or use the BeanFactory to retrieve it dynamically. I came up with a simple but really nice approach to handle this.

In my @Configuration class I set the BeanFactory in the enum making it statically accessible. Thus, even our annoying UsersInterface implementation can take advantage of our managed beans without having to deal with any final mess.

After I worked out that spring bean issues I still had to deal with the frustrations of learning the ins and outs of the SMB protocol. This approach also would allow for a transfer that requires no writes to disk and authentication that can flow through the existing system. Look here for a gist of the general approach. If you really want to use the JLAN library realize that Alfresco sells an enterprise license and probably supports a great deal of options with it.

Here is a snippet of what you will want to add to a POM (maven) to play around with these code samples.

More to come!