Dovid Kopel - Technologist

Technology Consigliere & Innovator, Open Source Evangelist, Geek, Lefty

Musing about computers, technology, and science

June 05, 2016

Big Data: Low Latency Solutions

The powerhouses of the past were gigantic mainframes that were built for high yield computing tasks. The notion of distributed computing extends far beyond the “cloud” frontier. I remember running SETI@home as well as FOLDING@home in the early 2000’s. The earliest known project I could find was GIMPS which dates back to 1997. Suffice to say, there are advancements in distributed computing. Unfortunately, the general problems from then are still very much a reality in the present. Applications are forced to choose between processing throughput over latency. Powerful data processing appliances answer the questions knowing you will have to wait around for the computation to complete.

April 05, 2016

Data Evolution to Revolution

I apologize in advance as this is an attempt to solidify some ideas that I have been having into something a tad more cohesive. Let’s start small. Not sure if this is a chicken and egg sort of thing but I think it just may be. Does the evolution of technology yield more data and thus more complex and intricate mechanisms to capture, evaluate, utilize and wield said data are needed. Advances in technology bring about more data quicker and more accurately. Strides forward in technology enable greater insight into data that may have existed for years maybe even decades or more. Most of these technologies are merely building on top of their predecessor taking things one step further and refining them. Occasionally there are novel ideas that truly disrupt the pace and direction shattering preconceptions and misconstrued understandings. We have grown very little in our “PC” age in true advancement. We have sleeker, smaller, faster infrastructure. We have smart watches, phones, tablets and more. We have this buzz word “cloud” and “IoT” that people love to throw around. Anyone today can make an “app” and they think that is novel, that will change the world. We are so far away from true advancement on the computing level to use the word of intelligence. I would not pretend to be an expert on anything to do with AI or machine learning. I do however know that we have neither the precision or speed capable of coming remotely close to anything of substance. We are playing “Go” and Jeopardy, we are writing cookbooks, and more. True creativity is alien to our creations. We are doing nothing more then creating formidable copy-cats. Sure it may consume many different approaches to chess or some other topic. Ultimately it is illustrating a path that will attempt to “beat” its opponent. I am not enough of a philosopher or scientist to evaluate the state of that level of comprehension. It is certainly complex and well structured, and it may beat a human. It is however a very far cry from the human intellect.

March 17, 2016

Crate.io - Part 1

So I started playing a bit with crate.io while I waited for the power to come back on here at the office. Crate is powered as a NoSQL data model with implied typing and optimistic locking. What makes crate unique is that it attempts to provide the aggregation functionality commonly found in RDBMS. However the eventually consistent non-transactional data model is very foreign. It has things like full text search and geometric querying functionality, which is nice but nothing to write home about.

The data types it claims to handle add the additional array and object types not found in traditional SQL systems. PostgreSQL does support these types but they are not without their limitations. Crate's ability to handle JSON type input seems a great deal more natural than has been my experience with PSQL 9.4+, which was fairly awful.

Let's create the following table:

create table test (
  data object,
  label string
  );
 now we make an insert with this data:
insert into test values(
  {"z"= 1},
  'some test'
  );
 Now if I attempt to use insert another row with a data type that is different from the previous row there is  a problem:
insert into test values (
  {"z"= [1, 2 ,3]},
  'baba bobo'
  )

yields:

SQLActionException[Validation failed for data['z']: Invalid long]

now that may be logical...what crate does is for the complex data types (object and arrays) it detects the data types of the inutted values and creates a schema around those values. There are other data storage engines that perform similarly. Creating a schema on the fly. This makes querying fast and more efficient, especially where the actual database is written in a format that is much more type rigid.

Now this was an unexpected and annoying issue:

insert into test values (
  {"mixed_array"= [1, 'two', 3.0]},
  'mix it up!'
  )

yields:

SQLActionException[Validation failed for data: Mixed dataTypes inside a list are not supported]

The strongly typed Java roots of Crate may be apparent from these few limitations. The type flexibility within arrays is not an uncommon convention in JSON. As for the rigidity of the schema, I imagine that for performance reasons Crate initially detects inputted types and creates a schema to adhere to it. This is not uncommon for some NoSQL databases that attempt to have what I've seen called gradual typing.

Crate supports blob types, like MySQL and others. Blobs are supposed to allow for binary storage of data. Crate doesn't point out any clear limitation.

I want to quickly summarize my findings so far. I have not reviewed crate for performance, high availability and reliability and many other things for that matter. My initial focus was evaluating its support for JSON and having "dynamic schema" while using a SQL variant. It is non-transactional system that utilizes optimisic locking. If you want to store your JSON which may have a "fuzzy" schema or hybrid schema you may run into problems. Crate locks into its perceived schema based on inputted data. If your JSON is consistent and you want to support database aggregation (the way it should be) crate may be for you.

Bottom line: Looks like a promising solution for dealing with data that has a schema with limited fluidity. Has many of the features you would expect from an RDMBS with the scalability of the newer NoSQL variants. Warrants further investigation and looks "optimistic".

March 17, 2016

SCP, SMB and more

Recently I was asked to add additional connection protocols as a means to submit samples to be analyzed by our  automated forensic analysis platform. We had our UI, and REST API and multiple REST clients (C#, Java and Python). These are all standard but all require integration or manual intervention. Protocols like FTP, SFTP, SCP, SMB and many others are used to transfer files for everyday use. They have commercially available clients and many operating systems already have built-in support as well. My challenge was providing a smart, powerful and flexible integration.

January 31, 2016

Nested Workflows with jBPM

I am working on a project where  we are utilizing BPMN for authoring and controlling the processing of analysis. Any single analysis task may yield several descendants like how a .ZIP file has many child files. Additionally, many analyzers also yielded additional analysis for both the inputted artifact as well as additional artifacts. Currently we we treating each and every child artifact, as well as child workflows, as completely separate entities.