June 05, 2016

Big Data: Low Latency Solutions

The powerhouses of the past were gigantic mainframes that were built for high yield computing tasks. The notion of distributed computing extends far beyond the “cloud” frontier. I remember running SETI@home as well as FOLDING@home in the early 2000’s. The earliest known project I could find was GIMPS which dates back to 1997. Suffice to say, there are advancements in distributed computing. Unfortunately, the general problems from then are still very much a reality in the present. Applications are forced to choose between processing throughput over latency. Powerful data processing appliances answer the questions knowing you will have to wait around for the computation to complete.

This has been the big problem that we knew needed an answer. Depending on the industry there has been a need for performing computationally intensive tasks like finding prime numbers which were not necessarily dealing with large data. On the flip side we had financial records which accumulated to a fairly large size with much simpler computation. Handling the computationally intensive we still strive for with light at the end of the tunnel with hope for a quantum computing solution. Until quantum computing becomes a reality and an economically feasible technology we will have to seek alternative avenues.

Hadoop and provided a well known distributed file system (HDFS) which was tightly coupled with the ability to load data from disk, process it and store the results back to disk. HDFS were inspired by Google and their experience with big data. MapReduce is kind of what I think of as the caveman’s approach to data processing. It’s raw and plentiful. It’s fault tolerant so hopefully it all get’s done sometime, eventually. MapReduce on Hadoop is only as sophisticated as the underlying system and the jobs executing. There is no linkage between Hadoop jobs so the jobs need to be self contained or connect to external data sources…which add latency. With complete isolation between mapping jobs the reduction at the end intends on bridging the gap and calculating the final results. The design paradigm works but there is  a lot of room for improvement.

While this certainly works time and experience has led experts to much more “intelligent” approaches that remove much of the redundancy and attempts to limit duplication of efforts as much as possible. If you were thinking of Spark, you are right. Spark was created as a distributed compute system to maximize throughput and minimize redundancy at all costs. Spark is very fast in the areas where Hadoop was sluggish and simplistic. Spark has multiple libraries that can run on top of it, SQL, GraphX and MLib. Spark also support stream processing, we’ll get back to that soon. Ultimately attempting to do the same approach as Hadoop…just smarter. This will be faster, but still not what you need for an OLTP.

Did he just mention OLTP!? No, this guy is off his rocker…Hadoop isn’t for OLTP, its for batch processing. Come on everyone knows that.

What is the major differences between an OLTP and a batch processing system? An OLTP is catered to provide data for application consumption and low-latency usage. The number of active users are typically much higher and the scope of their computations are simpler in complexity. Whereas a batch processing system is geared towards low concurrent usage with complex computations. There is no question that these are very different tools and addressing different users. My interest is in providing a way where the two systems may live together able to harness the capacity for concurrent complex computations as well as parallel consumption on the user level. I am looking into possible solutions that may fit this void. So far https://www.citusdata.com is the closest thing that could fit this. Additionally, http://druid.io seems to also be a possible contender. I will review and discuss these technologies in future entries. Keep your eye open for technologies that are going to stretch your perception as to what is a database and what is a data-warehousing tool. I hope to see the lines blur between data storage, data processing, and real-time data analytics. The need is there for these sort of advancements, and I think with the proper approach the technology isn’t too far from our grasp.