Tag Archives: database

AWS and vendor lock-in

Right now AWS is the leader in cloud-computing, without a question. With DELL’s recent acquisition of EMC, which happens to own roughly 80% of the eminent virtualization company Vmware  one can imagine they have their sights set on competing for a piece of the action as well. AWS is so popular that I recently saw companies such a Rackspace who used to be a real competitor to AWS now offers premier managed hosting of AWS. That is the kind of smart attitude shift that companies such as Microsoft have started doing. Now under the tutelage of Satya Nadella Microsoft has been making many wise moves all recognizing that they need to play nicely with other companies and recognize that big bad Redmond isn’t the only company in the ecosystem anymore. With that said most companies I speak to still treat AWS as being synonymous to the cloud, or at least the defacto cloud provider solution. There isn’t anything wrong with that, as long as you recognize what that really means.

Back in the day where companies such as Oracle…actually innovated and dominated the large scale database space. In the realm of commercial databases Oracle is still a big player, of course technologies rooted in open source are becoming more and more common place. Back in the day before ORM’s and when there was no buzz word called the “cloud”, the database was a huge monsterious construct that dominated the stack. Nothing was distributed and hardly clustered. Databases had single masters with some read-only slaves that had stale data. The single master was a bottleneck for writes, you hoped that it was enough because there weren’t the same options that exist today. Lot’s of the database providers would offer their own solutions that were specific to their API. Your DBA and Architects would caution you against vendor lock-in. They would tell you if you implement your system around vendor specific features you will find yourselves unable to break free from their grasp. Now I have always said that you should take advantage of vendor specific features as long as you architect them in a way such that they are more or less lift and shift binding a standard interface with a new vendor implementation so that you can now take advantage of their approach without sacrificing the modularity of your code-base.

This notion of vendor lock-in is not only applicable to databases but in my mind even more important with hosting solutions such as AWS. AWS isn’t a software company, they are a service company. A handful of the services that they offer are actually born and bred in-house, rather most of their services are open source technologies that they have adapted tried and true practices and software solutions and mixed in their automation and high availability in. AWS is different than any other hosting company before it because they provide specialized solutions not just raw hosting. Other companies can and will compete and offer similar if not better products. I anticipate that you will start seeing a great more in specialization. Companies like IBM, who are fairly minimal in their large scale utilization. They do however have a very impressive suite of API’s geared towards machine learning (https://www.ibm.com/watson/developercloud/services-catalog.html) both Google and AWS have some minimal machine learning solution but nothing that compares to the versatile toolkit Watson offers. I’m not telling you that IBM is reliable, performant, or any good at all. I am saying that in addition to their boring Bluemix hosting they are being innovative with their services.

The bottom line is remember that AWS is only a single vendor. Just like all markets, their will be growth and competition. S3 may be a fairly standard key-value system but their API and domain level approaches may be very different from the next leader in the cloud industry. Learn from experience, invest your time in designing your systems to be capable of switching cloud partners without a complete rewrite. It’s okay to use vendor-specific features, as long as you develop them in a way that you can easily accommodate a change.

Big Data: Low Latency Solutions

The powerhouses of the past were gigantic mainframes that were built for high yield computing tasks. The notion of distributed computing extends far beyond the “cloud” frontier. I remember running SETI@home as well as FOLDING@home in the early 2000’s. The earliest known project I could find was GIMPS which dates back to 1997. Suffice to say, there are advancements in distributed computing. Unfortunately, the general problems from then are still very much a reality in the present. Applications are forced to choose between processing throughput over latency. Powerful data processing appliances answer the questions knowing you will have to wait around for the computation to complete.

This has been the big problem that we knew needed an answer. Depending on the industry there has been a need for performing computationally intensive tasks like finding prime numbers which were not necessarily dealing with large data. On the flip side we had financial records which accumulated to a fairly large size with much simpler computation. Handling the computationally intensive we still strive for with light at the end of the tunnel with hope for a quantum computing solution. Until quantum computing becomes a reality and an economically feasible technology we will have to seek alternative avenues.

Hadoop and  provided a well known distributed file system (HDFS) which was tightly coupled with the ability to load data from disk, process it and store the results back to disk. HDFS were inspired by Google and their experience with big data. MapReduce is kind of what I think of as the caveman’s approach to data processing. It’s raw and plentiful. It’s fault tolerant so hopefully it all get’s done sometime, eventually. MapReduce on Hadoop is only as sophisticated as the underlying system and the jobs executing. There is no linkage between Hadoop jobs so the jobs need to be self contained or connect to external data sources…which add latency. With complete isolation between mapping jobs the reduction at the end intends on bridging the gap and calculating the final results. The design paradigm works but there is  a lot of room for improvement.

While this certainly works time and experience has led experts to much more “intelligent” approaches that remove much of the redundancy and attempts to limit duplication of efforts as much as possible. If you were thinking of Spark, you are right. Spark was created as a distributed compute system to maximize throughput and minimize redundancy at all costs. Spark is very fast in the areas where Hadoop was sluggish and simplistic. Spark has multiple libraries that can run on top of it, SQL, GraphX and MLib. Spark also support stream processing, we’ll get back to that soon. Ultimately attempting to do the same approach as Hadoop…just smarter. This will be faster, but still not what you need for an OLTP.

Did he just mention OLTP!? No, this guy is off his rocker…Hadoop isn’t for OLTP, its for batch processing. Come on everyone knows that.

What is the major differences between an OLTP and a batch processing system? An OLTP is catered to provide data for application consumption and low-latency usage. The number of active users are typically much higher and the scope of their computations are simpler in complexity. Whereas a batch processing system is geared towards low concurrent usage with complex computations. There is no question that these are very different tools and addressing different users. My interest is in providing a way where the two systems may live together able to harness the capacity for concurrent complex computations as well as parallel consumption on the user level. I am looking into possible solutions that may fit this void. So far https://www.citusdata.com is the closest thing that could fit this. Additionally, http://druid.io seems to also be a possible contender. I will review and discuss these technologies in future entries. Keep your eye open for technologies that are going to stretch your perception as to what is a database and what is a data-warehousing tool. I hope to see the lines blur between data storage, data processing, and real-time data analytics. The need is there for these sort of advancements, and I think with the proper approach the technology isn’t too far from our grasp.

Crate.io – Part 1

So I started playing a bit with crate.io while I waited for the power to come back on here at the office. Crate is powered as a NoSQL data model with implied typing and optimistic locking. What makes crate unique is that it attempts to provide the aggregation functionality commonly found in RDBMS. However the eventually consistent non-transactional data model is very foreign. It has things like full text search and geometric querying functionality, which is nice but nothing to write home about.

The data types it claims to handle add the additional array and object types not found in traditional SQL systems. PostgreSQL does support these types but they are not without their limitations. Crate’s ability to handle JSON type input seems a great deal more natural than has been my experience with PSQL 9.4+, which was fairly awful.

Let’s create the following table:

 now we make an insert with this data:

 Now if I attempt to use insert another row with a data type that is different from the previous row there is  a problem:

yields:

now that may be logical…what crate does is for the complex data types (object and arrays) it detects the data types of the inutted values and creates a schema around those values. There are other data storage engines that perform similarly. Creating a schema on the fly. This makes querying fast and more efficient, especially where the actual database is written in a format that is much more type rigid.

Now this was an unexpected and annoying issue:

yields:

The strongly typed Java roots of Crate may be apparent from these few limitations. The type flexibility within arrays is not an uncommon convention in JSON. As for the rigidity of the schema, I imagine that for performance reasons Crate initially detects inputted types and creates a schema to adhere to it. This is not uncommon for some NoSQL databases that attempt to have what I’ve seen called gradual typing.

Crate supports blob types, like MySQL and others. Blobs are supposed to allow for binary storage of data. Crate doesn’t point out any clear limitation.

I want to quickly summarize my findings so far. I have not reviewed crate for performance, high availability and reliability and many other things for that matter. My initial focus was evaluating its support for JSON and having “dynamic schema” while using a SQL variant. It is non-transactional system that utilizes optimisic locking. If you want to store your JSON which may have a “fuzzy” schema or hybrid schema you may run into problems. Crate locks into its perceived schema based on inputted data. If your JSON is consistent and you want to support database aggregation (the way it should be) crate may be for you.

Bottom line: Looks like a promising solution for dealing with data that has a schema with limited fluidity. Has many of the features you would expect from an RDMBS with the scalability of the newer NoSQL variants. Warrants further investigation and looks “optimistic”.