April 17, 2018

AWS Kinesis and data streams

For large complex application system communication between services as well as from the external world is the backbone of your system infrastructure. I can go on for a while going into the various options for these types of systems and my many experiences with each of them. However, I want to focus on the managed services that AWS offers for this purpose. I’m going to run through SQS/SNS, Amazon MQ, and Kinesis. Each of these deserve their own in-depth analysis and all truly have great merit on their own.

The first system while technically are two separate systems, are often utilized together. SQS is a simple queuing system, and SNS is the simple notification system. SNS messages must go somewhere, to a subscribed endpoint which may be sms, lambda, email, or an sqs queue. SNS does not support the capacity to reply to messages as well as anything more than a completely static topic. In more versatile messaging systems nuances such as wildcards or hierarchical systems may be used to allow for a powerful routing system.

SNS may automatically trigger a Lambda which makes SNS very powerful. SNS with a lambda trigger is a single point of failure for message execution. If that lambda was erroneous for any reason that would result in that message being lost, which for many systems is not an option.

SQS cannot be trigger a Lambda, a queue must be polled or scanned and items processed. Ideally AWS should have the capacity to automatically call a Lambda upon an item being added to queue. I would design the system such that depending on the calls return status would determine whether or not the item should be removed from the queue or not.

I think to address some of the matters I have pointed out AWS came out with their own managed AMQP: Amazon MQ. I’m pretty disappointed, I know that a number of AWS services are probably heavily tweaked open source software under the covers, but this almost seemed lazy. There are two options for the broker size, and that is all. I would’ve hoped to see a truly scalable “managed service” that would closer reflect SQS. I’m not sure if they released Amazon MQ just to satisfy some customers who wanted an official MQ as opposed to maintaining an instance themselves. Now with ECS Fargate I have a hard time justifying the difference between the two. With that said, is this way of saying that this is a technology that they do not wish to further invest in.

Stream processing is no simple task, but it’s even more complex if you want to process the stream in a modular way. Do you have one huge monolithic function that analyzes that entire block? More importantly how do you deal with a block? You may buffer the data but that doesn’t guarantee that you will be able to process the data immediately. I’m especially thinking of encrypted data that may vary by the type of data. Yes, you can use KMS but that will only work for encrypting the data stream on a whole. What if you want to be able to use many different types of data each with their own encryption mechanisms or keys.

What if you want to allow external parties to push data on to your stream. As a way to consume data much like a third party would push data via any other transport protocol. We can grant the third party user access to encrypt on the kms key for the given stream. Then you may consume this external stream breaking it down into the correct category finding its appropriate destination. You need to judge based on how large the raw data is, but it is likely that it is not worth passing it around the network.

In general when playing around with technologies like this, I will want to push very hard limits to figure out what use cases are appropriate. Depending on the size of the initial payload, number of consumers, need for transactions, etc., you can establish a matrix that will help decide whether a technology is a good fit or not for a given use case, or at all.

Back in the day when data storage was expensive, we would try to avoid storing redundant data. Now, storage is really cheap, and even processing power isn’t too bad. Concurrency and time are still a factor. Even if you have data stored in several places, and many processors to evaluate the data with a given condition doesn’t mean you will not run into non-linear issues with the data eval. In general I’ve found the best approach is more clearly defining your storage, processing, and analytics needs. Separating functions with focus on single intention not all functionality, like a crappy multi-function printer.

This type of approach followed to its logical conclusion would yield a MapReduce style system. This isn’t a bad thing, it is a bad thing if your steps in your pipeline are redundant, and not optimized. For batch processing of data using a Hadoop style cluster even with a smarter workflow is still very clumsy. Even using Apache Spark which attempts to better organize and process your data, is still quite clumsy.

When dealing with YARN or Mesos, or any type of job distributing system it really doesn’t understand your data needs or the relationship between the workflow steps, and the conditions in between those steps. Apache Spark evolves slightly in this manner, more intelligently grouping together data for it to be processed with greater performance.

Let’s pretend your data is 100 bytes long.

Do you need the entire 100 bytes at one time? (Decryption, signature validation, checksum etc.)
Yes.
1. Okay you need the initial data at one time. Let’s assume you want to decrypt it.
2. It probably makes sense to hash, very, and decrypt it once and then send it through its channels. Depending on the number of consumers, the number of checks, and nature of encryption this will usually be your better option.
No.
1. If your data is not encrypted, or
2. Do you need to process the data serially?
3. Do you need the output of each step, or only the previous step?
Can you process the 100 bytes in parallel?
1. Depending on the size of the payload and the number of subsequent operations you need to perform on it recognize that it may be cheaper to extract some needed information about the data, store it in an in memory datastore, or S3 (whatever), then reference the data and retrieve it for the remaining usages. What it all boils down to is being smarter with your data on the onset.

If you have a very simple data flow that you can glean based on the first few bytes of data you can augment your flow accordingly. The greatest problem with this type of data processing at scale is balancing the ability to easily change the workflow operations while maintaining performance.

I’ve done some extensive R&D with Scala and some expression languages. I’ve found that if your plan of action well is defined in advance, even dynamic in nature like code may be able to derive how to best execute your steps to maximize efficiency, and minimize operating costs.

If you are at all familiar with the IoC notion, or Inversion of Control design paradigm. This is commonly used in systems to define how to handle dependencies. Instead of attempting to populate all its needed dependencies, a service will accurately declare them relying on the underlying infrastructure to keep its contract much like you’d expect of an API. This same technique may be utilized for data processing as well. Imagine if you were to indicate that a given method accepts Unencrypted data, and that the data must be provided Sequentially. You don’t want a step to dictate exactly where it must be, but it is logical for it to indicate what it may and may not do. For instance, a step that may convert data from one format to another should ideally have one input and one output. Not only the number of possible inputs can be defined, but also matters that will determine data flow. This knowledge can be used to determine where certain more ambiguous or unspecified matters may be determined based on the best contextual decision.

I’m inclined to approach utilizing this sort of stream processing as a transport, not a solution. In general, I feel that AWS services are better to be looked at as infrastructure and not fully developed solutions. Depending on the size of your organization and the speed in which you need to move you may choose to embrace this notion or not. I know that there are large organizations which almost only use EC2 instances, and that’s all. I believe there is a healthy balance between only using AWS (or any cloud vendor for that vendor) as a data center replacement, and high level services. When you choose scalability and uptime over features as a given you really need to ask yourself is that really the case? Is there a way you can make a reliable more versatile solution that can be as reliable or close enough?

Please take this to heart, more “9”’s doesn’t mean much if you have plenty of bugs, poor security and other wonderful problems. Don’t let stupid metrics be the primary inspiration to how you budget your time and money. Prioritize your problems, you don’t have the be the cool kid in the class with the biggest numbers.