AMQP + Document Database = ZebraMQ

Part 1

Imagine taking the powerful and simplicity of AMQP and combining it with a document store you can create a method for message routing far more flexible than a hierarchy dot notation routing key.

 

Let’s say you have a sports team organization system. One where you may have different divisions and teams within a division. You have the administrators of the league and administrators of the division and then the coaches of each individual team. You want users who are coaches to be able to communicate with their players add well as all of their superiors. You don’t want a coach to communicate with players outside of their team and you want to allow for coaches to be able to communicate with their fellow coach’s.

 

To have a coach message his players would involve specifying the division dot the team then indicating that they are players not coaches or parents.

 

Let’s say that the publish routing key could be specified as a query with a JSON embedded fashion. Like specifying:

 

 

{
    type: "user",
    $or: [
          {
              "division": "orange",
              "role": "administrator"
          },
          {
              "teams.the-hawks": {
                   "role": {
                        "$in": [
                             "player",
                             "captain"
                        ]
                   }
              }
          }
    ]   
}

 

That would include the orange division administrators and the players and captain of the Hawks team. Excluding non players if that team. The subscribers data entries may look like this:

 

 

{
    "firstName": "Jon",
    "lastName": "Smith",
    "type": "user",    
    "teams": {
          "the-hawks": {
            "role": "player"
          }
     } 
}
 
{
    "firstName": "Tom",
    "lastName": "Klein",
    "type": "user",    
    "teams": {
          "the-hawks": {
            "role": "captain"
          }
     } 
}
 
{
    "firstName": "Arnold",
    "lastName": "Palmer",
    "type": "user,
    "division": "orange",
    "role": "administrator"
}

 

This is all valid MongoDB type querying, which I think is just a very simple and readable way of expressing a query.. The idea is instead of having the application obtain the records based on the queries, and passing them along to the message bus, it would be far more efficient to have the systems indexed and integrated together. Having the data relating to who receives publishes messages present and indexed and only publishing a query that would be used to pull those corresponding records would provide a whole new way of utilizing message queues.

 

I see this being somewhat how ElasticSearch utilizes a document store and lucene under the hood, if we utilized a document store with RabbitMQ under the hood. Mostly as a tool for publishing messages to a complex set of users or some other endpoint. Theoretically, lambdas may also be used for subscription callbacks. Thus providing a tool to subscribe to something and store the way the subscriber responds all within the message system context.

 

That would be very similar to how HDFS utilizes map reduce, not having to send the data outside of message system to process its callback function. Obviously, the callback can be application based, but this way would yield a tremendous performance gain, like stored procedures in a RDBMS.

Part 2

Messaging. Queue. Data.

My initial way to implement the bare minimum functionality to prototype was by creating a RabbitMQ plugin that would be able to plugin into a data store. I still think that having a tightly coupled message bus along side its data store will be most performant. The roots of AMQP are in the finance and its most important requirement was high performance. If I am going to keep the AMQP flag I would want to maintain performance as a priority, while allowing for additional configurable options that may tip the scales with slightly negiable variances in performance but tremendous gains in security and overall functionality.

 

Instead of building either a message bus, or a data store or both, I would build an abstraction that sits on top of the two (or more). Essentially creating a producer / consumer bus that allows for integrating services. All “endpoints” will be authenticated and identified by set credentials. Endpoints use AMQP (and optionally HTTP, STOMP on top) to hook in and produce/consume.

 

Once authenticated the endpoints may publish and subscribe to “messages”. Data stores may be directly queried by publishing a request with message criteria specifying the data store in question (just a routing queue/key parameter) and the corresponding query.

 

Optionally, the notion of subscribing to a data store for specific data changes is also a possibility. Depending on the implementation, the system may poll intervals to check, or most likely will be able to know changes are happening when messages are received to alter the data that would match the subscribed criteria. This can be an extremely awesome feature. Not if the data store is modified by means outside of the application’s context it will be unaware of those changes!

Endpoints

Depending on the actual AMQP specifications, it may be possible to use a standard AMQP client (0.9 or 1.0) implementation to communicate with Zebra. The routing queue would need to be serialized, perhaps it can just be in the headers. A REST API would be simplest to implement.

 

Security Options

If data requires security restrictions that will limit messages of different “types” being mixed with different data there are several possible methodologies to handle this.

 

1. Public key / private key: pair may be used to encrypt all data. In order to send certain types of data a separate key will be needed for varying levels of security. The message data will be encrypted against the security signature’s unique public key. Prior to receiving the data, the application will check each recipient’s security access and see if have access to this “security signature”, if they do, then the data is decrypted and sent over a standard SSL encryption to the recipients.

 

2. Multi phase queue: Routing queries may be be specified in multiple phases. A queued phase will essentially filter out messages based on given criteria and pass the appropriate messages to the phase that matches the next set of criteria. Each is essentially a separate exchange and therefor may be configurable with all of the standard options (durable, auto delete, declare)

 

Federated Data Storage and Access

Zebra may be pluggable to anything that implements its interfaces. Data stores are special but are endpoints of Zebra as well. Therefore they may subscribe to messages and publish messages. Complex data queries may be performed that can merge and aggregate data from multiple data stores together. This can be accomplished by querying multiple data stores and providing a set of standard tools. Beyond anything provided, you may implement “middleware” which subscribe to common data tasks and analysis, similar to a stored procedure enabling you to create pluggable endpoints that you can query against.

 

Part 3

Standard RPC calls through AMQP (RabbitMQ implementation) pass a correlation-id and a reply-to field in the messages header. The reply-to field is used to indicate what queue should be used to send the corresponding response. In reality we don’t need to send that message back to it origin, rather if it is part of a sequence of events perhaps we can direct it to the next step of the sequence.

 

Using reply-to and multi-phase queues we can have a very powerful conditional workflow system and map reduce system. The notion of multi-phase queues is an interesting one I came up with. Taking a routing query and assign a level of priority allowing for multiple phases sequentially. Phases may be nested with conditional criteria. When the set criteria is found to be true, the next phase will be evaluated.

 

If you want to decimate your data in a clean and automated way you could just specify the initial criteria in the primary routing query. Then for the next query you can specify the more specific query knowing that all of the results will be viable.

 

Let’s say you have a web site that buys and sells high value automobiles. You have two separate databases that you utilize. One that has a listing of all of the automobiles and a wealth of information and history on each make and model. The second database just has the buyers and sellers automobiles listing what is for sell or looking to purchase. Depending on the traffic and inventory, it is possible that there may be more buyer/seller automobiles than all of the automobiles produced or the exact opposite. The point is first let’s narrow down the automobiles by the actual criteria of the car’s specification, then dealing with if the automobile is actually available or not. This all depends on what the question is that the user is attempting to answer, is it what is the car they want, or is it what cars are available that I also want?

 

HEADERS:

 

"routing-keys": [
     {
          "brake-horsepower": {
              "$gte": 500
          },
          "makes": {
              "$in": [
              "Ferrari",
              "Lamborghini",
              "Maserati",
              "Aston Martin"
          ]
     },
     {
         "price": {
             "$lte": 250,000
          },
         "condition": "new"
     },
    {
          "location": {
               "$lte": [100, 'miles']
          }
    }
]

 

This example is dealing with cars. The first set of criteria is to filter all messages to those that have brake horsepower greater than or equal to 500 and only one of the specified makes.  After this, the next place the recipient of that message will do is publish the new message with the routing query specifying that the price must be less than or equal to $250K and I only want to purchase a new car.

 

The second message will alert all of the owners of cars that match the secondary criteria a message letting them know that someone may be interested in purchasing their car. The third and final criteria is that I would really be interested if possible of seeing the car in person and that it is located within 100 miles from where I live. This final result will return back to the initial publisher user so that they may view the final list.

 

As a message makes its way along its path, it may accumulate data at each phase and transformation. Beyond the messages next path, it may acquire new data. All the changes made at each step of the messages journey may be associated with the message. This provides a way to gather information through a workflow process. There is a great deal of flexibility in this that we will explore shortly.

 

Let’s say I have a piece of data that I want to analyze to see if it something that needs to be reported to the system administrator. Perhaps these analyzers pass along their analysis to be fed into the next analyzer or maybe it will be attached to the final result.

 

Perhaps the first analyzer will detect the applications’s type. If the applications’s type falls in to two categories it will go to the corresponding category. Then let’s say that we want to send the application to be tested in a sandbox and examine the function calls it makes. Finally, we would take a list of the URL’s that are called by the application and if any of them are in the supplied list then the application will be returned  to  be further evaluated by the user. The user may want to see the gathered data alongside the fact that it matched the final criteria. This is very similar to a traditional map-reduce pattern used in distributed computing systems.

 

Each of the links along the chain, may inject its results into the “route-data” header field. Such “route-data” may be used in routing queries. For instance let’s say you are looking for an application that is a PE (Portable Executable) file and is compiled for a 64-bit architecture. The next routing query is:

 

 

{
    "$routeData.$0.fileType": "PE",
    "$routeData.$0.architecture": "64-bit"
}

 

 

A possible usage of “$routeData” can be used to show if there are any headers in the “route-headers” field that satisfied the supplied condition found in the first route.

 

Zebra is a fresh approach to combining AMQP style messaging alongside document based databases. AMQP remains the wire transport protocol and Zebra sits on top of the database and messaging layer as an abstraction. It translates the queries into subscribers and then publishes the appropriate messages to corresponding subscriber’s queue.

Smart Captcha

There is an alternative to the popular Google reCaptcha (https://www.google.com/recaptcha) that I have used 10+ years ago, called Text Captcha. (Try http://api.textcaptcha.com/cp.json for a demo) It is simple language type questions that require a very minimal effort to follow…but somehow I prefer greatly over those terrible images I can’t seem to understand.

The way to “challenge” this sort of captcha would be through natural language processing. The varied options aren’t that hard, that I’m sure it wouldn’t be too challenging to break. I was thinking however of combining this with the now defunt Google image labeler (http://en.wikipedia.org/wiki/Google_Image_Labeler). I used to play this once in a while many years ago. Let’s take the game and the logic to the next level. Imagine if we have a series of three images when put together would convey a theme. For instance: turkey + leaves + football = Thankgiving. Yeah that was corny, but that is not a simple thing to crack. That involves understanding what each image may possible mean (in all contexts) and then limit that understanding to the perceived intent with the accompanying images.

An alternate type could be the complete the sequence. Like having a ‘horse and buggy’, ‘train’, ?, that would be an automobile. The sequence of course is forms of ground transportation going from the past to the most modern.

The two options I have given are not friendly to those who are visually impaired, but that doesn’t rule it out as a possibility for a majority of the population. I think that a more complex text captcha is also possible, but that will still be left to a simpler NLP attack.

 

Authentication: The real me

Authentication as we use it in the security world is obviously from the word “authentic”, meaning genuine. Today we find most common authentication means are simply fulfilling an already established contract with secret information that only the account owner would posses. This gives no insight that the given user who initially established his account let’s call him Bob is in fact Bob logging into his account or Alice, a third party listener may have somehow obtained Bob’s authentication credentials. Since Bob’s credentials may be a username and password pair, the only thing that protects this account is possessing this secret information. You haven’t missed anything, all I have said is that modern day authentication means rely on secrecy or private information that only the account hold would posses. What if we were able to actually establish that Bob, is the same Bob that initially established his account. Not due to knowledge of a simple pair of username/password credentials, or a selected picture and the like. Rather, what if Bob was somehow able to expose his likes, dislikes, habits, tendencies, interests, etc. and this information may be used to not verify that Bob knows his password, but that Bob is Bob.

The closest thing I have seen that somehow utilizes this is during credit checks. The questions are almost entirely address verification related multiple choice questions. Did you live in the street X?

Imagine you have a browser extension installed. You read an article that makes you “sad”, or makes you “happy”. You click on your “authentication” icon and use an emoji to express what you feel, or perhaps use a few tag keywords to convey your response. This data may be collected and used to build a statistical model to build verification questions for a authentication service. Your web history, emails, shopping tendencies, music you listen to. All of this can be used to provide data to help paint a picture of things that would show “how” you think and how your react.

Pandora does this to provide you with recommended music. The more you show how you like and dislike the recommended music the more you train your station. Imagine if the next time you logged in to “Bob”s account he was asked two questions which songs in this list would you prefer to listen to. This is not ironclad and different types of data can have different results and predictability.

This is a very general idea, but I think that ultimately just as now we can use Facebook and Google as ways to login into our account…imagine if we used our Pandora, Netflix, and Amazon account as seed data to ask us questions to verify our identity!!

I am about to post another idea about CAPTCHA’s but look at this post http://en.wikipedia.org/wiki/Google_Image_Labeler I remember when this game was around and Google used people to help train/verify its image search results. I could certainly see a game such as this use to both train and verify a user’s identity.

This can become a great deal more abstract. I have read recently about http://www.washingtontimes.com/news/2015/jan/24/true-cybersecurity-intelligent-computer-keyboard-i/ recent efforts in keystroke analysis in identifying by their unique electronic signature.

A bit of a rant, but I think there are some solid ideas here.

Protect your data! – The cure to identity theft

Today identity theft is a very real threat that we face. There are many unprotected pieces of information that can be used to identify oneself that have little or no internal protection. The worst thing is that we are not in control over our own data. Once we have given information to a third party like an insurance company, a bank, or utility company we have little or no control over what happens to this information. We are not able to say he Mr. Blue Cross I want to terminate my service with you because I don’t think you secure my information properly. Sure you can cancel your service with the company, but what happens to your data? 

Recently I opened a checking account with a bank. I was surprised when I was told I was done. I’m used to having to sign a digital pad with my signature which I was always told was used for verification purposes. I inquired, “Don’t you need my signature to use for verification”, she responds “No, checks are no longer verified with their signature”. That is the last straw, I might as well just take put my checking account information on a billboard to make it easier for my hard earned funds to be stolen!

I used to have the mindset that if a check was lost or stolen but not stolen I was somewhat protected since the check had to be signed. I knew several years ago even this thinking was a farce. All you need is an ACH account and my routing number and account number and that’s all. Who came up with this system? Were they all trusting and assumed that no checks or account would every be compromised? The financial and private information systems in our country (and likely internationally) needs to be updated to handle real threats that we may encounter daily!

It sickens me that my emails has two factor authentication, is better protected than my bank account! I am hearing about companies on a frequent basis being compromised and credit card information being stolen as well as other sensitive data. This is ridiculous, are you expecting me to believe that in our huge government there isn’t some sort of organization or committee that is supposed to making sure that companies that store this sort of information actually do what is necessary to protect their data? Stupid me that made that assumption. Unfortunately, I don’t have a choice. Unless I want to be Amish or completely off the grid some organizations are going to have my sensitive information that may be compromised at any moment.

Something needs to be done…now!

Let’s start with the biggest security hole I know of, the social security number. Since 1935 the nine digit number was used to identify American citizens with their social security accounts. For some reason this became the defacto primary means of identification for US citizens. You just rattled of your nine digits to companies and you were who you said you were. For verification purposes companies usually only ask for your last four of your SSID but they sometimes ask for the entire number as well. Since the SSID was created only as a means of identifying yourself for your social security account it probably wasn’t intended for such widespread usage as we see today. There isn’t any other security standard that remains unchanged after almost eighty years.

Pretend for a moment that we have just been tasks with designing a system for American citizens to use as universal identifier that is a common format and something that most of the population uses. What would that system look like? Let’s design something right now.

One thing that is nice about the SSID is that it is only nine digits long and fairly easy to remember. Today where we have a much larger population we would probably want to use a longer sequence and expands the key-set to include letters as well numbers. One more thing let’s also allow it to be case sensitive, further doubling the key-set. Let’s see what our hypothetical ID could look like:

Fy32-h26H-K02D-xM4r (four characters in four groups total of 16 characters)

Even guessing this number would be nearly impossible, but given enough time anything is possible. Now Let’s add in a second factor of authentication to prove that you aren’t just an eavesdropper who got the ID sequence. Let’s make it that you must have a six digit numeric pin code that must be changed at least once every 3 months:

35-26-65-16 (Yeah its annoying but so is having someone steal your identity).

Okay, let’s take it one step further and incorporate an additional security mechanism that is becoming common place in many security setup a third factor of authentication. Very commonly this is with a smart-phone application that has been setup to be associated with a time based synchronization that will yield a known sequence every n seconds that will be unique to your account. This is an every changing pin that is based on some initial value that is used to produce a pseudo-random sequence use for verification purposes.

Okay so now we have a much stronger identifier, a known password, and a third constantly changing sequence. Are we safe? No, not even close. We are potentially just as vulnerable now as we are with the current SSID schema. Firstly, how would a third party validate your pin and third form of authentication? That presents a problem. The reason that the SSID was so easy was all you needed was to have the user spit out their social and compare it to the account on record, if it matched he must be who he says he is.

We have two major issues, and one minor issue. The two major issues is that the third party may still be storing your sensitive data in an insecure manner. The second being how can that third party validate your sensitive data? The last issue is we have put a lot more “stuff” on the account holder to remember. A longer ID, a pin and having to have some method of third form of authentication, people are going to complain about that!

The third-party should not need to posses the “plain text” or unprotected form of ID. We said earlier that there were two basic reasons why the SSID was used. The first being it is something that most people posses. The second being that it is unique guaranteeing that no two people will have the same IDS and therefore serves as an easy way to uniquely identify a given user. Knowing the plain text SSID may be needed at times but really the third-party really wants an easy user verification mechanism.

I’m not going to tell you that every American in the country has a smart phone, but I will say that many do. Between a smart-phone or computer there is a good bet that most Americans have either a smart-phone or computer. Let’s leverage that assumption in securing our user’s ID. There are banks now that offer a “virtual credit card” which is a pointer to a real credit card but the user has the benefit of never giving the merchant their actual credit card number. Should the consumer wish to terminate their account they can unlink the virtual credit card and immediately prevent the merchant from charging the user’s actual credit card.

Let’s apply this same logic to our ID system. Imagine I call up a company and want to utilize their services. They need to input some information about myself into their system and they ask for my ID number. Instead of me giving them my real number I hop onto to my phone open the “ID” application and click “generate new ID”. The representative on the phone gives me the company’s ID number that uniquely identifies them as a company. I punch that number into the application, enter my pin, and verify the third factor authentication. Now a unique ID has been created just for this organization.

The application let’s me know what type of information the company requests and what level of access as well. I can choose to grant them access, or deny it and hang up the phone. Some companies may only need a verification mechanism. In the future the user would just open his app enter his pin, and third factor validation and give the company representative the ID and a corresponding validation code. This enables the company to unlock the specific ID for verification purposes.

Without the verification code the company has only a unique ID that is only a virtual reference to your ID. You can look at your ID’s activity viewing everytime the company attempts to access your information, what they view and when. At anytime you can revoke the virtual ID and suspend them from accessing your information until you grant that information once again.

Within the ID granted the company may request certain types of information without needing a new verification code. For instance, the company may wish to send you a bill in the mail and shouldn’t need to call you for a verifiication code everytime they want to send you mail to retrieve your address information. This information however will be audited and viewable to you at any time. This sensitive information still will not be stored on the companies servers. They will use the ID and their verification code the request the information from a centralized system when needed. An alternative is some sort of LastPass locally encrypted concept..we can work on this.

Well this was fun, we can brainstorm some more later. Things need to change, the crazy people who like to mess with other people’s lives aren’t going to stop. We need to step and and protect our data.