All posts by Dovid Kopel

Java Scripting Vulnerabilites

Recently I had to look into processing user inputted expressions which would enable a user to insert control flow logic into a workflow depending on whatever criteria they wish. We are already using JBoss’s jBPMN which is an implementation of BPMN, a standard workflow schema. It is a stable, trusted and widely used tool in the enterprise community.

jBPMN allows the workflow to have “constraints” where you may inject variables or really whatever you wish and supply either Java code or MVEL expressions to evaluate and ultimately control the flow of the workflow. MVEL is just a subset of Java and provides a nice “scripting” feel while maintaining access to all Java objects and classes without the verbosity of Java. With that said, both Java and MVEL are subject to malicious code should as much as if you were running that code directly not in an interpreter. My first thought was that it was likely that either on the MVEL layer or on the jBPMN layer there are some restrictions added in to safeguard the application and the underlying system. I was wrong. I was able to invoke RunTime.getRuntime().exec("rm -rf /") which will obviously yield some pretty devastating results. I’m assuming that jBPMN was never intended to be written for the purposes of allowing external parties to invoke their own code. However, that is precisely what I needed to accomplish without sacrificing the security of the system.

I looked into using Rhino (https://developer.mozilla.org/en-US/docs/Mozilla/Projects/Rhino, https://docs.oracle.com/javase/7/docs/technotes/guides/scripting/programmer_guide/) which is the standard scripting engine available in Java 7. Rhino has now been taken out of the spotlight as Java 8 offers the new and improved Nashorn JavaScript engine. I didn’t evaluate Nashorn regarding security concerns as the project this was related to was limited to using Java 7. I was able to invoke a slightly differently command as I did with Java/MVEL. With Rhino I needed to provide the entire package name, but the outcome was ultimately the same; I was able to invoke the exact same commands as I did earlier.

I looked around and found this blog post (http://maxrohde.com/2015/08/06/sandboxing-javascript-in-java-app-link-collection/) which led me to a simple little library (https://github.com/javadelight/delight-rhino-sandbox). It combines a number of things including the initSafeStandardObjects() method which removes the Java connection entirely. This library also provides some easy ways to still inject desired objects and data.

The result is with a little bit of code I am able to invoke the sandboxed JavaScript within the jBPMN evaluation and having the inputted script written in JavaScript. All I have done is made it that the constraint evaluation will yield a separate evaluation which is then used to return a boolean value indicating whether or not the corresponding step should be invoked or not. In reality this sort of logic would not be limited to JavaScript via Rhino. Once we are separating the execution from reliance on what jBPMN supports there isn’t anything stopping us from supporting any script language that may be invoked in the JVM or on the host machine.

The end goal is we were able to leverage the stability and power of jBPMN without sacrificing our systems security.

Multi-model database Evaluation

I have been evaluating various databases recently for a project at work. Without going into the details about the product all you need to know is that there are several types of data and performance is a high priority. We have a very large amount of data and versatile data types. We had been using PostgreSQL for holding relational data, Riak for holding most of everything else, and ElasticSearch for handling full text search. I want to focus on multi-model databases but I cannot overemphasize that there is no such thing as a one-size fits all database. A database is built for specific uses and specific data types, and data sizes. You may need multiple storage engines (the word I am going to use instead of database) to handle your data requirements. The problems that you run into is as the number of storage engines increase your system becomes more complex. You may have difficulty with data that is stored in a single storage engine but require some sort of decimation from another storage engine. This will result in performance issues and general complications. With all of this being said there may be times that a single storage engine solution may be viable and appropriate.

Neo4j is a popular graph database that is well respected and widely used. It is however, a graph database, not a document store, and may not handle large data as well as key-value stores that are intended for large storage. Neo4j is pretty fast and distributed. It uses a visually descriptive language called cypher. I happen to find that cypher is very easy to use and understand even with most of my life with SQL and plenty of MongoDB and other NoSQL databases. For a graph database it more or less just made sense. It’s is supported by Spring Data which makes using it easier. If you are using it for commercial use be prepared to break open the piggy bank as Neo4j does not come cheap. You may be curious why am I bringing in Neo4j as it is not a multi-model database. The only  reason I wanted to bring up Neo4j is because it is the defacto for graph databases. It really tries to use a graph database a much more intuitive way to persist your data and not an odd plugin on another storage engine. It is truly data stored though normalization as opposed to de-normalization in your traditional RDBMS.

OrientDB saw Neo4j, traditions RDBMS and thought they could do better. OrientDB claims to truly replace your RDBMS with greater flexibility and scalability. It sits on top of a document store which sits on top of a key-value store. Enabling you to store your entities and relate them. Storing complex nested data is possible. One of the coolest features I had encountered was its object oriented hierarchy. Allowing for polymorphism with querying entities.

Unfortunately OrientDB has its share of problems. Don’t even waste your time with their distributed solution. There documentation is poor and have many open GitHub issues. They have an awesome product deep down there, but there are a lot of rough edges that need to be handled.

After evaluating OrientDB and we looked at ArangoDB which seems to be on a similar trajectory as OrientDB but more focused on stability and less and features. Due to the fact that Arango lacked a binary protocol and, support for binary data (non base-64 encoded bloat), and a decent Java API. Their Java API was simply awful. I couldn’t even find Javadocs let alone a nicely designed API.

As of right now our plan is to partially swap our key value store with Orient and slowly but surely evaluate it. We are being cautious due to the volatile nature of Orient and not put all of our eggs in one basket. If we are successful with large data storage both with high volume simultaneous writes and reads we will continue and port the remaining database to Orient.

During my evaluations when I was working with large documents several hundred megabytes after some time I was getting heap errors on the server side. This is a posted issue that large documents can cause issues. Now I had no issues with “some” large documents. It was the large bombardment of concurrent writes of large documents that seemed to cause issues.

I was attempting to remedy this with two basic attempts. These documents were JSON. I tried to take the data and break it down to vertices and edges so that there wouldn’t be any single large document. So it did work nicely…but a single document could end up with hundreds of thousands of vertices and edges causing the write time to be very long, several minutes! The traversal time was very good in this case, which is to be expected.

Alternatively, I tried stored the data as an attribute in the document. Another possibility was to store the data opaquely with the zero byte document type, chucking the data into small chunks. Then streaming the data.

I have now realized that for “larger” documents you will run into issues with Orient and will need to dish out the cash for more RAM. I have been running a large export process on a node with 30GB of RAM with the -Xmx=10g. There is a substantial amount of data being exported and are in multiple locations and network so unfortunately latency is high. The Java process for Orient has been more or less consistent at ~25G for the past few days. Slowly rising as the data set grew. Keep in mind Orient claims that its disc cache is more important than its heap. While that may be true, Orient will throw out of heap space errors and die if there isn’t enough heap space. My concern was that there is no end to the amount of memory that Orient would consume. It seems based on the size of file(s) you need a minimum heap setting to keep Orient happy. I do not know how to calculate that magical number, but I can say fairly confidently that 10g hits the sweet spot in this case.

What this means for anyone using or considering Orient as a solution is that the size of the data you store must be taken into account. As of now Orient will not gracefully slow down on performance in order to keep itself going and keep those large data chunks on disk. Ideally Orient would never throw a heap error and better manage its memory. I’m sure it uses some logic based on the frequency and assumed relevance of each record and what level of memory it should sit. Size of the heap space and current utilization needs to be taken into that equation!

As of my project is going to be moving forwards temporarily holding off on a storage engine switch. It will be my recommendation when the discussion comes up again, as it will that Orient be heavily considered. With its vast features and excellent performance it brings plenty to the table as long as you are know that you are going to need to invest in getting things going. As of now Orient will not be a plug and play solution without some tinkering and extensive reading of documentation and even some source code. With that said, I’m looking forward to using Orient on some smaller less critical projects now and hope to use it more in the future.

Explaining Douglas Crockford’s Updating the Web

This is my attempt to extract his slides from his presentation and annotate them slightly. I write my interpretations under the headings. Watch the full talk here.

What’s wrong with the web?

Insecure

Complexity

He believes that the security vulnerabilities are due to the overall complexity.

HTTP

Key / Value Pairs

Negotiation

Request / Response

Certificate Authorities

Not trust worthy, vulnerable

HTML

Not for applications…really for describing technical documents

Templating

XSS attacks

Document Object Model

The worst API, very insecure

CSS

Awkward and not intended for application usage

JavaScript

Hot mess, it is pretty terrible but there are some good parts

Many have tried

  • Microsoft, Adobe, Apple, Oracle, many more
  • In most cases, the technology was much better
  • In most cases, the solution was not open
  • There was no transition

Upgrade the Web

Keep the things it does well.

Move over and go down a new path for things that are currently vulnerable still.

HDTV transition was possible with set top box

Helper App

Used to open external protocols that were not supported by the browser. For a new protocol “web”, we will have a new way to execute applications

Transition Plan

  • Convince one progressive browser maker to integrate.
  • Convince one secure site to require its customers to use that browser
  • Risk mitigation will compel the other secure sites
  • Competitive measures will move the other browser makers
  • The world will follow for improved security and faster application development
  • Nothing breaks!

Strong Cryptography

  • ECC 521
  • AES 256
  • SHA 3-256
Built upon paranoid levels of crypto beyond what is deemed needed by todays standards. Keeping things secure and future proof.

ECC 521 public keys as unique identifiers

No more passwords, no more usernames. This is you.

Secure JSON over TCP

HTTP is limited and not really needed for this. JSON can be encrypted and pushed over the wire asynchronously.

web://  publickey @ ipaddress / capability

It’s not pretty, but its clear. Take the certificate authorities out, and keep

Trust Management / Petnames

The way to try to make the long and completely foreign schema identifiable to the user. The initial relationship like going to “amazon.com” would come from search engines or directories. The idea being once you know about the site now there is a “relationship” that you wish to maintain. The idea of just typing in a domain name like “money.com” which was a hit or miss sort of thing would no longer really be possible.

Vat

More than a sandbox…only has access to what is granted.

Cooperation under mutual suspicion

I think here he was saying that assume that everything is potentially malicious. All applications can work together based on the API that they allow.

JavaScript Message Server / Qt

Isolated components entirely only sending JSON back and forth. He wants to use something like NodeJS but with better security in place to handle the messaging. This would be talking to the remote server and perhaps the individual applications. This is somewhat like a AMQP bus that is secure. Qt is the interaction and rendering framework that is very widely used. I’m inclined to saying that at this point in time if you want to approach things this way let’s not limit the potential. I can say why not allow Qt at a minimum, or a JVM that can plugin. Essentially creating a platform to build an application that is web based.
He didn’t go into too much detail here at all. The only thing he really mentioned was the clean separation this provides. Qt would handle the rendering of content as well as user interaction. The messaging bus would transfer the content.
I’m thinking that taking his Vat approach we can almost describe a secure web based ecosystem. He didn’t discuss how this would work in anyway but I can propose a possible design.
I’m thinking that “applications” have dependencies but they aren’t included in your code at all. You can refer to the original repository of that dependency or bundle it if you want to. The idea is that the applications will in fact “live” in the web but are installed into the user’s browser much more like an extension on Chrome is. They will have versions. When you have updates to be pushed out part of the application checks a specific address for updates and performs that update accordingly.
I’m inclined to say that much like an operating system this platform can have a standalone application as well as an application that allows other applications to interact with it. Let’s say you have an Ebay application. You want to by things. You have the Ebay application installed. Let’s say that you also have your bank application let’s say Well Fargo and that is installed. There is a PaymentProvider interface that can be implemented that defines what is necessary to provide payments to other applications. When you want to pay for your item you won on Ebay, you can choose one of your applications that implement the PaymentProvider interface.
I’m not entirely certain but I believe that the intention was that the JS message server may be more than just a message bus, more like the front-end application would  be written there. The business logic would live on the message end, and it would broadcast messages to the Qt side for rendering updates etc. Assuming I am correct I think we can play around with this more and refine it.
I really like the application ecosystem concept. To pull that off, come up with logical interfaces and ways of interaction will not be easy. This is however, a tremendous improvement from the chaos that we have today. We have an inter-connected system that talks together in many different ways but in fact has very little security and clear cut boundaries in place. OAuth2 does have the notion of granting certain roles for different user types. This is a start, but to allow applications to collaborate a more descriptive mechanism is needed. This concept is going to be essential in the development of IoT technologies.

The Old Web: Promiscuity

The New Web: Commitment

The mantra for this. The old web will remain. Pretty sure that isn’t just for the transitional period. Rather, for “finding” new content, and what we call “Browsing the web” we would use the old web. Once we establish a relationship with a site we we will use the new web to maintain that relationship. I’d like to say that perhaps you can think about it where HTTP would be used for insecure content and some sites would “switch” to HTTPS for secure content, so too here.

There’s nothing new here

No new technology at all. Just bringing current technologies together.

 

TypeScript: Much more than having “closure”

JavaScript or ECMA Script has a community that is always finding ways to make JavaScript “suck less”. Either easier to write, deploy or test. What I have never seen is an attempt to make JavaScript more like “Java”. No language is the magical unicorn perfect for every single situation, only a novice says stupid things like that. Yes, JavaScript is messy, odd and confusing. It’s API is funky and sadly inconsistent from browser to browser. As the content we develop shifts more and more to full blown applications and not mere static content more conventions are needed to ensure that JavaScript protects itself.

TypeScript which was pioneered by Microsoft (yup) since 2012. It has been quite well received by the community. The AngularJS team actually abandoned their development of a similar technology in favor of TypeScript.

Type checking for those who come from Smalltalk and Python they are going to look annoying and cluttered…but that that simply isn’t a priority. Readable code is nice and especially useful to quickly understand…but at the same time it is equally as important if not more important that the developer actually follows the guidelines that they have put into place and not introduce errors that may be hard or even impossible to detect.

Java is verbose, there is no denying it. Java 7 made it possible to omit inferred generic types from a variable’s initialization and that does clean things up a bit. My personal feeling is having greater control trumps speed every time in enterprise applications. For small start-ups that are understaffed and overworked time is of the essence, but that is a different story altogether.

Gradual typing with smart inferred type checking is a very nice balance that should appease those with statically types backgrounds and beyond. TypeScript’s is quite refreshing. The addition of the interface adds an additional dimension to TypeScript. Interfaces in Java always yield a class implementation. In TypeScript you don’t actually need to make a “class” to take advantage of interfaces. They are enormously useful for specifying function parameters.

I will be writing more on TypeScript soon, but start using it now. Ultimately, this is a better way to develop, ending up with a much less error prone code base that requires little to no change beyond “js” to “ts” (and compiling).

Angular 2 – Trial Run

I’m currently still in EST in SF at the Angular U conference. I figured I would give ES6 and Angular2 a try with the official documentation before I hear the keynotes and all.

I started from the “Quick Start Guide” (https://angular.io/docs/js/latest/quickstart.html). Sadly right away I found mistakes… The documentation says all that you need to install is the angular2 TypeScript definition. When the compiler runs it turns out you need a number of additional definitions in order to make the code compile:

Maybe the version of the guide hasn’t been updated to reflect changes in Angular.

Well anyways, after this the script did in fact compile and was fairly trivial. Now for my rant. I really love AngularJS. After developing a lot of applications with jQuery and getting fed up with the fact that there was no structure to my applications I set out to look at the various libraries and frameworks that best fit my needs and the demands of most of the projects I work on.

AngularJS was the least “preachy”, most overall functional and forward thinking framework. You didn’t need to embrace any philosophies, file structure, really much of anything. The only mantra that I associated with Angular is no DOM manipulation in anything other than a directive.

After developing with Angular 1.x for a nice chunk of time I have discovered that there is room for improvement and simplification.

Here is my brief list of issues with Angular 1.x. (Some of these are more limitations in its utilization and less issues with the framework directly)

  1. Dynamic modules – Right now officially if you want to use Angular you need to load “all” of your modules at load time in order to use them. For large applications this is not only inefficient, but simply awful. For “websites” this is fine, but full blown web applications may be huge and if they are built to be single page applications you want the entire application to be rendered using one base HTML page. For large applications I use RequireJS to dynamically load needed libraries and scripts as needed. There are 3rd party libraries that dynamically resolve the angular scripts and trigger digests to propagate throughout the application and mix-in the newly loaded modules. This works fine but its a hack at best. Which leads to the next issue.
  2. Config phase restrictions – The config phase of the application is very logical. You have access to the raw modules and are able to modify them as needed prior to initialization. This is reminiscent of Spring Framework for Java that utilizes @Configuration classes to declare the Java Beans. This is performed prior to the dependency injection process which had greatly inspired Angular. Where it lacked was especially with my first qualm. No third party module was able to dynamically load and be able to affect the config phase of the application. For setting up the routing which is one of the core components of a web application, this is a very crucial step.
  3. Directives are overly complex – Everyone says that the two way binding of Angular is what makes it special. They are wrong, directives is where the power of Angular shines. Two way binding is the obvious outcome with a MVC architecture trying to truly separate the application domains. Scope isn’t super complex. I do think some of the restrictions and subtleties of directives make them very awkward and confusing. While I understand the notion that only a single isolated scope can exists on a single element, it can make many directives difficult to work with. The need to manually invoke digests using $scope.$apply because Angular didn’t know otherwise was really messy and almost hackish. I think this was needed because of the lack of support for native Object.observe functionality.
  4. $scope.$watch – If you are dealing with a large application you will want to limit the number of watches you use in your application. I try to avoid them as much as possible. They will consume memory and affect performance. Because the Object.observe function has not been adapted by all browsers Angular needs to perform dirty checking which can be expensive. This results in performance being affected and you are forced to use Angular’s broadcast system.
  5. Broadcast can be improved – Avoiding the scope.$watch when possible forces you to use some sort of event propagation system. Angular has its own $broadcast and $emit calls that send data down and up (respectively through the scope) on the routing key specified. My biggest issue is not as much with the way that the broadcast system works, but rather that it is too limited. I want to see an actual AMQP style event bus that can queue events/messages and use actual routing keys much like you find in RabbitMQ. I have actually developed my own library (https://github.com/CyberPoint/eventBus) that is a JavaScript event bus. It doesn’t deal with Angular scopes at all, but it could. I find being able to use the dot notation hierarchy is just as effective as scope alone.

I hope to post a follow-up entry with how Angular2 addresses these items.

Who are you? – Identifying yourself, from a security perspective

They say you are what you eat. I think that you are whoever you seem to be plus who you really are. Others perception of you while not truly important may attribute to the scope of “who you are”.

Who are you?

In a doctors office they would start off with questions regarding name, address, gender, family, and then get into activities you do. They are attempting to triage you based on your lifestyle, the activities you perform and your genetic history. There is obviously merit to this as it is certainly a strong factor in well being. The car you drive, the clothes you wear…while they own’t affect your health, they certainly factor in to how others perceive you. The way you walk, the way you curse (or not), is it rude to text while talking to someone else. All of these things come together to form an image, you.

Let’s explore these relationships and how understanding them can help identify and understand “you” the best we can.

1. You are a person.
2. Your gender is male.
3. You have dark hair.
4. You wear glasses.
5. You are left handed.
6. You live in Baltimore.
8. You are married.
9. You have two children.
10. You work in Baltimore.
11. You drive to work in a car.
12. You drive a sedan.
13. You own a mobile phone.
14. You are a Software Engineer.
15. You enjoy solving challenging problems.
16. You enjoy classic rock.
17. You are a passionate person.
18. You talk loudly.
19. You do not like hot weather.
20. You like to eat blueberries and do not like bananas.

Okay, so these are all true observations about myself. Let’s analyze this list for a second. Most of this list can be broken up into categories:

1. Observable physical attributes
2. Observable personality traits
3. Family members
4. Possessions
5. Preferences and opinions

I would call all of these attributes “core” attributes. They can change over time, like I may drive a different car, or own a different phone. Ultimately this list would be up to date and relevant.

There is a new buzz word being used, IoT or Internet of Things. This notion isn’t a new idea…just like the “cloud” isn’t a new idea. IoT emphasizes the relationships between objects that do not need human interaction. A prime example could be a door that has a special lock that is linked to your mobile phone and unlocks the door when you are within a certain proximity to it. Most of these items to date have been more about convenience and have not really been adopted by the layman.

I think that IoT can be utilized to fill in the blanks between our lives in more ways than you might think. Combining the proper IoT devices and highly advanced software you can build an ecosystem that can make your security and connectivity as simple as snapping your fingers.

I have a phone a work, my mobile phone when I’m on the go, and a phone at home. Imagine that when I am work all of my calls were routed to my work phone. When I am on the go, all to my mobile phone, and when at home all calls routed to my home phone. Aside from a nice convenience, this buys you a lot more. That call never rings at work and therefore no one can answer it for you.

Replace a phone call with my computer. I have one at work, and at home. When I’m at home my work computer is locked and home computer is unlocked. When at work, my work computer is unlocked and home computer is locked.

Now replace a computer with a virtual account like your bank account. When the user is “you” you have access to your account. Somebody else doesn’t have access to your bank account.

Today, you use things like inputted secret credentials to authenticate yourself. Since you know this secret information you must be the account holder. Therefore, anyone who knows this secret information may access your account.

Additional precautions have been added to further lock down your account. You need your smartphone in order to receive a code in addition to your secret credentials. Not only do you need to know the secret information, but also have access to your phone. This is an obvious step in the right direction, but certainly makes it more difficult for “you” to access your account. Obviously, to date the extra step has been worth the added security measures to prevent unauthorized access. What if you could just say to your bank account…it’s me let me in!?

Let’s take what we have already established about your core attributes and what we know about secret credentials. What if we could take properties from the five categories we listed above and use them to build a signature that would clearly identify you, and no one else.

Let’s pretend that we walk around with a special bleeding edge recording device that captures all sorts of information for a month. This device takes everything and categorizes its data into these five different categories. It breaks down that data into a knowledge database that has facts and assumptions. Associated with each assumption may be a corresponding confidence, expressing the level of certainty of each assumption. Certain types of facts may also have confidence levels, perhaps this fact was observed but only rarely or special circumstances. Assumptions may have been suggested based on facts that haven’t yet reached the threshold of a fact.

Next time you want to access your bank account instead of logging in with your secret credentials and multi-factor code, what if you provided your signature? After you walked around with this recording device and the data was converted in a knowledge database that generated a signature. This signature is a representation of the knowledge about you. Now when you want to access your account you need to satisfy the knowledge base to produce a compatible signature.

What is this signature?
How is it derived from the knowledge base?
How do you produce a valid signature that is compatible with the initial signature?

We said earlier that there will be a confidence associated with facts. Assumptions are assertions that are less than a fact but may be true.

If you asked me to write down a list of five items that identify you with your core attributes, I would most likely respond with some version of the list of twenty.

– Location is easy…high confidence
– Certain attributes change, that would be specified in their definition and taken into account according to the nature of how they change
– Data feeds from other “people” can be linked into yours like the next evolution in social networking
– I may be acting slightly different, but because I am sitting here with my son, and daughter I must be me. Using data from other people in conjunction to your own data. Data is published to granted parties for consumption.

You are what you do – Identification based on behavior

Thinking about the desire for a password-less society. When it boils down to it there are a few major leagues of password security.

1. You need something physical that only the owner would possess
2. You need some sort of knowledge that only the owner would posses.

We are familiar with the first and second one. The first can be a simple lock and key. The second a username and password.

The third and less common and much more difficult to achieve is the password that isn’t a password, rather that which can verify that you are “acting” or doing something the same way that the authenticated party would. There are movies that use voice recording and match the voice signature against the authenticated parties known voice. I’ve read articles about detecting a distinct electrical signature that the owner gives off unique to himself. I’ve also heard of individual keystroke patterns much like handwriting recognition.

I had written about an idea that learned what websites you went, your purchase history, radio history, Netflix, etc.. essentially giving it as much as data as possible. All to use to train a model to use to authenticate yourself with predictive algorithms.

I like this idea, but it’s really complicated and will require significantly sophisticated models.

One additional factor that has not been mentioned is whether or not the authentication is occurring according to the account holder’s will or against their will. If an account owner is held at gun point or some of situation that would threaten their life or that of a loved one, they may give up credentials to access the sensitive information. For some things that is obviously okay and the “smart” thing to do. For other things, like matters of national security some may say that giving that information up is so damaging that they would not want to divulge this information even when their life is being threatened.

It is an unfortunate but real situation that certain types of data may have. A security mechanism would be ideal if it could prevent the account owner from authenticating even if they have “given up” and are trying to safe their life…the data may not be compromised no matter what and a safeguard must be in place.

We can utilize the human factor to add additional layers of security. Biometric data such as heart rate, the account holders posture, their walking gate, their speech patters, hand gestures. All of these charcteristics can be used to identify anxious and unusual behavior. If we are dealing with a case of torture certainly their will be tell tale signs.

This is obviously an extreme yet real case one that I used to help illustrate a point. In extreme scenarios even the best trained soldiers will react under pressure. I think that with a well calibrated “mechanism” using a multitude of sensor data a baseline can be established to identify a user. This could not only identify the user but also identify certain behaviors, moods and reactions of the user.

Let’s take facial recognition. Utilizing a few dozen positions on the user’s face measuring the distances and locations of certain parts of the face can yield a very accurate model to identify that individual in the future.

Now take that same facial recognition while the user is watching a comedy, and a tear jerking movie. We can establish a baseline for emotion for each individual response we want to associate. Utilizing heart rate, hand gestures, and the like once well trained a few quick images could reveal instantly who the user is.

Utilizing tools like Kinect and Leap motion adding in things like infrared and close images of the pupil and the face a great deal of information can be used to identify a user.

Imagine if you could watch a movie and the next time you do I can predict how you will react at each frame with a percent of certainty.

I am not suggesting that we understand merely the psyche of the user, but more about their innate responses and tendencies…these are not things that can easily be broken.

At least one thing we can take from this at a minimum is the ability to add in the “scared” factor, or rather unusual behavior we can protect many things. I want to use this to identify yourself and when I know it is you but you aren’t acting like yourself. Obviously certain traits will be more dominant than others.

We can take this just a layer on top of a standard multi-factor system that incorporates tests to help verify that the account holder is not under duress.

The completely other application for this is for convenience and AI facilitators. If we can get the pattern down to identify an account holder and then be able to detect variances in their behavior we can trigger different things in response to that. This goes well beyond security and much more into the realm of IoT and automation, but let’s explore it.

You come home and you walk in. Of course your car has pulled up and your home already knows that you are approaching with your Wifi connected phone. You are emitting your mac address and a public key alerting your house that you are approaching. Your door is unlocked with NFC automatically but really, Wifi with a unique signature ID can trigger that as well. You walk in and your home is already lit to your specifications and temperature control as well. Nest helps with some of this, as well as detecting ambient lights in conjunction with the room and the individuals involved. Depending on the activities different illuminations settings can be triggered. When a “reading” action is triggered lighting should accommodate your preference. Okay…I’m leading up to it…now when you get to your computer it is unlocked because you are using it. My vision of the ultimate in security and convenience is really one solution. Tracking your behavior, your adjustments, your actions, your reactions. Learning from them to better identify you and make your life more secure.

Your house knows it is you because it knows your stride, your face, your smile, and the way you hum. All of these sort of things that your girlfriend may pick up on can be incorporated into the ultimate system which help to “get to know you to protect you”.

AMQP + Document Database = ZebraMQ

Part 1

Imagine taking the powerful and simplicity of AMQP and combining it with a document store you can create a method for message routing far more flexible than a hierarchy dot notation routing key.

 

Let’s say you have a sports team organization system. One where you may have different divisions and teams within a division. You have the administrators of the league and administrators of the division and then the coaches of each individual team. You want users who are coaches to be able to communicate with their players add well as all of their superiors. You don’t want a coach to communicate with players outside of their team and you want to allow for coaches to be able to communicate with their fellow coach’s.

 

To have a coach message his players would involve specifying the division dot the team then indicating that they are players not coaches or parents.

 

Let’s say that the publish routing key could be specified as a query with a JSON embedded fashion. Like specifying:

 

 

{
    type: "user",
    $or: [
          {
              "division": "orange",
              "role": "administrator"
          },
          {
              "teams.the-hawks": {
                   "role": {
                        "$in": [
                             "player",
                             "captain"
                        ]
                   }
              }
          }
    ]   
}

 

That would include the orange division administrators and the players and captain of the Hawks team. Excluding non players if that team. The subscribers data entries may look like this:

 

 

{
    "firstName": "Jon",
    "lastName": "Smith",
    "type": "user",    
    "teams": {
          "the-hawks": {
            "role": "player"
          }
     } 
}
 
{
    "firstName": "Tom",
    "lastName": "Klein",
    "type": "user",    
    "teams": {
          "the-hawks": {
            "role": "captain"
          }
     } 
}
 
{
    "firstName": "Arnold",
    "lastName": "Palmer",
    "type": "user,
    "division": "orange",
    "role": "administrator"
}

 

This is all valid MongoDB type querying, which I think is just a very simple and readable way of expressing a query.. The idea is instead of having the application obtain the records based on the queries, and passing them along to the message bus, it would be far more efficient to have the systems indexed and integrated together. Having the data relating to who receives publishes messages present and indexed and only publishing a query that would be used to pull those corresponding records would provide a whole new way of utilizing message queues.

 

I see this being somewhat how ElasticSearch utilizes a document store and lucene under the hood, if we utilized a document store with RabbitMQ under the hood. Mostly as a tool for publishing messages to a complex set of users or some other endpoint. Theoretically, lambdas may also be used for subscription callbacks. Thus providing a tool to subscribe to something and store the way the subscriber responds all within the message system context.

 

That would be very similar to how HDFS utilizes map reduce, not having to send the data outside of message system to process its callback function. Obviously, the callback can be application based, but this way would yield a tremendous performance gain, like stored procedures in a RDBMS.

Part 2

Messaging. Queue. Data.

My initial way to implement the bare minimum functionality to prototype was by creating a RabbitMQ plugin that would be able to plugin into a data store. I still think that having a tightly coupled message bus along side its data store will be most performant. The roots of AMQP are in the finance and its most important requirement was high performance. If I am going to keep the AMQP flag I would want to maintain performance as a priority, while allowing for additional configurable options that may tip the scales with slightly negiable variances in performance but tremendous gains in security and overall functionality.

 

Instead of building either a message bus, or a data store or both, I would build an abstraction that sits on top of the two (or more). Essentially creating a producer / consumer bus that allows for integrating services. All “endpoints” will be authenticated and identified by set credentials. Endpoints use AMQP (and optionally HTTP, STOMP on top) to hook in and produce/consume.

 

Once authenticated the endpoints may publish and subscribe to “messages”. Data stores may be directly queried by publishing a request with message criteria specifying the data store in question (just a routing queue/key parameter) and the corresponding query.

 

Optionally, the notion of subscribing to a data store for specific data changes is also a possibility. Depending on the implementation, the system may poll intervals to check, or most likely will be able to know changes are happening when messages are received to alter the data that would match the subscribed criteria. This can be an extremely awesome feature. Not if the data store is modified by means outside of the application’s context it will be unaware of those changes!

Endpoints

Depending on the actual AMQP specifications, it may be possible to use a standard AMQP client (0.9 or 1.0) implementation to communicate with Zebra. The routing queue would need to be serialized, perhaps it can just be in the headers. A REST API would be simplest to implement.

 

Security Options

If data requires security restrictions that will limit messages of different “types” being mixed with different data there are several possible methodologies to handle this.

 

1. Public key / private key: pair may be used to encrypt all data. In order to send certain types of data a separate key will be needed for varying levels of security. The message data will be encrypted against the security signature’s unique public key. Prior to receiving the data, the application will check each recipient’s security access and see if have access to this “security signature”, if they do, then the data is decrypted and sent over a standard SSL encryption to the recipients.

 

2. Multi phase queue: Routing queries may be be specified in multiple phases. A queued phase will essentially filter out messages based on given criteria and pass the appropriate messages to the phase that matches the next set of criteria. Each is essentially a separate exchange and therefor may be configurable with all of the standard options (durable, auto delete, declare)

 

Federated Data Storage and Access

Zebra may be pluggable to anything that implements its interfaces. Data stores are special but are endpoints of Zebra as well. Therefore they may subscribe to messages and publish messages. Complex data queries may be performed that can merge and aggregate data from multiple data stores together. This can be accomplished by querying multiple data stores and providing a set of standard tools. Beyond anything provided, you may implement “middleware” which subscribe to common data tasks and analysis, similar to a stored procedure enabling you to create pluggable endpoints that you can query against.

 

Part 3

Standard RPC calls through AMQP (RabbitMQ implementation) pass a correlation-id and a reply-to field in the messages header. The reply-to field is used to indicate what queue should be used to send the corresponding response. In reality we don’t need to send that message back to it origin, rather if it is part of a sequence of events perhaps we can direct it to the next step of the sequence.

 

Using reply-to and multi-phase queues we can have a very powerful conditional workflow system and map reduce system. The notion of multi-phase queues is an interesting one I came up with. Taking a routing query and assign a level of priority allowing for multiple phases sequentially. Phases may be nested with conditional criteria. When the set criteria is found to be true, the next phase will be evaluated.

 

If you want to decimate your data in a clean and automated way you could just specify the initial criteria in the primary routing query. Then for the next query you can specify the more specific query knowing that all of the results will be viable.

 

Let’s say you have a web site that buys and sells high value automobiles. You have two separate databases that you utilize. One that has a listing of all of the automobiles and a wealth of information and history on each make and model. The second database just has the buyers and sellers automobiles listing what is for sell or looking to purchase. Depending on the traffic and inventory, it is possible that there may be more buyer/seller automobiles than all of the automobiles produced or the exact opposite. The point is first let’s narrow down the automobiles by the actual criteria of the car’s specification, then dealing with if the automobile is actually available or not. This all depends on what the question is that the user is attempting to answer, is it what is the car they want, or is it what cars are available that I also want?

 

HEADERS:

 

"routing-keys": [
     {
          "brake-horsepower": {
              "$gte": 500
          },
          "makes": {
              "$in": [
              "Ferrari",
              "Lamborghini",
              "Maserati",
              "Aston Martin"
          ]
     },
     {
         "price": {
             "$lte": 250,000
          },
         "condition": "new"
     },
    {
          "location": {
               "$lte": [100, 'miles']
          }
    }
]

 

This example is dealing with cars. The first set of criteria is to filter all messages to those that have brake horsepower greater than or equal to 500 and only one of the specified makes.  After this, the next place the recipient of that message will do is publish the new message with the routing query specifying that the price must be less than or equal to $250K and I only want to purchase a new car.

 

The second message will alert all of the owners of cars that match the secondary criteria a message letting them know that someone may be interested in purchasing their car. The third and final criteria is that I would really be interested if possible of seeing the car in person and that it is located within 100 miles from where I live. This final result will return back to the initial publisher user so that they may view the final list.

 

As a message makes its way along its path, it may accumulate data at each phase and transformation. Beyond the messages next path, it may acquire new data. All the changes made at each step of the messages journey may be associated with the message. This provides a way to gather information through a workflow process. There is a great deal of flexibility in this that we will explore shortly.

 

Let’s say I have a piece of data that I want to analyze to see if it something that needs to be reported to the system administrator. Perhaps these analyzers pass along their analysis to be fed into the next analyzer or maybe it will be attached to the final result.

 

Perhaps the first analyzer will detect the applications’s type. If the applications’s type falls in to two categories it will go to the corresponding category. Then let’s say that we want to send the application to be tested in a sandbox and examine the function calls it makes. Finally, we would take a list of the URL’s that are called by the application and if any of them are in the supplied list then the application will be returned  to  be further evaluated by the user. The user may want to see the gathered data alongside the fact that it matched the final criteria. This is very similar to a traditional map-reduce pattern used in distributed computing systems.

 

Each of the links along the chain, may inject its results into the “route-data” header field. Such “route-data” may be used in routing queries. For instance let’s say you are looking for an application that is a PE (Portable Executable) file and is compiled for a 64-bit architecture. The next routing query is:

 

 

{
    "$routeData.$0.fileType": "PE",
    "$routeData.$0.architecture": "64-bit"
}

 

 

A possible usage of “$routeData” can be used to show if there are any headers in the “route-headers” field that satisfied the supplied condition found in the first route.

 

Zebra is a fresh approach to combining AMQP style messaging alongside document based databases. AMQP remains the wire transport protocol and Zebra sits on top of the database and messaging layer as an abstraction. It translates the queries into subscribers and then publishes the appropriate messages to corresponding subscriber’s queue.

Smart Captcha

There is an alternative to the popular Google reCaptcha (https://www.google.com/recaptcha) that I have used 10+ years ago, called Text Captcha. (Try http://api.textcaptcha.com/cp.json for a demo) It is simple language type questions that require a very minimal effort to follow…but somehow I prefer greatly over those terrible images I can’t seem to understand.

The way to “challenge” this sort of captcha would be through natural language processing. The varied options aren’t that hard, that I’m sure it wouldn’t be too challenging to break. I was thinking however of combining this with the now defunt Google image labeler (http://en.wikipedia.org/wiki/Google_Image_Labeler). I used to play this once in a while many years ago. Let’s take the game and the logic to the next level. Imagine if we have a series of three images when put together would convey a theme. For instance: turkey + leaves + football = Thankgiving. Yeah that was corny, but that is not a simple thing to crack. That involves understanding what each image may possible mean (in all contexts) and then limit that understanding to the perceived intent with the accompanying images.

An alternate type could be the complete the sequence. Like having a ‘horse and buggy’, ‘train’, ?, that would be an automobile. The sequence of course is forms of ground transportation going from the past to the most modern.

The two options I have given are not friendly to those who are visually impaired, but that doesn’t rule it out as a possibility for a majority of the population. I think that a more complex text captcha is also possible, but that will still be left to a simpler NLP attack.

 

Authentication: The real me

Authentication as we use it in the security world is obviously from the word “authentic”, meaning genuine. Today we find most common authentication means are simply fulfilling an already established contract with secret information that only the account owner would posses. This gives no insight that the given user who initially established his account let’s call him Bob is in fact Bob logging into his account or Alice, a third party listener may have somehow obtained Bob’s authentication credentials. Since Bob’s credentials may be a username and password pair, the only thing that protects this account is possessing this secret information. You haven’t missed anything, all I have said is that modern day authentication means rely on secrecy or private information that only the account hold would posses. What if we were able to actually establish that Bob, is the same Bob that initially established his account. Not due to knowledge of a simple pair of username/password credentials, or a selected picture and the like. Rather, what if Bob was somehow able to expose his likes, dislikes, habits, tendencies, interests, etc. and this information may be used to not verify that Bob knows his password, but that Bob is Bob.

The closest thing I have seen that somehow utilizes this is during credit checks. The questions are almost entirely address verification related multiple choice questions. Did you live in the street X?

Imagine you have a browser extension installed. You read an article that makes you “sad”, or makes you “happy”. You click on your “authentication” icon and use an emoji to express what you feel, or perhaps use a few tag keywords to convey your response. This data may be collected and used to build a statistical model to build verification questions for a authentication service. Your web history, emails, shopping tendencies, music you listen to. All of this can be used to provide data to help paint a picture of things that would show “how” you think and how your react.

Pandora does this to provide you with recommended music. The more you show how you like and dislike the recommended music the more you train your station. Imagine if the next time you logged in to “Bob”s account he was asked two questions which songs in this list would you prefer to listen to. This is not ironclad and different types of data can have different results and predictability.

This is a very general idea, but I think that ultimately just as now we can use Facebook and Google as ways to login into our account…imagine if we used our Pandora, Netflix, and Amazon account as seed data to ask us questions to verify our identity!!

I am about to post another idea about CAPTCHA’s but look at this post http://en.wikipedia.org/wiki/Google_Image_Labeler I remember when this game was around and Google used people to help train/verify its image search results. I could certainly see a game such as this use to both train and verify a user’s identity.

This can become a great deal more abstract. I have read recently about http://www.washingtontimes.com/news/2015/jan/24/true-cybersecurity-intelligent-computer-keyboard-i/ recent efforts in keystroke analysis in identifying by their unique electronic signature.

A bit of a rant, but I think there are some solid ideas here.