Using AI to Cluster and Protect Data

One of the oldest previously-classified US documents, written over one hundred years ago, explains how to make secret ink: “Take one ounce of linseed oil, 20 ounces of liquid ammonia, 100 ounces of distilled water. This mixture must be shaken up before using with a quill pen. Write in free space between words written in pencil. To make this writing appear, dip the whole letter in cold water, and read secret writing while wet. Upon drying the writing disappears.”

Leon Panetta, during his tenure at the CIA, decided to declassify this, and several similar documents, that had been labeled classified, because technology had advanced sufficiently to make protection no longer necessary. This data decision, and so much more, is all part of a huge process apparatus in every national government to classify and protect their sensitive information. Sometimes the process works, and other times (uh, Snowden), it doesn’t.

I had this sense of data classification in mind while chatting recently with Yaniv Avidan of Israeli start-up MinerEye. My friend Malcolm Harkins serves on their advisory board, and I was keen to learn about how the MinerEye team is using artificial intelligence for data protection. What I discovered, to my delight, was a tool that addresses one of the most under-attended aspects of cyber security: Records Information Management.

“Our technology uses AI methods to identify, categorize, and help provide controls around unstructured data repositories in the enterprise,” explained Avidan. “We identify sensitive documents and files either on local premise or stored in cloud repositories, and we categorize assets using our proprietary pattern recognition and machine learning algorithms. This allows enterprise team to enforce data protection policies.”

The way this process works is that the MinerEye DataTracker virtual machine detects, learns, matches, and updates the stored document and file footprint of the organization. Such activities produce a variety of administrative benefits for IT and security teams, but in an era of one-after-another data exfiltration incidents and intellectual property theft cases, having a clear understanding of data records and information posture is the first step in reducing risk.

I asked Avidan how this would work in practice and his use-case was familiar: “When some document is created by an enterprise user, it is often stored on multiple local and cloud sources,” he explained. “The DataTracker software identifies this content and clusters it with related data. This information can then be fed to data leakage prevention (DLP) tools, records information management (RIM) systems, and much more.”

I asked Avidan how this clustering works, and this is where the company’s experience in computer vision and machine learning comes into play: “Every file identified by DataTracker is automatically matched to a cluster,” he said, “That cluster can be tagged, and an AI module profiles its behavior. We use protected folders in DLP and IAM systems as learning sets to discover new locations of stored data of interest.”

While listening to Avidan go deeper into the details of MinerEye, I kept thinking about how so many companies have zero clue about their unstructured data posture. This seems crazy, because intellectual property theft has become such a prominent technical and political issue. Companies complain that their data is being ripped off (see China vs. America), but the targeted entity often is unaware of where their sensitive data and records are stored.

Once data clusters have been identified and tagged by DataTracker, a variety of useful management actions are possible – most related to tracking and compliance. “Our software integrates directly with Microsoft Office 365 and Azure to trigger security and system compliance activities,” Avidan explained. “This is important for teams dealing with regulatory issues related to PII, PCI, and other sensitive data.”

My concern going into the discussion with MinerEye was that cloud storage would significantly complicate the process of discovery and tracking. And certainly, if access to stored information is not clearly managed, then it might remain undetected. But MinerEye seems unusually focused on the challenges of cloud storage – even emphasizing the benefits of data posture identification in advance of cloud migration. This made sense to me.

I suspect that early adopters here will be organizations with intense data identification and tracking requirements for unstructured files and documents. And Avidan took me through an impressive list of customers. Where this new technology has the greatest potential, however, lies in its ability to energize and automate RIM policy enforcement in the enterprise. Finding and properly protecting files before they are exfiltrated is a sensible goal.

If you are responsible for data security in the enterprise, or if you are one of those poor souls tasked with managing the company RIM policy, then you should be in touch with MinerEye to learn more about their solution. Taking steps to protecting your data will only increase in relevance in the coming years. And perhaps if enough of you deploy DataTracker, then data thieves will have to reuse that linseed oil and ammonia mixture for invisible writing.

As always, after you speak with MinerEye, let us all know what you learned.