Synthetic Data for Cyber

I once tried to create a vivid, non-technical explanation of the promise of deceptive computing in a cyber security book I wrote back in 2008. I set up the premise that you, the reader, are an evil jewel thief, wearing your gloves and black turtleneck, and that you are about to nab some expensive piece of jewelry. You quietly approach the glass display and eye your target, which is protected under glass, on a velvet display. Everything is going as planned.

But as you carefully reach forward to lift the glass, something startling occurs. The entire display case disappears with a poof sound, and your surroundings also change suddenly and dramatically. In an instant, the lights are on around you, and several uniformed police officers have formed a circle around you. You have been caught in a deceptive sting, and it was my contention then and now, that this scenario is made possible in computing by virtualization.

Last week, I welcomed a couple of visitors to Fulton Street for coffee and a technical chat about deception. But unlike most deceptive offerings that I’ve covered – and I will be writing more and more about this important area in the coming year, my visitors from a small firm called ExactData were offering a different angle. We spent the morning discussing how they create synthetic data for testing – and how this might be useful for cyber security.

“The area in which we operate at ExactData involves high-fidelity, fully-synthetic, automated data generation,” explained John Dawson, CEO of the Company. “The technique was established for testing and optimizing data process systems and involves creating data model templates within a sophisticated rules engine that can generate a simulated data universe – including system response files – and then writing the result into a new database.

“Our company has now developed and patented the market-leading technology for such high fidelity synthetic data generation that will engineer data to accurately simulate production systems. That data is 100% artificial, completely realistic, and error free at very deep levels. We believe this is a powerful capability.”

The idea we discussed was how this data generation might be applied to cyber, and it was not hard to see how this could work. The most obvious use-case involves the creation of a test and simulation environment for cyber security tools. Not unlike tech vendors such as Keysight Technologies, which acquired Ixia in 2017, ExactData could easily point their synthetic data generation solutions at our industry – and this seems like a no-brainer for the start-up.

But I think there is more here – and promise of deception forms the basis for my enthusiasm. Here’s the idea: If you can create synthetic data for testing, then you can create it for any purpose, including live operational support for security. Consider the possibilities of on-demand generation of customer data records that are synthetically generated to be incredibly realistic, but entirely fake. Such honey content would seem useful for many different engagement types.

First, synthetic data generation can support dynamic generation of enticing content for the purposes of intrusion detection. Such on-demand capability has the potential to drive honey pots from static, pre-determined entities into live, adjustable resources that can be crafted based on behavioral attributes of the environment. This could be easily extended to machine learning-based algorithms that adjust synthetic content to observed needs.

Second, synthetic data can support a variety of security training and compliance framework needs. Customer support representatives, for example, could be required periodically to demonstrate their competence in a simulated data environment, where evaluations can be done using records that cannot be differentiated from real data. The dynamic aspect of synthetic data generation would make such simulators quite effective.

And third, the possibilities for evaluating security tools is already well-established. Many larger companies already use synthetic data to test their tools, and most cyber security vendors have also followed this practice for years. I would imagine that product test companies such as NSS Labs would be well-served to have a look at high fidelity synthetic data generation solutions to improve the quality of their test and simulation data.

“Our algorithms appear well-suited to cyber security,” Dawson said, “because we do much more than just cut a small slice of your records and then use that to copy it over many times. Instead, we follow advanced methods for creating synthetic data that is realistic and almost impossible to differentiate from the real stuff. We believe this can have a powerful impact on the cyber security technology industry and do it at tremendous data volumes.”

ExactData is a small company with the full intention of getting bigger. If this sounds like an attractive potential partner, then I suggest giving them a call. Obviously, if you have an established relationship with a great company like Keysight, then my advice is to get creative in your discussions. Ask them what’s on the horizon in terms of dynamic content generation based on methods such as machine learning. It’s an exciting area with great promise.

And as always, please share with us what you have learned.