Finding Cyber Signals in Noise

For over a dozen years, I scurried each morning past a looming portrait of Claude Shannon in the lobby of my Florham Park building. (Too bad if your building has stock modern art. Such a missed opportunity to inspire.) Shannon’s work, in case you’ve been away for the last hundred years, formed the basis for information theory, which has enabled much of what we do in computing today. Modern signal processing, for example, is a direct descendent.

A common problem in signal processing involves extraction of a weak signal buried under communications noise. If you took a graduate course on this at Stevens or NYU, the professor would write the following equation: f(t) = s(t) + n(t). This is mathematician speak for the following: The total signal f(t) is equal to the weak signal s(t) plus the noise n(t). (These equations get considerably more complex after that, but you get the idea.)

I was pondering f-of-t’s and s-of-t’s and n-of-t’s while meeting with Dr. Ari Tuchman, founder of Quantifind. Now, before we start, I must share that Tuchman is not the usual start-up guy. He holds the B.S. in physics from Harvard, Ph.D. in atomic physics from Yale, and has served as a research scientist at Stanford working in quantum metrology. As he began to explain his work, I prayed that he wouldn’t quiz me on neutral plasmas.

“What we do at Quantifind involves pulling signals from unstructured data,” he explained, which calmed me down somewhat. “We do this using predictive models that help derive useful information from these noisy, unstructured sources. We’ve had considerable success predicting things like customer churn for mobile service providers, but we are now pondering how our methods might be applied to cyber security.”

This sounded cool, so we dug deeper into what the company is doing, and yes, others on the Quantifind team have frighteningly impressive resumes: Co-founder John Stockton, for example, studied at Caltech and Stanford. Sigh. Anyway, the basis here is that unstructured external data abounds on the Internet in the form of social media, mobile applications, web applications, user forums, and the like. The goal is to find weak signals in this mess of data.

The Quantifind platform first ingests this massive wealth of external information about users and entities of interest. Think of this as collecting, organizing, and indexing f(t) from the Internet and other external sources. The next step uses machine learning algorithms trained with Truth to find legitimate connections in the data. The result is a powerful playing field for connecting entities based on names, mentions, locations, records, and so on.

Let’s go through a simple use-case to illustrate how Quantifind connects information domains: Suppose that a bank notices an internal transaction that is marginally anomalous, but does not possess much external context about the user involved. The bank might thus not make a big deal about the transaction, and would likely not issue a Suspicious Activity Report (SAR). This is perfectly acceptable, so long as the activity is truly non-fraudulent.

What Quantifind enables, however, involves connecting this activity to external data about the user or entity involved. There might be, for example, unusual or incriminating evidence on social media or some other public forum like arrest records (Quantifind only uses public sources) that would prompt deeper analysis into the transaction. It is entirely possible that the bank’s final interpretation of the event might thus change and a SAR would be issued.

This functionality lends quite well to detecting money laundering (and Quantifind works with many banks today). Tracking of illegal behavior, unethical business practices, or other nefarious activity are also in scope, because you can create context by connecting the details of internally observed actions with corresponding external data. Clearance investigators make similar connections when checking whether a person is living beyond their means.

And there is an interesting contra-case where the absence of a suitable total signal in the presence of some weak noise might represent a suspicious condition. Suppose, for example, that a financial advisor is aggressively recommending some purportedly exciting investment, but that the corresponding expected social chatter, Internet buzz, and public commentary are simply not present. This discovery provides a powerful way to isolate a fraudulent claim.

The sixty-four-thousand-dollar question is how to point their predictive models at unstructured data to help security teams find useful indicators of attacks or other subtle cyber-related signals amidst the noise. One potential case would be a SOC hunter using this tool to add context to cyber cases to gauge motives and connect dots between actors. Cyber law enforcers would also be advised to consider using this during investigations.

When I asked Tuchman for his thoughts on cyber, he had an interesting observation: “One of the largest challenges we see in the cyber application involves motivating investigators to truly care about this type of analysis,” he replied. “We would like to see banks being more motivated to detect, for example, that a construction company might be a front for nefarious cyber offensive activity. Maybe we need to introduce financial incentives.”

The bottom line is that pulling cyber signal from noisy, unstructured data is so potentially useful, that I thought it prudent to ask you – my social community – for suggestions on how this might be deployed. Please take a moment to offer up any ideas you might have. Let’s see if we can find a useful context for a decent tool from some frighteningly smart people at Quantifind who might help us tilt the balance in favor of the cyber security defender.

I look forward to hearing from you.