Book Review: Malware Data Science by Joshua Saxe with Hillary Sanders

One perk of teaching is the free books. Lots of them. They usually come with a lovely letter suggesting that the new edition of The Grand Handbook of Cyber Security – and they all sound like that – would be just perfect for your graduate or undergraduate students. I usually hand these freebies, unopened, to the nearest student I can find. I’m not saying the books are always terrible, but I prefer to select books on my own, thank you.

Which is why I am so excited to share my enthusiasm for a 2018 book I picked up last week and devoured over the holiday (despite a head cold). Malware Data Science, by Joshua Saxe with Hillary Sanders, is a great read – despite having features that would add to its suspiciousness score for me: First, it has code, and I usually find code in a book to be padding for bad writers. (The truth is that the code in this book nicely illustrate concepts.)

Second, the book is published by No Starch Press, a small publisher that turned down a couple of proposals I’d sent them in the past. And yes – damn it, even cold hearted jerks like me have feelings. I still have the reject letters in a box somewhere, and I remember thinking that they must have no clue if they hate my proposals. (The truth is that they do have a clue and their books, including this one, are awesome, offering strength where I am weak.)

And finally, the book has chapters on machine and deep learning, and like most of you, I’m up-to-here with the artificial intelligence stuff. It’s like – OK, already – we all get that you can build a weighted decision tree (as I was doing back in the 90’s) to categorize software like that good or bad Wonka scale. (The truth is that Malware Data Science is more than just the usual book about AI. It smoothly combines theory and practice – which is not easy.)

The meat of the book involves developing actual tools to use machine learning to differentiate between malware and benignware samples. Excellent theoretical taxonomies of false/true positive/negative alarms are used as the basis for evaluating machine learning techniques, including logistic regression, K-nearest neighbors, decision trees, and random forest. Deep learning is also explained in a coherent and approachable manner. Nice job.

The authors claim repeatedly that readers can and should make use of Python 2.7 in VirtualBox Linux to perform practical analytic tasks. Usually I scoff at such claims from authors, preferring instead to accept minimal pedagogical goals from the open source tools, sample code, and the like. But in this case, I’m going to suggest that if I was a SOC manager today hiring data scientists to investigate malware, I’d make this book a requirement.

If you teach college, then maybe a free copy of this book is already on your desk. If not, then I’d suggest you go grab a copy. Despite its hefty $49.95 price (and apparently the proceeds are going to the Environmental Defense Fund – which I like), I think this is a good book for your summer reading list. As always, please share your thoughts with all of us after you go through this book (or if you read it already). I’ll be watching for your review.