CIT faculty aim to make big data a small issue for law enforcement

Several computer and information technology professors have received a grant to build a cost-effective digital forensics toolkit for law enforcement agencies.

Digital forensic tools that are currently available are limited in scope, incompatible with other tools, unable to adapt to ever-evolving technologies, and not designed to deal with the “big data problem,” according to Kathryn Seigfried-Spellar, assistant professor of computer and information technology.

Kathryn Seigfried-SpellarNetwork servers and clients such as personal computers and mobile devices all store data including text, images, video, voice, and numerous other file types. Even relatively simple data transmissions amongst a multitude of network-connected devices can result in a large quantity of data, but Seigfried-Spellar said quantity isn’t the only “big data” challenge faced by law enforcement.

“The three V’s of big data that are well-known are volume, variety, and velocity,” said Seigfried-Spellar. “There is a fourth V: visualization. The jury, prosecutor, and judge want to see data, but it can be hard for them to understand it.”

If evidence is presented in a manner felt to be prejudicial, it can be thrown out. “We want to be able to present digital evidence the way it looked in real time at the moment it was created or transmitted,” Seigfried-Spellar said. “The goal is to show things visually as they appeared or how things are linked between person A and person B.”

Being able to reconstruct the data being accessed or transmitted by a suspect in a forensically sound manner requires knowing the transmission protocols which were used and the network conditions at the time, according to John Springer, associate professor of computer and information technology.

“Origin is important. You can spoof email addresses. It’s important to understand there is metadata associated with the movement of data,” Springer said.

Seigfried-Spellar suggested considering ordinary vehicular traffic as an analogy. “You’ve got your home, workplace, streets, traffic lights, other cars on the road, lanes to drive in, and places to park,” she said. “Some of the roads aren’t going to work due to construction or traffic accidents. Some cars have people carpooling and others have just the driver. Next, are you sending kids to school? Are you driving to work? We are trying to create a snapshot of all of that.”

“Tools do exist to help with this, but they’re not cost-effective for law enforcement,” Springer said. A data mining software tool called Splunk is available, for example, but “it’s not focused on law enforcement and forensics. Using our roads analogy, we want to capture all of the traffic which moves over a series of roads over a period of time in an entire area, and to know what radio station the driver was listening to or if he was talking to someone in the back seat.”

That level of detail in big data on large scale computer networks might be important to investigators examining criminal activity like transmission of an illegal image from point A to point B, Seigfried-Spellar said. “Some officers want just a snapshot. What did the road look like? Was it sunny? Rainy? Was the road slippery? That’s what comes into play when evidence gets to the jury. We’re trying to balance the quantity of data with what they will need to know and when they will need it.”

Molding existing tools for use in criminal investigations can be expensive for police departments due to the sheer volume of data.

John Springer“We’re talking about a USA-scale of traffic, not a West Lafayette-scale, and traffic is driving at the speed of light,” said Springer. “And we might want to identify individual elements from thousands of miles away.”

To accommodate the scale of data and address the challenges faced by digital forensic examiners, Seigfried-Spellar and Springer, along with Marc Rogers, professor and head of the Department of Computer and Information Technology, and Baijian Yang, associate professor of computer and information technology, are using a grant from the National Institute of Justice to build File Toolkit for Selective Analysis & Reconstruction (File TSAR) for Large Scale Computer Networks.

File TSAR will provide tools to show the process of the transmission of data so that examiners can reconstruct it and provide key evidence to prosecutors.

“Our search engine is a forensics tool,” Springer said. “We’ll collect data at the network packet level.” A packet is the individual unit of data sent over a network, somewhat analogous to a letter which the post office delivers to the address on the envelope. A complete transmission consists of many packets, like sending a book via the Postal Service one page at a time.

“Right now the collector, indexer, and optimizer could all run on the same machine,” said Springer. “But as we need more compute power, it could be scaled and spread to run across multiple servers.”

Investigators would connect to File TSAR via a web browser, he said.

The research team hopes that File TSAR will help make digital evidence more admissible, reliable, and verifiable.

“Digital evidence sometimes gets accepted without challenge, and that’s an issue,” said Rogers, who has provided training for judges. “Because it’s so technical, we need a clear process for validating it.”

The research team is projected to have a File TSAR prototype ready for testing by Purdue’s new High Tech Crime Unit late this year. At the end of the project’s second year, they plan to offer a three-day training session for 20 investigators, Seigfried-Spellar said.

Additional information: