How to build a custom Dataset for Tensorflow

How to build a custom Dataset for TensorflowIvelin IvanovBlockedUnblockFollowFollowingJun 19Tensorflow inspires developers to experiment with their exciting AI ideas in almost any domain that comes to mind.

There are three well known factors in the ML community that make up a good Deep Neural Network model do magical things.

Model ArchitectureHigh quality training dataSufficient Compute CapacityMy area of interest is Real Time Communication.

Coming up with practical ML use cases that may add value to RTC applications is the easy part.

I wrote about a few of these recently.

As my co-founder and good friend Jean Deruelle pointed out, there are many more adjacent use cases if we wander into ambient computing with new generation communication devices seamlessly enhancing home and work experiences.

So I wanted to build a simple prototype and jumped right into connecting Restcomm to Tensorflow.

After a few days of research, I realized that there is no easy way to feed real time streaming audio/video media (SIP/RTP) into a tensorflow model.

Something similar to the Google Cloud’s Speech to Text streaming gRPC API would have been an acceptable initial fallback, but I could not find that in the open source Tensorflow community.

There are ways to read from offline audio files and video files, but that’s quite different from processing real time latency sensitive media streams.

Eventually my search took me to the Tensorflow IO project lead by Yong Tang.

TF IO is a young project with a growing community supported by Google, IBM and others.

Yong pointed me to an open github issue for live audio support waiting on contributors.

That started a good conversation.

A couple of weekends later I had build enough courage to take on a small coding challenge — implementing a new Tensorflow Dataset for PCAP network capture files.

PCAP files are closely related to real time media streams, because they are precise historical snapshots of network activity.

PCAP files enable recording and replay of actual network packets as they come into the media processing software including dropped packets and time delays.

Back to the subject of this article — I will now walk you through the main steps in my quest to building a TF PcapDataset and contributing it to the Tensorflow IO project:Fork Tensorflow IO and build from sourceLook at the adjacent datasets in the source tree and pick one that’s closest to pcap.

I leveraged code from text, cifar and parquet.

There is also a document on creating TF ops that proved helpful.

Ask for help on the gitter channel.

There are folks who pay attention and respond within hours.

I got valuable advise from Stephan Uphoff and Yong.

There are also monthly conference calls where anyone can chime in on project issues.

Submit a pull request when ready.

The TF IO team is quite responsive and supportive guiding contributors through tweaks and fixes to meet best practices.

Step 2 turned out to be the one where I spent most of my weekend-hobby time learning TF infrastructure and APIs.

Let me break it down for you.

Fundamentally TF is a graph structure with operations at each node.

Data comes into the graph, operations take data samples as inputs, process these samples and pass outputs to the next operations in the graph that their node is connected to.

The figure below is an example of a TF graph from the official docs.

TF Graph exampleOperations work with a common data type named tensors (hence the name TensorFlow).

The term tensor has mathematical definition, but the data structure for a tensor is essentially an n-dimensional vector: 0D scalar (number, character or string), 1D list of scalars, 2D matrix of scalars or higher dimension vector of vectors.

Data has to be pre-processed and formatted into a Tensor data structure before it’s fed into a TF model.

This tensor format requirement is due to the linear algebra extensively used in Deep Neural Networks and the optimizations possible with these structures applying computational parallelism on GPUs or TPUs.

Tensor examplesIts helpful to understand the benefits of TF Datasets and all the convenience functions that come out of the box such as batching, mapping, shuffling, repeating.

These functions make it easier and more efficient to build and train TF models with limited amounts of data and compute power.

Datasets and other TF operations can be built in C++ or Python.

I picked the C++ route just so I can learn some of the TF C++ framework.

Then I wrapped them in Python.

In the future, I plan to write a few pure Python datasets, which should be a bit easier.

Let’s look at the source code file structure for a TF IO Dataset.

Source code directory structure for the TF IO pcap DatasetTensorflow uses Bazel as build system, which Google open sourced in 2015.

Following is the PcapDataset BUILD file.

It declares the public name of the dynamic pcap library (_pcap_ops.


Lists the two source files to build from (pcap_input.

cc and pcap_ops.


And declares a few TF dependencies required for the build.

Main Bazel BUILD file for the pcap datasetThe next source file of significance is pcap_ops.

cc where we declare the TF ops that will be registered with the TF runtime environment and be available to use in TF apps.

Most of the code here is boilerplate.

It says that we are introducing a PcapInput op that can read from pcap files and a PcapDataset op that is populated by a PcapInput.

The relationship between the two will become more apparent in a few moments.

From the time when I started my contribution work till the time it was accepted into the TF master branch, there were several simplifications introduced in the base TF 2.

0 framework that reduced boilerplate code in my files.

I suspect there will be more of these simplifications in the near future.

The core TF team understands that in order to attract larger community of contributors, its important to lower the barrier of entry.

New contributors should be able to only focus on the net new code they are writing and not sweat the details of interacting with the TF environment until they are ready for that.

The next file in the package is pcap_input.


That’s where most of the heavy lifting takes place.

I spent a fair share of time writing and testing this file.

It has a section that declares the relationship between PcapDataset, PcapInput and PcapInputStream.

We will see what each of these does.

PcapInputStream contains most of the logic reading from a raw pcap file and converting it to a tensor.

To get a flavor of the input, here is a screenshot of the test http.

pcap file viewed with CocoaPacketAnalyzer.

CocoaPacketAnalyzer view of http.

pcapLet me skip the logic specific to pcap files and point out a few defining elements for the conversion from raw binary file data to tensors.

Read a packet record from the pcap file and convert to tensorsThis ReadRecord line reads from the pcap file the next pcap packet and populates two local variables: packet_timestamp double and packet_data_buffer string.

ReadRecord(packet_timestamp, &packet_data_buffer, record_count);If a new pcap record was populated successfully, the scalars are placed into respective tensor placeholders.

The shape of the resulting output tensor is a matrix with two columns.

One column holds the timestamp scalars for each read pcap packet.

The other column holds the corresponding packet data as a string.

Each row in the output tensor (matrix) corresponds to a pcap packet.

Processing pcap file input to TF tensor outputTensor timestamp_tensor = (*out_tensors)[0];timestamp_tensor.

flat<double>()(*record_read) = packet_timestamp;Tensor data_tensor = (*out_tensors)[1];data_tensor.

flat<string>()(*record_read) = std::move(packet_data_buffer);out_tensors are the placeholder tensors prepared when a new batch is requested from the PcapDataset.

That is done here; before the read loop.

The packet_timestamp scalar is placed at the first column (index [0]) and (*record_read) row using the typed flat function.

Respectively packet_data_buffer is placed at the second column (index [1]) and same (*record_read) row.

This covers the key elements of the C++ code.

Now lets look at the Python files.


py at the top pcap directory level instructs the TF Python documentation generator how to traverse the python code and extract API reference documentation.

You can read more about the documentation best practices here.

The code above instructs the Pyhton API docs generator to focus on the PcapDataset class and ignore other code in this model.

Next, pcap_ops.

py wraps the C++ DataSet op and makes it available to Python apps.

The C++ dynamic library is imported as follows:from tensorflow_io import _load_librarypcap_ops = _load_library('_pcap_ops.

so')One of the main roles of the dataset constructor is to provide metadata about the dataset tensors types it produces.

First it has to describe the tensor types in an individual data sample.

PcapDataset samples are a vector of two scalars.

One for the pcap packet timestamp of type tf.

float64 and another for the packet data of type tf.


dtypes = [tf.

float64, tf.

string]Batch is the number of training examples in one forward/backward pass through the neural network.

In our case, when we define the size of the batch, we also define the shape of the tensor.

When multiple pcap packets are grouped in one batch, both timestamp (tf.

float64) and data (tf.

string) are 1-D tensors and have the shapes of tf.


Since we don’t know the number of total samples beforehand and the total samples may not be divisible by the size of batch, we would rather set the shape as tf.

TensorShape([None]) to give us more flexibility.

Batch size of 0 is a special case where the shape of each individual tensor degenerates into tf.

TensorShape([]), or 0-D scalar tensor.

shapes = [ tf.

TensorShape([]), tf.

TensorShape([])] if batch == 0 else [ tf.

TensorShape([None]), tf.

TensorShape([None])]Almost there.

We just need a test case.


py exercises PcapDataset while sampling from http.


The test code is straightforward.

Iterates over all pcap packets and tests the values in the first one against known constants.

To build just the PcapDataset and run its test, I used the following lines from the local io directory:$ bazel build -s –verbose_failures //tensorflow_io/pcap/.

$ pytest tests/test_pcap_eager.

pyThat’s it!.Hope this helps you build your own custom Dataset.

When you do, I hope you will consider contributing it to the TF community to accelerate the progress of open source AI.

Feel free to ask questions in the comments section.

I will try to answer to my best ability.


. More details

Leave a Reply