Introducing LinkedIn’s Avro2TF

Introducing LinkedIn’s Avro2TFA New Feature Transformation Framework for TensorFlowJesus RodriguezBlockedUnblockFollowFollowingApr 11Feature extraction and transformation is one of the main elements of any large-scale machine learning solution.

Conceptually, feature extraction-transformation is the process of derive key pieces of information from a training dataset in a way that can be used by machine learning models.

If you are building an isolated machine learning model, it is easy to grasp the relevance of feature extraction and transformation as, most likely, the code that accomplishes that is included in the same model.

However, feature transformation quickly becomes a nightmare in environments running multiple machine learning models.

LinkedIn has been dealing with feature extraction and transformation challenges for years and recently open sourced Avro2TF, a new framework for transforming large datasets into TensorFlow-ready features.

The challenge of feature extraction and transformation at large scale is not exclusive to LinkedIn.

For example, Facebook, Google and Uber have built feature transformation capabilities into FBLearner Flow, TFX, and Michelangelo respectively.

The way LinkedIn encountered the problem is somewhat unusual though.

For years, LinkedIn relied on Avro as the data serialization format for its machine learning datasets.

Avro datasets are powering several mission-critical machine learning solutions at LinkedIn such as its Photon-ML personalization engine.

Most of the initial machine learning models at LinkedIn were based on classification systems that had no issues interacting with Avro data.

Over the years, LinkedIn decided to embrace TensorFlow as its main framework to power deep learning capabilities.

That transition created an important challenge as TensorFlow programs can’t easily consumed Avro feature datasets.

This is due to an impedance mismatch between Avro’s sparse vector representation and TensorFlow’s vector data structure.

To understand the challenges that motivated the creation of Avro2TF is important to analyze the problem at LinkedIn scale.

Transforming data from Avro into TensorFlow vector is not a big technical challenge if we are talking about a single instance.

However, when you are doing this for hundreds of machine learning models and tens of thousands of datasets trying to achieve efficient levels of reusability, then you need a consistent framework to achieve optimal feature transformations.

Entering Avro2TFThe goal of Avro2TF is to provide a flexible feature transformation model between Avro and TensorFlow vector representations.

In the context of LinkedIn machine learning strategy, Avro2TF is an important component of Pro-ML, LinkedIn’s core machine learning platform.

Conceptually, Pro-ML enables the core building blocks of machine learning solutions at LinkedIn.

Feature transformation is an important foundational element of the Pro-ML platform and one that is relevant to all other layers of the stack as illustrated in the following picture:In terms of LinkedIn’s machine learning infrastructure stack, Avro2TF is part of a broader framework called TensorFlowIn which simplifies the implementation of TensorFlow applications for the different engineering teams at LinkedIn.

Avro2TF sits on top of highly scalable TensorFlow runtimes such as Spark and LinkedIn’s own TonY.

Functionally, Avro2TF enables a way to model and execute feature transformations between Avro and TensorFlow vector representations.

The framework provides a simple JSON configuration file for modelers to obtain tensors from existing training data.

Tensor data itself is not self-contained.

In order to be loaded to TensorFlow, the Tensor data is required to carry metadata.

Avro2TF also fills this gap by providing a distributed metadata collection job.

Avro2TF provides a bridge to leverage Avro and other sparse vector formats into TensorFlow programs.

The initial implementation of Avro2TF includes several capabilities that are worth highlighting:Input Data Requirements: Avro2TF supports all data formats that Spark can read, including the most popular formats at LinkedIn, Avro and ORC.

Supported Data Types of Output Tensor: In Avro2TF, the supported data types (dtype) of output tensors are: int, long, float, double, string, boolean, and bytes.

The framework also provides a special data type, sparseVector, to represent categorical/sparse features.

A sparseVector tensor type has two fields: indices and values.

Avro2TF Configuration: At the top level, the configuration file contains information about tensors that will be fed to the deep learning training framework.

For each specified tensor, it consists of two kinds of information:Input feature information, to tell which existing feature(s) should be used to construct the tensor.

Output tensor information, including the name, dtype, and shape of the expected output tensor.

Avro2TF Data Pipeline: This component handles feature extraction, feature transformation, tensor metadata and feature mapping generation, converting string to numerical indices, and tensor serialization.

Using Avro2TF starts by defining a configuration file that specifies the mappings between the two vector representations.

The following example is from a movielens tutorial included in the release:{ "features": [ { "inputFeatureInfo": { "columnExpr": "userId" }, "outputTensorInfo": { "name": "userId", "dtype": "long", "shape": [ -1 ] } }, { "inputFeatureInfo": { "columnExpr": "movieId", "transformConfig": { "hashInfo": { "hashBucketSize": 1000, "numHashFunctions": 4 } } }, "outputTensorInfo": { "name": "movieId_hashed", "dtype": "long", "shape": [ 4 ] } }, { "inputFeatureInfo": { "columnExpr": "genreFeatures.

term" }, "outputTensorInfo": { "name": "genreFeatures_term", "dtype": "long", "shape": [ -1 ] } }, { "inputFeatureInfo": { "columnConfig": { "genreFeatures": { "whitelist": [ "Genre" ] }, "movieLatentFactorFeatures": { "blacklist": [ "*" ] } }, "transformConfig": { "hashInfo": { "hashBucketSize": 100, "combiner": "AVG" } } }, "outputTensorInfo": { "name": "genreFeatures_movieLatentFactorFeatures", "dtype": "sparseVector", "shape": [] } } ], "labels": [ { "inputFeatureInfo": { "columnExpr": "response" }, "outputTensorInfo": { "name": "response", "dtype": "double", "shape": [] } } ]}© 2019 GitHub, Inc.

After that, features can be transformed from Avro to TensorFlow using a few lines of code:// Read input data from HDFS to Spark DataFrame var dataFrame = TensorizeInJobHelper.

readDataFromHDFS(spark, params)// Sanity check on tensor names specified in TensorizeIn config TensorizeInJobHelper.

tensorsNameCheck(dataFrame, params)// Extracts features that will be converted to tensors dataFrame = (new FeatureExtraction).

run(dataFrame, params)// Transforms features that will be converted to tensors dataFrame = (new FeatureTransformation).

run(dataFrame, params)// Generate tensor metadata only in train mode; otherwise, directly load existing ones from working directory if (params.

executionMode == Constants.


enableCache) dataFrame.


MEMORY_AND_DISK_SER)// Generate tensor metadata (new FeatureListGeneration).

run(dataFrame, params) (new TensorMetadataGeneration).

run(dataFrame, params) }The simplest way to start using Avro2TF is to install the Docker image included in the open source release.

The instance includes several Jupyter Notebooks with detailed instructions of how to leverage Avro2TF in TensorFlow applications.

Avro2TF addresses a very common scenario in real world machine learning applications.

Many organizations have already invested in creating robust Spark infrastructures as part of their big data initiatives.

Naturally, that makes a strong case for leveraging Spark runtimes for running machine learning models like those built using TensorFlow.

Solving the impedance mismatch between the Avro/Spark and TensorFlow vector representations removes an important roadblock to enable those scenarios.

Initiatives like Avro2TF that have proven those capabilities at the scale of an organization like LinkedIn are certainly a welcomed addition to the TensorFlow stack.

.. More details

Leave a Reply