Introducing FeastGoogle’s New Feature Store for Machine Learning ApplicationsJesus RodriguezBlockedUnblockFollowFollowingJan 23Feature extraction and storage is one of the most important and often overlooked aspects of machine learning solutions.
Features play a key role helping machine learning models to process and understand datasets for training and production.
If you are building a single machine learning model, feature extraction seems like a very basic thing to do but that picture gets really complicated as your team scales.
Picture a large organization with dozens of data science teams cranking up machine learning models.
Each team needs to process different datasets and extract the corresponding features which becomes computationally extremely expensive and nearly impossible to scale.
Building mechanisms for reusing features across different models is one of the key challenges faced by high performance machine learning teams.
A feature store is a pattern that is becoming prevalent in modern machine learning solutions.
Conceptually, a feature store serves as a repository of features that can be used on the training and evaluation of machine learning models.
Despite its obvious value proposition, feature stores are notably missing from most machine learning platforms.
Recently, Google joined efforts with Asian’s ride-hailing startup GO-JEK to open source Feast, a feature store for machine learning models.
Feast abstracts many of the fundamental building blocks of feature extraction, transformation and discovery which are omnipresent in machine learning applications.
The MotivationLike other rapidly growing data science organizations, GO-JEK constantly faces challenges in terms of feature extraction and discovery.
GO-JEK’s machine learning models typically reuse common features such as driving time to destination, time of the day or driver profile in order to extract intelligence from heterogenous datasets.
Beyond the obvious benefits of feature extraction and discovery, Google and GO-JEK decided to build Feast with some very tangible goals in mind:· Feature Standardization: Feast attempts to present a centralized repository for describing features of machine learning models.
This provides structure to the way features are defined and allows teams to reuse features across different machine learning models.
· Feature Discovery: Feast enables the exploration and discoverability of features and their associated information.
This allows for a deeper understanding of features and their specifications, more feature reuse between teams and projects, and faster experimentation.
· Model Training-Serving Consistency: Feast’s standard representations enables feature consistency between model training and serving.
This addresses the constant mismatch between the development and production version of machine learning models.
· Feature Infrastructure Management: A pretty obvious benefit, Feast abstracts the infrastructure needed to extract, store and manage features across machine learning models.
Although conceptually simple, feature extraction is one of those areas that ends up consuming incredibly large amounts of time in machine learning implementations.
The ArchitectureIn order to accomplish the aforementioned goals, Feast relies on a very simple architecture that abstracts the feature analysis process in five simple stages:Create: features based on defined format and programming modelIngest: features via streaming input, import from files or BigQuery tables, and write to an appropriate data storeStore: feature data for both serving and training purposes based on feature access patternsAccess: features for training and servingDiscover: information about entities and features stored and served by FeastThe core architecture of Feast is illustrated in the following figure:Feast relies on BigQuery as the underlying storage mechanisms for the feature store.
In BigQuery, a feature is defined by the following attributes:Entity: A features must be associated with a known Entity which is a domain-specific concept.
Examples of Entities can be Customer, Driver or any other relevant domain objects.
ValueType: The feature type must be defined, e.
String, Bytes, Int64, Int32, Float etc.
Requirements: Properties related to how a feature should be stored for serving and trainingGranularity: Time series features require a defined granularityStorageType: For both serving and training a storage type must be definedThose basic attributes are enough to represent features in a way that can be used across different machine learning models.
From the architecture standpoint, Feast is based on four fundamental components:· Feast Core: The Core subsystem is responsible for managing the different components of Feast.
For instance, Feast Core manages the execution of feature ingestion jobs from batch and streaming sources while also enabling the registration and management of entities, features, data stores, and other system resources.
· Feast Store: Feast supports two fundamental types of stores: warehouses and serving.
Feast Warehouse Stores are based on Google BigQuery and maintain all historical feature data.
The warehouse can be queried for batch datasets which are then used for model training.
Serving Stores are responsible for maintaining feature values for access in a production serving environment.
· Feast Serving API: This API is responsible for the retrieval of feature values by models in production.
Feast Serving API supports HTTP and gRPC models which allows for low latency and high throughput execution models.
· Feast Client Libraries: Feast supports client libraries for different languages such as Java, Go and Python as well as a command-line module.
The client libraries streamline the developer interactions with the platform.
There are different ways to get started with Feast but one of the most creative ones is via Kubeflow.
In just a few months, Kubeflow has become one of the most popular runtimes for the execution of machine learning workflows.
Conceptually, Kubeflow iss an open source Kubernetes-native platform for developing, orchestrating, deploying, and running scalable and portable ML workloads.
It helps support reproducibility and collaboration in ML workflow lifecycles, allowing you to manage end-to-end orchestration of ML pipelines.
Feast provides native integration with Kubeflow which streamline its adoption in machine learning environments.
As someone who is constantly exposed to real world machine learning solutions and have experienced the challenges of doing feature management at scale, I am incredibly excited about efforts like Feast.
The commitment from Google can definitely help with the adoption of the platform.
As machine learning evolves, we are likely to see more efforts like Feast that try to abstract the fundamentals for feature extraction and discovery in machine learning solutions.