Introducing Built-in Image Data Source in Apache Spark 2.4

# Read image data using new image scheme image_df =“image”).load(sample_img_dir) # Databricks display includes built-in image display support display(image_df) # Split training and test datasets train_df, test_df = image_df.randomSplit([0.6, 0.4]) # train logistic regression on features generated by InceptionV3: featurizer = DeepImageFeaturizer(inputCol=”image”, outputCol=”features”, modelName=”InceptionV3″) # Build logistic regression transform lr = LogisticRegression(maxIter=20, regParam=0.05, elasticNetParam=0.3, labelCol=”label”) # Build ML pipeline p = Pipeline(stages=[featurizer, lr]) # Build our model p_model = # Run our model against test dataset tested_df = p_model.transform(test_df) # Evaluate our model evaluator = MulticlassClassificationEvaluator(metricName=”accuracy”) print(“Test set accuracy = ” + str(evaluator.evaluate(“prediction”, “label”)))) Note: For Deep Learning Pipelines developers, the new image schema changes the ordering of the color channels to BGR from RGB..To minimize confusion, some of the internal APIs now require you to specify the ordering explicitly..What’s Next It would be helpful if you could sample the returned DataFrame via df.sample, but sampling is not optimized..To improve this, we need to push down the sampling operator to the image data source so that it doesn’t need to read every image file..This feature will be added in DataSource V2 in the future..New image features are planned for future releases in Apache Spark and Databricks, so stay tuned for updates..You can also try the deep learning example notebook in Databricks Runtime 5.0 ML..Read More For further reading on Image Data Source, and how to use it: Read our documentation on Image Data Source for Azure and AWS..Try the example notebook on Databricks Runtime 5.0 ML..Learn about Deep Learning Pipelines..Visit the Deep Learning Pipelines on GitHub..Acknowledgments Thanks to Denny Lee, Stephanie Bodoff, and Jules S.. More details

Leave a Reply