The Jungle of Koalas, Pandas, Optimus and SparkWhat to expect from the newest library from Databricks (Koalas), the Optimus framework and Apache Spark 3.
xFavio VázquezBlockedUnblockFollowFollowingApr 25If you are as excited about data science as me, you probably know that the Spark+AI latest summit started yesterday (April 24th 2019).
And there are great things to talk about.
But I will do it with a spin-off.
If you’ve been following me you now that I co-created a framework called Optimus.
If you want to see more about that check these articles:Data Science with Optimus.
Part 1: Intro.
Breaking down data science with Python, Spark and Optimus.
comData Science with Optimus.
Part 2: Setting your DataOps Environment.
Breaking down data science with Python, Spark and Optimus.
Today: Data Operations for Data Science.
::Part 1 here…towardsdatascience.
comWhere I’m explaining a whole data science environment with Optimus and other open source tools.
This could have been part 3, but I’m more interested in showing you other stuff.
com/ironmussa/OptimusIn the beginning Optimus started as a project for cleaning big data, but suddenly we realized that there was a lot of more opportunities for the framework.
Right now we have created a lot of interesting things and if you are a data scientist using pandas, or spark you should check it out.
Right now we have these APIs:Improved versions of Spark Dataframes (much better for cleaning and munging data).
Easier Machine Learning with Spark.
Easier Deep Learning with Spark and Spark Deep Learning.
Plots directly from Spark Dataframes.
Profiling Spark Dataframes.
Database connections (like Amazon Redshift) easier.
Enrich data connecting with external APIs.
You can even read data directly from the internet to Spark.
So you can see we have been trying a lot to improve the world of the data scientist.
One of the things we care was creating a simple and usable API, and we didn’t love Pandas API or Spark API by itself, but a combination of those with a little touch of awesomeness created what you can call today our framework.
Koalas vs Optimus vs Spark vs Pandashttps://www.
com/Today Databricks announced the project Koala as a more productive way when interacting with big data, by augmenting Apache Spark’s Python DataFrame API to be compatible with Pandas.
If you want to try it check this MatrixDS project:MatrixDS | The Data Project WorkbenchMatrixDS is a place to build, share and manage data projects at any scale.
comAnd this GitHub repo:FavioVazquez/koalas_optimus_sparkRocking the world with Spark and friends.
Contribute to FavioVazquez/koalas_optimus_spark development by creating an…github.
comSo instead of boring you with copy-pasting the documentation of Koalas that you can read right now, I created a simple example of the connection between Koalas, Optimus and Spark.
You’ll need to install Optimuspip install –user optimuspysparkand Koalaspip install –user koalasI’ll be using this dataset for testing:https://raw.
csvLet’s first read data with Spark vanilla:from pyspark.
sql import SparkSessionspark = SparkSession.
getOrCreate()df = spark.
csv", header=True)For that I needed to upload the dataset before.
Let’s see that in Optimus:from optimus import Optimusop = Optimus()df = op.
csv")That was one step simpler because with Optimus you can read data directly from the web.
What about Koalas?import databricks.
koalas as ksdf = ks.
csv")This code will fail, would that happen in Pandas?import pandas as pddf = pd.
That would work.
That’s because you can read data with Pandas from the web directly.
Ok so let’s make the Koalas code work:import databricks.
koalas as ksdf_ks = ks.
csv")Well that looks simple enough.
By the way, if you want to read the data from the local storage with Optimus it’s almost the same:from optimus import Optimusop = Optimus()df_op_local = op.
csv")But, let’s take a look at what happen next.
What are the types of this Dataframes?print(type(df_sp))print(type(df_op))print(type(df_pd))print(type(df_ks))And the result is:<class 'pyspark.
DataFrame'>So the only framework that created a Spark DF apart from Spark itself, was Optimus.
What does this mean?Let’s see what happens when we want to show the data.
For showing data in Spark we normally use the .
show() method, and for Pandas the .
show(1)Will work as expected.
show(1)Will work too.
head(1)Will work as well.
But what about out Koalas DF?.Well you need to use the pandas API, because that’s one of the goals of the library, make the transition easier from pandas.
show(1)Will fail, butdf_ks.
If you are running this code along with me, if you hit show for spark, this is what you saw:+———-+——+——+——+——+———-+———-+———-+——-+——-+——+——–+———-+——+| Date| Open| High| Low| Close| Volume|ExDividend|SplitRatio|AdjOpen|AdjHigh|AdjLow|AdjClose| AdjVolume|Symbol|+———-+——+——+——+——+———-+———-+———-+——-+——-+——+——–+———-+——+|2018-03-27|173.
0| AAPL|+———-+——+——+——+——+———-+———-+———-+——-+——-+——+——–+———-+——+only showing top 1 rowWhich is kinda awful.
Everyone prefers those pretty HTML outline tables to see their data, and Pandas has them, so Koalas inherits them from Pandas.
But remember, this are not Spark DF.
If you really want to see a prettier version of Spark DF with Optimus you can use the .
table(1)and you’ll see:which shows you the data better plus information about it like the types of the columns, the number of rows in the DF, the number of columns and partitions.
Selecting dataLet’s do more with our data.
Like slicing it.
I’ll choose the columns Date, Open, High, Low and Volume with the frameworks.
There may be more ways of selecting data, I’m just using the common ones.
With Spark:# With Sparkdf_sp["Date","Open","High","Volume"].
table(1)# or with indices :)df_op.
head(1) # will workdf_ks.
head(1) # will faildf_ks.
select("Date","Open","High","Volume") # will failSo as you can see right now we have good support of different things with Optimus, and if you love the [] style from Pandas, you can use it with Koalas too, but you can’t select by indices, at least not yet.
The difference here is with Koalas and Optimus you are running Spark code underneath, so you don’t have to worry about performance.
At least not right now.
More advance stuff:Let’s get the frequencies for a column:Pandas:df_pd["Symbol"].
value_counts()They’re the same which is very cool.
Spark (some of the bad parts):df_sp.
show()Optimus (you can do the same as in Spark):df_op.
show()or you can use the .
cols attribute to get more functions:df_op.
frequency("Symbol")Let’s transform our data with One-Hot-Enconding:Pandas:# This is crazy easypd.
head(1)Koalas:# This is crazy easy tooks.
head(1)Spark (similar enough result but horrible to do):# I hate thisfrom pyspark.
feature import StringIndexer,OneHotEncoderEstimatorindexer = StringIndexer(inputCol="Symbol", outputCol="SymbolIndex")df_sp_indexed = indexer.
transform(df_sp)encoder = OneHotEncoderEstimator(inputCols=["SymbolIndex"], outputCols=["SymbolVec"])model = encoder.
fit(df_sp_indexed)df_sp_encoded = model.
show(1)Optimus (a little better but I still prefer Koalas for this):from optimus.
feature import string_to_index, one_hot_encoderdf_sp_indexed = string_to_index(df_sp, "Symbol")df_sp_encoded = one_hot_encoder(df_sp_indexed, "Symbol_index")df_sp_encoded.
show()So in this case the easier way was from Pandas, and luckily it’s implemented in Koalas, and this types of functions will increase in the future, but right now this is almost all we have as you can see here:General functions – Koalas 0.
0 documentationEdit descriptionkoalas.
ioBut they run in Spark so it rocks.
Plots:Plotting is an important part of data analysis.
With Pandas we are use to plot what ever we want very easily, but with Spark it’s not that easy.
We are happy to announce that in the latest version of Optimus (2.
2) we have created a way of creating plots directly from your Spark DataFrames, no subsetting needed.
hist("Low")This will not work.
scatterplot(["Open","Volume"])Apache Spark 3.
x:These are some of the things to expect from Spark 3.
12 support (listing it twice)Continuous Processing non-experimentalKubernetes support non-experimentalA more flushed out version of data source API v2Hadoop 3.
0 supportImprove usability of Spark Deep Learning PipelinesAnd more!databricks.
comConclusionSpark it’s growing exponentially, and if you are not using it now, you definitely should.
Commonly you will be coming from Pandas or something like that, so make use of great libraries like Koalas and Optimus to improve your life in the Data Science world.
If you have any questions please write me here:Favio Vazquez — Founder / Chief Data Scientist — Ciencia y Datos | LinkedInJoin LinkedIn ‼️‼️ Important Note: Due to Linkedin technical limitations, I can now only accept connection requests…www.
comand follow me on twitter:Favio Vázquez (@FavioVaz) | TwitterThe latest Tweets from Favio Vázquez (@FavioVaz).
Physicist and computational engineer.
I have a…twitter.
comHave fun learning :).