Python Tutorial For Researchers Who use RInstallation, Loading Data, Visualization, Linear Regression, Rpy2Jun WuBlockedUnblockFollowFollowingJul 2@wwarby unsplash.

comThis tutorial is aimed at researchers who are used to using R.

Data is at the center of any research project.

To me, every researcher is now empowered with massive amounts of data than ever before.

This puts a researcher squarely in the role of being a data scientist.

If you are a researcher who’s been using R all this time, you are missing out.

There are powerful models at your disposal to handle larger amounts of data available in Python.

To upgrade your skills as a data scientist, give Python a try.

Like me, you might find that you still love R for some tasks.

Slowly, as you get used to using Python for other tasks, you may find that it’s more robust in handling larger amounts of data.

I will go over python installation, data loading, simple visualization, and a linear regression example.

Toward the end, just for novelty sake, I will show you how to use R in Python.

The Case for Using PythonFor researchers who are using R, using Python might seem to be a daunting task.

On the contrary, today’s python’s analysis packages such as pandas, numpy and sklearn make it very easy to load data, explore data and analyze data as you would in R.

You don’t need to write extensive code for simple analysis.

Using pandas and numpy together — The combination of the two enables you to handle any data management tasks.

You create dataframes.

Using the packages, you can handle missing data, manage columns, rows.

Using sklearn— All of your scientific computing needs are all contained in this package.

You can find models for classification, regression, and clustering.

You can also find tools for dimensionality reduction, model selection and preprocessing.

Using matplotlib — All of your graphing needs are all taken care of in this package.

There are simple graphs such as bar charts, line charts.

There are more complex graphs such as gradients, contours, and heatmaps.

Mac and Windows Installation PythonNow, let’s get started by installing python onto your desktop.

The anaconda distribution of python is recommended.

It contains not only python.

It also contains the Spyder editor for development.

Download Anaconda for Windows or MacChoose Python 3.

7 version, Download, InstallOpen up Terminal and run belowconda listYou should get a list of conda commands coming back.

If you do, congratulations, you just installed Python.

If you receive an error, try to do the following and check again.

Errors are usually caused by not being able to find your ananconda installation in your path.

Your .

profile file should already be appended by the ananconda installation process.

Then, sourcing it one more time on your terminal would do the trick.

source .

profileconda listSpyderWith Anaconda, you want to use the Spyder editor for free.

It’s a good default editor to use for python.

You can start up the spyder by simply initiating on the command line.

spyderOnce your editor starts up, you want to create a new file.

File -> New File, File -> Save As to name your new python code.

Commenting out code in PythonYou can comment out blocks of code using quotes or you can comment out one line of code using “#”.

“””diabetes=pd.

read_csv('data/diabetes.

csv') “””#diabetes=pd.

read_csv('data/diabetes.

csv')Loading DatasetsIn R, you can load test datasets by default.

In Scipy, you can also load test datasets by default.

For completion sake, there are a lot more datasets on Kaggle.

Let’s grab our test data from here: Test Datasets can be found at Kaggle.

We’ll use the “Pima Indians Diabetes Database” dataset.

Click to Download it.

Loading the data.

First, print some columns.

Then, print some rows.

import pandas as pddata=pd.

read_csv(‘data/diabetes.

csv’)#Print data Colummnsprint(data.

columns)#Preview first 5 lines of the loaded dataprint(data.

head())Summary Statistics by Jun WuSummary StatisticsSummary statistics can be run using pandas.

The describe() function is similar to the summary() function and will output the result in a table.

print (data.

describe())The output is here in a table format.

Creating a sub dataframe to exploreOur columns are: [Pregenancies (Number of), Glucose (Level of), Blood Pressure (Level of), SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age, Outcome]We can create a sub dataframe by simply indexing our dataframe(data) by the columns.

age = data[[“Age”,”Glucose”]]print(age)Age vs Glucose by Jun WuPlotting and VisualizationPlotting and Visualization for models can be done using matplotlib.

HistogramInitially, displaying the histogram of all the column data relative to the “Outcome” can show you the distribution.

A simple groupby on the dataframe (data) by the column “Outcome” accomplishes this.

Using “.

hist”, the histogram is generated of all the other columns in relation to “Outcome”.

import pandas as pdimport numpy as npimport matplotlib.

pyplot as pltimport seaborn as snsdata=pd.

read_csv(‘data/diabetes.

csv’)data.

groupby(‘Outcome’).

hist(figsize=(9,9))BoxplotSimple boxplot can be created by simply using the “by=” parameter inside the boxplot function for the dataframe (data).

The “Outcome” data either positive=1 and negative=0.

In this case, “Age” is grouped by the “Outcome”.

This is a boxplot of all the Ages of people who have either Outcome=0 or Outcome=1.

data.

boxplot(column=[‘Age’], by=[‘Outcome’])Missing Data HandlingChecking for missing data is important in any data analysis.

You can use these below functions to check for missing data.

Pandas has a great tutorial on missing data for more information.

In this case, we checked there’s no data that’s null or na.

print(data.

isnull().

sum())print(data.

isna().

sum())Missing Data by Jun WuLinear RegressionLinear regression is a great example to start to see the power of sklearn.

This package contains all the models needed for scientific computing.

Instead of using the diabetes dataset that grabbed earlier, we can use the same dataset imported inside sklearn.

The process of running linear regression is as follows:- splitting data into training data and test data (X)- splitting target data into training data and test data (Y)- create the linear model object (linear_model.

LinearRegression()) – train the model using training sets- make predictions- output the scores- plot the linear regressionimport pandas as pdimport numpy as npimport matplotlib.

pyplot as pltimport seaborn as snsfrom sklearn import datasets, linear_modelfrom sklearn.

metrics import mean_squared_error, r2_scorediabetes = datasets.

load_diabetes()diabetes_X = diabetes.

data[:, np.

newaxis, 2]# Split the data into training/testing setsdiabetes_X_train = diabetes_X[:-20]diabetes_X_test = diabetes_X[-20:]# Split the targets into training/testing setsdiabetes_y_train = diabetes.

target[:-20]diabetes_y_test = diabetes.

target[-20:]# Create linear regression objectregr = linear_model.

LinearRegression()# Train the model using the training setsregr.

fit(diabetes_X_train, diabetes_y_train)# Make predictions using the testing setdiabetes_y_pred = regr.

predict(diabetes_X_test)# The coefficientsprint(‘Coefficients:.’, regr.

coef_)# The mean squared errorprint(“Mean squared error: %.

2f” % mean_squared_error(diabetes_y_test, diabetes_y_pred))# Explained variance score: 1 is perfect predictionprint(‘Variance score: %.

2f’ % r2_score(diabetes_y_test, diabetes_y_pred))# Plot outputsplt.

scatter(diabetes_X_test, diabetes_y_test, color=’black’)plt.

plot(diabetes_X_test, diabetes_y_pred, color=’blue’, linewidth=3)plt.

xticks(())plt.

yticks(())plt.

show()Linear Regression by Jun WuUsing R inside PythonJust for novelty, did you know that you can also use R inside python? rpy2 package allows you to do just that.

You can go back to your terminal and run this command.

conda install -c r rpy2Then, upon success, you can close and reopen Spyder inside this terminal.

import pandas.

rpy.

common as comimport pandas as pdimport rpy2.

robjects as rodiabetes=pd.

read_csv(‘data/diabetes.

csv’)dia=com.

convert_to_r_dataframe(diabetes)print(ro.

r(‘summary(dia)’))You should get the summary() statistics data from R.

To find out more about using R inside Python, you can see the tutorials at rpy2 website.

I hope that this tutorial enabled you to get started with using Python.

As data proliferate our research lives, learning to use Python and R side by side for a research project will only give us more tools to analyze the data from our research projects.

.