Stock Market Data Collection & Feature Engineering Using Python

I contend that the average individual has the resources to gather sufficient data to build useful models.

Storage these days is relatively cheap.

And there are tons of data sources available that most people just aren’t aware of.

Some of these data sources come from academia, some from random APIs, amazing open source people who provide this data for free, governments, etc.

Here are some useful data resources:Favorites: https://db.


world/ This resource is especially helpful.

It’s a compilation of massive amounts of economics datasets from all over the world.

You can select as many data sets as you want → add them to your cart → download all datasets with one clickhttps://github.

com/addisonlynch/iexfinance This resource is my go-to for gathering data on stock prices, balance sheets, analyst predictions, and all sorts of great information.

Census Data: https://www.



htmlUseful Collection of Forecasts: http://www.



htmMassive Collection of Indicators: https://www.


com/Labor Statistics: https://www.


gov/Data from the US Treasury: https://home.


gov/Since we’re using Python, our first step is probably to read in the data from our various CSV files.

One should look to merge all of this data together into a single dataframe and perform some exploratory analysis.

Preprocess the DataCleaning DataInevitably, some of our data will have holes in it, some will just be garbage and completely out of left field, and some can be made more effective by scaling or normalizing.

We can also add artificial data to our dataset if we need more data to train the model.

This is especially popular right now in image processing where companies like Nvidia are simulating environments to help train CNNs.

The simplest route to take when preparing data that includes missing data or invalid data is to use Pandas built-in functions like dropna().


dropna(inplace=True)dropna() will remove all NaN entries from our dataframe.



mean(), inplace=True)fillna() will replace NaN entries with the desired argument.

In our example above we use the column mean as a replacement.

There are tons of possible approaches.

Another possible approach is to take the average of the nearest two entries to that NaN entry.

Some of these approaches are well-defined in various material on information theory.

Feature EngineeringOftentimes it can be useful to identify what feature combinations will be most useful for training our model.

These are going to be features that are highly separable.

To that end, a visualization can be especially useful.

We can use the library seaborn to automatically build a visualization with our features vs.

our output.


pairplot(data=df, hue="asset_price")Checking for separable featuresThese distributions aren’t terribly separable.

We can see the first feature is probably more separable than the second feature.

In a deep learning model we probably just throw every bit of data into the mix.

But in some non-deep learning applications we can get better performance out of just training on the data that is more linearly separable.

We can also score features under the same principle in a more quantitative fashion.

We use Scikit-Learn’s SelectKBest to score our features.

Output Score Using Chi-squaredObviously the higher the score the more impact the feature has on our output according to the SelectKBest algorithm.

In our case, we use the Chi-squared test for our scoring:Chi-squared testThe details aren’t overly important, just know your score indicates “goodness of fit.

” Here is the implementation:Scaling DataThere are many approaches to this part of the processing game.

One of the most useful functions provided by Scikit-Learn is scale().

This uses normalization to bring all of the data into a more digestible form.

You can read more about scaling here.

In this case we also flatten our output variable since 1-D is our desired dimension:Closing ThoughtsI hope you learned something new, maybe a new method or a fresh approach.

Many of these methods likely already exist in your workflow, but some may not and you should try adding them to your toolkit and seeing what works best.

If you’re struggling to get very great results out of your models, I recommend spending more time on your data and less time trying to find a model that magically fits your data; you’re likely just overfitting.

The majority of failing models occur because of a lack of data, low quality data, or just mismanagement of data.

Data, data, data….. More details

Leave a Reply