How to Become aFinancial Data ScientistWith R ExampleDiogo RibeiroBlockedUnblockFollowFollowingMay 30Photo by Helloquence on UnsplashThe financial industry has been one of the early adopters of the field of data science and the need for financial data scientist role has been growing rapidly.

Data science, as applied to finance, is the field where you build systems and processes to extract insights from financial data in various forms.

The finance professionals have always been doing data science in the form of statistical analysis, forecasting, and risk analysis, among other things, however, we now have an industry recognized term for it and formal career options around it.

As we know, the financial services companies are highly information-driven and stand to gain tremendously from insights from their information to improve their top-line as well as bottom-line.

Data science can help banks in almost all areas of work including the following:Risk monitoringTrade surveillancePaymentsFraudClaimsFintechSocial MediaCustomer experienceAnd moreFor example, a data scientist could be required to build data models for risk analysis or work with credit cards transactions data to identify fraudulent and risky behavior.

In the field of customer service, banks can serve their customers better by analyzing their transactional behavioral data using various data science algorithms.

Banks can also use data science to forecast various aspects of business such as profitability, delinquency and closure.

Skills Required for Financial Data Scientist RoleA financial data scientist or a team of data scientists working together as a team in a company would have skills around these four areas:1.

Data Analysis / Quantitative TechniquesKnowledge required to perform data analysis which would include statistics, decision sciences, operations research, econometrics, and predictive analytics.

This I think is the most important piece in the data science puzzle.

It is important that the data scientist is able to define the data analysis problem, understand the quality of data, fill the gaps in the data or make the right assumptions about it, select the right statistical models to apply on the data, perform the analysis using the technical tools, correctly infer the results of the analysis, and finally present the results in a meaningful way to the stakeholders.

One thing that needs special mentioning here is to learn Time Series Analysis since most of the financial data is time-series data.

Data analysis is a complete field by itself and you can apply it to any domain and on any data.

You don’t need any big tools or programming skills to know how to perform the analysis.

But this knowledge is foundational to your financial data scientist career.

2.

Technical KnowledgeAlong with the knowledge about the data analysis and the statistical techniques, the second important thing is the knowledge of the tools used to perform the data analysis.

This is a hot area and this is how a data scientist gets their job done.

Typically when you have to perform analysis on a data set, it is going to be large, probably containing hundreds of thousands of records.

You can neither do this analysis manually nor would you be able to efficiently do it with MS Excel.

Since we’re dealing with large amounts of data, data scientists would use a variety of tools and programming languages to perform their job.

Currently, the two most preferred tools of choice for data scientists are Python and R programming.

Both are very popular programming languages for statistics and various kinds of data analysis and visualizations.

Both have their own advantages and disadvantages.

R was specifically developed with data analysis and statisticians in mind.

Python, on the other hand, has grown rapidly in popularity and has a huge community and package support for data analysis capabilities.

I have always suggested that a data scientist should invest in learning both.

Since most of the data is stored in databases, you would also need to learn about databases and how to retrieve data from databases using SQL and NoSQL.

Apart from this, as the financial institutions focus on investing in big data technologies, it would also become important for data scientists to get themselves equipped with dealing with big data.

You will gain an advantage by learning frameworks such as Hadoop, MapReduce, Spark and even machine learning, but that’s for advanced users.

There are lots of other things you can learn to attain mastery but we will leave it out of scope for this article.

3.

Data MungingSkills required to deal with your data is the difference between an average and great data scientist.

Since the data is big you will also need to learn data munging and data cleaning techniques.

Data munging, also called data wrangling involves converting raw data to another format that is easier to access and analyze.

Sure, you do this using the technology tools available to you, but it requires a different bend of mind to be able to absorb and form relationships between various data sources and combine them in an accurate and meaningful way.

You will use technology, statistical skills as well as your experience to do this.

This is probably the most time-consuming job performed by data scientists as the data comes in many forms and from a variety of sources.

4.

Domain KnowledgeThe fourth piece of the puzzle is to have the domain knowledge of the specific field of data that you are looking to analyze.

For example, if the analysis is of loans related data, the data scientist would be expected to understand how loans work, how banks manage loan portfolios, and so on.

In the context of a bank, you would benefit from learning financial analysis, economics, risk analysis, portfolio management, and acquiring knowledge about financial markets among other things.

While learning financial concepts, pay special attention to the maths behind various concepts and the data needs behind them.

When you combine your knowledge of data analysis and statistical techniques, technology skills to perform the analysis, and the financial domain knowledge, you become a financial data scientist and you are in a position to get the best insights from the financial data.

Data Science Project Example in RIn this index analytics project, the S&P 500 index’s performance is evaluated based on its geometric and arithmetic averages for different frequencies — daily, monthly, annual and 5-year periods.

This project also evaluated the performances of 4 different self-financing portfolios in the forms of excess returns for the same 4 frequencies as S&P 500 index performances.

From the analysis, we can draw the conclusion that arithmetic average returns will always be higher than the geometric average returns, with the 5-year period arithmetic returns the highest among all frequencies (~18.

78% for S&P500 performances, ~11.

08% for self-financing portfolios).

However, the 5-year return also largely depends on the assumptions imposed in the calculations.

The overlapping of the long-time series definitely affects the performance outlook of the investments, providing a great trick for funds to manipulate their performance metrics.

Key Questions to AnswerFor the (CRSP) S&P500 with dividends, what was the average arithmetic and geometric historical mean rate of return for daily returns, for monthly returns, for annual returns, and for 5-year returns from 1/1/1973 through 1/1/2015?How would each of the four self-financed strategies with the prevailing short rate, the 30-day Treasury rate, the 1-year Treasury rate, and the 5-year Treasury rates, respectively have performed in terms of excess return?Does overlapping the longer-term series lead to different inference?ComputationsImport S&P 500, Risk — free Data from WRDSlibrary('data.

table')library('dplyr')library('lubridate')library('magrittr')library('zoo')# Import S&P500 data and convert the date column to date objectsp500_return = fread('data/SP500Return.

csv')sp500_return[,caldt := as.

Date(caldt, format = '%m/%d/%Y')]Geometric Means and Arithmetic MeansNext, we proceed to answer the first question.

For the (CRSP) S&P500 with dividends, what was the average arithmetic and geometric historical mean rate of return for daily returns, for monthly returns, for annual returns, and for 5-year returns from 1/1/1973 through 1/1/2015?Daily ReturnsAnnualized daily geometric returnsFrom 1/1/1973 through 1/1/2015, there are in total of 10617 days in the sample.

The daily geometric means could be found byndays represents the number of days in the sample.

# daily geometric meannum_days = length(sp500_return[, caldt])daily_geo_return = (prod(sp500_return[, vwretd] + 1))^(1/num_days) – 1Annualize daily geometric return,annualized_daily_geo_return = (1 + daily_geo_return) ^ 252 – 1Annualized daily arithmetic returnsThe arithmetic means could be found by# daily arithmetic meannum_days = length(sp500_return[, caldt])daily_arith_return = sum(sp500_return[, vwretd])/num_daysannualized_daily_arithm_return = daily_arith_return * 252daily_return = cbind(annualized_daily_geo_return, annualized_daily_arithm_return)Monthly ReturnsAnnualized monthly geometric returnsFrom 1/1/1973 through 1/1/2015, there are in total of 504 months in the sample.

The monthly geometric means could be found bynmonths represents the number of months in the sample.

We need to first convert the data into monthly frequency.

# Convert data into monthly frequencysp500_monthly = sp500_return %>% group_by(month=floor_date(caldt, "month")) %>% summarise(vwretd = (function(r) prod(1 + r))(vwretd))%>% as.

data.

table()# monthly geometric meannum_months = length(sp500_monthly[,month])monthly_geo_return = (prod(sp500_monthly[, vwretd]))^(1/num_months) – 1Annualize monthly geometric return,annualized_monthly_geo_return = (1 + monthly_geo_return)^12 – 1Annualized monthly arithmetic returnsThe monthly arithmetic means could be found byWe could use the same monthly frequency data that is used to calculate the monthly geometric mean.

# monthly arithmetic meannum_months = length(sp500_monthly[,month])monthly_arithm_return = sum(sp500_monthly[, vwretd] – 1)/num_monthsAnnualize the arithmetic return,annualized_monthly_arithm_return = monthly_arithm_return * 12monthly_return = cbind(annualized_monthly_geo_return, annualized_monthly_arithm_return)Annualized monthly arithmetic returnsThe monthly arithmetic means could be found byWe could use the same monthly frequency data that is used to calculate the monthly geometric mean.

# monthly arithmetic meannum_months = length(sp500_monthly[,month])monthly_arithm_return = sum(sp500_monthly[, vwretd] – 1)/num_monthsAnnualize the arithmetic return,annualized_monthly_arithm_return = monthly_arithm_return * 12monthly_return = cbind(annualized_monthly_geo_return, annualized_monthly_arithm_return)Annual ReturnsAnnual geometric returnsFrom 1/1/1973 through 1/1/2015, there are in total of 42 years.

The annual geometric means could be found bynyears represents the number of years in the sample.

We need to first convert the data into annual frequency, but annualization is no longer needed.

Same as monthly returns, due to the limitation of the group_by operation in R, we will need to subtract 1 from the total months since 1/1/2015 does not count as an extra year# Convert data into annual frequencysp500_annual = sp500_return %>% group_by(year=floor_date(caldt, "year")) %>% summarise(vwretd = (function(r) prod(1 + r))(vwretd))%>% as.

data.

table()# annual geometric meannum_years = length(sp500_annual[,year])annual_geo_return = prod(sp500_annual[, vwretd])^(1/num_years) – 1Annualized arithmetic returnsThe arithmetic means could be found byWe could use the same monthly frequency data that is used to calculate annual geometric mean# annual arithmetic meannum_years = length(sp500_annual[,year])annual_arithm_return = sum(sp500_annual[, vwretd]- 1)/num_yearsannual_return = cbind(annual_geo_return, annual_arithm_return)5 Year ReturnsThere are in total of 42 years in this dataset, which indicates that there will be 2 extra years will not be able to make a full 5-year period.

The two extra years will likely affect the magnitude of geometric return, therefore I decided to remove the year of 2014 and 2015 from the dataset, making it 8 total 5-year periods.

The annual geometric means could be found byn5_years represents the number of years in the sample.

We need to first convert the data into 5-year frequency.

# remove the year 2013 and 2014sp500_return_short = sp500_return[year(caldt) >= 1975]# Convert data into 5-year frequencysp500_5_year = sp500_return_short%>% group_by(yr5 = floor_date(caldt, "5 years")) %>% summarise(vwretd = (function(r) prod(1 + r))(vwretd))%>% as.

data.

table()# 5 yr geometric meannum_5_years = length(sp500_5_year[,yr5])five_yr_geo_return = prod(sp500_5_year[, vwretd])^(1/num_5_years) – 1Annualize 5-year geometric return,annualized_5yr_geo_return = (1 + five_yr_geo_return)^(1/5) – 1The arithmetic means could be found byWe could use the same monthly frequency data that is used to calculate 5-year geometric mean# monthly arithmetic meannum_5_yr = length(sp500_5_year[,yr5])five_yr_arithm_return = sum(sp500_5_year[, vwretd]- 1)/num_5_yrAnnualize 5-year geometric return,annualized_5yr_arithm_return = five_yr_arithm_return/5yr5_return = cbind(annualized_5yr_geo_return, annualized_5yr_arithm_return)Summary:We will build a summary table for different period returns for S&P500 to see if there is anything interestingsp500_mean_returns = rbind(daily_return * 100, monthly_return * 100, annual_return * 100, yr5_return * 100)%>% as.

data.

table() %>% setNames(c('Geometric Return (%)', 'Arithmetic Return (%)')) sp500_mean_returns[, ` `:=c('Daily', 'Monthly', 'Annual', '5 Year')]setcolorder(sp500_mean_returns, c(3,1,2))# Display Summary Tableprint(sp500_mean_returns)Geometric Return (%) Arithmetic Return (%)1: Daily 10.

33694 11.

334162: Monthly 10.

35028 11.

091213: Annual 10.

35028 11.

928164: 5 Year 12.

20949 18.

78374From the summary table above, the following conclusions could be drawnarithmetic returns are always larger than geometric returns, regardless of the period chosen as the base of the calculationsannual geometric returns in 1973 and 1974 are negative, therefore the values 5-year return would be lower if the year 1973 and 1974 is includedthe geometric return will be 9.

77%, and arithmetic return will be 13.

91% if the last two years (2013, 2014) are removed instead of the first two years (1973, 1974)removing the last two years could also be an option, and this assumption will result in a totally different result for both 5-year geometric and arithmetic returns5-year arithmetic return is the largest in value among different periods — there is a steep increase in average returns for 5 yearSelf Financing PortfoliosWe will proceed to answer the second question,How would each of the four self-financed strategies with the prevailing short rate, the 30-day Treasury rate, the 1-year Treasury rate, and the 5-year Treasury rates, respectively have performed in terms of excess return?The self-financing strategies are defined as the zero initial investments that our portfolio holds.

We will use the treasury bills to finance the investment in stock marketsData ImportThe prevailing short rate, 30-day, 1-year, and 5-year T-bill data are downloaded from WRDS’s Federal Reserve repositoryExcess Return for Prevailing Short RatesThe prevailing short rate is the federal overnight rates, the data is available in WRDSshort_rate = fread('data/fed_fund_rate.

csv')short_rate[, `:=` (date = as.

Date(as.

character(date), format = '%Y%m%d', origin = "1960-10-01"), daily_rf = FF_O/(100*252))]There are missing data in the short_rate data table, we should first left join the short_rate table to sp500_return to make them have the same lengthexcess_return_daily = merge(sp500_return, short_rate[,c('date', 'daily_rf')],all.

x = T, by.

x = 'caldt', by.

y = 'date')The missing data in the daily_rf column could be imputed using last observation carried forward, which can be accomplished by na.

locf function from package ‘zoo’.

By doing so we assumed the overnight rate does not change from the previous day if the data is missingexcess_return_daily[,daily_rf := na.

locf(daily_rf)]After all the missing values are imputed, the daily excess return could be found.

To find the daily excess return for S&P 500 over risk-free rates, we will need to have take the difference of two series.

excess_return_daily[,excess_r := vwretd – daily_rf]Geometric Mean for Daily Excess ReturnsThe equation is the same as the geometric return equations abovenum_days = length(excess_return_daily[, caldt])daily_geo_excess_return = (prod(excess_return_daily[, excess_r] + 1))^(1/num_days) – 1annualized_daily_geo_excess_return = (1 + daily_geo_excess_return) ^ 252 – 1Arithmetic Mean for Daily Excess ReturnsThe equation is the same as the arithmetic return equations above# daily arithmetic meannum_days = length(excess_return_daily[, caldt])daily_arith_excess_return = sum(excess_return_daily[, excess_r])/num_daysannualized_daily_arithm_excess_return = daily_arith_excess_return * 252daily_excess = cbind(annualized_daily_geo_excess_return, annualized_daily_arithm_excess_return)30-day T-Bill Excess Return# Treasury Bills constant maturityt_bills = fread("data/t_bills.

csv")t_bills = t_bills[,c("date", "TCMNOM_M1", "TCMNOM_Y1", "TCMNOM_Y5")]# import the daily promised treasury yield and convert the date column to date objects# convert the annualized yield to monthly frequencyt_bill_30_day = t_bills[,c("date", "TCMNOM_M1")]t_bill_30_day[, `:=` (month = as.

Date(as.

character(date), format = '%Y%m%d', origin = "1960-10-01"), t30ret = TCMNOM_M1/(100*12))]t_bill_30_day = t_bill_30_day[month >= '2001/07/01', c('month', 't30ret')]To find the excess return for the S&P 500 over risk-free rates, we will need to have taken the difference of two series.

Since the two series have difference frequencies, we will convert the daily S&P 500 return to monthly S&P 500 return, which is stored in the sp500_monthly dataset.

Unfortunately, the 30-day T-bill data is only available after 2001/07/31, the period chosen for this calculations are after 2001/08/01# subtract 1 to make to obtain returnsp500_monthly[month >= '2001/07/01', vwretd:= vwretd – 1]We have a little issue here with the different dates,In the sp500_monthly data table, the monthly data is quoted as the beginning of the month while the risk_free data table quotes the monthly data as the end of the month,print(t_bill_30_day[1:5,])month t30ret1: 2001-07-31 0.

0030583332: 2001-08-31 0.

0029416673: 2001-09-30 0.

0022333334: 2001-10-31 0.

0018916675: 2001-11-30 0.

001658333Therefore, we will need to convert the month-end date to the month-beginning date using floor_datet_bill_30_day[,month := floor_date(month, "month")]print(t_bill_30_day[1:5,])month t30ret1: 2001-07-01 0.

0030583332: 2001-08-01 0.

0029416673: 2001-09-01 0.

0022333334: 2001-10-01 0.

0018916675: 2001-11-01 0.

001658333Merge two data tables and impute the missing values# merge the two data tablesexcess_return_monthly = merge(sp500_monthly, t_bill_30_day, by.

x = 'month', by.

y = 'month')# Fill previous month value to missing valueexcess_return_monthly[,t30ret := na.

locf(t30ret)]# build excess return columnexcess_return_monthly[, excess_r := vwretd – t30ret]Geometric Mean for 30-day Excess ReturnsThe equation is the same as the geometric return equations abovenum_months = length(excess_return_monthly[,month])monthly_geo_excess_return = (prod(1+ excess_return_monthly[, excess_r]))^(1/num_months) – 1annualized_monthly_geo_excess_return = (1 + monthly_geo_excess_return)^12 – 1Arithmetic Mean for 30-day Excess ReturnsThe equation is the same as the arithmetic return equations abovenum_months = length(excess_return_monthly[,month]) – 1monthly_arithm_excess_return = sum(excess_return_monthly[, excess_r])/num_monthsannualized_monthly_arithm_excess_return = sum(monthly_arithm_excess_return)*12monthly_excess = cbind(annualized_monthly_geo_excess_return, annualized_monthly_arithm_excess_return)1-year Excess ReturnThe 1-year treasury yield data is also available on WRDSt_bill_1yr = t_bills[,c("date", "TCMNOM_Y1")]t_bill_1yr[, `:=` (month = as.

Date(as.

character(date), format = '%Y%m%d', origin = "1960-10-01"), b1ret = TCMNOM_Y1/(100*12))]t_bill_1yr = t_bill_1yr[, c('month', 'b1ret')]The 1-year treasury yield is quoted in monthly frequency, therefore should be adjusted to match the 1 year period of sp500_annual# Convert data into annual frequencyt_bill_1yr = t_bill_1yr%>% group_by(year = floor_date(month, "year")) %>% summarise(b1ret = (function(r) prod(1 + r))(b1ret))%>% as.

data.

table()The 1-year excess return could be found by merging with sp500_annual, missing values will again be imputed# merge the two data tablesannual_excess_return = merge(sp500_annual, t_bill_1yr, by.

x = 'year', by.

y = 'year')# impute missing valuesannual_excess_return[, b1ret := na.

locf(b1ret)]# build excess return columnannual_excess_return[, excess_r := vwretd – b1ret]Geometric Mean for 1-year Excess Returnsnum_years = length(annual_excess_return[,year])annual_geo_excess_return = (prod( 1+ annual_excess_return[, excess_r]))^(1/num_years) – 1Arithmetic Mean for 1-year Excess Returnsnum_years = length(annual_excess_return[,year])annual_arithm_excess_return = sum(annual_excess_return[, excess_r])/num_yearsannual_excess = cbind(annual_geo_excess_return, annual_arithm_excess_return)5-year Excess ReturnSame as the 1-year excess returns, we have to find the corresponding 5-year T-bill rate on WRDSt_bill_5yr = t_bills[,c("date", "TCMNOM_Y5")]t_bill_5yr[, `:=` (month = as.

Date(as.

character(date), format = '%Y%m%d', origin = "1960-10-01"), b5ret = TCMNOM_Y5/(100*12))]t_bill_5yr = t_bill_5yr[year(month) >= 1975, c('month', 'b5ret')]The 5-year treasury yield is quoted in annual frequency, therefore should be adjusted to match the 5 year period of sp500_5_year# Convert data into annual frequencyt_bill_5yr = t_bill_5yr%>% group_by(yr5 = floor_date(month, "5 years")) %>% summarise(b5ret = (function(r) prod(1 + r))(b5ret))%>% as.

data.

table()The 5-year excess return could be found by merging with sp500_5_year# merge the two data tablesyr5_excess_return = merge(sp500_5_year, t_bill_5yr, by.

x = 'yr5', by.

y = 'yr5')# build excess return columnyr5_excess_return[, excess_r := vwretd – b5ret]Geometric Mean for 5-year Excess Returnsnum_5_years = length(yr5_excess_return[,yr5])yr5_geo_excess_return = (prod( 1 + yr5_excess_return[, excess_r]))^(1/num_5_years) – 1annualized_5yr_geo_return = (1 + yr5_geo_excess_return)^(1/5) – 1Arithmetic Mean for 1-year Excess Returnsnum_5_years = length(yr5_excess_return[,yr5])yr5_arithm_excess_return = sum(yr5_excess_return[, excess_r])/num_5_yearsannualized_5yr_arithm_return = sum(yr5_arithm_excess_return)/5yr5_excess = cbind(annualized_5yr_geo_return, annualized_5yr_arithm_return)Summary:We will build a summary return table for different period excess returnsexcess_mean_returns = rbind(daily_excess * 100, monthly_excess * 100, annual_excess * 100, yr5_excess * 100) %>% as.

data.

table() %>% setNames(c('Geometric Excess Return (%)', 'Arithmetic Excess Return (%)')) excess_mean_returns[, ` `:=c('Daily', 'Monthly', 'Annual', '5 Year')]setcolorder(excess_mean_returns, c(3,1,2))# Display Summary Tableprint(excess_mean_returns)Geometric Excess Return (%) Arithmetic Excess Return (%)1: Daily 4.

293887 5.

7009622: Monthly 4.

572426 5.

6377943: Annual 4.

428523 6.

1199324: 5 Year 6.

661580 11.

018330We will have a similar conclusion this time,5-year arithmetic returns remains the largest among all return values — way larger than the returns calculated from the other 3 frequenciesagain, assumptions imposed are very important here — the missing data in daily returns could be imputed by different ways (or even removed from the series), while the 5-year returns can be totally different if different two years are removed (for example 2013 and 2014)with the assumptions right now, the 5-year arithmetic excess returns still blows up due to the large S&P 500 5-year arithmetic returnsConclusionBased on the analysis above, we could draw the conclusion that arithmetic return could be the more attractive value for funds to show to their investors than geometric returns.

When there are non-divisible periods of data avaliable, the assumptions imposed would largely affect the result of the calculation.

The overlapping time series could greatly affect the performance outlook for a certain investment, but geometric returns are the ones that investors could really retain.

Therefore, investors should be careful with the quoted return values when choosing their investments of different funds.

Going back to the last key question,Does overlapping the longer-term series lead to different inference?The answer is definitely a yes.

.