ETL and DI in Data Science: usage in financial market data warehousesM.
EmmanuelBlockedUnblockFollowFollowingFeb 18It has been a few months since I started working on backtesting quantitative trading.
Initially, all the focus was on the core activities: strategies and which backtesting engines shall be used, but as soon as the project started to roll over, other real-world small challenges appeared demanding attention.
The endeavour of systematically and professionally approach backtesting requires not only the fancy activities of machine learning, statistical analysis and data analytics, but also a bit of humble scaffolding and plumbing on the data warehouse stage.
In my previous post How to store financial market data for backtesting, I was discussing the retrieval strategies to be used for backtesting.
Using conventional actual warehouse terminology, if in that article we discussed the data pickup challenges in this article we discuss the data putaway strategies.
Provisioning and feeding data into your warehouse deals with the less shiny but equally relevant side of the data analysis business: keeping the raw data flowing, because without proper quality data feeds, missions are doomed to failure.
external providerThere are basically two approaches that can be followed:Contract a third party data feed supplier and consume their APIs as required.
There will be integration costs, but you do not need to build, feed or operate your own data warehouse.
Stockpile data from different sources in your own data warehouse.
There will be integration costs, plus the costs associated to build, maintain and operate your data warehouse.
While the first option shall ensure consistent and quality data (provided that you select a quality data feed provider) it suffers from several limitations:Quality feeds are expensive — quality ones used by large investment firms can be very expensive indeed — .
This will limit the number of assets you can access and it might also limit the creativity to explore strategies in other areas or markets.
Data feeds costs can jeopardize the project budget that could be allocated to more valuable resources and activities.
Third party data feeds for intraday are usually limited in time.
While there are reasonable doubts on the benefits on backtesting intraday data which is distant in time (markets evolve and there is consensus on the impact of algorithmic trading during last years) it is always a good idea to test strategies for long periods.
The reason is those insights on how certain intraday strategies behave during market regimes can be only gained if you count with extensive periods of data because market regimes can last for years.
Not confronting or analysing the impact of market regimes in your strategies will have disastrous consequences, so the more data you rely on the better.
Futures and options contracts in third-party data feeds might not be split by settlement date.
While most strategies would probably want to focus on an aggregated continuous future contract — which does not exist in the real world but it is a convenient artefact to backtest — , you might be interested in evaluating divergences, trends or volume spikes in contracts with different settlement dates.
In that case, it might be challenging to find a third party data feed supplier matching this requirement.
Moving back and forward large data sets for the analysis engine could have a performance impact, although current network performance — especially in cloud services — will make this a minor issue.
Certain third party suppliers impose restrictions on the number of requests that customers can do against the service; this limitation is usually related to the cost of the service — which has been already discussed — .
The benefits of using a third-party supplier are obvious.
It simplifies and externalizes the problem of continuously feeding and storing large amounts of data, providing a consistent API which is often enriched with additional financial data.
A quality third-party data feed will also provide information on price splits — for stock cash assets — and dividends, and some suppliers provide also fundamental figures.
On the contrary, building your own data warehouse will probably represent a lower cost in the long term, provided that you count with the resources to build it fast.
The disadvantages are obvious:Development times will be larger.
You will have to deal with several data feeds and delayed time providers, and the project will require custom and side developments to ensure that you can cover all assets and markets you want to later analyse.
You will still need a one-time historical data to pre-provision your data warehouse.
As discussed, counting with long historical data sets is relevant and required.
If you start stockpiling data today it will not be useful until at least one or two years later.
You need proper online hardware resources.
Probably the less problematic issue, as cloud platforms are now a commodity and huge amounts of storage and computing power can be purchased as a service at fairly cheap prices.
You need operational procedures in place to ensure monitoring and data quality.
While you might expect that once the development of the platform is completed there will be no issues there is no such thing as zero maintenance project.
A proper operational plan must be defined — it can be simple, but it must exist — and for that, you need skilled resources in your side.
This is probably the major drawback to face when you decide to go for an in-house solution.
The main advantages are basically a potential project overall gain in value (your project will have muscle and data) and potential cost savings.
Being the owner of the data warehouse has also the advantage of having full control over what to store and how to exploit it and can have a significant impact on computing-intensive backtesting scenarios.
With no doubt the key factor is cost.
Just the decision of building your own data warehouse or use an external provider does not determine costs: the overall savings or costs strongly depend on the available team and resources.
Overall, based on my experience, building and maintaining your own data warehouse can provide significative cost savings if and only if there are already resources available to build up and maintain the project.
In the end, integrating a third party feed is not as straightforward as one might think, and sort kind of temporary repository will be still required, so the extra mile required to build up your own data repository is not that hard.
On the contrary, if your project uses standard backtesting software and your resources lack the experience to cope with an enterprise-grade data warehouse project it will be wise to stay hands-off.
This scenario is quite common in the industry as people tend to focus and specialize on backtesting software.
Probably the main advantage of purchasing an external data feed is that it reduces the operational needs to the minimum.
Building a data warehouse for financial markets data is not rocket science —data models in financial markets are extremely simple and consistent —, but as anything serious in life, it definitively needs skilled and experienced resources in the team.
It poses development, deployment and operational tasks that while not complex in nature still need to be properly addressed.
Our journey to a production-grade platformIn our project, we decided to build our own custom solution.
The reasons were:Poor functionalities of existing entry-level backtesting tools to perform standard statistical analysis.
Lack of functionalities of existing backtesting tools in qualitative -rather than quantitative- or other AI-based analysis.
Steep learning curves of existing backtesting tools APIs and languages.
Performance issues when dealing with large intraday datasets.
Our project requirements quickly evolved from isolated tests, ideas and concepts into a formal project aimed to get a production-grade solution.
The reason for this evolution was that as soon as first backtesting activities started it was immediately clear that the overhead to prepare data and strategies was much higher than expected.
We also found many operational issues while validating the results that lead to lack of confidence on results.
Even in the pure quantitative field — where existing backtesting solutions shine — , the overhead was really high.
We found that an adequate framework environment with data management, data integration, debugging, data flows, job scheduling, charting, reporting and validation capabilities were required.
Those components do not form part of the actual analysis and testing but they have a huge impact on productivity, so huge that they can solely determine the success or failure of the mission.
In the long run — and quantitative business is definitively a very long journey — , such an environment will provide a huge pay-off and will enable to quickly test, validate, discard and adopt strategies.
Why data integration (DI) and extract-transform-load (ETL) simplify integration and provide further analysis reliabilityAs discussed in the previous paragraph, it was clear from the early stages of the project that a clear solution was required to address data management, integration, debugging, charting, reporting and validation.
After evaluating several alternatives, including the classic options in Data Science such as Jupyter Notebooks, the conclusion was clear: all those solutions are built for drafting, evaluation and prototyping small concepts.
By no means, they can be used to build the required collaboration, development and production environment that any large project requires.
After a quick evaluation and a draft on the main components required, we found that a good solution would be to leverage into an existing data integration platform (which basically further extends the ETL — Extract Transform Load — concept) to cope with all basic tasks and provide a mature flow, job, transformation, debugging and reporting environment.
While data integration does not participate in the actual backtesting analysis, it will simplify the orchestration of the different flows and jobs required to build the solution; in our project, it has already proved to be an extremely efficient and useful companion.
Flow excerpt to provision EUREX Futures intraday data: the visual approach mixed with lightweight custom Java code lead to full implementation of the data provisioning project leg in just two weeks.
Easy to maintain, monitor and self-documented.
Data Integration (DI) platforms (there are several including both commercial and open source) will provide mature, enterprise-grade and extensible frameworks to deal with data processing requirements.
This basic level of integration and manipulation is a constant requirement in any project that involves dealing with several data sources for long period of times, and financial markets are not an exception to this.
Contrary to the popular belief, the key benefits of a data integration platform are NOT related to avoiding or minimise custom coding; they are related to effectively self-document processes and providing flow and job frameworks with proper debugging, scheduling, monitoring and reporting tools — both in development and production environments — .
Those points are the real key benefits.
In our case we decided to use Hitachi Vantara as our Data Integration solution, it is easily expandable by using Java custom classes and the learning curve is really easy.
I am not promoting this solution over others: the take away is to leverage on one and the result would have been probably equally good with other solutions.
We found Hitachi Vantara a bit bulky in terms of resources but it has good performance and outstanding integration with Java —this point was relevant in our case as the Java skills in our team were strong and part of the AI analysis is being prototyped in Java — .
After the adoption of this solution, we managed to solve all major data provisioning issues in around two weeks time — including the learning curve and installation of the platform — .
We found it an extremely useful tool, both for development and production environments, and it represented a huge backing to speed up the project.
SummaryQuantitative trading requires large datasets to test, train and evaluate.
It is relevant to both count with long historical records — to evaluate strategies on different market regimes — and to be able to continuously feed our repository with new data.
This leads to the need fora data warehouse or an external feed which provides data on demand.
It is possible to purchase a subscription service for the assets and markets we want to evaluate.
This will provide an easier way to retrieve data, but it might be expensive if the number of markets is large and we leverage on quality data.
It is probably the smart way to go if your project lacks hands-on experience in enterprise-grade data integration projects.
Building your own repository is not complex and it might be a good idea if your project counts with the right resources on time.
In that scenario, an in-house data warehouse can provide significative savings with a very limited effort.
In the event of going for such direction, leveraging on a data integration platform has proven to be helpful, simplifying both development and production environments and increasing the trustworthiness in our data analysis and processes.
.. More details