Fraud on financial statements can stay uncovered even after a financial audit as the techniques used to cook the books are becoming more sophisticated.Recently I completed a group project to build a machine learning model to identify US firms with intentional misstatements in the year 2010 based on training data from 2005–2009.I thought it would be interesting to share with you the approach and key learning points of this data science project, keep reading if you would like to know the end to end process of building this machine learning model.These are the steps that will be discussed in detail:Feature engineering with domain knowledge while leveraging on research papersExploratory Data Analytics to gain insightsFeature Selection with t-test and chi-square testModel Selection & ImplementationFine Tuning of Model with Randomized Search CVModel Evaluation with AUC scoreModel InterpretationStep 1: Feature EngineeringFeature engineering is the most important step of every data science project, data scientists spend significant hours on feature engineer to create meaningful features that capture the characteristics of a phenomenon being observed (target variables)..Without quality data, the most sophisticated machine learning algorithm would not be able to outperform a simple model with quality features.For this project, feature engineering was done from scratch as the training dataset only contained the firm identifier, year and whether the firm performed intentional misstatement, no other features were given.How to create meaningful features that will capture the characteristics of the target variable?Read research papers written on a similar machine learning problem you are trying to solveBrainstorm by asking the right questionsLeveraging on Past Research PapersRelying on past research papers such as Dechow(2011) and BCE(2018) was very helpful to understand the type of features that could pick up the characteristic of firms committing accounting fraud.Here are some of the different types of variables being explored in these research papers:Financial Variables that capture the financial activities of firms: e.g..Change in receivable, change in inventory, book to market and whether the firm underwent restructuringTextual Styles Features that capture the readability, length, tone and word choice of financial statementsFeatures that capture the nature of the business: e.g..Whether the company is in the retail, service or computer industryIncluding the significant features from these two research papers in our initial Lasso model in R was only able to achieve an out-of-sample AUC score of 0.64, indicating that the model was only able to predict the target correctly 64% of the time.Brainstorming with Domain KnowledgeIt is clear that more feature engineering is needed, we then asked ourselves how can we detect firms with intentional misstatement by leveraging on any data if we were auditors, internal auditors, regulators and investors?.Imagine that you could obtain any data, what features would you add to the model..After some brainstorming:Auditors would have access to areas/number of material misstatements, the number of management review points, auditor opinions, industry and whether the industry is a high-risk industry.Internal Auditors assess the internal control of firms, they would have access to the score of internal control strength and internal emails..Some of the possible variables are an opinion on internal control and whether internal emails contain suspicious messages that indicate fraud.Regulators would have access to top management information and compensation details, a possible variable would be to calculate the turnover rate of the CEO or CFO as top management would tend to leave the company during the period of financial statement manipulation or after the fraud is being uncovered by regulators..The number of unexercised stock options of top management is often associated with aggressive accounting as well.Investors being the public have access to financial reports and stock data, some of the possible areas to look at can be volatility of the stock price and stock return as management often manipulate financial statements to meet investors or analyst expectations in order to maintain the stock price.With the possible variables in mind, we searched for the availability of these variables from the WRDS database to obtain the following variables:Audit opinion: auditor’s opinion might signal potential fraudulent cases..E.g..Adverse opinion, qualified opinion and unqualified opinion.Opinion on internal control: poor internal control increases the opportunity for fraud..E.g..Effective, adverse and disclaimer.Auditor of Financial Statement: certain audit firms might face greater pressure to sign off financial statements with material misstatementsIndustry: some industries are easier to be hit by fraud and found to be associated with fraud casesCEO Turnover and compensation: this variable was later not used due to too much missing dataVariables related to stock returns and volatility: this variable was later not used, as this will limit our training dataset to contain listed firms onlyStep 2: Data CleaningAfter merging all the variables with the training dataset given which contains the firm identifier, year and the target variable.. More details
- Automated Hyperparameter Tuning, Scaling and Tracking: On-Demand Webinar and FAQs now available!
- How to Develop an Auxiliary Classifier GAN (AC-GAN) From Scratch with Keras
- Best of arXiv.org for AI, Machine Learning, and Deep Learning – June 2019
- Introduction to PyTorch-Transformers: An Incredible Library for State-of-the-Art NLP (with Python code)