ActiveRecord on MySQL— Iterating over large tables with conditions

This is how the underlying queries look:Employee Load (1.1ms) SELECT `employees`.* FROM `employees` ORDER BY `employees`.`emp_no` ASC LIMIT 1 Employee Load (1.1ms) SELECT `employees`.* FROM `employees` ORDER BY `employees`.`emp_no` DESC LIMIT 1 Employee Load (1.5ms) SELECT `employees`.`emp_no`, `employees`.`first_name`, `employees`.`last_name` FROM `employees` WHERE (employees.emp_no > 10001) AND (employees.emp_no <= 11001) Employee Load (1.9ms) SELECT `employees`.`emp_no`, `employees`.`first_name`, `employees`.`last_name` FROM `employees` WHERE (employees.emp_no > 11001) AND (employees.emp_no <= 12001)…Employee Load (1.8ms) SELECT `employees`.`emp_no`, `employees`.`first_name`, `employees`.`last_name` FROM `employees` WHERE (employees.emp_no >= 5128001) AND (employees.emp_no < 5129001)This approach works best if the ids are in order because the iteration wouldn’t have to iterate & skip a lot of missing records in that case³.Iterating with joinsNow, let’s compare performance of these two methods when we add some more complexity to the query.In this new scenario, say, we want to iterate through all employees whose salary was above 80,000 at any point during their employment with the company..The find_each method would look something like this:time = Benchmark.realtime do Employee.select(:emp_no, :first_name, :last_name)..joins(:salaries)..where('salary > 80000')..find_each(batch_size: DEFAULT_BATCH_SIZE) do |employee| endend=> 1181.770457000006On the other hand, the id iterator method for performing the same operation results in an order of magnitude improvement in performance.time = Benchmark.realtime dofirst_id = Employee.first.id last_id = Employee.last.id(first_id..last_id).step(DEFAULT_BATCH_SIZE).each do |value| Employee.where('employees.emp_no >= ?', value)..where('employees.emp_no < ?', value + DEFAULT_BATCH_SIZE)..joins(:salaries)..where('salary > 80000')..order('employees.emp_no ASC')..select(:emp_no, :first_name, :last_name).each do |employee| end endend=> 72.75677799998084The above results indicate that using the find_each approach results in a much worse performance⁴..The ID iterator approach is about 15x faster than naive find_each..The reason for this becomes clear when you inspect the queries that are made by the two approaches.The find_each method makes this type of query:SELECT `employees`.`emp_no`, `employees`.`first_name`, `employees`.`last_name` FROM `employees` INNER JOIN `salaries` ON `salaries`.`emp_no` = `employees`.`emp_no` WHERE (salary > 80000) ORDER BY `employees`.`emp_no` ASC LIMIT 1000An EXPLAIN on this query reveals the following:1 SIMPLE salaries ALL salary,emp_no NULL NULL NULL 2837536 Using where; Using temporary; Using filesort1 SIMPLE employees eq_ref PRIMARY PRIMARY 4 employees.salaries.emp_no 1 Using indexwhich indicates that neither the index on salary nor the index on emp_no is being used to filter the salaries table.The id iterator method makes this type of query:SELECT `employees`.`emp_no`, `employees`.`first_name`, `employees`.`last_name` FROM `employees` INNER JOIN `salaries` ON `salaries`.`emp_no` = `employees`.`emp_no` WHERE (employees.emp_no >= 5128001) AND (employees.emp_no < 5129001) AND (salary > 80000)An EXPLAIN on this query shows that the query optimizer uses the index on emp_no in the salaries table:1 SIMPLE salaries range salary,emp_no emp_no 4 NULL 1 Using index condition; Using where1 SIMPLE employees eq_ref PRIMARY PRIMARY 4 employees.salaries.emp_no 1 Using indexwhich reveals why the find_each method is so much slower than the iterator method.TL;DRThe lesson here is always use EXPLAINs to understand what the MySQL query optimizer actually does so that you can create the most optimized queries⁵.. More details

Leave a Reply