Exporting large data to CSV using mysql outfile in multi-host Rails env.

In one of our current project we had a requirement where we needed to allow the site admin to download raw DB data as CSV file for some offline analysis. So, according to the requirement, we allowed the admin to select a database table and a date range to download the generated CSV data.

Our initial approach was to send the CSV file asynchronously using delayed job. It worked like this:
Admin submits the request, system queues it and sends admin an email with the attached CSV file. The reason we did this asynchronously was to avoid timeout during CSV generation.

While testing & benchmarking this approach, we ecountered two issues when we had a large data to export.

  • Too much memory was used by the ruby process when generating the CSV
  • CSV size could be larger than the supported attachment size by the email service provider, hence could result in an undelivered email

This situation forced us to rethink and tweak our approach. We decided to try MySQL’s ‘outfile’ query to export data as CSV file.
This could have been fairly simple had our DB server and web server been hosted on same phyical host. We could have run the outfile query to dump CSV data and send it as an attachment. But our app was running on a master-slave DB architecture and multi-host environment. Hence, after running the db outfile query, we had to SCP the dumped CSV file from DB server to delayed job server in order to attach it to an email.

To avoid this additional step of SCP, we decided to drop the idea of attaching CSV file to the email. Instead of sending CSV file as attachment we decided to send a link in the email which could be used to download the data file.

Finally, we ran MySQL’s ‘outfile’ query to export data as CSV on our DB slave server, and asked admin to download the CSV file from the slave server by following the link sent in the email, instead of attaching the CSV file in email itself.

Here is how it works:

  • Admin submits the request to export data by specifying database table name and date range
  • Our system queues this request.
  • Delayed job worker picks the job, makes connection to DB slave server, executes query to export data to CSV(on slave), and sends an email to admin with the link by which he can download the CSV data.
  • System cleans up generated CSV files at a regular interval.

Here is the code snippet of our DataExporter class, the ‘process’ method is the method called by delayed job worker:


class DataExporter < ActiveRecord::Base

  establish_connection Rails.configuration.database_configuration["slave"]

  INPUT_DATE_FORMAT = "%Y-%m-%d"
  DUMP_FOLDER_LOCATION = "/var/www/dumps/"

  attr_accessible :end_date, :start_date, :model_name


  # This method is being called asynchronously to export and send details download link to admin
  def process
    to_csv
    transmit
  end

  def table_headers
    model_column_names.collect {|el| "'#{el}'"}.join(',')
  end

  def columns_to_fetch
    model_column_names.join(',')
  end

  def model_column_names
    model_name.safe_constantize.column_names
  end

  def file_path
    "#{DUMP_FOLDER_LOCATION}#{file_name}"
  end

  def host_url
    "https://data.domain.com/"
  end

  def file_url
    "#{host_url}dumps/#{file_name}"
  end

  def the_table_name
    model_name.safe_constantize.table_name
  end

  def transmit
    ReportMailer.csv_report(self)
  end

  private

  def to_csv
    connection.execute(
      "SELECT #{table_headers} UNION ALL SELECT #{columns_to_fetch} FROM #{the_table_name} WHERE DATE(created_at) >= '#{start_date}'
            AND DATE(created_at) <= '#{end_date}'
            INTO OUTFILE '/tmp/#{file_name}' FIELDS TERMINATED BY ','
            ENCLOSED BY '\"'
        LINES TERMINATED BY '\n'")
  end

  def set_file_name
    self.file_name = "#{model_name}-#{start_date}-to-#{end_date}-#{Time.now.to_i}.csv"
  end


end


We also had to configure a web server on the DB slave server to serve the link sent in the email from which admin could download the CSV file, which was something like: https://data.domain.com/dumps/filename.csv

To cleanup old files, we configured a cron task to remove files from the directory which are older than some specified time.

Till now, we haven’t faced any issues with this approach.
Do share your thoughts on this approach, or on what could have been a better implementation.

Leave a Reply

Your email address will not be published. Required fields are marked *