In one of our current project we had a requirement where we needed to allow the site admin to download raw DB data as CSV file for some offline analysis. So, according to the requirement, we allowed the admin to select a database table and a date range to download the generated CSV data.
Our initial approach was to send the CSV file asynchronously using delayed job. It worked like this:
Admin submits the request, system queues it and sends admin an email with the attached CSV file. The reason we did this asynchronously was to avoid timeout during CSV generation.
While testing & benchmarking this approach, we ecountered two issues when we had a large data to export.
- Too much memory was used by the ruby process when generating the CSV
- CSV size could be larger than the supported attachment size by the email service provider, hence could result in an undelivered email
This situation forced us to rethink and tweak our approach. We decided to try MySQL’s ‘outfile’ query to export data as CSV file.
This could have been fairly simple had our DB server and web server been hosted on same phyical host. We could have run the outfile query to dump CSV data and send it as an attachment. But our app was running on a master-slave DB architecture and multi-host environment. Hence, after running the db outfile query, we had to SCP the dumped CSV file from DB server to delayed job server in order to attach it to an email.
To avoid this additional step of SCP, we decided to drop the idea of attaching CSV file to the email. Instead of sending CSV file as attachment we decided to send a link in the email which could be used to download the data file.
Finally, we ran MySQL’s ‘outfile’ query to export data as CSV on our DB slave server, and asked admin to download the CSV file from the slave server by following the link sent in the email, instead of attaching the CSV file in email itself.
Here is how it works:
- Admin submits the request to export data by specifying database table name and date range
- Our system queues this request.
- Delayed job worker picks the job, makes connection to DB slave server, executes query to export data to CSV(on slave), and sends an email to admin with the link by which he can download the CSV data.
- System cleans up generated CSV files at a regular interval.
Here is the code snippet of our DataExporter class, the ‘process’ method is the method called by delayed job worker:
class DataExporter < ActiveRecord::Base establish_connection Rails.configuration.database_configuration["slave"] INPUT_DATE_FORMAT = "%Y-%m-%d" DUMP_FOLDER_LOCATION = "/var/www/dumps/" attr_accessible :end_date, :start_date, :model_name # This method is being called asynchronously to export and send details download link to admin def process to_csv transmit end def table_headers model_column_names.collect {|el| "'#{el}'"}.join(',') end def columns_to_fetch model_column_names.join(',') end def model_column_names model_name.safe_constantize.column_names end def file_path "#{DUMP_FOLDER_LOCATION}#{file_name}" end def host_url "https://data.domain.com/" end def file_url "#{host_url}dumps/#{file_name}" end def the_table_name model_name.safe_constantize.table_name end def transmit ReportMailer.csv_report(self) end private def to_csv connection.execute( "SELECT #{table_headers} UNION ALL SELECT #{columns_to_fetch} FROM #{the_table_name} WHERE DATE(created_at) >= '#{start_date}' AND DATE(created_at) <= '#{end_date}' INTO OUTFILE '/tmp/#{file_name}' FIELDS TERMINATED BY ',' ENCLOSED BY '\"' LINES TERMINATED BY '\n'") end def set_file_name self.file_name = "#{model_name}-#{start_date}-to-#{end_date}-#{Time.now.to_i}.csv" end end
We also had to configure a web server on the DB slave server to serve the link sent in the email from which admin could download the CSV file, which was something like: https://data.domain.com/dumps/filename.csv
To cleanup old files, we configured a cron task to remove files from the directory which are older than some specified time.
Till now, we haven’t faced any issues with this approach.
Do share your thoughts on this approach, or on what could have been a better implementation.