Optimising big data processing
It happens a lot for applications that were originally designed and tested based on small set of data eventually cannot handle big data. For example, as database grows or a big CSV file is the input a lot of issues such as slow performance, timeout error or unreasonable memory usage may come to the scene and optimising the process is crucial than ever. In this post, I’m gonna share our recent experience and the approaches we took for optimising big data processing in PHP.

Amazon CloudFront Timeout

As one of the internal tools we have a payment matching tool which is mainly used by finance team to compare payment gateway reports with the internal database. In the original implementation the operation was triggered by uploading a CSV file including transactions. After doing some basic validations every row was pushed to Amazon SQS queue to be processed later. The problem started when the finance team was trying to use the tool with big CSV files containing around 15000 rows. The first issue was Amazon CloudFront timeout because even making a job per row for a big file took more than 30 seconds. To resolve the issue and speed up the process we used Amazon S3 to store the files. We also considered a job as a parent one to make the sub-jobs. This fixed the CloudFront Timeout issue.

Excessive Memory Usage

Unfortunately, it was not the end of story and we hit another wall down the track. For some reason our custom worker which was used to process the queued jobs threw the following error after processing almost 2000 jobs:
Fatal error: Allowed memory size of 134217728 bytes exhausted

So obviously there was a memory leak somewhere. Also we were not sure that the worker itself is efficient enough. So for the sake of testing we run the worker with the same number of jobs but without any process. Interestingly enough the same issue happened meaning that the worker itself had memory leak. To track down the issue we enabled profiling with Xdebug by adding the following line to php.ini:
xdebug.profiler_enable = 1

After analysing the profiling outputs using Webgrind we had some idea about the cause of the issue.

Guzzle

The worker was fetching the jobs from the queue inside a loop using Guzzle and for some reason memory was not freed up after each request. Also it turned out there were some people with the same problem. When we called gc_collect_cycles() after finishing each job in the loop the issue was gone!

However, when we run the worker with the jobs including the actual payment matching process the same issue happened. So we had to optimise the jobs as well. This time we used memory_get_usage(true) in different places of the process to find out where exactly allocated memory was increased.

memory

As you can see we identified the place in the code causing the issue which helped us to resolve the bug in the end.