Parallel Forking and Process Management
Stop me if you have heard this one before. You have a list of files you need to process in a text file with one item per line. Handling this is fairly simple you read a line in and process it over and over again until you processed the whole list. This works great, but if that list is 40,000 items long and each item takes up to 30 seconds to run it suddenly takes a very long time to finish. In this case processing each item is just a system call to another cli application with no shared resources, thus allowing processing of items in parallel with no fuss. For this task I am using Parallel::ForkManager and here are the important bits:
Now the above is basically the first example in the POD and not all that interesting. The "fun" part was discovering that some of the processes were hanging. Digging in I find it is the other application I am calling that is hanging because the file it was trying to read is for some reason unavailable. This kind of failure is acceptable, just log what didn't work and move on. To clean up the stale processes I used Proc::ProcessTable to check for jobs that were running too long and kill them every number of processes configured for Parallel::ForkManager. The code looks like this now:
I tweak the $process_count based on number of CPU cores since these processes are CPU heavy but RAM lite. I had multiple versions of this application running on different server sizes for the last week to determine performance metrics for the future. I want to squeeze as much work out of a system as I can every second it runs. Yes I do love optimizing code and fast results but this is purely practical. Above I mentioned a list of 40,000 items. Since I started using this little application I wrote that source file is now a few hundred thousand items long since processing time is not terribly long.
This is how I feel now.