Thread pool for a boss/worker model
This is a pretty simple idea - a boss thread assigns work to a pool of worker threads who do nothing until some work enters their queue. This way the boss can fill a queue very quickly and you have multiple back end processes that can consume that queue.
I'm using threading and not an async thing because some of the work I'll be assigning to threads are long-polling operations. The workers will hit some REST API route on some other application, and some of those routes take up to 30 seconds to complete or have dependencies or followup work. Rather than block and spin in an async call, for my tasks, its easier to have a queue of work and workers that execute them.
Each task may have multiple sub-tasks associated that are ordered, so each worker will be assigned a "task group". This way it can manage its own dependencies and I can manage the total load on both the server its sitting on and the cloud, and the database. Its not about "speed" in the sense of less time to execute, its more about keeping a small pipe filled and not blocking other workers while on a long blocking call.
There aren't a lot of thread pooling type modules that I could see on CPAN. Its not a complex task, but there are a lot of things to think about. The Thread::Pool module causes problems with the perl MongoDB module, and thats really the only recent option that seems to fit my problem set.
I ended up rolling my own module using threads::shared, Thread::Queue, and Thread::Semaphore. I basically spawn X amount of threads using the semaphore as a creator of the tokens to keep the number of threads at the right level. I use shared queues and non-blocking checks against the queue from the worker to get its work. I use a shared variable with each worker thread to control its loop and kill it when its time. This also allows for things like stopping the boss and waiting for the workers to drain the queue before killing the pool. You can add workers and stop individual workers. You can keep the workers running and kill the boss, or kill the workers and let the boss run. Boss/workers have callback functions for executing their tasks. Its shiny and runs great.
I'm 99% certain my code will pass any test case I throw against it and it will work fine for what I'll be using it for (work queue for a back-office cloud controller).
Did I miss a module? Is there something better out there that is less intrusive than Thread::Pool? If not I'll clean up the POD on this and submit it to CPAN but I'd rather not duplicate with Yet Another CPAN Module that already has 10 different variations.
I can propose a very different way using zeroMQ .
It a powerful yet simple library for messaging between processes, you may implements everything you need just using simples processes communicating easily with each others.
Using only messaging, you drop a whole class of concurrent acces issues.
You have a perl module for it .
@happymorning - I thought about using an MQ for the queues/thread control/boss messaging since the org is using a lot of ActiveMQ already, but this is pretty simple - a process from the UI or from someplace else says "Take this job and run it!" and the master grabbing that job and then saying "Worker, do this job". There is only one Boss per work queue, so I'm just using a collection on Mongo for getting work into and out of the Master Queue. Simple, quick, easy.
Zeromq is pretty cool from what I can tell! Its just overkill for what I need on the messaging/queue side.
Its *thread management* that I really want to make sure I'm doing right.
Hi Kal
What about these Perl modules:
1) Gearman
2) Hopkins
3) TheSchwartz
4) Helios (which uses TheSchwatrz)
In fact, it'd be marvellous if you'd try all of them and write a comparative review for those of us who haven't actually tried any...
Wow, I didn't see these, I'm still used to looking for relatively self-descriptive names I guess. :)
Gearman is definitely lean in terms of dependencies, but its single threaded and I'd still have to manage multiple worker threads (and a server thread and a manager to stuff tasks), or just have separate apps to run each of these, though the complexity drops a bit. Fits most of the requirements but has stuff I don't want or need (socketed client/servers, IPC).
Hopkins is a static job runner - they call themselves a better cron job, and this fulfills part of what I need..but it requires static configuration of XML up front. Under the hood, running POE::Component::JobQueue for its heavy lifting which is async and not threaded, and will block on some of my tasks. Not quite what I need.
TheSchwartz/Helios - also has the ability of registering capabilities, but adds in the requirements of a DBI driven database. Much closer to what I'm looking for, but a bit of overkill. Also async and most likely will block on a few of my tasks.
My problem was I was also searching for lighter-weight thread pooling modules, to use across multiple apps for consistency, and I didn't see the modules you listed. I don't need any real IPC other than to signal a thread to exit its loop. I don't really want/need additional sockets open, other than a single outbound connection to a DB or job queue.
One task for this thread pool will be on a VM with only a single worker to manage jobs coming in from a master controller. A single threaded queue/run program would actually suffice here but there are a couple of situations where a system will be heavily using its workers and I'd like them to be able to service more than one at a time.
Its really the controller(s) I'm worried about. But I don't need a million workers, I'm starting with 5 and will be scaling to ~25, and Future Growth may increase that. I just was looking for a Simple Thread Pool management module, very much like Thread::Pool, except one that worked with non-thread friendly modules like MongoDB. ;(
Another suggestion: beanstalkd and the Perl module Beanstalk::Client.
I use this by forking off a number of workers per machine. Then I add more machines as load increases. Having isolated processes that only talk to the beanstalkd server means I don't have to deal with synchronisation or blocking issues. This makes for very simple and efficient code.
beanstalk seems to be very much like Gearman. Its a TCP client/server model in which you feed jobs into 'tubes' or queues on the server, and workers which join to the server and then consume those jobs. You can't register capabilities, but I'm assuming you could have differently capable workers pulling jobs from different tubes. This looks like a more simple API than Gearman, but the source for Gearman is easier to read - the beanstalk client source looks - err.. optimized, but I'd have to benchmark to see if it makes any difference. :)
The problem with these for my application lies in the additional dependency of a job server. Both of these server daemons are written in C, which I could find packages for, but thats yet another dependency chain to track. A partial framework already exists for this design and its all in perl, and I'd rather continue to update this design (DB backend), which works great, than switch to a new client/server model.
I've finished a rough draft of Thread::Workers. I need to polish its edges a bit and make it pretty before I expose it to the world.
Its a simplistic pool/queue design with no priorities - in my case, I need FIFO access for all queues. I assume if one wanted priorities, one could create multiple pools of boss/workers and manage the priorities from the main application thread using this module.
For me, its enough to create a single pool. I hook the boss's fetch work callback to a mongo query. If queries are returned, this is fed into a FIFO queue which the idle workers pick up on first come, first served basis.
Have you seen Parallel::Workers or Thread::Pool on CPAN?
Gratuitous self promotion: Thread::Apartment provides a pooling interface, and should allow use of direct method calls to threaded objects.
Hi Kal
Beanstalk::Client is written by Graham Barr, which is a big plus in my books.
BTW: I haven't used any of these, but would be v-e-r-y confused if confronted with the need to to do.
As for overlooking those modules I listed, I'm not surprised. Many modules have obscurantistic names.
That's why I keep a tiddlywiki (check them out!) (one of many) for Perl with a section dedicated to 'Interesting Modules' I stumble across.
I guess it's time to calve out that list into its own web page and put it on my site. I'll do that over the next week and blog on blogs.perl.org.
Right now I'm chasing an embarrassingly huge bug on GraphViz2::Marpa........
Cheers
Ron