Results tagged “stream”


Ever since I've promised to write about my "generic streams" framework on #moose IRC channel (it happened several weeks ago), my conscience wouldn't leave me alone.
I'm still not ready to talk about streams with enough details, so I'll talk about the bigger picture instead.

So, I'm working at Yandex company on a quite large-scale project (it's a blogsearch if you really want to know). Our codebase is separated into something like 400 separate distributions, mostly perl. Every distribution is packaged into debian package and all these packages are cross-dependent in the complicated ways. We've got several hundreds of production hosts, separate test hosts and the separate test cluster. Our group is just a small part of the whole company (less than 1 percent in fact) and so some of our modules are used outside of our blogsearch project.

Why am I telling you all this? Because the software we write and especially the software which we release in the opensource is very much shaped by this highly diverse environment. In the last few years, we encountered some common patterns in the areas of service management, software configuration and streaming data processing again and again, and came up with some interesting (IMHO) and generic solutions.

So here are three things which we can share with the world:

First, there is service management. As you can imagine, we run a lot of daemons - http services and some other stuff, and so we wrote Ubic. I wrote about it many times already, and there is a fairly high probability that you already know what Ubic is. If you don't know, just look at these slides.

Second, there is software configuration. Having a production cluster and testing cluster and separate configuration for the unit testing and maintaining it all for 400 distributions is a nightmare... unless you invent something like Morpheus. Morpheus is the "ultimate configuration engine" - it provides the unified interface to any configuration values, which then can be provided from any source (config files in any format, DB, environment, command line arguments) without any changes to code at all.
Think "Log::Any", but apply its idea to the configs.
I'm really excited about Morpheus and I'll talk about it in detail on Saint Perl 2 in Russia, in St. Petersburg on December, 18. I'll have some slides by then, so if you like the Morpheus idea too and want to know more, just wait for 2 weeks.
(If you are really impatient, you can go on and read the code, but there are almost no docs by now).

Third, streaming data processing.
Creating blogsearch means that you have to fetch a lot of web pages, then put them into storage, then index all the stuff you put into storage, and also extract links from this "new posts" stream, then process these links more (expand short links, for example), also export link to another services, also use these links to calculate antispam metrics, etc.
In my mind, it looks like a giant graph with vertices representing data stored on different hosts and groups of hosts in various forms (it can be log or it can be a DB table, or it can be a rsync share on another host), and nodes between them representing processing programs.
Also, I like to think about this picture.
Thus, streaming framework. Or more like set of libraries, some of which define common APIs, some providing specific implementations (Stream::Log, Stream::DB, etc), some helping to integrate all this stuff into the big picture. I still don't know when I'll find the will to open it into public, not because it's not ready - we are using it in production - but because I'm still trying to find the right approach to decouple it and separate useful parts from the policy as much as possible, and keep it easy to use and deploy at the same time.

Besides these three major directions we also have several small useful modules.
Some of them are already out there in separate distrubtions (Log::Unrotate, AnyEvent::HTTP::LWP::UserAgent, DBD::Safe), some are on their way.
About two of these small modules I'm uncertain if they would be useful to anyone as separate distributions, but they are still too helpful and I had to bundle them with Ubic: Ubic::Lockf, Ubic::Persistent. If you like one of them, tell me and I'll upload them immediately as the separate distributions. Especially if you'll suggest an appropriate name :)

Well... that's all by now.

Can't choose CPAN namespace

So... I've got this set of modules I've been going to opensource for more than a year now. Strangely, main obstacle is a choice of namespace.

Internally, they are called Stream::*. It is multi-layered set of classes with common interfaces, from low-level Stream::File / Stream::Log / Stream::MemoryStorage, to more complex Stream::Queue (local file-based queue supporting multiple parallel clients). Also, there are functional-style filters, catalog which can construct stream objects by their name, pumpers connecting input and output streams, multiplexing, and a lot of abstractions, base classes and roles...
Together, they assist in implementing complex asyncrohous, realtime and possibly distributed data processing.
By now, i think the paradigm I'm trying to implement is called a Flow-based programming, but until recently, I've mostly been thinking in terms of this image:
Anyway. I'll have a chance to talk about it later when this code will become public.

Any hints about namespace?
I don't like Stream::* that much, and there is a Stream-Reader in that namespace already.


About Vyacheslav Matyukhin

user-pic I wrote Ubic. I worked at Yandex for many years, and now i'm building my own startup (formerly PlayPerl). I'm also working on Flux, streaming data processing framework. CPAN ID: MMCLERIC.