Roadmaps

By Vyacheslav Matyukhin on December 6, 2010 11:13 PM

Ever since I've promised to write about my "generic streams" framework on #moose IRC channel (it happened several weeks ago), my conscience wouldn't leave me alone.
I'm still not ready to talk about streams with enough details, so I'll talk about the bigger picture instead.

So, I'm working at Yandex company on a quite large-scale project (it's a blogsearch if you really want to know). Our codebase is separated into something like 400 separate distributions, mostly perl. Every distribution is packaged into debian package and all these packages are cross-dependent in the complicated ways. We've got several hundreds of production hosts, separate test hosts and the separate test cluster. Our group is just a small part of the whole company (less than 1 percent in fact) and so some of our modules are used outside of our blogsearch project.

Why am I telling you all this? Because the software we write and especially the software which we release in the opensource is very much shaped by this highly diverse environment. In the last few years, we encountered some common patterns in the areas of service management, software configuration and streaming data processing again and again, and came up with some interesting (IMHO) and generic solutions.

So here are three things which we can share with the world:

First, there is service management. As you can imagine, we run a lot of daemons - http services and some other stuff, and so we wrote Ubic. I wrote about it many times already, and there is a fairly high probability that you already know what Ubic is. If you don't know, just look at these slides.

Second, there is software configuration. Having a production cluster and testing cluster and separate configuration for the unit testing and maintaining it all for 400 distributions is a nightmare... unless you invent something like Morpheus. Morpheus is the "ultimate configuration engine" - it provides the unified interface to any configuration values, which then can be provided from any source (config files in any format, DB, environment, command line arguments) without any changes to code at all.
Think "Log::Any", but apply its idea to the configs.
I'm really excited about Morpheus and I'll talk about it in detail on Saint Perl 2 in Russia, in St. Petersburg on December, 18. I'll have some slides by then, so if you like the Morpheus idea too and want to know more, just wait for 2 weeks.
(If you are really impatient, you can go on and read the code, but there are almost no docs by now).

Third, streaming data processing.
Creating blogsearch means that you have to fetch a lot of web pages, then put them into storage, then index all the stuff you put into storage, and also extract links from this "new posts" stream, then process these links more (expand short links, for example), also export link to another services, also use these links to calculate antispam metrics, etc.
In my mind, it looks like a giant graph with vertices representing data stored on different hosts and groups of hosts in various forms (it can be log or it can be a DB table, or it can be a rsync share on another host), and nodes between them representing processing programs.
Also, I like to think about this picture.
Thus, streaming framework. Or more like set of libraries, some of which define common APIs, some providing specific implementations (Stream::Log, Stream::DB, etc), some helping to integrate all this stuff into the big picture. I still don't know when I'll find the will to open it into public, not because it's not ready - we are using it in production - but because I'm still trying to find the right approach to decouple it and separate useful parts from the policy as much as possible, and keep it easy to use and deploy at the same time.

Besides these three major directions we also have several small useful modules.
Some of them are already out there in separate distrubtions (Log::Unrotate, AnyEvent::HTTP::LWP::UserAgent, DBD::Safe), some are on their way.
About two of these small modules I'm uncertain if they would be useful to anyone as separate distributions, but they are still too helpful and I had to bundle them with Ubic: Ubic::Lockf, Ubic::Persistent. If you like one of them, tell me and I'll upload them immediately as the separate distributions. Especially if you'll suggest an appropriate name :)

Well... that's all by now.

4 comments

Tagged as:

morpheus, opensource, stream, ubic, yandex

4 Comments

Aristotle | December 10, 2010 9:09 AM | Reply

It looks like DBD::Safe is aiming at the same problem as DBIx::Connector, which is an extraction of battle-tested connection management code from DBIx::Class. Is there a reason to choose DBD::Safe over the use of DBIx::Connector?

Vyacheslav Matyukhin replied to comment from Aristotle | December 10, 2010 3:13 PM | Reply

DBD::Safe is a DBI driver, so you can use it without changing existing code which expects DBI database handler. And we have tons of such code.

Sometimes our programs process data for hours, so you can't just call my_function($conn->dbh).

On the other hand, DBIx::Connector's run(sub { ... }) feature looks more powerful and flexible than current DBD::Safe approach. DBD::Safe pings DB before executing/preparing the statement, and statements which you get by calling $safe_dbh->prepare() are not safe themselves. We are going to add "safe statements" and something similar to "fixup" mode later, though.

So, yes, the only reason to use DBD::Safe is if you want to get these features without changing your code.

PS: FWIW, DBD::Safe is derived from Catalyst::Model::DBI.

Aristotle | December 10, 2010 7:17 PM | Reply

Ah. How did I miss the DBIx::Connector mention?! It was right there in the POD.

It might be useful to add the rest of your explanation from here to the docs, though – as a separate COMPARISON TO section maybe, so people have a better feel for the trade-off.

(The code in Catalyst::Model::DBI has been derived from DBIx::Class in turn, btw. It might be missing things that were tweaked in the DBIx::Class version in the meantime – there was mention of something about this, a while ago, on the Catalyst list or somewhere (maybe in relation to Theory starting work on DBIx::Connector? I don’t remember). It might be worth chasing down the maintainers and having a chat. Although you’ve clearly done your due diligence (whereas I read hastily), so I may not be telling you any news; if so, don’t mind me.)

Vyacheslav Matyukhin replied to comment from Aristotle | December 10, 2010 9:23 PM | Reply

Strange, I missed DBIx::Connector reference in the POD too. As I said, there are several of us, and DBD::Safe is written by another member of our group :)

Thanks for your advices, they are helpful.

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About Vyacheslav Matyukhin

I wrote Ubic. I worked at Yandex for many years, and now i'm building my own startup questhub.io (formerly PlayPerl). I'm also working on Flux, streaming data processing framework. CPAN ID: MMCLERIC.

More info »

Vyacheslav Matjukhin

Roadmaps

Tagged as:

4 Comments

Leave a comment

About Vyacheslav Matyukhin

Search this blog