YAPC::Asia Tokyo 2013: Fighting Legacy Code

Due to popular demand, I'm going to post a summary/transcript of my presentation for YAPC::Asia Tokyo 2013, titled "Fighting Legacy Code". It's also the presentation that I would've given if I didn't cancel my attendance for YAPC::NA Austin 2013.

Because I wrote all this in a rush, the language is very terse. Please let me know on twitter (@lestrrat) if somethings don't make sense.

You could also take a look at my Japanese slides while reading through this. You may get a little more feel for what I'm writing about.


Fighting Legacy Code

This is the story of an engineer who started working on the codebase for livedoorBlog, Japan's largest blog service, serving billions of requests every day.

livedoorBlog

Some facts:

  • Started 2003 (10 yrs in service!)
  • Very featureful
  • Many, many engineers have worked on, and have moved on from this codebase
  • Approx 220,000 lines of Perl code
  • Everything glued to mod_perl 1.3.x and perl-5.8.8

Pageviews / Unique user counts are not publicized, but I can confirm that it actually handles a pretty scary amount of requests. The content served is dynamic (no static html generated), and we utilize caches in various formats to make it all work fast. I can also confirm that the backend app servers handles about 1 billion raw requests per day.
The whole system consists of a few hundred servers (low hundreds)

The goal of my assignment is to make this 10year-old mess of a codebase easier to develop, easier to operate, more modern. This presentation takes you through stuff I worked through during the last 9months or so.

So what did I do?

Well, so first things first: THERE. IS. NO. SILVER. BULLET.

There's no magic middleware that makes all the problem go away. There's no One True Way To Refactor that gets you all the modern stuff in. There's no buzzword that will solve all your problems.

Instead, this is what you do: You patiently write, refactor, move around, remove code -- one by one, in small steps. The only way to make this work is to work hard at it.

Back to our target

So more details about our system:

  • perl-5.8.8
  • Glued to mod_perl 1.3.x.
  • Makefile.PL doesn't exist. Dependencies are kept in *.tar.gz files
  • Dependencies are old. Versions from 6~8 years ago
  • There are bunch of code you don't know what they do, and code that isn't even being used for the last few years

In other words, it's your stereotypical legacy system.

Until up to this point, the engineers and staff were making this system work through sheer power of will and brute force application of technology. Note: in this talk I will stress on the bad aspects but when you start looking at the code you start realizing that this system is indeed well thought out. it's just... old.

Anyway, from now on we need to cleanup the implementation of this system and make it easier to do new things in the future.

Low-Level Goals

  • Upgrade perl, and make it possible to keep upgrading perl in the future
  • Modernize the packaging of the system
  • Say farewell to mod_perl
  • Automate bootstrapping to make it easier for newbies (like myself!) to start working on the codebase
  • And remove anything else that might otherwise become obstacle to development and operation in the future

Perl Upgrade

Of course, I really wanted to upgrade Perl asap, but ... our code was running on mod_perl 1.3.x built against perl-5.8.8. We'd have to switch to PSGI before we could change perl versions.

So let's step back and see our system breakdown...

  • The distributor component (jp2) - serves content for PC, smartphone, et al.
  • Comments/Trackbacks (blogsys)
  • CMS - where you write your articles and stuff

of the above jp2 is written as raw Apache handlers. The other two uses Sledge, a web application framework developed around 2000.

So we've decided to first move jp2 away from mod_perl. It looked easier than migrating a framework-based component.

Phase 1: Refactor the Distributor Component

So, now to look at the code. What is this doing? Is there a spec? Ha-hah... right... Where are the debug/trace logs? what ? no logs?

Let me tell you: Logs are the single most important thing when you're trying to modernize legacy system. YOU WILL LOG. OR ELSE....

At first glance, this is how the typical logging in our system looked like:


sub hoge {
  my ($self, $foo, $bar, $baz) = @_;
  ...
  # warn "Got $foo $bar $baz";
  if (! $foo) {
    # warn '$foo is FALSE'
    ...
  }
  ...
}

Ugh. Commented out warns are not logs. They are just doodles in our source code.

We need our logs to be

  1. All over the place: preferably in every important conditional blocks
  2. Informational: Don't just print out "x = 1". Print "X = 1, so XXX needs to happen next" or some such, so that you can figure out what's actually going on
  3. Easily toggle-able: Commenting out logs is a sure way to fail. We need to make all logs appear/disappear with just a flip of a switch.

So for our purposes, we use Log::Minimal. We could have put a more flexible OO-ish loggers, but for our purposes, we want to put logs all over the place, without any effort. Log::Minimal provides a functional interface that can just be inserted anywhere, so this is good.

Also, we don't want to slow down production code just because of our logs. For that you should use constant folding:


    use constant LDBLOGNG_DEBUG => $ENV{LDBLOGNG_DEBUG};
    use Log::Minimal;

if (LDBLOGNG_DEBUG) {
debugf("....");
}

This would cause the if block to completely be compiled out if LDBLOGNG_DEBUG is false, therefore saving you many extra calculations that you otherwise don't need.

If you start getting a lot of logs, creating "log markers" may be useful:


     use Scope::Guard;
     use Log::Minimal;

sub marker_guard {
my $section = shift;
debugf("[START %s]", $section);
return Scope::Guard->new(sub {
debugf("[END %s]", $section);
});
}

sub foo {
my $marker = marker_guard("foo()");
....
}

This would create nice markers for the duration for foo() subroutine, so that you can tell what's going on.

Okay, so now you have logs. We can finally figure out what's going on, so we can start hacking.

mod_perl: yak shaving

    Oh mod_perl,
        how do I hate thee?
              Let me count the ways

First thing that hit me was the fact that a lot of the code was written as raw Apache handlers. $r was everywhere.

$r is evil. Apache::Request->instances is even worse. $r has too much power, and so $r should go away.

The first thing I did was to refactor the use of $r into (duh) subroutines.

BEFORE:


sub handler : method {
my ($class, $r) = @_;

if (...) {
$r->status(404);
$r->send_http_headers();
} else {
$r->status(200);
$r->send_http_headers();
$r->print($data);
}
return Apache::OK;
}

AFTER:


sub handler : method {
  my ($class, $r) = @_;

if (...) {
$class->not_found($r);
} else {
$class->output_data($r, {
status => 200, # default
body => $data
});
}
return Apache::OK;
}

Then I started fiddling with the class hierarchy for the Apache handlers. I moved all the switches to the parent class, and made them class-based switches (via Class::Data::Inheritable), so it would be easier for me to convert these into objects later.

Then I released the converted version. This contained new logging mechanism, the class hierarchy, etc.

Then immediately I got hit by a bug: All of our 404 pages (served via ErrorDocument directive) turned white. nothing.

Turns out it's because mod_perl 1.3.x has this odd rule to enable ErrorDocuments. You want to return 404, and show an ErrorDocument: can you guess which combination works?:

  • $r->status(404), return Apache::OK
  • $r->status(404), return Apache::NOT_FOUND
  • $r->status(200), return Apache::OK
  • $r->status(200), return Apache::NOT_FOUND
  • $r->status(404), $r->send_http_header, return Apache::OK
  • $r->status(404), $r->send_http_header, return Apache::NOT_FOUND
  • $r->status(200), $r->send_http_header, return Apache::OK
  • $r->status(200), $r->send_http_header, return Apache::NOT_FOUND

Answer: $r->status(200), return Apache::NOT_FOUND. Seriously, how are you supposed to get this.... mod_perl must die.

Stuff like this kept popping up. My hatred against mod_perl grew every day, but nonetheless...

Phase 2: Migrate the Distributor to PSGI

So finally we had a cleaner set of Apache handlers. It's time to make'em PSGI.

First all the handlers were turned into Plain Old Perl Objects, and I swapped all the Apache::Request usage to Plack::Requst. Then, fix, try, find an error, fix, try... repeat.

On the way, I wrote Geest, which is a port of Kage to Perl/AnyEvent. This allowed me to send requests to both the development app and the production app and compare if my implementation was different the production one. Turned out to be a great tool while I was still learning how the whole system worked. We don't use it anymore, but it's a cool tool to have during selected parts of the development cycle.

Once I started converting stuff to PSGI, I started using a bunch of really useful middlewares, and other assorted tools:

And then THIS was the right time to upgrade perl. I upgraded from perl-5.8.8 to perl-5.16.3, making a 7-year leap. Meanwhile, if we are going to modernize we might as well use cpanfile and Carton. So we did.

At the time, we used cpanfile >= 1.6, with explicit versions requirements ("requires DBI => '==1.628'"), and Carton 0.9. The prereq modules are in a list so I started manually bringing them in, one by one. Install it, run the software, see if it failed, repeat, repeat, repeat.

There were also some modules that were not up on CPAN: in-house modules. I had the tarballs. I wanted to just install them via cpanm/cpanfile, but there was this install order problem (e.g., we needed Class::Trigger 0.09 -- if I just put it in cpanfile, the chances were that some other module which depends on Class::Trigger might install the latest one instead). So I scripted this part so that all the correct versions were installed.

Then it hit me: it's not just the prereqs. My coworkers will have to do the whole setup on their boxes again. yikes! Nobody's going to adopt my software!

So I made it so that my PSGI app script did ALL the required initialization, creating dirs, installing the right mysql binaries, even applying patches on modules that needed to be installed. All that my co-workers needed to do was to enable this PSGI app script, and it would do the rest. This way they couldn't complain.

Having done all that, it was time to DEPLOY!

.... Of course, all hell broke loose: performance was down by 30%. Not a good thing when you have billions of hits.

Our resources are recorded via CloudForecast so first I tried tweaking the number of workers while checking CloudForecast graphs. Well, turns out it wasn't a good idea. The load average went up, and we started getting many alerts. It was now time to closely review the source code.

My coworker (kazeburo) was profiling away our app using Devel::NYTProf. He actually found a lot of things there.

One thing he found out was that, because I was using a lot of mount() calls, I was wasting some precious resource. I changed all those to simple hash lookups, and we were now handling twice as much requests. See https://gist.github.com/kazeburo/5087266.

Also, Plack < 1.0018 had a little problem. Namely, query_parameters() was a bit inefficient. It was calling uri() twice, and it wasn't even caching it. We moneky-patched it and the performance improved by 10~15%. It's fixed in Plack >= 1.0018.

While we were at it, we thought it might be better to micro-optimize this hot-path (we were using a LOT of query parameters). So I started writing Text::QueryString, then I was quickly informed of URL::Encode::XS by its author. Turns out URL::Encode::XS was faster AND more correct, so we started using that. We again monkey-patched Plack::Request and we squeezed another 5 to 10%.

So turns out that at least compared to Plack > 1.0018, Apache was a pretty darn efficient guy. It was faster doing HTTP request parsing and stuff. But memories always seem better then they actually were. It was time to say good by to mod_perl regardless...

Now we could finally deploy and be happy. Yes!
Except, until we tried hot-deploying a minor fix by sending HUP to Server::Starter.

If you are on a busy box, and you try to replace generations of apps like Server::Starter does, you need to realize that by default it will bring up the workers with the new generation code and then kill all of the old workers ALL AT ONCE. That means the load average goes up momentarily, and you might have brief service interruptions. This is clearly a FAIL.

This was actually clearly documented in 2011 by Server::Starter's author, including a fix http://blog.kazuhooku.com/2011/04/web-serverstarter-parallelprefork.html

BEFORE:


carton exec -- \
start_server \
--port 5000 -- \
plackup \
-s Starlet \
--max-workers=80 \
--max-reqs-per-child=500 \
--min-reqs-per-child=350

AFTER:


carton exec -- \
start_server \
--port 5000 --signal-on-hup=USR1 -- \
plackup \
-s Starlet \
--max-workers=80 \
--max-reqs-per-child=500 \
--min-reqs-per-child=350 \
--spawn-interval=0.5

By setting signal-on-hup on start_server, and using spawn-interval for Starlet, you can achieve "slow restarts". Basically a new generation worker comes up, but the next one only comes up after the specified seconds in spawn-interval. The same goes for the killing of old workers: one old worker is killed, only after the specified interval is the next one killed.

So you get to see the processes gradually being replaced. Watching this via "watch 'ps -ef | grep plackup'" is kind of fun.

And now we can finally properly hot-deploy new code... (actually there's a bit more to do, but we'll touch that later)

Phase 3: Migrating Sledge

Now the other Sledge based components. Sledge is, as I previously described, a really old web application framework. I personally hadn't touched it since 2005.

To port this to PSGI (Sledge is natively mod_perl based), we needed a Sledge adapter. There is a Sledge::PSGI module, but I heard that it's not the most efficient adapter through the grapevine (apparently copying the entire contents of PSGI $env to %ENV for every request adds up). We need to squeeze out every last bit for this application, so I built a custom Sledge PSGI handler.

Sledge was natively mod_perl based. It expects Apache::Request everywhere. And the way Apache::Request was being used was way more extensive than Phase 2. I didn't want to change all the Apache::Request - Plack::Request incompatibilities by hand.

Answer: Create a facade that is Plack::Request based, but acts like Apache::Request. I saw some modules like it on CPAN, but they weren't exactly what I was looking for... so I wrote my own FakeApacheReq object. The good part was that because I have completely control, I was able to embed warnings when certain wicked mod_perl-isms were being used. One notable gripe of mine was the upload() method and the weird interaction with param(). See the slides for this.

Also, the problem I encountered with Sledge-based code was the liberal use of %ENV.


sub post_dispatch_add {
  ...
  # Expects %ENV!!!!
  my %cookies = CGI::Cookie->fetch();
  ...
}

Oh god, no. %ENV is a GLOBAL! WHY ARE YOU.... Ugh, nevermind. How am I going to escape this one... In the end I decided to track down all the occurrences of this pattern, and replaced them with an equivalent of


    local $ENV{ .. } = $r->header(...); 

I wasn't going to set %ENV, because for one it's exactly what we wanted to avoid when we decided not to use Sledge::PSGI, and two, I was not going to allow my co-workers to EVER use this global again in their code. %ENV usage like this had to go.

As far as Sledge was concerned, this was about it. It turned out that when you have a set of well defined API between your environment (mod_perl/PSGI) and your app, migration is much easier.

Phase 4. Going Further

So then most of the code was finally ported to PSGI. It was now time to make some fine adjustments so that things were even easier.

One problem I found was that when you wanted to fine-tune the number of workers/max-requests-per-child/etc on Starlet, you needed to send a TERM signal to the start_server program. This is because start_server takes the arguments it was given the first time it was fired up and keep reusing that to spawn new generations.

Therefore in below code, everything after --signal-on-hup... is reused every time you tried to hot-deploy the code


carton exec -- \
    start_server \
        --port 5000 --signal-on-hup=USR1 -- \
    plackup \
        -s Starlet \
        --max-workers=80 \
        --max-reqs-per-child=500 \
        --min-reqs-per-child=350 \
        --spawn-interval=0.5

If you wanted to change --max-workers to 50, you needed to terminate start_server. This is clearly not a good thing, and it defeats the purpose of having a hot-deployable code.

The solution is to give start_server a script, instead of the raw command line:


carton exec -- \
    start_server \
        --port 5000 --signal-on-hup=USR1 -- \
    bin/jp2_app

This code is a wrapper that loads the Starlet server via Plack::Loader, the PSGI app, and sets all the required parameters. start_server only sees bin/jp2_app, so if you change its contents to pass a different parameter to Starlet, the new settings are honored.

The next minor problem/annoyance was the fact that we had too many plack processes. app A, B, C, might be running on the same box, and since stuff like "ps | grep plackup" gave me all the apps' processes, it was really annoying to look for that one process that we wanted.

To work around this, we started to liberally change $0.

For example, this to describe the Starlet workers:


$0 = sprintf "%s (%s %s=[%s])",
    $0,
    $apppath,  # etc/app.psgi
    $handler_name, # Starlet
    join ", ",
        map { "$_=$handler_args{$_}" }
        keys %handler_args;

# run_psgi (etc/app.psgi Starlet=[max_workers=80 ... ])

Grepping this is much easier, and the resulting string was more descriptive.

Also, during each request, we'd also do something like to find which handler was currently processing which request:


sub handle_psgi {
    my ($self, $env) = @_;
    ...
    local $0 = sprintf "%s [%s (%s)] %s",
        Scalar::Util::blessed($self),
        $env->{REQUEST_URI},
        $env->{HTTP_HOST},
        scalar localtime
    ;
}
# HandlerClass [/foo/bar/baz (blog.livedoor.jp)] Tue Sep 17 10:42:03 2013

Next steps

Fighting legacy code definitely doesn't stop here. There are more stuff that I want to do, like more fine-grained clustering, more automated tests, fixing up the job queue mechanism, normalizing databases (imagine a 10 year old database... it's hell!), upgrade (!) our mysql4 servers to mysql5, use DTrace instead of debug logs...

There are plenty of things to do. Maybe next time I get to talk about all this.


5 Comments

This is awesome! I especially like your guard based log blocks and the $0 = ... idea. Thanks!

Very nice article!

> If you wanted to change --max-workers to 50, you needed to terminate start_server. This is clearly not a good thing, and it defeats the purpose of having a hot-deployable code.

When I ran into this problem, I added a level of indirection by putting settings like --max-workers into a config file:

before:

start_server -- starman --max-workers=50 ...

after:

start_server -- start-my-app

$ cat start-my-app
use strict; use warnings;
use Config::Any;

my $config = Config::Any->load_files({
use_ext = 1,
files => [ 'my-app-config.yml' ],
flatten_to_hash => 1,
);

exec(
'starman',
'--preload-app',
'--workers' => $config->{workers} // 50,
# other configs...
'my-app.psgi',
);

That way you can dynamically change your starman parameters and they will get
used the next time you send a HUP to start_server.

If my URL::Encode::XS makes a difference you should try Unicode::UTF8 (also my module), it makes a huge difference in my deployments.
--
chansen

Leave a comment

About lestrrat

user-pic Japan Perl Association director; LINE, Inc; Tokyo, Japan