The Problems With CGI

The Common Gateway Interface was revolutionary. It gave us, for the first time, an extremely simple way to provide dynamic content via HTTP. It was one of a combination of technologies that led to the explosive growth of the Web. For anyone writing an application that runs once per HTTP request, there is no other practical option. And for such applications, CGI is almost always adequate.

But modern web applications typically run in persistent environments. For anything with more than a small trickle of traffic, we don't want the overhead of launching a new process for every hit. Even in low-traffic environments, the startup costs involved with using modern Perl frameworks like Moose and DBIx::Class can make non-persistent applications prohibitive.

We have things like mod_perl and FastCGI for easily creating persistent applications. But these applications are generally built upon emulating aspects of the stateless, non-persistent CGI protocol within a persistent environment. Even pure mod_perl applications typically receive much of their input via environment variables specified in the CGI standard, often by instantiating CGI.pm or one of its clones.

This model is fundamentally broken. Read on for my list of reasons why CGI should not be used in persistent applications.

First, let me be clear that I'm talking here about the CGI protocol (or interface, or whatever you want to call it) and not the CGI.pm Perl module, which you also shouldn't use in persistent applications, but for unrelated reasons.

Environment Variables

CGI uses environment variables for all input not contained within the request body. That includes incoming headers, the HTTP request method, the HTTP version number, the path information, cookies, and other miscellany. The result is a large mish-mash of application configuration data. Almost anyone who has ever done any web development has implemented their version of:

print "$_: $ENV{$_}\n" for sort keys %ENV;

and had to use it to debug some issue.

Environment variables can make sense when you launch a single subprocess to handle a request, as in the CGI model. But in a persistent application, a single process handles many requests, each of which likely have different input parameters. The persistent framework must take care to properly clean up the environment in between requests, lest some important data leak between independent sessions. HTTP is a stateless protocol. The use of environment variables adds unnecessary state to the system. In a vanilla CGI, this doesn't matter. In a persistent application, leaving an environment variable sitting around can be disastrous.

Environment variables can not contain highly structured data, objects, filehandles, or other useful items. In some environments, their length is severely limited. These are typical limitations for interprocess communication, such as between an HTTP server and a CGI script. But a persistent application can run in the same process as an HTTP server. With the CGI model, we are still stuck with this extremely limited communication channel to provide input data to the application.

Solutions to this problem often involve things like session databases, keyed off some value in the input, such as a cookie or REMOTE_USER variable. It's an unnecessary over-complication.

Finally, the behavior of some environment variables can be unpredictable. The SCRIPT_NAME and PATH_INFO variables, for example, both hold different portions of the request URI. But these variables can not always be trusted. For example, Apache sets PATH_INFO differently for mod_perl handlers at the root level than in an explicit path. Consider the following <Location> directive:

<Location />
    SetHandler perl-script
    PerlHandler My::Application
</Location>

If a request were issued to this server for the URI /foo/bar/baz, Apache sets SCRIPT_NAME to /foo and PATH_INFO to /bar/baz. If we change the <Location> block to live under /myapp instead, then a request to /myapp/foo/bar/baz results in a PATH_INFO of /foo/bar/baz. Very old versions of Apache 1.1 also handle these variables slightly differently than Apache 1.3 and Apache 2.

This inconsistent behavior makes relying on PATH_INFO for application dispatch extremely dubious. The only reliable method is to look at the complete URI, but this information is not available in the standard CGI environment variables. (It is easy to get it from mod_perl, though, via $r->uri.) A mod_perl or other persistent application which relies on PATH_INFO and/or SCRIPT_NAME can suffer serious and confusing bugs due to issues of server configuration.

stdin and stdout

CGI uses standard input and output for communication between the HTTP server and the CGI program. The request body, if there is one, is sent to standard input, and the response, including headers, is sent to the program's standard output.

This fits well within the Unix philosophy of using small programs as stream filters. But it's a bad model for persistent applications, which typically serve their content via a more complex interface. Various persistent environments like mod_perl and FastCGI have to provide ugly hacks to make STDOUT go to the right place.

In addition, using standard input and output makes code reuse more difficult. A common case has us recycling a CGI script outside of an HTTP environment. For example, to create an emailed version of a generated page, or to generate a static version of a dynamic site. The result is that the client program must construct an artificial CGI environment in which to run the reused code, marshalling its input data into environment variables and an input stream, and sending it to a subprocess via pipes.

Wouldn't it be nicer if we could load a library and call a function instead? To be fair, lots of good programmers write their web applications in this manner, with the actual executable script or handler being only a stub that instantiates and runs a module. Even so, the module just as often returns, or (worse), prints to STDOUT a single string of unstructured data.

This tight coupling between a web application's functionality and its necessarily serialized input and output are the primary remaining obstacle that keep us from achieving the goals of true MVC design in our applications.

There is also a striking asymmetry. In CGI, the incoming HTTP headers are munged in various ways and placed into environment variables, with some notable omissions of useful data. However, the application is expected to construct a complete response, with headers and a body, for standard output. The HTTP model specifies that requests and responses are both similarly constructed and structured messages, yet CGI handles them completely differently.

Conclusions

When building persistent web applications which are not also targeted towards vanilla CGI, you should avoid emulating the CGI model. mod_perl or a self-contained HTTP server running behind a reverse proxy can provide you with all the necessary information in a more reliable and consistent manner. FastCGI, while a great and useful protocol, suffers from all the weaknesses listed above due to its goal of compatibility with legacy CGI scripts.

4 Comments

For a modernised interoperable foundation for Perl web apps, see PSGI and the Plack reference implementation at http://plackperl.org/

Hello Mike,

I take issue with some aspects of your premise:

modern web applications typically run in persistent environments.

They do? Based on what metrics? I don't consider PHP applications to be persistent, and they are quite popular.


For anything with more than a small trickle of traffic, we don't want the overhead of launching a new process for every hit.

It's true that there is a point at which this happens, but the characterization of a "small trickle" is misleading. As I benchmarked in my post on the benefits of vanilla CGI, being able to support one request per second should not be particularly hard with an appropriate framework. That equates to well over 50,000 uses per day-- hardly what I'd call a small trickle!


Even in low-traffic environments, the startup costs involved with using modern Perl frameworks like Moose and DBIx::Class can make non-persistent applications prohibitive.

I agree, but slow loading frameworks aren't the only game in town. As I benchmarked, Mojo is an example of a Perl framework I would consider modern yet lightweight.

It's not at all clear to me that there is a trend towards using heavier and Moose-driven frameworks for Perl web application development.

Sri first developed Catalyst, and then moved on to write the lighter weight Mojo framework.

Matt Trout became the primary Catalyst maintainer, and while it added Moose as a dependency, Matt has most recently recently Web::Simple a small lightweight framework.

We continue see activity around the lightweight CGI::Application solution, such as CGI::Application::Structured. Dancer is another recent lightweight Perl framework based on Ruby's Sinatra framework and using CGI.pm.

My take away point here is that lightweight solutions are not only simpler by definition, they can run in more environments, and if they are well designed, can scale up for more complex demands. A heavyweight solution is not only more complex by definition, they nearly always lack the option to "scale down" if you don't need all the weight.

Leave a comment

About Mike Friedman

user-pic Mike Friedman is a professional computer programmer living and working in the New York City area.