Suprisingly hard task of writing logs

By Vyacheslav Matyukhin on May 4, 2011 11:57 PM

At my $job, we often use files and logs as the cheap way of managing queues of data.
We then read them using Log::Unrotate module, but this is the topic for another post.
Anyway, it's important that all items written in log are never lost and never get corrupted.

So here is the task: write serialized data to file from multiple processes.
Sounds easy, right?

Well, the most obvious problem is that print() uses buffered output by default, and many lines in the log will be mixed up if you don't turn it off.
Ok, let's call $fh->autoflush(1), let's take lock on the file, let's even use syswrite() because... why not?
But even then, POSIX doesn't promise that write(2) is atomic. And if you call syswrite(), you have to check for the return value ("number of bytes actually written", quoting perldoc) and compare it to number of bytes you attempted to write.
And while you're trying to write the remaining bytes, maybe you won't do it in time and someone will kill your process...
And then another process will run and carelessly append to the broken file:
> line1
> line2
> linline3
> line4
And then you're screwed.
Especially if you're storing Storable strings in these files as we do, because thaw($str) can cause segfaults, out of memory and other bizzare errors on invalid lines.

Really, am I missing something obvious here?
Why is it so hard?
The only solution we came up with is to seek() and look at the last byte in file before every write and seek even further to the beginning of this broken line if it's not "\n".
But it looks ugly and probably slow as well.

28 comments

Tagged as:

posix, question

28 Comments

Jason | May 5, 2011 2:36 AM | Reply

Multiple process writing to a single file without any kind of locking or single-threaded writer is a recipe for disaster.

At first glance it looks dangerously easy to believe that write() or print() is atomic. As you are discovering, that is not really the case.

Consider using IPC::DirQueue, which will let you create a queue of data to be processed. it will retain ordering, and uses a directory modification protocol that ensures that things don't get clobbered.

If you want to continue to pursue the log file solution, you may want to consider writing your own logging agent. Have multiple processes connect to a socket (local Unix domain socket) that your logging agent makes available, and the agent will be responsible for reading a complete record, and writing a complete record to the log file, before consuming another one.

Mr Z | May 5, 2011 5:12 AM | Reply

I agree with Jason. At one job, we did just as he is describing and 'tee'd (rough description of the logger process) logs to a single server to make it cheap to scrub log files for process status across a dozen servers. One listener buffers all the log writes (like forwarding syslog) and then we processed a single log file for status updates. None of the logger processes ever saw bottlenecks, the listener/writer happily dealt with a fairly heavy load of incoming data. It's writes were not time critical and the buffering never harmed any process. You can even loop on 127.0.0.1 to the listener so that it's all good.

Mithaldu | May 5, 2011 8:35 AM | Reply

What you're describing here is the *exact* problem that led to the creation of database servers. I'd suggest you just throw an instance of Postgres on a dedicated logging server and have everything log to there for log items that need to be combined.

Concurrency of multiple processes into one file is not a problem that can be solved sanely on Unix-like systems.

Vyacheslav Matyukhin replied to comment from Jason | May 5, 2011 9:58 AM | Reply

It looks like IPC::DirQueue creates a separate file (or two) for every new item. So it doesn't scale to multi-gigabyte volumes as logs do.

Actually, we have our own local queue implementation (I hope to upload it to CPAN one day), which writes data in chunks and read them in chunks and scales pretty well. But this queue module is based on our file writing module, which has exactly the same race condition issues.

Also, logs are better than these queues in the aspect that you can easily grep them when necessary.

Vyacheslav Matyukhin replied to comment from Mr Z | May 5, 2011 10:05 AM | Reply

There are three things that bother me in logging daemon solution:
1) it'll make unit testing harder or less reliable; programs won't be self-sufficient anymore;
2) it probably won't handle multiple users;
3) the important benefit of logs for me is that they are local, light-weight and low-level; on the other hand, considering my own post, maybe that simplicity is a delusion...

Vyacheslav Matyukhin replied to comment from Mithaldu | May 5, 2011 10:26 AM | Reply

Hi!
I think the reason for database servers is having fast index lookups and structured data rather than locking/consistency issues :)
But anyway, dedicated server is definitely not a solution in many cases. I'm almost ready to give up simplicity in exchange for consistency and use dedicated *local* logging daemon, but separate server is too prone to network problems.

OK, maybe in some cases storing queues remotely/distributely would be nice, but for these cases I'd prefer to use ActiveMQ/RabbitMQ/etc. It's not an easy task to make any of them fault-tolerant and HA, though.
(Local queue is not fault-tolerant too, of course, but if I'm going to move data to remote host, I'd like to get some advantages in exchange for all future network issues and scaling problems which definitely will happen).

Jozef | May 5, 2011 10:30 AM | Reply

Why don't you use syslog for concurrent logging?

Vyacheslav Matyukhin replied to comment from Jozef | May 5, 2011 10:41 AM | Reply

Because we've got hundreds of scripts and hundreds of logs, and with syslog we'd have to modify its config every time we want to write to the new log. Double work.
I'd prefer a solution which is a library and not the separate daemon, but apperently it is not that easy.

Mithaldu replied to comment from Vyacheslav Matyukhin | May 5, 2011 1:17 PM | Reply

Sorry, i think i wasn't clear here:

The issue is concurrency.

On Unix it is nearly impossible or at the VERY least extremely hard, to do concurrent writes to a file without destroying the file. Databases exist to solve that problem. Lookups are a nice side effect and honestly: The use case you describe seems to point towards the read capabilities of databases also being something you want.

Jason replied to comment from Vyacheslav Matyukhin | May 5, 2011 1:23 PM | Reply

Since it seems like this is all your code, can't you just use the 'advisory' locking flock() utilizes?

The warning that it's merely advisory is moot here: if you control all the programs writing to this log file, you can modify them to honor that lock.

This lets you avoid a dependency on an additional daemon, but may introduce some contention between your 100s of processes.

Using DBD::SQLite also lets you avoid external dependencies, while letting you store structured data (rather than lines of a Storable image). The SQLite library will handle the locking for you, so your concurrent processes can all write at once.

Are they writing to this log file at high-frequency? Perhaps it'd be easier to let each process have it's own independent log file, and move the complexity to the log consumer to reassemble things into order?

Vyacheslav Matyukhin replied to comment from Mithaldu | May 5, 2011 1:30 PM | Reply

The issue is not exactly concurrency.
It's about fixing incomplete lines, which is, yes, becomes too much overhead on concurrent environment.
You could have the same problem even if you write log from one process. But in case of one process, you could check last line of log right after open() and seek() to the end of the last complete line.
In case of multiple processes, you don't know when some other process will write invalid line in log and die, even if you use flock() (which we do).

In the bigger picture, the issue is that there are no transactions in POSIX filesystem API.

Vyacheslav Matyukhin replied to comment from Jason | May 5, 2011 1:48 PM | Reply

See my answer above to Mithaldu about flock() and other stuff. (Short answer: yes, we already do this, as I said in original post).

DBD::SQLite is an option for implementing the local queues without daemon, right.
But again it's the solution to the different problem. I just wanted the safe way to write text logs, the ones I can grep, wc -l, etc.

Independent log files is a good idea, thanks. Though it too changes the file format, it's a nice alternative.

Jason replied to comment from Vyacheslav Matyukhin | May 5, 2011 2:09 PM | Reply

I seem to have glazed over that part you wrote about locking both times :)

You're right, multiple processes opening a file have no POSIX primitives to atomically modify it file.

However, the vfs layer guarantees that rename() is atomic. This is how things like maildir & IPC::DirQueue do their magic. Unfortunately, this is not useful for you here.

dly | May 5, 2011 7:30 PM | Reply

You could also solve this problem the easiest by not caring what order they are written to the log file. You could even leave autoflush off to leave buffering on for better performance.

All you'd have to do is include a hi-res timestamp in the log file with each entry, and then when you wish to read through the log just sort it by timestamp.

In order to avoid massive sort times, you could have the log file rotated often to keep the file size small, and when the file is rotated you could also sort it at the same time.

And you said you wanted a library instead of a daemon, just write a simple small one to prepend a small accurate timestamp to each entry in the log file, and write a hook for logrotate to do the sorting when the file is rotated.

Vyacheslav Matyukhin replied to comment from dly | May 5, 2011 7:40 PM | Reply

I don't care about line order.
I care about corrupted (incomplete) lines in log.
Please re-read original post...

dly | May 5, 2011 9:17 PM | Reply

Your specific problem is really specific, because you're using log files to share data that can't get corrupted or incomplete. This is not what log files are usually used for that's why my last post was so off. I just assumed you were writing logging statements to the log file, not using the log file as a database to share data among many processes.

I don't know how to solve your IPC/sharing data problem. But you should reconsider using log files, and consider using a database either the regular server kind like MySQL, or the "just a library" kind like sqlite. And if you're going to use a database maybe a object-relation mapper like DBIx::Class is a better fit than using Storable.

In a previous comment you say, "In the bigger picture, the issue is that there are no transactions in POSIX filesystem API." When I think of transactions I either think of bank ATMs, or databases. So, if you need transactions consider a real database or sqlite.

Also, in place of grep, wc -l, and other commands, if you use sqlite, you can use the sqlite3 command to run whatever queries against your database that you want to without bothering with DBD::SQLite. A database would fix the transaction problem, but then you would have maintain and test the database. I don't know, but I do know that when I think of log files I think of logging helpful debugging information not sharing data, and when I think of transactions I think of databases.

Salvador Fandiño | May 5, 2011 11:41 PM | Reply

A simple way to solve your problem without locks:

open my $afh, '>>', $log_filename;
open my $wfh, '>' , $log_filename;

then, in order to write a line $line:

my $zeros = "\x00" x length $line;
1 while (syswrite $afh, $zeros != length $line);
my $start = seek($afh, 0, 1) - length $line;
seek($wfh, $start, 0);
while (length $line) {
  my $bytes = syswrite($wfh, $line);
  substr($line, 0, $bytes, "");
}

And on the reader just ignore zeros between the lines.

[and don't forget to add error checking code or to use autodie!]

Salvador Fandiño | May 5, 2011 11:43 PM | Reply

oops, in my previous comment code, a s/seek/sysseek/g is required!

Vyacheslav Matyukhin replied to comment from Salvador Fandiño | May 6, 2011 1:14 AM | Reply

Thank you for sticking with original problem and not trying to replace it with the different one :)

Two small bugs in your code: first, you want to open $wfh with '+ Otherwise, it works, and I really like this idea.

But I see one important issue: as I understand it, it is possible to have "aaa\n0000ccc\n" at one moment and "aaa\nbbb\nccc\n" at the next one. But reader wouldn't be able to decide if it should wait for the second line to be filled.

Jason replied to comment from Vyacheslav Matyukhin | May 6, 2011 2:51 AM | Reply

Are log lines fixed length or variable length? You mentioned in one post your writing 'Storable' images?

Mr Z replied to comment from Mithaldu | May 6, 2011 4:27 AM | Reply

There is also a nice thing called a network monitoring system, if you stream the combined log to the NMS and let it work out who to email, what ticket to create, and how to escalate. It has it's own db, so no need to roll your own.

Salvador Fandiño replied to comment from Vyacheslav Matyukhin | May 6, 2011 9:57 AM | Reply

then replace the zeros by "\n" when the reserved space is not big enough:

my $zeros = "\x00" x length $line;
while (1) {
  my $bytes = syswrite $afh, $zeros;
  last if $bytes == length $line;
  if($bytes > 0) {
    my $start = sysseek($afh, 0, 1) - $bytes;
    sysseek($wfh, $start, 0);
    my $empty = "\n" x length $bytes;
    syswrite($wfh, $empty);
  }
}
my $start = sysseek($afh, 0, 1) - length $line;
seek($wfh, $start, 0);
while (length $line) {
  my $bytes = syswrite($wfh, $line);
  substr($line, 0, $bytes, "");
}

Vyacheslav Matyukhin replied to comment from Jason | May 6, 2011 4:51 PM | Reply

Line length is usually variable.
It's not always Storable, sometimes it's plaintext and sometimes JSON (each format has its advantages and disadvantages, so we use all three depending on the task).

Vyacheslav Matyukhin replied to comment from Mr Z | May 6, 2011 4:58 PM | Reply

I'm not sure how *monitoring* system can help me in the task of implementing generic lightweight queues (log-based or not).
See also http://www.reddit.com/r/perl/comments/h4sm5/why_you_want_your_logging_system_to_not_rely_on/c1smyxd for the explanation why I'm trying to avoid DB-based solutions for this task.

Vyacheslav Matyukhin replied to comment from Salvador Fandiño | May 6, 2011 5:05 PM | Reply

Hmm, I think you missed my point, or I didn't understand yours.

Your solution writes lines to the log without strict order.
You reserve portion of bytes in the file first, and then fill it in non-atomic way.
But a reader reads the file sequentially, and if it skips the line of zeroed bytes (or "\n"s, doesn't matter), then it will never read the data you write over these zeroed bytes later.

Salvador Fandiño replied to comment from Vyacheslav Matyukhin | May 6, 2011 6:09 PM | Reply

When the reader finds that some byte in the file is zero it should (busy) wait until it becomes non zero.

Vyacheslav Matyukhin replied to comment from Salvador Fandiño | May 6, 2011 6:17 PM | Reply

Busy wait is evil :)
And it won't work anyway. Write process can fail (possibly by sigkill) before it'll write non-zero data.

Salvador Fandiño replied to comment from Vyacheslav Matyukhin | May 7, 2011 8:26 AM | Reply

Don't wait forever, just a little bit!

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About Vyacheslav Matyukhin

I wrote Ubic. I worked at Yandex for many years, and now i'm building my own startup questhub.io (formerly PlayPerl). I'm also working on Flux, streaming data processing framework. CPAN ID: MMCLERIC.

More info »

Vyacheslav Matjukhin