Perl to the rescue: case study of deleting a large directory

When I moved from OpenBSD to FreeBSD a year ago, I also had to move the email being handled by my server. As things were a bit different, I added a "Just-in-case" MailDir for one of my users so that no matter what else happened to the rest of their procmailrc, they'd have a backup copy.

Flash forward a year.

Yeah, you guessed it... we never turned that off. It's been accumulating spam at the rate of a few messages a second. For a year. I couldn't figure out why my 80GB of freespace a year ago was now dangerously under 15GB.

The MailDir/new directory was 2.5GB. Not the contents. The directory itself! I tried an "ls", realizing after a few minutes that ls would want to load the entire list of names into memory, then sort them, then dump out. Ouch.

So I tried the next obvious thing: "mv new BADnew; rm -rf BADnew". Nothing happened, for a very long time. I tried tracing the process, and saw that "rm" was like "ls" here: it was sucking in all of the names without doing any deletes yet. Foiled again.

I remembered that "find ... -delete" probably wouldn't need to sort things. Tried that. Again, after a few minutes of nothing happening, I traced it. What? Same thing. find was pulling all names in first. Ouch. Foiled again.

On a whim, I fired up "perl -debug", and tested a scalar readdir(). Woo hoo! Name immediately! So I set up a quick loop:

perl -e 'chdir "BADnew" or die; opendir D, "."; while ($n = readdir D) { unlink $n }'

And immediately, names were being deleted, and space was being returned to the OS. It took about four hours to delete the entire directory, but yeay, there it was!

Perl to the rescue!

12 Comments

:-) How many files were in that dir? Last time I remember, the largest Maildir/new I had to empty only contained about 250k (or was it 500k?) files, so I guess I'm lucky...

Heck, I can top that speed...but I have to use "mkfs"... ;)

unlink takes a list of files to unlink, so -

perl -e 'unlink <*>'

Normally I don’t use the ‘<*>’ instead using glob directly. Either way, probably doesn’t matter much.

That’s clever, except it won’t work: <*> will try to build the entire list of file names first – so you have the same problem as all the other tools.

As a dedicated OpenBSD fan I feel obligated to ask: Why did you swich from OpenBSD to FreeBSD?

Looking forward to you having Bob Beck on for FLOSS Weekly ;-)

Randal L Schwartz wrote: When I moved from OpenBSD to FreeBSD a year ago ....

Have you posted or blogged anywhere why you made that transition? I would be very interested in hearing your reasons?

Thank you very much.

When I do the math, 3-5 spam mails a second for a year gives us more
like a million to a million and a half files. Even a million files is
enough to choke the UFS2 file system, which is the defualt file system
on FreeBSD.

A million files is no big deal for a (modern) Linux system:

$ mkdir large
$ time for i in {0..999999} ; do echo > large/$i ; done

real 2m47.775s
user 0m12.032s
sys 0m34.987s
$ du -ks large
4022660 large
$ time rm -r large

real 1m6.592s
user 0m1.657s
sys 0m20.221s

Yikes, that's a 3.8 GiB direcotry! Run with ext4 and Linux 3.0.3.

No, you're right: like 100 to 150 million.

Thanks!

For a couple of days I hassled with exactly the same problem. 10 million files and every tool I used likes to cache the directory for "performance" reasons...

Now that script is running on a clients OSX server where the rsync/osx10.6 problem spammed a directory.

This post is quite old already, but I found it due to a post of mine in the perlmonks website.

As far as I checked, if you're in Linux, you could use a system call to SYS_getdents with syscall and avoid doing a stat() on the files and get their names to a list (or a text file).

After that it would be just a matter to call unlink on those names.

The downside is that I'm still trying to figure out how to use getdents with Perl. :-)

Leave a comment

About Randal L. Schwartz

user-pic print "Just another Perl hacker"; # the original!