Perl to the rescue: case study of deleting a large directory
When I moved from OpenBSD to FreeBSD a year ago, I also had to move the email being handled by my server. As things were a bit different, I added a "Just-in-case" MailDir for one of my users so that no matter what else happened to the rest of their procmailrc, they'd have a backup copy.
Flash forward a year.
Yeah, you guessed it... we never turned that off. It's been accumulating spam at the rate of a few messages a second. For a year. I couldn't figure out why my 80GB of freespace a year ago was now dangerously under 15GB.
The MailDir/new directory was 2.5GB. Not the contents. The directory itself! I tried an "ls", realizing after a few minutes that ls would want to load the entire list of names into memory, then sort them, then dump out. Ouch.
So I tried the next obvious thing: "mv new BADnew; rm -rf BADnew". Nothing happened, for a very long time. I tried tracing the process, and saw that "rm" was like "ls" here: it was sucking in all of the names without doing any deletes yet. Foiled again.
I remembered that "find ... -delete" probably wouldn't need to sort things. Tried that. Again, after a few minutes of nothing happening, I traced it. What? Same thing. find was pulling all names in first. Ouch. Foiled again.
On a whim, I fired up "perl -debug", and tested a scalar readdir(). Woo hoo! Name immediately! So I set up a quick loop:
perl -e 'chdir "BADnew" or die; opendir D, "."; while ($n = readdir D) { unlink $n }'
And immediately, names were being deleted, and space was being returned to the OS. It took about four hours to delete the entire directory, but yeay, there it was!
Perl to the rescue!
:-) How many files were in that dir? Last time I remember, the largest Maildir/new I had to empty only contained about 250k (or was it 500k?) files, so I guess I'm lucky...
I'll never actually know how many files were there. But consider that the average dirent was probably 20 chars plus a 4-char inode number... that'd put it in the 100 million range. Perhaps off by a few orders of magnitude, but not more. :) Coming at it from the other side... 3-5 spam mails a second for a year would be about 100 million. Hmm, that's starting to pan out then.
So, perhaps, I just deleted 100 million files with Perl.
Heck, I can top that speed...but I have to use "mkfs"... ;)
unlink
takes a list of files to unlink, so -Normally I don’t use the ‘
<*>
’ instead using glob directly. Either way, probably doesn’t matter much.That’s clever, except it won’t work:
<*>
will try to build the entire list of file names first – so you have the same problem as all the other tools.As a dedicated OpenBSD fan I feel obligated to ask: Why did you swich from OpenBSD to FreeBSD?
Looking forward to you having Bob Beck on for FLOSS Weekly ;-)
Randal L Schwartz wrote: When I moved from OpenBSD to FreeBSD a year ago ....
Have you posted or blogged anywhere why you made that transition? I would be very interested in hearing your reasons?
Thank you very much.
OpenBSD vs FreeBSD:
FreeBSD has the pf firewall (albeit a ported version, and a little downrev). That's something I wasn't going to give up.
But FreeBSD also has ZFS, including boot from ZFS in the latest releases. After playing with OpenSolaris for a while, I was hungry to continue to use ZFS, but without the unfamiliarity of Solaris system administration and lack of good modern ports.
And finally, FreeBSD has a huge ports catalog that is well maintained. OpenBSD ports never seemed to have what I wanted. In particular, "BSDPan" can build non-ported Perl modules and install them as a removable package, just as if they had done a port.
When I do the math, 3-5 spam mails a second for a year gives us more
like a million to a million and a half files. Even a million files is
enough to choke the UFS2 file system, which is the defualt file system
on FreeBSD.
A million files is no big deal for a (modern) Linux system:
$ mkdir large
$ time for i in {0..999999} ; do echo > large/$i ; done
real 2m47.775s
user 0m12.032s
sys 0m34.987s
$ du -ks large
4022660 large
$ time rm -r large
real 1m6.592s
user 0m1.657s
sys 0m20.221s
Yikes, that's a 3.8 GiB direcotry! Run with ext4 and Linux 3.0.3.
No, you're right: like 100 to 150 million.
Thanks!
For a couple of days I hassled with exactly the same problem. 10 million files and every tool I used likes to cache the directory for "performance" reasons...
Now that script is running on a clients OSX server where the rsync/osx10.6 problem spammed a directory.
This post is quite old already, but I found it due to a post of mine in the perlmonks website.
As far as I checked, if you're in Linux, you could use a system call to SYS_getdents with syscall and avoid doing a stat() on the files and get their names to a list (or a text file).
After that it would be just a matter to call unlink on those names.
The downside is that I'm still trying to figure out how to use getdents with Perl. :-)