Fun With Fuse
FUSE (Filesystem in USErspace) is a useful kernel module with an API which allows file systems to be implemented in user space applications, mounted and fully integrated into the system's VFS. Originally implemented for Linux, FUSE API-compatible kernel modules are now available on *BSD, OpenSolaris and MacOS. That said, this was written on a Linux machine so there may be assumptions made about the tools available, their output etc. Feel free to submit info on other *nix-like systems for inclusion if you encounter any inconsistencies. This also goes for corrections and suggestions - if I get the details plain wrong or the code stinks and so on...
Some examples of popular FUSE file systems are ntfs-3g, which provides full NTFS support for the Linux kernel, and sshfs, which mounts remote directories over SSH with SFTP support. FUSE also powers ambitious projects, such as the distributed filesystem, MooseFS as well as providing easy access to proprietary protocols such as MTP.
We can combine this power with the power of Perl using the excellent FUSE API binding.
In this post we will implement a simple FUSE module to read an RSS feed right in the file system. Why an RSS feed? It's a handy source of data to populate files and demonstrates nicely (I hope) how Perl and FUSE can work together. I will use FUSE here to refer to the API and Fuse to refer to the Perl module / bindings.
FUSE works via callbacks to perform the basic operations of a file system, such as fetching a directory listing or retrieving a file's attributes (type, permissions etc.). These are implemented in the Fuse module by passing coderefs in the arguments of the call to the main loop.
A minimal example of this is the following code to create directory listings which consist of the English alphabet. It implements two functions of a file system, getdir (return a directory listing) and getattr (retrieve attributes for a file/directory). References to these functions are passed in the call to Fuse::main - when fuse needs a directory listing or file attributes (type, owner, permissions, size...) it will call these coderefs.
#!/usr/bin/env perl
# listing.pl
use strict;
use warnings;
use Fuse;
sub getdir {
return (@{['a'..'z']}, 0);
}
sub getattr {
return (0, 0, 0040700, 1, 0, 0, 0, 0, 0, 0, 0, 4096, 0);
}
Fuse::main(
mountpoint => "listing",
getdir => \&getdir,
getattr => \&getattr,
threaded => 0
);
So, getdir returns an array, 'a'..'z', but why is the last element 0? The last element of the returned array should be an errno, 0 indicating success. Some negative number usually denotes failure. We will look at cases of returning a failure status in further examples.
What's going on in getattr? For this simple example a set of default values for the directory entry's attributes have been used. We can see the effect this has on the listing when we actually use the module. Yes, we can mount this small module and examine its contents. Let's take a look at what it does:
$ mkdir listing $ chmod 0755 listing.pl $ ./listing.pl & [1] 6928 $ ls listing a b c d e f g h i j k l m n o p q r s t u v w x y z
We can see the directory listing consists of the array returned from getdir. Where does getattr come in? For each entry in the listing we must return attributes so the shell knows how to present it, access can be granted/denied and countless other things. So, what do some of the values we returned from getattr mean?
We returned (0, 0, 0040700, 1, 0, 0, 0, 0, 0, 0, 0, 4096, 0)
These describe (dev, ino, mode, nlink, uid, gid, rdev, size, atime, mtime, ctime, blksize, blocks).
We don't need to worry about most of these for now (for a more complete description, see the Fuse or stat() documentation), but we can observe how the values returned effect what we see in the listing, like so:
$ ls -ld listing/a drwx------ 1 root root 0 Jan 1 1970 listing/a
The first column we see is drwx------ - the type and permissions (or mode) of the entry. We passed 0040700 for this. The 0040 part describes the type - directory. If this were a file, this value would be 0100. The second part is probably familiar to anyone used to Linux file permissions, it grants read, write and execute (or enter for directories) permission to the entry's owner.
Next is the link count, nlink, which counts the number of hard links to that file or directory. Its entry in directory listings counts as a link, hard links created with ln count as a link... The link count can be used by tools such as find to make decisions about recursion. We set this to 1 - there is always at least one link. This feature is where unlink() takes its name from, we are not deleting files as such, we are deleting names which link to a file or directory. The system reclaims the file's space when the last link has been removed.
The next values we see are the uid and gid - we returned each of these as 0 so the directory is owned by root:root.
Next are the time stamps, access time, modify time, create time. Each of these is epoch time, or the number of seconds since 00:00 on 1st Jan 1970. Since we return 0 for these, the directory appears to be many years old.
The last column is the directory name. Since we return the same attributes for any getattr request, all the directories will appear much the same. Also, since we return the same set of contents for each getdir callback request we should have a recursively alphabetic directory tree...
$ ls listing/a/a/a/a/a/a a b c d e f g h i j k l m n o p q r s t u v w x y z$ ls -ld listing/a/b/c/d/e/f/g
drwx------ 1 root root 0 Jan 1 1970 listing/a/b/c/d/e/f/g
At this point you could also take the opportunity run the mount command, which should show you the listing.pl module mounted alongside your disks and other virtual file systems, e.g.
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime) sys on /sys type sysfs (rw,nosuid,nodev,noexec,relatime) dev on /dev type devtmpfs (rw,nosuid,relatime,size=895452k,nr_inodes=223863,mode=755) /dev/sda3 on / type ext4 (rw,relatime,data=ordered) . . . /dev/fuse on /home/fuzzix/projects/fuse/listing type fuse (rw,nosuid,nodev,relatime,user_id=1000,group_id=100)
The user_id and group_id are a component of FUSE's security model - unless explicitly granted, FUSE mounts belonging to you are not accessible to any other user, including root.
To unmount the filesystem we use the fusermount tool which should ship with FUSE:
$ fusermount -u listing [1]+ Done ./listing.pl
Running mount again should confirm that the module has indeed been unmounted.
So far we have demonstrated a functioning FUSE module, though its utility is limited. We can remedy this with the magic of data! A reasonably dynamic source of data is a RSS feed for a busy website. We will use a hash populated with a file system structure and the content of an RSS feed to demonstrate how easily a data set which fits naturally into Perl's capabilities can be presented using the hierarchical file system abstraction (though this particular example won't demonstrate a hierarchy so much).
How will we bend the data from RSS to a file system layout? A simple mapping of the usual RSS feed elements should help; each file's content is a single post, the file name is the title, the file's time stamp is the post date and time... you see where I'm going.
The first task is to populate the hash from an Atom or RSS URL, we can use the XML::FeedPP module to achieve this fairly trivially. The file content itself might just be a dump of the post's content as-is, but for the purposes of this we'll convert it to plain text. HTML::FormatText::WithLinks should prove useful here, providing a readable format while keeping links intact. The post title itself can contain HTML markup and entities which could make the file name difficult to read or even nonsensical - HTML::Strip should clean it up nicely. Retrieving an epoch time stamp from UTC or other formatted time stamps is trivially achieved (from our perspective) with Date::Parse.
So now we have all the components to turn the RSS feed into a file system, we can proceed. The following function takes a preconfigured feed URL ($source) and chops its components into elements of a hash with file scope containing the file system contents (%files).
sub populate {
my $feed = XML::FeedPP->new($source);
my $format = HTML::FormatText::WithLinks->new();
my $strip = HTML::Strip->new();
my $filename;
my $content;
my $feeddate = str2time( $feed->pubDate() ) || time;
undef %files;
%files = (
'.' => { type => 0040, mode => 0755, ctime => $feeddate },
'..' => { }
);
foreach my $item ( $feed->get_item() ) {
( $filename = $strip->parse( $item->title() ) ) =~ s#[/]#-#g;
$content = $format->parse( $item->description() );
$content .= "Link: " . $item->link() . "\n";
if ( $item->author() ) {
$filename .= " (" . $item->author() . ")";
}
$files{ $filename . '.txt' } = {
type => 0100,
mode => 0644,
ctime => str2time( $item->pubDate() ),
content => $content
};
}
}
The function starts by adding '.' and '..' to the listing of files - you may have noticed that these seemingly perennial entries were missing from the full directory listings in our first example. They must be managed by the FUSE module explicitly.
Other than '.' and '..', we do not create any other directory entries here - in order to keep things simple (in functions such as rss_getdir()) we'll just have a plain dump of articles into files. To include subdirectories would require a slightly different structure, perhaps using the 'content' key to describe lists of other files and directories. There is no need to specify attributes for '..' here - as '..' refers to another file system in this case, the attributes will be retrieved from that file system module (most likely the file system driver for your home directory).
We then iterate through the list of posts, or items, that XML::FeedPP parsed from our feed and create a new file for each entry. The file name is set from the item's title which has any HTML stripped and any occurrence of the '/' character replaced - POSIX restricts use of '/' to the directory separator so it is not allowed in file names.
The file's contents are set to the item's "description" - the description is generally specified as a field to provide a short description for the provided link, though many feeds place the entire article in here. The original link is then appended to the text.
We now have our hash, populated from the parsed feed. How do we go about presenting it? We saw earlier that the minimal set of implemented callbacks for a working file system amount to getdir and getattr, so how about we continue by implementing them. Retrieving a directory listing is trivial enough, we simply return a set of the entries created in populate().
sub rss_getdir {
return ( keys %files, 0 );
}
To implement getattr we mix some of the values we set in populate() with values taken from the file's entry in the %files hash, Fuse as well as some hard-coded defaults.
sub path_to_ref {
my $path = shift;
$path =~ s#^/##;
$path = '.' unless length($path);
return ( \%{$files{$path}} );
}
sub rss_getattr {
my $path = path_to_ref(shift);
return ( -ENOENT() ) unless $path;
my $context = Fuse::fuse_get_context();
my $size = length ( $path->{content} ) || 0;
my $mode = ( $path->{type} << 9 ) + $path->{mode};
my $uid = $context->{uid};
my $gid = $context->{gid};
my $atime = my $ctime = my $mtime = $path->{ctime};
my ( $dev, $ino, $rdev, $blocks, $nlink, $blksize ) =
( 0, 0, 0, 1, 1, 1024 );
return (
$dev, $ino, $mode, $nlink, $uid, $gid, $rdev,
$size, $atime, $mtime, $ctime, $blksize, $blocks
);
}
The path_to_ref function returns a reference to the element in %files which matches the requested path by simply removing the initial directory separator returning that named key. Since a request for top-level directory is '/', this must be caught and set to the actual entry we created for the directory in %files, '.' . This function is used by getattr and open.
The rss_getattr() function's use of path_to_ref() can trap non-existent files trivially. The file size is set to the number of characters in the file, rather than the number of bytes, which would usually be what a file system module returns - this is good enough for our purposes here. The mode is made up of a bit-shifted type added to the mode flags so a directory type (0040) with mode rwxr-xr-x (755) permissions is returned as 0040755. User and group ID values are pulled from Fuse::fuse_get_context. Timestamps are all set to the post's time we stored in populate().
So far we have just enough to download a feed and list its contents in the file system. How do we go about reading the posts? We need to implement just a couple more callbacks to make this happen. The first is open. Since we don't need to generate handles or any other low-level work, all we need do is confirm the requested file exists and is not a directory...
sub rss_open {
my $path = path_to_ref(shift);
return -ENOENT() unless $path;
return -EISDIR() if $path->{type} & 0040;
return 0;
}
...Well, that would be the case, if we still lived in an 8-bit world - we could use substr in the read callback to return the number of bytes from the location requested. We have already "cheated" in this file system by not returning the size of the content in bytes, but characters. For read, we need to return the actual set of bytes requested, regardless of string encoding.
We can take advantage of a feature in Perl, the in-memory file handle, to get real bytes from our file's content, ignoring the character encoding. If we return a file handle from the open call, subsequent read calls will get that file handle to work on. So, let's try that again:
sub rss_open {
my $path = path_to_ref(shift);
return -ENOENT() unless $path;
return -EISDIR() if $path->{type} & 0040;
open my $fh, "<", \$path->{content};
binmode $fh;
return 0, $fh;
}
The (almost) final callback we need to implement for a working file system is read. The parameters passed to the read callback are the file handle (since we returned one from rss_open()), offset (the byte to start reading from) and the number of bytes to return.
You may be tempted to use sysread() here to ensure you get bytes, not characters (I know I was), but sysread is not equipped to handle in-memory file handles.
sub rss_read {
my ( $path, $bytes, $offset, $fh ) = @_;
my $buffer;
my $status = read( $fh, $buffer, $bytes, $offset );
if ($status > 0) {
return $buffer;
}
return $status;
}
Since we have an open() call which generates a file handle, we should implement a simple close (or release()) call:
sub rss_release {
my ($path, $flags, $fh) = @_;
close($fh);
}
The final step is to call populate() and pass references to our functions in the call to the Fuse main loop:
populate();
Fuse::main(
mountpoint => $mount,
getdir => \&rss_getdir,
getattr => \&rss_getattr,
open => \&rss_open,
read => \&rss_read,
release => \&rss_release,
threaded => 0
);
When we mounted our first demo file system, we had to background it on the command line with '&'. To have the file system background itself, add the following line before the call to the Fuse main loop:
fork and exit;
I think we're ready to try the module now. There are some small pieces I didn't cover here (retrieving command line parameters, the exact form of module includes...) so you can retrieve the full source.
Now to locate a RSS/Atom feed - I hear good things about blogs.perl.org, so...
$ mkdir blogs.perl.org $ ./rssfs.pl https://blogs.perl.org/atom.xml blogs.perl.org/ $ ls -ltr blogs.perl.org/ -rw-r--r-- 1 fuzzix users 446 Aug 11 21:52 Getting to the Venue (YAPC::Europe 2012).txt -rw-r--r-- 1 fuzzix users 4110 Aug 12 20:07 CPAN Testers Summary - July 2012 - Head On The Door (CPAN Testers).txt -rw-r--r-- 1 fuzzix users 3133 Aug 13 21:08 The solved problem that isn't, is (Jeffrey Kegler).txt. . .
-rw-r--r-- 1 fuzzix users 1366 Aug 21 15:43 A NYTprof encoding hiccup (Kimmel).txt
-rw-r--r-- 1 fuzzix users 1602 Aug 21 20:36 YAPC::Europe Day 2 (acme).txt
-rw-r--r-- 1 fuzzix users 396 Aug 22 15:28 The goodness of testing (Jerome Eteve).txt
An NYTprof encoding hiccup? Sounds interesting...
$ cat blogs.perl.org/A\ NYTprof\ encoding\ hiccup\ \(Kimmel\).txtWhile using Devel::NYTProf on a new application I started getting this
message
. . .
...Well, the module appears to be working. One last thought occurs to me, what if I want to obsessively update the feed to stay right on top of what's new in the Perl community? Well, we could unmount and remount the file system to run populate() again, or we could add some hook which runs it for us.
It's arbitrary enough, but we could add functionality to rss_open to trap, say, `touch update` to repopulate the file system hash. We'll need to add this to the initial set of files in populate() as we have no facility to dynamically create arbitrary file system entries:
%files = (
'update' => { type => 0100, mode => 0644, ctime => $feeddate, content => "" },
'.' => { type => 0040, mode => 0755, ctime => $feeddate },
'..' => {}
);
The rss_open function can now trap O_WRONLY open requests on '/update' to call populate():
sub rss_open {
my ($path_txt, $flags) = @_;
if ($path_txt eq '/update' && $flags & O_WRONLY) {
populate();
return 0;
}
my $path = path_to_ref(shift);
. . .
Since we aren't always passing file handles now from open, rss_release() needs a slight modification:
sub rss_release {
my ( $path, $flags, $fh ) = @_;
close($fh) if $fh;
return(0);
}
...and the O_WRONLY symbol needs to be added to those listed in the POSIX module use line:
use POSIX qw(ENOENT EISDIR EINVAL O_WRONLY);
Since touch also results in a utime call, we will need to add a minimal coderef to the Fuse::main call:
Fuse::main(
mountpoint => $mount,
getdir => \&rss_getdir,
getattr => \&rss_getattr,
open => \&rss_open,
read => \&rss_read,
release => \&rss_release,
utime => sub { return 0 },
threaded => 0
);
We now have enough in place for the following command to generate an update to the RSS feed:
$ touch blogs.perl.org/update
Here is a gist for this version.
Well, I think that's about it for now. Hope you had fun.
Wow, very informative. Never thought it would be so easy!
Thanks for posting!
Thanks for this! A really informative and thorough article. Motivates me to try playing with FUSE one of these days.
Thanks, glad you found it useful.
A thought occurs, that instead of all the binmode/read/handle stuff, you might be able to 'use bytes;' and then use plain substr. That might simplify things, if it worked.