Monday, April 27, 2015

From the Trenches, Tips & Tricks Edition: Hacking "/ on ZFS" and GELI Encrypted Drives, the Old-School Way

Glen Barber is back to kick off our latest From The Trenches series: The Tips and Tricks Edition. 

All my personal machines run FreeBSD.

In fact, all my personal machines run FreeBSD-CURRENT. I do this primarily to keep track of changes that get committed to the head branch, so I can personally test changes (for the things I use, at least) before they get merged to the stable branches.

As one of the Release Engineers, I find it essential that, whenever possible, I find issues so they can be corrected before they are part of a release.

My primary work machine is a laptop, currently a Lenovo Thinkpad T540p. I picked this laptop, and all the other laptops before it, because it met my minimum requirements for a primary workstation: it is capable of supporting a large amount of RAM (16GB for my Thinkpad, 8GB for all previous laptops), an Intel Core i7 CPU, and I could replace the DVD drive with a second hard drive.

In addition to these hardware requirements, I also have a few personal requirements of any workstation - the drives must be encrypted, and the underlying filesystem must be ZFS.

For me, it is not so much about the data I have *on* the laptop that I need to protect, but the kinds of things within the FreeBSD Project I am permitted access. Without encrypted drives, a lost or stolen laptop would absolutely be my worst possible nightmare, because I only have my login passphrase protecting my data (GPG key, SSH keys, and so on).

Recent FreeBSD releases allow "/ on ZFS" installation with the option to enable GELI-based encryption. This predates my original installation, however, since each laptop I have purchased for the past several years used the hard drives from the previous laptop. According to zpool history, the installation was at least two and a half years ago, but I know it is much longer than that, because of zfs recv being one of the first things zpool history reports.

So, I needed to do things the old-fashioned way, and manually create the GELI-backed providers and perform the "/ on ZFS" installation myself.

While bsdinstall(8) may now cover the majority of use cases for such installations, there may be cases where someone specifically needs to do something a certain way that the installer does not provide.

Because I only had one hard drive in the system when the system was initially installed (a long time ago), I will only refer to one hard drive when describing the steps I used to perform the installation, for now.

I installed the system using the 9.0-RELEASE or 9.1-RELEASE memory stick installer (memstick.img), I cannot remember which, but that detail is not as important, since I did not use the installer anyway.

When I booted from the memory stick, the two drives recognized on the system were the internal hard drive, /dev/ada0, and the external USB flash drive for the installation, /dev/da0. The first menu screen has three options available: "Install", "Shell", "Live CD".

I selected "Live CD", and logged in as root (no password is necessary for the "Live CD" functionality). The hard drive did not have an operating system. Because I purchased the hard drive, in addition to the laptop, with the intention of replacing the laptop's drive, I did not need to remove any partitions from an existing installation. If I did need to remove partitions, I would have done so with:

# gpart destroy -F ada0
Here is where some technical details become important:
  • While you can install "/ on ZFS" on a drive partitioned with MBR (Master Boot Record), using GPT is far easier. In fact, I have forgotten much about how MBR partitioning is actually done.
  • When doing full disk encryption, you must keep /boot contents separate, otherwise loader(8) and the kernel will not be available when the BIOS hands over control to the operating system. As such, /boot should be given its own partition on the disk left unencrypted, and the rest of the system on its own encrypted partition.
I created four partitions on the drive. The first partition is for the boot blocks (not to be confused with the /boot contents), the second partition is for /boot, the third is for the encrypted system, and the fourth is for swap.
# gpart create -s gpt ada0
# gpart add -t freebsd-boot -s 512k -i 1 -l gptboot ada0
# gpart add -t freebsd-zfs -s 10G -i 2 -l bootfs ada0
# gpart add -t freebsd-swap -s 10G -i 3 -l swapfs ada0
# gpart add -t freebsd-zfs -s 180G -i 4 -l rootfs ada0
I decided to put the swap partition between the /boot partition and the rest of the system, in case I needed to increase or decrease the size of the /boot partition, it would be far easier (and safer) to do.

Then, I loaded the necessary kernel modules for ZFS and GELI:
# kldload /boot/kernel/opensolaris.ko
# kldload /boot/kernel/zfs.ko
# kldload /boot/kernel/geom_eli.ko
Now that GELI functionality is available, I created the backend provider for the ZFS dataset:
# geli init -b -a HMAC/SHA256 -e AES-CBC -l 256 \
    -s 4096 /dev/ada0p4
Then I attached the GELI provider, and wrote data from /dev/random to the new device /dev/ada0p4.eli:
# geli attach ada0p4
# dd if=/dev/random of=/dev/ada0p4.eli bs=4096

This took a while on the system this hard drive was originally installed, so I probably got coffee at this point. :-)

When the dd(1) command finished, I continued the installation.

I created temporary directories to use to import the pools after they were created:
# mkdir /tmp/zroot
# mkdir /tmp/zboot
Keep in mind, I am installing from a memory stick image, which by default, is read-only. The /tmp directory is writable, however, because it is a md(4)-backed memory disk filesystem.
# zpool create -O checksum=fletcher4 -O atime=off \
    -m /tmp/zboot zboot /dev/ada0p2
# zpool create -O checksum=fletcher4 -O atime=off \
    -m /tmp/zroot zroot /dev/ada0p4.eli
Then I made a few ZFS datasets for various paths:
# for i in var var/log var/tmp var/db usr usr/home \
    usr/compat usr/ports \
    usr/local tmp; do \
    zfs create zroot/${i} \
I also made a separate ZFS dataset for the "bootfs" contents, and set the mountpoint to the /boot directory in the temporary working directory:
# zfs create zboot/boot
# zfs set mountpoint=/tmp/zroot/boot zboot/boot
On the memory stick installation media, the distribution sets are located in /usr/freebsd-dist. I extracted their contents into the newly-created filesystem:
# cd /tmp/zroot
# for i in base kernel lib32; do \
    tar -xf /usr/freebsd-dist/${i}.txz -C . \
Then I wrote the bootcode to the first partition of the drive:
# gpart bootcode -b /tmp/zroot/boot/pmbr \
    -p /tmp/zroot/boot/gptzfsboot -i 1 ada0
Because the "bootfs" (/boot) and "rootfs" (everything else) are both ZFS, I needed to use the gptzfsboot bootcode for the "freebsd-boot" partition.

Now the system is installed, but I needed to make a few modifications before I was ready to reboot. In particular, set a root password, edit /etc/fstab to enable swap, edit /etc/rc.conf to enable the zfs rc(8) startup script, and edit /boot/loader.conf to load the geom_eli.ko, opensolaris.ko, and zfs.ko kernel modules at boot.
# chroot /tmp/zroot
# passwd root
[enter password]
# echo '/dev/gpt/swapfs none swap sw 0 0' \
    >> /etc/fstab
# echo 'zfs_enable="YES"' >> /etc/rc.conf
# echo 'geom_eli_load="YES"' >> /boot/loader.conf
# echo 'zfs_load="YES"' >> /boot/loader.conf
# exit
Before rebooting, I needed to make a few adjustments to where /boot from the zboot/boot dataset would be mounted at boot.
# zfs umount zboot/boot
# zfs set mountpoint=/realboot zboot/boot
This now makes the /boot directory mount as /realboot, so I then needed to point /boot in the zroot dataset to the correct place. This was easily solved with a symbolic link:
# cd /tmp/zroot
# ln -s boot /realboot
Now when the system boots, the filesystem will look something like this:
/boot -> /realboot
Finally, I needed to unmount the zroot dataset, and fix its mountpoints. I only needed to change the zroot mountpoint itself, since all children datasets adjusted their paths automatically.
# zfs umount -a
# zfs set mountpoint=/ zroot
At this point, the installation was complete. I rebooted the laptop, entered the GELI passphrase for /dev/ada0p4.eli when prompted, and was greeted by the "login: " prompt we have all grown to love.

Friday, March 13, 2015

15th Anniversary and Spring Fundraising Kickoff

I'm so excited to announce our spring fundraising campaign. I know it's not officially spring yet, but it sure feels like it here at Foundation headquarters in Boulder, Colorado. We're kicking off our fundraising campaign in conjunction with some other exciting events. There's so much to celebrate. First, we are proud to be a Platinum sponsor of AsiaBSDCon. This is the tenth AsiaBSDCon, with over 140 attendees planned, and 31 talks, providing a venue for all things BSD in Asia. People from around the world attend this conference to learn about the BSD operating systems, share their knowledge and experience, and work together to develop, hack, fix, improve, and document the various BSD operating systems.

The most exciting news we have is that we are celebrating our 15th anniversary supporting the FreeBSD Project and community worldwide! We have grown from our president and founder, Justin Gibbs, creating a non-profit to support FreeBSD, to an eight member board with 7 staff members. In case you missed it earlier, check out Justin's interview about the history of the Foundation on BSDNow

As the first employee, 9 years ago, I've witnessed incredible growth in our ability to support the Project and community. The year we were founded we raised a whopping $7,000. My first year with the Foundation, in 2006, we raised a little over $100,000. And, last year we raised $2,436,194, spending $877,412 on the project.

When we first started out, we focused on funding project development, conference sponsorships, and travel grants. Fifteen years later, we have increased support in those areas and have now grown to providing legal support for the Project; purchasing and helping manage hardware for FreeBSD infrastructure; providing release engineering support for consistent and timely releases; creating marketing literature and presentations that not only inform people of what FreeBSD is, but also provides detailed information on what's in new releases; attending more conferences to promote FreeBSD; and publishing a professional online FreeBSD magazine, The FreeBSD Journal.

To celebrate our anniversary, we are kicking off a fundraising campaign to help broaden the reach of our mission, by adding 500 new community investors in the next four weeks. What's a new community investor? An individual or organization that makes their first 2015 donation during this spring campaign. 

Why donate to the Foundation? Your donations will help us continue and increase our support in the following areas:
  • Funding improvement and development projects, including: Native ISCSI kernel Stack, Updated video console (Newcons), UEFI system boot support, Capsicum component framework, IPv6 support in FreeBSD, Auditdistd improvements for FreeBSD cluster, and adding modern AES modes to OpenCrypto (to support IP/SEC).
  • Helping to provide consistent and on-time releases.
  • Educating the public and promoting FreeBSD with tools like our high-quality FreeBSD 10X Brochure and company visits to help
  • facilitate collaboration efforts with the Project.
  • Sponsoring BSD conferences and summits in Europe, Japan, Canada, and the US.
  • Protecting FreeBSD IP and providing legal support to the Project.
  • Purchasing hardware to build and improve FreeBSD project infrastructure.
For the last 15 years, you as a community have allowed us to make an major impact on the FreeBSD Project and Community. Please help us continue and increase our support by making a donation today.

Deb Goodkin, Executive Director

FreeBSD From the Trenches: Using autofs(5) to Mount Removable Media

This next FreeBSD From the Trenches story come to us from Edward Tomasz NapieraƂa who shares his work on the new FreeBSD automounter.

My big project for 2014 was the new FreeBSD automounter.  Like any proper FreeBSD Foundation sponsored project, it included the usual kind of documentation - man pages and the Handbook chapter.  But there is no document that shows how it works inside, from the advanced system administrator or a power user point of view.

So, here it is.  The article demonstrates how modular the automounter is, and how easy it is to adopt to any mount-related situation you might have, using recently added removable media support as an example.  (And it shows some related mechanisms as a bonus.)

autofs(5) Basics

The purpose of autofs(5) is to mount filesystems on access, in a way that's transparent to the application. In other words, filesystems get mounted when they are first accessed, and then unmounted after some time passes. The application trying to access the filesystem doesn't even notice this event, apart from a slight delay on first access.  It's a mechanism similar to ones available in other systems, in particular to OS X.  It's a completely independent implementation, it's just that OS X is the other operating system I use.

Automounting requires cooperation of four things: the kernel filesystem, autofs.ko, which is responsible, among other things, for "pausing" the application until the filesystem is actually there; the automountd(8) daemon, which is the component that retrieves configuration information from maps (this includes fetching it from remote sources, such as LDAP) and actually mounts the filesystems; the automount(8) utility for various administrative purposes; and then the autounmountd(8) daemon to, well, unmount the filesystems mounted by automountd(8) after a timeout.

Setting it up is fairly simple: you obviously need to have autofs(5) enabled in /etc/rc.conf:
And you need to have the autofs(5) daemons running - just like other deamons in FreeBSD those will get started at system bootup if autofs_enable was set; otherwise you need to start them by hand:
# /etc/rc.d/automount start
# /etc/rc.d/automountd start
# /etc/rc.d/autounmountd start
The kernel driver will get loaded automatically, you can see it in kldstat(8) output.

autofs(5) and Removable Media

Note that at the time of this writing, this is only available in FreeBSD 11-CURRENT. This will change soon.

The main configuration file for autofs(5) is /etc/auto_master; you need to uncomment this line:
/media -media -nosuid
This basically says that there is a /media directory, and automount will mount the "-media" map there, and everything that gets mounted there will have the "nosuid" mount option, for security reasons.

If you already had autofs(5) running before uncommenting the line, you must refresh its configuration by running automount(8) as root; run it as "automount -v" for a detailed explanation of what it does.  It looks like this:
# automount -v
automount: parsing auto_master file at "/etc/auto_master"
automount: done parsing "/etc/auto_master"
automount: unmounting stale autofs mounts
automount: skipping /, filesystem type is not autofs
automount: skipping /dev, filesystem type is not autofs
automount: leaving autofs mounted on /net
automount: mounting new autofs mounts
automount: autofs already mounted on /net
automount: nothing mounted on /media; mounting
automount: mounting map -media on /media, prefix "/media", options "nosuid"
If you run mount(8), you will see so called "trigger nodes" of type autofs(5):
# mount
/dev/ada0p2 on / (ufs, local, noatime, journaled soft-updates)
devfs on /dev (devfs, local, multilabel)
map -hosts on /net (autofs)
map -media on /media (autofs)

Basic usage

With all that done, plug a drive into USB, and here is what happens in a real-world case:
[trasz@brick:~]% ll /media
total 9
drwxr-xr-x  3 root wheel  512 Feb 24 12:54 .
drwxr-xr-x 30 root wheel 1024 Feb 24 12:28 ..
drwxr-xr-x  1 root wheel 4096 Jan  1 1980 ADATA UFD
drwxr-xr-x  3 root wheel  512 Feb 24 12:54 md0
[trasz@brick:~]% cd /media/ADATA\ UFD
[trasz@brick:/media/ADATA UFD]% ll
total 10117
drwxr-xr-x 1 root wheel    4096 Jan  1  1980 .
drwxr-xr-x 3 root wheel     512 Feb 24 12:54 ..
drwxr-xr-x 1 root wheel    4096 Nov 24 00:03 .Spotlight-V100
drwxr-xr-x 1 root wheel    4096 Nov 24 00:03 .Trashes
-rwxr-xr-x 1 root wheel    4096 Nov 24 00:03 ._.Trashes
drwxr-xr-x 1 root wheel    4096 Jan 13 11:24 .fseventsd
drwxr-xr-x 1 root wheel    4096 Nov 22 22:44 Bonus
-rwxr-xr-x 1 root wheel 3309568 Nov 24 14:50 DSC05996.JPG
-rwxr-xr-x 1 root wheel 4063232 Nov 24 14:50 DSC05997.JPG
-rwxr-xr-x 1 root wheel 2953199 Nov 25 21:40 DSC05998.JPG
drwxr-xr-x 1 root wheel    4096 Nov 22 18:24 Meshuggah
drwxr-xr-x 1 root wheel    4096 Nov 22 21:06 System Volume Information
[trasz@brick:/media/ADATA UFD]% mount
/dev/ada0p2 on / (ufs, local, noatime, journaled soft-updates)
devfs on /dev (devfs, local, multilabel)
map -hosts on /net (autofs)
map -media on /media (autofs)
/dev/da0s1 on /media/ADATA UFD (msdosfs, local, nosuid, automounted)
[trasz@brick:/media/ADATA UFD]% cd /
[trasz@brick:/media/ADATA UFD]% sudo automount -u
[trasz@brick:/media/ADATA UFD]% mount
/dev/ada0p2 on / (ufs, local, noatime, journaled soft-updates)
devfs on /dev (devfs, local, multilabel)
map -hosts on /net (autofs)
map -media on /media (autofs)
Two things to notice here: first, the "ADATA UFD" is a factory default
filesystem label on the flash drive.  If there was no filesystem label,
autofs(5) would use device name instead - in this case, that would
be "da0s1".  Second - if you don't want to wait for autounmountd(8)
to unmount the automounted volume, you can use "automount -u".  Or
"automount -fu", if you want to force unmount.

Not So Basic Usage

Take a close look at the directory listing for /media in previous example. Did you notice the "md0" there?  It looks like a device node for memory disk (md(4)), but is a directory.  That's a leftover from my earlier experimentation, and shows an interesting feature of autofs(5)-based automounter: it's not limited to removable media, it can mount everything that's available for mounting.  In this case it's a memory disk (kind of ramdisk, see "man mdconfig").  It can also be an iSCSI lun.  And, of course, a removable media.  How does that work?


In FreeBSD, GEOM is a name of what could otherwise be called a block device layer.  It's a piece of code that manages all the "disk-like devices", both physical and virtual: SATA/SAS/FC/NVME/USB drives, memory disks, iSCSI LUNs, partitions, encrypted GELI volumes etc.

GEOM has another meaning: an instance of GEOM class.  The "class" here means the "kind" of device, and the instance is an actual device of that kind. It's easiest to explain it with an example:
# geom disk list
Geom name: cd0
1. Name: cd0
   Mediasize: 0 (0B)
   Sectorsize: 2048
   Mode: r0w0e0
   ident: (null)
   fwsectors: 0
   fwheads: 0

Geom name: ada0
1. Name: ada0
   Mediasize: 250059350016 (233G)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r2w2e3
   descr: Samsung SSD 850 EVO 250GB
   lunid: 5002538da000f602
   ident: S21PNSAFC02149R
   fwsectors: 63
   fwheads: 16

Geom name: da0
1. Name: da0
   Mediasize: 7654604800 (7.1G)
   Sectorsize: 512
   Mode: r0w0e0
   descr: ADATA USB Flash Drive
   lunname: USB MEMORY BAR
   lunid: 2020030102060804
   ident: 14A0711312300023
   fwsectors: 63
   fwheads: 255

# geom part list
Geom name: ada0
modified: false
state: OK
fwheads: 16
fwsectors: 63
last: 488397127
first: 34
entries: 128
scheme: GPT
1. Name: ada0p1
   Mediasize: 65536 (64K)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 1024
   Mode: r0w0e0
   rawuuid: 42dc1b8b-c49b-11e3-8066-001c257ac65f
   rawtype: 83bd6b9d-7f41-11dc-be0b-001560b84f0f
   label: (null)
   length: 65536
   offset: 17408
   type: freebsd-boot
   index: 1
   end: 161
   start: 34
2. Name: ada0p2
   Mediasize: 236223201280 (220G)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 1024
   Mode: r1w1e1
   rawuuid: 42dc921f-c49b-11e3-8066-001c257ac65f
   rawtype: 516e7cb6-6ecf-11d6-8ff8-00022d09712b
   label: (null)
   length: 236223201280
   offset: 82944
   type: freebsd-ufs
   index: 2
   end: 461373601
   start: 162
3. Name: ada0p3
   Mediasize: 13836045312 (13G)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 1024
   Mode: r1w1e0
   rawuuid: 21a8eef9-a0d4-11e4-ab80-001c257ac65f
   rawtype: 516e7cb5-6ecf-11d6-8ff8-00022d09712b
   label: (null)
   length: 13836045312
   offset: 236223284224
   type: freebsd-swap
   index: 3
   end: 488397127
   start: 461373602
1. Name: ada0
   Mediasize: 250059350016 (233G)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r2w2e3

Geom name: da0
modified: false
state: OK
fwheads: 255
fwsectors: 63
last: 14950399
first: 1
entries: 4
scheme: MBR
1. Name: da0s1
   Mediasize: 7654576128 (7.1G)
   Sectorsize: 512
   Stripesize: 0
   Stripeoffset: 28672
   Mode: r0w0e0
   rawtype: 12
   length: 7654576128
   offset: 28672
   type: !12
   index: 1
   end: 14950399
   start: 56
1. Name: da0
   Mediasize: 7654604800 (7.1G)
   Sectorsize: 512
   Mode: r0w0e0
See?  I've used the geom(8) command to get the information about two GEOM classes: "disk", and "part".  The first one returned information about three instances of the disk class: the DVD drive, the SSD, and the flash.  The second one returned information on the partitions known to the system. Everything that is potentially mountable - a physical disk, a partition, encrypted ELI volume, multipath device, RAID3 volume, memory disks, even volume labels - it all has its GEOM class and can be queried in a similar way.  To see all the GEOM instances in the running system, use:
# sysctl kern.geom.conftxt
Now, notice the "Mode" lines.  Like the one for ada0: "r2w2e3".  Those are three usage counters for ada0 GEOM: read, write, and exclusive.  They are non-zero, because ada0 is used: there are three partitions on it; three instances of PART GEOM class hold it opened.  The partitions, just like any other GEOM nodes, have their own counters.  Take a look at the first one, ada0p1: the mode there is "r0w0e0".  This means it's not open by anything.  It's, in other words, available for mounting.  If you check the MD geom class:
# geom md list 
Geom name: md0 
1. Name: md0
   Mediasize: 1073741824 (1.0G) 
   Sectorsize: 512
   Mode: r0w0e0
   type: swap
   access: read-write
   compression: off
   length: 1073741824
   fwsectors: 0
   fwheads: 0
   unit: 0
You will see the same thing: it's not opened.  That's the first thing the autofs(5) "-media" map checks for: zero access counts; if the counts are not zero, it means the node is used by something: it's either mounted (like ada0p2, mounted on /), or there is something "on top of it" - like ada0.

But why there is no /media/ada0p1?  Because it's not mountable; there is no filesystem there.  It's a boot loader partition.  How does autofs(5) figure it out?


Before we can do anything with a filesystem, we need to determine what kind of filesystem it is - and whether it actually is a supported filesystem in the first place.  That means we need a piece of code that can take a look at it and determine if it has a format it recognizes.

It is possible to use file(1) for this, eg:
# file -s /dev/md0
Vermaden's sysutils/automount port uses this approach.  There are a few problems with doing it this way, though.  First, the output, for a typical FAT filesystem, looks like this:
/dev/md0: DOS/MBR boot sector, code offset 0x3c+2, \
 OEM-ID "BSD4.4  ", sectors/cluster 32, root entries \
 512, sectors/FAT 256, sectors/track 63, heads 255, \
 sectors 2097144 (volumes > 32 MB) , serial number \
 0x668a120e, unlabeled, FAT (16 bit)
It's not particularly easy to parse.  It's even harder to extract the volume label.

Second, file(1) can recognize all kinds of file types, from JPEG to 6502 assembly.  This means that if there are some strange data on the removable media, instead of a filesystem we expect, the file(1) will output something our script wasn't tested against, making the first problem even harder.

Third, file(1) had its share of security bugs, eg CVE-2014-1943, CVE-2014-9620, or CVE-2014-3710.

For this reason I've decided the proper fix would be to just write a new utility. The strange name - "fstyp" - comes from the utility of the same name, installed by default on Solaris, IRIX, OS X, and perhaps most other UNIX systems.

The fstyp(8) addresses the file(1) issues: the output is easily parsable (just a filesystem name, one word), it only recognizes filesystems supported by FreeBSD, and uses Capsicum sandboxing to make sure that even if there is a vulnerability, its impact is limited to incorrectly reporting the filesystem type.  It's a good topic for another article, but in short - in FreeBSD, every process can enter what's called a "capability mode". It's one way - a process can enter it, but there is no way to exit it.  Child processes inherit the mode. In capability mode, kernel will deny all attempts to open new files, create sockets, attach the shared memory segments etc, but the process is pretty much free to do anything it likes with the file descriptors it already had opened before entering the capability mode - and it can receive other file descriptors over a UNIX socket.  So, the fstyp(8) utility opens the device file, then calls cap_enter(2), which switches it into capability mode, and then continues execution, reading from the device to determine what's there.  Should it be compromised, it won't be able to execute /bin/sh, it won't be able to open a socket to transmit the data to some external host, etc.

The "-media" Map

Those are the components underneath the autofs(5), but how does it fit together? Let's start with the actual map.  In FreeBSD, special maps (the ones with names starting with "-") are just executables in /etc/autofs/:
# ls -al /etc/autofs
total 36
drwxr-xr-x   2 root  wheel  512 Feb 14 21:18 .
drwxr-xr-x  25 root  wheel 3072 Feb 24 11:22 ..
-rwxr-xr-x   1 root  wheel 1010 Oct 17 11:26 include_ldap
-rwxr-xr-x   1 trasz wheel   43 Aug 17  2014 include_nis
-rwxr-xr-x   1 root  wheel  367 Oct 17 11:26 special_hosts
-rwxr-xr-x   1 root  wheel 2294 Dec  6 10:15 special_media
-rwxr-xr-x   1 root  wheel  355 Feb 14 21:17 special_noauto
-rwxr-xr-x   1 root  wheel   97 Oct 17 11:26 special_null
-rwxr-xr-x   1 root  wheel  357 Aug 22  2014 special_smb
See the special_media?  That's the one.  It's a shell script.  The reason it's in /etc is that the system administrator can modify it if required, or add new special maps.

Now, let's try to run it by hand, as root:
# /etc/autofs/special_media

# /etc/autofs/special_media md0
-fstype=msdosfs,nosuid  :/dev/md0
That's exactly how automountd(8) uses it, after the kernel component notifies it that it needs the /media directory taken care of.  It's described in more detail in the auto_master(5) manual page.  The shell script is pretty well commented, and I don't think there is any point in explaining it here.

Bottom line:
the core autofs itself doesn't know anything about removable devices; the special map "-media" does: it queries GEOM for the list of all disk-like nodes that are not in use, and then uses fstyp(8) to determine whether they contain a useful filesystem.  UNIX.  Modularity.  Plain text.  ;-)


Now, let's create a second memory disk, 1GB in size (the "1g" below) to see if it all works as intended:
# mdconfig -s1g
# newfs_msdos /dev/md1
newfs_msdos: cannot get number of sectors per track: \
 Operation not supported
newfs_msdos: cannot get number of heads: \
 Operation not supported
newfs_msdos: trim 8 sectors to adjust to a multiple of 63
/dev/md1: 2096576 sectors in 65518 FAT16 clusters \
 (16384 bytes/cluster)
BytesPerSec=512 SecPerClust=32 ResSectors=1 FATs=2 \
 RootDirEnts=512 Media=0xf0 FATsecs=256 SecPerTrack=63 \
 Heads=255 HiddenSecs=0 HugeSectors=2097144

# ll /media
total 5
drwxr-xr-x   3 root wheel  512 Feb 24 12:25 .
drwxr-xr-x  30 root wheel 1024 Feb 23 09:04 ..
drwxr-xr-x   1 root wheel 4096 Jan  1 1980 ADATA UFD
drwxr-xr-x   3 root wheel  512 Feb 24 12:25 md0
Whoops.  Where is /media/md1?

There is one more mechanism for the whole thing to work correctly: the autofs(5) cache needs to be dealt with.

The first paragraph mentioned that it's automountd(8) that does all the map parsing - including running the /etc/autofs/special_media - and actual mounting. Doing it every time someone accesses the /media directory - or any directory, for that matter - would kill performance.  For this reason, after the kernel component asks the automound(8) to do its magic, it doesn't do that again until some time later.  In most cases it doesn't matter - the list of NFS exports for a given host doesn't change too often - but in case of removable media it's not acceptable.  The cache needs to be flushed, using "automount -c".  After that, the subsequent lookup in /media will trigger automountd(8), which will query the devices list and refresh the directory contents.

This obviously needs to happen automatically.  And if you actually went and opened /etc/auto_master in a text editor, you would have noticed this:
# When using the -media special map, make sure to edit devd.conf(5)
# to move the call to "automount -c" out of the comments section.
The devd(8) is a daemon responsible for listening for notifications from the kernel and running whatever is configured in its config, /etc/devd.conf. There are all kinds of things there, from running utilities to upload firmware for various USB devices, to launch moused(8) when a mouse gets connected, to switching power profiles, to... discarding autofs(5) caches.  It looks like this:
notify 100 {
   match "system" "GEOM";
   match "subsystem" "DEV";
   action "/usr/sbin/automount -c";
If you do "man devd.conf", you will see the description of those events. Note that, just like the "-media" map works the same way for flash drives and encrypted volumes over multipath over iSCSI, this mechanism does not care about any specific hardware either.


Two, really.  First: you need to run 11-CURRENT.  Second: the nodes in /media never disappear.  I expect to merge this support to 10-STABLE after the second issue is addressed.

Thursday, February 26, 2015

FreeBSD From the Trenches: ZFS, and How to Make a Foot Cannon

This month's story comes to us from Glen Barber, UNIX Systems Administrator.

The ZFS filesystem is regarded for its robustness and extensive feature set.

Its robustness can be haunting, however, if a mistake is made.  I learned this the hard way through a seemingly innocent typo, a mistake I certainly will not soon repeat.

We use ZFS almost exclusively in the FreeBSD cluster.  I say "almost" because there is one remaining machine that does not use ZFS, because the machine is too underpowered to handle it.

All machines are installed in a netboot environment while logged in at the serial console, providing the utilities necessary for extremely customizable installations.  Most of the installations I have performed on machines in the cluster have been pseudo-scripted, with subtle differences depending on the machine, such as if the disks are da(4) or ada(4), the number of disks, how much space to allocate for swap, the number of ZFS pools, and so on.

For the most part, a basic installation would be done with a very simple sh(1) script that looks something like:

# for i in $(sysctl -n kern.disks); do \
  gpart create -s gpt $i; [...]; done
Nothing too fancy at all.

Most times I would copy/paste from an installation script I've used for years, other times I would manually type the commands.  It really depended on what the end result was supposed to be, as far as configuration.

When I installed the FreeBSD Foundation's new server, I typed the commands manually.  You might ask, "Why did you do it this way?"  To this day, I cannot answer that question.  But if I didn't, this story would be far less interesting.

The machine was installed like this, almost verbatim:
# for i in $(sysctl -n kern.disks); do \
  gpart create -s gpt /dev/${i}; \
  gpart add -t freebsd-boot -s 512k -i 1 /dev/${i}; \
  gpart bootcode -b /boot/pmbr \
  -p /boot/gptzfsboot -i 1 /dev/${i}; \
  gpart add -t freebsd-swap -s 16G -i 2 /dev/${i}; \
  gpart add -t freebsd-zfs -i 3 /dev/${i}; \
# zpool create zroot mirror /dev/ada0 /dev/ada1
# for i in tmp var var/tmp var/log \
  var/db usr usr/local usr/home; do \
  zfs create -o atime=off zroot/${i}; \
This creates the GPT partition scheme for all available hard disks, writes the partition layout to the disks, writes the GPT boot code to the first partition on each disk, and allocates the swap space and ZFS space.  Then it creates the ZFS pool named 'zroot' configured as a mirror, and creates the ZFS datasets in the new pool.

The problem is not too obvious unless you are looking for it specifically, but instead of using the 'freebsd-zfs' GPT partitions, which are /dev/ada0p3 and /dev/ada1p3, I created the pool on the full disk (/dev/ada0 and /dev/ada1).

Simple enough to fix, right?  Destroy the 'zroot' pool, destroy the GPT partition layout to be safe, and create it again with the correct arguments to 'zpool create'.

So, that's what I did.

Luckily I wasn't ready to put this machine into production yet.  I still wanted to do some basic stress testing on the machine before moving anything critical to it.

Fast forward about a month.

After being satisfied that the machine did not have any obvious stability problems, such as faulty RAM for example, and after having lowered the relevant TTL entries in DNS, I decided to do one more upgrade on the machine before beginning the independent service migrations to the new machine.

This is where things started to go wrong.  Fast.

The source-based upgrade finished, and I rebooted the machine.  In another terminal, attached to the serial console, saw the machine proceed through the normal reboot routines, killing running services, syncing buffers, and so on.

After the machine completed POST routines, everything went dark.  The machine did not respond to serial console input, and as far as I could tell, this was not due to a change caused by the update.

I should note that, by nature, I am a paranoid sysadmin.  This is a good quality, in my opinion, because I habitually go out of my way to make sure any situation is recoverable if something goes wrong.  Suspecting I did something wrong, I immediately began reviewing the history recorded while being logged in at the console.  Nothing looked suspicious.  This upgrade should have "just worked."

I remotely power-cycled the machine, and booted into our netboot environment to investigate further.

I immediately knew something went wrong after importing the 'zroot' pool into a temporary location, and seeing several tell-tale signs.  For starters, /etc/rc.conf had a timestamp that predated the machine from even being shipped to the colocation facility.  More confusingly, /usr/obj was empty, as if the 'buildworld/buildkernel'-style upgrade that took place less than an hour prior had never happened.

Then panic ensued.  The machine didn't panic -- I did.

Everything was gone.

Every configuration change since the initial install, every jail that was created, every package that was installed.  All of it.  Just gone.

While investigating, I sent a heads-up to the other cluster administrators in case there was an issue that affected other installations.  As investigation progressed, Peter realized he had seen this exact behavior in the past, and provided an example scenario with which it could occur.

It was exactly what I had done - used the raw disk for the ZFS pool instead of the 'freebsd-zfs' GPT partition.

So, what's the problem?

The problem is 'zpool destroy' does not implicitly delete pool metadata from the disks, so as far as ZFS is concerned, I had two different ZFS pools, both named 'zroot', which confused the boot blocks just enough to import the wrong pool at boot.  Well, it didn't just import the wrong pool, it imported an empty pool.

Worse yet, because I had allocated the partitions in the order of 'freebsd-boot', 'freebsd-swap', and 'freebsd-zfs', and that 'freebsd-swap' consisted of 16GB, the swap partition had more than enough space to hold on to the metadata from the pool I did not want to exist.  There was no way to force one pool to be chosen over the other, and worse, no way to tell which pool would be chosen by the loader.

The only good news at this point was that the machine was not yet in production.

How do you fix this, then?

Peter had a suggestion, since he has run into this before.  Reboot the machine into the netboot environment, and try to force the correct pool into being imported by forcibly removing all device entries for the disks and retrying the ZFS pool import.  This would be done by running:
# rm -f /dev/gptid/* /dev/diskid/* /dev/ada?
# zpool import -o altroot=/tmp/zroot zroot
Unfortunately, the wrong pool was imported again, most likely (but unconfirmed) by allocation such a large amount of swap to the disks.
# zpool status
     zroot       ONLINE 
       mirror-0  ONLINE 
         ada0    ONLINE 
         ada1    ONLINE
Then I realized the partition table was also corrupt.

After several attempts to coerce the correct pool to import, I became increasingly more uncomfortable with leaving the machine in this condition. At this point, there was only one solution - wipe the disks, and start over.

Ultimately, despite disliking the solution, that is what I did to correct the problem, though at the time, I was unaware of the 'labelclear' command to zpool(8), which would have wiped the ZFS pool metadata from the disks.  But at that point, I was not going to take any chances either way.

The takeaway is, despite how innocent a mistake may appear at first, when dealing with metadata stored on disk devices, it surely will come back to haunt you at some point sooner or later.

Wednesday, February 25, 2015

SCALE 13x Trip Report: Michael Dexter

The Foundation recently sponsored Michael Dexter to attend SCALE 13x. Michael provides the following trip report:

SCALE 13x was the 13th Southern California Linux Expo and took place February 19th through 20th in Los Angeles, California. Despite its name, this year's event demonstrated sincere outreach to the BSD community as demonstrated by two booths and several BSD-related talks. The first booth featured FreeBSD, the FreeBSD Foundation, FreeNAS, PC-BSD and pfSense while the second featured OpenBSD and NetBSD. Both booths were filled with familiar faces including Dru Lavigne, Denise Ebery, Matt Olander, James Nixon, David Maxwell, Brooke and Seth and two toddlers!

The FreeBSD Booth Crew -
Photo courtesy of iXsystems

The variety of booth visitors were very familiar for SCALE: a mix of students, consultants, open source developers and military/aerospace contractors. I heard lots of "I got started on FreeBSD" and "I use FreeNAS" plus the occasional "When can we have a military-certified BSD so we can stop using Linux?" The last one is something I have heard at every SCALE I have attended and is representative of the region. Hats off to the SCALE organizers for also attracting such a diverse

The BSD-related talk topics included David Maxwell's newly-released pipecut that he debuted at MeetBSD (, Brooks Davis' talk on the BERI CPU that he is working on with Robert Watson, Dru Lavigne's talk on new FreeNAS 9.3 features and my talk on FreeBSD Virtualization Options. There were also many overlapping talks such as those on various system containers, embedded systems and of course Brendan Gregg's talk on systems performance. Brendan kindly updated the Netflix statistics that I was already going to address and both Bryan Smith and Randal Schwartz had great user questions. It truly was a pleasure to speak at SCALE and my sincerest thanks to Brendan for live Tweeting my talk.

Impressively, some SCALE speakers were in their teens and the overall outreach to kids was great including an evening kids-only event. The BSD Certification Group scheduled a BSDA exam but alas it was poorly attended. I humbly invite you to take the BSDA exam if you have not done so already and ask that you help spread the word whenever you get a chance.

In a community where we often preach to the converted, I find SCALE to be a very receptive venue for outreach and encourage you to attend and consider submitting a BSD-related talk to SCALE 14x. Special thanks to Gareth Greenaway for reaching out to the BSD community and for the great attitude demonstrated by his team of volunteers. Finally, I would like to thank the FreeBSD Foundation for covering my air travel and O'Reilly Media for allowing me to share a room with one of their amazing team members.

Friday, December 12, 2014

More From Your Newest Board Member: An Interview with Cheryl R. Blain

Recently, The FreeBSD Foundation announced the addition of Cheryl R. Blain to the Board of Directors. We sat down with Cheryl to find out more about her background and what brought her to the Foundation. Take a look at what she has to say:

Tell us a little about yourself, and how you got involved with FreeBSD?
I was bit by the entrepreneur bug in 1999 when working for a non-profit. I’ve worked with high-tech, venture-backed, small-cap companies ever since.  My typical engagement finds

Cheryl R. Blain
me streamlining operations and sales teams to prepare companies for their next step forward, which most often involves financing.  

I have a master’s degree in business administration with a dual emphasis in finance and sustainable enterprise, from Saint Mary’s College and as a visiting student at UNC Kenan-Flagler.

Xinuos is the latest high-tech, venture-backed company to which I’ve plied my wares.  While working for Xinuos, I was exposed to FreeBSD for the first time in 2013.  During my first week on the job, I was asked if I was willing to go to Ottawa, Canada to learn more about FreeBSD and the community of developers.  The head of engineering and I felt the conference was very important to Xinuos’ future, so we decided it was an opportunity not to be missed.  Since the trip was so unexpected, I actually had to have my passport over-night shipped to me in our New Jersey office so I could leave the following day!  My colleague and I attended BSDCan and it was everything we had hoped it would be.  We were welcomed by the development community and pleasantly inundated with inquiries about our interest in FreeBSD.  David Chisnall was an especially helpful evangelist of FreeBSD, and made sure my colleague and I had the information we needed.

Why are you passionate about serving on the FreeBSD Foundation Board?
The FreeBSD community (including the board) is in no small part the reason I chose to learn more about the project as a commercial offering two years ago.  My passion is in building businesses, and I wanted to work on a project that was technologically sound, well supported and attractive to people who I like and respect.  The FreeBSD community quickly forgave me for being the least technical person in the room, and was wonderful in embracing the value I can bring to the community from a business perspective.

I look forward to doing my part to ensure that the FreeBSD project has a vibrant future.

What excited you about our work?
There are many things that make FreeBSD interesting...but the first time I think I got really excited was in Ottawa in 2013, when Matt Ahrens gave his talk on ZFS.  Every developer in the room was abuzz with excitement.  In Matt’s presentation he listed logos of the other open source operating systems using ZFS, but I connected with how the room full of BSD developers really embraced Matt as their own.  His bold move to pack his box at Oracle to continue his open source work, helped me realize the people associated with FreeBSD are not status quo...they are pushing the envelope. Then I met Peter Grehan and Neel Natu and was introduced to their work on bhyve, and Justin and George as Foundation board members and FreeBSD committers and knew that even though the FreeBSD project has been around since 1993, new excitement and innovation is happening right now.  And I haven’t even mentioned Capsicum or Clang! Oh and I can’t forget, I was there for the naming of Groff with all the rowdy laughter and good spirited banter, and it was then that I felt like I was among friends.   

 What are you hoping to bring to the organization and the community through your new leadership role?
I hope that my participation in the planning discussions will encourage other business leaders to join in the discussions as well.   

I also hope to encourage those who use FreeBSD commercially to become more vocal about their experiences and use cases, to encourage others to develop with FreeBSD as well.  In doing so, there is a great opportunity to build an endowment among alum to ensure a vibrant future for FreeBSD.

How do you see your background and experience complementing the current board? 
I will be delighted if I am successful in bringing a business lens to the board discussions.  I would like to help elevate FreeBSD in the minds of technology companies worldwide and see a broader acceptance of the OS as a commercially desirable alternative.

Thursday, December 11, 2014

Super Computing Trip Report: Michael Dexter

Michael Dexter has also provided his trip report for Super Computing:

In case you have not heard of the conference, it is a meeting of 10,000 researchers, computer scientists, engineers, students, managers, sales engineers and three-letter agency representatives that takes place in a different US city every year. I have hosted a booth at the event since 2009 when it passed through Portland and this year showcased the bhyve Hypervisor and explained all things BSD to brilliant attendees from around the world. I was joined by Patrick Masson, General Manager of the Open Source Initiative, who helped shed light on the pervasive yet unrecognized use of open source software by the universities, organizations and companies at the event. Literally 90% or more of the exhibitors rely on open source but few give it any recognition. For years, GNU/Linux has dominated the Top500 list of supercomputers that is announced at the event each year and I set out to help change that by highlighting bhyve, OpenZFS and other great technologies in FreeBSD.

SC14 could not have started on a better note thanks to the announcement on the first day that the FreeBSD Foundation received a million dollar donation from WhatsApp founder Jan Koum. I heard many people say "I used FreeBSD ten years ago" and the news instantly got their attention and set the tone for the rest of the event. By showcasing ZFS, we drew the attention of ex-Sun Microsystems engineers and executives and even had a visit by UC Berkeley CSRG research assistant Clem Cole. The message that "BSD is back" was loud and clear and I canvased the Student Cluster Competition to help inspire a new generation of users who had never heard of the BSDs.

The bhyve booth was in the heart of the ARM pavilion which made for some enlightening conversations. bhyve and the ARM CPU architecture both stand out for operating without emulation, resulting in simplicity and performance for bhyve and significant power savings for ARM. A roadmap exists for bhyve support on ARM and hopefully this will be something to showcase at SC15. Of the exhibiting ARM partners, the SoftIron team stood out as loud and proud users of FreeBSD and I look forward to seeing them at future BSD events.

FreeBSD vendor iXsystems was also at the event demonstrating FreeNAS and TrueNAS, as were the SaltStack team who received a bhyve demo and expressed a sincere desire to include support for bhyve. A handful of other open source vendors like Red Hat were in attendance plus FreeBSD consumers like Spectra Logic, EMC/Isilon, NetApp and Juniper. Many individual open source users came to the booth and my favorite quotation came from a conversation at a Mellanox event: "Our administrators use FreeNAS at home and come work and ask 'why the heck aren't we using ZFS?'" Open source is winning but there is still much work to be done.

Speaking of work, I asked many people, including Navy researchers moving massive uncompressed video streams, what FreeBSD needs to do get back on the Top500 list of supercomputers. The short list of answers I received was: OFED/OpenFabrics Enterprise Distribution support, OpenMPI/Message Passing Interface support and Lustre distributed file system support. Surprisingly, NUMA/Non-Uniform Memory Access did not come up. Interconnect vendor Chelsio Communications stood out as a solid supporter of FreeBSD and dominant player Mellanox expressed interest in expanding their support for FreeBSD given the opportunity it represents. All in all, people were very receptive to giving FreeBSD and other BSDs a try, especially given that it would be a homecoming for so many users.

I wish to thank the FreeBSD Foundation for sponsoring the bhyve booth at SC14 and I am delighted to hear that ARM has just made a generous $50,000 donation to the Foundation. In total I gave out 250 tri-fold brochures and talked to hundreds of people at SC14. Hopefully those seeds will take root and we will start seeing FreeBSD systems in the Student Cluster Competition and on the 2015 Top500 supercomputer list!