May 05, 2015

Thoughts on a feedback loop for Trinity.

With the success that afl has been having on fuzzing userspace, I’ve been revisiting an idea that Andi Kleen gave me years ago for trinity, which was pretty much the same thing but for kernel space. I.e., a genetic algorithm that rates how successful the last fuzz attempt was, and makes a decision on whether to mutate that last run, or do something completely new.

It’s something I’ve struggled to get my head around for a few years. The mutation part would be fairly easy. We would need to store the parameters from the last run, and extrapolate out a set of ->mutate functions from the existing ->sanitize functions that currently generate arguments.

The difficult part is the “how successful” measurement. Typically, we don’t really get anything useful back from a syscall other than “we didn’t crash”, which isn’t particularly useful in this case. What we really want is “did we execute code that we’ve not previously tested”. I’ve done some experiments with code coverage in the past. Explorations of the GCOV feature in the kernel didn’t really get very far however for a few reasons (primarily that it really slowed things down too much, and also I was looking into this last summer, when the initial cracks were showing that I was going to be leaving Red Hat, so my time investment for starting large new projecs was limited).

After recent discussions at work surrounding code coverage, I got thinking about this stuff again, and trying to come up with workable alternatives. I started wondering if I could use the x86 performance counters for this. Basically counting the number of instructions executed between system call enter/exit. The example code that Vince Weaver wrote for perf_event_open looked like a good starting point. I compiled it and ran it a few times.

$ ./a.out 
Measuring instruction count for this printf
Used 3212 instructions
$ ./a.out 
Measuring instruction count for this printf
Used 3214 instructions

Ok, so there’s some loss of precision there, but we can mask off the bottom few bits. A collision isn’t the end of the world for what we’re using this for. That’s just measuring userspace however. What happens if we tell it to measure the kernel, and measure say.. getpid().

$ ./a.out 
Used 9283 instructions
$ ./a.out 
Used 9367 instructions

Ok, that’s a lot more precision we’ve lost. What the hell.
Given how much time he’s spent on this stuff, I emailed Vince, and asked if he had insight as to why the counters weren’t deterministic across different runs. He had actually written a paper on the subject. Turns out we’re also getting event counts here for page faults, hardware interrupts, timers, etc.
x86 counters lack the ability to say “only generate events if RIP is within this range” or anything similar, so it doesn’t look like this is going to be particularly useful.

That’s kind of where I’ve stopped with this for now. I don’t have a huge amount of time to work on this, but had hoped that I could hack up something basic using the perf counters, but it looks like even if it’s possible, it’s going to be a fair bit more work than I had anticipated.

Thoughts on a feedback loop for Trinity. is a post from:

Reach the Top With NetworkManager 1.0.2

Summit - Asbjørn Floden (CC BY-NC 2.0)Summit – Asbjørn Floden (CC BY-NC 2.0)

Just this morning Lubomir released NetworkManager 1.0.2, the latest of the 1.0 stable series.  It’s  a great cleanup and bugfix release with contributions from lots of community members in many different areas of the project!

Some highlights of new functionality and fixes:

  • Wi-Fi device band capability indications, requested by the GNOME Shell team
  • Devices set to ignore carrier that use DHCP configurations will now wait a period of time for the carrier to appear, instead of failing immediately
  • Startup optimizations allow networking-dependent services to be started much earlier by systemd
  • Memory usage reductions through many memory leak fixes and optimizations
  • teamd interface management is now more robust and teamd is respawned when it terminates
  • dnsmasq is now respawned when it terminates in the local caching nameserver configuration
  • Fixes for an IPv6 DoS issue CVE-2015-2924, similar to one fixed recently in the kernel
  • IPv6 Dynamic DNS updates sent through DHCP now work more reliably (and require a fully qualified name, per the RFCs)
  • An IPv6 router solicitation loop due to a non-responsive IPv6 router has been fixed

While the list of generally interesting enhancements may be short, it masks 373 git commits and over 50 bugzilla issues fixed.  It’s a great release and we recommend that everyone upgrade.

Next up is NetworkManager 1.2, with DNS improvements, Wi-Fi scanning and AP list fixes for mobile uses, NM-in-containers improvements (no udev required!), even less dependence on the obsolete dbus-glib, less logging noise, device management fixes, continuing removal of external dependencies (like avahi-autoipd), configuration reload-ability, and much more!

May 04, 2015

kernel code coverage brain dump.

Someone at work recently asked me about code coverage tooling for the kernel. I played with this a little last year. At the time I was trying to figure out just how much of certain syscalls trinity was exercising. I ended up being a little disappointed at the level of post-processing tools to deal with the information presented, and added some things to my TODO list to find some time to hack up something, which quickly bubbled its way to the bottom.

As I did a write-up based on past experiences with this stuff, I figured I’d share.

requires kernel built with
Note: Setting GCOV_PROFILE_ALL incurs some performance penalty, so any resulting kernel built with this option should _never_ be used for any kind of performance tests.
I can’t exaggerate this enough, it’s miserably slow. Disk operations that took minutes for me now took hours. As example:


# time dd if=/dev/zero of=output bs=1M count=500
500+0 records in
500+0 records out
524288000 bytes (524 MB) copied, 0.409712 s, 1.3 GB/s
0.00user 0.40system 0:00.41elapsed 99%CPU (0avgtext+0avgdata 2980maxresident)k
136inputs+1024000outputs (1major+340minor)pagefaults 0swaps


# time dd if=/dev/zero of=output bs=1M count=500
500+0 records in
500+0 records out
524288000 bytes (524 MB) copied, 6.17212 s, 84.9 MB/s
0.00user 7.17system 0:07.22elapsed 99%CPU (0avgtext+0avgdata 2940maxresident)k
0inputs+1024000outputs (0major+338minor)pagefaults 0swaps

From 41 seconds, to over 7 minutes. Ugh.

If we *didn’t* set GCOV_PROFILE_ALL, we’d have to recompile just the files we cared about with the relevant gcc profiling switches. It’s kind of a pain.

For all this to work, gcov expects to see a source tree, with:

  • .o objects
  • source files
  • .gcno files (these are generated during the kernel build)
  • .gcda files containing the runtime counters. These come from sysfs on the running kernel.

After booting the kernel, a subtree appears in sysfs at /sys/kernel/debug/gcov/
These directories mirror the kernel source tree, but instead of source files, now contain files that can be fed to the gcov tool. There will be a .gcda file, and a .gcno symlink back to the source tree (with complete path). Ie, /sys/kernel/debug/mm for example contains (among others..)

-rw------- 1 root root 0 Mar 24 11:46 readahead.gcda
lrwxrwxrwx 1 root root 0 Mar 24 11:46 readahead.gcno -> /home/davej/build/linux-dj/mm/readahead.gcno

It is likely the symlink will be broken on the test machine, because the path doesn’t exist, unless you nfs mount the source code from the built kernel for eg.

I hacked up the script below, which may or may not be useful for anyone else (honestly, it’s way easier to just use nfs).
Run it from within a kernel source tree, and it will populate the source tree with the relevant gcda files, and generate the .gcov output file.

obj=$(echo "$1" | sed 's/\.c/\.o/')
if [ ! -f $obj ]; then

dirname=$(dirname $1)
gcovfn=$(echo "$(basename $1)" | sed 's/\.c/\.gcda/')
if [ -f /sys/kernel/debug/gcov$pwd/$dirname/$gcovfn ]; then
  cp /sys/kernel/debug/gcov$pwd/$dirname/$gcovfn $dirname
  gcov -f -r -o $1 $obj
  if [ -f $(basename $1).gcov ]; then
    mv $(basename $1).gcov $dirname
  echo "no gcov data for /sys/kernel/debug/gcov$pwd/$dirname/$gcovfn"

Take that script, and run it like so..

$ cd kernel-source-tree
$ find . -type f -name "*.c" -exec "{}" \;

Running for eg, mm/mmap.c will cause gcov to spit out a mmap.c.gcov file (in the current directory) that has coverage information that looks like..

   135684:  269:static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
        -:  270:{
   135684:  271:        struct vm_area_struct *next = vma->vm_next;
        -:  272:
   135684:  273:        might_sleep();
   135686:  274:        if (vma->vm_ops && vma->vm_ops->close)
     5080:  275:                vma->vm_ops->close(vma);
   135686:  276:        if (vma->vm_file)
    90302:  277:                fput(vma->vm_file);
        -:  278:        mpol_put(vma_policy(vma));
   135686:  279:        kmem_cache_free(vm_area_cachep, vma);
   135686:  280:        return next;
        -:  281:}

The numbers on the left being the number of times that line of code was executed.
Lines beginning with ‘-‘ have no coverage information for whatever reason.
If a branch is not taken, it gets prefixed with ‘#####’, like so..

  4815374:  391:                if (vma->vm_start < pend) {
    #####:  392:                        pr_emerg("vm_start %lx < pend %lx\n",
        -:  393:                                  vma->vm_start, pend);
        -:  394:                        bug = 1;
        -:  395:                }

There are some cases that need a little more digging to explain. eg:

    88105:  237:static void __remove_shared_vm_struct(struct vm_area_struct *vma,
        -:  238:                struct file *file, struct address_space *mapping)
        -:  239:{
    88105:  240:        if (vma->vm_flags & VM_DENYWRITE)
    15108:  241:                atomic_inc(&file_inode(file)->i_writecount);
    88105:  242:        if (vma->vm_flags & VM_SHARED)
        -:  243:                mapping_unmap_writable(mapping);
        -:  244:
        -:  245:        flush_dcache_mmap_lock(mapping);
    88105:  246:        vma_interval_tree_remove(vma, &mapping->i_mmap);
        -:  247:        flush_dcache_mmap_unlock(mapping);
    88104:  248:}

In this example, lines 245 & 247 have no hitcount, even though there’s no way they could have been skipped.
If we look at the definition of flush_dcache_mmap_(un)lock, we see..
#define flush_dcache_mmap_lock(mapping) do { } while (0)
So the compiler never emitted any code, and hence, it gets treated the same way as the blank lines.

There is a /sys/kernel/debug/gcov/reset file that can be written to to reset the counters before each test if desired.

Additional thoughts

  • Not sure how inlining affects things.
  • There needs to be some element of post-processing, to work out percentages of code coverage etc, which may involve things like stripping out comments/preprocessor defines.
  • debug kernels differ in functionality in various low level features. For example LOCKDEP will fundamentally change the way spinlocks work. For coverage purposes though, we can choose to not care and stop drilling down at certain levels.
  • Whatever does the post-processing of results may need to aggregate results from multiple test machines. Think of the situation where we’re running a client/server test: Both machines will be running different code paths.
  • ggcov has some interesting looking tooling for visually displaying results.

kernel code coverage brain dump. is a post from:

May 01, 2015

Trinity socket improvements

I’ve been wanting to get back to working on the networking related code in trinity for a long time. I recently carved out some time in the evenings to make a start on some of the lower hanging fruit.

Something that bugged me a while is that we create a bunch of sockets on startup, and then when we call for eg, setsockopt() on that socket, the socket options we pass have more chance of not being the correct protocol for the protocol the socket was created for. This isn’t always a bad thing; for eg, one of the oldest kernel bugs trinity found was found by setting TCP options on a non-TCP socket. But doing this the majority of the time is wasteful, as we’ll just get -EINVAL most the time.

We actually have the necessary information in trinity to know what kind of socket we were dealing with in a socketinfo struct.

struct socket_triplet {
        unsigned int family;
        unsigned int type;
        unsigned int protocol;

struct socketinfo {
        struct socket_triplet triplet;
        int fd; 

We just had it at the wrong level of abstraction. setsockopt only ever saw a file descriptor. We could have searched through the fd arrays looking for the socketinfo that matched, but that seems like a lame solution. So I changed the various networking syscalls to take a ARG_SOCKETINFO instead of an ARG_FD. As a side-effect, we actually pass sockets to those syscalls more than say, a perf fd, or an epoll fd, or ..

There is still a small chance we pass some crazy fd, just to cover the crazy cases, though those cases don’t tend to trip things up much any more.

After passing down the triplet, it was a simple case of annotating the structures containing the various setsockopt function pointers to indicate which family they belonged to. AF_INET was the only complication, which needed special casing due to the multiple protocols for which we have setsockopt() functions. Creation of a second table, using the protocol instead of the family was enough for the matching code.

There are still a ton of improvements I want to make to this code, but it’s going to take a while, so it’s good when some mostly trivial changes like the above come together quickly.

Trinity socket improvements is a post from:

April 14, 2015

the more things change.. 4.0

$ ping gelk
PING ( 56(84) bytes of data.
WARNING: kernel is not very fresh, upgrade is recommended.
$ uname -r

Remember that one time the kernel versioning changed and nothing in userspace broke ? Me either.

Why people insist on trying to think they can get this stuff right is beyond me.


update: this was already fixed, almost exactly a year ago in the ping git tree. The (now removed) commentary kind of explains why they cared. Sigh.

the more things change.. 4.0 is a post from:

March 31, 2015

Official GNOME SDK runtime builds are out

As people who have followed the work on sandboxed applications know, we have promised a developer preview for GNOME 3.16. Well, 3.16 has now been released, so the time is now!

I spent last week setting up an build system on the GNOME infrastructure, and the output of this is finally available at:

This repository contains the gnome 3.16 runtimes, org.gnome.Platform, as well as a smaller one that is useful for less integrated apps (like games) called org.freedesktop.Platform. It also has corresponding develoment runtimes (org.gnome.Sdk and org.freedesktop.Sdk) that you can use to create applications for the platforms.

This is a developer preview, so consider these builds weakly supported. This means I will try to keep them somewhat updated if there are major issues and that I will keep them API and ABI stable. I will probably also pick up at least some 3.16.x minor releases as they are released.

I also did the first official release of xdg-app. For easy testing this is available for Fedora 21 and 22 as a copr repo.

Testing the SDK

Using the repo above makes it really easy to test this. Just install the xdg-app package from copr, log out+in (needed update the environment for the session), then follow these instructions (as a regular user):

  1. Install the Gnome SDK public key into  /usr/share/ostree/trusted.gpg.d, (or alternatively, use –no-gpg-verify when you add the remote below).
  2. Install the basic Gnome and freedesktop runtimes:
    $ xdg-app add-remote --user gnome-sdk
    $ xdg-app install-runtime --user gnome-sdk org.gnome.Platform 3.16
    $ xdg-app install-runtime --user gnome-sdk org.freedesktop.Platform 1.0
  3. Optionally install some locale packs:
    $ xdg-app install-runtime --user gnome-sdk 3.16
    $ xdg-app install-runtime --user gnome-sdk 1.0
  4. Install some apps from my repository of test apps:
    $ xdg-app add-remote --user --no-gpg-verify test-apps
    $ xdg-app install-app --user test-apps org.gnome.gedit
    $ xdg-app install-app --user test-apps org.freedesktop.glxgears
  5. Run the apps! You should find gedit listed among the regular applications in the shell as it exports a desktop file. But you can also run them manually like this:
    $ xdg-app run org.gnome.gedit
    $ xdg-app run org.freedesktop.glxgears
  6. I also packaged the latest gnome builder from git. It requires the full sdk which takes a bit longer to download:
    $ xdg-app install-runtime --user gnome-sdk org.gnome.Sdk 3.16
    $ xdg-app install-app --user test-apps org.gnome.Builder

All the above install the apps into your home-directory (in ~/.local/share/xdg-app) . You can also run the commands as root and skip the –user arguments to do system-wide application installs.

Future work

With the basics now laid down to run current applications in a minimally isolated environment the next step is to work on the sandboxing aspects more. This will require lots of work, both in the system side (things like kdbus), the desktop (add sandbox aware APIs, make pulseaudio protect clients from each other, etc)  and in modifying applications.

If you’re interested in this, you can follow the work on the wiki.

Building your own apps

If you download the SDKs you have enough tooling to build your own applications. There are some documentations on how to do this here.

I also created a git repository with the scripts I used to build the test applications above. It uses the gnome-sdk-bundles repostory which has some tooling and specfiles to easily bundle dependencies with the application.

Building the SDK

If you ever want to build the SDK yourself, it is available at:

This repository contains the desktop specific parts of the SDK, which is layered on a core Yocto layer. When you build the SDK this will be automatically checked out and built from:

However, if you don’t want to build all of this you can download the pre-build images from and put them in the freedesktop-sdk-base/images/x86_64 subdirectory of gnome-sdk-images. This can save you a lot of time and space.

March 22, 2015

Fedora at Midwest Rep Rap Fest 2015

I attended Midwest Rep Rap Fest 2015 this weekend, in Goshen, Indiana. Goshen is about 45 minutes outside of South Bend (the nearest regional airport). This part of Indiana is noteworthy for a few reasons, including the fact that Matthew Miller, the Fedora Project Leader, is from there. It also has a very large Amish population, which makes it one of the few places I've attended a conference where most of the local businesses have a place to tie up your horses. The Midwest Rep Rap Fest is an event dedicated to Open Source 3d printers (and their surrounding ecosystem). The primary sponsor of the event is SeeMeCNC, a local vendor that makes open source hardware delta 3d printers. A Delta printer is a 3d printer with a circular stationary bed. Attached to the bed are three vertical rods which serve as tracks for three geared motors. The motors move up and down the rods, and are connected to a central extruder which hangs down the center. The extruder is moved in three dimensions by moving the supports along their tracks. Watching a Delta 3d printer do its thing is pretty amazing, it seems to dance like a trapeze artists as it dips and swoops to print the object.

The Delta type of 3d printer was the most common printer at the event, many people had either bought SeeMeCNC printers or had built their own off their open source design. The SeeMeCNC team brought their super-sized Delta, which they think is the largest Delta printer in the world. It was easily 30 feet tall and barely fit in the building we were using (which is saying something, because we were in an exhibition hall at the local state fairgrounds). The owner of the company decided to see how big of a Delta printer he could build, and this was the result!

The printer used a shop vac to blow plastic pellets up a plastic hose into the giant heated end. Originally, they were trying to print a giant model of Groot (shown in progress in my picture above), but they had to leave it running overnight on Friday and when we came back Saturday morning, the print had failed because it had run out of plastic pellets! Later on, they printed a very large basket/vase with it (after fixing it so that it wouldn't run out of plastic).

Fedora had a table in the main room. I brought two open source 3d printers from Lulzbot and controlled them both from my laptop running Fedora 21. My larger printer, the Taz 4, was configured with a dual extruder addon, and I spent four hours on Friday calibrating it to print properly. On Saturday morning, I printed my first completely successful dual color print, a red and white tree frog!

The eyes didn't come out perfect, but it all came out aligned and in one piece. Several people offered me tips and advice on how to improve the print quality with the dual-extruder setup. One of the nice things about the Rep Rap fest was the extremely friendly nature of the community. Everyone was eager to help everyone else solve problems or improve their printers/prints. I used Pronterface to control the Taz 4, since it was better suited to handle the dual extruder controls.

My smaller printer, the Lulzbot Mini, was controlled with Cura-Lulzbot (a package which got added to Fedora a few days before the show!). Cura has a very fast and high quality slicer, but with less options for tweaking it than slic3r (the traditional open source slicing tool) does. 3d printers depend on a slicing tool to take a 3d model and convert it into the GCode machine instructions that tell the printer where to move and when to extract plastic. Cura also has a more polished UI than Pronterface.

The Lulzbot Mini is able to self level, self clean, and self calibrate, which almost eliminates the prep time before a print! One of the vendors at the show was Taulman, who is constantly innovating new filaments for 3d printing. They announced a new filament the weekend of the Rep Rap Fest, 910, and they gave me a sample to try out on the Mini. The Mini can print filaments with a melting point of 300 degrees Celsius or less, so it was well suited for the 910. 910 was interesting because it was incredibly strong, almost as good as polycarbonate! It was also translucent, which made it ideal for me to finish a project I've been working on for a long time: my 3d printed TARDIS model!

I printed four window panels and a topper piece for the lantern on the roof. A few other people had TARDIS models (including one that had storage drawers inside it), but mine was the biggest (and I think, the nicest).

One of Fedora's neighbors was mUVe, an open source SLA 3d printer. SLA 3d printers use a liquid resin and a DLP projector to make incredibly accurate 3d models that would be difficult or impossible to print on other kinds of 3d printers. It seemed like everyone was printing the same Groot model at the event, and they printed one that came out looking incredible. The inventor of the hardware was working their table, and we talked for a while about the importance of open source in hardware. He felt strongly that it was mandatory for him to release his work into open source so that other people could innovate and improve upon the designs he'd created. The mUVe printer was one of the largest SLA printers I've ever seen and the quality of its prints was amazing. The biggest downside is the complexity, it involves chemicals in the resin and in curing the prints once they have finished, but in my opinion, it was worth it. The cost was in the $1500-2000 price range, but he said he's working on something awesome that will bring that cost down. They used Creation Workshop to slice and control their printer, which was new to me, but it was also open source. It's C# though, but I want to see if I can get it working in Mono on Fedora. (They were also in the greater Detroit area, so I encouraged them to come out and demo it at Penguicon!)

Another neighbor had 3d printed an amazingly intricate "home clock". They had used a famous woodworking pattern, converted each of the pieces to a 3d model, then printed them. Each piece was then smoothed and attached together. The only piece they didn't print was the clock at the center! On the table, the top of the clock was taller than me (and I'm 6'4"). It didn't look 3d printed, it looked too nice! It took them 3 months to print it all. The owner said that if you're able to cut this model from wood and assemble it properly, you're considered to be a master in their community. Everyone was definitely in awe of it in this community.

It seemed like everyone showing off something at this event had a clever hack of their own. Some people were creating amazing models, some people had built new open source printers. One printer had color changing LED strips attached underneath it which changed from red to green to indicate the progress of the printing job. Another printer had a Raspberry Pi with camera wired into it so you had a "printer's eye view" as it printed. There was a custom 3d scanner designed to scan people's heads and torsos to make printable busts. There was even a printer that looked like some sort of industrial robot gone mad! The one thing these all had in common? They were open source. No one here was questioning open source, it was just the way they operated, sharing what they knew and building off each other's successes (and failures). There were a few MakerBot Replicators, but all of them had been hacked in some way.

Attendance at this years event was both up and down. There were more people and companies exhibiting at the event, including Texas Instruments, Hackaday, Lulzbot, Taulman, and Printed Solid. Printed Solid was giving out free samples of some amazing ColorFabb filament. I came home with some BronzeFill (prints into a bronze like material that when polished is heavy and shiny), a new flexible filament, and some carbon-fiber infused filament! They also had some really fantastic glow in the dark filament, but no samples of that were available (and I didn't have the spare cash to buy a full spool). General attendance at the event was about 750 people, which was down from last year (around 1000). The general consensus was that the event wasn't doing all it could to advertise itself, and the location wasn't exactly optimal (45 minutes from the nearest regional airport, almost 2 hours from a major airport). The majority of visitors were local to the Indiana/Michigan area. The event staff said that next year they plan on rebranding the event to a more general FOSS 3d printing event (not limiting themselves to the Midwest region of the US). I think that is the right decision, since they are the only open source 3d printing event that I'm aware of, and I'd really love to see them grow into something bigger and more accessible.

Oh, did I mention we had a celebrity at the event? Ben Heck was there with his Delta printer! He's built a pinball machine. I might want to be him a little bit (but I'm not). He was very friendly and cool, spent a lot of time talking to the other makers and attendees.

Thanks to Ben Williams, Fedora had a very nice booth setup. We had our Fedora tablecloth and lots of stickers to give away. I brought a good sampling of models I'd printed with Fedora and my 3d printers, and I had a lot of good conversations about using Linux and open source to power 3d printing and 3d model creation. My coworker (and celebrity writer) Brian Proffitt stopped by on Saturday and helped out at the table for a while. I was supposed to have Fedora 21 media to hand out, but the promised shipment never arrived. The computers there were a mix of Windows and Linux, very few Macs in this community. Several people were using Fedora, but most of the Linux instances were Debian.

The Fedora event box needs a little love, there wasn't very much in it that was useful anymore. The OLPC in it is very old now, and since the current OLPC hardware runs Android these days, it isn't as "cool" as it used to be. I restocked it with Fedora bubble stickers, but it probably needs a plan to revitalize it.

All in all, it was a very fun weekend event and a great opportunity to connect with the open source 3d printer community. I think it is the responsibility of Fedora (and Red Hat) to reach out to the maker communities and help them be open source in their own ways, and this was an excellent opportunity to do exactly that. Is there a Maker event happening somewhere near you? You can sign up to represent Fedora at that event like I did at MRRF: Fedora Event Calendar

March 16, 2015

virgil3d local rendering test harness

So I've still been working on the virgil3d project along with part time help from Marc-Andre and Gerd at Red Hat, and we've been making steady progress. This post is about a test harness I just finished developing for adding and debugging GL features.

So one of the more annoying issuess with working on virgil has been that while working on adding 3D renderer features or trying to track down a piglit failure, you generally have to run a full VM to do so. This adds a long round trip in your test/development cycle.

I'd always had the idea to do some sort of local system renderer, but there are some issues with calling GL from inside a GL driver. So my plan was to have a renderer process which loads the renderer library that qemu loads, and a mesa driver that hooks into the software rasterizer interfaces. So instead of running llvmpipe or softpipe I have a virpipe gallium wrapper, that wraps my virgl driver and the sw state tracker via a new vtest winsys layer for virgl.

So the virgl pipe driver sits on top of the new winsys layer, and the new winsys instead of using the Linux kernel DRM apis just passes the commands over a UNIX socket to a remote server process.

The remote server process then uses EGL and the renderer library, forks a new copy for each incoming connection and dies off when the rendering is done.

The final rendered result has to be read back over the socket, and then the sw winsys is used to putimage the rendering onto the screen.

So this system is probably going to be slower in raw speed terms, but for developing features or debugging fails it should provide an easier route without the overheads of the qemu process. I was pleasantly surprised it only took two days to pull most of this test harness together which was neat, I'd planned much longer for it!

The code lives in two halves. virgl-mesa-driver

[updated: pushed into the main branches]

Also the virglrenderer repo is standalone now, it also has a bunch of unit tests in it that are run using valgrind also, in an attempt to lock down some more corners of the API and test for possible ways to escape the host.

March 13, 2015

LSF/MM 2015 recap.

It’s been a long week.
Spent Monday/Tuesday at LSFMM. This year it was in Boston, which was convenient in that I didn’t have to travel anywhere, but less convenient in that I had to get up early and do a rush-hour commute to get to the conference location in time. At least the weather got considerably better this week compared to the frankly stupid amount of snow we’ve had over the last month.
LWN did their usual great write-up which covers everything that was talked about in a lot more detail than my feeble mind can remember.

A lot of things from last years event seem to still be getting a lot of discussion. SMR drives & persistent memory being the obvious stand-outs. Lots of discussion surrounding various things related to huge pages (so much so one session overran and replaced a slot I was supposed to share with Sasha, not that I complained. It was interesting stuff, and I learned a few new reasons to dislike the way we handle hugepages & forking), and I lost track how many times the GFP_NOFAIL discussion came up.

In a passing comment in one session, one of the people Intel sent (Dave Hansen iirc) mentioned that Intel are now shipping a 18 core/36 thread CPU. A bargain at just $4642. Especially when compared to this madness.

A few days before the event, I had been asked if I wanted to do a “how Akamai uses Linux” type talk at LSFMM, akin to what Chris Mason did re: facebook at last years event. I declined, given I’m still trying to figure that out myself. Perhaps another time.

Wednesday/Thursday, I attended Vault at the same location.
My take-away’s:

  • There was generally a lot more positive vibes around btrfs this year. Even with Josef playing bad cop to Chris’ good cop talk, things generally seemed to be moving away from a “everything is awful” toward “this actually works…” though with the qualifier “.. for facebook’s workload”. Josef did touch on one area that btrfs does still suck, which apparently is database workloads (iirc, due to the copy-on-write nature of btrfs). The spurious ENOSPC failures of the past should hopefully stay in the past. Things generally on the up and up. (Though, this does include the linecount, which has now passed 100KLOC, more than double that of XFS or ext*. Scary).
  • Equally positive vibes surrounding XFS. We celebrated the 20 year anniversary at one evening event, making us all feel just that little bit more like an old fart club. Interesting talk toward the end by Dave Chinner about the future of XFS, and how the current surge of development in XFS is probably its last for various scaling reasons as disks continue to get bigger and bigger. Predicting the future is always hard, but if what Dave said was true, things will start to get ‘interesting’ in about 5 years time, given every other filesystem we support in Linux has the same issues (or worse).
  • People still care a lot about NFS. Especially pNFS. Surprising amount of activity still happening.
  • Even when I worked there, I never really got Red Hat’s “big picture” wrt the several distributed filesystems they supported. Now that I’m not there, I feel even more out of the loop. “ceph is the way forward” “except when it’s glusterfs” or something. Oh, and GFS2 is still a thing apparently, for some reason.
  • As entertaining as Jeremy Allison might be, don’t go to a talk on Samba internals unless you work on it (in which case it’s too late for you). The horrors will likely keep you up at night.
  • Ted’s ext4 talk drew a decent crowd. As fancy as btrfs/xfs etc might be, a *lot* of people still give a crap about extN. Somehow I missed the addition of the ‘lazytime’ option to ext4. Seems neat. Played with it (and also the super-secret ‘dioread_nolock’ mount option). Saw another talk on orphan list scalability in ext4, which was interesting, but didn’t draw as big a crowd.

I got asked “What are you doing at Akamai ?” a lot. (answer right now: trying to bring some coherence to our multiple test infrastructures).
Second most popular question: “What are going to do after that ?”. (answer: unknown, but likely something more related to digging into networking problems rather than fighting shell scripts, perl and Makefiles).

All that, plus a lot of hallway conversations, long lunches, and evening activities that went on possibly a little later than they should have have led to me almost losing my voice today.
Really good use of time though. I had fun, and it’s always good to catch up with various people.

LSF/MM 2015 recap. is a post from:

March 03, 2015

Trinity 1.5 release.

As announced this morning, today I decided that things had slowed down (to an almost standstill of late) enough that it was worth making a tarball release of Trinity, to wrap up everything that’s gone in over the last year.

The email linked above covers most of the major changes, but a lot of the change over the last year has actually been groundwork for those features. Things like..

  • The post-mortem dumper needed the generation of the text and the writing to log files to be decoupled, which wasn’t particularly trivial.
  • Some features involved considerable rewrites. The fd generators are now pretty much isolated from each other, making adding a new one a simple task.
  • Handling of the mapping structs got a lot of cleanup (though there is definitely still a lot of room for improvement there, especially when we do things like splitting a mapping).
  • I should also mention the countless hours spent chasing down quite a few hard-to-reproduce bugs that are fixed in 1.5

As I mentioned in the announcement, I don’t see myself having a huge amount of time for at least this year to work on Trinity. I’ve had a number of people email me asking the status of some feature. Hopefully this demarkation point will answer the question.

So, it’s not abandoned, it just won’t be seeing the volume of change it has over the last few years. I expect my personal involvement will be limited to merging patches, and updating the syscall lists when new syscalls get added.

Trinity used to be on roughly a six month release schedule. We’ll see if by the end of the year there’s enough input from other people to justify doing a 1.6 release.

I’m also hopeful that time working on other projects mean I’ll come back to this at some point with fresh eyes. There are a number of features I wanted to implement that needed a lot more thought. Perhaps working on some other things for a while will give me the perspective necessary to realize those features.

Trinity 1.5 release. is a post from:

February 23, 2015

backup solutions.

For the longest time, my backup solution has been a series of rsync scripts that have evolved over time into a crufty mess. Having become spoiled on my mac with time machine, I decided to look into something better that didn’t involve a huge time investment on my part.

The general consensus seemed to be that for ready-to-use home-nas type devices, the way to go was either Synology, or Drobo. You just stick in some disks, and setup NFS/SAMBA etc with a bunch of mouse clicking. Perfect.

I had already decided I was going to roll with a 5 disk RAID6 setup, so bit the bullet and laid down $1000 for a Synology 8-Bay DS1815+. It came *triple* boxed, unlike the handful of 3TB HGST drives.
I chose the HGST’s after reading backblaze’s report on failure rates across several manufacturers, and figured that after the RAID6 overhead, 8TB would be more than enough for a long time, even at the rate I accumulate flac and wav files. Also, worst case, I still had 3 spare bays I could expand into later if needed.

Installation was a breeze. The plastic drive caddies felt a little flimsy, but the drives were secure once in them, even if they did feel like they were going to snap as I flexed them to pop them into place. After putting in all the drives, I connected the four ethernet ports, I powered it up.
After connecting to its web UI, it wanted to do a firmware update, like just about every internet connected device wants to do these days. It rebooted, and finally I could get about setting things up.

On first logging into the device over ssh, I think the first command I typed was uname. Seeing a 3.2 kernel surprised me a little. I got nervous thinking about how many VFS,EXT4,MD bugfixes hadn’t made their way back to long-term stable, and got the creeps a little. I decided to not think too much about it, and put faith in the Synology people doing backports (though I never got as far as looking into their kernel package).

The web ui is pretty slick, though felt a little sluggish at times. I set up my RAID6 volume with a bunch of clicks, and then listened as all those disks started clattering away. After creation, it wanted to do an initial parity scan. I set it going, and went to bed. The next morning before going to work, I checked on it, and noticed it wasn’t even at 20% done. I left it going while I went into the office the next day. I spent the night away from home, and so didn’t get back to it until another day later.

When I returned home, the volume was now ready, but I noticed the device was now noticeably hotter to touch than I remembered. I figured it had been hammering the disks non-stop for 24hrs, so go figure, and that it would probably cool off a little as it idled. As the device was now ready for exporting, I set up an nfs export, and then spent some time fighting uid mappings, as you do. The device does have ability to deal with LDAP and some other stuff that I’ve never had time to setup, so I did things the hard way. Once I had the export mounted, I started my first rsync from my existing backups.

While it was running, I remembered I had intended to set up bonding. A little bit of clicky-clicky later, it was done, and transfers started getting even faster. Very nice. I set up two bonds, with a pair of NICs in each. Given my desktop only has a dual NIC, that was good enough. Having a 2nd 2GigE bond I figured was nice in case I had multiple machines wanting to use it while I was doing a backup.

So the backup was going to take a while, so I left it running.
A few hours later, I got back to it, and again, it was getting really hot. There are two pretty big fans in the back of the units, and they were cranking out heat. Then, things started getting really weird. I noticed that the rsync had hung. I ctrl-c’d it, and tried logging into the device as root. It took _minutes_ to get a command prompt. I typed top and waited. About two minutes later top started. Then it spontaneously rebooted.

When it came back up, I logged in, and poked around the log files, and didn’t see anything out of the ordinary.
I restarted the rsync, and left it go for a while. About 20 minutes later, I came back to check on it again, and found that the box had just hung completely. The rsync was stalled, I couldn’t ssh in. I rebooted the device, cursed a bit, and then decided to think about it for a while, so never restarted the rsync. I clicked around in the interface, to see if there was anything I could turn on/off that would perhaps give me some clues wtf was going on.
Then it rebooted spontaneously again.

It was about this time I was ready to throw the damn thing out the window. I bought this thing because I wanted a turn-key solution that ‘just worked’, and had quickly come to realize that with this device when something went bad, I was pretty screwed. Sometimes “It runs Linux” just isn’t enough. For some people, the Synology might be a great solution, but it wasn’t for me. Reading some of the Amazon reviews, it seems there were a few people complaining about their units overheating, which might explain the random reboots I saw. For a device I wanted to leave switched on 24/7 and never think about, something that overheats (especially when I’m not at home) really doesn’t give me feel good vibes. Some of the other reviews on Amazon rave about the DS1815+. It may be that there was a bad batch, and I got unlucky, but I felt burnt on the whole experience, and even if I had got a replacement, I don’t know if I would have felt like I could have trusted this thing with my data.

I ended up returning it to Amazon for a refund, and used the money to buy a motherboard, cpu, ram etc to build a dedicated backup computer. It might not have the fancy web ui, and it might mean I’ll still be using my crappy rsync scripts, but when things go wrong, I generally have a much better chance of fixing the problems.

Other surprises: At one point, I opened the unit up to install an extra 4GB of RAM (It comes with just 2GB by default), I noticed that it runs off a single 250W power supply, which seemed surprising to me. I thought disks during spin-up used considerably more power, but apparently they’re pretty low power these days.

So, two weeks of wasted time, frustration, and failed experiments. Hopefully by next week I’ll have my replacement solution all set up and can move on to more interesting things instead of fighting appliances.

backup solutions. is a post from:

February 17, 2015

First fully sandboxed Linux desktop app

Its not a secret that I’ve been working on sandboxed desktop applications recently. In fact, I recently gave a talk at about it. However, up until now I’ve mainly been focusing on the bundling and deployment aspects of the problem. I’ve been running applications in their own environment, but having pretty open access to the system.

Now that the basics are working it’s time to start looking at how to create a real sandbox. This is going to require a lot of changes to the Linux stack. For instance, we have to use Wayland instead of X11, because X11 is impossible to secure. We also need to use kdbus to allow desktop integration that is properly filtered at the kernel level.

Recently Wayland has made some pretty big strides though, and we now have working Wayland sessions in Fedora 21. This means we can start testing real sandboxing for simple applications. To get something running I chose to focus on a game, because they require very little interaction with the system. Here is a video I made of Neverball, running in a minimal sandbox:

Click here to view the embedded video.

In this example we’re running a regular build of neverball in an environment which:

  • Is independent of the host distribution
  • Has no access to any system or user files other than the ones from the runtime and application itself
  • Has no access to any hardware devices, other than DRI (for GL rendering)
  • Has no network access
  • Can’t see any other processes in the system
  • Can only get input via Wayland
  • Can only show graphics via Wayland
  • Can only output audio via PulseAudio
  • … plus more sandboxing details

Yet the application is still simple to install and integrates nicely with the desktop. If you want to test it yourself, just follow the instructions on the project page and install org.neverball.Neverball.

Of course, there are still a lot to do here. For instance, PulseAudio doesn’t protect clients from each other, and for more complex applications we need to add new APIs to safely grant access to things like user files and devices. The sandbox details page has a more detailed list of what has to be done.

The road is long, but at least we have now started our journey!

February 16, 2015

NetworkManager for Administrators Part 1

4870003098_26ba44a08a_b(via scobleizer, CC BY 2.0)

NetworkManager is a system service that manages network interfaces and connections based on user or automatic configuration. It supports Ethernet, Bridge, Bond, VLAN, team, InfiniBand, Wi-Fi, mobile broadband (WWAN), PPPoE and other devices, and supports a variety of different VPN services.  You can manage it a couple different ways, from config files to a rich command-line client, a curses-like client for non-GUI systems, graphical clients for the major desktop environments, and even web-based management consoles like Cockpit.

There’s an old perception that NetworkManager is only useful on laptops for controlling Wi-Fi, but nothing could be further from the truth.  No laptop I know of has InfiniBand ports.  We recently released NetworkManager 1.0 with a whole load of improvements for workstations, servers, containers, and tiny systems from embedded to RaspberryPi.  In the spirit of making double-plus sure that everyone knows how capable and useful NetworkManager is, let’s take a magical journey into Administrator-land and start at the very bottom…

Daemon Configuration Files

Basic configuration is stored in /etc/NetworkManager/NetworkManager.conf in a standard key/value ini-style format.  The sections and values are well-described by ‘man NetworkManager.conf’.  A standard default configuration looks like this:


You can override default configuration through either command-line switches or by dropping “configuration snippets” into /etc/NetworkManager/conf.d.  These snippets use the same configuration options from ‘man NetworkManager.conf’ but are much easier to distribute among larger numbers of machines though packages or tools like Puppet, or even just to install features through your favorite package manager.  For example, in Fedora, there is a NetworkManager-config-connectivity-fedora RPM package that installs a snippet that enables connectivity checking to Fedora Project servers.  If you don’t care about connectivity checking, you simply ‘rpm -e NetworkManager-config-connectivity-fedora’ instead of tracking down and deleting /etc/NetworkManager/conf.d/20-connectivity-fedora.conf.

Just for kicks, let’s take a walk through the various configuration options, what they do, and why you might care about them in a server, datacenter, or minimal environment…

Configuration Snippets

First, each configuration “snippet” in /etc/NetworkManager/conf.d can override values set in earlier snippets, or even the default configuration (but not command-line options).  So the same option specified in 50-foobar.conf will override that option specified in 10-barfoo.conf.  Many options also support the “+” modifier, which allows their value to be added to earlier ones instead of replacing.  So “plugins+=something-else” will add “something-else” to the list, instead of overwriting any earlier values.  You’ll see why this is quite useful in a minute…

Dive Deep

plugins=ifcfg-rh | ifupdown | ifnet | ifcfg-suse | ibft (default empty)

This option enables or disables certain settings plugins, which are small loadable libraries that read and write distribution-specific network configuration.  For example, Fedora/RHEL would specify ‘plugins=ifcfg-rh’ for reading and writing the ifcfg file format, while Debian/Ubuntu would use ‘plugins=ifupdown’ for reading /etc/network/interfaces, and Gentoo would use ‘plugins=ifnet’.  If you know your distro’s config format like the back of your hand, NetworkManager doesn’t make you change it.

There is one default plugin though, ‘keyfile’, which NetworkManager uses to read and write configurations that the distro-specific plugins can’t handle.  These files go into /etc/NetworkManager/system-connections and are standard .ini-style key/value files.  If you’re interested in the key and value definitions, you can check out ‘man nm-settings’ and ‘man nm-settings-keyfiles’, or even look at some examples.

monitor-connection-files=yes | no (default no)

By popular demand, NetworkManager no longer watches configuration files for changes.  Instead, you make all the changes you want, and then explicitly tell NetworkManager when you’re done with “nmcli con reload” or “nmcli con load <filename>”.  This prevents reading partial configuration and allows you to double-check that everything is correct before making the configuration update.  Note that changes made through the D-Bus interface (instead of the filesystem) always happen immediately.

However, if you want the old behavior back, you can set this option to “yes”.

auth-polkit=yes | no (default yes)

If built with support for it, NetworkManager uses PolicyKit for fine-grained authorization of network actions.  This will be the subject of another article in this series, but the TLDR is that PolicyKit easily allows user A the permission to use WiFi while denying user B WiFi but allowing WWAN.  These things can be done with Unix groups, but that quickly gets unwieldy and isn’t fine-grained enough for some organizations.  In any case, PolicyKit is often unecessary on small, single-user systems or in datacenters with controlled access.  So even if your distribution builds NetworkManager with PolicyKit enabled, you can turn it off for simpler root-only operation.

dhcp=dhclient | dhcpcd | internal (default determined at build time, dhclient preferred if enabled)

With NetworkManager 1.0 we’ve added a new internal DHCP client (based off systemd code which was based off ConnMan code) which is smaller, faster, and lighter than dhclient or dhcpcd.  It doesn’t do DHCPv6 yet, but we’re working on that.  We think you’ll like it, and it’s certainly much less of a resource hog than a dhclient process for every interface. To use it, set this option to “internal” and restart NetworkManager.

If NetworkManager was built with support for dhclient or dhcpcd, you can use either of these clients by setting this option to the client’s name.  Note that if you enable both dhclient and dhcpcd, dhclient will be preferred for maximum compatibility.

no-auto-default= (default empty)

By default, NetworkManager will create an in-memory DHCP connection for every Ethernet interface on your system, which ensures that you have connectivity when bringing a new system up or booting a live DVD.  But that’s not ideal on large systems with many NICs, or on systems where you’d like to control initial network bring-up yourself.  In that case, you should set this option to “*” to disable the auto-Ethernet behavior for all interfaces, indicating that you’d like to create explicit configuration instead.  You can also use MAC addresses or interface names here too!  On Fedora we’ve created a package called NetworkManager-config-server that sets this option to “*” by default.

ignore-carrier= (default empty)

Trip over a cable?  Want to make sure a critical interface stays configured if the switch port goes down?  This option is for you!  Setting it to “*” (all interfaces) or using MAC addresses or interface names here will tell NetworkManager to ignore carrier events after the interface is configured.  For DHCP connections a carrier is obviously required for initial configuration, while static connections can start regardless of carrier status.  After that, feel free to unplug the cable every time Apple sells an iPhone!

configure-and-quit=yes | no (default no)

New with 1.0 is the “configure and quit” mode where NetworkManager configures interfaces (including, if desired, blocking startup until networking is active) and then quits, spawning small helpers to maintain DHCP leases and IPv6 address lifetimes if required.  In a datacenter or cloud where cycles are money, this can save you some cash and deliver a more stable setup with known behavior.

dns=dnsmasq | unbound | none | default (default empty, equivalent to “default”)

Want to control DNS yourself?  NetworkManager makes it easy!  Don’t want to?  NetworkManager makes that easy too! When you set this option to ‘dnsmasq’ NetworkManager will configure dnsmasq as a local caching nameserver, including split DNS for VPN tunnels.  If you set it to ‘none’ then NetworkManager won’t touch /etc/resolv.conf and you can use dispatcher scripts that NetworkManager calls at various points to set up DNS any way you choose.

Leaving the option empty or setting it to “default” asks NetworkManager to own resolv.conf, updating system DNS with any information from your explicit network settings or those received from automatic means like DHCP.

In the upcoming NetworkManager 1.2, DNS information is written to /var/lib/NetworkManager/resolv.conf and, if NM is allowed to manage /etc/resolv.conf, that file will be a symlink to the one in /var similar to systemd-resolvd.  This makes it easier for external tools to incorporate the DNS information that NetworkManager combines from multiple sources like DHCP, PPP, IPv6, VPNs, and more.

unmanaged-devices= (default empty)

Want to keep NetworkManager’s hands off a specific device?  That’s what this option is for, where you can use “interface-name:eth0″ or “mac:00:22:68:1c:59:b1″ to prevent automatic management of a device.  While there are some situations that require this, by default NetworkManager doesn’t touch virtual interfaces that it didn’t create, like bridges, bonds, VLANs, teams, macvlan, tun, tap, etc.  So while it’s unusual to need this option, we realize that NetworkManager can be used in concert with other tools, so it’s here if you do.

uri=  (default empty = disabled)
interval=(default 0 = disabled)
response=  (default “NetworkManager is online”)

Connectivity checking helps users log into captive ports and hotspots, while also providing information about whether or not the Internet is reachable.  When NetworkManager connects a network interface, it sends an HTTP request to the given URI and waits for the specified response.  If you’re connected to the Internet and the connectivity server isn’t down, the response should match and NetworkManager will change state from CONNECTED_SITE to CONNECTED.  It will also check connectivity every ‘interval’ seconds so that clients can report status to the user.

If you’re instead connected to a WiFi hotspot or some kind of captive portal like a hotel network, your DNS will be hijacked and the request will be redirected to an authentication server.  The response will be unexpected and NetworkManager will know that you’re behind a captive portal.  Clients like GNOME Shell will then indicate that you must authenticate before you can access the real Internet, and could provide an embedded web browser for this purpose.

Upstream connectivity checking is disabled by default, but some distribution variants (like Fedora Workstation) are now enabling it for desktops, laptops, and workstations.  On a server or embedded system, or where traffic costs a lot of money, you probably don’t want this feature enabled.  To turn it off you can either remove your distro-provided connectivity package (which just drops a file in /etc/NetworkManager/conf.d) or you can remove the options from NetworkManager.conf.

Special NetworkManager data files

In the normal course of network management sometimes non-configuration data needs to persist.  NetworkManager does this in the /var/lib/NetworkManager directory, which contains a few different files of interest:


This file contains the BSSIDs (MAC addresses) of WiFi access points that NetworkManager has connected to for each configured WiFi network.  NetworkManager doesn’t do this to spy on you (and the file is readable only by root), but instead to automatically connect to WiFi networks that do not broadcast their SSID.  You almost never need to touch this file, but if you are concerned about privacy feel free to delete this file periodically.


Each time you connect to a network, whether wired, WiFi, etc, NetworkManager updates the timestamp in this file.  This allows NetworkManager to determine which network you last used, which can be used to automatically connect you to more preferred networks.  NetworkManager also uses the timestamp as an indicator that you have successfully connected to the network before, which it uses when deciding whether or not to ask for your WiFi password when you get randomly disconnected or the driver fails.


This file stores persistent user-determined state for Airplane mode for each technology like WiFi, WWAN, and WiMAX.  Normally this is controlled by hardware buttons, but some systems don’t have hardware buttons or the drivers don’t work, plus that state is not persistent across boots.  So NetworkManager stores a user-defined state for each radio type and will ensure the radio stays in that state across reboots too.

DHCP lease and configuration files

When you obtain a DHCP lease, that lease may last longer than your connection to that network.  To ensure that you receive a nominally stable IP address the next time you connect, or to ensure that your TCP sessions are not broken if there is a network hiccup, NetworkManager stores the DHCP lease and attempts to acquire the same lease again.  These files are stored per-connection to ensure that a lease acquired on your home WiFi or ethernet network is not used for work or Starbucks.  Temporary DHCP configuration files are also stored here, which are constructed based on your preferences and on generic DHCP configuration files in /etc for each supported DHCP client.  If you want to wipe the DHCP slate clean, feel free to remove any of the lease or configuration files.

And that’s it for this time, stay tuned for the next part in this series!

January 30, 2015

Back from DevX hackfest

I’m now back from a week in Cambridge at the developer experience hackfest. This was a great event, it was a lot of fun to meet people again, and we got a lot of things done. I spent a lot of time talking to people about things related to xdg-app and sandboxed applications, both spreading information and actually implementing features.

I spent some time with Emmanuele, Ryan and Lars working on glib stuff, which resulted in the G_DECLARE_*_TYPE macros finally being merged. Additionally I reviewed the new list model abstraction which I hope we can land soon, and Ryan and I worked out a new fancy __attribute__(cleanup) approach that we hope to merge into glib soon.

We also worked a bit on Gtk+ OpenGL support. Based on feedback from early users we’re doing some changes in how GL contexts are created to allow you to configure them in more detail. We also decided that we want to completely drop support for legacy OpenGL contexts, as these had issues cooperating with Core 3.2 contexts, and because we don’t live in the 90s anymore. Carlos was working on converting GtkPopover to use (override redirect) toplevels on X11, and I gave him moral support and generally hated on ancient crappy X11 behaviour.

Props to Collabora and Philip for arranging a great event!

January 26, 2015

So much to learn..

At lunch a few days ago, I was discussing with a coworker how right now I’m feeling a little overwhelmed. Not completely unsurprising I suppose given the volume of new things I need to learn. But as I’m getting acclimated in my new job, it’s becoming clearer which things I either don’t know well enough or am completely unfamiliar with.

I’m only now realizing that not only the scope of what I’m working on changed, but how I work on things. During the ten years I worked on Fedora, because we were woefully understaffed and buried alive in bugs, there was never really time to spend a lot of time on an individual bug, unless a lot of users were hitting it. Because of this, the things that I ended up fixing were usually fairly small fixes. The occasional NULL pointer deference. Perhaps a use after free. Maybe linked-list corruption. Pretty basic stuff that anyone with familiarity with any part of the kernel could figure out reasonably quickly. Do enough of this, and it becomes less about engineering, and more about pattern recognition.

Even a lot of the bugs that trinity finds aren’t terribly invasive fixes.
(Screwed up error paths being probably the most common thing it picks up, because nothing else really ever tests them).

The more complicated bugs, like a WARN_ON being hit ? That could take a lot more understanding that could take a considerable time to get to.

As Fedora kernels were so close to mainline a lot of the time we could chase up the maintainer, pass the bug along, and move onto something else. The result of ten years of this way of working has meant I have a passing understanding how lots of parts of the kernel interact from a 10,000ft perspective, and perhaps some understanding of some nuances based on past interactions, but no deep architectural understanding of for eg: how an sk_buff traverses through the various parts of the network stack. A warning deep in the tcp guts, caused by a packet that’s been through bonding, netfilter, etc ? I’m not your guy. At least today.

Thankfully I’ve got some time to ramp up my learning on unfamiliar parts of the kernel, and get a better understanding of how things work on a deeper level. I suspect I’ll end up turning at least some of the stuff I learn into future posts here.

In the meantime, I’ve got a lot of reading to do.
I felt a little reassured at least when my coworker responded “yeah, me too”.

So much to learn.. is a post from:

January 21, 2015

Thoughts on long-term stable kernels.

I remembered something that I found eye-opening while interviewing/phone-screening over the last few months.

The number of companies that base their systems (especially those that don’t actually distribute their kernels outside the company) on the long-term stable releases of Linux caught me by surprise. We dismissed the idea of basing on long-term stable releases in Fedora after giving it a try circa Fedora 14, and it generally being a disaster because the bugs being fixed didn’t match up much to the bugs our users were seeing. We found that we got more bugs we cared about being fixed by sticking to the rolling model that Fedora is known for.

After discussing this with several potential employers, I now have a different perspective on this, and can see the appeal if you have a limited use-case, and only care about a small subset of hardware.

The general purpose “one size fits all” model that distribution kernels have to fit is a much bigger problem to solve, and with the feedback loop of “stable release -> bug -> report upstream -> fixed in mainline -> backport to next stable release” being so long, it’s not really a surprise that just having users be closer to bleeding edge gets a higher volume of bugs fixed (though at the cost of all the shiny new bugs that get introduced along the way) unless you have a RHEL-like small army of developers backporting fixes.

Finally, nearly everyone I talked to who uses long-term stable was also carrying some ‘extras’, such as updated drivers for hardware they cared about (in some cases, out of tree variants that aren’t even synced with linux-next yet).

It’s a complicated mess out there. “We run linux stable” doesn’t necessarily tell you the whole picture.

Thoughts on long-term stable kernels. is a post from:

January 19, 2015

The Whole Damn World Takes Effect to NetworkManager 1.0



Facebook launched.

The first Ubuntu release appeared.

It was the Year of the Linux Desktop.

Novell had just bought Ximian and Mono happened.

Google IPOed.

Firefox 1.0 showed up.

This was your cellphone and PDAs were still a thing.

This love took you over and made you think you got it.

And NetworkManager was first released.

Fast forward to 2014…


NetworkManager 1.0!

Right before the 2014 holidays, and more than 10 years after the first line of NetworkManager was typed, we released version 1.0.  A huge milestone on the way to making NetworkManager more cooperative, more flexible, more configurable, and more useful than ever before.

How you ask?

1: libnm: the new GLib client library

For all the GLib/GObject users out there, we’ve rebuilt libnm-util and libnm-glib from the ground up into a new single library called libnm.  It uses GDBus instead of dbus-glib.  It provides GIO-style asynchronous methods. It also exposes IP addresses, MAC addresses, and other properties as strings instead of byte arrays, and combines the old NMClient and NMRemoteSettings objects into a single NMClient object, among other things.

from gi.repository import GLib, NM

for dev in
    ipcfg = dev.get_ip4_config()
    if ipcfg:
        for addr in ipcfg.get_addresses():
            print "(%s) %s/%d" % (dev.get_iface(), addr.get_address(), addr.get_prefix())

2: a smaller, faster DHCP client

While it doesn’t do DHCPv6 (yet!) this internal client (based off systemd/connman code) is much faster than dhclient and dhcpcd, and doesn’t consume huge amounts of memory like dhclient.  Use the ‘dhcp=internal’ option in NetworkManager.conf to enable it and let us know how it works.  We’ll be adding DHCPv6 support and enhancing the recognized options in the near future.

3: configure and quit

Have a more static configuration and still want to use NetworkManager configuration and API to manage it?  The ‘configure-and-quit=yes’ option in NetworkManager.conf will configure your interfaces and quit the NM process, spawning small helpers to preserve DHCP and IPv6 addresses.  This saves cycles (and therefore money) and is simpler to manage.

4: more cooperative

Continuing the trend, NetworkManager 1.0 does a much better job of leaving externally configured interfaces alone until you tell it to do something.  In addition to improvements for IPv6 sysctl recognition and user-added route preservation, externally created virtual interfaces are no longer automatically set IFF_UP, and NetworkManager handles external master/slave relationship changes more smoothly.

5: more powerful nmcli

We’ve added PolicyKit and interactive password support to nmcli, allowing full command-line-only operation for most network connections, even for less privileged users.  There’s a new ‘nmcli dev connect’ command that brings up an interface using the best available connection.  You can also delete virtual interfaces directly through nmcli.

6: improved IPv6

We’ve ensured that if network interfaces are supposed to be down and unconfigured, that the kernel doesn’t assign a link-local address to them, to prevent potential security issues when you think networking is down.  We’ve also added support for IPv6 WWAN connections and fixes to respect router-delivered MTUs.

7: Bluetooth DUN support

Bluez5 changed API for Dial-Up-Networking functionality, which broke the NetworkManager support.  At long last we’ve added that support back, no thanks to Bluez.  Happy mobile networking!

8: more flexible and cooperative routing

Every interface that can have a default route now gets one, and NetworkManager manages the priorities to ensure they don’t conflict.  Plus, if you need to, you can manually manage priorities on a per-connection basis to prefer WiFi over WWAN or WWAN over ethernet, or whatever you need.

9: fewer dependencies

We’ve also removed some direct dependencies (PolicyKit), slimmed down code, and split functionality into selectable plugins, leading to easier installs on limited systems and better configurability.

That’s just the tip of the iceberg; we’ve improved almost every part of NetworkManager and we’re not stopping there.  We’re planning improvements to container use-cases, WiFi, VPNs, power savings, client APIs, and much  more.  2015 is gonna be a great year, and not just because the version number is greater than 1!

January 15, 2015

UAT on RTL-SDR update

About a year ago, when I started playing with ADS-B over 1090ES, I noticed that small airplanes heavily favour UAT in 978 MHz, because it's cheaper. For the purposes of independent on-board traffic, thus, it would be important to tap directly into UAT. If I mingle with airliners and their 1090ES, it's in controlled airspace anyway (yes, I know that most collisions happen in controlled airspace, but I'm grasping at excuses to play with UAT here, okay).

UAT poses a large challenge for RTL-SDR because of its relatively high data rate: 1.041667 mbit/s. RTL 2832U can only sample at 3.2 MS/s, and is only stable at 2.8 MS/s. Theoretically it should be enough, but everything I saw out there only works with 8 samples per bit, for weak signals. So, for initial experiments, I thought to try a trick: self-clocking. I set the sample rate to 2083334, and then do no clock recovery whatsoever. It hits where it hits, and if sample points lay well onto bits, the packet is recovered, otherwise it's lost.

I threw together some code, ran it, and it didn't work. Only received white noise. Oh, well. I moved the repo into a dusty corner of Github and forgot about it.

Fast forward a year, a gentleman by the name Oliver Jowett noticed a big problem: the phase was computed incorrectly. I fixed that up and suddenly saw some bits recovered (as it happened, zeroes and ones were swapped, which Oliver had to correct again).

After that, things started to move forward. Having bits recovered allowed to measure reception, and I found that the antenna that I built for the 978 MHz band was much worse than the stock antenna for TV. Imagine my surprise and disappoinment: all that soldering for nothing. I don't know where I screwed up, but some suggest that the computer and dongle produce RF noise that screws with antenna, and a length of coax helps with that despite the losses incurred by the coax (/u/christ0ph liked that in parti-cular).

This is bad.

This is good. Or better at least.

From now on, it's the error recovery. Unfortunately, I have no clue what this means:

The FEC parity generation shall be based on a systematic RS 256-ary code with 8-bit code word symbols. FEC parity generation for each of the six blocks shall be a RS (92,72) code.

Quoted from Annex 10, Volume III,

P.S. Observing Oliver's key involvement, the cynical may conclude that the whole premise of Open Source is that you write drek that does not work, upload it to Github, and someone fixes it up for you, free of charge. ESR wrote a whole book about it ("with enough eyes all bugs are shallow")!

January 08, 2015

the new job reveal.

I let the cat out of the bag earlier this afternoon on twitter. Next Monday is day one of my new job at Akamai. The scope of my new role is pretty wide, ranging from the usual kernel debugging type work, to helping stabilizing production releases, proactively finding new bugs, misc QA work, development of some new tooling and a whole bunch of other stuff I can’t talk about just yet. (And probably a whole slew of things I don’t even know about yet).

It seemed almost serendipitous that I’ve ended up here. Earlier this year, I read Fatal System Error, a book detailing how Prolexic got founded. It’s a fascinating story beginning with DDoS’s of offshore gambling sites and ending with Russian organised crime syndicates. I don’t know how much of it got embellished for the book, but it’s a good read all the same. Also, Amazon usually has used copies for 1 cent. Anyway, a month after I read that book, Akamai acquired Prolexic.

Shortly afterwards, I found myself reading another Akamai related book, No Better Time, the autobiography of the late founder of Akamai, Danny Lewin. After reading it, I decided it was at least going to be worth interviewing there.

The job search led me a few possibilities, and the final decision to go with Akamai wasn’t an easy one to make. The combination of interesting work, an “easy to commute to” office (I’ll be at the Kendall square office in Cambridge,MA) and a small team that seemed easy to get along with (famous last words) is what decided it. (That, and all the dubstep).

It’s going to be an interesting challenge ahead to switch from the mindset of “bug that affects a single computer, or a handful of users” to “something that could bring down a 200,000 node cluster”, but I think I’m up for it. One thing I am definitely looking forward to is only caring about contemporary hardware, and a limited number of platforms.

I apologize in advance for any unexpected internet outages. I swear it was like that when I found it.

the new job reveal. is a post from:

Swift and balance

Swift is on the cusp of getting yet another intricate mechanism that regulates how partitions are placed: so-called "overload". But new users of Swift even keep asking what weights are, and now this? I am not entirely sure it's necessary, but here's a simple explanation why we're ending with attempts at complexity (thanks to John Dickinson on IRC).

Suppose you have a system that spreads your replicas (of partitions) across (failure) zones. This works great as long as your zones are about the same size, usually a rack. But then one day you buy a new rack with 8TB drives and suddenly the new zone is several times larger than others. If you do not adjust anything, it ends only filled by quarter at best.

So, fine, we add "weights". Now a zone that has weight 100.0 gets 2 times more replicas than zone with weight 50.0. This allows you to fill zones better, but this must compromize your dispersion and thus durability. Suppose you only have 4 racks: three with 2TB drives and one with 8TB drives. Not an unreasonable size for a small cloud. So, you set weights to 25, 25, 25, 100. With replication factor of 3, there's still a good probability (which I'm unable to calculate, although I feel it ought to be easy for someone better educated) that the bigger node will end with 2 replicas for some partitions. Once that node goes down, you lose redundancy completely for those partitions.

In the small-cloud example above, if you care about your customers' data, you have to eat the imbalance and underutilization until your retire the 2TB drives [1].

<clayg> torgomatic: well if you have 6 failure domains in a tier but their sized 10000 10 10 10 10 10 - you're still sorta screwed

My suggestion would be just ignore all the complexity we thoughtfuly provided for the people with "screwed" clusters. Deploy and maintain your cluster to make it easy for the placement and replication: have a good number of more or less uniform zones that are well aligned to natural failure domains. Everything else is a workaround -- even weights.

P.S. Kinda wondering how Ceph deals with these issues. It is more automagic when it decides what to store where, but surely there ought to be a good and bad way to add OSDs.

[1] Strictly speaking, other options exist. You can delegate to another tier by tying 2 small racks into a zone: yet another layer of Swift's complexity. Or, you could put new 8TB drives on trays and stuff them into existing nodes. But considering that muddies the waters.

UPDATE: See the changelog for better placement in Swift 2.2.2.

Some closure on a particularly nasty bug.

For the final three months of my tenure at Red Hat, I was chasing what was possibly the most frustrating bug I’d encountered since I had started work there. I had been fixing up various bugs in Trinity over the tail end of summer that meant on a good kernel, it would run and run for quite some time. I still didn’t figure out exactly what the cause of a self-corruption was, but narrowed it down enough that I could come up with a workaround. With this workaround in place, every so often, the kernels NMI watchdog would decide that a process had wedged, and eventually the box would grind to a halt. Just to make it doubly annoying, the hang would happen at seeming random intervals. Sometimes I could trigger it within an hour, sometimes it would take 24 hours.

I spent a lot of time trying to narrow down exactly the circumstances that would trigger this, without much luck. A lot of time was wasted trying to bisect where this was introduced, based upon bad assumptions that earlier kernels were ‘good’. During all this, a lot of theories were being thrown around, and people started looking at multiple areas of the kernel. The watchdog code, the scheduler, FPU context saving, the page fault handling, and time management. Along the way several suspect areas were highlighted, some things got fixed/cleaned up, but ultimately, they didn’t solve the problem. (A number of other suspect areas of code were highlighted that don’t have commits yet).

Then, right down to the final week before I gave all my hardware back to Red Hat, Linus managed to reproduce similar symptoms, by scribbling directly to the HPET. He came up with a hack that at least made the kernel survive for him. When I tried the same patch, the machine ran for three days before I interrupted it. The longest it had ever run.

The question remains, what was scribbling over the HPET in my case ? The /dev/hpet node doesn’t allow writing, even as root. You can mmap /dev/mem if you know the address of the HPET, and directly write to it, but..
1. That would be a root-only possibility, and..
2. Trinity blacklists /dev/mem, and never touches it.

The only two plausible scenarios I could think of were

  • Trinity generated 0xfed000f0 as a random address, and passed that to a syscall which wrote to it. This seems pretty unlikely, and hopefully the kernel has sufficient access_ok() checks on addresses passed in from userspace. Just to be sure, I had hardwired trinity to pass in that address, and couldn’t reproduce the bug.
  • A hardware bug.
    I’m actually starting to believe this may be the case. When trinity drives the CPU load up past a certain threshold, for whatever reason, the HPET stops ticking and corrupts itself. It still seems a bit “out there”, but is more believable than the other theory at least. An interesting data point showed up when googling for the DMI string of the affected machine. Someone else had seen ‘random lockups’ that looked very similar a year earlier. The associated bugzilla had a few more traces.

So that’s where the story (mostly) ends. When I left Red Hat, I gave that (possibly flawed) machine back. Linus’ hacky workaround didn’t get committed, but him & John Stultz continue to back & forth on hardening the clock management code in the face of screwed up hardware, so maybe soon we’ll see something real get committed there.

It was an interesting (though downright annoying) bug that took a lot longer to get any kind of closure on than expected. Some things I learned from this experience:

  • Keep better notes.
    Every week that passed, I had wished I wrote down what I had done the week before. With everything else going on in my life over the last few months, I neglected to document things as well as I could have, and only had old emails to fall back on. Not every bug drags on for months like this, but when you over-optimistically think a bug is going to be solved in a few days, you tend to not bother taking as extensive notes on what has been tried so far.
  • Google for the DMI string of the affected hardware pretty early on.
    That might have given us some clues a lot sooner as to what was going on. Or maybe not, but still – more data.
  • The more people looked into this bug, the more “this doesn’t look right” code was found. There’s never just one bug.

Some closure on a particularly nasty bug. is a post from:

January 07, 2015

continuity of various projects

I’ve had a bunch of people emailing me asking how my new job will affect various things I’ve worked on over the last few years. For the most part, not massively, but there are some changes ahead.

  • Trinity.
    This will proceed pretty much as it has over the last year, perhaps with a little more focus on various areas of the kernel than it has in the past. (Not being intentionally elusive, I’m not entirely sure myself yet).
    From discussions I’ve had so far, there may even be a spin-off into a separate tool, we’ll see.
    • I’ve been wanting to get back to the network related code for some time, and that’s probably going to happen soon-ish.
    • There is still work that could be done with the VM/FS code I’ve worked on over the last year and a half, but it’s already finding bugs, so is “good enough” for a while.
    • There are also still a few lingering bugs that I need to one day sit down and figure out, but they happen so infrequently that I’ve not found the time so far.
    • Finally, there is also a bunch of other feature work that needs fleshing out that I don’t see myself getting to any time soon, that I’ll dump in the TODO over the next week.
  • Upstream testing.
    Things like my daily running of various stress tools against Linus’ master branch with debug builds won’t happen to the level they were previously. The good news is that instead I’ll be doing a lot more testing on various stable branches, which I never did a whole lot of in the past. I expect over time as I get a better feel for my workload I might be able to ramp back up master testing somewhat, but it will be a secondary thing that I do in addition to everything else. Right now, I’m not really doing any of this, so other people running things like trinity against 3.19/3.20 would probably be a good idea if you like collecting crashes, at least until I find my feet again.
  • Coverity scans.
    • I’ll keep doing the scan.coverity runs hopefully once per -rc. I’ve lapsed a little right now, but will pick this back up again soon to get things back up to current. It takes a lot longer to run a whole scan from home now that I don’t have access to my 24-thread Nehalem (was: < 1 hours for compile, pack & upload to coverity, now: ~4 hours just for the compile), so I'll automate these to run overnight when Linus makes a new snapshot, at least until I put together something a little faster than my ~7 year old core duo.
    • I’m not sure I’m going to have a huge amount of time for triage work yet, so we’ll see how that works out. I might have to focus just on certain areas.
    • There will be some element of related work in my new job, but I’m not sure to what degree yet, more info on that as I figure it out.
  • Fedora.
    The biggest change of all.
    • I’m just not going to have to maintain packages, read mail etc for Fedora, so those all got orphaned yesterday.
    • Josh & Justin pretty much handled all of the Fedora kernel work for the last year or so, so me walking away is not going to make a huge difference there.
    • I might still occasionally take a peek at Fedora bugzilla to see if there’s anything similar to a particular bug, but don’t expect to be doing triage work.
    • I’ll still keep a Fedora box or two at home for a while, but work-wise, I’m expecting a lot more Debian in my life. It’s been over a decade since I last used it seriously. That should prove to be fun.
  • Conferences etc.
    I’ll hopefully be seeing some familiar faces again later this year, and possibly meeting some new ones. The only real change here will be the lack of fudcon/flock/whatever it’s called this week.

So that’s about it. For at least the first few months of 2015, I expect to be absorbed in getting acclimated my new job, so I won’t be as visible as usual, but I’m not going to disappear from the Linux community forever, which was something I made clear wasn’t something I wanted to happen at everywhere I interviewed.

continuity of various projects is a post from:

January 05, 2015

Going beyond ZFS by accident

Yesterday, CKS wrote an article that tries to redress the balance a little in the coverage of ZFS. Apparently, detractors of ZFS were pulling quotes from his operational gripes, so he went forceful with the observation that ZFS remains the only viable advanced filesystem (on Linux or not). CKS has no time for your btrfs bullshit.

The situation where weselovskys of the world hold the only viable advanced filesystem hostage and call everyone else "jackass" is very sad for Linux, but it may not be quite so dire, because it's possible that events are overtaking btrfs and ZFS. I am talking about the march of super-advanced, distributed filesystems downstream.

It all started with the move beyond POSIX, which, admittedly, seemed very silly at the time. The early DHT was laughable and I remember how they struggled for years to even enable writes. However, useful software was developed since then.

The poster child of it is Sage's Ceph, which relies on plain old XFS for back-end storage, composes an object storage out of nodes (called RADOS), and layers a POSIX layer on top for those who want it. It is in field use at Dreamhost. I can easily see someone using it where otherwise a ZFS-backed NFS/CIFS cluster would be deployed.

Another piece of software that I place in the same category is OpenStack Swift. I know, Swift is not competing with ZFS directly. The consistency of its meta layer is not sufficient to emulate POSIX anyway. However, you get all those built-in checksums and all that durability jazz that CKS wants. And Swift aims even further up in scale than Ceph, by being in field use at Rackspace. So, what seems to be happening is that folks who really need to go large are willing at times to forsake even compatibility with POSIX, in part of get the benefits that ZFS provides to CKS. Mercado Libre is one well-hyped case of migration from a pile of NFS filers to a Swift cluster.

Now that these systems are established and have themselves proven, I see constant efforts to take them downscale. Original Swift 1.0 did not even work right if you had less than 3 nodes (strictly speaking, if you had fewer zones than replication factor). This was fixed since by so-called "as good as possible placement" around 1.13 Havana, so you can have 1-node Swift easily. Ceph, similarly, would not consider PGs on the same node healthy and it's a bit of a PITA even in Firefly. So yea, there are issues, but we're working on it. And in the process, we're definitely coming for chunks of ZFS space.

blitz2 for GNOME catastrophy

After putting blitz2 on my Nexus and poking into GUI buttons, I reckoned that it might be time to stop typing in a terminal like a caveman on Linux too (I'm joking, but only just). And the project rolled smoothly for a while. What it took me 2 months to accomplish in Android, only took 2 days in GNOME. However, 2 lines of code from the end, it all came to an abrubt halt when I found out that it is impossible to access clipboard from a JavaScript application. See GNOME bugs 579312, 712752.

January 03, 2015

2 months in

Today is my two month monthiversary at my new job. Haven’t had time so far to sit back and reflect and let people know, but now during packing boxes for our upcoming move downtown, I welcome the distraction.

I dove into the black hole. I joined the borg collective. I’m now working for the little search engine that could.

I sure had my reservations while contemplating this choice. This is the first job I’ve had that I had to interview for – and quite a bit, I might add (though I have to admit that curiosity about the interviewing process is what made me go for the interviews in the first place – I wasn’t even considering a different job at that time). My first job, a four month high school math teaching stint right after I graduated, was suggested to me by an ex-girlfriend, and I was immediately accepted after talking to the headmaster (that job is still a fond memory for many reasons). For my first real job, I informally chatted over dinner with one of the four founders, and then I started working for them without knowing if they were going to pay me. They ended up doing so by the end of the month, and that was that. The next job was offered to me over IRC, and from that Fluendo and Flumotion were born. None of these were through a standard job interview, and when I interviewed at Google I had much more experience on the other side of the interviewing table.

From a bunch of small startups to a company the scale of Google is a big step up, so that was my main reservation. Am I going to be able to adapt to a big company’s way of working? On the other hand, I reasoned, I don’t really know what it’s like to work for a big company, and clearly Google is one of the best of those to work for. I’d rather try out working for a big company while I’m still considered relatively young job-market-wise, so I rack up some experience with both sides of this coin during my professionally mobile years.

But I’m not going to lie either – seeing that giant curious machine from the inside, learn how they do things, being allowed to pierce the veil and peak behind the curtain – there is a curiosity here that was waiting to be satisfied. Does a company like this have all big problems solved already? How do they handle things I’ve had to learn on the fly without anyone else to learn from? I was hiring and leading a small group of engineers – how does a company that big handle that on an industrial scale? How does a search query really work? How many machines are involved?

And Google is delivering in spades on that front. From the very first day, there’s an openness and a sharing of information that I did not expect. (This also explains why I’ve always felt that people who joined Google basically disappeared into a black hole – in return for this openness, you are encouraged to swear yourself to secrecy towards the outside world. I’m surprised that that can work as an approach, but it seems to). By day two we did our first commit (obviously nothing that goes to production, but still.) In my first week I found the way to the elusive (to me at least) roof top terrace by searching through internal documentation.IMG_20141229_144054The view was totally worth it.

So far, in my first two months, I’ve only had good surprises. I think that’s normal – even the noogler training itself tells you about the happiness curve, and how positive and excited you feel the first few months. It was easy to make fun of some of the perks from an outside perspective, but what you couldn’t tell from that outside perspective is how these perks are just manifestations of common engineering sense on a company level. You get excellent free lunches so that you go eat with your team mates or run into colleagues and discuss things, without losing brain power on deciding where to go eat (I remember the spreadsheet we had in Barcelona for a while for bike lunch once a week) or losing too much time doing so (in Barcelona, all of the options in the office building were totally shit. If you cared about food it was not uncommon to be out of the office area for ninety minutes or more). You get snacks and drinks so that you know that’s taken care of for you and you don’t have to worry about getting any and leave your workplace for them. There are hammocks and nap pods so you can take a nap and be refreshed in the afternoon. You get massage points for massages because a healthy body makes for a healthy mind. You get a health plan where the good options get subsidized because Google takes that same data-driven approach to their HR approach and figured out how much they save by not having sick employees. None of these perks are altruistic as such, but there is also no pretense of them being so. They are just good business sense – keep your employees healthy, productive, focused on their work, and provide the best possible environment to do their best work in. I don’t think I will ever make fun of free food perks again given that the food is this good, and possibly the favorite part of my day is the smoothie I pick up from the cafe on the way in every morning. It’s silly, it’s small, and they probably only do it so that I get enough vitamins to not get the flu in winter and miss work, but it works wonders on me and my morning mood.

I think the bottom line here is that you get treated as a responsible adult by default in this company. I remember silly discussions we had at Flumotion about developer productivity. Of course, that was just a breakdown of a conversation that inevitably stooped to the level of measuring hours worked as a measurement of developer productivity, simply because that’s the end point of any conversation on that spirals out of control. Counting hours worked was the only thing that both sides of that conversation understood as a concept, and paying for hours worked was the only thing that both sides agreed on as a basic rule. But I still considered it a major personal fault to have let the conversation back then get to that point; it was simply too late by then to steer it back in the right direction. At Google? There is no discussion about hours worked, work schedule, expected productivity in terms of hours, or any of that. People get treated like responsible adults, are involved in their short-, mid- and long-term planning, feel responsible for their objectives, and allocate their time accordingly. I’ve come in really early and I’ve come in late (by some personal definition of “on time” that, ever since my second job 15 years ago, I was lucky enough to define as ’10 AM’). I’ve left early on some days and stayed late on more days. I’ve seen people go home early, and I’ve seen people stay late on a Friday night so they could launch a benchmark that was going to run all weekend so there’d be useful data on Monday. I asked my manager one time if I should let him know if I get in later because of a doctor’s visit, and he told me he didn’t need to know, but it helps if I put it on the calendar in case people wanted to have a meeting with me at that hour.

And you know what? It works. Getting this amount of respect by default, and seeing a standard to live up to set all around you – it just makes me want to work even harder to be worthy of that respect. I never had any trouble motivating myself to do work, but now I feel an additional external motivation, one this company has managed to create and maintain over the fifteen+ years they’ve been in business. I think that’s an amazing achievement.

So far, so good, fingers crossed, touch wood and all that. It’s quite a change from what came before, and it’s going to be quite the ride. But I’m ready for it.

(On a side note – the only time my habit of wearing two different shoes was ever considered a no-no for a job was for my previous job – the dysfunctional one where they still owe me money, among other stunts they pulled. I think I can now empirically elevate my shoe habit to a litmus test for a decent job, and I should have listened to my gut on the last one. Live and learn!)

flattr this!

December 31, 2014

blitz2 for Android

An additional upside for blitz2 is that HTTP client is available on about any platform. So if I want to share clipboard with my Nexus tablet, I can, without running an sshd on it.

This is my first Android app, and I didn't touch Java in many years. So, first impressions.

I forgot how insanely wordy Java is. And doing anything takes effort, with all the factories, accessors, and whatnot.

I like checked exceptions. Too bad Python doesn't have them (probably impossible by the very nature of a dynamic language, but I've been bitten by an unexpected exception floating up from the depth of the stack before).

Android docs are excellent and one almost never needs to search for answers. Unfortunately, I managed to step into one such case: the so-called "Up" navigation. My chosen API level is 11. The contemporary docs explain how to emulate "Up" using compatibility libraries for APIs before mine, and explain how to use onNavigateUp(), that comes in API level 16. But there's absolutely nothing, nowhere, that tells how do it in API 11. I was walking in circles for days. The answer is actually a secret ID namespace, particularly I would never figure it out if not for random pieces of code on the Internet. Good grief, Google. So close to perfect marks.

Oh, and one more thing: Googlers score good sanity points for reimplementing a stock Java API for HTTP (HttpURLConnection and friends). They could've easily rolled their own, but they didn't. They wrote their own runtime, but it's fully compatible with Oracle, including dark corners of SSL. It permits to mostly debug difficult parts on a Linux box. Very nice. Just to see what it could be otherwise, look at their gratiously incompatible Base64.

UPDATE: I forgot to mention that I started with Eclipse, but it was entirely unusable due to crashing all the time (about once an hour, for no discernable reason). I was at Fedora 20 at the time. So, I used command-line tools, and that worked like a charm. There's a Makefile in blitz2 repo linked above.

December 22, 2014

Cheating around taskotron in Fedora

The yesterday ntp vulnerability uncovered a trick for Fedora maintainers. You know how it's super annoying that you cannot push an update to F20 without F21? You must herd updates and can never do them in parallel, or else taskotron ruins innocent updates. But at the time of this writing the fixes are live in F20, but not in F21. How does Miroslav do it?

The answer is easy: he keeps ntp intentionally a few releases back in older Fedora (4.2.6p5-19 in F20), so he can bump it with impunity without regard to the newer Fedora (4.2.6p5-25 in F21). Of course, if someone were to upgrade to F21 today, he'd go from a fixed ntp to a broken ntp, but hey... at least the automated checks are defeated.

This challenge is similar to writing super ugly OpenStack code that passes PEP8 checks, only outcome is actually dangerous today.

December 19, 2014

Moving on from Red Hat.

After eleven and a half years, today is my final day at Red Hat.
I’ll write more about what comes next in the new year.

In the meantime, here’s a slightly edited version of a mail I sent internally yesterday.

In 2003, I got an email from Michael Johnson, about a secretive new thing Red Hat was working on called "Fedora". No-one was quite sure what it was going to be (some may argue we're still figuring it out), but he was pretty sure I'd want to be a part of it. "How'd you feel about taking care of _any_ kernel problems that come in for this thing?" he asked. I was terrified, but excited at the opportunities to learn a lot of stuff outside my usual areas of expertise.

With barely any real detail as to what I was signing up for, I jumped at the opportunity. Within my first few months, I had some concerns over whether or not I had made a good decision. Then Michael left for rPath, and I seriously started to have my doubts.

While everyone was figuring out what Fedora was going to be, I was thrown in at the deep end. "Here's Red Hat Linux 7, 8 and 9, you maintain the kernel for those now. Go". I remember looking at bugzilla scrolling through page after page of bugs thinking "This is going to be a nightmare" At the same time, RHEL 3 was really starting to take shape. I looked at what the guys working on RHEL were doing and thought "Well, this sucks, but those guys.. they _really_ have work to do". As much as I was buried alive in work, I relished every moment of it, learning as much as I could in what little spare time I had.

Then Fedora finally happened. For those not around back then, Fedora Core 1 was pretty much what Red Hat Linux 10 would have been from a kernel pov. A nasty hairball of patches that weren't going upstream (execshield! 4g4g! Tux! CIPE!) that even their authors had stopped maintaining, and a bunch of features backported from 2.5 to 2.4. I get the shakes when I think back to the horrors of maintaining that mess, but like the horrors of RHL before it, it was an amazing learning experience (mostly "what not to do").

But for all its warts, Fedora gained traction, and after Fedora 2 moved to a 2.6 kernel, things really started to take shape. As Fedora's community started to grow, things got even busier in bugzilla than RHL had ever been.

Then somehow I got talked into also being RHEL4 kernel maintainer for a while.
It turned out that juggling Fedora 3, Fedora 4, Rawhide, RHEL4 GA, and RHEL4 U1 means you don't get a lot of time to sleep. So after finding another sucker to deal with the RHEL work, I moved back to just doing Fedora work, and in another big turning point, we started to slowly grow out the Fedora kernel team.

Over the years that followed, the only thing that remained constant was the inflow of bugs. At any given time we had a thousand or so bugs open, with at best 3 people, at worst 1 person working on them. I'm incredibly proud of what we've managed to achieve with the Fedora kernel. More than just the base for RHEL, it changed the whole landscape of upstream kernel development.

  • Our insistence on shipping the latest code, with as few 'special sauce' patches won over a lot of upstream developers that wouldn't have given us the time of day for similar bugs back in the RHL days. Sometimes painful for our users, but Linux as a whole got better because of our stance here.
  • Decisions like Fedora enabling debug options by default in betas shook out an unbelievable number of bugs almost as soon as they get introduced. Again, painful for users, but from a quality standpoint, we found a ton of bugs in code others were racing to ship first and call "enterprise ready".
  • Fedora enabling features sometimes before they were fully baked got us a lot of love from their respective upstream maintainers.

Despite this progress though, I always felt we were on a treadmill making no real forward progress. That constant 1000 or so bugs kept nagging at me. As fast as we closed them out, a new batch would arrive.

In more recent years, we tried to split the workload within the team so we could do more proactive bug-finding before users even find them. My own 'trinity' project has found so many serious bugs (filesystem corruptors, root holes, vm corner cases, the list goes on) that it got to be almost a full time job just tracking everything.

I used to feel that leaving Red Hat wasn't something I could do. On a few occasions I actually turned down offers from potential employers, because "What about the Fedora kernel?". For the first time since the project has begun I feel like I've left things in more than capable hands, and I'm sure things will continue to move in the right direction.

3 RHL's. 5 and a half RHEL's. 21 Fedoras. You don't even want to know how much hardware I've destroyed in the line of duty in this time. It's been uh, an experience.

So, after all this time, one thing I have learned, is that all this was definitely one of my better decisions. I hope that my next decision turns out to be an equally good one.

Moving on from Red Hat. is a post from:

December 12, 2014


You know how some people attach several montiors to one PC? I don't. I just have several PCs. But then I want copy-paste to work transparently (as transparently as possible). For several years I used blitz to copy clipboard. It works well enough, but once you have 3 computers, it gets somewhat cumbersome to type the hostname. Also, it always bothered me how it rides ssh authentication. I wanted something independent from ssh.

Behold blitz2. Instead of passing the clipboard to the host where it's needed directly, the clipboard is uploaded to an HTTP server. Seems more complex at first, but it's actually much better, because previously the PC where you copy had to authenticate to the PC where you paste. Now the authentication is symmetric. So, all clients are configured exactly the same, and all can upload and download the clipboard no matter who trusts what ssh keys.

December 02, 2014

Free-riding and copyleft in cultural commons like Flickr

Flickr recently started selling prints of Creative Commons Attribution-Share Alike photos without sharing any of the revenue with the original photographers. When people were surprised, Flickr said “if you don’t want commercial use, switch the photo to CC non-commercial”.

This seems to have mostly caused two reactions:

  1. This is horrible! Creative Commons is horrible!”
  2. “Commercial reuse is explicitly part of the license; I don’t understand the anger.”

I think it makes sense to examine some of the assumptions those users (and many license authors) may have had, and what that tells us about license choice and design going forward.

Free ride!!, by
Free ride!!, by Dhinakaran Gajavarathan, under CC BY 2.0

Free riding is why we share-alike…

As I’ve explained before here, a major reason why people choose copyleft/share-alike licenses is to prevent free rider problems: they are OK with you using their thing, but they want the license to nudge (or push) you in the direction of sharing back/collaborating with them in the future. To quote Elinor Ostrom, who won a Nobel for her research on how commons are managed in the wild, “[i]n all recorded, long surviving, self-organized resource governance regimes, participants invest resources in monitoring the actions of each other so as to reduce the probability of free riding.” (emphasis added)

… but share-alike is not always enough

Copyleft is one of our mechanisms for this in our commons, but it isn’t enough. I think experience in free/open/libre software shows that free rider problems are best prevented when three conditions are present:

  • The work being created is genuinely collaborative — i.e., many authors who contribute similarly to the work. This reduces the cost of free riding to any one author. It also makes it more understandable/tolerable when a re-user fails to compensate specific authors, since there is so much practical difficulty for even a good-faith reuser to evaluate who should get paid and contact them.
  • There is a long-term cost to not contributing back to the parent project. In the case of Linux and many large software projects, this long-term cost is about maintenance and security: if you’re not working with upstream, you’re not going to get the benefit of new fixes, and will pay a cost in backporting security fixes.
  • The license triggers share-alike obligations for common use cases. The copyleft doesn’t need to perfectly capture all use cases. But if at least some high-profile use cases require sharing back, that helps discipline other users by making them think more carefully about their obligations (both legal and social/organizational).

Alternately, you may be able to avoid damage from free rider problems by taking the Apache/BSD approach: genuinely, deeply educating contributors, before they contribute, that they should only contribute if they are OK with a high level of free riding. It is hard to see how this can work in a situation like Flickr’s, because contributors don’t have extensive community contact.1

The most important takeaway from this list is that if you want to prevent free riding in a community-production project, the license can’t do all the work itself — other frictions that somewhat slow reuse should be present. (In fact, my first draft of this list didn’t mention the license at all — just the first two points.)

Flickr is practically designed for free riding

Flickr fails on all the points I’ve listed above — it has no frictions that might discourage free riding.

  • The community doesn’t collaborate on the works. This makes the selling a deeply personal, “expensive” thing for any author who sees their photo for sale. It is very easy for each of them to find their specific materials being reused, and see a specific price being charged by Yahoo that they’d like to see a slice of.
  • There is no cost to re-users who don’t contribute back to the author—the photo will never develop security problems, or get less useful with time.
  • The share-alike doesn’t kick in for virtually any reuses, encouraging Yahoo to look at the relationship as a purely legal one, and encouraging them to forget about the other relationships they have with Flickr users.
  • There is no community education about the expectations for commercial use, so many people don’t fully understand the licenses they’re using.

So what does this mean?

This has already gone on too long, but a quick thought: what this suggests is that if you have a community dedicated to creating a cultural commons, it needs some features that discourage free riding — and critically, mere copyleft licensing might not be good enough, because of the nature of most production of commons of cultural works. In Flickr’s case, maybe this should simply have included not doing this, or making some sort of financial arrangement despite what was legally permissible; for other communities and other circumstances other solutions to the free-rider problem may make sense too.

And I think this argues for consideration of non-commercial licenses in some circumstances as well. This doesn’t make non-commercial licenses more palatable, but since commercial free riding is typically people’s biggest concern, and other tools may not be available, it is entirely possible it should be considered more seriously than free and open source software dogma might have you believe.

  1. It is open to discussion, I think, whether this works in Wikimedia Commons, and how it can be scaled as Commons grows.

November 30, 2014

23 years

(This post is only about music – for people not from Belgium, Luc de Vos, singer of Gorki, passed away yesterday at 52)

I am 15. I hear a song on the radio, and I don’t understand the lyrics. Why would you ask a piranha to devour you? Still, I’m intrigued. I’d only really gotten into music little by little. My earliest musical memory is hearing my parents’ record player playing ‘I want you’ by Bob Dylan. After that, it was my inexplicable arousal at seeing the Hey You the Rock Steady Crewvideo in 1983 when I was 7, getting the Top Gun soundtrack on cassette (my first ever music purchase) in 1986, and watching the video for ‘I want your sex’ by George Michael in 1987 over and over on my recording of Veronica’s “Countdown”. At my confirmation (12 years old), when kids typically get some kind of bigger gift they’ve been dreaming of for a long time, I still chose a computer instead of a stereo.

I am 16, I just had my birthday. I am doing a summer job at my family’s company (which processes animal fat) and I am staying with my grandparents in Bavegem. With the money from my birthday I bought a portable stereo CD/cassette player for the incredible amount of 6000 BEF (or 150 euro as the kids would call it these days). . I listen to nothing else for weeks on end. I can still hum the amazingly beautiful piano part that closes Mia from memory. It’s been my favorite song ever since.

I am 17, and learning the guitar. It turns out that Mia is quite complicated to get right, because of that perfect 3/4-5/4 tempo, or whatever you’d call it if you knew anything about music. It doesn’t help that I’m left-handed playing on a right-handed guitar, but I make the song my own. To this day though, I can still not play and sing it at the same time. There is something about the timing of how that third line starts before the music starts, where he signs ‘Mensen als ik’, that I just can’t figure out. It’s magic – it makes this song all the better.

I am 17, and Gorky is now Gorki, with completely new band members. I see them live for the first time, at ‘De Kring’ in Merelbeke, with my best friend Jeremy. I wish I had bought all the t-shirts that night – they had a different one for each of the new songs. The album sounds so different – parts of it recorded in Africa. I don’t listen to that album enough, but I still love playing Berejager on guitar, such a beautiful intro.

I am 17, and it’s my last year of boy scout before becoming a leader. I have a mini-JIN camp called JINTRO during the year, that ends with a party. I dance with a girl to Mia, and one minute into the dance she says, ‘no no, we’re not going to do a one-tile-dance for the rest of the night. Here’s how you do it’ and she teaches me two basic moves to make a slow dance more interesting. Thank you, Karlien, for changing my life.

I am 18, and we travel through Catalunya with the boy and girl scouts group I’m in, and a local Catalan group. This is one of the CD’s we brought with us as a sample of our own culture. The Catalans love it – they say it sounds like Bruce Springsteen. I can see where they’re coming from. At the end of the two weeks, he guitar player of their group nails down a really good version of Mia (without the words of course)

I am 18, and have my first serious girlfriend. Mia is a song that runs through our history together – we must have danced to it at every party that played it (she messaged me yesterday that she immediately thought of me when she heard the news… just like I did of her). Back then, parties still had blocks of 3 slow songs every one or two hours. I miss that tradition… The moves that Karlien taught me put me well ahead of the pack of my fellow young adult males, and that paid off generously in the young adult females agreeing to dance with me at every party. (The theory of compounded interest clearly put in practice, now that I think of it)

I am 19, and one of my fellow boy scout leaders gives me an old demo cassette of Gorky. Among other things, it contains a cover of the Pixies’ “Monkey Gone to Heaven”, some of their songs that didn’t make their debut (but appeared on Boterhammen, like ‘Ik word oud’, or were turned into a b-side). It also contains the original version of Mia, as a fast-paced slurred-sung rocker. They made the right call slowing it down.

I am 21, and I have a radio show at a student radio I helped start up. I am too young to know how the world really works and just send out interview requests to managers and record labels for bands that I like. In those days, I got to interview my favorite band, The Afghan Whigs, as well as other bands like Everclear and The Sheila Divine. But we also managed to get Luc De Vos as a guest on our radio show, and Jeremy and I interviewed him inbetween songs for an hour. (That tape is at my parents’ place. I have an Excel sheet that tells me exactly which box it’s in, and I hope I can recover it next time I go to Belgium.) I tell him about that demo tape that I have, and he asks for a copy. A little after that, I bring him a copy of that practice tape, I put ‘Congregation’ by The Afghan Whigs on the other side (because I want one of my favorite bands to know another one of my favorite bands), and I go past his house to drop it off. (From the news report this weekend I hear he still lived in the same street, so I can only assume he was still living in the same house he’s lived for the last 17 years).

I am 22, and Luc De Vos plays solo at the university somewhere, in an auditorium. I think it was one of the first times he ever did that. He probably already read out a column he wrote. But I remember how amazing he was by himself, what beautiful versions of these songs that I knew so well he played, songs that usually they didn’t play live because they were the slower ones. ‘Arme Jongen’, I remember him playing it there like it was yesterday.

I am 26, and I see him at various festivals, always there to either play or enjoy the music. I see him backstage with his son, recently born. He is walking around with some kind of elastic band tied around his waist that keeps his kid from running away more than ten meters from him, and it is hilarious to see in the backstage area.

Time starts moving quicker as I grow up, become an adult, and graduate from college. More and more albums. Every album still contained at least one killer song. ‘Leve de Lente’ still gives me goosebumps when those guitars crash in. ‘Vaarwel Lieveling’ is possibly his most underrated song – I don’t think I’ve ever heard that one played live. ‘Ode an die freude’, ‘We zijn zo jong’, ‘Duitsland wint altijd’ – I love the sound of resignment he has in his voice, like a deep sigh put too music. That album came with a floppy disk (!) with the lyrics. ‘Het voorspel was moordend’, ‘Tijdbom’ – while the music came back to being a bit more convential, the lyrics got more hermetically sealed. I must admit that I slowly lost track – having moved to Barcelona at some point, it was much harder to catch them live of course. I know their first five albums the best, and while I still bought the others (having missed only one), none of them had the luxury of not having any other album in my collection to compete with like their debut album had. But there is no denying that when they were great, they were still amazing. A song like ‘Veronica komt naar je toe’ managed to pull together so many different things. The title was a recurring slogan of a Dutch channel that was popular among young people in Belgium, for lack of a Belgian alternative. Here’s a great song, with a great chorus, and his ability to sample just this one sentence to evoke a memory of youth every one of my generation remembers (while it evoked at the same time my personal memory of seeing ‘I want your sex’ on Veronica). And then he manages to evoke such a common feeling everyone has, where you are trying to grab that fleeting thing you were thinking just a second ago, straddling typically complicated-to-phrase words in Dutch with effortless ease – ‘Wat was het nu ook alweer/dat ik wou doen/het was iets belangrijks’ (or ‘What was it again/I wanted to do/it was something important). In the beginning, his lyrics were quirky in ideas, but fairly straightforward in their phrasing. Further on in their career, they experimented quite a bit musically, but especially the lyrics could get complicated, and with exceptional and inventive phrasing.

I’m 31, and I live in Barcelona, but I travel back to Belgium because Gorki is playing their debut album, Gorky. I wrote about that concert back then, but that memory is still strong. I can’t believe that was 7 years ago…

I always enjoyed reading his columns in Zone 09 whenever I was in my hometown, I thought he had a great gift for writing. I noticed just now he left behind quite a few more books than I had, so I started tracking those down. So many of my memories have his music attached to it. His was the first band that opened me up to a wider range of music, away from the mainstream (not everybody would agree I guess, but I never considered them mainstream. Their debut album certainly was different enough from whatever was considered mainstream at the time, and as often happens this debut was only widely recognized several albums into their career later, while at the same time those later albums never really got the same kind of traction.)

I loved his way of looking at the world, the way he described it in music, lyrics, writing, and interviews. Always with that cheeky look. Like, surprisingly it now turns out, so many of my generation, his music was intertwined with my growing up. Here’s a man I was hoping to live long and make much more music, and grow old playing hundreds of songs in bars and clubs, but it wasn’t to be. He set out to be a successful rock singer, whether that was tongue-in-cheek or not, and by all accounts he achieved what he set out to do. And everything he did, he did it for the best of reasons. He did it for ‘a fistful of bonnekes’

flattr this!

November 29, 2014

is this a protocol? displaylink3

I'm not sure

but if hd0;u]; means anything to anyone from displaylink, or is the first unencrypted bytes they send, then oops.

Looks like I have some work to do next week.

November 11, 2014

systemd For Administrators, Part XXI

Container Integration

Since a while containers have been one of the hot topics on Linux. Container managers such as libvirt-lxc, LXC or Docker are widely known and used these days. In this blog story I want to shed some light on systemd's integration points with container managers, to allow seamless management of services across container boundaries.

We'll focus on OS containers here, i.e. the case where an init system runs inside the container, and the container hence in most ways appears like an independent system of its own. Much of what I describe here is available on pretty much any container manager that implements the logic described here, including libvirt-lxc. However, to make things easy we'll focus on systemd-nspawn, the mini-container manager that is shipped with systemd itself. systemd-nspawn uses the same kernel interfaces as the other container managers, however is less flexible as it is designed to be a container manager that is as simple to use as possible and "just works", rather than trying to be a generic tool you can configure in every low-level detail. We use systemd-nspawn extensively when developing systemd.

Anyway, so let's get started with our run-through. Let's start by creating a Fedora container tree in a subdirectory:

# yum -y --releasever=20 --nogpg --installroot=/srv/mycontainer --disablerepo='*' --enablerepo=fedora install systemd passwd yum fedora-release vim-minimal

This downloads a minimal Fedora system and installs it in in /srv/mycontainer. This command line is Fedora-specific, but most distributions provide similar functionality in one way or another. The examples section in the systemd-nspawn(1) man page contains a list of the various command lines for other distribution.

We now have the new container installed, let's set an initial root password:

# systemd-nspawn -D /srv/mycontainer
Spawning container mycontainer on /srv/mycontainer
Press ^] three times within 1s to kill container.
-bash-4.2# passwd
Changing password for user root.
New password:
Retype new password:
passwd: all authentication tokens updated successfully.
-bash-4.2# ^D
Container mycontainer exited successfully.

We use systemd-nspawn here to get a shell in the container, and then use passwd to set the root password. After that the initial setup is done, hence let's boot it up and log in as root with our new password:

$ systemd-nspawn -D /srv/mycontainer -b
Spawning container mycontainer on /srv/mycontainer.
Press ^] three times within 1s to kill container.
systemd 208 running in system mode. (+PAM +LIBWRAP +AUDIT +SELINUX +IMA +SYSVINIT +LIBCRYPTSETUP +GCRYPT +ACL +XZ)
Detected virtualization 'systemd-nspawn'.

Welcome to Fedora 20 (Heisenbug)!

[  OK  ] Reached target Remote File Systems.
[  OK  ] Created slice Root Slice.
[  OK  ] Created slice User and Session Slice.
[  OK  ] Created slice System Slice.
[  OK  ] Created slice system-getty.slice.
[  OK  ] Reached target Slices.
[  OK  ] Listening on Delayed Shutdown Socket.
[  OK  ] Listening on /dev/initctl Compatibility Named Pipe.
[  OK  ] Listening on Journal Socket.
         Starting Journal Service...
[  OK  ] Started Journal Service.
[  OK  ] Reached target Paths.
         Mounting Debug File System...
         Mounting Configuration File System...
         Mounting FUSE Control File System...
         Starting Create static device nodes in /dev...
         Mounting POSIX Message Queue File System...
         Mounting Huge Pages File System...
[  OK  ] Reached target Encrypted Volumes.
[  OK  ] Reached target Swap.
         Mounting Temporary Directory...
         Starting Load/Save Random Seed...
[  OK  ] Mounted Configuration File System.
[  OK  ] Mounted FUSE Control File System.
[  OK  ] Mounted Temporary Directory.
[  OK  ] Mounted POSIX Message Queue File System.
[  OK  ] Mounted Debug File System.
[  OK  ] Mounted Huge Pages File System.
[  OK  ] Started Load/Save Random Seed.
[  OK  ] Started Create static device nodes in /dev.
[  OK  ] Reached target Local File Systems (Pre).
[  OK  ] Reached target Local File Systems.
         Starting Trigger Flushing of Journal to Persistent Storage...
         Starting Recreate Volatile Files and Directories...
[  OK  ] Started Recreate Volatile Files and Directories.
         Starting Update UTMP about System Reboot/Shutdown...
[  OK  ] Started Trigger Flushing of Journal to Persistent Storage.
[  OK  ] Started Update UTMP about System Reboot/Shutdown.
[  OK  ] Reached target System Initialization.
[  OK  ] Reached target Timers.
[  OK  ] Listening on D-Bus System Message Bus Socket.
[  OK  ] Reached target Sockets.
[  OK  ] Reached target Basic System.
         Starting Login Service...
         Starting Permit User Sessions...
         Starting D-Bus System Message Bus...
[  OK  ] Started D-Bus System Message Bus.
         Starting Cleanup of Temporary Directories...
[  OK  ] Started Cleanup of Temporary Directories.
[  OK  ] Started Permit User Sessions.
         Starting Console Getty...
[  OK  ] Started Console Getty.
[  OK  ] Reached target Login Prompts.
[  OK  ] Started Login Service.
[  OK  ] Reached target Multi-User System.
[  OK  ] Reached target Graphical Interface.

Fedora release 20 (Heisenbug)
Kernel 3.18.0-0.rc4.git0.1.fc22.x86_64 on an x86_64 (console)

mycontainer login: root

Now we have everything ready to play around with the container integration of systemd. Let's have a look at the first tool, machinectl. When run without parameters it shows a list of all locally running containers:

$ machinectl
MACHINE                          CONTAINER SERVICE
mycontainer                      container nspawn

1 machines listed.

The "status" subcommand shows details about the container:

$ machinectl status mycontainer
       Since: Mi 2014-11-12 16:47:19 CET; 51s ago
      Leader: 5374 (systemd)
     Service: nspawn; class container
        Root: /srv/mycontainer
          OS: Fedora 20 (Heisenbug)
        Unit: machine-mycontainer.scope
              ├─5374 /usr/lib/systemd/systemd
                │ └─5414 /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-act...
                │ └─5383 /usr/lib/systemd/systemd-journald
                │ └─5411 /usr/lib/systemd/systemd-logind
                  └─5416 /sbin/agetty --noclear -s console 115200 38400 9600

With this we see some interesting information about the container, including its control group tree (with processes), IP addresses and root directory.

The "login" subcommand gets us a new login shell in the container:

# machinectl login mycontainer
Connected to container mycontainer. Press ^] three times within 1s to exit session.

Fedora release 20 (Heisenbug)
Kernel 3.18.0-0.rc4.git0.1.fc22.x86_64 on an x86_64 (pts/0)

mycontainer login:

The "reboot" subcommand reboots the container:

# machinectl reboot mycontainer

The "poweroff" subcommand powers the container off:

# machinectl poweroff mycontainer

So much about the machinectl tool. The tool knows a couple of more commands, please check the man page for details. Note again that even though we use systemd-nspawn as container manager here the concepts apply to any container manager that implements the logic described here, including libvirt-lxc for example.

machinectl is not the only tool that is useful in conjunction with containers. Many of systemd's own tools have been updated to explicitly support containers too! Let's try this (after starting the container up again first, repeating the systemd-nspawn command from above.):

# hostnamectl -M mycontainer set-hostname "wuff"

This uses hostnamectl(1) on the local container and sets its hostname.

Similar, many other tools have been updated for connecting to local containers. Here's systemctl(1)'s -M switch in action:

# systemctl -M mycontainer
UNIT                                 LOAD   ACTIVE SUB       DESCRIPTION
-.mount                              loaded active mounted   /
dev-hugepages.mount                  loaded active mounted   Huge Pages File System
dev-mqueue.mount                     loaded active mounted   POSIX Message Queue File System
proc-sys-kernel-random-boot_id.mount loaded active mounted   /proc/sys/kernel/random/boot_id
[...]                     loaded active active    System Time Synchronized                        loaded active active    Timers
systemd-tmpfiles-clean.timer         loaded active waiting   Daily Cleanup of Temporary Directories

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

49 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.

As expected, this shows the list of active units on the specified container, not the host. (Output is shortened here, the blog story is already getting too long).

Let's use this to restart a service within our container:

# systemctl -M mycontainer restart systemd-resolved.service

systemctl has more container support though than just the -M switch. With the -r switch it shows the units running on the host, plus all units of all local, running containers:

# systemctl -r
UNIT                                        LOAD   ACTIVE SUB       DESCRIPTION
boot.automount                              loaded active waiting   EFI System Partition Automount
proc-sys-fs-binfmt_misc.automount           loaded active waiting   Arbitrary Executable File Formats File Syst
sys-devices-pci0000:00-0000:00:02.0-drm-card0-card0\x2dLVDS\x2d1-intel_backlight.device loaded active plugged   /sys/devices/pci0000:00/0000:00:02.0/drm/ca
[...]                                                                                       loaded active active    Timers
mandb.timer                                                                                         loaded active waiting   Daily man-db cache update
systemd-tmpfiles-clean.timer                                                                        loaded active waiting   Daily Cleanup of Temporary Directories
mycontainer:-.mount                                                                                 loaded active mounted   /
mycontainer:dev-hugepages.mount                                                                     loaded active mounted   Huge Pages File System
mycontainer:dev-mqueue.mount                                                                        loaded active mounted   POSIX Message Queue File System
[...]                                                                        loaded active active    System Time Synchronized                                                                           loaded active active    Timers
mycontainer:systemd-tmpfiles-clean.timer                                                            loaded active waiting   Daily Cleanup of Temporary Directories

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

191 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.

We can see here first the units of the host, then followed by the units of the one container we have currently running. The units of the containers are prefixed with the container name, and a colon (":"). (The output is shortened again for brevity's sake.)

The list-machines subcommand of systemctl shows a list of all running containers, inquiring the system managers within the containers about system state and health. More specifically it shows if containers are properly booted up, or if there are any failed services:

# systemctl list-machines
delta (host) running      0    0
mycontainer  running      0    0
miau         degraded     1    0
waldi        running      0    0

4 machines listed.

To make things more interesting we have started two more containers in parallel. One of them has a failed service, which results in the machine state to be degraded.

Let's have a look at journalctl(1)'s container support. It too supports -M to show the logs of a specific container:

# journalctl -M mycontainer -n 8
Nov 12 16:51:13 wuff systemd[1]: Starting Graphical Interface.
Nov 12 16:51:13 wuff systemd[1]: Reached target Graphical Interface.
Nov 12 16:51:13 wuff systemd[1]: Starting Update UTMP about System Runlevel Changes...
Nov 12 16:51:13 wuff systemd[1]: Started Stop Read-Ahead Data Collection 10s After Completed Startup.
Nov 12 16:51:13 wuff systemd[1]: Started Update UTMP about System Runlevel Changes.
Nov 12 16:51:13 wuff systemd[1]: Startup finished in 399ms.
Nov 12 16:51:13 wuff sshd[35]: Server listening on port 24.
Nov 12 16:51:13 wuff sshd[35]: Server listening on :: port 24.

However, it also supports -m to show the combined log stream of the host and all local containers:

# journalctl -m -e

(Let's skip the output here completely, I figure you can extrapolate how this looks.)

But it's not only systemd's own tools that understand container support these days, procps sports support for it, too:

# ps -eo pid,machine,args
 PID MACHINE                         COMMAND
   1 -                               /usr/lib/systemd/systemd --switched-root --system --deserialize 20
2915 -                               emacs contents/projects/
3403 -                               [kworker/u16:7]
3415 -                               [kworker/u16:9]
4501 -                               /usr/libexec/nm-vpnc-service
4519 -                               /usr/sbin/vpnc --non-inter --no-detach --pid-file /var/run/NetworkManager/ -
4749 -                               /usr/libexec/dconf-service
4980 -                               /usr/lib/systemd/systemd-resolved
5006 -                               /usr/lib64/firefox/firefox
5168 -                               [kworker/u16:0]
5192 -                               [kworker/u16:4]
5193 -                               [kworker/u16:5]
5497 -                               [kworker/u16:1]
5591 -                               [kworker/u16:8]
5711 -                               sudo -s
5715 -                               /bin/bash
5749 -                               /home/lennart/projects/systemd/systemd-nspawn -D /srv/mycontainer -b
5750 mycontainer                     /usr/lib/systemd/systemd
5799 mycontainer                     /usr/lib/systemd/systemd-journald
5862 mycontainer                     /usr/lib/systemd/systemd-logind
5863 mycontainer                     /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
5868 mycontainer                     /sbin/agetty --noclear --keep-baud console 115200 38400 9600 vt102
5871 mycontainer                     /usr/sbin/sshd -D
6527 mycontainer                     /usr/lib/systemd/systemd-resolved

This shows a process list (shortened). The second column shows the container a process belongs to. All processes shown with "-" belong to the host itself.

But it doesn't stop there. The new "sd-bus" D-Bus client library we have been preparing in the systemd/kdbus context knows containers too. While you use sd_bus_open_system() to connect to your local host's system bus sd_bus_open_system_container() may be used to connect to the system bus of any local container, so that you can execute bus methods on it.

sd-login.h and machined's bus interface provide a number of APIs to add container support to other programs too. They support enumeration of containers as well as retrieving the machine name from a PID and similar.

systemd-networkd also has support for containers. When run inside a container it will by default run a DHCP client and IPv4LL on any veth network interface named host0 (this interface is special under the logic described here). When run on the host networkd will by default provide a DHCP server and IPv4LL on veth network interface named ve- followed by a container name.

Let's have a look at one last facet of systemd's container integration: the hook-up with the name service switch. Recent systemd versions contain a new NSS module nss-mymachines that make the names of all local containers resolvable via gethostbyname() and getaddrinfo(). This only applies to containers that run within their own network namespace. With the systemd-nspawn command shown above the the container shares the network configuration with the host however; hence let's restart the container, this time with a virtual veth network link between host and container:

# machinectl poweroff mycontainer
# systemd-nspawn -D /srv/mycontainer --network-veth -b

Now, (assuming that networkd is used in the container and outside) we can already ping the container using its name, due to the simple magic of nss-mymachines:

# ping mycontainer
PING mycontainer ( 56(84) bytes of data.
64 bytes from mycontainer ( icmp_seq=1 ttl=64 time=0.124 ms
64 bytes from mycontainer ( icmp_seq=2 ttl=64 time=0.078 ms

Of course, name resolution not only works with ping, it works with all other tools that use libc gethostbyname() or getaddrinfo() too, among them venerable ssh.

And this is pretty much all I want to cover for now. We briefly touched a variety of integration points, and there's a lot more still if you look closely. We are working on even more container integration all the time, so expect more new features in this area with every systemd release.

Note that the whole machine concept is actually not limited to containers, but covers VMs too to a certain degree. However, the integration is not as close, as access to a VM's internals is not as easy as for containers, as it usually requires a network transport instead of allowing direct syscall access.

Anyway, I hope this is useful. For further details, please have a look at the linked man pages and other documentation.

November 07, 2014

more on Displaylink3 and HDCP encryption

okay another braindump (still nothing working).

The git repo mentioned in previous post has all the code I've hacked up so far.

I finished writing the HDCP protocol stages, and sending all the msgs and getting replies from the device.

So I've successfully reached a point where I've negotiated a HDCP session key with the device, and we are both happy about it. Unfortunately I've no idea what I'm meant to be encrypting to send to the device. The next packet the USB traces contain is 384-bytes of encrypted data.

Now HDCP v2 had a vulnerabilty in its key neg, and I've written code to try and use this fact. So I've taken a trace I made from Windows, and extracted the necessary bits, and using that I've managed to derive the master key used in that trace, and subsequently managed to derived the session key for it. So I've replayed the first encrypted packet from the trace to the device and got an encrypted response the same as in the trace.

I've tried changing a bit in the session key, riv value and data I'm sending, and doing that causes the device not to reply with the answer. This to me implies that the device is using the HDCP cipher to encode the control channel. Now HDCP does say you should only do this for video streams, but maybe DisplayLink forgot to read that bit.

Now where does this leave me, in theory I should be able to replay the full trace (haven't had time yet) and I should see the same picture on screen as I did (though I can't remember what monitor/device I used, so I might have to retrace and restage my tests before then).

However I really need to decrypt the encrypted data in the trace, and from reading the HDCP spec the only values I need to feed the AES engine are ks ^ lc128, riv, streamctr, inputctr. I'm assuming streamctr and inputctr are 0 for the first packet (I could be wrong, maybe they use some wacky streamctr to avoid messing with hdcp), riv and ks I've captured. So lc128 is possibly the crux.

Now what is lc128? Its a secret 128-bit value in the HDCP world given only to HDCP adopters. Its normally something you'd store in hw on the GPU etc as an input to the hw cipher. But in displaylink there is no GPU encrypting the data. Now its possible that displaylink don't use the same lc128 as the HDCP people, unlikely but possible. Maybe they cipher their streams with their own lc128, and only use the offical hdcp lc128 for actual HDCP streams.

I don't think lc128 has leaked, I'm not sure what the consequences of it leaking would be, but hey its just a magic number, and if displaylink are using as an input to their AES code, it must be in RAM at some point, now I need to figure out ways to work that out. I'm not sure how long it would take to brute force as 128-bit key space, probably impossible.

At any point if someone from DisplayLink wants to talk, you know where to find me :-)

November 03, 2014

Thoughts on crashdumps.

Linux has what appears to be a useful feature that can be enabled to diagnose tricky kernel bugs. The feature is called kdump. A crashdump mechanism that uses kexec to switch to a different kernel, before writing out memory to disk, nfs, wherever. It’s a pretty neat idea.

Unfortunately, I have _never_ seen it working when I needed it.
I know it’s possible, because some of my co-workers swear by crashdumps for diagnosing tricky RHEL bugs. Someone every single RHEL release invests the time to fix up a bunch of bugs and get it into a working state again. But because Fedora is constantly moving, it’s near constantly broken in some non-trivial way.

We even have a wiki page telling Fedora users how to enable it. In honesty every time in the past I’ve told a user to try it, I’ve thought to myself “yeah, that isn’t going to work”, and my record for being correct in that regard is pretty damn good. If after 15+[1] years of kernel debugging, _I_ can’t get this thing to work, what hope does the average end-user have ?

In a recent meeting at the office, one of my coworkers enthused about how “it’s so much better now, it just works”. So I thought I’d give it a try again the last few weeks. In that time, I have ended up with a total of zero crash dumps, and I-lost-count-how-many kdump bugs.

Why is it so fragile ? I don’t have a good answer. It tends to have the worst possible failure modes. It’s hard to diagnose bugs that either lock up the machine entirely, or instantly reboot it. When you’re trying to debug something, and then it turns out you need to debug the debugging mechanism, most people probably think “I don’t have time for this shit”, and try alternative avenues of debugging, adding “FIND OUT WHY KDUMP IS FUCKED AGAIN” somewhere near the bottom of their TODO list.

At one point I thought “Maybe I’m just unlucky with hardware choices”[2], but the problems seem to be universal across every machine I’ve tried it on.

No doubt it “works” for some people, in certain circumstances, but this kind of feature has to be reliable at least most of the time to make it even worth trying.

I wish this post had a happy ending where I unveiled some solution to this problem[3], but after needing to travel to a machine that wedged itself after it had crashed for the Nth time this weekend, I’m kind of over kdump.
Sometimes it’s easier just to say “Don’t even bother” and do something entirely different.

[1] Oh god what have I done with my life.
[2] There are no good choices when it comes to computer hardware.
[3] Coming in a future post: Why pstore is the solution to this, and why it’s also completely awful.

Thoughts on crashdumps. is a post from:

October 31, 2014

a day with DisplayLink USB3 and HDCP

So for some reason I decided to look at the displaylink usb3 adaptors today. (no good news).

This blog post is so I don't forget all of this when I page it out. Notes, HDCP1.0 being broken doesn't matter to this, maybe HDCPv2.0 being a bit broken could be used, but I'm not sure how!

The displaylink USB3 protocol is based on HDCP protocol. I've traced the first few packets and it clearly
looks like the host sends two packets


and the device sends back

at least.

AKE_Send_Cert contains a 522 byte certificate, containing a receiver id, public key, some misc bytes and a signature generated with the DCP LLC private key, that you have to verify.

so the HDCP v2.2 spec contains the DP LLC public key, and I've written some code to verify the spec using openssl, but it totally fails to work. This is probably due to me doing something stupid, or not understanding what I'm doing, if you are openssl knowledgeable and want to look, the hack fest is

It might be the DisplayLink devices use a different signing key than the DP LLC one.

That repo contains some code to talk to the device (currently disabled) and do the initial sequence, along with an attempt to verify the cert.

Now once I get past this hurdle, the larger one seems to remain, the HDCP 2.0 spec has a global secret 128-bit value called LC128, that everyone who implements HDCP gets and hides somewhere. Its probably sitting in the displaylink driver in hex, but I'd hope they at least hide it better than that. It may also be possibly supplied by the OS, Windows or OSX. (I've no clue yet). That value is used in the key negotiation.

Now it might be possible that Displaylink allow non-HDCP encrypted data to be sent to the device, in which case win if I can find out where/how to do that, or it might be the device requires HDCP and decrypts non-HDCP content before sending it over VGA/DVI. I've no ideas yet on that front either.

Ah well probably enough learning for today, I knew nothing about HDCP this morning, so I can't say it made my life any better learning about it :-P

October 29, 2014

Understanding Wikimedia, or, the Heavy Metal Umlaut, one decade on

It has been nearly a full decade since Jon Udell’s classic screencast about Wikipedia’s article on the Heavy Metal Umlaut (current textJan. 2005). In this post, written for Paul Jones’ “living and working online” class, I’d like to use the last decade’s changes to the article to illustrate some points about the modern Wikipedia.1

Measuring change

At the end of 2004, the article had been edited 294 times. As we approach the end of 2014, it has now been edited 1,908 times by 1,174 editors.2

This graph shows the number of edits by year – the blue bar is the overall number of edits in each year; the dotted line is the overall length of the article (which has remained roughly constant since a large pruning of band examples in 2007).



The dropoff in edits is not unusual — it reflects both a mature article (there isn’t that much more you can write about metal umlauts!) and an overall slowing in edits in English Wikipedia (from a peak of about 300,000 edits/day in 2007 to about 150,000 edits/day now).3

The overall edit count — 2000 edits, 1000 editors — can be hard to get your head around, especially if you write for a living. Implications include:

  • Style is hard. Getting this many authors on the same page, stylistically, is extremely difficult, and it shows in inconsistencies small and large. If not for the deeply acculturated Encyclopedic Style we all have in our heads, I suspect it would be borderline impossible.
  • Most people are good, most of the time. Something like 3% of edits are “reverted”; i.e., about 97% of edits are positive steps forward in some way, shape, or form, even if imperfect. This is, I think, perhaps the single most amazing fact to come out of the Wikimedia experiment. (We reflect and protect this behavior in one of our guidelines, where we recommend that all editors Assume Good Faith.)

The name change, tools, and norms

In December 2008, the article lost the “heavy” from its name and became, simply, “metal umlaut” (explanation, aka “edit summary“, highlighted in yellow):

Name change

A few take aways:

  • Talk pages: The screencast explained one key tool for understanding a Wikipedia article – the page history. This edit summary makes reference to another key tool – the talk page. Every Wikipedia article has a talk page, where people can discuss the article, propose changes, etc.. In this case, this user discussed the change (in November) and then made the change in December. If you’re reporting on an article for some reason, make sure to dig into the talk page to fully understand what is going on.
  • Sources: The user justifies the name change by reference to sources. You’ll find little reference to them in 2005, but by 2008, finding an old source using a different term is now sufficient rationale to rename the entire page. Relatedly…
  • Footnotes: In 2008, there was talk of sources, but still no footnotes. (Compare the story about Motley Crue in Germany in 2005 and now.) The emphasis on foonotes (and the ubiquitous “citation needed”) was still a growing thing. In fact, when Jon did his screencast in January 2005, the standardized/much-parodied way of saying “citation needed” did not yet exist, and would not until June of that year! (It is now used in a quarter of a million English Wikipedia pages.) Of course, the requirement to add footnotes (and our baroque way of doing so) may also explain some of the decline in editing in the graphs above.

Images, risk aversion, and boldness

Another highly visible change is to the Motörhead art, which was removed in November 2011 and replaced with a Mötley Crüe image in September 2013. The addition and removal present quite a contrast. The removal is explained like this:

remove File:Motorhead.jpg; no fair use rationale provided on the image description page as described at WP:NFCC content criteria 10c

This is clear as mud, combining legal issues (“no fair use rationale”) with Wikipedian jargon (“WP:NFCC content criteria 10c”). To translate it: the editor felt that the “non-free content” rules (abbreviated WP:NFCC) prohibited copyright content unless there was a strong explanation of why the content might be permitted under fair use.

This is both great, and sad: as a lawyer, I’m very happy that the community is pre-emptively trying to Do The Right Thing and take down content that could cause problems in the future. At the same time, it is sad that the editors involved did not try to provide the missing fair use rationale themselves. Worse, a rationale was added to the image shortly thereafter, but the image was never added back to the article.

So where did the new image come from? Simply:

boldly adding image to lead

“boldly” here links to another core guideline: “be bold”. Because we can always undo mistakes, as the original screencast showed about spam, it is best, on balance, to move forward quickly. This is in stark contrast to traditional publishing, which has to live with printed mistakes for a long time and so places heavy emphasis on Getting It Right The First Time.

In brief

There are a few other changes worth pointing out, even in a necessarily brief summary like this one.

  • Wikipedia as a reference: At one point, in discussing whether or not to use the phrase “heavy metal umlaut” instead of “metal umlaut”, an editor makes the point that Google has many search results for “heavy metal umlaut”, and another editor points out that all of those search results refer to Wikipedia. In other words, unlike in 2005, Wikipedia is now so popular, and so widely referenced, that editors must be careful not to (indirectly) be citing Wikipedia itself as the source of a fact. This is a good problem to have—but a challenge for careful authors nevertheless.
  • Bots: Careful readers of the revision history will note edits by “ClueBot NG“. Vandalism of the sort noted by Jon Udell has not gone away, but it now is often removed even faster with the aid of software tools developed by volunteers. This is part of a general trend towards software-assisted editing of the encyclopedia.NoSwagForYou
  • Translations: The left hand side of the article shows that it is in something like 14 languages, including a few that use umlauts unironically. This is not useful for this article, but for more important topics, it is always interesting to compare the perspective of authors in different languages.Languages

Other thoughts?

I look forward to discussing all of these with the class, and to any suggestions from more experienced Wikipedians for other lessons from this article that could be showcased, either in the class or (if I ever get to it) in a one-decade anniversary screencast. :)

  1. I still haven’t found a decent screencasting tool that I like, so I won’t do proper homage to the original—sorry Jon!
  2. Numbers courtesy X’s edit counter.
  3. It is important, when looking at Wikipedia statistics, to distinguish between stats about Wikipedia in English, and Wikipedia globally — numbers and trends will differ vastly between the two.

October 24, 2014

Introducing Gthree

I’ve recently been working on OpenGL support in Gtk+, and last week it landed in master. However, the demos we have are pretty lame and are not very good to show off or even test the OpenGL support. I’ve looked around for some open source demos that used modern GL that we could use, but I didn’t find anything that we could easily use.

What I did find though, was a lot of WebGL demos that used three.js. This looked like a very nice open source library for highlevel 3d rendering. At first I had some plans to bind OpenGL to gjs so that we could run three.js, but this turned out to be a hard.

Instead I started converting three.js into C + GObject, using the Gtk+ OpenGL support and the vector/matrix library graphene that Emmanuele has been working on recently.

After about a week of frantic hacking it is now at a stage where it may be interesting for others. So, without further ado I introduce:

It does not yet support everything that three.js can do, but it does support a meshes with most mesh matrial types and lighting, including a loader for the json model format of thee.js, which means that it is minimally useful.

Here are some screenshots of the examples that ships with the code:

Screenshot from 2014-10-24 15:04:47Various types of materials
Screenshot from 2014-10-24 15:10:00Some sample models from three.js examples
Screenshot from 2014-10-24 15:31:40Some random cubes

This has been a lot of fun to work on as I’ve seen a lot of progress very fast. Mad props to mrdoob and the other three.js developers for creating three.js and making it free software. Gthree is a huge rip-off of their work and would never be possible without it. Thanks also to Emmanuele for his graphene library.

What are you sitting here for, go ahead and play with it! Make some demos, port some more three.js features, marvel at the fancy graphics!

October 23, 2014

Trinity and pages of random data.

Something trinity uses a lot, are pages of random data. They get passed around to syscalls, ioctls, whatever. 5 years ago, before I’d even added multiple children to trinity, this was done using ‘page_rand’. A single page allocated on startup, that was passed around, and scribbled over by anyone who needed something to scribble over.

After the VM work I did earlier this year, where we recycle successful calls to mmap, and inherit them across children, quite a few places started passing around map structs instead. This was good, because it started shaking out the many many kernel bugs that we had lingering in huge page support.

It kind of sucked that we had two sets of routines for doing things like “get a page”, “dirty a page” etc which were fundamentally the same operations, except one set worked on a pointer, and one on a struct. It also sucked that the page_rand code was actually buggy in a number of ways, which showed up as overruns.

Over time, I’ve been trying to move all the code that used page_rand to using mappings instead. Today I finished that work, and ripped out the last vestiges of page_rand support. The only real remnants of the supporting code was some of the dirtying code. We used to have separate ‘dirty page_rand’ and ‘dirty an mmap’ routines. After todays work, there’s now a single set of functions for mappings. There’s still a bunch more consolidation and cleanup to do, which I’ll get fixed up and merged over the next week.

The only feature that’s now missing is periodic dirtying of mappings. We did this every 100 syscalls for page_rand. Right now we only dirty mmap’s after a mmap() call succeeds, or on an mremap(). I plan on getting this done tomorrow.

The motivation for ripping out all this code, and unifying a lot of the support code is that a lot of code paths get simpler, and more importantly, the code in place now takes ‘len’ arguments, so we’re in a better position to make sure we’re not passing buffers that are too small when we do random syscalls.

In other news: while I was happy to report a few days ago that 3.18rc1 fixed up the btrfs bug that had been bothering me for a while, I’ve now managed to discover two new btrfs bugs [1]. [2]. Grumble.

Trinity and pages of random data. is a post from:

October 19, 2014

Laptop bleg

I'm considering a laptop (actually two). Requirements:

  • 13" to 14" class.
  • Indestructable.
  • Display that is not too wide. Enough with 16:9 already! Aspect of 1.6 would be ideal (Lenovo T400 had that).
  • Light. Indestructable is more important, but it should be light: 2kg or less.
  • No nipple. No Lenovo.

Where it comes from is mostly my wife's Sony Vaio Z. I used to have a Z back in 2001 or so, when they were in 12" format. It was the best laptop ever, but unfortunately it succumbed to a DC-DC converter failure. The modern Z is not like that Z. The most super annoying problem is that the screws holding the battery failed in an interesting way: it is impossible to remove the battery now. Also, the contact between the battery and the moterboard is marginal. I managed to fix the problem by manufacturing a finely shaped wooden wedge that I drove into a gap and thus extended the life of that thing, but man, Sony, this is disappointing.

Unfortunately, I don't remember if it was Kota or Daisuke, but one of Japanese guys at a recent Swift Hackathon in Boston had a Z of the similar vintage, and it looked impeccable. Maybe Sony figured that it's going to be predominant mode of care that their wares receive, and so why not make the modern Z this much cheaper than the old, indestructable Z. But they still charge exorbitant prices.

Lenovo wins a special notice because I had a T400 for 3 years and swore never deal with it ever again. The biggest problem is the keyboard layout, because I use left pinky for control key. I could live with their idiotic placement of Escape, but I refuse to deal with 3 years of physical pain again. Also, their famous qualify seems slipping, as my mouse button broke within 3 years. Battery died, too. However, the T400 had a very good display, and I would like another like that, if possible.

October 18, 2014

Trinity updates

Over a month ago, I posted about some pthreads work I was experimenting with in Trinity, and how that wasn’t really working out. After taking a short vacation, I came back with no real epiphanies, and decided to back-burner that work for now, and instead refocus on fixing up some other annoying problems that I’d stumbled across while doing that experimenting. Some of these problems were actually long-standing bugs in trinity. So that’s pretty much all I’ve been working on for the last month, and I’m now pretty happy with how long it runs for (providing you don’t hit a kernel bug first).

The primary motivation was to fix a problem where trinity’s internal data structures would get corrupted. After a series of debugging patches, I found a number of places where a child process would overrun a buffer it had allocated.

First up: the code that takes syscalls arguments and renders them into a human-readable string. In some cases this would write huge strings past the end of the buffer. One example of this was the instance where trinity would generate a random pathname. It would sometimes generate complete garbage, which was fine until it came to printing it out. Fixed by deleting lots of code in the pathname generator. Stressing the negative dentry case was never that interesting anyway. After fixing up a few other cases in the argument generator I looked at the code that performs rendering to buffers. None of this code took length parameters, or took into account the remaining space in the buffers. Fairly quick rewrite took care of that.

After these bugs were fixed trinity would (on a good kernel) run for a really long time without incident. With longer runtimes, a few more obscure corner cases turned up.

There were 2-3 cases where the watchdog process would hang waiting for a condition that would never be met (due to losing track of how many running child processes there were). I’m still not happy that this can even occur but it is at least a little less likely to hang when it happens now. I’ll investigate the actual cause for this later.

Another fun watchdog bug: we keep track of the time stamp a child performed its last syscall at, and check to make sure 1 second later that it has increased by some small amount. To make sure we haven’t corrupted our own state, there’s also a sanity check that we haven’t jumped into the future. But we also have to compensate for the possibility that adjtimex was the random syscall we did. That takes a maximum offset of 2145. The code checked for that but forgot to also add the one second since the last time we checked.

There’s been a bunch of small 1-2 fixes like this lately, but I’m sitting on a larger set of changes that I’ll start to trickle into git next week, which moves towards cleaning up the “create a random page to pass to syscalls” code, which has been another fun source of corruption bugs.

In kernel news: The only interesting bugs this week that Trinity has shown up, have been two ext4 bugs. Diagnosing those has pointed out some more enhancements that are needed to the post-mortem code in trinity. Once I’ve cleared the current backlog of patches, I’ll work on adding better tracking of fd’s in the logging code. In other news, the btrfs bug trinity hit in August is now fixed in 3.17+ git.

Trinity updates is a post from:

October 09, 2014

Emacs hint for Firefox hacking

I started hacking on firefox recently. And, of course, I’ve configured emacs a bit to make hacking on it more pleasant.

The first thing I did was create a .dir-locals.el file with some customizations. Most of the tree has local variable settings in the source files — but some are missing and it is useful to set some globally. (Whether they are universally correct is another matter…)

Also, I like to use bug-reference-url-mode. What this does is automatically highlight references to bugs in the source code. That is, if you see “bug #1050501″, it will be buttonized and you can click (or C-RET) and open the bug in the browser. (The default regexp doesn’t capture quite enough references so my settings hack this too; but I filed an Emacs bug for it.)

I put my .dir-locals.el just above my git checkout, so I don’t end up deleting it by mistake. It should probably just go directly in-tree, but I haven’t tried to do that yet. Here’s that code:

 ;; Generic settings.
 (nil .
      ;; See C-h f bug-reference-prog-mode, e.g, for using this.
      ((bug-reference-url-format . "")
       (bug-reference-bug-regexp . "\\([Bb]ug ?#?\\|[Pp]atch ?#\\|RFE ?#\\|PR [a-z-+]+/\\)\\([0-9]+\\(?:#[0-9]+\\)?\\)")))

 ;; The built-in javascript mode.
 (js-mode .
     ((indent-tabs-mode . nil)
      (js-indent-level . 2)))

 (c++-mode .
	   ((indent-tabs-mode . nil)
	    (c-basic-offset . 2)))

 (idl-mode .
	   ((indent-tabs-mode . nil)
	    (c-basic-offset . 2)))


In programming modes I enable bug-reference-prog-mode. This enables highlighting only in comments and strings. This would easily be done from prog-mode-hook, but I made my choice of minor modes depend on the major mode via find-file-hook.

I’ve also found that it is nice to enable this minor mode in diff-mode and log-view-mode. This way you get bug references in diffs and when viewing git logs. The code ends up like:

(defun tromey-maybe-enable-bug-url-mode ()
  (and (boundp 'bug-reference-url-format)
       (stringp bug-reference-url-format)
       (if (or (derived-mode-p 'prog-mode)
	       (eq major-mode 'tcl-mode)	;emacs 23 bug
	       (eq major-mode 'makefile-mode)) ;emacs 23 bug
	   (bug-reference-prog-mode t)
	 (bug-reference-mode t))))

(add-hook 'find-file-hook #'tromey-maybe-enable-bug-url-mode)
(add-hook 'log-view-mode-hook #'tromey-maybe-enable-bug-url-mode)
(add-hook 'diff-mode-hook #'tromey-maybe-enable-bug-url-mode)

October 03, 2014

I think it’s better to look odd than to look normal

In the fall of ’98 I had a thing for a girl I didn’t want to have a thing for. I had also just seen one of my favorite movies, Much Ado About Nothing (the original Brannagh movie, not the Josh Whedon one that I didn’t know about until recently and have yet to see).

I decided to exorcise my feelings into a good old-fashioned mix cd (well, I guess that wasn’t old fashioned back in ’98). I cut up the movie dialogue into pieces, and interspersed them inbetween a song selection aiming to match the flow of the movie lyric-wise and, in places, matching them sound-wise too to the movie snippets. It ended up being two cd’s, and a bunch of my friends liked it as well so I think I ended up making about 30 copies of the thing.

Today I needed to recreate those two CD’s plus its original packaging. That means I had to actually buy CD-R’s (didn’t have any anymore after the move to the US), buy jewelcases (can you believe that I actually have actual boxes with actual empty jewelcases that I *kept* in storage in Belgium? These days if you want to buy them they’re a little harder to find than they used to be, even though I’m sure there must be landfills full of them all over the world), and go to a print shop to print the front and back covers.

Being the obsessive backupper that I am, it was easy to find the sound files back (actually, I took a morituri rip that I made at my best friend’s house, who has the CD’s, last time I was there – so that I would have a perfect .cue sheet that would stitch the tracks together). I knew I had the files for the fronts and backs somewhere as well, but they were a little harder to find because I couldn’t remember their names. But I trusted my OCD self that I had backups from fifteen years ago somewhere here with me in NY, and I started looking for files from the same timeframe, until I came across the files I was looking for hidden in a subdirectory.

But then when you find them, what do you do with .cdr CorelDraw files from 1998? I tried inkscape, which uses uniconvertor, which on my F-19 machine failed with a constructor with wrong arguments in Python, which seems like a silly bug. I rebuilt the F-21 version, which gets past that bug, but then doesn’t actually convert anything. I tried an online converter, and it only picked up on the images and none of the text.

So I went the illegal route – I downloaded CorelDraw 11 from the internet, installed it in wine (which was surprisingly easy, it just worked), and I could open the files. Except that it was missing fonts and so the layout was all wrong. Sigh. Hunt random font sites for the missing fonts, install them for wine, open again, rinse, repeat. Eventually the files opened with the right fonts, except that one of the titles was too big to fit on the CD inlay. Oh well, adjust them all manually, make it a little smaller, export to eps, load in gimp, adjust the page as it was perfectly measured for A4 printing but I’m in the US now and the US uses letter which is slightly different, export to pdf so I could go to any random print shop in New York and get it printed.

CD burnt, on to the print shop, fiddle with the printer as nobody in the store can figure out which tray number the tray is where they loaded the card stock paper, and it’s not like the driver on the windows machine knows either – I had to do 5 failed prints to different printers before we even knew which printer was the right one. Cut up the paper by hand with scissors (which I suck at), put it all together, and be on my way.

All this just to say that, while I can be as good about backups as I want to be to bring back to life something I did fifteen years ago, there is still a whole lot of real-world technology fails getting in the way, like outdated proprietary file formats, not having good interchange formats, missing fonts, paper sizes and general Imperial/metric nonsense, ages-old printer crap and just simple manual tasks, which we as humans will probably inflict upon ourselves for forever. I mean, I’d sure like to believe that in the future it will be as simple as pressing a button and getting this 15 year old CD project 3D-printed all at once, but experience has taught me that most likely I will be fiddling just as much with getting 2040’s 3D printer to work with 2025’s data files.

And so it is that I arrive just after 6 at Barnes and Noble in Tribeca, queue up in front of eight registers with only one open, buy a book, get a wristband, go to the back where Emma Thompson is reading from her Peter Rabbit book, in her perfectly English and genuinely funny way, queue after the reading, and hear her say “I think it’s better to look odd than to look normal” to the seven year old twin girls in front of me. I wholeheartedly agree with her. I hand her my copy to sign, give her my two cd’s and tell her what they are and say that I thought this was a good opportunity to give them to her, and she smiles and seems genuinely surprised and pleased.

I think my dad would be genuinely jealous at this point – he always seemed to appreciate seeing her on the screen, and after today I can’t say I blame him. I hope she enjoys the CD’s, and if someone can recommend a good website where I can put these online for others to listen to, that would be great!

flattr this!

September 20, 2014

Emacs Modules

I’ve been working on an odd Emacs package recently — not ready for release — which has turned into more than the usual morass of prefixed names and double hyphens.

So, I took another look at Nic Ferrier’s namespace proposal.

Suddenly it didn’t seem all that hard to implement something along these lines, and after a bit of poking around I wrote emacs-module.

The basic idea is to continue to follow the Emacs approach of prefixing symbol names — but not to require you to actually write out the full names of everything.  Instead, the module system intercepts load and friends to rewrite symbol names as lisp is loaded.

The symbol renaming is done in a simple way, following existing Emacs conventions.  This gives the nice result that existing code doesn’t need to be updated to use the module system directly.  That is, the module system recognizes name prefixes as “implicit” modules, based purely on the module name.

I’d say this is still a proof-of-concept.  I haven’t tried hairier cases, like defclass, and at least declare-function does not work but should.

Here’s the example from the docs:

(define-module testmodule :export (somevar))
(defvar somevar nil)
(defvar private nil)
(provide 'testmodule)

This defines the public variable testmodule-somevar and the “private” function testmodule--private.

September 12, 2014

Trinity threading improvements and misc

Since my blogging tsunami almost a month ago, I’ve been pretty quiet. The reason being that I’ve been heads down working on some new features for trinity which have turned out to be a lot more involved than I initially anticipated.

Trinity does all of its work in child processes continually forked off from a main process. For a long time I’ve had “investigate using pthreads” as a TODO item, but after various conversations at kernel summit, I decided to bump the priority of that up a little, and spend some time looking at it. I initially guessed that it would have take maybe a few weeks to have something usable, but after spending some time working on it, every time I make progress on one issue, it becomes apparent that there’s something else that is also going to need changing.

I’m taking a week off next week to clear my head and hopefully return to this work with fresh eyes, and make more progress, because so far it’s been mostly frustrating, and there may be an easier way to solve some of the problems I’ve been hitting. Sidenote: In the 15+ years I’ve been working on Linux, this is the first time I recall actually ever using pthreads in my own code. I can’t say I’ve been missing out.

Unrelated to that work, a month or so ago I came up with a band-aid fix for a problem where trinity would corrupt its own structures. That ‘fix’ turned out to break the post-mortem work I implemented a few months prior, so I’ve spent some time this week undoing that, and thinking about how I’m going to fix that properly. But before coming up with a fix, I needed to reproduce the problem reliably, and naturally now that I’ve added debug code to determine where the corruption is coming from, the bug has gone into hiding.

I need this vacation.

Trinity threading improvements and misc is a post from:

September 04, 2014

My Wikimania 2014 talks

Primarily what I did during Wikimania was chew on pens.

Discussing Fluid Lobbying at Wikimania 2014, by  Sebastiaan ter Burg, under CC BY 2.0
Discussing Fluid Lobbying at Wikimania 2014, by Sebastiaan ter Burg, under CC BY 2.0

However, I also gave some talks.

The first one was on Creative Commons 4.0, with Kat Walsh. While targeted at Wikimedians, this may be of interest to others who want to learn about CC 4.0 as well.

Second one was on Open Source Hygiene, with Stephen LaPorte. This one is again Wikimedia-specific (and I’m afraid less useful without the speaker notes) but may be of interest to open source developers more generally.

The final one was on sharing; video is below (and I’ll share the slides once I figure out how best to embed the notes, which are pretty key to understanding the slides):

September 02, 2014

Wikimania 2014 Notes – very miscellaneous

A collection of semi-random notes from Wikimania London, published very late:

Gruppenfoto Wikimania 2014 London, by Ralf Roletschek, under CC BY-SA 3.0 Austria

The conference generally

  • Tone: Overall tone of the conference was very positive. It is possibly just small sample size—any one person can only talk to a small number of the few thousand at the conference—but seemed more upbeat/positive than last year.
  • Tone, 2: The one recurring negative theme was concern about community tone, from many angles, including Jimmy. I’m very curious to see how that plays out. I agree, of course, and will do my part, both at WMF and when I’m editing. But that sort of social/cultural change is very hard.
  • Speaker diversity: Heard a few complaints about gender balance and other diversity issues in the speaker lineup, and saw a lot of the same (wonderful!) faces as last year. I’m wondering if there are procedural changes (like maybe blind submissions, or other things from this list) might bring some new blood and improve diversity.
  • “Outsiders”: The conference seemed to have better representation than last year from “outside” our core community. In particular, it was great for me to see huge swathes of the open content/open access movements represented, as well as other free software projects like Mozilla. We should be a movement that works well with others, and Wikimania can/should be a key part of that, so this was a big plus for me.
  • Types of talks: It would be interesting to see what the balance was of talks (and submissions) between “us learning about the world” (e.g., me talking about CC), “us learning about ourselves” (e.g., the self-research tracks), and “the world learning about us” (e.g., aimed at outsiders). Not sure there is any particular balance we should have between the three of them, but it might be revealing to see what the current balance is.
  • Less speaking, more conversing: Next year I will probably propose mostly (only?) panels and workshops, and I wonder if I can convince others to do the same. I can do a talk+slides and stream it at any time; what I can only do in person is have deeper, higher-bandwidth conversations.
  • Physical space and production values: The hackathon space was amazingly fun for me, though I got the sense not everyone agreed. The production values (and the rest of the space) for the conference were very good. I’m torn on whether or not the high production values are a plus for us, honestly. They raise the bar for participation (bad); make the whole event feel somewhat… un-community-ish(?); but they also make us much more accessible to people who aren’t yet ready for the full-on, super-intense Wikimedian Experience.

The conference for projects I work on

  • LCA: Legal/Community Affairs was pretty awesome on many fronts—our talks, our work behind the scenes, our dealing with both the expected and unexpected, etc. Deeply proud to be part of this dedicated, creative team. Also very appreciative for everyone who thanked us—it means a lot when we hear from people we’ve helped.
  • Maps: Great seeing so much interest in Open Street Map. Had a really enjoyable time at their 10th birthday meetup; was too bad I had to leave early. Now have a better understanding of some of the technical issues after a chat with Kolossos and Katie. Also had just plain fun geeking out about “hard choices” like map boundaries—I find how communities make decisions about problems like that fascinating.
  • Software licensing: My licensing talk with Stephen went well, but probably should have been structured as part of the hackathon rather than for more general audiences. Ultimately this will only work out if engineering (WMF and volunteer) is on board, and will work best if engineering leads. (The question asked by Mako afterwards has already led to patches, which is cool.)
  • Creative Commons: My CC talk with Kat went well, and got some good questions. Ultimately the rubber will meet the road when the translations are out and we start the discussion with the full community. Also great meeting User:Multichill; looking forward to working on license templates with him and May from design.
  • Metadata: The multimedia metadata+licensing work is going to be really challenging, but very interesting and ultimately very empowering for everyone who wants to work with the material on commons. Look forward to working with a large/growing number of people on this project.
  • Advocacy: Advocacy panel was challenging, in a good way. A variety of good, useful suggestions; but more than anything else, I took away that we should probably talk about how we talk when subjects are hard, and consensus may be difficult to reach. Examples would include when there is a short timeline for a letter, or when topics are deeply controversial for good, honest reasons.

The conference for me

  • Lesson (1): Learned a lesson: never schedule a meeting for the day after Wikimania. Odds of being productive are basically zero, though we did get at least some things done.
  • Lesson (2): I badly overbooked myself; it hurt my ability to enjoy the conference and meet everyone I wanted to meet. Next year I’ll try to be more focused in my commitments so I can benefit more from spontaneity, and get to see some slightly less day-job-related (but enjoyable or inspirational) talks/presentations.
  • Research: Love that there is so much good/interesting research going on, and do deeply think that it is important to understand it so that I can apply it to my work. Did not get to see very much of it, though :/
  • Arguing with love: As tweeted about by Phoebe, one of the highlights was a vigorous discussion (violent agreement :) with Mako over dinner about the four freedoms and how they relate to just/empowering software more broadly. Also started a good, vigorous discussion with SJ about communication and product quality, but we sadly never got to finish that.
  • Recharging: Just like GUADEC in my previous life, I find these exhausting but also ultimately exhilarating and recharging. Can’t wait to get to Mexico City!


  • London: I really enjoy London—the mix of history and modernity is amazing. Bonus: I think the beer scene has really improved since the last time I was there.
  • Movies: I hardly ever watch movies anymore, even though I love them. Knocked out 10 movies in the 22 hours in flight. On the way to London:
    • Grand Hotel Budapest (the same movie as every other one of his movies, which is enjoyable)
    • Jodorowsky’s Dune (awesome if you’re into scifi)
    • Anchorman (finally)
    • Stranger than Fiction (enjoyed it, but Adaptation was better)
    • Captain America, Winter Soldier (not bad?)
  • On the way back:
    • All About Eve (finally – completely compelling)
    • Appleseed:Alpha (weird; the awful dialogue and wooden “faces” of computer animated actors clashed particularly badly with the clasically great dialogue and acting of All About Eve)
    • Mary Poppins (having just seen London; may explain my love of magico-realism?)
    • The Philadelphia Story (great cast, didn’t engage me otherwise)
    • Her (very good)

August 31, 2014

Revisiting How We Put Together Linux Systems

In a previous blog story I discussed Factory Reset, Stateless Systems, Reproducible Systems & Verifiable Systems, I now want to take the opportunity to explain a bit where we want to take this with systemd in the longer run, and what we want to build out of it. This is going to be a longer story, so better grab a cold bottle of Club Mate before you start reading.

Traditional Linux distributions are built around packaging systems like RPM or dpkg, and an organization model where upstream developers and downstream packagers are relatively clearly separated: an upstream developer writes code, and puts it somewhere online, in a tarball. A packager than grabs it and turns it into RPMs/DEBs. The user then grabs these RPMs/DEBs and installs them locally on the system. For a variety of uses this is a fantastic scheme: users have a large selection of readily packaged software available, in mostly uniform packaging, from a single source they can trust. In this scheme the distribution vets all software it packages, and as long as the user trusts the distribution all should be good. The distribution takes the responsibility of ensuring the software is not malicious, of timely fixing security problems and helping the user if something is wrong.

Upstream Projects

However, this scheme also has a number of problems, and doesn't fit many use-cases of our software particularly well. Let's have a look at the problems of this scheme for many upstreams:

  • Upstream software vendors are fully dependent on downstream distributions to package their stuff. It's the downstream distribution that decides on schedules, packaging details, and how to handle support. Often upstream vendors want much faster release cycles then the downstream distributions follow.

  • Realistic testing is extremely unreliable and next to impossible. Since the end-user can run a variety of different package versions together, and expects the software he runs to just work on any combination, the test matrix explodes. If upstream tests its version on distribution X release Y, then there's no guarantee that that's the precise combination of packages that the end user will eventually run. In fact, it is very unlikely that the end user will, since most distributions probably updated a number of libraries the package relies on by the time the package ends up being made available to the user. The fact that each package can be individually updated by the user, and each user can combine library versions, plug-ins and executables relatively freely, results in a high risk of something going wrong.

  • Since there are so many different distributions in so many different versions around, if upstream tries to build and test software for them it needs to do so for a large number of distributions, which is a massive effort.

  • The distributions are actually quite different in many ways. In fact, they are different in a lot of the most basic functionality. For example, the path where to put x86-64 libraries is different on Fedora and Debian derived systems..

  • Developing software for a number of distributions and versions is hard: if you want to do it, you need to actually install them, each one of them, manually, and then build your software for each.

  • Since most downstream distributions have strict licensing and trademark requirements (and rightly so), any kind of closed source software (or otherwise non-free) does not fit into this scheme at all.

This all together makes it really hard for many upstreams to work nicely with the current way how Linux works. Often they try to improve the situation for them, for example by bundling libraries, to make their test and build matrices smaller.

System Vendors

The toolbox approach of classic Linux distributions is fantastic for people who want to put together their individual system, nicely adjusted to exactly what they need. However, this is not really how many of today's Linux systems are built, installed or updated. If you build any kind of embedded device, a server system, or even user systems, you frequently do your work based on complete system images, that are linearly versioned. You build these images somewhere, and then you replicate them atomically to a larger number of systems. On these systems, you don't install or remove packages, you get a defined set of files, and besides installing or updating the system there are no ways how to change the set of tools you get.

The current Linux distributions are not particularly good at providing for this major use-case of Linux. Their strict focus on individual packages as well as package managers as end-user install and update tool is incompatible with what many system vendors want.


The classic Linux distribution scheme is frequently not what end users want, either. Many users are used to app markets like Android, Windows or iOS/Mac have. Markets are a platform that doesn't package, build or maintain software like distributions do, but simply allows users to quickly find and download the software they need, with the app vendor responsible for keeping the app updated, secured, and all that on the vendor's release cycle. Users tend to be impatient. They want their software quickly, and the fine distinction between trusting a single distribution or a myriad of app developers individually is usually not important for them. The companies behind the marketplaces usually try to improve this trust problem by providing sand-boxing technologies: as a replacement for the distribution that audits, vets, builds and packages the software and thus allows users to trust it to a certain level, these vendors try to find technical solutions to ensure that the software they offer for download can't be malicious.

Existing Approaches To Fix These Problems

Now, all the issues pointed out above are not new, and there are sometimes quite successful attempts to do something about it. Ubuntu Apps, Docker, Software Collections, ChromeOS, CoreOS all fix part of this problem set, usually with a strict focus on one facet of Linux systems. For example, Ubuntu Apps focus strictly on end user (desktop) applications, and don't care about how we built/update/install the OS itself, or containers. Docker OTOH focuses on containers only, and doesn't care about end-user apps. Software Collections tries to focus on the development environments. ChromeOS focuses on the OS itself, but only for end-user devices. CoreOS also focuses on the OS, but only for server systems.

The approaches they find are usually good at specific things, and use a variety of different technologies, on different layers. However, none of these projects tried to fix this problems in a generic way, for all uses, right in the core components of the OS itself.

Linux has come to tremendous successes because its kernel is so generic: you can build supercomputers and tiny embedded devices out of it. It's time we come up with a basic, reusable scheme how to solve the problem set described above, that is equally generic.

What We Want

The systemd cabal (Kay Sievers, Harald Hoyer, Daniel Mack, Tom Gundersen, David Herrmann, and yours truly) recently met in Berlin about all these things, and tried to come up with a scheme that is somewhat simple, but tries to solve the issues generically, for all use-cases, as part of the systemd project. All that in a way that is somewhat compatible with the current scheme of distributions, to allow a slow, gradual adoption. Also, and that's something one cannot stress enough: the toolbox scheme of classic Linux distributions is actually a good one, and for many cases the right one. However, we need to make sure we make distributions relevant again for all use-cases, not just those of highly individualized systems.

Anyway, so let's summarize what we are trying to do:

  • We want an efficient way that allows vendors to package their software (regardless if just an app, or the whole OS) directly for the end user, and know the precise combination of libraries and packages it will operate with.

  • We want to allow end users and administrators to install these packages on their systems, regardless which distribution they have installed on it.

  • We want a unified solution that ultimately can cover updates for full systems, OS containers, end user apps, programming ABIs, and more. These updates shall be double-buffered, (at least). This is an absolute necessity if we want to prepare the ground for operating systems that manage themselves, that can update safely without administrator involvement.

  • We want our images to be trustable (i.e. signed). In fact we want a fully trustable OS, with images that can be verified by a full trust chain from the firmware (EFI SecureBoot!), through the boot loader, through the kernel, and initrd. Cryptographically secure verification of the code we execute is relevant on the desktop (like ChromeOS does), but also for apps, for embedded devices and even on servers (in a post-Snowden world, in particular).

What We Propose

So much about the set of problems, and what we are trying to do. So, now, let's discuss the technical bits we came up with:

The scheme we propose is built around the variety of concepts of btrfs and Linux file system name-spacing. btrfs at this point already has a large number of features that fit neatly in our concept, and the maintainers are busy working on a couple of others we want to eventually make use of.

As first part of our proposal we make heavy use of btrfs sub-volumes and introduce a clear naming scheme for them. We name snapshots like this:

  • usr:<vendorid>:<architecture>:<version> -- This refers to a full vendor operating system tree. It's basically a /usr tree (and no other directories), in a specific version, with everything you need to boot it up inside it. The <vendorid> field is replaced by some vendor identifier, maybe a scheme like org.fedoraproject.FedoraWorkstation. The <architecture> field specifies a CPU architecture the OS is designed for, for example x86-64. The <version> field specifies a specific OS version, for example 23.4. An example sub-volume name could hence look like this: usr:org.fedoraproject.FedoraWorkstation:x86_64:23.4

  • root:<name>:<vendorid>:<architecture> -- This refers to an instance of an operating system. Its basically a root directory, containing primarily /etc and /var (but possibly more). Sub-volumes of this type do not contain a populated /usr tree though. The <name> field refers to some instance name (maybe the host name of the instance). The other fields are defined as above. An example sub-volume name is root:revolution:org.fedoraproject.FedoraWorkstation:x86_64.

  • runtime:<vendorid>:<architecture>:<version> -- This refers to a vendor runtime. A runtime here is supposed to be a set of libraries and other resources that are needed to run apps (for the concept of apps see below), all in a /usr tree. In this regard this is very similar to the usr sub-volumes explained above, however, while a usr sub-volume is a full OS and contains everything necessary to boot, a runtime is really only a set of libraries. You cannot boot it, but you can run apps with it. An example sub-volume name is: runtime:org.gnome.GNOME3_20:x86_64:3.20.1

  • framework:<vendorid>:<architecture>:<version> -- This is very similar to a vendor runtime, as described above, it contains just a /usr tree, but goes one step further: it additionally contains all development headers, compilers and build tools, that allow developing against a specific runtime. For each runtime there should be a framework. When you develop against a specific framework in a specific architecture, then the resulting app will be compatible with the runtime of the same vendor ID and architecture. Example: framework:org.gnome.GNOME3_20:x86_64:3.20.1

  • app:<vendorid>:<runtime>:<architecture>:<version> -- This encapsulates an application bundle. It contains a tree that at runtime is mounted to /opt/<vendorid>, and contains all the application's resources. The <vendorid> could be a string like org.libreoffice.LibreOffice, the <runtime> refers to one the vendor id of one specific runtime the application is built for, for example org.gnome.GNOME3_20:3.20.1. The <architecture> and <version> refer to the architecture the application is built for, and of course its version. Example: app:org.libreoffice.LibreOffice:GNOME3_20:x86_64:133

  • home:<user>:<uid>:<gid> -- This sub-volume shall refer to the home directory of the specific user. The <user> field contains the user name, the <uid> and <gid> fields the numeric Unix UIDs and GIDs of the user. The idea here is that in the long run the list of sub-volumes is sufficient as a user database (but see below). Example: home:lennart:1000:1000.

btrfs partitions that adhere to this naming scheme should be clearly identifiable. It is our intention to introduce a new GPT partition type ID for this.

How To Use It

After we introduced this naming scheme let's see what we can build of this:

  • When booting up a system we mount the root directory from one of the root sub-volumes, and then mount /usr from a matching usr sub-volume. Matching here means it carries the same <vendor-id> and <architecture>. Of course, by default we should pick the matching usr sub-volume with the newest version by default.

  • When we boot up an OS container, we do exactly the same as the when we boot up a regular system: we simply combine a usr sub-volume with a root sub-volume.

  • When we enumerate the system's users we simply go through the list of home snapshots.

  • When a user authenticates and logs in we mount his home directory from his snapshot.

  • When an app is run, we set up a new file system name-space, mount the app sub-volume to /opt/<vendorid>/, and the appropriate runtime sub-volume the app picked to /usr, as well as the user's /home/$USER to its place.

  • When a developer wants to develop against a specific runtime he installs the right framework, and then temporarily transitions into a name space where /usris mounted from the framework sub-volume, and /home/$USER from his own home directory. In this name space he then runs his build commands. He can build in multiple name spaces at the same time, if he intends to builds software for multiple runtimes or architectures at the same time.

Instantiating a new system or OS container (which is exactly the same in this scheme) just consists of creating a new appropriately named root sub-volume. Completely naturally you can share one vendor OS copy in one specific version with a multitude of container instances.

Everything is double-buffered (or actually, n-fold-buffered), because usr, runtime, framework, app sub-volumes can exist in multiple versions. Of course, by default the execution logic should always pick the newest release of each sub-volume, but it is up to the user keep multiple versions around, and possibly execute older versions, if he desires to do so. In fact, like on ChromeOS this could even be handled automatically: if a system fails to boot with a newer snapshot, the boot loader can automatically revert back to an older version of the OS.

An Example

Note that in result this allows installing not only multiple end-user applications into the same btrfs volume, but also multiple operating systems, multiple system instances, multiple runtimes, multiple frameworks. Or to spell this out in an example:

Let's say Fedora, Mageia and ArchLinux all implement this scheme, and provide ready-made end-user images. Also, the GNOME, KDE, SDL projects all define a runtime+framework to develop against. Finally, both LibreOffice and Firefox provide their stuff according to this scheme. You can now trivially install of these into the same btrfs volume:

  • usr:org.fedoraproject.WorkStation:x86_64:24.7
  • usr:org.fedoraproject.WorkStation:x86_64:24.8
  • usr:org.fedoraproject.WorkStation:x86_64:24.9
  • usr:org.fedoraproject.WorkStation:x86_64:25beta
  • usr:org.mageia.Client:i386:39.3
  • usr:org.mageia.Client:i386:39.4
  • usr:org.mageia.Client:i386:39.6
  • usr:org.archlinux.Desktop:x86_64:302.7.8
  • usr:org.archlinux.Desktop:x86_64:302.7.9
  • usr:org.archlinux.Desktop:x86_64:302.7.10
  • root:revolution:org.fedoraproject.WorkStation:x86_64
  • root:testmachine:org.fedoraproject.WorkStation:x86_64
  • root:foo:org.mageia.Client:i386
  • root:bar:org.archlinux.Desktop:x86_64
  • runtime:org.gnome.GNOME3_20:x86_64:3.20.1
  • runtime:org.gnome.GNOME3_20:x86_64:3.20.4
  • runtime:org.gnome.GNOME3_20:x86_64:3.20.5
  • runtime:org.gnome.GNOME3_22:x86_64:3.22.0
  • runtime:org.kde.KDE5_6:x86_64:5.6.0
  • framework:org.gnome.GNOME3_22:x86_64:3.22.0
  • framework:org.kde.KDE5_6:x86_64:5.6.0
  • app:org.libreoffice.LibreOffice:GNOME3_20:x86_64:133
  • app:org.libreoffice.LibreOffice:GNOME3_22:x86_64:166
  • app:org.mozilla.Firefox:GNOME3_20:x86_64:39
  • app:org.mozilla.Firefox:GNOME3_20:x86_64:40
  • home:lennart:1000:1000
  • home:hrundivbakshi:1001:1001

In the example above, we have three vendor operating systems installed. All of them in three versions, and one even in a beta version. We have four system instances around. Two of them of Fedora, maybe one of them we usually boot from, the other we run for very specific purposes in an OS container. We also have the runtimes for two GNOME releases in multiple versions, plus one for KDE. Then, we have the development trees for one version of KDE and GNOME around, as well as two apps, that make use of two releases of the GNOME runtime. Finally, we have the home directories of two users.

Now, with the name-spacing concepts we introduced above, we can actually relatively freely mix and match apps and OSes, or develop against specific frameworks in specific versions on any operating system. It doesn't matter if you booted your ArchLinux instance, or your Fedora one, you can execute both LibreOffice and Firefox just fine, because at execution time they get matched up with the right runtime, and all of them are available from all the operating systems you installed. You get the precise runtime that the upstream vendor of Firefox/LibreOffice did their testing with. It doesn't matter anymore which distribution you run, and which distribution the vendor prefers.

Also, given that the user database is actually encoded in the sub-volume list, it doesn't matter which system you boot, the distribution should be able to find your local users automatically, without any configuration in /etc/passwd.

Building Blocks

With this naming scheme plus the way how we can combine them on execution we already came quite far, but how do we actually get these sub-volumes onto the final machines, and how do we update them? Well, btrfs has a feature they call "send-and-receive". It basically allows you to "diff" two file system versions, and generate a binary delta. You can generate these deltas on a developer's machine and then push them into the user's system, and he'll get the exact same sub-volume too. This is how we envision installation and updating of operating systems, applications, runtimes, frameworks. At installation time, we simply deserialize an initial send-and-receive delta into our btrfs volume, and later, when a new version is released we just add in the few bits that are new, by dropping in another send-and-receive delta under a new sub-volume name. And we do it exactly the same for the OS itself, for a runtime, a framework or an app. There's no technical distinction anymore. The underlying operation for installing apps, runtime, frameworks, vendor OSes, as well as the operation for updating them is done the exact same way for all.

Of course, keeping multiple full /usr trees around sounds like an awful lot of waste, after all they will contain a lot of very similar data, since a lot of resources are shared between distributions, frameworks and runtimes. However, thankfully btrfs actually is able to de-duplicate this for us. If we add in a new app snapshot, this simply adds in the new files that changed. Moreover different runtimes and operating systems might actually end up sharing the same tree.

Even though the example above focuses primarily on the end-user, desktop side of things, the concept is also extremely powerful in server scenarios. For example, it is easy to build your own usr trees and deliver them to your hosts using this scheme. The usr sub-volumes are supposed to be something that administrators can put together. After deserializing them into a couple of hosts, you can trivially instantiate them as OS containers there, simply by adding a new root sub-volume for each instance, referencing the usr tree you just put together. Instantiating OS containers hence becomes as easy as creating a new btrfs sub-volume. And you can still update the images nicely, get fully double-buffered updates and everything.

And of course, this scheme also applies great to embedded use-cases. Regardless if you build a TV, an IVI system or a phone: you can put together you OS versions as usr trees, and then use btrfs-send-and-receive facilities to deliver them to the systems, and update them there.

Many people when they hear the word "btrfs" instantly reply with "is it ready yet?". Thankfully, most of the functionality we really need here is strictly read-only. With the exception of the home sub-volumes (see below) all snapshots are strictly read-only, and are delivered as immutable vendor trees onto the devices. They never are changed. Even if btrfs might still be immature, for this kind of read-only logic it should be more than good enough.

Note that this scheme also enables doing fat systems: for example, an installer image could include a Fedora version compiled for x86-64, one for i386, one for ARM, all in the same btrfs volume. Due to btrfs' de-duplication they will share as much as possible, and when the image is booted up the right sub-volume is automatically picked. Something similar of course applies to the apps too!

This also allows us to implement something that we like to call Operating-System-As-A-Virus. Installing a new system is little more than:

  • Creating a new GPT partition table
  • Adding an EFI System Partition (FAT) to it
  • Adding a new btrfs volume to it
  • Deserializing a single usr sub-volume into the btrfs volume
  • Installing a boot loader into the EFI System Partition
  • Rebooting

Now, since the only real vendor data you need is the usr sub-volume, you can trivially duplicate this onto any block device you want. Let's say you are a happy Fedora user, and you want to provide a friend with his own installation of this awesome system, all on a USB stick. All you have to do for this is do the steps above, using your installed usr tree as source to copy. And there you go! And you don't have to be afraid that any of your personal data is copied too, as the usr sub-volume is the exact version your vendor provided you with. Or with other words: there's no distinction anymore between installer images and installed systems. It's all the same. Installation becomes replication, not more. Live-CDs and installed systems can be fully identical.

Note that in this design apps are actually developed against a single, very specific runtime, that contains all libraries it can link against (including a specific glibc version!). Any library that is not included in the runtime the developer picked must be included in the app itself. This is similar how apps on Android declare one very specific Android version they are developed against. This greatly simplifies application installation, as there's no dependency hell: each app pulls in one runtime, and the app is actually free to pick which one, as you can have multiple installed, though only one is used by each app.

Also note that operating systems built this way will never see "half-updated" systems, as it is common when a system is updated using RPM/dpkg. When updating the system the code will either run the old or the new version, but it will never see part of the old files and part of the new files. This is the same for apps, runtimes, and frameworks, too.

Where We Are Now

We are currently working on a lot of the groundwork necessary for this. This scheme relies on the ability to monopolize the vendor OS resources in /usr, which is the key of what I described in Factory Reset, Stateless Systems, Reproducible Systems & Verifiable Systems a few weeks back. Then, of course, for the full desktop app concept we need a strong sandbox, that does more than just hiding files from the file system view. After all with an app concept like the above the primary interfacing between the executed desktop apps and the rest of the system is via IPC (which is why we work on kdbus and teach it all kinds of sand-boxing features), and the kernel itself. Harald Hoyer has started working on generating the btrfs send-and-receive images based on Fedora.

Getting to the full scheme will take a while. Currently we have many of the building blocks ready, but some major items are missing. For example, we push quite a few problems into btrfs, that other solutions try to solve in user space. One of them is actually signing/verification of images. The btrfs maintainers are working on adding this to the code base, but currently nothing exists. This functionality is essential though to come to a fully verified system where a trust chain exists all the way from the firmware to the apps. Also, to make the home sub-volume scheme fully workable we actually need encrypted sub-volumes, so that the sub-volume's pass-phrase can be used for authenticating users in PAM. This doesn't exist either.

Working towards this scheme is a gradual process. Many of the steps we require for this are useful outside of the grand scheme though, which means we can slowly work towards the goal, and our users can already take benefit of what we are working on as we go.

Also, and most importantly, this is not really a departure from traditional operating systems:

Each app, each OS and each app sees a traditional Unix hierarchy with /usr, /home, /opt, /var, /etc. It executes in an environment that is pretty much identical to how it would be run on traditional systems.

There's no need to fully move to a system that uses only btrfs and follows strictly this sub-volume scheme. For example, we intend to provide implicit support for systems that are installed on ext4 or xfs, or that are put together with traditional packaging tools such as RPM or dpkg: if the the user tries to install a runtime/app/framework/os image on a system that doesn't use btrfs so far, it can just create a loop-back btrfs image in /var, and push the data into that. Even us developers will run our stuff like this for a while, after all this new scheme is not particularly useful for highly individualized systems, and we developers usually tend to run systems like that.

Also note that this in no way a departure from packaging systems like RPM or DEB. Even if the new scheme we propose is used for installing and updating a specific system, it is RPM/DEB that is used to put together the vendor OS tree initially. Hence, even in this scheme RPM/DEB are highly relevant, though not strictly as an end-user tool anymore, but as a build tool.

So Let's Summarize Again What We Propose

  • We want a unified scheme, how we can install and update OS images, user apps, runtimes and frameworks.

  • We want a unified scheme how you can relatively freely mix OS images, apps, runtimes and frameworks on the same system.

  • We want a fully trusted system, where cryptographic verification of all executed code can be done, all the way to the firmware, as standard feature of the system.

  • We want to allow app vendors to write their programs against very specific frameworks, under the knowledge that they will end up being executed with the exact same set of libraries chosen.

  • We want to allow parallel installation of multiple OSes and versions of them, multiple runtimes in multiple versions, as well as multiple frameworks in multiple versions. And of course, multiple apps in multiple versions.

  • We want everything double buffered (or actually n-fold buffered), to ensure we can reliably update/rollback versions, in particular to safely do automatic updates.

  • We want a system where updating a runtime, OS, framework, or OS container is as simple as adding in a new snapshot and restarting the runtime/OS/framework/OS container.

  • We want a system where we can easily instantiate a number of OS instances from a single vendor tree, with zero difference for doing this on order to be able to boot it on bare metal/VM or as a container.

  • We want to enable Linux to have an open scheme that people can use to build app markets and similar schemes, not restricted to a specific vendor.

Final Words

I'll be talking about this at LinuxCon Europe in October. I originally intended to discuss this at the Linux Plumbers Conference (which I assumed was the right forum for this kind of major plumbing level improvement), and at, but there was no interest in my session submissions there...

Of course this is all work in progress. These are our current ideas we are working towards. As we progress we will likely change a number of things. For example, the precise naming of the sub-volumes might look very different in the end.

Of course, we are developers of the systemd project. Implementing this scheme is not just a job for the systemd developers. This is a reinvention how distributions work, and hence needs great support from the distributions. We really hope we can trigger some interest by publishing this proposal now, to get the distributions on board. This after all is explicitly not supposed to be a solution for one specific project and one specific vendor product, we care about making this open, and solving it for the generic case, without cutting corners.

If you have any questions about this, you know how you can reach us (IRC, mail, G+, ...).

The future is going to be awesome!

August 25, 2014

Emacs verus notification area, again

Ages and ages I wrote about letting Emacs code access the notification area.  I have more to say about it now, but first I want to bore you with some rambling thoughts and some history.

The “notification area” is also called the “status icon area” or the “systray” — it is a spot that holds some icons that are under control of various applications.

I was a fan of the notification area since it first showed up in Gnome.  I recognized it instantly as the thing I wanted that I hadn’t realized I wanted.

Now, as you know, the notification area has fallen on hard times.  It’s been removed in Gnome 3… I searched a bit for the rationale for this deletion, which as far as I can tell is just that some applications abused it, whatever that means; or that it was used inconsistently, which I think the web has conclusively proven is fine by users.  Coming from the Emacs perspective, where one can customize the somewhat-equivalent of the status area (see those recent posts on diminishing minor-mode lighters in the mode line…), and where a certain amount of per-mode idiosyncrasy is the norm, these seem like an inadequate reasons.

However, the reason doesn’t really matter.  I love the notification area!  When I moved more of my daily desktop use back into Emacs (the tides are strong but slow, and take years to come in or go out), I hooked Emacs up to it, and made it a part of my basic configuration.

It’s indispensable now.  What I particularly like about it is that it is both noticeable and unobtrusive — the former because I can have the icons blink on important events, and the latter because the icons don’t move around or obscure other windows.

Ok!  You should use it!  And I totally plan to tell you how, but first some boring history.

My original post relied on a hacked version of the Gnome zenity utility.  This turned out to be a real pain over time.  I had to rebuild it periodically, adding hacks (once removing chunks), etc.  Sharing it with others was hard.  And, for whatever reason, the patches in Gnome bugzilla were completely ignored.  Bah.

A bit later I wrote a big patch to Emacs to put all this into the core.  That patch was rejected, more or less.  Bah two.

Then even later I flirted with KDE for a bit.  Yes.  KDE had the nice idea to expose the notification area via dbus, and Emacs could talk dbus… so I did the obvious thing in elisp.  However the KDE notification area was pretty buggy and in the end I had to abandon it as well.

So, it was back to zenity… until this week, during my funemployment.  I rewrote my hacks in Python.  This was so easy I wish I’d done it years and years ago.

I’m not sure what the moral of this story is.  Maybe that my obsession is your gain.  Or maybe that I have trouble letting go.

Anyway, the result is here, on github, or in marmalade.  You’ll need Python and the new (introspection-based) Python Gtk interfaces.  This of course is no trouble to install.  The package includes the base status icon API, plus basic UIs for ERC and EMMS.  Try it out and let me know what you think.

August 22, 2014

Another Mode Line Hack

While streamlining my mode line, I wrote another little mode-line feature that I thought of ages ago — using the background of the mode-line to indicate the current position in the buffer. I didn’t like this enough to use it, but I thought I’d post it since it was a fun hack.

First, make sure the current mode line is kept:

(defvar tromey-real-mode-line-format mode-line-format)

Now, make a little function that format the mode line using the standard rules and then applies a property depending on the current position in the buffer:

(defun tromey-compute-mode-line ()
  (let* ((width (frame-width))
     (line (substring 
        (concat (format-mode-line tromey-real-mode-line-format)
            (make-string width ? ))
        0 width)))
    ;; Quote "%"s.
    (setq line
      (mapconcat (lambda (c)
               (if (eq c ?%)
             ;; It's absurd that we must wrap this.
             (make-string 1 c)))
             line ""))

    (let ((start (window-start))
      (end (or (window-end) (point))))
      (add-face-text-property (round (* (/ (float start)
                    (length line)))
                  (round (* (/ (float end)
                    (length line)))
                  'region nil line))

We have to do this funny wrapping and “%”-quoting business here because the :eval form returns a mode line format — not just text — and because the otherwise appealing :propertize form doesn’t allow computations.

Also, I’ve never understood why mapconcat can’t handle a character result from the map function.  Anybody?

Now set this to be the mode line:

(setq-default mode-line-format '((:eval (tromey-compute-mode-line))))

The function above changes the background of the mode line corresponding to the current window’s start and end positions.  So, for example, here we are in the middle of a buffer that is bigger than the window:

Screenshot - 08222014 - 12:52:19 PM

I left this on for a bit but found it too distracting.  If you like it, use it. You might like to remove the mode-line-position stuff from the mode line, as it seems redundant with the visual display.

August 15, 2014

A breakdown of Linux kernel networking related issues from Coverity scan

For the last of these breakdowns, I’ll focus on fifth place: networking.

Linux supports many different network protocols, so I spent quite a while splitting the net/ tree into per-protocol components. The result looks like this.

Net-802 8
Net-Bluetooth 15
Net-CAIF 9
Net-Core 11
Net-DCCP 5
Net-IRDA 17
Net-NFC 11
Net-SCTP 18
Net-SunRPC 21
Net-Wireless 9
Net-XFRM 6
Net-bridge 14
Net-ipv4 24
Net-ipv6 16
Net-mac80211 12
Net-sched 5
everything else 124

The networking code has gotten noticably better over the last year. When I initially introduced these components they were all well into double figures. Now, even crap like DECNET has gotten better (both users will be very happy).

“Everything else” above is actually a screw-up on my part. For some reason around 50 or so netfilter issues haven’t been categorized into their appropriate component. The remaining ~70 are quite a mix, but nearly all small numbers of issues in many components.Things like 9p, atm, ax25, batman, can, ceph, l2tp, rds, rxrpc, tipc, vmwsock, and x25. The Lovecraftian protocols you only ever read about.

So networking is in pretty good shape considering just how much stuff it supports. While there’s 24 issues in a common protocol like ipv4, they tend to be mostly benign things rather than OMG 24 WAYS THE NSA IS OWNING YOUR LINUX RIGHT NOW.

That’s the last of these breakdowns I’ll do for now. I’ll do this again maybe in six months to a year, if things are dramatically different, but I expect any changes to be minor and incremental rather than anything too surprising.

After I get back from kernel summit and recover from travelling, I’ll start a series of posts showing code examples of the top checkers.

A breakdown of Linux kernel networking related issues from Coverity scan is a post from:

Breakdown of Linux kernel wireless drivers in Coverity scan

In fourth place on the list of hottest areas of the kernel as seen by Coverity, is drivers/net/wireless.

rtlwifi 96
Atheros 74
brcm80211 67
mwifiex 33
b43 16
iwlwifi 15
everything else 65

I mentioned in my drivers/staging examination that the realtek wifi drivers stood odd as especially problematic. Here we see the same situation. Larry Finger has been working on cleaning up this (and other drivers) for some time, but it apparently still has a long way to go.

It’s worth noting that “Atheros” here is actually a number of drivers (ar5523, ath10k, ath5k, ath6k, ath9k, carl9170, wcn36xx, wil6210). I’ve not had time to break those down into smaller components yet, though a quick look shows that ath9k in particular accounts for a sizable portion of those 74 issues)

I was actually surprised at how low the iwlwifi and b43 counts were. I guess there’s something to be said for ubiquitous hardware.

What of all the ancient wireless drivers ? The junky pcmcia/pccard drivers like orinoco and friends ?
They’re in with those 65 “everything else” bugs, and make up < 5-6 issues each. Considering their age, and lack of any real maintenance these days, they’re in surprisingly good shape.

Just for fun, here’s how the drivers above compare against the wireless drivers currently in staging.

rtl8821 102 (Staging)
rtlwifi 96
Atheros 74
brcm80211 67
rtl8188eu 42 (Staging)
mwifiex 33
rtl8712 22 (Staging)
rtl8192u 21 (Staging)
rtl8192e 17 (Staging)
b43 16
iwlwifi 15
everything else 65

Breakdown of Linux kernel wireless drivers in Coverity scan is a post from:

A breakdown of Linux kernel filesystem issues in Coverity scans

The filesystem code shows up in the number two position of the list of hottest areas of the kernel. Like the previous post on drivers/scsi, this isn’t because “the filesystem code is terrible”, but more that Linux supports so many filesystems, the accumulative effect of issues present in all of them adds up to a figure that dominates the statistics.

The breakdown looks like this.

fs/*.c 77
9P 3
EXTn 36
GFS2 12
HFSPlus 4
NFS 24
OCFS2 35
Reiserfs 12
UDF 14
XFS 33

fs/*.c accounts for the VFS core, AIO, binfmt parsers, eventfd, epoll, timerfd’s, xattr code and a bunch of assorted miscellany. Little wonder it show up with so high, it’s around 62,000 LOC by itself. Of all the entries on the list, this is perhaps the most concerning area given it affects every filesystem.

A little more concerning perhaps is that btrfs is so high on the list. Btrfs is still seeing a lot of churn each release, so many of these issues come and go, but it seems to be holding roughly at the same rate of new incoming issues each release.

EXTn counts for ext2, ext3, and ext4 combined. Not too bad considering that’s around 74,000 LOC combined. (and another 15K LOC for jbd/jbd2)

The CIFS, NFS and OCFS filesystems stand out as potentially something that might be of concern, especially if those issues are over-the-wire trigger-able.

XFS has been improving over the past year. It was around 60-70 when I started doing regular scans, and continues to move downward each release, with few new issues getting added.

The remaining filesystems: not too shabby. Especially considering some of the niche ones don’t get a lot of attention.

A breakdown of Linux kernel filesystem issues in Coverity scans is a post from:

A closer look at drivers/scsi Coverity scans.

drivers/scsi showed up in third place in the list of hottest areas of the kernel. Breaking it down into sub-components, it looks like this.

aic7xxx 15
be2iscsi 15
bfa 26
bnx2fc 6
csiostor 10
isci 11
lpfc 38
megaraid 10
mpt2sas 17
mpt3sas 15
pm8001 9
qla2xxx 42
qla4xxx 17
Everything else 152

All these components have been steadily improving over the last year. The obvious stand-out is “Everything else” that looks like it needs to be broken out into more components.
But drivers/scsi is one area of the kernel where we have a *lot* of legacy drivers, many of them 10-15 years old. (Remarkably, some of these are even still in regular use). Looking over the list of filenames matching the “Everything else” component, pretty much every driver that isn’t broken out into its own component is on the list. 3w-9xxx, NCR5380, aacraid, advansys, aic94xx, arcmsr, atp870, bnx2i, cxgbi, dc395x, dpt_i2o, eata, esas2, fdomain, fnic, gdth, hpsa, imm, ipr, ips, mvsas, mvumi, osst, pmcraid, qla1280, qlogicfas, stex, storvsc_drv, sym53x8xx, tmscsim.
None of these are particularly worse than the others, most averaging less than a half dozen issues each.

Ignoring the problems I currently have adding more components, it’s not particularly helpful to break it down further when the result is going to be components with a half dozen issues. It’s not that there’s a few awful drivers dragging down the average, it’s that there’s so many of them, and they all contribute a little bit of awful.

Something I’d like to component-ize, but can’t easily without crafting and maintaining ugly regexps, is the core scsi functionality and its libraries. The problem is that drivers/scsi/*.c includes both legacy drivers, and also scsi core functionality & library functions. I discussed potentially moving all the old drivers to a “legacy” or “vintage” sub-directory at LSF/MM earlier this year with James, but he didn’t seem overly enthusiastic. So it’s going to continue to be lumped in with “Everything else” for now.

The difficulty with figuring out whether many of these issues are real concerns is that because they are hardware drivers, the scanner has no way of knowing what range of valid responses the HBA will return. So there are a number of issues which are of the form “This can’t actually happen, because if the HBA returned this, then we would have called this other function instead”.
Not a problem unique to SCSI, and something that’s seen across many different parts of the kernel.

And for those ancient 15 year old drivers ? It’s tough to find someone who either remembers how they work on a chip level, or cares enough to go back and revisit them.

A closer look at drivers/scsi Coverity scans. is a post from:

drivers/staging under the Coverity microscope.

In my previous post, I mentioned that drivers/staging took the top spot for number of issues in a component.

Here’s a ‘zoomed in’ look at the sub-components under drivers/staging.

bcm 103
comedi 45
iio 13
line6 7
lustre 133
media 10
rtl8188eu 42
rtl8192e 17
rtl8192u 21
rtl8712 22
rtl8821 102
rts5208 19
unisys 14
vt6655 47
vt6656 4
everything else in drivers/staging/ (40 other uncategorized drivers) 95

Some of the sub-components with < 10 issues are likely to have their categories removed soon. When they were initially added, the open issues counts were higher, but over time they’ve improved to the point where they could just be lumped in with “everything else”

When Lustre was added back in 3.12, it caused a noticable jump in new issues detected. The largest delta from any one single addition since I’ve been doing regular scans. It’s continuing to make progress, with 20 or so issues being knocked out each release, and few new issues being introduced. Lustre doesn’t suffer from any one issue overly, but has a grab-bag of issues from the many checkers that Coverity has.
Amusingly, Lustre is the only part of the kernel that has Coverity annotations in the code.

Second on the list is the bcm Wimax driver. This has been around in staging for years, and has had a metric shitload of checkpatch type stylistic changes made to it, but relatively few actual functionality fixes. (confession: I was guilty of ~30 of those cleanups myself, but I couldn’t bare to look at the 1906 line bcm_char_ioctl function: Splitting that up did have a nice side-effect though). A lot of the issues in this driver are duplicates due to a problem in a macro being picked up as a new issue for every instance it gets used.

Something that sticks out in this list is the cluster of rtl* drivers. At time of writing there are seven drivers for various Realtek wireless chips, all of varying quality. Much of the code between these drivers is cut-and-pasted from previous drivers. It seems each time Realtek rev new silicon, they do another code-drop with a new driver. Worse yet, many of the fixes that went into the kernel variants don’t make it back to the driver they based their new work on. There have been numerous cases where a bug fixed in one driver has been reintroduced in a new variant months later. There’s a ton of work going on here, and a lot more needed.
Somewhat depressingly, even the not-in-staging rtlwifi driver that lives in drivers/net/wireless has ~100 issues. Many of them the exact same issues as those in the staging drivers.

As bad as it seems, staging is serving its purpose for the most part, and things have gotten a lot quieter each merge window when the staging tree gets pulled. It’s only when it contains something new and huge like Lustre that it really shows up noticeably in the daily stats after each scan. The number of new issues being added are generally lower than the number being fixed. For the 3.17 pull for example, 67 new issues, 132 eliminated. (Note: Those numbers are kernel wide, not *just* staging, but staging made up the majority of the results change on that day).

Something that bothers me slightly is that a number of drivers have ‘escaped’ drivers/staging into the kernel proper, with a high number of open issues. That said, many of those escapees are no worse than drivers that were added 10+ years ago when standards were lower. More on that in a future post.

drivers/staging under the Coverity microscope. is a post from:

Linux kernel Coverity scan ‘hot’ areas.

One of the time-consuming parts of organizing the data generated by Coverity has been sorting it into categories, (or components as Coverity refers to them). A component is a wildcard (or exact filename) that matches a specific subsystem, driver, filesystem etc.

As the Linux kernel has thousands of drivers, it isn’t really practical to add a component per-driver, so I started by generalizing into subsystems, and from there, broke down the larger groupings into per-driver components, while still leaving an “everything else” catch-all for drivers within a subsystem that hadn’t been broken out.

According to discussions I’ve had with Coverity, we are actually one of the more ‘heavy’ users of components, and we’ve hit a few scalability problems as we’ve added more and more of them, which has been another reason I’ve not broken things down more than the ~150 components we have so far. Also, if a component has less than 10 or so issues, it’s really not worth the effort of splitting up. (I may revise that cut-off even higher at some point just to keep things managable).

Before the big reveal, some caveats:

  • Something having ‘100’ issues may not be 100 individual problems. For example if a problem is in a macro, Coverity flags a new issue for every use of that macro. For something heavily used, like a formatted printk debug wrapper, this could account for many many warnings.
  • Many of these issues aren’t actual bugs. At the same time, the checker isn’t wrong, but has no way to infer that the use is ok. I’ll explain more about these in a future post when I start showing some actual warnings.
  • Sometimes a combination of both the previous points. As an example: The nouveau driver this week had ~100 issues open against it, making it the #1 in the list of drm drivers with issues. Ben Skeggs spent some time going over them all, and closed out 80-90 of them as intentional/false positives, and came away with around a half dozen or so issues that actually need code changes, and around 20 issues that are still undecided. It’s a laborious time-consuming effort to dig through them, and in many cases, only the person who wrote the code can really determine if the intent matches what the code actually does.

Right now, the top ten ‘hot areas’ of the kernel (these include accumulated broken-out drivers), sorted by number of issues are:

drivers/staging 694
fs/ 465
drivers/scsi/ 382
drivers/net/wireless 366
net/ 324
drivers/ethernet/ 285
drivers/media/ 262
drivers/usb/ 140
drivers/infiniband/ 109
arch/x86/ 95
sound/ 89

It should come as no surprise really that the staging drivers take the number one spot. If something had beaten it, I think it would have highlighted a somewhat embarrassing problem in our development methods.

In the next posts, I’ll drill down into each of these categories, and figure out exactly why they’re at the top of the list.

For the impatient: once this series is over, I intend to show breakdowns of the various types of issues being detected, but it’s going to take me a while to get to (probably post kernel summit). There’s an absolute ton of data to dig through, and I’m trying to present as much of it in bite-sized chunks as possible, rather than dozens of pages of info.

Linux kernel Coverity scan ‘hot’ areas. is a post from:

August 13, 2014

Streamlined Mode Line

The default mode line looks like this:

Screenshot - 08112014 - 01:57:07 PM

At least, it looks sort of like this if you ignore the lamenesses in the screenshot. If you’re like me you probably don’t remember what all these things mean, at least not without looking them up.  What’s that “U”?  Why the “:” or why three hyphens?

At a local Emacs meetup with Damon Haley and Greg Pfeil, Greg mentioned that he’d done some experiments on using unicode characters in his mode-line.  I decided to give it a try.

I took a good look at the above.  I rarely use any of it — I normally don’t care about the coding system or the line ending style.  I can’t remember the last time I had a buffer that was both read-only and modified.  The VC information, when it appears, is generally too verbose and doesn’t show me the one thing I need to know (see below).  And, though I do like to see the name of the major mode, I don’t really need to see most minor mode names; furthermore I like to have a bit of extra space so that I can use other modes that display information that I do want to see in the mode line.

What’s that VC thing?  Well, ordinarily you may see something like Git-master in the mode line. But, usually I already know the version control system being used — or even if I don’t know, I probably don’t care if I am using VC. By default the branch name is in there too. This can be quite long and seems to get stale when I switch branches; and anyway because I do a lot of work via vc-dir, I don’t really need this in every buffer anyway.

However, what is missing is that the mode-line won’t tell me if a buffer should be registered with version control but is not.  This is a pretty common source of errors.

So, first the code to deal with the VC state.  We need a bit more code than you might think, because the information we need isn’t already computed, and my tries to compute it while updating the mode line caused weird behavior.  Our rule for “should be registered” is “a VC back end claims this file, but the file isn’t actually registered”:

(defvar tromey-vc-mode nil)
(make-variable-buffer-local 'tromey-vc-mode)

(require 'vc)
(defun tromey-vc-command-hook (&rest args)
  (let ((file-name (buffer-file-name)))
    (setq tromey-vc-mode (and file-name
                  (not (vc-registered file-name))
                (vc-responsible-backend file-name))))))

(add-hook 'vc-post-command-functions #'tromey-vc-command-hook)
(add-hook 'find-file-hook #'tromey-vc-command-hook)

(defun tromey-vc-info ()
  (if tromey-vc-mode
      (propertize (string #x26c3 32) 'face 'error)
    " "))

We’ll use that final function in the mode line. Note the odd character in there — my choice was U+26C3 (BLACK DRAUGHTS KING), since I thought it looked disk-drive-like — but you can easily replace it with something else. (Also note the weirdness of using string rather than a string constant. This is just for WordPress’ benefit as its editor kept mangling the actual character.)

To deal with minor modes, I used diminish. This made it easy to remove any display of some modes that I don’t care to know about, and replace the name of some others with a single character:

(require 'diminish)
(diminish 'abbrev-mode)
(diminish 'projectile-mode)
(diminish 'eldoc-mode)
(diminish 'flyspell-mode (string 32 #x2708))
(diminish 'auto-fill-function (string 32 #xa7))
(diminish 'isearch-mode (string 32 #x279c))

Here flyspell is U+2708 (AIRPLANE), auto-fill is U+00A7 (SECTION SIGN), and isearch is U+279C (HEAVY ROUND-TIPPED RIGHTWARDS ARROW).  Haha, Unicode names.

I wanted to try out which-func-mode, now that I had extra space on the mode line, so:

(setq which-func-unknown "")

Finally, we can use all the above and remove some other things from the mode line at the same time:

(setq-default mode-line-format
		(:eval (if (buffer-modified-p)
			   (propertize (string #x21a7) 'face 'error)
			 " "))
		(:eval (tromey-vc-info))
		" " mode-line-buffer-identification
		"   " mode-line-position
		"  " mode-line-modes

The “modified” character in there is U+21A7 (DOWNWARDS ARROW FROM BAR).

Here’s how it looks normally (another badly cropped screenshot):

Screenshot - 08112014 - 08:42:33 PM

Here’s how it looks when the file is not registered with the version control system:

Screenshot - 08112014 - 08:43:04 PM

And here’s how it looks when the file is also modified:

Screenshot - 08112014 - 08:43:39 PM

Occasionally I run into some other minor mode I want to diminish, but this is easily done by editing it into my .emacs and evaluating it for immediate effect.

The first year of Coverity Linux kernel scans.

Next week at kernel summit, I’m going to be speaking about the Coverity scans, and have come up with more material than I have time to present in the short slot, so I’ve decided to turn it into a series of blog posts in a hope to kickstart some discussion ahead of time.

I started doing regular scans against the Linux kernel in July 2013. In that time, I’ve sent a bunch of patches, reported many bugs, and spent hours going through the database categorizing, diagnosing, and closing out issues where possible.

I’ve been doing at least one build per day during each merge window (except obviously on days when there haven’t been any commits), and at least one per -rc once the merge window closes.

A few people have asked me about the config file that I use for the builds.
It’s pretty much an ‘allmodconfig’, except where choices have to be made, I’ve tried to pick the general case that a distribution would select. For some of these, I will occasionally flip between them (for eg, SLAB/SLOB/SLUB, PREEMPT_NONE/PREEMPT_VOLUNTARY/PREEMPT) just for coverage. In total, currently 6955 CONFIG_ options are enabled, 117 disabled. (None by choice, they are all the deselected parts of multi-choice options).

The builds are done x86-64 only. At this time, it’s the only architecture Coverity scan supports. I do have CONFIG_COMPILE_TEST set, so non-x86 drivers that can be built do get scanned. The architecture specific code in arch/ and drivers not covered under COMPILE_TEST being the only parts of the kernel we’re not covering.

Builds take about an hour to build on a 24-core Nehalem. The results are then uploaded to a server which takes another 20 minutes. Then a script kicks something at Coverity to pick up the new tarball and scan it. This can take any number of hours. At best, around 5-6 hours, at worst I’ve seen it take as long as 12 hours. This hopefully answers why I don’t do even more builds, or builds of variant trees. (Although I’m still trying to figure out a way to scan linux-next while having it inherit the results of the issues already marked in Linus tree). Thankfully much of the build/upload/scan process is automated, so I can do other things while I wait for it to finish.

Over the year, the overall defect density has been decreasing.

3.11 0.68
3.12 0.62
3.13 0.59
3.14 0.55
3.15 0.55
3.16 0.53

Moving in the right direction, though things have slowed a little the last few releases. At least in part due to my spending more time on Trinity than going through the Coverity backlog. The good news is that the incoming rate of new bugs each window has also slowed.

Newer issues when they are getting introduced, are getting jumped on faster than before. Many developers have signed up for accounts and are looking over their subsystems each release, which is great. It means I have to spend less time sending email :)
Eventually I hope that Coverity implements a feature I asked for allowing each component to have a designated email address that new reports get sent to. With that in place, plus active triage on the backlog, a real dent could be made in the ~4700 outstanding issues.

Throughout the past year Coverity has made a number of improvements server-side, some at the behest of the scans, resulting in fewer false positives being found by some checkers. A good example of this was some additional heuristics being added to spot intentional ‘missing break in switch statement’ situations. I’ve also been in constant communication whenever an interesting bug was found upstream that Coverity didn’t detect, so over time, additional checkers should be added to catch more bugs.

How do we compare against other projects ?
I picked a few at random.

FreeBSD 0.54 (~15m LOC) 14655 total, 6446 fixed, 8093 outstanding.
Firefox 0.70 (~5.4m LOC) 9008 total. 5066 fixed. 3786 outstanding.
Linux 0.53 (~9m LOC) 13337 total. 7202 fixed. 4761 outstanding.
Python 0.03 ! (~400k LOC) 1030 total. 895 fixed. 3 outstanding.

(LOC based on C preprocessor output)

FreeBSD’s defect density is pretty much the same as Linux right now, despite having a lot more code. I think they include all their userspace in their scans also, so it’s picked up gcc, sendmail, binutils etc etc.

The Python people have made a big effort to keep their defect density low (afaik, the lowest of all projects in scan). They did however have a lot fewer issues to begin with, and have a much smaller codebase. Firefox by comparison seems to have a lot of the same problems Linux has. A large corpus of pre-existing issues, and a large codebase (probably with few people with ‘global’ knowledge)

In my next post, I’ll go into some detail about where some of the more dense areas of the kernel are for Coverity issues. Much of it should be no surprise (old, unmaintained/neglected code etc), but there are a few interesting cases).

update : added FreeBSD statistics.
update 2 : (hi hackernews!) added blurb about coverity improvements.

The first year of Coverity Linux kernel scans. is a post from:

August 08, 2014

Week of kernel bugs in review

With the 3.17 merge window opening up this week, it’s been kinda busy.
I also made a few enhancements to Trinity, so it found some bugs that have been there for a while.

In addition to this, I started pulling together a talk for kernel summit based on all the stuff that Coverity has been finding. I’ll eventually get around to turning those into blog posts too, as there’s a lot of material.

Productive week.

Week of kernel bugs in review is a post from:

compiler sanitizers.

I only recently discovered the sanitizer libraries that both gcc and llvm support despite them being a few years old now. (libasan, liblsan, libtsan and my favorite libubsan for undefined behaviour detection). LLVM also has a -fsanitize=memory.

Building code with -fsanitize={address|leak|undefined} has turned up a number of hard to find issues in various userspace code I’ve written. (Unfortunately doing this on something like Trinity produces a lot of false positives, as it deliberately generates undefined behavior in many cases, like creating an mmap, never writing to it, and then passing it to something that reads it).

There’s also a variant of libasan for the kernel which looks interesting. I know that’s found a bunch of issues in concert with fuzzing via Trinity, and expect it’s something we’ll see more of if/when that functionality gets merged.

Today I was reading about the recent gcc meeting, and these slides by the sanitizer developers caught my attention. What I found of particular interest was the “MSan for Chromium” slide, where they mention they rebuilt ~40 libraries to link with the sanitizer.

I’ve been contemplating doing this for a subset of some userspace packages in Fedora that I care about for a while, but I’ve not had spare cycles to even look into it. I dogfood a lot of bleeding edge code on all my machines, and have been curious for some time to see what the fallout looks like from such a rebuild of various network facing daemons. I suspect with Chromium being more focused on the client side, there hasn’t been a huge amount of research into this for server side code. Looking at ASan’s found bugs wiki page, it does seem to support that hypothesis. I’m curious to see what would fall out from a rebuilt Apache, Bind, Sendmail, nginx, etc.
Hopefully the developers of all the network facing code we ship are just as curious.

There are obvious comparisons to valgrind, which doesn’t require rebuilding, but in my experience so far, the sanitizers have found a bunch of issues that valgrind didn’t (or got lost in the noise). Also, just like with fuzzers, different tools tend to find different bugs even if they have the same intent. I think there’s room for both approaches.

compiler sanitizers. is a post from: