December 19, 2014

Moving on from Red Hat.

After eleven and a half years, today is my final day at Red Hat.
I’ll write more about what comes next in the new year.

In the meantime, here’s a slightly edited version of a mail I sent internally yesterday.

In 2003, I got an email from Michael Johnson, about a secretive new thing Red Hat was working on called "Fedora". No-one was quite sure what it was going to be (some may argue we're still figuring it out), but he was pretty sure I'd want to be a part of it. "How'd you feel about taking care of _any_ kernel problems that come in for this thing?" he asked. I was terrified, but excited at the opportunities to learn a lot of stuff outside my usual areas of expertise.

With barely any real detail as to what I was signing up for, I jumped at the opportunity. Within my first few months, I had some concerns over whether or not I had made a good decision. Then Michael left for rPath, and I seriously started to have my doubts.

While everyone was figuring out what Fedora was going to be, I was thrown in at the deep end. "Here's Red Hat Linux 7, 8 and 9, you maintain the kernel for those now. Go". I remember looking at bugzilla scrolling through page after page of bugs thinking "This is going to be a nightmare" At the same time, RHEL 3 was really starting to take shape. I looked at what the guys working on RHEL were doing and thought "Well, this sucks, but those guys.. they _really_ have work to do". As much as I was buried alive in work, I relished every moment of it, learning as much as I could in what little spare time I had.

Then Fedora finally happened. For those not around back then, Fedora Core 1 was pretty much what Red Hat Linux 10 would have been from a kernel pov. A nasty hairball of patches that weren't going upstream (execshield! 4g4g! Tux! CIPE!) that even their authors had stopped maintaining, and a bunch of features backported from 2.5 to 2.4. I get the shakes when I think back to the horrors of maintaining that mess, but like the horrors of RHL before it, it was an amazing learning experience (mostly "what not to do").

But for all its warts, Fedora gained traction, and after Fedora 2 moved to a 2.6 kernel, things really started to take shape. As Fedora's community started to grow, things got even busier in bugzilla than RHL had ever been.

Then somehow I got talked into also being RHEL4 kernel maintainer for a while.
It turned out that juggling Fedora 3, Fedora 4, Rawhide, RHEL4 GA, and RHEL4 U1 means you don't get a lot of time to sleep. So after finding another sucker to deal with the RHEL work, I moved back to just doing Fedora work, and in another big turning point, we started to slowly grow out the Fedora kernel team.

Over the years that followed, the only thing that remained constant was the inflow of bugs. At any given time we had a thousand or so bugs open, with at best 3 people, at worst 1 person working on them. I'm incredibly proud of what we've managed to achieve with the Fedora kernel. More than just the base for RHEL, it changed the whole landscape of upstream kernel development.

  • Our insistence on shipping the latest code, with as few 'special sauce' patches won over a lot of upstream developers that wouldn't have given us the time of day for similar bugs back in the RHL days. Sometimes painful for our users, but Linux as a whole got better because of our stance here.
  • Decisions like Fedora enabling debug options by default in betas shook out an unbelievable number of bugs almost as soon as they get introduced. Again, painful for users, but from a quality standpoint, we found a ton of bugs in code others were racing to ship first and call "enterprise ready".
  • Fedora enabling features sometimes before they were fully baked got us a lot of love from their respective upstream maintainers.

Despite this progress though, I always felt we were on a treadmill making no real forward progress. That constant 1000 or so bugs kept nagging at me. As fast as we closed them out, a new batch would arrive.

In more recent years, we tried to split the workload within the team so we could do more proactive bug-finding before users even find them. My own 'trinity' project has found so many serious bugs (filesystem corruptors, root holes, vm corner cases, the list goes on) that it got to be almost a full time job just tracking everything.

I used to feel that leaving Red Hat wasn't something I could do. On a few occasions I actually turned down offers from potential employers, because "What about the Fedora kernel?". For the first time since the project has begun I feel like I've left things in more than capable hands, and I'm sure things will continue to move in the right direction.

3 RHL's. 5 and a half RHEL's. 21 Fedoras. You don't even want to know how much hardware I've destroyed in the line of duty in this time. It's been uh, an experience.

So, after all this time, one thing I have learned, is that all this was definitely one of my better decisions. I hope that my next decision turns out to be an equally good one.

Moving on from Red Hat. is a post from:

December 12, 2014


You know how some people attach several montiors to one PC? I don't. I just have several PCs. But then I want copy-paste to work transparently (as transparently as possible). For several years I used blitz to copy clipboard. It works well enough, but once you have 3 computers, it gets somewhat cumbersome to type the hostname. Also, it always bothered me how it rides ssh authentication. I wanted something independent from ssh.

Behold blitz2. Instead of passing the clipboard to the host where it's needed directly, the clipboard is uploaded to an HTTP server. Seems more complex at first, but it's actually much better, because previously the PC where you copy had to authenticate to the PC where you paste. Now the authentication is symmetric. So, all clients are configured exactly the same, and all can upload and download the clipboard no matter who trusts what ssh keys.

December 02, 2014

Free-riding and copyleft in cultural commons like Flickr

Flickr recently started selling prints of Creative Commons Attribution-Share Alike photos without sharing any of the revenue with the original photographers. When people were surprised, Flickr said “if you don’t want commercial use, switch the photo to CC non-commercial”.

This seems to have mostly caused two reactions:

  1. This is horrible! Creative Commons is horrible!”
  2. “Commercial reuse is explicitly part of the license; I don’t understand the anger.”

I think it makes sense to examine some of the assumptions those users (and many license authors) may have had, and what that tells us about license choice and design going forward.

Free ride!!, by
Free ride!!, by Dhinakaran Gajavarathan, under CC BY 2.0

Free riding is why we share-alike…

As I’ve explained before here, a major reason why people choose copyleft/share-alike licenses is to prevent free rider problems: they are OK with you using their thing, but they want the license to nudge (or push) you in the direction of sharing back/collaborating with them in the future. To quote Elinor Ostrom, who won a Nobel for her research on how commons are managed in the wild, “[i]n all recorded, long surviving, self-organized resource governance regimes, participants invest resources in monitoring the actions of each other so as to reduce the probability of free riding.” (emphasis added)

… but share-alike is not always enough

Copyleft is one of our mechanisms for this in our commons, but it isn’t enough. I think experience in free/open/libre software shows that free rider problems are best prevented when three conditions are present:

  • The work being created is genuinely collaborative — i.e., many authors who contribute similarly to the work. This reduces the cost of free riding to any one author. It also makes it more understandable/tolerable when a re-user fails to compensate specific authors, since there is so much practical difficulty for even a good-faith reuser to evaluate who should get paid and contact them.
  • There is a long-term cost to not contributing back to the parent project. In the case of Linux and many large software projects, this long-term cost is about maintenance and security: if you’re not working with upstream, you’re not going to get the benefit of new fixes, and will pay a cost in backporting security fixes.
  • The license triggers share-alike obligations for common use cases. The copyleft doesn’t need to perfectly capture all use cases. But if at least some high-profile use cases require sharing back, that helps discipline other users by making them think more carefully about their obligations (both legal and social/organizational).

Alternately, you may be able to avoid damage from free rider problems by taking the Apache/BSD approach: genuinely, deeply educating contributors, before they contribute, that they should only contribute if they are OK with a high level of free riding. It is hard to see how this can work in a situation like Flickr’s, because contributors don’t have extensive community contact.1

The most important takeaway from this list is that if you want to prevent free riding in a community-production project, the license can’t do all the work itself — other frictions that somewhat slow reuse should be present. (In fact, my first draft of this list didn’t mention the license at all — just the first two points.)

Flickr is practically designed for free riding

Flickr fails on all the points I’ve listed above — it has no frictions that might discourage free riding.

  • The community doesn’t collaborate on the works. This makes the selling a deeply personal, “expensive” thing for any author who sees their photo for sale. It is very easy for each of them to find their specific materials being reused, and see a specific price being charged by Yahoo that they’d like to see a slice of.
  • There is no cost to re-users who don’t contribute back to the author—the photo will never develop security problems, or get less useful with time.
  • The share-alike doesn’t kick in for virtually any reuses, encouraging Yahoo to look at the relationship as a purely legal one, and encouraging them to forget about the other relationships they have with Flickr users.
  • There is no community education about the expectations for commercial use, so many people don’t fully understand the licenses they’re using.

So what does this mean?

This has already gone on too long, but a quick thought: what this suggests is that if you have a community dedicated to creating a cultural commons, it needs some features that discourage free riding — and critically, mere copyleft licensing might not be good enough, because of the nature of most production of commons of cultural works. In Flickr’s case, maybe this should simply have included not doing this, or making some sort of financial arrangement despite what was legally permissible; for other communities and other circumstances other solutions to the free-rider problem may make sense too.

And I think this argues for consideration of non-commercial licenses in some circumstances as well. This doesn’t make non-commercial licenses more palatable, but since commercial free riding is typically people’s biggest concern, and other tools may not be available, it is entirely possible it should be considered more seriously than free and open source software dogma might have you believe.

  1. It is open to discussion, I think, whether this works in Wikimedia Commons, and how it can be scaled as Commons grows.

November 30, 2014

23 years

(This post is only about music – for people not from Belgium, Luc de Vos, singer of Gorki, passed away yesterday at 52)

I am 15. I hear a song on the radio, and I don’t understand the lyrics. Why would you ask a piranha to devour you? Still, I’m intrigued. I’d only really gotten into music little by little. My earliest musical memory is hearing my parents’ record player playing ‘I want you’ by Bob Dylan. After that, it was my inexplicable arousal at seeing the Hey You the Rock Steady Crewvideo in 1983 when I was 7, getting the Top Gun soundtrack on cassette (my first ever music purchase) in 1986, and watching the video for ‘I want your sex’ by George Michael in 1987 over and over on my recording of Veronica’s “Countdown”. At my confirmation (12 years old), when kids typically get some kind of bigger gift they’ve been dreaming of for a long time, I still chose a computer instead of a stereo.

I am 16, I just had my birthday. I am doing a summer job at my family’s company (which processes animal fat) and I am staying with my grandparents in Bavegem. With the money from my birthday I bought a portable stereo CD/cassette player for the incredible amount of 6000 BEF (or 150 euro as the kids would call it these days). . I listen to nothing else for weeks on end. I can still hum the amazingly beautiful piano part that closes Mia from memory. It’s been my favorite song ever since.

I am 17, and learning the guitar. It turns out that Mia is quite complicated to get right, because of that perfect 3/4-5/4 tempo, or whatever you’d call it if you knew anything about music. It doesn’t help that I’m left-handed playing on a right-handed guitar, but I make the song my own. To this day though, I can still not play and sing it at the same time. There is something about the timing of how that third line starts before the music starts, where he signs ‘Mensen als ik’, that I just can’t figure out. It’s magic – it makes this song all the better.

I am 17, and Gorky is now Gorki, with completely new band members. I see them live for the first time, at ‘De Kring’ in Merelbeke, with my best friend Jeremy. I wish I had bought all the t-shirts that night – they had a different one for each of the new songs. The album sounds so different – parts of it recorded in Africa. I don’t listen to that album enough, but I still love playing Berejager on guitar, such a beautiful intro.

I am 17, and it’s my last year of boy scout before becoming a leader. I have a mini-JIN camp called JINTRO during the year, that ends with a party. I dance with a girl to Mia, and one minute into the dance she says, ‘no no, we’re not going to do a one-tile-dance for the rest of the night. Here’s how you do it’ and she teaches me two basic moves to make a slow dance more interesting. Thank you, Karlien, for changing my life.

I am 18, and we travel through Catalunya with the boy and girl scouts group I’m in, and a local Catalan group. This is one of the CD’s we brought with us as a sample of our own culture. The Catalans love it – they say it sounds like Bruce Springsteen. I can see where they’re coming from. At the end of the two weeks, he guitar player of their group nails down a really good version of Mia (without the words of course)

I am 18, and have my first serious girlfriend. Mia is a song that runs through our history together – we must have danced to it at every party that played it (she messaged me yesterday that she immediately thought of me when she heard the news… just like I did of her). Back then, parties still had blocks of 3 slow songs every one or two hours. I miss that tradition… The moves that Karlien taught me put me well ahead of the pack of my fellow young adult males, and that paid off generously in the young adult females agreeing to dance with me at every party. (The theory of compounded interest clearly put in practice, now that I think of it)

I am 19, and one of my fellow boy scout leaders gives me an old demo cassette of Gorky. Among other things, it contains a cover of the Pixies’ “Monkey Gone to Heaven”, some of their songs that didn’t make their debut (but appeared on Boterhammen, like ‘Ik word oud’, or were turned into a b-side). It also contains the original version of Mia, as a fast-paced slurred-sung rocker. They made the right call slowing it down.

I am 21, and I have a radio show at a student radio I helped start up. I am too young to know how the world really works and just send out interview requests to managers and record labels for bands that I like. In those days, I got to interview my favorite band, The Afghan Whigs, as well as other bands like Everclear and The Sheila Divine. But we also managed to get Luc De Vos as a guest on our radio show, and Jeremy and I interviewed him inbetween songs for an hour. (That tape is at my parents’ place. I have an Excel sheet that tells me exactly which box it’s in, and I hope I can recover it next time I go to Belgium.) I tell him about that demo tape that I have, and he asks for a copy. A little after that, I bring him a copy of that practice tape, I put ‘Congregation’ by The Afghan Whigs on the other side (because I want one of my favorite bands to know another one of my favorite bands), and I go past his house to drop it off. (From the news report this weekend I hear he still lived in the same street, so I can only assume he was still living in the same house he’s lived for the last 17 years).

I am 22, and Luc De Vos plays solo at the university somewhere, in an auditorium. I think it was one of the first times he ever did that. He probably already read out a column he wrote. But I remember how amazing he was by himself, what beautiful versions of these songs that I knew so well he played, songs that usually they didn’t play live because they were the slower ones. ‘Arme Jongen’, I remember him playing it there like it was yesterday.

I am 26, and I see him at various festivals, always there to either play or enjoy the music. I see him backstage with his son, recently born. He is walking around with some kind of elastic band tied around his waist that keeps his kid from running away more than ten meters from him, and it is hilarious to see in the backstage area.

Time starts moving quicker as I grow up, become an adult, and graduate from college. More and more albums. Every album still contained at least one killer song. ‘Leve de Lente’ still gives me goosebumps when those guitars crash in. ‘Vaarwel Lieveling’ is possibly his most underrated song – I don’t think I’ve ever heard that one played live. ‘Ode an die freude’, ‘We zijn zo jong’, ‘Duitsland wint altijd’ – I love the sound of resignment he has in his voice, like a deep sigh put too music. That album came with a floppy disk (!) with the lyrics. ‘Het voorspel was moordend’, ‘Tijdbom’ – while the music came back to being a bit more convential, the lyrics got more hermetically sealed. I must admit that I slowly lost track – having moved to Barcelona at some point, it was much harder to catch them live of course. I know their first five albums the best, and while I still bought the others (having missed only one), none of them had the luxury of not having any other album in my collection to compete with like their debut album had. But there is no denying that when they were great, they were still amazing. A song like ‘Veronica komt naar je toe’ managed to pull together so many different things. The title was a recurring slogan of a Dutch channel that was popular among young people in Belgium, for lack of a Belgian alternative. Here’s a great song, with a great chorus, and his ability to sample just this one sentence to evoke a memory of youth every one of my generation remembers (while it evoked at the same time my personal memory of seeing ‘I want your sex’ on Veronica). And then he manages to evoke such a common feeling everyone has, where you are trying to grab that fleeting thing you were thinking just a second ago, straddling typically complicated-to-phrase words in Dutch with effortless ease – ‘Wat was het nu ook alweer/dat ik wou doen/het was iets belangrijks’ (or ‘What was it again/I wanted to do/it was something important). In the beginning, his lyrics were quirky in ideas, but fairly straightforward in their phrasing. Further on in their career, they experimented quite a bit musically, but especially the lyrics could get complicated, and with exceptional and inventive phrasing.

I’m 31, and I live in Barcelona, but I travel back to Belgium because Gorki is playing their debut album, Gorky. I wrote about that concert back then, but that memory is still strong. I can’t believe that was 7 years ago…

I always enjoyed reading his columns in Zone 09 whenever I was in my hometown, I thought he had a great gift for writing. I noticed just now he left behind quite a few more books than I had, so I started tracking those down. So many of my memories have his music attached to it. His was the first band that opened me up to a wider range of music, away from the mainstream (not everybody would agree I guess, but I never considered them mainstream. Their debut album certainly was different enough from whatever was considered mainstream at the time, and as often happens this debut was only widely recognized several albums into their career later, while at the same time those later albums never really got the same kind of traction.)

I loved his way of looking at the world, the way he described it in music, lyrics, writing, and interviews. Always with that cheeky look. Like, surprisingly it now turns out, so many of my generation, his music was intertwined with my growing up. Here’s a man I was hoping to live long and make much more music, and grow old playing hundreds of songs in bars and clubs, but it wasn’t to be. He set out to be a successful rock singer, whether that was tongue-in-cheek or not, and by all accounts he achieved what he set out to do. And everything he did, he did it for the best of reasons. He did it for ‘a fistful of bonnekes’

flattr this!

November 29, 2014

is this a protocol? displaylink3

I'm not sure

but if hd0;u]; means anything to anyone from displaylink, or is the first unencrypted bytes they send, then oops.

Looks like I have some work to do next week.

November 11, 2014

systemd For Administrators, Part XXI

Container Integration

Since a while containers have been one of the hot topics on Linux. Container managers such as libvirt-lxc, LXC or Docker are widely known and used these days. In this blog story I want to shed some light on systemd's integration points with container managers, to allow seamless management of services across container boundaries.

We'll focus on OS containers here, i.e. the case where an init system runs inside the container, and the container hence in most ways appears like an independent system of its own. Much of what I describe here is available on pretty much any container manager that implements the logic described here, including libvirt-lxc. However, to make things easy we'll focus on systemd-nspawn, the mini-container manager that is shipped with systemd itself. systemd-nspawn uses the same kernel interfaces as the other container managers, however is less flexible as it is designed to be a container manager that is as simple to use as possible and "just works", rather than trying to be a generic tool you can configure in every low-level detail. We use systemd-nspawn extensively when developing systemd.

Anyway, so let's get started with our run-through. Let's start by creating a Fedora container tree in a subdirectory:

# yum -y --releasever=20 --nogpg --installroot=/srv/mycontainer --disablerepo='*' --enablerepo=fedora install systemd passwd yum fedora-release vim-minimal

This downloads a minimal Fedora system and installs it in in /srv/mycontainer. This command line is Fedora-specific, but most distributions provide similar functionality in one way or another. The examples section in the systemd-nspawn(1) man page contains a list of the various command lines for other distribution.

We now have the new container installed, let's set an initial root password:

# systemd-nspawn -D /srv/mycontainer
Spawning container mycontainer on /srv/mycontainer
Press ^] three times within 1s to kill container.
-bash-4.2# passwd
Changing password for user root.
New password:
Retype new password:
passwd: all authentication tokens updated successfully.
-bash-4.2# ^D
Container mycontainer exited successfully.

We use systemd-nspawn here to get a shell in the container, and then use passwd to set the root password. After that the initial setup is done, hence let's boot it up and log in as root with our new password:

$ systemd-nspawn -D /srv/mycontainer -b
Spawning container mycontainer on /srv/mycontainer.
Press ^] three times within 1s to kill container.
systemd 208 running in system mode. (+PAM +LIBWRAP +AUDIT +SELINUX +IMA +SYSVINIT +LIBCRYPTSETUP +GCRYPT +ACL +XZ)
Detected virtualization 'systemd-nspawn'.

Welcome to Fedora 20 (Heisenbug)!

[  OK  ] Reached target Remote File Systems.
[  OK  ] Created slice Root Slice.
[  OK  ] Created slice User and Session Slice.
[  OK  ] Created slice System Slice.
[  OK  ] Created slice system-getty.slice.
[  OK  ] Reached target Slices.
[  OK  ] Listening on Delayed Shutdown Socket.
[  OK  ] Listening on /dev/initctl Compatibility Named Pipe.
[  OK  ] Listening on Journal Socket.
         Starting Journal Service...
[  OK  ] Started Journal Service.
[  OK  ] Reached target Paths.
         Mounting Debug File System...
         Mounting Configuration File System...
         Mounting FUSE Control File System...
         Starting Create static device nodes in /dev...
         Mounting POSIX Message Queue File System...
         Mounting Huge Pages File System...
[  OK  ] Reached target Encrypted Volumes.
[  OK  ] Reached target Swap.
         Mounting Temporary Directory...
         Starting Load/Save Random Seed...
[  OK  ] Mounted Configuration File System.
[  OK  ] Mounted FUSE Control File System.
[  OK  ] Mounted Temporary Directory.
[  OK  ] Mounted POSIX Message Queue File System.
[  OK  ] Mounted Debug File System.
[  OK  ] Mounted Huge Pages File System.
[  OK  ] Started Load/Save Random Seed.
[  OK  ] Started Create static device nodes in /dev.
[  OK  ] Reached target Local File Systems (Pre).
[  OK  ] Reached target Local File Systems.
         Starting Trigger Flushing of Journal to Persistent Storage...
         Starting Recreate Volatile Files and Directories...
[  OK  ] Started Recreate Volatile Files and Directories.
         Starting Update UTMP about System Reboot/Shutdown...
[  OK  ] Started Trigger Flushing of Journal to Persistent Storage.
[  OK  ] Started Update UTMP about System Reboot/Shutdown.
[  OK  ] Reached target System Initialization.
[  OK  ] Reached target Timers.
[  OK  ] Listening on D-Bus System Message Bus Socket.
[  OK  ] Reached target Sockets.
[  OK  ] Reached target Basic System.
         Starting Login Service...
         Starting Permit User Sessions...
         Starting D-Bus System Message Bus...
[  OK  ] Started D-Bus System Message Bus.
         Starting Cleanup of Temporary Directories...
[  OK  ] Started Cleanup of Temporary Directories.
[  OK  ] Started Permit User Sessions.
         Starting Console Getty...
[  OK  ] Started Console Getty.
[  OK  ] Reached target Login Prompts.
[  OK  ] Started Login Service.
[  OK  ] Reached target Multi-User System.
[  OK  ] Reached target Graphical Interface.

Fedora release 20 (Heisenbug)
Kernel 3.18.0-0.rc4.git0.1.fc22.x86_64 on an x86_64 (console)

mycontainer login: root

Now we have everything ready to play around with the container integration of systemd. Let's have a look at the first tool, machinectl. When run without parameters it shows a list of all locally running containers:

$ machinectl
MACHINE                          CONTAINER SERVICE
mycontainer                      container nspawn

1 machines listed.

The "status" subcommand shows details about the container:

$ machinectl status mycontainer
       Since: Mi 2014-11-12 16:47:19 CET; 51s ago
      Leader: 5374 (systemd)
     Service: nspawn; class container
        Root: /srv/mycontainer
          OS: Fedora 20 (Heisenbug)
        Unit: machine-mycontainer.scope
              ├─5374 /usr/lib/systemd/systemd
                │ └─5414 /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-act...
                │ └─5383 /usr/lib/systemd/systemd-journald
                │ └─5411 /usr/lib/systemd/systemd-logind
                  └─5416 /sbin/agetty --noclear -s console 115200 38400 9600

With this we see some interesting information about the container, including its control group tree (with processes), IP addresses and root directory.

The "login" subcommand gets us a new login shell in the container:

# machinectl login mycontainer
Connected to container mycontainer. Press ^] three times within 1s to exit session.

Fedora release 20 (Heisenbug)
Kernel 3.18.0-0.rc4.git0.1.fc22.x86_64 on an x86_64 (pts/0)

mycontainer login:

The "reboot" subcommand reboots the container:

# machinectl reboot mycontainer

The "poweroff" subcommand powers the container off:

# machinectl poweroff mycontainer

So much about the machinectl tool. The tool knows a couple of more commands, please check the man page for details. Note again that even though we use systemd-nspawn as container manager here the concepts apply to any container manager that implements the logic described here, including libvirt-lxc for example.

machinectl is not the only tool that is useful in conjunction with containers. Many of systemd's own tools have been updated to explicitly support containers too! Let's try this (after starting the container up again first, repeating the systemd-nspawn command from above.):

# hostnamectl -M mycontainer set-hostname "wuff"

This uses hostnamectl(1) on the local container and sets its hostname.

Similar, many other tools have been updated for connecting to local containers. Here's systemctl(1)'s -M switch in action:

# systemctl -M mycontainer
UNIT                                 LOAD   ACTIVE SUB       DESCRIPTION
-.mount                              loaded active mounted   /
dev-hugepages.mount                  loaded active mounted   Huge Pages File System
dev-mqueue.mount                     loaded active mounted   POSIX Message Queue File System
proc-sys-kernel-random-boot_id.mount loaded active mounted   /proc/sys/kernel/random/boot_id
[...]                     loaded active active    System Time Synchronized                        loaded active active    Timers
systemd-tmpfiles-clean.timer         loaded active waiting   Daily Cleanup of Temporary Directories

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

49 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.

As expected, this shows the list of active units on the specified container, not the host. (Output is shortened here, the blog story is already getting too long).

Let's use this to restart a service within our container:

# systemctl -M mycontainer restart systemd-resolved.service

systemctl has more container support though than just the -M switch. With the -r switch it shows the units running on the host, plus all units of all local, running containers:

# systemctl -r
UNIT                                        LOAD   ACTIVE SUB       DESCRIPTION
boot.automount                              loaded active waiting   EFI System Partition Automount
proc-sys-fs-binfmt_misc.automount           loaded active waiting   Arbitrary Executable File Formats File Syst
sys-devices-pci0000:00-0000:00:02.0-drm-card0-card0\x2dLVDS\x2d1-intel_backlight.device loaded active plugged   /sys/devices/pci0000:00/0000:00:02.0/drm/ca
[...]                                                                                       loaded active active    Timers
mandb.timer                                                                                         loaded active waiting   Daily man-db cache update
systemd-tmpfiles-clean.timer                                                                        loaded active waiting   Daily Cleanup of Temporary Directories
mycontainer:-.mount                                                                                 loaded active mounted   /
mycontainer:dev-hugepages.mount                                                                     loaded active mounted   Huge Pages File System
mycontainer:dev-mqueue.mount                                                                        loaded active mounted   POSIX Message Queue File System
[...]                                                                        loaded active active    System Time Synchronized                                                                           loaded active active    Timers
mycontainer:systemd-tmpfiles-clean.timer                                                            loaded active waiting   Daily Cleanup of Temporary Directories

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

191 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.

We can see here first the units of the host, then followed by the units of the one container we have currently running. The units of the containers are prefixed with the container name, and a colon (":"). (The output is shortened again for brevity's sake.)

The list-machines subcommand of systemctl shows a list of all running containers, inquiring the system managers within the containers about system state and health. More specifically it shows if containers are properly booted up, or if there are any failed services:

# systemctl list-machines
delta (host) running      0    0
mycontainer  running      0    0
miau         degraded     1    0
waldi        running      0    0

4 machines listed.

To make things more interesting we have started two more containers in parallel. One of them has a failed service, which results in the machine state to be degraded.

Let's have a look at journalctl(1)'s container support. It too supports -M to show the logs of a specific container:

# journalctl -M mycontainer -n 8
Nov 12 16:51:13 wuff systemd[1]: Starting Graphical Interface.
Nov 12 16:51:13 wuff systemd[1]: Reached target Graphical Interface.
Nov 12 16:51:13 wuff systemd[1]: Starting Update UTMP about System Runlevel Changes...
Nov 12 16:51:13 wuff systemd[1]: Started Stop Read-Ahead Data Collection 10s After Completed Startup.
Nov 12 16:51:13 wuff systemd[1]: Started Update UTMP about System Runlevel Changes.
Nov 12 16:51:13 wuff systemd[1]: Startup finished in 399ms.
Nov 12 16:51:13 wuff sshd[35]: Server listening on port 24.
Nov 12 16:51:13 wuff sshd[35]: Server listening on :: port 24.

However, it also supports -m to show the combined log stream of the host and all local containers:

# journalctl -m -e

(Let's skip the output here completely, I figure you can extrapolate how this looks.)

But it's not only systemd's own tools that understand container support these days, procps sports support for it, too:

# ps -eo pid,machine,args
 PID MACHINE                         COMMAND
   1 -                               /usr/lib/systemd/systemd --switched-root --system --deserialize 20
2915 -                               emacs contents/projects/
3403 -                               [kworker/u16:7]
3415 -                               [kworker/u16:9]
4501 -                               /usr/libexec/nm-vpnc-service
4519 -                               /usr/sbin/vpnc --non-inter --no-detach --pid-file /var/run/NetworkManager/ -
4749 -                               /usr/libexec/dconf-service
4980 -                               /usr/lib/systemd/systemd-resolved
5006 -                               /usr/lib64/firefox/firefox
5168 -                               [kworker/u16:0]
5192 -                               [kworker/u16:4]
5193 -                               [kworker/u16:5]
5497 -                               [kworker/u16:1]
5591 -                               [kworker/u16:8]
5711 -                               sudo -s
5715 -                               /bin/bash
5749 -                               /home/lennart/projects/systemd/systemd-nspawn -D /srv/mycontainer -b
5750 mycontainer                     /usr/lib/systemd/systemd
5799 mycontainer                     /usr/lib/systemd/systemd-journald
5862 mycontainer                     /usr/lib/systemd/systemd-logind
5863 mycontainer                     /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
5868 mycontainer                     /sbin/agetty --noclear --keep-baud console 115200 38400 9600 vt102
5871 mycontainer                     /usr/sbin/sshd -D
6527 mycontainer                     /usr/lib/systemd/systemd-resolved

This shows a process list (shortened). The second column shows the container a process belongs to. All processes shown with "-" belong to the host itself.

But it doesn't stop there. The new "sd-bus" D-Bus client library we have been preparing in the systemd/kdbus context knows containers too. While you use sd_bus_open_system() to connect to your local host's system bus sd_bus_open_system_container() may be used to connect to the system bus of any local container, so that you can execute bus methods on it.

sd-login.h and machined's bus interface provide a number of APIs to add container support to other programs too. They support enumeration of containers as well as retrieving the machine name from a PID and similar.

systemd-networkd also has support for containers. When run inside a container it will by default run a DHCP client and IPv4LL on any veth network interface named host0 (this interface is special under the logic described here). When run on the host networkd will by default provide a DHCP server and IPv4LL on veth network interface named ve- followed by a container name.

Let's have a look at one last facet of systemd's container integration: the hook-up with the name service switch. Recent systemd versions contain a new NSS module nss-mymachines that make the names of all local containers resolvable via gethostbyname() and getaddrinfo(). This only applies to containers that run within their own network namespace. With the systemd-nspawn command shown above the the container shares the network configuration with the host however; hence let's restart the container, this time with a virtual veth network link between host and container:

# machinectl poweroff mycontainer
# systemd-nspawn -D /srv/mycontainer --network-veth -b

Now, (assuming that networkd is used in the container and outside) we can already ping the container using its name, due to the simple magic of nss-mymachines:

# ping mycontainer
PING mycontainer ( 56(84) bytes of data.
64 bytes from mycontainer ( icmp_seq=1 ttl=64 time=0.124 ms
64 bytes from mycontainer ( icmp_seq=2 ttl=64 time=0.078 ms

Of course, name resolution not only works with ping, it works with all other tools that use libc gethostbyname() or getaddrinfo() too, among them venerable ssh.

And this is pretty much all I want to cover for now. We briefly touched a variety of integration points, and there's a lot more still if you look closely. We are working on even more container integration all the time, so expect more new features in this area with every systemd release.

Note that the whole machine concept is actually not limited to containers, but covers VMs too to a certain degree. However, the integration is not as close, as access to a VM's internals is not as easy as for containers, as it usually requires a network transport instead of allowing direct syscall access.

Anyway, I hope this is useful. For further details, please have a look at the linked man pages and other documentation.

November 07, 2014

more on Displaylink3 and HDCP encryption

okay another braindump (still nothing working).

The git repo mentioned in previous post has all the code I've hacked up so far.

I finished writing the HDCP protocol stages, and sending all the msgs and getting replies from the device.

So I've successfully reached a point where I've negotiated a HDCP session key with the device, and we are both happy about it. Unfortunately I've no idea what I'm meant to be encrypting to send to the device. The next packet the USB traces contain is 384-bytes of encrypted data.

Now HDCP v2 had a vulnerabilty in its key neg, and I've written code to try and use this fact. So I've taken a trace I made from Windows, and extracted the necessary bits, and using that I've managed to derive the master key used in that trace, and subsequently managed to derived the session key for it. So I've replayed the first encrypted packet from the trace to the device and got an encrypted response the same as in the trace.

I've tried changing a bit in the session key, riv value and data I'm sending, and doing that causes the device not to reply with the answer. This to me implies that the device is using the HDCP cipher to encode the control channel. Now HDCP does say you should only do this for video streams, but maybe DisplayLink forgot to read that bit.

Now where does this leave me, in theory I should be able to replay the full trace (haven't had time yet) and I should see the same picture on screen as I did (though I can't remember what monitor/device I used, so I might have to retrace and restage my tests before then).

However I really need to decrypt the encrypted data in the trace, and from reading the HDCP spec the only values I need to feed the AES engine are ks ^ lc128, riv, streamctr, inputctr. I'm assuming streamctr and inputctr are 0 for the first packet (I could be wrong, maybe they use some wacky streamctr to avoid messing with hdcp), riv and ks I've captured. So lc128 is possibly the crux.

Now what is lc128? Its a secret 128-bit value in the HDCP world given only to HDCP adopters. Its normally something you'd store in hw on the GPU etc as an input to the hw cipher. But in displaylink there is no GPU encrypting the data. Now its possible that displaylink don't use the same lc128 as the HDCP people, unlikely but possible. Maybe they cipher their streams with their own lc128, and only use the offical hdcp lc128 for actual HDCP streams.

I don't think lc128 has leaked, I'm not sure what the consequences of it leaking would be, but hey its just a magic number, and if displaylink are using as an input to their AES code, it must be in RAM at some point, now I need to figure out ways to work that out. I'm not sure how long it would take to brute force as 128-bit key space, probably impossible.

At any point if someone from DisplayLink wants to talk, you know where to find me :-)

November 03, 2014

Thoughts on crashdumps.

Linux has what appears to be a useful feature that can be enabled to diagnose tricky kernel bugs. The feature is called kdump. A crashdump mechanism that uses kexec to switch to a different kernel, before writing out memory to disk, nfs, wherever. It’s a pretty neat idea.

Unfortunately, I have _never_ seen it working when I needed it.
I know it’s possible, because some of my co-workers swear by crashdumps for diagnosing tricky RHEL bugs. Someone every single RHEL release invests the time to fix up a bunch of bugs and get it into a working state again. But because Fedora is constantly moving, it’s near constantly broken in some non-trivial way.

We even have a wiki page telling Fedora users how to enable it. In honesty every time in the past I’ve told a user to try it, I’ve thought to myself “yeah, that isn’t going to work”, and my record for being correct in that regard is pretty damn good. If after 15+[1] years of kernel debugging, _I_ can’t get this thing to work, what hope does the average end-user have ?

In a recent meeting at the office, one of my coworkers enthused about how “it’s so much better now, it just works”. So I thought I’d give it a try again the last few weeks. In that time, I have ended up with a total of zero crash dumps, and I-lost-count-how-many kdump bugs.

Why is it so fragile ? I don’t have a good answer. It tends to have the worst possible failure modes. It’s hard to diagnose bugs that either lock up the machine entirely, or instantly reboot it. When you’re trying to debug something, and then it turns out you need to debug the debugging mechanism, most people probably think “I don’t have time for this shit”, and try alternative avenues of debugging, adding “FIND OUT WHY KDUMP IS FUCKED AGAIN” somewhere near the bottom of their TODO list.

At one point I thought “Maybe I’m just unlucky with hardware choices”[2], but the problems seem to be universal across every machine I’ve tried it on.

No doubt it “works” for some people, in certain circumstances, but this kind of feature has to be reliable at least most of the time to make it even worth trying.

I wish this post had a happy ending where I unveiled some solution to this problem[3], but after needing to travel to a machine that wedged itself after it had crashed for the Nth time this weekend, I’m kind of over kdump.
Sometimes it’s easier just to say “Don’t even bother” and do something entirely different.

[1] Oh god what have I done with my life.
[2] There are no good choices when it comes to computer hardware.
[3] Coming in a future post: Why pstore is the solution to this, and why it’s also completely awful.

Thoughts on crashdumps. is a post from:

October 31, 2014

a day with DisplayLink USB3 and HDCP

So for some reason I decided to look at the displaylink usb3 adaptors today. (no good news).

This blog post is so I don't forget all of this when I page it out. Notes, HDCP1.0 being broken doesn't matter to this, maybe HDCPv2.0 being a bit broken could be used, but I'm not sure how!

The displaylink USB3 protocol is based on HDCP protocol. I've traced the first few packets and it clearly
looks like the host sends two packets


and the device sends back

at least.

AKE_Send_Cert contains a 522 byte certificate, containing a receiver id, public key, some misc bytes and a signature generated with the DCP LLC private key, that you have to verify.

so the HDCP v2.2 spec contains the DP LLC public key, and I've written some code to verify the spec using openssl, but it totally fails to work. This is probably due to me doing something stupid, or not understanding what I'm doing, if you are openssl knowledgeable and want to look, the hack fest is

It might be the DisplayLink devices use a different signing key than the DP LLC one.

That repo contains some code to talk to the device (currently disabled) and do the initial sequence, along with an attempt to verify the cert.

Now once I get past this hurdle, the larger one seems to remain, the HDCP 2.0 spec has a global secret 128-bit value called LC128, that everyone who implements HDCP gets and hides somewhere. Its probably sitting in the displaylink driver in hex, but I'd hope they at least hide it better than that. It may also be possibly supplied by the OS, Windows or OSX. (I've no clue yet). That value is used in the key negotiation.

Now it might be possible that Displaylink allow non-HDCP encrypted data to be sent to the device, in which case win if I can find out where/how to do that, or it might be the device requires HDCP and decrypts non-HDCP content before sending it over VGA/DVI. I've no ideas yet on that front either.

Ah well probably enough learning for today, I knew nothing about HDCP this morning, so I can't say it made my life any better learning about it :-P

October 29, 2014

Understanding Wikimedia, or, the Heavy Metal Umlaut, one decade on

It has been nearly a full decade since Jon Udell’s classic screencast about Wikipedia’s article on the Heavy Metal Umlaut (current textJan. 2005). In this post, written for Paul Jones’ “living and working online” class, I’d like to use the last decade’s changes to the article to illustrate some points about the modern Wikipedia.1

Measuring change

At the end of 2004, the article had been edited 294 times. As we approach the end of 2014, it has now been edited 1,908 times by 1,174 editors.2

This graph shows the number of edits by year – the blue bar is the overall number of edits in each year; the dotted line is the overall length of the article (which has remained roughly constant since a large pruning of band examples in 2007).



The dropoff in edits is not unusual — it reflects both a mature article (there isn’t that much more you can write about metal umlauts!) and an overall slowing in edits in English Wikipedia (from a peak of about 300,000 edits/day in 2007 to about 150,000 edits/day now).3

The overall edit count — 2000 edits, 1000 editors — can be hard to get your head around, especially if you write for a living. Implications include:

  • Style is hard. Getting this many authors on the same page, stylistically, is extremely difficult, and it shows in inconsistencies small and large. If not for the deeply acculturated Encyclopedic Style we all have in our heads, I suspect it would be borderline impossible.
  • Most people are good, most of the time. Something like 3% of edits are “reverted”; i.e., about 97% of edits are positive steps forward in some way, shape, or form, even if imperfect. This is, I think, perhaps the single most amazing fact to come out of the Wikimedia experiment. (We reflect and protect this behavior in one of our guidelines, where we recommend that all editors Assume Good Faith.)

The name change, tools, and norms

In December 2008, the article lost the “heavy” from its name and became, simply, “metal umlaut” (explanation, aka “edit summary“, highlighted in yellow):

Name change

A few take aways:

  • Talk pages: The screencast explained one key tool for understanding a Wikipedia article – the page history. This edit summary makes reference to another key tool – the talk page. Every Wikipedia article has a talk page, where people can discuss the article, propose changes, etc.. In this case, this user discussed the change (in November) and then made the change in December. If you’re reporting on an article for some reason, make sure to dig into the talk page to fully understand what is going on.
  • Sources: The user justifies the name change by reference to sources. You’ll find little reference to them in 2005, but by 2008, finding an old source using a different term is now sufficient rationale to rename the entire page. Relatedly…
  • Footnotes: In 2008, there was talk of sources, but still no footnotes. (Compare the story about Motley Crue in Germany in 2005 and now.) The emphasis on foonotes (and the ubiquitous “citation needed”) was still a growing thing. In fact, when Jon did his screencast in January 2005, the standardized/much-parodied way of saying “citation needed” did not yet exist, and would not until June of that year! (It is now used in a quarter of a million English Wikipedia pages.) Of course, the requirement to add footnotes (and our baroque way of doing so) may also explain some of the decline in editing in the graphs above.

Images, risk aversion, and boldness

Another highly visible change is to the Motörhead art, which was removed in November 2011 and replaced with a Mötley Crüe image in September 2013. The addition and removal present quite a contrast. The removal is explained like this:

remove File:Motorhead.jpg; no fair use rationale provided on the image description page as described at WP:NFCC content criteria 10c

This is clear as mud, combining legal issues (“no fair use rationale”) with Wikipedian jargon (“WP:NFCC content criteria 10c”). To translate it: the editor felt that the “non-free content” rules (abbreviated WP:NFCC) prohibited copyright content unless there was a strong explanation of why the content might be permitted under fair use.

This is both great, and sad: as a lawyer, I’m very happy that the community is pre-emptively trying to Do The Right Thing and take down content that could cause problems in the future. At the same time, it is sad that the editors involved did not try to provide the missing fair use rationale themselves. Worse, a rationale was added to the image shortly thereafter, but the image was never added back to the article.

So where did the new image come from? Simply:

boldly adding image to lead

“boldly” here links to another core guideline: “be bold”. Because we can always undo mistakes, as the original screencast showed about spam, it is best, on balance, to move forward quickly. This is in stark contrast to traditional publishing, which has to live with printed mistakes for a long time and so places heavy emphasis on Getting It Right The First Time.

In brief

There are a few other changes worth pointing out, even in a necessarily brief summary like this one.

  • Wikipedia as a reference: At one point, in discussing whether or not to use the phrase “heavy metal umlaut” instead of “metal umlaut”, an editor makes the point that Google has many search results for “heavy metal umlaut”, and another editor points out that all of those search results refer to Wikipedia. In other words, unlike in 2005, Wikipedia is now so popular, and so widely referenced, that editors must be careful not to (indirectly) be citing Wikipedia itself as the source of a fact. This is a good problem to have—but a challenge for careful authors nevertheless.
  • Bots: Careful readers of the revision history will note edits by “ClueBot NG“. Vandalism of the sort noted by Jon Udell has not gone away, but it now is often removed even faster with the aid of software tools developed by volunteers. This is part of a general trend towards software-assisted editing of the encyclopedia.NoSwagForYou
  • Translations: The left hand side of the article shows that it is in something like 14 languages, including a few that use umlauts unironically. This is not useful for this article, but for more important topics, it is always interesting to compare the perspective of authors in different languages.Languages

Other thoughts?

I look forward to discussing all of these with the class, and to any suggestions from more experienced Wikipedians for other lessons from this article that could be showcased, either in the class or (if I ever get to it) in a one-decade anniversary screencast. :)

  1. I still haven’t found a decent screencasting tool that I like, so I won’t do proper homage to the original—sorry Jon!
  2. Numbers courtesy X’s edit counter.
  3. It is important, when looking at Wikipedia statistics, to distinguish between stats about Wikipedia in English, and Wikipedia globally — numbers and trends will differ vastly between the two.

October 24, 2014

Introducing Gthree

I’ve recently been working on OpenGL support in Gtk+, and last week it landed in master. However, the demos we have are pretty lame and are not very good to show off or even test the OpenGL support. I’ve looked around for some open source demos that used modern GL that we could use, but I didn’t find anything that we could easily use.

What I did find though, was a lot of WebGL demos that used three.js. This looked like a very nice open source library for highlevel 3d rendering. At first I had some plans to bind OpenGL to gjs so that we could run three.js, but this turned out to be a hard.

Instead I started converting three.js into C + GObject, using the Gtk+ OpenGL support and the vector/matrix library graphene that Emmanuele has been working on recently.

After about a week of frantic hacking it is now at a stage where it may be interesting for others. So, without further ado I introduce:

It does not yet support everything that three.js can do, but it does support a meshes with most mesh matrial types and lighting, including a loader for the json model format of thee.js, which means that it is minimally useful.

Here are some screenshots of the examples that ships with the code:

Screenshot from 2014-10-24 15:04:47

Various types of materials

Screenshot from 2014-10-24 15:10:00

Some sample models from three.js examples

Screenshot from 2014-10-24 15:31:40

Some random cubes

This has been a lot of fun to work on as I’ve seen a lot of progress very fast. Mad props to mrdoob and the other three.js developers for creating three.js and making it free software. Gthree is a huge rip-off of their work and would never be possible without it. Thanks also to Emmanuele for his graphene library.

What are you sitting here for, go ahead and play with it! Make some demos, port some more three.js features, marvel at the fancy graphics!

October 23, 2014

Trinity and pages of random data.

Something trinity uses a lot, are pages of random data. They get passed around to syscalls, ioctls, whatever. 5 years ago, before I’d even added multiple children to trinity, this was done using ‘page_rand’. A single page allocated on startup, that was passed around, and scribbled over by anyone who needed something to scribble over.

After the VM work I did earlier this year, where we recycle successful calls to mmap, and inherit them across children, quite a few places started passing around map structs instead. This was good, because it started shaking out the many many kernel bugs that we had lingering in huge page support.

It kind of sucked that we had two sets of routines for doing things like “get a page”, “dirty a page” etc which were fundamentally the same operations, except one set worked on a pointer, and one on a struct. It also sucked that the page_rand code was actually buggy in a number of ways, which showed up as overruns.

Over time, I’ve been trying to move all the code that used page_rand to using mappings instead. Today I finished that work, and ripped out the last vestiges of page_rand support. The only real remnants of the supporting code was some of the dirtying code. We used to have separate ‘dirty page_rand’ and ‘dirty an mmap’ routines. After todays work, there’s now a single set of functions for mappings. There’s still a bunch more consolidation and cleanup to do, which I’ll get fixed up and merged over the next week.

The only feature that’s now missing is periodic dirtying of mappings. We did this every 100 syscalls for page_rand. Right now we only dirty mmap’s after a mmap() call succeeds, or on an mremap(). I plan on getting this done tomorrow.

The motivation for ripping out all this code, and unifying a lot of the support code is that a lot of code paths get simpler, and more importantly, the code in place now takes ‘len’ arguments, so we’re in a better position to make sure we’re not passing buffers that are too small when we do random syscalls.

In other news: while I was happy to report a few days ago that 3.18rc1 fixed up the btrfs bug that had been bothering me for a while, I’ve now managed to discover two new btrfs bugs [1]. [2]. Grumble.

Trinity and pages of random data. is a post from:

October 19, 2014

Laptop bleg

I'm considering a laptop (actually two). Requirements:

  • 13" to 14" class.
  • Indestructable.
  • Display that is not too wide. Enough with 16:9 already! Aspect of 1.6 would be ideal (Lenovo T400 had that).
  • Light. Indestructable is more important, but it should be light: 2kg or less.
  • No nipple. No Lenovo.

Where it comes from is mostly my wife's Sony Vaio Z. I used to have a Z back in 2001 or so, when they were in 12" format. It was the best laptop ever, but unfortunately it succumbed to a DC-DC converter failure. The modern Z is not like that Z. The most super annoying problem is that the screws holding the battery failed in an interesting way: it is impossible to remove the battery now. Also, the contact between the battery and the moterboard is marginal. I managed to fix the problem by manufacturing a finely shaped wooden wedge that I drove into a gap and thus extended the life of that thing, but man, Sony, this is disappointing.

Unfortunately, I don't remember if it was Kota or Daisuke, but one of Japanese guys at a recent Swift Hackathon in Boston had a Z of the similar vintage, and it looked impeccable. Maybe Sony figured that it's going to be predominant mode of care that their wares receive, and so why not make the modern Z this much cheaper than the old, indestructable Z. But they still charge exorbitant prices.

Lenovo wins a special notice because I had a T400 for 3 years and swore never deal with it ever again. The biggest problem is the keyboard layout, because I use left pinky for control key. I could live with their idiotic placement of Escape, but I refuse to deal with 3 years of physical pain again. Also, their famous qualify seems slipping, as my mouse button broke within 3 years. Battery died, too. However, the T400 had a very good display, and I would like another like that, if possible.

October 18, 2014

Trinity updates

Over a month ago, I posted about some pthreads work I was experimenting with in Trinity, and how that wasn’t really working out. After taking a short vacation, I came back with no real epiphanies, and decided to back-burner that work for now, and instead refocus on fixing up some other annoying problems that I’d stumbled across while doing that experimenting. Some of these problems were actually long-standing bugs in trinity. So that’s pretty much all I’ve been working on for the last month, and I’m now pretty happy with how long it runs for (providing you don’t hit a kernel bug first).

The primary motivation was to fix a problem where trinity’s internal data structures would get corrupted. After a series of debugging patches, I found a number of places where a child process would overrun a buffer it had allocated.

First up: the code that takes syscalls arguments and renders them into a human-readable string. In some cases this would write huge strings past the end of the buffer. One example of this was the instance where trinity would generate a random pathname. It would sometimes generate complete garbage, which was fine until it came to printing it out. Fixed by deleting lots of code in the pathname generator. Stressing the negative dentry case was never that interesting anyway. After fixing up a few other cases in the argument generator I looked at the code that performs rendering to buffers. None of this code took length parameters, or took into account the remaining space in the buffers. Fairly quick rewrite took care of that.

After these bugs were fixed trinity would (on a good kernel) run for a really long time without incident. With longer runtimes, a few more obscure corner cases turned up.

There were 2-3 cases where the watchdog process would hang waiting for a condition that would never be met (due to losing track of how many running child processes there were). I’m still not happy that this can even occur but it is at least a little less likely to hang when it happens now. I’ll investigate the actual cause for this later.

Another fun watchdog bug: we keep track of the time stamp a child performed its last syscall at, and check to make sure 1 second later that it has increased by some small amount. To make sure we haven’t corrupted our own state, there’s also a sanity check that we haven’t jumped into the future. But we also have to compensate for the possibility that adjtimex was the random syscall we did. That takes a maximum offset of 2145. The code checked for that but forgot to also add the one second since the last time we checked.

There’s been a bunch of small 1-2 fixes like this lately, but I’m sitting on a larger set of changes that I’ll start to trickle into git next week, which moves towards cleaning up the “create a random page to pass to syscalls” code, which has been another fun source of corruption bugs.

In kernel news: The only interesting bugs this week that Trinity has shown up, have been two ext4 bugs. Diagnosing those has pointed out some more enhancements that are needed to the post-mortem code in trinity. Once I’ve cleared the current backlog of patches, I’ll work on adding better tracking of fd’s in the logging code. In other news, the btrfs bug trinity hit in August is now fixed in 3.17+ git.

Trinity updates is a post from:

October 09, 2014

Emacs hint for Firefox hacking

I started hacking on firefox recently. And, of course, I’ve configured emacs a bit to make hacking on it more pleasant.

The first thing I did was create a .dir-locals.el file with some customizations. Most of the tree has local variable settings in the source files — but some are missing and it is useful to set some globally. (Whether they are universally correct is another matter…)

Also, I like to use bug-reference-url-mode. What this does is automatically highlight references to bugs in the source code. That is, if you see “bug #1050501″, it will be buttonized and you can click (or C-RET) and open the bug in the browser. (The default regexp doesn’t capture quite enough references so my settings hack this too; but I filed an Emacs bug for it.)

I put my .dir-locals.el just above my git checkout, so I don’t end up deleting it by mistake. It should probably just go directly in-tree, but I haven’t tried to do that yet. Here’s that code:

 ;; Generic settings.
 (nil .
      ;; See C-h f bug-reference-prog-mode, e.g, for using this.
      ((bug-reference-url-format . "")
       (bug-reference-bug-regexp . "\\([Bb]ug ?#?\\|[Pp]atch ?#\\|RFE ?#\\|PR [a-z-+]+/\\)\\([0-9]+\\(?:#[0-9]+\\)?\\)")))

 ;; The built-in javascript mode.
 (js-mode .
     ((indent-tabs-mode . nil)
      (js-indent-level . 2)))

 (c++-mode .
	   ((indent-tabs-mode . nil)
	    (c-basic-offset . 2)))

 (idl-mode .
	   ((indent-tabs-mode . nil)
	    (c-basic-offset . 2)))


In programming modes I enable bug-reference-prog-mode. This enables highlighting only in comments and strings. This would easily be done from prog-mode-hook, but I made my choice of minor modes depend on the major mode via find-file-hook.

I’ve also found that it is nice to enable this minor mode in diff-mode and log-view-mode. This way you get bug references in diffs and when viewing git logs. The code ends up like:

(defun tromey-maybe-enable-bug-url-mode ()
  (and (boundp 'bug-reference-url-format)
       (stringp bug-reference-url-format)
       (if (or (derived-mode-p 'prog-mode)
	       (eq major-mode 'tcl-mode)	;emacs 23 bug
	       (eq major-mode 'makefile-mode)) ;emacs 23 bug
	   (bug-reference-prog-mode t)
	 (bug-reference-mode t))))

(add-hook 'find-file-hook #'tromey-maybe-enable-bug-url-mode)
(add-hook 'log-view-mode-hook #'tromey-maybe-enable-bug-url-mode)
(add-hook 'diff-mode-hook #'tromey-maybe-enable-bug-url-mode)

October 03, 2014

I think it’s better to look odd than to look normal

In the fall of ’98 I had a thing for a girl I didn’t want to have a thing for. I had also just seen one of my favorite movies, Much Ado About Nothing (the original Brannagh movie, not the Josh Whedon one that I didn’t know about until recently and have yet to see).

I decided to exorcise my feelings into a good old-fashioned mix cd (well, I guess that wasn’t old fashioned back in ’98). I cut up the movie dialogue into pieces, and interspersed them inbetween a song selection aiming to match the flow of the movie lyric-wise and, in places, matching them sound-wise too to the movie snippets. It ended up being two cd’s, and a bunch of my friends liked it as well so I think I ended up making about 30 copies of the thing.

Today I needed to recreate those two CD’s plus its original packaging. That means I had to actually buy CD-R’s (didn’t have any anymore after the move to the US), buy jewelcases (can you believe that I actually have actual boxes with actual empty jewelcases that I *kept* in storage in Belgium? These days if you want to buy them they’re a little harder to find than they used to be, even though I’m sure there must be landfills full of them all over the world), and go to a print shop to print the front and back covers.

Being the obsessive backupper that I am, it was easy to find the sound files back (actually, I took a morituri rip that I made at my best friend’s house, who has the CD’s, last time I was there – so that I would have a perfect .cue sheet that would stitch the tracks together). I knew I had the files for the fronts and backs somewhere as well, but they were a little harder to find because I couldn’t remember their names. But I trusted my OCD self that I had backups from fifteen years ago somewhere here with me in NY, and I started looking for files from the same timeframe, until I came across the files I was looking for hidden in a subdirectory.

But then when you find them, what do you do with .cdr CorelDraw files from 1998? I tried inkscape, which uses uniconvertor, which on my F-19 machine failed with a constructor with wrong arguments in Python, which seems like a silly bug. I rebuilt the F-21 version, which gets past that bug, but then doesn’t actually convert anything. I tried an online converter, and it only picked up on the images and none of the text.

So I went the illegal route – I downloaded CorelDraw 11 from the internet, installed it in wine (which was surprisingly easy, it just worked), and I could open the files. Except that it was missing fonts and so the layout was all wrong. Sigh. Hunt random font sites for the missing fonts, install them for wine, open again, rinse, repeat. Eventually the files opened with the right fonts, except that one of the titles was too big to fit on the CD inlay. Oh well, adjust them all manually, make it a little smaller, export to eps, load in gimp, adjust the page as it was perfectly measured for A4 printing but I’m in the US now and the US uses letter which is slightly different, export to pdf so I could go to any random print shop in New York and get it printed.

CD burnt, on to the print shop, fiddle with the printer as nobody in the store can figure out which tray number the tray is where they loaded the card stock paper, and it’s not like the driver on the windows machine knows either – I had to do 5 failed prints to different printers before we even knew which printer was the right one. Cut up the paper by hand with scissors (which I suck at), put it all together, and be on my way.

All this just to say that, while I can be as good about backups as I want to be to bring back to life something I did fifteen years ago, there is still a whole lot of real-world technology fails getting in the way, like outdated proprietary file formats, not having good interchange formats, missing fonts, paper sizes and general Imperial/metric nonsense, ages-old printer crap and just simple manual tasks, which we as humans will probably inflict upon ourselves for forever. I mean, I’d sure like to believe that in the future it will be as simple as pressing a button and getting this 15 year old CD project 3D-printed all at once, but experience has taught me that most likely I will be fiddling just as much with getting 2040’s 3D printer to work with 2025’s data files.

And so it is that I arrive just after 6 at Barnes and Noble in Tribeca, queue up in front of eight registers with only one open, buy a book, get a wristband, go to the back where Emma Thompson is reading from her Peter Rabbit book, in her perfectly English and genuinely funny way, queue after the reading, and hear her say “I think it’s better to look odd than to look normal” to the seven year old twin girls in front of me. I wholeheartedly agree with her. I hand her my copy to sign, give her my two cd’s and tell her what they are and say that I thought this was a good opportunity to give them to her, and she smiles and seems genuinely surprised and pleased.

I think my dad would be genuinely jealous at this point – he always seemed to appreciate seeing her on the screen, and after today I can’t say I blame him. I hope she enjoys the CD’s, and if someone can recommend a good website where I can put these online for others to listen to, that would be great!

flattr this!

September 20, 2014

Emacs Modules

I’ve been working on an odd Emacs package recently — not ready for release — which has turned into more than the usual morass of prefixed names and double hyphens.

So, I took another look at Nic Ferrier’s namespace proposal.

Suddenly it didn’t seem all that hard to implement something along these lines, and after a bit of poking around I wrote emacs-module.

The basic idea is to continue to follow the Emacs approach of prefixing symbol names — but not to require you to actually write out the full names of everything.  Instead, the module system intercepts load and friends to rewrite symbol names as lisp is loaded.

The symbol renaming is done in a simple way, following existing Emacs conventions.  This gives the nice result that existing code doesn’t need to be updated to use the module system directly.  That is, the module system recognizes name prefixes as “implicit” modules, based purely on the module name.

I’d say this is still a proof-of-concept.  I haven’t tried hairier cases, like defclass, and at least declare-function does not work but should.

Here’s the example from the docs:

(define-module testmodule :export (somevar))
(defvar somevar nil)
(defvar private nil)
(provide 'testmodule)

This defines the public variable testmodule-somevar and the “private” function testmodule--private.

September 12, 2014

Trinity threading improvements and misc

Since my blogging tsunami almost a month ago, I’ve been pretty quiet. The reason being that I’ve been heads down working on some new features for trinity which have turned out to be a lot more involved than I initially anticipated.

Trinity does all of its work in child processes continually forked off from a main process. For a long time I’ve had “investigate using pthreads” as a TODO item, but after various conversations at kernel summit, I decided to bump the priority of that up a little, and spend some time looking at it. I initially guessed that it would have take maybe a few weeks to have something usable, but after spending some time working on it, every time I make progress on one issue, it becomes apparent that there’s something else that is also going to need changing.

I’m taking a week off next week to clear my head and hopefully return to this work with fresh eyes, and make more progress, because so far it’s been mostly frustrating, and there may be an easier way to solve some of the problems I’ve been hitting. Sidenote: In the 15+ years I’ve been working on Linux, this is the first time I recall actually ever using pthreads in my own code. I can’t say I’ve been missing out.

Unrelated to that work, a month or so ago I came up with a band-aid fix for a problem where trinity would corrupt its own structures. That ‘fix’ turned out to break the post-mortem work I implemented a few months prior, so I’ve spent some time this week undoing that, and thinking about how I’m going to fix that properly. But before coming up with a fix, I needed to reproduce the problem reliably, and naturally now that I’ve added debug code to determine where the corruption is coming from, the bug has gone into hiding.

I need this vacation.

Trinity threading improvements and misc is a post from:

September 04, 2014

My Wikimania 2014 talks

Primarily what I did during Wikimania was chew on pens.

Discussing Fluid Lobbying at Wikimania 2014, by  Sebastiaan ter Burg, under CC BY 2.0
Discussing Fluid Lobbying at Wikimania 2014, by Sebastiaan ter Burg, under CC BY 2.0

However, I also gave some talks.

The first one was on Creative Commons 4.0, with Kat Walsh. While targeted at Wikimedians, this may be of interest to others who want to learn about CC 4.0 as well.

Second one was on Open Source Hygiene, with Stephen LaPorte. This one is again Wikimedia-specific (and I’m afraid less useful without the speaker notes) but may be of interest to open source developers more generally.

The final one was on sharing; video is below (and I’ll share the slides once I figure out how best to embed the notes, which are pretty key to understanding the slides):

September 02, 2014

Wikimania 2014 Notes – very miscellaneous

A collection of semi-random notes from Wikimania London, published very late:

Gruppenfoto Wikimania 2014 London, by Ralf Roletschek, under CC BY-SA 3.0 Austria

The conference generally

  • Tone: Overall tone of the conference was very positive. It is possibly just small sample size—any one person can only talk to a small number of the few thousand at the conference—but seemed more upbeat/positive than last year.
  • Tone, 2: The one recurring negative theme was concern about community tone, from many angles, including Jimmy. I’m very curious to see how that plays out. I agree, of course, and will do my part, both at WMF and when I’m editing. But that sort of social/cultural change is very hard.
  • Speaker diversity: Heard a few complaints about gender balance and other diversity issues in the speaker lineup, and saw a lot of the same (wonderful!) faces as last year. I’m wondering if there are procedural changes (like maybe blind submissions, or other things from this list) might bring some new blood and improve diversity.
  • “Outsiders”: The conference seemed to have better representation than last year from “outside” our core community. In particular, it was great for me to see huge swathes of the open content/open access movements represented, as well as other free software projects like Mozilla. We should be a movement that works well with others, and Wikimania can/should be a key part of that, so this was a big plus for me.
  • Types of talks: It would be interesting to see what the balance was of talks (and submissions) between “us learning about the world” (e.g., me talking about CC), “us learning about ourselves” (e.g., the self-research tracks), and “the world learning about us” (e.g., aimed at outsiders). Not sure there is any particular balance we should have between the three of them, but it might be revealing to see what the current balance is.
  • Less speaking, more conversing: Next year I will probably propose mostly (only?) panels and workshops, and I wonder if I can convince others to do the same. I can do a talk+slides and stream it at any time; what I can only do in person is have deeper, higher-bandwidth conversations.
  • Physical space and production values: The hackathon space was amazingly fun for me, though I got the sense not everyone agreed. The production values (and the rest of the space) for the conference were very good. I’m torn on whether or not the high production values are a plus for us, honestly. They raise the bar for participation (bad); make the whole event feel somewhat… un-community-ish(?); but they also make us much more accessible to people who aren’t yet ready for the full-on, super-intense Wikimedian Experience.

The conference for projects I work on

  • LCA: Legal/Community Affairs was pretty awesome on many fronts—our talks, our work behind the scenes, our dealing with both the expected and unexpected, etc. Deeply proud to be part of this dedicated, creative team. Also very appreciative for everyone who thanked us—it means a lot when we hear from people we’ve helped.
  • Maps: Great seeing so much interest in Open Street Map. Had a really enjoyable time at their 10th birthday meetup; was too bad I had to leave early. Now have a better understanding of some of the technical issues after a chat with Kolossos and Katie. Also had just plain fun geeking out about “hard choices” like map boundaries—I find how communities make decisions about problems like that fascinating.
  • Software licensing: My licensing talk with Stephen went well, but probably should have been structured as part of the hackathon rather than for more general audiences. Ultimately this will only work out if engineering (WMF and volunteer) is on board, and will work best if engineering leads. (The question asked by Mako afterwards has already led to patches, which is cool.)
  • Creative Commons: My CC talk with Kat went well, and got some good questions. Ultimately the rubber will meet the road when the translations are out and we start the discussion with the full community. Also great meeting User:Multichill; looking forward to working on license templates with him and May from design.
  • Metadata: The multimedia metadata+licensing work is going to be really challenging, but very interesting and ultimately very empowering for everyone who wants to work with the material on commons. Look forward to working with a large/growing number of people on this project.
  • Advocacy: Advocacy panel was challenging, in a good way. A variety of good, useful suggestions; but more than anything else, I took away that we should probably talk about how we talk when subjects are hard, and consensus may be difficult to reach. Examples would include when there is a short timeline for a letter, or when topics are deeply controversial for good, honest reasons.

The conference for me

  • Lesson (1): Learned a lesson: never schedule a meeting for the day after Wikimania. Odds of being productive are basically zero, though we did get at least some things done.
  • Lesson (2): I badly overbooked myself; it hurt my ability to enjoy the conference and meet everyone I wanted to meet. Next year I’ll try to be more focused in my commitments so I can benefit more from spontaneity, and get to see some slightly less day-job-related (but enjoyable or inspirational) talks/presentations.
  • Research: Love that there is so much good/interesting research going on, and do deeply think that it is important to understand it so that I can apply it to my work. Did not get to see very much of it, though :/
  • Arguing with love: As tweeted about by Phoebe, one of the highlights was a vigorous discussion (violent agreement :) with Mako over dinner about the four freedoms and how they relate to just/empowering software more broadly. Also started a good, vigorous discussion with SJ about communication and product quality, but we sadly never got to finish that.
  • Recharging: Just like GUADEC in my previous life, I find these exhausting but also ultimately exhilarating and recharging. Can’t wait to get to Mexico City!


  • London: I really enjoy London—the mix of history and modernity is amazing. Bonus: I think the beer scene has really improved since the last time I was there.
  • Movies: I hardly ever watch movies anymore, even though I love them. Knocked out 10 movies in the 22 hours in flight. On the way to London:
    • Grand Hotel Budapest (the same movie as every other one of his movies, which is enjoyable)
    • Jodorowsky’s Dune (awesome if you’re into scifi)
    • Anchorman (finally)
    • Stranger than Fiction (enjoyed it, but Adaptation was better)
    • Captain America, Winter Soldier (not bad?)
  • On the way back:
    • All About Eve (finally – completely compelling)
    • Appleseed:Alpha (weird; the awful dialogue and wooden “faces” of computer animated actors clashed particularly badly with the clasically great dialogue and acting of All About Eve)
    • Mary Poppins (having just seen London; may explain my love of magico-realism?)
    • The Philadelphia Story (great cast, didn’t engage me otherwise)
    • Her (very good)

August 31, 2014

Revisiting How We Put Together Linux Systems

In a previous blog story I discussed Factory Reset, Stateless Systems, Reproducible Systems & Verifiable Systems, I now want to take the opportunity to explain a bit where we want to take this with systemd in the longer run, and what we want to build out of it. This is going to be a longer story, so better grab a cold bottle of Club Mate before you start reading.

Traditional Linux distributions are built around packaging systems like RPM or dpkg, and an organization model where upstream developers and downstream packagers are relatively clearly separated: an upstream developer writes code, and puts it somewhere online, in a tarball. A packager than grabs it and turns it into RPMs/DEBs. The user then grabs these RPMs/DEBs and installs them locally on the system. For a variety of uses this is a fantastic scheme: users have a large selection of readily packaged software available, in mostly uniform packaging, from a single source they can trust. In this scheme the distribution vets all software it packages, and as long as the user trusts the distribution all should be good. The distribution takes the responsibility of ensuring the software is not malicious, of timely fixing security problems and helping the user if something is wrong.

Upstream Projects

However, this scheme also has a number of problems, and doesn't fit many use-cases of our software particularly well. Let's have a look at the problems of this scheme for many upstreams:

  • Upstream software vendors are fully dependent on downstream distributions to package their stuff. It's the downstream distribution that decides on schedules, packaging details, and how to handle support. Often upstream vendors want much faster release cycles then the downstream distributions follow.

  • Realistic testing is extremely unreliable and next to impossible. Since the end-user can run a variety of different package versions together, and expects the software he runs to just work on any combination, the test matrix explodes. If upstream tests its version on distribution X release Y, then there's no guarantee that that's the precise combination of packages that the end user will eventually run. In fact, it is very unlikely that the end user will, since most distributions probably updated a number of libraries the package relies on by the time the package ends up being made available to the user. The fact that each package can be individually updated by the user, and each user can combine library versions, plug-ins and executables relatively freely, results in a high risk of something going wrong.

  • Since there are so many different distributions in so many different versions around, if upstream tries to build and test software for them it needs to do so for a large number of distributions, which is a massive effort.

  • The distributions are actually quite different in many ways. In fact, they are different in a lot of the most basic functionality. For example, the path where to put x86-64 libraries is different on Fedora and Debian derived systems..

  • Developing software for a number of distributions and versions is hard: if you want to do it, you need to actually install them, each one of them, manually, and then build your software for each.

  • Since most downstream distributions have strict licensing and trademark requirements (and rightly so), any kind of closed source software (or otherwise non-free) does not fit into this scheme at all.

This all together makes it really hard for many upstreams to work nicely with the current way how Linux works. Often they try to improve the situation for them, for example by bundling libraries, to make their test and build matrices smaller.

System Vendors

The toolbox approach of classic Linux distributions is fantastic for people who want to put together their individual system, nicely adjusted to exactly what they need. However, this is not really how many of today's Linux systems are built, installed or updated. If you build any kind of embedded device, a server system, or even user systems, you frequently do your work based on complete system images, that are linearly versioned. You build these images somewhere, and then you replicate them atomically to a larger number of systems. On these systems, you don't install or remove packages, you get a defined set of files, and besides installing or updating the system there are no ways how to change the set of tools you get.

The current Linux distributions are not particularly good at providing for this major use-case of Linux. Their strict focus on individual packages as well as package managers as end-user install and update tool is incompatible with what many system vendors want.


The classic Linux distribution scheme is frequently not what end users want, either. Many users are used to app markets like Android, Windows or iOS/Mac have. Markets are a platform that doesn't package, build or maintain software like distributions do, but simply allows users to quickly find and download the software they need, with the app vendor responsible for keeping the app updated, secured, and all that on the vendor's release cycle. Users tend to be impatient. They want their software quickly, and the fine distinction between trusting a single distribution or a myriad of app developers individually is usually not important for them. The companies behind the marketplaces usually try to improve this trust problem by providing sand-boxing technologies: as a replacement for the distribution that audits, vets, builds and packages the software and thus allows users to trust it to a certain level, these vendors try to find technical solutions to ensure that the software they offer for download can't be malicious.

Existing Approaches To Fix These Problems

Now, all the issues pointed out above are not new, and there are sometimes quite successful attempts to do something about it. Ubuntu Apps, Docker, Software Collections, ChromeOS, CoreOS all fix part of this problem set, usually with a strict focus on one facet of Linux systems. For example, Ubuntu Apps focus strictly on end user (desktop) applications, and don't care about how we built/update/install the OS itself, or containers. Docker OTOH focuses on containers only, and doesn't care about end-user apps. Software Collections tries to focus on the development environments. ChromeOS focuses on the OS itself, but only for end-user devices. CoreOS also focuses on the OS, but only for server systems.

The approaches they find are usually good at specific things, and use a variety of different technologies, on different layers. However, none of these projects tried to fix this problems in a generic way, for all uses, right in the core components of the OS itself.

Linux has come to tremendous successes because its kernel is so generic: you can build supercomputers and tiny embedded devices out of it. It's time we come up with a basic, reusable scheme how to solve the problem set described above, that is equally generic.

What We Want

The systemd cabal (Kay Sievers, Harald Hoyer, Daniel Mack, Tom Gundersen, David Herrmann, and yours truly) recently met in Berlin about all these things, and tried to come up with a scheme that is somewhat simple, but tries to solve the issues generically, for all use-cases, as part of the systemd project. All that in a way that is somewhat compatible with the current scheme of distributions, to allow a slow, gradual adoption. Also, and that's something one cannot stress enough: the toolbox scheme of classic Linux distributions is actually a good one, and for many cases the right one. However, we need to make sure we make distributions relevant again for all use-cases, not just those of highly individualized systems.

Anyway, so let's summarize what we are trying to do:

  • We want an efficient way that allows vendors to package their software (regardless if just an app, or the whole OS) directly for the end user, and know the precise combination of libraries and packages it will operate with.

  • We want to allow end users and administrators to install these packages on their systems, regardless which distribution they have installed on it.

  • We want a unified solution that ultimately can cover updates for full systems, OS containers, end user apps, programming ABIs, and more. These updates shall be double-buffered, (at least). This is an absolute necessity if we want to prepare the ground for operating systems that manage themselves, that can update safely without administrator involvement.

  • We want our images to be trustable (i.e. signed). In fact we want a fully trustable OS, with images that can be verified by a full trust chain from the firmware (EFI SecureBoot!), through the boot loader, through the kernel, and initrd. Cryptographically secure verification of the code we execute is relevant on the desktop (like ChromeOS does), but also for apps, for embedded devices and even on servers (in a post-Snowden world, in particular).

What We Propose

So much about the set of problems, and what we are trying to do. So, now, let's discuss the technical bits we came up with:

The scheme we propose is built around the variety of concepts of btrfs and Linux file system name-spacing. btrfs at this point already has a large number of features that fit neatly in our concept, and the maintainers are busy working on a couple of others we want to eventually make use of.

As first part of our proposal we make heavy use of btrfs sub-volumes and introduce a clear naming scheme for them. We name snapshots like this:

  • usr:<vendorid>:<architecture>:<version> -- This refers to a full vendor operating system tree. It's basically a /usr tree (and no other directories), in a specific version, with everything you need to boot it up inside it. The <vendorid> field is replaced by some vendor identifier, maybe a scheme like org.fedoraproject.FedoraWorkstation. The <architecture> field specifies a CPU architecture the OS is designed for, for example x86-64. The <version> field specifies a specific OS version, for example 23.4. An example sub-volume name could hence look like this: usr:org.fedoraproject.FedoraWorkstation:x86_64:23.4

  • root:<name>:<vendorid>:<architecture> -- This refers to an instance of an operating system. Its basically a root directory, containing primarily /etc and /var (but possibly more). Sub-volumes of this type do not contain a populated /usr tree though. The <name> field refers to some instance name (maybe the host name of the instance). The other fields are defined as above. An example sub-volume name is root:revolution:org.fedoraproject.FedoraWorkstation:x86_64.

  • runtime:<vendorid>:<architecture>:<version> -- This refers to a vendor runtime. A runtime here is supposed to be a set of libraries and other resources that are needed to run apps (for the concept of apps see below), all in a /usr tree. In this regard this is very similar to the usr sub-volumes explained above, however, while a usr sub-volume is a full OS and contains everything necessary to boot, a runtime is really only a set of libraries. You cannot boot it, but you can run apps with it. An example sub-volume name is: runtime:org.gnome.GNOME3_20:x86_64:3.20.1

  • framework:<vendorid>:<architecture>:<version> -- This is very similar to a vendor runtime, as described above, it contains just a /usr tree, but goes one step further: it additionally contains all development headers, compilers and build tools, that allow developing against a specific runtime. For each runtime there should be a framework. When you develop against a specific framework in a specific architecture, then the resulting app will be compatible with the runtime of the same vendor ID and architecture. Example: framework:org.gnome.GNOME3_20:x86_64:3.20.1

  • app:<vendorid>:<runtime>:<architecture>:<version> -- This encapsulates an application bundle. It contains a tree that at runtime is mounted to /opt/<vendorid>, and contains all the application's resources. The <vendorid> could be a string like org.libreoffice.LibreOffice, the <runtime> refers to one the vendor id of one specific runtime the application is built for, for example org.gnome.GNOME3_20:3.20.1. The <architecture> and <version> refer to the architecture the application is built for, and of course its version. Example: app:org.libreoffice.LibreOffice:GNOME3_20:x86_64:133

  • home:<user>:<uid>:<gid> -- This sub-volume shall refer to the home directory of the specific user. The <user> field contains the user name, the <uid> and <gid> fields the numeric Unix UIDs and GIDs of the user. The idea here is that in the long run the list of sub-volumes is sufficient as a user database (but see below). Example: home:lennart:1000:1000.

btrfs partitions that adhere to this naming scheme should be clearly identifiable. It is our intention to introduce a new GPT partition type ID for this.

How To Use It

After we introduced this naming scheme let's see what we can build of this:

  • When booting up a system we mount the root directory from one of the root sub-volumes, and then mount /usr from a matching usr sub-volume. Matching here means it carries the same <vendor-id> and <architecture>. Of course, by default we should pick the matching usr sub-volume with the newest version by default.

  • When we boot up an OS container, we do exactly the same as the when we boot up a regular system: we simply combine a usr sub-volume with a root sub-volume.

  • When we enumerate the system's users we simply go through the list of home snapshots.

  • When a user authenticates and logs in we mount his home directory from his snapshot.

  • When an app is run, we set up a new file system name-space, mount the app sub-volume to /opt/<vendorid>/, and the appropriate runtime sub-volume the app picked to /usr, as well as the user's /home/$USER to its place.

  • When a developer wants to develop against a specific runtime he installs the right framework, and then temporarily transitions into a name space where /usris mounted from the framework sub-volume, and /home/$USER from his own home directory. In this name space he then runs his build commands. He can build in multiple name spaces at the same time, if he intends to builds software for multiple runtimes or architectures at the same time.

Instantiating a new system or OS container (which is exactly the same in this scheme) just consists of creating a new appropriately named root sub-volume. Completely naturally you can share one vendor OS copy in one specific version with a multitude of container instances.

Everything is double-buffered (or actually, n-fold-buffered), because usr, runtime, framework, app sub-volumes can exist in multiple versions. Of course, by default the execution logic should always pick the newest release of each sub-volume, but it is up to the user keep multiple versions around, and possibly execute older versions, if he desires to do so. In fact, like on ChromeOS this could even be handled automatically: if a system fails to boot with a newer snapshot, the boot loader can automatically revert back to an older version of the OS.

An Example

Note that in result this allows installing not only multiple end-user applications into the same btrfs volume, but also multiple operating systems, multiple system instances, multiple runtimes, multiple frameworks. Or to spell this out in an example:

Let's say Fedora, Mageia and ArchLinux all implement this scheme, and provide ready-made end-user images. Also, the GNOME, KDE, SDL projects all define a runtime+framework to develop against. Finally, both LibreOffice and Firefox provide their stuff according to this scheme. You can now trivially install of these into the same btrfs volume:

  • usr:org.fedoraproject.WorkStation:x86_64:24.7
  • usr:org.fedoraproject.WorkStation:x86_64:24.8
  • usr:org.fedoraproject.WorkStation:x86_64:24.9
  • usr:org.fedoraproject.WorkStation:x86_64:25beta
  • usr:org.mageia.Client:i386:39.3
  • usr:org.mageia.Client:i386:39.4
  • usr:org.mageia.Client:i386:39.6
  • usr:org.archlinux.Desktop:x86_64:302.7.8
  • usr:org.archlinux.Desktop:x86_64:302.7.9
  • usr:org.archlinux.Desktop:x86_64:302.7.10
  • root:revolution:org.fedoraproject.WorkStation:x86_64
  • root:testmachine:org.fedoraproject.WorkStation:x86_64
  • root:foo:org.mageia.Client:i386
  • root:bar:org.archlinux.Desktop:x86_64
  • runtime:org.gnome.GNOME3_20:x86_64:3.20.1
  • runtime:org.gnome.GNOME3_20:x86_64:3.20.4
  • runtime:org.gnome.GNOME3_20:x86_64:3.20.5
  • runtime:org.gnome.GNOME3_22:x86_64:3.22.0
  • runtime:org.kde.KDE5_6:x86_64:5.6.0
  • framework:org.gnome.GNOME3_22:x86_64:3.22.0
  • framework:org.kde.KDE5_6:x86_64:5.6.0
  • app:org.libreoffice.LibreOffice:GNOME3_20:x86_64:133
  • app:org.libreoffice.LibreOffice:GNOME3_22:x86_64:166
  • app:org.mozilla.Firefox:GNOME3_20:x86_64:39
  • app:org.mozilla.Firefox:GNOME3_20:x86_64:40
  • home:lennart:1000:1000
  • home:hrundivbakshi:1001:1001

In the example above, we have three vendor operating systems installed. All of them in three versions, and one even in a beta version. We have four system instances around. Two of them of Fedora, maybe one of them we usually boot from, the other we run for very specific purposes in an OS container. We also have the runtimes for two GNOME releases in multiple versions, plus one for KDE. Then, we have the development trees for one version of KDE and GNOME around, as well as two apps, that make use of two releases of the GNOME runtime. Finally, we have the home directories of two users.

Now, with the name-spacing concepts we introduced above, we can actually relatively freely mix and match apps and OSes, or develop against specific frameworks in specific versions on any operating system. It doesn't matter if you booted your ArchLinux instance, or your Fedora one, you can execute both LibreOffice and Firefox just fine, because at execution time they get matched up with the right runtime, and all of them are available from all the operating systems you installed. You get the precise runtime that the upstream vendor of Firefox/LibreOffice did their testing with. It doesn't matter anymore which distribution you run, and which distribution the vendor prefers.

Also, given that the user database is actually encoded in the sub-volume list, it doesn't matter which system you boot, the distribution should be able to find your local users automatically, without any configuration in /etc/passwd.

Building Blocks

With this naming scheme plus the way how we can combine them on execution we already came quite far, but how do we actually get these sub-volumes onto the final machines, and how do we update them? Well, btrfs has a feature they call "send-and-receive". It basically allows you to "diff" two file system versions, and generate a binary delta. You can generate these deltas on a developer's machine and then push them into the user's system, and he'll get the exact same sub-volume too. This is how we envision installation and updating of operating systems, applications, runtimes, frameworks. At installation time, we simply deserialize an initial send-and-receive delta into our btrfs volume, and later, when a new version is released we just add in the few bits that are new, by dropping in another send-and-receive delta under a new sub-volume name. And we do it exactly the same for the OS itself, for a runtime, a framework or an app. There's no technical distinction anymore. The underlying operation for installing apps, runtime, frameworks, vendor OSes, as well as the operation for updating them is done the exact same way for all.

Of course, keeping multiple full /usr trees around sounds like an awful lot of waste, after all they will contain a lot of very similar data, since a lot of resources are shared between distributions, frameworks and runtimes. However, thankfully btrfs actually is able to de-duplicate this for us. If we add in a new app snapshot, this simply adds in the new files that changed. Moreover different runtimes and operating systems might actually end up sharing the same tree.

Even though the example above focuses primarily on the end-user, desktop side of things, the concept is also extremely powerful in server scenarios. For example, it is easy to build your own usr trees and deliver them to your hosts using this scheme. The usr sub-volumes are supposed to be something that administrators can put together. After deserializing them into a couple of hosts, you can trivially instantiate them as OS containers there, simply by adding a new root sub-volume for each instance, referencing the usr tree you just put together. Instantiating OS containers hence becomes as easy as creating a new btrfs sub-volume. And you can still update the images nicely, get fully double-buffered updates and everything.

And of course, this scheme also applies great to embedded use-cases. Regardless if you build a TV, an IVI system or a phone: you can put together you OS versions as usr trees, and then use btrfs-send-and-receive facilities to deliver them to the systems, and update them there.

Many people when they hear the word "btrfs" instantly reply with "is it ready yet?". Thankfully, most of the functionality we really need here is strictly read-only. With the exception of the home sub-volumes (see below) all snapshots are strictly read-only, and are delivered as immutable vendor trees onto the devices. They never are changed. Even if btrfs might still be immature, for this kind of read-only logic it should be more than good enough.

Note that this scheme also enables doing fat systems: for example, an installer image could include a Fedora version compiled for x86-64, one for i386, one for ARM, all in the same btrfs volume. Due to btrfs' de-duplication they will share as much as possible, and when the image is booted up the right sub-volume is automatically picked. Something similar of course applies to the apps too!

This also allows us to implement something that we like to call Operating-System-As-A-Virus. Installing a new system is little more than:

  • Creating a new GPT partition table
  • Adding an EFI System Partition (FAT) to it
  • Adding a new btrfs volume to it
  • Deserializing a single usr sub-volume into the btrfs volume
  • Installing a boot loader into the EFI System Partition
  • Rebooting

Now, since the only real vendor data you need is the usr sub-volume, you can trivially duplicate this onto any block device you want. Let's say you are a happy Fedora user, and you want to provide a friend with his own installation of this awesome system, all on a USB stick. All you have to do for this is do the steps above, using your installed usr tree as source to copy. And there you go! And you don't have to be afraid that any of your personal data is copied too, as the usr sub-volume is the exact version your vendor provided you with. Or with other words: there's no distinction anymore between installer images and installed systems. It's all the same. Installation becomes replication, not more. Live-CDs and installed systems can be fully identical.

Note that in this design apps are actually developed against a single, very specific runtime, that contains all libraries it can link against (including a specific glibc version!). Any library that is not included in the runtime the developer picked must be included in the app itself. This is similar how apps on Android declare one very specific Android version they are developed against. This greatly simplifies application installation, as there's no dependency hell: each app pulls in one runtime, and the app is actually free to pick which one, as you can have multiple installed, though only one is used by each app.

Also note that operating systems built this way will never see "half-updated" systems, as it is common when a system is updated using RPM/dpkg. When updating the system the code will either run the old or the new version, but it will never see part of the old files and part of the new files. This is the same for apps, runtimes, and frameworks, too.

Where We Are Now

We are currently working on a lot of the groundwork necessary for this. This scheme relies on the ability to monopolize the vendor OS resources in /usr, which is the key of what I described in Factory Reset, Stateless Systems, Reproducible Systems & Verifiable Systems a few weeks back. Then, of course, for the full desktop app concept we need a strong sandbox, that does more than just hiding files from the file system view. After all with an app concept like the above the primary interfacing between the executed desktop apps and the rest of the system is via IPC (which is why we work on kdbus and teach it all kinds of sand-boxing features), and the kernel itself. Harald Hoyer has started working on generating the btrfs send-and-receive images based on Fedora.

Getting to the full scheme will take a while. Currently we have many of the building blocks ready, but some major items are missing. For example, we push quite a few problems into btrfs, that other solutions try to solve in user space. One of them is actually signing/verification of images. The btrfs maintainers are working on adding this to the code base, but currently nothing exists. This functionality is essential though to come to a fully verified system where a trust chain exists all the way from the firmware to the apps. Also, to make the home sub-volume scheme fully workable we actually need encrypted sub-volumes, so that the sub-volume's pass-phrase can be used for authenticating users in PAM. This doesn't exist either.

Working towards this scheme is a gradual process. Many of the steps we require for this are useful outside of the grand scheme though, which means we can slowly work towards the goal, and our users can already take benefit of what we are working on as we go.

Also, and most importantly, this is not really a departure from traditional operating systems:

Each app, each OS and each app sees a traditional Unix hierarchy with /usr, /home, /opt, /var, /etc. It executes in an environment that is pretty much identical to how it would be run on traditional systems.

There's no need to fully move to a system that uses only btrfs and follows strictly this sub-volume scheme. For example, we intend to provide implicit support for systems that are installed on ext4 or xfs, or that are put together with traditional packaging tools such as RPM or dpkg: if the the user tries to install a runtime/app/framework/os image on a system that doesn't use btrfs so far, it can just create a loop-back btrfs image in /var, and push the data into that. Even us developers will run our stuff like this for a while, after all this new scheme is not particularly useful for highly individualized systems, and we developers usually tend to run systems like that.

Also note that this in no way a departure from packaging systems like RPM or DEB. Even if the new scheme we propose is used for installing and updating a specific system, it is RPM/DEB that is used to put together the vendor OS tree initially. Hence, even in this scheme RPM/DEB are highly relevant, though not strictly as an end-user tool anymore, but as a build tool.

So Let's Summarize Again What We Propose

  • We want a unified scheme, how we can install and update OS images, user apps, runtimes and frameworks.

  • We want a unified scheme how you can relatively freely mix OS images, apps, runtimes and frameworks on the same system.

  • We want a fully trusted system, where cryptographic verification of all executed code can be done, all the way to the firmware, as standard feature of the system.

  • We want to allow app vendors to write their programs against very specific frameworks, under the knowledge that they will end up being executed with the exact same set of libraries chosen.

  • We want to allow parallel installation of multiple OSes and versions of them, multiple runtimes in multiple versions, as well as multiple frameworks in multiple versions. And of course, multiple apps in multiple versions.

  • We want everything double buffered (or actually n-fold buffered), to ensure we can reliably update/rollback versions, in particular to safely do automatic updates.

  • We want a system where updating a runtime, OS, framework, or OS container is as simple as adding in a new snapshot and restarting the runtime/OS/framework/OS container.

  • We want a system where we can easily instantiate a number of OS instances from a single vendor tree, with zero difference for doing this on order to be able to boot it on bare metal/VM or as a container.

  • We want to enable Linux to have an open scheme that people can use to build app markets and similar schemes, not restricted to a specific vendor.

Final Words

I'll be talking about this at LinuxCon Europe in October. I originally intended to discuss this at the Linux Plumbers Conference (which I assumed was the right forum for this kind of major plumbing level improvement), and at, but there was no interest in my session submissions there...

Of course this is all work in progress. These are our current ideas we are working towards. As we progress we will likely change a number of things. For example, the precise naming of the sub-volumes might look very different in the end.

Of course, we are developers of the systemd project. Implementing this scheme is not just a job for the systemd developers. This is a reinvention how distributions work, and hence needs great support from the distributions. We really hope we can trigger some interest by publishing this proposal now, to get the distributions on board. This after all is explicitly not supposed to be a solution for one specific project and one specific vendor product, we care about making this open, and solving it for the generic case, without cutting corners.

If you have any questions about this, you know how you can reach us (IRC, mail, G+, ...).

The future is going to be awesome!

August 25, 2014

Emacs verus notification area, again

Ages and ages I wrote about letting Emacs code access the notification area.  I have more to say about it now, but first I want to bore you with some rambling thoughts and some history.

The “notification area” is also called the “status icon area” or the “systray” — it is a spot that holds some icons that are under control of various applications.

I was a fan of the notification area since it first showed up in Gnome.  I recognized it instantly as the thing I wanted that I hadn’t realized I wanted.

Now, as you know, the notification area has fallen on hard times.  It’s been removed in Gnome 3… I searched a bit for the rationale for this deletion, which as far as I can tell is just that some applications abused it, whatever that means; or that it was used inconsistently, which I think the web has conclusively proven is fine by users.  Coming from the Emacs perspective, where one can customize the somewhat-equivalent of the status area (see those recent posts on diminishing minor-mode lighters in the mode line…), and where a certain amount of per-mode idiosyncrasy is the norm, these seem like an inadequate reasons.

However, the reason doesn’t really matter.  I love the notification area!  When I moved more of my daily desktop use back into Emacs (the tides are strong but slow, and take years to come in or go out), I hooked Emacs up to it, and made it a part of my basic configuration.

It’s indispensable now.  What I particularly like about it is that it is both noticeable and unobtrusive — the former because I can have the icons blink on important events, and the latter because the icons don’t move around or obscure other windows.

Ok!  You should use it!  And I totally plan to tell you how, but first some boring history.

My original post relied on a hacked version of the Gnome zenity utility.  This turned out to be a real pain over time.  I had to rebuild it periodically, adding hacks (once removing chunks), etc.  Sharing it with others was hard.  And, for whatever reason, the patches in Gnome bugzilla were completely ignored.  Bah.

A bit later I wrote a big patch to Emacs to put all this into the core.  That patch was rejected, more or less.  Bah two.

Then even later I flirted with KDE for a bit.  Yes.  KDE had the nice idea to expose the notification area via dbus, and Emacs could talk dbus… so I did the obvious thing in elisp.  However the KDE notification area was pretty buggy and in the end I had to abandon it as well.

So, it was back to zenity… until this week, during my funemployment.  I rewrote my hacks in Python.  This was so easy I wish I’d done it years and years ago.

I’m not sure what the moral of this story is.  Maybe that my obsession is your gain.  Or maybe that I have trouble letting go.

Anyway, the result is here, on github, or in marmalade.  You’ll need Python and the new (introspection-based) Python Gtk interfaces.  This of course is no trouble to install.  The package includes the base status icon API, plus basic UIs for ERC and EMMS.  Try it out and let me know what you think.

August 22, 2014

Another Mode Line Hack

While streamlining my mode line, I wrote another little mode-line feature that I thought of ages ago — using the background of the mode-line to indicate the current position in the buffer. I didn’t like this enough to use it, but I thought I’d post it since it was a fun hack.

First, make sure the current mode line is kept:

(defvar tromey-real-mode-line-format mode-line-format)

Now, make a little function that format the mode line using the standard rules and then applies a property depending on the current position in the buffer:

(defun tromey-compute-mode-line ()
  (let* ((width (frame-width))
     (line (substring 
        (concat (format-mode-line tromey-real-mode-line-format)
            (make-string width ? ))
        0 width)))
    ;; Quote "%"s.
    (setq line
      (mapconcat (lambda (c)
               (if (eq c ?%)
             ;; It's absurd that we must wrap this.
             (make-string 1 c)))
             line ""))

    (let ((start (window-start))
      (end (or (window-end) (point))))
      (add-face-text-property (round (* (/ (float start)
                    (length line)))
                  (round (* (/ (float end)
                    (length line)))
                  'region nil line))

We have to do this funny wrapping and “%”-quoting business here because the :eval form returns a mode line format — not just text — and because the otherwise appealing :propertize form doesn’t allow computations.

Also, I’ve never understood why mapconcat can’t handle a character result from the map function.  Anybody?

Now set this to be the mode line:

(setq-default mode-line-format '((:eval (tromey-compute-mode-line))))

The function above changes the background of the mode line corresponding to the current window’s start and end positions.  So, for example, here we are in the middle of a buffer that is bigger than the window:

Screenshot - 08222014 - 12:52:19 PM

I left this on for a bit but found it too distracting.  If you like it, use it. You might like to remove the mode-line-position stuff from the mode line, as it seems redundant with the visual display.

August 15, 2014

A breakdown of Linux kernel networking related issues from Coverity scan

For the last of these breakdowns, I’ll focus on fifth place: networking.

Linux supports many different network protocols, so I spent quite a while splitting the net/ tree into per-protocol components. The result looks like this.

Net-802 8
Net-Bluetooth 15
Net-CAIF 9
Net-Core 11
Net-DCCP 5
Net-IRDA 17
Net-NFC 11
Net-SCTP 18
Net-SunRPC 21
Net-Wireless 9
Net-XFRM 6
Net-bridge 14
Net-ipv4 24
Net-ipv6 16
Net-mac80211 12
Net-sched 5
everything else 124

The networking code has gotten noticably better over the last year. When I initially introduced these components they were all well into double figures. Now, even crap like DECNET has gotten better (both users will be very happy).

“Everything else” above is actually a screw-up on my part. For some reason around 50 or so netfilter issues haven’t been categorized into their appropriate component. The remaining ~70 are quite a mix, but nearly all small numbers of issues in many components.Things like 9p, atm, ax25, batman, can, ceph, l2tp, rds, rxrpc, tipc, vmwsock, and x25. The Lovecraftian protocols you only ever read about.

So networking is in pretty good shape considering just how much stuff it supports. While there’s 24 issues in a common protocol like ipv4, they tend to be mostly benign things rather than OMG 24 WAYS THE NSA IS OWNING YOUR LINUX RIGHT NOW.

That’s the last of these breakdowns I’ll do for now. I’ll do this again maybe in six months to a year, if things are dramatically different, but I expect any changes to be minor and incremental rather than anything too surprising.

After I get back from kernel summit and recover from travelling, I’ll start a series of posts showing code examples of the top checkers.

A breakdown of Linux kernel networking related issues from Coverity scan is a post from:

Breakdown of Linux kernel wireless drivers in Coverity scan

In fourth place on the list of hottest areas of the kernel as seen by Coverity, is drivers/net/wireless.

rtlwifi 96
Atheros 74
brcm80211 67
mwifiex 33
b43 16
iwlwifi 15
everything else 65

I mentioned in my drivers/staging examination that the realtek wifi drivers stood odd as especially problematic. Here we see the same situation. Larry Finger has been working on cleaning up this (and other drivers) for some time, but it apparently still has a long way to go.

It’s worth noting that “Atheros” here is actually a number of drivers (ar5523, ath10k, ath5k, ath6k, ath9k, carl9170, wcn36xx, wil6210). I’ve not had time to break those down into smaller components yet, though a quick look shows that ath9k in particular accounts for a sizable portion of those 74 issues)

I was actually surprised at how low the iwlwifi and b43 counts were. I guess there’s something to be said for ubiquitous hardware.

What of all the ancient wireless drivers ? The junky pcmcia/pccard drivers like orinoco and friends ?
They’re in with those 65 “everything else” bugs, and make up < 5-6 issues each. Considering their age, and lack of any real maintenance these days, they’re in surprisingly good shape.

Just for fun, here’s how the drivers above compare against the wireless drivers currently in staging.

rtl8821 102 (Staging)
rtlwifi 96
Atheros 74
brcm80211 67
rtl8188eu 42 (Staging)
mwifiex 33
rtl8712 22 (Staging)
rtl8192u 21 (Staging)
rtl8192e 17 (Staging)
b43 16
iwlwifi 15
everything else 65

Breakdown of Linux kernel wireless drivers in Coverity scan is a post from:

A breakdown of Linux kernel filesystem issues in Coverity scans

The filesystem code shows up in the number two position of the list of hottest areas of the kernel. Like the previous post on drivers/scsi, this isn’t because “the filesystem code is terrible”, but more that Linux supports so many filesystems, the accumulative effect of issues present in all of them adds up to a figure that dominates the statistics.

The breakdown looks like this.

fs/*.c 77
9P 3
EXTn 36
GFS2 12
HFSPlus 4
NFS 24
OCFS2 35
Reiserfs 12
UDF 14
XFS 33

fs/*.c accounts for the VFS core, AIO, binfmt parsers, eventfd, epoll, timerfd’s, xattr code and a bunch of assorted miscellany. Little wonder it show up with so high, it’s around 62,000 LOC by itself. Of all the entries on the list, this is perhaps the most concerning area given it affects every filesystem.

A little more concerning perhaps is that btrfs is so high on the list. Btrfs is still seeing a lot of churn each release, so many of these issues come and go, but it seems to be holding roughly at the same rate of new incoming issues each release.

EXTn counts for ext2, ext3, and ext4 combined. Not too bad considering that’s around 74,000 LOC combined. (and another 15K LOC for jbd/jbd2)

The CIFS, NFS and OCFS filesystems stand out as potentially something that might be of concern, especially if those issues are over-the-wire trigger-able.

XFS has been improving over the past year. It was around 60-70 when I started doing regular scans, and continues to move downward each release, with few new issues getting added.

The remaining filesystems: not too shabby. Especially considering some of the niche ones don’t get a lot of attention.

A breakdown of Linux kernel filesystem issues in Coverity scans is a post from:

A closer look at drivers/scsi Coverity scans.

drivers/scsi showed up in third place in the list of hottest areas of the kernel. Breaking it down into sub-components, it looks like this.

aic7xxx 15
be2iscsi 15
bfa 26
bnx2fc 6
csiostor 10
isci 11
lpfc 38
megaraid 10
mpt2sas 17
mpt3sas 15
pm8001 9
qla2xxx 42
qla4xxx 17
Everything else 152

All these components have been steadily improving over the last year. The obvious stand-out is “Everything else” that looks like it needs to be broken out into more components.
But drivers/scsi is one area of the kernel where we have a *lot* of legacy drivers, many of them 10-15 years old. (Remarkably, some of these are even still in regular use). Looking over the list of filenames matching the “Everything else” component, pretty much every driver that isn’t broken out into its own component is on the list. 3w-9xxx, NCR5380, aacraid, advansys, aic94xx, arcmsr, atp870, bnx2i, cxgbi, dc395x, dpt_i2o, eata, esas2, fdomain, fnic, gdth, hpsa, imm, ipr, ips, mvsas, mvumi, osst, pmcraid, qla1280, qlogicfas, stex, storvsc_drv, sym53x8xx, tmscsim.
None of these are particularly worse than the others, most averaging less than a half dozen issues each.

Ignoring the problems I currently have adding more components, it’s not particularly helpful to break it down further when the result is going to be components with a half dozen issues. It’s not that there’s a few awful drivers dragging down the average, it’s that there’s so many of them, and they all contribute a little bit of awful.

Something I’d like to component-ize, but can’t easily without crafting and maintaining ugly regexps, is the core scsi functionality and its libraries. The problem is that drivers/scsi/*.c includes both legacy drivers, and also scsi core functionality & library functions. I discussed potentially moving all the old drivers to a “legacy” or “vintage” sub-directory at LSF/MM earlier this year with James, but he didn’t seem overly enthusiastic. So it’s going to continue to be lumped in with “Everything else” for now.

The difficulty with figuring out whether many of these issues are real concerns is that because they are hardware drivers, the scanner has no way of knowing what range of valid responses the HBA will return. So there are a number of issues which are of the form “This can’t actually happen, because if the HBA returned this, then we would have called this other function instead”.
Not a problem unique to SCSI, and something that’s seen across many different parts of the kernel.

And for those ancient 15 year old drivers ? It’s tough to find someone who either remembers how they work on a chip level, or cares enough to go back and revisit them.

A closer look at drivers/scsi Coverity scans. is a post from:

drivers/staging under the Coverity microscope.

In my previous post, I mentioned that drivers/staging took the top spot for number of issues in a component.

Here’s a ‘zoomed in’ look at the sub-components under drivers/staging.

bcm 103
comedi 45
iio 13
line6 7
lustre 133
media 10
rtl8188eu 42
rtl8192e 17
rtl8192u 21
rtl8712 22
rtl8821 102
rts5208 19
unisys 14
vt6655 47
vt6656 4
everything else in drivers/staging/ (40 other uncategorized drivers) 95

Some of the sub-components with < 10 issues are likely to have their categories removed soon. When they were initially added, the open issues counts were higher, but over time they’ve improved to the point where they could just be lumped in with “everything else”

When Lustre was added back in 3.12, it caused a noticable jump in new issues detected. The largest delta from any one single addition since I’ve been doing regular scans. It’s continuing to make progress, with 20 or so issues being knocked out each release, and few new issues being introduced. Lustre doesn’t suffer from any one issue overly, but has a grab-bag of issues from the many checkers that Coverity has.
Amusingly, Lustre is the only part of the kernel that has Coverity annotations in the code.

Second on the list is the bcm Wimax driver. This has been around in staging for years, and has had a metric shitload of checkpatch type stylistic changes made to it, but relatively few actual functionality fixes. (confession: I was guilty of ~30 of those cleanups myself, but I couldn’t bare to look at the 1906 line bcm_char_ioctl function: Splitting that up did have a nice side-effect though). A lot of the issues in this driver are duplicates due to a problem in a macro being picked up as a new issue for every instance it gets used.

Something that sticks out in this list is the cluster of rtl* drivers. At time of writing there are seven drivers for various Realtek wireless chips, all of varying quality. Much of the code between these drivers is cut-and-pasted from previous drivers. It seems each time Realtek rev new silicon, they do another code-drop with a new driver. Worse yet, many of the fixes that went into the kernel variants don’t make it back to the driver they based their new work on. There have been numerous cases where a bug fixed in one driver has been reintroduced in a new variant months later. There’s a ton of work going on here, and a lot more needed.
Somewhat depressingly, even the not-in-staging rtlwifi driver that lives in drivers/net/wireless has ~100 issues. Many of them the exact same issues as those in the staging drivers.

As bad as it seems, staging is serving its purpose for the most part, and things have gotten a lot quieter each merge window when the staging tree gets pulled. It’s only when it contains something new and huge like Lustre that it really shows up noticeably in the daily stats after each scan. The number of new issues being added are generally lower than the number being fixed. For the 3.17 pull for example, 67 new issues, 132 eliminated. (Note: Those numbers are kernel wide, not *just* staging, but staging made up the majority of the results change on that day).

Something that bothers me slightly is that a number of drivers have ‘escaped’ drivers/staging into the kernel proper, with a high number of open issues. That said, many of those escapees are no worse than drivers that were added 10+ years ago when standards were lower. More on that in a future post.

drivers/staging under the Coverity microscope. is a post from:

Linux kernel Coverity scan ‘hot’ areas.

One of the time-consuming parts of organizing the data generated by Coverity has been sorting it into categories, (or components as Coverity refers to them). A component is a wildcard (or exact filename) that matches a specific subsystem, driver, filesystem etc.

As the Linux kernel has thousands of drivers, it isn’t really practical to add a component per-driver, so I started by generalizing into subsystems, and from there, broke down the larger groupings into per-driver components, while still leaving an “everything else” catch-all for drivers within a subsystem that hadn’t been broken out.

According to discussions I’ve had with Coverity, we are actually one of the more ‘heavy’ users of components, and we’ve hit a few scalability problems as we’ve added more and more of them, which has been another reason I’ve not broken things down more than the ~150 components we have so far. Also, if a component has less than 10 or so issues, it’s really not worth the effort of splitting up. (I may revise that cut-off even higher at some point just to keep things managable).

Before the big reveal, some caveats:

  • Something having ‘100’ issues may not be 100 individual problems. For example if a problem is in a macro, Coverity flags a new issue for every use of that macro. For something heavily used, like a formatted printk debug wrapper, this could account for many many warnings.
  • Many of these issues aren’t actual bugs. At the same time, the checker isn’t wrong, but has no way to infer that the use is ok. I’ll explain more about these in a future post when I start showing some actual warnings.
  • Sometimes a combination of both the previous points. As an example: The nouveau driver this week had ~100 issues open against it, making it the #1 in the list of drm drivers with issues. Ben Skeggs spent some time going over them all, and closed out 80-90 of them as intentional/false positives, and came away with around a half dozen or so issues that actually need code changes, and around 20 issues that are still undecided. It’s a laborious time-consuming effort to dig through them, and in many cases, only the person who wrote the code can really determine if the intent matches what the code actually does.

Right now, the top ten ‘hot areas’ of the kernel (these include accumulated broken-out drivers), sorted by number of issues are:

drivers/staging 694
fs/ 465
drivers/scsi/ 382
drivers/net/wireless 366
net/ 324
drivers/ethernet/ 285
drivers/media/ 262
drivers/usb/ 140
drivers/infiniband/ 109
arch/x86/ 95
sound/ 89

It should come as no surprise really that the staging drivers take the number one spot. If something had beaten it, I think it would have highlighted a somewhat embarrassing problem in our development methods.

In the next posts, I’ll drill down into each of these categories, and figure out exactly why they’re at the top of the list.

For the impatient: once this series is over, I intend to show breakdowns of the various types of issues being detected, but it’s going to take me a while to get to (probably post kernel summit). There’s an absolute ton of data to dig through, and I’m trying to present as much of it in bite-sized chunks as possible, rather than dozens of pages of info.

Linux kernel Coverity scan ‘hot’ areas. is a post from:

August 13, 2014

Streamlined Mode Line

The default mode line looks like this:

Screenshot - 08112014 - 01:57:07 PM

At least, it looks sort of like this if you ignore the lamenesses in the screenshot. If you’re like me you probably don’t remember what all these things mean, at least not without looking them up.  What’s that “U”?  Why the “:” or why three hyphens?

At a local Emacs meetup with Damon Haley and Greg Pfeil, Greg mentioned that he’d done some experiments on using unicode characters in his mode-line.  I decided to give it a try.

I took a good look at the above.  I rarely use any of it — I normally don’t care about the coding system or the line ending style.  I can’t remember the last time I had a buffer that was both read-only and modified.  The VC information, when it appears, is generally too verbose and doesn’t show me the one thing I need to know (see below).  And, though I do like to see the name of the major mode, I don’t really need to see most minor mode names; furthermore I like to have a bit of extra space so that I can use other modes that display information that I do want to see in the mode line.

What’s that VC thing?  Well, ordinarily you may see something like Git-master in the mode line. But, usually I already know the version control system being used — or even if I don’t know, I probably don’t care if I am using VC. By default the branch name is in there too. This can be quite long and seems to get stale when I switch branches; and anyway because I do a lot of work via vc-dir, I don’t really need this in every buffer anyway.

However, what is missing is that the mode-line won’t tell me if a buffer should be registered with version control but is not.  This is a pretty common source of errors.

So, first the code to deal with the VC state.  We need a bit more code than you might think, because the information we need isn’t already computed, and my tries to compute it while updating the mode line caused weird behavior.  Our rule for “should be registered” is “a VC back end claims this file, but the file isn’t actually registered”:

(defvar tromey-vc-mode nil)
(make-variable-buffer-local 'tromey-vc-mode)

(require 'vc)
(defun tromey-vc-command-hook (&rest args)
  (let ((file-name (buffer-file-name)))
    (setq tromey-vc-mode (and file-name
                  (not (vc-registered file-name))
                (vc-responsible-backend file-name))))))

(add-hook 'vc-post-command-functions #'tromey-vc-command-hook)
(add-hook 'find-file-hook #'tromey-vc-command-hook)

(defun tromey-vc-info ()
  (if tromey-vc-mode
      (propertize (string #x26c3 32) 'face 'error)
    " "))

We’ll use that final function in the mode line. Note the odd character in there — my choice was U+26C3 (BLACK DRAUGHTS KING), since I thought it looked disk-drive-like — but you can easily replace it with something else. (Also note the weirdness of using string rather than a string constant. This is just for WordPress’ benefit as its editor kept mangling the actual character.)

To deal with minor modes, I used diminish. This made it easy to remove any display of some modes that I don’t care to know about, and replace the name of some others with a single character:

(require 'diminish)
(diminish 'abbrev-mode)
(diminish 'projectile-mode)
(diminish 'eldoc-mode)
(diminish 'flyspell-mode (string 32 #x2708))
(diminish 'auto-fill-function (string 32 #xa7))
(diminish 'isearch-mode (string 32 #x279c))

Here flyspell is U+2708 (AIRPLANE), auto-fill is U+00A7 (SECTION SIGN), and isearch is U+279C (HEAVY ROUND-TIPPED RIGHTWARDS ARROW).  Haha, Unicode names.

I wanted to try out which-func-mode, now that I had extra space on the mode line, so:

(setq which-func-unknown "")

Finally, we can use all the above and remove some other things from the mode line at the same time:

(setq-default mode-line-format
		(:eval (if (buffer-modified-p)
			   (propertize (string #x21a7) 'face 'error)
			 " "))
		(:eval (tromey-vc-info))
		" " mode-line-buffer-identification
		"   " mode-line-position
		"  " mode-line-modes

The “modified” character in there is U+21A7 (DOWNWARDS ARROW FROM BAR).

Here’s how it looks normally (another badly cropped screenshot):

Screenshot - 08112014 - 08:42:33 PM

Here’s how it looks when the file is not registered with the version control system:

Screenshot - 08112014 - 08:43:04 PM

And here’s how it looks when the file is also modified:

Screenshot - 08112014 - 08:43:39 PM

Occasionally I run into some other minor mode I want to diminish, but this is easily done by editing it into my .emacs and evaluating it for immediate effect.

The first year of Coverity Linux kernel scans.

Next week at kernel summit, I’m going to be speaking about the Coverity scans, and have come up with more material than I have time to present in the short slot, so I’ve decided to turn it into a series of blog posts in a hope to kickstart some discussion ahead of time.

I started doing regular scans against the Linux kernel in July 2013. In that time, I’ve sent a bunch of patches, reported many bugs, and spent hours going through the database categorizing, diagnosing, and closing out issues where possible.

I’ve been doing at least one build per day during each merge window (except obviously on days when there haven’t been any commits), and at least one per -rc once the merge window closes.

A few people have asked me about the config file that I use for the builds.
It’s pretty much an ‘allmodconfig’, except where choices have to be made, I’ve tried to pick the general case that a distribution would select. For some of these, I will occasionally flip between them (for eg, SLAB/SLOB/SLUB, PREEMPT_NONE/PREEMPT_VOLUNTARY/PREEMPT) just for coverage. In total, currently 6955 CONFIG_ options are enabled, 117 disabled. (None by choice, they are all the deselected parts of multi-choice options).

The builds are done x86-64 only. At this time, it’s the only architecture Coverity scan supports. I do have CONFIG_COMPILE_TEST set, so non-x86 drivers that can be built do get scanned. The architecture specific code in arch/ and drivers not covered under COMPILE_TEST being the only parts of the kernel we’re not covering.

Builds take about an hour to build on a 24-core Nehalem. The results are then uploaded to a server which takes another 20 minutes. Then a script kicks something at Coverity to pick up the new tarball and scan it. This can take any number of hours. At best, around 5-6 hours, at worst I’ve seen it take as long as 12 hours. This hopefully answers why I don’t do even more builds, or builds of variant trees. (Although I’m still trying to figure out a way to scan linux-next while having it inherit the results of the issues already marked in Linus tree). Thankfully much of the build/upload/scan process is automated, so I can do other things while I wait for it to finish.

Over the year, the overall defect density has been decreasing.

3.11 0.68
3.12 0.62
3.13 0.59
3.14 0.55
3.15 0.55
3.16 0.53

Moving in the right direction, though things have slowed a little the last few releases. At least in part due to my spending more time on Trinity than going through the Coverity backlog. The good news is that the incoming rate of new bugs each window has also slowed.

Newer issues when they are getting introduced, are getting jumped on faster than before. Many developers have signed up for accounts and are looking over their subsystems each release, which is great. It means I have to spend less time sending email :)
Eventually I hope that Coverity implements a feature I asked for allowing each component to have a designated email address that new reports get sent to. With that in place, plus active triage on the backlog, a real dent could be made in the ~4700 outstanding issues.

Throughout the past year Coverity has made a number of improvements server-side, some at the behest of the scans, resulting in fewer false positives being found by some checkers. A good example of this was some additional heuristics being added to spot intentional ‘missing break in switch statement’ situations. I’ve also been in constant communication whenever an interesting bug was found upstream that Coverity didn’t detect, so over time, additional checkers should be added to catch more bugs.

How do we compare against other projects ?
I picked a few at random.

FreeBSD 0.54 (~15m LOC) 14655 total, 6446 fixed, 8093 outstanding.
Firefox 0.70 (~5.4m LOC) 9008 total. 5066 fixed. 3786 outstanding.
Linux 0.53 (~9m LOC) 13337 total. 7202 fixed. 4761 outstanding.
Python 0.03 ! (~400k LOC) 1030 total. 895 fixed. 3 outstanding.

(LOC based on C preprocessor output)

FreeBSD’s defect density is pretty much the same as Linux right now, despite having a lot more code. I think they include all their userspace in their scans also, so it’s picked up gcc, sendmail, binutils etc etc.

The Python people have made a big effort to keep their defect density low (afaik, the lowest of all projects in scan). They did however have a lot fewer issues to begin with, and have a much smaller codebase. Firefox by comparison seems to have a lot of the same problems Linux has. A large corpus of pre-existing issues, and a large codebase (probably with few people with ‘global’ knowledge)

In my next post, I’ll go into some detail about where some of the more dense areas of the kernel are for Coverity issues. Much of it should be no surprise (old, unmaintained/neglected code etc), but there are a few interesting cases).

update : added FreeBSD statistics.
update 2 : (hi hackernews!) added blurb about coverity improvements.

The first year of Coverity Linux kernel scans. is a post from:

August 08, 2014

Week of kernel bugs in review

With the 3.17 merge window opening up this week, it’s been kinda busy.
I also made a few enhancements to Trinity, so it found some bugs that have been there for a while.

In addition to this, I started pulling together a talk for kernel summit based on all the stuff that Coverity has been finding. I’ll eventually get around to turning those into blog posts too, as there’s a lot of material.

Productive week.

Week of kernel bugs in review is a post from:

compiler sanitizers.

I only recently discovered the sanitizer libraries that both gcc and llvm support despite them being a few years old now. (libasan, liblsan, libtsan and my favorite libubsan for undefined behaviour detection). LLVM also has a -fsanitize=memory.

Building code with -fsanitize={address|leak|undefined} has turned up a number of hard to find issues in various userspace code I’ve written. (Unfortunately doing this on something like Trinity produces a lot of false positives, as it deliberately generates undefined behavior in many cases, like creating an mmap, never writing to it, and then passing it to something that reads it).

There’s also a variant of libasan for the kernel which looks interesting. I know that’s found a bunch of issues in concert with fuzzing via Trinity, and expect it’s something we’ll see more of if/when that functionality gets merged.

Today I was reading about the recent gcc meeting, and these slides by the sanitizer developers caught my attention. What I found of particular interest was the “MSan for Chromium” slide, where they mention they rebuilt ~40 libraries to link with the sanitizer.

I’ve been contemplating doing this for a subset of some userspace packages in Fedora that I care about for a while, but I’ve not had spare cycles to even look into it. I dogfood a lot of bleeding edge code on all my machines, and have been curious for some time to see what the fallout looks like from such a rebuild of various network facing daemons. I suspect with Chromium being more focused on the client side, there hasn’t been a huge amount of research into this for server side code. Looking at ASan’s found bugs wiki page, it does seem to support that hypothesis. I’m curious to see what would fall out from a rebuilt Apache, Bind, Sendmail, nginx, etc.
Hopefully the developers of all the network facing code we ship are just as curious.

There are obvious comparisons to valgrind, which doesn’t require rebuilding, but in my experience so far, the sanitizers have found a bunch of issues that valgrind didn’t (or got lost in the noise). Also, just like with fuzzers, different tools tend to find different bugs even if they have the same intent. I think there’s room for both approaches.

compiler sanitizers. is a post from:

August 05, 2014

Linux 3.16 coverity stats

date rev Outstanding fixed defect density
Jun/8/2014 v3.15 4928 6397 0.55
Jun/16/2014 v3.16-rc1 4817 6651 0.53
Jun/23/2014 v3.16-rc2 4815 6653 0.53
Jun/29/2014 v3.16-rc3 4810 6659 0.53
Jul/6/2014 v3.16-rc4 4806 6661 0.53
Jul/14/2014 v3.16-rc5 4801 6663 0.53
Jul/21/2014 v3.16-rc6 4827 7022 0.53
Jul/28/2014 v3.16-rc7 4820 7022 0.53
Aug/4/2014 v3.16 4817 7023 0.53

The 3.16 cycle really started putting a dent in the backlog of older issues. Hundreds of older issues got fixed in -rc1.
There was a small bump at rc5 in new issues being detected, when Coverity upgraded to their 7.5.0 release.
Improvements in that upgrade also meant it closed out more issues than it found new (395 new: 409 eliminated)

Many of the new issues detected look to be real problems. 50 or so of them come from a new checker that looks for patterns like

if (condition)

In a lot of drivers however, it seems to be intentional, as these cases come with FIXME comments suggesting that the author doesn’t know what the right thing to do is in the ‘else’ case, or some functionality doesn’t work right yet, so it falls back to doing the same thing in both branches.

It’s now been a year since I first started doing regular builds in Coverity. In that time, the detected defect density has dropped from 0.68 to 0.53 today. We used to see upticks in new issues every time the merge window opened. Now, we’re seeing as many as (or more) issues closed as we are seeing new. As an example: day 1 of the 3.17 merge window yesterday featured 3638 new changes, including all the questionable code in drivers/staging/ Coverity picked up 67 new issues, but 132 got eliminated).

I’m hoping things continue to improve at this rate.

Linux 3.16 coverity stats is a post from:

Linux 3.15 coverity stats

date rev Outstanding fixed defect density
Mar/31/2014 v3.14 4811 6126 0.55
Apr/14/2014 v3.15-rc1 4909 6337 0.55
Apr/15/2014 v3.15-rc2 4881 6369 0.55
Apr/21/2014 v3.15-rc3 4878 6375 0.55
Apr/28/2014 v3.15-rc4 4966 6382 0.56
May/9/2014 v3.15-rc5 4960 6389 0.56
May/22/2014 v3.15-rc6 4956 6390 0.56
May/27/2014 v3.15-rc7 4954 6392 0.56
Jun/2/2014 v3.15-rc8 4932 6393 0.55
Jun/8/2014 v3.15 4928 6397 0.55

A belated dump of the statistics coverity gathers on defect density for the 3.15 kernel.
The most interesting thing for this cycle was the bump around rc4. The number of outstanding bugs increased by almost a hundred new defects. This was due to Coverity implementing a new checker for detecting the heartbleed bug in openssl.

After dismissing a bunch of false positives/intentional cases, we ended the cycle with a delta vs the previous release of just over a hundred new outstanding issues, but the overall defect density remained at 0.55.

Linux 3.15 coverity stats is a post from:

Vacation over, back to oopses.

First day back after a week off.
Started the day by screwing up a coverity build, because went away. Fixed up my scripts and started over, running the analysis on 3.16 which came out yesterday. I’ll do another report tomorrow on how things have changed over the last release or two (sudden realization that I haven’t done one since 3.14).

Spent some time in the afternoon beginning some new functionality in trinity, to have some shared files that all threads write/seek/truncate etc to. It didn’t take much code at all before I was staring at my first oops, when btrfs blew up. Pretty sad, given I’m not even really started doing anything particularly interesting with this code yet.

Vacation over, back to oopses. is a post from:

July 30, 2014

you have a long road to walk, but first you have to leave the house

or why publishing code is STEP ZERO.

If you've been developing code internally for a kernel contribution, you've probably got a lot of reasons not to default to working in the open from the start, you probably don't work for Red Hat or other companies with default to open policies, or perhaps you are scared of the scary kernel community, and want to present a polished gem.

If your company is a pain with legal reviews etc, you have probably spent/wasted months of engineering time on internal reviews and stuff, so think all of this matters later, because why wouldn't it, you just spent (wasted) a lot of time on it, so it must matter.

So you have your polished codebase, why wouldn't those kernel maintainers love to merge it.

Then you publish the source code.

Oh, look you just left your house. The merging of your code is many many miles distant and you just started walking that road, just now, not when you started writing it, not when you started legal review, not when you rewrote it internally the 4th time. You just did it this moment.

You might have to rewrite it externally 6 times, you might never get it merged, it might be something your competitors are also working on, and the kernel maintainers would rather you cooperated with people your management would lose their minds over, that is the kernel development process.

step zero: publish the code. leave the house.

(lately I've been seeing this problem more and more, so I decided to write it up, and it really isn't directed at anyone in particular, I think a lot of vendors are guilty of this).

July 22, 2014

Slide embedding from Commons

A friend of a friend asked this morning:

I suggested Wikimedia Commons, but it turns out she wanted something like Slideshare’s embedding. So here’s a test of how that works (timely, since soon Wikimanians will be uploading dozens of slide decks!)

This is what happens when you use the default Commons “Use this file on the web -> HTML/BBCode” option on a slide deck pdf:

Wikimedia Legal overview 2014-03-19

Not the worst outcome – clicking gets you to a clickable deck. No controls inline in the embed, though. And importantly nothing to show that it is clickable :/

Compare with the same deck, uploaded to Slideshare:

Some work to be done if we want to encourage people to upload to Commons and share later.

Update: a commenter points me at viewer.js, which conveniently includes a wordpress plugin! The plugin is slightly busted (I had to move some files around to get it to work in my install) but here’s a demo:

Update2: bugs are fixed upstream and in an upcoming 0.5.2 release of the plugin. Hooray!

July 16, 2014

Designers and Creative Commons: Learning Through Wikipedia Redesigns

tl;dr: Wikipedia redesigns mostly ignore attribution of Wikipedia authors, and none approach the problem creatively. This probably says as much or more about Creative Commons as it does about the designers.

disclaimer-y thing: so far, this is for fun, not work; haven’t discussed it at the office and have no particular plans to. Yes, I have a weird idea of fun.

Refresh variant from
A mild refresh from

It is no longer surprising when a new day brings a new redesign of Wikipedia. After seeing one this weekend with no licensing information, I started going back through seventeen of them (most of the ones listed on-wiki) to see how (if at all) they dealt with licensing, attribution, and history. Here’s a summary of what I found.

Completely missing

Perhaps not surprisingly, many designers completely remove attribution (i.e., history) and licensing information in their designs. Seven of the seventeen redesigns I surveyed were in this camp. Some of them were in response to a particular, non-licensing-related challenge, so it may not be fair to lump them into this camp, but good designers still deal with real design constraints, and licensing is one of them.

History survives – sometimes

The history link is important, because it is how we honor the people who wrote the article, and comply with our attribution obligations. Five of the seventeen redesigns lacked any licensing information, but at least kept a history link.

Several of this group included some legal information, such as links to the privacy policy, or in one case, to the Wikimedia Foundation trademark page. This suggests that our current licensing information may be presented in a worse way than some of our other legal information, since it seems to be getting cut out even by designers who are tolerant of some of our other legalese?

Same old, same old

Four of the seventeen designs keep the same old legalese, though one fails to comply by making it impossible to get to the attribution (history) page. Nothing wrong with keeping the existing language, but it could reflect a sad conclusion that licensing information isn’t worth the attention of designers; or (more generously) that they don’t understand the meaning/utility of the language, so it just gets cargo-culted around. (Credit to Hamza Erdoglu , who was the only mockup designer who specifically went out of his way to show the page footer in one of his mockups.)

A winner, sort of!

Of the seventeen sites I looked at, exactly one did something different: Wikiwand. It is pretty minimal, but it is something. The one thing: as part of the redesign, it adds a big header/splash image to the page, and then adds a new credit specifically for the author of the header/splash image down at the bottom of the page with the standard licensing information. Arguably it isn’t that creative, just complying with their obligations from adding a new image, but it’s at least a sign that not everyone is asleep at the wheel.


This is surely not a large or representative sample, so all my observations from this exercise should be taken with a grain of salt. (They’re also speculative since I haven’t talked to the designers.) That said, some thoughts besides the ones above:

  • Virtually all of the designers who wrote about why they did the redesign mentioned our public-edit-nature as one of their motivators. Given that, I expected history to be more frequently/consistently addressed. Not clear whether this should be chalked up to designers not caring about attribution, or the attribution role of history being very unclear to anyone who isn’t an expect. I suspect the latter.
  • It was evident that some of these designers had spent a great deal of time thinking about the site, and yet were unaware of licensing/attribution. This suggests that people who spend less time with the site (i.e., 99.9% of readers) are going to be even more ignorant.
  • None of the designers felt attribution and licensing was even important enough to experiment on or mention in their writeups. As I said above, this is understandable but sort of sad, and I wonder how to change it.

Postscript, added next morning:

I think it’s important to stress that I didn’t link to the individual sites here, because I don’t want to call out particular designers or focus on their failures/oversights. The important (and as I said, sad) thing to me is that designers are, historically, a culture concerned with licensing and attribution. If we can’t interest them in applying their design talents to our problem, in the context of the world’s most famously collaborative project, we (lawyers and other Commoners) need to look hard at what we’re doing, and how we can educate and engage designers to be on our side.

I should also add that the WMF design team has been a real pleasure to work with on this problem, and I look forward to doing more of it. Some stuff still hasn’t made it off the drawing board, but they’re engaged and interested in this challenge. Here is one example.

morituri 0.2.3 ‘moved’ released!

It’s two weeks shy of a year since the last morituri release. It’s been a pretty crazy year for me, getting married and moving to New York, and I haven’t had much time throughout the year to do any morituri hacking at all. I miss it, and it was time to do something about it, especially since there’s been quite a bit of activity on github since I migrated the repository to it.

I wanted to get this release out to combine all of the bug fixes since the last release before I tackle one of the number one asked for issues – not ripping the hidden track one audio if it’s digital silence. There are patches floating around that hopefully will be good enough so I can quickly do another release with that feature, and there are a lot of minor issues that should be easy to fix still floating around.

But the best way to get back into the spirit of hacking and to remove that feeling of it’s-been-so-long-since-a-release-so-now-it’s-even-harder-to-do-one is to just Get It Done.

I look forward to my next hacking stretch!

Happy ripping everybody.

flattr this!

Closure on some old bugs.

Closure on some old bugs. is a post from:

July 15, 2014

catch-up after a brief hiatus.

Yikes, almost a month since I last posted.
In that time, I’ve spent pretty much all my time heads down chasing memory corruption bugs in Trinity, and whacking a bunch of smaller issues as I came across them. Some of the bugs I’ve been chasing have taken a while to reproduce, so I’ve deliberately held off on changing too much at once this last few weeks, choosing instead to dribble changes in a few commits at a time, just to be sure things weren’t actually getting worse. Every time I thought I’d finally killed the last bug, I’d do another run for a few hours, and then see the same corrupted structures. Incredibly frustrating. After a process of elimination (I found a hundred places where the bug wasn’t), I think I’ve finally zeroed in on the problematic code, in the functions that generate random filenames.
I pretty much gutted that code today, which should remove both the bug, and a bunch of unnecessary operations that never found any kernel bugs anyway. I’m glad I spent the time to chase this down, because the next bunch of features I plan to implement leverage this code quite heavily, and would have only caused even more headache later on.

The one plus side of chasing this bug the last month or so has been all the added debugging code I’ve come up with. Some of it has been committed for re-use later, while some of the more intrusive debug features (like storing backtraces for every locking operation) I opted not to commit, but will keep the diff around in case it comes in handy again sometime.

Spent the afternoon clearing out my working tree by committing all the clean-up patches I’ve done while doing this work. Some of them were a little tangled and needed separating into multiple commits). Next, removing some lingering code that hasn’t really done anything useful for a while.

I’ve been hesitant to start on the newer features until things calmed down, but that should hopefully be pretty close.

catch-up after a brief hiatus. is a post from:

July 04, 2014

FUDCON + GNOME.Asia Beijing 2014

Thanks to the funding from FUDCON I had the chance to attend and keynote at the combined FUDCON Beijing 2014 and GNOME.Asia 2014 conference in Beijing, China.

My talk was about systemd's present and future, what we achieved and where we are going. In my talk I tried to explain a bit where we are coming from, and how we changed focus from being purely an init system, to more being a set of basic building blocks to build an OS from. Most of the talk I talked about where we still intend to take systemd, which areas we believe should be covered by systemd, and of course, also the always difficult question, on where to draw the line and what clearly is outside of the focus of systemd. The slides of my talk you find online. (No video recording I am aware of, sorry.)

The combined conferences were a lot of fun, and as usual, the best discussions I had in the hallway track, discussing Linux and systemd.

A number of pictures of the conference are now online. Enjoy!

After the conference I stayed for a few more days in Beijing, doing a bit of sightseeing. What a fantastic city! The food was amazing, we tried all kinds of fantastic stuff, from Peking duck, to Bullfrog Sechuan style. Yummy. And one of those days I am sure I will find the time to actually sort my photos and put them online, too.

I am really looking forward to the next FUDCON/GNOME.Asia!

June 29, 2014

mach 1.0.3 ‘moved’ released

It’s been very long since I last posted something. Getting married, moving across the Atlantic, enjoying the city, it’s all taken its time. And the longer you don’t do something, the harder it is to get back into.

So I thought I’d start simple – I updated mach to support Fedora 19 and 20, and started rebuilding some packages.

Get the source, update from my repository, or wait until updates hit the Fedora repository.

Happy packaging!

flattr this!

June 21, 2014

Democracy and Software Freedom

As part of a broader discussion of democracy as the basis for a just socio-economic system, Séverine Deneulin summarizes Robert Dahl’s Democracy, which says democracy requires five qualities:

First, democracy requires effective participation. Before a policy is adopted, all members must have equal and effective opportunities for making their views known to others as to what the policy should be.

Second, it is based on voting equality. When the moment arrives for the final policy decision to be made, every member should have an equal and effective opportunity to vote, and all votes should be counted as equal.

Third, it rests on ‘enlightened understanding’. Within reasonable limits, each member should have equal and effective opportunities for learning about alternative policies and their likely consequences.

Fourth, each member should have control of the agenda, that is, members should have the exclusive opportunity to decide upon the agenda and change it.

Fifth, democratic decision-making should include all adults. All (or at least most) adult permanent residents should have the full rights of citizens that are implied by the first four criteria.

From An Introduction to the Human Development and Capability Approach“, Ch. 8 – “Democracy and Political Participation”.

Poll worker explains voting process in southern Sudan referendum.jpg
Poll worker explains voting process in southern Sudan referendum” by USAID Africa Bureau via Wikimedia Commons.

It is striking that, despite talking a lot about freedom, and often being interested in the question of who controls power, these five criteria might as well be (Athenian) Greek to most free software communities and participants- the question of liberty begins and ends with source code, and has nothing to say about organizational structure and decision-making – critical questions serious philosophers always address.

Our licensing, of course, means that in theory points #4 and #5 are satisfied, but saying “you can submit a patch” is, for most people, roughly as satisfying as saying “you could buy a TV ad” to an American voter concerned about the impact of wealth on our elections. Yes, we all have the theoretical option to buy a TV ad/edit our code, but for most voters/users of software that option will always remain theoretical. We’re probably even further from satisfying #1, #2, and #3 in most projects, though one could see the Ada Initiative and GNOME OPW as attempts to deal with some aspects of #1, #3, and #4

This is not to say that voting is the right way to make decisions about software development, but simply to ask: if we don’t have these checks in place, what are we doing instead? And are those alternatives good enough for us to have certainty that we’re actually enhancing freedom?

We’ll Build A Dream House Of Net

(Note: the final 0.9.10 will be out later this week…  read on for the awesome that it will contain)

Hey wouldn’t it be great if NetworkManager did X and made my life so awesome I could retire to a private island surrounded by things I love?  Like kittens and teddy bears and bright copper kettles and Domaine Leflaive Montrachet Grand Cru?

If only your dream was reality…  Oh wait!  NetworkManager 0.9.10 will be your genie from Aladdin, granting every wish you dream of, except this time you can wish for more wishes.  But you still can’t bring awful networking back from the dead because it’s just not pretty.  Don’t do it.



Tons of new features, yet somehow smaller and nimbler! (via cuxclipper, CC BY 2.0)

What is pretty is NetworkManager 0.9.10; it’s like the lightning-quick racing yacht that Larry Ellison doesn’t have and really, really wants, but which somehow also adds a Triple-E-Class-worth of new features just for you.  Somebody (maybe you!) wished for every single thing you’re about to see.  And then a magic genie showed up, snapped its fingers, and gave it to them.


We found a usability gap between full-fledged CLI tools like nmcli and GUI-based ones, and thus nmtui was born.  Sometimes you don’t want to remember esoteric commands and options, but you also don’t want to run X. Boom, first wish granted: a curses-based tool for configuring and managing your network, no X involved:



The command-line still rules with divine mandate, and we’re here to please so nmcli was a huge focus for this release.  We’ve added interactive editing support, single-command editing, detailed help, tab completion, and enhanced bash completion.  You really need to check this out; almost anything you can do with GUI tools can now be done with nmcli, and there’s even some stuff nmcli can do that the GUI tools can’t.  If you’re comfortable with terminals, NetworkManager 0.9.10 is right up your alley.


Size Does Matter

Continuing on the quest to be more nimble and streamlined, we’ve split Wi-Fi, WWAN, Bluetooth, ADSL, and WiMAX device support into plugins which you don’t need to install if you like a minimal system.  Distributions should package these separately so they can be added/removed independently of NetworkManager itself, which reduces disk usage, runtime memory usage, and packaging dependency chains.  We’ve also spent time slimming down and optimizing the code.  The core NetworkManager daemon is now just over 1MB in size!

dbus-daemon is also no longer required for root-only or early-boot operation, with communication using a private root-only Unix socket. Similarly, PolicyKit is no longer used for root operation, though it could always be disabled at build-time anyway.

To facilitate remote and SSH-based management, the “at_console” D-Bus permission has been removed, which also helpfully harmonizes authorization settings between Fedora and Debian-based distributions.  All permissions authorization now happens through PolicyKit instead.


NetworkManager works here (via scobleizer, CC BY 2.0)

The Enterprise

When you Absolutely Positively MUST have your ethernet frames delivered on-time and without loss you turn to Data Center Bridging.  DCB provides the reliability and robustness that iSCSI and FibreChannel over Ethernet (FCoE) need so you don’t have to keep shovelling money into a proprietary SAN.  Since users requested it, we snapped our fingers and added support to NetworkManager 0.9.10 for configuring DCB on your ethernet interfaces.

We’ve also upped our game with IP-level configuration support for many more software interfaces like GRE, macvlan, macvtap, tun, tap, veth, and vxlan.  And when you have services that aren’t yet network-aware, the NetworkManager-wait-online systemd service is more reliable to ensure your legacy services start up with the resources they require.

Customization Galore

You dreamed, we listened.  Creepy, no?  Yeah, we know what you want.  And top of the list was more flexible configuration:

  • Connection configuration files are no longer watched for changes by default, which used to cause problems with backups, filesystem copies, half-configured connections, etc.  If you want that behavior you can turn it back on (monitor-connection-files=true), but instead, edit them as much as you want and when you’re done, “nmcli con reload“.
  • Connections can now be locked to interface names instead of just MAC addresses
  • A new “ignore-carrier” option is available to ensure your critical app doesn’t fail just because you got drunk on Captain Morgan + Coke, and tripped over a cable
  • Want to manage /etc/resolv.conf yourself?  You can!  “dns=none” is your new best friend.
  • Configuration file snippets can be dropped into /etc/NetworkManager/conf.d to change smaller sets of configuration options

The NetworkManager dispatcher got some enhancements too.  It now has a “pre-up” event that allow scripts to execute before NetworkManager announces connectivity to applications.  We also added a “pre-down” event that lets network filesystems flush data before the interface is actually disconnected from the network.

Seamless Cooperation

Do you love /sbin/ip?  ifconfig?  brctl?  vconfig?  Keep using them!  Changes you make outside of NetworkManager get picked up, respected, and reflected in the D-Bus API.  NetworkManager 0.9.10 also goes to great lengths to read the existing configuration of interfaces and not touch them.  Most network interfaces known to the kernel are now exposed in the D-Bus API, and you can even change their IP configuration right from NetworkManager.  There’s more work to do here but we hope you’ll appreciate the new situational awareness as much as we do.

Get Your VPN On

We’ve improved support for routing-only VPNs like Openswan/Libreswan/Strongswan.  We’ve added full details of the VPN’s IP configuration to the D-Bus API.  And best yet, VPN plugins can now request additional passwords during the connection process if the ones you previously gave them are wrong or changed.

All the Rest

For clients, more properties are exposed in the D-Bus API.  We’ve added support for custom IP ranges to the Internet Connection Sharing functionality.  We’ve added WWAN autoconnect support and more reliable airplane mode behavior.  Fatal connection errors now more reliably block reconnect, which means better handling of wrong Wi-Fi passwords and access point failures.  Captive portal/hotspot support is moving forward, as are DNSSEC enhancements.

Geez, are we done yet?

Not even close!  Seriously, there’s more but I’m kinda tired of typing.  Try it out (the final release will be out later this week) and tell us what you think.  Then tell us what you want.  Don’t be afraid to dream a little bigger, darling!

(via Alexandra Guerson, CC BY-NC-ND 2.0)

June 20, 2014

Transferring maintainership of x86info.

On Mon Feb 26 2001, I committed the version of x86info. 13 years later I’m pretty much done with it.
Despite a surge of activity back in February, when I reached a new record monthly commit count record, it’s been a project I’ve had little time for over the last few years. Luckily, someone has stepped up to take over. Going forward, Nick Black will be maintaining it here.

I might still manage the occasional commit, but I won’t be feeling guilty about neglecting it any more.

An unfinished splinter project that I started a while back x86utils was intended to be a replacement for x86info of sorts, by moving it to multiple small utilities rather than the one monolithic blob that x86info is. I’m undecided if I’ll continue to work on this, especially given my time commitments to Trinity and related projects.

Transferring maintainership of x86info. is a post from:

June 19, 2014

PSA: Fedora 21, NetworkManager, and DNF

Recently we posted a Fedora 21 update delivering the huge new awesome that is NetworkManager 0.9.10-beta1.  Among a literal Triple E Class boatload of enhancements and fixes, this update continues our fine tradition of making the core of NetworkManager smaller and more flexible by splitting out Wi-Fi support into the NetworkManager-wifi package.  If you don’t have or don’t use Wi-Fi on your system, you don’t need to install stuff for it and you can save some disk space and RAM.

The Problem

To ensure upgrades work correctly and people didn’t unexpectedly lose Wi-Fi support we set RPM Obsoletes to ensure that when upgrading the new package was installed even though it didn’t exist before.  Unfortunately this didn’t work for those using DNF instead of yum for package management.

It turns out that DNF treats RPM obsoletes differently than yum, and it’s unclear right now how package splitting is actually supposed to work with DNF.  You can track the issue here.

The Workaround

If you suddenly find yourself without Wi-Fi, find a wired network connection and:

dnf install NetworkManager-wifi
systemctl restart NetworkManager

and harmony will return to the Universe.  We understand the pain, will continue to monitor the situation, and will update the Fedora NetworkManager packages when DNF has a solution.

June 17, 2014

Factory Reset, Stateless Systems, Reproducible Systems & Verifiable Systems

(Just a small heads-up: I don't blog as much as I used to, I nowadays update my Google+ page a lot more frequently. You might want to subscribe that if you are interested in more frequent technical updates on what we are working on.)

In the past weeks we have been working on a couple of features for systemd that enable a number of new usecases I'd like to shed some light on. Taking benefit of the /usr merge that a number of distributions have completed we want to bring runtime behaviour of Linux systems to the next level. With the /usr merge completed most static vendor-supplied OS data is found exclusively in /usr, only a few additional bits in /var and /etc are necessary to make a system boot. On this we can build to enable a couple of new features:

  1. A mechanism we call Factory Reset shall flush out /etc and /var, but keep the vendor-supplied /usr, bringing the system back into a well-defined, pristine vendor state with no local state or configuration. This functionality is useful across the board from servers, to desktops, to embedded devices.
  2. A Stateless System goes one step further: a system like this never stores /etc or /var on persistent storage, but always comes up with pristine vendor state. On systems like this every reboot acts as factor reset. This functionality is particularly useful for simple containers or systems that boot off the network or read-only media, and receive all configuration they need during runtime from vendor packages or protocols like DHCP or are capable of discovering their parameters automatically from the available hardware or periphery.
  3. Reproducible Systems multiply a vendor image into many containers or systems. Only local configuration or state is stored per-system, while the vendor operating system is pulled in from the same, immutable, shared snapshot. Each system hence has its private /etc and /var for receiving local configuration, however the OS tree in /usr is pulled in via bind mounts (in case of containers) or technologies like NFS (in case of physical systems), or btrfs snapshots from a golden master image. This is particular interesting for containers where the goal is to run thousands of container images from the same OS tree. However, it also has a number of other usecases, for example thin client systems, which can boot the same NFS share a number of times. Furthermore this mechanism is useful to implement very simple OS installers, that simply unserialize a /usr snapshot into a file system, install a boot loader, and reboot.
  4. Verifiable Systems are closely related to stateless systems: if the underlying storage technology can cryptographically ensure that the vendor-supplied OS is trusted and in a consistent state, then it must be made sure that /etc or /var are either included in the OS image, or simply unnecessary for booting.


A number of Linux-based operating systems have tried to implement some of the schemes described out above in one way or another. Particularly interesting are GNOME's OSTree, CoreOS and Google's Android and ChromeOS. They generally found different solutions for the specific problems you have when implementing schemes like this, sometimes taking shortcuts that keep only the specific case in mind, and cannot cover the general purpose. With systemd now being at the core of so many distributions and deeply involved in bringing up and maintaining the system we came to the conclusion that we should attempt to add generic support for setups like this to systemd itself, to open this up for the general purpose distributions to build on. We decided to focus on three kinds of systems:

  1. The stateful system, the traditional system as we know it with machine-specific /etc, /usr and /var, all properly populated.
  2. Startup without a populated /var, but with configured /etc. (We will call these volatile systems.)
  3. Startup without either /etc or /var. (We will call these stateless systems.)

A factory reset is just a special case of the latter two modes, where the system boots up without /var and /etc but the next boot is a normal stateful boot like like the first described mode. Note that a mode where /etc is flushed, but /var is not is nothing we intend to cover (why? well, the user ID question becomes much harder, see below, and we simply saw no usecase for it worth the trouble).


Booting up a system without a populated /var is relatively straight-forward. With a few lines of tmpfiles configuration it is possible to populate /var with its basic structure in a way that is sufficient to make a system boot cleanly. systemd version 214 and newer ship with support for this. Of course, support for this scheme in systemd is only a small part of the solution. While a lot of software reconstructs the directory hierarchy it needs in /var automatically, many software does not. In case like this it is necessary to ship a couple of additional tmpfiles lines that setup up at boot-time the necessary files or directories in /var to make the software operate, similar to what RPM or DEB packages would set up at installation time.

Booting up a system without a populated /etc is a more difficult task. In /etc we have a lot of configuration bits that are essential for the system to operate, for example and most importantly system user and group information in /etc/passwd and /etc/group. If the system boots up without /etc there must be a way to replicate the minimal information necessary in it, so that the system manages to boot up fully.

To make this even more complex, in order to support "offline" updates of /usr that are replicated into a number of systems possessing private /etc and /var there needs to be a way how these directories can be upgraded transparently when necessary, for example by recreating caches like /etc/ or adding missing system users to /etc/passwd on next reboot.

Starting with systemd 215 (yet unreleased, as I type this) we will ship with a number of features in systemd that make /etc-less boots functional:

  • A new tool systemd-sysusers as been added. It introduces a new drop-in directory /usr/lib/sysusers.d/. Minimal descriptions of necessary system users and groups can be placed there. Whenever the tool is invoked it will create these users in /etc/passwd and /etc/group should they be missing. It is only suitable for creating system users and groups, not for normal users. It will write to the files directly via the appropriate glibc APIs, which is the right thing to do for system users. (For normal users no such APIs exist, as the users might be stored centrally on LDAP or suchlike, and they are out of focus for our usecase.) The major benefit of this tool is that system user definition can happen offline: a package simply has to drop in a new file to register a user. This makes system user registration declarative instead of imperative -- which is the way how system users are traditionally created from RPM or DEB installation scripts. By being declarative it is easy to replicate the users on next boot to a number of system instances.

    To make this new tool interesting for packaging scripts we make it easy to alternatively invoke it during package installation time, thus being a good alternative to invocations of useradd -r and groupadd -r.

    Some OS designs use a static, fixed user/group list stored in /usr as primary database for users/groups, which fixed UID/GID mappings. While this works for specific systems, this cannot cover the general purpose. As the UID/GID range for system users/groups is very small (only containing 998 users and groups on most systems), the best has to be made from this space and only UIDs/GIDs necessary on the specific system should be allocated. This means allocation has to be dynamic and adjust to what is necessary.

    Also note that this tool has one very nice feature: in addition to fully dynamic, and fully static UID/GID assignment for the users to create, it supports reading UID/GID numbers off existing files in /usr, so that vendors can make use of setuid/setgid binaries owned by specific users.

  • We also added a default user definition list which creates the most basic users the system and systemd need. Of course, very likely downstream distributions might need to alter this default list, add new entries and possibly map specific users to particular numeric UIDs.
  • A new condition ConditionNeedsUpdate= has been added. With this mechanism it is possible to conditionalize execution of services depending on whether /usr is newer than /etc or /var. The idea is that various services that need to be added into the boot process on upgrades make use of this to not delay boot-ups on normal boots, but run as necessary should /usr have been update since the last boot. This is implemented based on the mtime timestamp of the /usr: if the OS has been updated the packaging software should touch the directory, thus informing all instances that an upgrade of /etc and /var might be necessary.
  • We added a number of service files, that make use of the new ConditionNeedsUpdate= switch, and run a couple of services after each update. Among them are the aforementiond systemd-sysusers tool, as well as services that rebuild the udev hardware database, the journal catalog database and the library cache in /etc/
  • If systemd detects an empty /etc at early boot it will now use the unit preset information to enable all services by default that the vendor or packager declared. It will then proceed booting.
  • We added a new tmpfiles snippet that is able to reconstruct the most basic structure of /etc if it is missing.
  • tmpfiles also gained the ability copy entire directory trees into place should they be missing. This is particularly useful for copying certain essential files or directories into /etc without which the system refuses to boot. Currently the most prominent candidates for this are /etc/pam.d and /etc/dbus-1. In the long run we hope that packages can be fixed so that they always work correctly without configuration in /etc. Depending on the software this means that they should come with compiled-in defaults that just work should their configuration file be missing, or that they should fall back to static vendor-supplied configuration in /usr that is used whenever /etc doesn't have any configuration. Both the PAM and the D-Bus case are probably candidates for the latter. Given that there are probably many cases like this we are working with a number of folks to introduce a new directory called /usr/share/etc (name is not settled yet) to major distributions, that always contain the full, original, vendor-supplied configuration of all packages. This is very useful here, so that there's an obvious place to copy the original configuration from, but it is also useful completely independently as this provides administrators with an easy place to diff their own configuration in /etc against to see what local changes are in place.
  • We added a new --tmpfs= switch to systemd-nspawn to make testing of systems with unpopulated /etc and /var easy. For example, to run a fully state-less container, use a command line like this:

    # system-nspawn -D /srv/mycontainer --read-only --tmpfs=/var --tmpfs=/etc -b

    This command line will invoke the container tree stored in /srv/mycontainer in a read-only way, but with a (writable) tmpfs mounted to /var and /etc. With a very recent git snapshot of systemd invoking a Fedora rawhide system should mostly work OK, modulo the D-Bus and PAM problems mentioned above. A later version of systemd-nspawn is likely to gain a high-level switch --mode={stateful|volatile|stateless} that sets combines this into simple switches reusing the vocabulary introduced earlier.

What's Next

Pulling this all together we are very close to making boots with empty /etc and /var on general purpose Linux operating systems a reality. Of course, while doing the groundwork in systemd gets us some distance, there's a lot of work left. Most importantly: the majority of Linux packages are simply incomptible with this scheme the way they are currently set up. They do not work without configuration in /etc or state directories in /var; they do not drop system user information in /usr/lib/sysusers.d. However, we believe it's our job to do the groundwork, and to start somewhere.

So what does this mean for the next steps? Of course, currently very little of this is available in any distribution (simply already because 215 isn't even released yet). However, this will hopefully change quickly. As soon as that is accomplished we can start working on making the other components of the OS work nicely in this scheme. If you are an upstream developer, please consider making your software work correctly if /etc and/or /var are not populated. This means:

  • When you need a state directory in /var and it is missing, create it first. If you cannot do that, because you dropped priviliges or suchlike, please consider dropping in a tmpfiles snippet that creates the directory with the right permissions early at boot, should it be missing.
  • When you need configuration files in /etc to work properly, consider changing your application to work nicely when these files are missing, and automatically fall back to either built-in defaults, or to static vendor-supplied configuration files shipped in /usr, so that administrators can override configuration in /etc but if they don't the default configuration counts.
  • When you need a system user or group, consider dropping in a file into /usr/lib/sysusers.d describing the users. (Currently documentation on this is minimal, we will provide more docs on this shortly.)

If you are a packager, you can also help on making this all work:

  • Ask upstream to implement what we describe above, possibly even preparing a patch for this.
  • If upstream will not make these changes, then consider dropping in tmpfiles snippets that copy the bare minimum of configuration files to make your software work from somewhere in /usr into /etc.
  • Consider moving from imperative useradd commands in packaging scripts, to declarative sysusers files. Ideally, this is shipped upstream too, but if that's not possible then simply adding this to packages should be good enough.

Of course, before moving to declarative system user definitions you should consult with your distribution whether their packaging policy even allows that. Currently, most distributions will not, so we have to work to get this changed first.

Anyway, so much about what we have been working on and where we want to take this.


Before we finish, let me stress again why we are doing all this:

  1. For end-user machines like desktops, tablets or mobile phones, we want a generic way to implement factory reset, which the user can make use of when the system is broken (saves you support costs), or when he wants to sell it and get rid of his private data, and renew that "fresh car smell".
  2. For embedded machines we want a generic way how to reset devices. We also want a way how every single boot can be identical to a factory reset, in a stateless system design.
  3. For all kinds of systems we want to centralize vendor data in /usr so that it can be strictly read-only, and fully cryptographically verified as one unit.
  4. We want to enable new kinds of OS installers that simply deserialize a vendor OS /usr snapshot into a new file system, install a boot loader and reboot, leaving all first-time configuration to the next boot.
  5. We want to enable new kinds of OS updaters that build on this, and manage a number of vendor OS /usr snapshots in verified states, and which can then update /etc and /var simply by rebooting into a newer version.
  6. We wanto to scale container setups naturally, by sharing a single golden master /usr tree with a large number of instances that simply maintain their own private /etc and /var for their private configuration and state, while still allowing clean updates of /usr.
  7. We want to make thin clients that share /usr across the network work by allowing stateless bootups. During all discussions on how /usr was to be organized this was fequently mentioned. A setup like this so far only worked in very specific cases, with this scheme we want to make this work in general case.

Of course, we have no illusions, just doing the groundwork for all of this in systemd doesn't make this all a real-life solution yet. Also, it's very unlikely that all of Fedora (or any other general purpose distribution) will support this scheme for all its packages soon, however, we are quite confident that the idea is convincing, that we need to start somewhere, and that getting the most core packages adapted to this shouldn't be out of reach.

Oh, and of course, the concepts behind this are really not new, we know that. However, what's new here is that we try to make them available in a general purpose OS core, instead of special purpose systems.

Anyway, let's get the ball rolling! Late's make stateless systems a reality!

And that's all I have for now. I am sure this leaves a lot of questions open. If you have any, join us on IRC on #systemd on freenode or comment on Google+.

Daily log June 16th 2014

Catch up from the last week or so.
Spent a lot of time turning a bunch of code in trinity inside out. After splitting the logging code into “render” and “write out buffer”, I had a bunch of small bugs to nail down. While chasing those I kept occasionally hitting some strange bug where occasionally the list of pids would get corrupted. Two weeks on, and I’m still no closer to figuring out the exact cause, but I’ve got a lot more clues (mostly by now knowing what it _isn’t_ due to a lot of code rewriting).
Some of that rewriting has been on my mind for a while, the shared ‘shm’ structure between all processes was growing to the point that I was having trouble remembering why certain variables were shared there rather than just be globals in main, inherited at fork time. So I moved a bunch of variables from the shm to another structure named ‘childdata’, of which there are an array of ptrs to in the shm.

Some of the code then looked a little convoluted with references that used to be shm->data being replaced with shm->children[childno].data, which I remedied by having each child instantiate a ‘this_child’ var when it starts up, allowing this_child->data to point to the right thing.

All of this code churn served in part to refresh my memory on how some of this worked, hoping that I’d stumble across the random scribble that occasionally happens. I also added a bunch of code to things like dump backtraces, which has been handy for turning up 1-2 unrelated bugs.

On Friday afternoon I discovered that the bug only shows up if MALLOC_PERTURB_ is set. Which is curious, because the struct in question that gets scribbled over is never free()’d. It’s allocated once at startup, and inherited in each child when it fork()’s.

Debugging continues..

Daily log June 16th 2014 is a post from:

June 12, 2014

Building clean docker images

I’ve been doing some experiments recently with ways to build docker images. As a test bed for this I’m building an image with the Gtk+ Broadway backend, and some apps. This will both be useful (have a simple way to show off Broadway) as well as being a complex real-world use case.

I have two goals for the image. First of all, building of custom code has to be repeatable and controlled, Secondly, the produced image should be clean and have a minimal size. In particular, we don’t want any of the build dependencies or intermediate results in any layer in the final docker image.

The approach I’m using involves using a build Dockerfile, which installs all the build tools and the development dependencies. Inside this container I build a set of rpms. Then another Dockerfile creates a runtime image which just installs the necessary runtime rpms from the rpms built in the first container..

However, there is a complication to this. The runtime image is created using a Dockerfile, but there is no way for a Dockerfile to access the files produced from the build container, as volumes don’t work during docker build. We could extract the rpms from the container and use ADD in the dockerfile to insert the files into the container before installing them, but this would create an extra layer in the runtime image that contained the rpms, making it unnecessarily large.

To solve this I made the build container construct a yum repository of all the built rpms, and then when starts lighttp, exporting it. So, by doing:

docker build -t broadway-repo
docker run -d -p 9999:80 broadway-repo

I get a yum repository server at port 9999 that can be used by the
second container. Ideally we would want to use
docker links to get access to the build container, but unfortunately links don’t work with docker build. Instead we use a hack in the Makefile to
generate a yum repo file with the host ip address and add that to the

Here is the Dockerfile used for the build container. There are are some interesting things to notice in it:

  • Instead of starting by adding all the source files we add them one by one as needed. This allows us to use the caching that docker build does, so that if you change something at the end of the Dockerfile docker will reuse the previously built images all the way up to the first line that has changed.
  • We try do do multiple commands in each RUN invocation, chaining them together with &&. This way we can avoid the current limit of 127 layers in a docker image.
  • At the end we use createrepo to create the repository and set the default command to run lighttp on the repository at port 80.
  • We rebuild all packages in the Gtk+ stack, so there are no X11 dependencies in the final image. Additionally, we strip a bunch of dependencies and use some other tricks to make the rpms themselves a bit smaller.

Here is the Dockerfile for the runtime container. It installs the required rpms, creates a user and sets up an init script and a super-simple panel/launcher app written in javascript. We also use a simple init as pid 1 in the container so that process that die are reaped.

Here is a screenshot of the final results:


You can try it yourself by running “docker run -d -p 8080:8080 alexl/broadway” and loading in a browser.

June 10, 2014

Chromium revisited

It's been more than a year since I've had a successful build of Chromium that I was willing to share with anyone else, but last night I pushed out a Fedora 20 x86_64 build of the current stable Chromium. Here's where you can go and get it:

1) Get this repo file: and put it in /etc/yum.repos.d/
2) EDIT - I've signed the packages with my personal GPG key, upon request. This means you also need to download my public key. You can either get it from:
or by running:
gpg --recv-key 93054260 ; gpg --export --armor 93054260 > spot.gpg
Then, copy it (as root) to /etc/pki/rpm-gpg/
cp -a spot.gpg /etc/pki/rpm-gpg/
3) Run yum install chromium chromium-v8

Why has it been so long?

A) I really believe in building things from source code. At the very least, I should be able to take the source code and know that I can follow instructions (whether in an elaborate README or in src.rpm) and get a predictable binary that matches it. This is one of the core reasons that I got involved with FOSS oh so many years ago.

B) I'm a bit of a perfectionist. I wanted any Chromium builds to be as fully functional as possible, and not just put something out there with disabled features.

C) Chromium's not my day job. I don't get paid to do it by Red Hat (they don't mind me doing it as long as I also do the work they _do_ pay me for).

D) I also build Chromium using as many system libraries as possible, because I really get annoyed at Google's policy of bundling half the planet. I also pull out and package independently some components that are reasonably independent (webrtc, ffmpegsumo).

If my schedule is reasonably clear, it takes me about 7-10 _days_ to get a Chromium build up and going, with all of its components. Why did it take a year this time around? Here's the specifics:

AA) I changed jobs in November, and that didn't leave me very much time to do anything else. It also didn't leave me very motivated to drill down into Chromium for quite some time. I've gotten my feet back underneath me though, so I made some time to revisit the problem....

BB) ... and the core problem was the toolchains that Chromium requires. Chromium relies upon two independent (sortof) toolchains to build its Native Client (NaCl) and Portable Native Client (PNaCl) support. NaCl is binutils/gcc/newlib (it's also glibc and other things, but Chromium doesn't need those so I don't build them), and PNaCl is binutils/llvm/clang/a whole host of other libs. NaCl was reasonably easy to figure out how to package, even if you do have to do a bootstrap pass through gcc to do it, but for a very long time, I had no success getting PNaCl to build whatsoever. I tried teasing it apart into its components, but while it depends on the NaCl toolchain to be built, it also builds and uses incompatible versions of libraries that conflict with that toolchain. Eventually, I tried just building it from the giant "here is all the PNaCl source in one git checkout" that Google loosely documents, but it never worked (and it kept trying to download pre-built NaCl binaries to build itself which I didn't want to use).

** deep breath **

After a few months not looking at Chromium or NaCl or PNaCl, I revisited it with fresh eyes and brain. Roland McGrath was very helpful in giving me advice and feedback as to where I was going wrong with my efforts, and I finally managed a building PNaCl package. It's not done the way I'd want it to be (it uses the giant git checkout of all PNaCl sources instead of breaking it out into components), but it is built entirely from source and it uses my NaCl RPMs. The next hurdle was the build system inside Chromium. The last time I'd done a build, I used gyp to generate Makefiles, because, for all of make's eccentricities, it is the devil we understand. I bet you can guess the next part... Makefile generation no longer works. Someone reported a bug on it, and Google's response is paraphrased as "people use that? we should disable it." They've moved the build tool over to something called "ninja", which was written by Google for Chromium. It's not the worst tool ever, but it's new, and learning new things takes time. Make packages, test packages, build, repeat. Namespace off the v8 that chromium needs into a chromium-v8 package that doesn't conflict with the v8 in Fedora proper that node.js uses. Discover that Google has made changes to namespace zlib (to be fair, its the same hack Firefox uses), so we have to use their bundled copy. Discover that Google has added code assuming that icu is bundled and no longer works with the system copy. Discover that Google's fork of libprotobuf is not compatible with the system copy, but the API looks identical, so it builds against the system copy but does not work properly (and coredumps when you try to setup sync). Add all the missing files that need to go into the package (there is no "make install" equivalent).

Then, we test. Discover that NaCl/PNaCl sortof works, but nothing graphical does. Figure out that we need to enable the "Override software rendering list" in chrome://flags because intel graphics are blacklisted on Linux (it works fine once I do that, at least on my Thinkpad T440s, your mileage may vary). Test WebRTC (seems to work). Push packages and hope for the best. Wait for the inevitable bugs to roll in.


I didn't do an i686 build (but some of the libraries that are likely to be multilib on an x86_64 system are present in i686 builds as well), because I'm reasonably sure there are not very many chromium users for that arch. If I'm wrong, let me know. I also haven't built for any older targets.

June 06, 2014

Monthly Fedora kernel bug statistics – May 2014

  19 20 rawhide  
Open: 88 308 158 (554)
Opened since 2014-05-01 9 112 20 (141)
Closed since 2014-05-01 26 68 17 (111)
Changed since 2014-05-01 49 293 31 (373)

Monthly Fedora kernel bug statistics – May 2014 is a post from:

June 03, 2014

Daily log June 2nd 2014

Spent the day making some progress on trinity’s “dump buffers after we detect tainted kernel” code, before hitting a roadblock. It’s possible for a fuzzing child process to allocate a buffer to create (for eg) a mangled filename. Because this allocation is local to the child, attempts to reconstruct what it pointed to from the dumping process will fail.
I spent a while thinking about this before starting on a pretty severe rewrite of how the existing logging code works. Instead of just writing the description of what is in the registers to stdout/logs, it now writes it to a buffer, which can be referenced directly by any of the other processes. A lot of the output code actually looks a lot easier to read as an unintended consequence.
No doubt I’ve introduced a bunch of bugs in the process, which I’ll spend time trying to get on top of tomorrow.

Daily log June 2nd 2014 is a post from:

May 28, 2014

Daily log May 27th 2014

Spent much of the day getting on top of the inevitable email backlog.
Later started cleaning up some of the code in trinity that dumps out syscall parameters to tty/log files.
One of the problems that has come up with trinity is that many people run it with logging disabled, because it’s much faster when you’re not constantly spewing to files, or scrolling text. This unfortunately leads to situations where we get an oops and we have no idea what actually happened. We have some information that we could dump in that case, which we aren’t right now.
So I’ve been spending time lately working towards a ‘post-mortem dump’ feature. The two pieces missing are 1. moving the logging functions away from being special cased to be called from within a child process, so that they can just chew on a syscall record. (This work is now almost done), and 2. maintain a ringbuffer of these records per child. (Not a huge amount of work once everything else is in place).

Hopefully I’ll have this implemented soon, and then I can get back on track with some of the more interesting fuzzing ideas I’ve had for improving.

Daily log May 27th 2014 is a post from:

May 27, 2014

Next stop, SMT

The little hardware project, mentioned previously, continues to chug along. A prototype is now happily blinking LEDs on a veroboard:


Now it's time to use a real technology, so the thing could be used onboard a car or airplane. Since the heart of the design is a chip in LGA-14 package, we are looking at soldering a surface-mounted element with 0.35 mm pitch.

I reached to members of a local hackerspace for advice, and they suggested to forget what people post to the Internet about wave-soldering in a oven. Instead, buy a cheap 10x microscope, fine tip for my iron, and some flux-solder paste, then do it all by hand. As Yoda said, there is only do and do not, there is no try.

May 25, 2014

Digital life in 2014

I do not get out often and my clamshell cellphone incurs a bill of $3/month. So, imagine my surprise when at a dinner with colleagues everyone pulled out a charger brick and plugged their smartphone into it. Or, actually, one guy did not have a brick with him, so someone else let him tap into a 2-port brick (pictured above). The same ritual repeated at every dinner!

I can offer a couple of explanations. One is that it is much cheaper to buy a smartphone with a poor battery life and a brick than to buy a cellphone with a decent battery life (decent being >= 1 day, so you could charge it overnight). Or, that market forces are aligned in such a way that there are no smartphones on the market that can last for a whole day (anymore).

UPDATE: My wife is a smartphone user and explained it to me. Apparently, sooner or later everyone hits an app which is an absolute energy hog for no good reason, and is unwilling to discard it (in her case it was Pomodoro, but it could be anything). Once that happens, no technology exists to pack enough battery into a smartphone. Or so the story goes. In other words, they buy a smartphone hoping that they will not need the brick, but inevitably they do.

May 23, 2014

OpenStack Atlanta 2014

The best part, I think, came during the Swift Ops session, when our PTL, John Dickinson, asked for a show of hands. For example -- without revealing any specific proprietary information -- how many people run clusters of less than 10 nodes, 10 to 100, 100 to 1000, etc. He also asked how old the production clusters were, and if anyone deployed Swift in 2014. Most clusters were older, and only 1 man raised his hand. John asked him what were the difficulties setting it up, and the man said: "we had some, but then we paid Red Hat to fix it up for us, and they did, so it works okay now." I felt so useful!

The hardware show was pretty bare, compared to Portland, where Dell and OCP brought out cool things.

HGST showed their Ethernet drive, but they used a chassis so dire, I don't even want to post a picture. OCP did the same this year: they brought a German partner who demonstrated a storage box that looked built in a basement in Kherson, Ukraine, while under siege by Kiev forces.

Here's a poor pic of cute Mellanox Ethernet wares: a switch, NICs, and some kind of modern equivalent of GBICs for fiber.

Interestingly enough, although Mellanox displayed Ethernet only, I heard in other sessions that Infiniband was not entirely dead. Basically if you need to step beyond bonded 10GbE, there's nothing else for your overworked Swift proxies: it's Infiniband or nothing. My interlocutor from SwiftStack implied that a router existed into which you could plug your Infiniband pipe, I think.

DisplayPort MST dock support for Fedora 20 copr repo.

So I've created a copr

With a kernel + intel driver that should provide support for DisplayPort MST on Intel Haswell hardware. It doesn't do any of the fancy Dell monitor stuff yet, it primarily for people who have Lenovo or Dell docks and laptops that can't currently multihead.

The kernel source is from this branch which backports a chunk of stuff to v3.14 to support this.

It might still have some bugs and crashes, but the basics should in theory work.

Daily log May 22nd 2014

Found another hugepage VM bug.
This one VM_BUG_ON_PAGE(PageTail(page), page). The filemap.c bug has gone into hiding since last weekend. Annoying.

On the trinity list, Michael Ellerman reported a couple bugs that looked like memory corruption since my changes yesterday to make fd registration more dynamic. So I started the day trying out -fsanitize=address without much success. (I had fixed a number of bugs recently using it). However, reading the gcc man page, I found out about -fsanitize=undefined, which I was previously unaware of. This is new in gcc 4.9, so ships with Fedora rawhide, but my test box is running F20. Luckily, clang has supported it even longer.

So I spent a while fixing up a bunch of silly bugs, like (1<<31) where it should have been (1UL<<31) and some not so obvious alignment bugs. A dozen or so fixes for bugs that had been there for years. Sadly though, nothing that explained the corruption Michael has been seeing. I spent some more time looking over yesterdays changes without success. Annoying that I don’t see the corruption when I run it.

Decided not to take on anything too ambitious for the rest of the day given I’m getting an early start on the long weekend, by taking off tomorrow. Hopefully when I return with fresh eyes on Tuesday I’ll have some idea of what I can do to reproduce the bug.

Daily log May 22nd 2014 is a post from: