July 29, 2015

Conference submission and voting

Generally I feel that I do not do any work that's important enough to present at conferences. My previous presentation was at OLS back in 2005, concerning usbmon. The usbmon is something a guy learning C would program: it's a circular buffer into which kernel drops tracing events; Wireshark pulls them out. Hardly a conference material, but at the time I thought it was supremely important to proseltize the basic techniques of always-on tracing, because it would improve the quality and the ease of debugging of the kernel overall. I really wanted FireWire guys to adopt a similar tracing scheme, because it was a hell on a stick debugging juju with just printk(). Needless to say, that was a miserable failure, as was FireWire itself. I don't think anyone who came to listen to my presentation in Ottawa received their money's worth.

Or did they? Recently an epiphany occured to me. I really should not even think if anyone is interested. That is conference organizers' job, not mine! As a result, I sent a proposal to OpenStack Tokyo, entitled "The Plot to Destroy OpenStack Swift Using C++: Enhancements of Swift API Compatibility in Ceph RADOS Gateway". It's basically a compendum of practical issues that occur when running Swift apps on top of Ceph RGW and what we do to help people do that.

The things are a little different from 10 years ago, because attendees can vote on the submissions. This sounds democratic. I went through all submissions on the storage track and voted them according to my preference. It took a very long time and I suspect that I was crowdsourced by the organizers in the best traditions of Web 2.0. I wonder if they'll even read the abstracts. :-)

July 28, 2015

Announcing systemd.conf 2015

Announcing systemd.conf 2015

We are happy to announce the inaugural systemd.conf 2015 conference of the systemd project.

The conference takes place November 5th-7th, 2015 in Berlin, Germany.

Only a limited number of tickets are available, hence make sure to sign up quickly.

For further details consult the conference website.

July 23, 2015

Wikimania 2015 – random thoughts and observations

Random thoughts from Wikimania, 2015 edition (2013, 2014):

"Wikimania 2015 Reception at Laboratorio Arte Alameda - 02" by Jarek Tuszynski,  under CC BY 4.0
Wikimania 2015 Reception at Laboratorio Arte Alameda – 02” by Jarek Tuszynski, under CC BY 4.0
  • Dancing: After five Wikimedia events (not counting WMF all-hands) I was finally dragged onto the dance floor on the last night. I’ll never be Garfield, but I had fun anyway. The amazing setting did not hurt.
  • Our hosts: The conference was excellently organized and run. I’ve never had Mexico City high on my list of “places I must see” but it moved up many spots after this trip.
  • First timers: I always enjoy talking to people who have never been to Wikimania before. They almost always seem to have enjoyed it, but of course the ones I talk to are typically the ones who are more outgoing and better equipped to enjoy things. I do hope we’re also being welcome to people who don’t already know folks, or who aren’t as outgoing.
  • Luis von Ahn: Good to chat briefly with my long-ago classmate. I thought the Q&A section of his talk was one of the best I’ve seen in a long time. There were both good questions and interesting answers, which is more rare than it should be.
  • Keynotes: I’d love to have one keynote slot each year for a contributor to talk about their work within the movement. Finding the right person would be a challenge, of course, as could language barriers, but it seems like it should be doable.
  • US English: I was corrected on my Americanisms and the occasional complexity of my sentence structure. It was a good reminder that even for fairly sophisticated speakers of English as a second language, California-English is not terribly clear. This is especially true when spoken. Verbose slides can help, which is a shame, since I usually prefer minimal slides. I will try to work on that in the future, and see how we can help other WMFers do the same.
  • Mobile: Really hope someday we can figure out how to make the schedule legible on a mobile device :) Good reminder we’ve got a long way to go there.
  • Community engagement: I enjoyed my departments “engage with” session, but I think next year we need to make it more interactive—probably with something like an introduction/overview followed by a World Cafe-style discussion. One thing we did right was to take questions on written cards. This helped indicate what the most important topics were (when questions were repeated), avoided the problem of lecture-by-question, and opened the floor to people who might otherwise be intimidated because of language barriers or personality. Our booth was also excellent and I’m excited to see some of the stories that came out of it.
  • Technology and culture: After talking about how we’d used cards to change the atmosphere of a talk, someone deliberately provoked me: shouldn’t we address on-wiki cultural issues the same way, by changing the “technology” used for discussion? I agree that technology can help improve things, and we should think about it more than we do (e.g.) but ultimately it can only be part of the solution – our most difficult problems will definitely require work on culture as well as interfaces. (Surprisingly, my 2009 post on this topic holds up pretty well.)
  • Who is this for? I’ve always felt there was some tension around whether the conference is for “us” or for the public, but never had language for it. An older gentleman who I spoke with for a while finally gave me the right term: is it an annual meeting or is it a public conference? Nothing I saw here changed my position, which is that it is more annual meeting than public conference, at least until we get much better at turning new users into long-term users.
  • Esino Lario looks like it will be a lot of fun. I strongly support the organizing committee’s decision to focus less on brief talks and more on longer, more interactive conversations. That is clearly the best use of our limited time together. I’m also excited that they’re looking into blind submissions (which I suggested in my Wikimania post from last year).
  • Being an exec: I saw exactly one regular talk that was not by my department, though I did have lots and lots of conversations. I’m still not sure how I feel about this tradeoff, but I know it will become even harder if we truly do transition to a model with more workshops/conversations and fewer lectures, since those will be both more valuable and more time-consuming/less flexible.
  • Some day: I wrote most of this post in the Mexico City airport, and saw that there are flights from there to La Habana. I hope someday we can do a Wikimania there.

July 20, 2015

Fedora 22 killed IPv6 and I'm fine

I upgraded Fedora on my home router to F22 and immediately IPv6 disappeared on the internal network. The problem is that radvd started throwing its usual "no linklocal address configured on ethmain.5" (although the message is only visible with "IgnoreIfMissing off;"), which leads to "interface ethmain.5 does not exist or is not set up properly". With the default IgnoreIfMissing, radvd continues running but refuses to work, quietly. Needless to say, the interface has a perfectly valid link-local address, same as it had in F21 before the upgrade.

There used to be a time when I took a problem like this as an affront to the idea of IPv6 superiority and the reputation of Fedora as a platform for roll-your-own home router. Now though, I don't give a rat's tail for IPv6. Let Comcast and Google care and pay someone to care. Okay, I lied. I cared enough for file a bug 1244428, but I'm not rushing to build from SRPMs, reinstall old versions, and such.

July 16, 2015

NetworkManager 1.0.4 Released!

Just a quick note that we’ve released the latest stable NetworkManager, version 1.0.4.  This is mainly a bugfix release, though it does have a couple new features and larger changes.  Some of those are:

  • Some configuration options can now be changed without restarting the NM daemon.  Those include the ‘dns’, ‘connectivity’, and ‘ignore-carrier’ settings.
  • Devices that have only an IPv6 link-local address are no longer assumed to be connected; by default whenever an interface is set “up” the kernel will assign an IPv6 link-local address which could theoretically be used for communication.  NetworkManager used to interpret this as the interface being available for network communication, while this is rarely what users want or expect.
  • Correct routing is now maintained when two interfaces of the same priority are connected to the same network.
  • udev rules can now be used to tell NetworkManager to manage or unmanage specific devices.
  • Connections with children (bridge, bond, team, etc) can now optionally bring up their slave interfaces when the master is started.
  • Many, many bugs and crashers have also been fixed in the core daemon, the libraries, and nmcli.

Grab the 1.0.4 release here!

We’re well into the development cycle of NetworkManager 1.2 as well, where among other things, we’ll finally be moving to GDBus instead of dbus-glib.  We’ll also have support for structured logging with journald, indicating that certain connections are metered, advertising LLDP-discovered information, built-in IPv4 link-local addressing to get rid of avahi-autoipd, improvements to Wi-Fi Access Point scanning, less verbose logging by default, improvements to DNS handling, and more.  Look for it later this year!

 

July 12, 2015

Future development of Trinity.

It’s been an odd few weeks regarding Trinity based things.

First an email from a higher-up at my former employer asking (paraphrased)..

"That thing we asked you to stop working on when you worked here, any chance now you've left you'll implement these features."

I’m still trying to get my head around the thought process that led to that being a reasonable thing to ask. I’ve made the occasional commit over the last six months, but it’s mostly been code motion, clean-up work, and things like syscall table updates. New feature development came to a halt long ago.

It’s no coincidence that the number of bugs reported found with Trinity have dropped off sharply since the beginning of the year, and I don’t think it’s because the Linux kernel suddenly got lots better. Rather, it’s due to the lack of real ongoing development to “try something else” when some approaches dry up. Sadly we now live in a world where it’s easier to get paid to run someone else’s fuzzer these days than it is to develop one.

Then earlier this week, came the revelation that the only people prepared to fund that kind of new feature development are pretty much the worst people.

Apparently Hacking Team modified Trinity to fuzz ioctl() on Android, which yielded some results. I’ve done no analysis on whether those crashes are are exploitable/fixed/only relevant to Android etc. (Frankly, I’m past caring). I’m not convinced their approach is particularly sound even if it was finding results Trinity wasn’t, so it looks unlikely there are even ideas to borrow here. (We all already knew that ioctl was ripe with bugs, and had practically zero coverage testing).

It bothers me that my work was used as a foundation for their hack-job. Then again, maybe if I hadn’t released Trinity, they’d have based on iknowthis, or some other less useful fuzzer. None of this really should surprise me. I’ve known for some time that there are some “security” people that have their own modifications they have no intention of sending my way. Thanks to the way that people that release 0-days are revered in this circus, there’s no incentive for people to share their modifications if it means that someone else might beat them to finding their precious bugs.

It’s unfortunate that this project has attracted so many awful people. When I began it, the motivation had nothing to do with security. Back in 2010 we were inundated in weird oopses that we couldn’t reproduce, many times triggered by jvm’s. I came up with the idea that maybe a fuzzer could create a realistic enough workload to tickle some of those same bugs. Turned out I was right, and so began a series of huge page and other VM related bug fixes.

In the five years that I’ve made Trinity available, I’ve received notable contributions from perhaps a half dozen people. In return I’ve made my changes available before I’d even given them runtime myself.

It’s a project everyone wants to take from, but no-one wants to give back to.

And that’s why for the foreseeable future, I’m unlikely to make public any further feature work I do on it.
I’m done enabling assholes.

Future development of Trinity. is a post from: codemonkey.org.uk

July 02, 2015

Time for a new GPG key

My GPG key has lasted me well, over 18 years, but it's a v2 key and therefore no longer supported by newer versions of GnuPG. So it's time to move to a new one. I've made a transition statement available. If you signed my old key please consider signing the new one.

June 24, 2015

xdg-app moving to freedesktop.org

For anyone following the development of xdg-app, all development have now moved to freedesktop.org. Here is where things are happening now:

June 18, 2015

The new sd-bus API of systemd

With the new v221 release of systemd we are declaring the sd-bus API shipped with systemd stable. sd-bus is our minimal D-Bus IPC C library, supporting as back-ends both classic socket-based D-Bus and kdbus. The library has been been part of systemd for a while, but has only been used internally, since we wanted to have the liberty to still make API changes without affecting external consumers of the library. However, now we are confident to commit to a stable API for it, starting with v221.

In this blog story I hope to provide you with a quick overview on sd-bus, a short reiteration on D-Bus and its concepts, as well as a few simple examples how to write D-Bus clients and services with it.

What is D-Bus again?

Let's start with a quick reminder what D-Bus actually is: it's a powerful, generic IPC system for Linux and other operating systems. It knows concepts like buses, objects, interfaces, methods, signals, properties. It provides you with fine-grained access control, a rich type system, discoverability, introspection, monitoring, reliable multicasting, service activation, file descriptor passing, and more. There are bindings for numerous programming languages that are used on Linux.

D-Bus has been a core component of Linux systems since more than 10 years. It is certainly the most widely established high-level local IPC system on Linux. Since systemd's inception it has been the IPC system it exposes its interfaces on. And even before systemd, it was the IPC system Upstart used to expose its interfaces. It is used by GNOME, by KDE and by a variety of system components.

D-Bus refers to both a specification, and a reference implementation. The reference implementation provides both a bus server component, as well as a client library. While there are multiple other, popular reimplementations of the client library – for both C and other programming languages –, the only commonly used server side is the one from the reference implementation. (However, the kdbus project is working on providing an alternative to this server implementation as a kernel component.)

D-Bus is mostly used as local IPC, on top of AF_UNIX sockets. However, the protocol may be used on top of TCP/IP as well. It does not natively support encryption, hence using D-Bus directly on TCP is usually not a good idea. It is possible to combine D-Bus with a transport like ssh in order to secure it. systemd uses this to make many of its APIs accessible remotely.

A frequently asked question about D-Bus is why it exists at all, given that AF_UNIX sockets and FIFOs already exist on UNIX and have been used for a long time successfully. To answer this question let's make a comparison with popular web technology of today: what AF_UNIX/FIFOs are to D-Bus, TCP is to HTTP/REST. While AF_UNIX sockets/FIFOs only shovel raw bytes between processes, D-Bus defines actual message encoding and adds concepts like method call transactions, an object system, security mechanisms, multicasting and more.

From our 10year+ experience with D-Bus we know today that while there are some areas where we can improve things (and we are working on that, both with kdbus and sd-bus), it generally appears to be a very well designed system, that stood the test of time, aged well and is widely established. Today, if we'd sit down and design a completely new IPC system incorporating all the experience and knowledge we gained with D-Bus, I am sure the result would be very close to what D-Bus already is.

Or in short: D-Bus is great. If you hack on a Linux project and need a local IPC, it should be your first choice. Not only because D-Bus is well designed, but also because there aren't many alternatives that can cover similar functionality.

Where does sd-bus fit in?

Let's discuss why sd-bus exists, how it compares with the other existing C D-Bus libraries and why it might be a library to consider for your project.

For C, there are two established, popular D-Bus libraries: libdbus, as it is shipped in the reference implementation of D-Bus, as well as GDBus, a component of GLib, the low-level tool library of GNOME.

Of the two libdbus is the much older one, as it was written at the time the specification was put together. The library was written with a focus on being portable and to be useful as back-end for higher-level language bindings. Both of these goals required the API to be very generic, resulting in a relatively baroque, hard-to-use API that lacks the bits that make it easy and fun to use from C. It provides the building blocks, but few tools to actually make it straightforward to build a house from them. On the other hand, the library is suitable for most use-cases (for example, it is OOM-safe making it suitable for writing lowest level system software), and is portable to operating systems like Windows or more exotic UNIXes.

GDBus is a much newer implementation. It has been written after considerable experience with using a GLib/GObject wrapper around libdbus. GDBus is implemented from scratch, shares no code with libdbus. Its design differs substantially from libdbus, it contains code generators to make it specifically easy to expose GObject objects on the bus, or talking to D-Bus objects as GObject objects. It translates D-Bus data types to GVariant, which is GLib's powerful data serialization format. If you are used to GLib-style programming then you'll feel right at home, hacking D-Bus services and clients with it is a lot simpler than using libdbus.

With sd-bus we now provide a third implementation, sharing no code with either libdbus or GDBus. For us, the focus was on providing kind of a middle ground between libdbus and GDBus: a low-level C library that actually is fun to work with, that has enough syntactic sugar to make it easy to write clients and services with, but on the other hand is more low-level than GDBus/GLib/GObject/GVariant. To be able to use it in systemd's various system-level components it needed to be OOM-safe and minimal. Another major point we wanted to focus on was supporting a kdbus back-end right from the beginning, in addition to the socket transport of the original D-Bus specification ("dbus1"). In fact, we wanted to design the library closer to kdbus' semantics than to dbus1's, wherever they are different, but still cover both transports nicely. In contrast to libdbus or GDBus portability is not a priority for sd-bus, instead we try to make the best of the Linux platform and expose specific Linux concepts wherever that is beneficial. Finally, performance was also an issue (though a secondary one): neither libdbus nor GDBus will win any speed records. We wanted to improve on performance (throughput and latency) -- but simplicity and correctness are more important to us. We believe the result of our work delivers our goals quite nicely: the library is fun to use, supports kdbus and sockets as back-end, is relatively minimal, and the performance is substantially better than both libdbus and GDBus.

To decide which of the three APIs to use for you C project, here are short guidelines:

  • If you hack on a GLib/GObject project, GDBus is definitely your first choice.

  • If portability to non-Linux kernels -- including Windows, Mac OS and other UNIXes -- is important to you, use either GDBus (which more or less means buying into GLib/GObject) or libdbus (which requires a lot of manual work).

  • Otherwise, sd-bus would be my recommended choice.

(I am not covering C++ specifically here, this is all about plain C only. But do note: if you use Qt, then QtDBus is the D-Bus API of choice, being a wrapper around libdbus.)

Introduction to D-Bus Concepts

To the uninitiated D-Bus usually appears to be a relatively opaque technology. It uses lots of concepts that appear unnecessarily complex and redundant on first sight. But actually, they make a lot of sense. Let's have a look:

  • A bus is where you look for IPC services. There are usually two kinds of buses: a system bus, of which there's exactly one per system, and which is where you'd look for system services; and a user bus, of which there's one per user, and which is where you'd look for user services, like the address book service or the mail program. (Originally, the user bus was actually a session bus -- so that you get multiple of them if you log in many times as the same user --, and on most setups it still is, but we are working on moving things to a true user bus, of which there is only one per user on a system, regardless how many times that user happens to log in.)

  • A service is a program that offers some IPC API on a bus. A service is identified by a name in reverse domain name notation. Thus, the org.freedesktop.NetworkManager service on the system bus is where NetworkManager's APIs are available and org.freedesktop.login1 on the system bus is where systemd-logind's APIs are exposed.

  • A client is a program that makes use of some IPC API on a bus. It talks to a service, monitors it and generally doesn't provide any services on its own. That said, lines are blurry and many services are also clients to other services. Frequently the term peer is used as a generalization to refer to either a service or a client.

  • An object path is an identifier for an object on a specific service. In a way this is comparable to a C pointer, since that's how you generally reference a C object, if you hack object-oriented programs in C. However, C pointers are just memory addresses, and passing memory addresses around to other processes would make little sense, since they of course refer to the address space of the service, the client couldn't make sense of it. Thus, the D-Bus designers came up with the object path concept, which is just a string that looks like a file system path. Example: /org/freedesktop/login1 is the object path of the 'manager' object of the org.freedesktop.login1 service (which, as we remember from above, is still the service systemd-logind exposes). Because object paths are structured like file system paths they can be neatly arranged in a tree, so that you end up with a venerable tree of objects. For example, you'll find all user sessions systemd-logind manages below the /org/freedesktop/login1/session sub-tree, for example called /org/freedesktop/login1/session/_7, /org/freedesktop/login1/session/_55 and so on. How services precisely label their objects and arrange them in a tree is completely up to the developers of the services.

  • Each object that is identified by an object path has one or more interfaces. An interface is a collection of signals, methods, and properties (collectively called members), that belong together. The concept of a D-Bus interface is actually pretty much identical to what you know from programming languages such as Java, which also know an interface concept. Which interfaces an object implements are up the developers of the service. Interface names are in reverse domain name notation, much like service names. (Yes, that's admittedly confusing, in particular since it's pretty common for simpler services to reuse the service name string also as an interface name.) A couple of interfaces are standardized though and you'll find them available on many of the objects offered by the various services. Specifically, those are org.freedesktop.DBus.Introspectable, org.freedesktop.DBus.Peer and org.freedesktop.DBus.Properties.

  • An interface can contain methods. The word "method" is more or less just a fancy word for "function", and is a term used pretty much the same way in object-oriented languages such as Java. The most common interaction between D-Bus peers is that one peer invokes one of these methods on another peer and gets a reply. A D-Bus method takes a couple of parameters, and returns others. The parameters are transmitted in a type-safe way, and the type information is included in the introspection data you can query from each object. Usually, method names (and the other member types) follow a CamelCase syntax. For example, systemd-logind exposes an ActivateSession method on the org.freedesktop.login1.Manager interface that is available on the /org/freedesktop/login1 object of the org.freedesktop.login1 service.

  • A signature describes a set of parameters a function (or signal, property, see below) takes or returns. It's a series of characters that each encode one parameter by its type. The set of types available is pretty powerful. For example, there are simpler types like s for string, or u for 32bit integer, but also complex types such as as for an array of strings or a(sb) for an array of structures consisting of one string and one boolean each. See the D-Bus specification for the full explanation of the type system. The ActivateSession method mentioned above takes a single string as parameter (the parameter signature is hence s), and returns nothing (the return signature is hence the empty string). Of course, the signature can get a lot more complex, see below for more examples.

  • A signal is another member type that the D-Bus object system knows. Much like a method it has a signature. However, they serve different purposes. While in a method call a single client issues a request on a single service, and that service sends back a response to the client, signals are for general notification of peers. Services send them out when they want to tell one or more peers on the bus that something happened or changed. In contrast to method calls and their replies they are hence usually broadcast over a bus. While method calls/replies are used for duplex one-to-one communication, signals are usually used for simplex one-to-many communication (note however that that's not a requirement, they can also be used one-to-one). Example: systemd-logind broadcasts a SessionNew signal from its manager object each time a user logs in, and a SessionRemoved signal every time a user logs out.

  • A property is the third member type that the D-Bus object system knows. It's similar to the property concept known by languages like C#. Properties also have a signature, and are more or less just variables that an object exposes, that can be read or altered by clients. Example: systemd-logind exposes a property Docked of the signature b (a boolean). It reflects whether systemd-logind thinks the system is currently in a docking station of some form (only applies to laptops …).

So much for the various concepts D-Bus knows. Of course, all these new concepts might be overwhelming. Let's look at them from a different perspective. I assume many of the readers have an understanding of today's web technology, specifically HTTP and REST. Let's try to compare the concept of a HTTP request with the concept of a D-Bus method call:

  • A HTTP request you issue on a specific network. It could be the Internet, or it could be your local LAN, or a company VPN. Depending on which network you issue the request on, you'll be able to talk to a different set of servers. This is not unlike the "bus" concept of D-Bus.

  • On the network you then pick a specific HTTP server to talk to. That's roughly comparable to picking a service on a specific bus.

  • On the HTTP server you then ask for a specific URL. The "path" part of the URL (by which I mean everything after the host name of the server, up to the last "/") is pretty similar to a D-Bus object path.

  • The "file" part of the URL (by which I mean everything after the last slash, following the path, as described above), then defines the actual call to make. In D-Bus this could be mapped to an interface and method name.

  • Finally, the parameters of a HTTP call follow the path after the "?", they map to the signature of the D-Bus call.

Of course, comparing an HTTP request to a D-Bus method call is a bit comparing apples and oranges. However, I think it's still useful to get a bit of a feeling of what maps to what.

From the shell

So much about the concepts and the gray theory behind them. Let's make this exciting, let's actually see how this feels on a real system.

Since a while systemd has included a tool busctl that is useful to explore and interact with the D-Bus object system. When invoked without parameters, it will show you a list of all peers connected to the system bus. (Use --user to see the peers of your user bus instead):

$ busctl
NAME                                       PID PROCESS         USER             CONNECTION    UNIT                      SESSION    DESCRIPTION
:1.1                                         1 systemd         root             :1.1          -                         -          -
:1.11                                      705 NetworkManager  root             :1.11         NetworkManager.service    -          -
:1.14                                      744 gdm             root             :1.14         gdm.service               -          -
:1.4                                       708 systemd-logind  root             :1.4          systemd-logind.service    -          -
:1.7200                                  17563 busctl          lennart          :1.7200       session-1.scope           1          -
[…]
org.freedesktop.NetworkManager             705 NetworkManager  root             :1.11         NetworkManager.service    -          -
org.freedesktop.login1                     708 systemd-logind  root             :1.4          systemd-logind.service    -          -
org.freedesktop.systemd1                     1 systemd         root             :1.1          -                         -          -
org.gnome.DisplayManager                   744 gdm             root             :1.14         gdm.service               -          -
[…]

(I have shortened the output a bit, to make keep things brief).

The list begins with a list of all peers currently connected to the bus. They are identified by peer names like ":1.11". These are called unique names in D-Bus nomenclature. Basically, every peer has a unique name, and they are assigned automatically when a peer connects to the bus. They are much like an IP address if you so will. You'll notice that a couple of peers are already connected, including our little busctl tool itself as well as a number of system services. The list then shows all actual services on the bus, identified by their service names (as discussed above; to discern them from the unique names these are also called well-known names). In many ways well-known names are similar to DNS host names, i.e. they are a friendlier way to reference a peer, but on the lower level they just map to an IP address, or in this comparison the unique name. Much like you can connect to a host on the Internet by either its host name or its IP address, you can also connect to a bus peer either by its unique or its well-known name. (Note that each peer can have as many well-known names as it likes, much like an IP address can have multiple host names referring to it).

OK, that's already kinda cool. Try it for yourself, on your local machine (all you need is a recent, systemd-based distribution).

Let's now go the next step. Let's see which objects the org.freedesktop.login1 service actually offers:

$ busctl tree org.freedesktop.login1
└─/org/freedesktop/login1
  ├─/org/freedesktop/login1/seat
  │ ├─/org/freedesktop/login1/seat/seat0
  │ └─/org/freedesktop/login1/seat/self
  ├─/org/freedesktop/login1/session
  │ ├─/org/freedesktop/login1/session/_31
  │ └─/org/freedesktop/login1/session/self
  └─/org/freedesktop/login1/user
    ├─/org/freedesktop/login1/user/_1000
    └─/org/freedesktop/login1/user/self

Pretty, isn't it? What's actually even nicer, and which the output does not show is that there's full command line completion available: as you press TAB the shell will auto-complete the service names for you. It's a real pleasure to explore your D-Bus objects that way!

The output shows some objects that you might recognize from the explanations above. Now, let's go further. Let's see what interfaces, methods, signals and properties one of these objects actually exposes:

$ busctl introspect org.freedesktop.login1 /org/freedesktop/login1/session/_31
NAME                                TYPE      SIGNATURE RESULT/VALUE                             FLAGS
org.freedesktop.DBus.Introspectable interface -         -                                        -
.Introspect                         method    -         s                                        -
org.freedesktop.DBus.Peer           interface -         -                                        -
.GetMachineId                       method    -         s                                        -
.Ping                               method    -         -                                        -
org.freedesktop.DBus.Properties     interface -         -                                        -
.Get                                method    ss        v                                        -
.GetAll                             method    s         a{sv}                                    -
.Set                                method    ssv       -                                        -
.PropertiesChanged                  signal    sa{sv}as  -                                        -
org.freedesktop.login1.Session      interface -         -                                        -
.Activate                           method    -         -                                        -
.Kill                               method    si        -                                        -
.Lock                               method    -         -                                        -
.PauseDeviceComplete                method    uu        -                                        -
.ReleaseControl                     method    -         -                                        -
.ReleaseDevice                      method    uu        -                                        -
.SetIdleHint                        method    b         -                                        -
.TakeControl                        method    b         -                                        -
.TakeDevice                         method    uu        hb                                       -
.Terminate                          method    -         -                                        -
.Unlock                             method    -         -                                        -
.Active                             property  b         true                                     emits-change
.Audit                              property  u         1                                        const
.Class                              property  s         "user"                                   const
.Desktop                            property  s         ""                                       const
.Display                            property  s         ""                                       const
.Id                                 property  s         "1"                                      const
.IdleHint                           property  b         true                                     emits-change
.IdleSinceHint                      property  t         1434494624206001                         emits-change
.IdleSinceHintMonotonic             property  t         0                                        emits-change
.Leader                             property  u         762                                      const
.Name                               property  s         "lennart"                                const
.Remote                             property  b         false                                    const
.RemoteHost                         property  s         ""                                       const
.RemoteUser                         property  s         ""                                       const
.Scope                              property  s         "session-1.scope"                        const
.Seat                               property  (so)      "seat0" "/org/freedesktop/login1/seat... const
.Service                            property  s         "gdm-autologin"                          const
.State                              property  s         "active"                                 -
.TTY                                property  s         "/dev/tty1"                              const
.Timestamp                          property  t         1434494630344367                         const
.TimestampMonotonic                 property  t         34814579                                 const
.Type                               property  s         "x11"                                    const
.User                               property  (uo)      1000 "/org/freedesktop/login1/user/_1... const
.VTNr                               property  u         1                                        const
.Lock                               signal    -         -                                        -
.PauseDevice                        signal    uus       -                                        -
.ResumeDevice                       signal    uuh       -                                        -
.Unlock                             signal    -         -                                        -

As before, the busctl command supports command line completion, hence both the service name and the object path used are easily put together on the shell simply by pressing TAB. The output shows the methods, properties, signals of one of the session objects that are currently made available by systemd-logind. There's a section for each interface the object knows. The second column tells you what kind of member is shown in the line. The third column shows the signature of the member. In case of method calls that's the input parameters, the fourth column shows what is returned. For properties, the fourth column encodes the current value of them.

So far, we just explored. Let's take the next step now: let's become active - let's call a method:

# busctl call org.freedesktop.login1 /org/freedesktop/login1/session/_31 org.freedesktop.login1.Session Lock

I don't think I need to mention this anymore, but anyway: again there's full command line completion available. The third argument is the interface name, the fourth the method name, both can be easily completed by pressing TAB. In this case we picked the Lock method, which activates the screen lock for the specific session. And yupp, the instant I pressed enter on this line my screen lock turned on (this only works on DEs that correctly hook into systemd-logind for this to work. GNOME works fine, and KDE should work too).

The Lock method call we picked is very simple, as it takes no parameters and returns none. Of course, it can get more complicated for some calls. Here's another example, this time using one of systemd's own bus calls, to start an arbitrary system unit:

# busctl call org.freedesktop.systemd1 /org/freedesktop/systemd1 org.freedesktop.systemd1.Manager StartUnit ss "cups.service" "replace"
o "/org/freedesktop/systemd1/job/42684"

This call takes two strings as input parameters, as we denote in the signature string that follows the method name (as usual, command line completion helps you getting this right). Following the signature the next two parameters are simply the two strings to pass. The specified signature string hence indicates what comes next. systemd's StartUnit method call takes the unit name to start as first parameter, and the mode in which to start it as second. The call returned a single object path value. It is encoded the same way as the input parameter: a signature (just o for the object path) followed by the actual value.

Of course, some method call parameters can get a ton more complex, but with busctl it's relatively easy to encode them all. See the man page for details.

busctl knows a number of other operations. For example, you can use it to monitor D-Bus traffic as it happens (including generating a .cap file for use with Wireshark!) or you can set or get specific properties. However, this blog story was supposed to be about sd-bus, not busctl, hence let's cut this short here, and let me direct you to the man page in case you want to know more about the tool.

busctl (like the rest of system) is implemented using the sd-bus API. Thus it exposes many of the features of sd-bus itself. For example, you can use to connect to remote or container buses. It understands both kdbus and classic D-Bus, and more!

sd-bus

But enough! Let's get back on topic, let's talk about sd-bus itself.

The sd-bus set of APIs is mostly contained in the header file sd-bus.h.

Here's a random selection of features of the library, that make it compare well with the other implementations available.

  • Supports both kdbus and dbus1 as back-end.

  • Has high-level support for connecting to remote buses via ssh, and to buses of local OS containers.

  • Powerful credential model, to implement authentication of clients in services. Currently 34 individual fields are supported, from the PID of the client to the cgroup or capability sets.

  • Support for tracking the life-cycle of peers in order to release local objects automatically when all peers referencing them disconnected.

  • The client builds an efficient decision tree to determine which handlers to deliver an incoming bus message to.

  • Automatically translates D-Bus errors into UNIX style errors and back (this is lossy though), to ensure best integration of D-Bus into low-level Linux programs.

  • Powerful but lightweight object model for exposing local objects on the bus. Automatically generates introspection as necessary.

The API is currently not fully documented, but we are working on completing the set of manual pages. For details see all pages starting with sd_bus_.

Invoking a Method, from C, with sd-bus

So much about the library in general. Here's an example for connecting to the bus and issuing a method call:

#include <stdio.h>
#include <stdlib.h>
#include <systemd/sd-bus.h>

int main(int argc, char *argv[]) {
        sd_bus_error error = SD_BUS_ERROR_NULL;
        sd_bus_message *m = NULL;
        sd_bus *bus = NULL;
        const char *path;
        int r;

        /* Connect to the system bus */
        r = sd_bus_open_system(&bus);
        if (r < 0) {
                fprintf(stderr, "Failed to connect to system bus: %s\n", strerror(-r));
                goto finish;
        }

        /* Issue the method call and store the respons message in m */
        r = sd_bus_call_method(bus,
                               "org.freedesktop.systemd1",           /* service to contact */
                               "/org/freedesktop/systemd1",          /* object path */
                               "org.freedesktop.systemd1.Manager",   /* interface name */
                               "StartUnit",                          /* method name */
                               &error,                               /* object to return error in */
                               &m,                                   /* return message on success */
                               "ss",                                 /* input signature */
                               "cups.service",                       /* first argument */
                               "replace");                           /* second argument */
        if (r < 0) {
                fprintf(stderr, "Failed to issue method call: %s\n", error.message);
                goto finish;
        }

        /* Parse the response message */
        r = sd_bus_message_read(m, "o", &path);
        if (r < 0) {
                fprintf(stderr, "Failed to parse response message: %s\n", strerror(-r));
                goto finish;
        }

        printf("Queued service job as %s.\n", path);

finish:
        sd_bus_error_free(&error);
        sd_bus_message_unref(m);
        sd_bus_unref(bus);

        return r < 0 ? EXIT_FAILURE : EXIT_SUCCESS;
}

Save this example as bus-client.c, then build it with:

$ gcc bus-client.c -o bus-client `pkg-config --cflags --libs libsystemd`

This will generate a binary bus-client you can now run. Make sure to run it as root though, since access to the StartUnit method is privileged:

# ./bus-client
Queued service job as /org/freedesktop/systemd1/job/3586.

And that's it already, our first example. It showed how we invoked a method call on the bus. The actual function call of the method is very close to the busctl command line we used before. I hope the code excerpt needs little further explanation. It's supposed to give you a taste how to write D-Bus clients with sd-bus. For more more information please have a look at the header file, the man page or even the sd-bus sources.

Implementing a Service, in C, with sd-bus

Of course, just calling a single method is a rather simplistic example. Let's have a look on how to write a bus service. We'll write a small calculator service, that exposes a single object, which implements an interface that exposes two methods: one to multiply two 64bit signed integers, and one to divide one 64bit signed integer by another.

#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <systemd/sd-bus.h>

static int method_multiply(sd_bus_message *m, void *userdata, sd_bus_error *ret_error) {
        int64_t x, y;
        int r;

        /* Read the parameters */
        r = sd_bus_message_read(m, "xx", &x, &y);
        if (r < 0) {
                fprintf(stderr, "Failed to parse parameters: %s\n", strerror(-r));
                return r;
        }

        /* Reply with the response */
        return sd_bus_reply_method_return(m, "x", x * y);
}

static int method_divide(sd_bus_message *m, void *userdata, sd_bus_error *ret_error) {
        int64_t x, y;
        int r;

        /* Read the parameters */
        r = sd_bus_message_read(m, "xx", &x, &y);
        if (r < 0) {
                fprintf(stderr, "Failed to parse parameters: %s\n", strerror(-r));
                return r;
        }

        /* Return an error on division by zero */
        if (y == 0) {
                sd_bus_error_set_const(ret_error, "net.poettering.DivisionByZero", "Sorry, can't allow division by zero.");
                return -EINVAL;
        }

        return sd_bus_reply_method_return(m, "x", x / y);
}

/* The vtable of our little object, implements the net.poettering.Calculator interface */
static const sd_bus_vtable calculator_vtable[] = {
        SD_BUS_VTABLE_START(0),
        SD_BUS_METHOD("Multiply", "xx", "x", method_multiply, SD_BUS_VTABLE_UNPRIVILEGED),
        SD_BUS_METHOD("Divide",   "xx", "x", method_divide,   SD_BUS_VTABLE_UNPRIVILEGED),
        SD_BUS_VTABLE_END
};

int main(int argc, char *argv[]) {
        sd_bus_slot *slot = NULL;
        sd_bus *bus = NULL;
        int r;

        /* Connect to the user bus this time */
        r = sd_bus_open_user(&bus);
        if (r < 0) {
                fprintf(stderr, "Failed to connect to system bus: %s\n", strerror(-r));
                goto finish;
        }

        /* Install the object */
        r = sd_bus_add_object_vtable(bus,
                                     &slot,
                                     "/net/poettering/Calculator",  /* object path */
                                     "net.poettering.Calculator",   /* interface name */
                                     calculator_vtable,
                                     NULL);
        if (r < 0) {
                fprintf(stderr, "Failed to issue method call: %s\n", strerror(-r));
                goto finish;
        }

        /* Take a well-known service name so that clients can find us */
        r = sd_bus_request_name(bus, "net.poettering.Calculator", 0);
        if (r < 0) {
                fprintf(stderr, "Failed to acquire service name: %s\n", strerror(-r));
                goto finish;
        }

        for (;;) {
                /* Process requests */
                r = sd_bus_process(bus, NULL);
                if (r < 0) {
                        fprintf(stderr, "Failed to process bus: %s\n", strerror(-r));
                        goto finish;
                }
                if (r > 0) /* we processed a request, try to process another one, right-away */
                        continue;

                /* Wait for the next request to process */
                r = sd_bus_wait(bus, (uint64_t) -1);
                if (r < 0) {
                        fprintf(stderr, "Failed to wait on bus: %s\n", strerror(-r));
                        goto finish;
                }
        }

finish:
        sd_bus_slot_unref(slot);
        sd_bus_unref(bus);

        return r < 0 ? EXIT_FAILURE : EXIT_SUCCESS;
}

Save this example as bus-service.c, then build it with:

$ gcc bus-service.c -o bus-service `pkg-config --cflags --libs libsystemd`

Now, let's run it:

$ ./bus-service

In another terminal, let's try to talk to it. Note that this service is now on the user bus, not on the system bus as before. We do this for simplicity reasons: on the system bus access to services is tightly controlled so unprivileged clients cannot request privileged operations. On the user bus however things are simpler: as only processes of the user owning the bus can connect no further policy enforcement will complicate this example. Because the service is on the user bus, we have to pass the --user switch on the busctl command line. Let's start with looking at the service's object tree.

$ busctl --user tree net.poettering.Calculator
└─/net/poettering/Calculator

As we can see, there's only a single object on the service, which is not surprising, given that our code above only registered one. Let's see the interfaces and the members this object exposes:

$ busctl --user introspect net.poettering.Calculator /net/poettering/Calculator
NAME                                TYPE      SIGNATURE RESULT/VALUE FLAGS
net.poettering.Calculator           interface -         -            -
.Divide                             method    xx        x            -
.Multiply                           method    xx        x            -
org.freedesktop.DBus.Introspectable interface -         -            -
.Introspect                         method    -         s            -
org.freedesktop.DBus.Peer           interface -         -            -
.GetMachineId                       method    -         s            -
.Ping                               method    -         -            -
org.freedesktop.DBus.Properties     interface -         -            -
.Get                                method    ss        v            -
.GetAll                             method    s         a{sv}        -
.Set                                method    ssv       -            -
.PropertiesChanged                  signal    sa{sv}as  -            -

The sd-bus library automatically added a couple of generic interfaces, as mentioned above. But the first interface we see is actually the one we added! It shows our two methods, and both take "xx" (two 64bit signed integers) as input parameters, and return one "x". Great! But does it work?

$ busctl --user call net.poettering.Calculator /net/poettering/Calculator net.poettering.Calculator Multiply xx 5 7
x 35

Woohoo! We passed the two integers 5 and 7, and the service actually multiplied them for us and returned a single integer 35! Let's try the other method:

$ busctl --user call net.poettering.Calculator /net/poettering/Calculator net.poettering.Calculator Divide xx 99 17
x 5

Oh, wow! It can even do integer division! Fantastic! But let's trick it into dividing by zero:

$ busctl --user call net.poettering.Calculator /net/poettering/Calculator net.poettering.Calculator Divide xx 43 0
Sorry, can't allow division by zero.

Nice! It detected this nicely and returned a clean error about it. If you look in the source code example above you'll see how precisely we generated the error.

And that's really all I have for today. Of course, the examples I showed are short, and I don't get into detail here on what precisely each line does. However, this is supposed to be a short introduction into D-Bus and sd-bus, and it's already way too long for that …

I hope this blog story was useful to you. If you are interested in using sd-bus for your own programs, I hope this gets you started. If you have further questions, check the (incomplete) man pages, and inquire us on IRC or the systemd mailing list. If you need more examples, have a look at the systemd source tree, all of systemd's many bus services use sd-bus extensively.

June 17, 2015

Testing rawhide apps using xdg-app

An important aspect of xdg-app is application sandboxing, which will require application changes to use sandbox-specific APIs. However, xdg-app is also a good way to deploy and run non-sandboxed (or partially sandboxed) regular applications.

A very interesting usecase for this is to have an image-based operating system, for instance a Workstation spin of Fedora Atomic. Such a system would have a basic workstation installation with a read-only /usr, and atomic updates/rollback. However, installing an application is painful, and customizing yor install in that way undoes many of the advantages of an image-based OS.

With xdg-app you can install apps into /var (or $HOME) and have them fully integrate with the system, while still being isolated from changes to the host. This makes for a great combination, just like atomic + docker is a good combination for the server space.

I’ve spent some time recently making a prototype runtime based on the Fedora packages, as reported on the desktop list. This is kind of interesting as it lets you test applications from rawhide on fedora 21 or 22. Just install xdg-app from fedora-updates and then install the runtime:

$ xdg-app add-remote --no-gpg-verify --user fedora http://fedorapeople.org/~alexl/repo/
$ xdg-app install-runtime --user fedora org.fedoraproject.Platform 23

And then you can try gedit 3.17.0:

$ xdg-app install-app --user fedora org.gnome.gedit
$ xdg-app run org.gnome.gedit

Or evince 3.17.2:

$ xdg-app install-app --user fedora org.gnome.evince
$ xdg-app run org.gnome.evince

Once installed you can also just start them from the desktop environment as usual.  They should be there like any regular application as the desktop files and icons are exported to the host.

June 11, 2015

Rich and comments

Rich Jones posted an article about being banned by Boing-Boing, supposedly for bringing attention to their use of affiliate links (the practice that Gamergate groups criticized as well — and scored a regulatory win against). Meanwhile, all my comments at Rich's blog are blackholed, which is quite ironic. Generally, I am not into this "blog comment" thing. Ani-nouto never had any comments and is doing great that way. But some people like comments, so I leave them as necessary.

June 06, 2015

What tools are changing our world next?

Quick brain dump after a bike ride home: free software took a huge leap in the late 90s and early 00s in large part because of non-ideological advantages that the rest of the world is now competing with or surpassing:

HDR automatically created from old pictures of Muir Woods by Google Photos.
HDR automatically created by Google Photos from my old pictures of Muir Woods. Not perfect, but better than I ever bothered to do!
  • Collaboration tools: Because we got to the ‘net first, our tools for collaborating with each other were simply better than what proprietary developers were doing: cvs, mailman, wiki, etc., were all better than the silo’d old-school tools. Modern best-of-breed collaboration tools have all learned from what we did and added proprietary sauce on top: github, slack, Google Docs, etc. So our tools that are now (at best) as productive as our proprietary counterparts, and sometimes less productive but ideologically agreeable.
  • Release processes: “Release early/release often” made us better partners for our users. We’re now actively behind here: compare how often a mobile app or web user gets updates, exactly as the author intended, relative to a user of a modern Linux distro.
  • Zero cost: We did things for no (direct) cost by subsidizing our work through college, startups, or consulting gigs; now everyone has a subsidize-by-selling-something-else model (usually advertising, though sometimes freemium). Again, advantage (mostly?) lost.
  • Knowing our users: We knew a lot about our users, because we were our biggest users, and we talked to other users a lot; this was more effective than what passed for software design in the late 90s. This has been eclipsed by extensive a/b testing throughout the industry, and (to a lesser extent) by more extensive usage of direct user testing and design-thinking.

None of these are terribly original observations – all of these have been remarked on before. But after playing some with Google Photos this weekend, I’m ready to add another one to the list:

Worth asking what your project is doing that could be radically changed if your competitors get access to new technology. For example, for Wikipedia:

  • Collaborating: Wiki was best-of-breed (or close); it isn’t anymore. Visual Editor helps get editing back to par, but the social aspect of collaboration is still lacking relative to the expectations of many users.
  • Knowledge creation: big groups of humans, working together wiki-style, is the state of the art for creating useful, non-BS knowledge at scale. With the aforementioned machine learning, I suspect this will no longer the case in a (growing) number of domains.

I’m sure there are others…

May 29, 2015

Cool hardware in Vancouver

There wasn't much, but more than in Atlanta. The most "pro" looking kit was presented by NEC: basically a bladeserver, but the "blades" are SBCs, each of them accompained by a dedicated drive card. I can see downsides of this design, but very cute.

Unfortunately, they only offer CPU cards based on Atom. No ARM or anything.

The only other interesting booth belonged to StackVelocity, a subsidiary of JB Circuits that does custom design.

I'm sorry to say, their wares looked decidedly pedestrian, which is to be expected: their sales point is low cost, and stuff of that nature underpins the modern datacenter. One curious thing, however, is the variety of flash cards they offer. Basically Fusion-IO on budget. One was particularly tricky by having 2 layers. At first I even thought it could have flash chips mounted sideways, but nope, the science of low-cost computing is not there yet.

P.S. NEC also sell the same chassis with CPU cards instead of drive cards under the index "DX1000".

Semi-hard numbers from Rackspace

Previously in hard numbers: China, Wikimedia, Amazon S3. Rackspace previously reported in creiht's preso 18 months ago. This time, scotty went public at the Vancouver (Liberty) summit with the following:

> 50 billion objects
> 100 PB data (sanitized number, but way higher than 85 PB)
= 6 global clusters
3:1 PUT:GET ratio
10k+ requests/second

The number of objects is roughly 40 times less than in Amazon S3.

May 19, 2015

GDB Preattach

In firefox development, it’s normal to do most development tasks via the mach command. Build? Use mach. Update UUIDs? Use mach. Run tests? Use mach. Debug tests? Yes, mach mochitest --debugger gdb.

Now, normally I run gdb inside emacs, of course. But this is hard to do when I’m also using mach to set up the environment and invoke gdb.

This is really an Emacs bug. GUD, the Emacs interface to all kinds of debuggers, is written as its own mode, but there’s no really great reason for this. It would be way cooler to have an adaptive shell mode, where running the debugger in the shell would magically change the shell-ish buffer into a gud-ish buffer. And somebody — probably you! — should work on this.

But anyway this is hard and I am lazy. Well, sort of lazy and when I’m not lazy, also unfocused, since I came up with three other approaches to the basic problem. Trying stuff out and all. And these are even the principled ways, not crazy stuff like screenify.

Oh right, the basic problem.  The basic problem with running gdb from mach is that then you’re just stuck in the terminal. And unless you dig the TUI, which I don’t, terminal gdb is not that great to use.

One of the ideas, in fact the one this post is about, since this post isn’t about the one that I couldn’t get to work, or the one that is also pretty cool but that I’m not ready to talk about, was: hey, can’t I just attach gdb to the test firefox? Well, no, of course not, the test program runs too fast (sometimes) and racing to attach is no fun. What would be great is to be able to pre-attach — tell gdb to attach to the next instance of a given program.

This requires kernel support. Once upon a time there were some gdb and kernel patches (search for “global breakpoints”) to do this, but they were never merged. Though hmm! I can do some fun kernel stuff with SystemTap…

Specifically what I did was write a small SystemTap script to look for a specific exec, then deliver a SIGSTOP to the process. Then the script prints the PID of the process. On the gdb side, there’s a new command written in Python that invokes the SystemTap script, reads the PID, and invokes attach. It’s a bit hacky and a bit weird to use (the SIGSTOP appears in gdb to have been delivered multiple times or something like that). But it works!

It would be better to have this functionality directly in the kernel. Somebody — probably you! — should write this. But meanwhile my hack is available, along with a few other gdb scxripts, in my gdb helpers github repository.

May 06, 2015

How Mitchell Baker made me to divorce

Well, nearly did. Deleting history in Firefox 37 is very slow and the UI locks up while you do that. "Very slow" means an operation that takes 13 minutes (not exaggerating - it's reproducible). The UI lock-up means a non-dismissable context menu floating over everything; Firefox itself being, of course, entirely unresponsive. See the screencap.

The screencap is from Linux where I confirmed the problem, but the story started on Windows, where my wife tried to tidy up a bit. So, when Firefox locked up, she killed it, and repeated the process a few times. And what else would you do? We are not talking about hanging up for seconds - it literally was many minutes. Firefox did not pop a dialog with "Please wait, deleting 108,534 objects with separate SQLite transactions", a progress gauge, and a "Cancel" button. Instead, it pretended to lock up.

Interestingly enough, remember when Firefox had a default to keep the history for a week? This mode is gone now - FF keeps the history potentially forever. Instead, it offers a technical limit: 108,534 entries are saved in the "Places" database at the most, in order to prevent SQLite from eating all your storage. Now I understand why my brown "visited" links never go back to blue anymore.

The problem is, there's no alternative. I tried to use Midori as my main browser for a month or two in early 2014, but it was a horrible crash city. I had no choice but to give up and go back to Firefox and its case of Featuritis Obesum.

Come work with me – developer edition!

It has been a long time since I was able to say to developer friends “come work with me” in anything but the most abstract “come work under the same roof” kind of sense. But today I can say to developers “come work with me” and really mean it. Which is fun :)

By Supercarwaar (Own work) [CC BY-SA 3.0], via Wikimedia Commons
By Supercarwaar, CC BY-SA 3.0
Details: Wikimedia’s new community tech team is hiring for a community tech developer and a team lead. This will be extremely community-intensive work, so if you enjoy and get energy from working with a community and helping them achieve their goals, this could be a great role for you. This team will work intensely with my department to ensure that we’re correctly identifying and prioritizing the needs of our most active editors. If that sounds like fun, get in touch :)

[And I realize that I’ve been bad and not posted here, so here’s my new job announce: “my department” is the Foundation’s new Community Engagement department, where we work to support healthy contributor communities and help WMF-community collaboration. It is a detour from law, but I’ve always said law was just a way to help people do their thing — so in that sense is the same thing I’ve always been doing. It has been an intense roller coaster of a first two months, and I look forward to much more of the same.]

May 05, 2015

Thoughts on a feedback loop for Trinity.

With the success that afl has been having on fuzzing userspace, I’ve been revisiting an idea that Andi Kleen gave me years ago for trinity, which was pretty much the same thing but for kernel space. I.e., a genetic algorithm that rates how successful the last fuzz attempt was, and makes a decision on whether to mutate that last run, or do something completely new.

It’s something I’ve struggled to get my head around for a few years. The mutation part would be fairly easy. We would need to store the parameters from the last run, and extrapolate out a set of ->mutate functions from the existing ->sanitize functions that currently generate arguments.

The difficult part is the “how successful” measurement. Typically, we don’t really get anything useful back from a syscall other than “we didn’t crash”, which isn’t particularly useful in this case. What we really want is “did we execute code that we’ve not previously tested”. I’ve done some experiments with code coverage in the past. Explorations of the GCOV feature in the kernel didn’t really get very far however for a few reasons (primarily that it really slowed things down too much, and also I was looking into this last summer, when the initial cracks were showing that I was going to be leaving Red Hat, so my time investment for starting large new projecs was limited).

After recent discussions at work surrounding code coverage, I got thinking about this stuff again, and trying to come up with workable alternatives. I started wondering if I could use the x86 performance counters for this. Basically counting the number of instructions executed between system call enter/exit. The example code that Vince Weaver wrote for perf_event_open looked like a good starting point. I compiled it and ran it a few times.

$ ./a.out 
Measuring instruction count for this printf
Used 3212 instructions
$ ./a.out 
Measuring instruction count for this printf
Used 3214 instructions

Ok, so there’s some loss of precision there, but we can mask off the bottom few bits. A collision isn’t the end of the world for what we’re using this for. That’s just measuring userspace however. What happens if we tell it to measure the kernel, and measure say.. getpid().

$ ./a.out 
Used 9283 instructions
$ ./a.out 
Used 9367 instructions

Ok, that’s a lot more precision we’ve lost. What the hell.
Given how much time he’s spent on this stuff, I emailed Vince, and asked if he had insight as to why the counters weren’t deterministic across different runs. He had actually written a paper on the subject. Turns out we’re also getting event counts here for page faults, hardware interrupts, timers, etc.
x86 counters lack the ability to say “only generate events if RIP is within this range” or anything similar, so it doesn’t look like this is going to be particularly useful.

That’s kind of where I’ve stopped with this for now. I don’t have a huge amount of time to work on this, but had hoped that I could hack up something basic using the perf counters, but it looks like even if it’s possible, it’s going to be a fair bit more work than I had anticipated.

update:
It occurred to me after posting this that measuring instructions isn’t going to work regardless of the amount of precision the counters offer. Consider a syscall that operates on vma’s for example. Over the lifetime of a process, the number of executed instructions of a call to such a syscall will vary even with the same input parameters, as the lengths of various linked lists that have to be walked will change. Number of instructions, or number of branches taken/untaken etc just isn’t a good match for this idea. Approximating “have we been here before” isn’t really achievable with this approach afaics, so I’m starting to think something like the initial gcov idea is the only way this could be done.

Thoughts on a feedback loop for Trinity. is a post from: codemonkey.org.uk

Reach the Top With NetworkManager 1.0.2

Summit - Asbjørn Floden (CC BY-NC 2.0)Summit – Asbjørn Floden (CC BY-NC 2.0)

Just this morning Lubomir released NetworkManager 1.0.2, the latest of the 1.0 stable series.  It’s  a great cleanup and bugfix release with contributions from lots of community members in many different areas of the project!

Some highlights of new functionality and fixes:

  • Wi-Fi device band capability indications, requested by the GNOME Shell team
  • Devices set to ignore carrier that use DHCP configurations will now wait a period of time for the carrier to appear, instead of failing immediately
  • Startup optimizations allow networking-dependent services to be started much earlier by systemd
  • Memory usage reductions through many memory leak fixes and optimizations
  • teamd interface management is now more robust and teamd is respawned when it terminates
  • dnsmasq is now respawned when it terminates in the local caching nameserver configuration
  • Fixes for an IPv6 DoS issue CVE-2015-2924, similar to one fixed recently in the kernel
  • IPv6 Dynamic DNS updates sent through DHCP now work more reliably (and require a fully qualified name, per the RFCs)
  • An IPv6 router solicitation loop due to a non-responsive IPv6 router has been fixed

While the list of generally interesting enhancements may be short, it masks 373 git commits and over 50 bugzilla issues fixed.  It’s a great release and we recommend that everyone upgrade.

Next up is NetworkManager 1.2, with DNS improvements, Wi-Fi scanning and AP list fixes for mobile uses, NM-in-containers improvements (no udev required!), even less dependence on the obsolete dbus-glib, less logging noise, device management fixes, continuing removal of external dependencies (like avahi-autoipd), configuration reload-ability, and much more!

May 04, 2015

kernel code coverage brain dump.

Someone at work recently asked me about code coverage tooling for the kernel. I played with this a little last year. At the time I was trying to figure out just how much of certain syscalls trinity was exercising. I ended up being a little disappointed at the level of post-processing tools to deal with the information presented, and added some things to my TODO list to find some time to hack up something, which quickly bubbled its way to the bottom.

As I did a write-up based on past experiences with this stuff, I figured I’d share.

gcov/gprof
requires kernel built with
CONFIG_GCOV_KERNEL=y
GCOV_PROFILE_ALL=y
GCOV_FORMAT_AUTODETECT=y
Note: Setting GCOV_PROFILE_ALL incurs some performance penalty, so any resulting kernel built with this option should _never_ be used for any kind of performance tests.
I can’t exaggerate this enough, it’s miserably slow. Disk operations that took minutes for me now took hours. As example:

Before:

# time dd if=/dev/zero of=output bs=1M count=500
500+0 records in
500+0 records out
524288000 bytes (524 MB) copied, 0.409712 s, 1.3 GB/s
0.00user 0.40system 0:00.41elapsed 99%CPU (0avgtext+0avgdata 2980maxresident)k
136inputs+1024000outputs (1major+340minor)pagefaults 0swaps

After:

# time dd if=/dev/zero of=output bs=1M count=500
500+0 records in
500+0 records out
524288000 bytes (524 MB) copied, 6.17212 s, 84.9 MB/s
0.00user 7.17system 0:07.22elapsed 99%CPU (0avgtext+0avgdata 2940maxresident)k
0inputs+1024000outputs (0major+338minor)pagefaults 0swaps

From 41 seconds, to over 7 minutes. Ugh.

If we *didn’t* set GCOV_PROFILE_ALL, we’d have to recompile just the files we cared about with the relevant gcc profiling switches. It’s kind of a pain.

For all this to work, gcov expects to see a source tree, with:

  • .o objects
  • source files
  • .gcno files (these are generated during the kernel build)
  • .gcda files containing the runtime counters. These come from sysfs on the running kernel.

After booting the kernel, a subtree appears in sysfs at /sys/kernel/debug/gcov/
These directories mirror the kernel source tree, but instead of source files, now contain files that can be fed to the gcov tool. There will be a .gcda file, and a .gcno symlink back to the source tree (with complete path). Ie, /sys/kernel/debug/mm for example contains (among others..)

-rw------- 1 root root 0 Mar 24 11:46 readahead.gcda
lrwxrwxrwx 1 root root 0 Mar 24 11:46 readahead.gcno -> /home/davej/build/linux-dj/mm/readahead.gcno

It is likely the symlink will be broken on the test machine, because the path doesn’t exist, unless you nfs mount the source code from the built kernel for eg.

I hacked up the script below, which may or may not be useful for anyone else (honestly, it’s way easier to just use nfs).
Run it from within a kernel source tree, and it will populate the source tree with the relevant gcda files, and generate the .gcov output file.

  
#!/bin/sh
# gen-gcov-data.sh
obj=$(echo "$1" | sed 's/\.c/\.o/')
if [ ! -f $obj ]; then
  exit
fi

pwd=$(pwd)
dirname=$(dirname $1)
gcovfn=$(echo "$(basename $1)" | sed 's/\.c/\.gcda/')
if [ -f /sys/kernel/debug/gcov$pwd/$dirname/$gcovfn ]; then
  cp /sys/kernel/debug/gcov$pwd/$dirname/$gcovfn $dirname
  gcov -f -r -o $1 $obj
 
  if [ -f $(basename $1).gcov ]; then
    mv $(basename $1).gcov $dirname
  fi
else
  echo "no gcov data for /sys/kernel/debug/gcov$pwd/$dirname/$gcovfn"
fi

Take that script, and run it like so..

$ cd kernel-source-tree
$ find . -type f -name "*.c" -exec gen-gcov-data.sh "{}" \;

Running for eg, gen-gcov-data.sh mm/mmap.c will cause gcov to spit out a mmap.c.gcov file (in the current directory) that has coverage information that looks like..

 
   135684:  269:static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
        -:  270:{
   135684:  271:        struct vm_area_struct *next = vma->vm_next;
        -:  272:
   135684:  273:        might_sleep();
   135686:  274:        if (vma->vm_ops && vma->vm_ops->close)
     5080:  275:                vma->vm_ops->close(vma);
   135686:  276:        if (vma->vm_file)
    90302:  277:                fput(vma->vm_file);
        -:  278:        mpol_put(vma_policy(vma));
   135686:  279:        kmem_cache_free(vm_area_cachep, vma);
   135686:  280:        return next;
        -:  281:}

The numbers on the left being the number of times that line of code was executed.
Lines beginning with ‘-‘ have no coverage information for whatever reason.
If a branch is not taken, it gets prefixed with ‘#####’, like so..

 
  4815374:  391:                if (vma->vm_start < pend) {
     #####:  392:                        pr_emerg("vm_start %lx < pend %lx\n",
         -:  393:                                  vma->vm_start, pend);
        -:  394:                        bug = 1;
        -:  395:                }

There are some cases that need a little more digging to explain. eg:

    88105:  237:static void __remove_shared_vm_struct(struct vm_area_struct *vma,
        -:  238:                struct file *file, struct address_space *mapping)
        -:  239:{
    88105:  240:        if (vma->vm_flags & VM_DENYWRITE)
    15108:  241:                atomic_inc(&file_inode(file)->i_writecount);
    88105:  242:        if (vma->vm_flags & VM_SHARED)
        -:  243:                mapping_unmap_writable(mapping);
        -:  244:
        -:  245:        flush_dcache_mmap_lock(mapping);
    88105:  246:        vma_interval_tree_remove(vma, &mapping->i_mmap);
        -:  247:        flush_dcache_mmap_unlock(mapping);
    88104:  248:}

In this example, lines 245 & 247 have no hitcount, even though there’s no way they could have been skipped.
If we look at the definition of flush_dcache_mmap_(un)lock, we see..
#define flush_dcache_mmap_lock(mapping) do { } while (0)
So the compiler never emitted any code, and hence, it gets treated the same way as the blank lines.

There is a /sys/kernel/debug/gcov/reset file that can be written to to reset the counters before each test if desired.

Additional thoughts

  • Not sure how inlining affects things.
  • There needs to be some element of post-processing, to work out percentages of code coverage etc, which may involve things like stripping out comments/preprocessor defines.
  • debug kernels differ in functionality in various low level features. For example LOCKDEP will fundamentally change the way spinlocks work. For coverage purposes though, we can choose to not care and stop drilling down at certain levels.
  • Whatever does the post-processing of results may need to aggregate results from multiple test machines. Think of the situation where we’re running a client/server test: Both machines will be running different code paths.
  • ggcov has some interesting looking tooling for visually displaying results.

kernel code coverage brain dump. is a post from: codemonkey.org.uk

May 01, 2015

Trinity socket improvements

I’ve been wanting to get back to working on the networking related code in trinity for a long time. I recently carved out some time in the evenings to make a start on some of the lower hanging fruit.

Something that bugged me a while is that we create a bunch of sockets on startup, and then when we call for eg, setsockopt() on that socket, the socket options we pass have more chance of not being the correct protocol for the protocol the socket was created for. This isn’t always a bad thing; for eg, one of the oldest kernel bugs trinity found was found by setting TCP options on a non-TCP socket. But doing this the majority of the time is wasteful, as we’ll just get -EINVAL most the time.

We actually have the necessary information in trinity to know what kind of socket we were dealing with in a socketinfo struct.

struct socket_triplet {
        unsigned int family;
        unsigned int type;
        unsigned int protocol;
};

struct socketinfo {
        struct socket_triplet triplet;
        int fd; 
};

We just had it at the wrong level of abstraction. setsockopt only ever saw a file descriptor. We could have searched through the fd arrays looking for the socketinfo that matched, but that seems like a lame solution. So I changed the various networking syscalls to take a ARG_SOCKETINFO instead of an ARG_FD. As a side-effect, we actually pass sockets to those syscalls more than say, a perf fd, or an epoll fd, or ..

There is still a small chance we pass some crazy fd, just to cover the crazy cases, though those cases don’t tend to trip things up much any more.

After passing down the triplet, it was a simple case of annotating the structures containing the various setsockopt function pointers to indicate which family they belonged to. AF_INET was the only complication, which needed special casing due to the multiple protocols for which we have setsockopt() functions. Creation of a second table, using the protocol instead of the family was enough for the matching code.

There are still a ton of improvements I want to make to this code, but it’s going to take a while, so it’s good when some mostly trivial changes like the above come together quickly.

Trinity socket improvements is a post from: codemonkey.org.uk

April 14, 2015

the more things change.. 4.0


$ ping gelk
PING gelk.kernelslacker.org (192.168.42.30) 56(84) bytes of data.
WARNING: kernel is not very fresh, upgrade is recommended.
...
$ uname -r
4.0.0

Remember that one time the kernel versioning changed and nothing in userspace broke ? Me either.

Why people insist on trying to think they can get this stuff right is beyond me.

YOU’RE PING. WHY DO YOU EVEN CARE WHAT KERNEL VERSION IS RUNNING.

update: this was already fixed, almost exactly a year ago in the ping git tree. The (now removed) commentary kind of explains why they cared. Sigh.

the more things change.. 4.0 is a post from: codemonkey.org.uk

March 31, 2015

Official GNOME SDK runtime builds are out

As people who have followed the work on sandboxed applications know, we have promised a developer preview for GNOME 3.16. Well, 3.16 has now been released, so the time is now!

I spent last week setting up an build system on the GNOME infrastructure, and the output of this is finally available at:

http://sdk.gnome.org/repo/

This repository contains the gnome 3.16 runtimes, org.gnome.Platform, as well as a smaller one that is useful for less integrated apps (like games) called org.freedesktop.Platform. It also has corresponding develoment runtimes (org.gnome.Sdk and org.freedesktop.Sdk) that you can use to create applications for the platforms.

This is a developer preview, so consider these builds weakly supported. This means I will try to keep them somewhat updated if there are major issues and that I will keep them API and ABI stable. I will probably also pick up at least some 3.16.x minor releases as they are released.

I also did the first official release of xdg-app. For easy testing this is available for Fedora 21 and 22 as a copr repo.

Testing the SDK

Using the repo above makes it really easy to test this. Just install the xdg-app package from copr, log out+in (needed update the environment for the session), then follow these instructions (as a regular user):

  1. Install the Gnome SDK public key into  /usr/share/ostree/trusted.gpg.d, (or alternatively, use –no-gpg-verify when you add the remote below).
  2. Install the basic Gnome and freedesktop runtimes:
    $ xdg-app add-remote --user gnome-sdk http://sdk.gnome.org/repo/
    $ xdg-app install-runtime --user gnome-sdk org.gnome.Platform 3.16
    $ xdg-app install-runtime --user gnome-sdk org.freedesktop.Platform 1.0
  3. Optionally install some locale packs:
    $ xdg-app install-runtime --user gnome-sdk org.gnome.Platform.Locale.se 3.16
    $ xdg-app install-runtime --user gnome-sdk org.freedesktop.Platform.Locale.se 1.0
  4. Install some apps from my repository of test apps:
    $ xdg-app add-remote --user --no-gpg-verify test-apps https://people.gnome.org/~alexl/test-apps/repo/
    $ xdg-app install-app --user test-apps org.gnome.gedit
    $ xdg-app install-app --user test-apps org.freedesktop.glxgears
  5. Run the apps! You should find gedit listed among the regular applications in the shell as it exports a desktop file. But you can also run them manually like this:
    $ xdg-app run org.gnome.gedit
    $ xdg-app run org.freedesktop.glxgears
  6. I also packaged the latest gnome builder from git. It requires the full sdk which takes a bit longer to download:
    $ xdg-app install-runtime --user gnome-sdk org.gnome.Sdk 3.16
    $ xdg-app install-app --user test-apps org.gnome.Builder

All the above install the apps into your home-directory (in ~/.local/share/xdg-app) . You can also run the commands as root and skip the –user arguments to do system-wide application installs.

Future work

With the basics now laid down to run current applications in a minimally isolated environment the next step is to work on the sandboxing aspects more. This will require lots of work, both in the system side (things like kdbus), the desktop (add sandbox aware APIs, make pulseaudio protect clients from each other, etc)  and in modifying applications.

If you’re interested in this, you can follow the work on the wiki.

Building your own apps

If you download the SDKs you have enough tooling to build your own applications. There are some documentations on how to do this here.

I also created a git repository with the scripts I used to build the test applications above. It uses the gnome-sdk-bundles repostory which has some tooling and specfiles to easily bundle dependencies with the application.

Building the SDK

If you ever want to build the SDK yourself, it is available at:

https://git.gnome.org/browse/gnome-sdk-images

This repository contains the desktop specific parts of the SDK, which is layered on a core Yocto layer. When you build the SDK this will be automatically checked out and built from:

https://git.gnome.org/browse/freedesktop-sdk-base

However, if you don’t want to build all of this you can download the pre-build images from http://sdk.gnome.org/images/x86_64/ and put them in the freedesktop-sdk-base/images/x86_64 subdirectory of gnome-sdk-images. This can save you a lot of time and space.

March 22, 2015

Fedora at Midwest Rep Rap Fest 2015

I attended Midwest Rep Rap Fest 2015 this weekend, in Goshen, Indiana. Goshen is about 45 minutes outside of South Bend (the nearest regional airport). This part of Indiana is noteworthy for a few reasons, including the fact that Matthew Miller, the Fedora Project Leader, is from there. It also has a very large Amish population, which makes it one of the few places I've attended a conference where most of the local businesses have a place to tie up your horses. The Midwest Rep Rap Fest is an event dedicated to Open Source 3d printers (and their surrounding ecosystem). The primary sponsor of the event is SeeMeCNC, a local vendor that makes open source hardware delta 3d printers. A Delta printer is a 3d printer with a circular stationary bed. Attached to the bed are three vertical rods which serve as tracks for three geared motors. The motors move up and down the rods, and are connected to a central extruder which hangs down the center. The extruder is moved in three dimensions by moving the supports along their tracks. Watching a Delta 3d printer do its thing is pretty amazing, it seems to dance like a trapeze artists as it dips and swoops to print the object.

The Delta type of 3d printer was the most common printer at the event, many people had either bought SeeMeCNC printers or had built their own off their open source design. The SeeMeCNC team brought their super-sized Delta, which they think is the largest Delta printer in the world. It was easily 30 feet tall and barely fit in the building we were using (which is saying something, because we were in an exhibition hall at the local state fairgrounds). The owner of the company decided to see how big of a Delta printer he could build, and this was the result!



The printer used a shop vac to blow plastic pellets up a plastic hose into the giant heated end. Originally, they were trying to print a giant model of Groot (shown in progress in my picture above), but they had to leave it running overnight on Friday and when we came back Saturday morning, the print had failed because it had run out of plastic pellets! Later on, they printed a very large basket/vase with it (after fixing it so that it wouldn't run out of plastic).

Fedora had a table in the main room. I brought two open source 3d printers from Lulzbot and controlled them both from my laptop running Fedora 21. My larger printer, the Taz 4, was configured with a dual extruder addon, and I spent four hours on Friday calibrating it to print properly. On Saturday morning, I printed my first completely successful dual color print, a red and white tree frog!



The eyes didn't come out perfect, but it all came out aligned and in one piece. Several people offered me tips and advice on how to improve the print quality with the dual-extruder setup. One of the nice things about the Rep Rap fest was the extremely friendly nature of the community. Everyone was eager to help everyone else solve problems or improve their printers/prints. I used Pronterface to control the Taz 4, since it was better suited to handle the dual extruder controls.

My smaller printer, the Lulzbot Mini, was controlled with Cura-Lulzbot (a package which got added to Fedora a few days before the show!). Cura has a very fast and high quality slicer, but with less options for tweaking it than slic3r (the traditional open source slicing tool) does. 3d printers depend on a slicing tool to take a 3d model and convert it into the GCode machine instructions that tell the printer where to move and when to extract plastic. Cura also has a more polished UI than Pronterface.

The Lulzbot Mini is able to self level, self clean, and self calibrate, which almost eliminates the prep time before a print! One of the vendors at the show was Taulman, who is constantly innovating new filaments for 3d printing. They announced a new filament the weekend of the Rep Rap Fest, 910, and they gave me a sample to try out on the Mini. The Mini can print filaments with a melting point of 300 degrees Celsius or less, so it was well suited for the 910. 910 was interesting because it was incredibly strong, almost as good as polycarbonate! It was also translucent, which made it ideal for me to finish a project I've been working on for a long time: my 3d printed TARDIS model!



I printed four window panels and a topper piece for the lantern on the roof. A few other people had TARDIS models (including one that had storage drawers inside it), but mine was the biggest (and I think, the nicest).

One of Fedora's neighbors was mUVe, an open source SLA 3d printer. SLA 3d printers use a liquid resin and a DLP projector to make incredibly accurate 3d models that would be difficult or impossible to print on other kinds of 3d printers. It seemed like everyone was printing the same Groot model at the event, and they printed one that came out looking incredible. The inventor of the hardware was working their table, and we talked for a while about the importance of open source in hardware. He felt strongly that it was mandatory for him to release his work into open source so that other people could innovate and improve upon the designs he'd created. The mUVe printer was one of the largest SLA printers I've ever seen and the quality of its prints was amazing. The biggest downside is the complexity, it involves chemicals in the resin and in curing the prints once they have finished, but in my opinion, it was worth it. The cost was in the $1500-2000 price range, but he said he's working on something awesome that will bring that cost down. They used Creation Workshop to slice and control their printer, which was new to me, but it was also open source. It's C# though, but I want to see if I can get it working in Mono on Fedora. (They were also in the greater Detroit area, so I encouraged them to come out and demo it at Penguicon!)



Another neighbor had 3d printed an amazingly intricate "home clock". They had used a famous woodworking pattern, converted each of the pieces to a 3d model, then printed them. Each piece was then smoothed and attached together. The only piece they didn't print was the clock at the center! On the table, the top of the clock was taller than me (and I'm 6'4"). It didn't look 3d printed, it looked too nice! It took them 3 months to print it all. The owner said that if you're able to cut this model from wood and assemble it properly, you're considered to be a master in their community. Everyone was definitely in awe of it in this community.



It seemed like everyone showing off something at this event had a clever hack of their own. Some people were creating amazing models, some people had built new open source printers. One printer had color changing LED strips attached underneath it which changed from red to green to indicate the progress of the printing job. Another printer had a Raspberry Pi with camera wired into it so you had a "printer's eye view" as it printed. There was a custom 3d scanner designed to scan people's heads and torsos to make printable busts. There was even a printer that looked like some sort of industrial robot gone mad! The one thing these all had in common? They were open source. No one here was questioning open source, it was just the way they operated, sharing what they knew and building off each other's successes (and failures). There were a few MakerBot Replicators, but all of them had been hacked in some way.

Attendance at this years event was both up and down. There were more people and companies exhibiting at the event, including Texas Instruments, Hackaday, Lulzbot, Taulman, and Printed Solid. Printed Solid was giving out free samples of some amazing ColorFabb filament. I came home with some BronzeFill (prints into a bronze like material that when polished is heavy and shiny), a new flexible filament, and some carbon-fiber infused filament! They also had some really fantastic glow in the dark filament, but no samples of that were available (and I didn't have the spare cash to buy a full spool). General attendance at the event was about 750 people, which was down from last year (around 1000). The general consensus was that the event wasn't doing all it could to advertise itself, and the location wasn't exactly optimal (45 minutes from the nearest regional airport, almost 2 hours from a major airport). The majority of visitors were local to the Indiana/Michigan area. The event staff said that next year they plan on rebranding the event to a more general FOSS 3d printing event (not limiting themselves to the Midwest region of the US). I think that is the right decision, since they are the only open source 3d printing event that I'm aware of, and I'd really love to see them grow into something bigger and more accessible.

Oh, did I mention we had a celebrity at the event? Ben Heck was there with his Delta printer! He's built a pinball machine. I might want to be him a little bit (but I'm not). He was very friendly and cool, spent a lot of time talking to the other makers and attendees.

Thanks to Ben Williams, Fedora had a very nice booth setup. We had our Fedora tablecloth and lots of stickers to give away. I brought a good sampling of models I'd printed with Fedora and my 3d printers, and I had a lot of good conversations about using Linux and open source to power 3d printing and 3d model creation. My coworker (and celebrity writer) Brian Proffitt stopped by on Saturday and helped out at the table for a while. I was supposed to have Fedora 21 media to hand out, but the promised shipment never arrived. The computers there were a mix of Windows and Linux, very few Macs in this community. Several people were using Fedora, but most of the Linux instances were Debian.



The Fedora event box needs a little love, there wasn't very much in it that was useful anymore. The OLPC in it is very old now, and since the current OLPC hardware runs Android these days, it isn't as "cool" as it used to be. I restocked it with Fedora bubble stickers, but it probably needs a plan to revitalize it.

All in all, it was a very fun weekend event and a great opportunity to connect with the open source 3d printer community. I think it is the responsibility of Fedora (and Red Hat) to reach out to the maker communities and help them be open source in their own ways, and this was an excellent opportunity to do exactly that. Is there a Maker event happening somewhere near you? You can sign up to represent Fedora at that event like I did at MRRF: Fedora Event Calendar

March 16, 2015

virgil3d local rendering test harness

So I've still been working on the virgil3d project along with part time help from Marc-Andre and Gerd at Red Hat, and we've been making steady progress. This post is about a test harness I just finished developing for adding and debugging GL features.

So one of the more annoying issuess with working on virgil has been that while working on adding 3D renderer features or trying to track down a piglit failure, you generally have to run a full VM to do so. This adds a long round trip in your test/development cycle.

I'd always had the idea to do some sort of local system renderer, but there are some issues with calling GL from inside a GL driver. So my plan was to have a renderer process which loads the renderer library that qemu loads, and a mesa driver that hooks into the software rasterizer interfaces. So instead of running llvmpipe or softpipe I have a virpipe gallium wrapper, that wraps my virgl driver and the sw state tracker via a new vtest winsys layer for virgl.

So the virgl pipe driver sits on top of the new winsys layer, and the new winsys instead of using the Linux kernel DRM apis just passes the commands over a UNIX socket to a remote server process.

The remote server process then uses EGL and the renderer library, forks a new copy for each incoming connection and dies off when the rendering is done.

The final rendered result has to be read back over the socket, and then the sw winsys is used to putimage the rendering onto the screen.

So this system is probably going to be slower in raw speed terms, but for developing features or debugging fails it should provide an easier route without the overheads of the qemu process. I was pleasantly surprised it only took two days to pull most of this test harness together which was neat, I'd planned much longer for it!

The code lives in two halves.
http://cgit.freedesktop.org/~airlied/virglrenderer
http://cgit.freedesktop.org/~airlied/mesa virgl-mesa-driver

[updated: pushed into the main branches]

Also the virglrenderer repo is standalone now, it also has a bunch of unit tests in it that are run using valgrind also, in an attempt to lock down some more corners of the API and test for possible ways to escape the host.

March 13, 2015

LSF/MM 2015 recap.

It’s been a long week.
Spent Monday/Tuesday at LSFMM. This year it was in Boston, which was convenient in that I didn’t have to travel anywhere, but less convenient in that I had to get up early and do a rush-hour commute to get to the conference location in time. At least the weather got considerably better this week compared to the frankly stupid amount of snow we’ve had over the last month.
LWN did their usual great write-up which covers everything that was talked about in a lot more detail than my feeble mind can remember.

A lot of things from last years event seem to still be getting a lot of discussion. SMR drives & persistent memory being the obvious stand-outs. Lots of discussion surrounding various things related to huge pages (so much so one session overran and replaced a slot I was supposed to share with Sasha, not that I complained. It was interesting stuff, and I learned a few new reasons to dislike the way we handle hugepages & forking), and I lost track how many times the GFP_NOFAIL discussion came up.

In a passing comment in one session, one of the people Intel sent (Dave Hansen iirc) mentioned that Intel are now shipping a 18 core/36 thread CPU. A bargain at just $4642. Especially when compared to this madness.

A few days before the event, I had been asked if I wanted to do a “how Akamai uses Linux” type talk at LSFMM, akin to what Chris Mason did re: facebook at last years event. I declined, given I’m still trying to figure that out myself. Perhaps another time.

Wednesday/Thursday, I attended Vault at the same location.
My take-away’s:

  • There was generally a lot more positive vibes around btrfs this year. Even with Josef playing bad cop to Chris’ good cop talk, things generally seemed to be moving away from a “everything is awful” toward “this actually works…” though with the qualifier “.. for facebook’s workload”. Josef did touch on one area that btrfs does still suck, which apparently is database workloads (iirc, due to the copy-on-write nature of btrfs). The spurious ENOSPC failures of the past should hopefully stay in the past. Things generally on the up and up. (Though, this does include the linecount, which has now passed 100KLOC, more than double that of XFS or ext*. Scary).
  • Equally positive vibes surrounding XFS. We celebrated the 20 year anniversary at one evening event, making us all feel just that little bit more like an old fart club. Interesting talk toward the end by Dave Chinner about the future of XFS, and how the current surge of development in XFS is probably its last for various scaling reasons as disks continue to get bigger and bigger. Predicting the future is always hard, but if what Dave said was true, things will start to get ‘interesting’ in about 5 years time, given every other filesystem we support in Linux has the same issues (or worse).
  • People still care a lot about NFS. Especially pNFS. Surprising amount of activity still happening.
  • Even when I worked there, I never really got Red Hat’s “big picture” wrt the several distributed filesystems they supported. Now that I’m not there, I feel even more out of the loop. “ceph is the way forward” “except when it’s glusterfs” or something. Oh, and GFS2 is still a thing apparently, for some reason.
  • As entertaining as Jeremy Allison might be, don’t go to a talk on Samba internals unless you work on it (in which case it’s too late for you). The horrors will likely keep you up at night.
  • Ted’s ext4 talk drew a decent crowd. As fancy as btrfs/xfs etc might be, a *lot* of people still give a crap about extN. Somehow I missed the addition of the ‘lazytime’ option to ext4. Seems neat. Played with it (and also the super-secret ‘dioread_nolock’ mount option). Saw another talk on orphan list scalability in ext4, which was interesting, but didn’t draw as big a crowd.

I got asked “What are you doing at Akamai ?” a lot. (answer right now: trying to bring some coherence to our multiple test infrastructures).
Second most popular question: “What are going to do after that ?”. (answer: unknown, but likely something more related to digging into networking problems rather than fighting shell scripts, perl and Makefiles).

All that, plus a lot of hallway conversations, long lunches, and evening activities that went on possibly a little later than they should have have led to me almost losing my voice today.
Really good use of time though. I had fun, and it’s always good to catch up with various people.

LSF/MM 2015 recap. is a post from: codemonkey.org.uk

March 03, 2015

Trinity 1.5 release.

As announced this morning, today I decided that things had slowed down (to an almost standstill of late) enough that it was worth making a tarball release of Trinity, to wrap up everything that’s gone in over the last year.

The email linked above covers most of the major changes, but a lot of the change over the last year has actually been groundwork for those features. Things like..

  • The post-mortem dumper needed the generation of the text and the writing to log files to be decoupled, which wasn’t particularly trivial.
  • Some features involved considerable rewrites. The fd generators are now pretty much isolated from each other, making adding a new one a simple task.
  • Handling of the mapping structs got a lot of cleanup (though there is definitely still a lot of room for improvement there, especially when we do things like splitting a mapping).
  • I should also mention the countless hours spent chasing down quite a few hard-to-reproduce bugs that are fixed in 1.5

As I mentioned in the announcement, I don’t see myself having a huge amount of time for at least this year to work on Trinity. I’ve had a number of people email me asking the status of some feature. Hopefully this demarkation point will answer the question.

So, it’s not abandoned, it just won’t be seeing the volume of change it has over the last few years. I expect my personal involvement will be limited to merging patches, and updating the syscall lists when new syscalls get added.

Trinity used to be on roughly a six month release schedule. We’ll see if by the end of the year there’s enough input from other people to justify doing a 1.6 release.

I’m also hopeful that time working on other projects mean I’ll come back to this at some point with fresh eyes. There are a number of features I wanted to implement that needed a lot more thought. Perhaps working on some other things for a while will give me the perspective necessary to realize those features.

Trinity 1.5 release. is a post from: codemonkey.org.uk

February 23, 2015

backup solutions.

For the longest time, my backup solution has been a series of rsync scripts that have evolved over time into a crufty mess. Having become spoiled on my mac with time machine, I decided to look into something better that didn’t involve a huge time investment on my part.

The general consensus seemed to be that for ready-to-use home-nas type devices, the way to go was either Synology, or Drobo. You just stick in some disks, and setup NFS/SAMBA etc with a bunch of mouse clicking. Perfect.

I had already decided I was going to roll with a 5 disk RAID6 setup, so bit the bullet and laid down $1000 for a Synology 8-Bay DS1815+. It came *triple* boxed, unlike the handful of 3TB HGST drives.
I chose the HGST’s after reading backblaze’s report on failure rates across several manufacturers, and figured that after the RAID6 overhead, 8TB would be more than enough for a long time, even at the rate I accumulate flac and wav files. Also, worst case, I still had 3 spare bays I could expand into later if needed.

Installation was a breeze. The plastic drive caddies felt a little flimsy, but the drives were secure once in them, even if they did feel like they were going to snap as I flexed them to pop them into place. After putting in all the drives, I connected the four ethernet ports, I powered it up.
After connecting to its web UI, it wanted to do a firmware update, like just about every internet connected device wants to do these days. It rebooted, and finally I could get about setting things up.

On first logging into the device over ssh, I think the first command I typed was uname. Seeing a 3.2 kernel surprised me a little. I got nervous thinking about how many VFS,EXT4,MD bugfixes hadn’t made their way back to long-term stable, and got the creeps a little. I decided to not think too much about it, and put faith in the Synology people doing backports (though I never got as far as looking into their kernel package).

The web ui is pretty slick, though felt a little sluggish at times. I set up my RAID6 volume with a bunch of clicks, and then listened as all those disks started clattering away. After creation, it wanted to do an initial parity scan. I set it going, and went to bed. The next morning before going to work, I checked on it, and noticed it wasn’t even at 20% done. I left it going while I went into the office the next day. I spent the night away from home, and so didn’t get back to it until another day later.

When I returned home, the volume was now ready, but I noticed the device was now noticeably hotter to touch than I remembered. I figured it had been hammering the disks non-stop for 24hrs, so go figure, and that it would probably cool off a little as it idled. As the device was now ready for exporting, I set up an nfs export, and then spent some time fighting uid mappings, as you do. The device does have ability to deal with LDAP and some other stuff that I’ve never had time to setup, so I did things the hard way. Once I had the export mounted, I started my first rsync from my existing backups.

While it was running, I remembered I had intended to set up bonding. A little bit of clicky-clicky later, it was done, and transfers started getting even faster. Very nice. I set up two bonds, with a pair of NICs in each. Given my desktop only has a dual NIC, that was good enough. Having a 2nd 2GigE bond I figured was nice in case I had multiple machines wanting to use it while I was doing a backup.

So the backup was going to take a while, so I left it running.
A few hours later, I got back to it, and again, it was getting really hot. There are two pretty big fans in the back of the units, and they were cranking out heat. Then, things started getting really weird. I noticed that the rsync had hung. I ctrl-c’d it, and tried logging into the device as root. It took _minutes_ to get a command prompt. I typed top and waited. About two minutes later top started. Then it spontaneously rebooted.

When it came back up, I logged in, and poked around the log files, and didn’t see anything out of the ordinary.
I restarted the rsync, and left it go for a while. About 20 minutes later, I came back to check on it again, and found that the box had just hung completely. The rsync was stalled, I couldn’t ssh in. I rebooted the device, cursed a bit, and then decided to think about it for a while, so never restarted the rsync. I clicked around in the interface, to see if there was anything I could turn on/off that would perhaps give me some clues wtf was going on.
Then it rebooted spontaneously again.

It was about this time I was ready to throw the damn thing out the window. I bought this thing because I wanted a turn-key solution that ‘just worked’, and had quickly come to realize that with this device when something went bad, I was pretty screwed. Sometimes “It runs Linux” just isn’t enough. For some people, the Synology might be a great solution, but it wasn’t for me. Reading some of the Amazon reviews, it seems there were a few people complaining about their units overheating, which might explain the random reboots I saw. For a device I wanted to leave switched on 24/7 and never think about, something that overheats (especially when I’m not at home) really doesn’t give me feel good vibes. Some of the other reviews on Amazon rave about the DS1815+. It may be that there was a bad batch, and I got unlucky, but I felt burnt on the whole experience, and even if I had got a replacement, I don’t know if I would have felt like I could have trusted this thing with my data.

I ended up returning it to Amazon for a refund, and used the money to buy a motherboard, cpu, ram etc to build a dedicated backup computer. It might not have the fancy web ui, and it might mean I’ll still be using my crappy rsync scripts, but when things go wrong, I generally have a much better chance of fixing the problems.

Other surprises: At one point, I opened the unit up to install an extra 4GB of RAM (It comes with just 2GB by default), I noticed that it runs off a single 250W power supply, which seemed surprising to me. I thought disks during spin-up used considerably more power, but apparently they’re pretty low power these days.

So, two weeks of wasted time, frustration, and failed experiments. Hopefully by next week I’ll have my replacement solution all set up and can move on to more interesting things instead of fighting appliances.

backup solutions. is a post from: codemonkey.org.uk

February 17, 2015

First fully sandboxed Linux desktop app

Its not a secret that I’ve been working on sandboxed desktop applications recently. In fact, I recently gave a talk at devconf.cz about it. However, up until now I’ve mainly been focusing on the bundling and deployment aspects of the problem. I’ve been running applications in their own environment, but having pretty open access to the system.

Now that the basics are working it’s time to start looking at how to create a real sandbox. This is going to require a lot of changes to the Linux stack. For instance, we have to use Wayland instead of X11, because X11 is impossible to secure. We also need to use kdbus to allow desktop integration that is properly filtered at the kernel level.

Recently Wayland has made some pretty big strides though, and we now have working Wayland sessions in Fedora 21. This means we can start testing real sandboxing for simple applications. To get something running I chose to focus on a game, because they require very little interaction with the system. Here is a video I made of Neverball, running in a minimal sandbox:

Click here to view the embedded video.

In this example we’re running a regular build of neverball in an environment which:

  • Is independent of the host distribution
  • Has no access to any system or user files other than the ones from the runtime and application itself
  • Has no access to any hardware devices, other than DRI (for GL rendering)
  • Has no network access
  • Can’t see any other processes in the system
  • Can only get input via Wayland
  • Can only show graphics via Wayland
  • Can only output audio via PulseAudio
  • … plus more sandboxing details

Yet the application is still simple to install and integrates nicely with the desktop. If you want to test it yourself, just follow the instructions on the project page and install org.neverball.Neverball.

Of course, there are still a lot to do here. For instance, PulseAudio doesn’t protect clients from each other, and for more complex applications we need to add new APIs to safely grant access to things like user files and devices. The sandbox details page has a more detailed list of what has to be done.

The road is long, but at least we have now started our journey!

February 16, 2015

NetworkManager for Administrators Part 1

4870003098_26ba44a08a_b(via scobleizer, CC BY 2.0)

NetworkManager is a system service that manages network interfaces and connections based on user or automatic configuration. It supports Ethernet, Bridge, Bond, VLAN, team, InfiniBand, Wi-Fi, mobile broadband (WWAN), PPPoE and other devices, and supports a variety of different VPN services.  You can manage it a couple different ways, from config files to a rich command-line client, a curses-like client for non-GUI systems, graphical clients for the major desktop environments, and even web-based management consoles like Cockpit.

There’s an old perception that NetworkManager is only useful on laptops for controlling Wi-Fi, but nothing could be further from the truth.  No laptop I know of has InfiniBand ports.  We recently released NetworkManager 1.0 with a whole load of improvements for workstations, servers, containers, and tiny systems from embedded to RaspberryPi.  In the spirit of making double-plus sure that everyone knows how capable and useful NetworkManager is, let’s take a magical journey into Administrator-land and start at the very bottom…

Daemon Configuration Files

Basic configuration is stored in /etc/NetworkManager/NetworkManager.conf in a standard key/value ini-style format.  The sections and values are well-described by ‘man NetworkManager.conf’.  A standard default configuration looks like this:

[main]
plugins=ifcfg-rh

You can override default configuration through either command-line switches or by dropping “configuration snippets” into /etc/NetworkManager/conf.d.  These snippets use the same configuration options from ‘man NetworkManager.conf’ but are much easier to distribute among larger numbers of machines though packages or tools like Puppet, or even just to install features through your favorite package manager.  For example, in Fedora, there is a NetworkManager-config-connectivity-fedora RPM package that installs a snippet that enables connectivity checking to Fedora Project servers.  If you don’t care about connectivity checking, you simply ‘rpm -e NetworkManager-config-connectivity-fedora’ instead of tracking down and deleting /etc/NetworkManager/conf.d/20-connectivity-fedora.conf.

Just for kicks, let’s take a walk through the various configuration options, what they do, and why you might care about them in a server, datacenter, or minimal environment…

Configuration Snippets

First, each configuration “snippet” in /etc/NetworkManager/conf.d can override values set in earlier snippets, or even the default configuration (but not command-line options).  So the same option specified in 50-foobar.conf will override that option specified in 10-barfoo.conf.  Many options also support the “+” modifier, which allows their value to be added to earlier ones instead of replacing.  So “plugins+=something-else” will add “something-else” to the list, instead of overwriting any earlier values.  You’ll see why this is quite useful in a minute…

Dive Deep

[main]
plugins=ifcfg-rh | ifupdown | ifnet | ifcfg-suse | ibft (default empty)

This option enables or disables certain settings plugins, which are small loadable libraries that read and write distribution-specific network configuration.  For example, Fedora/RHEL would specify ‘plugins=ifcfg-rh’ for reading and writing the ifcfg file format, while Debian/Ubuntu would use ‘plugins=ifupdown’ for reading /etc/network/interfaces, and Gentoo would use ‘plugins=ifnet’.  If you know your distro’s config format like the back of your hand, NetworkManager doesn’t make you change it.

There is one default plugin though, ‘keyfile’, which NetworkManager uses to read and write configurations that the distro-specific plugins can’t handle.  These files go into /etc/NetworkManager/system-connections and are standard .ini-style key/value files.  If you’re interested in the key and value definitions, you can check out ‘man nm-settings’ and ‘man nm-settings-keyfiles’, or even look at some examples.

[main]
monitor-connection-files=yes | no (default no)

By popular demand, NetworkManager no longer watches configuration files for changes.  Instead, you make all the changes you want, and then explicitly tell NetworkManager when you’re done with “nmcli con reload” or “nmcli con load <filename>”.  This prevents reading partial configuration and allows you to double-check that everything is correct before making the configuration update.  Note that changes made through the D-Bus interface (instead of the filesystem) always happen immediately.

However, if you want the old behavior back, you can set this option to “yes”.

[main]
auth-polkit=yes | no (default yes)

If built with support for it, NetworkManager uses PolicyKit for fine-grained authorization of network actions.  This will be the subject of another article in this series, but the TLDR is that PolicyKit easily allows user A the permission to use WiFi while denying user B WiFi but allowing WWAN.  These things can be done with Unix groups, but that quickly gets unwieldy and isn’t fine-grained enough for some organizations.  In any case, PolicyKit is often unecessary on small, single-user systems or in datacenters with controlled access.  So even if your distribution builds NetworkManager with PolicyKit enabled, you can turn it off for simpler root-only operation.

[main]
dhcp=dhclient | dhcpcd | internal (default determined at build time, dhclient preferred if enabled)

With NetworkManager 1.0 we’ve added a new internal DHCP client (based off systemd code which was based off ConnMan code) which is smaller, faster, and lighter than dhclient or dhcpcd.  It doesn’t do DHCPv6 yet, but we’re working on that.  We think you’ll like it, and it’s certainly much less of a resource hog than a dhclient process for every interface. To use it, set this option to “internal” and restart NetworkManager.

If NetworkManager was built with support for dhclient or dhcpcd, you can use either of these clients by setting this option to the client’s name.  Note that if you enable both dhclient and dhcpcd, dhclient will be preferred for maximum compatibility.

[main]
no-auto-default= (default empty)

By default, NetworkManager will create an in-memory DHCP connection for every Ethernet interface on your system, which ensures that you have connectivity when bringing a new system up or booting a live DVD.  But that’s not ideal on large systems with many NICs, or on systems where you’d like to control initial network bring-up yourself.  In that case, you should set this option to “*” to disable the auto-Ethernet behavior for all interfaces, indicating that you’d like to create explicit configuration instead.  You can also use MAC addresses or interface names here too!  On Fedora we’ve created a package called NetworkManager-config-server that sets this option to “*” by default.

[main]
ignore-carrier= (default empty)

Trip over a cable?  Want to make sure a critical interface stays configured if the switch port goes down?  This option is for you!  Setting it to “*” (all interfaces) or using MAC addresses or interface names here will tell NetworkManager to ignore carrier events after the interface is configured.  For DHCP connections a carrier is obviously required for initial configuration, while static connections can start regardless of carrier status.  After that, feel free to unplug the cable every time Apple sells an iPhone!

[main]
configure-and-quit=yes | no (default no)

New with 1.0 is the “configure and quit” mode where NetworkManager configures interfaces (including, if desired, blocking startup until networking is active) and then quits, spawning small helpers to maintain DHCP leases and IPv6 address lifetimes if required.  In a datacenter or cloud where cycles are money, this can save you some cash and deliver a more stable setup with known behavior.

[main]
dns=dnsmasq | unbound | none | default (default empty, equivalent to “default”)

Want to control DNS yourself?  NetworkManager makes it easy!  Don’t want to?  NetworkManager makes that easy too! When you set this option to ‘dnsmasq’ NetworkManager will configure dnsmasq as a local caching nameserver, including split DNS for VPN tunnels.  If you set it to ‘none’ then NetworkManager won’t touch /etc/resolv.conf and you can use dispatcher scripts that NetworkManager calls at various points to set up DNS any way you choose.

Leaving the option empty or setting it to “default” asks NetworkManager to own resolv.conf, updating system DNS with any information from your explicit network settings or those received from automatic means like DHCP.

In the upcoming NetworkManager 1.2, DNS information is written to /var/lib/NetworkManager/resolv.conf and, if NM is allowed to manage /etc/resolv.conf, that file will be a symlink to the one in /var similar to systemd-resolvd.  This makes it easier for external tools to incorporate the DNS information that NetworkManager combines from multiple sources like DHCP, PPP, IPv6, VPNs, and more.

[keyfile]
unmanaged-devices= (default empty)

Want to keep NetworkManager’s hands off a specific device?  That’s what this option is for, where you can use “interface-name:eth0″ or “mac:00:22:68:1c:59:b1″ to prevent automatic management of a device.  While there are some situations that require this, by default NetworkManager doesn’t touch virtual interfaces that it didn’t create, like bridges, bonds, VLANs, teams, macvlan, tun, tap, etc.  So while it’s unusual to need this option, we realize that NetworkManager can be used in concert with other tools, so it’s here if you do.

[connectivity]
uri=  (default empty = disabled)
interval=(default 0 = disabled)
response=  (default “NetworkManager is online”)

Connectivity checking helps users log into captive ports and hotspots, while also providing information about whether or not the Internet is reachable.  When NetworkManager connects a network interface, it sends an HTTP request to the given URI and waits for the specified response.  If you’re connected to the Internet and the connectivity server isn’t down, the response should match and NetworkManager will change state from CONNECTED_SITE to CONNECTED.  It will also check connectivity every ‘interval’ seconds so that clients can report status to the user.

If you’re instead connected to a WiFi hotspot or some kind of captive portal like a hotel network, your DNS will be hijacked and the request will be redirected to an authentication server.  The response will be unexpected and NetworkManager will know that you’re behind a captive portal.  Clients like GNOME Shell will then indicate that you must authenticate before you can access the real Internet, and could provide an embedded web browser for this purpose.

Upstream connectivity checking is disabled by default, but some distribution variants (like Fedora Workstation) are now enabling it for desktops, laptops, and workstations.  On a server or embedded system, or where traffic costs a lot of money, you probably don’t want this feature enabled.  To turn it off you can either remove your distro-provided connectivity package (which just drops a file in /etc/NetworkManager/conf.d) or you can remove the options from NetworkManager.conf.

Special NetworkManager data files

In the normal course of network management sometimes non-configuration data needs to persist.  NetworkManager does this in the /var/lib/NetworkManager directory, which contains a few different files of interest:

seen-bssids

This file contains the BSSIDs (MAC addresses) of WiFi access points that NetworkManager has connected to for each configured WiFi network.  NetworkManager doesn’t do this to spy on you (and the file is readable only by root), but instead to automatically connect to WiFi networks that do not broadcast their SSID.  You almost never need to touch this file, but if you are concerned about privacy feel free to delete this file periodically.

timestamps

Each time you connect to a network, whether wired, WiFi, etc, NetworkManager updates the timestamp in this file.  This allows NetworkManager to determine which network you last used, which can be used to automatically connect you to more preferred networks.  NetworkManager also uses the timestamp as an indicator that you have successfully connected to the network before, which it uses when deciding whether or not to ask for your WiFi password when you get randomly disconnected or the driver fails.

NetworkManager.state

This file stores persistent user-determined state for Airplane mode for each technology like WiFi, WWAN, and WiMAX.  Normally this is controlled by hardware buttons, but some systems don’t have hardware buttons or the drivers don’t work, plus that state is not persistent across boots.  So NetworkManager stores a user-defined state for each radio type and will ensure the radio stays in that state across reboots too.

DHCP lease and configuration files

When you obtain a DHCP lease, that lease may last longer than your connection to that network.  To ensure that you receive a nominally stable IP address the next time you connect, or to ensure that your TCP sessions are not broken if there is a network hiccup, NetworkManager stores the DHCP lease and attempts to acquire the same lease again.  These files are stored per-connection to ensure that a lease acquired on your home WiFi or ethernet network is not used for work or Starbucks.  Temporary DHCP configuration files are also stored here, which are constructed based on your preferences and on generic DHCP configuration files in /etc for each supported DHCP client.  If you want to wipe the DHCP slate clean, feel free to remove any of the lease or configuration files.

And that’s it for this time, stay tuned for the next part in this series!

January 30, 2015

Back from DevX hackfest

I’m now back from a week in Cambridge at the developer experience hackfest. This was a great event, it was a lot of fun to meet people again, and we got a lot of things done. I spent a lot of time talking to people about things related to xdg-app and sandboxed applications, both spreading information and actually implementing features.

I spent some time with Emmanuele, Ryan and Lars working on glib stuff, which resulted in the G_DECLARE_*_TYPE macros finally being merged. Additionally I reviewed the new list model abstraction which I hope we can land soon, and Ryan and I worked out a new fancy __attribute__(cleanup) approach that we hope to merge into glib soon.

We also worked a bit on Gtk+ OpenGL support. Based on feedback from early users we’re doing some changes in how GL contexts are created to allow you to configure them in more detail. We also decided that we want to completely drop support for legacy OpenGL contexts, as these had issues cooperating with Core 3.2 contexts, and because we don’t live in the 90s anymore. Carlos was working on converting GtkPopover to use (override redirect) toplevels on X11, and I gave him moral support and generally hated on ancient crappy X11 behaviour.

Props to Collabora and Philip for arranging a great event!

January 26, 2015

So much to learn..

At lunch a few days ago, I was discussing with a coworker how right now I’m feeling a little overwhelmed. Not completely unsurprising I suppose given the volume of new things I need to learn. But as I’m getting acclimated in my new job, it’s becoming clearer which things I either don’t know well enough or am completely unfamiliar with.

I’m only now realizing that not only the scope of what I’m working on changed, but how I work on things. During the ten years I worked on Fedora, because we were woefully understaffed and buried alive in bugs, there was never really time to spend a lot of time on an individual bug, unless a lot of users were hitting it. Because of this, the things that I ended up fixing were usually fairly small fixes. The occasional NULL pointer deference. Perhaps a use after free. Maybe linked-list corruption. Pretty basic stuff that anyone with familiarity with any part of the kernel could figure out reasonably quickly. Do enough of this, and it becomes less about engineering, and more about pattern recognition.

Even a lot of the bugs that trinity finds aren’t terribly invasive fixes.
(Screwed up error paths being probably the most common thing it picks up, because nothing else really ever tests them).

The more complicated bugs, like a WARN_ON being hit ? That could take a lot more understanding that could take a considerable time to get to.

As Fedora kernels were so close to mainline a lot of the time we could chase up the maintainer, pass the bug along, and move onto something else. The result of ten years of this way of working has meant I have a passing understanding how lots of parts of the kernel interact from a 10,000ft perspective, and perhaps some understanding of some nuances based on past interactions, but no deep architectural understanding of for eg: how an sk_buff traverses through the various parts of the network stack. A warning deep in the tcp guts, caused by a packet that’s been through bonding, netfilter, etc ? I’m not your guy. At least today.

Thankfully I’ve got some time to ramp up my learning on unfamiliar parts of the kernel, and get a better understanding of how things work on a deeper level. I suspect I’ll end up turning at least some of the stuff I learn into future posts here.

In the meantime, I’ve got a lot of reading to do.
I felt a little reassured at least when my coworker responded “yeah, me too”.

So much to learn.. is a post from: codemonkey.org.uk

January 21, 2015

Thoughts on long-term stable kernels.

I remembered something that I found eye-opening while interviewing/phone-screening over the last few months.

The number of companies that base their systems (especially those that don’t actually distribute their kernels outside the company) on the long-term stable releases of Linux caught me by surprise. We dismissed the idea of basing on long-term stable releases in Fedora after giving it a try circa Fedora 14, and it generally being a disaster because the bugs being fixed didn’t match up much to the bugs our users were seeing. We found that we got more bugs we cared about being fixed by sticking to the rolling model that Fedora is known for.

After discussing this with several potential employers, I now have a different perspective on this, and can see the appeal if you have a limited use-case, and only care about a small subset of hardware.

The general purpose “one size fits all” model that distribution kernels have to fit is a much bigger problem to solve, and with the feedback loop of “stable release -> bug -> report upstream -> fixed in mainline -> backport to next stable release” being so long, it’s not really a surprise that just having users be closer to bleeding edge gets a higher volume of bugs fixed (though at the cost of all the shiny new bugs that get introduced along the way) unless you have a RHEL-like small army of developers backporting fixes.

Finally, nearly everyone I talked to who uses long-term stable was also carrying some ‘extras’, such as updated drivers for hardware they cared about (in some cases, out of tree variants that aren’t even synced with linux-next yet).

It’s a complicated mess out there. “We run linux stable” doesn’t necessarily tell you the whole picture.

Thoughts on long-term stable kernels. is a post from: codemonkey.org.uk

January 19, 2015

The Whole Damn World Takes Effect to NetworkManager 1.0

nyold

2004

Facebook launched.

The first Ubuntu release appeared.

It was the Year of the Linux Desktop.

Novell had just bought Ximian and Mono happened.

Google IPOed.

Firefox 1.0 showed up.

This was your cellphone and PDAs were still a thing.

This love took you over and made you think you got it.

And NetworkManager was first released.

Fast forward to 2014…

nynew

NetworkManager 1.0!

Right before the 2014 holidays, and more than 10 years after the first line of NetworkManager was typed, we released version 1.0.  A huge milestone on the way to making NetworkManager more cooperative, more flexible, more configurable, and more useful than ever before.

How you ask?

1: libnm: the new GLib client library

For all the GLib/GObject users out there, we’ve rebuilt libnm-util and libnm-glib from the ground up into a new single library called libnm.  It uses GDBus instead of dbus-glib.  It provides GIO-style asynchronous methods. It also exposes IP addresses, MAC addresses, and other properties as strings instead of byte arrays, and combines the old NMClient and NMRemoteSettings objects into a single NMClient object, among other things.

from gi.repository import GLib, NM

for dev in NM.Client.new(None).get_devices():
    ipcfg = dev.get_ip4_config()
    if ipcfg:
        for addr in ipcfg.get_addresses():
            print "(%s) %s/%d" % (dev.get_iface(), addr.get_address(), addr.get_prefix())

2: a smaller, faster DHCP client

While it doesn’t do DHCPv6 (yet!) this internal client (based off systemd/connman code) is much faster than dhclient and dhcpcd, and doesn’t consume huge amounts of memory like dhclient.  Use the ‘dhcp=internal’ option in NetworkManager.conf to enable it and let us know how it works.  We’ll be adding DHCPv6 support and enhancing the recognized options in the near future.

3: configure and quit

Have a more static configuration and still want to use NetworkManager configuration and API to manage it?  The ‘configure-and-quit=yes’ option in NetworkManager.conf will configure your interfaces and quit the NM process, spawning small helpers to preserve DHCP and IPv6 addresses.  This saves cycles (and therefore money) and is simpler to manage.

4: more cooperative

Continuing the trend, NetworkManager 1.0 does a much better job of leaving externally configured interfaces alone until you tell it to do something.  In addition to improvements for IPv6 sysctl recognition and user-added route preservation, externally created virtual interfaces are no longer automatically set IFF_UP, and NetworkManager handles external master/slave relationship changes more smoothly.

5: more powerful nmcli

We’ve added PolicyKit and interactive password support to nmcli, allowing full command-line-only operation for most network connections, even for less privileged users.  There’s a new ‘nmcli dev connect’ command that brings up an interface using the best available connection.  You can also delete virtual interfaces directly through nmcli.

6: improved IPv6

We’ve ensured that if network interfaces are supposed to be down and unconfigured, that the kernel doesn’t assign a link-local address to them, to prevent potential security issues when you think networking is down.  We’ve also added support for IPv6 WWAN connections and fixes to respect router-delivered MTUs.

7: Bluetooth DUN support

Bluez5 changed API for Dial-Up-Networking functionality, which broke the NetworkManager support.  At long last we’ve added that support back, no thanks to Bluez.  Happy mobile networking!

8: more flexible and cooperative routing

Every interface that can have a default route now gets one, and NetworkManager manages the priorities to ensure they don’t conflict.  Plus, if you need to, you can manually manage priorities on a per-connection basis to prefer WiFi over WWAN or WWAN over ethernet, or whatever you need.

9: fewer dependencies

We’ve also removed some direct dependencies (PolicyKit), slimmed down code, and split functionality into selectable plugins, leading to easier installs on limited systems and better configurability.

That’s just the tip of the iceberg; we’ve improved almost every part of NetworkManager and we’re not stopping there.  We’re planning improvements to container use-cases, WiFi, VPNs, power savings, client APIs, and much  more.  2015 is gonna be a great year, and not just because the version number is greater than 1!

January 15, 2015

UAT on RTL-SDR update

About a year ago, when I started playing with ADS-B over 1090ES, I noticed that small airplanes heavily favour UAT in 978 MHz, because it's cheaper. For the purposes of independent on-board traffic, thus, it would be important to tap directly into UAT. If I mingle with airliners and their 1090ES, it's in controlled airspace anyway (yes, I know that most collisions happen in controlled airspace, but I'm grasping at excuses to play with UAT here, okay).

UAT poses a large challenge for RTL-SDR because of its relatively high data rate: 1.041667 mbit/s. RTL 2832U can only sample at 3.2 MS/s, and is only stable at 2.8 MS/s. Theoretically it should be enough, but everything I saw out there only works with 8 samples per bit, for weak signals. So, for initial experiments, I thought to try a trick: self-clocking. I set the sample rate to 2083334, and then do no clock recovery whatsoever. It hits where it hits, and if sample points lay well onto bits, the packet is recovered, otherwise it's lost.

I threw together some code, ran it, and it didn't work. Only received white noise. Oh, well. I moved the repo into a dusty corner of Github and forgot about it.

Fast forward a year, a gentleman by the name Oliver Jowett noticed a big problem: the phase was computed incorrectly. I fixed that up and suddenly saw some bits recovered (as it happened, zeroes and ones were swapped, which Oliver had to correct again).

After that, things started to move forward. Having bits recovered allowed to measure reception, and I found that the antenna that I built for the 978 MHz band was much worse than the stock antenna for TV. Imagine my surprise and disappoinment: all that soldering for nothing. I don't know where I screwed up, but some suggest that the computer and dongle produce RF noise that screws with antenna, and a length of coax helps with that despite the losses incurred by the coax (/u/christ0ph liked that in parti-cular).

This is bad.

This is good. Or better at least.

From now on, it's the error recovery. Unfortunately, I have no clue what this means:

The FEC parity generation shall be based on a systematic RS 256-ary code with 8-bit code word symbols. FEC parity generation for each of the six blocks shall be a RS (92,72) code.

Quoted from Annex 10, Volume III, 12.4.4.2.1.

P.S. Observing Oliver's key involvement, the cynical may conclude that the whole premise of Open Source is that you write drek that does not work, upload it to Github, and someone fixes it up for you, free of charge. ESR wrote a whole book about it ("with enough eyes all bugs are shallow")!

January 08, 2015

the new job reveal.

I let the cat out of the bag earlier this afternoon on twitter. Next Monday is day one of my new job at Akamai. The scope of my new role is pretty wide, ranging from the usual kernel debugging type work, to helping stabilizing production releases, proactively finding new bugs, misc QA work, development of some new tooling and a whole bunch of other stuff I can’t talk about just yet. (And probably a whole slew of things I don’t even know about yet).

It seemed almost serendipitous that I’ve ended up here. Earlier this year, I read Fatal System Error, a book detailing how Prolexic got founded. It’s a fascinating story beginning with DDoS’s of offshore gambling sites and ending with Russian organised crime syndicates. I don’t know how much of it got embellished for the book, but it’s a good read all the same. Also, Amazon usually has used copies for 1 cent. Anyway, a month after I read that book, Akamai acquired Prolexic.

Shortly afterwards, I found myself reading another Akamai related book, No Better Time, the autobiography of the late founder of Akamai, Danny Lewin. After reading it, I decided it was at least going to be worth interviewing there.

The job search led me a few possibilities, and the final decision to go with Akamai wasn’t an easy one to make. The combination of interesting work, an “easy to commute to” office (I’ll be at the Kendall square office in Cambridge,MA) and a small team that seemed easy to get along with (famous last words) is what decided it. (That, and all the dubstep).

It’s going to be an interesting challenge ahead to switch from the mindset of “bug that affects a single computer, or a handful of users” to “something that could bring down a 200,000 node cluster”, but I think I’m up for it. One thing I am definitely looking forward to is only caring about contemporary hardware, and a limited number of platforms.

I apologize in advance for any unexpected internet outages. I swear it was like that when I found it.

the new job reveal. is a post from: codemonkey.org.uk

Swift and balance

Swift is on the cusp of getting yet another intricate mechanism that regulates how partitions are placed: so-called "overload". But new users of Swift even keep asking what weights are, and now this? I am not entirely sure it's necessary, but here's a simple explanation why we're ending with attempts at complexity (thanks to John Dickinson on IRC).

Suppose you have a system that spreads your replicas (of partitions) across (failure) zones. This works great as long as your zones are about the same size, usually a rack. But then one day you buy a new rack with 8TB drives and suddenly the new zone is several times larger than others. If you do not adjust anything, it ends only filled by quarter at best.

So, fine, we add "weights". Now a zone that has weight 100.0 gets 2 times more replicas than zone with weight 50.0. This allows you to fill zones better, but this must compromize your dispersion and thus durability. Suppose you only have 4 racks: three with 2TB drives and one with 8TB drives. Not an unreasonable size for a small cloud. So, you set weights to 25, 25, 25, 100. With replication factor of 3, there's still a good probability (which I'm unable to calculate, although I feel it ought to be easy for someone better educated) that the bigger node will end with 2 replicas for some partitions. Once that node goes down, you lose redundancy completely for those partitions.

In the small-cloud example above, if you care about your customers' data, you have to eat the imbalance and underutilization until your retire the 2TB drives [1].

<clayg> torgomatic: well if you have 6 failure domains in a tier but their sized 10000 10 10 10 10 10 - you're still sorta screwed

My suggestion would be just ignore all the complexity we thoughtfuly provided for the people with "screwed" clusters. Deploy and maintain your cluster to make it easy for the placement and replication: have a good number of more or less uniform zones that are well aligned to natural failure domains. Everything else is a workaround -- even weights.

P.S. Kinda wondering how Ceph deals with these issues. It is more automagic when it decides what to store where, but surely there ought to be a good and bad way to add OSDs.

[1] Strictly speaking, other options exist. You can delegate to another tier by tying 2 small racks into a zone: yet another layer of Swift's complexity. Or, you could put new 8TB drives on trays and stuff them into existing nodes. But considering that muddies the waters.

UPDATE: See the changelog for better placement in Swift 2.2.2.

Some closure on a particularly nasty bug.

For the final three months of my tenure at Red Hat, I was chasing what was possibly the most frustrating bug I’d encountered since I had started work there. I had been fixing up various bugs in Trinity over the tail end of summer that meant on a good kernel, it would run and run for quite some time. I still didn’t figure out exactly what the cause of a self-corruption was, but narrowed it down enough that I could come up with a workaround. With this workaround in place, every so often, the kernels NMI watchdog would decide that a process had wedged, and eventually the box would grind to a halt. Just to make it doubly annoying, the hang would happen at seeming random intervals. Sometimes I could trigger it within an hour, sometimes it would take 24 hours.

I spent a lot of time trying to narrow down exactly the circumstances that would trigger this, without much luck. A lot of time was wasted trying to bisect where this was introduced, based upon bad assumptions that earlier kernels were ‘good’. During all this, a lot of theories were being thrown around, and people started looking at multiple areas of the kernel. The watchdog code, the scheduler, FPU context saving, the page fault handling, and time management. Along the way several suspect areas were highlighted, some things got fixed/cleaned up, but ultimately, they didn’t solve the problem. (A number of other suspect areas of code were highlighted that don’t have commits yet).

Then, right down to the final week before I gave all my hardware back to Red Hat, Linus managed to reproduce similar symptoms, by scribbling directly to the HPET. He came up with a hack that at least made the kernel survive for him. When I tried the same patch, the machine ran for three days before I interrupted it. The longest it had ever run.

The question remains, what was scribbling over the HPET in my case ? The /dev/hpet node doesn’t allow writing, even as root. You can mmap /dev/mem if you know the address of the HPET, and directly write to it, but..
1. That would be a root-only possibility, and..
2. Trinity blacklists /dev/mem, and never touches it.

The only two plausible scenarios I could think of were

  • Trinity generated 0xfed000f0 as a random address, and passed that to a syscall which wrote to it. This seems pretty unlikely, and hopefully the kernel has sufficient access_ok() checks on addresses passed in from userspace. Just to be sure, I had hardwired trinity to pass in that address, and couldn’t reproduce the bug.
  • A hardware bug.
    I’m actually starting to believe this may be the case. When trinity drives the CPU load up past a certain threshold, for whatever reason, the HPET stops ticking and corrupts itself. It still seems a bit “out there”, but is more believable than the other theory at least. An interesting data point showed up when googling for the DMI string of the affected machine. Someone else had seen ‘random lockups’ that looked very similar a year earlier. The associated bugzilla had a few more traces.

So that’s where the story (mostly) ends. When I left Red Hat, I gave that (possibly flawed) machine back. Linus’ hacky workaround didn’t get committed, but him & John Stultz continue to back & forth on hardening the clock management code in the face of screwed up hardware, so maybe soon we’ll see something real get committed there.

It was an interesting (though downright annoying) bug that took a lot longer to get any kind of closure on than expected. Some things I learned from this experience:

  • Keep better notes.
    Every week that passed, I had wished I wrote down what I had done the week before. With everything else going on in my life over the last few months, I neglected to document things as well as I could have, and only had old emails to fall back on. Not every bug drags on for months like this, but when you over-optimistically think a bug is going to be solved in a few days, you tend to not bother taking as extensive notes on what has been tried so far.
  • Google for the DMI string of the affected hardware pretty early on.
    That might have given us some clues a lot sooner as to what was going on. Or maybe not, but still – more data.
  • The more people looked into this bug, the more “this doesn’t look right” code was found. There’s never just one bug.

Some closure on a particularly nasty bug. is a post from: codemonkey.org.uk

January 07, 2015

continuity of various projects

I’ve had a bunch of people emailing me asking how my new job will affect various things I’ve worked on over the last few years. For the most part, not massively, but there are some changes ahead.

  • Trinity.
    This will proceed pretty much as it has over the last year, perhaps with a little more focus on various areas of the kernel than it has in the past. (Not being intentionally elusive, I’m not entirely sure myself yet).
    From discussions I’ve had so far, there may even be a spin-off into a separate tool, we’ll see.
    • I’ve been wanting to get back to the network related code for some time, and that’s probably going to happen soon-ish.
    • There is still work that could be done with the VM/FS code I’ve worked on over the last year and a half, but it’s already finding bugs, so is “good enough” for a while.
    • There are also still a few lingering bugs that I need to one day sit down and figure out, but they happen so infrequently that I’ve not found the time so far.
    • Finally, there is also a bunch of other feature work that needs fleshing out that I don’t see myself getting to any time soon, that I’ll dump in the TODO over the next week.
  • Upstream testing.
    Things like my daily running of various stress tools against Linus’ master branch with debug builds won’t happen to the level they were previously. The good news is that instead I’ll be doing a lot more testing on various stable branches, which I never did a whole lot of in the past. I expect over time as I get a better feel for my workload I might be able to ramp back up master testing somewhat, but it will be a secondary thing that I do in addition to everything else. Right now, I’m not really doing any of this, so other people running things like trinity against 3.19/3.20 would probably be a good idea if you like collecting crashes, at least until I find my feet again.
  • Coverity scans.
    • I’ll keep doing the scan.coverity runs hopefully once per -rc. I’ve lapsed a little right now, but will pick this back up again soon to get things back up to current. It takes a lot longer to run a whole scan from home now that I don’t have access to my 24-thread Nehalem (was: < 1 hours for compile, pack & upload to coverity, now: ~4 hours just for the compile), so I'll automate these to run overnight when Linus makes a new snapshot, at least until I put together something a little faster than my ~7 year old core duo.
    • I’m not sure I’m going to have a huge amount of time for triage work yet, so we’ll see how that works out. I might have to focus just on certain areas.
    • There will be some element of related work in my new job, but I’m not sure to what degree yet, more info on that as I figure it out.
  • Fedora.
    The biggest change of all.
    • I’m just not going to have to maintain packages, read mail etc for Fedora, so those all got orphaned yesterday.
    • Josh & Justin pretty much handled all of the Fedora kernel work for the last year or so, so me walking away is not going to make a huge difference there.
    • I might still occasionally take a peek at Fedora bugzilla to see if there’s anything similar to a particular bug, but don’t expect to be doing triage work.
    • I’ll still keep a Fedora box or two at home for a while, but work-wise, I’m expecting a lot more Debian in my life. It’s been over a decade since I last used it seriously. That should prove to be fun.
  • Conferences etc.
    I’ll hopefully be seeing some familiar faces again later this year, and possibly meeting some new ones. The only real change here will be the lack of fudcon/flock/whatever it’s called this week.

So that’s about it. For at least the first few months of 2015, I expect to be absorbed in getting acclimated my new job, so I won’t be as visible as usual, but I’m not going to disappear from the Linux community forever, which was something I made clear wasn’t something I wanted to happen at everywhere I interviewed.

continuity of various projects is a post from: codemonkey.org.uk

January 05, 2015

Going beyond ZFS by accident

Yesterday, CKS wrote an article that tries to redress the balance a little in the coverage of ZFS. Apparently, detractors of ZFS were pulling quotes from his operational gripes, so he went forceful with the observation that ZFS remains the only viable advanced filesystem (on Linux or not). CKS has no time for your btrfs bullshit.

The situation where weselovskys of the world hold the only viable advanced filesystem hostage and call everyone else "jackass" is very sad for Linux, but it may not be quite so dire, because it's possible that events are overtaking btrfs and ZFS. I am talking about the march of super-advanced, distributed filesystems downstream.

It all started with the move beyond POSIX, which, admittedly, seemed very silly at the time. The early DHT was laughable and I remember how they struggled for years to even enable writes. However, useful software was developed since then.

The poster child of it is Sage's Ceph, which relies on plain old XFS for back-end storage, composes an object storage out of nodes (called RADOS), and layers a POSIX layer on top for those who want it. It is in field use at Dreamhost. I can easily see someone using it where otherwise a ZFS-backed NFS/CIFS cluster would be deployed.

Another piece of software that I place in the same category is OpenStack Swift. I know, Swift is not competing with ZFS directly. The consistency of its meta layer is not sufficient to emulate POSIX anyway. However, you get all those built-in checksums and all that durability jazz that CKS wants. And Swift aims even further up in scale than Ceph, by being in field use at Rackspace. So, what seems to be happening is that folks who really need to go large are willing at times to forsake even compatibility with POSIX, in part of get the benefits that ZFS provides to CKS. Mercado Libre is one well-hyped case of migration from a pile of NFS filers to a Swift cluster.

Now that these systems are established and have themselves proven, I see constant efforts to take them downscale. Original Swift 1.0 did not even work right if you had less than 3 nodes (strictly speaking, if you had fewer zones than replication factor). This was fixed since by so-called "as good as possible placement" around 1.13 Havana, so you can have 1-node Swift easily. Ceph, similarly, would not consider PGs on the same node healthy and it's a bit of a PITA even in Firefly. So yea, there are issues, but we're working on it. And in the process, we're definitely coming for chunks of ZFS space.

blitz2 for GNOME catastrophy

After putting blitz2 on my Nexus and poking into GUI buttons, I reckoned that it might be time to stop typing in a terminal like a caveman on Linux too (I'm joking, but only just). And the project rolled smoothly for a while. What it took me 2 months to accomplish in Android, only took 2 days in GNOME. However, 2 lines of code from the end, it all came to an abrubt halt when I found out that it is impossible to access clipboard from a JavaScript application. See GNOME bugs 579312, 712752.

January 03, 2015

2 months in

Today is my two month monthiversary at my new job. Haven’t had time so far to sit back and reflect and let people know, but now during packing boxes for our upcoming move downtown, I welcome the distraction.

I dove into the black hole. I joined the borg collective. I’m now working for the little search engine that could.

I sure had my reservations while contemplating this choice. This is the first job I’ve had that I had to interview for – and quite a bit, I might add (though I have to admit that curiosity about the interviewing process is what made me go for the interviews in the first place – I wasn’t even considering a different job at that time). My first job, a four month high school math teaching stint right after I graduated, was suggested to me by an ex-girlfriend, and I was immediately accepted after talking to the headmaster (that job is still a fond memory for many reasons). For my first real job, I informally chatted over dinner with one of the four founders, and then I started working for them without knowing if they were going to pay me. They ended up doing so by the end of the month, and that was that. The next job was offered to me over IRC, and from that Fluendo and Flumotion were born. None of these were through a standard job interview, and when I interviewed at Google I had much more experience on the other side of the interviewing table.

From a bunch of small startups to a company the scale of Google is a big step up, so that was my main reservation. Am I going to be able to adapt to a big company’s way of working? On the other hand, I reasoned, I don’t really know what it’s like to work for a big company, and clearly Google is one of the best of those to work for. I’d rather try out working for a big company while I’m still considered relatively young job-market-wise, so I rack up some experience with both sides of this coin during my professionally mobile years.

But I’m not going to lie either – seeing that giant curious machine from the inside, learn how they do things, being allowed to pierce the veil and peak behind the curtain – there is a curiosity here that was waiting to be satisfied. Does a company like this have all big problems solved already? How do they handle things I’ve had to learn on the fly without anyone else to learn from? I was hiring and leading a small group of engineers – how does a company that big handle that on an industrial scale? How does a search query really work? How many machines are involved?

And Google is delivering in spades on that front. From the very first day, there’s an openness and a sharing of information that I did not expect. (This also explains why I’ve always felt that people who joined Google basically disappeared into a black hole – in return for this openness, you are encouraged to swear yourself to secrecy towards the outside world. I’m surprised that that can work as an approach, but it seems to). By day two we did our first commit (obviously nothing that goes to production, but still.) In my first week I found the way to the elusive (to me at least) roof top terrace by searching through internal documentation.IMG_20141229_144054The view was totally worth it.

So far, in my first two months, I’ve only had good surprises. I think that’s normal – even the noogler training itself tells you about the happiness curve, and how positive and excited you feel the first few months. It was easy to make fun of some of the perks from an outside perspective, but what you couldn’t tell from that outside perspective is how these perks are just manifestations of common engineering sense on a company level. You get excellent free lunches so that you go eat with your team mates or run into colleagues and discuss things, without losing brain power on deciding where to go eat (I remember the spreadsheet we had in Barcelona for a while for bike lunch once a week) or losing too much time doing so (in Barcelona, all of the options in the office building were totally shit. If you cared about food it was not uncommon to be out of the office area for ninety minutes or more). You get snacks and drinks so that you know that’s taken care of for you and you don’t have to worry about getting any and leave your workplace for them. There are hammocks and nap pods so you can take a nap and be refreshed in the afternoon. You get massage points for massages because a healthy body makes for a healthy mind. You get a health plan where the good options get subsidized because Google takes that same data-driven approach to their HR approach and figured out how much they save by not having sick employees. None of these perks are altruistic as such, but there is also no pretense of them being so. They are just good business sense – keep your employees healthy, productive, focused on their work, and provide the best possible environment to do their best work in. I don’t think I will ever make fun of free food perks again given that the food is this good, and possibly the favorite part of my day is the smoothie I pick up from the cafe on the way in every morning. It’s silly, it’s small, and they probably only do it so that I get enough vitamins to not get the flu in winter and miss work, but it works wonders on me and my morning mood.

I think the bottom line here is that you get treated as a responsible adult by default in this company. I remember silly discussions we had at Flumotion about developer productivity. Of course, that was just a breakdown of a conversation that inevitably stooped to the level of measuring hours worked as a measurement of developer productivity, simply because that’s the end point of any conversation on that spirals out of control. Counting hours worked was the only thing that both sides of that conversation understood as a concept, and paying for hours worked was the only thing that both sides agreed on as a basic rule. But I still considered it a major personal fault to have let the conversation back then get to that point; it was simply too late by then to steer it back in the right direction. At Google? There is no discussion about hours worked, work schedule, expected productivity in terms of hours, or any of that. People get treated like responsible adults, are involved in their short-, mid- and long-term planning, feel responsible for their objectives, and allocate their time accordingly. I’ve come in really early and I’ve come in late (by some personal definition of “on time” that, ever since my second job 15 years ago, I was lucky enough to define as ’10 AM’). I’ve left early on some days and stayed late on more days. I’ve seen people go home early, and I’ve seen people stay late on a Friday night so they could launch a benchmark that was going to run all weekend so there’d be useful data on Monday. I asked my manager one time if I should let him know if I get in later because of a doctor’s visit, and he told me he didn’t need to know, but it helps if I put it on the calendar in case people wanted to have a meeting with me at that hour.

And you know what? It works. Getting this amount of respect by default, and seeing a standard to live up to set all around you – it just makes me want to work even harder to be worthy of that respect. I never had any trouble motivating myself to do work, but now I feel an additional external motivation, one this company has managed to create and maintain over the fifteen+ years they’ve been in business. I think that’s an amazing achievement.

So far, so good, fingers crossed, touch wood and all that. It’s quite a change from what came before, and it’s going to be quite the ride. But I’m ready for it.

(On a side note – the only time my habit of wearing two different shoes was ever considered a no-no for a job was for my previous job – the dysfunctional one where they still owe me money, among other stunts they pulled. I think I can now empirically elevate my shoe habit to a litmus test for a decent job, and I should have listened to my gut on the last one. Live and learn!)

flattr this!

December 31, 2014

blitz2 for Android

An additional upside for blitz2 is that HTTP client is available on about any platform. So if I want to share clipboard with my Nexus tablet, I can, without running an sshd on it.

This is my first Android app, and I didn't touch Java in many years. So, first impressions.

I forgot how insanely wordy Java is. And doing anything takes effort, with all the factories, accessors, and whatnot.

I like checked exceptions. Too bad Python doesn't have them (probably impossible by the very nature of a dynamic language, but I've been bitten by an unexpected exception floating up from the depth of the stack before).

Android docs are excellent and one almost never needs to search for answers. Unfortunately, I managed to step into one such case: the so-called "Up" navigation. My chosen API level is 11. The contemporary docs explain how to emulate "Up" using compatibility libraries for APIs before mine, and explain how to use onNavigateUp(), that comes in API level 16. But there's absolutely nothing, nowhere, that tells how do it in API 11. I was walking in circles for days. The answer is actually a secret ID namespace, particularly android.R.id.home. I would never figure it out if not for random pieces of code on the Internet. Good grief, Google. So close to perfect marks.

Oh, and one more thing: Googlers score good sanity points for reimplementing a stock Java API for HTTP (HttpURLConnection and friends). They could've easily rolled their own, but they didn't. They wrote their own runtime, but it's fully compatible with Oracle, including dark corners of SSL. It permits to mostly debug difficult parts on a Linux box. Very nice. Just to see what it could be otherwise, look at their gratiously incompatible Base64.

UPDATE: I forgot to mention that I started with Eclipse, but it was entirely unusable due to crashing all the time (about once an hour, for no discernable reason). I was at Fedora 20 at the time. So, I used command-line tools, and that worked like a charm. There's a Makefile in blitz2 repo linked above.

December 22, 2014

Cheating around taskotron in Fedora

The yesterday ntp vulnerability uncovered a trick for Fedora maintainers. You know how it's super annoying that you cannot push an update to F20 without F21? You must herd updates and can never do them in parallel, or else taskotron ruins innocent updates. But at the time of this writing the fixes are live in F20, but not in F21. How does Miroslav do it?

The answer is easy: he keeps ntp intentionally a few releases back in older Fedora (4.2.6p5-19 in F20), so he can bump it with impunity without regard to the newer Fedora (4.2.6p5-25 in F21). Of course, if someone were to upgrade to F21 today, he'd go from a fixed ntp to a broken ntp, but hey... at least the automated checks are defeated.

This challenge is similar to writing super ugly OpenStack code that passes PEP8 checks, only outcome is actually dangerous today.

December 19, 2014

Moving on from Red Hat.

After eleven and a half years, today is my final day at Red Hat.
I’ll write more about what comes next in the new year.

In the meantime, here’s a slightly edited version of a mail I sent internally yesterday.


In 2003, I got an email from Michael Johnson, about a secretive new thing Red Hat was working on called "Fedora". No-one was quite sure what it was going to be (some may argue we're still figuring it out), but he was pretty sure I'd want to be a part of it. "How'd you feel about taking care of _any_ kernel problems that come in for this thing?" he asked. I was terrified, but excited at the opportunities to learn a lot of stuff outside my usual areas of expertise.

With barely any real detail as to what I was signing up for, I jumped at the opportunity. Within my first few months, I had some concerns over whether or not I had made a good decision. Then Michael left for rPath, and I seriously started to have my doubts.

While everyone was figuring out what Fedora was going to be, I was thrown in at the deep end. "Here's Red Hat Linux 7, 8 and 9, you maintain the kernel for those now. Go". I remember looking at bugzilla scrolling through page after page of bugs thinking "This is going to be a nightmare" At the same time, RHEL 3 was really starting to take shape. I looked at what the guys working on RHEL were doing and thought "Well, this sucks, but those guys.. they _really_ have work to do". As much as I was buried alive in work, I relished every moment of it, learning as much as I could in what little spare time I had.

Then Fedora finally happened. For those not around back then, Fedora Core 1 was pretty much what Red Hat Linux 10 would have been from a kernel pov. A nasty hairball of patches that weren't going upstream (execshield! 4g4g! Tux! CIPE!) that even their authors had stopped maintaining, and a bunch of features backported from 2.5 to 2.4. I get the shakes when I think back to the horrors of maintaining that mess, but like the horrors of RHL before it, it was an amazing learning experience (mostly "what not to do").

But for all its warts, Fedora gained traction, and after Fedora 2 moved to a 2.6 kernel, things really started to take shape. As Fedora's community started to grow, things got even busier in bugzilla than RHL had ever been.

Then somehow I got talked into also being RHEL4 kernel maintainer for a while.
It turned out that juggling Fedora 3, Fedora 4, Rawhide, RHEL4 GA, and RHEL4 U1 means you don't get a lot of time to sleep. So after finding another sucker to deal with the RHEL work, I moved back to just doing Fedora work, and in another big turning point, we started to slowly grow out the Fedora kernel team.

Over the years that followed, the only thing that remained constant was the inflow of bugs. At any given time we had a thousand or so bugs open, with at best 3 people, at worst 1 person working on them. I'm incredibly proud of what we've managed to achieve with the Fedora kernel. More than just the base for RHEL, it changed the whole landscape of upstream kernel development.

  • Our insistence on shipping the latest code, with as few 'special sauce' patches won over a lot of upstream developers that wouldn't have given us the time of day for similar bugs back in the RHL days. Sometimes painful for our users, but Linux as a whole got better because of our stance here.
  • Decisions like Fedora enabling debug options by default in betas shook out an unbelievable number of bugs almost as soon as they get introduced. Again, painful for users, but from a quality standpoint, we found a ton of bugs in code others were racing to ship first and call "enterprise ready".
  • Fedora enabling features sometimes before they were fully baked got us a lot of love from their respective upstream maintainers.

Despite this progress though, I always felt we were on a treadmill making no real forward progress. That constant 1000 or so bugs kept nagging at me. As fast as we closed them out, a new batch would arrive.

In more recent years, we tried to split the workload within the team so we could do more proactive bug-finding before users even find them. My own 'trinity' project has found so many serious bugs (filesystem corruptors, root holes, vm corner cases, the list goes on) that it got to be almost a full time job just tracking everything.

I used to feel that leaving Red Hat wasn't something I could do. On a few occasions I actually turned down offers from potential employers, because "What about the Fedora kernel?". For the first time since the project has begun I feel like I've left things in more than capable hands, and I'm sure things will continue to move in the right direction.

3 RHL's. 5 and a half RHEL's. 21 Fedoras. You don't even want to know how much hardware I've destroyed in the line of duty in this time. It's been uh, an experience.

So, after all this time, one thing I have learned, is that all this was definitely one of my better decisions. I hope that my next decision turns out to be an equally good one.

Moving on from Red Hat. is a post from: codemonkey.org.uk

December 12, 2014

blitz2

You know how some people attach several montiors to one PC? I don't. I just have several PCs. But then I want copy-paste to work transparently (as transparently as possible). For several years I used blitz to copy clipboard. It works well enough, but once you have 3 computers, it gets somewhat cumbersome to type the hostname. Also, it always bothered me how it rides ssh authentication. I wanted something independent from ssh.

Behold blitz2. Instead of passing the clipboard to the host where it's needed directly, the clipboard is uploaded to an HTTP server. Seems more complex at first, but it's actually much better, because previously the PC where you copy had to authenticate to the PC where you paste. Now the authentication is symmetric. So, all clients are configured exactly the same, and all can upload and download the clipboard no matter who trusts what ssh keys.

December 02, 2014

Free-riding and copyleft in cultural commons like Flickr

Flickr recently started selling prints of Creative Commons Attribution-Share Alike photos without sharing any of the revenue with the original photographers. When people were surprised, Flickr said “if you don’t want commercial use, switch the photo to CC non-commercial”.

This seems to have mostly caused two reactions:

  1. This is horrible! Creative Commons is horrible!”
  2. “Commercial reuse is explicitly part of the license; I don’t understand the anger.”

I think it makes sense to examine some of the assumptions those users (and many license authors) may have had, and what that tells us about license choice and design going forward.

Free ride!!, by https://www.flickr.com/photos/dhinakaran/
Free ride!!, by Dhinakaran Gajavarathan, under CC BY 2.0

Free riding is why we share-alike…

As I’ve explained before here, a major reason why people choose copyleft/share-alike licenses is to prevent free rider problems: they are OK with you using their thing, but they want the license to nudge (or push) you in the direction of sharing back/collaborating with them in the future. To quote Elinor Ostrom, who won a Nobel for her research on how commons are managed in the wild, “[i]n all recorded, long surviving, self-organized resource governance regimes, participants invest resources in monitoring the actions of each other so as to reduce the probability of free riding.” (emphasis added)

… but share-alike is not always enough

Copyleft is one of our mechanisms for this in our commons, but it isn’t enough. I think experience in free/open/libre software shows that free rider problems are best prevented when three conditions are present:

  • The work being created is genuinely collaborative — i.e., many authors who contribute similarly to the work. This reduces the cost of free riding to any one author. It also makes it more understandable/tolerable when a re-user fails to compensate specific authors, since there is so much practical difficulty for even a good-faith reuser to evaluate who should get paid and contact them.
  • There is a long-term cost to not contributing back to the parent project. In the case of Linux and many large software projects, this long-term cost is about maintenance and security: if you’re not working with upstream, you’re not going to get the benefit of new fixes, and will pay a cost in backporting security fixes.
  • The license triggers share-alike obligations for common use cases. The copyleft doesn’t need to perfectly capture all use cases. But if at least some high-profile use cases require sharing back, that helps discipline other users by making them think more carefully about their obligations (both legal and social/organizational).

Alternately, you may be able to avoid damage from free rider problems by taking the Apache/BSD approach: genuinely, deeply educating contributors, before they contribute, that they should only contribute if they are OK with a high level of free riding. It is hard to see how this can work in a situation like Flickr’s, because contributors don’t have extensive community contact.1

The most important takeaway from this list is that if you want to prevent free riding in a community-production project, the license can’t do all the work itself — other frictions that somewhat slow reuse should be present. (In fact, my first draft of this list didn’t mention the license at all — just the first two points.)

Flickr is practically designed for free riding

Flickr fails on all the points I’ve listed above — it has no frictions that might discourage free riding.

  • The community doesn’t collaborate on the works. This makes the selling a deeply personal, “expensive” thing for any author who sees their photo for sale. It is very easy for each of them to find their specific materials being reused, and see a specific price being charged by Yahoo that they’d like to see a slice of.
  • There is no cost to re-users who don’t contribute back to the author—the photo will never develop security problems, or get less useful with time.
  • The share-alike doesn’t kick in for virtually any reuses, encouraging Yahoo to look at the relationship as a purely legal one, and encouraging them to forget about the other relationships they have with Flickr users.
  • There is no community education about the expectations for commercial use, so many people don’t fully understand the licenses they’re using.

So what does this mean?

This has already gone on too long, but a quick thought: what this suggests is that if you have a community dedicated to creating a cultural commons, it needs some features that discourage free riding — and critically, mere copyleft licensing might not be good enough, because of the nature of most production of commons of cultural works. In Flickr’s case, maybe this should simply have included not doing this, or making some sort of financial arrangement despite what was legally permissible; for other communities and other circumstances other solutions to the free-rider problem may make sense too.

And I think this argues for consideration of non-commercial licenses in some circumstances as well. This doesn’t make non-commercial licenses more palatable, but since commercial free riding is typically people’s biggest concern, and other tools may not be available, it is entirely possible it should be considered more seriously than free and open source software dogma might have you believe.

  1. It is open to discussion, I think, whether this works in Wikimedia Commons, and how it can be scaled as Commons grows.

November 30, 2014

23 years

(This post is only about music – for people not from Belgium, Luc de Vos, singer of Gorki, passed away yesterday at 52)

I am 15. I hear a song on the radio, and I don’t understand the lyrics. Why would you ask a piranha to devour you? Still, I’m intrigued. I’d only really gotten into music little by little. My earliest musical memory is hearing my parents’ record player playing ‘I want you’ by Bob Dylan. After that, it was my inexplicable arousal at seeing the Hey You the Rock Steady Crewvideo in 1983 when I was 7, getting the Top Gun soundtrack on cassette (my first ever music purchase) in 1986, and watching the video for ‘I want your sex’ by George Michael in 1987 over and over on my recording of Veronica’s “Countdown”. At my confirmation (12 years old), when kids typically get some kind of bigger gift they’ve been dreaming of for a long time, I still chose a computer instead of a stereo.

I am 16, I just had my birthday. I am doing a summer job at my family’s company (which processes animal fat) and I am staying with my grandparents in Bavegem. With the money from my birthday I bought a portable stereo CD/cassette player for the incredible amount of 6000 BEF (or 150 euro as the kids would call it these days). . I listen to nothing else for weeks on end. I can still hum the amazingly beautiful piano part that closes Mia from memory. It’s been my favorite song ever since.

I am 17, and learning the guitar. It turns out that Mia is quite complicated to get right, because of that perfect 3/4-5/4 tempo, or whatever you’d call it if you knew anything about music. It doesn’t help that I’m left-handed playing on a right-handed guitar, but I make the song my own. To this day though, I can still not play and sing it at the same time. There is something about the timing of how that third line starts before the music starts, where he signs ‘Mensen als ik’, that I just can’t figure out. It’s magic – it makes this song all the better.

I am 17, and Gorky is now Gorki, with completely new band members. I see them live for the first time, at ‘De Kring’ in Merelbeke, with my best friend Jeremy. I wish I had bought all the t-shirts that night – they had a different one for each of the new songs. The album sounds so different – parts of it recorded in Africa. I don’t listen to that album enough, but I still love playing Berejager on guitar, such a beautiful intro.

I am 17, and it’s my last year of boy scout before becoming a leader. I have a mini-JIN camp called JINTRO during the year, that ends with a party. I dance with a girl to Mia, and one minute into the dance she says, ‘no no, we’re not going to do a one-tile-dance for the rest of the night. Here’s how you do it’ and she teaches me two basic moves to make a slow dance more interesting. Thank you, Karlien, for changing my life.

I am 18, and we travel through Catalunya with the boy and girl scouts group I’m in, and a local Catalan group. This is one of the CD’s we brought with us as a sample of our own culture. The Catalans love it – they say it sounds like Bruce Springsteen. I can see where they’re coming from. At the end of the two weeks, he guitar player of their group nails down a really good version of Mia (without the words of course)

I am 18, and have my first serious girlfriend. Mia is a song that runs through our history together – we must have danced to it at every party that played it (she messaged me yesterday that she immediately thought of me when she heard the news… just like I did of her). Back then, parties still had blocks of 3 slow songs every one or two hours. I miss that tradition… The moves that Karlien taught me put me well ahead of the pack of my fellow young adult males, and that paid off generously in the young adult females agreeing to dance with me at every party. (The theory of compounded interest clearly put in practice, now that I think of it)

I am 19, and one of my fellow boy scout leaders gives me an old demo cassette of Gorky. Among other things, it contains a cover of the Pixies’ “Monkey Gone to Heaven”, some of their songs that didn’t make their debut (but appeared on Boterhammen, like ‘Ik word oud’, or were turned into a b-side). It also contains the original version of Mia, as a fast-paced slurred-sung rocker. They made the right call slowing it down.

I am 21, and I have a radio show at a student radio I helped start up. I am too young to know how the world really works and just send out interview requests to managers and record labels for bands that I like. In those days, I got to interview my favorite band, The Afghan Whigs, as well as other bands like Everclear and The Sheila Divine. But we also managed to get Luc De Vos as a guest on our radio show, and Jeremy and I interviewed him inbetween songs for an hour. (That tape is at my parents’ place. I have an Excel sheet that tells me exactly which box it’s in, and I hope I can recover it next time I go to Belgium.) I tell him about that demo tape that I have, and he asks for a copy. A little after that, I bring him a copy of that practice tape, I put ‘Congregation’ by The Afghan Whigs on the other side (because I want one of my favorite bands to know another one of my favorite bands), and I go past his house to drop it off. (From the news report this weekend I hear he still lived in the same street, so I can only assume he was still living in the same house he’s lived for the last 17 years).

I am 22, and Luc De Vos plays solo at the university somewhere, in an auditorium. I think it was one of the first times he ever did that. He probably already read out a column he wrote. But I remember how amazing he was by himself, what beautiful versions of these songs that I knew so well he played, songs that usually they didn’t play live because they were the slower ones. ‘Arme Jongen’, I remember him playing it there like it was yesterday.

I am 26, and I see him at various festivals, always there to either play or enjoy the music. I see him backstage with his son, recently born. He is walking around with some kind of elastic band tied around his waist that keeps his kid from running away more than ten meters from him, and it is hilarious to see in the backstage area.

Time starts moving quicker as I grow up, become an adult, and graduate from college. More and more albums. Every album still contained at least one killer song. ‘Leve de Lente’ still gives me goosebumps when those guitars crash in. ‘Vaarwel Lieveling’ is possibly his most underrated song – I don’t think I’ve ever heard that one played live. ‘Ode an die freude’, ‘We zijn zo jong’, ‘Duitsland wint altijd’ – I love the sound of resignment he has in his voice, like a deep sigh put too music. That album came with a floppy disk (!) with the lyrics. ‘Het voorspel was moordend’, ‘Tijdbom’ – while the music came back to being a bit more convential, the lyrics got more hermetically sealed. I must admit that I slowly lost track – having moved to Barcelona at some point, it was much harder to catch them live of course. I know their first five albums the best, and while I still bought the others (having missed only one), none of them had the luxury of not having any other album in my collection to compete with like their debut album had. But there is no denying that when they were great, they were still amazing. A song like ‘Veronica komt naar je toe’ managed to pull together so many different things. The title was a recurring slogan of a Dutch channel that was popular among young people in Belgium, for lack of a Belgian alternative. Here’s a great song, with a great chorus, and his ability to sample just this one sentence to evoke a memory of youth every one of my generation remembers (while it evoked at the same time my personal memory of seeing ‘I want your sex’ on Veronica). And then he manages to evoke such a common feeling everyone has, where you are trying to grab that fleeting thing you were thinking just a second ago, straddling typically complicated-to-phrase words in Dutch with effortless ease – ‘Wat was het nu ook alweer/dat ik wou doen/het was iets belangrijks’ (or ‘What was it again/I wanted to do/it was something important). In the beginning, his lyrics were quirky in ideas, but fairly straightforward in their phrasing. Further on in their career, they experimented quite a bit musically, but especially the lyrics could get complicated, and with exceptional and inventive phrasing.

I’m 31, and I live in Barcelona, but I travel back to Belgium because Gorki is playing their debut album, Gorky. I wrote about that concert back then, but that memory is still strong. I can’t believe that was 7 years ago…

I always enjoyed reading his columns in Zone 09 whenever I was in my hometown, I thought he had a great gift for writing. I noticed just now he left behind quite a few more books than I had, so I started tracking those down. So many of my memories have his music attached to it. His was the first band that opened me up to a wider range of music, away from the mainstream (not everybody would agree I guess, but I never considered them mainstream. Their debut album certainly was different enough from whatever was considered mainstream at the time, and as often happens this debut was only widely recognized several albums into their career later, while at the same time those later albums never really got the same kind of traction.)

I loved his way of looking at the world, the way he described it in music, lyrics, writing, and interviews. Always with that cheeky look. Like, surprisingly it now turns out, so many of my generation, his music was intertwined with my growing up. Here’s a man I was hoping to live long and make much more music, and grow old playing hundreds of songs in bars and clubs, but it wasn’t to be. He set out to be a successful rock singer, whether that was tongue-in-cheek or not, and by all accounts he achieved what he set out to do. And everything he did, he did it for the best of reasons. He did it for ‘a fistful of bonnekes’

flattr this!

November 29, 2014

is this a protocol? displaylink3

I'm not sure

but if hd0;u]; means anything to anyone from displaylink, or is the first unencrypted bytes they send, then oops.

Looks like I have some work to do next week.

November 11, 2014

systemd For Administrators, Part XXI

Container Integration

Since a while containers have been one of the hot topics on Linux. Container managers such as libvirt-lxc, LXC or Docker are widely known and used these days. In this blog story I want to shed some light on systemd's integration points with container managers, to allow seamless management of services across container boundaries.

We'll focus on OS containers here, i.e. the case where an init system runs inside the container, and the container hence in most ways appears like an independent system of its own. Much of what I describe here is available on pretty much any container manager that implements the logic described here, including libvirt-lxc. However, to make things easy we'll focus on systemd-nspawn, the mini-container manager that is shipped with systemd itself. systemd-nspawn uses the same kernel interfaces as the other container managers, however is less flexible as it is designed to be a container manager that is as simple to use as possible and "just works", rather than trying to be a generic tool you can configure in every low-level detail. We use systemd-nspawn extensively when developing systemd.

Anyway, so let's get started with our run-through. Let's start by creating a Fedora container tree in a subdirectory:

# yum -y --releasever=20 --nogpg --installroot=/srv/mycontainer --disablerepo='*' --enablerepo=fedora install systemd passwd yum fedora-release vim-minimal

This downloads a minimal Fedora system and installs it in in /srv/mycontainer. This command line is Fedora-specific, but most distributions provide similar functionality in one way or another. The examples section in the systemd-nspawn(1) man page contains a list of the various command lines for other distribution.

We now have the new container installed, let's set an initial root password:

# systemd-nspawn -D /srv/mycontainer
Spawning container mycontainer on /srv/mycontainer
Press ^] three times within 1s to kill container.
-bash-4.2# passwd
Changing password for user root.
New password:
Retype new password:
passwd: all authentication tokens updated successfully.
-bash-4.2# ^D
Container mycontainer exited successfully.
#

We use systemd-nspawn here to get a shell in the container, and then use passwd to set the root password. After that the initial setup is done, hence let's boot it up and log in as root with our new password:

$ systemd-nspawn -D /srv/mycontainer -b
Spawning container mycontainer on /srv/mycontainer.
Press ^] three times within 1s to kill container.
systemd 208 running in system mode. (+PAM +LIBWRAP +AUDIT +SELINUX +IMA +SYSVINIT +LIBCRYPTSETUP +GCRYPT +ACL +XZ)
Detected virtualization 'systemd-nspawn'.

Welcome to Fedora 20 (Heisenbug)!

[  OK  ] Reached target Remote File Systems.
[  OK  ] Created slice Root Slice.
[  OK  ] Created slice User and Session Slice.
[  OK  ] Created slice System Slice.
[  OK  ] Created slice system-getty.slice.
[  OK  ] Reached target Slices.
[  OK  ] Listening on Delayed Shutdown Socket.
[  OK  ] Listening on /dev/initctl Compatibility Named Pipe.
[  OK  ] Listening on Journal Socket.
         Starting Journal Service...
[  OK  ] Started Journal Service.
[  OK  ] Reached target Paths.
         Mounting Debug File System...
         Mounting Configuration File System...
         Mounting FUSE Control File System...
         Starting Create static device nodes in /dev...
         Mounting POSIX Message Queue File System...
         Mounting Huge Pages File System...
[  OK  ] Reached target Encrypted Volumes.
[  OK  ] Reached target Swap.
         Mounting Temporary Directory...
         Starting Load/Save Random Seed...
[  OK  ] Mounted Configuration File System.
[  OK  ] Mounted FUSE Control File System.
[  OK  ] Mounted Temporary Directory.
[  OK  ] Mounted POSIX Message Queue File System.
[  OK  ] Mounted Debug File System.
[  OK  ] Mounted Huge Pages File System.
[  OK  ] Started Load/Save Random Seed.
[  OK  ] Started Create static device nodes in /dev.
[  OK  ] Reached target Local File Systems (Pre).
[  OK  ] Reached target Local File Systems.
         Starting Trigger Flushing of Journal to Persistent Storage...
         Starting Recreate Volatile Files and Directories...
[  OK  ] Started Recreate Volatile Files and Directories.
         Starting Update UTMP about System Reboot/Shutdown...
[  OK  ] Started Trigger Flushing of Journal to Persistent Storage.
[  OK  ] Started Update UTMP about System Reboot/Shutdown.
[  OK  ] Reached target System Initialization.
[  OK  ] Reached target Timers.
[  OK  ] Listening on D-Bus System Message Bus Socket.
[  OK  ] Reached target Sockets.
[  OK  ] Reached target Basic System.
         Starting Login Service...
         Starting Permit User Sessions...
         Starting D-Bus System Message Bus...
[  OK  ] Started D-Bus System Message Bus.
         Starting Cleanup of Temporary Directories...
[  OK  ] Started Cleanup of Temporary Directories.
[  OK  ] Started Permit User Sessions.
         Starting Console Getty...
[  OK  ] Started Console Getty.
[  OK  ] Reached target Login Prompts.
[  OK  ] Started Login Service.
[  OK  ] Reached target Multi-User System.
[  OK  ] Reached target Graphical Interface.

Fedora release 20 (Heisenbug)
Kernel 3.18.0-0.rc4.git0.1.fc22.x86_64 on an x86_64 (console)

mycontainer login: root
Password:
-bash-4.2#

Now we have everything ready to play around with the container integration of systemd. Let's have a look at the first tool, machinectl. When run without parameters it shows a list of all locally running containers:

$ machinectl
MACHINE                          CONTAINER SERVICE
mycontainer                      container nspawn

1 machines listed.

The "status" subcommand shows details about the container:

$ machinectl status mycontainer
mycontainer:
       Since: Mi 2014-11-12 16:47:19 CET; 51s ago
      Leader: 5374 (systemd)
     Service: nspawn; class container
        Root: /srv/mycontainer
     Address: 192.168.178.38
              10.36.6.162
              fd00::523f:56ff:fe00:4994
              fe80::523f:56ff:fe00:4994
          OS: Fedora 20 (Heisenbug)
        Unit: machine-mycontainer.scope
              ├─5374 /usr/lib/systemd/systemd
              └─system.slice
                ├─dbus.service
                │ └─5414 /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-act...
                ├─systemd-journald.service
                │ └─5383 /usr/lib/systemd/systemd-journald
                ├─systemd-logind.service
                │ └─5411 /usr/lib/systemd/systemd-logind
                └─console-getty.service
                  └─5416 /sbin/agetty --noclear -s console 115200 38400 9600

With this we see some interesting information about the container, including its control group tree (with processes), IP addresses and root directory.

The "login" subcommand gets us a new login shell in the container:

# machinectl login mycontainer
Connected to container mycontainer. Press ^] three times within 1s to exit session.

Fedora release 20 (Heisenbug)
Kernel 3.18.0-0.rc4.git0.1.fc22.x86_64 on an x86_64 (pts/0)

mycontainer login:

The "reboot" subcommand reboots the container:

# machinectl reboot mycontainer

The "poweroff" subcommand powers the container off:

# machinectl poweroff mycontainer

So much about the machinectl tool. The tool knows a couple of more commands, please check the man page for details. Note again that even though we use systemd-nspawn as container manager here the concepts apply to any container manager that implements the logic described here, including libvirt-lxc for example.

machinectl is not the only tool that is useful in conjunction with containers. Many of systemd's own tools have been updated to explicitly support containers too! Let's try this (after starting the container up again first, repeating the systemd-nspawn command from above.):

# hostnamectl -M mycontainer set-hostname "wuff"

This uses hostnamectl(1) on the local container and sets its hostname.

Similar, many other tools have been updated for connecting to local containers. Here's systemctl(1)'s -M switch in action:

# systemctl -M mycontainer
UNIT                                 LOAD   ACTIVE SUB       DESCRIPTION
-.mount                              loaded active mounted   /
dev-hugepages.mount                  loaded active mounted   Huge Pages File System
dev-mqueue.mount                     loaded active mounted   POSIX Message Queue File System
proc-sys-kernel-random-boot_id.mount loaded active mounted   /proc/sys/kernel/random/boot_id
[...]
time-sync.target                     loaded active active    System Time Synchronized
timers.target                        loaded active active    Timers
systemd-tmpfiles-clean.timer         loaded active waiting   Daily Cleanup of Temporary Directories

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

49 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.

As expected, this shows the list of active units on the specified container, not the host. (Output is shortened here, the blog story is already getting too long).

Let's use this to restart a service within our container:

# systemctl -M mycontainer restart systemd-resolved.service

systemctl has more container support though than just the -M switch. With the -r switch it shows the units running on the host, plus all units of all local, running containers:

# systemctl -r
UNIT                                        LOAD   ACTIVE SUB       DESCRIPTION
boot.automount                              loaded active waiting   EFI System Partition Automount
proc-sys-fs-binfmt_misc.automount           loaded active waiting   Arbitrary Executable File Formats File Syst
sys-devices-pci0000:00-0000:00:02.0-drm-card0-card0\x2dLVDS\x2d1-intel_backlight.device loaded active plugged   /sys/devices/pci0000:00/0000:00:02.0/drm/ca
[...]
timers.target                                                                                       loaded active active    Timers
mandb.timer                                                                                         loaded active waiting   Daily man-db cache update
systemd-tmpfiles-clean.timer                                                                        loaded active waiting   Daily Cleanup of Temporary Directories
mycontainer:-.mount                                                                                 loaded active mounted   /
mycontainer:dev-hugepages.mount                                                                     loaded active mounted   Huge Pages File System
mycontainer:dev-mqueue.mount                                                                        loaded active mounted   POSIX Message Queue File System
[...]
mycontainer:time-sync.target                                                                        loaded active active    System Time Synchronized
mycontainer:timers.target                                                                           loaded active active    Timers
mycontainer:systemd-tmpfiles-clean.timer                                                            loaded active waiting   Daily Cleanup of Temporary Directories

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

191 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.

We can see here first the units of the host, then followed by the units of the one container we have currently running. The units of the containers are prefixed with the container name, and a colon (":"). (The output is shortened again for brevity's sake.)

The list-machines subcommand of systemctl shows a list of all running containers, inquiring the system managers within the containers about system state and health. More specifically it shows if containers are properly booted up, or if there are any failed services:

# systemctl list-machines
NAME         STATE   FAILED JOBS
delta (host) running      0    0
mycontainer  running      0    0
miau         degraded     1    0
waldi        running      0    0

4 machines listed.

To make things more interesting we have started two more containers in parallel. One of them has a failed service, which results in the machine state to be degraded.

Let's have a look at journalctl(1)'s container support. It too supports -M to show the logs of a specific container:

# journalctl -M mycontainer -n 8
Nov 12 16:51:13 wuff systemd[1]: Starting Graphical Interface.
Nov 12 16:51:13 wuff systemd[1]: Reached target Graphical Interface.
Nov 12 16:51:13 wuff systemd[1]: Starting Update UTMP about System Runlevel Changes...
Nov 12 16:51:13 wuff systemd[1]: Started Stop Read-Ahead Data Collection 10s After Completed Startup.
Nov 12 16:51:13 wuff systemd[1]: Started Update UTMP about System Runlevel Changes.
Nov 12 16:51:13 wuff systemd[1]: Startup finished in 399ms.
Nov 12 16:51:13 wuff sshd[35]: Server listening on 0.0.0.0 port 24.
Nov 12 16:51:13 wuff sshd[35]: Server listening on :: port 24.

However, it also supports -m to show the combined log stream of the host and all local containers:

# journalctl -m -e

(Let's skip the output here completely, I figure you can extrapolate how this looks.)

But it's not only systemd's own tools that understand container support these days, procps sports support for it, too:

# ps -eo pid,machine,args
 PID MACHINE                         COMMAND
   1 -                               /usr/lib/systemd/systemd --switched-root --system --deserialize 20
[...]
2915 -                               emacs contents/projects/containers.md
3403 -                               [kworker/u16:7]
3415 -                               [kworker/u16:9]
4501 -                               /usr/libexec/nm-vpnc-service
4519 -                               /usr/sbin/vpnc --non-inter --no-detach --pid-file /var/run/NetworkManager/nm-vpnc-bfda8671-f025-4812-a66b-362eb12e7f13.pid -
4749 -                               /usr/libexec/dconf-service
4980 -                               /usr/lib/systemd/systemd-resolved
5006 -                               /usr/lib64/firefox/firefox
5168 -                               [kworker/u16:0]
5192 -                               [kworker/u16:4]
5193 -                               [kworker/u16:5]
5497 -                               [kworker/u16:1]
5591 -                               [kworker/u16:8]
5711 -                               sudo -s
5715 -                               /bin/bash
5749 -                               /home/lennart/projects/systemd/systemd-nspawn -D /srv/mycontainer -b
5750 mycontainer                     /usr/lib/systemd/systemd
5799 mycontainer                     /usr/lib/systemd/systemd-journald
5862 mycontainer                     /usr/lib/systemd/systemd-logind
5863 mycontainer                     /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
5868 mycontainer                     /sbin/agetty --noclear --keep-baud console 115200 38400 9600 vt102
5871 mycontainer                     /usr/sbin/sshd -D
6527 mycontainer                     /usr/lib/systemd/systemd-resolved
[...]

This shows a process list (shortened). The second column shows the container a process belongs to. All processes shown with "-" belong to the host itself.

But it doesn't stop there. The new "sd-bus" D-Bus client library we have been preparing in the systemd/kdbus context knows containers too. While you use sd_bus_open_system() to connect to your local host's system bus sd_bus_open_system_container() may be used to connect to the system bus of any local container, so that you can execute bus methods on it.

sd-login.h and machined's bus interface provide a number of APIs to add container support to other programs too. They support enumeration of containers as well as retrieving the machine name from a PID and similar.

systemd-networkd also has support for containers. When run inside a container it will by default run a DHCP client and IPv4LL on any veth network interface named host0 (this interface is special under the logic described here). When run on the host networkd will by default provide a DHCP server and IPv4LL on veth network interface named ve- followed by a container name.

Let's have a look at one last facet of systemd's container integration: the hook-up with the name service switch. Recent systemd versions contain a new NSS module nss-mymachines that make the names of all local containers resolvable via gethostbyname() and getaddrinfo(). This only applies to containers that run within their own network namespace. With the systemd-nspawn command shown above the the container shares the network configuration with the host however; hence let's restart the container, this time with a virtual veth network link between host and container:

# machinectl poweroff mycontainer
# systemd-nspawn -D /srv/mycontainer --network-veth -b

Now, (assuming that networkd is used in the container and outside) we can already ping the container using its name, due to the simple magic of nss-mymachines:

# ping mycontainer
PING mycontainer (10.0.0.2) 56(84) bytes of data.
64 bytes from mycontainer (10.0.0.2): icmp_seq=1 ttl=64 time=0.124 ms
64 bytes from mycontainer (10.0.0.2): icmp_seq=2 ttl=64 time=0.078 ms

Of course, name resolution not only works with ping, it works with all other tools that use libc gethostbyname() or getaddrinfo() too, among them venerable ssh.

And this is pretty much all I want to cover for now. We briefly touched a variety of integration points, and there's a lot more still if you look closely. We are working on even more container integration all the time, so expect more new features in this area with every systemd release.

Note that the whole machine concept is actually not limited to containers, but covers VMs too to a certain degree. However, the integration is not as close, as access to a VM's internals is not as easy as for containers, as it usually requires a network transport instead of allowing direct syscall access.

Anyway, I hope this is useful. For further details, please have a look at the linked man pages and other documentation.

systemd For Administrators, Part XXI

Container Integration

Since a while containers have been one of the hot topics on Linux. Container managers such as libvirt-lxc, LXC or Docker are widely known and used these days. In this blog story I want to shed some light on systemd's integration points with container managers, to allow seamless management of services across container boundaries.

We'll focus on OS containers here, i.e. the case where an init system runs inside the container, and the container hence in most ways appears like an independent system of its own. Much of what I describe here is available on pretty much any container manager that implements the logic described here, including libvirt-lxc. However, to make things easy we'll focus on systemd-nspawn, the mini-container manager that is shipped with systemd itself. systemd-nspawn uses the same kernel interfaces as the other container managers, however is less flexible as it is designed to be a container manager that is as simple to use as possible and "just works", rather than trying to be a generic tool you can configure in every low-level detail. We use systemd-nspawn extensively when developing systemd.

Anyway, so let's get started with our run-through. Let's start by creating a Fedora container tree in a subdirectory:

# yum -y --releasever=20 --nogpg --installroot=/srv/mycontainer --disablerepo='*' --enablerepo=fedora install systemd passwd yum fedora-release vim-minimal

This downloads a minimal Fedora system and installs it in in /srv/mycontainer. This command line is Fedora-specific, but most distributions provide similar functionality in one way or another. The examples section in the systemd-nspawn(1) man page contains a list of the various command lines for other distribution.

We now have the new container installed, let's set an initial root password:

# systemd-nspawn -D /srv/mycontainer
Spawning container mycontainer on /srv/mycontainer
Press ^] three times within 1s to kill container.
-bash-4.2# passwd
Changing password for user root.
New password:
Retype new password:
passwd: all authentication tokens updated successfully.
-bash-4.2# ^D
Container mycontainer exited successfully.
#

We use systemd-nspawn here to get a shell in the container, and then use passwd to set the root password. After that the initial setup is done, hence let's boot it up and log in as root with our new password:

$ systemd-nspawn -D /srv/mycontainer -b
Spawning container mycontainer on /srv/mycontainer.
Press ^] three times within 1s to kill container.
systemd 208 running in system mode. (+PAM +LIBWRAP +AUDIT +SELINUX +IMA +SYSVINIT +LIBCRYPTSETUP +GCRYPT +ACL +XZ)
Detected virtualization 'systemd-nspawn'.

Welcome to Fedora 20 (Heisenbug)!

[  OK  ] Reached target Remote File Systems.
[  OK  ] Created slice Root Slice.
[  OK  ] Created slice User and Session Slice.
[  OK  ] Created slice System Slice.
[  OK  ] Created slice system-getty.slice.
[  OK  ] Reached target Slices.
[  OK  ] Listening on Delayed Shutdown Socket.
[  OK  ] Listening on /dev/initctl Compatibility Named Pipe.
[  OK  ] Listening on Journal Socket.
         Starting Journal Service...
[  OK  ] Started Journal Service.
[  OK  ] Reached target Paths.
         Mounting Debug File System...
         Mounting Configuration File System...
         Mounting FUSE Control File System...
         Starting Create static device nodes in /dev...
         Mounting POSIX Message Queue File System...
         Mounting Huge Pages File System...
[  OK  ] Reached target Encrypted Volumes.
[  OK  ] Reached target Swap.
         Mounting Temporary Directory...
         Starting Load/Save Random Seed...
[  OK  ] Mounted Configuration File System.
[  OK  ] Mounted FUSE Control File System.
[  OK  ] Mounted Temporary Directory.
[  OK  ] Mounted POSIX Message Queue File System.
[  OK  ] Mounted Debug File System.
[  OK  ] Mounted Huge Pages File System.
[  OK  ] Started Load/Save Random Seed.
[  OK  ] Started Create static device nodes in /dev.
[  OK  ] Reached target Local File Systems (Pre).
[  OK  ] Reached target Local File Systems.
         Starting Trigger Flushing of Journal to Persistent Storage...
         Starting Recreate Volatile Files and Directories...
[  OK  ] Started Recreate Volatile Files and Directories.
         Starting Update UTMP about System Reboot/Shutdown...
[  OK  ] Started Trigger Flushing of Journal to Persistent Storage.
[  OK  ] Started Update UTMP about System Reboot/Shutdown.
[  OK  ] Reached target System Initialization.
[  OK  ] Reached target Timers.
[  OK  ] Listening on D-Bus System Message Bus Socket.
[  OK  ] Reached target Sockets.
[  OK  ] Reached target Basic System.
         Starting Login Service...
         Starting Permit User Sessions...
         Starting D-Bus System Message Bus...
[  OK  ] Started D-Bus System Message Bus.
         Starting Cleanup of Temporary Directories...
[  OK  ] Started Cleanup of Temporary Directories.
[  OK  ] Started Permit User Sessions.
         Starting Console Getty...
[  OK  ] Started Console Getty.
[  OK  ] Reached target Login Prompts.
[  OK  ] Started Login Service.
[  OK  ] Reached target Multi-User System.
[  OK  ] Reached target Graphical Interface.

Fedora release 20 (Heisenbug)
Kernel 3.18.0-0.rc4.git0.1.fc22.x86_64 on an x86_64 (console)

mycontainer login: root
Password:
-bash-4.2#

Now we have everything ready to play around with the container integration of systemd. Let's have a look at the first tool, machinectl. When run without parameters it shows a list of all locally running containers:

$ machinectl
MACHINE                          CONTAINER SERVICE
mycontainer                      container nspawn

1 machines listed.

The "status" subcommand shows details about the container:

$ machinectl status mycontainer
mycontainer:
       Since: Mi 2014-11-12 16:47:19 CET; 51s ago
      Leader: 5374 (systemd)
     Service: nspawn; class container
        Root: /srv/mycontainer
     Address: 192.168.178.38
              10.36.6.162
              fd00::523f:56ff:fe00:4994
              fe80::523f:56ff:fe00:4994
          OS: Fedora 20 (Heisenbug)
        Unit: machine-mycontainer.scope
              ├─5374 /usr/lib/systemd/systemd
              └─system.slice
                ├─dbus.service
                │ └─5414 /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-act...
                ├─systemd-journald.service
                │ └─5383 /usr/lib/systemd/systemd-journald
                ├─systemd-logind.service
                │ └─5411 /usr/lib/systemd/systemd-logind
                └─console-getty.service
                  └─5416 /sbin/agetty --noclear -s console 115200 38400 9600

With this we see some interesting information about the container, including its control group tree (with processes), IP addresses and root directory.

The "login" subcommand gets us a new login shell in the container:

# machinectl login mycontainer
Connected to container mycontainer. Press ^] three times within 1s to exit session.

Fedora release 20 (Heisenbug)
Kernel 3.18.0-0.rc4.git0.1.fc22.x86_64 on an x86_64 (pts/0)

mycontainer login:

The "reboot" subcommand reboots the container:

# machinectl reboot mycontainer

The "poweroff" subcommand powers the container off:

# machinectl poweroff mycontainer

So much about the machinectl tool. The tool knows a couple of more commands, please check the man page for details. Note again that even though we use systemd-nspawn as container manager here the concepts apply to any container manager that implements the logic described here, including libvirt-lxc for example.

machinectl is not the only tool that is useful in conjunction with containers. Many of systemd's own tools have been updated to explicitly support containers too! Let's try this (after starting the container up again first, repeating the systemd-nspawn command from above.):

# hostnamectl -M mycontainer set-hostname "wuff"

This uses hostnamectl(1) on the local container and sets its hostname.

Similar, many other tools have been updated for connecting to local containers. Here's systemctl(1)'s -M switch in action:

# systemctl -M mycontainer
UNIT                                 LOAD   ACTIVE SUB       DESCRIPTION
-.mount                              loaded active mounted   /
dev-hugepages.mount                  loaded active mounted   Huge Pages File System
dev-mqueue.mount                     loaded active mounted   POSIX Message Queue File System
proc-sys-kernel-random-boot_id.mount loaded active mounted   /proc/sys/kernel/random/boot_id
[...]
time-sync.target                     loaded active active    System Time Synchronized
timers.target                        loaded active active    Timers
systemd-tmpfiles-clean.timer         loaded active waiting   Daily Cleanup of Temporary Directories

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

49 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.

As expected, this shows the list of active units on the specified container, not the host. (Output is shortened here, the blog story is already getting too long).

Let's use this to restart a service within our container:

# systemctl -M mycontainer restart systemd-resolved.service

systemctl has more container support though than just the -M switch. With the -r switch it shows the units running on the host, plus all units of all local, running containers:

# systemctl -r
UNIT                                        LOAD   ACTIVE SUB       DESCRIPTION
boot.automount                              loaded active waiting   EFI System Partition Automount
proc-sys-fs-binfmt_misc.automount           loaded active waiting   Arbitrary Executable File Formats File Syst
sys-devices-pci0000:00-0000:00:02.0-drm-card0-card0\x2dLVDS\x2d1-intel_backlight.device loaded active plugged   /sys/devices/pci0000:00/0000:00:02.0/drm/ca
[...]
timers.target                                                                                       loaded active active    Timers
mandb.timer                                                                                         loaded active waiting   Daily man-db cache update
systemd-tmpfiles-clean.timer                                                                        loaded active waiting   Daily Cleanup of Temporary Directories
mycontainer:-.mount                                                                                 loaded active mounted   /
mycontainer:dev-hugepages.mount                                                                     loaded active mounted   Huge Pages File System
mycontainer:dev-mqueue.mount                                                                        loaded active mounted   POSIX Message Queue File System
[...]
mycontainer:time-sync.target                                                                        loaded active active    System Time Synchronized
mycontainer:timers.target                                                                           loaded active active    Timers
mycontainer:systemd-tmpfiles-clean.timer                                                            loaded active waiting   Daily Cleanup of Temporary Directories

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

191 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.

We can see here first the units of the host, then followed by the units of the one container we have currently running. The units of the containers are prefixed with the container name, and a colon (":"). (The output is shortened again for brevity's sake.)

The list-machines subcommand of systemctl shows a list of all running containers, inquiring the system managers within the containers about system state and health. More specifically it shows if containers are properly booted up, or if there are any failed services:

# systemctl list-machines
NAME         STATE   FAILED JOBS
delta (host) running      0    0
mycontainer  running      0    0
miau         degraded     1    0
waldi        running      0    0

4 machines listed.

To make things more interesting we have started two more containers in parallel. One of them has a failed service, which results in the machine state to be degraded.

Let's have a look at journalctl(1)'s container support. It too supports -M to show the logs of a specific container:

# journalctl -M mycontainer -n 8
Nov 12 16:51:13 wuff systemd[1]: Starting Graphical Interface.
Nov 12 16:51:13 wuff systemd[1]: Reached target Graphical Interface.
Nov 12 16:51:13 wuff systemd[1]: Starting Update UTMP about System Runlevel Changes...
Nov 12 16:51:13 wuff systemd[1]: Started Stop Read-Ahead Data Collection 10s After Completed Startup.
Nov 12 16:51:13 wuff systemd[1]: Started Update UTMP about System Runlevel Changes.
Nov 12 16:51:13 wuff systemd[1]: Startup finished in 399ms.
Nov 12 16:51:13 wuff sshd[35]: Server listening on 0.0.0.0 port 24.
Nov 12 16:51:13 wuff sshd[35]: Server listening on :: port 24.

However, it also supports -m to show the combined log stream of the host and all local containers:

# journalctl -m -e

(Let's skip the output here completely, I figure you can extrapolate how this looks.)

But it's not only systemd's own tools that understand container support these days, procps sports support for it, too:

# ps -eo pid,machine,args
 PID MACHINE                         COMMAND
   1 -                               /usr/lib/systemd/systemd --switched-root --system --deserialize 20
[...]
2915 -                               emacs contents/projects/containers.md
3403 -                               [kworker/u16:7]
3415 -                               [kworker/u16:9]
4501 -                               /usr/libexec/nm-vpnc-service
4519 -                               /usr/sbin/vpnc --non-inter --no-detach --pid-file /var/run/NetworkManager/nm-vpnc-bfda8671-f025-4812-a66b-362eb12e7f13.pid -
4749 -                               /usr/libexec/dconf-service
4980 -                               /usr/lib/systemd/systemd-resolved
5006 -                               /usr/lib64/firefox/firefox
5168 -                               [kworker/u16:0]
5192 -                               [kworker/u16:4]
5193 -                               [kworker/u16:5]
5497 -                               [kworker/u16:1]
5591 -                               [kworker/u16:8]
5711 -                               sudo -s
5715 -                               /bin/bash
5749 -                               /home/lennart/projects/systemd/systemd-nspawn -D /srv/mycontainer -b
5750 mycontainer                     /usr/lib/systemd/systemd
5799 mycontainer                     /usr/lib/systemd/systemd-journald
5862 mycontainer                     /usr/lib/systemd/systemd-logind
5863 mycontainer                     /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
5868 mycontainer                     /sbin/agetty --noclear --keep-baud console 115200 38400 9600 vt102
5871 mycontainer                     /usr/sbin/sshd -D
6527 mycontainer                     /usr/lib/systemd/systemd-resolved
[...]

This shows a process list (shortened). The second column shows the container a process belongs to. All processes shown with "-" belong to the host itself.

But it doesn't stop there. The new "sd-bus" D-Bus client library we have been preparing in the systemd/kdbus context knows containers too. While you use sd_bus_open_system() to connect to your local host's system bus sd_bus_open_system_container() may be used to connect to the system bus of any local container, so that you can execute bus methods on it.

sd-login.h and machined's bus interface provide a number of APIs to add container support to other programs too. They support enumeration of containers as well as retrieving the machine name from a PID and similar.

systemd-networkd also has support for containers. When run inside a container it will by default run a DHCP client and IPv4LL on any veth network interface named host0 (this interface is special under the logic described here). When run on the host networkd will by default provide a DHCP server and IPv4LL on veth network interface named ve- followed by a container name.

Let's have a look at one last facet of systemd's container integration: the hook-up with the name service switch. Recent systemd versions contain a new NSS module nss-mymachines that make the names of all local containers resolvable via gethostbyname() and getaddrinfo(). This only applies to containers that run within their own network namespace. With the systemd-nspawn command shown above the the container shares the network configuration with the host however; hence let's restart the container, this time with a virtual veth network link between host and container:

# machinectl poweroff mycontainer
# systemd-nspawn -D /srv/mycontainer --network-veth -b

Now, (assuming that networkd is used in the container and outside) we can already ping the container using its name, due to the simple magic of nss-mymachines:

# ping mycontainer
PING mycontainer (10.0.0.2) 56(84) bytes of data.
64 bytes from mycontainer (10.0.0.2): icmp_seq=1 ttl=64 time=0.124 ms
64 bytes from mycontainer (10.0.0.2): icmp_seq=2 ttl=64 time=0.078 ms

Of course, name resolution not only works with ping, it works with all other tools that use libc gethostbyname() or getaddrinfo() too, among them venerable ssh.

And this is pretty much all I want to cover for now. We briefly touched a variety of integration points, and there's a lot more still if you look closely. We are working on even more container integration all the time, so expect more new features in this area with every systemd release.

Note that the whole machine concept is actually not limited to containers, but covers VMs too to a certain degree. However, the integration is not as close, as access to a VM's internals is not as easy as for containers, as it usually requires a network transport instead of allowing direct syscall access.

Anyway, I hope this is useful. For further details, please have a look at the linked man pages and other documentation.

November 07, 2014

more on Displaylink3 and HDCP encryption

okay another braindump (still nothing working).

The git repo mentioned in previous post has all the code I've hacked up so far.

I finished writing the HDCP protocol stages, and sending all the msgs and getting replies from the device.

So I've successfully reached a point where I've negotiated a HDCP session key with the device, and we are both happy about it. Unfortunately I've no idea what I'm meant to be encrypting to send to the device. The next packet the USB traces contain is 384-bytes of encrypted data.

Now HDCP v2 had a vulnerabilty in its key neg, and I've written code to try and use this fact. So I've taken a trace I made from Windows, and extracted the necessary bits, and using that I've managed to derive the master key used in that trace, and subsequently managed to derived the session key for it. So I've replayed the first encrypted packet from the trace to the device and got an encrypted response the same as in the trace.

I've tried changing a bit in the session key, riv value and data I'm sending, and doing that causes the device not to reply with the answer. This to me implies that the device is using the HDCP cipher to encode the control channel. Now HDCP does say you should only do this for video streams, but maybe DisplayLink forgot to read that bit.

Now where does this leave me, in theory I should be able to replay the full trace (haven't had time yet) and I should see the same picture on screen as I did (though I can't remember what monitor/device I used, so I might have to retrace and restage my tests before then).

However I really need to decrypt the encrypted data in the trace, and from reading the HDCP spec the only values I need to feed the AES engine are ks ^ lc128, riv, streamctr, inputctr. I'm assuming streamctr and inputctr are 0 for the first packet (I could be wrong, maybe they use some wacky streamctr to avoid messing with hdcp), riv and ks I've captured. So lc128 is possibly the crux.

Now what is lc128? Its a secret 128-bit value in the HDCP world given only to HDCP adopters. Its normally something you'd store in hw on the GPU etc as an input to the hw cipher. But in displaylink there is no GPU encrypting the data. Now its possible that displaylink don't use the same lc128 as the HDCP people, unlikely but possible. Maybe they cipher their streams with their own lc128, and only use the offical hdcp lc128 for actual HDCP streams.

I don't think lc128 has leaked, I'm not sure what the consequences of it leaking would be, but hey its just a magic number, and if displaylink are using as an input to their AES code, it must be in RAM at some point, now I need to figure out ways to work that out. I'm not sure how long it would take to brute force as 128-bit key space, probably impossible.

At any point if someone from DisplayLink wants to talk, you know where to find me :-)

November 03, 2014

Thoughts on crashdumps.

Linux has what appears to be a useful feature that can be enabled to diagnose tricky kernel bugs. The feature is called kdump. A crashdump mechanism that uses kexec to switch to a different kernel, before writing out memory to disk, nfs, wherever. It’s a pretty neat idea.

Unfortunately, I have _never_ seen it working when I needed it.
I know it’s possible, because some of my co-workers swear by crashdumps for diagnosing tricky RHEL bugs. Someone every single RHEL release invests the time to fix up a bunch of bugs and get it into a working state again. But because Fedora is constantly moving, it’s near constantly broken in some non-trivial way.

We even have a wiki page telling Fedora users how to enable it. In honesty every time in the past I’ve told a user to try it, I’ve thought to myself “yeah, that isn’t going to work”, and my record for being correct in that regard is pretty damn good. If after 15+[1] years of kernel debugging, _I_ can’t get this thing to work, what hope does the average end-user have ?

In a recent meeting at the office, one of my coworkers enthused about how “it’s so much better now, it just works”. So I thought I’d give it a try again the last few weeks. In that time, I have ended up with a total of zero crash dumps, and I-lost-count-how-many kdump bugs.

Why is it so fragile ? I don’t have a good answer. It tends to have the worst possible failure modes. It’s hard to diagnose bugs that either lock up the machine entirely, or instantly reboot it. When you’re trying to debug something, and then it turns out you need to debug the debugging mechanism, most people probably think “I don’t have time for this shit”, and try alternative avenues of debugging, adding “FIND OUT WHY KDUMP IS FUCKED AGAIN” somewhere near the bottom of their TODO list.

At one point I thought “Maybe I’m just unlucky with hardware choices”[2], but the problems seem to be universal across every machine I’ve tried it on.

No doubt it “works” for some people, in certain circumstances, but this kind of feature has to be reliable at least most of the time to make it even worth trying.

I wish this post had a happy ending where I unveiled some solution to this problem[3], but after needing to travel to a machine that wedged itself after it had crashed for the Nth time this weekend, I’m kind of over kdump.
Sometimes it’s easier just to say “Don’t even bother” and do something entirely different.

[1] Oh god what have I done with my life.
[2] There are no good choices when it comes to computer hardware.
[3] Coming in a future post: Why pstore is the solution to this, and why it’s also completely awful.

Thoughts on crashdumps. is a post from: codemonkey.org.uk

October 31, 2014

a day with DisplayLink USB3 and HDCP

So for some reason I decided to look at the displaylink usb3 adaptors today. (no good news).

This blog post is so I don't forget all of this when I page it out. Notes, HDCP1.0 being broken doesn't matter to this, maybe HDCPv2.0 being a bit broken could be used, but I'm not sure how!

The displaylink USB3 protocol is based on HDCP protocol. I've traced the first few packets and it clearly
looks like the host sends two packets

AKE_Init,
AKE_Transmitter_Info

and the device sends back
AKE_Send_Cert

at least.

AKE_Send_Cert contains a 522 byte certificate, containing a receiver id, public key, some misc bytes and a signature generated with the DCP LLC private key, that you have to verify.

so the HDCP v2.2 spec contains the DP LLC public key, and I've written some code to verify the spec using openssl, but it totally fails to work. This is probably due to me doing something stupid, or not understanding what I'm doing, if you are openssl knowledgeable and want to look, the hack fest is
http://cgit.freedesktop.org/~airlied/dl3dev/

It might be the DisplayLink devices use a different signing key than the DP LLC one.

That repo contains some code to talk to the device (currently disabled) and do the initial sequence, along with an attempt to verify the cert.

Now once I get past this hurdle, the larger one seems to remain, the HDCP 2.0 spec has a global secret 128-bit value called LC128, that everyone who implements HDCP gets and hides somewhere. Its probably sitting in the displaylink driver in hex, but I'd hope they at least hide it better than that. It may also be possibly supplied by the OS, Windows or OSX. (I've no clue yet). That value is used in the key negotiation.

Now it might be possible that Displaylink allow non-HDCP encrypted data to be sent to the device, in which case win if I can find out where/how to do that, or it might be the device requires HDCP and decrypts non-HDCP content before sending it over VGA/DVI. I've no ideas yet on that front either.

Ah well probably enough learning for today, I knew nothing about HDCP this morning, so I can't say it made my life any better learning about it :-P

October 29, 2014

Understanding Wikimedia, or, the Heavy Metal Umlaut, one decade on

It has been nearly a full decade since Jon Udell’s classic screencast about Wikipedia’s article on the Heavy Metal Umlaut (current textJan. 2005). In this post, written for Paul Jones’ “living and working online” class, I’d like to use the last decade’s changes to the article to illustrate some points about the modern Wikipedia.1

Measuring change

At the end of 2004, the article had been edited 294 times. As we approach the end of 2014, it has now been edited 1,908 times by 1,174 editors.2

This graph shows the number of edits by year – the blue bar is the overall number of edits in each year; the dotted line is the overall length of the article (which has remained roughly constant since a large pruning of band examples in 2007).

Edits-by-year

 

The dropoff in edits is not unusual — it reflects both a mature article (there isn’t that much more you can write about metal umlauts!) and an overall slowing in edits in English Wikipedia (from a peak of about 300,000 edits/day in 2007 to about 150,000 edits/day now).3

The overall edit count — 2000 edits, 1000 editors — can be hard to get your head around, especially if you write for a living. Implications include:

  • Style is hard. Getting this many authors on the same page, stylistically, is extremely difficult, and it shows in inconsistencies small and large. If not for the deeply acculturated Encyclopedic Style we all have in our heads, I suspect it would be borderline impossible.
  • Most people are good, most of the time. Something like 3% of edits are “reverted”; i.e., about 97% of edits are positive steps forward in some way, shape, or form, even if imperfect. This is, I think, perhaps the single most amazing fact to come out of the Wikimedia experiment. (We reflect and protect this behavior in one of our guidelines, where we recommend that all editors Assume Good Faith.)

The name change, tools, and norms

In December 2008, the article lost the “heavy” from its name and became, simply, “metal umlaut” (explanation, aka “edit summary“, highlighted in yellow):

Name change

A few take aways:

  • Talk pages: The screencast explained one key tool for understanding a Wikipedia article – the page history. This edit summary makes reference to another key tool – the talk page. Every Wikipedia article has a talk page, where people can discuss the article, propose changes, etc.. In this case, this user discussed the change (in November) and then made the change in December. If you’re reporting on an article for some reason, make sure to dig into the talk page to fully understand what is going on.
  • Sources: The user justifies the name change by reference to sources. You’ll find little reference to them in 2005, but by 2008, finding an old source using a different term is now sufficient rationale to rename the entire page. Relatedly…
  • Footnotes: In 2008, there was talk of sources, but still no footnotes. (Compare the story about Motley Crue in Germany in 2005 and now.) The emphasis on foonotes (and the ubiquitous “citation needed”) was still a growing thing. In fact, when Jon did his screencast in January 2005, the standardized/much-parodied way of saying “citation needed” did not yet exist, and would not until June of that year! (It is now used in a quarter of a million English Wikipedia pages.) Of course, the requirement to add footnotes (and our baroque way of doing so) may also explain some of the decline in editing in the graphs above.

Images, risk aversion, and boldness

Another highly visible change is to the Motörhead art, which was removed in November 2011 and replaced with a Mötley Crüe image in September 2013. The addition and removal present quite a contrast. The removal is explained like this:

remove File:Motorhead.jpg; no fair use rationale provided on the image description page as described at WP:NFCC content criteria 10c

This is clear as mud, combining legal issues (“no fair use rationale”) with Wikipedian jargon (“WP:NFCC content criteria 10c”). To translate it: the editor felt that the “non-free content” rules (abbreviated WP:NFCC) prohibited copyright content unless there was a strong explanation of why the content might be permitted under fair use.

This is both great, and sad: as a lawyer, I’m very happy that the community is pre-emptively trying to Do The Right Thing and take down content that could cause problems in the future. At the same time, it is sad that the editors involved did not try to provide the missing fair use rationale themselves. Worse, a rationale was added to the image shortly thereafter, but the image was never added back to the article.

So where did the new image come from? Simply:

boldly adding image to lead

“boldly” here links to another core guideline: “be bold”. Because we can always undo mistakes, as the original screencast showed about spam, it is best, on balance, to move forward quickly. This is in stark contrast to traditional publishing, which has to live with printed mistakes for a long time and so places heavy emphasis on Getting It Right The First Time.

In brief

There are a few other changes worth pointing out, even in a necessarily brief summary like this one.

  • Wikipedia as a reference: At one point, in discussing whether or not to use the phrase “heavy metal umlaut” instead of “metal umlaut”, an editor makes the point that Google has many search results for “heavy metal umlaut”, and another editor points out that all of those search results refer to Wikipedia. In other words, unlike in 2005, Wikipedia is now so popular, and so widely referenced, that editors must be careful not to (indirectly) be citing Wikipedia itself as the source of a fact. This is a good problem to have—but a challenge for careful authors nevertheless.
  • Bots: Careful readers of the revision history will note edits by “ClueBot NG“. Vandalism of the sort noted by Jon Udell has not gone away, but it now is often removed even faster with the aid of software tools developed by volunteers. This is part of a general trend towards software-assisted editing of the encyclopedia.NoSwagForYou
  • Translations: The left hand side of the article shows that it is in something like 14 languages, including a few that use umlauts unironically. This is not useful for this article, but for more important topics, it is always interesting to compare the perspective of authors in different languages.Languages

Other thoughts?

I look forward to discussing all of these with the class, and to any suggestions from more experienced Wikipedians for other lessons from this article that could be showcased, either in the class or (if I ever get to it) in a one-decade anniversary screencast. :)

  1. I still haven’t found a decent screencasting tool that I like, so I won’t do proper homage to the original—sorry Jon!
  2. Numbers courtesy X’s edit counter.
  3. It is important, when looking at Wikipedia statistics, to distinguish between stats about Wikipedia in English, and Wikipedia globally — numbers and trends will differ vastly between the two.

October 24, 2014

Introducing Gthree

I’ve recently been working on OpenGL support in Gtk+, and last week it landed in master. However, the demos we have are pretty lame and are not very good to show off or even test the OpenGL support. I’ve looked around for some open source demos that used modern GL that we could use, but I didn’t find anything that we could easily use.

What I did find though, was a lot of WebGL demos that used three.js. This looked like a very nice open source library for highlevel 3d rendering. At first I had some plans to bind OpenGL to gjs so that we could run three.js, but this turned out to be a hard.

Instead I started converting three.js into C + GObject, using the Gtk+ OpenGL support and the vector/matrix library graphene that Emmanuele has been working on recently.

After about a week of frantic hacking it is now at a stage where it may be interesting for others. So, without further ado I introduce:

https://github.com/alexlarsson/gthree

It does not yet support everything that three.js can do, but it does support a meshes with most mesh matrial types and lighting, including a loader for the json model format of thee.js, which means that it is minimally useful.

Here are some screenshots of the examples that ships with the code:

Screenshot from 2014-10-24 15:04:47Various types of materials
Screenshot from 2014-10-24 15:10:00Some sample models from three.js examples
Screenshot from 2014-10-24 15:31:40Some random cubes

This has been a lot of fun to work on as I’ve seen a lot of progress very fast. Mad props to mrdoob and the other three.js developers for creating three.js and making it free software. Gthree is a huge rip-off of their work and would never be possible without it. Thanks also to Emmanuele for his graphene library.

What are you sitting here for, go ahead and play with it! Make some demos, port some more three.js features, marvel at the fancy graphics!

October 23, 2014

Trinity and pages of random data.

Something trinity uses a lot, are pages of random data. They get passed around to syscalls, ioctls, whatever. 5 years ago, before I’d even added multiple children to trinity, this was done using ‘page_rand’. A single page allocated on startup, that was passed around, and scribbled over by anyone who needed something to scribble over.

After the VM work I did earlier this year, where we recycle successful calls to mmap, and inherit them across children, quite a few places started passing around map structs instead. This was good, because it started shaking out the many many kernel bugs that we had lingering in huge page support.

It kind of sucked that we had two sets of routines for doing things like “get a page”, “dirty a page” etc which were fundamentally the same operations, except one set worked on a pointer, and one on a struct. It also sucked that the page_rand code was actually buggy in a number of ways, which showed up as overruns.

Over time, I’ve been trying to move all the code that used page_rand to using mappings instead. Today I finished that work, and ripped out the last vestiges of page_rand support. The only real remnants of the supporting code was some of the dirtying code. We used to have separate ‘dirty page_rand’ and ‘dirty an mmap’ routines. After todays work, there’s now a single set of functions for mappings. There’s still a bunch more consolidation and cleanup to do, which I’ll get fixed up and merged over the next week.

The only feature that’s now missing is periodic dirtying of mappings. We did this every 100 syscalls for page_rand. Right now we only dirty mmap’s after a mmap() call succeeds, or on an mremap(). I plan on getting this done tomorrow.

The motivation for ripping out all this code, and unifying a lot of the support code is that a lot of code paths get simpler, and more importantly, the code in place now takes ‘len’ arguments, so we’re in a better position to make sure we’re not passing buffers that are too small when we do random syscalls.

In other news: while I was happy to report a few days ago that 3.18rc1 fixed up the btrfs bug that had been bothering me for a while, I’ve now managed to discover two new btrfs bugs [1]. [2]. Grumble.

Trinity and pages of random data. is a post from: codemonkey.org.uk

October 19, 2014

Laptop bleg

I'm considering a laptop (actually two). Requirements:

  • 13" to 14" class.
  • Indestructable.
  • Display that is not too wide. Enough with 16:9 already! Aspect of 1.6 would be ideal (Lenovo T400 had that).
  • Light. Indestructable is more important, but it should be light: 2kg or less.
  • No nipple. No Lenovo.

Where it comes from is mostly my wife's Sony Vaio Z. I used to have a Z back in 2001 or so, when they were in 12" format. It was the best laptop ever, but unfortunately it succumbed to a DC-DC converter failure. The modern Z is not like that Z. The most super annoying problem is that the screws holding the battery failed in an interesting way: it is impossible to remove the battery now. Also, the contact between the battery and the moterboard is marginal. I managed to fix the problem by manufacturing a finely shaped wooden wedge that I drove into a gap and thus extended the life of that thing, but man, Sony, this is disappointing.

Unfortunately, I don't remember if it was Kota or Daisuke, but one of Japanese guys at a recent Swift Hackathon in Boston had a Z of the similar vintage, and it looked impeccable. Maybe Sony figured that it's going to be predominant mode of care that their wares receive, and so why not make the modern Z this much cheaper than the old, indestructable Z. But they still charge exorbitant prices.

Lenovo wins a special notice because I had a T400 for 3 years and swore never deal with it ever again. The biggest problem is the keyboard layout, because I use left pinky for control key. I could live with their idiotic placement of Escape, but I refuse to deal with 3 years of physical pain again. Also, their famous qualify seems slipping, as my mouse button broke within 3 years. Battery died, too. However, the T400 had a very good display, and I would like another like that, if possible.

October 18, 2014

Trinity updates

Over a month ago, I posted about some pthreads work I was experimenting with in Trinity, and how that wasn’t really working out. After taking a short vacation, I came back with no real epiphanies, and decided to back-burner that work for now, and instead refocus on fixing up some other annoying problems that I’d stumbled across while doing that experimenting. Some of these problems were actually long-standing bugs in trinity. So that’s pretty much all I’ve been working on for the last month, and I’m now pretty happy with how long it runs for (providing you don’t hit a kernel bug first).

The primary motivation was to fix a problem where trinity’s internal data structures would get corrupted. After a series of debugging patches, I found a number of places where a child process would overrun a buffer it had allocated.

First up: the code that takes syscalls arguments and renders them into a human-readable string. In some cases this would write huge strings past the end of the buffer. One example of this was the instance where trinity would generate a random pathname. It would sometimes generate complete garbage, which was fine until it came to printing it out. Fixed by deleting lots of code in the pathname generator. Stressing the negative dentry case was never that interesting anyway. After fixing up a few other cases in the argument generator I looked at the code that performs rendering to buffers. None of this code took length parameters, or took into account the remaining space in the buffers. Fairly quick rewrite took care of that.

After these bugs were fixed trinity would (on a good kernel) run for a really long time without incident. With longer runtimes, a few more obscure corner cases turned up.

There were 2-3 cases where the watchdog process would hang waiting for a condition that would never be met (due to losing track of how many running child processes there were). I’m still not happy that this can even occur but it is at least a little less likely to hang when it happens now. I’ll investigate the actual cause for this later.

Another fun watchdog bug: we keep track of the time stamp a child performed its last syscall at, and check to make sure 1 second later that it has increased by some small amount. To make sure we haven’t corrupted our own state, there’s also a sanity check that we haven’t jumped into the future. But we also have to compensate for the possibility that adjtimex was the random syscall we did. That takes a maximum offset of 2145. The code checked for that but forgot to also add the one second since the last time we checked.

There’s been a bunch of small 1-2 fixes like this lately, but I’m sitting on a larger set of changes that I’ll start to trickle into git next week, which moves towards cleaning up the “create a random page to pass to syscalls” code, which has been another fun source of corruption bugs.

In kernel news: The only interesting bugs this week that Trinity has shown up, have been two ext4 bugs. Diagnosing those has pointed out some more enhancements that are needed to the post-mortem code in trinity. Once I’ve cleared the current backlog of patches, I’ll work on adding better tracking of fd’s in the logging code. In other news, the btrfs bug trinity hit in August is now fixed in 3.17+ git.

Trinity updates is a post from: codemonkey.org.uk

October 09, 2014

Emacs hint for Firefox hacking

I started hacking on firefox recently. And, of course, I’ve configured emacs a bit to make hacking on it more pleasant.

The first thing I did was create a .dir-locals.el file with some customizations. Most of the tree has local variable settings in the source files — but some are missing and it is useful to set some globally. (Whether they are universally correct is another matter…)

Also, I like to use bug-reference-url-mode. What this does is automatically highlight references to bugs in the source code. That is, if you see “bug #1050501″, it will be buttonized and you can click (or C-RET) and open the bug in the browser. (The default regexp doesn’t capture quite enough references so my settings hack this too; but I filed an Emacs bug for it.)

I put my .dir-locals.el just above my git checkout, so I don’t end up deleting it by mistake. It should probably just go directly in-tree, but I haven’t tried to do that yet. Here’s that code:

(
 ;; Generic settings.
 (nil .
      ;; See C-h f bug-reference-prog-mode, e.g, for using this.
      ((bug-reference-url-format . "https://bugzilla.mozilla.org/show_bug.cgi?id=%s")
       (bug-reference-bug-regexp . "\\([Bb]ug ?#?\\|[Pp]atch ?#\\|RFE ?#\\|PR [a-z-+]+/\\)\\([0-9]+\\(?:#[0-9]+\\)?\\)")))

 ;; The built-in javascript mode.
 (js-mode .
     ((indent-tabs-mode . nil)
      (js-indent-level . 2)))

 (c++-mode .
	   ((indent-tabs-mode . nil)
	    (c-basic-offset . 2)))

 (idl-mode .
	   ((indent-tabs-mode . nil)
	    (c-basic-offset . 2)))

)

In programming modes I enable bug-reference-prog-mode. This enables highlighting only in comments and strings. This would easily be done from prog-mode-hook, but I made my choice of minor modes depend on the major mode via find-file-hook.

I’ve also found that it is nice to enable this minor mode in diff-mode and log-view-mode. This way you get bug references in diffs and when viewing git logs. The code ends up like:

(defun tromey-maybe-enable-bug-url-mode ()
  (and (boundp 'bug-reference-url-format)
       (stringp bug-reference-url-format)
       (if (or (derived-mode-p 'prog-mode)
	       (eq major-mode 'tcl-mode)	;emacs 23 bug
	       (eq major-mode 'makefile-mode)) ;emacs 23 bug
	   (bug-reference-prog-mode t)
	 (bug-reference-mode t))))

(add-hook 'find-file-hook #'tromey-maybe-enable-bug-url-mode)
(add-hook 'log-view-mode-hook #'tromey-maybe-enable-bug-url-mode)
(add-hook 'diff-mode-hook #'tromey-maybe-enable-bug-url-mode)