Kernel Planet

January 25, 2015

James Bottomley: Adventures in Embedded UEFI with Intel Galileo

At one of the Intel Technology Days conferences a while ago, Intel gave us a gift of a Galileo board, which is based on the Quark SoC, just before the general announcement.  The promise of the Quark SoC was that it would be a fully open (down to the firmware) embedded system based on UEFI.  When the board first came out, though, the UEFI code was missing (promised for later), so I put it on a shelf and forgot about it.   Recently, the UEFI Security Subteam has been considering issues that impinge on embedded architectures (mostly arm) so having an actual working embedded development board could prove useful.  This is the first part of the story of trying to turn the Galileo into an embedded reference platform for UEFI.

The first problem with getting the Galileo working is that if you want to talk to the UEFI part it’s done over a serial interface, with a 3.5″ jack connection.  However, a quick trip to amazon solved that one.  Equipped with the serial interface, it’s now possible to start running UEFI binaries.   Just using the default firmware (with no secure boot) I began testing the efitools binaries.  Unfortunately, they were building the size of the secure variables (my startup.nsh script does an append write to db) and eventually the thing hit an assert failure on entering the UEFI handoff.  This led to the discovery that the recovery straps on the board didn’t work, there’s no way to clear the variable NVRAM and the only way to get control back was to use an external firmware flash tool.  So it’s off for an unexpected trip to uncharted territory unless I want the board to stay bricked.

The flash tool Intel recommends is the Dediprog SF 100.  These are a bit expensive (around US$350) and there’s no US supplier, meaning you have to order from abroad, wait ages for it to be delivered and deal with US customs on import, something I’ve done before and have no wish to repeat.  So, casting about for a better solution, I came up with the Bus Pirate.  It’s a fully open hardware component (including a google code repository for the firmware and the schematics) plus it’s only US$35 (that’s 10x cheaper than the dediprog), available from Amazon and works well with Linux (for full disclosure, Amazon was actually sold out when I bought mine, so I got it here instead).

The Bus Pirate comes as a bare circuit board (no case or cables), so you have to buy everything you might need to use it extra.  I knew I’d need a ribbon cable with SPI plugs (the Galileo has an SPI connector for the dediprog), so I ordered one with the card.  The first surprise when the card arrived was that the USB connector is actually Mini B not the now standard Micro connector.  I’ve not seen one of those since I had an Android G1, but, after looking in vain for my old android one, Staples still has the cables.    The next problem is that, being open hardware, there are multiple manufacturers.  They all supply a nice multi coloured ribbon cable, but there are two possible orientations and both are produced.  Turns out I have a sparkfun cable which is the opposite way around from the colour values in the firmware (which is why the first attempt to talk to the chip didn’t work).  The Galileo has diode isolators so the SPI flash chip can be powered up and operated independently by the Bus Pirate;  accounting for cable orientation, and disconnecting the Galileo from all other external power, this now works.  Finally, there’s a nice Linux project, flashrom, which allows you to flash all manner of chips and it has a programmer mode for the Bus Pirate.  Great, except that the default USB serial speed is 115200 and at that baud rate, it takes about ten minutes just to read an 8MB SPI flash (flashrom will read, program and then verify, giving you about 25 mins each time you redo the firmware).  Speeding this up is easy: there’s an unapplied patch to increase the baud rate to 2Mbit and I wrote some code to flash only designated areas of the chip (note to self: send this upstream).  The result is available on the OpenSUSE build service.  The outcome is that I’m now able to build and reprogram the firmware in around a minute.

By now this is two weeks, a couple of hacks to a tool I didn’t know I’d need and around US$60 after I began the project, but at least I’m now an embedded programmer and have the scars to prove it.  Next up is getting Secure Boot actually working ….

January 25, 2015 12:44 AM

January 23, 2015

Michael Kerrisk (manpages): man-pages-3.78 is released

I've released man-pages-3.78. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

Aside from many smaller fixes and improvements to various pages, the most notable changes in man-pages-3.78 are the following:

January 23, 2015 08:28 AM

James Bottomley: efitools now working for both x86 and x86_64

Some people noticed a while ago that version 1.5.1 was building for ia32, but 1.5.2 is the release that’s tested and works (1.5.1 had problems with the hash computations).  The primary reason for this work is bringing up secure boot on the Intel Quark SoC board (I’ve got the Galileo Kipps Bay D but the link is to the new Gen 2 board).  There’s a corresponding release of OVMF, with a fix for a DxeImageVerification problem that meant IA32 hashes weren’t getting computed properly.   I’ll post more on the Galileo and what I’ve been doing with it in a different post (tagged for embedded, since it’s mostly a story of embedded board programming).

With these tools, you can now test out IA32 images for both UEFI and the key manipulation and signing tools.

January 23, 2015 05:05 AM

January 21, 2015

Dave Jones: Thoughts on long-term stable kernels.

I remembered something that I found eye-opening while interviewing/phone-screening over the last few months.

The number of companies that base their systems (especially those that don’t actually distribute their kernels outside the company) on the long-term stable releases of Linux caught me by surprise. We dismissed the idea of basing on long-term stable releases in Fedora after giving it a try circa Fedora 14, and it generally being a disaster because the bugs being fixed didn’t match up much to the bugs our users were seeing. We found that we got more bugs we cared about being fixed by sticking to the rolling model that Fedora is known for.

After discussing this with several potential employers, I now have a different perspective on this, and can see the appeal if you have a limited use-case, and only care about a small subset of hardware.

The general purpose “one size fits all” model that distribution kernels have to fit is a much bigger problem to solve, and with the feedback loop of “stable release -> bug -> report upstream -> fixed in mainline -> backport to next stable release” being so long, it’s not really a surprise that just having users be closer to bleeding edge gets a higher volume of bugs fixed (though at the cost of all the shiny new bugs that get introduced along the way) unless you have a RHEL-like small army of developers backporting fixes.

Finally, nearly everyone I talked to who uses long-term stable was also carrying some ‘extras’, such as updated drivers for hardware they cared about (in some cases, out of tree variants that aren’t even synced with linux-next yet).

It’s a complicated mess out there. “We run linux stable” doesn’t necessarily tell you the whole picture.

Thoughts on long-term stable kernels. is a post from: codemonkey.org.uk

January 21, 2015 04:31 AM

January 19, 2015

Daniel Vetter: LCA 2015: Botching Up IOCTLs

So I'm stuck somewhere on an airport and jetlagged on my return trip from my very first LCA in Auckland - awesome conference, except for being on the wrong side of the globe: Very broad range of speakers, awesome people all around, great organization and since it's still a community conference none of the marketing nonsense and sales-pitch keynotes.

Also done a presentation about botching up ioctls. Compared to my blog post a bunch more details on technical issues and some overall comments on what's really important to avoid a v2 ioctl because v1 ended up being unsalvageable. Slides and video (curtesy the LCA video team).

January 19, 2015 05:55 AM

January 15, 2015

Pete Zaitcev: UAT on RTL-SDR update

About a year ago, when I started playing with ADS-B over 1090ES, I noticed that small airplanes heavily favour UAT in 978 MHz, because it's cheaper. For the purposes of independent on-board traffic, thus, it would be important to tap directly into UAT. If I mingle with airliners and their 1090ES, it's in controlled airspace anyway (yes, I know that most collisions happen in controlled airspace, but I'm grasping at excuses to play with UAT here, okay).

UAT poses a large challenge for RTL-SDR because of its relatively high data rate: 1.041667 mbit/s. RTL 2832U can only sample at 3.2 MS/s, and is only stable at 2.8 MS/s. Theoretically it should be enough, but everything I saw out there only works with 8 samples per bit, for weak signals. So, for initial experiments, I thought to try a trick: self-clocking. I set the sample rate to 2083334, and then do no clock recovery whatsoever. It hits where it hits, and if sample points lay well onto bits, the packet is recovered, otherwise it's lost.

I threw together some code, ran it, and it didn't work. Only received white noise. Oh, well. I moved the repo into a dusty corner of Github and forgot about it.

Fast forward a year, a gentleman by the name Oliver Jowett noticed a big problem: the phase was computed incorrectly. I fixed that up and suddenly saw some bits recovered (as it happened, zeroes and ones were swapped, which Oliver had to correct again).

After that, things started to move forward. Having bits recovered allowed to measure reception, and I found that the antenna that I built for the 978 MHz band was much worse than the stock antenna for TV. Imagine my surprise and disappoinment: all that soldering for nothing. I don't know where I screwed up, but some suggest that the computer and dongle produce RF noise that screws with antenna, and a length of coax helps with that despite the losses incurred by the coax (/u/christ0ph liked that in parti-cular).

This is bad.

This is good. Or better at least.

From now on, it's the error recovery. Unfortunately, I have no clue what this means:

The FEC parity generation shall be based on a systematic RS 256-ary code with 8-bit code word symbols. FEC parity generation for each of the six blocks shall be a RS (92,72) code.

Quoted from Annex 10, Volume III, 12.4.4.2.1.

P.S. Observing Oliver's key involvement, the cynical may conclude that the whole premise of Open Source is that you write drek that does not work, upload it to Github, and someone fixes it up for you, free of charge. ESR wrote a whole book about it ("with enough eyes all bugs are shallow")!

January 15, 2015 07:52 PM

January 12, 2015

Pavel Machek: Why avoid ACPI on ARM (and everywhere else, too)

Grant Likely published article about ACPI and ARM at http://www.secretlab.ca/archives/151 . He acknowledges systems with ACPI are harder to debug, but because Microsoft says so, we have to use ACPI (basically).

I believe doing wrong technical choice "because Microsoft says so" is a wrong thing to do.

Yes, ACPI gives more flexibility to hardware vendors. Imagine replacing block devices with interpretted bytecode coming from ROM. That is obviously bad, right? Why is it good for power management?

It is not.

Besides being harder to debug, there are more disadvantages:
* Size, speed and complexity disadvantage of bytecode interpretter in the kernel.
* Many more drivers. Imagine GPIO switch, controlling rfkill (for example). In device tree case, that's few lines in the .dts specifying which GPIO that switch is on.
In ACPI case, each hardware vendor initially implements rfkill switch in AML, differently. After few years, each vendor implements (different) kernel<->AML interface for querying rfkill state and toggling it in software. Few years after that, we implement kernel drivers for those AML interfaces, to properly integrate them in the kernel.
* Incompatibility. ARM servers will now be very different from other ARM systems.


Now, are there some arguments for ACPI? Yes -- it allows hw vendors to hack half-working drivers without touching kernel sources. (Half-working: such drivers are not properly integrated in all the various subsystems). Grant claims that power management is somehow special, and requirement for real drivers is somehow ok for normal drivers (block, video), but not for power management. Now, getting driver merged into the kernel does not take that long -- less than half a year if you know what you are doing. Plus, for power management, you can really just initialize hardware in the bootloader (into working but not optimal state). But basic drivers are likely to merged fast, and then you'll just have to supply DT tables.
Avoid ACPI. It only makes things more complex and harder to debug.

January 12, 2015 05:00 PM

January 10, 2015

Michael Kerrisk (manpages): man-pages-3.77 is released

I've released man-pages-3.77. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

Aside from many smaller fixes and improvements to various pages, the most notable changes in man-pages-3.77 is the following:

January 10, 2015 04:17 PM

Grant Likely: Why ACPI on ARM?

Why are we doing ACPI on ARM? That question has been asked many times, but we haven’t yet had a good summary of the most important reasons for wanting ACPI on ARM. This article is an attempt to state the rationale clearly.

During an email conversation late last year, Catalin Marinas asked for a summary of exactly why we want ACPI on ARM, Dong Wei replied with the following list:
> 1. Support multiple OSes, including Linux and Windows
> 2. Support device configurations
> 3. Support dynamic device configurations (hot add/removal)
> 4. Support hardware abstraction through control methods
> 5. Support power management
> 6. Support thermal management
> 7. Support RAS interfaces

The above list is certainly true in that all of them need to be supported. However, that list doesn’t give the rationale for choosing ACPI. We already have DT mechanisms for doing most of the above, and can certainly create new bindings for anything that is missing. So, if it isn’t an issue of functionality, then how does ACPI differ from DT and why is ACPI a better fit for general purpose ARM servers?

The difference is in the support model. To explain what I mean, I’m first going to expand on each of the items above and discuss the similarities and differences between ACPI and DT. Then, with that as the groundwork, I’ll discuss how ACPI is a better fit for the general purpose hardware support model.

Device Configurations

2. Support device configurations
3. Support dynamic device configurations (hot add/removal)

From day one, DT was about device configurations. There isn’t any significant difference between ACPI & DT here. In fact, the majority of ACPI tables are completely analogous to DT descriptions. With the exception of the DSDT and SSDT tables, most ACPI tables are merely flat data used to describe hardware.

DT platforms have also supported dynamic configuration and hotplug for years. There isn’t a lot here that differentiates between ACPI and DT. The biggest difference is that dynamic changes to the ACPI namespace can be triggered by ACPI methods, whereas for DT changes are received as messages from firmware and have been very much platform specific (e.g. IBM pSeries does this)

Power Management Model

4. Support hardware abstraction through control methods
5. Support power management
6. Support thermal management

Power, thermal, and clock management can all be dealt with as a group. ACPI defines a power management model (OSPM) that both the platform and the OS conform to. The OS implements the OSPM state machine, but the platform can provide state change behaviour in the form of bytecode methods. Methods can access hardware directly or hand off PM operations to a coprocessor. The OS really doesn’t have to care about the details as long as the platform obeys the rules of the OSPM model.

With DT, the kernel has device drivers for each and every component in the platform, and configures them using DT data. DT itself doesn’t have a PM model. Rather the PM model is an implementation detail of the kernel. Device drivers use DT data to decide how to handle PM state changes. We have clock, pinctrl, and regulator frameworks in the kernel for working out runtime PM. However, this only works when all the drivers and support code have been merged into the kernel. When the kernel’s PM model doesn’t work for new hardware, then we change the model. This works very well for mobile/embedded because the vendor controls the kernel. We can change things when we need to, but we also struggle with getting board support mainlined.

This difference has a big impact when it comes to OS support. Engineers from hardware vendors, Microsoft, and most vocally Red Hat have all told me bluntly that rebuilding the kernel doesn’t work for enterprise OS support. Their model is based around a fixed OS release that ideally boots out-of-the-box. It may still need additional device drivers for specific peripherals/features, but from a system view, the OS works. When additional drivers are provided separately, those drivers fit within the existing OSPM model for power management. This is where ACPI has a technical advantage over DT. The ACPI OSPM model and it’s bytecode gives the HW vendors a level of abstraction under their control, not the kernel’s. When the hardware behaves differently from what the OS expects, the vendor is able to change the behaviour without changing the HW or patching the OS.

At this point you’d be right to point out that it is harder to get the whole system working correctly when behaviour is split between the kernel and the platform. The OS must trust that the platform doesn’t violate the OSPM model. All manner of bad things happen if it does. That is exactly why the DT model doesn’t encode behaviour: It is easier to make changes and fix bugs when everything is within the same code base. We don’t need a platform/kernel split when we can modify the kernel.

However, the enterprise folks don’t have that luxury. The platform/kernel split isn’t a design choice. It is a characteristic of the market. Hardware and OS vendors each have their own product timetables, and they don’t line up. The timeline for getting patches into the kernel and flowing through into OS releases puts OS support far downstream from the actual release of hardware. Hardware vendors simply cannot wait for OS support to come online to be able to release their products. They need to be able to work with available releases, and make their hardware behave in the way the OS expects. The advantage of ACPI OSPM is that it defines behaviour and limits what the hardware is allowed to do without involving the kernel.

What remains is sorting out how we make sure everything works. How do we make sure there is enough cross platform testing to ensure new hardware doesn’t ship broken and that new OS releases don’t break on old hardware? Those are the reasons why a UEFI/ACPI firmware summit is being organized, it’s why the UEFI forum holds plugfests 3 times a year, and it is why we’re working on FWTS and LuvOS.

Reliability, Availability & Serviceability (RAS)

7. Support RAS interfaces

This isn’t a question of whether or not DT can support RAS. Of course it can. Rather it is a matter of RAS bindings already existing for ACPI, including a usage model. We’ve barely begun to explore this on DT. This item doesn’t make ACPI technically superior to DT, but it certainly makes it more mature.

Multiplatform support

1. Support multiple OSes, including Linux and Windows

I’m tackling this item last because I think it is the most contentious for those of us in the Linux world. I wanted to get the other issues out of the way before addressing it.

The separation between hardware vendors and OS vendors in the server market is new for ARM. For the first time ARM hardware and OS release cycles are completely decoupled from each other, and neither are expected to have specific knowledge of the other (ie. the hardware vendor doesn’t control the choice of OS). ARM and their partners want to create an ecosystem of independent OSes and hardware platforms that don’t explicitly require the former to be ported to the latter.

Now, one could argue that Linux is driving the potential market for ARM servers, and therefore Linux is the only thing that matters, but hardware vendors don’t see it that way. For hardware vendors it is in their best interest to support as wide a choice of OSes as possible in order to catch the widest potential customer base. Even if the majority choose Linux, some will choose BSD, some will choose Windows, and some will choose something else. Whether or not we think this is foolish is beside the point; it isn’t something we have influence over.

During early ARM server planning meetings between ARM, its partners and other industry representatives (myself included) we discussed this exact point. Before us were two options, DT and ACPI. As one of the Linux people in the room, I advised that ACPI’s closed governance model was a show stopper for Linux and that DT is the working interface. Microsoft on the other hand made it abundantly clear that ACPI was the only interface that they would support. For their part, the hardware vendors stated the platform abstraction behaviour of ACPI is a hard requirement for their support model and that they would not close the door on either Linux or Windows.

However, the one thing that all of us could agree on was that supporting multiple interfaces doesn’t help anyone: It would require twice as much effort on defining bindings (once for Linux-DT and once for Windows-ACPI) and it would require firmware to describe everything twice. Eventually we reached the compromise to use ACPI, but on the condition of opening the governance process to give Linux engineers equal influence over the specification. The fact that we now have a much better seat at the ACPI table, for both ARM and x86, is a direct result of these early ARM server negotiations. We are no longer second class citizens in the ACPI world and are actually driving much of the recent development.

I know that this line of thought is more about market forces rather than a hard technical argument between ACPI and DT, but it is an equally significant one. Agreeing on a single way of doing things is important. The ARM server ecosystem is better for the agreement to use the same interface for all operating systems. This is what is meant by standards compliant. The standard is a codification of the mutually agreed interface. It provides confidence that all vendors are using the same rules for interoperability.

Summary

To summarize, here is the short form rationale for ACPI on ARM:

At the beginning of this article I made the statement that the difference is in the support model. For servers, responsibility for hardware behaviour cannot be purely the domain of the kernel, but rather is split between the platform and the kernel. ACPI frees the OS from needing to understand all the minute details of the hardware so that the OS doesn’t need to be ported to each and every device individually. It allows the hardware vendors to take responsibility for PM behaviour without depending on an OS release cycle which it is not under their control.

ACPI is also important because hardware and OS vendors have already worked out how to use it to support the general purpose ecosystem. The infrastructure is in place, the bindings are in place, and the process is in place. DT does exactly what we need it to when working with vertically integrated devices, but we don’t have good processes for supporting what the server vendors need. We could potentially get there with DT, but doing so doesn’t buy us anything. ACPI already does what the hardware vendors need, Microsoft won’t collaborate with us on DT, and the hardware vendors would still need to provide two completely separate firmware interface; one for Linux and one for Windows.

January 10, 2015 02:00 PM

January 09, 2015

Pavel Machek: fight with 2.6.28

...was not easy. 2.6.28 will not compile with semi-recent toolchain (eldk-5.4), and it needs fixes even with older toolchain (eldk-5.2). To add difficulty, it also needs fixes to work with new make (!). Anyway, now it boots, including nfsroot. That should make userspace development possible. I still hope to get voice calls working, but it is not easy on 3.18/3.19 due to all the audio problems.

January 09, 2015 10:32 PM

January 08, 2015

Dave Jones: the new job reveal.

I let the cat out of the bag earlier this afternoon on twitter. Next Monday is day one of my new job at Akamai. The scope of my new role is pretty wide, ranging from the usual kernel debugging type work, to helping stabilizing production releases, proactively finding new bugs, misc QA work, development of some new tooling and a whole bunch of other stuff I can’t talk about just yet. (And probably a whole slew of things I don’t even know about yet).

It seemed almost serendipitous that I’ve ended up here. Earlier this year, I read Fatal System Error, a book detailing how Prolexic got founded. It’s a fascinating story beginning with DDoS’s of offshore gambling sites and ending with Russian organised crime syndicates. I don’t know how much of it got embellished for the book, but it’s a good read all the same. Also, Amazon usually has used copies for 1 cent. Anyway, a month after I read that book, Akamai acquired Prolexic.

Shortly afterwards, I found myself reading another Akamai related book, No Better Time, the autobiography of the late founder of Akamai, Danny Lewin. After reading it, I decided it was at least going to be worth interviewing there.

The job search led me a few possibilities, and the final decision to go with Akamai wasn’t an easy one to make. The combination of interesting work, an “easy to commute to” office (I’ll be at the Kendall square office in Cambridge,MA) and a small team that seemed easy to get along with (famous last words) is what decided it. (That, and all the dubstep).

It’s going to be an interesting challenge ahead to switch from the mindset of “bug that affects a single computer, or a handful of users” to “something that could bring down a 200,000 node cluster”, but I think I’m up for it. One thing I am definitely looking forward to is only caring about contemporary hardware, and a limited number of platforms.

I apologize in advance for any unexpected internet outages. I swear it was like that when I found it.

the new job reveal. is a post from: codemonkey.org.uk

January 08, 2015 09:59 PM

Pete Zaitcev: Swift and balance

Swift is on the cusp of getting yet another intricate mechanism that regulates how partitions are placed: so-called "overload". But new users of Swift even keep asking what weights are, and now this? I am not entirely sure it's necessary, but here's a simple explanation why we're ending with attempts at complexity (thanks to John Dickinson on IRC).

Suppose you have a system that spreads your replicas (of partitions) across (failure) zones. This works great as long as your zones are about the same size, usually a rack. But then one day you buy a new rack with 8TB drives and suddenly the new zone is several times larger than others. If you do not adjust anything, it ends only filled by quarter at best.

So, fine, we add "weights". Now a zone that has weight 100.0 gets 2 times more replicas than zone with weight 50.0. This allows you to fill zones better, but this must compromize your dispersion and thus durability. Suppose you only have 4 racks: three with 2TB drives and one with 8TB drives. Not an unreasonable size for a small cloud. So, you set weights to 25, 25, 25, 100. With replication factor of 3, there's still a good probability (which I'm unable to calculate, although I feel it ought to be easy for someone better educated) that the bigger node will end with 2 replicas for some partitions. Once that node goes down, you lose redundancy completely for those partitions.

In the small-cloud example above, if you care about your customers' data, you have to eat the imbalance and underutilization until your retire the 2TB drives [1].

<clayg> torgomatic: well if you have 6 failure domains in a tier but their sized 10000 10 10 10 10 10 - you're still sorta screwed

My suggestion would be just ignore all the complexity we thoughtfuly provided for the people with "screwed" clusters. Deploy and maintain your cluster to make it easy for the placement and replication: have a good number of more or less uniform zones that are well aligned to natural failure domains. Everything else is a workaround -- even weights.

P.S. Kinda wondering how Ceph deals with these issues. It is more automagic when it decides what to store where, but surely there ought to be a good and bad way to add OSDs.

[1] Strictly speaking, other options exist. You can delegate to another tier by tying 2 small racks into a zone: yet another layer of Swift's complexity. Or, you could put new 8TB drives on trays and stuff them into existing nodes. But considering that muddies the waters.

January 08, 2015 06:17 PM

Dave Jones: Some closure on a particularly nasty bug.

For the final three months of my tenure at Red Hat, I was chasing what was possibly the most frustrating bug I’d encountered since I had started work there. I had been fixing up various bugs in Trinity over the tail end of summer that meant on a good kernel, it would run and run for quite some time. I still didn’t figure out exactly what the cause of a self-corruption was, but narrowed it down enough that I could come up with a workaround. With this workaround in place, every so often, the kernels NMI watchdog would decide that a process had wedged, and eventually the box would grind to a halt. Just to make it doubly annoying, the hang would happen at seeming random intervals. Sometimes I could trigger it within an hour, sometimes it would take 24 hours.

I spent a lot of time trying to narrow down exactly the circumstances that would trigger this, without much luck. A lot of time was wasted trying to bisect where this was introduced, based upon bad assumptions that earlier kernels were ‘good’. During all this, a lot of theories were being thrown around, and people started looking at multiple areas of the kernel. The watchdog code, the scheduler, FPU context saving, the page fault handling, and time management. Along the way several suspect areas were highlighted, some things got fixed/cleaned up, but ultimately, they didn’t solve the problem. (A number of other suspect areas of code were highlighted that don’t have commits yet).

Then, right down to the final week before I gave all my hardware back to Red Hat, Linus managed to reproduce similar symptoms, by scribbling directly to the HPET. He came up with a hack that at least made the kernel survive for him. When I tried the same patch, the machine ran for three days before I interrupted it. The longest it had ever run.

The question remains, what was scribbling over the HPET in my case ? The /dev/hpet node doesn’t allow writing, even as root. You can mmap /dev/mem if you know the address of the HPET, and directly write to it, but..
1. That would be a root-only possibility, and..
2. Trinity blacklists /dev/mem, and never touches it.

The only two plausible scenarios I could think of were

So that’s where the story (mostly) ends. When I left Red Hat, I gave that (possibly flawed) machine back. Linus’ hacky workaround didn’t get committed, but him & John Stultz continue to back & forth on hardening the clock management code in the face of screwed up hardware, so maybe soon we’ll see something real get committed there.

It was an interesting (though downright annoying) bug that took a lot longer to get any kind of closure on than expected. Some things I learned from this experience:

Some closure on a particularly nasty bug. is a post from: codemonkey.org.uk

January 08, 2015 05:09 PM

January 07, 2015

Dave Jones: continuity of various projects

I’ve had a bunch of people emailing me asking how my new job will affect various things I’ve worked on over the last few years. For the most part, not massively, but there are some changes ahead.

So that’s about it. For at least the first few months of 2015, I expect to be absorbed in getting acclimated my new job, so I won’t be as visible as usual, but I’m not going to disappear from the Linux community forever, which was something I made clear wasn’t something I wanted to happen at everywhere I interviewed.

continuity of various projects is a post from: codemonkey.org.uk

January 07, 2015 03:08 PM

January 05, 2015

Pete Zaitcev: Going beyond ZFS by accident

Yesterday, CKS wrote an article that tries to redress the balance a little in the coverage of ZFS. Apparently, detractors of ZFS were pulling quotes from his operational gripes, so he went forceful with the observation that ZFS remains the only viable advanced filesystem (on Linux or not). CKS has no time for your btrfs bullshit.

The situation where weselovskys of the world hold the only viable advanced filesystem hostage and call everyone else "jackass" is very sad for Linux, but it may not be quite so dire, because it's possible that events are overtaking btrfs and ZFS. I am talking about the march of super-advanced, distributed filesystems downstream.

It all started with the move beyond POSIX, which, admittedly, seemed very silly at the time. The early DHT was laughable and I remember how they struggled for years to even enable writes. However, useful software was developed since then.

The poster child of it is Sage's Ceph, which relies on plain old XFS for back-end storage, composes an object storage out of nodes (called RADOS), and layers a POSIX layer on top for those who want it. It is in field use at Dreamhost. I can easily see someone using it where otherwise a ZFS-backed NFS/CIFS cluster would be deployed.

Another piece of software that I place in the same category is OpenStack Swift. I know, Swift is not competing with ZFS directly. The consistency of its meta layer is not sufficient to emulate POSIX anyway. However, you get all those built-in checksums and all that durability jazz that CKS wants. And Swift aims even further up in scale than Ceph, by being in field use at Rackspace. So, what seems to be happening is that folks who really need to go large are willing at times to forsake even compatibility with POSIX, in part of get the benefits that ZFS provides to CKS. Mercado Libre is one well-hyped case of migration from a pile of NFS filers to a Swift cluster.

Now that these systems are established and have themselves proven, I see constant efforts to take them downscale. Original Swift 1.0 did not even work right if you had less than 3 nodes (strictly speaking, if you had fewer zones than replication factor). This was fixed since by so-called "as good as possible placement" around 1.13 Havana, so you can have 1-node Swift easily. Ceph, similarly, would not consider PGs on the same node healthy and it's a bit of a PITA even in Firefly. So yea, there are issues, but we're working on it. And in the process, we're definitely coming for chunks of ZFS space.

January 05, 2015 06:25 PM

Pete Zaitcev: blitz2 for GNOME catastrophy

After putting blitz2 on my Nexus and poking into GUI buttons, I reckoned that it might be time to stop typing in a terminal like a caveman on Linux too (I'm joking, but only just). And the project rolled smoothly for a while. What it took me 2 months to accomplish in Android, only took 2 days in GNOME. However, 2 lines of code from the end, it all came to an abrubt halt when I found out that it is impossible to access clipboard from a JavaScript application. See GNOME bugs 579312, 712752.

January 05, 2015 04:27 PM

January 01, 2015

Michael Kerrisk (manpages): man-pages-3.76 is released

I've released man-pages-3.76. We've just passed 12k commits in the project, and this release, my 158th,  marks the completion of my tenth year as maintainer of the man-pages project.

The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

Aside from a large number of smaller fixes and improvements to many pages, the most notable changes in man-pages-3.76 are the following:

January 01, 2015 08:27 AM

December 31, 2014

Pete Zaitcev: blitz2 for Android

An additional upside for blitz2 is that HTTP client is available on about any platform. So if I want to share clipboard with my Nexus tablet, I can, without running an sshd on it.

This is my first Android app, and I didn't touch Java in many years. So, first impressions.

I forgot how insanely wordy Java is. And doing anything takes effort, with all the factories, accessors, and whatnot.

I like checked exceptions. Too bad Python doesn't have them (probably impossible by the very nature of a dynamic language, but I've been bitten by an unexpected exception floating up from the depth of the stack before).

Android docs are excellent and one almost never needs to search for answers. Unfortunately, I managed to step into one such case: the so-called "Up" navigation. My chosen API level is 11. The contemporary docs explain how to emulate "Up" using compatibility libraries for APIs before mine, and explain how to use onNavigateUp(), that comes in API level 16. But there's absolutely nothing, nowhere, that tells how do it in API 11. I was walking in circles for days. The answer is actually a secret ID namespace, particularly android.R.id.home. I would never figure it out if not for random pieces of code on the Internet. Good grief, Google. So close to perfect marks.

Oh, and one more thing: Googlers score good sanity points for reimplementing a stock Java API for HTTP (HttpURLConnection and friends). They could've easily rolled their own, but they didn't. They wrote their own runtime, but it's fully compatible with Oracle, including dark corners of SSL. It permits to mostly debug difficult parts on a Linux box. Very nice. Just to see what it could be otherwise, look at their gratiously incompatible Base64.

UPDATE: I forgot to mention that I started with Eclipse, but it was entirely unusable due to crashing all the time (about once an hour, for no discernable reason). I was at Fedora 20 at the time. So, I used command-line tools, and that worked like a charm. There's a Makefile in blitz2 repo linked above.

December 31, 2014 11:15 PM

Michael Kerrisk (manpages): man-pages-3.74 is released

I've released man-pages-3.74. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

Aside from various minor changes to many pages, the most notable changes in man-pages-3.74 are the following:

December 31, 2014 07:04 AM

Michael Kerrisk (manpages): man-pages-3.75 is released

I've released man-pages-3.75. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

This is a quite small release. The most notable changes in man-pages-3.75 are the following:

December 31, 2014 07:04 AM

December 24, 2014

Pavel Machek: Merry Chrismass from DRAM vendors

As a Chrismass present, you may be interested to learn that your computer's memory (DRAM) does not work as well as you thought.

Iteration 139 (after 326.41s)
49.460 nanosec per iteration: 2.13668 sec for 43200000 iterations
check
(check took 0.213835s)
Iteration 140 (after 328.76s)
48.805 nanosec per iteration: 2.1084 sec for 43200000 iterations
check
error at 0x890f1118: got 0xfeffffffffffffff
(check took 0.244179s)
** exited with status 256 (0x100)

https://www.ece.cmu.edu/~safari/pubs/kim-isca14.pdf


And yes, it probably means "your memory, too". Thread on lkml is "DRAM unreliable under..."

December 24, 2014 10:42 PM

December 23, 2014

James Bottomley: UEFI Secure Boot Tools Updated for 2.4

UEFI 2.4 has been out for a while (18 months to be exact).  However, it’s taken me a while to redo the tianocore builds for the 2.4 base.  This is now done, so the OVMF packages are now building 2.4 roms

https://build.opensuse.org/package/show/home:jejb1:UEFI/OVMF

Please remember that this is bleeding edge.  I found a few bugs while testing the tools, so there are likely many more lurking in there.  I’ve also updated the tools package

https://build.opensuse.org/package/show/home:jejb1:UEFI/efitools

As of version 1.5.1, there’s a new generator for hashed revocation certificates and KeyTool now supports the timestamp signature database.  I’ve also published a version of sbsigntools

https://build.opensuse.org/package/show/home:jejb1:UEFI/sbsigntools

Which, as of 0.7, has support for multiple signatures.

December 23, 2014 03:09 AM

December 22, 2014

Pete Zaitcev: Cheating around taskotron in Fedora

The yesterday ntp vulnerability uncovered a trick for Fedora maintainers. You know how it's super annoying that you cannot push an update to F20 without F21? You must herd updates and can never do them in parallel, or else taskotron ruins innocent updates. But at the time of this writing the fixes are live in F20, but not in F21. How does Miroslav do it?

The answer is easy: he keeps ntp intentionally a few releases back in older Fedora (4.2.6p5-19 in F20), so he can bump it with impunity without regard to the newer Fedora (4.2.6p5-25 in F21). Of course, if someone were to upgrade to F21 today, he'd go from a fixed ntp to a broken ntp, but hey... at least the automated checks are defeated.

This challenge is similar to writing super ugly OpenStack code that passes PEP8 checks, only outcome is actually dangerous today.

December 22, 2014 04:44 PM

December 19, 2014

Dave Jones: Moving on from Red Hat.

After eleven and a half years, today is my final day at Red Hat.
I’ll write more about what comes next in the new year.

In the meantime, here’s a slightly edited version of a mail I sent internally yesterday.


In 2003, I got an email from Michael Johnson, about a secretive new thing Red Hat was working on called "Fedora". No-one was quite sure what it was going to be (some may argue we're still figuring it out), but he was pretty sure I'd want to be a part of it. "How'd you feel about taking care of _any_ kernel problems that come in for this thing?" he asked. I was terrified, but excited at the opportunities to learn a lot of stuff outside my usual areas of expertise.

With barely any real detail as to what I was signing up for, I jumped at the opportunity. Within my first few months, I had some concerns over whether or not I had made a good decision. Then Michael left for rPath, and I seriously started to have my doubts.

While everyone was figuring out what Fedora was going to be, I was thrown in at the deep end. "Here's Red Hat Linux 7, 8 and 9, you maintain the kernel for those now. Go". I remember looking at bugzilla scrolling through page after page of bugs thinking "This is going to be a nightmare" At the same time, RHEL 3 was really starting to take shape. I looked at what the guys working on RHEL were doing and thought "Well, this sucks, but those guys.. they _really_ have work to do". As much as I was buried alive in work, I relished every moment of it, learning as much as I could in what little spare time I had.

Then Fedora finally happened. For those not around back then, Fedora Core 1 was pretty much what Red Hat Linux 10 would have been from a kernel pov. A nasty hairball of patches that weren't going upstream (execshield! 4g4g! Tux! CIPE!) that even their authors had stopped maintaining, and a bunch of features backported from 2.5 to 2.4. I get the shakes when I think back to the horrors of maintaining that mess, but like the horrors of RHL before it, it was an amazing learning experience (mostly "what not to do").

But for all its warts, Fedora gained traction, and after Fedora 2 moved to a 2.6 kernel, things really started to take shape. As Fedora's community started to grow, things got even busier in bugzilla than RHL had ever been.

Then somehow I got talked into also being RHEL4 kernel maintainer for a while.
It turned out that juggling Fedora 3, Fedora 4, Rawhide, RHEL4 GA, and RHEL4 U1 means you don't get a lot of time to sleep. So after finding another sucker to deal with the RHEL work, I moved back to just doing Fedora work, and in another big turning point, we started to slowly grow out the Fedora kernel team.

Over the years that followed, the only thing that remained constant was the inflow of bugs. At any given time we had a thousand or so bugs open, with at best 3 people, at worst 1 person working on them. I'm incredibly proud of what we've managed to achieve with the Fedora kernel. More than just the base for RHEL, it changed the whole landscape of upstream kernel development.

Despite this progress though, I always felt we were on a treadmill making no real forward progress. That constant 1000 or so bugs kept nagging at me. As fast as we closed them out, a new batch would arrive.

In more recent years, we tried to split the workload within the team so we could do more proactive bug-finding before users even find them. My own 'trinity' project has found so many serious bugs (filesystem corruptors, root holes, vm corner cases, the list goes on) that it got to be almost a full time job just tracking everything.

I used to feel that leaving Red Hat wasn't something I could do. On a few occasions I actually turned down offers from potential employers, because "What about the Fedora kernel?". For the first time since the project has begun I feel like I've left things in more than capable hands, and I'm sure things will continue to move in the right direction.

3 RHL's. 5 and a half RHEL's. 21 Fedoras. You don't even want to know how much hardware I've destroyed in the line of duty in this time. It's been uh, an experience.

So, after all this time, one thing I have learned, is that all this was definitely one of my better decisions. I hope that my next decision turns out to be an equally good one.

Moving on from Red Hat. is a post from: codemonkey.org.uk

December 19, 2014 04:07 PM

December 16, 2014

Daniel Vetter: Neat drm/i915 stuff for 3.19

So kernel version 3.18 is out the door and it's time for our regular look at what's in the next merge window.
First looking at new hardware the big item is basic Skylake support. There are still a few smalls things missing, but mostly it's there now. This has been contributed by Damien, Satheeshakrishna and a lot of other folks. Looking at other platforms there has also been a lot of changes for vlv/chv: Improved backlight code, completely refactored interrupt handling to bring it in line with other platforms, rewritten panel power sequencing code, all from Ville. Rodrigo contributed PSR support for vlv/chv together with a lot of other fixes for PSR. Unfortunately it's not yet again enabled by default.

Moving on to Broadwell and the render side of things, Mika and Arun provided patches to improve the render workaround code and bring the set of workarounds up to date. execlist (the new command submission support on Gen8+) is also being polished with the addition of on-demand pinning of context objects with patches from Thomas Daniel and Oscar Mateo. Finally the RPS/render-turbo code has seen a lot of polish from Imre with a few fixes from Tom O'Rourke.

Otherwise not a lot of really big things happened on the GEM side: Just a few patches to fix issues in ppgtt (unfortunately still not enabled by default anywhere due to fun with context switches). And there's a bit of prep work and reorg all over for new stuff landing hopefully soon.

Looking at overall infrastructure changes the big thing certainly is the preparations for atomic display updates. The drm core/driver interface for atomic and all the helper library code to convert drivers has landed in 3.19, and already some conversions. On the Intel side it's been just prep work under the hood thus far with patches from Ander to precompute display PLL state. The new code to use vblank evades for pagelips has also landed, which is needed for atomic plane updates. And prep patches from Gustavo Padovan started to split the low-level plane update functions into check and commit steps. Lots more patches from different people are in flight and some have been merged for 3.20 already.

Besides these driver internal changes for atomic there has been other work to improve the codebase: Imre reorganized our handlers for suspend, resume and thawing and freezing. Jani reworked the audio and eld code which is the gfx side of the puzzle needed to make audio over HDMI or DP work. Jesse provided patches to track infoframes more accurately, which is needed to correctly fastboot (i.e. without modesets if possible) on external screens.

For older machines Ville has spent a few spare cycles to make them more useful: GPU reset support for gen3/4 should mitigate some of the recent chromium crashes on mesa, and the modeset code on i830M might work correctly for the first time, ever.


And of course the usual pile of smaller fixes and improvements all over.

Not directly related to code or features is the start of documenting i915 driver internals: With this release we now have some of the interrupt handling, fifo underrun reporting, frontbuffer tracking and runtime pm support newly document. And there's lots more in-flight, so hopefully soonish this will be fairly useful.

December 16, 2014 09:03 PM

December 14, 2014

Eric Sandeen: Neptune RF water meter frequency hopping pattern

A couple of years ago, my water utility installed new remote-read water meters, Neptune e-Coder R900i, in every home in their service area.  As any casual reader of this blog knows, I’m a big fan of measuring and data when it comes to household resource consumption. I’ve got electricity pretty well covered, but water and gas are still lacking. I could install secondary meters with pulse counters, by that seems silly – I already have remote-read meters installed, it’s just that the data they send isn’t accessible to me. Let’s see if we can start to remedy that!

Reading the product literature for this meter, we learn a few things right off the bat:

Transmitter Specifications:

Great!  We know nothing about the signal, but we know where to start looking to find it.  Is there any cheap hardware which could listen in on these frequencies?  Why yes, there is!

nooelec dvb tuner [amzn]

The rtl-sdr project has repurposed Digital Video Broadcasting (DVB) receivers as general-purpose software-defined radios.  We can control frequency and gain, and listen in on signals within the tuned bandwidth.

For now, we’re just trying to find the signals; we don’t yet care what they look like, we just want to know where they are so we can start listening.  To this end we can use the rtl_power utility from the rtl-sdr suite, which listens in on a given bandwidth, slices it into smaller buckets, and gives us time-stamped signal strength for each frequency bucket.  I used keenerd’s tree on github, as it has many bleeding-edge fixes and features (he was also supremely helpful on the IRC channel, thanks!)

However, the device can’t listen in on the entire 910-920MHZ spectrum at once; its maximum bandwidth is about 2GHz.  So to listen across all 10MHZ of the published range would require sweeping, and we might miss the short signals the meter emits if we’re sweeping the wrong range at the time it sends the signal.  But that’s ok; we know that the periodicity of the signal is once every 14 seconds, and that it hops across 50 frequencies.  We don’t know how random the sequence is across the 50 frequencies, but let’s assume it’s simple, that it repeats every 14×50 seconds, or every 700 seconds.  So let’s try listening in to dedicated 1MHZ bands at a time, for 700s each, and line up the results; essentially “binning” the signals we get into 50 14-second-aligned slots.

#!/bin/bash

# 50 channels, each for 14s; 50*14 is 700, gather 1 cycle at each freq

for I in `seq 910 1 919`; do
    let J=$I+1
    rtl_power -g 0.9 -p -39.906 -f ${I}M:${J}M:5000 -c 20% -i 1s -e 700s -E rtl_power-${I}M-${J}M-5000-i1.csv
done

This tells rtl_power to use gain 0.9 (pretty low, the antenna was very near the meter, and I don’t want my neighbors’ meters interfering), calibrate the frequency, crop the gathered data by 20% (data at the edges of the bandwidth can be dodgy), record power every 1s, and stop after 700s.  We record 5000 bins in the range, fairly fine grained.  The -E switch tells rtl_power to record timestamps as seconds since epoch.  We do this for each of the 10 1GHz ranges between 910MHz and 920MHz in succession.

This gives us 10 CSV files, with one row per second, and 5000 time-stamped frequency columns.  Now what?  We’d like to know what it’s doing across the entire documented frequency range.

Assuming we were right about the sequence repeating every 700s, we can now just “horizontally concatenate” the frequency columns from each run.  It’s easy enough to use awk to strip out the timetamp data and retain only the frequency columns.  But how to concatenate them?  paste(1)!  Did you know about paste?  I did not.  Paste is awesome for this.

After simple awk scripts to gather only the frequency columns for each csv file, we combine them like so, using a comma for the delimiter as we combine the files:

paste -d , rtl_power-910M-911M-5000-i1.csv 911M-912M.csv 912M-913M.csv 913M-914M.csv 914M-915M.csv 915M-916M.csv 916M-917M.csv 917M-918M.csv 918M-919M.csv 919M-920M.csv > all.csv

Frikkin’ trivial!  And now we’d like to visualize it.  Turns out keenerd has a great tool for this too; heatmap.py.  Again, super easy:

heatmap.py all.csv > all.png

and voila!

That’s the heatmap of power across frequencies and time; the yellow lines are the higher energy emissions during the signals, and the white text is my own annotation for time and frequency sequences.  Looks amazingly orderly, no?

I actually ran the capture for 3 cycles after this, and confirmed that the cycle repeats the above pattern ad infinitum.  We know that the time scale is once every 14s, but it’s hard to tell from the image above what the frequencies might be; we could count pixels, or look into the CSV files for frequency bins, but this DVB tuner isn’t super accurate… so, looking at my meter, it’s FCC ID is P2SNTGECR900DL.  And looking at that the test report for that FCC filing, we find this:

Which matches just extremely well; the lower frequency is stated in the test report as 911.06MHZ, and the upper as 919.08 MHz.  It also states that the channel separation was measured at 131.58 kHz.  So, counting the peaks above, and counting up (17 peaks) or down (33 peaks) from the extremes, correlating to the sequence we measured, and sorting – here is the table of frequencies and sequences for the Neptune e-Coder R900i RF water meter.  Now that we know where to look, we can start investigating the signals.  (Note: Table below edited 12/21, spreadsheet thinko corrected)

Seq     Frequency
1       911.06
2       915.92
3       912.38
4       917.24
5       915.26
6       918.55
7       911.72
8       916.58
9       913.03
10      917.90
11      911.19
12      916.05
13      912.51
14      917.37
15      915.40
16      918.69
17      911.85
18      916.71
19      913.17
20      918.03
21      911.32
22      916.19
23      912.64
24      917.50
25      915.53
26      918.82
27      911.98
28      916.84
29      914.87
30      918.16
31      911.45
32      916.32
33      912.77
34      917.63
35      915.66
36      918.95
37      912.11
38      916.97
39      915.00
40      918.29
41      911.59
42      916.45
43      912.90
44      917.76
45      915.79
46      919.08
47      912.24
48      917.11
49      915.13
50      918.42

December 14, 2014 04:46 AM

December 12, 2014

Pete Zaitcev: blitz2

You know how some people attach several montiors to one PC? I don't. I just have several PCs. But then I want copy-paste to work transparently (as transparently as possible). For several years I used blitz to copy clipboard. It works well enough, but once you have 3 computers, it gets somewhat cumbersome to type the hostname. Also, it always bothered me how it rides ssh authentication. I wanted something independent from ssh.

Behold blitz2. Instead of passing the clipboard to the host where it's needed directly, the clipboard is uploaded to an HTTP server. Seems more complex at first, but it's actually much better, because previously the PC where you copy had to authenticate to the PC where you paste. Now the authentication is symmetric. So, all clients are configured exactly the same, and all can upload and download the clipboard no matter who trusts what ssh keys.

December 12, 2014 08:55 PM

November 29, 2014

Dave Airlie: is this a protocol? displaylink3

I'm not sure

but if hd0;u]; means anything to anyone from displaylink, or is the first unencrypted bytes they send, then oops.

Looks like I have some work to do next week.

November 29, 2014 05:42 AM

November 20, 2014

Pavel Machek: fight with pulseaudio

On nokia n900, pulseaudio is needed to have a correct call. Unfortunately that piece of software fights back.

pavel@n900:~$ pulseaudio --start
N: [pulseaudio] main.c: User-configured server at {d3b6d0d847a14a3390b6c41ef280dbac}unix:/run/user/1000/pulse/native, refusing to start/autospawn.
pavel@n900:~$

Ok, I'd really like to avoid complexity of users here. Let me try as root.

root@n900:/home/pavel# pulseaudio --start
W: [pulseaudio] main.c: This program is not intended to be run as root (unless --system is specified).
N: [pulseaudio] main.c: User-configured server at {d3b6d0d847a14a3390b6c41ef280dbac}unix:/run/user/1000/pulse/native, refusing to start/autospawn.

Ok, I don't need per-user sessions, this is cellphone. Lets specify --system.

root@n900:/home/pavel# pulseaudio --start --system
E: [pulseaudio] main.c: --start not supported for system instances.

Yeah, ok.root@n900:/home/pavel# pulseaudio --system
W: [pulseaudio] main.c: Running in system mode, but --disallow-exit not set!
W: [pulseaudio] main.c: Running in system mode, but --disallow-module-loading not set!
N: [pulseaudio] main.c: Running in system mode, forcibly disabling SHM mode!
N: [pulseaudio] main.c: Running in system mode, forcibly disabling exit idle time!
W: [pulseaudio] main.c: OK, so you are running PA in system mode. Please note that you most likely shouldn't be doing that.
W: [pulseaudio] main.c: If you do it nonetheless then it's your own fault if things don't work as expected.
W: [pulseaudio] main.c: Please read http://pulseaudio.org/wiki/WhatIsWrongWithSystemMode for an explanation why system mode is usually a bad idea.

Totally my fault that someone forgot to document this pile of code. Thanks for blaming me. I'd actually like to read what is wrong with that, except that the page referenced does not exist. :-(.

November 20, 2014 09:18 PM

Paul E. Mc Kenney: Stupid RCU Tricks: rcutorture Catches an RCU Bug

My previous posting described an RCU bug that I might plausibly blame on falsehoods from firmware. The RCU bug in this post, alas, I can blame only on myself.

In retrospect, things were going altogether too smoothly while I was getting my RCU commits ready for the 3.19 merge window. That changed suddenly when my automated testing kicked out a “BUG: FAILURE, 1 instances”. This message indicates a grace-period failure, in other words that RCU failed to be RCU, which is of course really really really bad. Fortunately, it was still some weeks until the merge window, so there was some time for debugging and fixing.

Of course, we all have specific patches that we are suspicious of. So my next step was to revert suspect patches and to otherwise attempt to outguess the bug. Unfortunately, I quickly learned that the bug is difficult to reproduce, requiring something like 100 hours of focused rcutorture testing. Bisection based on 100-hour tests would have consumed the remainder of 2014 and a significant fraction of 2015, so something better was required. In fact, something way better was required because there was only a very small number of failures, which meant that the expected test time to reproduce the bug might well have been 200 hours or even 300 hours instead of my best guess of 100 hours.

My first attempt at “something better” was to inspect the suspect patches. This effort did locate some needed fixes, but nothing that would explain the grace-period failures. My next attempt was to take a closer look at the dmesg logs of the two runs with failures, which produced much better results.

You see, rcutorture detects failures using the RCU API, for example, invoking synchronize_rcu() and waiting for it to return. This overestimates the grace-period duration, because RCU's grace-period primitives wait not only for a grace period to elapse, but also for the RCU core to notice their requests and also for RCU to inform them of the grace period's end, as shown in the following figure.

RCUnearMiss.svg

This means that a given RCU read-side critical section might overlap a grace period by a fair amount without rcutorture being any the wiser. However, some RCU implementations provide rcutorture access to the underlying grace-period counters, which in theory provide rcutorture with a much more precise view of each grace period's duration. These counters have long recorded in the rcutorture output as “Reader Batch” counts as shown on the second line of the following (the first line is the full-API data):


rcu-torture: !!! Reader Pipe:  13341924415 88927 1 0 0 0 0 0 0 0 0
rcu-torture: Reader Batch:  13341824063 189279 1 0 0 0 0 0 0 0 0



This shows rcutorture output from a run containing a failure. On each line, the first two numbers correspond to legal RCU activity: An RCU read-side critical section might have been entirely contained within a single RCU grace period (first number) or it might have overlapped the boundary between an adjacent pair of grace periods (second number). However, a single RCU read-side critical section is most definitely not allowed to overlap three different grace periods, because that would mean that the middle grace period was by definition too short. And exactly that error occured above, indicated by the exclamation marks and the “1” in the third place in the “Reader Pipe” line above.

If RCU was working perfectly, in theory, the output would instead look something like this, without the exclamation marks and without non-zero value in the third and subsequent positions:


rcu-torture: Reader Pipe:  13341924415 88927 0 0 0 0 0 0 0 0 0
rcu-torture: Reader Batch:  13341824063 189279 0 0 0 0 0 0 0 0 0



In practice, the “Reader Batch” counters were intended only for statistical use, and access to them is therefore unsynchronized, as indicated by the jagged grace-period-start and grace-period-end lines in the above figure. Furthermore, any attempt to synchronize them in rcutorture's RCU readers would incur so much overhead that rcutorture could not possibly do a good job of torturing RCU. This means that these counters can result in false positives, and false positives are not something that I want in my automated test suite. In other words, it is at least theoretically possible that we might legitimately see something like this from time to time:


rcu-torture: Reader Pipe:  13341924415 88927 0 0 0 0 0 0 0 0 0
rcu-torture: Reader Batch:  13341824063 189279 1 0 0 0 0 0 0 0 0



We clearly need to know how bad this false-positive problem is. One way of estimating the level of false-positive badness is to scan my three months worth of rcutorture test results. This scan showed that there were no suspicious Reader Batch counts in more than 1,300 hours of rcutorture testing, aside from the roughly one per hour from the TREE03 test, which happened to also be the only test to produce real failures. This suggested that I could use the Reader Batch counters as indications of a “near miss” to guide my debugging effort. The once-per-hour failure rate suggested a ten-hour test duration, which I was able compress into two hours by running five concurrent tests.

Why ten hours?

Since TREE03 had been generating Reader Batch near misses all along, this was not a recent problem. Therefore, git bisect was more likely to be confused by unrelated ancient errors than to be of much help. I therefore instead bisected by configuration and by rcutorture parameters. This alternative bisection showed that the problem occurred only with normal grace periods (as opposed to expedited grace periods), only with CONFIG_RCU_BOOST=y, and only with concurrent CPU-hotplug operations. Of course, one possibility was that the bug was in rcutorture rather than RCU, but rcutorture was exonerated via priority-boost testing on a kernel built with CONFIG_RCU_BOOST=n, which showed neither failures nor Reader Batch near misses. This combination of test results points the finger of suspicion directly at the rcu_preempt_offline_tasks() function, which is the only part of RCU's CPU-hotplug code path that has a non-trivial dependency on CONFIG_RCU_BOOST.

One very nice thing about this combination is that it is very unlikely that people are encountering this problem in production. After all, for this problem to appear, you must be doing the following:


  1. Running a kernel with both CONFIG_PREEMPT=y and CONFIG_RCU_BOOST=y.
  2. Running on hardware containing more than 16 CPUs (assuming the default value of 16 for CONFIG_RCU_FANOUT_LEAF).
  3. Carrying out frequent CPU-hotplug operations, and so much so that a given block of 16 CPUs (for example, CPUs 0-15 or 16-31) might reasonably all be offline at the same time.


That said, if this does describe your system and workload, you should consider applying this patch set. You should also consider modifying your workload.

Returning to the rcu_preempt_offline_tasks() function that the finger of suspicion now points to:


 1 static int rcu_preempt_offline_tasks(struct rcu_state *rsp,
 2                                      struct rcu_node *rnp,
 3                                      struct rcu_data *rdp)
 4 {
 5   struct list_head *lp;
 6   struct list_head *lp_root;
 7   int retval = 0;
 8   struct rcu_node *rnp_root = rcu_get_root(rsp);
 9   struct task_struct *t;
10 
11   if (rnp == rnp_root) {
12     WARN_ONCE(1, "Last CPU thought to be offlined?");
13     return 0;
14   }
15   WARN_ON_ONCE(rnp != rdp->mynode);
16   if (rcu_preempt_blocked_readers_cgp(rnp) && rnp->qsmask == 0)
17     retval |= RCU_OFL_TASKS_NORM_GP;
18   if (rcu_preempted_readers_exp(rnp))
19     retval |= RCU_OFL_TASKS_EXP_GP;
20   lp = &rnp->blkd_tasks;
21   lp_root = &rnp_root->blkd_tasks;
22   while (!list_empty(lp)) {
23     t = list_entry(lp->next, typeof(*t), rcu_node_entry);
24     raw_spin_lock(&rnp_root->lock);
25     smp_mb__after_unlock_lock();
26     list_del(&t->rcu_node_entry);
27     t->rcu_blocked_node = rnp_root;
28     list_add(&t->rcu_node_entry, lp_root);
29     if (&t->rcu_node_entry == rnp->gp_tasks)
30       rnp_root->gp_tasks = rnp->gp_tasks;
31     if (&t->rcu_node_entry == rnp->exp_tasks)
32       rnp_root->exp_tasks = rnp->exp_tasks;
33 #ifdef CONFIG_RCU_BOOST
34     if (&t->rcu_node_entry == rnp->boost_tasks)
35       rnp_root->boost_tasks = rnp->boost_tasks;
36 #endif
37     raw_spin_unlock(&rnp_root->lock);
38   }
39   rnp->gp_tasks = NULL;
40   rnp->exp_tasks = NULL;
41 #ifdef CONFIG_RCU_BOOST
42   rnp->boost_tasks = NULL;
43   raw_spin_lock(&rnp_root->lock);
44   smp_mb__after_unlock_lock();
45   if (rnp_root->boost_tasks != NULL &&
46       rnp_root->boost_tasks != rnp_root->gp_tasks &&
47       rnp_root->boost_tasks != rnp_root->exp_tasks)
48     rnp_root->boost_tasks = rnp_root->gp_tasks;
49   raw_spin_unlock(&rnp_root->lock);
50 #endif
51   return retval;
52 }



First, we should of course check the code under #ifdef CONFIG_RCU_BOOST. The code starting at line 43 is quite suspicious, as it is not clear why it is safe to make these modifications after having dropped the rnp_root structure's ->lock on line 37. And removing lines 43-49 does in fact reduce the number of Reader Batch near misses by an order of magnitude, but sadly not to zero.

I was preparing to dig into this function to find the bug, but then I noticed that the loop spanning lines 22-38 is executed with interrupts disabled. Given that the number of iterations through this loop is limited only by the number of tasks in the system, this loop is horrendously awful for real-time response.

Or at least it is now—at the time I wrote that code, the received wisdom was “Never do CPU-hotplug operations on a system that runs realtime applications!” I therefore had invested zero effort into maintaining good realtime latencies during these operations. That expectation has changed because some recent real-time applications offline then online CPUs in order to clear off irrelevant processing that might otherwise degrade realtime latencies. Furthermore, the large CPU counts on current systems are an invitation to run multiple real-time applications on a single system, so that the CPU hotplug operations run as part of one application's startup might interfere with some other application. Therefore, this loop clearly needs to go. So I abandoned my debugging efforts and focused instead on getting rid of rcu_preempt_offline_tasks() entirely, along with all of its remaining bugs.

The point of the loop spanning lines 22-38 is handle any tasks on the rnp structure's ->blkd_tasks list. This list accumulates tasks that block while in an RCU read-side critical section while running on a CPU associated with the leaf rcu_node structure pointed to by rnp. This blocking will normally be due to preemption, but could also be caused by -rt's blocking spinlocks. This function is called only when the last CPU associated with the leaf rcu_node structure is going offline, after which there will no longer be any online CPUs associated with this leaf rcu_node structure. This loop then moves those tasks to the root rcu_node structure's ->blkd_tasks list. Because there is always at least one CPU online somewhere in the system, there will always be at least one online CPU associated with the root rcu_node structure, which means that RCU will be guaranteed to take proper care of these tasks.

The obvious question at this point is “why not just leave the tasks on the leaf rcu_node structure?” After all, this clearly avoids the unbounded time spent moving tasks while interrupts are disabled. However, it also means that RCU's grace-period computation must change in order to account for blocked tasks associated with a CPU-free leaf rcu_node structure.

In the old approach, the leaf rcu_node structure in question was excluded from RCU's grace-period computations immediately. In the new approach, RCU will need to continue including that leaf rcu_node structure until the last task on its list exits its outermost RCU read-side critical section. At that point, the last task will remove itself from the list, leaving the list empty, thus allowing RCU to ignore the leaf rcu_node structure from that point forward.

This patch set makes this change by abstracting an rcu_cleanup_dead_rnp() from rcu_cleanup_dead_cpu(), which can then be called from rcu_read_unlock_special() when the last task removes itself from the ->blkd_tasks list. This change allows the rcu_preempt_offline_tasks() function to be dispensed with entirely. With this patch set applied, TEST03 ran for 1,000 hours with neither failures nor near misses, which gives a high degree of confidence that this patch set made the bug less likely to appear. This, along with the decreased complexity and the removal of a source of realtime latency, is a definite plus!

I also modified the rcutorture scripts to pay attention to the “Reader Batch” near-miss output, giving a warning if one such near miss occurs in a run, and giving an error if two or more near misses occur, but at a rate of at least one near miss per three hours of test time. This should inform me should I make a similar mistake in the future.

Given that RCU priority boosting has been in the Linux kernel for many years and given that I regularly run rcutorture with CPU-hotplug testing enabled on CONFIG_RCU_BOOST=y kernels, it is only fair to ask why rcutorture did not locate this bug years ago. The likely reason is that I only recently added RCU callback flood testing, in which rcutorture registers 20,000 RCU callbacks, waits a jiffy, registers 20,000 more callbacks, waits another jiffy, then finally registers a third set of 20,000 callbacks. This causes call_rcu() to take evasive action, which has the effect of starting the RCU grace period more quickly, which in turn makes rcutorture more sensitive to too-short grace periods. This view is supported by the fact that I saw actual failures only on recent kernels whose rcutorture testing included callback-flood testing.

Another question is “exactly what was the bug?” I have a fairly good idea of what stupid mistake led to that bug, but given that I have completely eliminated the offending function, I am not all that motivated to chase it down. In fact, it seems much more productive to leave it as an open challenge for formal verification. So, if you have a favorite formal-verification tool, why not see what it can make of rcu_preempt_offline_tasks()?

Somewhat embarrassingly, slides 39-46 of this Collaboration Summit presentation features my careful development of rcu_preempt_offline_tasks(). I suppose that I could hide behind the “More changes due to RCU priority boosting” on slide 46, but the fact remains that I simply should not have written this function in the first place, whether carefully before priority boosting or carelessly afterwards.

In short, it is not enough to correctly code a function. You must also code the correct function!

November 20, 2014 07:18 PM

November 17, 2014

Pavel Machek: gcc trying to be helpful... in pretty unhelpful way

gcc tried to help me with figuring pulseaudio-module-cmtspeech-n9xx compilation... It says:

/lib/x86_64-linux-gnu/libz.so.1: error adding symbols: DSO missing from command line

To decrypt it, you should understand that "DSO" is a library. So it wants you to add /lib/x86_64-linux-gnu/libz.so.1 to command line you are using to compile. It took me a while to figure out...

November 17, 2014 08:45 PM

November 10, 2014

Eric Sandeen: Residential boiler monitoring via ModBus

Graph of boiler operation, click to embiggen

We recently upgraded our boiler to a high-efficiency modulating/condensing Triangle Tube Prestige Trimax Solo, and in perusing the manual, I found that the boiler has a ModBus interface .  Woohoo, project!  The final result is live charting like you see above; for more details, read on!

Per the boiler docs, the ModBus interface is Modbus/RTU using RS-485 for the physical layer.  First thing, obviously, is to find an RS-485 adapter.  Fortunately, I found one super cheap (under $10) at Amazon, a “Kedsum USB to RS485 Converter” [amzn] and despite the cheap price and appearance, it seems to work just fine:

Cheapo USB Adapter at Amazon

The adapter only has A and B connections, and no GND, but it works fine.  I used a twisted pair from a CAT5 cable to connect to the A and B terminals on the boiler, plugged it into a spare Raspberry Pi, and was ready to test this thing.

Boiler ModBus Connections

Raspberry Pi w/ RS-485 Adapter

libmodbus makes communication simple; basically open the serial port and read addresses.  A little test program I wrote verified that things are working:

# ./tt-status -s /dev/ttyUSB0 
Status:
 DHW Mode
 Flame Present
 DHW Pump
Supply temp:		168 °F
Return temp:		145 °F
DHW Storage temp:	102 °F
Flue temp:		132 °F
Outdoor temp:		 35 °F
Flame Ionization:	 11 μA
Firing rate:		 25 %
Boiler Setpoint:	170 °F
CH1 Maximum Setpoint:	143 °F
DHW Setpoint:		125 °F

Ok!  So to get the pretty graph above, I obviously needed some data collection; for this I used collectd simply because it already had a ModBus plugin.  It did require some patching and hacking though; the non-bloggy details are on this page; the graph you see above was made by Visage.

Overall this has been very useful; the boiler controls are fairly involved, but primarily we want to get the outdoor reset curve tweaked so that we get nice long runtimes and low return temps, which keep the boiler in its most efficient condensing mode.  My installer hit the default buttons and left, as contractors often do.  Seeing how the boiler was working over time definitely helped me tweak things to improve performance, efficiency and comfort.  Primarily, I drastically lowered the high end of the outdoor reset curve which defaulted to 170F For cast iron radiators, because we have much more radiation in this house than it needs at 170F output.  I currently have the high end set to 140F.  I may do another post on all that later… suffice it to say, doing a heat loss and measuring radiation and using that information to guide boiler setup is not rocket science, and you really need to do it for the sake of comfort and efficiency.

collectd, at least with the rrd output, doesn’t keep fine-grained data around for long.  The current setup is useful for seeing how things behaved in the past couple days, but I’ll probably start logging fine-grained data to a database at some point, and can start doing fun things like using the firing rate to estimate gas usage over the month, look at therms per heating degree days in near real time, etc.

Other boilers have ModBus interfaces as well; in particular Lochinvar and NTI both mention the capability, although I haven’t yet found documentation on what data is available.  If you have a boiler with ModBus and want to try this hack, drop me a line – I’d love to see how this looks in other homes, and I’d be glad to help you set up a collectd config file.  I may see about creating a Raspberry Pi image to make this all more or less work out of the box.

Again, take a look at the the more detailed writeup for a bit more info, in particular the patches I used with the upstream tools, and if you try this at home & make it work, share what you find!

November 10, 2014 11:14 PM

November 09, 2014

Pavel Machek: Dialer for ofono?

I have stock Debian running on Nokia n900, with ofono stack (on 3.18-rc1 and nfsroot)... and would like some GUI dialer. There's none in ofono project. mer had some, so I went to http://gitweb.merproject.org/gitweb ... but it gives me "service temporarily unavailable"... I was told to look at telepathy-ring, but that is not in Debian 7.7, and would have a lot of dependencies. Any ideas where to get sources of dialer from mer or what other software to use?

And... Is there a recommended camera application in Debian? I'd like to test that the drivers still work...

November 09, 2014 09:46 PM

November 07, 2014

Dave Airlie: more on Displaylink3 and HDCP encryption

okay another braindump (still nothing working).

The git repo mentioned in previous post has all the code I've hacked up so far.

I finished writing the HDCP protocol stages, and sending all the msgs and getting replies from the device.

So I've successfully reached a point where I've negotiated a HDCP session key with the device, and we are both happy about it. Unfortunately I've no idea what I'm meant to be encrypting to send to the device. The next packet the USB traces contain is 384-bytes of encrypted data.

Now HDCP v2 had a vulnerabilty in its key neg, and I've written code to try and use this fact. So I've taken a trace I made from Windows, and extracted the necessary bits, and using that I've managed to derive the master key used in that trace, and subsequently managed to derived the session key for it. So I've replayed the first encrypted packet from the trace to the device and got an encrypted response the same as in the trace.

I've tried changing a bit in the session key, riv value and data I'm sending, and doing that causes the device not to reply with the answer. This to me implies that the device is using the HDCP cipher to encode the control channel. Now HDCP does say you should only do this for video streams, but maybe DisplayLink forgot to read that bit.

Now where does this leave me, in theory I should be able to replay the full trace (haven't had time yet) and I should see the same picture on screen as I did (though I can't remember what monitor/device I used, so I might have to retrace and restage my tests before then).

However I really need to decrypt the encrypted data in the trace, and from reading the HDCP spec the only values I need to feed the AES engine are ks ^ lc128, riv, streamctr, inputctr. I'm assuming streamctr and inputctr are 0 for the first packet (I could be wrong, maybe they use some wacky streamctr to avoid messing with hdcp), riv and ks I've captured. So lc128 is possibly the crux.

Now what is lc128? Its a secret 128-bit value in the HDCP world given only to HDCP adopters. Its normally something you'd store in hw on the GPU etc as an input to the hw cipher. But in displaylink there is no GPU encrypting the data. Now its possible that displaylink don't use the same lc128 as the HDCP people, unlikely but possible. Maybe they cipher their streams with their own lc128, and only use the offical hdcp lc128 for actual HDCP streams.

I don't think lc128 has leaked, I'm not sure what the consequences of it leaking would be, but hey its just a magic number, and if displaylink are using as an input to their AES code, it must be in RAM at some point, now I need to figure out ways to work that out. I'm not sure how long it would take to brute force as 128-bit key space, probably impossible.

At any point if someone from DisplayLink wants to talk, you know where to find me :-)

November 07, 2014 04:20 AM

November 03, 2014

Daniel Vetter: Atomic Modeset Support for KMS Drivers

So I've just reposted my atomic modeset helper series, and since the main goal of all that work was to ensure a smooth and simple transition for existing drivers to the promised atomic land it's time to elaborate a bit. The big problem is that the existing helper libraries and callbacks to driver backends don't really fit the new semantics, so some shuffling was required to avoid long-term pain. So if you are a driver writer and just interested in the details then read for what needs to be done to support atomic modeset updates using these new helper libraries.

Phase 1: Reworking the Driver Backend Functions for Planes

The first phase is reworking the driver backend callbacks to fit the new world. There are two big mismatches between the new atomic semantics and legacy ioctl interfaces:

Both issues are addressed by adding new driver backend callbacks. Furthermore a few transitional helper functions are provided to implement the legacy entry points in terms of these new callbacks. That way the driver backend can be reworked without the additional hassle of needing to deal with all the atomic state object handling and check/commit semantics.

The first step is to rework the ->disable/update_plane hooks using the transitional helper implementations drm_plane_helper_update/disable. These need the following new driver callbacks:
With this it's also easy to implement universal plane support directly, instead of with the default implementation which doesn't allow the primary plane to be disabled. Universal planes are a requirement for atomic and need to be implemented in phase 1, but testing the primary plane support is also a good preparation for the next step:

The new crtc->mode_set_nofb callback must be implement, which just updates the CRTC timings and data in the hardware without touching the primary plane state at all. The provided helpers functions drm_helper_crtc_mode_set and drm_helper_crtc_mode_set_base then implement the callbacks required by the CRTC helpers in terms of the new ->mode_set_nofb callback and the above newly implemented plane helper callbacks.

Phase 2: Wire up the Atomic State Object Scaffolding

With the completion of phase 1 all the driver backend functions have been adapted to the new requirements of the atomic helper library. The goal of phase 2 is to get all the state object handling needed for atomic updates into place. There are three steps to that:

Phase 3: Rolling out Atomic Support

With the driver backend changes from phase 1 and the state handling changes from phase 2 everything is ready for the step-by-step rollout of atomic support. Presuming nothing was missed this just consists of wiring up the ->atomic_check and ->atomic_commit implementations from the atomic helper library. And then replacing all the legacy entry pointers with the corresponding functions from the atomic helper library to implement them in terms of atomic.

The recommended order is to start with planes first, then test the ->set_config functionality. Page flips and properties are best done later since they likely need some additional work:

Besides these two complications (which might require a bit of work depending upon the driver) this is all that's needed for full atomic modeset and pageflip support.

Follow-up Driver Cleanups

But there's of course quite a bit of cleanup work possible afterards!

The are some big differences between the old CRTC helper modeset logic and the new one (using the same callbacks, but completely rewritten otherwise) in the atomic helper library:

These are all lessons learned from the i915 modeset rewrite. The only thing missing in the atomic helpers compared to i915 is the state readout and cross-checking support - everything else is there. But even that can be easily implemented by adding hardware state readout callbacks and using them in the various state reset functions (to reconstruct matching software state) and also to cross-check state.

The other big cleanup task is to stop using all the legacy state variables and switch all the driver backend code to only look at the state object structures. The two big examples here are crtc->mode and the plane->fb pointer.

So What Now?

With all that converting drivers should be simple and can be done with a series of not-too-invasive refactorings. But my patch series doesn't yet contain the actual atomic modeset ioctl. So what's left to be done in the drm core?

So still a few things to do, besides adding atomic support to all drivers.

Update: The explanation for how to implement state readout and cross checking was a bit confused, so I reworded that.

November 03, 2014 10:23 PM

Dave Jones: Thoughts on crashdumps.

Linux has what appears to be a useful feature that can be enabled to diagnose tricky kernel bugs. The feature is called kdump. A crashdump mechanism that uses kexec to switch to a different kernel, before writing out memory to disk, nfs, wherever. It’s a pretty neat idea.

Unfortunately, I have _never_ seen it working when I needed it.
I know it’s possible, because some of my co-workers swear by crashdumps for diagnosing tricky RHEL bugs. Someone every single RHEL release invests the time to fix up a bunch of bugs and get it into a working state again. But because Fedora is constantly moving, it’s near constantly broken in some non-trivial way.

We even have a wiki page telling Fedora users how to enable it. In honesty every time in the past I’ve told a user to try it, I’ve thought to myself “yeah, that isn’t going to work”, and my record for being correct in that regard is pretty damn good. If after 15+[1] years of kernel debugging, _I_ can’t get this thing to work, what hope does the average end-user have ?

In a recent meeting at the office, one of my coworkers enthused about how “it’s so much better now, it just works”. So I thought I’d give it a try again the last few weeks. In that time, I have ended up with a total of zero crash dumps, and I-lost-count-how-many kdump bugs.

Why is it so fragile ? I don’t have a good answer. It tends to have the worst possible failure modes. It’s hard to diagnose bugs that either lock up the machine entirely, or instantly reboot it. When you’re trying to debug something, and then it turns out you need to debug the debugging mechanism, most people probably think “I don’t have time for this shit”, and try alternative avenues of debugging, adding “FIND OUT WHY KDUMP IS FUCKED AGAIN” somewhere near the bottom of their TODO list.

At one point I thought “Maybe I’m just unlucky with hardware choices”[2], but the problems seem to be universal across every machine I’ve tried it on.

No doubt it “works” for some people, in certain circumstances, but this kind of feature has to be reliable at least most of the time to make it even worth trying.

I wish this post had a happy ending where I unveiled some solution to this problem[3], but after needing to travel to a machine that wedged itself after it had crashed for the Nth time this weekend, I’m kind of over kdump.
Sometimes it’s easier just to say “Don’t even bother” and do something entirely different.

[1] Oh god what have I done with my life.
[2] There are no good choices when it comes to computer hardware.
[3] Coming in a future post: Why pstore is the solution to this, and why it’s also completely awful.

Thoughts on crashdumps. is a post from: codemonkey.org.uk

November 03, 2014 04:02 PM

November 02, 2014

Daniel Vetter: Neat drm/i915 stuff for 3.18

Since Dave Airlie moved the feature cut-off of the drm-next tree roughly one month ahead it is already time for our regular look at what's ahead. Even though the 3.17 features aren't even released yet.

On the modeset side of things we now have the final pieces for plane rotation support from Sonika Jindal and Ville. The DisplayPort code has also seen lots of improvements, with updated training values in preparation of the latest eDP standard (Sonika Jindal) and support for DP training pattern 3 (Ville). DSI panels now support burst mode (Shobhit) and hdmi conformance has been improved with some fixes from Clint Taylor.

For eDP panels we also have improved panel power sequencing code, mostly to fix issues on Cherryview, from Ville. Ville has also contributed fixes to the VDD handling code, which is used to temporarily enable panel power. And the backlight code learned to handle the bl_power setting so that the backlight can be turned off completely without upsetting the panel's power sequencing, contributed by Jani.

Chris Wilson has also been fairly busy on the modeset code: 3.18 includes his patches to cache EDIDs for a single probe call - unfortunately the full caching solution to keep the EDID around between multiple probe calls isn't merged yet. And pageflips have now improved error detection and recovery logic: In case something goes wrong we shouldn't end up stuck any longer waiting for a pageflip to complete that has been lost by either the hardware or the driver.

Moving on to platform specific work there's been lots of preparations for Skylake, most of it from Damien and Sonika. The actual intial platform enabling is delayed for 3.19 though. On the other end of the timeline Ville fixed up i830M modeset support on a rainy w/e in his vacation, and 3.18 now has all that code. And there has been a lot of Cherryview fixes all over.

Cherryview also gained support for power wells and hence runtime pm (Ville). And for platform agnostic feature a lot of the preparation for DRRS (dynamic refresh rate switching) is merged, hopefully the actual feature patches from Vandana Kannan will land in 3.19.

Moving on the render side of the driver there's been a lot of patches to beat the full ppgtt support into shape. The context code has been cleaned up, lifetime handling for ppgtt address spaces is fixed and bad interactions with secure batches are now also rectified. Enabling full ppgtt missed the feature cutoff by a hair though, but it's already enabling for the following release.

Basic support for execlists command submission from Ben Widawsky, Oscar Mateo and Thomas Daniel was also merged. This is the fancy new way to submit commands available on Gen8 and subsequent platforms. It's not yet enabled by default, but since it's a requirement for a lot of cool new features keep an eye on what's going on here. There is also a lot of work going on underneath to enable all this new code in GEM, like preparing to switch away from sequence numbers to tracking gpu progress more abstractly using the driver's request structures.

And this time around there is also some cool stuff going on in the drm core worth of a shout-out: The vblank handling code is massively revamped, hopefully plugging all the small races, inconsistencies and inefficiencies in that code. And thanks to David Herrmann it is finally possible to write a drm driver without the drm midlayer getting completely in the way of a proper driver load and unload sequence! Unfortunately i915 can't be converted right away since the legacy usermodesetting code crucial relies on this midlayer functionality. But that's now well deprecated and hopefully can be removed in one of the next releases.

November 02, 2014 02:31 PM

October 31, 2014

Dave Airlie: a day with DisplayLink USB3 and HDCP

So for some reason I decided to look at the displaylink usb3 adaptors today. (no good news).

This blog post is so I don't forget all of this when I page it out. Notes, HDCP1.0 being broken doesn't matter to this, maybe HDCPv2.0 being a bit broken could be used, but I'm not sure how!

The displaylink USB3 protocol is based on HDCP protocol. I've traced the first few packets and it clearly
looks like the host sends two packets

AKE_Init,
AKE_Transmitter_Info

and the device sends back
AKE_Send_Cert

at least.

AKE_Send_Cert contains a 522 byte certificate, containing a receiver id, public key, some misc bytes and a signature generated with the DCP LLC private key, that you have to verify.

so the HDCP v2.2 spec contains the DP LLC public key, and I've written some code to verify the spec using openssl, but it totally fails to work. This is probably due to me doing something stupid, or not understanding what I'm doing, if you are openssl knowledgeable and want to look, the hack fest is
http://cgit.freedesktop.org/~airlied/dl3dev/

It might be the DisplayLink devices use a different signing key than the DP LLC one.

That repo contains some code to talk to the device (currently disabled) and do the initial sequence, along with an attempt to verify the cert.

Now once I get past this hurdle, the larger one seems to remain, the HDCP 2.0 spec has a global secret 128-bit value called LC128, that everyone who implements HDCP gets and hides somewhere. Its probably sitting in the displaylink driver in hex, but I'd hope they at least hide it better than that. It may also be possibly supplied by the OS, Windows or OSX. (I've no clue yet). That value is used in the key negotiation.

Now it might be possible that Displaylink allow non-HDCP encrypted data to be sent to the device, in which case win if I can find out where/how to do that, or it might be the device requires HDCP and decrypts non-HDCP content before sending it over VGA/DVI. I've no ideas yet on that front either.

Ah well probably enough learning for today, I knew nothing about HDCP this morning, so I can't say it made my life any better learning about it :-P

October 31, 2014 06:25 AM

October 30, 2014

Matthew Garrett: Hacker News metrics (first rough approach)

I'm not a huge fan of Hacker News[1]. My impression continues to be that it ends up promoting stories that align with the Silicon Valley narrative of meritocracy, technology will fix everything, regulation is the cancer killing agile startups, and discouraging stories that suggest that the world of technology is, broadly speaking, awful and we should all be ashamed of ourselves.

But as a good data-driven person[2], wouldn't it be nice to have numbers rather than just handwaving? In the absence of a good public dataset, I scraped Hacker Slide to get just over two months of data in the form of hourly snapshots of stories, their age, their score and their position. I then applied a trivial test:

  1. If the story is younger than any other story
  2. and the story has a higher score than that other story
  3. and the story has a worse ranking than that other story
  4. and at least one of these two stories is on the front page
then the story is considered to have been penalised.

(note: "penalised" can have several meanings. It may be due to explicit flagging, or it may be due to an automated system deciding that the story is controversial or appears to be supported by a voting ring. There may be other reasons. I haven't attempted to separate them, because for my purposes it doesn't matter. The algorithm is discussed here.)

Now, ideally I'd classify my dataset based on manual analysis and classification of stories, but I'm lazy (see [2]) and so just tried some keyword analysis:
KeywordPenalisedUnpenalised
Women134
Harass20
Female51
Intel23
x8634
ARM34
Airplane12
Startup4626

A few things to note:
  1. Lots of stories are penalised. Of the front page stories in my dataset, I count 3240 stories that have some kind of penalty applied, against 2848 that don't. The default seems to be that some kind of detection will kick in.
  2. Stories containing keywords that suggest they refer to issues around social justice appear more likely to be penalised than stories that refer to technical matters
  3. There are other topics that are also disproportionately likely to be penalised. That's interesting, but not really relevant - I'm not necessarily arguing that social issues are penalised out of an active desire to make them go away, merely that the existing ranking system tends to result in it happening anyway.

This clearly isn't an especially rigorous analysis, and in future I hope to do a better job. But for now the evidence appears consistent with my innate prejudice - the Hacker News ranking algorithm tends to penalise stories that address social issues. An interesting next step would be to attempt to infer whether the reasons for the penalties are similar between different categories of penalised stories[3], but I'm not sure how practical that is with the publicly available data.

(Raw data is here, penalised stories are here, unpenalised stories are here)


[1] Moving to San Francisco has resulted in it making more sense, but really that just makes me even more depressed.
[2] Ha ha like fuck my PhD's in biology
[3] Perhaps stories about startups tend to get penalised because of voter ring detection from people trying to promote their startup, while stories about social issues tend to get penalised because of controversy detection?

comment count unavailable comments

October 30, 2014 06:18 PM

Matthew Garrett: Linux Container Security

First, read these slides. Done? Good.

(Edit: Just to clarify - these are not my slides. They're from a presentation Jerome Petazzoni gave at Linuxcon NA earlier this year)

Hypervisors present a smaller attack surface than containers. This is somewhat mitigated in containers by using seccomp, selinux and restricting capabilities in order to reduce the number of kernel entry points that untrusted code can touch, but even so there is simply a greater quantity of privileged code available to untrusted apps in a container environment when compared to a hypervisor environment[1].

Does this mean containers provide reduced security? That's an arguable point. In the event of a new kernel vulnerability, container-based deployments merely need to upgrade the kernel on the host and restart all the containers. Full VMs need to upgrade the kernel in each individual image, which takes longer and may be delayed due to the additional disruption. In the event of a flaw in some remotely accessible code running in your image, an attacker's ability to cause further damage may be restricted by the existing seccomp and capabilities configuration in a container. They may be able to escalate to a more privileged user in a full VM.

I'm not really compelled by either of these arguments. Both argue that the security of your container is improved, but in almost all cases exploiting these vulnerabilities would require that an attacker already be able to run arbitrary code in your container. Many container deployments are task-specific rather than running a full system, and in that case your attacker is already able to compromise pretty much everything within the container. The argument's stronger in the Virtual Private Server case, but there you're trading that off against losing some other security features - sure, you're deploying seccomp, but you can't use selinux inside your container, because the policy isn't per-namespace[2].

So that seems like kind of a wash - there's maybe marginal increases in practical security for certain kinds of deployment, and perhaps marginal decreases for others. We end up coming back to the attack surface, and it seems inevitable that that's always going to be larger in container environments. The question is, does it matter? If the larger attack surface still only results in one more vulnerability per thousand years, you probably don't care. The aim isn't to get containers to the same level of security as hypervisors, it's to get them close enough that the difference doesn't matter.

I don't think we're there yet. Searching the kernel for bugs triggered by Trinity shows plenty of cases where the kernel screws up from unprivileged input[3]. A sufficiently strong seccomp policy plus tight restrictions on the ability of a container to touch /proc, /sys and /dev helps a lot here, but it's not full coverage. The presentation I linked to at the top of this post suggests using the grsec patches - these will tend to mitigate several (but not all) kernel vulnerabilities, but there's tradeoffs in (a) ease of management (having to build your own kernels) and (b) performance (several of the grsec options reduce performance).

But this isn't intended as a complaint. Or, rather, it is, just not about security. I suspect containers can be made sufficiently secure that the attack surface size doesn't matter. But who's going to do that work? As mentioned, modern container deployment tools make use of a number of kernel security features. But there's been something of a dearth of contributions from the companies who sell container-based services. Meaningful work here would include things like:


These aren't easy jobs, but they're important, and I'm hoping that the lack of obvious development in areas like this is merely a symptom of the youth of the technology rather than a lack of meaningful desire to make things better. But until things improve, it's going to be far too easy to write containers off as a "convenient, cheap, secure: choose two" tradeoff. That's not a winning strategy.

[1] Companies using hypervisors! Audit your qemu setup to ensure that you're not providing more emulated hardware than necessary to your guests. If you're using KVM, ensure that you're using sVirt (either selinux or apparmor backed) in order to restrict qemu's privileges.
[2] There's apparently some support for loading per-namespace Apparmor policies, but that means that the process is no longer confined by the sVirt policy
[3] To be fair, last time I ran Trinity under Docker under a VM, it ended up killing my host. Glass houses, etc.

comment count unavailable comments

October 30, 2014 01:11 AM

Matthew Garrett: On joining the FSF board

I joined the board of directors of the Free Software Foundation a couple of weeks ago. I've been travelling a bunch since then, so haven't really had time to write about it. But since I'm currently waiting for a test job to finish, why not?

It's impossible to overstate how important free software is. A movement that began with a quest to work around a faulty printer is now our greatest defence against a world full of hostile actors. Without the ability to examine software, we can have no real faith that we haven't been put at risk by backdoors introduced through incompetence or malice. Without the freedom to modify software, we have no chance of updating it to deal with the new challenges that we face on a daily basis. Without the freedom to pass that modified software on to others, we are unable to help people who don't have the technical skills to protect themselves.

Free software isn't sufficient for building a trustworthy computing environment, one that not merely protects the user but respects the user. But it is necessary for that, and that's why I continue to evangelise on its behalf at every opportunity.

However.

Free software has a problem. It's natural to write software to satisfy our own needs, but in doing so we write software that doesn't provide as much benefit to people who have different needs. We need to listen to others, improve our knowledge of their requirements and ensure that they are in a position to benefit from the freedoms we espouse. And that means building diverse communities, communities that are inclusive regardless of people's race, gender, sexuality or economic background. Free software that ends up designed primarily to meet the needs of well-off white men is a failure. We do not improve the world by ignoring the majority of people in it. To do that, we need to listen to others. And to do that, we need to ensure that our community is accessible to everybody.

That's not the case right now. We are a community that is disproportionately male, disproportionately white, disproportionately rich. This is made strikingly obvious by looking at the composition of the FSF board, a body made up entirely of white men. In joining the board, I have perpetuated this. I do not bring new experiences. I do not bring an understanding of an entirely different set of problems. I do not serve as an inspiration to groups currently under-represented in our communities. I am, in short, a hypocrite.

So why did I do it? Why have I joined an organisation whose founder I publicly criticised for making sexist jokes in a conference presentation? I'm afraid that my answer may not seem convincing, but in the end it boils down to feeling that I can make more of a difference from within than from outside. I am now in a position to ensure that the board never forgets to consider diversity when making decisions. I am in a position to advocate for programs that build us stronger, more representative communities. I am in a position to take responsibility for our failings and try to do better in future.

People can justifiably conclude that I'm making excuses, and I can make no argument against that other than to be asked to be judged by my actions. I hope to be able to look back at my time with the FSF and believe that I helped make a positive difference. But maybe this is hubris. Maybe I am just perpetuating the status quo. If so, I absolutely deserve criticism for my choices. We'll find out in a few years.

comment count unavailable comments

October 30, 2014 12:45 AM

October 27, 2014

Paul E. Mc Kenney: Lies that firmware tells RCU

One of the complaints that real-time people have against some firmware is that it lies about its age, attempting to cover up cycle-stealing via SMIs by reprogramming the TSC. Some firmware goes farther and lies about the number of CPUs on the system, apparently on the grounds that more is better, regardless of how many of those alleged CPUs actually exist.

RCU used to naively believe the firmware, and would therefore create one set of rcuo kthreads per advertised CPU. On some systems, this resulted in hundreds of such kthreads on systems with only a few tens of CPUs. But RCU can choose to create the rcuo kthreads only for CPUs that actually come online. Problem solved!

Mostly solved, that is.

Yanko Kaneti, Jay Vosburgh, Meelis Roos, and Eric B Munson discovered the “mostly” part when they encountered hangs in _rcu_barrier(). So what is rcu_barrier()?

The rcu_barrier() primitive waits for all pre-existing callbacks to be invoked. This is useful when you want to unload a module that uses call_rcu(), as described in this LWN article. It is important to note that rcu_barrier() does not necessarily wait for a full RCU grace period. In fact, if there are currently no RCU callbacks queued, rcu_barrier() is within its rights to simply return immediately. Otherwise, rcu_barrier() enqueues a callback on each CPU that already has callbacks, and waits for all these callbacks to be invoked. Because RCU is careful to invoke all callbacks posted to a given CPU in order, this guarantees that by the time rcu_barrier() returns, all pre-existing RCU callbacks will have already been invoked, as required.

However, it is possible to offload invocation of a given CPU's RCU callbacks to rcuo kthreads, as described in this LWN article. This kthread might well be executing on some other CPU, which means that the callbacks are moved from one list to another as they pass through their lifecycles. This makes it difficult for rcu_barrier() to reliably determine whether or not there are RCU callbacks pending for an offloaded CPU. So rcu_barrier() simply unconditionally enqueues an RCU callback for each offloaded CPU, regardless of that CPU's state.

In fact, rcu_barrier() even enqueues a callback for offloaded CPUs that are offline. The reason for this odd-seeming design decision is that a given CPU might enqueue a huge number of callbacks, then go offline. It might take the corresponding rcuo kthread significant time to work its way through this backlog of callbacks, which means that rcu_barrier() cannot safely assume that an offloaded CPU is callback-free just because it happens to be offline. So, to come full circle, rcu_barrier() enqueues an RCU callback for all offloaded CPUs, regardless of their state.

This approach works quite well in practice.

At least, it works well on systems where the firmware provides the Linux kernel with an accurate count of the number of CPUs. However, it breaks horribly when the firmware over-reports the number of CPUs, because then the system will then have CPUs that never ever come online. If these CPUs have been designated as offloaded CPUs, this means that their rcuo kthreads will never ever be spawned, which in turn means that any callbacks enqueued for these mythical CPUs will never ever be invoked. And because rcu_barrier() waits for all the callbacks that it posts to be invoked, rcu_barrier() ends up waiting forever, which can of course result in hangs.

The solution is to make rcu_barrier() refrain from posting callbacks for offloaded CPUs that have never been online, in other words, for CPUs that do not yet have an rcuo kthread.

With some luck, this patch will clear things up. And I did take the precaution of reviewing all of RCU's uses of for_each_possible_cpu(), so here is hoping! ;-)

October 27, 2014 10:09 PM

James Morris: Linux Security Summit 2014 Wrap-Up

The slides from the 2014 Linux Security Summit in August may be found linked at the schedule.

LWN covered both the James Bottomley keynote, and the SELinux on Android talk by Stephen Smalley.

We had an engaging and productive two days, with strong attendance throughout.  We’ll likely follow a similar format next year at LinuxCon.  I hope we can continue to expand the contributor base beyond mostly kernel developers.  We’re doing ok, but can certainly do better.  We’ll also look at finding a sponsor for food next year.

Thanks to those who contributed and attended, to the program committee, and of course, to the events crew at Linux Foundation, who do all of the heavy lifting logistics-wise.

See you next year!

October 27, 2014 12:56 PM

October 23, 2014

Dave Jones: Trinity and pages of random data.

Something trinity uses a lot, are pages of random data. They get passed around to syscalls, ioctls, whatever. 5 years ago, before I’d even added multiple children to trinity, this was done using ‘page_rand’. A single page allocated on startup, that was passed around, and scribbled over by anyone who needed something to scribble over.

After the VM work I did earlier this year, where we recycle successful calls to mmap, and inherit them across children, quite a few places started passing around map structs instead. This was good, because it started shaking out the many many kernel bugs that we had lingering in huge page support.

It kind of sucked that we had two sets of routines for doing things like “get a page”, “dirty a page” etc which were fundamentally the same operations, except one set worked on a pointer, and one on a struct. It also sucked that the page_rand code was actually buggy in a number of ways, which showed up as overruns.

Over time, I’ve been trying to move all the code that used page_rand to using mappings instead. Today I finished that work, and ripped out the last vestiges of page_rand support. The only real remnants of the supporting code was some of the dirtying code. We used to have separate ‘dirty page_rand’ and ‘dirty an mmap’ routines. After todays work, there’s now a single set of functions for mappings. There’s still a bunch more consolidation and cleanup to do, which I’ll get fixed up and merged over the next week.

The only feature that’s now missing is periodic dirtying of mappings. We did this every 100 syscalls for page_rand. Right now we only dirty mmap’s after a mmap() call succeeds, or on an mremap(). I plan on getting this done tomorrow.

The motivation for ripping out all this code, and unifying a lot of the support code is that a lot of code paths get simpler, and more importantly, the code in place now takes ‘len’ arguments, so we’re in a better position to make sure we’re not passing buffers that are too small when we do random syscalls.

In other news: while I was happy to report a few days ago that 3.18rc1 fixed up the btrfs bug that had been bothering me for a while, I’ve now managed to discover two new btrfs bugs [1]. [2]. Grumble.

Trinity and pages of random data. is a post from: codemonkey.org.uk

October 23, 2014 02:33 AM

October 19, 2014

Pete Zaitcev: Laptop bleg

I'm considering a laptop (actually two). Requirements:

Where it comes from is mostly my wife's Sony Vaio Z. I used to have a Z back in 2001 or so, when they were in 12" format. It was the best laptop ever, but unfortunately it succumbed to a DC-DC converter failure. The modern Z is not like that Z. The most super annoying problem is that the screws holding the battery failed in an interesting way: it is impossible to remove the battery now. Also, the contact between the battery and the moterboard is marginal. I managed to fix the problem by manufacturing a finely shaped wooden wedge that I drove into a gap and thus extended the life of that thing, but man, Sony, this is disappointing.

Unfortunately, I don't remember if it was Kota or Daisuke, but one of Japanese guys at a recent Swift Hackathon in Boston had a Z of the similar vintage, and it looked impeccable. Maybe Sony figured that it's going to be predominant mode of care that their wares receive, and so why not make the modern Z this much cheaper than the old, indestructable Z. But they still charge exorbitant prices.

Lenovo wins a special notice because I had a T400 for 3 years and swore never deal with it ever again. The biggest problem is the keyboard layout, because I use left pinky for control key. I could live with their idiotic placement of Escape, but I refuse to deal with 3 years of physical pain again. Also, their famous qualify seems slipping, as my mouse button broke within 3 years. Battery died, too. However, the T400 had a very good display, and I would like another like that, if possible.

October 19, 2014 04:29 PM

October 18, 2014

Pavel Machek: N900 nfs root

So you'd like to develop on Nokia N900... It has serial port, but with "interesting" connector. It has keyboard, but with "interesting" keyboard map, you mostly need full X to be useful... and it is too small for serious typing, anyway. You could put root filesystem on SD card, but that is disconnected when back cover is removed. And with back cover in place, you can't reset the machine.

Ok, so NFS. Insecure, tricky to setup, but actually makes the development usable. I started with commit 4f3e8d263^ (because that should have working usb networking according to mailing lists).. and with config from same page. Disadvantage is that video does not work with that configuration... but setting up system blind should not be that hard, right?

Assemblying minimal system with busybox from so I could run second-stage of debootstrap was tricky, and hacking into the resulting debian was not easy, either, but now I have telnet connections and things should only improve.

October 18, 2014 07:43 PM

Dave Jones: Trinity updates

Over a month ago, I posted about some pthreads work I was experimenting with in Trinity, and how that wasn’t really working out. After taking a short vacation, I came back with no real epiphanies, and decided to back-burner that work for now, and instead refocus on fixing up some other annoying problems that I’d stumbled across while doing that experimenting. Some of these problems were actually long-standing bugs in trinity. So that’s pretty much all I’ve been working on for the last month, and I’m now pretty happy with how long it runs for (providing you don’t hit a kernel bug first).

The primary motivation was to fix a problem where trinity’s internal data structures would get corrupted. After a series of debugging patches, I found a number of places where a child process would overrun a buffer it had allocated.

First up: the code that takes syscalls arguments and renders them into a human-readable string. In some cases this would write huge strings past the end of the buffer. One example of this was the instance where trinity would generate a random pathname. It would sometimes generate complete garbage, which was fine until it came to printing it out. Fixed by deleting lots of code in the pathname generator. Stressing the negative dentry case was never that interesting anyway. After fixing up a few other cases in the argument generator I looked at the code that performs rendering to buffers. None of this code took length parameters, or took into account the remaining space in the buffers. Fairly quick rewrite took care of that.

After these bugs were fixed trinity would (on a good kernel) run for a really long time without incident. With longer runtimes, a few more obscure corner cases turned up.

There were 2-3 cases where the watchdog process would hang waiting for a condition that would never be met (due to losing track of how many running child processes there were). I’m still not happy that this can even occur but it is at least a little less likely to hang when it happens now. I’ll investigate the actual cause for this later.

Another fun watchdog bug: we keep track of the time stamp a child performed its last syscall at, and check to make sure 1 second later that it has increased by some small amount. To make sure we haven’t corrupted our own state, there’s also a sanity check that we haven’t jumped into the future. But we also have to compensate for the possibility that adjtimex was the random syscall we did. That takes a maximum offset of 2145. The code checked for that but forgot to also add the one second since the last time we checked.

There’s been a bunch of small 1-2 fixes like this lately, but I’m sitting on a larger set of changes that I’ll start to trickle into git next week, which moves towards cleaning up the “create a random page to pass to syscalls” code, which has been another fun source of corruption bugs.

In kernel news: The only interesting bugs this week that Trinity has shown up, have been two ext4 bugs. Diagnosing those has pointed out some more enhancements that are needed to the post-mortem code in trinity. Once I’ve cleared the current backlog of patches, I’ll work on adding better tracking of fd’s in the logging code. In other news, the btrfs bug trinity hit in August is now fixed in 3.17+ git.

Trinity updates is a post from: codemonkey.org.uk

October 18, 2014 03:11 PM

October 10, 2014

Grant Likely: git.secretlab.ca is down

For anyone who has been using git.secretlab.ca, the server is currently down and I don’t know when it will be back up. I’ve moved my Linux kernel tree over to kernel.org. The new tree can be found here:

https://git.kernel.org/cgit/linux/kernel/git/glikely/linux.git

The other trees will be back when git.secretlab.ca returns to life.

October 10, 2014 01:53 PM

October 06, 2014

Valerie Aurora: Operating systems war story: How feminism helped me solve one of file systems’ oldest conundrums

A smiling woman with pink hair wearing a t shirt with the word "O_PONIES" in Courier font

Valerie Aurora in 2009 (keep reading to find out why my shirt says “O_PONIES”)
CC BY NC-SA Robert Kaye

Hi, my name is Valerie Aurora, and I am the inventor of a software feature that has prevented billions of unnecessary writes to hard drives, saving energy and making our computers faster. My invention is called “relative atime,” and this is the story of how my feminist approach to computing helped me invent it – and what you can do to support women in open source software. (If you’re already convinced we need more women in open source, here’s a link to donate now to the Ada Initiative’s 2014 fundraising drive. My operating systems war story will still be here when you’re finished!)

Donate now

First, a little background for those of you who don’t live and breathe UNIX file systems performance. Ingo Molnar once called the access time, or “atime” feature of UNIX file systems “perhaps the most stupid Unix design idea of all times.” That’s harsh but fair. See, every time you read a file on a UNIX operating system – which includes OS X, Linux, and Android[1] – it is supposed to update the file to record the last time it was read, or accessed. This is called the access time or atime. Cool, right? You can imagine why it’s helpful to know when was the last time anything read a particular file – you can tell if you have new mail, for example, or figure out which files you haven’t used in a while and can throw away.

The problem with the atime feature is that updating atime requires writing to the disk. So every read to a file creates a tiny disk write – and writes are expensive and slow. (SSDs don’t get rid of this problem; you still don’t want to do unnecessary writes and most of the world’s data is still on spinning disks.) Here’s what Ingo said about this in 2006: “Atime updates are by far the biggest IO performance deficiency that Linux has today. Getting rid of atime updates would give us more everyday Linux performance than all the pagecache speedups of the past 10 years, _combined_.

So, atime is terrible idea – why don’t we just turn it off? That’s what many people did, using the “noatime” option that many file systems provide. The problem was that many programs did need to know the atime of a file to work properly. So most Linux distributions shipped with atime on, and it was up to the user to remember to turn it off (if they could). It was a bad situation.

A cartoon of a woman driving a robot penguin

LinuxChix logo

In 2006, I was a Linux file systems developer and also an active member of LinuxChix, a group for women who used Linux. LinuxChix existed in part because it was impossible to have technical discussions about Linux on most mailing lists without people insulting and flaming you for asking the simplest questions – and it was ten times worse for people with feminine usernames. Tell a cautionary story about installing RAM correctly, and the response might be a sneering, “Oh, you didn’t let out the magic smoke, did you?” On LinuxChix, that kind of obnoxiousness wasn’t allowed (though we still got a lot of what is now called mansplaining.)

So when I advised several people in LinuxChix to turn off atime, a friend felt safe telling me that hey, performance on her laptop was better, but now Mutt, the email reader we both used, thought she always had new email. This is because in her configuration, Mutt would look at an email file and compared its atime with the file’s last written time to figure out if any new email had arrived since the last time it read the file.

Now, the typical answer to “Mutt doesn’t work with noatime” was “Switch to a slower directory-based method,” or “Use a file size hack that had bugs,” or any number of other unhelpful things. Mostly, people just wouldn’t bother reporting things that broke with noatime. But I was part of a culture – a feminist culture – in which I respected people like my friend and programmers that attempted to use fully defined, useful features of UNIX in order to implement features efficiently.

I decided to look at the problem from a human point of view. What my friend and the Mutt programmers really wanted to know was this: Has this file been written since the last time I read it? They didn’t particularly care about the exact time of the last read, they just wanted to know if it had been read before or after the last write. I had an idea: What if we only updated a file’s atime if it would change the answer to the question, “Has this file been read since the last time it was written?” I called it “relative atime.”[2]

The amazing thing is: it worked! Matthew Garrett (also a known feminist), Ingo Molnar, and Andrew Morton made some changes to patch, including updating the atime if the current atime was more than 24 hours ago. Other than that, this incredibly simple algorithm worked well enough that in 2009, relative atime became the default in the mainline Linux kernel tree. Now, by default, people’s computers were fast and their programs worked.

I came up with this idea and the original patch in 2006, when the atime problem had been known for many years. Previous solutions had taken a very file-system-centric point of view, mainly along the lines of buffering up atime updates in memory and writing them out when we ran out of memory. What led me to a creative, simple, and extremely fast solution was being part of a feminist community in which people felt comfortable sharing their technical problems, wanted to help each other, and respected each other’s intelligence. Those are all feminist principles, and they make file systems development better.

I try to take that human-centered, feminist approach with other topics in file systems, including the great fsync()/rename() debate of 2009 (a.k.a “O_PONIES”) in which I argued that file systems developers should strive to make life easier for developers and users, not harder. As recently as 2013, a leading file systems developer was still arguing that file systems didn’t have to save file data reliably by mocking users for playing computer games.

I was working on another human-centered file system feature, union mounts, when I heard that a friend of mine had been groped at an open source conference for the third time in one year. While I loved my file systems work, I felt like stopping sexual harassment and assault of women in open source was more urgent, and that I was uniquely qualified to work on it. (I myself had been groped by another Linux storage developer.) So I quit my job as a Linux kernel developer and co-founded the Ada Initiative, whose mission is supporting women in open technology and culture. Unfortunately, as a result of my work, several more Linux storage developers came out publicly in favor of harassment and assault.

That’s one reason why I’m so excited that Ceph developer Sage Weil challenged the open storage community to raise $8192 for the Ada Initiative by Wednesday, Oct. 8 – and he’ll personally match that amount if we reach the goal! UPDATED: Sage and Mike Perez raised this to $16,384!!! The number of Linux file systems and storage developers who both donated to Sage’s challenge and wanted to be listed publicly as supporters is reminding me that the vast majority of the people I worked with in Linux want women to feel safe and comfortable in their community. I love file systems development, I love writing kernel code, and I miss working with and seeing my Linux friends. And as you can tell by the lack of something like union mounts in the mainline kernel 21 years after the first implementation, Linux file systems and storage does not have enough developers, and can’t afford to keep driving off women developers.

A woman sitting at a table explaining soemthing with her hands

Me teaching the Ally Skills Workshop

The Ada Initiative is capable of changing this situation. In August 2014, I taught the first Ally Skills Workshop at a Linux Foundation-run conference, LinuxCon North America. The Ally Skills Workshop teaches men simple everyday ways to support women in their workplaces in communities, and teaching it is my favorite part of my work. I was happy to see several Linux file systems and storage developers at the workshop. I was still nervous about running into the developers who support harassment and assault, but seeing how excited people were after the Ally Skills Workshop made it all worthwhile.

If you’d like to see more people working on Linux storage and file systems, and especially more women, please join Sage Weil and more than 30 other open storage developers in supporting the Ada Initiative. Donate now:

Donate now

Edited to add 10/6/2014: Sage made his goal, hurray! And here’s my favorite comment from the HN thread about this story, the only one actually flagged into non-existence (plenty of other creepy misogyny elsewhere though):

Screen Shot 2014-10-05 at 10.18.37 PM

Also, I had no idea Lennart Poettering planned to post this detailed description of the abuse, harassment, and death threats he’s suffered as an open source developer.

We’re still raising money for Ada Initiative to fight this kind of harassment, so feel free to donate:

Donate now

[1] Yes, Android is Linux too, I’m just naming the brands that non-operating systems experts would recognize.

[2] “Relative atime” isn’t so bad, but the name of the option that you pass to the kernel, “relatime”, showed my usual infelicity with naming things as it looks like a misspelling of “realtime”.


Tagged: ada initiative, feminism, filesystems, kernel, linux

October 06, 2014 09:36 PM

October 02, 2014

Matthew Garrett: Actions have consequences (or: why I'm not fixing Intel's bugs any more)

A lot of the kernel work I've ended up doing has involved dealing with bugs on Intel-based systems - figuring out interactions between their hardware and firmware, reverse engineering features that they refuse to document, improving their power management support, handling platform integration stuff for their GPUs and so on. Some of this I've been paid for, but a bunch has been unpaid work in my spare time[1].

Recently, as part of the anti-women #GamerGate campaign[2], a set of awful humans convinced Intel to terminate an advertising campaign because the site hosting the campaign had dared to suggest that the sexism present throughout the gaming industry might be a problem. Despite being awful humans, it is absolutely their right to request that a company choose to spend its money in a different way. And despite it being a dreadful decision, Intel is obviously entitled to spend their money as they wish. But I'm also free to spend my unpaid spare time as I wish, and I no longer wish to spend it doing unpaid work to enable an abhorrently-behaving company to sell more hardware. I won't be working on any Intel-specific bugs. I won't be reverse engineering any Intel-based features[3]. If the backlight on your laptop with an Intel GPU doesn't work, the number of fucks I'll be giving will fail to register on even the most sensitive measuring device.

On the plus side, this is probably going to significantly reduce my gin consumption.

[1] In the spirit of full disclosure: in some cases this has resulted in me being sent laptops in order to figure stuff out, and I was not always asked to return those laptops. My current laptop was purchased by me.

[2] I appreciate that there are some people involved in this campaign who earnestly believe that they are working to improve the state of professional ethics in games media. That is a worthy goal! But you're allying yourself to a cause that disproportionately attacks women while ignoring almost every other conflict of interest in the industry. If this is what you care about, find a new way to do it - and perhaps deal with the rather more obvious cases involving giant corporations, rather than obsessing over indie developers.

For avoidance of doubt, any comments arguing this point will be replaced with the phrase "Fart fart fart".

[3] Except for the purposes of finding entertaining security bugs

comment count unavailable comments

October 02, 2014 11:27 PM

October 01, 2014

Matt Domsch: Spamfighting: updated opendmarc packages, handling DMARC p=reject

I took a few months off from dealing with my spam problems, choosing to stick my head in the sand. Probably not my wisest move…

In the interim, the opendmarc developers have been busy, releasing version 1.3.0, which also adds the nice feature of doing SPF checking internally. This lets me CLOSE WONTFIX the smf-spf and libspf2 packages from the Fedora review process and remove them from my system. “All code has bugs. Unmaintained code with bugs that you aren’t running can’t harm you.” New packages and the open Fedora review are available.

I’ve also had several complaints from friends, all @yahoo.com users, who have been sending mail to myself and family @domsch.com. In most cases, @domsch.com simply forwards the emails on to yet other mail provider – it’s providing a mail forwarding service for a vanity domain name. However, now that Yahoo and AOL are publishing DMARC p=reject rules, after smtp.domsch.com forwarded the mail on to its ultimate home, those downstream servers were rejecting the messages (presumably on SPF grounds – smtp.domsch.com isn’t a valid mail server for @yahoo.com).

My solution to this is a bit akward, but will work for a while. Instead of forwarding mail from domains with DMARC p=reject or p=quarantine, I now store them and serve them up via POP3/IMAP to their ultimate destination. I’m using procmail to do the forwarding:


DEFAULT="/home/mdomsch/Mail/"
SENDER=`formail -c -x Return-Path`
SENDMAILFLAGS="-oi -f $SENDER"

# forward all mail except dmarc policy reject|quarantine.
:0 H
* ? formail -x'From:' | grep -o '[[:alnum:]+\.\_\-]*@[[:alnum:]+\.\_\-]*' | xargs opendmarc-check | egrep -s 'Domain policy: (reject|quarantine)'
${DEFAULT}

:0
! mdomsch@example.com

This introduces quite a bit of latency (on the order of an hour) for mail delivery from my friends with @yahoo.com addresses, but it keeps them from getting rejected due to their email provider’s lousy choice of policy.

Tim Draegen, the guy behind the excellent dmarcian.com, is chairing a new IETF working group focusing on proper handling on “indirect email flows” such as mailing lists and vanity domain forwarding. I’m hoping to have time to get involved there. If you care, follow along on their mailing lists.

I’m choosing to ignore the fact that domsch.com is getting spoofed 800k times a week (as reported by 8 mail providers and visualized nicely on dmarcian.com), at least for now. I’m hoping the new working group can come up with a method to help solve this.

Do your friends use a mail service publishing DMARC p=reject? Has it caused problems for you? Let me know in the comments below.

October 01, 2014 08:45 PM

Eric Sandeen: Sage’s challenge to the open storage community – support the Ada Iniative!

The Ada Initiative supports women in open technology and culture – looking around my place of work and various conference halls I’ve visited, I think there’s little doubt that we’ve still got an old boy network running here, and I’d like to see that change.  I have daughters who may or may not get deep into tech culture, and I’m glad there are people like Val working to make it a better place for them if they do.

And she’s recently gotten a nice bit of potential help: Sage Weil of Ceph & Dreamhost fame has issued a challenge: Raise $8192 in 8 days, and he’ll match it.  They’re already on their way.  If you can help, please do.  Power-of-two donations encouraged, but not required.  :) Click the counter below to see their donation page.  Thanks!


Donate now

October 01, 2014 08:14 PM

September 26, 2014

Matt Domsch: [REPOST] Who am I?

I’ve started blogging again on the Dell TechCenter site, Enterprise Mobility Management section, along with the rest of my team.

Here’s the intro to my first post, “Who am I?”:

The existential question, asked by everyone and everything throughout their lifetimes – who am I? High school seniors choosing a college, college seniors considering grad school or entering the job market, adults in the midst of their mid-life crisis—the question comes far easier than the answer.

In the world of technology, who you are depends on the technology with which you are interacting. On Facebook, you are your quirky personal self, with pictures of your family and vacations you take. On LinkedIn, you are your professional self, sharing articles and achievements that are aligned with your career.

What about on the myriad devices you carry around? On the smartphone in my pocket, I have several personas—personal, business, gamer (my kids borrow my phone), constantly context-switching between them. In the not-too-distant past, people would carry two phones—one for personal use and one for work, keeping the personas separate via physical separation—two personas, two devices.

Read more…

September 26, 2014 03:47 PM

September 25, 2014

Andy Grover: A program calling command-line tools is the moral equivalent of web scraping.

I gave this talk at LPC 2012. It promotes the idea that programs layered on top of human-centric interfaces is a bad idea.

Download the PDF file .

The timing of this post with the announcement of the most recent bash vulnerability is not entirely coincidental.

September 25, 2014 06:48 PM

September 24, 2014

Ted Tso: “How Google Works” giveaway

The authors of “How Google Works” have given electronic versions of “How Google Works” to all Google employees. Since I had already purchased a copy via pre-order, to make life interesting, I’ve decided to give my Google Play coupon code to someone via an electronic lottery.  (Edit: the coupon code will only work for people with Google accounts in the US; if you live outside of the US, my apologies, but the coupon code will not work for you.)

I will be using the procedure documented by RFC-3797 to select someone from the list of people who have sent-email to google-book-giveaway@thunk.org, snapshotted at Noon US/Eastern on Friday, September 26th, 2014, using as inputs into the RFC-3797 algorithm: (1) the daily volume of GOOG on September 26th, 2014 as reported by finance.google.com, (2) the daily volume of GOOGL on September 26th, 2014 as reported by finance.google.com, and (3) the Massachusetts Powerball Lottery Numbers for September 27th, 2014. If any of these values are not available for any reason, the values for the next trading day or lottery draw will be used.

Have fun!

September 24, 2014 02:55 PM

Matthew Garrett: My free software will respect users or it will be bullshit

I had dinner with a friend this evening and ended up discussing the FSF's four freedoms. The fundamental premise of the discussion was that the freedoms guaranteed by free software are largely academic unless you fall into one of two categories - someone who is sufficiently skilled in the arts of software development to examine and modify software to meet their own needs, or someone who is sufficiently privileged[1] to be able to encourage developers to modify the software to meet their needs.

The problem is that most people don't fall into either of these categories, and so the benefits of free software are often largely theoretical to them. Concentrating on philosophical freedoms without considering whether these freedoms provide meaningful benefits to most users risks these freedoms being perceived as abstract ideals, divorced from the real world - nice to have, but fundamentally not important. How can we tie these freedoms to issues that affect users on a daily basis?

In the past the answer would probably have been along the lines of "Free software inherently respects users", but reality has pretty clearly disproven that. Unity is free software that is fundamentally designed to tie the user into services that provide financial benefit to Canonical, with user privacy as a secondary concern. Despite Android largely being free software, many users are left with phones that no longer receive security updates[2]. Textsecure is free software but the author requests that builds not be uploaded to third party app stores because there's no meaningful way for users to verify that the code has not been modified - and there's a direct incentive for hostile actors to modify the software in order to circumvent the security of messages sent via it.

We're left in an awkward situation. Free software is fundamental to providing user privacy. The ability for third parties to continue providing security updates is vital for ensuring user safety. But in the real world, we are failing to make this argument - the freedoms we provide are largely theoretical for most users. The nominal security and privacy benefits we provide frequently don't make it to the real world. If users do wish to take advantage of the four freedoms, they frequently do so at a potential cost of security and privacy. Our focus on the four freedoms may be coming at a cost to the pragmatic freedoms that our users desire - the freedom to be free of surveillance (be that government or corporate), the freedom to receive security updates without having to purchase new hardware on a regular basis, the freedom to choose to run free software without having to give up basic safety features.

That's why projects like the GNOME safety and privacy team are so important. This is an example of tying the four freedoms to real-world user benefits, demonstrating that free software can be written and managed in such a way that it actually makes life better for the average user. Designing code so that users are fundamentally in control of any privacy tradeoffs they make is critical to empowering users to make informed decisions. Committing to meaningful audits of all network transmissions to ensure they don't leak personal data is vital in demonstrating that developers fundamentally respect the rights of those users. Working on designing security measures that make it difficult for a user to be tricked into handing over access to private data is going to be a necessary precaution against hostile actors, and getting it wrong is going to ruin lives.

The four freedoms are only meaningful if they result in real-world benefits to the entire population, not a privileged minority. If your approach to releasing free software is merely to ensure that it has an approved license and throw it over the wall, you're doing it wrong. We need to design software from the ground up in such a way that those freedoms provide immediate and real benefits to our users. Anything else is a failure.

(title courtesy of My Feminism will be Intersectional or it will be Bullshit by Flavia Dzodan. While I'm less angry, I'm solidly convinced that free software that does nothing to respect or empower users is an absolute waste of time)

[1] Either in the sense of having enough money that you can simply pay, having enough background in the field that you can file meaningful bug reports or having enough followers on Twitter that simply complaining about something results in people fixing it for you

[2] The free software nature of Android often makes it possible for users to receive security updates from a third party, but this is not always the case. Free software makes this kind of support more likely, but it is in no way guaranteed.

comment count unavailable comments

September 24, 2014 06:59 AM

September 23, 2014

Grant Likely: Don’t fear ACPI on ARM

Before everyone freaks out about Matthew Garrett’s post regarding ACPI on ARM, here are a few things to keep in mind:

First, when we’re talking about Linux and ACPI on ARM, we’re talking about general purpose servers. In the general purpose server market, Linux is already the dominant OS, regardless of the CPU architecture. Servers are designed, built and sold to run Linux. It is already the situation that x86 server vendors build their ACPI tables to work with Linux. Supporting Linux on ARM servers is merely an extension of what vendors are already doing to support Linux on x86. Despite Matthew’s concern, I don’t think we’re entering new territory in this regard.

Second, many of us have bad memories of getting ACPI to work with Linux. However, it is worth remembering that most of our problems have been with machines where the vendor really doesn’t care about Linux – usually desktop or laptop PCs. It’s not surprising that we have problems with these machines since they’ve only been tested with Windows! Server vendors, on the other hand, have a vested interest in ensuring that Linux runs well on their hardware and so they regularly test with Linux. The negative lessons learned in the laptop and desktop markets don’t carry over to machines built to run Linux.

Third, the ACPI world has changed in the last 2 years. It used to be that the ACPI spec was governed in a closed process by 5 companies: HP, Intel, Microsoft, Phoenix, and Toshiba, with nary a Linux person to be seen. Last year ACPI governance was transferred to the UEFI Forum and we’ve got plenty of Linux engineers sitting at the table. In light of that, it is no longer true that ACPI only caters to the needs of Windows, and we have the ability to propose changes to the spec. In fact, if you look at the revision history in version 5.1 of the spec, you’ll find changes that were proposed by Linux engineers to make ARMv8 work.

That said, the issues raised by Matthew are important. There is a big question about how Linux should declare itself to the platform. Claiming to be compatible with “Windows 8″ in the ACPI _OSI (Operating System Interface) method obviously isn’t appropriate on ARM. There is some talk about removing _OSI entirely on ARM since the way Linux uses it isn’t actually useful, and the _OSC (Operating System Capability) method has been proposed as a better way to declare what the OS supports. There is also a need to make sure vendors are testing with linux-next and mainline kernels so that we know when breakage happens and we can either do something about it, or work with vendors to fix their firmware.

Both of these are important issues and I think we need to propose solutions before merging ARM ACPI support into the kernel. Some of this work has already started: Linaro is running Canonical’s Firmware Test Suite (FWTS), the ACPI API tests, and the ACPI ASL tests on ARM, and we’re porting the Linux UEFI Verification (LUV) project which packages all the test suites into an easy to use distribution.

While I agree with Matthew that getting the interface between firmware and the OS is hard, I do not see the nightmare scenario he is describing. It certainly hasn’t played out that way on x86 servers where Linux is already the preferred OS. Besides, I really cannot agree with the premise that Linux being the dominant OS is a bad thing! We have a lot more influence than we give ourselves credit for.

September 23, 2014 10:00 PM

September 21, 2014

Michael Kerrisk (manpages): man-pages-3.73 is released

I've released man-pages-3.73. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

The most notable changes in man-pages-3.73 are various new and modified pages describing namespaces in general, and user and PID namespaces in detail:

September 21, 2014 11:39 AM

September 16, 2014

Matthew Garrett: ACPI, kernels and contracts with firmware

ACPI is a complicated specification - the latest version is 980 pages long. But that's because it's trying to define something complicated: an entire interface for abstracting away hardware details and making it easier for an unmodified OS to boot diverse platforms.

Inevitably, though, it can't define the full behaviour of an ACPI system. It doesn't explicitly state what should happen if you violate the spec, for instance. Obviously, in a just and fair world, no systems would violate the spec. But in the grim meathook future that we actually inhabit, systems do. We lack the technology to go back in time and retroactively prevent this, and so we're forced to deal with making these systems work.

This ends up being a pain in the neck in the x86 world, but it could be much worse. Way back in 2008 I wrote something about why the Linux kernel reports itself to firmware as "Windows" but refuses to identify itself as Linux. The short version is that "Linux" doesn't actually identify the behaviour of the kernel in a meaningful way. "Linux" doesn't tell you whether the kernel can deal with buffers being passed when the spec says it should be a package. "Linux" doesn't tell you whether the OS knows how to deal with an HPET. "Linux" doesn't tell you whether the OS can reinitialise graphics hardware.

Back then I was writing from the perspective of the firmware changing its behaviour in response to the OS, but it turns out that it's also relevant from the perspective of the OS changing its behaviour in response to the firmware. Windows 8 handles backlights differently to older versions. Firmware that's intended to support Windows 8 may expect this behaviour. If the OS tells the firmware that it's compatible with Windows 8, the OS has to behave compatibly with Windows 8.

In essence, if the firmware asks for Windows 8 support and the OS says yes, the OS is forming a contract with the firmware that it will behave in a specific way. If Windows 8 allows certain spec violations, the OS must permit those violations. If Windows 8 makes certain ACPI calls in a certain order, the OS must make those calls in the same order. Any firmware bug that is triggered by the OS not behaving identically to Windows 8 must be dealt with by modifying the OS to behave like Windows 8.

This sounds horrifying, but it's actually important. The existence of well-defined[1] OS behaviours means that the industry has something to target. Vendors test their hardware against Windows, and because Windows has consistent behaviour within a version[2] the vendors know that their machines won't suddenly stop working after an update. Linux benefits from this because we know that we can make hardware work as long as we're compatible with the Windows behaviour.

That's fine for x86. But remember when I said it could be worse? What if there were a platform that Microsoft weren't targeting? A platform where Linux was the dominant OS? A platform where vendors all test their hardware against Linux and expect it to have a consistent ACPI implementation?

Our even grimmer meathook future welcomes ARM to the ACPI world.

Software development is hard, and firmware development is software development with worse compilers. Firmware is inevitably going to rely on undefined behaviour. It's going to make assumptions about ordering. It's going to mishandle some cases. And it's the operating system's job to handle that. On x86 we know that systems are tested against Windows, and so we simply implement that behaviour. On ARM, we don't have that convenient reference. We are the reference. And that means that systems will end up accidentally depending on Linux-specific behaviour. Which means that if we ever change that behaviour, those systems will break.

So far we've resisted calls for Linux to provide a contract to the firmware in the way that Windows does, simply because there's been no need to - we can just implement the same contract as Windows. How are we going to manage this on ARM? The worst case scenario is that a system is tested against, say, Linux 3.19 and works fine. We make a change in 3.21 that breaks this system, but nobody notices at the time. Another system is tested against 3.21 and works fine. A few months later somebody finally notices that 3.21 broke their system and the change gets reverted, but oh no! Reverting it breaks the other system. What do we do now? The systems aren't telling us which behaviour they expect, so we're left with the prospect of adding machine-specific quirks. This isn't scalable.

Supporting ACPI on ARM means developing a sense of discipline around ACPI development that we simply haven't had so far. If we want to avoid breaking systems we have two options:

1) Commit to never modifying the ACPI behaviour of Linux.
2) Exposing an interface that indicates which well-defined ACPI behaviour a specific kernel implements, and bumping that whenever an incompatible change is made. Backward compatibility paths will be required if firmware only supports an older interface.

(1) is unlikely to be practical, but (2) isn't a great deal easier. Somebody is going to need to take responsibility for tracking ACPI behaviour and incrementing the exported interface whenever it changes, and we need to know who that's going to be before any of these systems start shipping. The alternative is a sea of ARM devices that only run specific kernel versions, which is exactly the scenario that ACPI was supposed to be fixing.

[1] Defined by implementation, not defined by specification
[2] Windows may change behaviour between versions, but always adds a new _OSI string when it does so. It can then modify its behaviour depending on whether the firmware knows about later versions of Windows.

comment count unavailable comments

September 16, 2014 10:51 PM

Andy Grover: Emacs and using multiple C code styles

I primarily work on Linux, so I put this in my Emacs config:

; Linux mode for C
(setq c-default-style
      '((c-mode . "linux") (other . "gnu")))

However, other projects like QEMU have their own style preferences. So here’s what I added to use a different style for that. First, I found the qemu C style defined here. Then, to only use this on some C code, we attach a hook that only overrides the default C style if the filename contains “qemu”, an imperfect but decent-enough test.

(defconst qemu-c-style
  '((indent-tabs-mode . nil)
    (c-basic-offset . 4)
    (tab-width . 8)
    (c-comment-only-line-offset . 0)
    (c-hanging-braces-alist . ((substatement-open before after)))
    (c-offsets-alist . ((statement-block-intro . +)
                        (substatement-open . 0)
                        (label . 0)
                        (statement-cont . +)
                        (innamespace . 0)
                        (inline-open . 0)
                        ))
    (c-hanging-braces-alist .
                            ((brace-list-open)
                             (brace-list-intro)
                             (brace-list-entry)
                             (brace-list-close)
                             (brace-entry-open)
                             (block-close . c-snug-do-while)
                             ;; structs have hanging braces on open
                             (class-open . (after))
                             ;; ditto if statements
                             (substatement-open . (after))
                             ;; and no auto newline at the end
                             (class-close)
                             ))
    )
  "QEMU C Programming Style")

(c-add-style "qemu" qemu-c-style)

(defun maybe-qemu-style ()
  (when (and buffer-file-name
       (string-match "qemu" buffer-file-name))
    (c-set-style "qemu")))

(add-hook 'c-mode-hook 'maybe-qemu-style)

September 16, 2014 01:16 AM