Kernel Planet

August 31, 2015

Matthew Garrett: Working with the kernel keyring

The Linux kernel keyring is effectively a mechanism to allow shoving blobs of data into the kernel and then setting access controls on them. It's convenient for a couple of reasons: the first is that these blobs are available to the kernel itself (so it can use them for things like NFSv4 authentication or module signing keys), and the second is that once they're locked down there's no way for even root to modify them.

But there's a corner case that can be somewhat confusing here, and it's one that I managed to crash into multiple times when I was implementing some code that works with this. Keys can be "possessed" by a process, and have permissions that are granted to the possessor orthogonally to any permissions granted to the user or group that owns the key. This is important because it allows for the creation of keyrings that are only visible to specific processes - if my userspace keyring manager is using the kernel keyring as a backing store for decrypted material, I don't want any arbitrary process running as me to be able to obtain those keys[1]. As described in keyrings(7), keyrings exist at the session, process and thread levels of granularity.

This is absolutely fine in the normal case, but gets confusing when you start using sudo. sudo by default doesn't create a new login session - when you're working with sudo, you're still working with key posession that's tied to the original user. This makes sense when you consider that you often want applications you run with sudo to have access to the keys that you own, but it becomes a pain when you're trying to work with keys that need to be accessible to a user no matter whether that user owns the login session or not.

I spent a while talking to David Howells about this and he explained the easiest way to handle this. If you do something like the following:
$ sudo keyctl add user testkey testdata @u
a new key will be created and added to UID 0's user keyring (indicated by @u). This is possible because the keyring defaults to 0x3f3f0000 permissions, giving both the possessor and the user read/write access to the keyring. But if you then try to do something like:
$ sudo keyctl setperm 678913344 0x3f3f0000
where 678913344 is the ID of the key we created in the previous command, you'll get permission denied. This is because the default permissions on a key are 0x3f010000, meaning that the possessor has permission to do anything to the key but the user only has permission to view its attributes. The cause of this confusion is that although we have permission to write to UID 0's keyring (because the permissions are 0x3f3f0000), we don't possess it - the only permissions we have for this key are the user ones, and the default state for user permissions on new keys only gives us permission to view the attributes, not change them.

But! There's a way around this. If we instead do:
$ sudo keyctl add user testkey testdata @s
then the key is added to the current session keyring (@s). Because the session keyring belongs to us, we possess any keys within it and so we have permission to modify the permissions further. We can then do:
$ sudo keyctl setperm 678913344 0x3f3f0000
and it works. Hurrah! Except that if we log in as root, we'll be part of another session and won't be able to see that key. Boo. So, after setting the permissions, we should:
$ sudo keyctl link 678913344 @u
which ties it to UID 0's user keyring. Someone who logs in as root will then be able to see the key, as will any processes running as root via sudo. But we probably also want to remove it from the unprivileged user's session keyring, because that's readable/writable by the unprivileged user - they'd be able to revoke the key from underneath us!
$ sudo keyctl unlink 678913344 @s
will achieve this, and now the key is configured appropriately - UID 0 can read, modify and delete the key, other users can't.

This is part of our ongoing work at CoreOS to make rkt more secure. Moving the signing keys into the kernel is the first step towards rkt no longer having to trust the local writable filesystem[2]. Once keys have been enrolled the keyring can be locked down - rkt will then refuse to run any images unless they're signed with one of these keys, and even root will be unable to alter them.

[1] (obviously it should also be impossible to ptrace() my userspace keyring manager)
[2] Part of our Secure Boot work has been the integration of dm-verity into CoreOS. Once deployed this will mean that the /usr partition is cryptographically verified by the kernel at runtime, making it impossible for anybody to modify it underneath the kernel. / remains writable in order to permit local configuration and to act as a data store, and right now rkt stores its trusted keys there.

comment count unavailable comments

August 31, 2015 05:18 PM

August 26, 2015

James Morris: Linux Security Summit 2015 – Wrapup, slides

The slides for all of the presentations at last week’s Linux Security Summit are now available at the schedule page.

Thanks to all of those who participated, and to all the events folk at Linux Foundation, who handle the logistics for us each year, so we can focus on the event itself.

As with the previous year, we followed a two-day format, with most of the refereed presentations on the first day, with more of a developer focus on the second day.  We had good attendance, and also this year had participants from a wider field than the more typical kernel security developer group.  We hope to continue expanding the scope of participation next year, as it’s a good opportunity for people from different areas of security, and FOSS, to get together and learn from each other.  This was the first year, for example, that we had a presentation on Incident Response, thanks to Sean Gillespie who presented on GRR, a live remote forensics tool initially developed at Google.

The keynote by sysadmin, Konstantin Ryabitsev, was another highlight, one of the best talks I’ve seen at any conference.

Overall, it seems the adoption of Linux kernel security features is increasing rapidly, especially via mobile devices and IoT, where we now have billions of Linux deployments out there, connected to everything else.  It’s interesting to see SELinux increasingly play a role here, on the Android platform, in protecting user privacy, as highlighted in Jeffrey Vander Stoep’s presentation on whitelisting ioctls.  Apparently, some major corporate app vendors, who were not named, have been secretly tracking users via hardware MAC addresses, obtained via ioctl.

We’re also seeing a lot of deployment activity around platform Integrity, including TPMs, secure boot and other integrity management schemes.  It’s gratifying to see the work our community has been doing in the kernel security/ tree being used in so many different ways to help solve large scale security and privacy problems.  Many of us have been working for 10 years or more on our various projects  — it seems to take about that long for a major security feature to mature.

One area, though, that I feel we need significantly more work, is in kernel self-protection, to harden the kernel against coding flaws from being exploited.  I’m hoping that we can find ways to work with the security research community on incorporating more hardening into the mainline kernel.  I’ve proposed this as a topic for the upcoming Kernel Summit, as we need buy-in from core kernel developers.  I hope we’ll have topics to cover on this, then, at next year’s LSS.

We overlapped with Linux Plumbers, so LWN was not able to provide any coverage of the summit.  Paul Moore, however, has published an excellent write-up on his blog. Thanks, Paul!

The committee would appreciate feedback on the event, so we can make it even better for next year.  We may be contacted via email per the contact info at the bottom of the event page.

August 26, 2015 07:09 PM

August 25, 2015

Gustavo F. Padovan: Collabora contributions to Linux Kernel 4.2

A total of 63 patches were contributed upsteam by Collabora engineers as part of our current projects.

In the ARM multi_v7_defconfig we have the addition of support for Exynos Chromebooks, all options that had a tristate Kconfig option were added as module. After this change it was found that a few drivers weren’t working  properly when built as module, so this was fixed. This work was done by Javier Martinez.

Javier also added multi EC support as newer Chromebooks have more than one Embedded Controller in the system.

Tomeu Vizoso added EMC (External Memory Controller) support to the Tegra124 platform.

On the DRM side initial support for Atomic Modesetting was added to Exynos devices by Gustavo Padovan. The Atomic Modesetting interface allows all screen updates such as changing modes, pageflip and set planes/cursors to happen in the same IOCTL. Thus everything can be updated atomically. More on that can be found at Daniel Vetter’s post at Another contribution, from Daniel Stone, to Atomic Modesetting was the addition of the CRTC state mode property, it is through this property that userspace configure a modeset that will be updated via an Atomic Modesetting ioctl.

Following is a list of all patches submitted by Collabora for this kernel release:

Daniel Stone (17):

Gustavo Padovan (17):

Javier Martinez Canillas (19):

Tomeu Vizoso (11):

August 25, 2015 01:47 PM

August 19, 2015

Matt Domsch: Dell Desktop / Notebook Linux Engineering position available

Come help Dell ensure Linux “just works!” on Dell notebooks, desktops, and devices! The Dell Client Linux Engineering team has opening for a Senior Software Engineer. This team works closely with the Linux community, device manufacturers, and Dell engineering teams to provide the best Linux experience across the entire client product line.

Visit the Dell Jobs site to apply. If you’re a friend of mine and are interested, drop me a line and I’ll make sure you get in front of the hiring manager quickly!

August 19, 2015 09:31 PM

LPC 2015: Bird-of-a-Feather Sessions

We have a great slate of bird-of-a-feather (BoF) sessions on Thursday evening! However, there are still a few BoF slots left, so proposals are still welcome here. First come, first served!

August 19, 2015 09:22 PM

August 18, 2015

Matthew Garrett: Canonical's deliberately obfuscated IP policy

I bumped into Mark Shuttleworth today at Linuxcon and we had a brief conversation about Canonical's IP policy. The short summary:

The even shorter summary: Canonical won't clarify their IP policy because they believe they can make more money if they don't.

Why do I keep talking about this? Because Canonical are deliberately making it difficult to create derivative works, and that's one of the core tenets of the definition of free software. Their IP policy is fundamentally incompatible with our community norms, and that's something we should care about rather than ignoring.

comment count unavailable comments

August 18, 2015 07:02 PM

August 17, 2015

Andi Kleen: Announcing simple-pt — A simple Processor Trace implementation

Modern Intel Core CPUs (5th and 6th generation) have a Intel Processor Trace (PT) feature to trace branch execution with low overhead. This is useful for performance analysis and debugging.

simple-pt is a simple standalone driver and decoder tool to implement PT on Linux.

Starting with Linux 4.1 Linux already has a integrated PT implementation in perf (see ). simple-pt is an alternative implementation. It has many disadvantages over the perf PT implementation, such as:
- needs to run as root
- no long term tracing or sampling with interrupts
- no support for interactive debugging (use gdb 7.10 on perf for that)
- no support for histograms
- somewhat experimental
- not as well supported as perf

On the positive side simple-pt is:
- simple
- standalone. No kernel changes needed. Could be ported to older kernels or other operating systems
- easy to modify and experiment with
- more ftrace like decoding tool
- support for kprobes based triggers
- modular “unix style” design with simple tools that do only one thing each
- BSD licensed

Example output:

        % sptcmd  -c tcall taskset -c 0 ./tcall
        cpu   0 offset 1027688,  1003 KB, writing to ptout.0
        Wrote sideband to ptout.sideband
        % sptdecode --sideband ptout.sideband --pt ptout.0 | less
        frequency 32
        0        [+0]     [+   1] _dl_aux_init+436
                          [+   6] __libc_start_main+455 -> _dl_discover_osversion
                          [+  13] __libc_start_main+446 -> main
                          [+   9]     main+22 -> f1
                          [+   4]             f1+9 -> f2
                          [+   2]             f1+19 -> f2
                          [+   5]     main+22 -> f1
                          [+   4]             f1+9 -> f2
                          [+   2]             f1+19 -> f2
                          [+   5]     main+22 -> f1

Available from

August 17, 2015 04:27 AM

August 16, 2015

Daniel Vetter: Atomic Modesetting Design Overview

After a few years of development the atomic display update IOCTL for drm drivers is finally ready for prime time with the 4.2 pull request from Dave Airlie. It's been a long road, with a lot of drivers already converted over to atomic and even more in progress, the atomic helper libraries and support code in the drm subsystem sufficiently polished. But what's really missing is a design overview of what the overall atomic infrastructure looks like and why some decisions and details are implemented like they are.

That's now done and published on LWN: Part 1 talks about the problem space, issues with the Android atomic display framework and the basic atomic IOCTL interface. Part 2 goes into more detail about a few specific things like locking, helper library design and the exact semantics of atomic modessetting updates. Happy Reading!

August 16, 2015 01:52 PM

August 15, 2015

Rusty Russell: Broadband Speeds, New Data

Thanks to edmundedgar on reddit I have some more accurate data to update my previous bandwidth growth estimation post: OFCOM UK, who released their November 2014 report on average broadband speeds.  Whereas Akamai numbers could be lowered by the increase in mobile connections, this directly measures actual broadband speeds.

Extracting the figures gives:

  1. Average download speed in November 2008 was 3.6Mbit
  2. Average download speed in November 2014 was 22.8Mbit
  3. Average upload speed in November 2014 was 2.9Mbit
  4. Average upload speed in November 2008 to April 2009 was 0.43Mbit/s

So in 6 years, downloads went up by 6.333 times, and uploads went up by 6.75 times.  That’s an annual increase of 36% for downloads and 37% for uploads; that’s good, as it implies we can use download speed factor increases as a proxy for upload speed increases (as upload speed is just as important for a peer-to-peer network).

This compares with my previous post’s Akamai’s UK numbers of 3.526Mbit in Q4 2008 and 10.874Mbit in Q4 2014: only a factor of 3.08 (26% per annum).  Given how close Akamai’s numbers were to OFCOM’s in November 2008 (a year after the iPhone UK release, but probably too early for mobile to have significant effect), it’s reasonable to assume that mobile plays a large part of this difference.

If we assume Akamai’s numbers reflected real broadband rates prior to November 2008, we can also use it to extend the OFCOM data back a year: this is important since there was almost no bandwidth growth according to Akamai from Q4 2007 to Q7 2008: ignoring that period gives a rosier picture than my last post, and smells of cherrypicking data.

So, let’s say the UK went from 3.265Mbit in Q4 2007 (Akamai numbers) to 22.8Mbit in Q4 2014 (OFCOM numbers).  That’s a factor of 6.98, or 32% increase per annum for the UK. If we assume that the US Akamai data is under-representing Q4 2014 speeds by the same factor (6.333 / 3.08 = 2.056) as the UK data, that implies the US went from 3.644Mbit in Q4 2007 to 11.061 * 2.056 = 22.74Mbit in Q4 2014, giving a factor of 6.24, or 30% increase per annum for the US.

As stated previously, China is now where the US and UK were 7 years ago, suggesting they’re a reasonable model for future growth for that region.  Thus I revise my bandwidth estimates; instead of 17% per annum this suggests 30% per annum as a reasonable growth rate.

August 15, 2015 04:54 AM

August 14, 2015

Pete Zaitcev: Tablet Uber Alles Or Is It

Given the trouble with modern laptops, I'm seriously thinking if I should make a jump to a gigantic tablet with a keyboard. You run "make" on VM. Not enough RAM? Order in the cloud! The idea was planted in my mind by that jerk Atwood, who penned an article claiming a death of PC. And a month ago I saw someone at Python meetup using Canopy. It kinda worked, actually. I expect Github Atom to be even better.

Unfortunately, there are problems in 3 broad categories still.

First, the hotspot Internet connectivity sucks. It is plain unreliable. VPN, ssh, and IRC are often blocked; it's necessary to remember "Connectivity Through Anything" lessons and tehcniques. When it works, it's often slow. These problems extend to venues such as Intel's Executive Briefing Center. If "executives" eating their awesome snacks cannot obtain a decent WiFi, what hope do I have? I do not have cellphone data, but I hear bitching about it.

Second, the usual questions about privacy and security apply. Non-proprietary tablets suck immensely, from what I heard.

Third, tablets top out at 10..11 inch. Sorry, but that is not enough to kill laptops while laptops continue to be made. Certainly, Atwood made an argument that as tablets absorb users, PC makers will stop. The day the last one quits, we'll have to use the least shitty tablet regardless of size. But today is not that day.

August 14, 2015 09:38 PM

Pete Zaitcev: User-facing hardware

New business trip, new hardware pictures.

It was almost a year, and I'm still looking for a decent laptop, same criteria. I saw a couple of guys using Lenovo X1 Carbon, which looks good. Most importantly, the left Ctrl is now extends to its proper position. Almost a winner, but unfortunately, there are issues. Apparently, the screen on the X1 is not touching the main frame flat when it's closed, so a bundle of clothing pressing in the middle between the hinges is capable to making a nasty crack in plastic. Not acceptable for what is a $1,400 laptop even with Amazon's "discount" of $900. Way to go, Lenovo. Almost had me this time.

Meanwhile, a $500 Dell Vostro continues to soldier on. It's showing its age: building Ceph with "make -j${N}" requires more RAM that it has for any reasonable N, and dialog windows started to outgrow its screen (notably, some of GNOME preferences). I still need a laptop, but can't find a suitable one. The Lenovo X1 tops out at 8GB, which was another strike against it.

I was a little sad when Google stopped making Nexus 7. I have a 2013 version and it is quite good. In the same meeting, I bumped into a guy with a projected update to Nexus 7 that became orphaned when Google pulled the plug. ASUS continued to build them and market them as "MemoPad 7". However, taking the page from Microsoft playbook with their "Surface" and "Surface Pro", ASUS sell "MemoPad 7" versions ranging from worthless piece of junk with 1024x600 to actual Nexus 7 replacements with 1920x1200. Allegedly, the battery life and speed are much improved by using Intel's embedded Atom core. Some of the ARM-optimized apps may not work (example is some kind of music editing thing for podcasters).

August 14, 2015 09:18 PM

August 13, 2015

Dave Jones: The case of the mysterious disappearing I211

Day one of unemployed life saw me finally getting around to the first of several hardware related maintenance items that I’ve been putting off until I’ve had the time.

I got a lot of life out of my desktop machine that I had been using since 2007. Earlier this year, I decided it was long overdue an upgrade, and ended up building a ridiculously over-specced machine in the hopes it too would last me a while. After some research, I ended up with a 6-core Haswell-E i7-5820K, and a frankly ridiculously over-featured motherboard.
Once I had delved through the absurd number of BIOS options to convince it that I *really* didn’t want to overclock my CPU or my RAM, or anything else, it was very stable.

It has exceeded all my expectations. In the time it took my old desktop to build one kernel, I can build kernel .deb’s for every machine I own, and still have time spare. It’s an absolute beast.

One of the features that sold me on this board was the two onboard ethernet ports. I had been wanting to do a bunch of networking experiments, and the possibility of using bonding, without having to screw around with add-in cards was appealing.

So I was a little irked one evening after updating its BIOS, to notice that the bond only had one interface active. After some investigation, I noticed that the PCI ID of one of the onboard NICs had changed.

What was once

00:19.0 Ethernet controller: Intel Corporation Ethernet Connection (2) I218-V (rev 05)
08:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)

Was now

00:19.0 Ethernet controller: Intel Corporation Ethernet Connection (2) I218-V (rev 05)
08:00.0 Ethernet controller: Intel Corporation Device 1532 (rev 03)

My I211 had changed its PCI ID, and the e1000 driver wouldn’t bind to this new device.

At first I thought “Cool, some kind of NIC firmware update”, and assumed that e1000 hadn’t been updated yet to support this new feature. Googling for “i211 1532” told a much sadder story however.

If you read the spec update for the i211, you find this interesting table:

I211 Device ID Code Vendor ID Device ID Revision ID1
WGI211AT (not programmed/factory default) 0x8086 0x1532 0x3
WGI211AT (programmed) 0x8086 0x1539 0x3

Uh, not cool. Somehow the BIOS update procedure had wiped the NVRAM on the NIC.

A long protracted conversation with ASUS support followed, including such gems as “I understand you’re seeing blue screens” and “Have you tried removing the DIMMs, rubbing the contacts with an eraser and replacing them”. Eventually I think they got to the end of their script, and agreed to RMA the board. Somewhat annoying, given there’s probably a tool somewhere that can rewrite the flash, but Intel only seems to make that available to integrators, not end-users, and the ASUS representatives denied all knowledge.

It was gone for about two weeks, and finally returned yesterday. Its PCI ID is 0x1539 again, and it has its old MAC address once more. (I’m now hesitant to ever upgrade the BIOS on this machine again). So what happened ? Anyone’s guess, but this isn’t the first time I’ve seen this happen. We had a bunch of these NICs at Akamai too that occasionally had the same thing happen to them.

The whole thing is reminiscent of a painful old bug where ftrace would corrupt the e1000e ROM. Hopefully Linux isn’t to blame this time.

So, long story short: If you see an i211 with a PCI ID of 1532, you’re looking at an RMA.

The post The case of the mysterious disappearing I211 appeared first on

August 13, 2015 04:09 PM

LPC 2015: LPC closing party on the water

The Linux Plumbers Conference will have its closing party on Friday, August 21 at the Palisade Restaurant at Elliott Bay Marina on the waters of Puget Sound. Buses will be leaving from the Sheraton around 18:00 for the 15-minute journey to the restaurant. The evening will start with a champagne and seafood tower reception in the courtyard overlooking the marina. It will include shrimp, lobster, and oysters, along with plenty of vegetarian choices. There will be a buffet inside the restaurant after that with entrees including sushi (with vegetarian selections), salmon, risotto, and steak. All of that will be followed by dessert and coffee. There will be local wine and beer selections and all of the food will be locally sourced as much as possible.

It should make for a fabulous evening, with great views (perhaps even of Mount Rainier), and excellent company. We look forward to seeing you on Friday.

August 13, 2015 02:13 PM

August 12, 2015

LPC 2015: How to find Room WSCC 3AB

This year, because of space constraints, the single Wednesday Microconference track (for LLVM in the morning and the Development Tools Tutorial in the afternoon, see schedule for details) is happening offsite at the Washington State Convention Centre (WSCC).  Breakfast will still be at the Sheraton, but the rest of the Microconference will happen over at WSCC in room 3AB.  To get there from the Sheraton, take the escalators down to the ground floor, exit at the corner of Pike and 6th Avenue. Turn right on to Pike, cross 7th Avenue, continue a little way up Pike then turn right into the Convention Centre. Room 3AB is up the escalators on the third floor.



Wireless in the Washington State Convention Centre is on a different system.  Unfortunately it’s portal based, not WEP key based, so we can’t hide the difference.  The portal details are

SSID: Exhibitor Internet

Which is an open network, then browse to any page and enter the login at the prompt:
USER ID: linux
Password: seattle

August 12, 2015 01:34 AM

August 11, 2015

Dave Jones: Moving on from Akamai.

Today was my last day at Akamai. It’s been brief (Just over seven months), but things weren’t really working out for me there for a number of reasons. I’ve mentioned to a number of people who have known about my decision for a while, that it’s not that it’s a bad place to work, but it never felt like a good fit for me, and I came to realize that I’ve spent most of this last year being in denial of just how unhappy I was, in the hope “things would get better”.

There are a lot of smart people working there, working on really difficult problems, but a lot of those problems just don’t align with my interests, especially when they don’t always involve contributing code back upstream. [clarification: There is some upstream work going on there, just not as much as I’d like].

Add to this my disdain for some of the proprietary tooling that’s prevalent there, and it was becoming clear it was not a matter of “if”, but “when” I was going to leave. As an example; I joked a few months ago to co-workers “next time I’m looking for a job, the first question I ask is ‘do you use perforce’?”. Only it wasn’t really a joke, I was dead serious. User-hostile software has no place in my life.
Even little things like “let’s use git” translating to “let’s license Atlassian stash” rather than “run a git-daemon somewhere” started getting me down.

The final project I worked on there was a continuous rebase strategy for the kernel, moving away from perforce to git. It’s a move in the right direction, but ultimately, not the sort of work that gets me excited, and it’s going to be a multi-year project before it starts really bearing fruit. Given how perforce is ingrained in so many of Akamai’s systems, it would also have been extremely unlikely I’d have been able to purge all knowledge of ever having used it.

The rebase work itself also started to bother me that many of the kernel changes we made had no chance of ever even being submitted, let alone accepted upstream. (In part because many of them are very unique to Akamai’s CDN — you won’t find any of the trickery employed there described in a Richard Stevens book, and they’re unlikely to ever be official RFC’s due to the competitive edge they gain from those changes).
There are exceptions to all of this, and the kernel team is trying to do a better job there with upstreaming most of the newer changes, but many of the older legacy patches are under-documented, and/or understood well by few people, with the original authors no longer around, making it a frustrating exercise to get up to speed; especially when you’re trying to learn what the upstream code is doing at the same time.

Someone with less experience dealing exclusively with open-source for most of their career would probably find many of my reasons for leaving trivial. Those same people would probably find Akamai a great place to work. There are a lot of opportunities there if you have a higher tolerance for such things than I did. It was eye-opening recently, mentoring some of the interns there. Optimism. The unjaded outlook that comes with youth. Not getting bent out of shape at crappy tooling because they don’t know different. It made me realize I wasn’t going to ever be like this here.

On a particularly bad day a few weeks back, a recruiter reached out to me, to find out if I was interested in a second chance at an offer I received last time I was looking for a new job. It worked. Enduring an unhappy situation in the hopes things will get better isn’t a great strategy when there are other options.

So, I start at Facebook in September.

I have no delusions that things are going to be perfect there, but at least from the outside right now, the grass looks greener. I feel bad walking away from problems unfinished, but going home miserable or angry or some other negative emotion every day was really starting to get take its toll. It’s not a healthy way to live.

When I was interviewing last December, I read Being Geek to death, so it’s fitting that I’ve picked it up again recently. One paragraph in particular jumps out at me.

My single worst gig was one where I got everything I wanted out the of the offer letter, but in my exuberance for being highly valued, I totally forgot that my gut read on the gig was "meh". Ninety days later, I couldn't care less that I got a 15% raise and a sign-on bonus. I couldn't stand the mundanity of the daily work, and I happily resigned a few months later, taking both a pay cut and returning my sign-on bonus for the opportunity to work at Netscape.

Anachronisms and minor details aside, that paragraph played through my head this afternoon as I wrote the check to pay back the remainder of my sign-on bonus. I wasn’t quite thinking “meh”, but I knew I was making compromises on what I really valued from day one.

Walking away from unvested RSUs, giving up this months paycheck, and writing that check stings a little, but when I did my exit interview this morning, I knew that I too, was “happily resigning” for a great opportunity.

I’m feeling uncharacteristically optimistic right now. Hopefully it’ll last.

I’ll be in Seattle next week, but due to complications with my registration being transferred to another Akamai employee, I won’t actually be at the Linux plumbers conf. If you’re also going to be there and want to catch up, drop me a mail, or <ahem> hit me up on facebook.

The post Moving on from Akamai. appeared first on

August 11, 2015 10:33 PM

Pete Zaitcev: git submodule

It's a familiar sign to anyone dealing with a project that includes submodules: you run "make" and see something like this:

rgw/ In member function ‘virtual int RGWMongooseFrontend::run()’:
rgw/ error: ‘struct mg_callbacks’ has no member named ‘log_access’
cb.log_access = rgw_civetweb_log_access_callback;

Ah, yes. Submodule civetweb is obviously out of date. Type "git submodule init; git submodule update" and... nothing happens. The goddamn submodules are stuck.

At this point, running "git diff origin" produces an output like:

--- a/ceph-object-corpus
+++ b/ceph-object-corpus
@@ -1 +1 @@
-Subproject commit 20351c6bae6dd4802936a5a9fd76e41b8ce2bad0
+Subproject commit bb3cee6b85b93210af5fb2c65a33f3000e341a11

So yeah, obviously you fetched the right thing from the origin, but you cannot merge or rebase no matter what. You may spend a good part of a hackathon reading man pages for git subcommands, all for naught.

Fortunately, the stuck submodules can be worked around, by looking at the "git diff origin" above, then doing this:

git update-index --replace --cacheinfo 160000,20351c6bae6dd4802936a5a9fd76e41b8ce2bad0,ceph-object-corpus

You get the idea: force the right commit from the origin into the local index. This allows "git submodule update" to clone and checkout the right thing and you're off to the races. The fixups in the index will stick out in "git status", so create an empty commit to get rid of them (but only after "git submodule update").

When you're done, you might want to kick in the nuts whoever chose to use submodules in your project.

P.S. "git --version" yields "git version 2.4.3".

P.P.S. You verify what you have in the index by running "git ls-files -s ceph-object-corpus" (or src/civetweb). The mode must be 160000 and the hash should match the upstream. Note that "git diff origin" continues to display a disparity until you've run the "git submodules update".

August 11, 2015 03:08 AM

Pete Zaitcev: the future is here


10005 zaitcev   20   0  809920 755384  13220 R  99.7 12.5   0:20.47 cc1plus
 9894 zaitcev   20   0 1946748 1.806g  15800 R  99.3 31.4   1:46.60 cc1plus
 9956 zaitcev   20   0 1652076 1.524g  15832 R  99.0 26.5   1:30.64 cc1plus
   72 root      20   0       0      0      0 S   4.0  0.0   0:04.60 kswapd0
 9957 zaitcev   20   0   56648  43536   1436 S   2.7  0.7   0:00.49 as
 9895 zaitcev   20   0   79480  66368   1480 S   2.0  1.1   0:00.89 as
 2870 zaitcev   20   0 1989524 533104 160868 S   1.3  8.9  60:28.10 firefox
 2035 zaitcev   20   0 2018216 166872  20028 S   0.7  2.8  16:50.66 gnome-sh

That's right, boys and girls, a compiler with a bigger resident size than Firefox. Three times bigger.

August 11, 2015 02:12 AM

August 10, 2015

Lucas De Marchi: “Throw away” linux images in seconds

Generating a new rootfs from scratch in order to test changes to early parts of the software stack or just to have a pristine environment is something I needed several times in the past.

Since I use Archlinux in my desktop something that I like is to have a similar environment in the target test rootfs. I decided to re-use and improve a script from Kay Sievers to create an installer that can be booted as a VM, as a container or in bare metal: Originally  it was a script to bootstrap a Fedora image and I think that with some small changes that would still be possible.

$ time sudo -l ~/vm/test.img
real 0m31.238s
user 0m22.277s
sys 0m2.473s

30 seconds later I have a complete pristine image that can be used as a VM with qemu, as a container with systemd-nspawn or just copied to a pendrive/sdcard to boot for example a Minnow Board Max.


$ sudo systemd-nspawn -b -i ~/vm/test.img


sudo kvm-that ~/vm/test.img

Note: ‘kvm-that’ is also a script available in the same repository so I don’t have to type all the options to qemu.

In order to boot another computer or a board like Minnow Board Max just dd the image to a usb disk or sdcard. You can also generate the image directly to the final destination:

$ sudo -l /dev/mmcblk0

The script has also some nice options to make it easy to customize the final image.  One thing that I’m often doing is giving an overlay directory with configuration files for wpa_supplicant. This way I can already access my WiFi networks in the target image.

If you always need certain packages you can use the  example debug-tools hook that is executed before the image is finalized. By mixing hooks like that and the overlay directory mentioned above it’s possible to add your local repository to pacman.conf and install packages not available in Archlinux. Or packages that you’d like to maintain on your own. In my use cases with Minnow Board Max I maintain my own kernel with configurations suited to run ardupilot on it.

August 10, 2015 10:44 AM

August 08, 2015

Michael Kerrisk (manpages): man-pages-4.02 is released

I've released man-pages-4.02. The release tarball is available on The browsable online pages can be found on The Git repository for man-pages is available on

This release resulted from patches, bug reports,and  comments from around 15 contributors. As well as a large number of minor fixes to nearly 400 man pages, the more significant changes in man-pages-4.02 include the following:

August 08, 2015 10:10 PM

Matthew Garrett: Difficult social problems are still difficult problems

After less than a week of complaints, the TODO group have decided to pause development of their code of conduct. This seems to have been triggered by the public response to the changes I talked about here, which TODO appear to have been completely unprepared for.

While disappointing in a bunch of ways, this is probably the correct decision. TODO stumbled into this space with a poor understanding of the problems that they were trying to solve. Nikki Murray pointed out that the initial draft lacked several of the key components that help ensure less privileged groups can feel that their concerns are taken seriously. This was mostly rectified last week, but nobody involved appeared to be willing to stand behind those changes in a convincing way. This wasn't helped by almost all of this appearing to land on Github's plate, with the rest of the TODO group largely missing in action[1]. Where were Google in this? Yahoo? Facebook? Left facing an angry mob with nobody willing to make explicit statements of support, it's unsurprising that Github would try to back away from the situation.

But that doesn't remove their blame for being in the situation in the first place. The statement claims
We are consulting with stakeholders, community leaders, and legal professionals, which is great. It's also far too late. If an industry body wrote a new kernel from scratch and deployed it without any external review, then discovered that it didn't work and only then consulted any of the existing experts in the field, we'd never take them seriously again. But when an industry body turns up with a new social policy, fucks up spectacularly and then goes back to consult experts, it's expected that we give them a pass.

Why? Because we don't perceive social problems as difficult problems, and we assume that anybody can solve them by simply sitting down and talking for a few hours. When we find out that we've screwed up we throw our hands in the air and admit that this is all more difficult than we imagined, and we give up. We ignore the lessons that people have learned in the past. We ignore the existing work that's been done in the field. We ignore the people who work full time on helping solve these problems.

We wouldn't let an industry body with no experience of engineering build a bridge. We need to accept that social problems are outside our realm of expertise and defer to the people who are experts.

[1] The repository history shows the majority of substantive changes were from Github, with the initial work appearing to be mostly from Twitter.

comment count unavailable comments

August 08, 2015 08:09 PM

August 05, 2015

Andi Kleen: Generating Flame graphs with Processor Trace

How to generate a FlameGraph with Processor Trace. Everybody loves Flame Graphs.

Processor trace allows to do as very exact histograms of a program’s run time. Normal sampling has shadow effects, which can hide some details. Processor traces every branch, so it can be much more accurate than normal sampling.

You need a Intel Broadwell or Skylake CPU.
Running at 4.1 or later Linux kernel where perf supports PT.
You can verify the kernel supports pt with

ls /sys/devices/intel_pt

You need perf user tools built from
(this should soon be fixed when the user tools code is merged into Linux mainline)

Build perf with PT support

# set up https_proxy as needed
git clone
cd linux-perf/tools/perf

Copy the resulting perf binary to where you want to run it

Get the flamegraph code

git clone

Collect data from the workload. Best to not collect too long traces as they take much longer to process and may need too much disk space.

perf record -e intel_pt// workload (or -a sleep 1 to collect 1s globally)

Decode the data. This may take quite some time

perf script --itrace=i100usg | /path/to/FlameGraph/ | > workload.folded

The i100us means the trace decoder samples an instruction every 100us. This can be made more accurate (down to 1ns), at the cost of longer decoding time. The ‘g’ tells the decoder to add callgraphs.

Then generate the Flamegraph with

/path/to/FlameGraph/ workloaded.folded > workload.svg

Then view the resulting SVG in a SVG viewer, such as google chrome

google-chrome workload.svg

It is possible to click around.

Here’s a larger svg example from a gcc build (2.5MB). May need chrome or firefox to view.

In principle the trace also has support for more information not in normal sampling, such as determining the exact run time of individual functions from the trace. This is unfortunately not (yet?) supported by the Flame Graph tools.

August 05, 2015 11:13 PM

August 04, 2015

Matthew Garrett: Reverse this

The TODO group is an industry body that appears to be trying to define community best practices or something. I don't really know what their backstory is and whether they're trying to do meaningful work or just provide a fig leaf of respectability to organisations that dislike being criticised for doing nothing to improve the state of online communities but don't want to have to actually do anything, and their initial work on codes of conduct was, perhaps, suboptimal. But they do appear to be trying to improve things - this commit added a set of inappropriate behaviours, and also clarified that reverseisms were not actionable behaviour.

At which point Reddit lost its shit, because Reddit is garbage. And now the repository is a mess of white men attempting to explain how any policy that could allow them to be criticised is the real racism.

Fuck that shit.

Being a cis white man who's a native English speaker from a fairly well-off background, I'm pretty familiar with privilege. Spending my teenage years as an atheist of Irish Catholic upbringing in a Protestant school in a region of Northern Ireland that made parts of the bible belt look socially progressive, I'm also pretty familiar with the idea that that said privilege doesn't shield me from everything bad in life. Having privilege isn't a guarantee that my life will be better, in the same way that avoiding smoking doesn't mean I won't die of lung cancer. But there's an association in both cases, one that's strong enough to alter the statistical likelihood in meaningful ways.

And that inherently affects discussions about race or gender or sexuality. The probability that I've been subject to systematic discrimination because of these traits is vanishingly small. In the communities this policy is intended to cover, I'm the default. It's very difficult for any minority to exercise power over me. "You're white, you wouldn't understand" isn't fundamentally about my colour, it's about the fact that my colour means I haven't been subject to society trying to make my life more difficult at every opportunity. A community that considers saying that to be racist is a community that will never change the default, a community that will never be able to empower people who didn't grow up with that privilege. A code of conduct that makes it clear that "reverse racism" isn't grounds for complaint makes it clear that certain conversations are legitimate and helps ensure we have the framework we need to gradually change that default, and as such is better than one that doesn't.

(comments disabled because I don't trust any of you)

comment count unavailable comments

August 04, 2015 09:59 PM

Rusty Russell: The Bitcoin Blocksize: A Summary

There’s a significant debate going on at the moment in the Bitcoin world; there’s a great deal of information and misinformation, and it’s hard to find a cogent summary in one place.  This post is my attempt, though I already know that it will cause me even more trouble than that time I foolishly entitled a post “If you didn’t run code written by assholes, your machine wouldn’t boot”.

The Technical Background: 1MB Block Limit

The bitcoin protocol is powered by miners, who gather transactions into blocks, producing a block every 10 minutes (but it varies a lot).  They get a 25 bitcoin subsidy for this, plus whatever fees are paid by those transactions.  This subsidy halves every 4 years: in about 12 months it will drop to 12.5.

Full nodes on the network check transactions and blocks, and relay them to others.  There are also lightweight nodes which simply listen for transactions which affect them, and trust that blocks from miners are generally OK.

A normal transaction is 250 bytes, and there’s a hard-coded 1 megabyte limit on the block size.  This limit was introduced years ago as a quick way of avoiding a miner flooding the young network, though the original code could only produce 200kb blocks, and the default reference code still defaults to a 750kb limit.

In the last few months there have been increasing runs of full blocks, causing backlogs for a few hours.  More recently, someone deliberately flooded the network with normal-fee transactions for several days; any transactions paying less fees than those had to wait for hours to be processed.

There are 5 people who have commit access to the bitcoin reference implementation (aka. “bitcoin-core”), and they vary significantly in their concerns on the issue.

The Bitcoin Users’ Perspective

From the bitcoin users perspective, blocks should be infinite, and fees zero or minimal.  This is the basic position of respected (but non-bitcoin-core) developer Mike Hearn, and has support from bitcoin-core ex-lead Gavin Andresen.  They work on the wallet and end-user side of bitcoin, and they see the issue as the most urgent.  In an excellent post arguing why growth is so important, Mike raises the following points, which I’ve paraphrased:

  1. Currencies have network effects. A currency that has few users is simply not competitive with currencies that have many.
  2. A decentralised currency that the vast majority can’t use doesn’t change the amount of centralisation in the world. Most people will still end up using banks, with all the normal problems.
  3. Growth is a part of the social contract. It always has been.
  4. Businesses will only continue to invest in bitcoin and build infrastructure if they are assured that the market will grow significantly.
  5. Bitcoin needs users, lots of them, for its political survival. There are many people out there who would like to see digital cash disappear, or be regulated out of existence.

At this point, it’s worth mentioning another bitcoin-core developer: Jeff Garzik.  He believes that the bitcoin userbase has been promised that transactions will continue to be almost free.  When a request to change the default mining limit from 750kb to 1M was closed by the bitcoin lead developer Wladimir van der Laan as unimportant, Jeff saw this as a symbolic moment:

Disappointing. New #Bitcoin Core policy: stealth fee increases Zero plan to communicate this to BTC users :(

— Jeff Garzik (@jgarzik) July 21, 2015

What Happens If We Don’t Increase Soon?

Mike Hearn has a fairly apocalyptic view of what would happen if blocks fill.  That was certainly looking likely when the post was written, but due to episodes where the blocks were full for days, wallet designers are (finally) starting to estimate fees for timely processing (miners process larger fee transactions first).  Some wallets and services didn’t even have a way to change the setting, leaving users stranded during high-volume events.

It now seems that the bursts of full blocks will arrive with increasing frequency; proposals are fairly mature now to allow users to post-increase fees if required, which (if all goes well) could make for a fairly smooth transition from the current “fees are tiny and optional” mode of operation to a “there will be a small fee”.

But even if this rosy scenario is true, this begsavoids the bigger question of how high fees can become before bitcoin becomes useless.  1c?  5c?  20c? $1?

So What Are The Problems With Increasing The Blocksize?

In a word, the problem is miners.  As mining has transitioned from a geek pastime, semi-hobbyist, then to large operations with cheap access to power, it has become more concentrated.

The only difference between bitcoin and previous cryptocurrencies is that instead of a centralized “broker” to ensure honesty, bitcoin uses an open competition of miners. Given bitcoin’s endurance, it’s fair to count this a vital property of bitcoin.  Mining centralization is the long-term concern of another bitcoin-core developer (and my coworker at Blockstream), Gregory Maxwell.

Control over half the block-producing power and you control who can use bitcoin and cheat anyone not using a full node themselves.  Control over 2/3, and you can force a rule change on the rest of the network by stalling it until enough people give in.  Central control is also a single point to shut the network down; that lets others apply legal or extra-legal pressure to restrict the network.

What Drives Centralization?

Bitcoin mining is more efficient at scale. That was to be expected[7]. However, the concentration has come much faster than expected because of the invention of mining pools.  These pools tell miners what to mine, in return for a small (or in some cases, zero) share of profits.  It saves setup costs, they’re easy to use, and miners get more regular payouts.  This has caused bitcoin to reel from one centralization crisis to another over the last few years; the decline in full nodes has been precipitous by some measures[5] and continues to decline[6].

Consider the plight of a miner whose network is further away from most other miners.  They find out about new blocks later, and their blocks get built on later.  Both these effects cause them to create blocks which the network ignores, called orphans.  Some orphans are the inevitable consequence of miners racing for the same prize, but the orphan problem is not symmetrical.  Being well connected to the other miners helps, but there’s a second effect: if you discover the previous block, you’ve a head-start on the next one.  This means a pool which has 20% of the hashing power doesn’t have to worry about delays at all 20% of the time.

If the orphan rate is very low (say, 0.1%), the effect can be ignored.  But as it climbs, the pressure to join a pool (the largest pool) becomes economically irresistible, until only one pool remains.

Larger Blocks Are Driving Up Orphan Rates

Large blocks take longer to propagate, increasing the rate of orphans.  This has been happening as blocks increase.  Blocks with no transactions at all are smallest, and so propagate fastest: they still get a 25 bitcoin subsidy, though they don’t help bitcoin users much.

Many people assumed that miners wouldn’t overly centralize, lest they cause a clear decentralization failure and drive the bitcoin price into the ground.  That assumption has proven weak in the face of climbing orphan rates.

And miners have been behaving very badly.  Mining pools orchestrate attacks on each other with surprising regularity; DDOS and block withholding attacks are both well documented[1][2].  A large mining pool used their power to double spend and steal thousands of bitcoin from a gambling service[3].  When it was noticed, they blamed a rogue employee.  No money was returned, nor any legal action taken.  It was hoped that miners would leave for another pool as they approached majority share, but that didn’t happen.

If large blocks can be used as a weapon by larger miners against small ones[8], it’s expected that they will be.

More recently (and quite by accident) it was discovered that over half the mining power aren’t verifying transactions in blocks they build upon[4].  They did this in order to reduce orphans, and one large pool is still doing so.  This is a problem because lightweight bitcoin clients work by assuming anything in the longest chain of blocks is good; this was how the original bitcoin paper anticipated that most users would interact with the system.

The Third Side Of The Debate: Long Term Network Funding

Before I summarize, it’s worth mentioning the debate beyond the current debate: long term network support.  The minting of new coins decreases with time; the plan of record (as suggested in the original paper) is that total transaction fees will rise to replace the current mining subsidy.  The schedule of this is unknown and generally this transition has not happened: free transactions still work.

The block subsidy as I write this is about $7000.  If nothing else changes, miners would want $3500 in fees in 12 months when the block subsidy halves, or about $2 per transaction.  That won’t happen; miners will simply lose half their income.  (Perhaps eventually they form a cartel to enforce a minimum fee, causing another centralization crisis? I don’t know.)

It’s natural for users to try to defer the transition as long as possible, and the practice in bitcoin-core has been to aggressively reduce the default fees as the bitcoin price rises.  Core developers Gregory Maxwell and Pieter Wuille feel that signal was a mistake; that fees will have to rise eventually and users should not be lulled into thinking otherwise.

Mike Hearn in particular has been holding out the promise that it may not be necessary.  On this he is not widely supported: that some users would offer to pay more so other users can continue to pay less.

It’s worth noting that some bitcoin businesses rely on the current very low fees and don’t want to change; I suspect this adds bitterness and vitriol to many online debates.


The bitcoin-core developers who deal with users most feel that bitcoin needs to expand quickly or die, that letting fees emerge now will kill expansion, and that the infrastructure will improve over time if it has to.

Other bitcoin-core developers feel that bitcoin’s infrastructure is dangerously creaking, that fees need to emerge anyway, and that if there is a real emergency a blocksize change could be rolled out within a few weeks.

At least until this is resolved, don’t count on future bitcoin fees being insignificant, nor promise others that bitcoin has “free transactions”.

[1] “Bitcoin Mining Pools Targeted in Wave of DDOS Attacks” Coinbase 2015

[2] “Block Withholding Attacks – Recent Research” N T Courtois 2014

[3] “GHash.IO and double-spending against BetCoin Dice” mmtech et. al 2013

[4] “Questions about the July 4th BIP66 fork”

[5] “350,000 full nodes to 6,000 in two years…” P Todd 2015

[6] “Reachable nodes during the last 365 days.”

[7] “Re: Scalability and transaction rate” Satoshi 2010

[8] “[Bitcoin-development] Mining centralization pressure from non-uniform propagation speed” Pieter Wuille 2015

August 04, 2015 02:32 AM

August 03, 2015

LPC 2015: Thursday night reception for LPC

Thanks to the generous sponsorship of Intel, the Linux Plumbers Conference is pleased to announce that there will be an additional social event this year. On Thursday August 20th, we will be gathering at the Seattle Rock Bottom Brewery—just a short walk from the conference venue and hotel—for drinks and dinner in a relaxed setting. The evening’s event will be showcasing local beers, wines, and spirits, but some of the more standard items (like single-malt scotches and cocktails) will also be available.

Since there will be various BoFs and extended microconferences going later in the evening on Thursday, the event has been structured to accommodate that. The event will not have a buffet and will, instead, provide food made to order. It will run until midnight and dinner orders can be placed up until 23:30, so folks can show up any time and still get the food of their choice, hot and fresh. That said, if we all order right at the 18:00 start, the waiting time may get long. So, if you aren’t working late, a walk around Seattle (perhaps after popping in for a drink) would work well to put some space around the food orders. The Rock Bottom is a large venue with lots of tables for discussions and the like, so continuing a conversation there, rather than at the venue, will work out well.

We look forward to seeing everyone at the Rock Bottom on Thursday!

August 03, 2015 10:33 PM

August 02, 2015

Pete Zaitcev: DNF - Debugging Not Finished

It's 100% like CKS said:

[root@kvm-rei zaitcev]# dnf check-update openstack-swift
Last metadata expiration check performed 0:10:02 ago on Sun Aug  2 18:42:13 2015.

openstack-swift.noarch                   2.3.0-2.fc23                    rawhide
[root@kvm-rei zaitcev]# dnf update openstack-swift
Last metadata expiration check performed 0:10:07 ago on Sun Aug  2 18:42:13 2015.
Dependencies resolved.
Nothing to do.
[root@kvm-rei zaitcev]# rpm -q openstack-swift
[root@kvm-rei zaitcev]# 

Searching for a good way to make it unstuck.

August 02, 2015 11:12 PM

July 29, 2015

Pete Zaitcev: Conference submission and voting

Generally I feel that I do not do any work that's important enough to present at conferences. My previous presentation was at OLS back in 2005, concerning usbmon. The usbmon is something a guy learning C would program: it's a circular buffer into which kernel drops tracing events; Wireshark pulls them out. Hardly a conference material, but at the time I thought it was supremely important to proseltize the basic techniques of always-on tracing, because it would improve the quality and the ease of debugging of the kernel overall. I really wanted FireWire guys to adopt a similar tracing scheme, because it was a hell on a stick debugging juju with just printk(). Needless to say, that was a miserable failure, as was FireWire itself. I don't think anyone who came to listen to my presentation in Ottawa received their money's worth.

Or did they? Recently an epiphany occured to me. I really should not even think if anyone is interested. That is conference organizers' job, not mine! As a result, I sent a proposal to OpenStack Tokyo, entitled "The Plot to Destroy OpenStack Swift Using C++: Enhancements of Swift API Compatibility in Ceph RADOS Gateway". It's basically a compendum of practical issues that occur when running Swift apps on top of Ceph RGW and what we do to help people do that.

The things are a little different from 10 years ago, because attendees can vote on the submissions. This sounds democratic. I went through all submissions on the storage track and voted them according to my preference. It took a very long time and I suspect that I was crowdsourced by the organizers in the best traditions of Web 2.0. I wonder if they'll even read the abstracts. :-)

July 29, 2015 04:23 PM

July 28, 2015

Matthew Garrett: Your Ubuntu-based container image is probably a copyright violation

Update: A Canonical employee responded here, but doesn't appear to actually contradict anything I say below.

I wrote about Canonical's Ubuntu IP policy here, but primarily in terms of its broader impact, but I mentioned a few specific cases. People seem to have picked up on the case of container images (especially Docker ones), so here's an unambiguous statement:

If you generate a container image that is not a 100% unmodified version of Ubuntu (ie, you have not removed or added anything), Canonical insist that you must ask them for permission to distribute it. The only alternative is to rebuild every binary package you wish to ship[1], removing all trademarks in the process. As I mentioned in my original post, the IP policy does not merely require you to remove trademarks that would cause infringement, it requires you to remove all trademarks - a strict reading would require you to remove every instance of the word "ubuntu" from the packages.

If you want to contact Canonical to request permission, you can do so here. Or you could just derive from Debian instead.

[1] Other than ones whose license explicitly grants permission to redistribute binaries and which do not permit any additional restrictions to be imposed upon the license grants - so any GPLed material is fine

comment count unavailable comments

July 28, 2015 08:06 PM

LPC 2015: Microconference schedule now available

The Linux Plumbers Conference starts in less than three weeks and so the schedule for Microconferences is now available!  Looking forward to seeing you all there!

July 28, 2015 07:40 PM

July 27, 2015

Andi Kleen: Energy efficient servers book review

Energy efficient servers – Blue prints for data center optimization from Gough/Steiner/Sanders is a new book on power tuning on servers that was recently published at Apress. I got my copy a few weeks ago and read it and it is great.

Disclaimer: I contributed a few pages to the book, but have no financial interest in its success.

As you probably already know power efficiency is very important for modern computing. It matters to mobile devices to extend battery time, it matters to desktops and servers to avoid exceeding the thermal/power capacity and lower energy costs.

Modern chips cannot run all their transistors at full speed at the same time due to the dark silicon problem. This results in the somewhat paradoxical situation that power management is needed, even if energy costs don’t matter, just to give the best performance (such as the highest Turbo frequencies)

Power management in modern systems is quite complex, with many different moving parts, hardware, operating systems, drivers, firmware, embedded micro-controllers working together to be as efficient as possible. I’m not aware of any good overview of all of this.

There is some lore around — for example you may have heard of race to idle, that is running as fast as possible to go idle again — but nothing really that puts it all into a larger context. BTW race-to-idle is not always a good idea, as the book explains.

The new book makes an attempt to explain all of this together for Intel servers (the basic concepts are similar on other systems and also on client systems).

It starts with a (short) introduction of the underlying physical principles and then moves on to the basic CPU and platform power management techniques, such as frequency scaling and idle state and thermal management. It has a discussion on modern memory subsystems and describes the trade offs between different DIMM configurations. It describes the power management differences between larger servers and micro servers. And there is a overview of thermal management and power supply, such as energy efficient power supplies and voltage regulators.

Then it moves on to an overview of the software involved in power management, including firmware, rack level power management software, and operating systems. Then there is an extensive chapter how to instrument and measure power management

Finally (and perhaps most valuable) the book lays out a systematic power tuning methodology, starting with measurements and then concrete steps to optimize existing workloads for the best power efficiency.

The book is written not as an academic text book, but intended for people who solve concrete problems on shipping systems. It is quite readable, explaining any complicated concepts. You can clearly tell the authors have deep knowledge on the topic. While the details are intended for Intel servers, I would expect the book to be useful even to people working on clients or also other architectures.

One possible issue with the book is that it may be too specific for today’s systems. We’ll see how well it ages to future systems. But right now, as it just came out, it it very up-to-date and a good guide. It has some descriptions of data center design (such as efficient cooling), but these parts are quite short and are clearly not the main focus.

The ebook version is currently available as a free download both at the the publisher after registration, or at amazon as free kindle edition, or as reasonable priced paperback.

July 27, 2015 06:14 AM

July 24, 2015

James Morris: Linux Security Summit 2015 Update: Free Registration

In previous years, attending the Linux Security Summit (LSS) has required full registration as a LinuxCon attendee.  This year, LSS has been upgraded to a hosted event.  I didn’t realize that this meant that LSS registration was available entirely standalone.  To quote an email thread:

If you are only planning on attending the The Linux Security Summit, there is no need to register for LinuxCon North America. That being said you will not have access to any of the booths, keynotes, breakout sessions, or breaks that come with the LinuxCon North America registration.  You will only have access to The Linux Security Summit.

Thus, if you wish to attend only LSS, then you may register for that alone, at no cost.

There may be a number of people who registered for LinuxCon but who only wanted to attend LSS.   In that case, please contact the program committee at

Apologies for any confusion.

July 24, 2015 03:46 AM

July 23, 2015

Michael Kerrisk (manpages): man-pages-4.01 is released

I've released man-pages-4.01. The release tarball is available on The browsable online pages can be found on The Git repository for man-pages is available on

This release resulted from patches, bug reports,and  comments from nearly 50 contributors. As well as a large number of minor fixes to over 100 man pages, the more significant changes in man-pages-4.01 include the following:

July 23, 2015 06:03 PM

July 20, 2015

Pete Zaitcev: Fedora 22 killed IPv6 and I'm fine

I upgraded Fedora on my home router to F22 and immediately IPv6 disappeared on the internal network. The problem is that radvd started throwing its usual "no linklocal address configured on ethmain.5" (although the message is only visible with "IgnoreIfMissing off;"), which leads to "interface ethmain.5 does not exist or is not set up properly". With the default IgnoreIfMissing, radvd continues running but refuses to work, quietly. Needless to say, the interface has a perfectly valid link-local address, same as it had in F21 before the upgrade.

There used to be a time when I took a problem like this as an affront to the idea of IPv6 superiority and the reputation of Fedora as a platform for roll-your-own home router. Now though, I don't give a rat's tail for IPv6. Let Comcast and Google care and pay someone to care. Okay, I lied. I cared enough for file a bug 1244428, but I'm not rushing to build from SRPMs, reinstall old versions, and such.

(Frankly, if we just engage Lennart's attention for an hour, he'll incoporate a perfectly serviceable radvd function into systemd-networkd. Of course, one would need journalctl to see any messages from it, but since it is certain to work, nobody would actually attempt that. The bug report would go unanswered like radvd bugs today, but again, there would not be any bugs.)

UPDATE: The root cause turned out to be an incorrect link-local address after all. I presumed that RFC meant whole fe80::/10 prefix to be used, so each interface had a different address within a node. Ergo, fe80:0:0:1::1/64 address. As it turns out, I may have confused Link-Local address with Site-Local address. The RFC 4291 specifies a fe80::/64 prefix and in F22 the radvd started enforcing it. Note that apparently the lower part has to be unique across the link.

July 20, 2015 04:18 PM

Mel Gorman: Continual testing of mainline kernels

It is not widely known that the SUSE Performance team runs continual testing of mainline kernels and collects data on machines that would be otherwise idle. Testing is a potential topic for Kernel Summit 2015 topic so now seems like a good a time introduce Marvin. Marvin is a system that continually runs performance-related tests and is named after another robot doomed with repetitive tasks. When tests are complete it generates a performance comparison report that is publicly available but rarely linked. The primary responsibility of this system is to check SUSE Linux for Enterprise kernels for performance regressions but it is also configured to run tests against mainline releases. There are four primary components Marvin of interest.

The first component is the test client which is a copy of MMTests. The use of MMTests ensures that the tests can be independently replicated and the methodology examined. The second component is Bob which is a builder that monitors git trees for new kernels to test, builds the kernel when it's released and schedules it to be tested. In practice this monitors the SLE kernel tree continually and checks the mainline git tree once a month for new releases. Bob only builds and queues released kernels and ignores -rc kernels in mainline. The reason for this is simple -- time. The full battery of tests can take up to a month to complete in come cases and it's impractical to do that on every -rc release. There are times when a small subset of tests will be checked for a pre-release kernel but only when someone on the performance team is checking a specific series of patches and it's urgent to get the results quickly. When tests complete, it's Bob that generates the report. The third component is Marvin which runs on the server and one instance exists per test machine. It checks the queue, prepares the test machine and executes tests when the machine is ready. The final component is a configuration manager that is responsible for reserving machines for exclusive use, managing power, managing serial consoles and deploying distributions automatically. The inventory management does not have a specific name as it's different depending on where Marvin is setup.

There are two installations of Marvin -- one that runs in my house and a second that runs within SUSE and they have slightly different configurations. Technically Marvin supports testing on different distributions but only openSUSE and SLE are deployed. SLE kernels are tested on the corresponding SLE distribution. The Marvin instance in my house tests kernels 3.0 up to 3.12 on openSUSE 13.1 and then kernels 3.12 up to current mainline on openSUSE 13.2. In the SUSE instance, SLE 11 SP3 is used as the distribution for testing kernels 3.0 up to 3.12 and openSUSE 13.2 is used for 3.12 and later kernels. The kernel configuration used corresponds to the distribution. The raw results are not publicly available but the reports generated on private servers and mirrored once a week to the following locations;

Dashboard for kernels 3.0 to 3.12 on openSUSE 13.1 running on home machines

Dashboard for kernels 3.12 to current on openSUSE 13.2 running on home machines

Dashboard for kernels 3.0 to 3.12 on SLE 11 SP3 running on SUSE machines

Dashboard for kernels 3.12 to current on openSUSE 13.2 running on SUSE machines

The dashboard is intended to be a very high-level view detailing if there are regressions or not in comparison to a baseline. For example, in the first report linked above, the baseline is always going to be a 3.0-based kernel. It needs a human to establish if the regression is real or if it's an acceptable trade-off. The top section makes a guess as to where the biggest regressions might be but it's not perfect so double check. Each test that was conducted is then listed. The name of the test corresponds to a MMTests configuration file in configs/ with extentions naming the filesystem used if that is applicable. The columns are then machines with a number which represents a performance delta. 1 means there is no difference. 1.02 would mean there is a 2% difference and the colour indicates whether it is a performance regression or gain. Green is good, red is bad, gray or white is neutral. It will automatically guess if the result is significant which is why 0.98 on one test might be a 2% performance regression in one test (red) and in the noise for another.

It is important to note that the dashboard figure is a very rough estimate that often decomposing multiple values into a single number. There is no substitute for reading the detailed report and making an assessment. It is also important to note that Marvin is not up to date and some machines have not started testing 4.1. It is known that the reports are very ugly but making it prettier has yet to climb up the list of priorities. Where possible we are instead picking a regression and doing something about it instead of making HTML pages look pretty.

The obvious question is what has been done with this data. When Marvin was first assembled, the intent was to identify and fix regressions between 2.6.32 (yes, really) and 3.12. This is one of the reasons why 3.12-stable contains so many performance related fixes. When a regression was found there are generally one of three outcomes. The first is that it gets fixed obviously. The second is that it is identified as an apparent, but not real, regression. Usually this means the kernel was buggy in an old kernel in a manner that happened to benefit a particular benchmark. Tiobench is an excellent example. On old kernels there was a bug that preserved old pages and reclaimed new pages in certain circumstances. For most workloads, this is terrible but in tiobench it means that parts of the file were cached and the IO appeared to complete faster but it was a lie. The third possible outcome is that it's slower but it's a tradeoff to win somewhere else and the tradeoff is acceptable. Some scheduler regressions fall under this heading where a context-switch micro-benchmark might be hurt but it's because the scheduler is making an intelligent placement decision.

The focus on 3.12 is also why Marvin is not widely advertised within the community. It is rare that mainline developers are concerned with performance in -stable kernels unless the most recent kernel is also discussed. In some cases the most recent kernel may have the same regression but it is common to discover there is simply a different mix of problems in a recent kernel. Each problem must be identified and addressed in turn and time is spent on that instead of adding volume to LKML. Advertising the existence of Marvin wasalso postponed because some of the tests or reporting were buggy and each time I wanted to fix the problem. There are very few that are known to be problematic now but it takes a surprising amount of time to address all problems that crop up when running tests across large numbers of machines. There are still issues lurking in there but if a particularly issue is important to you then let me know and I'll see if it can be examined faster.

An obvious question is how this compares to other performance-based automated testing such as Intel's 0-day kernel test infrastructure. The answer is that they are complementary. The 0-day infrastructure tests every commit to quickly identify both performance gains and regressions. The tests are short-lived by necessity and are invaluable at quickly catching some classes of problems. The tests run by Marvin are much longer-lived and there is only overlap in a small number of places. The two systems are simply looking for different problems. Hence, in 2012 I was tempted to try integrating parts of what became Marvin with 0-day but ultimately it was unnecessary and there is value in both. The other system worth looking at is the results reported on Phoronix Test Suite. In that case, it's relatively rare that the data needed to debug a problem is included in the reports which complicates matters. In a few cases I examined in detail I had problems with the testing methodology. As MMTests already supported large amounts of what I was looking for there was no benefit to discarding it and starting again with Phoronix and addressing any perceived problems there. Finally, on the site that reports the results, there is a frequent emphasis there on graphics performance or the relative performance between different hardware configurations. It is relatively rare that this is the type of comparison my team is interested in.

The next obvious question is how are recent releases performing? At this time I so not want to make a general statement as I have not examined all the data in sufficient detail and am currently developing a series aimed at one of the problems. When I work on mainline patches, it's usually with reference to the problem I picked out after browsing through reports, targeting a particular subsystem area or in response to a bug report. I'm not applying a systematic process to identify all regressions at this point and it's still considered a manual process to determine if a reported regression is real, apparent or a tradeoff. When a real regression is found then Marvin can optionally conduct an automated bisection but that process is usually "invisible" and is only reported indirectly in a changelog if the regression gets fixed.

So what's next? The first is that more attention is going to be paid to recent kernels and checking if regressions were introduced since 3.12 that need addressing. The second is identifying any bottlenecks that exist in mainline that are not regressions but still should be addressed. The last of course if coverage. The first generation of Marvin focused on some common workloads and for a long time it was very useful. The number of problems it is finding is now declining so other workloads will be added over time. Each time a new configuration is added, Marvin will go back through all the old kernels and collect data. This is probably not a task that will ever finish. There always will be some new issue be it due to a hardware change, a new class of workload as the usage of computers evolve or a modification that fixed one problem and introduced another. Fun times!

July 20, 2015 02:55 PM

July 15, 2015

Matthew Garrett: Canonical's Ubuntu IP policy is garbage

(In order to avoid any ambiguity here, this is a personal opinion. The Free Software Foundation's opinion on this matter is here)

Canonical have a legal policy surrounding reuse of Intellectual Property they own in Ubuntu, and you can find it here. It's recently been modified to handle concerns raised by various people including the Free Software Foundation[1], who have some further opinions on the matter here. The net outcome is that Canonical made it explicit that if the license a piece of software is under explicitly says you can do something, you can do that even if the Ubuntu IP policy would otherwise forbid it.

Unfortunately, "Canonical have made it explicit that they're not attempting to violate the GPL" is about the nicest thing you can say about this. The most troubling statement is Any redistribution of modified versions of Ubuntu must be approved, certified or provided by Canonical if you are going to associate it with the Trademarks. Otherwise you must remove and replace the Trademarks and will need to recompile the source code to create your own binaries.. The apparent aim here is to avoid situations where people take Ubuntu, modify it and continue to pass it off as Ubuntu. But it reaches far further than that. Cases where this may apply include (but are not limited to):

In each of these cases, a strict reading of the policy indicates that you are distributing a modified version of Ubuntu and therefore must either get it approved by Canonical or remove the trademarks and rebuild everything. The strange thing is that this doesn't limit itself to rebuilding packages that include Canonical's trademarks - there's a requirement that you rebuild all binaries.

Now obviously this is good engineering practice in a whole bunch of ways, but it's a huge pain in the ass. And to make things worse, Canonical won't clarify what they consider to be use of their trademarks. Many Ubuntu packages rebuilt from Debian include the word "ubuntu" in their version string. Many Ubuntu packages will contain the word "ubuntu" in maintainer email addresses. Many Ubuntu packages include references to Ubuntu (for instance, documentation might say "This configuration file is located under /etc/default in Debian and Ubuntu"). And many Ubuntu packages will include the compiler version string, which will include the word "ubuntu". Realistically, there's no risk of confusion by using the trademarks in this way, and as a consequence there would be no infringement under trademark law. But Canonical aren't using trademark law here. Canonical assert that they hold copyright over binaries that they have built form source, and require that for you to have permission to redistribute these binaries under copyright law you must remove the trademarks. This means that it doesn't matter whether your use of the trademarks would be infringing or not - you're required to remove them, because fuck you that's why.

This is a huge overreach. It's hostile to free software, in that it makes it significantly more difficult to produce derivative works of Ubuntu and doesn't benefit the community in the process. It's hostile to our understanding of IP law, in that it claims that the mechanical process of turning source code into binaries creates an independently copyrightable work. And in some cases it may make it impossible to create derivative works that interoperate with Ubuntu due to applications making assumptions about the presence of strings.

It'd be easy write this off as an over the top misinterpretation of the policy if it hadn't been confirmed by the Ubuntu Community Manager that any binaries shipped by Ubuntu under licenses that don't grant an explicit right to redistribute the binaries can't be redistributed without permission or rebuilding. When I asked for clarification from Canonical over a year ago, I got no response[2]. Perhaps Canonical don't want to force you to remove every single use of the word Ubuntu from derivative works, but their policy is written such that the natural reading is that they do, and they've refused every single opportunity they've been given to clarify the point.

So, we're left with a policy that makes it hugely impractical to redistribute modified versions of Ubuntu unless Canonical approve of it. That's not freedom, and it's certainly not Ubuntu. If Canonical are serious about participating in the free software community then they need to demonstrate their willingness to continue improving this policy to bring it closer to our goals. Failure to do so will give a strong indication of their priorities.

[1] While I'm a member of the FSF's board of directors, I'm not involved in the majority of the FSF's day to day activities and was not part of this process
[2] Nebula's OS was a mixture of binary packages we pulled straight from Ubuntu and packages we rebuilt, so we were obviously pretty interested in what the answer was

comment count unavailable comments

July 15, 2015 07:20 PM

July 12, 2015

Dave Jones: Future development of Trinity.

It’s been an odd few weeks regarding Trinity based things.

First an email from a higher-up at my former employer asking (paraphrased)..

"That thing we asked you to stop working on when you worked here, any chance now you've left you'll implement these features."

I’m still trying to get my head around the thought process that led to that being a reasonable thing to ask. I’ve made the occasional commit over the last six months, but it’s mostly been code motion, clean-up work, and things like syscall table updates. New feature development came to a halt long ago.

It’s no coincidence that the number of bugs reported found with Trinity have dropped off sharply since the beginning of the year, and I don’t think it’s because the Linux kernel suddenly got lots better. Rather, it’s due to the lack of real ongoing development to “try something else” when some approaches dry up. Sadly we now live in a world where it’s easier to get paid to run someone else’s fuzzer these days than it is to develop one.

Then earlier this week, came the revelation that the only people prepared to fund that kind of new feature development are pretty much the worst people.

Apparently Hacking Team modified Trinity to fuzz ioctl() on Android, which yielded some results. I’ve done no analysis on whether those crashes are are exploitable/fixed/only relevant to Android etc. (Frankly, I’m past caring). I’m not convinced their approach is particularly sound even if it was finding results Trinity wasn’t, so it looks unlikely there are even ideas to borrow here. (We all already knew that ioctl was ripe with bugs, and had practically zero coverage testing).

It bothers me that my work was used as a foundation for their hack-job. Then again, maybe if I hadn’t released Trinity, they’d have based on iknowthis, or some other less useful fuzzer. None of this really should surprise me. I’ve known for some time that there are some “security” people that have their own modifications they have no intention of sending my way. Thanks to the way that people that release 0-days are revered in this circus, there’s no incentive for people to share their modifications if it means that someone else might beat them to finding their precious bugs.

It’s unfortunate that this project has attracted so many awful people. When I began it, the motivation had nothing to do with security. Back in 2010 we were inundated in weird oopses that we couldn’t reproduce, many times triggered by jvm’s. I came up with the idea that maybe a fuzzer could create a realistic enough workload to tickle some of those same bugs. Turned out I was right, and so began a series of huge page and other VM related bug fixes.

In the five years that I’ve made Trinity available, I’ve received notable contributions from perhaps a half dozen people. In return I’ve made my changes available before I’d even given them runtime myself.

It’s a project everyone wants to take from, but no-one wants to give back to.

And that’s why for the foreseeable future, I’m unlikely to make public any further feature work I do on it.
I’m done enabling assholes.

The post Future development of Trinity. appeared first on

July 12, 2015 09:37 PM

July 10, 2015

Andi Kleen: Speeding up less

Often when doing performance analysis or debugging, it boils down to stare at long text trace files with the less text viewer. Yes you can do a lot of analysis with custom scripts, but at some point it’s usually needed to also look at the raw data.

The first annoyance in less when opening a large file is the time it takes to count lines (less counts lines at the beginning to show you the current position as a percentage). The line counting has the easy workaround of hitting Ctrl-C or using less -n to disable percentage. But it would be still better if that wasn’t needed.

Nicolai Haenle speeded the process by about 20x in his less repository.

One thing that always bothered me was that searching in less is so slow. If you’re browsing a tens to hundreds of MB file file it can easily take minutes to search for a string. When browsing log and trace files searching over longer distances is often very important.

And there is no good workaround. Running grep on the file is much faster, but you can’t easily transfer the file position from grep to the less session.

Some profiling with perf shows that most of the time searching is spent converting each line. Less internally cleans up the line, convert it to canonical case, remove backspace bold, and some other changes. The conversion loop processes each character in a inefficient way. Most of the time this is not needed, so I replaced that with a quick check if the line contains any backspaces using the optimized strchr() from the standard C library. For case conversion the string search functions (either regular expression or fixed string search) can also handle case insensitive search directly, so we don’t need an extra conversion step. The default fixed string search (when the search string contains no regular expression meta characters) can be also done using the optimized C library functions.

The resulting less version searches ~85% faster on my benchmarks. I tried to submit the patch to the less maintainer, but it was ignored unfortunately. The less version in the repository also includes Nicolai’s speedup patches for the initial line counting.

One side effect of the patch is that less now defaults to case sensitive searches. The original less had a feature (or bug) to default to case-insensitive even without the -i option. To get case insensitive searches now “less -i” needs to be used.

[Edit: Fix typos]

July 10, 2015 08:26 PM

Pavel Machek: Front USB connectors are evil

NFSroot over USB on n900 was only giving me 300KiB/sec... and I thought that was normal. Now I plugged the cable to back USB port (not front one) and... speed went from 300KiB/sec to 2.5MiB/sec. Not bad for old cellphone.

Someone must be joking?
root@n900:/sys/devices/platform/68000000.ocp/480ab000.usb_otg_hs/ cat vbus
Vbus off, timeout 1100 msec

It looks like my n900 was bitten by famous "all calls disabled" problem. ( example solution ). Prague Brmlab helped a lot, and baking n900 for 15minutes at 250C seems to have fixed the problem... for a week. Now it looks like it slowly creeps back.

July 10, 2015 10:11 AM

July 09, 2015

Andi Kleen: toplev tutorial and manual

toplev, part of pmu-tools is a tool to determine the CPU bottleneck of workloads. Now finally there is a tutorial and manual available for toplev.,

July 09, 2015 05:57 PM

Andi Kleen: Adding Processor Trace support to Linux

I published an article at LWN: Adding processor trace to Linux. It describes the Linux perf support for the Intel Processor Trace feature on Intel Broadwell and other CPUs. Processor Trace allows fine grained tracing of program control flow.

July 09, 2015 05:51 PM

July 08, 2015

Rusty Russell: The Megatransaction: Why Does It Take 25 Seconds?

Last night f2pool mined a 1MB block containing a single 1MB transaction.  This scooped up some of the spam which has been going to various weakly-passworded “brainwallets”, gaining them 0.5569 bitcoins (on top of the normal 25 BTC subsidy).  You can see the megatransaction on

It was widely reported to take about 25 seconds for bitcoin core to process this block: this is far worse than my “2 seconds per MB” result in my last post, which was considered a pretty bad case.  Let’s look at why.

How Signatures Are Verified

The algorithm to check a transaction input (of this form) looks like this:

  1. Strip the other inputs from the transaction.
  2. Replace the input script we’re checking with the script of the output it’s trying to spend.
  3. Hash the resulting transaction with SHA256, then hash the result with SHA256 again.
  4. Check the signature correctly signed that hash result.

Now, for a transaction with 5570 inputs, we have to do this 5570 times.  And the bitcoin core code does this by making a copy of the transaction each time, and using the marshalling code to hash that; it’s not a huge surprise that we end up spending 20 seconds on it.

How Fast Could Bitcoin Core Be If Optimized?

Once we strip the inputs, the result is only about 6k long; hashing 6k 5570 times takes about 265 milliseconds (on my modern i3 laptop).  We have to do some work to change the transaction each time, but we should end up under half a second without any major backflips.

Problem solved?  Not quite….

This Block Isn’t The Worst Case (For An Optimized Implementation)

As I said above, the amount we have to hash is about 6k; if a transaction has larger outputs, that number changes.  We can fit in fewer inputs though.  A simple simulation shows the worst case for 1MB transaction has 3300 inputs, and 406000 byte output(s): simply doing the hashing for input signatures takes about 10.9 seconds.  That’s only about two or three times faster than the bitcoind naive implementation.

This problem is far worse if blocks were 8MB: an 8MB transaction with 22,500 inputs and 3.95MB of outputs takes over 11 minutes to hash.  If you can mine one of those, you can keep competitors off your heels forever, and own the bitcoin network… Well, probably not.  But there’d be a lot of emergency patching, forking and screaming…

Short Term Steps

An optimized implementation in bitcoind is a good idea anyway, and there are three obvious paths:

  1. Optimize the signature hash path to avoid the copy, and hash in place as much as possible.
  2. Use the Intel and ARM optimized SHA256 routines, which increase SHA256 speed by about 80%.
  3. Parallelize the input checking for large numbers of inputs.

Longer Term Steps

A soft fork could introduce an OP_CHECKSIG2, which hashes the transaction in a different order.  In particular, it should hash the input script replacement at the end, so the “midstate” of the hash can be trivially reused.  This doesn’t entirely eliminate the problem, since the sighash flags can require other permutations of the transaction; these would have to be carefully explored (or only allowed with OP_CHECKSIG).

This soft fork could also place limits on how big an OP_CHECKSIG-using transaction could be.

Such a change will take a while: there are other things which would be nice to change for OP_CHECKSIG2, such as new sighash flags for the Lightning Network, and removing the silly DER encoding of signatures.

July 08, 2015 03:09 AM

July 07, 2015

James Morris: Linux Security Summit 2015 Schedule Published

The schedule for the 2015 Linux Security Summit is now published!

The refereed talks are:

There will be several discussion sessions:

Also featured are brief updates on kernel security subsystems, including SELinux, Smack, AppArmor, Integrity, Capabilities, and Seccomp.

The keynote speaker will be Konstantin Ryabitsev, sysadmin for  Check out his Reddit AMA!

See the schedule for full details, and any updates.

This year’s summit will take place on the 20th and 21st of August, in Seattle, USA, as a LinuxCon co-located event.  As such, all Linux Security Summit attendees must be registered for LinuxCon. Attendees are welcome to attend the Weds 19th August reception.

Hope to see you there!

July 07, 2015 03:04 PM

July 06, 2015

Rusty Russell: Bitcoin Core CPU Usage With Larger Blocks

Since I was creating large blocks (41662 transactions), I added a little code to time how long they take once received (on my laptop, which is only an i3).

The obvious place to look is CheckBlock: a simple 1MB block takes a consistent 10 milliseconds to validate, and an 8MB block took 79 to 80 milliseconds, which is nice and linear.  (A 17MB block took 171 milliseconds).

Weirdly, that’s not the slow part: promoting the block to the best block (ActivateBestChain) takes 1.9-2.0 seconds for a 1MB block, and 15.3-15.7 seconds for an 8MB block.  At least it’s scaling linearly, but it’s just slow.

So, 16 Seconds Per 8MB Block?

I did some digging.  Just invalidating and revalidating the 8MB block only took 1 second, so something about receiving a fresh block makes it worse. I spent a day or so wrestling with benchmarking[1]…

Indeed, ConnectTip does the actual script evaluation: CheckBlock() only does a cursory examination of each transaction.  I’m guessing bitcoin core is not smart enough to parallelize a chain of transactions like mine, hence the 2 seconds per MB.  On normal transaction patterns even my laptop should be about 4 times faster than that (but I haven’t actually tested it yet!).

So, 4 Seconds Per 8MB Block?

But things are going to get better: I hacked in the currently-disabled libsecp256k1, and the time for the 8MB ConnectTip dropped from 18.6 seconds to 6.5 seconds.

So, 1.6 Seconds Per 8MB Block?

I re-enabled optimization after my benchmarking, and the result was 4.4 seconds; that’s libsecp256k1, and an 8MB block.

Let’s Say 1.1 Seconds for an 8MB Block

This is with some assumptions about parallelism; and remember this is on my laptop which has a fairly low-end CPU.  While you may not be able to run a competitive mining operation on a Raspberry Pi, you can pretty much ignore normal verification times in the blocksize debate.


[1] I turned on -debug=bench, which produced impenetrable and seemingly useless results in the log.

So I added a print with a sleep, so I could run perf.  Then I disabled optimization, so I’d get understandable backtraces with perf.  Then I rebuilt perf because Ubuntu’s perf doesn’t demangle C++ symbols, which is part of the kernel source package. (Are we having fun yet?).  I even hacked up a small program to help run perf on just that part of bitcoind.   Finally, after perf failed me (it doesn’t show 100% CPU, no idea why; I’d expect to see main in there somewhere…) I added stderr prints and ran strace on the thing to get timings.

July 06, 2015 09:58 PM

Matthew Garrett: Anti Evil Maid 2 Turbo Edition

The Evil Maid attack has been discussed for some time - in short, it's the idea that most security mechanisms on your laptop can be subverted if an attacker is able to gain physical access to your system (for instance, by pretending to be the maid in a hotel). Most disk encryption systems will fall prey to the attacker replacing the initial boot code of your system with something that records and then exfiltrates your decryption passphrase the next time you type it, at which point the attacker can simply steal your laptop the next day and get hold of all your data.

There are a couple of ways to protect against this, and they both involve the TPM. Trusted Platform Modules are small cryptographic devices on the system motherboard[1]. They have a bunch of Platform Configuration Registers (PCRs) that are cleared on power cycle but otherwise have slightly strange write semantics - attempting to write a new value to a PCR will append the new value to the existing value, take the SHA-1 of that and then store this SHA-1 in the register. During a normal boot, each stage of the boot process will take a SHA-1 of the next stage of the boot process and push that into the TPM, a process called "measurement". Each component is measured into a separate PCR - PCR0 contains the SHA-1 of the firmware itself, PCR1 contains the SHA-1 of the firmware configuration, PCR2 contains the SHA-1 of any option ROMs, PCR5 contains the SHA-1 of the bootloader and so on.

If any component is modified, the previous component will come up with a different measurement and the PCR value will be different, Because you can't directly modify PCR values[2], this modified code will only be able to set the PCR back to the "correct" value if it's able to generate a sequence of writes that will hash back to that value. SHA-1 isn't yet sufficiently broken for that to be practical, so we can probably ignore that. The neat bit here is that you can then use the TPM to encrypt small quantities of data[3] and ask it to only decrypt that data if the PCR values match. If you change the PCR values (by modifying the firmware, bootloader, kernel and so on), the TPM will refuse to decrypt the material.

Bitlocker uses this to encrypt the disk encryption key with the TPM. If the boot process has been tampered with, the TPM will refuse to hand over the key and your disk remains encrypted. This is an effective technical mechanism for protecting against people taking images of your hard drive, but it does have one fairly significant issue - in the default mode, your disk is decrypted automatically. You can add a password, but the obvious attack is then to modify the boot process such that a fake password prompt is presented and the malware exfiltrates the data. The TPM won't hand over the secret, so the malware flashes up a message saying that the system must be rebooted in order to finish installing updates, removes itself and leaves anyone except the most paranoid of users with the impression that nothing bad just happened. It's an improvement over the state of the art, but it's not a perfect one.

Joanna Rutkowska came up with the idea of Anti Evil Maid. This can take two slightly different forms. In both, a secret phrase is generated and encrypted with the TPM. In the first form, this is then stored on a USB stick. If the user suspects that their system has been tampered with, they boot from the USB stick. If the PCR values are good, the secret will be successfully decrypted and printed on the screen. The user verifies that the secret phrase is correct and reboots, satisfied that their system hasn't been tampered with. The downside to this approach is that most boots will not perform this verification, and so you rely on the user being able to make a reasonable judgement about whether it's necessary on a specific boot.

The second approach is to do this on every boot. The obvious problem here is that in this case an attacker simply boots your system, copies down the secret, modifies your system and simply prints the correct secret. To avoid this, the TPM can have a password set. If the user fails to enter the correct password, the TPM will refuse to decrypt the data. This can be attacked in a similar way to Bitlocker, but can be avoided with sufficient training: if the system reboots without the user seeing the secret, the user must assume that their system has been compromised and that an attacker now has a copy of their TPM password.

This isn't entirely great from a usability perspective. I think I've come up with something slightly nicer, and certainly more Web 2.0[4]. Anti Evil Maid relies on having a static secret because expecting a user to remember a dynamic one is pretty unreasonable. But most security conscious people rely on dynamic secret generation daily - it's the basis of most two factor authentication systems. TOTP is an algorithm that takes a seed, the time of day and some reasonably clever calculations and comes up with (usually) a six digit number. The secret is known by the device that you're authenticating against, and also by some other device that you possess (typically a phone). You type in the value that your phone gives you, the remote site confirms that it's the value it expected and you've just proven that you possess the secret. Because the secret depends on the time of day, someone copying that value won't be able to use it later.

But instead of using your phone to identify yourself to a remote computer, we can use the same technique to ensure that your computer possesses the same secret as your phone. If the PCR states are valid, the computer will be able to decrypt the TOTP secret and calculate the current value. This can then be printed on the screen and the user can compare it against their phone. If the values match, the PCR values are valid. If not, the system has been compromised. Because the value changes over time, merely booting your computer gives your attacker nothing - printing an old value won't fool the user[5]. This allows verification to be a normal part of every boot, without forcing the user to type in an additional password.

I've written a prototype implementation of this and uploaded it here. Do pay attention to the list of limitations - without a bootloader that measures your kernel and initrd, you're still open to compromise. Adding TPM support to grub is on my list of things to do. There are also various potential issues like an attacker being able to use external DMA-capable devices to obtain the secret, especially since most Linux distributions still ship kernels that don't enable the IOMMU by default. And, of course, if your firmware is inherently untrustworthy there's multiple ways it can subvert this all. So treat this very much like a research project rather than something you can depend on right now. There's a fair amount of work to do to turn this into a meaningful improvement in security.

[1] I wrote about them in more detail here, including a discussion of whether they can be used for general purpose DRM (answer: not really)

[2] In theory, anyway. In practice, TPMs are embedded devices running their own firmware, so who knows what bugs they're hiding.

[3] On the order of 128 bytes or so. If you want to encrypt larger things with a TPM, the usual way to do it is to generate an AES key, encrypt your material with that and then encrypt the AES key with the TPM.

[4] Is that even a thing these days? What do we say instead?

[5] Assuming that the user is sufficiently diligent in checking the value, anyway

comment count unavailable comments

July 06, 2015 05:39 PM

Matthew Garrett: Internet abuse culture is a tech industry problem

After Jesse Frazelle blogged about the online abuse she receives, a common reaction in various forums[1] was "This isn't a tech industry problem - this is what being on the internet is like"[2]. And yes, they're right. Abuse of women on the internet isn't limited to people in the tech industry. But the severity of a problem is a product of two separate factors: its prevalence and what impact it has on people.

Much of the modern tech industry relies on our ability to work with people outside our company. It relies on us interacting with a broader community of contributors, people from a range of backgrounds, people who may be upstream on a project we use, people who may be employed by competitors, people who may be spending their spare time on this. It means listening to your users, hearing their concerns, responding to their feedback. And, distressingly, there's significant overlap between that wider community and the people engaging in the abuse. This abuse is often partly technical in nature. It demonstrates understanding of the subject matter. Sometimes it can be directly tied back to people actively involved in related fields. It's from people who might be at conferences you attend. It's from people who are participating in your mailing lists. It's from people who are reading your blog and using the advice you give in their daily jobs. The abuse is coming from inside the industry.

Cutting yourself off from that community impairs your ability to do work. It restricts meeting people who can help you fix problems that you might not be able to fix yourself. It results in you missing career opportunities. Much of the work being done to combat online abuse relies on protecting the victim, giving them the tools to cut themselves off from the flow of abuse. But that risks restricting their ability to engage in the way they need to to do their job. It means missing meaningful feedback. It means passing up speaking opportunities. It means losing out on the community building that goes on at in-person events, the career progression that arises as a result. People are forced to choose between putting up with abuse or compromising their career.

The abuse that women receive on the internet is unacceptable in every case, but we can't ignore the effects of it on our industry simply because it happens elsewhere. The development model we've created over the past couple of decades is just too vulnerable to this kind of disruption, and if we do nothing about it we'll allow a large number of valuable members to be driven away. We owe it to them to make things better.

[1] Including Hacker News, which then decided to flag the story off the front page because masculinity is fragile

[2] Another common reaction was "But men get abused as well", which I'm not even going to dignify with a response

comment count unavailable comments

July 06, 2015 05:37 PM

July 03, 2015

Rusty Russell: Wrapper for running perf on part of a program.

Linux’s perf competes with early git for title of least-friendly Linux tool.  Because it’s tied to kernel versions, and the interfaces changes fairly randomly, you can never figure out how to use the version you need to use (hint: always use -g).

But when it works, it’s very useful.  Recently I wanted to figure out where bitcoind was spending its time processing a block; because I’m a cool kid, I didn’t use gprof, I used perf.  The problem is that I only want information on that part of bitcoind.  To start with, I put a sleep(30) and a big printf in the source, but that got old fast.

Thus, I wrote “perfme.c“.  Compile it (requires some trivial CCAN headers) and link perfme-start and perfme-stop to the binary.  By default it runs/stops perf record on its parent, but an optional pid arg can be used for other things (eg. if your program is calling it via system(), the shell will be the parent).

July 03, 2015 03:19 AM

June 29, 2015

Paul E. Mc Kenney: In Practice, What is the C Language, Really?

The official definition of the C Language is the standard, but the standard doesn't actually compile any programs. One can argue that the actual implementations are the real definition of the C Language, although further thought along this line usually results in a much greater appreciation of the benefits of having standards. Nevertheless, the implementations usually win any conflicts with the standard, at least in the short term.

Another interesting source of definitions is the opinions of the developers who actually write C. And both the standards bodies and the various implementations do take these opinions into account at least some of the time. Differences of opinion within the standards bodies are sometimes settled by surveying existing usage, and implementations sometimes provide facilities outside the standard based on user requests. For example, relatively few compiler warnings are actually mandated by the standard.

Although one can argue that the standard is the end-all and be-all definition of the C Language, the fact remains that if none of the implementers provide a facility called out by the standard, the implementers win. Similarly, if nobody uses a facility that is called out by the standard, the users win—even if that facility is provided by each and every implementation. Of course, things get more interesting if the users want something not guaranteed by the standard.

Therefore, it is worth knowing what users expect, even if only to adjust their expectations, as John Regehr has done for number of topics, perhaps most notably signed integer overflow. Some researchers have been taking a more proactive stance, with one example being Peter Sewell's group from the University of Cambridge. This group has put together a survey on padding bytes, pointer arithmetic, and unions. This survey is quite realistic, with “that would be crazy” being a valid answer to a number of the questions.

So, if you think you know a thing or two about C's handling of padding bytes, pointer arithmetic, and unions, take the survey!

June 29, 2015 04:46 PM

June 26, 2015

Daniel Vetter: Neat drm/i915 stuff for 4.2

The 4.1 kernel release is still a few weeks off and hence a bit early to talk about 4.2. But the drm subsystem feature cut-off already passed and I'm going on vacation for 2 weeks, so here we go.

First things first: No, i915 does not yet support atomic modesets. But a lot of progress has been made again towards enabling it. As I explained last time around the trouble is that the intel driver has grown its own almost-atomic modeset infrastructure over the past few years. And now we need to convert that to the slightly different proper atomic support infrastructure merged into the drm core, which means lots and lots of small changes all over the driver. A big part merged in this release is the removal of the ->new_config pointer by Ander, Matt & Maarten. This was the old i915-specific pointer to the staged new configuration. Removing it required switching all the CRTC code over to handling the staged configuration stored in the struct drm_atomic_state to be compatible with the atomic core. Unfortunately we still need to do the same for encoder/connector states and for plane states, so there's still lots of shuffling pending for 4.2.

There has also been other feature work going on on the modeset side: Ville cleaned&fixed up the CDCLK support in anticipation of implementing dynamic display clock frequency scaling. Unfortunately that part of his patches hasn't landed yet. Ville has also merged patches to fix up some details in the CPT modeset sequence, maybe this will finally fix the remaining "DP port stuck" issues we still seem to have.

Looking at newer platforms the interesting bit is rotation support for SKL from Sonika and Tvrtko. Compared to older platforms skl now also supports 90° and 270° rotation in the scanout engines, but only when the framebuffer uses a special tiling layout (which have been enabled in 4.0). A related feature is support for plane/CRTC scalers on SKL, provided by Chandra. Skylake has also gained support for the new low-power display states DC5/6. For Broxton basic enabling has landed, but there's nothing too interesting yet besides piles of small adjustments all over. This is because Broxton and Skylake have a common display block (similar to how the render block for atom chips was already shared since Baytrail) and hence share a lot of the infrastructure code. Unfortunately neither of these platforms has yet left the preliminary hardware support label for the i915 driver.

There's also a few minor features in the display code worth mentioning: DP compliance testing infrastructure from Todd Previte - DP compliance test devices have a special DP AUX sidechannel protocol for requesting certain test procedures and hence need a bit of driver support. Most of this will be in userspace though, with the kernel just forward requests and handing back results. Mika Kahola has optimized the DP link training, the kernel will now first try to use the current values (either from a previous modeset or set up by the firmware). PSR has also seen some more work, unfortunately it's still not yet enabled by default. And finally there's been lots of cleanups and improvements under the hood all over, as usual.

A big feature is the dynamic pagetable allocation for gen8+ from Michel Thierry and Ben Widawsky. This will greatly reduce the overhead of PPGTT and is a requirement for 48bit address space support - with that big a VM preallocating all the pagetables is just not possible any more. The gen7 cmd parser is now finally fixed up and enabled by default (thanks to Rebecca Palmer for one crucial fix), which means finally some newer GL extensions can be used without adding kernel hacks. And Chris Wilson has fine-tuned the cmd parser with a big pile of patches to reduce the overhead. And Chris has tuned the RPS boost code more, it should now no longer erratically boost the GPU's clock when it's inappropriate. He has also written a lot of patches to reduce the overhead of execlist command submission, and some of those patches have been merged into this release.

Finally two pieces of prep work: A few patches from John Harrison to prepare for removing the outstanding lazy request. We've added this years ago as a cheap way out of a memory and ringbuffer space preallocation issue and ever since then paid the price for this with added complexity leaking all over the GEM code. Unfortunately the actual removal is still pending. And then Joonas Lahtinen has implemented partial GTT mmap support. This is needed for virtual enviroments like XenGT where the GTT is cut up between the different guests and hence badly fragmented. The merged code only supports linear views and still needs support for fenced buffer objects to be actually useful.

June 26, 2015 08:58 AM

June 25, 2015

Rusty Russell: Hashing Speed: SHA256 vs Murmur3

So I did some IBLT research (as posted to bitcoin-dev ) and I lazily used SHA256 to create both the temporary 48-bit txids, and from them to create a 16-bit index offset.  Each node has to produce these for every bitcoin transaction ID it knows about (ie. its entire mempool), which is normally less than 10,000 transactions, but we’d better plan for 1M given the coming blopockalypse.

For txid48, we hash an 8 byte seed with the 32-byte txid; I ignored the 8 byte seed for the moment, and measured various implementations of SHA256 hashing 32 bytes on on my Intel Core i3-5010U CPU @ 2.10GHz laptop (though note we’d be hashing 8 extra bytes for IBLT): (implementation in CCAN)

  1. Bitcoin’s SHA256: 527.7+/-0.9 nsec
  2. Optimizing the block ending on bitcoin’s SHA256: 500.4+/-0.66 nsec
  3. Intel’s asm rorx: 314.1+/-0.3 nsec
  4. Intel’s asm SSE4 337.5+/-0.5 nsec
  5. Intel’s asm RORx-x8ms 458.6+/-2.2 nsec
  6. Intel’s asm AVX 336.1+/-0.3 nsec

So, if you have 1M transactions in your mempool, expect it to take about 0.62 seconds of hashing to calculate the IBLT.  This is too slow (though it’s fairly trivially parallelizable).  However, we just need a universal hash, not a cryptographic one, so I benchmarked murmur3_x64_128:

  1. Murmur3-128: 23 nsec

That’s more like 0.046 seconds of hashing, which seems like enough of a win to add a new hash to the mix.

June 25, 2015 07:51 AM

June 24, 2015

Matthew Garrett: Python for remote reconfiguration of server firmware

One project I've worked on at Nebula is a Python module for remote configuration of server hardware. You can find it here, but there's a few caveats:

  1. It's not hugely well tested on a wide range of hardware
  2. The interface is not yet guaranteed to be stable
  3. You'll also need this module if you want to deal with IBM (well, Lenovo now) servers
  4. The IBM support is based on reverse engineering rather than documentation, so who really knows how good it is

There's documentation in the README, and I'm sorry for the API being kind of awful (it suffers rather heavily from me writing Python while knowing basically no Python). Still, it ought to work. I'm interested in hearing from anybody with problems, anybody who's interested in getting it on Pypi and anybody who's willing to add support for new HP systems.

(Edited to update URL after Nebula went out of business and stopped paying for github)

comment count unavailable comments

June 24, 2015 11:56 PM

June 19, 2015

Rusty Russell: Mining on a Home DSL connection: latency for 1MB and 8MB blocks

I like data.  So when Patrick Strateman handed me a hacky patch for a new testnet with a 100MB block limit, I went to get some.  I added 7 digital ocean nodes, another hacky patch to prevent sendrawtransaction from broadcasting, and a quick utility to create massive chains of transactions/

My home DSL connection is 11Mbit down, and 1Mbit up; that’s the fastest I can get here.  I was CPU mining on my laptop for this test, while running tcpdump to capture network traffic for analysis.  I didn’t measure the time taken to process the blocks on the receiving nodes, just the first propagation step.

1 Megabyte Block

Naively, it should take about 10 seconds to send a 1MB block up my DSL line from first packet to last.  Here’s what actually happens, in seconds for each node:

  1. 66.8
  2. 70.4
  3. 71.8
  4. 71.9
  5. 73.8
  6. 75.1
  7. 75.9
  8. 76.4

The packet dump shows they’re all pretty much sprayed out simultaneously (bitcoind may do the writes in order, but the network stack interleaves them pretty well).  That’s why it’s 67 seconds at best before the first node receives my block (a bit longer, since that’s when the packet left my laptop).

8 Megabyte Block

I increased my block size, and one node dropped out, so this isn’t quite the same, but the times to send to each node are about 8 times worse, as expected:

  1. 501.7
  2. 524.1
  3. 536.9
  4. 537.6
  5. 538.6
  6. 544.4
  7. 546.7


Using the rough formula of 1-exp(-t/600), I would expect orphan rates of 10.5% generating 1MB blocks, and 56.6% with 8MB blocks; that’s a huge cut in expected profits.




June 19, 2015 02:37 AM

June 16, 2015

Michael Kerrisk (manpages): Linux/UNIX System Programming course scheduled for September 2015, Munich

I've scheduled a further 5-day Linux/UNIX System Programming course to take place in Munich, Germany, for the week of 14-18 September 2015.

The course is intended for programmers developing system-level, embedded, or network applications for Linux and UNIX systems, or programmers porting such applications from other operating systems (e.g., Windows) to Linux or UNIX. The course is based on my book, The Linux Programming Interface (TLPI), and covers topics such as low-level file I/O; signals and timers; creating processes and executing programs; POSIX threads programming; interprocess communication (pipes, FIFOs, message queues, semaphores, shared memory), and network programming (sockets).
The course has a lecture+lab format, and devotes substantial time to working on some carefully chosen programming exercises that put the "theory" into practice. Students receive printed and electronic copies of TLPI, along with a 600-page course book that includes all slides presented in the course. A reading knowledge of C is assumed; no previous system programming experience is needed.

Some useful links for anyone interested in the course:

Questions about the course? Email me via

June 16, 2015 02:24 PM

June 11, 2015

Pete Zaitcev: Rich and comments

Rich Jones posted an article about being banned by Boing-Boing, supposedly for bringing attention to their use of affiliate links (the practice that Gamergate groups criticized as well — and scored a regulatory win against). Meanwhile, all my comments at Rich's blog are blackholed, which is quite ironic. Generally, I am not into this "blog comment" thing. Ani-nouto never had any comments and is doing great that way. But some people like comments, so I leave them as necessary.

June 11, 2015 05:38 PM

June 10, 2015

LPC 2015: General Registration is now Closed

We’re pleased to announce that thanks to overwhelming support, interest in Linux Plumbers Conference has exceeded expectations.  The downside is that the conference is now officially full. Originally we were going to post a warning today, but a sudden surge of corporate registrations caught us off guard, so we had to close immediately.

If you still haven’t registered but would like to participate, contact us. We are running a waiting list on a first come first serve basis but with priority given to people who have accepted microconference topics. You could also try to use one of the sponsors tickets if your employer can provide one to you.

We look forward to seeing you in Seattle for a memorable conference.

June 10, 2015 02:14 AM

June 09, 2015

LPC 2015: Graphics Microconference Accepted into 2015 Linux Plumbers Conference

Although the Year of the Linux Desktop has yet to arrive, a surprising number of Linux users nevertheless need graphics support. This is because there have been a number of years of the Linux smartphone, the Linux television, the Linux digital sign/display/billboard, the Linux automobile, and more. This microconference will cover a number of topics including atomic modesetting in KMS, buffer allocation, verified-secure graphics pipelines, fencing and synchronisation, Wayland, and more.

For more information on this important topic, see the wiki page.

June 09, 2015 03:55 PM

June 05, 2015

James Morris: Hiring Subsystem Maintainers

The regular LWN kernel development stats have been posted here for version 4.1 (if you really don’t have a subscription, email me for a free link).  In this, Jon Corbet notes:

over 60% of the changes going into this kernel passed through the hands of developers working for just five companies. This concentration reflects a simple fact: while many companies are willing to support developers working on specific tasks, the number of companies supporting subsystem maintainers is far smaller. Subsystem maintainership is also, increasingly, not a job for volunteer developers..

As most folks reading this would know, I lead the mainline Linux Kernel team at Oracle.  We do have several people on the team who work in leadership roles in the kernel community (myself included), and what I’d like to make clear is that we are actively looking to support more such folk.

If you’re a subsystem maintainer (or acting in a comparable leadership role), please always feel free to contact me directly via email to discuss employment possibilities.  You can also contact Oracle kernel folk who may be presenting or attending Linux conferences.

June 05, 2015 05:34 AM

June 04, 2015

LPC 2015: Thermal Microconference Accepted into 2015 Linux Plumbers Conference

In stark contrast with decades past, thermal issues in computer systems now means much more than fans and heat sinks, and this microconference looks at some of the things that are now handled in software. The topics include the thermal framework, handling of temperature sensors, and different approaches to handling overtemperature conditions, ranging up to and including closed-loop control. Of course, software that is not tested can be assumed not to work, and the best way to ensure that testing happens when needed is to automate it, so automated testing of thermal subsystems is also on the agenda. Coordination with userspace is useful in order to determine how best to shed computational load, as is coordination among multiple cooling devices.

For more information on this important new-to-Plumbers topic, see the wiki page.

June 04, 2015 04:04 PM

June 03, 2015

LPC 2015: File and Storage Systems Microconference Accepted into 2015 Linux Plumbers Conference

Despite having been in production use for many decades, file and storage systems are very active areas, and have retained the ability to provide many new-technology surprises. This year’s edition of the File and Storage Systems Microconference will look at improved error reporting, filesystem-level encryption in traditional filesystems, SMR drives, online fsck, persistent memory, smart block-layer support in traditional filesystems, better interoperability of and support for NFS and Samba, userspace filesystem innovations, and much else besides.

For more information on this topic, see the wiki page.

June 03, 2015 04:33 PM

Rusty Russell: What Transactions Get Crowded Out If Blocks Fill?

What happens if bitcoin blocks fill?  Miners choose transactions with the highest fees, so low fee transactions get left behind.  Let’s look at what makes up blocks today, to try to figure out which transactions will get “crowded out” at various thresholds.

Some assumptions need to be made here: we can’t automatically tell the difference between me taking a $1000 output and paying you 1c, and me paying you $999.99 and sending myself the 1c change.  So my first attempt was very conservative: only look at transactions with two or more outputs which were under the given thresholds (I used a nice round $200 / BTC price throughout, for simplicity).

(Note: I used bitcoin-iterate to pull out transaction data, and rebuild blocks without certain transactions; you can reproduce the csv files in the blocksize-stats directory if you want).

Paying More Than 1 Person Under $1 (< 500000 Satoshi)

Here’s the result (against the current blocksize):

Sending 2 Or More Sub-$1 Outputs

Let’s zoom in to the interesting part, first, since there’s very little difference before 220,000 (February 2013).  You can see that only about 18% of transactions are sending less than $1 and getting less than $1 in change:

Since March 2013…

Paying Anyone Under 1c, 10c, $1

The above graph doesn’t capture the case where I have $100 and send you 1c.   If we eliminate any transaction which has any output less than various thresholds, we’ll catch that. The downside is that we capture the “sending myself tiny change” case, but I’d expect that to be rarer:

Blocksizes Without Small Output Transactions

This eliminates far more transactions.  We can see only 2.5% of the block size is taken by transactions with 1c outputs (the dark red line following the block “current blocks” line), but the green line shows about 20% of the block used for 10c transactions.  And about 45% of the block is transactions moving $1 or less.

Interpretation: Hard Landing Unlikely, But Microtransactions Lose

If the block size doesn’t increase (or doesn’t increase in time): we’ll see transactions get slower, and fees become the significant factor in whether your transaction gets processed quickly.  People will change behaviour: I’m not going to spend 20c to send you 50c!

Because block finding is highly variable and many miners are capping blocks at 750k, we see backlogs at times already; these bursts will happen with increasing frequency from now on.  This will put pressure on Satoshdice and similar services, who will be highly incentivized to use StrawPay or roll their own channel mechanism for off-blockchain microtransactions.

I’d like to know what timescale this happens on, but the graph shows that we grow (and occasionally shrink) in bursts.  A logarithmic graph prepared by Peter R of suggests that we hit 1M mid-2016 or so; expect fee pressure to bend that graph downwards soon.

The bad news is that even if fees hit (say) 25c and that prevents all the sub-$1 transactions, we only double our capacity, giving us perhaps another 18 months. (At that point miners are earning $1000 from transaction fees as well as $5000 (@ $200/BTC) from block reward, which is nice for them I guess.)

My Best Guess: Larger Blocks Desirable Within 2 Years, Needed by 3

Personally I think 5c is a reasonable transaction fee, but I’d prefer not to see it until we have decentralized off-chain alternatives.  I’d be pretty uncomfortable with a 25c fee unless the Lightning Network was so ubiquitous that I only needed to pay it twice a year.  Higher than that would have me reaching for my credit card to charge my Lightning Network account :)

Disclaimer: I Work For BlockStream, on Lightning Networks

Lightning Networks are a marathon, not a sprint.  The development timeframes in my head are even vaguer than the guesses above.  I hope it’s part of the eventual answer, but it’s not the bandaid we’re looking for.  I wish it were different, but we’re going to need other things in the mean time.

I hope this provided useful facts, whatever your opinions.

June 03, 2015 03:57 AM

Rusty Russell: Current Blocksize, by graphs.

I used bitcoin-iterate and gnumeric to render the current bitcoin blocksizes, and here are the results.

My First Graph: A Moment of Panic

This is block sizes up to yesterday; I’ve asked gnumeric to derive an exponential trend line from the data (in black; the red one is linear)

Woah! We hit 1M blocks in a month! PAAAANIC!

That trend line hits 1000000 at block 363845.5, which we’d expect in about 32 days time!  This is what is freaking out so many denizens of the Bitcoin Subreddit. I also just saw a similar inaccurate [correction: misleading] graph reshared by Mike Hearn on G+ :(

But Wait A Minute

That trend line says we’re on 800k blocks today, and we’re clearly not.  Let’s add a 6 hour moving average:

Oh, we’re only halfway there….

In fact, if we cluster into 36 blocks (ie. 6 hours worth), we can see how misleading the terrible exponential fit is:

What! We’re already over 1M blocks?? Maths, you lied to me!

Clearer Graphs: 1 week Moving Average

Actual Weekly Running Average Blocksize

So, not time to panic just yet, though we’re clearly growing, and in unpredictable bursts.

June 03, 2015 02:34 AM

June 02, 2015

LPC 2015: Boot, Init, and Config Microconference Accepted into 2015 Linux Plumbers Conference

The combination of security issues, kernel tinification, and a renewed concern about fast boot has intensified focus on system boot, initialization, and configuration, so much so that there is now a Linux Plumbers Conference Microconference focused on these topics.

In addition to secure boot and minimizing size and bloat, this microconference will delve into a number of topics related to boot speed. These topics include tuning systemd for embedded systems, optimizing and/or delaying memory initialization, deferring initcall-based initialization, introducing parallelism and multicore earlier in boot, speeding up early-boot I/O, pre-loading known configurations, speeding up installation, and of course improved timing and tracing analysis earlier in system startup. In short, the fast-boot work has definitely moved into the sub-second realm. A final bonus topic is better configuring for cloud- and container-based workloads.

For more information on this topic, see the wiki page.

June 02, 2015 03:58 PM