April 18, 2014

Weekly Fedora kernel bug statistics – April 18th 2014

  19 20 rawhide  
Open: 103 204 149 (456)
Opened since 2014-04-11 3 14 8 (25)
Closed since 2014-04-11 6 9 6 (21)
Changed since 2014-04-11 6 29 14 (49)

Weekly Fedora kernel bug statistics – April 18th 2014 is a post from: codemonkey.org.uk

April 17, 2014

Daily log April 16th 2014

Added some code to trinity to use random open flags on the fd’s it opens on startup.

Spent most of the day hitting the same VM bugs as yesterday, or others that Sasha had already reported.
Later in the day, I started seeing this bug after applying a not-yet-merged patch to fix a leak that Coverity had picked up on recently. Spent some time looking into that, without making much progress.
Rounded out the day by trying out latest builds on my freshly reinstalled laptop, and walked into this.

Daily log April 16th 2014 is a post from: codemonkey.org.uk

April 16, 2014

Daily log April 15th 2014

Spent all of yesterday attempting recovery (and failing) on the /home partition of my laptop.
On the weekend, I decided I’d unsuspend it to send an email, and just got a locked up desktop. The disk IO light was stuck on, but it was completely dead to input, couldn’t switch to console. Powered it off and back on. XFS wanted to me to run xfs_repair. So I did. It complained that there was pending metadata in the log, and that I should mount the partition to replay it first. I tried. It failed miserably, so I re-ran xfs_repair with -L to zero the log. Pages and pages of scrolly text zoomed up the screen.
Then I rebooted and.. couldn’t log in any more. Investigating with root showed that /home/davej was now /home/lost & found, and within it were a couple dozen numbered directories containing mostly uninteresting files.

So that’s the story about how I came to lose pretty much everything I’ve written in the last month that I hadn’t already pushed to github. I’m still not entirely sure what happened, but I point the finger of blame more at dm-crypt than at xfs at this point, because the non-encrypted partitions were fine.

Ultimately I gave up, reformatted and reinstalled. Kind of a waste of a day (and a half).

Things haven’t being entirely uneventful though:

So there’s still some fun VM/FS horrors lurking. Sasha has been hitting a bunch more huge page bugs too. It never ends.

Daily log April 15th 2014 is a post from: codemonkey.org.uk

April 11, 2014

Weekly Fedora kernel bug statistics – April 11th 2014

  19 20 rawhide  
Open: 102 197 143 (442)
Opened since 2014-04-04 3 19 7 (29)
Closed since 2014-04-04 7 13 6 (26)
Changed since 2014-04-04 9 32 10 (51)

Weekly Fedora kernel bug statistics – April 11th 2014 is a post from: codemonkey.org.uk

April 04, 2014

Weekly Fedora kernel bug statistics – April 04 2014

  19 20 rawhide  
Open: 99 186 142 (427)
Opened since 2014-03-28 4 17 9 (30)
Closed since 2014-03-28 12 17 4 (33)
Changed since 2014-03-28 15 29 10 (54)

Weekly Fedora kernel bug statistics – April 04 2014 is a post from: codemonkey.org.uk

Monthly Fedora kernel bug statistics – March 2014

  19 20 rawhide  
Open: 99 186 142 (427)
Opened since 2014-03-01 32 92 19 (143)
Closed since 2014-03-01 156 183 99 (438)
Changed since 2014-03-01 83 111 31 (225)

Monthly Fedora kernel bug statistics – March 2014 is a post from: codemonkey.org.uk

April 02, 2014

why I suck at finishing stuff , or how I learned to stop working and love DisplayPort MST

DisplayPort 1.2 Multi-stream Transport is a feature that allows daisy chaining of DP devices that support MST into all kinds of wonderful networks of devices. Its been on the TODO list for many developers for a while, but the hw never quite materialised near a developer.

At the start of the week my local Red Hat IT guy asked me if I knew anything about DP MST, it turns out the Lenovo T440s and T540s docks have started to use DP MST, so they have one DP port to the dock, and then dock has a DP->VGA, DP->DVI/DP, DP->HDMI/DP ports on it all using MST. So when they bought some of these laptops and plugged in two monitors to the dock, it fellback to using SST mode and only showed one image. This is not optimal, I'd call it a bug :)

Now I have a damaged in transit T440s (the display panel is in pieces) with a dock, and have spent a couple of days with DP 1.2 spec in one hand (monitor), and a lot of my hair in the other. DP MST has a network topology discovery process that is build on sideband msgs send over the auxch which is used in normal DP to read/write a bunch of registers on the plugged in device. You then can send auxch msgs over the sideband msgs over auxch to read/write registers on other devices in the hierarchy!

Today I achieved my first goal of correctly encoding the topology discovery message and getting a response from the dock:
[ 2909.990743] link address reply: 4
[ 2909.990745] port 0: input 1, pdt: 1, pn: 0
[ 2909.990746] port 1: input 0, pdt: 4, pn: 1
[ 2909.990747] port 2: input 0, pdt: 0, pn: 2
[ 2909.990748] port 3: input 0, pdt: 4, pn: 3

There are a lot more steps to take before I can produce anything, along with dealing with the fact that KMS doesn't handle dynamic connectors so well, should make for a fun tangent away from the job I should be doing which is finishing virgil.

I've ordered another DP MST hub that I can plug into AMD and nvidia gpus that should prove useful later, also for doing deeper topologies, and producing loops.

Also some 4k monitors using DP MST as they are really two panels, but I don't have one of them, so unless one appears I'm mostly going to concentrate on the Lenovo docks for now.

March 31, 2014

Linux 3.14 coverity stats

date rev Outstanding fixed defect density
Jan/20/2014 v3.13 5096 5705 0.59
Feb/03/2014 v3.14-rc1 4904 5789 0.56
Feb/09/2014 v3.14-rc2 4886 5810 0.56
Feb/16/2014 v3.14-rc3 4816 5836 0.55
Feb/23/2014 v3.14-rc4 4792 5841 0.55
Mar/03/2014 v3.14-rc5 4779 5842 0.55
Mar/10/2014 v3.14-rc6 4755 5852 0.54
Mar/17/2014 v3.14-rc7 4934 6123 0.56
Mar/27/2014 v3.14-rc8 4809 6126 0.55
Mar/31/2014 v3.14 4811 6126 0.55

The big thing that stands out this cycle is that the defect ratio was going down until we hit around 3.14-rc7, and then we got a few hundred new issues. What happened ?
Nothing in the kernel thankfully. This was due to an upgrade server side to a new version of Coverity which has some new checkers. Some of the existing ones got improved too, so a bunch of false positives we had sitting around in the database are no longer reported. The number of new issues unfortunately was greater than the known false positives[1]. In the days following, I did a first sweep through these and closed out the easy ones, bringing the defect density back down.

note: I stopped logging the ‘dismissed’ totals. With Coverity 7.0, the number can go backwards.
If a file gets deleted, the issues against that file that were dismissed also disappears.
Given this happens fairly frequently, the number isn’t really indicative of anything useful.

With the 3.15 merge window now open, I’m hoping a bunch of the queued fixes I sent over the last few weeks get merged, but I’m fully expecting to need to do some resending.

[1] It was actually worse than this, the ratio went back up to 0.57 right before rc7

Linux 3.14 coverity stats is a post from: codemonkey.org.uk

LSF/MM collaboration summit recap.

It’s been a busy week.
A week ago I flew out to Napa,CA for two days of discussions with various kernel people (ok, and some postgresql people too) about all things VM and FS/IO related. I learned a lot. These short focussed conferences have way more value to me these days personally than the conferences of years ago with a bunch of tracks, and day after day of presentations.

I gave two sessions relating to testing, there are some good write-ups on lwn. It was more of a extended QA than a presentation, so I got a lot of useful feedback (and especially afterwards in the hallway sessions). A couple people asked if trinity was doing certain things yet, which led to some code walkthroughs, and a lot of brainstorming about potential solutions.

By the end of the week I was overflowing with ideas for new things it could be doing, and have started on some of the code for this already. One feature I’d had in mind for a while (children doing root operations) but hadn’t gotten around to writing could be done in a much simpler way, which opens the doors to a bunch more interesting things. I might end up rewriting the current ioctl fuzzing (which isn’t finding a huge amount of bugs right now anyway) once this stuff has landed, because I think it could be doing much more ‘targeted’ things.

It was good to meet up with a bunch of people that I’ve interacted with for a while online and discuss some things. Was surprised to learn Sasha Levin is actually local to me, yet we both had to fly 3000 miles to meet.

Two sessions at LSF/MM were especially interesting outside of my usual work.
The postgresql session where they laid out their pain points with the kernel IO was enlightening, as they started off with a quick overview of postgresql’s process model, and how things interact. The session felt like it went off in a bunch of random directions at once, but the end goal (getting a test case kernel devs can run without needing a full postgresql setup) seemed to be reached the following day.

The second session I found interesting was the “Facebook linux problems” session. As mentioned in the lwn write-up, one of the issues was this race in the pipe code. “This is *very* hard to trigger in practice, since the race window is very small”. Facebook were hitting it 500 times a day. Gave me thoughts on a whole bunch of “testing at scale” problems. A lot of the testing I do right now is tiny in comparison. I do stress tests & fuzz runs on a handful of machines, and most of it is all done by hand. Doing this kind of thing on a bigger scale makes it a little impractical to do in a non-automated way. But given I’ve been buried alive in bugs with just this small number, it has left me wondering “would I find a load more bugs with more machines, or would it just mean the mean time between reproducing issues gets shorter”. (Given the reproducibility problems I’ve had with fuzz testing sometimes, the latter wouldn’t necessarily be a bad thing). More good thoughts on this topic can be found in a post google made a few years ago.

Coincidentally, I’m almost through reading How google tests software, which is a decent book, but with not a huge amount of “this is useful, I can apply this” type knowledge. It’s very focussed on the testing of various web-apps, with no real mention of testing of Android, Chrome etc. (The biggest insights in the book aren’t actually testing related, but more the descriptions of googles internal re-hiring processes when people move between teams).

Collaboration summit followed from Wednesday onwards. One highlight for me were learning that the tracing code has something coming in 3.15/3.16 that I’ve been hoping for for a while. At last years kernel summit, Andi Kleen suggested it might be interesting if trinity had some interaction with ftrace to get traces of “what the hell just happened”. The tracing changes landing over the next few months will allow that to be a bit more useful. Right now, we can only do that on a global system-wide basis, but with that moving to be per-process, things can get a lot more useful.

Another interesting talk was the llvmlinux session. I haven’t checked in on this project in a while, so was surprised to learn how far along they are. Apparently all the necessary llvm changes to build the kernel are either merged, or very close to merging. The kernel changes still have a ways to go, but this too has improved a lot since I last looked. Some good discussion afterwards about the crossover between things like clang’s static analysis warnings and the stuff I’m doing with Coverity.

Speaking of, I left early on Friday to head back to San Francisco to meet up with Coverity. Lots of good discussion about potential workflow improvements, false positive/heuristic improvements etc. A good first meeting if only to put faces to names I’ve been dealing with for the last year. I bugged them about a feature request I’ve had for a while (that a few people the days preceding had also nagged me about); the ability to have per-subsystem notification emails instead of the one global email. If they can hook this up, it’ll save me a lot of time having to manually craft mails to maintainers when new issues are detected.

busy busy week, with so many new ideas I felt like my head was full by the time I got on the plane to get back.
Taking it easy for a day or two, before trying to make progress on some of the things I made notes on last week.

LSF/MM collaboration summit recap. is a post from: codemonkey.org.uk

March 29, 2014

I am the CADT; and advice on NEEDINFOing old bugs en masse

[Attention conservation notice: probably not of interest to lawyers; this is about my previous life in software development.]

<a href="https://commons.wikimedia.org/wiki/File:MW_Bug_Squad_Barnstar.svg">Bugsquad barnstar, under MPL 1.1</a>
Bugsquad barnstar, under MPL 1.1

Someone recently mentioned JWZ’s old post on the CADT (Cascade of Attention Deficit Teecnagers) development model, and that finally has pushed me to say:

I am the CADT.

I did the bug closure that triggered Jamie’s rant, and I wrote the text he quotes in his blog post.1

Jamie got some things right, and some things wrong. The main thing he got right is that it is entirely possible to get into a cycle where instead of seriously trying to fix bugs, you just do a rewrite and cross your fingers that it fixes old bugs. And yes, this can particularly happen when you’re young and writing code for fun, where the joy of a from-scratch rewrite can overwhelm some of your other good senses. Jamie also got right that I communicated the issue pretty poorly. Consider this post a belated explanation (as well as a reference for the next time I see someone refer to CADT).

But that wasn’t what GNOME was doing when Jamie complained about it, and I doubt it is actually something that happens very often in any project large enough to have a large bug tracking system (BTS). So what were we doing?

First, as Brendan Eich has pointed out, sometimes a rewrite really is a good idea. GNOME 2 was such a rewrite – not only was a lot of the old code a hairy mess, we decided (correctly) to radically revise the old UI. So in that sense, the rewrite was not a “CADT” decision – the core bugs being fixed were the kinds of bugs that could only be fixed with massive, non-incremental change, rather than “hey, we got bored with the old code”. (Immediately afterwards, GNOME switched to time-based releases, and stuck to that schedule for the better part of a decade, which should be further proof we weren’t cascading.)

This meant there were several thousand old bugs that had been filed against UIs that no longer existed, and often against code that no longer existed or had been radically rewritten. So you’ve got new code and old bugs. What do you do with the old bugs?

It is important to know that open bugs in a BTS are not free. Old bugs impose a cost on developers, because when they are trying to search relevant bugs, old bugs can make it harder to find the things they really should be working on. In the best case, this slows them down; in the worst case, it drives them to use other tools to track the work they want to do – making the BTS next to useless. This violates rule #1 of a BTS: it must be useful for developers, or else it all falls apart.

So why did we choose to reduce these costs by closing bugs filed against the old codebase as NEEDINFO (and asking people to reopen if they were still relevant) instead of re-testing and re-triaging them one-by-one, as Jamie would have suggested? A few reasons:

  • number of triagers v. number of bugs: there were, at the time, around a half-dozen active bug volunteers, and thousands of pre-GNOME 2 bugs. It was simply unlikely that we’d ever be able to review all the old bugs even if we did nothing else.
  • focus on new bugs: new bugs are where triagers and developers are much more likely to be relevant – those bugs are against fresh code; the original filer is much more likely to respond to clarifying questions; etc. So all else being equal, time spent on new bugs was going to be much better for the software than time spent on old bugs.
  • steady flow of new bugs: if you’ve got a small number of new bugs coming in, perhaps you split your time – but we had no shortage of new bugs, nor of motivated bug reporters. So we may have paid some cost (by demotivating some reporters) but our scarce resource (developers) greatly appreciated it.
  • relative burden: with thousands of open bugs from thousands of reporters, it made sense to ask old them to test their bug against the new code. Reviewing their old bugs was a small burden for each of them, once we distributed it.

So when isn’t it a good idea to close ask for more information about old bugs?

  • Great at keeping old bugs triaged/relevant: If you have a very small number of old bugs that haven’t been touched in a long time, then they aren’t putting much burden on developers.
  • Slow code turnover: If your development process is such that it is highly likely that old bugs are still relevant (e.g., core has remained mostly untouched for many years, or effective use of TDD has kept the number of accidental new bugs low) this might not be a good idea.
  • No triggering event: In GNOME, there was a big event, plus a new influx of triagers, that made it make sense to do radical change. I wouldn’t recommend this “just because” – it should go hand-in-hand with other large changes, like a major release or important policy changes that will make future triaging more effective.

Relatedly, the team practices mailing list has been discussing good practices for migrating bug tracking systems in the past few days, which has been interesting to follow. I don’t take a strong position on where Wikimedia’s bugzilla falls on this point – Mediawiki has a fairly stable core, and the volume of incoming bugs may make triage of old bugs more plausible. But everyone running a very large bugzilla for an active project should remember that this is a part of their toolkit.

  1. Both had help from others, but it was eventually my decision.

March 28, 2014

VPN versus DNS

For years, I did my best to ignore the problem, but CKS inspired me to blog the curious networking banality, in case anyone has wisdom to share.

The deal is simple: I have a laptop with a VPN client (I use vpnc). The client creates a tun0 interface and some RFC 1918 routes. My home RFC 1918 routes are more specific, so routing works great. The name service does not.

Obviously, if we trust DHCP-supplied nameserver, it has no work-internal names in it. The stock solution is to let vpnc to install /etc/resolv.conf pointing to work-internal nameservers. Unfortunately this does not work for me, because I have a home DNS zone, zaitcev.lan. Work-internal DNS does not know about that one.

Thus I would like some kind of solution that routes DNS requests somehow according to a configuration. Requests to work-internal namespaces (such as *.redhat.com) would go to nameservers delivered by vpnc (I think I can make it write something like /etc/vpnc/resolv.conf that does not conflict). Other requests go to the infrastructure name service, being it a hotel network or home network. Home network is capable of serving its own private authoritative zones and forwarding the rest. That's the ideal, so how to accomplish it?

I attempted apply a local dnsmasq, but could not figure out if it can do what I want and if yes, how.

For now, I have some scripting that caches work-internal hostnames in /etc/hosts. That works, somewhat. Still, I cannot imagine that nobody thought of this problem. Surely, thousands are on VPNs, and some of them have home networks. And... nobody? (I know that a few people just run VPN on the home infrastructure; that does not help my laptop, unfortunately).

March 21, 2014

Weekly Fedora kernel bug statistics – March 21st 2014

  19 20 rawhide  
Open: 98 167 135 (400)
Opened since 2014-03-14 6 19 2 (27)
Closed since 2014-03-14 6 133 84 (223)
Changed since 2014-03-14 18 46 14 (78)

Weekly Fedora kernel bug statistics – March 21st 2014 is a post from: codemonkey.org.uk

March 19, 2014

Fedora rawhide should have GL 3.3 on radeonsi supported hardware

So to enable OpenGL 3.3 on radeonsi required some patches backported to llvm 3.4, I managed to get some time to do this, and rebuilt mesa against the new llvm, so if you have an AMD GPU that is supported by radeonsi you should now see GL 3.3.

For F20 this isn't an option as backporting llvm is a bit tricky there, though I'm considering doing a copr that has a private llvm build in it, it might screw up some apps but for most use cases it might be fine.

March 14, 2014

Weekly Fedora kernel bug statistics – March 14th 2014

  19 20 rawhide  
Open: 96 274 136 (506)
Opened since 2014-03-07 8 18 7 (33)
Closed since 2014-03-07 133 20 6 (159)
Changed since 2014-03-07 79 35 11 (125)

Weekly Fedora kernel bug statistics – March 14th 2014 is a post from: codemonkey.org.uk

Daily log March 13th 2014

High (low?) point of the day was taking delivery of my new remote power switch.
You know that weird PCB chemical smell some new electronics have? Once I got this thing out the box it smelled so strong, I almost gagged. I let it ‘air’ for a little while, hoping it would dissipate. It didn’t. Or if it did, I couldn’t tell because now the whole room stunk. Then I made the decision to plug it in anyway. Within about a minute the smell went away. Well, not so much “went away”. More like, “was replaced with the smell of burning electronics”.
So that’s another fun “hardware destroys itself as soon as I get a hold of it” story, and yet another Amazon return.
(And I guess I’m done with ‘digital loggers’ products).

In non hardware-almost-burning-down-my-house news:

  • More chasing the VM bugs. Specifically the bad rss-counter messages.
  • Wrote some code to fix up a trinity deadlock I introduced. It fixes the problem, but not happy with it, so haven’t committed it yet. Should be done tomorrow.

Daily log March 13th 2014 is a post from: codemonkey.org.uk

March 13, 2014

Daily log March 12th 2014

I’ve been trying to chase down the VM crashes I’ve been seeing. I’ve managed to find ways to reproduce some of them a little faster, but not really getting any resolution so far. Hacked up a script to run a subset of random VM related syscalls. (Trinity already has ‘-g vm’, but I wanted something even more fine-grained). Within minutes I was picking up VM traces that so far I’ve only seen Sasha reporting on -next.

Daily log March 12th 2014 is a post from: codemonkey.org.uk

March 12, 2014

Okay, it's all broken. Now what?

A rant in ;login was making rounds recently (h/t @jgarzik), which I thought was not all that relevant... until I remembered that Swift Power Calculator has mysteriously stopped working for me. Its creator is powerless to do anything about it, and so am I.

So, it's relevant all right. We're in a big trouble even if Gmail kind of works most of the time. But the rant makes no recommendations, only observations. So it's quite unsatisfying.

BTW, it reminds me about a famous preso by Jeff Mogul, "What's wrong with HTTP and why it does not matter". Except Mogul's rant was more to the point. They don't make engineers like they used to, apparently. Also notably, I think, Mogul prompted development of RESTful improvements. But there's nothing we can do about excessive thickness of our stacks (that I can see). It's just spiralling out of control.

March 11, 2014

Daily log March 10th 2014

Been hitting a number of VM related bugs the last few days.

The first bug is the one that concerns me most right now, though the 2nd is feasibly something that some non-fuzzer workloads may hit too. Other than these bugs, 3.14rc6 is working pretty well for me.

Daily log March 10th 2014 is a post from: codemonkey.org.uk

March 10, 2014


I suppose everyone has to pass through a hardware phase, and mine is now, for which I implemented a LED blinker with an AVRtiny2313. I don't think it even merits the usual blog laydown. Basically all it took was following tutorials to the letter.

For the initial project, I figured that learning gEDA would take too much, so I unleashed an inner hipster and used Fritzing. Hey, it allows to plan breadboards, so there. And well it was a learning experience and no mistake. Crashes, impossible to undo changes, UI elements outside of the screen, everything. Black magic everywhere: I could never figure out how to merge wires, dedicate a ground wire/plane, or edit labels (so all of them are incorrect in the schematic above). The biggest problem was the lack of library support together with an awful parts editor. Editing schematics in Inkscape was so painful, that I resigned to doing a piss-poor job, evident in all the crooked lines around the AVRtiny2313. I understand that Fritzing's main focus is iPad, but this is just at a level of typical outsourced Windows application.

Inkscape deserves a special mention due to the way Fritzing requires SVG files being in a particular format. If you load and edit some of those, the grouping defeats Inkscape features, so one cannot even select elements at times. And editing the raw XML cause weirdest effects, so it's not like LyX-on-TeX, edit and visualize. At least our flagship vector graphics package didn't crash.

The avr-gcc is awesome though. 100% turnkey: yum install and you're done. Same for avrdude. No huss, no fuss, everything works.

March 08, 2014

Weekly Fedora kernel bug statistics – March 07 2014

  19 20 rawhide  
Open: 225 263 135 (623)
Opened since 2014-02-28 12 29 9 (50)
Closed since 2014-02-28 8 41 8 (57)
Changed since 2014-02-28 18 54 16 (88)

Weekly Fedora kernel bug statistics – March 07 2014 is a post from: codemonkey.org.uk

March 07, 2014

Suddenly, Python Magic

Looking at a review by Solly today, I saw something deeply disturbing. A simplified version that I tested follows:

import unittest

class Context(object):
    def __init__(self):
        self.func = None
    def kill(self):

class TextGuruMeditationMock(object):

    # The .run() normally is implemented in the report.Text.
    def run(self):
        return "Guru Meditation Example"

    def setup_autorun(cls, ctx, dump_with=None):
        ctx.func = lambda *args: cls.handle_signal(dump_with,

    def handle_signal(cls, dump_func, *args):
            res = cls().run()
        except Exception:
            dump_func("Unable to run")

class TestSomething(unittest.TestCase):

    def test_dump_with(self):
        ctx = Context()

        class Writr(object):
            def __init__(self):
                self.res = ''

            def go(self, out):
                self.res += out

        target = Writr()
        self.assertIn('Guru Meditation', target.res)

Okay, obviously we're setting a signal handler, which is a little lambda, which invokes the dump_with, which ... is a class method? How does it receive its self?!

I guess that the deep Python magic occurs in how the method target.go is prepared to become an argument. The only explanation I see is that Python creates some kind of activation record for this, which includes the instance (target) and the method, and that record is the object being passed down as dump_with. I knew that Python did it for scoped functions, where we have global dict, local dict, and all that good stuff. But this is different, isn't it? How does it even know that target.io belongs to target? In what part of Python spec is it described?

UPDATE: Commenters provided hints with the key idea being a "bound method" (a kind of user-defined method).

A user-defined method object combines a class, a class instance (or None) and any callable object (normally a user-defined function).

When a user-defined method object is created by retrieving a user-defined function object from a class, its im_self attribute is None and the method object is said to be unbound. When one is created by retrieving a user-defined function object from a class via one of its instances, its im_self attribute is the instance, and the method object is said to be bound.

Thanks, Josh et al.!

UPDATE: See also Chris' explanation and Peter Donis' comment re. unbound methods gone from py3.

March 06, 2014

DAILY LOG March 5TH 2014

Spent some time chasing down what looks like a race condition in the watchdog code in trinity.
The symptom was a crash on x86-64, where it would try and decode a 32-bit syscall using the 64-bit syscall table. This segfaulted, because the 64-bit table is shorter. I stared at the code for quite a while, and adding debugging printfs at the crash site made the bug disappear. What I think was happening was that the child processes are updating two separate variables (one, a bool that says if we’re doing 32 or 64 bit calls, and two the syscall number), and the watchdog code was reading them in the middle of them being updated. I added some locking code to make sure we don’t read either value before an update is complete.
I’ve not managed to reproduce the bug since, so I’m really hoping I got it right.

In other news:
Hit a page locking bug, and some DMA debug spew.

DAILY LOG March 5TH 2014 is a post from: codemonkey.org.uk

March 04, 2014

Spin Class


via Roadsidepictures (CC BY-NC 2.0)

We’ve recently binged on making NetworkManager work better in more places, mostly enterprise and virtualization related.  But one thing we’ve wanted to do for a long time was make things more modular.  And that got landed this week via the dev-plugins branch, which makes ATM, Bluetooth, and WWAN into shared libraries loaded optionally at startup.

Distro packagers can now create separate NetworkManager-atm, NetworkManager-bluetooth, and NetworkManager-wwan packages, each with their own dependencies, while the NetworkManager core goes forward a slimmer, smaller, more efficient version of its previous self.  If you’re installing NetworkManager into minimal environments, you can just ignore or remove these plugins and revel in your newfound minimalism.

The core NM binary is now about 15% smaller, and there’s a corresponding 7.5% RSS reduction at runtime when no plugins are loaded.  What’s next?  Possibly WiFi, which would save about 6 – 8% of the core binary size.

March 01, 2014

Monthly Fedora kernel bug statistics – February 2014

  19 20 rawhide  
Open: 215 275 132 (622)
Opened since 2014-02-01 15 76 32 (123)
Closed since 2014-02-01 33 66 29 (128)
Changed since 2014-02-01 33 256 36 (325)

Monthly Fedora kernel bug statistics – February 2014 is a post from: codemonkey.org.uk

February 28, 2014

Weekly Fedora kernel bug statistics – February 28th 2014

  19 20 rawhide  
Open: 216 276 133 (625)
Opened since 2014-02-21 1 23 7 (31)
Closed since 2014-02-21 5 25 4 (34)
Changed since 2014-02-21 11 232 11 (254)

Weekly Fedora kernel bug statistics – February 28th 2014 is a post from: codemonkey.org.uk

February 26, 2014

On filesystem testing.

Last year I hacked up a small shell script to test various IO related things like “create a RAID5 array, put an XFS file system on it, create a bunch of files on it”.

Despite its crudeness, it ended up finding a bunch of kernel bugs. Unfortunately many of them were not easily reproducible, and required hours of runtime. There were also some problems with scaling the tests. Every time I wanted to add another test, or another filesystem, the overall runtime grew dramatically. Before my test box with 4 SATA disks died, it would take over 3 hours for a single run.

So I’ve been sketching up ideas for a replacement to address a number of these shortfallings.
Firstly, it’s in C. Shell was fun for coming up with an initial proof of concept, but for some things like better management of threads, it’s just not going to work. Speaking of threads, one of the reasons that the runtime was previously so long was that it never took advantage of idle disks. So if for example, I have 4 disks, and I want to run a 2 disk RAID0 stripe in one test, I should be able to launch additional threads to do something interesting with the other 2 idle disks.

The code for this is still very early, and doesn’t do much of anything yet, but it’ll show up on github at some point.

In the meantime, I’ve been trying to put together something to test on. For reasons unexplained, the quad opteron that held all my disks no longer powers up. I spent a couple hours trying to revive it with various spare parts, without luck.
Yesterday the idea occurred to me that I could just use a USB hub and a bunch of old memory sticks for now.
It would have the advantage of being easily portable while travelling. Then I rediscovered just how crap no-name chinese USB hubs are. Devices sometimes showing up, sometimes not. Devices falling off the bus. Sometimes the whole hub disappearing. Sometimes refusing to even power up. I tossed the idea. For now, I’ve got this usb-sata thing connected to an SSD. Portable, fast, and surprisingly, entirely stable.

I’ve got a bunch of other ideas for this tool beyond what the io-tests shell script did, and I suspect after next months VM/FS summit, I’ll have a load more.

On filesystem testing. is a post from: codemonkey.org.uk

February 23, 2014

Hacking a Wifi Kettle

Here is a quick writeup of the protocol for the iKettle taken from my Google+ post earlier this month. This protocol allows you to write your own software to control your iKettle or get notifications from it, so you can integrate it into your desktop or existing home automation system.

The iKettle is advertised as the first wifi kettle, available in UK since February 2014. I bought mine on pre-order back in October 2013. When you first turn on the kettle it acts as a wifi hotspot and they supply an app for Android and iPhone that reconfigures the kettle to then connect to your local wifi hotspot instead. The app then communicates with the kettle on your local network enabling you to turn it on, set some temperature options, and get notification when it has boiled.

Once connected to your local network the device responds to ping requests and listens on two tcp ports, 23 and 2000. The wifi connectivity is enabled by a third party serial to wifi interface board and it responds similar to a HLK-WIFI-M03. Port 23 is used to configure the wifi board itself (to tell it what network to connect to and so on). Port 2000 is passed through to the processor in the iKettle to handle the main interface to the kettle.

Port 2000, main kettle interface

The iKettle wifi interface listens on tcp port 2000; all devices that connect to port 2000 share the same interface and therefore receive the same messages. The specification for the wifi serial board state that the device can only handle a few connections to this port at a time. The iKettle app also uses this port to do the initial discovery of the kettle on your network.


Sending the string "HELLOKETTLE\n" to port 2000 will return with "HELLOAPP\n". You can use this to check you are talking to a kettle (and if the kettle has moved addresses due to dhcp you could scan the entire local network looking for devices that respond in this way. You might receive other HELLOAPP commands at later points as other apps on the network connect to the kettle.

Initial Status

Once connected you need to figure out if the kettle is currently doing anything as you will have missed any previous status messages. To do this you send the string "get sys status\n". The kettle will respond with the string "sys status key=\n" or "sys status key=X\n" where X is a single character. bitfields in character X tell you what buttons are currently active:

Bit 6Bit 5Bit 4Bit 3Bit 2Bit 1

So, for example if you receive "sys status key=!" then buttons "100C" and "On" are currently active (and the kettle is therefore turned on and heating up to 100C).

Status messages

As the state of the kettle changes, either by someone pushing the physical button on the unit, using an app, or sending the command directly you will get async status messages. Note that although the status messages start with "0x" they are not really hex. Here are all the messages you could see:

sys status 0x100100C selected
sys status 0x9595C selected
sys status 0x8080C selected
sys status 0x10065C selected
sys status 0x11Warm selected
sys status 0x10Warm has ended
sys status 0x5Turned on
sys status 0x0Turned off
sys status 0x8005Warm length is 5 minutes
sys status 0x8010Warm length is 10 minutes
sys status 0x8020Warm length is 20 minutes
sys status 0x3Reached temperature
sys status 0x2Problem (boiled dry?)
sys status 0x1Kettle was removed (whilst on)

You can receive multiple status messages given one action, for example if you turn the kettle on you should get a "sys status 0x5" and a "sys status 0x100" showing the "on" and "100C" buttons are selected. When the kettle boils and turns off you'd get a "sys status 0x3" to notify you it boiled, followed by a "sys status 0x0" to indicate all the buttons are now off.

Sending an action

To send an action to the kettle you send one or more action messages corresponding to the physical keys on the unit. After sending an action you'll get status messages to confirm them.

set sys output 0x80Select 100C button
set sys output 0x2Select 95C button
set sys output 0x4000Select 80C button
set sys output 0x200Select 65C button
set sys output 0x8Select Warm button
set sys output 0x8005Warm option is 5 mins
set sys output 0x8010Warm option is 10 mins
set sys output 0x8020Warm option is 20 mins
set sys output 0x4Select On button
set sys output 0x0Turn off

Port 23, wifi interface

The user manual for this document is available online, so no need to repeat the document here. The iKettle uses the device with the default password of "000000" and disables the web interface.

If you're interested in looking at the web interface you can enable it by connecting to port 23 using telnet or nc, entering the password, then issuing the commands "AT+WEBS=1\n" then "AT+PMTF\n" then "AT+Z\n" and then you can open up a webserver on port 80 of the kettle and change or review the settings. I would not recommend you mess around with this interface, you could easily break the iKettle in a way that you can't easily fix. The interface gives you the option of uploading new firmware, but if you do this you could get into a state where the kettle processor can't correctly configure the interface and you're left with a broken kettle. Also the firmware is just for the wifi serial interface, not for the kettle control (the port 2000 stuff above), so there probably isn't much point.

Missing functions

The kettle processor knows the temperature but it doesn't expose that in any status message. I did try brute forcing the port 2000 interface using combinations of words in the dictionary, but I found no hidden features (and the folks behind the kettle confirmed there is no temperature read out). This is a shame since you could combine the temperature reading with time and figure out how full the kettle is whilst it is heating up. Hopefully they'll address this in a future revision.

Security Implications

The iKettle is designed to be contacted only through the local network - you don't want to be port forwarding to it through your firewall for example because the wifi serial interface is easily crashed by too many connections or bad packets. If you have access to a local network on which there is an iKettle you can certainly cause mischief by boiling the kettle, resetting it to factory settings, and probably even bricking it forever. However the cleverly designed segmentation between the kettle control and wifi interface means it's pretty unlikely you can do something more serious like overiding safety (i.e. keeping the kettle element on until something physically breaks).

February 21, 2014

Weekly Fedora kernel bug statistics – February 21st 2014

  18 19 20 rawhide  
Open: 0 219 278 129 (626)
Opened since 2014-02-14 0 3 23 3 (29)
Closed since 2014-02-14 0 10 14 7 (31)
Changed since 2014-02-14 0 11 40 7 (58)

Weekly Fedora kernel bug statistics – February 21st 2014 is a post from: codemonkey.org.uk

February 19, 2014


Some work today on trinity to rid it of some hard-coded limits on the number of child processes. Now, if you have some ridiculously overpowered machine with hundreds of processors, it should run at least one child process per thread instead of maxing out at 64 like before. (It also allows overriding the maximum number of running children with the -C parameter as always, now with no upper bound, other than memory allocation for all the arrays).

Asides from that, some digging through coverity, and some abortive attempts at cleaning up some more “big function” drivers. Some of them are such a mess they need bigger changes than simply hoisting code out into functions. There’s a point though where I start to feel uncomfortable changing them without hardware to test on.

DAILY LOG FEBRUARY 19TH 2014 is a post from: codemonkey.org.uk

February 17, 2014


  • Last weeks cleanups to the staging/bcm driver had a neat side-effect. Dan Carpenters smatch tool started picking up some new warnings now that the functions are bite-sized enough for them to parse.
  • This incentivized me enough to continue working on splitting up some of the mega-functions we have in the kernel. Not done by a stretch yet, but should have a bunch of patches for 3.15 by the end of the week.
  • Finally got around to doing something about this atrocity. I haven’t really cared about reiserfs for the better part of a decade, but damn that was too ugly to live.

DAILY LOG FEBRUARY 18TH 2014 is a post from: codemonkey.org.uk

February 14, 2014

Weekly Fedora kernel bug statistics – February 14th 2014

  18 19 20 rawhide  
Open: 0 226 264 130 (620)
Opened since 2014-02-07 0 5 16 16 (37)
Closed since 2014-02-07 1 4 12 14 (31)
Changed since 2014-02-07 0 12 32 20 (64)

Weekly Fedora kernel bug statistics – February 14th 2014 is a post from: codemonkey.org.uk


For reasons not entirely clear, I felt the urge to hack some on x86info again. Ended up committing dozens of changes cleaning up old crap. A lot of that code was written over 10 years ago, and I like to believe that my skills have improved somewhat in that time. Lots of it really wants gutting and rewriting from scratch, but the effort really isn’t worth it given for many of its features there now exist better tools (lscpu for example). It’s likely that 1.31 will be the final release. After that, I’m thinking of stripping out some parts of it to standalone tools, and throwing away the rest.

Chasing the xfs overflow continued.

Looked over some coverity bugs. One of the features it offers is a view that shows functions with a high cyclomatic complexity. What this typically translates to is “really long functions that are in dire need of splitting up”.

  • Top of the list is a function in lustre called lustre_assert_wire_constants, which is 4493 lines long!
    Thankfully, this function is generated and wasn’t written by a human.
  • In second place is cx23885_dif_setup, at 2909 lines. This one actually looks hand-written, but is pretty much just one giant switch statement.
  • Third place is taken by altera_execute which is pretty horrendous. 1910 lines of state machine.
    I tried a few times at cleaning it up, before giving up. Perhaps I’ll revisit some time.
  • In fourth place was a 1906 line ioctl dispatch switch statement in the staging/bcm driver. I spent a while splitting this out to one function per ioctl. 38 patches later, and it looks a lot more readable, and opens things up to some further cleanups that weren’t immediately obvious. I’ll come back to that later maybe.

After those four, the next dozen or so items on the list were still quite long functions, but I somehow quickly lost the urge to look at reiserfs and isdn.

A lot of this kind of ‘cleaning’ is pretty mind numbing boring work, but sometimes when I see something that horrific, I just can’t walk away..

DAILY LOG FEBRUARY 13TH 2014 is a post from: codemonkey.org.uk

February 13, 2014

Linux 3.13 coverity stats

A belated statistics dump of how Coverity looked during 3.13 development.

date rev Outstanding dismissed
Nov/4/2013 v3.12 5215 1061
Nov/24/2013 v3.13-rc1 5164 1099
Nov/30/2013 v3.13-rc2 5149 1105
Dec/6/2013 v3.13-rc3 5146 1108
Dec/14/2013 v3.13-rc4 5130 1110
Dec/23/2013 v3.13-rc5 5116 1121
—- v3.13-rc6 —-
Jan/7/2014 v3.13-rc7 5098 1136
Jan/14/2014 v3.13-rc8 5096 1136
Jan/20/2014 v3.13 5096 1136

Things kinda slowed down over the xmas break, but the overall trend from 3.12 -> 3.13 is ~200 open issues lower, even after including new issues being introduced.
Which is what 3.11 -> 3.12 showed too. Just 25 more releases, and we’re done :-)

The new issues are getting jumped on pretty quickly too, which is good to see.

Linux 3.13 coverity stats is a post from: codemonkey.org.uk

February 12, 2014


Hit an interesting bug yesterday without really trying. What looks like a case of interrupts being disabled when they shouldn’t be is believed to actually be a stack overrun. But rather than crashing, we’re corrupting the state of the irq flags. I’m a little sceptical still that this is the actual cause, but right now it’s the best answer being offered. As such, people are starting to look at the amount of stack used during some of the IO paths.
As can be seen in the linked stack trace above, the callchain can get pretty deep, and seems to be getting worse over time.

Fixed up a handful of small trinity bugs that caused children to segfault. There’s still a few remaining. I’ve held off on fixing them for now, because the current state of trinity segv’ing with certain parameters is a useful reproducer for the bug mentioned above.

Continued chipping away at the coverity backlog.

DAILY LOG FEBRUARY 12TH 2014 is a post from: codemonkey.org.uk

February 11, 2014


FOSDEM is pretty much _the_ European community FOSS event. I've been going on and off for a few years now, but in the last few years, it has had a dedicated Legal devroom, and I really enjoy that aspect of it. I spoke in a short session in the Legal devroom on H264 and Cisco's donation of openh264. I thought that talk went okay, but every time I give a new presentation, I immediately realize 10-20 ways I could have improved it (even if I never give that talk again). Afterwards, someone from Mozilla came over to argue that the Cisco release of openh264 was a net win for FOSS and Linux distros, and I think we had to agree to disagree on that point. His point eventually boiled down to "we're losing users to Chrome, we desperately need openh264 to compete", which is a bit like me saying "Fedora is losing users to other distros, we desperately need non-free software to compete". Ahem.

Anyways, I was also on a panel about Governance in FOSS communities, which I thought went well, even if most of us on the panel were not entirely sure whether we were qualified to speak on that topic. :) Karen Sandler had some good questions, as did the audience, and it was a packed room.

Not to take away from any packed room, but FOSDEM has really really outgrown its venue. The Université libre de Bruxelles is nice, and it is free (or mostly free from what I hear), but 3 out of 4 sessions I'd have liked to see were full before I even had a chance. They need a lot bigger rooms (or more days with repeat sessions).

I also brought a Lulzbot Taz 3 3d printer with me, but because I'm an idiot (and assumed an auto-switching power supply), I cooked the power supply in the first hour. Later, we thought we had a working power supply replacement, but it was a 110V (and the Taz 3 really needs a 230V supply). Thankfully, the Fedorans had brought some Rep Rap printers, so we had 3d printing the whole time, just not on the Taz 3 so much. Lesson learned. Lulzbot donated that Taz 3 (and a replacement power supply) to hackerspace.be.

I had a lot of good hallway discussions with people (there were a larger than normal contingent of US Fedora people around because of devconf.cz, which was a week after FOSDEM, but I opted out this year), and a good sampling of delicious Belgian beer. After FOSDEM, I flew to Prague for two days, to scope out the venues for Flock 2 (Electric Boogaloo).

Daily log February 10th 2014

First day back after taking a week off. Which of course meant being buried alive in email. I deleted a ton of stuff, so if you were expecting a reply from me and didn’t get one, resend.

Started poking at trinity bugs. There’s a case where the watchdog got hung in circumstances where the main process had quit (ironically, it got stuck in the routine that checked if it was still alive). Fixed that up, but still uncertain why that path was ever being taken. The new dropprivs code has shaken out a few more corner cases, such as fork() failures that weren’t being checked for. Hopefully that’s a bit more robust after todays changes.

Hit a weird irda/lockdep bug. Not sure what’s going on there, and everyone is reluctant to dig into irda too much. Can’t say I blame them tbh, it’s pretty grotty and unmaintained.

Spent some time looking over the recent coverity issues, and dismissed a bunch. Requested a bunch more components for some of the more ‘bug heavy’ parts of the kernel.
I realized I never did a statistics dump for 3.13 like I did for 3.12. I’ll sort that out tomorrow.

Daily log February 10th 2014 is a post from: codemonkey.org.uk

February 08, 2014

Weekly Fedora kernel bug statistics – February 07 2014

  18 19 20 rawhide  
Open: 0 224 261 127 (612)
Opened since 2014-01-31 0 4 17 10 (31)
Closed since 2014-01-31 37 12 11 5 (65)
Changed since 2014-01-31 0 11 53 17 (81)

Weekly Fedora kernel bug statistics – February 07 2014 is a post from: codemonkey.org.uk

February 02, 2014

Monthly Fedora kernel bug statistics – January 2014

  18 19 20 rawhide  
Open: 36 229 258 120 (643)
Opened since 2014-01-01 1 29 102 31 (163)
Closed since 2014-01-01 8 70 49 23 (150)
Changed since 2014-01-01 2 186 168 57 (413)

Monthly Fedora kernel bug statistics – January 2014 is a post from: codemonkey.org.uk

January 31, 2014

Why I can never pass Google interview, part 3

Seen a hilarious blog post about Google interviews, which contains the following gem:

Code in C++ or Java, which shows maturity. If you can only code in Python or bash shell, you're going to have trouble.

(emphasis mine)

Reminds me immediately how Google paid 1.6 Billion dollars for a website coded entirely in Python.

Previously: FizzBuzz.

Weekly Fedora kernel bug statistics – January 31st 2014

  18 19 20 rawhide  
Open: 36 229 256 120 (641)
Opened since 2014-01-24 0 3 17 12 (32)
Closed since 2014-01-24 2 12 5 3 (22)
Changed since 2014-01-24 0 12 38 21 (71)

Weekly Fedora kernel bug statistics – January 31st 2014 is a post from: codemonkey.org.uk


Been grinding through some of the coverity backlog the last few days.
Closed out a lot of non-issues marking them as intentional. A handful of false positives, and sent a few patches for a handful of real bugs. Nothing too scary. Net result: outstanding issues went down from 5141 to 4844. If only I could sustain that rate of closures.

Spent some time trying to dig into a perf bug that I’d been sporadically triggering back when there was an ftrace bug that meant any user could trigger perf events. When that got fixed, the bug went into hiding because only root was able to trigger the code path necessary. But, now that trinity has dropprivs mode, it’s reachable again. Hrmph.

Tried reproducing it with Vince Weaver’s perf_event_test fuzzer. Ended up triggering a different bug instead. Grumble.

Feeling a little burnt out. Taking next week off. Will still do the daily coverity builds, but won’t be doing much else.

DAILY LOG JANUARY 30th 2014 is a post from: codemonkey.org.uk

January 30, 2014

N950 charging WTF

Courtesy of Zeeshan I came into an N950 last year, so that I could make the N9/N950 work great with ModemManagerMission accomplished.  And one of the most annoying things about it is that when the battery is charged, it stops charging even though it’s still plugged in.  So, of course, I wake up in the morning and since it’s stopped charging while plugged in, the battery is now down to 75% or even lower.

So I humbly ask the Internet, is there a way to tell the N9/950 not to drain the battery when it’s done charging and still plugged in?

Trinity and ‘root mode’.

Some people have been running trinity as root for a while (thankfully, in virtual machines, doing so on real hardware can end in not so hilarious results, like your bios settings getting screwed to the point you can’t power up until you cover up the CMOS jumper, or your laptop battery no longer returning sensible information over i2c, or a whole slew of worse things).

Meanwhile, those who want to use trinity as an unprivileged user have been unable to fuzz certain aspects of the kernel. For example, creating and binding certain socket types can only be done by a user with CAP_SYS_ADMIN.

So I’ve recently committed some code which allows running trinity as root, dropping privileges before starting the child processes.
This is all very work in progress (and is quite buggy still), and not recommended for anyone to try right now unless you’re interested in debugging. Once things are stable again, I’ll move on to creating certain child processes that only do root-required things, like various ioctls.

Trinity and ‘root mode’. is a post from: codemonkey.org.uk

January 24, 2014

Weekly Fedora kernel bug statistics – January 24th 2014

  18 19 20 rawhide  
Open: 36 235 242 108 (621)
Opened since 2014-01-17 1 3 25 8 (37)
Closed since 2014-01-17 2 4 14 10 (30)
Changed since 2014-01-17 1 18 43 16 (78)

Weekly Fedora kernel bug statistics – January 24th 2014 is a post from: codemonkey.org.uk


  • morning of phone calls. bleh.
  • afternoon of git bisect. bleh. After spending a while trying to debug a hang during early boot I gave up debugging and tried the brute force method. Nothing conclusive. In fact, after completing the unsuccessful bisect, I can no longer reproduce the bug at all. Crap day.

DAILY LOG JANUARY 23rd 2014 is a post from: codemonkey.org.uk

January 23, 2014


DAILY LOG JANUARY 22nd 2014 is a post from: codemonkey.org.uk

January 22, 2014


  • Now that 3.13 is out, and the merge window is open, things are starting to get interesting again with the coverity runs.
    The current trend indicates that with each point release, we seem to be 200 issues lower overall.

    Ver outstanding issues Defect density
    3.11 5425 0.67
    3.12 5215 0.62
    3.13 5096 0.59

    Looks like we’ll soon be below 5000 for the first time since I started running the daily scans.
    (There’s still a load of known false positives/intentional issues, so the actual bug count is less already).

    Since the merge window opened, 64 new issues have been detected so far. Not bad considering that includes some of the worst offenders like drivers/staging/). Additionally 38 issues got fixed.

  • Some trinity enhancements today/yesterday.
    • munmap now sometimes just unmaps part of a mapping. No bugs found so far.
    • Some updates for new sched syscalls coming in 3.14
    • Added some missing flags to madvise()
    • Vince Weaver fixed up a divide by zero (for real this time).
    • Jiri Slaby fixed a double free in modify_ldt()

DAILY LOG JANUARY 21st 2014 is a post from: codemonkey.org.uk

January 21, 2014

OpenStack and a Core Developer

Real quick: you know how BSDs were supposed to have "core bit" for "core committers"? If one was "on core", he could issue "cvs commit". Everyone else had to e-mail patch to one of the core guys. One problem with that setup was that people are weak. A core guy could be severily tempted to commit something that was not rigorously tested or otherwise questionable.

OpenStack addresses this problem by not actually letting "core" people commit anything. I'm on core for Swift, but I cannot do "git push" to master. I can only do "git review" and then ask someone else to approve it. Of course, this is easy to attack if one wants so. For example, committers can enter into a conspiracy and buddy-approve. It's possible. But at least we're somewhat protected against a late night commit before a deadline by a guy who was hurried by problems at work.

January 17, 2014

Weekly Fedora kernel bug statistics – January 17th 2014

  18 19 20 rawhide  
Open: 36 235 226 100 (597)
Opened since 2014-01-10 0 12 30 8 (50)
Closed since 2014-01-10 1 11 12 10 (34)
Changed since 2014-01-10 1 29 51 12 (93)

Weekly Fedora kernel bug statistics – January 17th 2014 is a post from: codemonkey.org.uk

January 10, 2014


Started getting a damn cold again, making it hard to think clearly most the day.

Still managed to grind through a bunch of cleanups in trinity.
The rewrite of the logging code a few months ago kind of blew up in terms of complexity, leading to things like functions with 8 arguments. Spent a while cleaning that up, and reduced it to 5 args.

Later I decided to shorten a lot of code by introducing a bunch of local pointers to the syscall structs. After I was feeling pleased with myself, I got an email telling me I had broken the compile on older versions of gcc. Turned out that having a struct named ‘syscall’ and a function named syscall() was a bad idea.
Newer gcc can disambiguate between the two, but older versions freak out when compiled with -Wshadow.

Which led to the great syscall struct renaming of 2014.

And that was pretty much all I managed to get my head around today. Even sudafed couldn’t make me more productive.

DAILY LOG JANUARY 10TH 2014 is a post from: codemonkey.org.uk

Weekly Fedora kernel bug statistics – January 10th 2014

  18 19 20 rawhide  
Open: 37 236 206 99 (578)
Opened since 2014-01-03 0 11 27 3 (41)
Closed since 2014-01-03 4 42 14 2 (62)
Changed since 2014-01-03 1 174 88 22 (285)

Weekly Fedora kernel bug statistics – January 10th 2014 is a post from: codemonkey.org.uk


Started the day with a WARN_ON in perf.

More trinity work. Every time I made a start on writing new code I found myself adding to my TODO without making much forward progress. Still, a plan is forming.

Started reworking the mapping code a little to make it easier to implement some new things.
Then spent the afternoon doing some more improvements to the network socket code (unpushed changes, should finish tomorrow).

Fixed up some boring stuff that cppcheck found.

Found some time to play with PeterZ’s new lockdep feature, and quickly found some breakage.

Ended the day with an oops in RDS.

DAILY LOG JANUARY 9TH 2014 is a post from: codemonkey.org.uk

talk from LCA2014 - Virtio GPU

Talked yesterday about virgil project, virtio based GPU and where its going.

watch it here,

January 09, 2014


Starting to get back into the swing of things.
A bunch of trinity changes today.

  • Now ARG_ADDRESS sometimes passes a page of ptrs to shared mmaps.
  • The “fault in some pages” code got a few new access patterns.
  • Children now do one less getuid() call per iteration. getuid is a pretty fast syscall, but still eliminating it makes traces look a bit nicer.
  • Some fixes to the socket creation. Now passing -P with impossible arguments gets trapped.
  • The “only do network related syscalls” code got some cleanup, and now picks every syscall that uses an fd, or sockaddr.
  • Now when something wants an IP address (say, a sockaddr), we cache the last one picked and re-use it five times.
  • A bunch of other small bug fixes and cleanups.

Started thinking about logging structs that trinity creates. Right now, we just see an address in the logs, which isn’t particularly helpful when we want to know what the struct members were. Should be trivial to hack up, so will likely do that tomorrow.

DAILY LOG JANUARY 8TH 2014 is a post from: codemonkey.org.uk

January 08, 2014


Coverity scans are happening again. The machine I was doing them on still isn’t recovered, so they’re taking a little longer to run, but they’re at least going to happen regularly again.

Been doing some reading on various oddball network protocols. Been figuring out some plans for enhanced fuzzing there in trinity.
At the same time, spent some time looking at old VM/FS bugs, and trying to figure out ways trinity could have triggered them.

DAILY LOG JANUARY 7TH 2014 is a post from: codemonkey.org.uk

January 07, 2014

Daily log January 6th 2014

First day back at work after a much needed break.
As expected, buried alive in email, so skimmed through the interesting parts of that, and marked pretty much everything else as read.
Lots of the day spent updating kernel trees to Linus’ latest. Surprisingly, nothing broke, which is a good thing given we’re at rc7. Later in the day, one fuzz testing box turned itself off spontaneously. Unsure if kernel bug, or first hardware death of the year.

Right before the break, I remotely started a fedup upgrade of the box that I do the daily coverity builds on.
It didn’t survive the reboot. So until I get that fixed up, the scans won’t be updating. Hopefully I’ll get on top of it soon.

Daily log January 6th 2014 is a post from: codemonkey.org.uk

January 01, 2014

import gdb

Occasionally I see questions about how to import gdb from the ordinary Python interpreter.  This turns out to be surprisingly easy to implement.

First, a detour into PIE and symbol visibility.

“PIE” stands for “Position Independent Executable”.  It uses essentially the same approach as a shared library, except it can be applied to the executable.  You can easily build a PIE by compiling the objects with the -fPIE flag, and then linking the resulting executable with -pie.  Normally PIEs are used as a security feature, but in our case we’re going to compile gdb this way so we can have Python dlopen it, following the usual Python approach: we install it as _gdb.so and add a a module initialization function, init_gdb. (We actually name the module “_gdb“, because that is what the gdb C code creates; the “gdb” module itself is already plain Python that happens to “import _gdb“.)

Why install the PIE rather than make a true shared library?  It is just more convenient — it doesn’t require a lot of configure and Makefile hacking, and it doesn’t slow down the build by forcing us to link gdb against a new library.

Next, what about all those functions in gdb?  There are thousands of them… won’t they possibly cause conflicts at dlopen time?  Why yes… but that’s why we have symbol visibility.  Symbol visibility is an ELF feature that lets us hide all of gdb’s symbols from any dlopen caller.  In fact, I found out during this process that you can even hide main, as ld.so seems to ignore visibility bits for this function.

Making this work is as simple as adding -fvisibility=hidden to our CFLAGS, and then marking our Python module initialization function with __attribute__((visibility("default"))).  Two notes here.  First, it’s odd that “default” means “public”; just one of those mysterious details.  Second, Python’s PyMODINIT_FUNC macro ought to do this already, but it doesn’t; there’s a Python bug.

Those are the low-level mechanics.  At this point gdb is a library, albeit an unusual one that has a single entry point.  After this I needed a few tweaks to gdb’s startup process in order to make it work smoothly.  This too was no big deal.  Now I can write scripts from Python to do gdb things:

import gdb
gdb.execute('file ./install/bin/gdb')
print 'sizeof = %d' % gdb.lookup_type('struct minimal_symbol').sizeof


$ python zz.py

Soon I’ll polish all the patches and submit this upstream.

Glie, eventlet, and Python 3

Speaking of the Python 3 debacle, I honestly meant to do Glie in py3, but I wanted a built-in webserver and used eventlet for it. But eventlet is Python 2.x only, isn't it? What's a decent embeddable mini webserver with WSGI interface for py3?

December 31, 2013

Hello, Glie

Although it was mentioned obliquely before, Glie now exists. It receives the ADS-B data from RTL-SDR in 1090ES band and produces an image. I can hit refresh in the browser and watch airplanes coming in to land at a nearby airport in real time.

This stuff is hardly groundbreaking. Many such programs exist, some are quite sophisticated in interfacing to various mapping, geography, schedule, and airframe information services, as well as in the UI. This one is mainly different because it's my toy.

Actually, the general aim of this project is also different, because unlike most stuff out there, it is not meant to be a surveilance tool, but to provide a traffic awareness readout. No persistent database of any kind is involved. No map either. Instead, I'm going to focus on onboard features, such as relative motion history (so one can easily identify targets on collision course).

But mostly, it's for fun and education. And already Glie is facing a few technical challenges:

  • The best orientation for onboard display is "nose-up" (obviously). However, one can only derive a "track-up" orientation from a GPS. This is obviously wrong in case of Glie flying on a helicopter that can fly sideways. It is less obviously wrong in case of a crosswind, but can greate a significant distortion. To get the nose direction, I have to acquire a compass readout, which seems quite challenging. It's not like common airplanes have AHARS sockets under panels.
  • I really want the graphics anti-aliased for better looks and finer precision, but I have no clue how to accomlish it. The standard TIS-B symbology essentially requires it, too, so I'm stuck with ad-hoc diamonds.
  • The darn FAA split ADS-B in the U.S. into two bands: 1090ES and UAT. The 1090 is a solved problem in RTL-SDR space (although performance of receivers is not excellent and I'm thinking about building semi-hardware receivers in the future). However, UAT has 1 mbit/s data rate and RTL-SDR fails miserably on its face when trying to deal with it. I probably need something like Ettius GNU Radio receiver, which is expensive - $800 and up, last I checked. Unfortunately, it looks increasingly that all the interesting traffic is going to follow UAT route, come the 2020 A.D. (the year of ADS-B manate).

For now I'm going to take it easy and play with what I have, aiming for some kind of a portable system. Perhaps someone develops an open source UAT receiver on a practical platform in the meanwhile.

December 20, 2013

Trinity 1.3 release.

I just tagged and pushed out a 1.3 tarball for trinity.
Most people who use it will likely be staying on the bleeding edge running latest git, but given it’s currently finding all kinds of interesting bugs in the VM, this seems to be a good point to tag a release, for people chasing those bugs to have something stable they can run.

It should also build on older distributions, which might be interesting for people wanting to test enterprise distributions. (I tend to break the build in git on a semi-regular basis, because I’m only building/testing on current Fedora).

In the new year, I’ve got a bunch of things I want to investigate adding/enhancing in this VM related code, as well as some interesting ideas for the networking code.

Trinity 1.3 release. is a post from: codemonkey.org.uk