|Opened since 2014-04-11||3||14||8||(25)|
|Closed since 2014-04-11||6||9||6||(21)|
|Changed since 2014-04-11||6||29||14||(49)|
Added some code to trinity to use random open flags on the fd’s it opens on startup.
Spent most of the day hitting the same VM bugs as yesterday, or others that Sasha had already reported.
Later in the day, I started seeing this bug after applying a not-yet-merged patch to fix a leak that Coverity had picked up on recently. Spent some time looking into that, without making much progress.
Rounded out the day by trying out latest builds on my freshly reinstalled laptop, and walked into this.
Spent all of yesterday attempting recovery (and failing) on the /home partition of my laptop.
On the weekend, I decided I’d unsuspend it to send an email, and just got a locked up desktop. The disk IO light was stuck on, but it was completely dead to input, couldn’t switch to console. Powered it off and back on. XFS wanted to me to run xfs_repair. So I did. It complained that there was pending metadata in the log, and that I should mount the partition to replay it first. I tried. It failed miserably, so I re-ran xfs_repair with -L to zero the log. Pages and pages of scrolly text zoomed up the screen.
Then I rebooted and.. couldn’t log in any more. Investigating with root showed that /home/davej was now /home/lost & found, and within it were a couple dozen numbered directories containing mostly uninteresting files.
So that’s the story about how I came to lose pretty much everything I’ve written in the last month that I hadn’t already pushed to github. I’m still not entirely sure what happened, but I point the finger of blame more at dm-crypt than at xfs at this point, because the non-encrypted partitions were fine.
Ultimately I gave up, reformatted and reinstalled. Kind of a waste of a day (and a half).
Things haven’t being entirely uneventful though:
So there’s still some fun VM/FS horrors lurking. Sasha has been hitting a bunch more huge page bugs too. It never ends.
DisplayPort 1.2 Multi-stream Transport is a feature that allows daisy chaining of DP devices that support MST into all kinds of wonderful networks of devices. Its been on the TODO list for many developers for a while, but the hw never quite materialised near a developer.
At the start of the week my local Red Hat IT guy asked me if I knew anything about DP MST, it turns out the Lenovo T440s and T540s docks have started to use DP MST, so they have one DP port to the dock, and then dock has a DP->VGA, DP->DVI/DP, DP->HDMI/DP ports on it all using MST. So when they bought some of these laptops and plugged in two monitors to the dock, it fellback to using SST mode and only showed one image. This is not optimal, I'd call it a bug :)
Now I have a damaged in transit T440s (the display panel is in pieces) with a dock, and have spent a couple of days with DP 1.2 spec in one hand (monitor), and a lot of my hair in the other. DP MST has a network topology discovery process that is build on sideband msgs send over the auxch which is used in normal DP to read/write a bunch of registers on the plugged in device. You then can send auxch msgs over the sideband msgs over auxch to read/write registers on other devices in the hierarchy!
Today I achieved my first goal of correctly encoding the topology discovery message and getting a response from the dock:
[ 2909.990743] link address reply: 4
[ 2909.990745] port 0: input 1, pdt: 1, pn: 0
[ 2909.990746] port 1: input 0, pdt: 4, pn: 1
[ 2909.990747] port 2: input 0, pdt: 0, pn: 2
[ 2909.990748] port 3: input 0, pdt: 4, pn: 3
There are a lot more steps to take before I can produce anything, along with dealing with the fact that KMS doesn't handle dynamic connectors so well, should make for a fun tangent away from the job I should be doing which is finishing virgil.
I've ordered another DP MST hub that I can plug into AMD and nvidia gpus that should prove useful later, also for doing deeper topologies, and producing loops.
Also some 4k monitors using DP MST as they are really two panels, but I don't have one of them, so unless one appears I'm mostly going to concentrate on the Lenovo docks for now.
The big thing that stands out this cycle is that the defect ratio was going down until we hit around 3.14-rc7, and then we got a few hundred new issues. What happened ?
Nothing in the kernel thankfully. This was due to an upgrade server side to a new version of Coverity which has some new checkers. Some of the existing ones got improved too, so a bunch of false positives we had sitting around in the database are no longer reported. The number of new issues unfortunately was greater than the known false positives. In the days following, I did a first sweep through these and closed out the easy ones, bringing the defect density back down.
note: I stopped logging the ‘dismissed’ totals. With Coverity 7.0, the number can go backwards.
If a file gets deleted, the issues against that file that were dismissed also disappears.
Given this happens fairly frequently, the number isn’t really indicative of anything useful.
With the 3.15 merge window now open, I’m hoping a bunch of the queued fixes I sent over the last few weeks get merged, but I’m fully expecting to need to do some resending.
 It was actually worse than this, the ratio went back up to 0.57 right before rc7
It’s been a busy week.
A week ago I flew out to Napa,CA for two days of discussions with various kernel people (ok, and some postgresql people too) about all things VM and FS/IO related. I learned a lot. These short focussed conferences have way more value to me these days personally than the conferences of years ago with a bunch of tracks, and day after day of presentations.
I gave two sessions relating to testing, there are some good write-ups on lwn. It was more of a extended QA than a presentation, so I got a lot of useful feedback (and especially afterwards in the hallway sessions). A couple people asked if trinity was doing certain things yet, which led to some code walkthroughs, and a lot of brainstorming about potential solutions.
By the end of the week I was overflowing with ideas for new things it could be doing, and have started on some of the code for this already. One feature I’d had in mind for a while (children doing root operations) but hadn’t gotten around to writing could be done in a much simpler way, which opens the doors to a bunch more interesting things. I might end up rewriting the current ioctl fuzzing (which isn’t finding a huge amount of bugs right now anyway) once this stuff has landed, because I think it could be doing much more ‘targeted’ things.
It was good to meet up with a bunch of people that I’ve interacted with for a while online and discuss some things. Was surprised to learn Sasha Levin is actually local to me, yet we both had to fly 3000 miles to meet.
Two sessions at LSF/MM were especially interesting outside of my usual work.
The postgresql session where they laid out their pain points with the kernel IO was enlightening, as they started off with a quick overview of postgresql’s process model, and how things interact. The session felt like it went off in a bunch of random directions at once, but the end goal (getting a test case kernel devs can run without needing a full postgresql setup) seemed to be reached the following day.
The second session I found interesting was the “Facebook linux problems” session. As mentioned in the lwn write-up, one of the issues was this race in the pipe code. “This is *very* hard to trigger in practice, since the race window is very small”. Facebook were hitting it 500 times a day. Gave me thoughts on a whole bunch of “testing at scale” problems. A lot of the testing I do right now is tiny in comparison. I do stress tests & fuzz runs on a handful of machines, and most of it is all done by hand. Doing this kind of thing on a bigger scale makes it a little impractical to do in a non-automated way. But given I’ve been buried alive in bugs with just this small number, it has left me wondering “would I find a load more bugs with more machines, or would it just mean the mean time between reproducing issues gets shorter”. (Given the reproducibility problems I’ve had with fuzz testing sometimes, the latter wouldn’t necessarily be a bad thing). More good thoughts on this topic can be found in a post google made a few years ago.
Coincidentally, I’m almost through reading How google tests software, which is a decent book, but with not a huge amount of “this is useful, I can apply this” type knowledge. It’s very focussed on the testing of various web-apps, with no real mention of testing of Android, Chrome etc. (The biggest insights in the book aren’t actually testing related, but more the descriptions of googles internal re-hiring processes when people move between teams).
Collaboration summit followed from Wednesday onwards. One highlight for me were learning that the tracing code has something coming in 3.15/3.16 that I’ve been hoping for for a while. At last years kernel summit, Andi Kleen suggested it might be interesting if trinity had some interaction with ftrace to get traces of “what the hell just happened”. The tracing changes landing over the next few months will allow that to be a bit more useful. Right now, we can only do that on a global system-wide basis, but with that moving to be per-process, things can get a lot more useful.
Another interesting talk was the llvmlinux session. I haven’t checked in on this project in a while, so was surprised to learn how far along they are. Apparently all the necessary llvm changes to build the kernel are either merged, or very close to merging. The kernel changes still have a ways to go, but this too has improved a lot since I last looked. Some good discussion afterwards about the crossover between things like clang’s static analysis warnings and the stuff I’m doing with Coverity.
Speaking of, I left early on Friday to head back to San Francisco to meet up with Coverity. Lots of good discussion about potential workflow improvements, false positive/heuristic improvements etc. A good first meeting if only to put faces to names I’ve been dealing with for the last year. I bugged them about a feature request I’ve had for a while (that a few people the days preceding had also nagged me about); the ability to have per-subsystem notification emails instead of the one global email. If they can hook this up, it’ll save me a lot of time having to manually craft mails to maintainers when new issues are detected.
busy busy week, with so many new ideas I felt like my head was full by the time I got on the plane to get back.
Taking it easy for a day or two, before trying to make progress on some of the things I made notes on last week.
[Attention conservation notice: probably not of interest to lawyers; this is about my previous life in software development.]
Someone recently mentioned JWZ’s old post on the CADT (Cascade of Attention Deficit Teecnagers) development model, and that finally has pushed me to say:
I am the CADT.
I did the bug closure that triggered Jamie’s rant, and I wrote the text he quotes in his blog post.1
Jamie got some things right, and some things wrong. The main thing he got right is that it is entirely possible to get into a cycle where instead of seriously trying to fix bugs, you just do a rewrite and cross your fingers that it fixes old bugs. And yes, this can particularly happen when you’re young and writing code for fun, where the joy of a from-scratch rewrite can overwhelm some of your other good senses. Jamie also got right that I communicated the issue pretty poorly. Consider this post a belated explanation (as well as a reference for the next time I see someone refer to CADT).
But that wasn’t what GNOME was doing when Jamie complained about it, and I doubt it is actually something that happens very often in any project large enough to have a large bug tracking system (BTS). So what were we doing?
First, as Brendan Eich has pointed out, sometimes a rewrite really is a good idea. GNOME 2 was such a rewrite – not only was a lot of the old code a hairy mess, we decided (correctly) to radically revise the old UI. So in that sense, the rewrite was not a “CADT” decision – the core bugs being fixed were the kinds of bugs that could only be fixed with massive, non-incremental change, rather than “hey, we got bored with the old code”. (Immediately afterwards, GNOME switched to time-based releases, and stuck to that schedule for the better part of a decade, which should be further proof we weren’t cascading.)
This meant there were several thousand old bugs that had been filed against UIs that no longer existed, and often against code that no longer existed or had been radically rewritten. So you’ve got new code and old bugs. What do you do with the old bugs?
It is important to know that open bugs in a BTS are not free. Old bugs impose a cost on developers, because when they are trying to search relevant bugs, old bugs can make it harder to find the things they really should be working on. In the best case, this slows them down; in the worst case, it drives them to use other tools to track the work they want to do – making the BTS next to useless. This violates rule #1 of a BTS: it must be useful for developers, or else it all falls apart.
So why did we choose to reduce these costs by closing bugs filed against the old codebase as NEEDINFO (and asking people to reopen if they were still relevant) instead of re-testing and re-triaging them one-by-one, as Jamie would have suggested? A few reasons:
So when isn’t it a good idea to close ask for more information about old bugs?
Relatedly, the team practices mailing list has been discussing good practices for migrating bug tracking systems in the past few days, which has been interesting to follow. I don’t take a strong position on where Wikimedia’s bugzilla falls on this point – Mediawiki has a fairly stable core, and the volume of incoming bugs may make triage of old bugs more plausible. But everyone running a very large bugzilla for an active project should remember that this is a part of their toolkit.
For years, I did my best to ignore the problem, but CKS inspired me to blog the curious networking banality, in case anyone has wisdom to share.
The deal is simple: I have a laptop with a VPN client (I use vpnc). The client creates a tun0 interface and some RFC 1918 routes. My home RFC 1918 routes are more specific, so routing works great. The name service does not.
Obviously, if we trust DHCP-supplied nameserver, it has no work-internal names in it. The stock solution is to let vpnc to install /etc/resolv.conf pointing to work-internal nameservers. Unfortunately this does not work for me, because I have a home DNS zone, zaitcev.lan. Work-internal DNS does not know about that one.
Thus I would like some kind of solution that routes DNS requests somehow according to a configuration. Requests to work-internal namespaces (such as *.redhat.com) would go to nameservers delivered by vpnc (I think I can make it write something like /etc/vpnc/resolv.conf that does not conflict). Other requests go to the infrastructure name service, being it a hotel network or home network. Home network is capable of serving its own private authoritative zones and forwarding the rest. That's the ideal, so how to accomplish it?
I attempted apply a local dnsmasq, but could not figure out if it can do what I want and if yes, how.
For now, I have some scripting that caches work-internal hostnames in /etc/hosts. That works, somewhat. Still, I cannot imagine that nobody thought of this problem. Surely, thousands are on VPNs, and some of them have home networks. And... nobody? (I know that a few people just run VPN on the home infrastructure; that does not help my laptop, unfortunately).
So to enable OpenGL 3.3 on radeonsi required some patches backported to llvm 3.4, I managed to get some time to do this, and rebuilt mesa against the new llvm, so if you have an AMD GPU that is supported by radeonsi you should now see GL 3.3.
For F20 this isn't an option as backporting llvm is a bit tricky there, though I'm considering doing a copr that has a private llvm build in it, it might screw up some apps but for most use cases it might be fine.
High (low?) point of the day was taking delivery of my new remote power switch.
You know that weird PCB chemical smell some new electronics have? Once I got this thing out the box it smelled so strong, I almost gagged. I let it ‘air’ for a little while, hoping it would dissipate. It didn’t. Or if it did, I couldn’t tell because now the whole room stunk. Then I made the decision to plug it in anyway. Within about a minute the smell went away. Well, not so much “went away”. More like, “was replaced with the smell of burning electronics”.
So that’s another fun “hardware destroys itself as soon as I get a hold of it” story, and yet another Amazon return.
(And I guess I’m done with ‘digital loggers’ products).
In non hardware-almost-burning-down-my-house news:
I’ve been trying to chase down the VM crashes I’ve been seeing. I’ve managed to find ways to reproduce some of them a little faster, but not really getting any resolution so far. Hacked up a script to run a subset of random VM related syscalls. (Trinity already has ‘-g vm’, but I wanted something even more fine-grained). Within minutes I was picking up VM traces that so far I’ve only seen Sasha reporting on -next.
A rant in ;login was making rounds recently (h/t @jgarzik), which I thought was not all that relevant... until I remembered that Swift Power Calculator has mysteriously stopped working for me. Its creator is powerless to do anything about it, and so am I.
So, it's relevant all right. We're in a big trouble even if Gmail kind of works most of the time. But the rant makes no recommendations, only observations. So it's quite unsatisfying.
BTW, it reminds me about a famous preso by Jeff Mogul, "What's wrong with HTTP and why it does not matter". Except Mogul's rant was more to the point. They don't make engineers like they used to, apparently. Also notably, I think, Mogul prompted development of RESTful improvements. But there's nothing we can do about excessive thickness of our stacks (that I can see). It's just spiralling out of control.
Been hitting a number of VM related bugs the last few days.
The first bug is the one that concerns me most right now, though the 2nd is feasibly something that some non-fuzzer workloads may hit too. Other than these bugs, 3.14rc6 is working pretty well for me.
I suppose everyone has to pass through a hardware phase, and mine is now, for which I implemented a LED blinker with an AVRtiny2313. I don't think it even merits the usual blog laydown. Basically all it took was following tutorials to the letter.
For the initial project, I figured that learning gEDA would take too much, so I unleashed an inner hipster and used Fritzing. Hey, it allows to plan breadboards, so there. And well it was a learning experience and no mistake. Crashes, impossible to undo changes, UI elements outside of the screen, everything. Black magic everywhere: I could never figure out how to merge wires, dedicate a ground wire/plane, or edit labels (so all of them are incorrect in the schematic above). The biggest problem was the lack of library support together with an awful parts editor. Editing schematics in Inkscape was so painful, that I resigned to doing a piss-poor job, evident in all the crooked lines around the AVRtiny2313. I understand that Fritzing's main focus is iPad, but this is just at a level of typical outsourced Windows application.
Inkscape deserves a special mention due to the way Fritzing requires SVG files being in a particular format. If you load and edit some of those, the grouping defeats Inkscape features, so one cannot even select elements at times. And editing the raw XML cause weirdest effects, so it's not like LyX-on-TeX, edit and visualize. At least our flagship vector graphics package didn't crash.
The avr-gcc is awesome though. 100% turnkey: yum install and you're done. Same for avrdude. No huss, no fuss, everything works.
Looking at a review by Solly today, I saw something deeply disturbing. A simplified version that I tested follows:
import unittest class Context(object): def __init__(self): self.func = None def kill(self): self.func(31) class TextGuruMeditationMock(object): # The .run() normally is implemented in the report.Text. def run(self): return "Guru Meditation Example" @classmethod def setup_autorun(cls, ctx, dump_with=None): ctx.func = lambda *args: cls.handle_signal(dump_with, *args) @classmethod def handle_signal(cls, dump_func, *args): try: res = cls().run() except Exception: dump_func("Unable to run") else: dump_func(res) class TestSomething(unittest.TestCase): def test_dump_with(self): ctx = Context() class Writr(object): def __init__(self): self.res = '' def go(self, out): self.res += out target = Writr() TextGuruMeditationMock.setup_autorun(ctx, dump_with=target.go) ctx.kill() self.assertIn('Guru Meditation', target.res)
Okay, obviously we're setting a signal handler, which is a little lambda, which invokes the dump_with, which ... is a class method? How does it receive its self?!
I guess that the deep Python magic occurs in how the method target.go is prepared to become an argument. The only explanation I see is that Python creates some kind of activation record for this, which includes the instance (target) and the method, and that record is the object being passed down as dump_with. I knew that Python did it for scoped functions, where we have global dict, local dict, and all that good stuff. But this is different, isn't it? How does it even know that target.io belongs to target? In what part of Python spec is it described?
UPDATE: Commenters provided hints with the key idea being a "bound method" (a kind of user-defined method).
A user-defined method object combines a class, a class instance (or None) and any callable object (normally a user-defined function).
When a user-defined method object is created by retrieving a user-defined function object from a class, its im_self attribute is None and the method object is said to be unbound. When one is created by retrieving a user-defined function object from a class via one of its instances, its im_self attribute is the instance, and the method object is said to be bound.
Thanks, Josh et al.!
UPDATE: See also Chris' explanation and Peter Donis' comment re. unbound methods gone from py3.
Spent some time chasing down what looks like a race condition in the watchdog code in trinity.
The symptom was a crash on x86-64, where it would try and decode a 32-bit syscall using the 64-bit syscall table. This segfaulted, because the 64-bit table is shorter. I stared at the code for quite a while, and adding debugging printfs at the crash site made the bug disappear. What I think was happening was that the child processes are updating two separate variables (one, a bool that says if we’re doing 32 or 64 bit calls, and two the syscall number), and the watchdog code was reading them in the middle of them being updated. I added some locking code to make sure we don’t read either value before an update is complete.
I’ve not managed to reproduce the bug since, so I’m really hoping I got it right.
We’ve recently binged on making NetworkManager work better in more places, mostly enterprise and virtualization related. But one thing we’ve wanted to do for a long time was make things more modular. And that got landed this week via the dev-plugins branch, which makes ATM, Bluetooth, and WWAN into shared libraries loaded optionally at startup.
Distro packagers can now create separate NetworkManager-atm, NetworkManager-bluetooth, and NetworkManager-wwan packages, each with their own dependencies, while the NetworkManager core goes forward a slimmer, smaller, more efficient version of its previous self. If you’re installing NetworkManager into minimal environments, you can just ignore or remove these plugins and revel in your newfound minimalism.
The core NM binary is now about 15% smaller, and there’s a corresponding 7.5% RSS reduction at runtime when no plugins are loaded. What’s next? Possibly WiFi, which would save about 6 – 8% of the core binary size.
Last year I hacked up a small shell script to test various IO related things like “create a RAID5 array, put an XFS file system on it, create a bunch of files on it”.
Despite its crudeness, it ended up finding a bunch of kernel bugs. Unfortunately many of them were not easily reproducible, and required hours of runtime. There were also some problems with scaling the tests. Every time I wanted to add another test, or another filesystem, the overall runtime grew dramatically. Before my test box with 4 SATA disks died, it would take over 3 hours for a single run.
So I’ve been sketching up ideas for a replacement to address a number of these shortfallings.
Firstly, it’s in C. Shell was fun for coming up with an initial proof of concept, but for some things like better management of threads, it’s just not going to work. Speaking of threads, one of the reasons that the runtime was previously so long was that it never took advantage of idle disks. So if for example, I have 4 disks, and I want to run a 2 disk RAID0 stripe in one test, I should be able to launch additional threads to do something interesting with the other 2 idle disks.
The code for this is still very early, and doesn’t do much of anything yet, but it’ll show up on github at some point.
In the meantime, I’ve been trying to put together something to test on. For reasons unexplained, the quad opteron that held all my disks no longer powers up. I spent a couple hours trying to revive it with various spare parts, without luck.
Yesterday the idea occurred to me that I could just use a USB hub and a bunch of old memory sticks for now.
It would have the advantage of being easily portable while travelling. Then I rediscovered just how crap no-name chinese USB hubs are. Devices sometimes showing up, sometimes not. Devices falling off the bus. Sometimes the whole hub disappearing. Sometimes refusing to even power up. I tossed the idea. For now, I’ve got this usb-sata thing connected to an SSD. Portable, fast, and surprisingly, entirely stable.
I’ve got a bunch of other ideas for this tool beyond what the io-tests shell script did, and I suspect after next months VM/FS summit, I’ll have a load more.
Here is a quick writeup of the protocol for the iKettle taken from my Google+ post earlier this month. This protocol allows you to write your own software to control your iKettle or get notifications from it, so you can integrate it into your desktop or existing home automation system.
The iKettle is advertised as the first wifi kettle, available in UK since February 2014. I bought mine on pre-order back in October 2013. When you first turn on the kettle it acts as a wifi hotspot and they supply an app for Android and iPhone that reconfigures the kettle to then connect to your local wifi hotspot instead. The app then communicates with the kettle on your local network enabling you to turn it on, set some temperature options, and get notification when it has boiled.
Once connected to your local network the device responds to ping requests and listens on two tcp ports, 23 and 2000. The wifi connectivity is enabled by a third party serial to wifi interface board and it responds similar to a HLK-WIFI-M03. Port 23 is used to configure the wifi board itself (to tell it what network to connect to and so on). Port 2000 is passed through to the processor in the iKettle to handle the main interface to the kettle.
|Bit 6||Bit 5||Bit 4||Bit 3||Bit 2||Bit 1|
So, for example if you receive "sys status key=!" then buttons "100C" and "On" are currently active (and the kettle is therefore turned on and heating up to 100C).
|sys status 0x100||100C selected|
|sys status 0x95||95C selected|
|sys status 0x80||80C selected|
|sys status 0x100||65C selected|
|sys status 0x11||Warm selected|
|sys status 0x10||Warm has ended|
|sys status 0x5||Turned on|
|sys status 0x0||Turned off|
|sys status 0x8005||Warm length is 5 minutes|
|sys status 0x8010||Warm length is 10 minutes|
|sys status 0x8020||Warm length is 20 minutes|
|sys status 0x3||Reached temperature|
|sys status 0x2||Problem (boiled dry?)|
|sys status 0x1||Kettle was removed (whilst on)|
You can receive multiple status messages given one action, for example if you turn the kettle on you should get a "sys status 0x5" and a "sys status 0x100" showing the "on" and "100C" buttons are selected. When the kettle boils and turns off you'd get a "sys status 0x3" to notify you it boiled, followed by a "sys status 0x0" to indicate all the buttons are now off.
|set sys output 0x80||Select 100C button|
|set sys output 0x2||Select 95C button|
|set sys output 0x4000||Select 80C button|
|set sys output 0x200||Select 65C button|
|set sys output 0x8||Select Warm button|
|set sys output 0x8005||Warm option is 5 mins|
|set sys output 0x8010||Warm option is 10 mins|
|set sys output 0x8020||Warm option is 20 mins|
|set sys output 0x4||Select On button|
|set sys output 0x0||Turn off|
If you're interested in looking at the web interface you can enable it by connecting to port 23 using telnet or nc, entering the password, then issuing the commands "AT+WEBS=1\n" then "AT+PMTF\n" then "AT+Z\n" and then you can open up a webserver on port 80 of the kettle and change or review the settings. I would not recommend you mess around with this interface, you could easily break the iKettle in a way that you can't easily fix. The interface gives you the option of uploading new firmware, but if you do this you could get into a state where the kettle processor can't correctly configure the interface and you're left with a broken kettle. Also the firmware is just for the wifi serial interface, not for the kettle control (the port 2000 stuff above), so there probably isn't much point.
Some work today on trinity to rid it of some hard-coded limits on the number of child processes. Now, if you have some ridiculously overpowered machine with hundreds of processors, it should run at least one child process per thread instead of maxing out at 64 like before. (It also allows overriding the maximum number of running children with the -C parameter as always, now with no upper bound, other than memory allocation for all the arrays).
Asides from that, some digging through coverity, and some abortive attempts at cleaning up some more “big function” drivers. Some of them are such a mess they need bigger changes than simply hoisting code out into functions. There’s a point though where I start to feel uncomfortable changing them without hardware to test on.
For reasons not entirely clear, I felt the urge to hack some on x86info again. Ended up committing dozens of changes cleaning up old crap. A lot of that code was written over 10 years ago, and I like to believe that my skills have improved somewhat in that time. Lots of it really wants gutting and rewriting from scratch, but the effort really isn’t worth it given for many of its features there now exist better tools (lscpu for example). It’s likely that 1.31 will be the final release. After that, I’m thinking of stripping out some parts of it to standalone tools, and throwing away the rest.
Chasing the xfs overflow continued.
Looked over some coverity bugs. One of the features it offers is a view that shows functions with a high cyclomatic complexity. What this typically translates to is “really long functions that are in dire need of splitting up”.
After those four, the next dozen or so items on the list were still quite long functions, but I somehow quickly lost the urge to look at reiserfs and isdn.
A lot of this kind of ‘cleaning’ is pretty mind numbing boring work, but sometimes when I see something that horrific, I just can’t walk away..
A belated statistics dump of how Coverity looked during 3.13 development.
Things kinda slowed down over the xmas break, but the overall trend from 3.12 -> 3.13 is ~200 open issues lower, even after including new issues being introduced.
Which is what 3.11 -> 3.12 showed too. Just 25 more releases, and we’re done :-)
The new issues are getting jumped on pretty quickly too, which is good to see.
Hit an interesting bug yesterday without really trying. What looks like a case of interrupts being disabled when they shouldn’t be is believed to actually be a stack overrun. But rather than crashing, we’re corrupting the state of the irq flags. I’m a little sceptical still that this is the actual cause, but right now it’s the best answer being offered. As such, people are starting to look at the amount of stack used during some of the IO paths.
As can be seen in the linked stack trace above, the callchain can get pretty deep, and seems to be getting worse over time.
Fixed up a handful of small trinity bugs that caused children to segfault. There’s still a few remaining. I’ve held off on fixing them for now, because the current state of trinity segv’ing with certain parameters is a useful reproducer for the bug mentioned above.
Continued chipping away at the coverity backlog.
FOSDEM is pretty much _the_ European community FOSS event. I've been going on and off for a few years now, but in the last few years, it has had a dedicated Legal devroom, and I really enjoy that aspect of it. I spoke in a short session in the Legal devroom on H264 and Cisco's donation of openh264. I thought that talk went okay, but every time I give a new presentation, I immediately realize 10-20 ways I could have improved it (even if I never give that talk again). Afterwards, someone from Mozilla came over to argue that the Cisco release of openh264 was a net win for FOSS and Linux distros, and I think we had to agree to disagree on that point. His point eventually boiled down to "we're losing users to Chrome, we desperately need openh264 to compete", which is a bit like me saying "Fedora is losing users to other distros, we desperately need non-free software to compete". Ahem.
Anyways, I was also on a panel about Governance in FOSS communities, which I thought went well, even if most of us on the panel were not entirely sure whether we were qualified to speak on that topic. :) Karen Sandler had some good questions, as did the audience, and it was a packed room.
Not to take away from any packed room, but FOSDEM has really really outgrown its venue. The Université libre de Bruxelles is nice, and it is free (or mostly free from what I hear), but 3 out of 4 sessions I'd have liked to see were full before I even had a chance. They need a lot bigger rooms (or more days with repeat sessions).
I also brought a Lulzbot Taz 3 3d printer with me, but because I'm an idiot (and assumed an auto-switching power supply), I cooked the power supply in the first hour. Later, we thought we had a working power supply replacement, but it was a 110V (and the Taz 3 really needs a 230V supply). Thankfully, the Fedorans had brought some Rep Rap printers, so we had 3d printing the whole time, just not on the Taz 3 so much. Lesson learned. Lulzbot donated that Taz 3 (and a replacement power supply) to hackerspace.be.
I had a lot of good hallway discussions with people (there were a larger than normal contingent of US Fedora people around because of devconf.cz, which was a week after FOSDEM, but I opted out this year), and a good sampling of delicious Belgian beer. After FOSDEM, I flew to Prague for two days, to scope out the venues for Flock 2 (Electric Boogaloo).
First day back after taking a week off. Which of course meant being buried alive in email. I deleted a ton of stuff, so if you were expecting a reply from me and didn’t get one, resend.
Started poking at trinity bugs. There’s a case where the watchdog got hung in circumstances where the main process had quit (ironically, it got stuck in the routine that checked if it was still alive). Fixed that up, but still uncertain why that path was ever being taken. The new dropprivs code has shaken out a few more corner cases, such as fork() failures that weren’t being checked for. Hopefully that’s a bit more robust after todays changes.
Hit a weird irda/lockdep bug. Not sure what’s going on there, and everyone is reluctant to dig into irda too much. Can’t say I blame them tbh, it’s pretty grotty and unmaintained.
Spent some time looking over the recent coverity issues, and dismissed a bunch. Requested a bunch more components for some of the more ‘bug heavy’ parts of the kernel.
I realized I never did a statistics dump for 3.13 like I did for 3.12. I’ll sort that out tomorrow.
Seen a hilarious blog post about Google interviews, which contains the following gem:
Code in C++ or Java, which shows maturity. If you can only code in Python or bash shell, you're going to have trouble.
Reminds me immediately how Google paid 1.6 Billion dollars for a website coded entirely in Python.
Been grinding through some of the coverity backlog the last few days.
Closed out a lot of non-issues marking them as intentional. A handful of false positives, and sent a few patches for a handful of real bugs. Nothing too scary. Net result: outstanding issues went down from 5141 to 4844. If only I could sustain that rate of closures.
Spent some time trying to dig into a perf bug that I’d been sporadically triggering back when there was an ftrace bug that meant any user could trigger perf events. When that got fixed, the bug went into hiding because only root was able to trigger the code path necessary. But, now that trinity has dropprivs mode, it’s reachable again. Hrmph.
Tried reproducing it with Vince Weaver’s perf_event_test fuzzer. Ended up triggering a different bug instead. Grumble.
Feeling a little burnt out. Taking next week off. Will still do the daily coverity builds, but won’t be doing much else.
Courtesy of Zeeshan I came into an N950 last year, so that I could make the N9/N950 work great with ModemManager. Mission accomplished. And one of the most annoying things about it is that when the battery is charged, it stops charging even though it’s still plugged in. So, of course, I wake up in the morning and since it’s stopped charging while plugged in, the battery is now down to 75% or even lower.
So I humbly ask the Internet, is there a way to tell the N9/950 not to drain the battery when it’s done charging and still plugged in?
Some people have been running trinity as root for a while (thankfully, in virtual machines, doing so on real hardware can end in not so hilarious results, like your bios settings getting screwed to the point you can’t power up until you cover up the CMOS jumper, or your laptop battery no longer returning sensible information over i2c, or a whole slew of worse things).
Meanwhile, those who want to use trinity as an unprivileged user have been unable to fuzz certain aspects of the kernel. For example, creating and binding certain socket types can only be done by a user with CAP_SYS_ADMIN.
So I’ve recently committed some code which allows running trinity as root, dropping privileges before starting the child processes.
This is all very work in progress (and is quite buggy still), and not recommended for anyone to try right now unless you’re interested in debugging. Once things are stable again, I’ll move on to creating certain child processes that only do root-required things, like various ioctls.
|Ver||outstanding issues||Defect density|
Looks like we’ll soon be below 5000 for the first time since I started running the daily scans.
(There’s still a load of known false positives/intentional issues, so the actual bug count is less already).
Since the merge window opened, 64 new issues have been detected so far. Not bad considering that includes some of the worst offenders like drivers/staging/). Additionally 38 issues got fixed.
Real quick: you know how BSDs were supposed to have "core bit" for "core committers"? If one was "on core", he could issue "cvs commit". Everyone else had to e-mail patch to one of the core guys. One problem with that setup was that people are weak. A core guy could be severily tempted to commit something that was not rigorously tested or otherwise questionable.
OpenStack addresses this problem by not actually letting "core" people commit anything. I'm on core for Swift, but I cannot do "git push" to master. I can only do "git review" and then ask someone else to approve it. Of course, this is easy to attack if one wants so. For example, committers can enter into a conspiracy and buddy-approve. It's possible. But at least we're somewhat protected against a late night commit before a deadline by a guy who was hurried by problems at work.
Started getting a damn cold again, making it hard to think clearly most the day.
Still managed to grind through a bunch of cleanups in trinity.
The rewrite of the logging code a few months ago kind of blew up in terms of complexity, leading to things like functions with 8 arguments. Spent a while cleaning that up, and reduced it to 5 args.
Later I decided to shorten a lot of code by introducing a bunch of local pointers to the syscall structs. After I was feeling pleased with myself, I got an email telling me I had broken the compile on older versions of gcc. Turned out that having a struct named ‘syscall’ and a function named syscall() was a bad idea.
Newer gcc can disambiguate between the two, but older versions freak out when compiled with -Wshadow.
Which led to the great syscall struct renaming of 2014.
And that was pretty much all I managed to get my head around today. Even sudafed couldn’t make me more productive.
Started the day with a WARN_ON in perf.
More trinity work. Every time I made a start on writing new code I found myself adding to my TODO without making much forward progress. Still, a plan is forming.
Started reworking the mapping code a little to make it easier to implement some new things.
Then spent the afternoon doing some more improvements to the network socket code (unpushed changes, should finish tomorrow).
Fixed up some boring stuff that cppcheck found.
Found some time to play with PeterZ’s new lockdep feature, and quickly found some breakage.
Ended the day with an oops in RDS.
Talked yesterday about virgil project, virtio based GPU and where its going.
watch it here,
Starting to get back into the swing of things.
A bunch of trinity changes today.
Started thinking about logging structs that trinity creates. Right now, we just see an address in the logs, which isn’t particularly helpful when we want to know what the struct members were. Should be trivial to hack up, so will likely do that tomorrow.
Coverity scans are happening again. The machine I was doing them on still isn’t recovered, so they’re taking a little longer to run, but they’re at least going to happen regularly again.
Been doing some reading on various oddball network protocols. Been figuring out some plans for enhanced fuzzing there in trinity.
At the same time, spent some time looking at old VM/FS bugs, and trying to figure out ways trinity could have triggered them.
First day back at work after a much needed break.
As expected, buried alive in email, so skimmed through the interesting parts of that, and marked pretty much everything else as read.
Lots of the day spent updating kernel trees to Linus’ latest. Surprisingly, nothing broke, which is a good thing given we’re at rc7. Later in the day, one fuzz testing box turned itself off spontaneously. Unsure if kernel bug, or first hardware death of the year.
Right before the break, I remotely started a fedup upgrade of the box that I do the daily coverity builds on.
It didn’t survive the reboot. So until I get that fixed up, the scans won’t be updating. Hopefully I’ll get on top of it soon.
Occasionally I see questions about how to
import gdb from the ordinary Python interpreter. This turns out to be surprisingly easy to implement.
First, a detour into PIE and symbol visibility.
“PIE” stands for “Position Independent Executable”. It uses essentially the same approach as a shared library, except it can be applied to the executable. You can easily build a PIE by compiling the objects with the
-fPIE flag, and then linking the resulting executable with
-pie. Normally PIEs are used as a security feature, but in our case we’re going to compile gdb this way so we can have Python dlopen it, following the usual Python approach: we install it as
_gdb.so and add a a module initialization function,
init_gdb. (We actually name the module “
_gdb“, because that is what the gdb C code creates; the “
gdb” module itself is already plain Python that happens to “
Why install the PIE rather than make a true shared library? It is just more convenient — it doesn’t require a lot of configure and Makefile hacking, and it doesn’t slow down the build by forcing us to link gdb against a new library.
Next, what about all those functions in gdb? There are thousands of them… won’t they possibly cause conflicts at
dlopen time? Why yes… but that’s why we have symbol visibility. Symbol visibility is an ELF feature that lets us hide all of gdb’s symbols from any
dlopen caller. In fact, I found out during this process that you can even hide
ld.so seems to ignore visibility bits for this function.
Making this work is as simple as adding
-fvisibility=hidden to our
CFLAGS, and then marking our Python module initialization function with
__attribute__((visibility("default"))). Two notes here. First, it’s odd that “default” means “public”; just one of those mysterious details. Second, Python’s
PyMODINIT_FUNC macro ought to do this already, but it doesn’t; there’s a Python bug.
Those are the low-level mechanics. At this point gdb is a library, albeit an unusual one that has a single entry point. After this I needed a few tweaks to gdb’s startup process in order to make it work smoothly. This too was no big deal. Now I can write scripts from Python to do gdb things:
#!/usr/bin/python import gdb gdb.execute('file ./install/bin/gdb') print 'sizeof = %d' % gdb.lookup_type('struct minimal_symbol').sizeof
$ python zz.py 72
Soon I’ll polish all the patches and submit this upstream.
Although it was mentioned obliquely before, Glie now exists. It receives the ADS-B data from RTL-SDR in 1090ES band and produces an image. I can hit refresh in the browser and watch airplanes coming in to land at a nearby airport in real time.
This stuff is hardly groundbreaking. Many such programs exist, some are quite sophisticated in interfacing to various mapping, geography, schedule, and airframe information services, as well as in the UI. This one is mainly different because it's my toy.
Actually, the general aim of this project is also different, because unlike most stuff out there, it is not meant to be a surveilance tool, but to provide a traffic awareness readout. No persistent database of any kind is involved. No map either. Instead, I'm going to focus on onboard features, such as relative motion history (so one can easily identify targets on collision course).
But mostly, it's for fun and education. And already Glie is facing a few technical challenges:
For now I'm going to take it easy and play with what I have, aiming for some kind of a portable system. Perhaps someone develops an open source UAT receiver on a practical platform in the meanwhile.
I just tagged and pushed out a 1.3 tarball for trinity.
Most people who use it will likely be staying on the bleeding edge running latest git, but given it’s currently finding all kinds of interesting bugs in the VM, this seems to be a good point to tag a release, for people chasing those bugs to have something stable they can run.
It should also build on older distributions, which might be interesting for people wanting to test enterprise distributions. (I tend to break the build in git on a semi-regular basis, because I’m only building/testing on current Fedora).
In the new year, I’ve got a bunch of things I want to investigate adding/enhancing in this VM related code, as well as some interesting ideas for the networking code.