Hi everyone,
While most of you are probably excited about the possibilities of the recently announced “Librem 5” phone, today I am sharing a technical progress report about our existing laptops, particularly findings about getting coreboot to be “production-ready” on the Skylake-based Librem 13 and 15, where you will see one of the primary reasons we experienced a delay in shipping last month (and how we solved the issue).
TL;DR: Shortly we began shipping from inventory the coreboot port was considered done, but we found some weird SATA issues at the last minute, and those needed to be fixed before shipping those orders.
I previously considered the coreboot port “done” for the new Skylake-based laptops, and as I went to the coreboot conference, I thought I’d be coming back home and finally be free to take care of the other stuff in my ever-increasing TODO list. But when I came back, I received an email from Zlatan (who was inside our distribution center that week), saying that some machines couldn’t boot, throwing errors such as:
Read Error
…in SeaBIOS, or
error: failure reading sector 0x802 from 'hd0'
or
error: no such partition. entering rescue mode
…in GRUB before dropping into the GRUB rescue shell.
That was odd, as I had never encountered those issues except one time very early in the development of the coreboot port, where we were seeing some ATA error messages in dmesg but that was fixed, and neither Matt nor I ever saw such errors again since. So of course, I didn’t believe Zlatan at first, thinking that maybe the OS was not installed properly… but the issue was definitely occurring on multiple machines that were being prepared to ship out. Zlatan then booted into the PureOS Live USB and re-installed the original AMI BIOS; then he had no more issues booting into his SSD, but when he’d flash coreboot back, it would fail to boot.
Intrigued, I tested on my machine again with the “final release” coreboot image I had sent them and I couldn’t boot into my OS either. Wait—What!? It was working fine just before I went to the coreboot conference.
Madness? THIS—IS—SATA!
After extensive testing, we finally came to the conclusion that whether or not the machine would manage to boot was entirely dependent on the following conditions:
The most astonishing (and frustrating) thing is that during the three weeks where Matt and I have been working on the coreboot port previously, we never encountered any “can’t boot” scenario—and we were rebooting those machines probably 10 times per hour or more… but now, we were suddenly both getting those errors, pretty consistently.
After a day or two of debugging, it suddenly started working without any errors again for a couple of hours, then it started bugging again. On my end, the problem seemed to typically happen with SATA SSDs on the M.2 port (I didn’t get any issues when using a 2.5″ HDD, and Matt was in the same situation). However, even with a 2.5″ HDD, Zlatan was having the same issues we were seeing with the M.2 connector.
So the good news was that we were at least able to encounter the error pretty frequently now, the bad news was that Purism couldn’t ship its newest laptops until this issue was fixed—and we had promised the laptops would be shipping out in droves by that time! Y’know, just to add a bit of stress to the mix.
When I was doing the v1 port, I had a more or less similar issue with the M.2 SATA port, but it was much more stable: it would always fail with “Read Error”, instead of failing with a different error on every boot and “sometimes failing, sometimes working”. Some of you may remember my explanation of how I fixed the issue on the v1 in February: back then, I had to set the DTLE setting on the IOBP register of the SATA port. What this means is anyone’s guess, but I found this article explaining that “DTLE” means “Discrete Time Linear Equalization”, and that having the wrong DTLE values can cause the drives to “run slower than intended, and may even be subject to intermittent link failures”. Intermittent link failures! Well! Doesn’t that sound familiar?
Unfortunately, I don’t know how to set the DTLE setting on the Skylake platform, since coreboot doesn’t have support for it. The IOBP registers that were on the Broadwell platform do not exist in Skylake (they have been replaced by a P2SB—Primary to SideBand—controller), and the DTLE setting does not exist in the P2SB registers either, according to someone with access to the NDA’ed datasheet.
When the computer was booting, there were some ATA errors appearing in dmesg, and it looks something like this:
ata3: exception Emask 0x0 SAct 0xf SErr 0x0 action 0x10 frozen ata3.00: failed command: READ FPDMA QUEUED ata3.00: cmd 60/04:00:d4:82:85/00:00:1f:00:00/40 tag 0 ncq 2048 in res 40/00:18:d3:82:85/00:00:1f:00:00/40 Emask 0x4 (timeout) ata3.00: status: { DRDY }
Everywhere I found this error referenced, such as in forums, the final conclusion was typically “the SATA connector is defective”, or “it’s a power related issue” where the errors disappeared after upgrading the power supply, etc. It sort of makes sense with regards to the DTLE setting causing a similar issue.
It also looks strikingly similar to Ubuntu bug #550559 where there is no insight on the cause, other than “disabling NCQ in the kernel fixes it”… but the original (AMI) BIOS does not disable NCQ support in the controller, and it doesn’t fix the DTLE setting itself.
So, not knowing what to do exactly and not finding any information in datasheets, I decided to try and figure it out using some good old reverse engineering.
First, I needed to see what the original BIOS did… but when I opened it in UEFIExtract, it turns out there’s a bunch of “modules” in it. What I mean by “a bunch” is about 1581 modules in the AMI UEFI BIOS, from what I could count. Yep. And “somewhere” in one of those, the answer must lay. I didn’t know what to look for; some modules are named, some aren’t, so I obviously started with the file called “SataController”—I thought I’d find the answer in it quickly enough simply by opening it up with IDA, but nope: that module file pretty much doesn’t do anything. I also tried “PcieSataController” and “PcieSataDynamicSetup” but those weren’t of much help either.
I then looked at the code in coreboot to see how exactly it initializes the SATA controller, and found this bit of code:
/* Step 1 */ sir_write(dev, 0x64, 0x883c9003);
I don’t really know what this does but to me it looks suspiciously like a “magic number”, where for some reason that value would need to be set in that variable for the SATA controller to be initialized. So I looked for that variable in all of the UEFI modules and found one module that has that same magic value, called “PchInitDxe”. Progress! But the code was complex and I quickly realized it would take me a long time to reverse engineer it all, and time was something I didn’t have—remember, shipments were blocked by this, and customers were asking us daily about their order status!
One realization that I had was that the error is always about this “READ FPDMA QUEUED” command… which means it’s somehow related to DMA, and therefore related to RAM—so, could there be RAM corruption occurring? Obviously, I tested the RAM with memtest and no issues turned up, and since we had finally received the hardware, I could push for receiving the schematics from the motherboard designer (I was previously told it would be a distraction to pursue schematics when there were so many logistical issues to fix first).
What else could I do? “If only there was a way to run the original BIOS in an emulator and catch every I/O it does to initialize the SATA controller!”
Well, there is something like that, it’s called serialICE and it’s part (sort of?) of the coreboot umbrella project. I was very happy to find that, but after a while I realized I can’t make use of it (at least not easily): it requires us to replace the BIOS with this serialICE which is a very very minimal BIOS that basically only initializes the UART lines and loads up qemu, then you can “connect” to it using the serial port, send it the BIOS you want to run, and while serialICE runs the BIOS it will output all the I/O access over the serial port back to you… That’s great, and exactly what I need, unfortunately:
Thankfully, I was told that there is a way to use xHCI usb debugging capabilities even on Skylake, and Nico Huber wrote libxhcidbg which is a library implementing the xHCI usb debug features. So, all I would need to make serialICE work would be to:
Another issue is that for the USB debug to work, USB needs to be initialized, and there is no way for me to know if the AMI BIOS initializes the SATA controller before or after the USB controller, so it might not even be helpful to do all that yak shaving.
The other solution (to use flashconsole) might not work either because we have 16MB of flash and I expect that a log of all I/O accesses will probably take a lot more space than that, so it might not be useful either.
And even if one or both of the solutions actually worked, sifting through thousands of I/O accesses to find just the right one that I need, might be like looking for a needle in a haystack.
Considering the amount of work involved, the uncertainty of whether or not it would even work, and the fact that I really didn’t have time for such animal cruelty (remember: shipments on hold until this is fixed!), I needed to find a quicker solution.
At that point, I was starting to lose hope for a quick solution and I couldn’t find any more tables to flip:
“This issue is so weird! I can’t figure out the cause, nothing makes sense, and there’s no easy way to track down what needs to be done in order to get it fixed.”
And then I noticed something. While it will sometimes fail to boot, sometimes will boot without issues, sometimes will trigger ATA errors in dmesg, sometimes will stay silent… one thing was consistent: once Linux boots, we don’t experience any issues—there was no kernel panic “because the disc can’t be accessed”, no “input/output error” when reading files… there is no real visible issue other than the few ATA errors we see in dmesg at the beginning when booting Linux, and those errors don’t re-appear later.
After doing quite a few tests, I noticed that whenever the ATA errors happen for a few times, the Linux kernel ends up dropping the ATA link speed to 3Gbps instead of the default 6Gbps, and that once it does, there aren’t any errors happening afterwards. I eventually came to the conclusion that those ATA errors are the same issue causing the boot errors from SeaBIOS/GRUB, and that they only happened when the controller was setup to use 6Gbps speeds.
What if I was wrong about the DTLE setting, and potential RAM issues? What if all of this is because of a misconfiguration of the controller itself? What if all AMI does is to disable the 6Gbps speed setting on the controller so it can’t be used?!
So, of course, I checked, and nope, it’s not disabled, and when booting Linux from the AMI BIOS, the link was set up to 6Gbps and had no issues… so it must be something else, related to that. I dumped every configuration of the SATA controller—not only the PCI address space, but also the AHCI ABAR memory mapped registers, and any other registers I could find that were related to the SATA/AHCI controller—and I made sure that they matched exactly between the AMI BIOS and the coreboot registers, and… still nothing. It made even less sense! If all the SATA PCI address space and AHCI registers were exactly the same, then why wouldn’t it work?
I gave up!
…ok, I actually didn’t. I temporarily gave up trying to fix the problem’s root cause, but only because I had an idea for a workaround that could yield a quick win instead: if Linux is able to drop the link speed to 3Gbps and stop having any issues, then why can’t I do the same in coreboot? Then both SeaBIOS and GRUB would stop having issues trying to read from the drive, ensuring the drive will allow booting properly.
I decided I would basically do the same thing as Linux, but do it purposedly in coreboot, instead of it being done “in Linux” after errors start appearing.
While not the “ideal fix”, such a workaround would at least let the Skylake-based Librems boot reliably for all users, allowing us to release the shipments so customers can start receiving their machines as soon as possible, after which I would be able to take the time to devise the “ideal” fix, and provide it as a firmware update.
I put my plan in motion:
As you can see, small issues like that are a real puzzle, and that’s the kind of thing that can make you waste a month of work just to “get it working” (let alone “find the perfect fix”). This is why I typically don’t give time estimates on this sort of work. We’re committed though on getting you the best experience with your machines, so we’re still actively working on everything.
Here’s a summary of the current situation:
It’s taken me much longer than anticipated to write this blog post (2 months exactly), as other things kept getting in the way—avalanches of emails, other bugs to fix, patches to test/verify, scripts to write, and a lot of things to catch up on from the one month of intense debugging during which I had neglected all my other responsibilities.
While I was writing this status report, I didn’t make much progress on the issue—I’ve had 3 or 4 enlightenments and I thought I suddenly figured it all out, only to end up in a dead end once again. Well, once I do figure it out, I will let you all know! Thanks for reading and thanks for your patience.
End notes: if you didn’t catch some references, or paragraph titles in this post, then you need to read The KingKiller Chronicles by Patrick Rothfuss (The Name of the Wind and The Wise Man’s Fear). These are some of the best fantasy books I’ve ever read, but be aware that the final book of the trilogy may not be released for another 10 years because the author loves to do millions of revisions of his manuscripts until they are “perfect”.