Firmware debugging is uniquely challenging, because most conventional software debugging tools aren’t available. With coreboot’s specialized tooling, support from the amazing community, and a little bit of creativity, we fixed a regression in coreboot 4.17 that caused reboot loops on the Librem Mini.
When coreboot makes a new release, I rebase our Librem-specific patches onto it, then run a set of tests on each device. On 4.17, rebooting caused a boot loop on the Librem Mini. A normal boot involves a coreboot, SeaBIOS, GRUB, and Linux, in that order. In the reboot loop, GRUB would try to start the Linux kernel, but then SeaBIOS would appear again.
Lots of problems look like this, so the first thing to do is to try to narrow it down. I checked this on Librem 14 laptop and Librem Mini v1. Mini v1 showed the same problem, but Librem 14 was fine.
I also wanted to know if the fault was in GRUB or Linux, but since Librem Mini doesn’t have a UART (serial port), it’s difficult to get Linux kernel boot logs when it faults very early.
Troubleshooting the problem wasn’t going anywhere fast, so I decided to look at the changes between releases instead. There were 1306 commits between 4.16 and 4.17, so I should be able to bisect that in roughly 10 steps.
‘git bisect’ is a wonderful tool – tell Git about a good commit and a bad commit, and it will suggest a commit to test. For simple histories, this is just the midpoint, but ‘git bisect’ also knows how to walk complex histories. On top of that, if you can’t test the commit Git suggested, just pick any commit you can test and mark it – ‘git bisect’ will pick up from there.
I started a bisect from 4.16 to 4.17:
For each test, I built the firmware, flashed it on the Librem Mini, and then tested a reboot. I applied our patches each time in case they contributed to the issue – I didn’t want to change too many factors at once. Along the way, I ran into some commits that would not boot, in that case I had to pick another commit and then reflash externally.
Each build/flash/reboot test cycle takes a little while, so the total bisection took a couple of hours. It was completely worth the time, because it found the commit I was looking for:
This only raised new questions though – all this change did was increase the size of an allocation. How could that cause a reboot loop, and why would it only affect Librem Mini and not Librem 14?
Coreboot allocates this region to create SMBIOS tables, which provide management information from the BIOS to the operating system. While looking around, I noticed that the write_smbios_table() didn’t clear the unused part of the allocation. In 4.17, there would now be at least 28 KB of uninitialized memory in this region on Mini v2.
The OS finds SMBIOS tables by searching for a specific signature – there’s no ‘table pointer’ passed from firmware to OS. On a hunch, I patched coreboot to clear the uninitialized part of the SMBIOS allocation. It worked!
This solved the problem, but why? Is this actually the right fix?
I used the ‘coreinfo’ test payload to look around in this region of memory after a reboot – the was a leftover ACPI table signature. The OS finds ACPI tables the same way – by searching for a signature in a region of memory.
ACPI tables aren’t in the SMBIOS allocation; they’re in the allocation before that. The old ACPI signature is still here because the entire memory region shifted by 4KB from first boot to reboot – it’s now at the end of the SMBIOS region. ACPI tables are much larger than 4KB, so if the OS found this stale signature, it would reference partially overwritten ACPI structures from the first boot.
This didn’t happen in coreboot 4.16 because the SMBIOS region itself was only 4KB – the SMBIOS table always overwrote the old ACPI table. Extending the SMBIOS region moved the table out of the way; the old ACPI table now survived. Librem 14 was fine because the memory regions don’t shift between boot and reboot on that system.
The coreboot community agreed that no allocated memory should be left uninitialized. With that analysis and support, this was the right solution. I submitted this patch to coreboot, where reviewers asked if Linux logged an error during boot.
Remember when I said the Librem Mini doesn’t have a UART? That’s not completely true – they do exist, but there are no connectors and some of the circuitry is unpopulated. I soldered a connector onto the UART pads and configured both coreboot and Linux to log to it.
I still didn’t know why Linux actually found the stale ACPI signature, because it is supposed to look for it just below 1MB, not much higher where it actually was. It seemed clear that it was finding it though, and since coreboot did not intend to leave this stale signature anyway, the precise reason it was found didn’t matter in this case.
Now that I had boot logs over the UART, there was no doubt, Linux was definitely finding the stale signature. Take a look at the garbage in the reboot log:
Normal cold boot: [ 0.008615] ACPI: RSDP 0x00000000000F6190 000024 (v02 COREv4) [ 0.008619] ACPI: XSDT 0x0000000099B480E0 00005C (v01 COREv4 COREBOOT 00000000 CORE 20220331) [ 0.008624] ACPI: FACP 0x0000000099B4A2A0 000114 (v06 COREv4 COREBOOT 00000000 CORE 20220331) [ 0.008634] ACPI: DSDT 0x0000000099B48280 00201F (v02 COREv4 COREBOOT 20110725 INTL 20220331) ... Reboot with corrupt table: [ 0.008820] ACPI: RSDP 0x00000000000F6190 000024 (v02 COREv4) [ 0.008823] ACPI: XSDT 0x0000000099B480E0 00005C (v01 COREv4 COREBOOT 00000000 CORE 20220331) [ 0.008828] ACPI: ???G 0x0000000099B4A2A0 20002001 (v00 ?G?$ 47020100 ?, 47020100) [ 0.008831] ACPI: �y 0x0000000099B4A3C0 54523882 (v67 ?_HID? A�? 65520D4E al T 20656D69) ...
The change then landed in coreboot, solving the problem! I included this patch in our 4.17 release. It is upstreamed for coreboot 4.18, fixing this issue for any affected boards!
|Librem Mini||In Stock||10 days|
|Librem Servers||Coming Soon||--|
|Librem Key||In Stock||10 days|
|Librem 14||In Stock||10 days|
|Librem 5 USA||In Stock||10 days|
|Librem 5||Currently shipping backlogs||20 weeks|