As we outlined in a previous post, the Librem 14 is the first Purism laptop to ship with our new, free software Librem-EC firmware for the laptop’s embedded controller (EC). This was a big undertaking, and as with any effort of this magnitude, issues arise in corner cases that often don’t show themselves during developmental testing, when only a small number of devices are tested. One such issue was with the power sequencing — the order and timing of all the different voltage rails and power sources/signals in the laptop.
We were preparing to ship out the first batch of Librem 14s right as we were putting the finishing touches on the firmware (both for the embedded controller, and the main coreboot/Pureboot firmware), and everything was looking good to finally get devices flashed and into our users’ hands. But as we flashed the laptops’ firmware to prep for shipping, a small but not insignificant number of devices were failing to boot. At first we thought the issue was with our flashing process, which was new for the Librem 14, as both the EC and main firmware are flashed in sequence on a live system. But re-flashing these problematic devices externally (using a USB flash programmer and a chip clip) did not resolve the issue, so the issue wasn’t with the flashing process itself.
Since our flashing process didn’t appear to be the issue, the next step was to determine if the issue lie within the EC firmware or coreboot. As it is easier to get live debug output from the EC than coreboot, we started there. We compiled and flashed a debug build of the EC firmware, and attached the debugger. The debug console showed that the EC was booting and properly transitioning the main CPU from off (the S5 power state) to normal boot (the S0 power state). Since the laptop was in S0, it meant that coreboot should be running, so it was time to see what was going on there.
Getting debug output from coreboot is slightly more difficult than the EC, as the most common method — serial UART output — isn’t exposed anywhere on the Librem 14’s mainboard. Luckily we have another option: the SPI flash console (developed in-house and upstreamed into coreboot), which writes the coreboot console log to a dedicated region of the firmware flash chip itself. We compiled and flashed a debug build of coreboot with the flash console debugger, attempted a boot, and then read the flash chip back to see where coreboot was dying.
Reading back the flash console log showed that RAM init (performed by Intel’s FSP-m blob) was failing, but the error code was so generic it provided no clue as to the root cause of the failure. At this point, we’d need to get serial/UART output somehow, in order to get more info from FSP as to the reason for the failure. Reviewing the schematics showed a promising location to get the UART output (TX) line from the CPU, and next-day delivery from a friendly internet retailer for some needed hardware meant we were in business — or so we thought. Despite verifying all of the hardware separately (and verifying coreboot was correctly configured for UART output), no debug output was received. We ordered more hardware, and spent more time verifying all the links in the chain independently, to no avail.
As we were unable to get any debug output from coreboot, we wanted to verify that the hardware was working, which we could also do from a booted OS while running the proprietary EC and system firmware. One of the problematic devices was flashed back to “stock” configuration, and booted right up without issue (as expected), but disappointingly still failed to provide any output via the UART debug port. On a whim, we flashed coreboot on the device (with the proprietary EC firmware), and to our surprise it booted right up. This serendipitous occurrence told us that the issue was almost certainly in our Librem EC firmware, not in coreboot.
To give a bit of background, the power sequence to boot a modern CPU (transition from S5 to S0) is a very, very complicated beast. It requires a specific sequence and precise timing between the EC, PCH, and voltage regulators for the enablement of the power rails and “power good” signals. There’s also an additional low power state (DS5 / deep S5) in which the EC sits when it first gets power (either from the internal battery, or an external power source), before anything else happens. So we need to precisely manage the order and timing of turning on power sources (rails) and signals (“power good”) from DS5 to S5 and then to S0.
When developing the Librem EC firmware, we didn’t start from a completely blank slate. We based our firmware on System 76’s open EC firmware (we forked it, in free software development terms), and we had the source code to the proprietary EC firmware. Both of these had their own baggage though: the original code was used for a different EC chip, and our board design/layout was very different; the proprietary EC code was a spaghetti mess with headers indicating development in the 2006-2009 time frame, and no comments or documentation. Despite these hurdles, we managed to reverse engineer the power sequencing from the proprietary EC code, and then apply it to the Librem EC code, along with all the other customization needed for the Librem 14. And it worked perfectly on all our development machines.
But now with large-scale flashing of production devices, we were experiencing boot failures. We knew that 1) the problem was with RAM initialization, and 2) the problem was in the EC firmware. The EC doesn’t directly control the power to the RAM, but it does tell the main CPU (or more precisely, the platform control hub / PCH) that power to the RAM is on and stable via the aforementioned “power good” signals. These signals tell the PCH that it’s safe to turn on other hardware components, and eventually to bring the CPU out of reset and start the boot process (this is the transition from the S5 power state to S0).
Figuring that these “power good” signals were a good place to start looking, we again compared the proprietary EC code to our Librem EC firmware. This reexamination revealed that although we’d matched the sequence and timing of the power rails/signals, the way certain functions were called in the Librem EC code meant that there was some variability in the timing of the enablement of two of the “power good” signals. Under certain conditions, it was possible for them to be turned on too early, before the voltage regulator had stabilized the power to the RAM.
Having identified the potential root cause, we adjusted the Librem EC code to ensure those signals didn’t turn on until the voltage regulator indicated the RAM was ready, crossed our fingers, and began testing. To our collective relief, this small adjustment to the power sequencing did indeed fix the issue. Our factory images were updated, and all previously “bricked” Librem 14s were updated with the new Librem EC firmware.
But the issue only affected a small percentage of Librem 14s – why were some affected and not others? That’s a damn good question, and one we wish we had a good answer for. There’s always some variation in hardware components, and even when all components are within spec, a few microseconds difference across a number of components can add up to enough to make the difference between the laptop booting or not. The good news is now that the issue has been identified and resolved, all future Librem 14s will ship with the updated EC firmware, and existing ones will receive it as an update at some point in the future (less critical here, since these devices are functioning normally).