Youness Alaoui – Purism

Adventures with coreboot and NVM Express storage

Youness Alaoui — Thu, 11 Oct 2018 23:11:21 +0000

Let me tell you how I made NVMe SSD support work on the first generation Librem laptops. This story is pretty old, from before the Librem 13 version 2 was even released, so it has been simplified and brought back to the current state of things as much as possible. The solutions presented here have been implemented a long time ago in our coreboot ports, but the technical insights you may derive from this post today should prove interesting nonetheless.

During internal beta testing of the install script a while ago, we realized that coreboot didn’t work with our NVMe SSDs, as all my testing had been done with a SATA M.2 SSD. I spent some time fixing coreboot so that it would initialize the NVMe SSD, and SeaBIOS so it can boot from the NVMe drive, and then I’ve figured out how to fix the NVMe issues I’ve been having after linux boots.

The story began with my blog post about the interference of the AMI BIOS with coreboot. What I didn’t mention back then is that after I figured out the issue and managed to unbrick Francois’ Librem, he wasn’t able to boot into his SSD from coreboot because it wasn’t getting detected. I then realized that he had an NVMe SSD and not a SATA SSD.

A SATA drive is controlled using the SATA controller on the motherboard which talks to the SATA drive using 4 data lines.
A NVMe drive is actually a PCIe device all on its own, which could use up to 16 data lines (4 per lane and the M.2 specification defines up to 4 lanes per device).

If a SATA device is detected, then the integrated SATA controller will talk to the drive using the SATA protocol, if a PCIe device is detected, then the device will be initialized as any other PCIe device (like the Wifi module for example) and in the case of NVMe drives, the NVMe protocol will be used directly (without passing through an onboard controller) to communicate with the device.

With that knowledge, I figured out my simple mistake then: the PCIe port used for the M.2 connector is Port #6, which is a “Flexible I/O” which can be used for either SATA or PCIe according to the Intel Broadwell datasheet. Unfortunately, in the Librem 13 coreboot configuration, the PCIe Port #6 was disabled (since it was never used, but that was only because I only ever tried a SATA drive). So the fix was simple: enable the PCIe port #6, and once coreboot initializes that PCIe port, the NVMe drive is initialized and working.

Francois tested this for me and confirmed he could boot on his drive. I needed to do my own tests however, so I ordered an NVMe drive (The Intel 600p Series SSD) and before I received it, someone (a regular Librem user) found and decided to test my script. With a lot of courage and determination, he was the first non-purism-employee beta tester of the coreboot install script and unfortunately, it didn’t work for him. While the coreboot install was fine, SeaBIOS wasn’t detecting his NVMe drive (he could boot into a live USB and flash back his Factory BIOS, so nothing to be alarmed about). I didn’t know why it worked for Francois but not for jsparber (our volunteer beta tester). I then realized that SeaBIOS itself didn’t have NVMe support, or more precisely, the NVMe support that was added to SeaBIOS was never tested outside of the qemu emulator, and was actually disabled for real hardware, so I enabled NVMe support for non-qemu hardware and sent the updated image to our beta tester who confirmed it to be working.

Why was it working for Francois if SeaBIOS didn’t have NVMe support, though? That’s a bit mysterious, but I think that his specific NVMe drive had some sort of SATA-compatibility mode in order to allow booting from older BIOS that don’t support NVMe devices.

Once I received my own NVMe SSD, I thought that it would just be a formality to get it to work, and indeed, it was detected by SeaBIOS but I couldn’t boot on it because it was still blank, so I tried to install PureOS on it. Unfortunately, that failed. I was getting an error halfway through the installation and the NVMe device was disappearing completely. My dmesg output had (among a flood of I/O errors) :

nvme 0000:04:00.0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
nvme 0000:04:00.0: Refused to change power state, currently in D3
nvme nvme0: Removing after probe failure status: -19
nvme0n1: detected capacity change from 256060514304 to 0

After searching for a long time, I’ve found a few mentions of this error. At first, I thought that this lauchpad bug was the one affecting me, but it was saying that it was fixed in kernel 4.4.0, and 4.8. The PureOS live USB was sporting kernel 4.7 back then, so I thought that maybe I needed a higher kernel version still. I tried to install Ubuntu 17.04, but it wouldn’t even boot, and then the Fedora 25 live USB would have the same issue, even though it had kernel 4.9.6. I decided to try the Archlinux installer, which had the 4.10.6 kernel, and I had the same problems, then I found this other bug which says it might happen since kernel 4.10.0 had support for APST and some drives had quirks which made them fail when APST was enabled, but the fix here was simple, a kernel option at boot and the bug should disappear, so I tried it but no luck.

At that point, in order to remove coreboot from the equation, I had flashed back an AMI BIOS on my Librem, but I was still getting all these issues.

I gave up after a while, and I figured out that if it’s not a kernel issue, maybe it’s not a Linux issue at all, so I tried installing Windows on the NVMe, and the same issue happened again! “Well, if the problem happens with Windows and the factory BIOS, then there’s nothing I can do, the problem is with the drive itself, it’s defective! Right?”

While I was trying to find the original packaging to ship it back for a replacement, I had an idea: I put the NVMe drive in the Librem 13 v2 prototype that I had, and it worked! So I figured that the problem was with my own Librem 13 v1 which might have had defective hardware (maybe a scratch on the motherboard or something?)

However, for a week, while working on other things, I kept thinking that there must be something else I can do, that the issue can’t be as simple as “it’s a hardware issue”, but I didn’t know what more I could do if Windows+AMI Bios were failing, and the SSD itself was fine.

Then Francois told me that he was having issues with his Librem, where the NVMe device would “disappear”. This looked a lot like the problem I’ve been having, but for him, it wouldn’t happen within the first 5 minutes of use like me, it would happen after 48 hours instead, sometimes after putting the laptop to sleep, sometimes not—very unpredictable. Unfortunately, he had also reinstalled his system at the same time when he flashed coreboot, so this new problem could be coming from coreboot or from the new OS he installed—but lo and behold, after he flashed his original BIOS back, he was still having the same issue of his NVMe disappearing.

Since I don’t believe in coincidences, I decided to start my research again from scratch—forget all the various links and explanations and datasheets I found—and just looked at the problem from a blank slate. After I searched for the error I was getting again, I found a post on the Lenovo forums where someone was complaining about the same issue on their ThinkPad X270 and the thread was marked as “SOLVED“, so that was very promising. After reading through it, I found that the solution was a new BIOS update for the Lenovo X270 that fixed the problem. And when I looked at the changelog of that update, this is what it said about NVMe support:

- (New) Disable NVMe L1.2
- (New) Disable NVMe CLKREQ

Now that was interesting… what was this “L1.2” and “CLKREQ”? I did some more research and I found an article that explained that L1.2 is simply a lower-power mode of operation for a PCIe device. Going back to the original dmesg output, I then realized that it said something very interesting about the drive and its power state:

Refused to change power state, currently in D3

According to this MSDN article, the “D3” there refers to a device power state, and more precisely, D3 is the lowest power state of the device. That seems to coincide with the L1.2 PCIe state which is also the lowest power state. I’ve decided to do what Lenovo did and disable CLKREQ and L1.2 in the PCIe device. The CLKREQ seems to be used by the CPU or by the device to request activation of the clock and to allow exit of the L1.2. According to the PCI specification, I’ve found a document that states :

“The CLKREQ# signal is also used by the L1 PM Substates mechanism. In this case, CLKREQ# can be asserted by either the system or the device to initiate an L1 exit”

The “L1 PM substates” is referring to that L1.2 (L1.1 and L1.2 are referred to as L1 substates, and “PM” here means “Power Management”), so my theory was that the drive goes into low power mode, and when it needs to get out of it, the CLKREQ should be used, but wasn’t working, causing the drive to never know that it needs to wake up. Disabling CLKREQ would fix it because some other mechanism would be used to wake the drive, or disabling L1.2 would also fix it because the drive would never go into that D3 low power mode.

I looked extensively at the Broadwell LP datasheet and I saw that the CLKREQ for PCIe port #6 is multiplexed with GPIO 23, and looking at the gpio.h in coreboot for the Librem 13, I see GPIO 23 set as “INPUT”, while GPIO 18 and 19 (which are for PCIe ports #1 and #2) are set as “NATIVE”. So I’ve set GPIO 23 to NATIVE and tried it, but this made NVMe undetectable; coreboot was simply unable to detect anything on port #6, and I have no idea why. Not only do I not know what a “native” gpio means, but I also don’t know why changing it from input to native would cause the PCI bus scan to fail.

Either way, I’ve set it back to “INPUT” and tried to see how to disable CLKREQ from some PCI configuration. Unfortunately, the code in soc/intel/broadwell/pcie.c—which mentions “CLKREQ”—does things that I can’t understand, it modifies/sets PCI configuration values on offsets that don’t match anything in the datasheet and I have no idea if I’ve been reading the datasheet incorrectly or if the code is wrong somehow.

One simple example is this code snippet:

/* Per-Port CLKREQ# handling. */
if (gpio_is_native(18 + rp - 1))
/*
* In addition to D28Fx PCICFG 420h[30:29] = 11b,
* set 420h[17] = 0b and 420[0] = 1b for L1 SubState.
*/
pci_update_config32(dev, 0x420, ~0x20000, (3 << 29) | 1);

First of all, it’s checking for the gpio to be native (which is what I did before without success), but it’s setting the PCI configuration at offset 0x420, but the only offset 0x420 I see in the datasheet (page 736) is the “PCI Express* Configuration Register 3”:

Bit    Description

31:1   Reserved 
0      PEC3 Field 1—R/W. BIOS may set this bit to 1b

Possibly these 31 “Reserved” bits are only described in a confidential Intel document, but in any case I didn’t know what that code was doing and I wouldn’t know what to change to make it behave the way I want it to.

I eventually found that this low power mechanism is called “ASPM” and the cbmem output from coreboot had a line that said “ASPM: enabled L1” which didn’t match any string in that soc/intel/broadwell/pcie.c file, so after I searched for the “ASPM:” string, I found that there is code in device/pciexp_device.c which is what actually configures the ASPM on the device!

The code in pciexp_device is rather straightforward since it does this:

/* Check for and enable Common Clock */
if (IS_ENABLED(CONFIG_PCIEXP_COMMON_CLOCK))
    pciexp_enable_common_clock(root, root_cap, dev, cap);

/* Check if per port CLK req is supported by endpoint*/
if (IS_ENABLED(CONFIG_PCIEXP_CLK_PM))
    pciexp_enable_clock_power_pm(dev, cap);

/* Enable L1 Sub-State when both root port and endpoint support */
if (IS_ENABLED(CONFIG_PCIEXP_L1_SUB_STATE))
    pciexp_config_L1_sub_state(root, dev);

/* Check for and enable ASPM */
if (IS_ENABLED(CONFIG_PCIEXP_ASPM))
    pciexp_enable_aspm(root, root_cap, dev, cap);

Unfortunately, those configs for L1_SUB_STATE and CLK_PM are forced-enabled in the menuconfig of coreboot, so I couldn’t disable it (I had already noticed them before but couldn’t disable them), so I just changed the code to remove the line that calls the
pciexp_enable_clock_power_pm function, and tested it. I could then see in the cbmem log that coreboot didn’t enable CLKREQ anymore, but the install was still failing, so I also removed the code that calls pciexp_config_L1_sub_state, and tried again, and my installation was successful!

I had previously done around 50 installation attempts on that NVMe and the drive would always crash between 50% and 80% through the installation. With my new changes, I had now done 3 successive installs that went all the way to 100% without crashing a single time. This demonstrated that my changes worked, that disabling the CLKREQ+L1.2 substate on the NVMe drive fixed the issue. My “fix” was obviously not the most elegant way of solving the issue, but I was now happy to report that we would be able to use the NVMe drives.

Some users might be wondering whether not being able to put their NVMe drives into low power mode would affect battery life, and the answer is, “In theory yes”, but in practice the difference would be a very small percentage. Back then, I doubted anyone would actually notice it, and so far it seems nobody did, so it looks like the issue is fairly minor in the grand scheme of things.

Purism customers now had working NVMe support for their Librem laptops running coreboot, and this solved a big headache for our operations & support team (who had temporarily put on hold all NVMe-based orders because of the bug, favouring the SATA-based laptop configurations as they were more reliable at that time).

Interestingly, this also meant that we had a superior user experience to similar laptops with a proprietary BIOS: users now had NVMe drives working with coreboot, NVMe drives that had never worked on the AMI BIOS we compared to!

During my testing of the install script, I had also tweaked some of the coreboot options and we had coreboot booting in about 350 miliseconds, which is a lot faster than the few seconds it took for the AMI BIOS to boot.

My fix was merged into coreboot in July 2017, via these two patches and this patch to SeaBIOS.

Some additional notes…

One might wonder if a possible reason behind the problem could have been an error in the design of the motherboard on the first-generation librems where the CLKREQ signal wouldn’t be properly routed, though it doesn’t look that way according to the schematics, so I’m not entirely sure why it was happening after all. At least, the fix was “simple” enough, and it worked on the Librem 13 v1 I had available to test.

Interestingly enough, François’ NVMe drive kept failing on his Librem 13 v1 after 2 to 3 days of use, even with my final fix. I was unable to figure out why that was still happening back then; why would his NVMe drive go into D3 power state if coreboot wasn’t enabling the L1.2 substate anymore? We eventually tabled the matter for a while as François switched to the newly released Librem 13 “version 2” a few weeks later. The answer came to me completely by chance, a year or so later, as I was looking through some PCIe code and saw that the PCIe device itself could have “L1.2 support” set even if it’s not enabled, so maybe his Linux kernel was enabling L1.2 if it saw that the device “supported” it. Unfortunately, by then, Francois wasn’t able to reproduce the issue on his NVMe drive anymore even with his old laptop, so it was impossible to test our hypothesis. The question of why he had started having those issues “all of a sudden” back then (when he didn’t encounter such issues before) shall remain a mystery!

The post Adventures with coreboot and NVM Express storage appeared first on Purism.

Intel FSP reverse engineering: finding the real entry point!

Youness Alaoui — Mon, 02 Apr 2018 13:37:27 +0000

2018-05-10 UPDATE: Intel politely asked Purism to remove this document which Intel believes may conflict with a licensing term. Since this post was informational only and has no impact on the future goals of Purism, we have complied. If you would like the repository link of the Intel FSP provided from Intel, please visit their publicly available code on the subject.

2018-04-23 UPDATE: after receiving a courtesy request from Intel’s Director of Software Infrastructure, we have decided to remove this post’s technical contents while we investigate our options. You are still welcome to learn about reverse engineering in general with my introductory post on the matter, Introduction to Reverse Engineering: A Primer Guide.

Hi everyone, it’s time for another blog post from your favorite Purism Reverse Engineer (that’s me! ’cause I’m the only one…)!

After attending 34C3 in Leipzig at the end of December, in which we (Zlatan and me) met with some of you, and had a lot of fun, I took some time off to travel Europe and fall victim to the horrible Influenza virus that so many people caught this year. After a couple more weeks of bed rest, I continued my saga in trying to find the real entry point of the Intel FSP-S module.

Here’s the non-technical summary of the current situation: I made some good progress in reverse engineering both the FSP-S and FSP-M and I’m very happy with it so far. Unfortunately, all the code I’ve seen so far has been about setting up the FSP itself, so I haven’t actually been able to start reverse engineering the actual Silicon initialization code.

The post Intel FSP reverse engineering: finding the real entry point! appeared first on Purism.

February 2018 coreboot update now available

Youness Alaoui — Thu, 22 Feb 2018 22:48:45 +0000

Hey everyone, I’m happy to announce the release of an update to our coreboot images for Librem 13 v2 and Librem 15 v3 machines.

All new laptops will come pre-loaded with this new update, and everyone else can update their machines using our existing build script which was updated to build the newest image. Some important remarks:

Please read the instructions below to make sure the image gets built properly and make sure to select the correct machine type in the menu for the build script.
The build script was initially written as a tool for internal use, and therefore isn’t as polished as it could be, so if you want something that just quickly applies updates without building/compiling the whole thing, we hope to provide such a (simpler) script in the future.

What’s new?

This is a follow up from Kyle’s previous blog post, and now that the image has been fully tested, you can all enjoy it and get one of our most requested feature : VT-d support for Qubes 4.0 to work.

The new version is “4.7-Purism-1” and here is the ChangeLog:

Update to coreboot 4.7

Update to FSP 2.0

Add IOMMU support

Enable TPM support

Fixed ATA errors at 6Gbps

While coreboot 4.7 has not been officially released, it was “tagged” on October 31st in coreboot’s git repository, and this release is based on that tag with the IOMMU (VT-d) and TPM support added on top of it.

If your laptop came with the TPM chip installed, you need to update your coreboot image to this version in order to use the TPM hardware.

How to build it?

To build the latest coreboot image :

Download the build script
mkdir building-coreboot && cd building-coreboot && wget https://code.wp.puri.sm/kakaroto/coreboot-files/raw/master/build_coreboot.sh
Install the required dependencies:
sudo apt-get install git build-essential bison flex m4 zlib1g-dev gnat libpci-dev libusb-dev libusb-1.0-0-dev dmidecode bsdiff python2.7
Run the script on your Librem machine:
chmod +x build_coreboot.sh && ./build_coreboot.sh
Follow the instructions on the screen, be sure to select your correct Librem laptop revision (Librem 13v2 or Librem 15v3), and give it time to build the image.
Once done, if everything went according to plan, it will ask you if you want to flash the newly built image
Make sure you are not running on low battery and select Yes
Reboot your machine once the flashing process is done.

For matters specifically related to this build script (not related to how to use a TPM per se), you may also want to check out the main forum thread about our coreboot build script, where discussion and testing has been going on over the past few months.

Verifying the presence of a TPM

If you are unsure whether or not you have a TPM installed on your system, install the tpm-tools package and then run sudo tpm_version to confirm that a TPM is detected on your system.

$ sudo tpm_version TPM 1.2 Version Info: Chip Version: 1.2.4.40 Spec Level: 2 Errata Revision: 3 TPM Vendor ID: IFX Vendor Specific data: 04280077 0074706d 3631ffff ff TPM Version: 01010000 Manufacturer Info: 49465800

If your machine came with a TPM, you can now take advantage of its capabilities, if you already have particular uses planned for it. Enjoy!

The post February 2018 coreboot update now available appeared first on Purism.

A Primer Guide to Reverse Engineering

Youness Alaoui — Fri, 17 Nov 2017 20:00:15 +0000

Over the years, many people asked me to teach them what I do, or to explain to them how to reverse engineer assembly code in general. Sometimes I hear the infamous “How hard can it be?” catchphrase. Last week someone I was discussing with thought that the assembly language is just like a regular programming language, but in binary form—it’s easy to make that mistake if you’ve never seen what assembly is or looks like. Historically, I’ve always said that reverse engineering and ASM is “too complicated to explain” or that “If you need help to get started, then you won’t be able to finish it on your own” and various other vague responses—I often wanted to explain to others why I said things like that but I never found a way to do it. You see, when something is complex, it’s easy to say that it’s complex, but it’s much harder to explain to people why it’s complex.

I was lucky to recently stumble onto a little function while reverse engineering a function that was both simple and complex, where figuring out what it does was an interesting challenge that I can easily walk you through. This function wasn’t a difficult thing to understand, and by far, it’s not one of the hard or complex things to reverse engineer, but this one is “small and complex enough” that it’s a perfect example to explain, without writing an entire book or getting into the more complex aspects of reverse engineering. So today’s post serves as a “primer” guide to reverse engineering for all of those interested in the subject. It is a required read in order to understand the next blog posts I would be writing about reverse engineering.

Note: some function and component names in this blog post have been scrambled, with names such as “BOB”, “ALICE” and “QUEEN”.

Ready? Strap on your geek helmet and let’s get started!

DISCLAIMER: I might make false statements in the blog post below, some by mistake, some intentionally for the purpose of vulgarizing the explanations. For example, when I say below that there are 9 registers in X86, I know there are more (SSE, FPU, or even just the DS or EFLAGS registers, or purposefully not mentioning EAX instead of RAX, etc.), but I just don’t want to complicate matters by going too wide in my explanations.

A prelude

First things first, you need to understand some basic concepts, such as “what is ASM exactly”. I will explain some basic concepts but not all the basic concepts you might need. I will assume that you know at least what a programming language is and know how to write a simple “hello world” in at least one language, otherwise you’ll be completely lost.

So, ASM is the Assembly language, but it’s not the actual binary code that executes on the machine. It is however, very similar to it. To be more exact, the assembly language is a textual representation of the binary instructions given to the microprocessor. You see, when you compile your regular C program into an executable, the compiler will transform all your code into some very, very, very basic instructions. Those instructions are what the CPU will understand and execute. By combining a lot of small, simple and specific instructions, you can do more complex things. That’s the basis of any programming language, of course, but with assembly, the building blocks that you get are very limited. Before I’ll talk about instructions, I want to explain two concepts first which you’ll need to follow the rest of the story.

The stack

First I’ll explain what “the stack” is. You may have heard of it before, or maybe you didn’t, but the important thing to know is that when you write code, you have two types of memory:

The first one is your “dynamic memory”, that’s when you call ‘malloc’ or ‘new’ to allocate new memory, this goes from your RAM upward (or left-to-right), in the sense that if you allocate 10 bytes, you’ll first get address 0x1000 for example, then when you allocate another 30 bytes, you’ll get address 0x100A, then if you allocate another 16 bytes, you’ll get 0x1028, etc.
The second type of memory that you have access to is the stack, which is different, instead it grows downward (or right-to-left), and it’s used to store local variables in a function. So if you start with the stack at address 0x8000, then when you enter a function with 16 bytes worth of local variables, your stack now points to address 0x7FF0, then you enter another function with 64 bytes worth of local variables, and your stack now points to address 0x7FB0, etc. The way the stack works is by “stacking” data into it, you “push” data in the stack, which puts the variable/data into the stack and moves the stack pointer down, you can’t remove an item from anywhere in the stack, you can always only remove (pop) the last item you added (pushed). A stack is actually an abstract type of data, like a list, an array, a dictionary, etc. You can read more about what a stack is on wikipedia and it shows you how you can add and remove items on a stack with this image:

The image shows you what we call a LIFO (Last-In-First-Out) and that’s what a stack is. In the case of the computer’s stack, it grows downward in the RAM (as opposed to upward in the above image) and is used to store local variables as well as the return address for your function (the instruction that comes after the call to your function in the parent function). So when you look at a stack, you will see multiple “frames”, you’ll see your current function’s stack with all its variables, then the return address of the function that called it, and above it, you’ll see the previous function’s frame with its own variables and the address of the function that called it, and above, etc. all the way to the main function which resides at the top of the stack.

Here is another image that exemplifies this:

The registers

The second thing I want you to understand is that the processor has multiple “registers”. You can think of a register as a variable, but there are only 9 total registers on x86, with only 7 of them usable. So, on the x86 processor, the various registers are: EAX, EBX, ECX, EDX, EDI, ESI, EBP, ESP, EIP.

There are two registers in there that are special:

The EIP (Instruction Pointer) contains the address of the current instruction being executed.
The ESP (Stack Pointer) contains the address of the stack.

Access to the registers is extremely fast when compared to accessing the data in the RAM (the stack also resides on the RAM, but towards the end of it) and most operations (instructions) have to happen on registers. You’ll understand more when you read below about instructions, but basically, you can’t use an instruction to say “add value A to value B and store it into address C”, you’d need to say “move value A into register EAX, then move value B into register EBX, then add register EAX to register EBX and store the result in register ECX, then store the value of register ECX into the address C”.

The instructions

Let’s go back to explaining instructions now. As I explained before, the instructions are the basic building blocks of the programs, and they are very simple, they take the form of:

INS OP1, OP2, OP3

Where “INS” is the instruction”, and OP1, OP2, OP3 is what we call the “operand”, most instructions will only take 2 operands, some will take no operands, some will take one operand and others will take 3 operands. The operands are usually registers. Sometimes, the operand can be an actual value (what we call an “immediate value”) like “1”, “2” or “3”, etc. and sometimes, the operand is a relative position from a register, like for example “[%eax + 4]” meaning the address pointed to by the %eax register + 4 bytes. We’ll see more of that shortly. For now, let’s give you the list of the most common and used instructions:

“MOV“: move data from one operand into another
“ADD/SUB/MUL/DIV“: Add, Substract, Multiply, Divide one operand with another and store the result in a register
“AND/OR/XOR/NOT/NEG“: Perform logical and/or/xor/not/negate operations on the operand
“SHL/SHR“: Shift Left/Shift Right the bits in the operand
“CMP/TEST“: Compare one register with an operand
“JMP/JZ/JNZ/JB/JS/etc.”: Jump to another instruction (Jump unconditionally, Jump if Zero, Jump if Not Zero, Jump if Below, Jump if Sign, etc.)
“PUSH/POP“: Push an operand into the stack, or pop a value from the stack into a register
“CALL“: Call a function. This is the equivalent of doing a “PUSH %EIP+4” + “JMP”. I’ll get into calling conventions later..
“RET“: Return from a function. This is the equivalent of doing a “POP %EIP”

That’s about it, that’s what most programs are doing. Of course, there’s a lot more instructions, you can see a full list here, but you’ll see that most of the other instructions are very obscure or very specific or variations on the above instructions, so really, this represents most of the instructions you’ll ever encounter.

I want to explain one thing before we go further down: there is an additional register I didn’t mention before called the FLAGS register, which is basically just a status register that contains “flags” that indicate when some arithmetic condition happened on the last arithmetic operation. For example, if you add 1 to 0xFFFFFFFF, it will give you ‘0’ but the “Overflow flag” will be set in the FLAGS register. If you substract 5 from 0, it will give you 0xFFFFFFFB and the “Sign flag” will be set because the result is negative, and if you substract 3 from 3, the result will be zero and the “Zero flag” will be set.

I’ve shown you the “CMP” instruction which is used to compare a register with an operand, but you might be wondering, “What does it mean exactly to ‘compare’?” Well, it’s simple, the CMP instruction is the same thing as the SUB instruction, in that, it substracts one operand from another, but the difference is that it doesn’t store the result anywhere. However, it does get your flags updated in the FLAGS register. For example, if I wanted to compare %EAX register with the value ‘2’, and %EAX contains the value 3, this is what’s going to happen: you will substract 2 from the value, the result will be 1, but you don’t care about that, what you care about is that the ZF (Zero flag) is not set, and the SF (Sign flag is not set), which means that %eax and ‘2’ are not equal (otherwise, ZF would be set), and that the value in %eax is superior to 2 (because SF is not set), so you know that “%eax > 2” and that’s what the CMP does.

The TEST instruction is very similar but it does a logical AND on the two operands for testing, so it’s used for comparing logical values instead of arithmetic values (“TEST %eax, 1” can be used to check if %eax contains an odd or even number for example).

This is useful because the next bunch of instructions I explained in the list above is conditional Jump instructions, like “JZ” (jump if zero) or “JB” (jump if below), or “JS” (jump if sign), etc. This is what is used to implement “if, for, while, switch/case, etc.” it’s as simple as doing a “CMP” followed by a “JZ” or “JNZ” or “JB”, “JA”, “JS”, etc.

And if you’re wondering what’s the difference between a “Jump if below” and “Jump if sign” and “Jump if lower”, since they all mean that the comparison gave a negative result, right? Well, the “jump if below” is used for unsigned integers, while “jump if lower” is used for signed integers, while “jump if sign” can be misleading. An unsigned 3 – 4 would give us a very high positive result… something like that, in practice, JB checks the Carry Flag, while JS checks the Sign Flag and JL checks if the Sign Flag is equal to the Overflow flag. See the Conditional Jump page for more details.

A practical example

Here’s a very small and simple practical example, if you have a simple C program like this:

int main() {
   return add_a_and_b(2, 3);
}

int add_a_and_b(int a, int b) {
   return a + b;
}

It would compile into something like this:

_main:
   push   3                ; Push the second argument '3' into the stack
   push   2                ; Push the first argument '2' into the stack
   call   _add_a_and_b     ; Call the _add_a_and_b function. This will put the address of the next
                           ; instruction (add) into the stack, then it will jump into the _add_a_and_b
                           ; function by putting the address of the first instruction in the _add_a_and_b
                           ; label (push %ebx) into the EIP register
   add    %esp, 8          ; Add 8 to the esp, which effectively pops out the two values we just pushed into it
   ret                     ; Return to the parent function.... 

_add_a_and_b:
   push   %ebx             ; We're going to modify %ebx, so we need to push it to the stack
                           ; so we can restore its value when we're done
   mov    %eax, [%esp+8]   ; Move the first argument (8 bytes above the stack pointer) into EAX
   mov    %ebx, [%esp+12]  ; Move the second argument (12 bytes above the stack pointer) into EBX
   add    %eax, %ebx       ; Add EAX and EBX and store the result into EAX
   pop    %ebx             ; Pop EBX to restore its previous value
   ret                     ; Return back into the main. This will pop the value on the stack (which was
                           ; the address of the next instruction in the main function that was pushed into
                           ; the stack when the 'call' instruction was executed) into the EIP register

Yep, something as simple as that, can be quite complicated in assembly. Well, it’s not really that complicated actually, but a couple of things can be confusing.

You have only 7 usable registers, and one stack. Every function gets its arguments passed through the stack, and can return its return value through the %eax register. If every function modified every register, then your code will break, so every function has to ensure that the other registers are unmodified when it returns (other than %eax). You pass the arguments on the stack and your return value through %eax, so what should you do if need to use a register in your function? Easy: you keep a copy on the stack of any registers you’re going to modify so you can restore them at the end of your function. In the _add_a_and_b function, I did that for the %ebx register as you can see. For more complex function, it can get a lot more complicated than that, but let’s not get into that for now (for the curious: compilers will create what we call a “prologue” and an “epilogue” in each function. In the prologue, you store the registers you’re going to modify, set up the %ebp (base pointer) register to point to the base of the stack when your function was entered, which allows you to access things without keeping track of the pushes/pops you do throughout the function, then in the epilogue, you pop the registers back, restore %esp to the value that was saved in %ebp, before you return).

The second thing you might be wondering about is with these lines:

mov %eax, [%esp+8]
mov %ebx, [%esp+12]

And to explain it, I will simply show you this drawing of the stack’s contents when we call those two instructions above:

For the purposes of this exercise, we’re going to assume that the _main function is located in memory at the address 0xFFFF0000, and that each instructoin is 4 bytes long (the size of each instruction can vary depending on the instruction and on its operands). So you can see, we first pushed 3 into the stack, %esp was lowered, then we pushed 2 into the stack, %esp was lowered, then we did a ‘call _add_a_and_b’, which stored the address of the next instruction (4 instructions into the main, so ‘_main+16’) into the stack and esp was lowered, then we pushed %ebx, which I assumed here contained a value of 0, and the %esp was lowered again. If we now wanted to access the first argument to the function (2), we need to access %esp+8, which will let us skip the saved %ebx and the ‘Return address’ that are in the stack (since we’re working with 32 bits, each value is 4 bytes). And in order to access the second argument (3), we need to access %esp+12.

Binary or assembly?

One question that may (or may not) be popping into your mind now is “wait, isn’t this supposed to be the ‘computer language’, so why isn’t this binary?” Well, it is… in a way. As I explained earlier, “the assembly language is a textual representation of the binary instructions given to the microprocessor”, what it means is that those instructions are given to the processor as is, there is no transformation of the instructions or operands or anything like that. However, the instructions are given to the microprocessor in binary form, and the text you see above is just the textual representation of it.. kind of like how “68 65 6c 6c 6f” is the hexadecimal representation of the ASCII text “hello”. What this means is that each instruction in assembly language, which we call a ‘mnemonic’ represents a binary instruction, which we call an ‘opcode’, and you can see the opcodes and mnemonics in the list of x86 instructions I gave you above. Let’s take the CALL instruction for example. The opcode/mnemonic list is shown as:

Opcode	Mnemonic	Description
`E8 cw`	`CALL rel16`	Call near, relative, displacement relative to next instruction
`E8 cd`	`CALL rel32`	Call near, relative, displacement relative to next instruction
`FF /2`	`CALL r/m16`	Call near, absolute indirect, address given in r/m16
`FF /2`	`CALL r/m32`	Call near, absolute indirect, address given in r/m32
`9A cd`	`CALL ptr16:16`	Call far, absolute, address given in operand
`9A cp`	`CALL ptr16:32`	Call far, absolute, address given in operand
`FF /3`	`CALL m16:16`	Call far, absolute indirect, address given in m16:16
`FF /3`	`CALL m16:32`	Call far, absolute indirect, address given in m16:32

This means that this same “CALL” mnemonic can have multiple addresses to call. Actually, there are four different possitiblities, each having a 16 bits and a 32 bits variant. The first possibility is to call a function with a relative displacement (Call the function 100 bytes below this current position), or an absolute address given in a register (Call the function whose address is stored in %eax) or an absolute address given as a pointer (Call the function at address 0xFFFF0100), or an absolute address given as an offset to a segment (I won’t explain segments now). In our example above, the “call _add_a_and_b” was probably stored as a call relative to the current position with 12 bytes below the current instruction (4 bytes per instruction, and we have the CALL, ADD, RET instructions to skip). This means that the instruction in the binary file was encoded as “E8 00 00 00 0C” (The E8 opcode to mean a “CALL near, relative”, and the “00 00 00 0C” to mean 12 bytes relative to the current instruction). Now, the most observant of you have probably noticed that this CALL instruction takes 5 bytes total, not 4, but as I said above, we will assume it’s 4 bytes per instruction just for the sake of keeping things simple, but yes, the CALL (in this case) is 5 bytes, and other instructions will sometimes have more or less bytes as well.

I chose the CALL function above for example, because I think it’s the least complicated to explain.. other instructions have even more complicated opcodes and operands (See the ADD and ADC (Add with Cary) instructions for example, you’ll notice the same opcodes shared between them even, so they are the same instruction, but it’s easy to give them separate mnemonics to differentiate their behaviors).

Here’s a screenshot showing a side by side view of the Assembly of a function with the hexadecimal view of the binary:

As you can see, I have my cursor on address 0xFFF6E1D6 on the assembly view on the left, which is also highlighted on the hex view on the right. That address is a CALL instruction, and you can see the equivalent hex of “E8 B4 00 00 00”, which means it’s a CALL near, relative (E8 being the opcode for it) and the function is 0xB4 (180) bytes below our current position of 0xFFF6E1D6.

If you open the file with a hexadecimal editor, you’ll only see the hex view on the right, but you need to put the file into a Disassembler (such as the IDA disassembler which I’m using here, but there are cheaper alternatives as well, the list can be long), and the disassembler will interpret those binary opcodes to show you the textual assembly representation which is much much easier to read.

Some actual reverse engineering

Now that you have the basics, let’s do a quick reverse engineering exercise… This is a very simple function that I’ve reversed recently, it comes from the SiliconInit, and it’s used to validate the UPD configuration structure (used to tell it what to do).

Here is the Assembly code for that function:

This was disassembled using IDA 7.0 (The Interactive DisAssembler) which is an incredible (but expensive) piece of software. There are other disassemblers which can do similar jobs, but I prefer IDA personally. Let’s first explain what you see on the screen.

On the left side, you see “seg000:FFF40xxx” this means that we are in the segment “seg000” at the address 0xFFF40xxx. I won’t explain what a segment is, because you don’t need to know it. The validate_upd_config function starts at address 0xFFF40311 in the RAM, and there’s not much else to understand. You can see how the address increases from one instruction to the next, it can help you calculate the size in bytes that each instruction takes in RAM for example, if you’re curious of course… (the XOR is 2 bytes, the CMP is 2 bytes, etc.).

As you’ve seen in my previous example, anything after a semicolon (“;”) is considered a comment and can be ignored. The “CODE XREF” comments are added by IDA to tell us that this code has a cross-references (is being called by) some other code. So when you see “CODE XREF: validate_upd_config+9” (at 0xFF40363, the RETN instruction), it means this instruction is being called (referenced by) from the function validate_upd_config and the “+9” means 9 bytes into the function (so since the function starts at 0xFFF40311, it means it’s being called from the instruction at offset 0xFFF4031A. The little “up” arrow next to it means that it comes from above the current position in the code, and if you follow the grey lines on the left side of the screen, you can follow that call up to the address 0xFFF4031A which contains the instruction “jnz short locret_FFF40363”. I assume the “j” letter right after the up arrow is to tell us that the reference comes from a “jump” instruction.

As you can see in the left side of the screen, there are a lot of arrows, that means that there’s a lot of jumping around in the code, even though it’s not immediatly obvious. The awesome IDA software has a “layout view” which gives us a much nicer view of the code, and it looks like this:

Now you can see each block of code separately in their own little boxes, with arrows linking all of the boxes together whenever a jump happens. The green arrows mean that it’s a conditional jump when the condition is successful, while the red arrows means the condition was not successful. This means that a “JZ” will show a green arrow towards the code it would jump to if the result is indeed zero, and a red arrow towards the block where the result is not zero. A blue arrow means that it’s an unconditional jump.

I usually always do my reverse engineering using the layout view, I find it much easier to read/follow, but for the purpose of this exercise, I will use the regular linear view instead, so I think it will be easier for you to follow with that instead. The reason is mostly because the layout view doesn’t display the address of each instruction, and it’s easier to have you follow along if I can point out exactly which instruction I’m looking it by mentioning its address.

Now that you know how to read the assembly code, you understand the various instructions, I feel you should be ready to reverse engineering this very simple assembly code (even though it might seem complex at first). I just need to give you the following hints first:

Because I’ve already reversed engineering it, you get the beautiful name “validate_upd_config” for the function, but technically, it was simply called “sub_FFF40311”
I had already reverse engineered the function that called it so I know that this function is receiving its arguments in an unusual way. The arguments aren’t pushed to the stack, instead, the first argument is stored in %ecx, and the second argument is stored in %edx
The first argument (%ecx, remember?) is an enum to indicate what type of UPD structure to validate, let me help you out and say that type ‘3’ is “ALICE” (The configuration structure for the QueenM, the MemoryInit function), and that type ‘5’ is “BOB” (The configuration structure for the QueenS, the SiliconInit function).
Reverse engineering is really about reading one line at a time, in a sequential manner, keep track of which blocks you reversed and be patient. You can’t look at it and expect to understand the function by viewing the big picture.
It is very very useful in this case to have a dual monitor, so you can have one monitor for the assembly, and the other monitor for your C code editor. In my case, I actually recently bought an ultra-wide monitor and I split screen between my IDA window and my emacs window and it’s great. It’s hard otherwise to keep going back and forth between the assembly and the C code. That being said, I would suggest you do the same thing here and have a window on the side showing you the assembly image above (not the layout view) while you read the explanation on how to reverse engineer it below.

Got it? All done? No? Stop sweating and hyperventilating… I’ll explain exactly how to reverse engineer this function in the next paragraph, and you will see how simple it turns out to be!

Let’s get started!

The first thing I do is write the function in C. Since I know the name and its arguments already, I’ll do that:

void validate_upd_config (uint8_t action, void *config) {
}

Yeah, there’s not much to it yet, and I set it to return “void” because I don’t know if it returns anything else, and I gave the first argument “action” as a “uint8_t” because in the parent function it’s used a single byte register (I won’t explain for now how to differentiate 1-byte, 2-bytes, 4-bytes and 8-bytes registers). The second argument is a pointer, but I don’t know it’s a pointer to what kind of structure exactly, so I just set it as a void *.

The first instruction is a “xor eax, eax”. What does this do? It XORs the eax register with the eax register and stores the result in the eax register itself, which is the same thing as “mov eax, 0”, because 1 XOR 1= 0 and 0 XOR 0 = 0, so if every bit in the eax register is logically XORed with itself, it will give 0 for the result. If you’re asking yourself “Why did the compiler decide to do ‘xor eax, eax’ instead of ‘mov eax, 0’ ?” then the answer is simple: “Because it takes less CPU clock cycles to do a XOR, than to do a move”, which means it’s more optimized and it will run faster. Besides, the XOR takes 2 bytes as you can see above (the address of the instructions jumped from FFF40311 to FFF40313), while a “mov eax, 0” would have taken 5 bytes. So it also helps keep the code smaller.

Alright, so now we know that eax is equal to 0, let’s keep that in mind and move along (I like to keep track of things like that as comments in my C code). Next instruction does a “cmp ecx, 3”, so it’s comparing ecx, which we already know is our first argument (uint8_t action), ok, it’s a comparison, not much to do here, again let’s keep that in mind and continue… the next instruction does a “jnz short loc_FFF40344”, which is more interesting, so if the previous comparison is NOT ZERO, then jump to the label loc_FFF40344 (for now ignore the “short”, it just helps us differentiate between the various mnemonics, and it means that the jump is a relative offset that fits in a “short word” which means 2 bytes, and you can confirm that the jnz instruction does indeed take only 2 bytes of code). Great, so there’s a jump if the result is NOT ZERO, which means that if the result is zero, the code will just continue, which means if the ecx register (action variable) is EQUAL (substraction is zero) to 3, the code will continue down to the next instruction instead of jumping… let’s do that, and in the meantime we’ll update our C code:

void validate_upd_config (uint8_t action, void *config) {
   // eax = 0
   if (action == 3) {
      // 0xFFF40318 
   } else {
      // loc_FFF40344
   }
}

The next instruction is “test edx, edx”. We know that the edx register is our second argument which is the pointer to the configuration structure. As I explained above, the “test” is just like a comparison, but it does an AND instead of a substraction, so basically, you AND edx with itself.. well, of course, that has no consequence, 1 AND 1 = 1, and 0 AND 0 = 0, so why is it useful to test a register against itself? Simply because the TEST will update our FLAGS register… so when the next instruction is “JZ” it basically means “Jump if the edx register was zero”… And yes, doing a “TEST edx, edx” is more optimized than doing a “CMP edx, 0”, you’re starting to catch on, yeay!

And indeed, the next instruction is “jz locret_FFF40363”, so if the edx register is ZERO, then jump to locret_FFF40363, and if we look at that locret_FFF40363, it’s a very simple “retn” instruction. So our code becomes:

void validate_upd_config (uint8_t action, void *config) {
  // eax = 0
  if (action == 3) {
    if (config == NULL)
       return; 
  } else {
    // loc_FFF40344
  }
}

Next! Now it gets slightly more complicated… the instruction is: “cmp dword ptr [edx], 554C424Bh”, which means we do a comparison of a dword (4 bytes), of the data pointed to by the pointer edx, with no offset (“[edx]” is the same as saying “edx[0]” if it was a C array for example), and we compare it to the value 554C424Bh… the “h” at the end means it’s a hexadecimal value, and with experience you can quickly notice that the hexadecimal value is all within the ASCII range, so using a Hex to ASCII converter, we realize that those 4 bytes represent the ASCII letters “KBLU” (which is why I manually added them as a comment to that instruction, so I won’t forget). So basically the instruction compares the first 4 bytes of the structure (the content pointed to by the edx pointer) to the string “KBLU”. The next instruction does a “jnz loc_FFF4035E” which means that if the comparison result is NOT ZERO (so, if they are not equal) we jump to loc_FFF4035E.

Instead of continuing sequentially, I will see what that loc_FFF4035E contains (of course, I did the same thing in all the previous jumps, and had to decide if I wanted to continue reverse engineering the jump or the next instruction, in this case, it seems better for me to jump, you’ll see why soon). The loc_FFF4035E label contains the following instruction: “mov, eax, 80000002h”, which means it stores the value 0x80000002 into the eax register, and then it jumps to (not really, it just naturally flows to the next instruction which happens to be the label) locret_FFF40363, which is just a “retn”. This makes our code into this:

uint32_t validate_upd_config (uint8_t action, void *config) {
  // eax = 0
  if (action == 3) {
    if (config == NULL)
       return 0; 
    if (((uint32_t *)config)[0] != 0x554C524B)
       return 0x80000002;
  } else {
    // loc_FFF40344
  }
}

The observant here will notice that I’ve changed the function prototype to return a uint32_t instead of “void” and my previous “return” has become “return 0” and the new code has a “return 0x80000002”. That’s because I realized at this point that the “eax” register is used to return a uint32_t value. And since the first instruction was “xor eax, eax”, and we kept in the back of our mind that “eax is initialized to 0”, it means that the use case with the (config == NULL) will return 0. That’s why I made all these changes…

Very well, let’s go back to where we were, since we’ve exhausted this jump, we’ll jump back in reverse to go back to the address FFF40322 and continue from there to the next instruction. It’s a “cmp dword ptr [edx+4], 4D5F4450h”, which compares the dword at edx+4 to 0x4D5F4450, which I know to be the ASCII for “PD_M”; this means that the last 3 instructions are used to compare the first 8 bytes of our pointer to “KBLUPD_M”… ohhh, light bulb above our heads, it’s comparing the pointer to the Signature of the ALICE structure (don’t forget, you weren’t supposed to know that the function is called validate_upd_config, or that the argument is a config pointer… just that it’s a pointer)! OK, now it makes sense, and while we’re at it—and since we are, of course, reading the QUEEN integration guide PDF, we then also realize what the 0x80000002 actually means. At this point, our code now becomes:

EFI_STATUS validate_upd_config (uint8_t action, void *config) {
  if (action == 3) {
    ALICE *upd = (ALICE *) config;
    if (upd == NULL)
       return EFI_SUCCESS; 
    if (upd->QueenUpdHeader.Signature != 0x4D5F4450554C524B /* 'KBLUPD_M'*/)
       return EFI_INVALID_PARAMETERS;
  } else {
    // loc_FFF40344
  }
}

Yay, this is starting to look like something… Now you probably got the hang of it, so let’s do things a little faster now.

The next line “cmp [edx+28h], eax” compares edx+0x28 to eax. Thankfully, we know now that edx points to the ALICE structure, and we can calculate that at offset 0x28 inside that structure, it’s the field StackBase within the QueenmArchUpd field…
and also, we still have in the back of our minds that ‘eax’ is initialized to zero, so, we know that the next 2 instructions are just checking if upd->QueenmArchUpd.StackBase is == NULL.
Then we compare the StackSize with 0x26000, but the comparison is using “jb” for the jump, which is “jump if below”, so it checks if StackSize < 0x26000,
finally it does a “test” with “edx+30h” (which is the BootloaderTolumSize field) and 0xFFF, then it does an unconditional jump to loc_FFF4035C, which itself does a “jz” to the return..
which means if (BootloaderTolumSize & 0xFFF == 0) it will return whatever EAX contained (which is zero),
but if it doesn’t, then it will continue to the next instruction which is the “mov eax, 80000002h”.

So, we end up with this code:

EFI_STATUS validate_upd_config (uint8_t action, void *config) {
  // eax = 0
  if (action == 3) {
    ALICE *upd = (ALICE *) config;
    if (upd == NULL)
       return 0;
    if (upd->QueenUpdHeader.Signature != 0x4D5F4450554C524B /* 'KBLUPD_M'*/)
       return EFI_INVALID_PARAMETERS;
    if (upd->QueenmArchUpd.StackBase == NULL)
        return EFI_INVALID_PARAMETERS;
    if (upd->QueenmArchUpd.StackSize < 0x2600)
        return EFI_INVALID_PARAMETERS;
    if (upd->QueenmArchUpd.BootloaderTolumSize & 0xFFF)
        return EFI_INVALID_PARAMETERS;
  } else {
    // loc_FFF40344
  }
  return EFI_SUCCESS
}

Great, we just solved half of our code! Don’t forget, we jumped one way instead of another at the start of the function, now we need to go back up and explore the second branch of the code (at offset 0xFFF40344). The code is very similar, but it checks for “KBLUPD_S” Signature, and nothing else. Now we can also remove any comment/notes we have (such as the note that eax is initialized to 0) and clean up, and simplify the code if there is a need.

So our function ends up being (this is the final version of the function):

EFI_STATUS validate_upd_config (uint8_t action, void *config) {
  if (action == 3) {
    ALICE *upd = (ALICE *) config;
    if (upd == NULL)
       return EFI_SUCCESS;
    if (upd->QueenUpdHeader.Signature != 0x4D5F4450554C524B /* 'KBLUPD_M'*/)
       return EFI_INVALID_PARAMETERS;
    if (upd->QueenmArchUpd.StackBase == NULL)
        return EFI_INVALID_PARAMETERS;
    if (upd->QueenmArchUpd.StackSize < 0x2600)
        return EFI_INVALID_PARAMETERS;
    if (upd->QueenmArchUpd.BootloaderTolumSize & 0xFFF)
        return EFI_INVALID_PARAMETERS;
  } else {
    BOB *upd = (BOB *) config;
    if (upd == NULL)
        return EFI_SUCCESS;
    if (upd->QueenUpdHeader.Signature != 0x535F4450554C524B /* 'KBLUPD_S'*/)
        return EFI_INVALID_PARAMETERS;
  }
  return EFI_SUCCESS
}

Now this wasn’t so bad, was it? I mean, it’s time consuming, sure, it can be a little disorienting if you’re not used to it, and you have to keep track of which branches (which blocks in the layout view) you’ve already gone through, etc. but the function turned out to be quite small and simple. After all, it was mostly only doing CMP/TEST and JZ/JNZ.

That’s pretty much all I do when I do my reverse engineering, I go line by line, I understand what it does, I try to figure out how it fits into the bigger picture, I write equivalent C code to keep track of what I’m doing and to be able to understand what happens, so that I can later figure out what the function does exactly… Now try to imagine doing that for hundreds of functions, some of them that look like this (random function taken from the QueenM module):

You can see on the right, the graph overview which shows the entirety of the function layout diagram. The part on the left (the assembly) is represented by the dotted square on the graph overview (near the middle). You will notice some arrows that are thicker than the others, that’s used in IDA to represent loops. On the left side, you can notice one such thick green line coming from the bottom and the arrow pointing to a block inside our view. This means that there’s a jump condition below that can jump back to a block that is above the current block and this is basically how you do a for/while loop with assembly, it’s just a normal jump that points backwards instead of forwards.

Finally, the challenge!

At the beginning of this post, I mentioned a challenging function to reverse engineer. It’s not extremely challenging—it’s complex enough that you can understand the kind of things I have to deal with sometimes, but it’s simple enough that anyone who was able to follow up until now should be able to understand it (and maybe even be able to reverse engineer it on their own).

So, without further ado, here’s this very simple function:

Since I’m a very nice person, I renamed the function so you won’t know what it does, and I removed my comments so it’s as virgin as it was when I first saw it. Try to reverse engineer it. Take your time, I’ll wait:

Alright, so, the first instruction is a “call $+5”, what does that even mean?

When I looked at the hex dump, the instruction was simply “E8 00 00 00 00” which according to our previous CALL opcode table means “Call near, relative, displacement relative to next instruction”, so it wants to call the instruction 0 bytes from the next instruction. Since the call opcode itself is taking 5 bytes, that means it’s doing a call to its own function but skipping the call itself, so it’s basically jumping to the “pop eax”, right? Yes… but it’s not actually jumping to it, it’s “calling it”, which means that it just pushed into the stack the return address of the function… which means that our stack contains the address 0xFFF40244 and our next instruction to be executed is the one at the address 0xFFF40244. That’s because, if you remember, when we do a “ret”, it will pop the return address from the stack into the EIP (instruction pointer) register, that’s how it knows where to go back when the function finishes.
So, then the instruction does a “pop eax” which will pop that return address into EAX, thus removing it from the stack and making the call above into a regular jump (since there is no return address in the stack anymore).
Then it does a “sub eax, 0FFF40244h”, which means it’s substracting 0xFFF40244 from eax (which should contain 0xFFF40244), so eax now contains the value “0”, right? You bet!
Then it adds to eax, the value “0xFFF4023F”, which is the address of our function itself. So, eax now contains the value 0xFFF4023F.
It will then substract from EAX, the value pointed to by [eax-15], which means the dword (4 bytes) value at the offset 0xFFF4023F – 0xF, so the value at 0xFFF40230, right… that value is 0x1AB (yep, I know, you didn’t have this information)… so, 0xFFF4023F – 0x1AB = 0xFFF40094!
And then the function returns.. with the value 0xFFF40094 in EAX, so it returns 0xFFF40094, which happens to be the pointer to the QUEEN_INFO_HEADER structure in the binary.

So, the function just returns 0xFFF40094, but why did it do it in such a convoluted way? The reason is simple: because the QUEEN-S code is technically meant to be loaded in RAM at the address 0xFFF40000, but it can actually reside anywhere in the RAM when it gets executed. Coreboot for example doesn’t load it in the right memory address when it executes it, so instead of returning the wrong address for the structure and crashing (remember, most of the jumps and calls use relative addresses, so the code should work regardless of where you put it in memory, but in this case returning the wrong address for a structure in memory wouldn’t work), the code tries to dynamically verify if it has been relocated and if it is, it will calculate how far away it is from where it’s supposed to be, and calculate where in memory the QUEEN_INFO_HEADER structure ended up being.

Here’s the explanation why:

If the Queen was loaded into a different memory address, then the “call $+5” would put the exact memory address of the next instruction into the stack, so when you pop it into eax then substract from it the expected address 0xFFF40244, this means that eax will contain the offset from where it was supposed to be.
Above, we said eax would be equal to zero, yes, that’s true, but only in the usecase where the Queen is in the right memory address, as expected, otherwise, eax would simply contain the offset. Then you add to it 0xFFFF4023F which is the address of our function, and with the offset, that means eax now contains the exact memory address of the current function, wherever it was actually placed in RAM!
Then when it grabs the value 0x1AB (because that value is stored in RAM 15 bytes before the start of the function, that will work just fine) and substracts it from our current position, it gives us the address in RAM of the QUEEN_INFO_HEADER (because the compiler knows that the structure is located exactly 0x1AB bytes before the current function). This just makes everything be relative.

Isn’t that great!? 😉 It’s so simple, but it does require some thinking to figure out what it does and some thinking to understand why it does it that way… but then you end up with the problem of “How do I write this in C”? Honestly, I don’t know how, I just wrote this in my C file:

// Use Position-independent code to make this relocatable
void *get_queen_info_header() {
    return 0xFFF40094; 
}

I think the compiler takes care of doing all that magic on its own when you use the -fPIC compiler option (for gcc), which means “Position-Independent Code”.

What this means for Purism

On my side, I’ve finished reverse engineering the entry code—from the entry point all the way to the end of the function and all the subfunctions that it calls.

This only represents 9 functions however, and about 115 lines of C code; I haven’t yet fully figured out where exactly it’s going in order to execute the rest of the code. What happens is that the last function it calls (it actually jumps into it) grabs a variable from some area in memory, and within that variable, it will copy a value into the ESP, thus replacing our stack pointer, and then it does a “RETN”… which means that it’s not actually returning to the function that called it (coreboot), it’s returning… “somewhere”, depending on what the new stack contains, but I don’t know where (or how) this new stack is created, so I need to track it down in order to find what the return address is, find where the “retn” is returning us into, so I can unlock plenty of new functions and continue reverse engineering this.

I’ve already made some progress on that front (I know where the new stack tells us to return into) but you will have to wait until my next blog post before I can explain it all to you. It’s long and complicated enough that it needs its own post, and this one is long enough already.

Perseverance prevails

In conclusion:

Reverse engineering isn’t just about learning a new language, it’s a very different experience from “learning Java/Python/Rust after you’ve mastered C”, because of the way it works; it can sometimes be very easy and boring, sometimes it will be very challenging for a very simple piece of code.
It’s all about perseverance, being very careful (it’s easy to get lost or make a mistake, and very hard to track down and fix a mistake/typo if you make one), and being very patient. We’re talking days, weeks, months. That’s why reverse engineering is something that very few people do (compared to the number of people who do general software development). Remember also that our first example was 82 bytes of code, and the second one was only 19 bytes long, and most of the time, when you need to reverse engineer something, it’s many hundreds of KBs of code.

All that being said, the satisfaction you get when you finish reverse engineering some piece of code, when you finally understand how it works and can reproduce its functionality with open source software of your own, cannot be described with words. The feeling of achievement that you get makes all the efforts worth it!

I hope this write-up helps everyone get a fresh perspective on what it means to “reverse engineer the code”, why it takes so long, and why it’s rare to find someone with the skills, experience and patience to do this kind of stuff for months—as it can be frustrating, and we sometimes need to take a break from it and do something else in order to renew our brain cells.

The post A Primer Guide to Reverse Engineering appeared first on Purism.

Deep dive into Intel Management Engine disablement

Youness Alaoui — Thu, 19 Oct 2017 15:38:48 +0000

Starting today, our second generation of laptops (based on the 6th gen Intel Skylake platform) will now come with the Intel Management Engine neutralized and disabled by default. Users who already received their orders can also update their flash to disable the ME on their machines.

In this post, I will dig deeper and explain in more details what this means exactly, and why it wasn’t done before today for the laptops that were shipping this spring and summer.

The life and times of the ME

Think of the ME as having 4 possible states:

Fully operational ME: the ME is running normally like it does on other manufacturers’ machines (note that this could be a consumer or corporate ME image, which vary widely in the features they ‘provide’)
Neutralized ME: the ME is neutralized/neutered by removing the most “mission-critical” components from it, such as the kernel and network stack.
Disabled ME: the ME is officially “disabled” and is known to be completely stopped and non-functional
Removed ME: the ME is completely removed and doesn’t execute anything at any time, at all.

In my previous blog post about taming the ME, we discussed how we neutralize the ME (note that this was on the first generation, Broadwell-based Purism laptops back then), but we’ve taken things one step further today by not only neutralizing the ME but also by disabling it. The difference between the two might not be immediately visible to some of you, so I’ll clarify below.

A neutralized ME is a ME image which had most of its code removed.
- The way the ME firmware is packaged on the flash, is in the form of multiple modules, and each module has a specific task, such as : Hardware initialization, Firmware updates, Kernel, Network stack, Audio/Video processing, HECI communication over PCI, Java virtual machine, etc. When the ME is neutralized using the me_cleaner tool, most of the modules will be removed. As we’ve seen on Broadwell, that meant almost 93% of the code is removed and only 7% remains (that proportion is different on Skylake, see further below).
- A neutralized ME means that the ME firmware will encounter an error during its regular boot cycle; It will not find some of its critical modules and it will throw an error and somehow fail to proceed. However, the ME remains operational, it just can’t do anything “valuable”. While it’s unable to communicate with the main CPU through the HECI commands, the PCI interface to the ME processor is still active and lets us poke at the status of the ME for example, which lets us see which error caused it to stop functioning.
When the ME is disabled using the “HAP” method (thanks to the Positive Technologies for discovering this trick), however, it doesn’t throw an error “because it can’t load a module”: it actually stops itself in a graceful manner, by design.

The two approaches are similar in that they both stop the execution of the ME during the hardware initialization (BUP) phase, but with the ME disabled through the HAP method, the ME stops on its own, without putting up a fight, potentially disabling things that the forceful “me_cleaner” approach, with the “unexpected error” state, wouldn’t have disabled. The PCI interface for example, is entirely unable to communicate with the ME processor, and the status of the ME is not even retrievable.

So the big, visible difference for us, between a neutralized and a disabled ME, is that the neutralized ME might appear “normal” when coreboot accesses its status, or it might show that it has terminated due to an error, while a disabled ME simply doesn’t give us a status at all—so coreboot will even think that the ME partition is corrupted. Another advantage, is that, from my understanding of the Positive Technologies’s research, a disabled ME stops its execution before a neutralized ME does, so there is at least a little bit of extra code that doesn’t get executed when the ME is disabled, compared to a neutralized ME.

Kill it with fire! Then dump it into a volcano.

In our case, we went with an ME that is both neutered and disabled. By doing so, we provide maximum security; even if the disablement of the ME isn’t functioning properly, the ME would still fail to load its mission-critical modules and will therefore be safe from any potential exploits or backdoors (unless one is found in the very early boot process of the ME).

I want to talk about the neutralizing of the Skylake ME then follow up on how the ME was disabled. However, I first want you to understand the differences between the ME on Broadwell systems (ME version 10.x) and the ME on Skylake systems (ME version 11.0.x).

The Intel Management Engine can be seen as two things; first, the isolated processor core that run the Management Engine is considered “The ME”, and second, the firmware that runs on the ME Core is also considered as being “the ME”. I often used the two terms interchangeably, but to avoid confusion, I will from now on (try to) refer to them, respectively, as the ME Core and the ME Firmware, but note that if I simply say the ME, then I am probably referring to the ME Firmware.
The ME Firmware 10.x was used on Broadwell systems which had an ARC core, while the ME Firmware 11.0.x used on Skylake systems uses an x86 core. What this means is that the architecture used by the ME core is completely different (kind of like how PowerPC and Intel macs used a different architecture, or how most mobile devices use an ARM architecture, the Broadwell ME Core used an ARC architecture). This means that the difference between the 10.x and 11.0.x ME firmwares is major, and the cores themselves are also very different. It’s a bit like comparing arabic to korean!
As the format of the ME firmware changed significantly, it took a while to figure out how to decompress the modules and understand how to remove the modules without breaking anything else. Nicola Corna, the author of the me_cleaner tool, recently was able to add support for Skylake machines by removing all the non essential modules.

In my last ME-related post, I gave everyone a rundown of the modules that were in the ME 10.x firmware and which ones were remaining after it was neutered, so, for Skylake, here is the list of modules in a regular ME 11.0.x firmware:

-rw-r--r-- 1 kakaroto kakaroto 184320 Aug 29 16:33 bup.mod
-rw-r--r-- 1 kakaroto kakaroto  36864 Aug 29 16:33 busdrv.mod
-rw-r--r-- 1 kakaroto kakaroto  32768 Aug 29 16:33 cls.mod
-rw-r--r-- 1 kakaroto kakaroto 163840 Aug 29 16:33 crypto.mod
-rw-r--r-- 1 kakaroto kakaroto 389120 Aug 29 16:33 dal_ivm.mod
-rw-r--r-- 1 kakaroto kakaroto  24576 Aug 29 16:33 dal_lnch.mod
-rw-r--r-- 1 kakaroto kakaroto  49152 Aug 29 16:33 dal_sdm.mod
-rw-r--r-- 1 kakaroto kakaroto  16384 Aug 29 16:33 evtdisp.mod
-rw-r--r-- 1 kakaroto kakaroto  16384 Aug 29 16:33 fpf.mod
-rw-r--r-- 1 kakaroto kakaroto  45056 Aug 29 16:33 fwupdate.mod
-rw-r--r-- 1 kakaroto kakaroto  16384 Aug 29 16:33 gpio.mod
-rw-r--r-- 1 kakaroto kakaroto   8192 Aug 29 16:33 hci.mod
-rw-r--r-- 1 kakaroto kakaroto  36864 Aug 29 16:33 heci.mod
-rw-r--r-- 1 kakaroto kakaroto  28672 Aug 29 16:33 hotham.mod
-rw-r--r-- 1 kakaroto kakaroto  28672 Aug 29 16:33 icc.mod
-rw-r--r-- 1 kakaroto kakaroto  16384 Aug 29 16:33 ipc_drv.mod
-rw-r--r-- 1 kakaroto kakaroto  11832 Aug 29 16:33 ish_bup.mod
-rw-r--r-- 1 kakaroto kakaroto  24576 Aug 29 16:33 ish_srv.mod
-rw-r--r-- 1 kakaroto kakaroto  73728 Aug 29 16:33 kernel.mod
-rw-r--r-- 1 kakaroto kakaroto  28672 Aug 29 16:33 loadmgr.mod
-rw-r--r-- 1 kakaroto kakaroto  28672 Aug 29 16:33 maestro.mod
-rw-r--r-- 1 kakaroto kakaroto  28672 Aug 29 16:33 mca_boot.mod
-rw-r--r-- 1 kakaroto kakaroto  24576 Aug 29 16:33 mca_srv.mod
-rw-r--r-- 1 kakaroto kakaroto  36864 Aug 29 16:33 mctp.mod
-rw-r--r-- 1 kakaroto kakaroto  32768 Aug 29 16:33 nfc.mod
-rw-r--r-- 1 kakaroto kakaroto 409600 Aug 29 16:33 pavp.mod
-rw-r--r-- 1 kakaroto kakaroto  16384 Aug 29 16:33 pmdrv.mod
-rw-r--r-- 1 kakaroto kakaroto  24576 Aug 29 16:33 pm.mod
-rw-r--r-- 1 kakaroto kakaroto  61440 Aug 29 16:33 policy.mod
-rw-r--r-- 1 kakaroto kakaroto  12288 Aug 29 16:33 prtc.mod
-rw-r--r-- 1 kakaroto kakaroto 167936 Aug 29 16:33 ptt.mod
-rw-r--r-- 1 kakaroto kakaroto  16384 Aug 29 16:33 rbe.mod
-rw-r--r-- 1 kakaroto kakaroto  12288 Aug 29 16:33 rosm.mod
-rw-r--r-- 1 kakaroto kakaroto  49152 Aug 29 16:33 sensor.mod
-rw-r--r-- 1 kakaroto kakaroto 110592 Aug 29 16:33 sigma.mod
-rw-r--r-- 1 kakaroto kakaroto  20480 Aug 29 16:33 smbus.mod
-rw-r--r-- 1 kakaroto kakaroto  36864 Aug 29 16:33 storage.mod
-rw-r--r-- 1 kakaroto kakaroto   8192 Aug 29 16:33 syncman.mod
-rw-r--r-- 1 kakaroto kakaroto  94208 Aug 29 16:33 syslib.mod
-rw-r--r-- 1 kakaroto kakaroto  16384 Aug 29 16:33 tcb.mod
-rw-r--r-- 1 kakaroto kakaroto  28672 Aug 29 16:33 touch_fw.mod
-rw-r--r-- 1 kakaroto kakaroto  12288 Aug 29 16:33 vdm.mod
-rw-r--r-- 1 kakaroto kakaroto  98304 Aug 29 16:33 vfs.mod

And here is the list of modules in a neutered ME :

-rw-r--r-- 1 kakaroto kakaroto 184320 Oct  4 16:21 bup.mod
-rw-r--r-- 1 kakaroto kakaroto  73728 Oct  4 16:21 kernel.mod
-rw-r--r-- 1 kakaroto kakaroto  16384 Oct  4 16:21 rbe.mod
-rw-r--r-- 1 kakaroto kakaroto  94208 Oct  4 16:21 syslib.mod

The total ME size dropped from 2.5MB to 360KB, which means that 14.42% of the code remains, while 85.58% of the code was neutralized with me_cleaner.

The reason the neutering on Skylake-based systems removed less code than on Broadwell-based systems is because of the code in the ME’s read-only memory (ROM). What this “ROM” means is that a small part of the ME firmware is actually burned in the silicon of the ME Core. The ROM content is the first code executed, loaded internally from the ROM, by the ME core, and it has the simple task of reading the ME firmware from the flash, verifying its signature, making sure it hasn’t been tampered with, loading it in the ME Core’s memory and executing it.

On Broadwell, there is about 128KB of code burned in the ME Core’s ROM. That 128KB of code contains the bootloader as well as some system APIs that the other modules can use.
On Skylake, the ROM code was decreased to 17KB, leaving only the basic bootloader, and moving the system APIs to a module of their own inside the ME firmware.
This means that the total amount of code remaining, including the ROM is 360+17KB out of 2524+17KB = 377/2541 = 14.84% for Skylake, while on Broadwell, it’s 120 + 128KB out of 1624+128KB = 248/1752 = 14.15% of code remaining. The difference is much smaller now when we account for the code hidden in the ROM of the processor.

The problem with the code in the ROM is that it cannot be removed because it’s inside of the processor itself and, well, it’s Read-Only Memory—it cannot be overwritten in any way, by definition. On the bright side, it is nice to see that most of the code that was previously in the ROM has now been moved to the flash in Skylake systems.

The ME firmware itself has multiple “partitions”, each containing something that the ME firmware needs. Some of those partitions will contain code modules, some will contain configuration files, and some will contain “other data” (I don’t really know what). Either way, the ME firmware contains about a dozen different partitions, each for a specific purpose, and two of those partitions contain the majority of the code modules.

Schrödinger’s Wi-Fi

I’ll now explain what has been done to get to this point in the project. When I was done with the coreboot port to the new Skylake machines, I tried to neutralize the ME, thinking it would be a breeze, since me_cleaner claimed support for Skylake. Unfortunately, it wasn’t working as it should and I spent the entire hacking day at the coreboot conference trying to fix it.

The problem is that once the ME was neutralized with me_cleaner, the Wi-Fi module on the Librem was unpredictable: it sometimes would work and sometimes wouldn’t, which was confusing. I eventually realized that if I reboot after replacing the ME, the wifi would keep the same state as it was in before:

if I neutralized the ME and reboot, it would still work, but after powering off the machine and turning it on, the wifi would stop working;
if I restored a full ME (instead of a neutralized one) and rebooted, the wifi would remain dead;
…but if I power off the machine and turn it back on, the wifi would finally be restored.

I figured that it has something to do with how the PCI-Express card is initialized, and I spent quite some time trying to “enable it” from coreboot with a neutralized ME. I’ll spare you the details but I eventually realized that I couldn’t get it to work because the PCIe device completely ignored all my commands and would simply refuse to power up. It turns out that the ME controls the ICC (Integrated Clock Controller) so without it, it would simply not enable the clock for the PCIe device, so the wifi card wouldn’t work and there is nothing you can do about it because only the ME has control over the ICC registers. I tried to test a handful of different ME firmware versions, but surprisingly, the wifi module never worked on any of those images, even when the ME was not neutralized. Obviously, it meant that the ME firmware was not properly configured, so I used the Intel FIT tool (which is used to configure ME images, allowing us to set things like PCIe lanes, and which clocks to enable, and all of that). Unfortunately, even when an image was configured the exact same way as the original ME image we had, the wifi would still not work, and I couldn’t figure out why.

I shelved the problem to concentrate on the release of coreboot and eventually on the SATA issues we were experiencing. The decision was made to release the Librem 13 v2 and Librem 15 v3 with a regular ME until more work was done on that front, because we couldn’t hold back shipments any longer (and because we can provide updates after shipment). Also note that at that time, the support for Skylake in me_cleaner was very rough—it was removing only half of the ME code because the format of the new ME 11.x firmware wasn’t fully known yet.

A few weeks later, I saw the release of unME11 from Positive Technologies and a week later, Nicola Corna pushed more complete support for Skylake in a testing branch of me_cleaner. I immediatly jumped on it and tested it on our machines. Unfortunately, the wifi issue was still there. I decided to debug the cause by figuring out what me_cleaner does that could be affecting the ME firmware that way.

As I mentioned earlier in this post, the ME firmware is made up of a dozen of partitions, some of those containing code modules, and me_cleaner will remove all the partitions except one, in which it will remove most of the modules and leave only the critical modules needed for the startup of the system. Therefore, I started progressively whitelisting more modules so me_cleaner wouldn’t remove them, and testing if it affected the wifi module. This was annoying to test because I’d have to change me_cleaner, neutralize the ME firmware, then copy the image from my main PC to the Librem then flash the new image, poweroff, then restart the machine, and if the Wifi wasn’t working, which was 99% of the time, I had to copy the files through a USB drive. I eventually restored all of the modules and it was still not working, which made me suspect the cause might be in one of the other partitions, so I gradually added one partition at a time, until the Wifi suddenly worked. I had just added the “MFS” partition, so I started removing the other partitions again one at a time, but keeping the “MFS” partition, and the Wifi was still working. I eventually removed all of the code modules (apart from the critical ones) but keeping the MFS partition, and the wifi was still working. So I had found my fix: I just need to keep the “MFS” partition in the image and the wifi would work.

So many firmwares, so little time

So, what is this mysterious “MFS” partition? There’s not a lot of information about it anywhere online, other than one forum or mailing list user mentioning the MFS partition as “ME File System”. I decided to use a comparative approach.

The fun thing when comparing ME firmware images: not only are there multiple versions (ex: 10.x vs 11.x), for each single ME version there are multiple “flavors” of it, such as “Consumer” or “Corporate”, and there are also multiple flavors for “mobile” and “desktop”.

When I extracted and compared all the partitions of all the variants and flavors, the only difference between a mobile and a desktop image is in the MFS partition, as every other partition shares the same hash between two flavors of the same version.
I then compared the various partitions between a configured and a non configured ME firmware, and noticed that what the Intel FIT tool does when you change the system’s configuration is to simply write that configuration inside of the MFS partition.
This means that the MFS partition, which doesn’t contain any code modules, is used for storage of configuration files used by the ME firmware. This is somewhat confirmed by the fact that the MFS partition is marked as containing data.

After modifying me_cleaner to add support for the Librem, which allows us to neutralize the ME while keeping the Wifi module working, I discussed with Nicola Corna how to best integrate the feature into me_cleaner. We came to the conclusion that having a new option to allow users to select which partitions to keep would be a better method, so I sent a pull request that adds such a feature.

Unfortunately, while the wifi module was working with this change, I also had an adverse side-effect when adding the MFS partition back into the ME firmware: my machine would refuse to power off, for example, and would have trouble rebooting.

The exact behavior is that if I power off the machine, Linux would do the entire power off sequence then stop, and I would have to manually force shutdown the Librem by holding the power button for 5 seconds. As for the rebooting issue, instead of actually rebooting when Linux finishes its poweroff sequence, the system will be frozen for a few seconds before suddenly shutting itself down forcibly, then turning itself back on 5 seconds later, on its own. This isn’t the most critical of issues, but it would be very annoying to users, and unfortunately, I couldn’t find the cause of this strange behavior. All I knew was that if I remove the MFS partition, coreboot says the ME partition is corrupted, and the wifi module doesn’t work, and if I keep the MFS partition, coreboot says the ME partition is valid, the wifi module works, but the poweroff/reboot issues automatically appear.
The solution for these issues turned out to be unexpectedly simple. After another of our developers said he was ready to live with the poweroff/reboot issues, and I sent him a neutralized ME for his system, I was told that his machine was working fine with no side-effects at all. I didn’t know what the difference between his machine and mine was, other than the fact that my machine is a prototype and his was a “production” machine. I then tested my neutralized ME on the “production” Librem 13 unit I had on hand, and I didn’t have any side effects of the neutralizing of the ME firmware. I then updated my coreboot build script to add the neutralization option and asked users on our forums to test it, and every one who tested the neutralized ME reported back success with no side-effects. I then realized the problem is probably only caused by the prototype machine that I was using. Well, I can live with that.

Disabling the ME

The next step for me was to start reverse-engineering the ME firmware, like I had done before. This is of course a very long and arduous process that took a while and for which I don’t really have much progress to show. One thing I wanted to reverse-engineer was the MFS file system format so I could see which configuration files are within it and to start eliminating as much from it as possible. I started from the beginning however, by reverse engineering the entry point in the ROM. I will spare you much of the detail and the troubles in trying to understand some of the instructions, and mostly some of the memory accesses. The important thing to know is that before I got too far along, Positive Technologies announced the discovery of a way to disable the Intel ME, and I needed to test it.

Unfortunately, enabling the HAP bit which disables the ME Core, didn’t work on the Librem: it was causing the power LED to blink very slowly, and nothing I could do would stop it until I removed the battery. I first thought the machine was stuck in a boot loop, but it was just blinking really slowly. I figured out eventually that the reason was that the “HAP” bit was not added in version 11.0.0, but rather in version 11.0.x (where x > 0). I decided to try a newer ME firmware version and the HAP bit did work on that, which confirmed that the ME disablement was a feature added to the ME after the version the Librem came with (11.0.0.1180). So now I have a newer ME (version 11.0.18.1002) that is disabled thanks to the HAP bit, but… no Wi-Fi again.

I decided to retry using the FIT tool to configure the ME with the exact same settings as the old ME firmware. I went through every setting available to make sure it matches, and when I tried booting it again, the ME Core was disabled and the Wifi module was working. Great Success!

Obviously, I then needed to do plenty of testing, make sure it’s all working as it should, confirm that the ME Core was disabled, test the behavior of the system with a ME firmware both disabled and neutralized, and that it has no side effects other than what we wanted.

My previous coreboot build script was using the ME image from the local machine, but unfortunately, I can’t do that now for disabling the ME since it’s not supported on the ME image that most people have on their machines. So I updated my coreboot build script to make it download the new ME version from a public link (found here), and I used bsdiff to patch the ME image with the proper configuration for the WiFi to work. I made sure to check that the only changes to the ME image is in the MFS partition and is configuration data, so the binary patch does not contain any binary code and we can safely distribute it.

Moving towards the FSP

The next step will be to continue the reverse-engineering efforts, but for now, I’ve put that on hold because Positive Technologies have announced that they found an exploit in the ME Firmware allowing the executing of unsigned code. This exploit will be announced at the BlackHat Europe 2017 conference in December, so we’ll have to wait and see how their exploit works and what we can achieve with it before going further. Also, once Positive Technologies release their information, it might be possible for us to work together and share our knowledge. I am hoping that I can get some information from them on code that they already reverse engineered, so I don’t have to duplicate all of their efforts. I’d also like to mention that, just as last time, Igor Skochinsky has generously shared his research with us, but also getting data from Positive Technologies would be a tremendous help, considering how much work they have already invested on this.

Right now, I have decided to move my focus to investigating the FSP, which is another important binary that needs to be reverse-engineered and removed from coreboot. I don’t think that anyone is currently actively working on it, so hopefully, I can achieve something without duplicating someone else’s work, and we can advance the cause much faster this way. I think I will concentrate first on the PCH initialization code, then move to the memory initialization.

The post Deep dive into Intel Management Engine disablement appeared first on Purism.

Coreboot and Skylake, part 2: A Beautiful Game!

Youness Alaoui — Tue, 29 Aug 2017 15:00:41 +0000

Hi everyone,

While most of you are probably excited about the possibilities of the recently announced “Librem 5” phone, today I am sharing a technical progress report about our existing laptops, particularly findings about getting coreboot to be “production-ready” on the Skylake-based Librem 13 and 15, where you will see one of the primary reasons we experienced a delay in shipping last month (and how we solved the issue).

TL;DR: Shortly we began shipping from inventory the coreboot port was considered done, but we found some weird SATA issues at the last minute, and those needed to be fixed before shipping those orders.

The bug was sometimes preventing booting any operating system, which is why it became a blocker for shipments.
I didn’t find the “perfect” fix yet, I simply worked around the problem; the workaround corrects the behavior without any major consequences for users, other than warnings showing up during boot with the Linux kernel, which allowed us to resume shipments.
Once I come up with the proper/perfect fix, an update will be made available for users to update their coreboot install post-facto. So, for now, do not worry if you see ATA errors during boot (or in dmesg) in your new Librem laptops shipped this summer: it is normal, harmless, and hopefully will be fixed soon.

The SATA-killer Chronicles

I previously considered the coreboot port “done” for the new Skylake-based laptops, and as I went to the coreboot conference, I thought I’d be coming back home and finally be free to take care of the other stuff in my ever-increasing TODO list. But when I came back, I received an email from Zlatan (who was inside our distribution center that week), saying that some machines couldn’t boot, throwing errors such as:

Read Error

…in SeaBIOS, or

error: failure reading sector 0x802 from 'hd0'

error: no such partition. entering rescue mode

…in GRUB before dropping into the GRUB rescue shell.

That was odd, as I had never encountered those issues except one time very early in the development of the coreboot port, where we were seeing some ATA error messages in dmesg but that was fixed, and neither Matt nor I ever saw such errors again since. So of course, I didn’t believe Zlatan at first, thinking that maybe the OS was not installed properly… but the issue was definitely occurring on multiple machines that were being prepared to ship out. Zlatan then booted into the PureOS Live USB and re-installed the original AMI BIOS; then he had no more issues booting into his SSD, but when he’d flash coreboot back, it would fail to boot.

The ever changing name of the wind

Intrigued, I tested on my machine again with the “final release” coreboot image I had sent them and I couldn’t boot into my OS either. Wait—What!? It was working fine just before I went to the coreboot conference.

Did something change recently? No, I remember specifically sending the image that I had been testing for weeks, and I hadn’t rebased coreboot because I very specifically wanted to avoid any potential new bug being introduced “at the last minute” from the latest coreboot git base.
Just to be sure, I went back to an even older image I had saved (which was known to work as well), and the issue occurred there as well—so not a compiling-related problem either.
I asked Matt to test on his machine, and when he booted the machine, it was failing for him with the same error. He hadn’t even flashed a new coreboot image! It was still the same image he had on the laptop for the past few weeks, which was was working perfectly for him… until now, as it now refused to boot.

Madness? THIS—IS—SATA!

After extensive testing, we finally came to the conclusion that whether or not the machine would manage to boot was entirely dependent on the following conditions:

The time of day
The current phase of the moon
The alignment of the planets in some distant galaxy
The mood of my neighbor’s cat

The most astonishing (and frustrating) thing is that during the three weeks where Matt and I have been working on the coreboot port previously, we never encountered any “can’t boot” scenario—and we were rebooting those machines probably 10 times per hour or more… but now, we were suddenly both getting those errors, pretty consistently.

After a day or two of debugging, it suddenly started working without any errors again for a couple of hours, then it started bugging again. On my end, the problem seemed to typically happen with SATA SSDs on the M.2 port (I didn’t get any issues when using a 2.5″ HDD, and Matt was in the same situation). However, even with a 2.5″ HDD, Zlatan was having the same issues we were seeing with the M.2 connector.

So the good news was that we were at least able to encounter the error pretty frequently now, the bad news was that Purism couldn’t ship its newest laptops until this issue was fixed—and we had promised the laptops would be shipping out in droves by that time! Y’know, just to add a bit of stress to the mix.

The Eolian presents: DTLE

When I was doing the v1 port, I had a more or less similar issue with the M.2 SATA port, but it was much more stable: it would always fail with “Read Error”, instead of failing with a different error on every boot and “sometimes failing, sometimes working”. Some of you may remember my explanation of how I fixed the issue on the v1 in February: back then, I had to set the DTLE setting on the IOBP register of the SATA port. What this means is anyone’s guess, but I found this article explaining that “DTLE” means “Discrete Time Linear Equalization”, and that having the wrong DTLE values can cause the drives to “run slower than intended, and may even be subject to intermittent link failures”. Intermittent link failures! Well! Doesn’t that sound familiar?

Unfortunately, I don’t know how to set the DTLE setting on the Skylake platform, since coreboot doesn’t have support for it. The IOBP registers that were on the Broadwell platform do not exist in Skylake (they have been replaced by a P2SB—Primary to SideBand—controller), and the DTLE setting does not exist in the P2SB registers either, according to someone with access to the NDA’ed datasheet.

When the computer was booting, there were some ATA errors appearing in dmesg, and it looks something like this:

ata3: exception Emask 0x0 SAct 0xf SErr 0x0 action 0x10 frozen
ata3.00: failed command: READ FPDMA QUEUED
ata3.00: cmd 60/04:00:d4:82:85/00:00:1f:00:00/40 tag 0 ncq 2048 in
res 40/00:18:d3:82:85/00:00:1f:00:00/40 Emask 0x4 (timeout)
ata3.00: status: { DRDY }

Everywhere I found this error referenced, such as in forums, the final conclusion was typically “the SATA connector is defective”, or “it’s a power related issue” where the errors disappeared after upgrading the power supply, etc. It sort of makes sense with regards to the DTLE setting causing a similar issue.

It also looks strikingly similar to Ubuntu bug #550559 where there is no insight on the cause, other than “disabling NCQ in the kernel fixes it”… but the original (AMI) BIOS does not disable NCQ support in the controller, and it doesn’t fix the DTLE setting itself.

Chasing the wind

So, not knowing what to do exactly and not finding any information in datasheets, I decided to try and figure it out using some good old reverse engineering.

First, I needed to see what the original BIOS did… but when I opened it in UEFIExtract, it turns out there’s a bunch of “modules” in it. What I mean by “a bunch” is about 1581 modules in the AMI UEFI BIOS, from what I could count. Yep. And “somewhere” in one of those, the answer must lay. I didn’t know what to look for; some modules are named, some aren’t, so I obviously started with the file called “SataController”—I thought I’d find the answer in it quickly enough simply by opening it up with IDA, but nope: that module file pretty much doesn’t do anything. I also tried “PcieSataController” and “PcieSataDynamicSetup” but those weren’t of much help either.

I then looked at the code in coreboot to see how exactly it initializes the SATA controller, and found this bit of code:

 /* Step 1 */
 sir_write(dev, 0x64, 0x883c9003);

I don’t really know what this does but to me it looks suspiciously like a “magic number”, where for some reason that value would need to be set in that variable for the SATA controller to be initialized. So I looked for that variable in all of the UEFI modules and found one module that has that same magic value, called “PchInitDxe”. Progress! But the code was complex and I quickly realized it would take me a long time to reverse engineer it all, and time was something I didn’t have—remember, shipments were blocked by this, and customers were asking us daily about their order status!

The RAM in storm

One realization that I had was that the error is always about this “READ FPDMA QUEUED” command… which means it’s somehow related to DMA, and therefore related to RAM—so, could there be RAM corruption occurring? Obviously, I tested the RAM with memtest and no issues turned up, and since we had finally received the hardware, I could push for receiving the schematics from the motherboard designer (I was previously told it would be a distraction to pursue schematics when there were so many logistical issues to fix first).

As I finally received the schematics and started studying them, I found that there were some discrepancies between the RComp resistor values in the schematics and what I had set in coreboot, so I fixed that… but it made no difference.
I thought that maybe the issue then is with the DQ/DQS settings of the RAM initialization (which is meant for synchronization), but I didn’t have the DQ/DQS settings for this motherboard and I couldn’t figure it out from the schematics, so what I did was to simply hexdump the entire UEFI modules, and grep for “79 00 51” which is the 16 bit value of “121” followed by the first byte of the 16 bit value of “81”, which are two of the RComp resistor values. That allowed me to find 2 modules which contained the values of the Rcomp resistors for this board, and from there, I was able to find the DQ and DQS settings that were stored in the same module, just a few bytes above the Rcomp values, as expected. I tested with these new values, and… it made no difference. No joy.

A night with no moon

What else could I do? “If only there was a way to run the original BIOS in an emulator and catch every I/O it does to initialize the SATA controller!”

Well, there is something like that, it’s called serialICE and it’s part (sort of?) of the coreboot umbrella project. I was very happy to find that, but after a while I realized I can’t make use of it (at least not easily): it requires us to replace the BIOS with this serialICE which is a very very minimal BIOS that basically only initializes the UART lines and loads up qemu, then you can “connect” to it using the serial port, send it the BIOS you want to run, and while serialICE runs the BIOS it will output all the I/O access over the serial port back to you… That’s great, and exactly what I need, unfortunately:

the Librems do not have a serial port that I can use for that;
looking at the schematics, the only UART pad that is available is for TX (for receiving data), not RX (for sending data to the machine);
I can’t find the TX pad on the motherboard, so I can’t even use that.

Thankfully, I was told that there is a way to use xHCI usb debugging capabilities even on Skylake, and Nico Huber wrote libxhcidbg which is a library implementing the xHCI usb debug features. So, all I would need to make serialICE work would be to:

port coreboot to use libxhcidebug to have the USB debugging feature, test it and make sure it all works, or…
port my previous flashconsole work to serialICE then find a way to somehow send/bundle the AMI BIOS inside the serialICE or put it somewhere in the flash so serialICE can grab it directly without me needing to feed it to it through serial.

Another issue is that for the USB debug to work, USB needs to be initialized, and there is no way for me to know if the AMI BIOS initializes the SATA controller before or after the USB controller, so it might not even be helpful to do all that yak shaving.

The other solution (to use flashconsole) might not work either because we have 16MB of flash and I expect that a log of all I/O accesses will probably take a lot more space than that, so it might not be useful either.

And even if one or both of the solutions actually worked, sifting through thousands of I/O accesses to find just the right one that I need, might be like looking for a needle in a haystack.

Considering the amount of work involved, the uncertainty of whether or not it would even work, and the fact that I really didn’t have time for such animal cruelty (remember: shipments on hold until this is fixed!), I needed to find a quicker solution.

The anger of a gentle man

At that point, I was starting to lose hope for a quick solution and I couldn’t find any more tables to flip:

“This issue is so weird! I can’t figure out the cause, nothing makes sense, and there’s no easy way to track down what needs to be done in order to get it fixed.”

And then I noticed something. While it will sometimes fail to boot, sometimes will boot without issues, sometimes will trigger ATA errors in dmesg, sometimes will stay silent… one thing was consistent: once Linux boots, we don’t experience any issues—there was no kernel panic “because the disc can’t be accessed”, no “input/output error” when reading files… there is no real visible issue other than the few ATA errors we see in dmesg at the beginning when booting Linux, and those errors don’t re-appear later.

After doing quite a few tests, I noticed that whenever the ATA errors happen for a few times, the Linux kernel ends up dropping the ATA link speed to 3Gbps instead of the default 6Gbps, and that once it does, there aren’t any errors happening afterwards. I eventually came to the conclusion that those ATA errors are the same issue causing the boot errors from SeaBIOS/GRUB, and that they only happened when the controller was setup to use 6Gbps speeds.

What if I was wrong about the DTLE setting, and potential RAM issues? What if all of this is because of a misconfiguration of the controller itself? What if all AMI does is to disable the 6Gbps speed setting on the controller so it can’t be used?!

So, of course, I checked, and nope, it’s not disabled, and when booting Linux from the AMI BIOS, the link was set up to 6Gbps and had no issues… so it must be something else, related to that. I dumped every configuration of the SATA controller—not only the PCI address space, but also the AHCI ABAR memory mapped registers, and any other registers I could find that were related to the SATA/AHCI controller—and I made sure that they matched exactly between the AMI BIOS and the coreboot registers, and… still nothing. It made even less sense! If all the SATA PCI address space and AHCI registers were exactly the same, then why wouldn’t it work?

I gave up!

…ok, I actually didn’t. I temporarily gave up trying to fix the problem’s root cause, but only because I had an idea for a workaround that could yield a quick win instead: if Linux is able to drop the link speed to 3Gbps and stop having any issues, then why can’t I do the same in coreboot? Then both SeaBIOS and GRUB would stop having issues trying to read from the drive, ensuring the drive will allow booting properly.

I decided I would basically do the same thing as Linux, but do it purposedly in coreboot, instead of it being done “in Linux” after errors start appearing.

While not the “ideal fix”, such a workaround would at least let the Skylake-based Librems boot reliably for all users, allowing us to release the shipments so customers can start receiving their machines as soon as possible, after which I would be able to take the time to devise the “ideal” fix, and provide it as a firmware update.

Sleeping under the wagon: an overnight workaround

I put my plan in motion:

I looked at the datasheet and how to configure the controller’s speed, and found that I could indeed disable the 6Gbps speed, but for some reason, that didn’t work.
Then I tried to make it switch to 3Gbps, and that still didn’t work.
I went into the Linux kernel’s SATA driver to see what it does exactly, and realized that I didn’t do the switch to 3Gbps correctly. So I fixed my code in coreboot, and the machines started booting again.
- I also learned what exactly happens in the Linux kernel: when there’s an error reading the drive, it will retry a couple of times; if the error keeps happening over and over again, then it will drop the speed to 3Gbps, otherwise, it keeps it as-is. That explains why we sometimes see only one ATA error, sometimes 3, and some other times 20 or more; it all depends on whether the retries worked or not.
- Once I changed the speed of the controller to 3Gbps, I stopped having troubles booting into the system because both SeaBIOS and GRUB were working on 3Gbps and were not having any issues reading the data. However, once Linux boots, it resets the controller, which cancels out the changes that I did, and Linux starts using the drive at 6Gbps. That’s not really a problem because I know that Linux will retry any reads, and will drop to 3Gbps on its own once errors start happening, but it has the side effect that users will be seeing these ATA error message on their boot screen or in dmesg.

The next chapter: probably less than 10 years from now

As you can see, small issues like that are a real puzzle, and that’s the kind of thing that can make you waste a month of work just to “get it working” (let alone “find the perfect fix”). This is why I typically don’t give time estimates on this sort of work. We’re committed though on getting you the best experience with your machines, so we’re still actively working on everything.

Here’s a summary of the current situation:

You will potentially see errors in your boot screen, but it’s not a problem since Linux will fix it
It’s not a hardware issue, since it doesn’t happen with the AMI BIOS, we just need to figure out what to configure to make it work.
There is nothing to be worried about, and I expect to fix it in a future coreboot firmware update, which we’ll release to everyone once it’s available (we’re working on integration with fwupd, so maybe we’ll release it through that, I don’t know yet).

It’s taken me much longer than anticipated to write this blog post (2 months exactly), as other things kept getting in the way—avalanches of emails, other bugs to fix, patches to test/verify, scripts to write, and a lot of things to catch up on from the one month of intense debugging during which I had neglected all my other responsibilities.

While I was writing this status report, I didn’t make much progress on the issue—I’ve had 3 or 4 enlightenments and I thought I suddenly figured it all out, only to end up in a dead end once again. Well, once I do figure it out, I will let you all know! Thanks for reading and thanks for your patience.

End notes: if you didn’t catch some references, or paragraph titles in this post, then you need to read The KingKiller Chronicles by Patrick Rothfuss (The Name of the Wind and The Wise Man’s Fear). These are some of the best fantasy books I’ve ever read, but be aware that the final book of the trilogy may not be released for another 10 years because the author loves to do millions of revisions of his manuscripts until they are “perfect”.

The post Coreboot and Skylake, part 2: A Beautiful Game! appeared first on Purism.

Coreboot on the Librem 13 v2, part 1

Youness Alaoui — Thu, 15 Jun 2017 20:38:18 +0000

Hello everyone! I am very happy to announce that the coreboot port to the Librem 13 v2 as well as the Librem 15 v3 is done! Wow, what an adventure! The entire thing took about 2 weeks of hard work, and an additional week of testing, fixing small issues that kept popping up, and cleaning up the code/commits.

It was truly an adventure, and I would have liked to stop and take the time to write 10 blog posts during that time, one for every major bump in the road or milestone, but I was under a strict deadline because we needed to finish the port before we started shipping the new Librem 13 v2 hardware (from now on referred to as ‘the v2’), so it could be shipping with coreboot pre-installed from day one. Now that the port is finished, I can finally start writing the first chapter in the story.

TL;DR: in the process of porting the Skylake-based Librem 13 v2 to coreboot, I have implemented a new debugging method (“flashconsole”) and added it to coreboot. It has been reviewed and merged upstream. The “flashconsole” driver is a debugging method for coreboot to write its console log to the SPI flash itself. So if you want to port a board to coreboot and you don’t have access to UART (or don’t want to solder UART wires to the motherboad), and can’t use USB debugging (on skylake for example), then you can enable the CONSOLE_SPI_FLASH configuration option and the console log will be written to the flash. When you use your external programmer, just dump the flash first, then you can use ‘cbfstool rom.bin read -r CONSOLE -f console.log‘ to extract the console log from it. No wires, no mess, no soldering required. Well, you do still need the external flasher, but you already have to use it to unbrick the machine since it wasn’t booting (and if it was booting, then you already have log access through cbmem, so you don’t need uart or flash console).

Getting your feet wet

Since I learned my lesson when I first tried to do the v2 port (my, oh so embarassing first attempt), I decided to grab all the logs I could get from the v2 before doing anything else. After running all the commands from the Motherboard Porting Guide, I copied the files over to my work laptop, and when I tried to look at the flash contents, I couldn’t find the rom.bin file! Maybe the cat ate it? It turns out, when I was trying to dump the flash, I hadn’t noticed that the ‘flashrom’ command returned this error :

Found chipset "Intel Sunrise Point (Skylake-U Premium)" with PCI ID 8086:9d48.
ERROR: This chipset is not supported yet.

Well, that’s interesting, flashrom doesn’t support the Skylake processors (confirmed here). So my first task would be to add Skylake support to flashrom. After I mentioned that in IRC, Nico Huber said that he already did port flashrom to Skylake, but it’s just been untested/unreviewed/unmerged. I decided to start reviewing those changes, to get my feet wet, understand how flashrom works, how to read the Intel PCH datasheet, and also to contribute something to the coreboot/flashrom community other than by submitting new code. I’ve sent my comments on a few of those patches, some things got fixed and/or merged, until I had to stop because my deadline was catching up to me.

The boring part

The first thing I had to do when I started the port was to understand what needed to be done. I’ve watched this talk by Shawn Nematbakhsh in the 2014 Chrome OS Firmware Summit where he explained the process of porting a new chromebook board to coreboot. Since I knew that the v2 was based on the skylake family of Intel processors, I looked at existing skylake boards in the coreboot tree and I found a few : Google Chell, Google Glados, Google Lars, Intel KBLRVP and Intel Kunimitsu. I decided to use the Google Chell board as my starting point (this was a random choice), so I copied the mainboard/google/chell directory into mainboard/purism/librem13v2 and I started to edit the files. I mainly edited the Kconfig/Kconfig.name/board_info.txt files to replace ‘google/chell’ by ‘purism/librem13’ everywhere I found it. Then I started removing files or references that were chromeos specific and after fumbling a bit in the dark, and removing anything that I thought wasn’t needed or that I didn’t understand, I managed to get coreboot to compile.

I also took the time of course to extract the vbios, download the FSP image, adapt the Kconfig file, and update the GPIO values to match the original firmware (more on that in a future post).

When I tested it though, it didn’t work, obviously. I needed to know what went wrong, I needed to debug it! Unfortunately, I couldn’t find any UART pads on the motherboard, and when I looked at the usbdebug (which I used for the v1), I realized that it is not implemented for Skylake. After some research, I realized that usbdebug was actually a feature of Broadwell processors, but on Skylake it’s different, it’s called DCI, and it requires some proprietary hardware, and proprietary software to talk over a proprietary protocol in order to get the DCI debugging working. I did not want to do that, so I looked for an alternative.

Now the fun begins

For me, the obvious choice for debugging was to use the SPI flash itself. After all, I was testing different things blindly, and every time that it failed to work, I was using my external flasher to write a new coreboot version to it. Why not read the flash at the same time and grab the log from it. This felt like such an obvious method for debugging but to my surprise, it wasn’t implemented in coreboot, so I decided to do just that.

First, I needed to be able to test my implementation, so I decided to implement it for Broadwell so I could test it on the v1 hardware first. Once it would work on the v1, then I could try it on the v2. I looked at the API used by coreboot to write to the flash, and used that to write my log to a fixed offset in the flash (which I knew was unused), then tested it. It worked! Wow, that was fast. This is going to be soooo easy! Yeah… right…

So now that it worked with a hard coded offset, I need to make it write to a CBFS file. The CBFS (CoreBoot FileSystem, I assume) has various sections in it called ‘files’, and after looking at the API for a bit, I figured out how to make it parse the CBFS and give me the offset and size of the ‘console’ file. I then changed the Makefile.inc so it would add a ‘console’ file to CBFS when the option is enabled. I then test, and my ‘cbfsconsole’ logger doesn’t work anymore. I played with it for a while without understanding the problem. Eventually, I found out that if I was writing at offset 0x200000 (my hardcoded value, pointing to unused space), it was working, but if I was writing to offset 0x260000 (the offset of the ‘console’ file in CBFS), it wasn’t working. I even removed the cbfs-related code and just hardcoded the value to 0x260000, it was still not working. Aaron Durbin (adurbin on IRC) came to the rescue, after understanding what I was doing and the issues I was having, he asked me *how* I was creating the ‘console’ file that was added to cbfs, and he immediatly saw the problem. The file was copied from /dev/zero, so it was all zeroes, but that’s not how NOR flash works. Apparently, you can’t write a ‘1’ in a NOR flash, you can only change a ‘1’ into a ‘0’, but not a ‘0’ into a ‘1’. Since my console file was all zeroes, all my writes were not working. Well, that was news to me, I didn’t know that’s how NOR flash worked.. So, what do you do if you want to turn a ‘0’ into a ‘1’? Well, that’s simple, you do a ‘sector erase’, which will erase the entire sector (turning it entirely into 1s (0xff data)), *then* you can write back your data. That also explains this life-long question I always had of “why does flashrom always erase sectors before it writes the data into it”.

So, here is the right way of populating any sector of the flash on which you expect to write anything :

dd if=/dev/zero count=1 bs=$(__cbfsconsole_size) | tr '\000' '\377'

Now that I had a proper ‘console’ file, and my driver was writing the log. I added the code to do a sector erase so whenever the PC boots, the ‘console’ file would be erased entirely so the log can be written to it safely. After all, if I try to write over existing data, it will just be a mess of old and new data. Unfortunately, this caused me my second headache.

Before I get into that, I will first, quickly talk about all the hours I wasted, trying to enable cbfsconsole for bootblock and romstage. You see, coreboot has 3 stages, each being executed as independent programs. The bootblock is the very first code that gets executed when you turn on the computer, it will setup the processor to use its cache to act as RAM, then it will execute the romstage. The romstage will initialize the RAM, then the ramstage is executed, which will do most of the actual hardware initialization. When I wrote cbfsconsole, I had it initially enabled only for ramstage, and now that it was working, I decided to enable it for the bootblock and romstage stages. Unfortunately, that didn’t work because I was using a global variable in my code, and because the SPI driver is doing crazy things like ‘malloc’ (allocating memory) and also using global variables, which of course, you can’t do before the RAM has been initialized.

So after messing with it for a while, I found out that coreboot uses a nice trick where you can declare global variables in a certain way (using the CAR_GLOBAL macro) and then you need to access them differently (using car_get_var and car_set_var) and then you can use global variables, which will just be stored in the cache-as-ram (CAR) section of the binary, instead of being stored in RAM. After I ported my cbfsconsole to use the CAR_GLOBAL trick, and then spending a few hours, trying to get it to compile, I eventually realized that it was a crazy idea, cbfsconsole used a global variable, but it also used the SPI flasher API which itself used global variables, but the SPI flasher itself used various SPI driver implementations which themselves used malloc and global variables. Even after I disabled all the drivers and only left the WIMBOND driver, and ported it to use CAR_GLOBAL, it was not finished, because the SPI driver uses Broadwell specific PCI interfaces, which I had to include and which themselves used global variables. It was a never-ending game where it was not trivial to port the entire thing to use CAR_GLOBAL. So I eventually gave up and thought that I could never get the logger to work on the bootblock and romstage stages (which is crucially important because the crash on the v2 was happening in those stages).

The following day however, as I was trying to compile coreboot for the v2 (skylake), I enabled the cbfsconsole by mistake and it compiled. Shocked, I realized that on skylake, there was a new hardware-sequencing implementation for the SPI driver which didn’t use any global variables or anything. So yeah, cbfsconsole would have worked if I had just tried it on Skylake before trying it on Broadwell!

Since my cbfsconsole was now properly compiling for Skylake, it was time to test it on the v2. Unfortunately, it was obviously not working. At that point though, I was already joined by Matt “Mr. Chromebox” DeVillier, who had access to a Chromebook Chell with the Servo (debug) connector installed, and he volunteered to test the cbfsconsole on his Google Chell to see if it would work. Once he enabled cbfsconsole, his Chell Chromebook stopped booting, but using his servo debugger, he was able to get the debug logs which showed him this peculiar line :

SPI Transaction Timeout (Exceeded 15 ms) at Flash Offset d0f000 HSFSTS = 0x3f066020

After a lot of debugging, reading the datasheet for the SPI hardware sequencer and testing various things, the only conclusion I could get to was that the Skylake hardware SPI sequencer was frozen for no particular reason, and that was causing everything to just freeze. Well, to be honest, it was freezing because the SPI transactions would timeout, causing the SPI driver to print the debug line “SPI Transaction Timeout” above, which itself would make it try to write it again to the flash, and we end up in an infinite loop.

I gave up again on trying to understand it and I went to the #coreboot IRC channel asking if someone knew why this would happen. Aaron Durbin came to the rescue once again, telling us that some implementations might freeze if we try to erase a sector that is already erased. Of course the sector was already erased, since it was all full of 0xff data (see above), and trying to erase it is what’s causing all our problems! Once I removed the code that erases the sectors, and replaced it with code that would read the ‘console’ file, and set its starting offset to the first occurence of the 0xff byte, the cbfsconsole worked! Matt reported that his Chell device booted, and when I tested on the v2, I finally got some output!

Finalizing things

Now that we had a way to get a debug log out of coreboot, it was time to get it to work on the v2. Unfortunately, it was crashing it seems on the very first thing that it was trying to do. This was the entire log :

coreboot-4.5-1805-g7da7ddf-dirty Fri May 12 20:15:36 UTC 2017 bootblock starting...
Calling FspTempRamInit

It didn’t take long to fix that because Matt had experience with that. It looks like the FSP needs us to give it a Microcode address and length that are not zero. Even though the FSP Integration Guide says that setting the MicrocodeRegionBase and MicrocodeRegionLength parameters to zero means that no microcode update is available, the reality is that the FSP simply freezes or crashes if a valid non-zero-sized Microcode is not provided to it.

In this case, coreboot was disappointing, because on Broadwell, the Microcode files were automatically included in the build since they are distributed with coreboot (in the ‘blobs’ repository), but for Skylake, they are not available, and that’s because Intel decided to make you accept a non-distribute license for downloading the microcode files, so they were of course never added to coreboot. The problem here is that coreboot simply adds a zero-sized cpu_microcode_blob.bin file to CBFS instead of complaining that the file could not be found. Also, we have to manually set the memory location and file size in the config file, even though those values could easily be programatically retrieved using the same CBFS APIs I used for the ‘console’ file.

Anyways, once I added the microcode file and set its position and length in the config, the FspTempRamInit was successful and the bootblock finished and loaded the romstage, which of course crashed at the FspMemoryInit call.

I will spare you the details today of how we got the memory init to work, and I will leave that for a future blog post. Suffice to say that it wasn’t trivial, but once we got memory working, we ended up with another crash. This time, the last line from the log was :

Calling FspTempRamExit API

After investigating the code, it looked like right after that call, the romstage main was returning, and it was jumping back into assembly code to teardown the Cache-as-RAM system that had been setup initially, then it would call the ramstage. I started trying to understand all it did, and trying to figure out why it would crash in the assembly code, and what might have been different in the various registers, etc.. Thankfully, I didn’t waste too much time on this, because Matt was asked to test something on a Google Sentry Chromebook, so he switched his environment to the Sentry and tested what he had to test for someone else, and then he saw that it was crashing at the exact same place as us on the v2. It was working for him before, and yet, it wasn’t working now. He realized he still had the cbfsconsole enabled, and once he disabled it, it booted. This was an incredibly lucky thing for us to realize (without wasting a week on this), that cbfsconsole was the one blocking coreboot. Once I disabled cbfsconsole support for bootblock and romstage, I was happy to see the ramstage booting and the screen light up!

For a while though, I kept it like that. We didn’t need to have cbfsconsole working for both romstage and ramstage, as long as ramstage was booting, that’s all I needed to see the logs of. Fast forward a couple of weeks later, and after the port was done, I came back to this issue. Actually, I simply mentioned the problem that I “still have to fix” on IRC and Aaron Durbin diagnosed the problem and came up with the solution right away. You see, when you use CAR_GLOBAL variable, and the FspTempRamInit is called (which tears down the cache-as-ram), the content of those global CAR variables are copied to the actual RAM, and it works just fine. However, if your variable itself contains a pointer to another CAR variable, then your data was copied but its content points to the old cache address, so when you try to access it, you can’t. This is what was happening for cbfsconsole. The SPI handler itself had the bug because it was storing a pointer to the SPI driver and that variable was wrong after the CAR was removed. The trick to fix it was to simply replace the calls “car_get_var_ptr” into “car_sync_var_ptr” which would migrate the data to the appropriate RAM location before returning the CAR pointer. Once I did that, the cbfsconsole was working on all 3 stages at the same time and without issues.

Porting to FMAP

There was one discussion we had on IRC about cbfsconsole, which was about how it should not be writing to the CBFS in the first place. My argument was that the MRC cache is also written in a CBFS file. The MRC cache is a memory cache, because memory discovery, test and initialization takes about 10 seconds, but those RAM settings are saved in the cache, which can be provided to the FSP, and subsequent boots only take 200 ms to initialize the RAM with the cache. Aaron thought that was wrong too and that the CBFS API doesn’t provide us with any writing mechanism. It is true after all, I used the CBFS API to find the offset and size in the flash for the console file, and used a different API to write to it. Coreboot has a different concept called FMAP (Flash map), which is used to map different areas in the flash to be used by coreboot. The FMAP defines the size of the flash, the offset/size of the BIOS region, and in that region, the offset and size of the CBFS. Aaron believes that the console log should have been written to a separate FMAP area besides the CBFS rather than inside a CBFS file itself. The reason is simple, while it is currently possible to write to it, it is possible that in the future, the CBFS would implement a checksum system, which would become broken if we wrote to it directly in the manner I just used (I think it actually already has checksums, I just ignore them).

After the discussion about the pros/cons of using FMAP instead of CBFS to store the console log, I decided to rename my cbfsconsole driver into ‘flashconsole’, and I implemented support for using an FMAP area. The code has even been simplified now because the FMAP API can directly return the right Read/Write object for us to use and there is no need to work around the limitation of the CBFS API. When the option is enabled the CONSOLE area is automatically added to the resulting coreboot image, and once you dump your flash, you can extract the FMAP console, using this command :

cbfstool rom.bin read -r CONSOLE -f console.log

And that’s it.

Conclusion

The Librem port was a great experience, but it would never have been possible without having access to a debug log. I am happy to have implemented the flashconsole method of debugging, which will be very useful in the future for anyone who wants to port coreboot to a new board, for which they do not have easy access to UART pads. I have sent this feature upstream for review, and it has been merged. You can see it here.

In my next blog post, I will explain how we got the memory init to work, and the various issues we got. Also, what actual steps are needed to port a new board to coreboot (GPIO, memory, PCI, ACPI) and what I’ve learned in the past 2 weeks.

Stay tuned!

The post Coreboot on the Librem 13 v2, part 1 appeared first on Purism.

Reverse-engineering the Intel Management Engine’s ROMP module

Youness Alaoui — Wed, 10 May 2017 16:12:30 +0000

Last month, while I was waiting for hardware to arrive and undergo troubleshooting, I had some spare time to begin some Intel ME reverse engineering work.

First, I need to give some shout out to Igor Skochinsky, a Hex-Rays developer, who had been working on reverse engineering the Intel ME for a while, and who has been very generous in sharing his notes and research on the ME with us, which is going to be a huge help and cut down months of reverse engineering and guesswork. Igor was very helpful in getting me to understand the bits that didn’t make sense to me.

The first thing I wanted to try and reverse was the ROMP module. It is one of the two modules that me_cleaner doesn’t remove, and given how small it is (less than 1KB of code+data), I thought it would be a good starting point. Turns out my hunch was right, as I finished reverse engineering that module after only a couple of days.

I have uploaded the C equivalent of the code to my github account and you can see the file here: romp.c as well as the rapi.h header that I used for defining RAPI (ROM API) calls and data structures (most of that info was taken from Igor’s shared information). Note that this romp.c/rapi.h code is not meant to be compiled (for now), but serves more as a proof of concept—or a way for others who are less at ease with assembly to audit the code and understand what it does exactly. A long term goal would be to make it compile and generate a binary-compatible result (with the same hash as the Intel files).

There are some more good news too: in that small bit of code, I have already found one bug in their implementation. I doubt that particular bug instance is exploitable as-is, but it’s a good indicator that their code is probably going to be full of bugs and it won’t be long before we find an exploitable one.

The bug is simple: when the ROMP module reads the partition from the SPI flash, after it validates the RSA signature, it copies it to a memory address and locks that memory address so it can’t be modified by the main CPU. However, they made a mistake and didn’t shift the size by 2 (effectively multiply by 4), which means that they are only locking ¼ of the region that needs to be locked. This could mean that there is a portion of the partition manifest that is accessible by the CPU which could allow us to modify the hash of a module and put our own code in it, since the signature has already been checked, and the module hash is in that ¾ portion that isn’t locked.
Unfortunately, we can’t seem to be able to use it because the ROMP module is executed very early, before the DRAM is initialized, and so it’s probably using/locking the internal RAM of the Intel ME ARC core, not the RAM of the main CPU. If, however, I can find the same bug (crossing fingers for a lazy Intel developer doing a copy/paste of that code) in the BUP module or any other module that executes on the main RAM, it would become exploitable and would allow us unsigned code execution on the ME processor… Which would be a nice shortcut for us.

As the Intel Management Engine’s ROMP module is now reversed and auditable (bringing us one little step closer in the freedom roadmap), we now understand much better what it’s doing (Igor thinks it might be a way to recover from an incomplete ME update, since it looks for an FTPR-named partition in the NFTP partition of the FPT header). We’ll continue digging soon. For the time being, as I received hardware prototypes for the new batch of Librems, I need to get back to porting them to coreboot.

The post Reverse-engineering the Intel Management Engine’s ROMP module appeared first on Purism.

Preventing AMI’s BIOS from interfering with coreboot flashing on the Librem 13

Youness Alaoui — Fri, 14 Apr 2017 21:05:07 +0000

I wrote recently about one of our collaborators having tried to install coreboot and unfortunately bricking his laptop in the process. I sent him mine as a replacement, then he swapped hard drives and he sent me his testing unit, so that I could investigate what happened. This is what we found.

Once I received his laptop, I obviously wanted to dump its flash to see what went wrong, but it was… not cooperating. My external programmer setup was working fine with my previous laptop, but now it was somehow completely unable to detect the ROM chip. I set out to figure out why, I tried everything: I used my logic analyzer to see if the SPI data had the right values, I tried a different power supply, a different FTDI chip, different USB cables, anything I could think of… and it made no sense whatsoever. Sometimes it would show me the right data in the logic analyzer, but the FTDI would not see the same data; most of the times though, it looked as if the MISO pin was being pulled high, so I even tried to wire a resistor to ground in order to pull down on the MISO pin, without luck.

Then, after four hours of intense troubleshooting, it suddenly worked, without warning. What had I done to make it work? I had no idea at first on what I did, but I remembered this as being my last sequence of events before it had worked :

I had a suspicion of a defective FTDI chip and was using my backup chip (which was also not working up until now);
I had powered down the ATX power supply providing 3.3V to the motherboard
I used the laptop’s power supply to turn on the laptop and try to read the chip while it was on (knowing quite well that it wouldn’t work since the motherboard itself would be driving the SPI chip’s pins high/low, thus interfering with the FTDI chip);
I shut down the laptop, removed the charging cable and put the external power supply on again
I tried to dump the flash, and that’s when it worked!

I had done similar manipulations hundreds of times already, and whenever I connected the flasher to the ROM (battery removed of course and external ATX power supply on), the flasher couldn’t detect the chip. Quite annoying.

Then I found a way to trick the hardware. While the FTDI chip is connected to the SPI flash, if I insert the charging cable into the laptop for about 2 seconds (and quickly remove it) then turn the power supply back on… suddenly the FTDI can detect and read/write to the flash chip without problems. Without a doubt this makes no sense at all (the joys of hardware), but it is 100% reproducible, so when I discovered that trick I simply thanked the lords of Kobol for their offering and I dumped the flash, then unbricked the laptop. Oh, and yes, my original FTDI chip had become defective, so switching to the backup one was also a required step.

Encounters with chimeras

Behold, the “AMI-coreboot” chimera.

Inspecting the flash dump, I was in for a surprise: the coreboot image was corrupted in a strange way, it had parts of the old AMI BIOS in the middle of the coreboot BIOS. Essentially, a chimera.

Let’s study this strange creature.

The old factory BIOS started at offset 0x200000 in the ROM (the first 2MB of the ROM are reserved for the ME)
After coreboot was flashed, the old factory BIOS had apparently “moved” and was now in the ROM at offset 0x220000…
Unfortunately, coreboot was already at offset 0x200000, so 128KB (0x20000 bytes) into the coreboot image’s region, the old BIOS had somehow inserted itself there and was corrupting our coreboot image.

I was completely baffled by that discovery. How could it have happened? The only thing I knew was that our collaborator had flashed coreboot from within Qubes OS, so I installed Qubes OS, spent some time learning how to use it, figured out how to install flashrom into “dom0”, how to move my coreboot image into dom0 and how to flash it. I rebooted and… it was working fine. So the problem wasn’t QubesOS, which brought me back to square one. I shelved the problem at that time and moved on to writing my coreboot installation script instead.

After my coreboot installation script was done, it was time to beta-test it (beyond just testing it myself). François Téchéné from our team volunteered to try it on his Librem 13. To be sure, I hopped on a conference call with him for this operation. Everything was going well, and after he powered off his laptop… it was a brick too. Curses!

Well, we’re back to the chimera then. Now we have seen two chimeras, so we know they’re real. Why was it working for me (I tested at least 100 times during development of the script) and Todd but not for some others? I tried the script again on my machine, powered off the laptop, and then got a brick too. I retried multiple times and kept getting the same corrupted coreboot image with the displaced old BIOS written in the middle of it. This was making less and less sense, and making me more and more tense. What was wrong with my alchemy?!

My script was already testing that flashrom was writing the ROM properly (by reading it back and ensuring it would get the exact same hash), but in this case after turning off the computer, I would get a different reading (the corrupt image) when reading with an external flash programmer. Below are some comparisons (left vs right) where you can see, on the left, the original BIOS image, and on the right, the chimera :

You see how the images match but the offsets are different (from 0x200000 to 0x220000) :

And this is the end of the changes, where you can see that the data is not entirely the same (the top part, also some of the lower part is mangled), and the data is matching from 0x21FFFF in the original BIOS with the 0x23FFFF in the chimera) :

I wondered if there was a bug in flashrom or in the internal SPI programmer that would cause this corruption. Maybe it was writing incorrectly but somehow caching the data so that when we read it back (and when flashrom itself verifies the data after it wrote it) it would give us the data we asked to write but not the one that was actually written?

Finding the alkahest

As I asked around on IRC in the #flashrom channel, agaran (IRC nickname) came to the rescue and after analyzing the binary files and the diffs (images above), he noticed that the “old BIOS” data looked more like some “NVRAM Storage” instead of being actual BIOS code. He suggested that maybe something was writing that data when I was powering off the laptop.

At that moment, I had an epiphany: all these last few months, I was almost always rebooting (or suspending) the machine after flashing coreboot, and when I was actually powering it off, it was quite probably always when I was testing a new coreboot version on a laptop that already had a previous version of our coreboot image! When I was in the conference call with François, he said, “I’ll power off the laptop”, which somehow stuck in my mind so when testing on my own later that day I was now subconciously using the “poweroff” command instead of the “reboot” command, which made all the difference!

I then spent a couple of days testing: I flashed the original factory BIOS, powered off, then flashed coreboot, then powered off again (or suspended, or rebooted, depending on the test case), probably 20 times for each scenario (every poweroff required flashing using the external hardware flasher, which required the charging cable trick to make it detect the rom). I was then able to confirm that:

Whenever we use “poweroff”, or even using “echo o > /proc/sysrq-trigger“, the flash was getting corrupted.
If we “reboot”, “pm-suspend”, “halt”, do a force shutdown by holding the power button, or “echo b > /proc/sysrq-trigger“, the flash would not get corrupted and coreboot would be fine.
No corruption happens if I power off when coreboot is running. This only happens if the currently-running BIOS is the factory AMI BIOS.

I had eventually connected my logic analyzer again, and realized that the corruption was happening after the Linux power off sequence, right after the screen backlight turns off and before the power LED turns off. I was happy to have written my previous script to analyze the logic trace and return a command execution log, which showed me that indeed, that’s what was running and rewriting my sectors.

The best explanation we have so far is that the factory BIOS has an SMM hook on the S5-state transition (shut down), which gets part of the old BIOS to execute, verify if the BIOS settings it has in memory are still matching the values in the ROM, if it doesn’t match, then it writes them back again. Because why not. And the reason the data is displaced by an offset of 128KB is probably because that same BIOS code will not find the NVRAM storage at its usual 0x200000 offset, so it decides to leave that area intact and instead just write it in the 0x220000 offset).

Turning lead into gold

In theory, the workaround is easy: reboot and do not power off after installing coreboot. However, as I’m writing an install script that should work for every Librem 13 users, and I want it to be as fool-proof as possible, I cannot allow that. Sure, I could have the script’s startup tell people that they need to reboot their machine at the end of the script and that it “can’t be postponed”, and then have the reboot done as part of the script itself, but what if someone prevents it from rebooting, and ends up shutting down their laptop instead? I’m sure there will be someone who does that, bricking up their laptop in the process, and we certainly don’t want that. So I started looking for an alternative solution to leave no margin for error.

There is no obvious way (that I could find) to disable that SMM hook, so I eventually decided to simply shift the coreboot image to a different region in the flash ROM, leaving that “AMI bios settings” area free. By doing so, we waste about 256KB of the 6MB BIOS region, but we are not that constrained for space and it’s a much better solution than risking a brick.

So now, I have a coreboot image that works in both situations; whether you power off or reboot your laptop, it will not get corrupted anymore, no matter what (but my script still does an automatic reboot at the end to make sure the old BIOS SMM hook doesn’t get executed, as an extra safety measure). What a great way to end the week!

— this epic tale was brought to you by Youness’ scientific rigor and Jeff’s homeric storytelling

The post Preventing AMI’s BIOS from interfering with coreboot flashing on the Librem 13 appeared first on Purism.

Neutralizing the Intel Management Engine on Librem Laptops

Youness Alaoui — Thu, 09 Mar 2017 15:00:37 +0000

In my last blog post, I have spoken of the completion of the Purism coreboot port for the Librem 13 v1 and mentioned that I had some good news about the Intel Management Engine disablement efforts (to go further than our existing quarantine) and to “stay tuned” for more information. Since then I got a little side-tracked with some more work on coreboot (more below), but now it’s time to share with you the good news!

Ladies and Gentlemen, Clean Your Engines!

I am happy to say that neutralizing the ME works! I investigated the effectiveness of neutralizing the Management Engine using the me_cleaner tool (which is an amazing feat of the community), and then I tested to make sure the ME was indeed neutralized and that the Librem 13 stays on for over 30 minutes. We plan to go even further than that in the future and reverse-engineer the remaining parts just so we can attain 100% freedom.

First of all, you need to understand what me_cleaner does. You can of course go read the technical details on their wiki on how it works, but to put it simply, the ME is organized in multiple modules, each handling a specific task. The me_cleaner tool deletes most modules (utilities, kernel, network stack, and a Java virtual machine—Yes! You read that right), pretty much everything except the hardware initialization (BUP = Bring UP) module in the ME image. After the BUP module is executed, it can’t find the other modules, so it stops executing (as it has nothing to execute into), but at that point the 30 minutes watchdog has already been disabled by the BUP itself, so we can keep running. This is already a great improvement! The watchdog is precisely the issue that had been blocking us when we did our initial investigations a year ago.

After I ran the me_cleaner script on the BIOS image, and flashed it, I needed to test and make sure that the ME was indeed neutralized. I used the “intelmetool” that comes with coreboot and which is used to communicate with the ME PCI device to get information from it. Unfortunately, the intelmetool kept crashing, which was a good sign because it apparently couldn’t find the ME, but a crash (segmentation fault) is not really a conclusive answer… so I looked at its code, figured out what it did wrong, fixed it, then tested it again. This time, it gave me lots of output, and it confirmed that the ME was basically unresponsive. I then checked the output of “cbmem”—coreboot’s debug log during the boot sequence—and it showed that the ME was now stuck in “bring up phase”, its state was “recovery” instead of “normal”.

Bring out the Champagne! The ME is not only quarantined, it is now officially neutralized and the Librem remains working beyond the 30 minutes time limit that Intel had put in place!

Pictured: our recommended “Defeat of the Intel ME” celebration kit.

The remains

A question remains, however: “What exactly did we remove, and what remains?” So I tried to dig into that as well.

First of all, the Intel ME image takes 2MB of space in the BIOS flash, but not all of those 2MB are used. It’s made of different modules, which can be compressed with LZMA or with a private/secret Huffman dictionary. There is a total of 1.2MB of actual compressed code in the image, which gives us a total of 1.6MB (1662976 bytes) of uncompressed code in 23 modules.

Of those 23 modules, 21 modules are completely removed from the ME partition, and we leave only 2 modules: ROMP and BUP. The ROMP module is a “ROM bypass” module which is used to bypass the ROM initialization code and it’s less than 1KB of code, used to load the BUP module and execute it. The BUP module is a 116KB module which is used to initialize the ME hardware. So we end up with 120KB (122880) of data (108224 bytes actually, if we ignore the end of the ROMP and BUP modules which are empty) which represents 7.38% of the total ME code. We have effectively removed over 92.6% of the ME code without any adverse effects (but see further below).

And so we removed plenty of stuff, but most importantly, we completely removed the ME kernel as well as the network stack. You can see the full list of modules here:

## Original ME modules :
total 1.6M
8.0K -rw-r--r-- 1 kakaroto kakaroto 8.0K Feb 28 17:08 AFWS-20687000.mod
12K -rw-r--r-- 1 kakaroto kakaroto 12K Feb 28 17:08 BOP-20392000.mod
116K -rw-r--r-- 1 kakaroto kakaroto 116K Feb 28 17:08 BUP-200d4000.mod
16K -rw-r--r-- 1 kakaroto kakaroto 16K Feb 28 17:08 CLS-206e0000.mod
4.0K -rw-r--r-- 1 kakaroto kakaroto 4.0K Feb 28 17:08 ClsPriv-20716000.mod
12K -rw-r--r-- 1 kakaroto kakaroto 12K Feb 28 17:08 FPF-206b3000.mod
132K -rw-r--r-- 1 kakaroto kakaroto 140K Feb 28 17:08 FTPM-20777000.mod
60K -rw-r--r-- 1 kakaroto kakaroto 60K Feb 28 17:08 HOSTCOMM-20396000.mod
24K -rw-r--r-- 1 kakaroto kakaroto 24K Feb 28 17:08 HOTHAM-2032b000.mod
16K -rw-r--r-- 1 kakaroto kakaroto 16K Feb 28 17:08 ICC-203ad000.mod
272K -rw-r--r-- 1 kakaroto kakaroto 272K Feb 28 17:08 JOM-208c2000.mod
344K -rw-r--r-- 1 kakaroto kakaroto 344K Feb 28 17:08 KERNEL-200f8000.mod
28K -rw-r--r-- 1 kakaroto kakaroto 28K Feb 28 17:08 MCTP-20379000.mod
28K -rw-r--r-- 1 kakaroto kakaroto 28K Feb 28 17:08 ME_TUNNEL-203b4000.mod
52K -rw-r--r-- 1 kakaroto kakaroto 52K Feb 28 17:08 NET_STACK-20383000.mod
20K -rw-r--r-- 1 kakaroto kakaroto 20K Feb 28 17:08 NFC-208bb000.mod
196K -rw-r--r-- 1 kakaroto kakaroto 204K Feb 28 17:08 Pavp-20040000.mod
124K -rw-r--r-- 1 kakaroto kakaroto 124K Feb 28 17:08 POLICY-2034d000.mod
4.0K -rw-r--r-- 1 kakaroto kakaroto 4.0K Feb 28 17:08 ROMP-200d2000.mod
60K -rw-r--r-- 1 kakaroto kakaroto 60K Feb 28 17:08 SESSMGR-20719000.mod
44K -rw-r--r-- 1 kakaroto kakaroto 44K Feb 28 17:08 SESSMGR_PRIV-2015a000.mod
4.0K -rw-r--r-- 1 kakaroto kakaroto 4.0K Feb 28 17:08 UPDATE-2003e000.mod
32K -rw-r--r-- 1 kakaroto kakaroto 32K Feb 28 17:08 utilities-2036f000.mod
## Cleaned ME modules :
total 120K
4.0K -rw-r--r-- 1 kakaroto kakaroto 4.0K Feb 28 17:15 ROMP-200d2000.mod
116K -rw-r--r-- 1 kakaroto kakaroto 116K Feb 28 17:15 BUP-200d4000.mod

A few things to watch out for

Possible graphics problems

Unfortunately for me, on one of my machine’s set-ups, the i915 graphics driver would constantly crash with Wayland. I have tried an Ubuntu 16.04 live USB and haven’t had any problems with it, but when trying with two different PureOS installs, I had one being extremely stable while the other had the graphics driver crashing. I could still SSH into the machine and do what I wanted, but I couldn’t login into my desktop. Running “startx” in a terminal was working, however, without causing additional crashes of the graphics driver.

Other people on the team tested on their own Librem 13 and couldn’t reproduce the issue.

I spent some time trying to debug that phenomenon, but without much success. I tried updating/downgrading the kernel and comparing Wayland versions, and couldn’t figure out what was different between my two PureOS installs. I eventually put that aside because I had other things to do, and this could wait, given that I only experienced this problem on one particular machine.

Microcode or no microcode, that is the question!

Then came the idea of removing the microcode update from coreboot. This is a tricky question.

The way the CPU is made, it comes with a predefined “microcode”, basically some sort of “arrangement” of the low-level transistor blocks to define the “high-level” x86 instruction sets the processor supports. Sometimes if an instruction doesn’t behave the way it should, Intel will release a microcode update to “re-arrange” the transistor blocks in order to fix bugs in how the instructions are behaving. Those bugs can be anything: silent data corruption, security flaws, or very visible kernel panics.
Some people, however, may decide not to have a microcode update in their BIOS because it’s technically an unknown binary—even though the CPU hardware itself already comes with an initial microcode configuration pre-burned in its silicon.

After researching the implications of removing the microcode update from coreboot, I tested it. I ran prime95 for over 28 hours without any errors (what I forgot to mention in my previous blog post is that my prime95 results back then were actually made on a microcode-less system!) The system seems to run fine, boots, logs me in without problems and is perfectly usable, so that’s great… but it’s of course no guarantee that the system won’t have hidden bugs that I can’t notice, or small data corruption, etc. If anyone wants to remove the microcode updates from their BIOS, they can do that, and they can be safe in knowing that the system will be “usable”, but of course this comes with a big disclaimer on the risks involved. Todd (our CEO) has tested his machine extensively with coreboot without microcode updates and said that the machine would lock-up completely in less than 24 hours, after a few days of testing, he added back the microcode updates and the system became stable again. Your mileage may vary.

Here are some comments about this that I’ve received from the #coreboot IRC channel:

microcode problems are weird. They can appear in many different problems. I have encountered: VT-X very broken, movies playing (SIMD broken?), wrong CPUID, …
I guess the most common issues fixed by microcode updates are typically related to caching bugs, which often have security impact. TLB broken, write-back unstable, stuff like that. There’s hardly a single tool you can run and hope that it’ll catch all those bugs.
I have never seen an x86 chip that didn’t fail in some way without ucode updates, so don’t get your hopes up

Back for more coreboot work

In the introduction, I mentioned some coreboot issues that distracted me. Nothing major at the beginning: there was a small typo (and apparently without consequences) in one of the commits I wrote in coreboot, which was not noticed until after it was merged, so I had to fix that (quick and easy) and send it to coreboot for review/merging (not so quick and easy). The coreboot team was great in giving me good feedback pretty quickly (my commit message wasn’t up to their standards because I did it in a rush), and it got merged upstream.

Then, unfortunately, during beta testing, someone in the team bricked his Librem 13 (good thing we’re testing with our own devices first, huh?) We’re not yet sure why this happened, so I’m waiting to receive this person’s unit to debug that. In the meantime, I had to send my Librem 13 to them so they can get their laptop back to work. Once I receive their “brick”, I’ll be able to investigate why it’s not booting, whether it was a problem in flashing coreboot (due to QubesOS or a bad version of flashrom or due to user error), if it’s a problem with coreboot itself, or if it’s a problem with the hardware (that laptop might have been an early prototype, which means the hardware may be different, I’ll have to check to make sure).

For now, I’m also still working on a fool-proof (as much as possible) install script that will build coreboot for you and install it, limiting any risks of user-error that might cause a bricked machine. Once I know what happened to that bricked Librem (and fix it), then I’ll be able to continue working on that installation script (it’s hard to test it without any hardware on hand!), after which we’ll do some more “in-house” beta testing before releasing a public beta test for everybody.

Thankfully, any brick of the laptop can easily be recovered by using an external hardware flasher and the original BIOS. I have the original BIOS from that specific machine, since a backup was made before coreboot was flashed to it, so I can recover it quickly.

The post Neutralizing the Intel Management Engine on Librem Laptops appeared first on Purism.

The Librem 13 v1 coreboot port is now complete

Youness Alaoui — Sat, 25 Feb 2017 22:00:29 +0000

Here are the news you’ve been waiting for: the coreboot port for the Librem 13 v1 is 100% done! I fixed all of the remaining issues, it is now fully working and is stable, ready for others to enjoy. I fixed the instability problem with the M.2 SATA port, finished running all the tests to ensure coreboot is working correctly, fixed the headphone jack that was not working, made the boot prettier, and started investigating the Intel Management Engine issue. Read on for details.

Currently our test matrix looks like this—100% tested and working:

Cold boot: memory controller works.
Cold boot: all installed DRAM is online.
Cold boot: graphics controller works.
Cold boot: SATA controller succeeds.
Cold boot: EC controller responds ok to init code.
Cold boot: LCD backlight turns on.
Cold boot: linux boots ok in text mode.
Cold boot: linux boots ok in framebuffer (boot splash) mode.
Cold boot: X initializes the LCD at full native resolution.
Cold boot: X enables hardware acceleration.
Boot time: Cold boot to grub succeeds in less than a set timeout.
Boot time: Reboot from linux back to linux succeeds in less than a set timeout.
Boot time: Power down succeeds in less than a set timeout.
SeaBIOS test: keyboard works.
Grub test: keyboard works.
Grub test: text mode and framebuffer graphics work.
Cold boot to USB linux succeeds.
Reboot to USB linux succeeds.
EC test: fan spins.
EC test: holding power for >5 seconds forces a power down.
ACPI test: lid switch works.
ACPI test: power button event received ok.
ACPI test: AC power on/off event received ok.
ACPI+EC+battery test: battery percentage works.
Media keys on keyboard work in linux.
Device tests: internal mic, internal speakers, webcam, webcam mic, wifi, bluetooth, hard drive, SSD, SD card, each USB port, headphone jack.
prime95 (one instance bound to each hyperthread) for a fixed time to test CPU thermal management.
glxgears for a fixed time to test GPU thermal management.
During prime95 test, CPU digital thermal sensor should give reasonable results.
Linux suspend ok.
LCD backlight adjustable in linux.
Linux kernel boot messages should not contain too many errors.

The next steps will be:

Write a script for users to beta test the coreboot release easily, and document the whole thing;
Determine the best method to extract the existing BIOS parts and flash coreboot, avoiding any code redistributing;
Package it all in a .deb that PureOS users can simply apply as an update to get their BIOS replaced by coreboot.

How did you fix those last issues?

The M.2 SATA port

In my previous post I noted some strange issues with the M.2 SATA port. I thought it had something to do with the PCI configuration (the subsystem ID was different). I hadn’t had time to investigate this too much, but I had an idea that needed to be reconsidered, something I had tried very early on but it had failed because coreboot didn’t support it, and I thought that I had the wrong idea…

Basically, the devicetree.cb of the Librem 13 says this:

# Port 0 tuning for link stability register "sata_port0_gen3_dtle" = "9"

Since the 2.5″ SATA is on port 0, while the M.2 SATA is on port 3, and my M.2 SATA issues were of a “link stability” nature, I figured, I’d just add this to the device tree and that this might solve the problem:

register "sata_port3_gen3_dtle" = "9"

…but that didn’t work outright (didn’t compile), so I shelved that idea.

After posting my previous blog entry mentioning the M.2 issue, I went on coreboot’s IRC channel and asked if this “gen3_dtle” thing could be the cause, and someone mentioned that the behavior I see is the exact behavior he was seeing on his board until he added a similar line to the devicetree.cb file for his board, so it at least confirmed that was my problem… but the SATA initialization code for the Intel Broadwell SoC does not have support for that register for port 3, only for port 0 and port 1. After looking at the code, I realized that it’s probably only because nobody had needed to use it for port 3 until today. Then comes the obvious question: what does that value mean, what is the SATA DTLE, and how do I add it for port 3? Well, the soc/intel/broadwell/sata.c file uses that devicetree register and sets some “IOBP Registers” on the SATA controller using that value, and the “IOBP registers” are defined in soc/intel/broadwell/include/soc/sata.h like this :

/* SATA IOBP Registers */ #define SATA_IOBP_SP0_SECRT88 0xea002688 #define SATA_IOBP_SP1_SECRT88 0xea002488 #define SATA_IOBP_SP0DTLE_DATA 0xea002750 #define SATA_IOBP_SP0DTLE_EDGE 0xea002754 #define SATA_IOBP_SP1DTLE_DATA 0xea002550 #define SATA_IOBP_SP1DTLE_EDGE 0xea002554

So the obvious next questions are “What are these magic numbers?” and “What magic number should I use for port 2 and port 3?”. Unfortunately, there really is no information about what these “IOBP registers” mean, or what those values are, or where to get them from. Someone in #coreboot said that information comes from an Intel specification document that is only available under NDA.

Considering that ‘SP0’ is for port 0 and ‘SP1’ is for port 1, and that the value goes from 0xea002750 for port 0 to 0xea002550 for port 1… I thought, “It would be funny if…” and set port 2 as 0xea002350, port 3 as 0xea002150, and tested it. It worked! Those guessed magic values for port 2 and 3 fixed the M.2 instability issues I was seeing. Well. That went better than expected.

I then booted the Librem from the SSD and started testing everything else in my list, at which point I found out that the headphones jack wasn’t working.

The Harrowing Jeopardy
of the
Headphone Jack

“This might be a PulseAudio bug”, thought I. Alas, after attempting everything I could with PulseAudio, the issue remained—from the headphones came nothing but deafening silence. I pondered the untimely physical death of Headphone Jack (“A hardware issue? Inconceivable!”), so I hammered the vendor BIOS back into the corpse to be sure… and lo and behold, Headphone Jack was alive! Thus I started the investigation into the causes of its disappearance amidst coreboot.

After a few days of looking at various possible causes and finding nothing, I realized that most patches in the coreboot git log mentioning “headphones” were modifying hda_verb.c or hda_verb.h (HDA means “Intel High Definition Audio”) and I realized the file contains the structure that is used to initialize the “codec” that runs on the sound card.

I recalled the codec#0 file that I had grabbed from the machine with original BIOS according to the motherboard porting guide and I compared it with the one from coreboot, and I found very few (and insignificant) differences, so it didn’t make any sense. Eventually, I decided to compare the content of hda_verb.c with the data from the codec#0 file even though it didn’t change from AMI bios to coreboot, and I noticed something strange right away.

codec#0:
Codec: Intel Broadwell HDMI Address: 0 AFG Function Id: 0x1 (unsol 0) Vendor Id: 0x80862808 Subsystem Id: 0x80860101

hda_verb.c:
0x19910269, /* Codec Vendor / Device ID: Realtek ALC269 */ 0x19910269, /* Subsystem ID */

Well, well! That’s not the same thing at all: the codec ID is different, the subsystem ID is different, the device ID/codec name is not even the same. Then I noticed the ‘0x8086’ in the codec id and it was the same vendor id as the Intel PCI vendor id, so on a hunch and out of curiosity, I decided to run “lspci” and search for the Audio PCI device to see if the PCI vendor/product id matches the codec id from the codec#0 file

lspci -nn | grep -i audio 00:03.0 Audio device [0403]: Intel Broadwell-U Audio Controller [8086:160c] (rev 09) 00:1b.0 Audio device [0403]: Intel Wildcat Point-LP High Definition Audio Controller [8086:9ca0] (rev 03)

Hey, I have two audio controllers for some reason! I looked at the script I ran to get the code#0 and it grabbed the file from /proc/asound/card0/codec#0, but when I looked at my /proc/asound directory, I had both a “card0” directory and a “card1” directory. I peeked into the card1/codec#0 file and found this :

Codec: Realtek ALC269VB Address: 0 AFG Function Id: 0x1 (unsol 1) Vendor Id: 0x10ec0269 Subsystem Id: 0x19910269 ... Control: name="Headphone Playback Switch", index=0, device=0

Ah-ha! Now, to figure out why it didn’t work. I copied that card1/codec#0 file, inserted the headphones and copied it again, compared the two and found differences (it detects when the headphones are inserted), flashed coreboot, copied the files again, with and without the headphones , then I compared the codec files from the coreboot system to the codec files from the AMI bios… and there were no differences between the files!

Some hours of head scratching ensued. Then, as I was looking at the codec#0 file, I noticed that it did not match what I had seen before, so copied the files yet again and compared the codec files from coreboot and AMI… and suddenly, they were completely different. I shrugged and continued my investigation. I later realized that the codec does not get reinitialized during a reboot, so that’s why the codec had not changed after I flashed coreboot and rebooted—I had to do a full shutdown and power on again in order to have the codec re-initialized by coreboot!

So, after comparing the codecs from the AMI and coreboot bioses and comparing with the contents of hda_verb.c, I saw differences that I couldn’t explain, and after I while I wanted to debug what was happening in coreboot and realized that there are already some debug messages being printed by the code that initializes the codec using the hda_verb.c data. I ran the ‘cbmem’ utility, printed the coreboot debug messages, and found this:

HDA: Initializing codec #0 HDA: codec viddid: 10ec0269 HDA: No verb table entry found

That’s when I realized the error. The codec viddid (vendor id/device id) is 0x10ec0269, yet hda_verb.c had it set to 0x19910269 (which is actually the subsystem ID), so coreboot was simply never finding the data from hda_verb and never initializing the codec. So I fixed the codec ID in hda_verb.c and recompiled coreboot. I fully shut down the laptop and powered it back on, then the headphones jack was working. Hooray!

The boot splash—Beauty is Pain

My next step was to add a nice bootsplash image to the boot process. It wasn’t strictly necessary at this point, but I wanted to do that, so I did. The problem is that it wasn’t working (surprise, surprise).

First, you have to add the bootsplash image in menuconfig, which adds it to the coreboot.rom… but doesn’t use it. So then you need to tell coreboot to actually show the bootsplash image, which didn’t work.

I figured it’s because the VGA graphics aren’t initialized since it’s SeaBIOS that runs the VGA option ROM (vbios), so I enabled coreboot to run the VGA option ROM, then I enabled it to run all PCI option ROMs.
I then realized I probably had to use an image with the exact resolution of the VESA mode being used… and it still didn’t work.
At that point I thought, “Maybe it was happening too fast for me to see”, so I enabled the option to keep the graphics in VESA mode.

I asked on IRC and those present told me that it was supposed to be “as simple as adding the image to coreboot and that’s it”. Eventually, while debugging, I saw an error message in cbmem from the SeaBIOS payload itself:

jpeg_decode failed with return code 9...

But I didn’t care about SeaBIOS because I wanted coreboot to show the image.

Eventually, I had another “What if…” moment and asked someone on IRC who had a working bootsplash to send me the image they used. I tried it and it worked! So the problem was my own image. After I made sure the resolutions matched, I had no idea what to look for next, and online, I couldn’t find any information on what requirements the image had to have.

I also realized that coreboot could show the bootsplash, but SeaBIOS would also find the bootsplash image in CBFS (in the bios filesystem basically) and would also show it, so I had both coreboot and SeaBIOS attempting to show the bootsplash, and if SeaBIOS was showing an error in decrypting the JPEG, it was probably the same reason why coreboot was refusing to display it. I then looked at the jpeg module in the SeaBIOS code and found the -9 error that it returned… it had something to do with the colorspace being wrong, as jpeg_decode only supports YCBR:22:11:11

#define ERR_NOT_YCBCR_221111 9

I looked at how to determine the colorspace of the JPEG image and how to change to YCBCR:22:11:11 but couldn’t find much information, so I started reading the code of jpeg_decode and understanding the binary structure of the JPEG file format (did anyone say yak shaving?), until I found which bits in the JPEG header were specifying the colorspace. I then opened GIMP and tried various things until I found where to change the colorspace, then wrangled with the options until I found which one was setting the bits to perfectly match the YCBCR:22:11:11 colorspace that SeaBIOS required. At that point, I was just using “hexdump” and reading the jpeg data structure to determine if it would work. In GIMP, we need to set the “Subsampling” advanced option to “4:2:0 (chroma quartered)”.

Once I did that, I booted and… still. No. Bootsplash. ლ(ಠ益ಠლ)

I looked at the cbmem log again and this time it was error 11, instead of 9:

#define ERR_NOT_SEQUENTIAL_DCT 11

Now, that didn’t mean much; I kept looking for this sequential DCT information, but couldn’t find any information about it. There were 3 bytes in the SOS (Start of Scan) marker of the JPEG headers, which were the ones that the jpeg_decode function was using to decide if it returned that error or not, and everywhere I looked, it was either not explaining what those bytes meant or defining them as “ignorableBytes” or explaining them as “unused” or “skip 3 bytes” or something like that, without explaining what they were or what they were for… eventually, I tried to bruteforce it, and I found that disabling the “progressive” option in GIMP will set those 3 bytes to the value that SeaBIOS’ jpeg_decode requires (which is 0x003F00, by the way). This gives us these JPEG export options in GIMP:

And voilà! With those settings, the splash image gets shown. Easy peasy, huh?

The Librem 13 v1 successfully showing the splash with coreboot (temporary artwork)

Afterwards, we asked François to create some more minimalistic bootsplash. This is roughly what it looks like now (keep in mind the picture is a bit overexposed here as well, so it looks better with the naked eye):

Upstreaming the work

I went through all of our test matrix and verified that everything works as expected. I ran prime95 for 28.5 hours without issues and verified that the CPU/GPU temperatures remain acceptable under both heavy CPU load (prime95) as well as heavy GPU load (uncapped glxgears):

…and eventually came to the conclusion that our coreboot release is done, stable and working.

I reviewed, cleaned and committed my changes, and sent the commits upstream (to coreboot) for review. Unlike in academia, the reviews were quick and painless: no changes were asked and it was all merged into the coreboot master branch on February 22nd.

While finishing the coreboot port, I also started to play around with the me_cleaner and testing the Librem with the Intel Management Engine disabled and various CPU configurations. We have some good news to report on this. Stay tuned.

The post The Librem 13 v1 coreboot port is now complete appeared first on Purism.

Librem 13 coreboot report – February 3rd, 2017: It’s Alive!

Youness Alaoui — Fri, 03 Feb 2017 21:17:45 +0000

Memtest with coreboot on the Librem 13

Hi again everyone and welcome to the “Good news” post!

It’s been 3 weeks since I wrote my last blog post but this is going to be a short update, in big part because I’ve spent the first two weeks sick in bed and thus wasn’t able to do much at all. However, in the last week I did manage to make some big progress, and the result represents such a great milestone that it warrants a blog post of its own. And, well, I doubt many will complain about not having to read through a wall of text for today’s blog post 🙂

So the good news is: coreboot is working on the Librem 13. The laptop boots into Linux and most things are working! The only issue I have found so far is that the M.2 SATA port doesn’t seem to work properly yet (see below for more info).

Getting video output

You may remember that, at the end of my 2nd blog post, I had finally managed to build coreboot with all of the binary blobs included and it should have worked but it didn’t for “some reason”, so I was going to try to enable debugging to see where it froze.

After installing the “Screwdriver” image on the BeagleBone Black and enabling the EHCI debugging in the coreboot config, I was able to get the debug output from coreboot. It was really quite easy to achieve. The screwdriver image for BBB is pretty much a “boot it and it will Just Work™” thing, no configuration or app installs to do on it. As for enabling EHCI debugging, it took me a couple of tries because I had to enable two options in different config menus, not just one (enable EHCI log debugging, then enable the option to send the log via USB), but thankfully the wiki page explained that so once I followed the docs, it was quite simple. And for the curious/future reference, the USB port which outputs the EHCI debugging is the one on the right side of the laptop.

Once I had the boot log from coreboot, I noticed that it hadn’t frozen anywhere, the last line was about coreboot launching the payload SeaBIOS, so coreboot did everything up until the end. I checked the various steps and it had initialized the RAM, the refcode, the VBIOS, etc. I figured, “Maybe it’s a configuration issue”, so I checked my lspci output from before, and saw that the VGA Controller PCI ID was “8086,1616”, then I went into the coreboot config and saw that it was set to “8086,0406”. So I changed that, and flashed coreboot and when I booted the machine, the video controller worked and I saw the SeaBIOS prompt. Hurray!

The Curious Case of the M.2 SSD

Unfortunately, once I tried booting Linux, it failed with a “Read Error”:

After spending some time trying to figure it out and not being able to (there is no “Read Error” string or anything that could print such a string in the SeaBIOS code, so I couldn’t track down where the error came from, and there is no EHCI debug since coreboot is already done booting and the issue is from SeaBIOS), I tried booting PureOS from the USB installation drive instead, and I was able to boot into the live environment without any problems. Wow, first success! PureOS is booting with coreboot! There was much rejoicing.

To begin investigating the SSD issue, I used the same set of commands from the Motherboard porting guide and started comparing the results, and there were a few differences, but I’m still not sure what they mean. Here’s an example of some of the differences between the two lspci outputs (the problematic SATA controller) :

For reference, you can see the full lspci output with and without coreboot.

Even after booting into Linux, the internal SSD was not accessible, and ‘dmesg’ was showing errors initializing the SATA controller.

SeaBIOS was sometimes seeing the M.2 SSD (but was never able to boot from it):

Sometimes, it wouldn’t see the M.2 SSD at all:

…and sometimes, it would just show garbage:

However, it had no issues detecting and booting from the USB stick, so I had an idea; I installed a 2.5″ HDD into my Librem 13 and tried that. It was immediatly detected by the PureOS liveUSB. So I installed PureOS on the HDD, and rebooted. While SeaBIOS still didn’t detect the SSD, it detected the 2.5″ HDD and was able to boot flawlessly with it. Still no SSD detection even with PureOS fully booted from the HDD however, and dmesg still complained about various SATA initialization issues.

I took the opportunity to test the wifi, video card, speakers, and everything seemed to work. I then booted into MemTest86+ and tested the RAM overnight. There were no errors after more than 17 hours of RAM testing.

As I booted Linux again I noticed the ME PCI device wasn’t in the lspci output, so I wondered if I somehow messed up the ME partition, therefore I left the computer running for a couple of hours to make sure it wouldn’t shut down (due to ME watchdog), then I noticed something weird: I suddenly had a /dev/sdb* set of devices. The output of ‘dmesg’ showed that it magically was able to detect it somehow and I was now able to access the M.2 SSD.

So I did a few more tests, and it seems that after a few minutes (30 minutes to an hour), the M.2 SSD connector will suddenly start responding and Linux will be able to initialize it and detect/access the SSD. It also seems that suspending/resuming the laptop helps trigger the M.2 initialization much faster. I still have no idea why this happens. And once, it managed to initialize the SSD after only 3 seconds instead of the usual 30 minutes, as you can see in this ‘dmesg’ output here :

I have now started reading up on the PCI Configuration space in order to understand the differences in the lspci output and hopefully fix the M.2 issues. My current theory is that since the PCI subsystem ID is different when using the vendor BIOS than from using the coreboot BIOS, it’s possible that the subsystem ID somehow tells SeaBIOS/Linux that this specific SATA controller has a quirk that changes the initialization timings. This is only a wild guess for the time being, hopefully in the next few weeks I’ll understand enough about the way PCI initialization works to be able to figure out what goes wrong.

Summarizing

My current status is that PureOS boots and is perfectly usable, however the M.2 controller doesn’t work reliably. Also, the MEI PCI device as well as the USB EHCI device have disappeared from the ‘lspci’ output (both USB ports are working though). The lspci output is also different for most of the other devices when compared to the original BIOS.

One other thing worth mentioning is that I have stopped using the IC clip already. Since I am able to boot into Linux with coreboot, I can now use flashrom to flash the BIOS directly from Linux and I’ve used it to do my BIOS updates while testing in the last few days. This is great, because not only does it speed up development, but it also confirms/tests the process that existing Librem 13 owners will go through to update their laptops to coreboot.

Here is the Acceptance Test Matrix that I mentioned in my previous article, which I’ve found in an old post on the coreboot blog, where I’ve stricken whatever I have had time to test and confirm as working, and made bold anything known not to work :

~~Cold boot: memory controller works.~~
~~Cold boot: all installed DRAM is online.~~
~~Cold boot: graphics controller works.~~
Cold boot: SATA controller succeeds.
Cold boot: EC controller responds ok to init code.
~~Cold boot: LCD backlight turns on.~~
~~Cold boot: linux boots ok in text mode.~~
~~Cold boot: linux boots ok in framebuffer (boot splash) mode.~~
~~Cold boot: X initializes the LCD at full native resolution.~~
~~Cold boot: X enables hardware acceleration.~~
Boot time: Cold boot to grub succeeds in less than a set timeout.
Boot time: Reboot from linux back to linux succeeds in less than a set timeout.
Boot time: Power down succeeds in less than a set timeout.
~~SeaBIOS test: keyboard works.~~
~~Grub test: keyboard works.~~
~~Grub test: text mode and framebuffer graphics work.~~
~~Cold boot to USB linux succeeds. (We plan to use SeaBIOS for boot device selection, barring major bugs.)~~
~~Reboot to USB linux succeeds.~~
~~EC test: fan spins.~~
~~EC test: holding power for >5 seconds forces a power down.~~
~~ACPI test: lid switch works.~~
~~ACPI test: power button event received ok.~~
~~ACPI test: AC power on/off event received ok.~~
~~ACPI+EC+battery test: battery percentage works.~~
Media keys on keyboard work in linux.
Device tests: internal mic, ~~internal speakers,~~ webcam, webcam mic, ~~wifi~~, bluetooth, ~~hard drive~~, SSD, SD card, ~~each USB port~~, headphone jack.
prime95 (one instance bound to each hyperthread) for a fixed time to test CPU thermal management.
glxgears for a fixed time to test GPU thermal management.
During prime95 test, CPU digital thermal sensor should give reasonable results.
~~Linux suspend ok.~~
~~LCD backlight adjustable in linux.~~
~~Linux kernel boot messages should not contain too many errors.~~ (Only the SATA errors are appearing)

As you can see we have at least 22 out of 32 items that are considered tested and done, which means we’re at least two-thirds there—most of the other items are probably working as well, I just hadn’t had time to test them yet.

I hope to have the M.2 issues fixed within the next couple of weeks, then, after making sure it is perfectly safe to flash coreboot to any Librem 13, we’ll probably release a beta image for people to test (it will come with plenty of disclaimers though!) After that, I’ll work on disabling the Intel ME (first by using the me_cleaner tool, then testing if it works as expected).

We’ll keep you posted on the progress.

The post Librem 13 coreboot report – February 3rd, 2017: It’s Alive! appeared first on Purism.

Librem 13 coreboot report – January 12, 2017

Youness Alaoui — Thu, 12 Jan 2017 17:45:13 +0000

Hello again Purists! I’ve made some progress on the coreboot port to the Librem 13 v1 hardware.

You probably remember that my initial post about coreboot development was mostly about the v2 hardware and all the mistakes I made while getting familiar with BIOS development. One comment I’ve heard on the previous post is that it was over-complicated to use a Logic Analyzer in order to do a dump of the flash. Of course it was over-complicated, but I did it for reasons other than just creating a dump. I did it because:

I hadn’t received the sockets/other hardware I ordered.
It sounded like a fun thing to do while I wait. And I simply love using the Saleae Logic Analyzer, and I look for any excuse to use it!
The logic analyzer doesn’t only give me a dump, it gives me a trace of all accesses, it allows me to see “which offsets are accessed when”.
When I see the process of what happens exactly (which bytes are read/executed first, where does it jump, etc.), it helps me better understand the boot process.
It lets me see which areas of the flash are ignored/not accessed. That also allows me to see which areas/partitions of the ME region are not protected by a signature (if the reads are not all sequential).
If I see the ME code being read in its entirety twice, it might mean that there is a ToCToU exploit that can be used there (signature is checked during the first read, then the ME code is loaded into memory).

So yeah, it might have been overkill, but it was interesting to do, that’s why I did it.

Dumping the Librem 13 v1 BIOS from software

At the end of my previous post, I explained that I had finally found a Librem 13 v1 that I could use. Given how it took us 3 weeks to finally manage to get one, I don’t want to be making the same mistakes this time around and bricking a very hard-to-replace laptop. Therefore, I decided against soldering a socket on the motherboard this time, or to be more precise, I decided to only solder the socket after/if I brick the laptop (once it becomes necessary). So my first task was to try, by any means necessary, to dump the flash contents using software only. Of course, I still did my logic analyzer trace, just for fun 🙂

I found the flash chip to be an 8MB MX25L6406E chip, and the vendor BIOS was from American Megatrends (AMI). Since I knew that flashrom wouldn’t work on laptops (more on that later), the first thing I tried was to find the flashing tool from the AMI utilities. I didn’t know what kind of BIOS I had, but looking at the screenshot from the Aptio-V page, it was clear that it wasn’t Aptio-V, so I had the choice between AMIBIOS or Aptio-4… so I first tried the one for AMIBIOS, I created a bootable DOS USB stick and copied AFUDOS to it… but it failed, then I tried the tool for Aptio-4, it failed, but it gave me an error message that said I need to use the tool for Aptio-5, so I finally downloaded the AFUDOS utility for the Aptio-V BIOS and that one worked (facepalm!). So I dumped my BIOS, success! Hmm, it’s only a 6MB file, weird…

Now, I figured that “It’s nice and all, but I need to be able to do it from a GNU/Linux platform”, because Librem users will be using PureOS so they’d need a way to dump/flash their BIOS from there. Thankfully, I found that AMI also has a tool called “AFULNX” for GNU/Linux, but for some reason it’s not available for download. I eventually found a link to it, which I can’t seem to find again now, but I also found this great article by Roman Hargrave that explained just how “awesome” (emphasis: sarcasm) AFULNX is. To make a long story short, the AFULNX tool will extract the source for a kernel module and try to build it, which will always fail because they are missing an include, then it deletes those source files, so I’d have to be quick enough to suspend the process before it deletes the files, then make a copy and fix the code so it compiles, then I need that kernel module file to be renamed (otherwise on the next execution it deletes it and retries).

Anyway, I managed to run AFULNX and dump my kernel. It’s another 6MB binary, which is weird because I know from the spec of the chipset, that the flash is supposed to be 8MB. So I compare the BIOS I dumped with the one that was dumped with the Logic Analyzer… Surprise, surprise, they are completely different. After a little investigating, I realize they are not so different, but the BIOS dumped by AFULNX is actually the last 6MB of the flash, while the first 2MB were completely skipped. Comparing the images from the v1 and v2 BIOS images, and also looking at the LA trace, I could see that the first thing being read is the offset 0x10, which contains the magic number 0x5aa5f00f in both images, so I figured that the flash has some sort of “filesystem” which itself contains the BIOS file, and that the AFULNX binary only dumps that BIOS file without the filesystem itself. This meant that my dump was useless in the case of a bricked laptop. I needed a full flash image, otherwise I couldn’t recover.

I then started investigating what this “filesystem” is, and I can’t say that I found a lot of documentation! Thankfully, I know the magic number of 0x5aa5f00f and that helped me find some stuff about it such as a patch on flashrom talking about supporting “ROM layout from IFD“, then, I found the magic number in the ichdesc.py from the tool called romdump which seems to be used to split the raw flash into multiple parts (one of which is the 6MB BIOS file from AFULNX), the file calls it the “ICH flash descriptor”… which led me to something called “ich9gen” that is used to generate that structure (so, is it an ICH9 or an IFD format?). Obviously ICH is a reference to the Intel Controller Hub, which I assume is the one parsing that flash structure (but then why does the coreboot developer manual say that the CPU just loads the top most 16 bytes from the flash? That’s clearly not the case since there is an actual structure to the BIOS and the ICH is the one parsing it prior to the BIOS getting executed). Anyway, I eventually figured out that it’s called (in coreboot speak) a “descriptor” which is used to define “regions” in the flash, one of those regions is the Intel ME firmware, another one is the GbE (Gigabit Ethernet) configuration data, which is used to configure the Intel integrated GbE device (it contains the MAC address for example), as well as the BIOS region which contains the actual UEFI BIOS.

So… at that point I still haven’t managed to do a proper full flash dump using a software method, but at least, I have good BIOS dumps from AFULNX and AFUDOS, right? I then decide to use romdump to extract the BIOS from my logic-analyzer dump, and compare the BIOS dump from AFULNX with the one from the Logic Analyzer, to see what kind of corruption I got from the logic analyzer… and when comparing the two, I realize that the BIOS dumped from AFULNX is the one that is corrupted!

In the AFULNX BIOS dump, there were some huge chunks that were all “0xffffffff” but were filled with data in the logic analyzer BIOS dump. Also, There were some differences in some bytes which, when I looked at the full LA trace, I saw those bytes being read about 20 times with the exact same data being returned each time (confirming the data to be valid) and yet, the one from AFULNX was different…

So I compared the AFULNX dump with the bios dump from AFUDOS, and… the files were different! The two BIOS dumps from AFULNX and AFUDOS were not only very different from the logic analyzer dump, they were also highly different from each other, confirming that they were both corrupted. That just makes everything worse… Why would a software-based dump of the BIOS result in corrupted data? And that’s using the official tool!

If the data is corrupted, how can I trust it? Most importantly how can I write to the BIOS if writing might cause it to be corrupted (i.e.: “is coreboot failing or is the image that was written to the flash wrong”)? Since AFULNX is not open-source, I was hoping to eventually reverse engineer AFULNX itself, to figure out what it does and port those “exceptions” (if there are any) to flashrom to make it support the Librem 13 hardware. That would allow the users to dump and write their flash with the coreboot BIOS using the free/libre and open source flashrom tool, but if that data can’t be trusted, it’s unacceptable. I need to find a way to do a proper and reliable dump of the flash.

Interlude: building coreboot

While I was thinking on how to do the dump, I decided to move to something else: building coreboot for the Librem (I had previously built coreboot for qemu, following the instructions on the wiki, but that’s not very helpful for the Librem 13). So I configured coreboot with ‘make menuconfig’, set the mainboard to the Librem 13, and ran ‘make’. Of course, it failed for some magical reason that I can’t remember right now, but after wrestling a bit with it I eventually realized that I was missing the BLOBs in the 3rdparty/blobs directory, and without it, I couldn’t build it. So I tried to disable the blobs from the config, but it still complained. I couldn’t find information on where to find those blobs that it needed, and I was a bit lost in all of that, to be honest. The only information I found was about the binary situation of coreboot but not on how and where to find the binary blobs, what filename to give to the files, in which directory to put them, etc. I’m going to skip ahead now and spare you the details of long nights crying and wondering “What am I doing wrong?” and just tell you that the coreboot project has a separate repository for binary blobs which needs to be cloned into the ‘3rdparty/’ directory (replacing the empty ‘blobs’ directory in it). Also, even if you ask for a fake IFD in the config, removing the need for the descriptor.bin and me.bin binaries, it will still look for the microcode blobs, which is a separate config option (which was causing things to fail for me).

So, the good news is, the CPU microcode blobs are in that blobs repository… but the descriptor.bin and me.bin binaries are not, and those will have to be created/copied manually; they come from my own flash dump and can be extracted using the tool ‘utils/ifdtool’ from my full raw BIOS dump. Great, I don’t have that!

Back to the dump!

Time to try some alternative methods. In the previous blog post, I spoke about the possibility of using an IC test clip (actually, about using the Logic Analyzer’s test clips) and an external power supply to power the flash chip, but I considered that to be a dangerous thing to do (based on some post I read on the flashrom mailing list). It turns out that while I was afraid of doing that, it’s really not that dangerous and it is actually the preferred method of doing things.

I apparently got ‘tricked’ into using the socket method simply because I stumbled onto this page in the coreboot wiki and thought that’s what I needed to do.
Last week, Zlatan was at the CCC and participated in the coreboot install session, where he got someone from coreboot to dump his flash, compile coreboot and flash it on his Librem 13. It didn’t work, but it was a helpful event because I realized, “If coreboot developers are using the IC test clip and it’s safe, then I probably also should”, so I researched that further and found posts/wiki pages that confirm it to be a good method to use, then I ordered one, received it overnight, and started playing around with doing a dump of the flash using hardware—but without soldering!

With the IC test clip on hand, I started looking for docs about that tool and saw a lot of wiki pages explaining how to do it. Most of them said to use an ATX Power supply to supply the 3.3V to the chip, and to always connect the power line last. So my first try was to connect the IC clip (so easy to use! Wonderful.) to my breadboard with the same FTDI UM232H as before, then disconnect the battery from the laptop, then connect the power line. The LED on the FTDI suddenly dimmed (it is powering much more than just the flash chip), and my PC was suddenly unable to access the FTDI chip. It still recognizes it but it complains about being unable to reset it. So I brought out a second FTDI chip, and used the first one for all the data/signal lines, and the second one for the power line only (and I did connect the ground from both FTDI chips together), but it was failing, it just couldn’t recognize the flash. I eventually realized that the problem was that flashrom was trying to read from the wrong FTDI (since I had two connected to my PC), so once I specified the right serial number to use, it was still failing, this time with “unable to read serial from device”. I suppose that second FTDI was simply draining my laptop’s USB power or something, I’m not sure exactly what the issue was, but I decided to just go grab myself an ATX power supply and use that instead. After I shorted the PS_ON pin on the power supply to get it to turn on, and made sure the orange wires are for the 3.3V power rail, I connected it all and tried again, it still failed! But at least, I was getting a response. I was getting this response from the chip :

RDID returned 0xe1 0x10 0x0b. RDID byte 0 parity violation. probe_spi_rdid_generic: id1 0xe1, id2 0x100b

But once I checked the datasheet for the MX25L6406E chip, it said that the RDID response (the manufacturer/device ID) should be 0xc2 0x20 0x17, and I noticed that 0xe1 is 1 bit shifted from 0xc2, so I figured that the FTDI must be too fast for the flash chip (even though FTDI was running at 30MHz, and the flash datasheet says it supports a clock of 86MHz—maybe only for reads?). I decided to try lowering the clock, so I changed the clock to 15MHz and it worked! So yay, finally, I was able to dump the flash. I did five dumps, and they all had the exact same checksum, that meant that there was no corruption in any of the dumps. Now I can get back to work!

For the curious, here is the command I used to dump the flash:

../flashrom/flashrom -p ft2232_spi:type=232H,serial=FTVDZ6J5,divisor=4 -c "MX25L6406E/MX25L6408E" -V -r v1.rom

I was then able to use the idftool to split the rom image into the 3 regions it contained (the descriptor, the ME and the BIOS regions). It didn’t contain a GbE region because the Librem does not use the Intel network card. I put the right files in the right directories and coreboot finally compiled!

A good scare

Now, before I test out coreboot, I decided to go grab all the information I can on my hardware, so that if something happens, at least, I would have that. so I went to the Motherboard Porting Guide on coreboot.org, read it, installed all the dependencies, then after quickly looking at the commands that I’m told to copy-paste (just some lspci, lsusb, etc.), I proceeded to paste this block of commands in my terminal in order to gather all the logs that I would need :

 lspci -nnvvvxxxx > lspci.log 2>lspci.err.log
 lsusb -vvv > lsusb.log 2>lsusb.err.log
 superiotool -deV > superiotool.log 2> superiotool.err.log
 inteltool -a > inteltool.log 2> inteltool.err.log
 ectool -i > ectool.log 2>ectool.err.log
 msrtool > msrtool.log 2>msrtool.err.log
 dmidecode > dmidecode.log 2>dmidecode.err.log
 biosdecode > biosdecode.log 2>biosdecode.err.log
 nvramtool -x > nvramtool.log 2>nvramtool.err.log
 dmesg > dmesg.log 2>dmesg.err.log
 flashrom -V -p internal:laptop=force_I_want_a_brick > flashrom_info.log 2>flashrom_info.err.log
 flashrom -V -p internal:laptop=force_I_want_a_brick -r rom.bin > flashrom_read.log 2>flashrom_read.err.log
 acpidump > acpidump.log 2>acpidump.err.log
 for x in /sys/class/sound/card0/hw*; do cat "$x/init_pin_configs" > pin_"$(basename "$x")"; done
 for x in /proc/asound/card0/codec#*; do cat "$x" > "$(basename "$x")"; done
 cat /proc/cpuinfo > cpuinfo.log 2>cpuinfo.err.log
 cat /proc/ioports > ioports.log 2>ioports.err.log
 cat /sys/class/input/input*/id/bustype > input_bustypes.log

And only after I pasted it, I saw the next line in the wiki (below the block of commands) that says, “Save all logs in safe place, and also rom.bin file.”, and noticed that flashrom is used in there too, but since I knew flashrom typically doesn’t work on laptops (as explained in my last blog post) I assumed that command probably just didn’t work and there would be no rom.bin… until I saw the option given to flashrom: laptop=force_i_want_a_brick …

I was half horrified, half amused—this seemed like a ridiculously dangerous thing to put in a long “to be copy/pasted” block of seemingly harmless commands. “Those options shouldn’t be part of a script, they should be set consciously by the user after a big warning!”, I thought. I later discussed this on IRC with the coreboot developers; while most agree that the risk is minimal and it is fairly safe, it was also agreed upon that the option shouldn’t be there in the wiki because of the risk to the user. So that will change soon (if not by a coreboot dev, then I will edit the wiki to add clear warnings).

Thankfully, the resulting “rom.bin” had the full 8MB BIOS dump in my case, and it was all correct (I later compared it with my dump made with the IC clip, and it was almost identical, small differences must be due to the ME constantly writing data to the flash and thus changing some bytes). And so it seems I can indeed dump the flash from software only. Whew!

After some research and discussion on IRC, it turns out that the big problem with flashrom and laptops is the EC (The Embedded Controller), which handles battery charging, the power on/off events,etc..) : if the EC is reading the flash, then you can get some conflicts. Sometimes the EC will appear as a fake flash to the PCH and act as a proxy, but either way, the real problem is just the conflict between EC and PCH trying to access the flash, and it happens to be safe on the Librem 13 because the EC is using a seperate flash chip instead. An old blog post from 2015 on the Purism blog confirms that. So, no more need for AFULNX (or reverse engineering what it does to bypass the EC): we can certainly use flashrom to dump and write the coreboot update to the Librem 13 laptops.

Finding Nemo and Dory: the MRC and the VBIOS

I then used my IC clip setup to write the coreboot image into the flash, and turned on the laptop. The laptop stayed on, but the screen remained black. I asked around on the coreboot IRC channel to get some pointers. After posting my config file, I was told that I am missing 2 important items: the mrc.bin binary blob, and the VBIOS (Video/VGA BIOS); without a VBIOS there would be no graphics support, and without mrc.bin there would be no RAM. So I started searching for this “mrc.bin”, and found very little information on what is “MRC”, what it is for and where I can get it. Thankfully, the folks on IRC were very nice and helped me figure things out.

The MRC blob does the RAM initialization, and it is necessary for the Broadwell architecture to use MRC.bin because there is no native memory init support in coreboot for Haswell/Broadwell chips, unlike older Intel chips. From my understand, it’s in some kind of Google-specific format or that Google is the one creating those mrc.bin binary blobs, and apparently, it can’t be redistributed, but I’m not sure… So I have 3 options, either use the provided MRC.bin from a chromebook’s (with similar chipset) coreboot but not be able to redistribute it (users would have to extract it from the chromebook bios image themselves), or I could use the Intel FSP, but support in coreboot is incomplete, so I’d have to add the support for it somehow, or reverse engineer the memory initialization and re-implement that in coreboot natively. I think reverse engineering it may be the best thing for the future, but for now all I want is to get this running as fast as possible, so I’ll go with the MRC solution for now.

In any case, I couldn’t just download the mrc.bin file from somewhere, and it was not in coreboot’s blobs repository either. It’s apparently the same blob as the one used in the Google Chromebook Samus which uses a similar Broadwell chipset, all I needed to do was to run the ‘./utils/chromebook/crosfirmware.sh samus’ command which would download the chromebook recovery image for the Samus device, and extract the BIOS from it. Then I could use cbfstool to extract the mrc.bin binary from it as explained in this commit (see additions to the README file). Unfortunately, cbfstool would refuse to extract/print the BIOS because it couldn’t detect it as being a coreboot BIOS. Thankfully, someone on IRC (coolstar) quickly told me to use ‘-r BOOT_STUB’ option which fixes things. Once I did that, I got the mrc.bin file.
Then I was told I would also need the refcode… “What is the refcode?” Well, they didn’t know, and it doesn’t seem to be clear either from the coreboot config help page, as it only says it’s an “external reference code to be put in cbfs” (by the way, cbfs is the coreboot filesystem which contains all the coreboot code and blobs, etc..). Eventually, I found some email on the U-boot mailing list saying that the Broadwell chipsets need an additional reference code to be executed in order to initialize the PCH. So I extracted the refcode binary blob as well and added it to coreboot.

The VBIOS, on the other hand, was the easy part. I followed the instructions from the coreboot wiki, I tried the bios_extract method, but it couldn’t recognize my BIOS as being an AMI BIOS, so I looked in the bios_extract source code, and figured out how it detects the BIOS vendor (it looks for specific strings in it), and I realized that my BIOS didn’t fit (it didn’t have anything close to those strings at all) and it wasn’t a simple bug to fix it and make it work… so I figured, maybe it’s considered a “UEFI” bios, so I followed the “UEFI method” which uses the UEFITool, so I installed it, ran it, gave it my vendor BIOS, and it recognized it as a UEFI bios and showed me a tree of “lots of stuff”:

To find the VBIOS, the wiki says to:

* Look for the “CSMCORE” DXE Driver ? usually having the hash ‘a062cf1f-8473-4aa3-8793-600bc4ffe9a8’?
* Search for text “VGA Compatible BIOS” (uncheck unicode)
* Search for text “PCIR” (uncheck unicode)

And so, I did, I found the DXE Driver called “CSM DXE” with the correct hash. I thought that I would then open that item and search for “VGA Compatible BIOS” in it, but nope, when I did the search, it gave me a completely different module, and when I searched for “PCIR” it gave me a bunch of different modules which matched… and I had no idea what to chose. I made my choice on the single result from “VGA Compatible BIOS” and extracted that one. I tried it and it didn’t work, so I wasn’t sure if it was right, but it felt right. Eventually, I put back my original vendor BIOS on the flash, booted the Librem 13 into PureOs, and used the “extraction from mapped memory” method to grab the VBIOS directly from GNU/Linux. As I compared it, it was identical to the vbios.bin that I extracted from UEFITool, so I at least knew that I had the right file.

Next steps

I now have coreboot with the proper descriptor, ME regions, VBIOS, MRC.bin (and refcode), but I don’t know yet why it still doesn’t boot. The next step for me will be to see how I can debug it. There are some ways to debug coreboot but I haven’t looked at them yet. There’s an old blog post about debugging coreboot on the Librem 13, but I’m not sure I’m in the mood to be soldering on the LPC lines (and where are they?), but Zlatan in the CCC coreboot session said that the coreboot developer was able to debug that it had stopped at the vga init stage for him (they didn’t have the VBIOS, I think), so there is obviously going to be a way to do that, and I doubt they soldered on the LPC lines in the middle of a conference. I think it’s safe to assume that I could use a beaglebone black (which I have already) and use the EHCI debug port to capture debug information. That will be my challenge for next week and once I achieve that, I will be able to debug what happens and hopefully figure out what went wrong and fix coreboot so it can boot the Librem 13!

Now, there are still some questions that I needed to know.

For example, assuming I get coreboot to work on my Librem 13, how can I know that it’s doing it’s job? I need a checklist of what coreboot is supposed to be doing and a way to check if it does it correctly. Thankfully, I found this old article that lists a checklist of 32 items for the Librem 13 coreboot port to be considered done.
I also wanted to know what are the risks to the motherboard if I have a wrong configuration in coreboot, are there components that could get fried? I got the answer today from IRC (thank you agra, avph, felix_, icon, for all the help), and the answer is that it’s “highly unlikely” that I could damage the board, although it is possible, such as supplying a too high voltage to the memory controller. Most boards are safe however, as the voltage regulator would not allow such things, but some motherboards might not be as safe as others, so I’d need to check the voltage regulators on my board to see what kind of values they allow.

So, now, I feel much more at ease with everything. My knowledge is of course still quite incomplete, but I don’t feel as blind as I felt last month, and I have a clear path towards what I need to be doing. Hopefully, I can find out easily how to enable debugging next week, figure out where everything went wrong, fix it, and “Tada! coreboot works!” But yeah, it’s probably going to be a little more complicated than that 😉

The post Librem 13 coreboot report – January 12, 2017 appeared first on Purism.