By Daniel Drake
In June 2022, a community of low-income women entrepreneurs in the United States were approached with the possibility of obtaining their first personal computer under Endless LaptopĀ innovative financing plans. They attended a local event, made a small down payment, and walked away with a powerful Personal Computer full of tools and content ready to support the growth of their businesses and the education of their families.
Our device access program was finally underway, reaching families that had never had a PC in their home before. This initiative was swiftly extended to underserved communities in Guatemala and Mexico, but as the user base grew, a significant flaw emerged: the laptop battery would run down very quickly when the system was in sleep mode.
In this article I wish to share the story of how Endless OS Foundationās highly skilled Tech team relentlessly chased this issue through several twists into the guts of the PC architecture, eventually pinpointing the problem to the most unlikely of things: a misbehaving device driver corresponding to a type of disk not even present in the product.
Due to the complexity and specificity of this problem, this recap unavoidably needs to draw on fairly technical jargon. I hope it is accessible to those who have at least a loose interest in PC and operating systems architecture!
Solving excessive battery drain through a stroke of good luck?
We first became aware of issues caused by rapid battery depletion in sleep mode at the same time that our Taipei lab staff were coincidentally debugging a different power management issue on a newer variant of the same laptop. It turned out that the new product, which was being considered as a successor, could not sleep at all: upon waking up, it would lose access to the disk.
Our software platform, Endless OS, is based on open source software (Linux), which allows us to tap into an extensive public developer community. In wider discussion about this issue, we tested and approved a workaroundĀ for this issue, which would have bothĀ laptop models revert to an older implementation of sleep mode.
In addition to allowing the newer product to sleep and wake without issue, this change had the additional effect of greatly reducing the amount of power used during sleep on the original Endless Laptop model that we had already delivered to our userbase. This exercise had coincidentally solved the power usage issue being seen by our users. āThatās handy!ā, we thought, as we swiftly rolled out this change to our user base via a software update.
Failures to wake up from sleep mode
Our attention was called back to this issue when we later identified a slow but steady stream of support requests from our users reporting that the device would occasionally get stuck in sleep mode. When this issue was encountered, the system would be unresponsive to any attempt to wake it up from the low power state. It was very hard to reproduce this failure, but we were eventually able to hit the failure and characterize it in detail.
Our workaround to the battery drain issue above was causing these systems to use S3 legacy suspend, a historical implementation of PC sleep mode. In this mode, control of the device is fully handed over to the system firmware when going sleep mode, and the CPU and RAM are powered down. Because the wakeup failure was happening in this mode, it was apparent that the issue was emerging at system firmware level, beyond the reach of the operating system. It is perhaps not surprising that such a firmware issue may exist: this product was not designed for S3 legacy suspend, S3 is likely untested and unsupported on this device, and we should probably not be using it.
Despite the initial indication that we had got lucky with the workaround to use S3 legacy suspend, it turned out to be unreliable and we knew we had to drop this and go back to understand the original problem in more depth. We had two questions to answer:
Why was the newer variant of the product failing to access the disk after waking up, before we put the (problematic) workaround in place?
Why was the device draining so much power during sleep mode, before we put the (problematic) workaround in place?
Intel Volume Management - failed disk access after wakeup
We compared the two product variants closely and spotted the reason why the newer variant had disk access issues after wakeup: it had the disks configured differently.
The original product had been rolled out withĀ Intel VMD, a system function enabling powerful data storage setups, not entirely relevant for our home PC use case. The newer sample had been configured to access the disks in the traditional way, without VMD. And the non-VMD configuration was experiencing the lack of disk access after waking up.
We looked closely and found that our Linux-based operating system was completely powering down the disk in non-VMD sleep mode. This makes sense, because you want to save as much power as possible while the system is sleeping. But we observed that the device was unable to restore power to the disk from that state, and using advanced debugging tools, we observed that Windows, a different operating system, was not cutting the disk power during sleep mode on this product.
We still donāt know why the power is retained in that configuration, nevertheless we updated the Linux behaviour to match. The problem was now avoided, but this time in a way where we had a far more precise understanding of the issue.
Modern Standby: understanding power usage
Now we had both laptop models able to sleep and wake up, regardless of disk configuration, without using the problematic legacy suspend method. It was time to return our attention to the original problem: why is so much power consumed when the system is asleep?
This product uses a Modern StandbyĀ design where the core system processor and operating system actually remain active during sleep mode. However, the operating system attempts to turn off as many hardware components as it can (screen, Wi-Fi, disk, etc), pause all apps, and get the processor into an ultra-low power mode where it has almost no work to do. The goal is that power consumption will reduce so drastically that the system can be in sleep mode for days, even though technically you could regard the core system as being awake and running.
In our case, clearly this power consumption goal was not being met. The battery was being drained in a matter of hours in sleep mode.
We called upon some low-level debugging features of the Intel processor that identify which specific parts of the system are reaching their lowest power states during suspend, and which are not. This revealed that the SATA disk controller was preventing the CPU from going into low power mode.
This was a very surprising finding. SATAĀ refers to a type of disk, but this product uses a more modern type of storage (NVMe) - not SATA! Why on earth would the unused SATA controller be getting in our way? What could cause it to prevent the CPU from deep sleep?
The mysterious Tiger Lake SATA power savings issue
Harnessing the power of the open source community, we were able to ask those questionsĀ directly to Intel engineers highly familiar with the workings of the hardware. That quickly gave us the exact direction we needed: SATA power savings had been intentionally disabled for this specific Intel āTiger Lakeā processor family on Linux. When power savings had been enabled at an earlier point, it had caused multiple users to mysteriously find themselves unable to boot their computers; nobody knew why.
This suggested that there was probably a whole range of products suffering from this power drain issue. It also meant that we would have to solve this SATA disk issue in order to make progress, despite our product not even making use of SATA.
Refocusing around this challenge, Endless's Jian-Hong Pan impressed us all by quickly spotting a peculiar detailĀ that had evaded everyone else for years: the code being used to turn on power savings for Intel SATA controllers was quietly and unexpectedly activating an additional behavior change for these devices. Much older Intel SATA controllers needed a āquirkā in order to support multiple disks, and this behavior change had been intended to be restricted to Intel hardware up to around 2017, but 6 years later, Linux was inadvertently applying the quirk to most present day Intel SATA controllers. And for whatever reason, applying this obsolete quirk to the Intel Tiger Lake processors would cause the SATA disks to become completely inaccessible.
Mission accomplished, time to sleep
Thanks to our findings, the Linux SATA maintainers were able to restrict the application of the SATA quirkĀ and activate power savings for Tiger Lake SATA, which should improve power usage on a whole range of devices in addition to ours. We then prevented our disk being problematically turned offĀ during sleep mode and re-enabled Modern StandbyĀ for this product, which is now able to achieve around a week of battery life in sleep mode. These fixes were all incorporated into official versions of Linux, and rapidly rolled out to our userbase in Endless OS 5.1.2. With the problem incidence rate subsequently dropped to zero, we can comfortably conclude that Endless Laptopās first-time PC users in underserved communities are now enjoying long battery life of their devices.
That was a long, hard, fascinating ride. What started with power usage issues took us through suspend mechanisms, firmware issues, disk power management, and quirks for unrelated hardware predating our product by several years. This example demonstrates the skill and resilience of the Endless team, the power of open source communities, and the importance of solving technical issues through truly understanding their root cause, no matter how deep you have to go.
Credit to Jian-Hong Pan and Cassidy Blaede at Endless for their detailed investigation of this issue, David Box and Mika Westerberg from Intel for their speedy and invaluable direction, and Linux SATA maintainers Niklas Cassel and Mario Limonciello for pushing the crucial fixes over the finish line.
Author - Daniel Drake, VP Engineering.
Daniel is passionate about extending the positive impacts of technology throughout the world and holds a specific interest in free & open source software.
Comments