My take of the "NMI Watchdog detected hard LOCKUP on cpu" issue

As I did run into this issue too, I want to share my experience about what I did to solve it. It is NOT a general solution but maybe helpful for others in similar circumstances. The concrete issue is that the system freezes completely without having anything in the logs. The only message appears in an open console/shell, like:

NMI Watchdog detected hard LOCKUP on cpu 0

Period. Frozen. Nothing other than a cold reboot is possible.

The locks came up randomly, but always during an X session. More precise - I realized this later on - is that the system locks when X was running, even in a (remote) console session. If X was not started at all I haven't had lockups. Before going on lets have a look at the

Hardware

I have had the issue on two different hardware platforms:

ASRock Q2900 with Intel Pentium J2900 (Baytrail)
MSI Workstation C236M with Xeon E3 (Skylake)

Both are µATX boards with Intel on-board graphic and a second Nvidia GT630 graphic card. Only one graphic card was connected to a monitor at the same time. The BIOS (UEFI) was set to safety defaults respectively. Overall the differences are not big. Both platforms are using the same kernel drivers mostly.

The Baytrail J2900 was just a test system. So the solution here was done on the Skylake Xeon E3. But I assume it is the same.

Getting the solution

I couldn't find a solution by searching the web. Checking out some ACPI parameters as suggested like intel_idle.max_cstate as kernel command line parameter didn't solve the issue. After I installed X including kde-plasma I realized that the boot process takes a while on

Waiting for uevents to be processed ...

This indicates that there is something wrong with a module/driver. So I decided to start from here and removed all modules which are not nessecary for my hardware from the kernel. Disabled eudev , rebooted and loaded/unloaded modules by hand. This shows the interesting snippets:

$ lspci -k
... <snip> ...
00:02.0 VGA compatible controller: Intel Corporation Device 191d (rev 06)
        Subsystem: Micro-Star International Co., Ltd. [MSI] Device 7972
        Kernel driver in use: i915
        Kernel modules: i915
... <snip> ...
01:00.0 VGA compatible controller: NVIDIA Corporation GK208 [GeForce GT 630 Rev. 2] (rev a1)
        Subsystem: CardExpert Technology GK208 [GeForce GT 630 Rev. 2]
        Kernel modules: nouveau
... <snip> ...

The module for the GT 630 did not load and dmesg shows an error message. Why? And then a

$ modprobe -f nouveau

freezes the system immediately! So I removed the nouveau driver completely from the kernel and rebuild it. After the reboot I did reinstall the whole graphical environment by using the nvidia-drivers 361.28 . And voila, the system runs stable for 4 weeks 24/7 now. The uevents are normal fast as well now too.

One thing I did afterwards and it could or could not be related to the lockup. The dmesg 'es did show that the i915 driver couldn't load necessary firmware. I did use kernel version 4.4.26. The correct firmware was in linux-firmware-20160331 which I did install shortly after.

I recommend to unset all graphic drivers in the kernel completely which will not be used. So the kernel .config should have (example):

# CONFIG_DRM_NOUVEAU is not set
# CONFIG_FB_NVIDIA is not set

A careful conclusion

I will be very careful but please keep in mind that something in this conclusion could be not absolutely correct. It is the status of the time I write this article. So in my case overall in short:

The nouveau kernel driver is broken or something in combination with it!

During my recherche I found some points I want to put together with my own thoughts:

The issue seems to be started with kernel 3.12.x and goes through to 4.6.x - maybe it's going on. Some people pointed out that 4.8, others that 4.9 are fine, but I don't believe this as of now. Some people did confirm success with older versions like 4.6. to quickly and recalled it afterwards.
With a 3.x kernel it could be that the “idle” workaround works. At least for a daily session (8 hours?) but I do not confirm for 24/7.
It looks like in most cases systems with low power consumption, especially Baytrail platforms, are affected. Rarely I have read of other system with an Intel i5 (Haswell?). Another indicator is maybe an on-board Intel graphic. One time only I read that an AMD was mentioned.
Replacing new broken hardware with newer broken hardware doesn't make any sense to me. The software have to control the hardware. If a hardware feature is broken, the software shouldn't use it or should do a workaround. Which or how many hardware was not broken during its short lifetime in the past, even if it could be used successful?
The time range of this issue is quite interesting.

I definitively confirm that I have had issues - maybe related to this - first time I tried to upgrade from a kernel 3.10.x to 3.12.x. Than the uevents (see above) did hang up completely during the boot process. Didn't remember exactly, but it was something with the kernel graphic drivers.

That's my outcome. I'm not a kernel hacker. My knowledge is far away from being able to solve the bug. From my knowledge base - I know enough to know that I know nothing. (based on Socrates)

My take of the "NMI Watchdog detected hard LOCKUP on cpu" issue

Hardware

Getting the solution

A careful conclusion

Links