My take of the "NMI Watchdog detected hard LOCKUP on cpu" issue

As I did run into this issue too, I want to share my experience about what I did to solve it. It is NOT a general solution but maybe helpful for others in similar circumstances. The concrete issue is that the system freezes completely without having anything in the logs. The only message appears in an open console/shell, like:

NMI Watchdog detected hard LOCKUP on cpu 0

Period. Frozen. Nothing other than a cold reboot is possible.

The locks came up randomly, but always during an X session. More precise - I realized this later on - is that the system locks when X was running, even in a (remote) console session. If X was not started at all I haven't had lockups. Before going on lets have a look at the

Hardware

I have had the issue on two different hardware platforms:

Both are µATX boards with Intel on-board graphic and a second Nvidia GT630 graphic card. Only one graphic card was connected to a monitor at the same time. The BIOS (UEFI) was set to safety defaults respectively. Overall the differences are not big. Both platforms are using the same kernel drivers mostly.

The Baytrail J2900 was just a test system. So the solution here was done on the Skylake Xeon E3. But I assume it is the same.

Getting the solution

I couldn't find a solution by searching the web. Checking out some ACPI parameters as suggested like  intel_idle.max_cstate  as kernel command line parameter didn't solve the issue. After I installed X including kde-plasma I realized that the boot process takes a while on

Waiting for uevents to be processed ...

This indicates that there is something wrong with a module/driver. So I decided to start from here and removed all modules which are not nessecary for my hardware from the kernel. Disabled  eudev , rebooted and loaded/unloaded modules by hand. This shows the interesting snippets:

$ lspci -k
... <snip> ...
00:02.0 VGA compatible controller: Intel Corporation Device 191d (rev 06)
        Subsystem: Micro-Star International Co., Ltd. [MSI] Device 7972
        Kernel driver in use: i915
        Kernel modules: i915
... <snip> ...
01:00.0 VGA compatible controller: NVIDIA Corporation GK208 [GeForce GT 630 Rev. 2] (rev a1)
        Subsystem: CardExpert Technology GK208 [GeForce GT 630 Rev. 2]
        Kernel modules: nouveau
... <snip> ...

The module for the GT 630 did not load and  dmesg  shows an error message. Why? And then a

$ modprobe -f nouveau

freezes the system immediately! So I removed the nouveau driver completely from the kernel and rebuild it. After the reboot I did reinstall the whole graphical environment by using the  nvidia-drivers 361.28 . And voila, the system runs stable for 4 weeks 24/7 now. The  uevents  are normal fast as well now too.

One thing I did afterwards and it could or could not be related to the lockup. The  dmesg 'es did show that the  i915  driver couldn't load necessary firmware. I did use kernel version 4.4.26. The correct firmware was in  linux-firmware-20160331  which I did install shortly after.

I recommend to unset all graphic drivers in the kernel completely which will not be used. So the kernel  .config  should have (example):

# CONFIG_DRM_NOUVEAU is not set
# CONFIG_FB_NVIDIA is not set

A careful conclusion

I will be very careful but please keep in mind that something in this conclusion could be not absolutely correct. It is the status of the time I write this article. So in my case overall in short:

The nouveau kernel driver is broken or something in combination with it!

During my recherche I found some points I want to put together with my own thoughts:

I definitively confirm that I have had issues - maybe related to this - first time I tried to upgrade from a kernel 3.10.x to 3.12.x. Than the  uevents  (see above) did hang up completely during the boot process. Didn't remember exactly, but it was something with the kernel graphic drivers.

That's my outcome. I'm not a kernel hacker. My knowledge is far away from being able to solve the bug. From my knowledge base - I know enough to know that I know nothing. ;-)(based on Socrates)

Looking back I wasted a lot of time to go through some threads. Nevertheless