Bug 1271537 - help analyzing "mce: [Hardware Error]: Machine check events logged" problem
help analyzing "mce: [Hardware Error]: Machine check events logged" problem
Status: CLOSED CANTFIX
Product: Fedora
Classification: Fedora
Component: qemu (Show other bugs)
25
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: Fedora Virtualization Maintainers
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-10-14 04:46 EDT by Mikhail
Modified: 2017-03-02 13:41 EST (History)
17 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-03-02 13:41:59 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Screenshot (746.20 KB, image/png)
2015-10-14 04:46 EDT, Mikhail
no flags Details
oops-2015-10-14-13:07:03-894-0 (57.30 KB, application/x-gzip)
2015-10-14 04:47 EDT, Mikhail
no flags Details
oops-2015-10-14-13:07:28-894-0 (57.36 KB, application/x-gzip)
2015-10-14 04:48 EDT, Mikhail
no flags Details
boxes-unknown.log (3.07 KB, text/plain)
2015-12-23 21:38 EST, Mikhail
no flags Details
/proc/cpuinfo (7.94 KB, text/plain)
2015-12-23 21:57 EST, Mikhail
no flags Details
journalctl -u mcelog > mce.txt (26.66 KB, text/plain)
2017-02-20 03:48 EST, Mikhail
no flags Details

  None (edit)
Description Mikhail 2015-10-14 04:46:34 EDT
Created attachment 1082747 [details]
Screenshot

Description of problem:
I am noticed that every time when I run virtual machine in gnome-boxes I get this strange error:

окт 14 13:07:02 localhost.localdomain kernel: mce: [Hardware Error]: Machine check events logged
окт 14 13:07:02 localhost.localdomain mcelog[865]: Hardware event. This is not a software error.
окт 14 13:07:02 localhost.localdomain mcelog[865]: MCE 0
окт 14 13:07:02 localhost.localdomain mcelog[865]: CPU 3 BANK 0
окт 14 13:07:02 localhost.localdomain mcelog[865]: TIME 1444810022 Wed Oct 14 13:07:02 2015
окт 14 13:07:02 localhost.localdomain mcelog[865]: MCG status:
окт 14 13:07:02 localhost.localdomain mcelog[865]: MCi status:
окт 14 13:07:02 localhost.localdomain mcelog[865]: Corrected error
окт 14 13:07:02 localhost.localdomain mcelog[865]: Error enabled
окт 14 13:07:02 localhost.localdomain mcelog[865]: MCA: Internal parity error
окт 14 13:07:02 localhost.localdomain mcelog[865]: STATUS 90000040000f0005 MCGSTATUS 0
окт 14 13:07:02 localhost.localdomain mcelog[865]: MCGCAP c09 APICID 6 SOCKETID 0
окт 14 13:07:02 localhost.localdomain mcelog[865]: CPUID Vendor Intel Family 6 Model 60
окт 14 13:07:27 localhost.localdomain kernel: mce: [Hardware Error]: Machine check events logged
окт 14 13:07:27 localhost.localdomain mcelog[865]: Hardware event. This is not a software error.
окт 14 13:07:27 localhost.localdomain mcelog[865]: MCE 0
окт 14 13:07:27 localhost.localdomain mcelog[865]: CPU 1 BANK 0
окт 14 13:07:27 localhost.localdomain mcelog[865]: TIME 1444810047 Wed Oct 14 13:07:27 2015
окт 14 13:07:27 localhost.localdomain mcelog[865]: MCG status:
окт 14 13:07:27 localhost.localdomain mcelog[865]: MCi status:
окт 14 13:07:27 localhost.localdomain mcelog[865]: Corrected error
окт 14 13:07:27 localhost.localdomain mcelog[865]: Error enabled
окт 14 13:07:27 localhost.localdomain mcelog[865]: MCA: Internal parity error
окт 14 13:07:27 localhost.localdomain mcelog[865]: STATUS 90000040000f0005 MCGSTATUS 0
окт 14 13:07:27 localhost.localdomain mcelog[865]: MCGCAP c09 APICID 2 SOCKETID 0
окт 14 13:07:27 localhost.localdomain mcelog[865]: CPUID Vendor Intel Family 6 Model 60


inside Virtual machine in same time i have this record in journal log:
Oct 14 13:04:48 localhost.localdomain libvirtd[3265]: internal error: QEMU / QMP failed:
                                                      (process:3345): GLib-WARNING **: gmem.c:482: custom memory
                                                      Could not access KVM kernel module: No such file or direct
                                                      failed to initialize KVM: No such file or directory
Oct 14 13:04:48 localhost.localdomain libvirtd[3265]: Failed to probe capabilities for /usr/bin/qemu-kvm: intern
                                                      (process:3345): GLib-WARNING **: gmem.c:482: custom memory
                                                      Could not access KVM kernel module: No such file or direct
                                                      failed to initialize KVM: No such file or directory
Oct 14 13:05:24 localhost.localdomain kernel: kworker/dying (116) used greatest stack depth: 4024 bytes left
Oct 14 13:05:33 localhost.localdomain systemd[1]: Reloading.
Oct 14 13:05:33 localhost.localdomain systemd[1]: Configuration file /usr/lib/systemd/system/auditd.service is m
Oct 14 13:05:42 localhost.localdomain dhclient[1312]: DHCPREQUEST on ens3 to 192.168.124.1 port 67 (xid=0x68b318
Oct 14 13:05:42 localhost.localdomain dhclient[1312]: DHCPACK from 192.168.124.1 (xid=0x68b31823)
Oct 14 13:05:43 localhost.localdomain NetworkManager[898]: <info>    address 192.168.124.12
Oct 14 13:05:43 localhost.localdomain NetworkManager[898]: <info>    plen 24 (255.255.255.0)
Oct 14 13:05:43 localhost.localdomain NetworkManager[898]: <info>    gateway 192.168.124.1
Oct 14 13:05:43 localhost.localdomain NetworkManager[898]: <info>    server identifier 192.168.124.1
Oct 14 13:05:43 localhost.localdomain NetworkManager[898]: <info>    lease time 3600
Oct 14 13:05:43 localhost.localdomain NetworkManager[898]: <info>    nameserver '192.168.124.1'
Oct 14 13:05:43 localhost.localdomain NetworkManager[898]: <info>  (ens3): DHCPv4 state changed bound -> bound


I'm sure that this is not a mere coincidence.

Of course abrt couldn't allow report about this.
Can you help me find who is real culprit here?
Comment 1 Mikhail 2015-10-14 04:47 EDT
Created attachment 1082748 [details]
oops-2015-10-14-13:07:03-894-0
Comment 2 Mikhail 2015-10-14 04:48 EDT
Created attachment 1082749 [details]
oops-2015-10-14-13:07:28-894-0
Comment 3 Mikhail 2015-10-14 05:00:19 EDT
Found interesting discussion here: https://bugs.launchpad.net/qemu/+bug/1307225
Comment 6 Jakub Filak 2015-12-10 04:33:02 EST
Hi Mikhail, unfortunately I am not MCE nor Virtualization expert. Maybe, QEMU folks can help us here.
Comment 7 Cole Robinson 2015-12-23 15:40:38 EST
Can you provide:

- host /proc/cpuinfo
- ~/.cache/libvirt/qemu/log/$vmname.log

If you remove qemu\* from inside the VM, and then re-run the VM, does it still trigger the MCE on the host?
Comment 8 Mikhail 2015-12-23 21:38 EST
Created attachment 1109126 [details]
boxes-unknown.log
Comment 9 Mikhail 2015-12-23 21:57 EST
Created attachment 1109127 [details]
/proc/cpuinfo
Comment 10 Mikhail 2015-12-23 22:15:35 EST
> If you remove qemu\* from inside the VM, and then re-run the VM

You mean run command
# dnf remove remove qemu\*
inside guest machine? And reboot after it?
Comment 11 Cole Robinson 2015-12-24 11:54:02 EST
(In reply to Mikhail from comment #10)
> > If you remove qemu\* from inside the VM, and then re-run the VM
> 
> You mean run command
> # dnf remove remove qemu\*
> inside guest machine? And reboot after it?

Yeah, I wonder if something trying to use virt _inside_ the VM is causing the host issue, nested virt is tricky
Comment 12 Mikhail 2016-02-17 14:39:44 EST
hmm, seems removing qemu\* helps
Comment 13 Mikhail 2016-05-02 03:42:29 EDT
This also occurred when guest system is Windows 10
Comment 14 Cole Robinson 2016-05-02 16:53:09 EDT
(In reply to Mikhail from comment #13)
> This also occurred when guest system is Windows 10

Are you using any virtualization software inside the windows VM?
Comment 15 Mikhail 2016-05-03 00:42:27 EDT
(In reply to Cole Robinson from comment #14)
> (In reply to Mikhail from comment #13)
> > This also occurred when guest system is Windows 10
> 
> Are you using any virtualization software inside the windows VM?

No
Comment 16 Fedora End Of Life 2016-11-24 07:47:05 EST
This message is a reminder that Fedora 23 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 23. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '23'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 23 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.
Comment 17 Cole Robinson 2017-02-16 14:33:37 EST
Mikhail you still see this with up to date f25 kernel and qemu?
Comment 18 Mikhail 2017-02-20 03:48 EST
Created attachment 1255609 [details]
journalctl -u mcelog > mce.txt
Comment 19 Mikhail 2017-02-20 03:50:13 EST
so we see that it occurred today with latest kernel
# uname -r
4.9.10-200.fc25.x86_64
Comment 20 Cole Robinson 2017-02-20 18:21:24 EST
Check out this page:

https://en.wikipedia.org/wiki/Machine-check_exception

machine check exceptions are hardware errors. It's possible that software is tickling some hardware bug I guess, but in most cases it seems like MCE are caused by CPU overheating. 

Does the issue always pop up when your machine is heavily loaded, or you're actively using a VM?
Comment 21 Mikhail 2017-02-20 23:18:02 EST
Yes, I am actively using a VM and run Google Chrome inside VM.
CPU is not overheating.
Comment 22 Cole Robinson 2017-03-02 13:41:59 EST
I just looked more closely at the launchpad link, thanks for finding that. But it basically points to this being an issue with the intel haswell arch, throwing spurious errors with virt stuff; Comments mention this happens with vmware and virtualbox as well, and the intel info that this is a known but harmless issue:

https://bugs.launchpad.net/qemu/+bug/1307225/comments/9

So this doesn't seem to be qemu/kvm specific, and an 'issue' with your particular hardware. The one possible workaround here is to do what the lp bug mentions freebsd did: suppress these messages in the linux kernel, but if you care about that you should probably take the suggestion upstream.

But I think this is CANTFIX for qemu, so closing. Please reopen if I've missed something

Note You need to log in before you can comment on or make changes to this bug.