Description of problem: I'm getting the above error from cron.hourly: From root.com Fri Jun 19 09:01:02 2009 Return-Path: <root.com> Date: Fri, 19 Jun 2009 09:01:01 -0700 From: root.com (Cron Daemon) To: root.com Subject: Cron <root@tlondon> run-parts /etc/cron.hourly Content-Type: text/plain; charset=UTF-8 Auto-Submitted: auto-generated X-Cron-Env: <SHELL=/bin/bash> X-Cron-Env: <PATH=/sbin:/bin:/usr/sbin:/usr/bin> X-Cron-Env: <MAILTO=root> X-Cron-Env: <HOME=/> X-Cron-Env: <LOGNAME=root> X-Cron-Env: <USER=root> Status: RO /etc/cron.hourly/mcelog.cron: mcelog: warning: record length longer than expected. Consider update. & Hadn't seen this before. Here is a copy of my /var/log/mcelog: [root@tlondon log]# cat /var/log/mcelog MCE 0 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 1 BANK 128 TSC 153abbd64e STATUS 88490180 MCGSTATUS 0 MCE 0 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 1 BANK 128 TSC d6f60d181 STATUS 88400180 MCGSTATUS 0 MCE 0 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 1 BANK 128 TSC 1daf43d2da STATUS 88400180 MCGSTATUS 0 [root@tlondon log]# This right? System is Lenovo Thinkpad X200, x86_64. Version-Release number of selected component (if applicable): mcelog-0.7-3.fc11.x86_64 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
[root@tlondon log]# mcelog --dmi mcelog: warning: record length longer than expected. Consider update. [root@tlondon log]# I'm running kernel-2.6.31-0.11.rc0.git13.fc12.x86_64
I pulled the latest code from XXXXX and compiled it on my system, adding a bit more to the error output: [root@tlondon mcelog-0.8pre]# ./mcelog mcelog: warning: record length (88) longer than expected (72). Consider update. [root@tlondon mcelog-0.8pre]# Possible there is something in newer kernels that adds 16 bytes to 'struct mce'?
OK. Just a bit more digging. /usr/include/asm/mce.h defines 'struct mce' as: struct mce { __u64 status; __u64 misc; __u64 addr; __u64 mcgstatus; __u64 ip; __u64 tsc; /* cpu time stamp counter */ __u64 time; /* wall time_t when error was detected */ __u8 cpuvendor; /* cpu vendor as encoded in system.h */ __u8 pad1; __u16 pad2; __u32 cpuid; /* CPUID 1 EAX */ __u8 cs; /* code segment */ __u8 bank; /* machine check bank */ __u8 cpu; /* cpu number; obsolete; use extcpu now */ __u8 finished; /* entry is valid */ __u32 extcpu; /* linux cpu number that detected the error */ __u32 socketid; /* CPU socket ID */ __u32 apicid; /* CPU initial apic ID */ __u64 mcgcap; /* MCGCAP MSR: machine check capabilities of CPU */ }; Where mcelog/mcelog.h defines it thusly: struct mce { __u64 status; __u64 misc; __u64 addr; __u64 mcgstatus; __u64 rip; __u64 tsc; /* cpu time stamp counter */ __u64 res1; /* for future extension */ __u64 res2; /* dito. */ __u8 cs; /* code segment */ __u8 bank; /* machine check bank */ __u8 cpu; /* cpu that raised the error */ __u8 finished; /* entry is valid */ __u32 pad; }; So, the __u32 'pad' is replaced with the __u32'extcpu' and extended with a __u32 'socketid' and a __u32 'acpid' and a __u64 'mcgcap'. The extra fields add 2*4 + 8 = 16 bytes.
I've patched mcelog with the updated kernel header on my new-kernel boxes. In additional to changing mcelog.h, I also had to replace the "rip" structure entries with "ip". (I believe they are the same, just a name change) It seems to work, and not produce errors when run now. I haven't seen an MCE event to confirm it is successfully detecting those.
*** Bug 508797 has been marked as a duplicate of this bug. ***
*** Bug 509950 has been marked as a duplicate of this bug. ***
This kernel structure appears to have changed again with kernel-2.6.31-0.67.rc2.git9.fc12.x86_64. Below is a (not completely clean) hack that seems to quiet mcelog. I basically copied the structure declaration from /usr/include/asm/mce.h, added support for '__u16', and renamed 'ip' to 'rip'. --- mcelog.h.orig 2009-06-25 12:04:39.000000000 -0700 +++ mcelog.h 2009-07-14 07:47:45.748420922 -0700 @@ -1,15 +1,18 @@ typedef unsigned long long u64; typedef unsigned int u32; +typedef unsigned short u16; typedef unsigned char u8; #define __u64 u64 #define __u32 u32 +#define __u16 u16 #define __u8 u8 /* kernel structure: */ /* Fields are zero when not available */ +#if 0 struct mce { __u64 status; __u64 misc; @@ -23,8 +26,34 @@ __u8 bank; /* machine check bank */ __u8 cpu; /* cpu that raised the error */ __u8 finished; /* entry is valid */ - __u32 pad; + __u32 extcpu; /* linux cpu number that detected the error */ + __u32 socketid; /* CPU socket ID */ + __u32 apicid; /* CPU initial apic ID */ + __u64 mcgcap; /* MCGCAP MSR: machine check capabilities of CPU */ }; +#else +struct mce { + __u64 status; + __u64 misc; + __u64 addr; + __u64 mcgstatus; + __u64 rip; + __u64 tsc; /* cpu time stamp counter */ + __u64 time; /* wall time_t when error was detected */ + __u8 cpuvendor; /* cpu vendor as encoded in system.h */ + __u8 pad1; + __u16 pad2; + __u32 cpuid; /* CPUID 1 EAX */ + __u8 cs; /* code segment */ + __u8 bank; /* machine check bank */ + __u8 cpu; /* cpu number; obsolete; use extcpu now */ + __u8 finished; /* entry is valid */ + __u32 extcpu; /* linux cpu number that detected the error */ + __u32 socketid; /* CPU socket ID */ + __u32 apicid; /* CPU initial apic ID */ + __u64 mcgcap; /* MCGCAP MSR: machine check capabilities of CPU */ +}; +#endif #define MCE_OVERFLOW 0 /* bit 0 in flags means overflow */
I have upgraded the mcelog version to 0.9 and patched it with the new kernel info, if anybody wants to use it: http://xney.com/RPM/mcelog-0.9-1.src.rpm This compiles cleanly and seems to work. I haven't had an MCE error yet to confirm that it detects it cleanly. The patch file is in there as %Source99 and does two things: updates the mce structure to the latest version, and changes the structure name in the source code (mce->rip to mce->ip)
Will your patch be added officially?
Any more info or a fix on the way for these error messages?
*** Bug 517400 has been marked as a duplicate of this bug. ***
The Cc: list of this bug is happily growing, but there is no feedback from the mcelog package maintainer(s), even though the patch is known. Is this package still being maintained?
I've checked in and built 0.9pre1 patched with the above patch to handle the kernel changes. Will be requesting rel-eng to include in F-12. http://koji.fedoraproject.org/koji/taskinfo?taskID=1728442
Just an update. I've spoken with Andi Kleen several times about mcelog recently. I should have noted it here. The package is getting an upstream overhaul so I put off pushing an update due to that...but it's taking a while for that, so maybe we need an interim update. I will look at this again.
So f11 will not be fixed?
Well, Andi hasn't emailed me recently. I will try to figure out what's going on.
Anything new? The clone of this bug for RHEL (#532983) has already been closed with CURRENTRELEASE.
This bug appears to have been reported against 'rawhide' during the Fedora 13 development cycle. Changing version to '13'. More information and reason for this action is here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
The problem here is that upstream had dried up for basically years. There is some new work happening there now, but I had held off pending an outcome on whether it would live on or change completely. That was totally not the best solution and I apologize. I will pull the same struct size fix back into F11 and release an updated package, and similarly update rawhide and F13. But then I need to catch up on the latest upstream progress. As I mentioned on your blog, Yenya, there have been various plans for mce logging recently, including using perf to capture MCE events. So in the future something quite different is likely to happen. I have spoken with various folks upstream about mce logging over the past few months but have not adequately kept you in the loop, and for that I do apologize. Still, you are completely right that this package has not had the right updates and that will be fixed immediately. Jon.
mcelog-0.7-4.fc11.x86_64 downloaded yesterday from Koji apparently fixes the problem - I have installed it on three servers yesterday and have not seen the "mcelog lenght" error mail from cron since then. Also running mcelog from the command line does not report any error. Jon, thanks for fixing this!
(In reply to comment #20) > mcelog-0.7-4.fc11.x86_64 downloaded yesterday from Koji apparently fixes the > problem Say it again? Precisely after installing mcelog-0.7-4.fc11.x86_64 from the latest updates, and only after that, I am getting every hour in logs: mcelog: warning: 18446744073709551600 bytes ignored in each record mcelog: consider an update where 18446744073709551600 prints in hex 0xfffffffffffffff0. This is with kernel-2.6.30.10-105.2.23.fc11.x86_64, i.e. the lastest one on F11. Granted, "... bytes ignored" is a message different from the original "... longer than expected" which somehow I did not see. mcelog-0.9pre1-0.1.fc12.x86_64 on F12 is quiet but it is dated "Mon 05 Oct 2009".
(In reply to comment #21) > ... I am getting every hour in logs: Sorry! Just to avoid a confusion. Not in logs but in mail from cron.
Same here, the version creates now hourly the message: ----------------------- /etc/cron.hourly/mcelog.cron mcelog: warning: 18446744073709551600 bytes ignored in each record mcelog: consider an update -----------------------
Might need to change the version of Fedora to somethign else, or at least I don't see the problem on an F13 install.
Problem is this update was pushed to Fedora 11, which runs an older kernel and is not compatible with it.
DOWNGRADE WORKS # rpm -q mcelog mcelog-0.9pre-1.27.el5 ! DOESNT WORK ! # rpm -q mcelog mcelog-0.9pre-1.29.el5 UPGRADE WORKS # mcelog --version mcelog 1.0pre
Can you articulate what it is that you need?
i'd like redhat to either rollback to mcelog-0.9pre-1.27.el5 or rollforward to mcelog 1.0pre so that i dont get thousands of error emails everytime mcelog (1.29) is run on my 64bit rhel systems !
64-bit RHEL systems? This bug relates to Fedora.
sorry ! i've just created a new rhel bug for you https://bugzilla.redhat.com/show_bug.cgi?id=625761
Thank you.
This message is a reminder that Fedora 13 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 13. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '13'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 13's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 13 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Fedora 13 changed to end-of-life (EOL) status on 2011-06-25. Fedora 13 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed.