507026 – mcelog: warning: record length longer than expected. Consider update.

Bug 507026 - mcelog: warning: record length longer than expected. Consider update.

Summary: mcelog: warning: record length longer than expected. Consider update.

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	mcelog
Sub Component:
Version:	13
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	Jon Masters
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Duplicates (3):	508797 509950 517400 (view as bug list)
Depends On:
Blocks:	F12Target 532983
TreeView+	depends on / blocked

Reported:	2009-06-19 21:38 UTC by Tom London
Modified:	2011-06-27 14:15 UTC (History)
CC List:	22 users (show)
Fixed In Version:
Clone Of:
Clones:	532983 (view as bug list)
Environment:
Last Closed:	2011-06-27 14:15:02 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Tom London 2009-06-19 21:38:49 UTC

Description of problem:
I'm getting the above error from cron.hourly:

From root.com  Fri Jun 19 09:01:02 2009
Return-Path: <root.com>
Date: Fri, 19 Jun 2009 09:01:01 -0700
From: root.com (Cron Daemon)
To: root.com
Subject: Cron <root@tlondon> run-parts /etc/cron.hourly
Content-Type: text/plain; charset=UTF-8
Auto-Submitted: auto-generated
X-Cron-Env: <SHELL=/bin/bash>
X-Cron-Env: <PATH=/sbin:/bin:/usr/sbin:/usr/bin>
X-Cron-Env: <MAILTO=root>
X-Cron-Env: <HOME=/>
X-Cron-Env: <LOGNAME=root>
X-Cron-Env: <USER=root>
Status: RO

/etc/cron.hourly/mcelog.cron:

mcelog: warning: record length longer than expected. Consider update.

& 

Hadn't seen this before.

Here is a copy of my /var/log/mcelog:

[root@tlondon log]# cat /var/log/mcelog
MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 BANK 128 TSC 153abbd64e 
STATUS 88490180 MCGSTATUS 0
MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 BANK 128 TSC d6f60d181 
STATUS 88400180 MCGSTATUS 0
MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 BANK 128 TSC 1daf43d2da 
STATUS 88400180 MCGSTATUS 0
[root@tlondon log]# 

This right?

System is Lenovo Thinkpad X200, x86_64.

Version-Release number of selected component (if applicable):
mcelog-0.7-3.fc11.x86_64

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

Comment 1 Tom London 2009-06-19 23:08:40 UTC

[root@tlondon log]# mcelog --dmi
mcelog: warning: record length longer than expected. Consider update.
[root@tlondon log]#

I'm running kernel-2.6.31-0.11.rc0.git13.fc12.x86_64

Comment 2 Tom London 2009-06-25 17:37:11 UTC

I pulled the latest code from XXXXX and compiled it on my system, adding a bit more to the error output:

[root@tlondon mcelog-0.8pre]# ./mcelog
mcelog: warning: record length (88) longer than expected (72). Consider update.
[root@tlondon mcelog-0.8pre]# 

Possible there is something in newer kernels that adds 16 bytes to 'struct mce'?

Comment 3 Tom London 2009-06-25 17:51:03 UTC

OK. Just a bit more digging.

/usr/include/asm/mce.h defines 'struct mce' as:

struct mce {
        __u64 status;
        __u64 misc;
        __u64 addr;
        __u64 mcgstatus;
        __u64 ip;
        __u64 tsc;      /* cpu time stamp counter */
        __u64 time;     /* wall time_t when error was detected */
        __u8  cpuvendor;        /* cpu vendor as encoded in system.h */
        __u8  pad1;
        __u16 pad2;
        __u32 cpuid;    /* CPUID 1 EAX */
        __u8  cs;               /* code segment */
        __u8  bank;     /* machine check bank */
        __u8  cpu;      /* cpu number; obsolete; use extcpu now */
        __u8  finished;   /* entry is valid */
        __u32 extcpu;   /* linux cpu number that detected the error */
        __u32 socketid; /* CPU socket ID */
        __u32 apicid;   /* CPU initial apic ID */
        __u64 mcgcap;   /* MCGCAP MSR: machine check capabilities of CPU */
};

Where mcelog/mcelog.h defines it thusly:

struct mce {
        __u64 status;
        __u64 misc;
        __u64 addr;
        __u64 mcgstatus;
        __u64 rip;
        __u64 tsc;      /* cpu time stamp counter */
        __u64 res1;     /* for future extension */
        __u64 res2;     /* dito. */
        __u8  cs;               /* code segment */
        __u8  bank;     /* machine check bank */
        __u8  cpu;      /* cpu that raised the error */
        __u8  finished;   /* entry is valid */
        __u32 pad;
};

So, the __u32 'pad' is replaced with the __u32'extcpu' and extended with a __u32 'socketid' and a __u32 'acpid' and a __u64 'mcgcap'.

The extra fields add 2*4 + 8 = 16 bytes.

Comment 4 Karl Mueller 2009-06-26 22:05:16 UTC

I've patched mcelog with the updated kernel header on my new-kernel boxes.  In additional to changing mcelog.h, I also had to replace the "rip" structure entries with "ip".  (I believe they are the same, just a name change)

It seems to work, and not produce errors when run now.  I haven't seen an MCE event to confirm it is successfully detecting those.

Comment 5 Yanko Kaneti 2009-06-29 22:20:31 UTC

*** Bug 508797 has been marked as a duplicate of this bug. ***

Comment 6 Yanko Kaneti 2009-07-07 06:22:29 UTC

*** Bug 509950 has been marked as a duplicate of this bug. ***

Comment 7 Tom London 2009-07-14 19:17:21 UTC

This kernel structure appears to have changed again with kernel-2.6.31-0.67.rc2.git9.fc12.x86_64.

Below is a (not completely clean) hack that seems to quiet mcelog.

I basically copied the structure declaration from /usr/include/asm/mce.h, added support for '__u16', and renamed 'ip' to 'rip'.


--- mcelog.h.orig	2009-06-25 12:04:39.000000000 -0700
+++ mcelog.h	2009-07-14 07:47:45.748420922 -0700
@@ -1,15 +1,18 @@
 
 typedef unsigned long long u64;
 typedef unsigned int u32;
+typedef unsigned short u16;
 typedef unsigned char u8;
 
 #define __u64 u64
 #define __u32 u32
+#define __u16 u16
 #define __u8  u8
 
 /* kernel structure: */
 
 /* Fields are zero when not available */
+#if 0
 struct mce {
 	__u64 status;
 	__u64 misc;
@@ -23,8 +26,34 @@
 	__u8  bank;	/* machine check bank */
 	__u8  cpu;	/* cpu that raised the error */
 	__u8  finished;   /* entry is valid */
-	__u32 pad;   
+	__u32 extcpu;   /* linux cpu number that detected the error */
+	__u32 socketid; /* CPU socket ID */
+	__u32 apicid;   /* CPU initial apic ID */
+	__u64 mcgcap;   /* MCGCAP MSR: machine check capabilities of CPU */
 };
+#else
+struct mce {
+        __u64 status;
+        __u64 misc;
+        __u64 addr;
+        __u64 mcgstatus;
+        __u64 rip;
+        __u64 tsc;      /* cpu time stamp counter */
+        __u64 time;     /* wall time_t when error was detected */
+        __u8  cpuvendor;        /* cpu vendor as encoded in system.h */
+        __u8  pad1;
+        __u16 pad2;
+        __u32 cpuid;    /* CPUID 1 EAX */
+        __u8  cs;               /* code segment */
+        __u8  bank;     /* machine check bank */
+        __u8  cpu;      /* cpu number; obsolete; use extcpu now */
+        __u8  finished;   /* entry is valid */
+        __u32 extcpu;   /* linux cpu number that detected the error */
+        __u32 socketid; /* CPU socket ID */
+        __u32 apicid;   /* CPU initial apic ID */
+        __u64 mcgcap;   /* MCGCAP MSR: machine check capabilities of CPU */
+};
+#endif
 
 #define MCE_OVERFLOW 0		/* bit 0 in flags means overflow */

Comment 8 Karl Mueller 2009-07-14 22:03:00 UTC

I have upgraded the mcelog version to 0.9 and patched it with the new kernel info, if anybody wants to use it:

http://xney.com/RPM/mcelog-0.9-1.src.rpm

This compiles cleanly and seems to work.  I haven't had an MCE error yet to confirm that it detects it cleanly.

The patch file is in there as %Source99 and does two things: updates the mce structure to the latest version, and changes the structure name in the source code (mce->rip to mce->ip)

Comment 9 Frank Murphy 2009-07-29 22:15:08 UTC

Will your patch be added officially?

Comment 10 Mike Chambers 2009-08-05 21:40:10 UTC

Any more info or a fix on the way for these error messages?

Comment 11 Orion Poplawski 2009-08-14 21:49:16 UTC

*** Bug 517400 has been marked as a duplicate of this bug. ***

Comment 12 Jan "Yenya" Kasprzak 2009-10-05 07:07:45 UTC

The Cc: list of this bug is happily growing, but there is no feedback from the mcelog package maintainer(s), even though the patch is known. Is this package still being maintained?

Comment 13 Orion Poplawski 2009-10-05 15:15:58 UTC

I've checked in and built 0.9pre1 patched with the above patch to handle the kernel changes.  Will be requesting rel-eng to include in F-12.

http://koji.fedoraproject.org/koji/taskinfo?taskID=1728442

Comment 14 Jon Masters 2009-10-22 16:51:56 UTC

Just an update. I've spoken with Andi Kleen several times about mcelog recently. I should have noted it here. The package is getting an upstream overhaul so I put off pushing an update due to that...but it's taking a while for that, so maybe we need an interim update. I will look at this again.

Comment 15 udo 2009-10-25 07:32:56 UTC

So f11 will not be fixed?

Comment 16 Jon Masters 2009-12-01 06:37:03 UTC

Well, Andi hasn't emailed me recently. I will try to figure out what's going on.

Comment 17 Jan "Yenya" Kasprzak 2010-01-22 11:36:18 UTC

Anything new? The clone of this bug for RHEL (#532983) has already been closed with CURRENTRELEASE.

Comment 18 Bug Zapper 2010-03-15 12:39:42 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 13 development cycle.
Changing version to '13'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 19 Jon Masters 2010-03-16 09:47:22 UTC

The problem here is that upstream had dried up for basically years. There is some new work happening there now, but I had held off pending an outcome on whether it would live on or change completely. That was totally not the best solution and I apologize. I will pull the same struct size fix back into F11 and release an updated package, and similarly update rawhide and F13.

But then I need to catch up on the latest upstream progress. As I mentioned on your blog, Yenya, there have been various plans for mce logging recently, including using perf to capture MCE events. So in the future something quite different is likely to happen. I have spoken with various folks upstream about mce logging over the past few months but have not adequately kept you in the loop, and for that I do apologize. Still, you are completely right that this package has not had the right updates and that will be fixed immediately.

Jon.

Comment 20 Jan "Yenya" Kasprzak 2010-03-17 07:30:13 UTC

mcelog-0.7-4.fc11.x86_64 downloaded yesterday from Koji apparently fixes the problem - I have installed it on three servers yesterday and have not seen the "mcelog lenght" error mail from cron since then. Also running mcelog from the command line does not report any error.

Jon, thanks for fixing this!

Comment 21 Michal Jaegermann 2010-03-20 21:11:34 UTC

(In reply to comment #20)
> mcelog-0.7-4.fc11.x86_64 downloaded yesterday from Koji apparently fixes the
> problem

Say it again?  Precisely after installing mcelog-0.7-4.fc11.x86_64 from the latest updates, and only after that, I am getting every hour in logs:

mcelog: warning: 18446744073709551600 bytes ignored in each record
mcelog: consider an update

where 18446744073709551600 prints in hex 0xfffffffffffffff0.  This is with kernel-2.6.30.10-105.2.23.fc11.x86_64, i.e. the lastest one on F11.  Granted,
"... bytes ignored" is a message different from the original "... longer than expected" which somehow I did not see.

mcelog-0.9pre1-0.1.fc12.x86_64 on F12 is quiet but it is dated "Mon 05 Oct 2009".

Comment 22 Michal Jaegermann 2010-03-20 21:20:58 UTC

(In reply to comment #21)

> ... I am getting every hour in logs:

Sorry!  Just to avoid a confusion.  Not in logs but in mail from cron.

Comment 23 JM 2010-03-21 14:17:11 UTC

Same here, the version creates now hourly the message:

-----------------------
/etc/cron.hourly/mcelog.cron 
mcelog: warning: 18446744073709551600 bytes ignored in each record
mcelog: consider an update
-----------------------

Comment 24 Mike Chambers 2010-03-21 14:23:19 UTC

Might need to change the version of Fedora to somethign else, or at least I don't see the problem on an F13 install.

Comment 25 Orion Poplawski 2010-03-23 14:53:09 UTC

Problem is this update was pushed to Fedora 11, which runs an older kernel and is not compatible with it.

Comment 26 clive darra 2010-08-18 13:29:24 UTC

DOWNGRADE WORKS
# rpm -q mcelog
mcelog-0.9pre-1.27.el5

! DOESNT WORK !
# rpm -q mcelog
mcelog-0.9pre-1.29.el5

UPGRADE WORKS
# mcelog --version
mcelog 1.0pre

Comment 27 Jon Masters 2010-08-18 19:29:16 UTC

Can you articulate what it is that you need?

Comment 28 clive darra 2010-08-19 08:15:36 UTC

i'd like redhat to either rollback to mcelog-0.9pre-1.27.el5 or rollforward to mcelog 1.0pre so that i dont get thousands of error emails everytime mcelog (1.29) is run on my 64bit rhel systems !

Comment 29 Jon Masters 2010-08-19 09:42:52 UTC

64-bit RHEL systems? This bug relates to Fedora.

Comment 30 clive darra 2010-08-20 11:42:15 UTC

sorry ! i've just created a new rhel bug for you 
https://bugzilla.redhat.com/show_bug.cgi?id=625761

Comment 31 Jon Masters 2010-08-20 18:41:52 UTC

Thank you.

Comment 32 Bug Zapper 2011-06-02 18:00:07 UTC

This message is a reminder that Fedora 13 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 13.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '13'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 13's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 13 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 33 Bug Zapper 2011-06-27 14:15:02 UTC

Fedora 13 changed to end-of-life (EOL) status on 2011-06-25. Fedora 13 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.

andreasfrische+redhat
cdrh
gandr
igeorgex
jcm
jonathan
karl
kas
kevin
libbe
linville
loganjerry
michal
nicolas.mailhot
opossum1er
orion
sander
sysoutfran
udovdh
valdis.kletnieks
zaitcev
zing