Bug 227365

Summary: diskdumpmsg fails with dumps from large memory systems
Product: Red Hat Enterprise Linux 4 Reporter: Bryn M. Reeves <bmr>
Component: diskdumputilsAssignee: Takao Indoh <tindoh>
Status: CLOSED ERRATA QA Contact:
Severity: medium Docs Contact:
Priority: high    
Version: 4.4CC: jhrozek, ktokunag, lwang, tao
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2007-0717 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-11-15 15:59:02 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 222397, 234251    
Attachments:
Description Flags
diskdumpmsg.diff none

Description Bryn M. Reeves 2007-02-05 16:47:45 UTC
Description of problem:
diskdumpmsg emits a backtrace when run on a vmcore from a system with large
amounts of RAM (reported on an ia64 with ~2Tb):

Traceback (most recent call last):
 File "/sbin/diskdumpmsg", line 916, in ?
   vmcore = Vmcore.generate(vmcorefile)
 File "/sbin/diskdumpmsg", line 306, in generate
   return subclass(vmcore, map)
 File "/sbin/diskdumpmsg", line 621, in __init__
   self.memory_dump(self.datafilename)
 File "/sbin/diskdumpmsg", line 678, in memory_dump
   page_desc_raw = self.fd.read(pd_size * self.header.max_mapnr)
OverflowError: requested number of bytes is more than a Python string can
hold
Version-Release number of selected component (if applicable):

That self.fs.read seems to be trying to slurp up the entire set of page
descriptors in one read. When max_mapnr exceeds 178956970, this amounts to >4Gb
and fails with "requested number of bytes is more than a Python string can
hold".

How reproducible:
100%

Steps to Reproduce:
1. Generate a vmcore from a machine with several Tb of memory
2. Attempt to process the core with diskdumpmsg
  
Actual results:
Backtrace listed above.

Expected results:
diskdumpmsg reads core correctly.

Additional info:

Looks like that one big read should be split up to pull the descriptors in one
at a time or in small groups.

Comment 1 Nobuhiro Tachino 2007-02-14 16:51:09 UTC
Created attachment 148066 [details]
diskdumpmsg.diff

This patch fixes the problem.

Comment 2 Nobuhiro Tachino 2007-02-21 16:03:43 UTC
The patch was merged in diskdumputils v1.3.27.


Comment 3 RHEL Program Management 2007-05-09 07:56:01 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 4 Dave Anderson 2007-07-13 19:02:41 UTC
MODIFIED -- CVS Tag: diskdumputils-1_4_0-1



Comment 6 Dave Anderson 2007-07-17 18:05:55 UTC
The diskdumputils package has been re-spun -- CVS Tag: diskdumputils-1_4_1-1

Please post QA results here.  I will transfer the test results
to the errata's QA report, and then set this bugzilla to VERIFIED
via the errata interface.

Comment 7 RHEL Program Management 2007-09-13 19:50:49 UTC
This request was previously evaluated by Red Hat Product Management
for inclusion in the current Red Hat Enterprise Linux release, but
Red Hat was unable to resolve it in time.  This request will be
reviewed for a future Red Hat Enterprise Linux release.

Comment 10 Dave Anderson 2007-09-28 19:48:37 UTC
Reminder -- we are still awaiting QA results from Fujitsu for this bugzilla.

Comment 11 Dave Anderson 2007-10-01 15:12:39 UTC
The diskdumputils package has been re-spun -- CVS Tag: diskdumputils-1_4_1-2

Please post QA results here.  I will transfer the test results
to the errata's QA report, and then set this bugzilla to VERIFIED
via the errata interface.

Comment 12 Dave Anderson 2007-10-03 15:57:51 UTC
Nobuhiro Tachino (ntachino) and Akira Imamura 
(aimamura) are no longer here at the Westford 
facility working as embedded engineers for Fujitsu, and 
therefore cannot complete the QA for this bugzilla's 
RHEL4-6 errata.

For that reason, it is essential that the Issue Tracker
REPORTER test this issue, and report back the results
to this bugzilla.  

If the QA is successful, I (as the proxy maintainer) will
transfer the test results to the errata's QA report, and 
then set this bugzilla to VERIFIED via the errata interface.  

Thanks,
  Dave Anderson

Comment 15 Dave Anderson 2007-10-04 15:54:09 UTC
Given that mmatsuya has requested in IT #112576 that Fujitsu
perform the QA for this bugzilla:

> Event posted 10-03-2007 09:13pm by mmatsuya 	
> Hi Indo-san,
>
> Have you already been in Westford?
> Can anyone in Fujitsu re-test with the new packages diskdumputils-1.4.1-2?

I have set this bugzilla's NEEDINFO to our remaining in-house Fujitsu
representative, ktokunag, as he has offered off-line to
help move things along.




Comment 18 Dave Anderson 2007-10-08 14:07:09 UTC
Based upon the last two comments in IT #112576, I'm changing
the NEEDINFO from ktokunag to mmatsuya:

> Event posted 10-05-2007 03:22am by L3support_kernel 	
> Hi matsuya-san,
>
> I take charge of this issue instead of Indoh.
> I will confirm diskdumputils-1.4.1-2 by Oct 12th.
>
> Kazuhiro Yoshida

> Event posted 10-05-2007 02:47am by mmatsuya 	
> Please discuss who in Fujitsu will test this issue with diskdumputils-1.4.1-2
> without confusion. onsite team in Westford or anyone in Fujitsu Japan.


Comment 20 Larry Troan 2007-10-11 17:59:52 UTC
Yoshida-san, have you confirmed that this is fixed?

Comment 24 Dave Anderson 2007-10-12 13:15:21 UTC
I'm not entirely clear on this, but as I understand it, if the system's
memory was such that there could be a huge contiguous array of physical
memory that in turn would cause a contiguous array of 178956970 (or more)
page_desc_t structures (at 24 bytes each) to exist, which would make
the total array size to be greater than 4GB, then the diskdumpmsg python
script would fail.  That could not occur on a 64GB machine, given that it
would only have 4194304 total pages.  

That being said, the code has been restructured so that the page_desc_t
reads are broken up, so it would be impossible to see the same failure
with the new diskdumpmsg.

So I'm not sure what is the best way to continue.

I've tried reassigning this bugzilla to our new embedded Fujitsu engineer,
Takao Indoh, who will be the diskdump maintainer in the future, and he is
familiar with this issue, but his email address (tindoh) is not
"known" by the bugzilla system yet.

Kei, can you check with Takao and ask him for his suggestion on how
best to proceed?




Comment 25 Keiichiro Tokunaga 2007-10-12 18:27:47 UTC
Per the discussion with Takao, it should occur on the machine having 256GB or 
more memory.  We have a PRIMEQUEST, which has 512GB memory and is waiting for 
power, in the lab, so we will try to find a way to use it for the QA.

Comment 26 Keiichiro Tokunaga 2007-10-15 20:36:45 UTC
I finally setup the machine with 512GB memory, installed the latest RHEL4.6 
(re20071011.0) on it, and have confirmed that the issue was fixed on 
diskdumputils-1.4.1-2 at first.

Then, I installed the old version of diskdumputils (1.3.25), which did not 
have the fix patch, and confirmed that the issue reproduced on it.

Comment 31 errata-xmlrpc 2007-11-15 15:59:02 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0717.html