Description of problem: It looks like with some vmcores, retrace-server may process them ok but not add group read permissions on the vmcore. So anyone running 'retrace-server-interact <taskid> crash' won't be able to read the core and crash will fail. I am not sure if this is a regression, or something that has been there for a long time. I think it just shows up on certain vmcores. Example: $ retrace-server-interact 556682995 crash WARNING:root: 2014-03-15 10:39:13 Unable to list modules: crash exited with 1: crash: /cores/retrace/tasks/556682995/crash/vmcore: Permission denied Usage: crash [OPTION]... NAMELIST MEMORY-IMAGE (dumpfile form) crash [OPTION]... [NAMELIST] (live system form) Enter "crash -h" for details. If you want to execute the command manually, you can run $ crash -i /cores/retrace/tasks/556682995/crashrc /cores/retrace/tasks/556682995/crash/vmcore /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/2.6.32-358.6.2.el6.x86_64/vmlinux crash 7.0.1 Copyright (C) 2002-2013 Red Hat, Inc. Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005, 2011 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details. crash: /cores/retrace/tasks/556682995/crash/vmcore: Permission denied Usage: crash [OPTION]... NAMELIST MEMORY-IMAGE (dumpfile form) crash [OPTION]... [NAMELIST] (live system form) Enter "crash -h" for details. $ ls -lh /cores/retrace/tasks/556682995/crash/vmcore -rw-------. 1 retrace gss-eng-collab 2.9G Mar 14 09:30 /cores/retrace/tasks/556682995/crash/vmcore Version-Release number of selected component (if applicable): retrace-server-1.11-1.el6.noarch How reproducible: Unclear. I think it is only certain vmcores. Perhaps ones where the vmcore is contained in a tar or gz file, and the final perms are more restrictive than they need to be, but this is just a guess. Steps to Reproduce: 1. Submit vmcore to retrace-server. 2. vmcore completes processing ok 3. 'retrace-server-interact <taskid> crash' or crash tool fails due to crash being unable to read the file. Actual results: crash: /cores/retrace/tasks/556682995/crash/vmcore: Permission denied Expected results: The vmcore in 'crash/vmcore' should have 'other' and 'world' read perms after extraction so crash can read it. Additional info: I have an example of one vmcore which fails and one that succeeds.
Ok, I'm more convinced this is a regression now. I've been successfully getting vmcores from this one customer with no problems. Now the latest ones I submitted tonight are unreadable with the same problem. for t in 370500522 373515740 154200319 347574662; do ls -lh /cores/retrace/tasks/$t/crash/vmcore; done -rw-------. 1 retrace gss-eng-collab 474M Mar 15 04:13 /cores/retrace/tasks/370500522/crash/vmcore -rw-------. 1 retrace gss-eng-collab 555M Mar 15 12:23 /cores/retrace/tasks/373515740/crash/vmcore -rw-------. 1 retrace gss-eng-collab 531M Mar 15 07:55 /cores/retrace/tasks/154200319/crash/vmcore -rw-------. 1 retrace gss-eng-collab 372M Mar 15 13:23 /cores/retrace/tasks/347574662/crash/vmcore
NOTE: For now, we've got a workaround in place on our production system, which is a cronjob that looks for vmcore files under the /cores/retrace/tasks/<taskid>/crash/vmcore and does a 'chmod go+r on them". Once this is fixed we'll remove the workaround.
Ah, I think I found it. I think this is only a problem only if we skip the makedumpfile check. This may have been introduced with the fix to skip over makedumpfile, https://bugzilla.redhat.com/show_bug.cgi?id=1067188 Latest code: /usr/lib/python2.6/site-packages/retrace/retrace.py skip_makedumpfile = CONFIG["VmcoreDumpLevel"] <= 0 or CONFIG["VmcoreDumpLevel"] >= 32 if (dump_level is not None and (dump_level & CONFIG["VmcoreDumpLevel"]) == CONFIG["VmcoreDumpLevel"]): log_info("Stripping to %d would have no effect" % CONFIG["VmcoreDumpLevel"]) skip_makedumpfile = True <--------------------------------------------------- we don't do a chmod in this case if not skip_makedumpfile: log_debug("Executing makedumpfile") start = time.time() strip_vmcore(vmcore, kernelver) dur = int(time.time() - start) st = os.stat(vmcore) if (st.st_mode & stat.S_IRGRP) == 0: <------------ probably needs moved outside and below the 'if' statements so it always gets executed. try: os.chmod(vmcore, st.st_mode | stat.S_IRGRP) <---------------------- here is the chmod; only done underneath 'if not skip_makedumpfile' except Exception as ex: log_warn("File '%s' is not group readable and chmod" " failed. The process will continue but if" " it fails this is the likely cause." % vmcore) log_info("Stripped size: %s" % human_readable_size(st.st_size)) log_info("Makedumpfile took %d seconds and saved %s" % (dur, human_readable_size(oldsize - st.st_size))) Now comparing with earlier code from https://bugzilla.redhat.com/show_bug.cgi?id=1067188#c0 def download_remote(self, unpack=True, timeout=0, kernelver=None): """Downloads all remote resources and returns a list of errors.""" ... if os.path.isfile(vmcore): oldsize = os.path.getsize(vmcore) log_info("Vmcore size: %s" % human_readable_size(oldsize)) if CONFIG["VmcoreDumpLevel"] > 0 and CONFIG["VmcoreDumpLevel"] < 32: log_debug("Executing makedumpfile") start = time.time() strip_vmcore(vmcore, kernelver) dur = int(time.time() - start) st = os.stat(vmcore) os.chmod(vmcore, st.st_mode | stat.S_IRGRP) <----------------------- used to do a chmod here unconditionally log_info("Stripped size: %s" % human_readable_size(st.st_size)) log_info("Makedumpfile took %d seconds and saved %s" % (dur, human_readable_size(oldsize - st.st_size))) It looks like download_remote has been refactored significantly though, perhaps for multiple bug fixes. Actually it looks like we always had a form of the bug, it was just not noticed due to the fact that our config file was set such that we always did a makedumpfile stripping, and the chmod was after that. When we added the logic to skip makedumpfile, we now have a situation where if the tarball was created with a vmcore without group read perms, it remains that way. I'm not sure what a good fix is right now. We may just want to add similar code to the non-stripped case, or perhaps better put the chmod below the 'if' conditionals.
Created attachment 875615 [details] Patch to fix this bug, v1
(In reply to Dave Wysochanski from comment #6) > Created attachment 875615 [details] > Patch to fix this bug, v1 NOTE: This is completely untested but it's a first stab.
I just verified the patch in comment #7 fixes the bug.
I guess this is not fully fixed. Today someone produced another vmcore that had another permissions issue. This one ran makedumpfile but makedumpfile saved 0 bytes. I tested the original vmcores in this bug and thos work. But there must be another subtlety I missed when makedumpfile is run.
Created attachment 888095 [details] Patch to fix this bug on top of previous patch. Move 'stat' and 'chmod' to the very end after all extraction and makedumpfile processing. Fix bug introduced where if makedumpfile ran it would always report 'saved 0 bytes' when it may have saved signfica
This looks fixed in retrace-server-1.11-4.el6.noarch