Bug 687936

Summary: Argument to exec occasionally incorrectly copied as NULL
Product: Red Hat Enterprise Linux 5 Reporter: Marc Milgram <mmilgram>
Component: kernelAssignee: Red Hat Kernel Manager <kernel-mgr>
Status: CLOSED DUPLICATE QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: high    
Version: 5.5   
Target Milestone: rc   
Target Release: ---   
Hardware: s390   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-03-29 14:14:34 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Marc Milgram 2011-03-15 18:43:12 UTC
Description of problem:
Customer has a shell script wrapper around /bin/kill.  This is called about once a second with the arguments:
  killwrapper -0 <pid>

It is used to test if a given process is still running.

We used a systemtap script to determine the command lines.

In two cases, the call to exec for the kill wrapper had the expected arguments, but when the exec for the real kill command was called, the command line for the kill wrapper showed that one of the arguments was NULL instead of the original value.

The kill wrapper didn't modify its command line arguments.  When its first argument is not the expected argument, it writes data to a file.  In the observed cases, it did not write data.

Version-Release number of selected component (if applicable):
kernel-2.6.18-194.3.1.el5.s390

How reproducible:
Difficult to reproduce.  Reproduces at customer site every 2 weeks to 2 months using Oracle clustering

Steps to Reproduce:
1. Run Oracle rac clustering between several nodes
2. Beat on it for several weeks
  
Actual results:
Cluster nodes evicted

Expected results:
Cluster remains running

Additional info:
There is plenty of memory available.

Comment 5 Marc Milgram 2011-03-29 14:14:34 UTC
Supposedly the machines in question didn't have a problem with the -194 kernel, but have a problem with the -194.3.1 kernel.  This may be a regression caused by the fix for BZ 545527.

This appears to have been fixed in the -238 kernel with BZ 627298.

*** This bug has been marked as a duplicate of bug 627298 ***