Bug 835806 - Using a waitpid/WNOHANG as response to sigaction kills the process.
Using a waitpid/WNOHANG as response to sigaction kills the process.
Status: CLOSED WONTFIX
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel (Show other bugs)
6.2
x86_64 Linux
unspecified Severity medium
: rc
: ---
Assigned To: Red Hat Kernel Manager
Red Hat Kernel QE team
:
Depends On:
Blocks: 1002711
  Show dependency treegraph
 
Reported: 2012-06-27 03:30 EDT by bjorn.enestrom
Modified: 2015-09-18 09:55 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-09-18 09:55:55 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
ksharma: needinfo-


Attachments (Terms of Use)

  None (edit)
Description bjorn.enestrom 2012-06-27 03:30:51 EDT
Details:
A code to run batch jobs is redesigned to catch zombie children using pid = waitpid(-1, &status, WNOHANG). If the code receives -1 = ECHILD it handles this but is also forcibly killed by the OS, giving error 10 (ECHILD) as reason for the kill. This is very annoying..

Old code that drops zombie processes now and then:

child_handler() {
 
  if ((pid = wait(&status)) == -1)	// Some children are lost and left as zombies
  {
    // Write error to DB
  }
  else
  {
    // Write success to DB
  }
}

int start_one_batch)
{
  char runbuf[256], bnr[16], bid[16];
  int pid;

  // Check DB for run_str, bnr, bid. Return EMPTY if no job found.
 
  if ( (pid = fork ()) == 0 ){
    (void) nice ( 0 );
    (void) execlp ( run_str, run_str, bnr, bid, 0 );
    (void) execl ( "/bin/sh", "/bin/sh", "-c", runbuf, 0 );
    (void) exit ( 1 );
  } else if ( pid == -1 ) {
    // Log error
    return EMPTY;
  } else return OK;
}

static void alarm_handler(int sig){
  // Check DB for errors. Here: do nothing
}

int main ()
{
 int empty_w_queue,
     i;

 int wait_time    = 60*5;

  struct sigaction alarm_action, child_action;
 
  memset(&alarm_action, 0, sizeof(alarm_action));
  memset(&child_action, 0, sizeof(child_action));

  alarm_action.sa_handler = alarm_handler;
  sigaction(SIGALRM, &alarm_action, NULL);

  child_action.sa_handler = child_handler;
  sigaction(SIGCHLD, &child_action, NULL);

 while ( 1 ) {
 
   empty_w_queue = 0;
   i = 0;
   while ( !empty_w_queue ) {
     switch ( start_one_batch () ){
     case OK:
       i++;
       n_active++; 
       break;
     case EMPTY:
       empty_w_queue = 1;
       break;
     }
   }

  if ( (time(0) - prev) < wait_time ) {
    set_time = (prev + wait_time) - time(0);
  } else {
    set_time = wait_time;
    prev     = time(0);
  }
  (void) alarm  ( set_time <= 0 ? wait_time : set_time);
  (void) pause();
}

=============================================== 
New child_handler that will get killed by OS:

…

child_handler() {
 
  while ((pid = waitpid(-1, &status, WNOHANG)) != 0) {
    if ((pid == -1) {
      // Write error to DB
    }
    else
    {
      // Write success to DB
    }
  }

…

=============================================== 
Alternate solution:

/************************************************************************/
/* Author: Anders Aslund                                                */
/* Descr: Simulate problem with zombie processes in MV batchmonitor     */
/*                                                                      */
/* Instr:                                                               */
/* Compile: gcc mbatchmon_sim.cpp                                       */
/* Run: ./a.out                                                         */
/*                                                                      */
/* Prompt will tell what PID is used - [PID] Waiting for SIGHUP          */
/* Remember PID and open another terminal                               */
/* Send SIGHUP with command (exchange PID with number from other        */
/* terminal): kill -s SIGHUP PID                                        */
/*                                                                      */
/* Useful command (lists processes spawned): ps -C a.out                */
/************************************************************************/

#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>

// THIS WORKS - Double fork for each batchjob (can the DB/Log handle concurrent access?)
void batchJobExecuterWorkaround(int signo)
{
  pid_t pid, pid2;
  int status;
  printf("[%d] signal received \n",getpid());
  pid = fork();
  if(pid  ==  0)
  {
    printf("[%d] I m Child process \n",getpid());
    pid2 = fork();
    if (pid2 == 0)
    {
      printf("[%d] I m GrandChild process \n", getpid());
      // Start batchjob here
      sleep(rand()%15); // Simulate batchjob
      printf("[%d] Exit GrandChild\n",getpid());
      exit(0);
    }
    waitpid(pid2, &status, 0);
    printf("[%d] Exit Child\n",getpid());
    // Do work here - log to DB etc. A problem could be here that DB/Log cant handle concurrent access
    exit(0);
  }
  else if(pid > 0)
  {
    printf("[%d] I m Parent process \n",getpid());
    // Returns to wait for SIGHUP
  }
  else
  {
    printf("[%d] Fork Failed \n",getpid());
  }
}

int main ()
{
  // Simple hack to asynchronously start batchjobs - simulating scheduler
  signal(SIGHUP, batchJobExecuterWorkaround);
  signal(SIGCHLD, SIG_IGN); // Let Linux terminate w/o using signals - logging solved in batchjobexecutor

  // Simple workaround using sigaction - not perfect because you miss logging the zombies
/*  struct sigaction sa;
  sa.sa_handler = childHandler2;
  sa.sa_flags = SA_NOCLDWAIT;
  if (sigaction(SIGCHLD, &sa, NULL) == -1)
  {
     perror("sigaction");
     exit(1);
  }*/

  //
  while (1)
  {
    printf("[%d] Waiting for SIGHUP \n",getpid());
    sleep(2);
  }
  printf("[%d] Exit \n",getpid());
}

Reproducibility:
Happens very soon, usually within the first 20 batch jobs.

Steps to Reproduce:
1.	Feed the code some batch jobs by changing start_one_batch into

int start_one_batch)
{
  char runbuf[256], bnr[16], bid[16];
  int pid;

  if ((rand()%100) > 95) // Will start 10-30 jobs before stopping
    return EMPTY;
 
  if ( (pid = fork ()) == 0 ){
    (void) nice ( nice_value );
    sleep(rand()%15); // Simulate batchjob
    (void) exit ( 1 );
  } else if ( pid == -1 ) {
    // Log error
    return EMPTY;
  } else return OK;
}

2.	Run and watch stderr

Actual Results:
The OS will terminate the process somewhere along the line, indicating “<pid> terminated with error 10” to stderr.

Expected Results:
All batch jobs performed.

Additional Information:
This code has been ported from HP-UX to RHEL 6.2. Also, it is rewritten to use sigaction instead of signal() .

Severity:
Medium. Workaround exists.

Security:
Not security sensitive
Comment 2 RHEL Product and Program Management 2012-12-14 03:26:01 EST
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.
Comment 5 RHEL Product and Program Management 2013-10-14 00:54:05 EDT
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.
Comment 7 Chris Williams 2015-09-18 09:55:55 EDT
This Bugzilla has been reviewed by Red Hat and is not planned on being addressed in Red Hat Enterprise Linux 6 and therefore will be closed. If this bug is critical to production systems, please contact your Red Hat support representative and provide sufficient business justification.

Note You need to log in before you can comment on or make changes to this bug.