Bug 2223974

Summary: Hang in os::Linux::get_namespace_pid with jps command
Product: Red Hat Enterprise Linux 8 Reporter: Paulo Andrade <pandrade>
Component: java-11-openjdkAssignee: Andrew John Hughes <ahughes>
Status: NEW --- QA Contact: OpenJDK QA <java-qa>
Severity: low Docs Contact:
Priority: unspecified    
Version: ---CC: myamazak
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Paulo Andrade 2023-07-19 12:51:17 UTC
User had a coredump of a process that was apparently hung in
os::Linux::get_namespace_pid. My previous analysis from almost
one year ago:

"""
  Thread #0 is waiting for thread #1 to finish.

  Thread 1 is in the fgetc call in the "for(;;)" loop below:

// Determine if the vmid is the parent pid for a child in a PID namespace.
// Return the namespace pid if so, otherwise -1.
int os::Linux::get_namespace_pid(int vmid) {
  char fname[24];
  int retpid = -1;

  snprintf(fname, sizeof(fname), "/proc/%d/status", vmid);
  FILE *fp = os::fopen(fname, "r");

  if (fp) {
    int pid, nspid;
    int ret;
    while (!feof(fp) && !ferror(fp)) {
      ret = fscanf(fp, "NSpid: %d %d", &pid, &nspid);
      if (ret == 1) {
        break;
      }
      if (ret == 2) {
        retpid = nspid;
        break;
      }
      for (;;) {
        int ch = fgetc(fp);
        if (ch == EOF || ch == (int)'\n') break;
      }
    }
    fclose(fp);
  }
  return retpid;
}
"""

  Suspecting issues:

* errno value is 3:
  #define ESRCH            3      /* No such process */

* The fp flags have _IO_ERR_SEEN set:
  #define _IO_ERR_SEEN          0x0020

* The FILE* fp did open "/proc/8503/status"
  Maybe there was a race condition and this thread did already exit.

* The fgetc call should be returning EOF, not hanging, so, it might be
  some issue with the procfs file.

  I suspect it might be related to vm.swappiness=1 in /etc/sysctl.conf

  It would be useful if there was a jps hung process while generating
the sosreport, as this could provide some extra data.

  It is also desirable to know where in the kernel it is hung, for
example, have the output of:

for pid in $(pidof jps); do echo ==$pid==; cat /proc/$pid/stack; done

while it is hung.

  If replated to vm.swappiness=1 it should be in some low memory condition
state. Experimenting with default vm.swappiness=60 should sort out this.

  This small program should reproduce the hang if it were a generalized
case, but it should be some complex condition...
"""
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <errno.h>
#include <sys/wait.h>

int
main(int argc, char *argv[])
{
  pid_t     pid;
  FILE     *fp;

  if ((pid = fork()) == -1) {
    perror("failed to fork");
    exit(1);
  }
  if (pid == 0) {
    printf("child: about to sleep 3\n");
    if (execl("/usr/bin/sleep", "sleep", "3", NULL, NULL))
      perror("failed to start sleep");
  }
  else {
    int ch, status;
    sleep(1);
    char path[256];
    sprintf(path, "/proc/%ld/status", pid);
    if ((fp = fopen(path, "r")) == NULL) {
      perror("failed to open /proc/pid/status");
      exit(1);
    }
    printf("parent: opened %s\n", path);
    do {
      printf("parent: waiting for %d\n", pid);
      if (waitpid(pid, &status,  WUNTRACED | WCONTINUED) == -1) {
	perror("failed to waitpid");
	exit(1);
      }
    } while (!WIFEXITED(status));
    printf("parent: process %d exited\n", pid);
    for (;;) {
      ch = fgetc(fp);
      printf("ch = %d, errno = %s, feof = %d, ferror = %d\n",
	     ch, strerror(errno), feof(fp), ferror(fp));
      if (ch == EOF || feof(fp)) {
	break;
      }
      fputc(ch, stdout);
    }
  }
  return 0;
}
"""

  User experienced the issue again.
  Besides vm.swappiness=1, user also has several entries in the pattern:

$USER	hard	nofile	819200
$USER	soft	nofile	819200

in /etc/security/limits.conf for several different users.

  Now user tested perf when the issue happened again, and indeed the
process is looping in the kernel, and using too much cpu time.

  Fix should be mostly trivial in java code, and if EOF is returned,
exit the main loop and return -1, not just break the for loop and
return to the main while loop.

...
    78.34%     0.00%  jps              [unknown]                 [k] 0000000000000000
            |
            ---0
               |          
               |--75.76%--read
               |          |          
               |          |--45.28%--entry_SYSCALL_64_after_hwframe
...