Bug 2223974 - Hang in os::Linux::get_namespace_pid with jps command
Summary: Hang in os::Linux::get_namespace_pid with jps command
Keywords:
Status: NEW
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: java-11-openjdk
Version: ---
Hardware: All
OS: Linux
unspecified
low
Target Milestone: rc
: ---
Assignee: Andrew John Hughes
QA Contact: OpenJDK QA
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-07-19 12:51 UTC by Paulo Andrade
Modified: 2023-07-20 00:35 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-162735 0 None None None 2023-07-19 12:53:34 UTC

Description Paulo Andrade 2023-07-19 12:51:17 UTC
User had a coredump of a process that was apparently hung in
os::Linux::get_namespace_pid. My previous analysis from almost
one year ago:

"""
  Thread #0 is waiting for thread #1 to finish.

  Thread 1 is in the fgetc call in the "for(;;)" loop below:

// Determine if the vmid is the parent pid for a child in a PID namespace.
// Return the namespace pid if so, otherwise -1.
int os::Linux::get_namespace_pid(int vmid) {
  char fname[24];
  int retpid = -1;

  snprintf(fname, sizeof(fname), "/proc/%d/status", vmid);
  FILE *fp = os::fopen(fname, "r");

  if (fp) {
    int pid, nspid;
    int ret;
    while (!feof(fp) && !ferror(fp)) {
      ret = fscanf(fp, "NSpid: %d %d", &pid, &nspid);
      if (ret == 1) {
        break;
      }
      if (ret == 2) {
        retpid = nspid;
        break;
      }
      for (;;) {
        int ch = fgetc(fp);
        if (ch == EOF || ch == (int)'\n') break;
      }
    }
    fclose(fp);
  }
  return retpid;
}
"""

  Suspecting issues:

* errno value is 3:
  #define ESRCH            3      /* No such process */

* The fp flags have _IO_ERR_SEEN set:
  #define _IO_ERR_SEEN          0x0020

* The FILE* fp did open "/proc/8503/status"
  Maybe there was a race condition and this thread did already exit.

* The fgetc call should be returning EOF, not hanging, so, it might be
  some issue with the procfs file.

  I suspect it might be related to vm.swappiness=1 in /etc/sysctl.conf

  It would be useful if there was a jps hung process while generating
the sosreport, as this could provide some extra data.

  It is also desirable to know where in the kernel it is hung, for
example, have the output of:

for pid in $(pidof jps); do echo ==$pid==; cat /proc/$pid/stack; done

while it is hung.

  If replated to vm.swappiness=1 it should be in some low memory condition
state. Experimenting with default vm.swappiness=60 should sort out this.

  This small program should reproduce the hang if it were a generalized
case, but it should be some complex condition...
"""
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <errno.h>
#include <sys/wait.h>

int
main(int argc, char *argv[])
{
  pid_t     pid;
  FILE     *fp;

  if ((pid = fork()) == -1) {
    perror("failed to fork");
    exit(1);
  }
  if (pid == 0) {
    printf("child: about to sleep 3\n");
    if (execl("/usr/bin/sleep", "sleep", "3", NULL, NULL))
      perror("failed to start sleep");
  }
  else {
    int ch, status;
    sleep(1);
    char path[256];
    sprintf(path, "/proc/%ld/status", pid);
    if ((fp = fopen(path, "r")) == NULL) {
      perror("failed to open /proc/pid/status");
      exit(1);
    }
    printf("parent: opened %s\n", path);
    do {
      printf("parent: waiting for %d\n", pid);
      if (waitpid(pid, &status,  WUNTRACED | WCONTINUED) == -1) {
	perror("failed to waitpid");
	exit(1);
      }
    } while (!WIFEXITED(status));
    printf("parent: process %d exited\n", pid);
    for (;;) {
      ch = fgetc(fp);
      printf("ch = %d, errno = %s, feof = %d, ferror = %d\n",
	     ch, strerror(errno), feof(fp), ferror(fp));
      if (ch == EOF || feof(fp)) {
	break;
      }
      fputc(ch, stdout);
    }
  }
  return 0;
}
"""

  User experienced the issue again.
  Besides vm.swappiness=1, user also has several entries in the pattern:

$USER	hard	nofile	819200
$USER	soft	nofile	819200

in /etc/security/limits.conf for several different users.

  Now user tested perf when the issue happened again, and indeed the
process is looping in the kernel, and using too much cpu time.

  Fix should be mostly trivial in java code, and if EOF is returned,
exit the main loop and return -1, not just break the for loop and
return to the main while loop.

...
    78.34%     0.00%  jps              [unknown]                 [k] 0000000000000000
            |
            ---0
               |          
               |--75.76%--read
               |          |          
               |          |--45.28%--entry_SYSCALL_64_after_hwframe
...


Note You need to log in before you can comment on or make changes to this bug.