User had a coredump of a process that was apparently hung in os::Linux::get_namespace_pid. My previous analysis from almost one year ago: """ Thread #0 is waiting for thread #1 to finish. Thread 1 is in the fgetc call in the "for(;;)" loop below: // Determine if the vmid is the parent pid for a child in a PID namespace. // Return the namespace pid if so, otherwise -1. int os::Linux::get_namespace_pid(int vmid) { char fname[24]; int retpid = -1; snprintf(fname, sizeof(fname), "/proc/%d/status", vmid); FILE *fp = os::fopen(fname, "r"); if (fp) { int pid, nspid; int ret; while (!feof(fp) && !ferror(fp)) { ret = fscanf(fp, "NSpid: %d %d", &pid, &nspid); if (ret == 1) { break; } if (ret == 2) { retpid = nspid; break; } for (;;) { int ch = fgetc(fp); if (ch == EOF || ch == (int)'\n') break; } } fclose(fp); } return retpid; } """ Suspecting issues: * errno value is 3: #define ESRCH 3 /* No such process */ * The fp flags have _IO_ERR_SEEN set: #define _IO_ERR_SEEN 0x0020 * The FILE* fp did open "/proc/8503/status" Maybe there was a race condition and this thread did already exit. * The fgetc call should be returning EOF, not hanging, so, it might be some issue with the procfs file. I suspect it might be related to vm.swappiness=1 in /etc/sysctl.conf It would be useful if there was a jps hung process while generating the sosreport, as this could provide some extra data. It is also desirable to know where in the kernel it is hung, for example, have the output of: for pid in $(pidof jps); do echo ==$pid==; cat /proc/$pid/stack; done while it is hung. If replated to vm.swappiness=1 it should be in some low memory condition state. Experimenting with default vm.swappiness=60 should sort out this. This small program should reproduce the hang if it were a generalized case, but it should be some complex condition... """ #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <string.h> #include <errno.h> #include <sys/wait.h> int main(int argc, char *argv[]) { pid_t pid; FILE *fp; if ((pid = fork()) == -1) { perror("failed to fork"); exit(1); } if (pid == 0) { printf("child: about to sleep 3\n"); if (execl("/usr/bin/sleep", "sleep", "3", NULL, NULL)) perror("failed to start sleep"); } else { int ch, status; sleep(1); char path[256]; sprintf(path, "/proc/%ld/status", pid); if ((fp = fopen(path, "r")) == NULL) { perror("failed to open /proc/pid/status"); exit(1); } printf("parent: opened %s\n", path); do { printf("parent: waiting for %d\n", pid); if (waitpid(pid, &status, WUNTRACED | WCONTINUED) == -1) { perror("failed to waitpid"); exit(1); } } while (!WIFEXITED(status)); printf("parent: process %d exited\n", pid); for (;;) { ch = fgetc(fp); printf("ch = %d, errno = %s, feof = %d, ferror = %d\n", ch, strerror(errno), feof(fp), ferror(fp)); if (ch == EOF || feof(fp)) { break; } fputc(ch, stdout); } } return 0; } """ User experienced the issue again. Besides vm.swappiness=1, user also has several entries in the pattern: $USER hard nofile 819200 $USER soft nofile 819200 in /etc/security/limits.conf for several different users. Now user tested perf when the issue happened again, and indeed the process is looping in the kernel, and using too much cpu time. Fix should be mostly trivial in java code, and if EOF is returned, exit the main loop and return -1, not just break the for loop and return to the main while loop. ... 78.34% 0.00% jps [unknown] [k] 0000000000000000 | ---0 | |--75.76%--read | | | |--45.28%--entry_SYSCALL_64_after_hwframe ...