Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
User had a coredump of a process that was apparently hung in
os::Linux::get_namespace_pid. My previous analysis from almost
one year ago:
"""
Thread #0 is waiting for thread #1 to finish.
Thread 1 is in the fgetc call in the "for(;;)" loop below:
// Determine if the vmid is the parent pid for a child in a PID namespace.
// Return the namespace pid if so, otherwise -1.
int os::Linux::get_namespace_pid(int vmid) {
char fname[24];
int retpid = -1;
snprintf(fname, sizeof(fname), "/proc/%d/status", vmid);
FILE *fp = os::fopen(fname, "r");
if (fp) {
int pid, nspid;
int ret;
while (!feof(fp) && !ferror(fp)) {
ret = fscanf(fp, "NSpid: %d %d", &pid, &nspid);
if (ret == 1) {
break;
}
if (ret == 2) {
retpid = nspid;
break;
}
for (;;) {
int ch = fgetc(fp);
if (ch == EOF || ch == (int)'\n') break;
}
}
fclose(fp);
}
return retpid;
}
"""
Suspecting issues:
* errno value is 3:
#define ESRCH 3 /* No such process */
* The fp flags have _IO_ERR_SEEN set:
#define _IO_ERR_SEEN 0x0020
* The FILE* fp did open "/proc/8503/status"
Maybe there was a race condition and this thread did already exit.
* The fgetc call should be returning EOF, not hanging, so, it might be
some issue with the procfs file.
I suspect it might be related to vm.swappiness=1 in /etc/sysctl.conf
It would be useful if there was a jps hung process while generating
the sosreport, as this could provide some extra data.
It is also desirable to know where in the kernel it is hung, for
example, have the output of:
for pid in $(pidof jps); do echo ==$pid==; cat /proc/$pid/stack; done
while it is hung.
If replated to vm.swappiness=1 it should be in some low memory condition
state. Experimenting with default vm.swappiness=60 should sort out this.
This small program should reproduce the hang if it were a generalized
case, but it should be some complex condition...
"""
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <errno.h>
#include <sys/wait.h>
int
main(int argc, char *argv[])
{
pid_t pid;
FILE *fp;
if ((pid = fork()) == -1) {
perror("failed to fork");
exit(1);
}
if (pid == 0) {
printf("child: about to sleep 3\n");
if (execl("/usr/bin/sleep", "sleep", "3", NULL, NULL))
perror("failed to start sleep");
}
else {
int ch, status;
sleep(1);
char path[256];
sprintf(path, "/proc/%ld/status", pid);
if ((fp = fopen(path, "r")) == NULL) {
perror("failed to open /proc/pid/status");
exit(1);
}
printf("parent: opened %s\n", path);
do {
printf("parent: waiting for %d\n", pid);
if (waitpid(pid, &status, WUNTRACED | WCONTINUED) == -1) {
perror("failed to waitpid");
exit(1);
}
} while (!WIFEXITED(status));
printf("parent: process %d exited\n", pid);
for (;;) {
ch = fgetc(fp);
printf("ch = %d, errno = %s, feof = %d, ferror = %d\n",
ch, strerror(errno), feof(fp), ferror(fp));
if (ch == EOF || feof(fp)) {
break;
}
fputc(ch, stdout);
}
}
return 0;
}
"""
User experienced the issue again.
Besides vm.swappiness=1, user also has several entries in the pattern:
$USER hard nofile 819200
$USER soft nofile 819200
in /etc/security/limits.conf for several different users.
Now user tested perf when the issue happened again, and indeed the
process is looping in the kernel, and using too much cpu time.
Fix should be mostly trivial in java code, and if EOF is returned,
exit the main loop and return -1, not just break the for loop and
return to the main while loop.
...
78.34% 0.00% jps [unknown] [k] 0000000000000000
|
---0
|
|--75.76%--read
| |
| |--45.28%--entry_SYSCALL_64_after_hwframe
...
Comment 1RHEL Program Management
2023-09-12 23:12:18 UTC
Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.
Comment 2RHEL Program Management
2023-09-12 23:14:06 UTC
This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there.
Due to differences in account names between systems, some fields were not replicated. Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information.
To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer. You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like:
"Bugzilla Bug" = 1234567
In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information.
User had a coredump of a process that was apparently hung in os::Linux::get_namespace_pid. My previous analysis from almost one year ago: """ Thread #0 is waiting for thread #1 to finish. Thread 1 is in the fgetc call in the "for(;;)" loop below: // Determine if the vmid is the parent pid for a child in a PID namespace. // Return the namespace pid if so, otherwise -1. int os::Linux::get_namespace_pid(int vmid) { char fname[24]; int retpid = -1; snprintf(fname, sizeof(fname), "/proc/%d/status", vmid); FILE *fp = os::fopen(fname, "r"); if (fp) { int pid, nspid; int ret; while (!feof(fp) && !ferror(fp)) { ret = fscanf(fp, "NSpid: %d %d", &pid, &nspid); if (ret == 1) { break; } if (ret == 2) { retpid = nspid; break; } for (;;) { int ch = fgetc(fp); if (ch == EOF || ch == (int)'\n') break; } } fclose(fp); } return retpid; } """ Suspecting issues: * errno value is 3: #define ESRCH 3 /* No such process */ * The fp flags have _IO_ERR_SEEN set: #define _IO_ERR_SEEN 0x0020 * The FILE* fp did open "/proc/8503/status" Maybe there was a race condition and this thread did already exit. * The fgetc call should be returning EOF, not hanging, so, it might be some issue with the procfs file. I suspect it might be related to vm.swappiness=1 in /etc/sysctl.conf It would be useful if there was a jps hung process while generating the sosreport, as this could provide some extra data. It is also desirable to know where in the kernel it is hung, for example, have the output of: for pid in $(pidof jps); do echo ==$pid==; cat /proc/$pid/stack; done while it is hung. If replated to vm.swappiness=1 it should be in some low memory condition state. Experimenting with default vm.swappiness=60 should sort out this. This small program should reproduce the hang if it were a generalized case, but it should be some complex condition... """ #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <string.h> #include <errno.h> #include <sys/wait.h> int main(int argc, char *argv[]) { pid_t pid; FILE *fp; if ((pid = fork()) == -1) { perror("failed to fork"); exit(1); } if (pid == 0) { printf("child: about to sleep 3\n"); if (execl("/usr/bin/sleep", "sleep", "3", NULL, NULL)) perror("failed to start sleep"); } else { int ch, status; sleep(1); char path[256]; sprintf(path, "/proc/%ld/status", pid); if ((fp = fopen(path, "r")) == NULL) { perror("failed to open /proc/pid/status"); exit(1); } printf("parent: opened %s\n", path); do { printf("parent: waiting for %d\n", pid); if (waitpid(pid, &status, WUNTRACED | WCONTINUED) == -1) { perror("failed to waitpid"); exit(1); } } while (!WIFEXITED(status)); printf("parent: process %d exited\n", pid); for (;;) { ch = fgetc(fp); printf("ch = %d, errno = %s, feof = %d, ferror = %d\n", ch, strerror(errno), feof(fp), ferror(fp)); if (ch == EOF || feof(fp)) { break; } fputc(ch, stdout); } } return 0; } """ User experienced the issue again. Besides vm.swappiness=1, user also has several entries in the pattern: $USER hard nofile 819200 $USER soft nofile 819200 in /etc/security/limits.conf for several different users. Now user tested perf when the issue happened again, and indeed the process is looping in the kernel, and using too much cpu time. Fix should be mostly trivial in java code, and if EOF is returned, exit the main loop and return -1, not just break the for loop and return to the main while loop. ... 78.34% 0.00% jps [unknown] [k] 0000000000000000 | ---0 | |--75.76%--read | | | |--45.28%--entry_SYSCALL_64_after_hwframe ...