Description of problem: In OSP 16.2 ironic_inspector_dnsmasq is failing healthcheck. This was found by an automated Jenkins job that executes commands below: sudo podman ps -a | grep ironic_inspector_dnsmasq 3298098dd074 rhos-qe-mirror-rdu2.usersys.redhat.com:5002/rh-osbs/rhosp16-openstack-ironic-inspector:16.2_20210514.1 kolla_start About an hour ago Up About an hour ago ironic_inspector_dnsmasq sudo systemctl list-units --failed --plain --no-legend --no-pager | grep healthcheck.service | grep ironic tripleo_ironic_inspector_dnsmasq_healthcheck.service loaded failed failed ironic_inspector_dnsmasq healthcheck Version-Release number of selected component (if applicable): RHOS-16.2-RHEL-8-20210514.n.0. Container healthcheck passed in: RHOS-16.2-RHEL-8-20210420.n.0 How reproducible: Every time Steps to Reproduce: 1. Execute command: sudo systemctl list-units --failed --plain --no-legend --no-pager | grep healthcheck.service | grep ironic 2. 3. Actual results: Container healthcheck failed Expected results: Container healthcheck passes Additional info:
This is also impacting us in our FFWD upgrades CI jobs now that we have enabled validations. The tripleo_ironic_inspector_dnsmasq_healthcheck.service appears as failed, making the validation to fail. I tried to do some debugging and the problem seems to be in the healtcheck_port function: (after sourcing healtcheck_port and get_user_from_process) [root@undercloud-0 /]# healthcheck_port 'dnsmasq' 67 exit [stack@undercloud-0 ~]$ echo $? 1 To be more specific, the problem seems to be when trying to find the -ilname "socket*" in the proc directory: [root@undercloud-0 /]# ports="${ports}|$(printf '%0.4x' 67)" [root@undercloud-0 /]# ports=":(${ports:1})" [root@undercloud-0 /]# echo $ports :(0043) [root@undercloud-0 /]# sockets=$(awk -i join -v m=${ports} '{IGNORECASE=1; if ($2 ~ m || $3 ~ m) {output[counter++] = $10} } END{if (length(output)>0) {print join(output, 0, length(output)-1, "|")}}' /proc/net/{tcp,udp}) [root@undercloud-0 /]# echo $sockets 403523 [root@undercloud-0 /]# match=$(( $match+$(sudo -u dnsmasq find /proc/8/fd/ -ilname "socket*" -printf "%l\n" 2>/dev/null | grep -c -E "(${sockets})") )) [root@undercloud-0 /]# echo $match 0 As it returns 0 the function ends up exiting with result 1. And when trying to execute sudo -u dnsmasq find /proc/8/fd/ -ilname "socket*" -printf "%l\n" alone, I got: [root@undercloud-0 /]# sudo -u dnsmasq find /proc/8/fd/ -ilname "socket*" -printf "%l\n" find: ‘/proc/8/fd/’: Permission denied However, without the sudo -u dnsmasq I could retrieve the sockets: [root@undercloud-0 /]# find /proc/8/fd/ -ilname "socket*" -printf "%l\n" 2>/dev/null socket:[403523] socket:[403524] It looks to me that the problem is here https://github.com/openstack/tripleo-common/blob/master/healthcheck/common.sh#L79
My suspicion was correct, after removing the "sudo -u $puser" and restarting the healthcheck the process is back active and running: [stack@undercloud-0 ~]$ sudo systemctl status tripleo_ironic_inspector_dnsmasq_healthcheck ● tripleo_ironic_inspector_dnsmasq_healthcheck.service - ironic_inspector_dnsmasq healthcheck Loaded: loaded (/etc/systemd/system/tripleo_ironic_inspector_dnsmasq_healthcheck.service; disabled; vendor preset: disabled) Active: inactive (dead) since Mon 2021-06-21 10:25:21 UTC; 26s ago Process: 1029745 ExecStart=/usr/bin/podman exec --user root ironic_inspector_dnsmasq /openstack/healthcheck (code=exited, status=0/SUCCESS) Main PID: 1029745 (code=exited, status=0/SUCCESS) Jun 21 10:25:21 undercloud-0.redhat.local systemd[1]: Starting ironic_inspector_dnsmasq healthcheck... Jun 21 10:25:21 undercloud-0.redhat.local healthcheck_ironic_inspector_dnsmasq[1029745]: 8 Jun 21 10:25:21 undercloud-0.redhat.local healthcheck_ironic_inspector_dnsmasq[1029745]: Checking dnsmasq port(s) 67. Jun 21 10:25:21 undercloud-0.redhat.local systemd[1]: tripleo_ironic_inspector_dnsmasq_healthcheck.service: Succeeded. Jun 21 10:25:21 undercloud-0.redhat.local systemd[1]: Started ironic_inspector_dnsmasq healthcheck.
Digging in a little bit more, it looks like a permission thing when comparing to other containers (as healthcheck_port is used in other healthcheck services). The /proc/8 directory is owned by dnsmasq ,but /proc/8/fd isn't: [root@undercloud-0 /]# ls -larth /proc/8/ total 0 dr-xr-xr-x. 481 root root 0 Jun 17 12:34 .. -r--r--r--. 1 root root 0 Jun 21 10:19 status -r--r--r--. 1 root root 0 Jun 21 10:19 cmdline dr-xr-xr-x. 9 dnsmasq dnsmasq 0 Jun 21 10:19 . dr-x------. 2 root root 0 Jun 21 10:19 fd -r--r--r--. 1 root root 0 Jun 21 10:21 stat -r--r--r--. 1 root root 0 Jun 21 10:34 wchan -rw-r--r--. 1 root root 0 Jun 21 10:34 uid_map -rw-rw-rw-. 1 root root 0 Jun 21 10:34 timerslack_ns -r--r--r--. 1 root root 0 Jun 21 10:34 timers -rw-r--r--. 1 root root 0 Jun 21 10:34 timens_offsets dr-xr-xr-x. 3 dnsmasq dnsmasq 0 Jun 21 10:34 task -r--------. 1 root root 0 Jun 21 10:34 syscall -r--r--r--. 1 root root 0 Jun 21 10:34 statm -r--------. 1 root root 0 Jun 21 10:34 stack -r--r--r--. 1 root root 0 Jun 21 10:34 smaps_rollup -r--r--r--. 1 root root 0 Jun 21 10:34 smaps -rw-r--r--. 1 root root 0 Jun 21 10:34 setgroups -r--r--r--. 1 root root 0 Jun 21 10:34 sessionid -r--r--r--. 1 root root 0 Jun 21 10:34 schedstat -rw-r--r--. 1 root root 0 Jun 21 10:34 sched lrwxrwxrwx. 1 root root 0 Jun 21 10:34 root -> / -rw-r--r--. 1 root root 0 Jun 21 10:34 projid_map -r--------. 1 root root 0 Jun 21 10:34 personality -r--------. 1 root root 0 Jun 21 10:34 patch_state -r--------. 1 root root 0 Jun 21 10:34 pagemap -rw-r--r--. 1 root root 0 Jun 21 10:34 oom_score_adj -r--r--r--. 1 root root 0 Jun 21 10:34 oom_score -rw-r--r--. 1 root root 0 Jun 21 10:34 oom_adj -r--r--r--. 1 root root 0 Jun 21 10:34 numa_maps dr-x--x--x. 2 root root 0 Jun 21 10:34 ns dr-xr-xr-x. 6 dnsmasq dnsmasq 0 Jun 21 10:34 net -r--------. 1 root root 0 Jun 21 10:34 mountstats -r--r--r--. 1 root root 0 Jun 21 10:34 mounts -r--r--r--. 1 root root 0 Jun 21 10:34 mountinfo -rw-------. 1 root root 0 Jun 21 10:34 mem -r--r--r--. 1 root root 0 Jun 21 10:34 maps dr-x------. 2 root root 0 Jun 21 10:34 map_files -rw-r--r--. 1 root root 0 Jun 21 10:34 loginuid -r--r--r--. 1 root root 0 Jun 21 10:34 limits -r--------. 1 root root 0 Jun 21 10:34 io -rw-r--r--. 1 root root 0 Jun 21 10:34 gid_map dr-x------. 2 root root 0 Jun 21 10:34 fdinfo lrwxrwxrwx. 1 root root 0 Jun 21 10:34 exe -> /usr/sbin/dnsmasq -r--------. 1 root root 0 Jun 21 10:34 environ lrwxrwxrwx. 1 root root 0 Jun 21 10:34 cwd -> / -r--r--r--. 1 root root 0 Jun 21 10:34 cpuset -r--r--r--. 1 root root 0 Jun 21 10:34 cpu_resctrl_groups -rw-r--r--. 1 root root 0 Jun 21 10:34 coredump_filter -rw-r--r--. 1 root root 0 Jun 21 10:34 comm --w-------. 1 root root 0 Jun 21 10:34 clear_refs -r--r--r--. 1 root root 0 Jun 21 10:34 cgroup -r--------. 1 root root 0 Jun 21 10:34 auxv -rw-r--r--. 1 root root 0 Jun 21 10:34 autogroup dr-xr-xr-x. 2 dnsmasq dnsmasq 0 Jun 21 10:34 attr That is why we receive the Permissiong denied when trying to search on it. For example, comparing to mistral_engine, which also invokes healthcheck_port in his healthcheck: [root@undercloud-0 /]# ps -edf UID PID PPID C STIME TTY TIME CMD mistral 1 0 0 Jun17 ? 00:00:00 dumb-init --single-child -- kolla_start mistral 7 1 0 Jun17 ? 00:34:58 /usr/bin/python3 /usr/bin/mistral-server --config-file=/etc/mistral/mistral.conf --log-file=/var/log/m root 71904 0 1 10:36 pts/0 00:00:00 bash root 71917 71904 0 10:37 pts/0 00:00:00 ps -edf [root@undercloud-0 /]# ls -larth /proc/7 ls: cannot read symbolic link '/proc/7/cwd': Permission denied ls: cannot read symbolic link '/proc/7/root': Permission denied ls: cannot read symbolic link '/proc/7/exe': Permission denied total 0 dr-xr-xr-x. 479 root root 0 Jun 17 12:33 .. dr-xr-xr-x. 9 mistral mistral 0 Jun 21 08:41 . dr-x------. 2 mistral mistral 0 Jun 21 08:41 fd -r--r--r--. 1 mistral mistral 0 Jun 21 10:06 status -r--r--r--. 1 mistral mistral 0 Jun 21 10:06 cmdline -r--r--r--. 1 mistral mistral 0 Jun 21 10:29 stat -r--r--r--. 1 mistral mistral 0 Jun 21 10:37 wchan -rw-r--r--. 1 mistral mistral 0 Jun 21 10:37 uid_map -rw-rw-rw-. 1 mistral mistral 0 Jun 21 10:37 timerslack_ns -r--r--r--. 1 mistral mistral 0 Jun 21 10:37 timers -rw-r--r--. 1 mistral mistral 0 Jun 21 10:37 timens_offsets dr-xr-xr-x. 3 mistral mistral 0 Jun 21 10:37 task -r--------. 1 mistral mistral 0 Jun 21 10:37 syscall -r--r--r--. 1 mistral mistral 0 Jun 21 10:37 statm -r--------. 1 mistral mistral 0 Jun 21 10:37 stack -r--r--r--. 1 mistral mistral 0 Jun 21 10:37 smaps_rollup -r--r--r--. 1 mistral mistral 0 Jun 21 10:37 smaps -rw-r--r--. 1 mistral mistral 0 Jun 21 10:37 setgroups -r--r--r--. 1 mistral mistral 0 Jun 21 10:37 sessionid -r--r--r--. 1 mistral mistral 0 Jun 21 10:37 schedstat -rw-r--r--. 1 mistral mistral 0 Jun 21 10:37 sched lrwxrwxrwx. 1 mistral mistral 0 Jun 21 10:37 root -rw-r--r--. 1 mistral mistral 0 Jun 21 10:37 projid_map -r--------. 1 mistral mistral 0 Jun 21 10:37 personality -r--------. 1 mistral mistral 0 Jun 21 10:37 patch_state -r--------. 1 mistral mistral 0 Jun 21 10:37 pagemap -rw-r--r--. 1 mistral mistral 0 Jun 21 10:37 oom_score_adj -r--r--r--. 1 mistral mistral 0 Jun 21 10:37 oom_score -rw-r--r--. 1 mistral mistral 0 Jun 21 10:37 oom_adj -r--r--r--. 1 mistral mistral 0 Jun 21 10:37 numa_maps dr-x--x--x. 2 mistral mistral 0 Jun 21 10:37 ns dr-xr-xr-x. 6 mistral mistral 0 Jun 21 10:37 net -r--------. 1 mistral mistral 0 Jun 21 10:37 mountstats -r--r--r--. 1 mistral mistral 0 Jun 21 10:37 mounts -r--r--r--. 1 mistral mistral 0 Jun 21 10:37 mountinfo -rw-------. 1 mistral mistral 0 Jun 21 10:37 mem -r--r--r--. 1 mistral mistral 0 Jun 21 10:37 maps dr-x------. 2 mistral mistral 0 Jun 21 10:37 map_files -rw-r--r--. 1 mistral mistral 0 Jun 21 10:37 loginuid -r--r--r--. 1 mistral mistral 0 Jun 21 10:37 limits -r--------. 1 mistral mistral 0 Jun 21 10:37 io -rw-r--r--. 1 mistral mistral 0 Jun 21 10:37 gid_map dr-x------. 2 mistral mistral 0 Jun 21 10:37 fdinfo lrwxrwxrwx. 1 mistral mistral 0 Jun 21 10:37 exe -r--------. 1 mistral mistral 0 Jun 21 10:37 environ lrwxrwxrwx. 1 mistral mistral 0 Jun 21 10:37 cwd -r--r--r--. 1 mistral mistral 0 Jun 21 10:37 cpuset -r--r--r--. 1 mistral mistral 0 Jun 21 10:37 cpu_resctrl_groups -rw-r--r--. 1 mistral mistral 0 Jun 21 10:37 coredump_filter -rw-r--r--. 1 mistral mistral 0 Jun 21 10:37 comm --w-------. 1 mistral mistral 0 Jun 21 10:37 clear_refs -r--r--r--. 1 mistral mistral 0 Jun 21 10:37 cgroup -r--------. 1 mistral mistral 0 Jun 21 10:37 auxv -rw-r--r--. 1 mistral mistral 0 Jun 21 10:37 autogroup dr-xr-xr-x. 2 mistral mistral 0 Jun 21 10:37 attr Everything under /proc/7 is owned by mistral.
*** Bug 2011676 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Release of components for Red Hat OpenStack Platform 16.2.2), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:1001