Bug 1961237

Summary: OSP16.2 ironic inspector dnsmasq fail healthcheck
Product: Red Hat OpenStack Reporter: David Rosenfeld <drosenfe>
Component: openstack-tripleo-commonAssignee: Steve Baker <sbaker>
Status: CLOSED ERRATA QA Contact: David Rosenfeld <drosenfe>
Severity: medium Docs Contact:
Priority: medium    
Version: 16.2 (Train)CC: asalvati, gkadam, igallagh, jamsmith, jfrancoa, kthakre, mburns, nm-s, pweeks, sbaker, slinaber, uemit.seren
Target Milestone: z2Keywords: Regression, Triaged
Target Release: 16.2 (Train on RHEL 8.4)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-common-11.7.1-2.20210802105338.9991292.el8ost Doc Type: Bug Fix
Doc Text:
Before this update, the dnsmasq healthcheck failed even when dnsmasq ran correctly. The healthcheck failed because it used the dnsmasq user rather than the root user, and did not have access to the `/proc` files. This resulted in incorrect systemd journal messages and failures when validations were enabled. With this update, the dnsmasq healthcheck is disabled because it is of limited use and it is being phased out in later releases. The dnsmasq container is now marked as healthy as long as it is running.
Story Points: ---
Clone Of:
: 2021204 (view as bug list) Environment:
Last Closed: 2022-03-23 22:10:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2021204    

Description David Rosenfeld 2021-05-17 15:33:47 UTC
Description of problem: In OSP 16.2 ironic_inspector_dnsmasq is failing healthcheck. This was found by an automated Jenkins job that executes commands below:

sudo podman ps -a | grep ironic_inspector_dnsmasq
3298098dd074  rhos-qe-mirror-rdu2.usersys.redhat.com:5002/rh-osbs/rhosp16-openstack-ironic-inspector:16.2_20210514.1           kolla_start           About an hour ago  Up About an hour ago                  ironic_inspector_dnsmasq


sudo systemctl list-units --failed --plain --no-legend  --no-pager | grep healthcheck.service | grep ironic
tripleo_ironic_inspector_dnsmasq_healthcheck.service loaded failed failed ironic_inspector_dnsmasq healthcheck

Version-Release number of selected component (if applicable): RHOS-16.2-RHEL-8-20210514.n.0. Container healthcheck passed in: RHOS-16.2-RHEL-8-20210420.n.0


How reproducible: Every time


Steps to Reproduce:
1. Execute command: sudo systemctl list-units --failed --plain --no-legend  --no-pager | grep healthcheck.service | grep ironic
2.
3.

Actual results: Container healthcheck failed


Expected results: Container healthcheck passes


Additional info:

Comment 1 Jose Luis Franco 2021-06-21 10:24:11 UTC
This is also impacting us in our FFWD upgrades CI jobs now that we have enabled validations. The tripleo_ironic_inspector_dnsmasq_healthcheck.service appears as failed, making the validation to fail.

I tried to do some debugging and the problem seems to be in the healtcheck_port function:

(after sourcing healtcheck_port and get_user_from_process)

[root@undercloud-0 /]# healthcheck_port 'dnsmasq' 67
exit
[stack@undercloud-0 ~]$ echo $?
1

To be more specific, the problem seems to be when trying to find the -ilname "socket*" in the proc directory:

[root@undercloud-0 /]# ports="${ports}|$(printf '%0.4x' 67)"
[root@undercloud-0 /]# ports=":(${ports:1})"
[root@undercloud-0 /]# echo $ports
:(0043)
[root@undercloud-0 /]# sockets=$(awk -i join -v m=${ports} '{IGNORECASE=1; if ($2 ~ m || $3 ~ m) {output[counter++] = $10} } END{if (length(output)>0) {print join(output, 0, length(output)-1, "|")}}' /proc/net/{tcp,udp})
[root@undercloud-0 /]# echo $sockets
403523
[root@undercloud-0 /]# match=$(( $match+$(sudo -u dnsmasq find /proc/8/fd/ -ilname "socket*" -printf "%l\n" 2>/dev/null | grep -c -E "(${sockets})") ))
[root@undercloud-0 /]# echo $match
0

As it returns 0 the function ends up exiting with result 1.

And when trying to execute sudo -u dnsmasq find /proc/8/fd/ -ilname "socket*" -printf "%l\n" alone, I got:

[root@undercloud-0 /]# sudo -u dnsmasq find /proc/8/fd/ -ilname "socket*" -printf "%l\n"
find: ‘/proc/8/fd/’: Permission denied

However, without the sudo -u dnsmasq I could retrieve the sockets:

[root@undercloud-0 /]# find /proc/8/fd/ -ilname "socket*" -printf "%l\n" 2>/dev/null
socket:[403523]
socket:[403524]

It looks to me that the problem is here https://github.com/openstack/tripleo-common/blob/master/healthcheck/common.sh#L79

Comment 2 Jose Luis Franco 2021-06-21 10:27:02 UTC
My suspicion was correct, after removing the "sudo -u $puser" and restarting the healthcheck the process is back active and running:

[stack@undercloud-0 ~]$ sudo systemctl status tripleo_ironic_inspector_dnsmasq_healthcheck
● tripleo_ironic_inspector_dnsmasq_healthcheck.service - ironic_inspector_dnsmasq healthcheck
   Loaded: loaded (/etc/systemd/system/tripleo_ironic_inspector_dnsmasq_healthcheck.service; disabled; vendor preset: disabled)
   Active: inactive (dead) since Mon 2021-06-21 10:25:21 UTC; 26s ago
  Process: 1029745 ExecStart=/usr/bin/podman exec --user root ironic_inspector_dnsmasq /openstack/healthcheck (code=exited, status=0/SUCCESS)
 Main PID: 1029745 (code=exited, status=0/SUCCESS)

Jun 21 10:25:21 undercloud-0.redhat.local systemd[1]: Starting ironic_inspector_dnsmasq healthcheck...
Jun 21 10:25:21 undercloud-0.redhat.local healthcheck_ironic_inspector_dnsmasq[1029745]: 8
Jun 21 10:25:21 undercloud-0.redhat.local healthcheck_ironic_inspector_dnsmasq[1029745]: Checking dnsmasq port(s) 67.
Jun 21 10:25:21 undercloud-0.redhat.local systemd[1]: tripleo_ironic_inspector_dnsmasq_healthcheck.service: Succeeded.
Jun 21 10:25:21 undercloud-0.redhat.local systemd[1]: Started ironic_inspector_dnsmasq healthcheck.

Comment 3 Jose Luis Franco 2021-06-21 10:38:39 UTC
Digging in a little bit more, it looks like a permission thing when comparing to other containers (as healthcheck_port is used in other healthcheck services). The /proc/8 directory is owned by dnsmasq ,but /proc/8/fd isn't:

[root@undercloud-0 /]# ls -larth /proc/8/
total 0
dr-xr-xr-x. 481 root    root    0 Jun 17 12:34 ..
-r--r--r--.   1 root    root    0 Jun 21 10:19 status
-r--r--r--.   1 root    root    0 Jun 21 10:19 cmdline
dr-xr-xr-x.   9 dnsmasq dnsmasq 0 Jun 21 10:19 .
dr-x------.   2 root    root    0 Jun 21 10:19 fd
-r--r--r--.   1 root    root    0 Jun 21 10:21 stat
-r--r--r--.   1 root    root    0 Jun 21 10:34 wchan
-rw-r--r--.   1 root    root    0 Jun 21 10:34 uid_map
-rw-rw-rw-.   1 root    root    0 Jun 21 10:34 timerslack_ns
-r--r--r--.   1 root    root    0 Jun 21 10:34 timers
-rw-r--r--.   1 root    root    0 Jun 21 10:34 timens_offsets
dr-xr-xr-x.   3 dnsmasq dnsmasq 0 Jun 21 10:34 task
-r--------.   1 root    root    0 Jun 21 10:34 syscall
-r--r--r--.   1 root    root    0 Jun 21 10:34 statm
-r--------.   1 root    root    0 Jun 21 10:34 stack
-r--r--r--.   1 root    root    0 Jun 21 10:34 smaps_rollup
-r--r--r--.   1 root    root    0 Jun 21 10:34 smaps
-rw-r--r--.   1 root    root    0 Jun 21 10:34 setgroups
-r--r--r--.   1 root    root    0 Jun 21 10:34 sessionid
-r--r--r--.   1 root    root    0 Jun 21 10:34 schedstat
-rw-r--r--.   1 root    root    0 Jun 21 10:34 sched
lrwxrwxrwx.   1 root    root    0 Jun 21 10:34 root -> /
-rw-r--r--.   1 root    root    0 Jun 21 10:34 projid_map
-r--------.   1 root    root    0 Jun 21 10:34 personality
-r--------.   1 root    root    0 Jun 21 10:34 patch_state
-r--------.   1 root    root    0 Jun 21 10:34 pagemap
-rw-r--r--.   1 root    root    0 Jun 21 10:34 oom_score_adj
-r--r--r--.   1 root    root    0 Jun 21 10:34 oom_score
-rw-r--r--.   1 root    root    0 Jun 21 10:34 oom_adj
-r--r--r--.   1 root    root    0 Jun 21 10:34 numa_maps
dr-x--x--x.   2 root    root    0 Jun 21 10:34 ns
dr-xr-xr-x.   6 dnsmasq dnsmasq 0 Jun 21 10:34 net
-r--------.   1 root    root    0 Jun 21 10:34 mountstats
-r--r--r--.   1 root    root    0 Jun 21 10:34 mounts
-r--r--r--.   1 root    root    0 Jun 21 10:34 mountinfo
-rw-------.   1 root    root    0 Jun 21 10:34 mem
-r--r--r--.   1 root    root    0 Jun 21 10:34 maps
dr-x------.   2 root    root    0 Jun 21 10:34 map_files
-rw-r--r--.   1 root    root    0 Jun 21 10:34 loginuid
-r--r--r--.   1 root    root    0 Jun 21 10:34 limits
-r--------.   1 root    root    0 Jun 21 10:34 io
-rw-r--r--.   1 root    root    0 Jun 21 10:34 gid_map
dr-x------.   2 root    root    0 Jun 21 10:34 fdinfo
lrwxrwxrwx.   1 root    root    0 Jun 21 10:34 exe -> /usr/sbin/dnsmasq
-r--------.   1 root    root    0 Jun 21 10:34 environ
lrwxrwxrwx.   1 root    root    0 Jun 21 10:34 cwd -> /
-r--r--r--.   1 root    root    0 Jun 21 10:34 cpuset
-r--r--r--.   1 root    root    0 Jun 21 10:34 cpu_resctrl_groups
-rw-r--r--.   1 root    root    0 Jun 21 10:34 coredump_filter
-rw-r--r--.   1 root    root    0 Jun 21 10:34 comm
--w-------.   1 root    root    0 Jun 21 10:34 clear_refs
-r--r--r--.   1 root    root    0 Jun 21 10:34 cgroup
-r--------.   1 root    root    0 Jun 21 10:34 auxv
-rw-r--r--.   1 root    root    0 Jun 21 10:34 autogroup
dr-xr-xr-x.   2 dnsmasq dnsmasq 0 Jun 21 10:34 attr

That is why we receive the Permissiong denied when trying to search on it. For example, comparing to mistral_engine, which also invokes healthcheck_port in his healthcheck:

[root@undercloud-0 /]# ps -edf
UID          PID    PPID  C STIME TTY          TIME CMD
mistral        1       0  0 Jun17 ?        00:00:00 dumb-init --single-child -- kolla_start
mistral        7       1  0 Jun17 ?        00:34:58 /usr/bin/python3 /usr/bin/mistral-server --config-file=/etc/mistral/mistral.conf --log-file=/var/log/m
root       71904       0  1 10:36 pts/0    00:00:00 bash
root       71917   71904  0 10:37 pts/0    00:00:00 ps -edf
[root@undercloud-0 /]# ls -larth /proc/7
ls: cannot read symbolic link '/proc/7/cwd': Permission denied
ls: cannot read symbolic link '/proc/7/root': Permission denied
ls: cannot read symbolic link '/proc/7/exe': Permission denied
total 0
dr-xr-xr-x. 479 root    root    0 Jun 17 12:33 ..
dr-xr-xr-x.   9 mistral mistral 0 Jun 21 08:41 .
dr-x------.   2 mistral mistral 0 Jun 21 08:41 fd
-r--r--r--.   1 mistral mistral 0 Jun 21 10:06 status
-r--r--r--.   1 mistral mistral 0 Jun 21 10:06 cmdline
-r--r--r--.   1 mistral mistral 0 Jun 21 10:29 stat
-r--r--r--.   1 mistral mistral 0 Jun 21 10:37 wchan
-rw-r--r--.   1 mistral mistral 0 Jun 21 10:37 uid_map
-rw-rw-rw-.   1 mistral mistral 0 Jun 21 10:37 timerslack_ns
-r--r--r--.   1 mistral mistral 0 Jun 21 10:37 timers
-rw-r--r--.   1 mistral mistral 0 Jun 21 10:37 timens_offsets
dr-xr-xr-x.   3 mistral mistral 0 Jun 21 10:37 task
-r--------.   1 mistral mistral 0 Jun 21 10:37 syscall
-r--r--r--.   1 mistral mistral 0 Jun 21 10:37 statm
-r--------.   1 mistral mistral 0 Jun 21 10:37 stack
-r--r--r--.   1 mistral mistral 0 Jun 21 10:37 smaps_rollup
-r--r--r--.   1 mistral mistral 0 Jun 21 10:37 smaps
-rw-r--r--.   1 mistral mistral 0 Jun 21 10:37 setgroups
-r--r--r--.   1 mistral mistral 0 Jun 21 10:37 sessionid
-r--r--r--.   1 mistral mistral 0 Jun 21 10:37 schedstat
-rw-r--r--.   1 mistral mistral 0 Jun 21 10:37 sched
lrwxrwxrwx.   1 mistral mistral 0 Jun 21 10:37 root
-rw-r--r--.   1 mistral mistral 0 Jun 21 10:37 projid_map
-r--------.   1 mistral mistral 0 Jun 21 10:37 personality
-r--------.   1 mistral mistral 0 Jun 21 10:37 patch_state
-r--------.   1 mistral mistral 0 Jun 21 10:37 pagemap
-rw-r--r--.   1 mistral mistral 0 Jun 21 10:37 oom_score_adj
-r--r--r--.   1 mistral mistral 0 Jun 21 10:37 oom_score
-rw-r--r--.   1 mistral mistral 0 Jun 21 10:37 oom_adj
-r--r--r--.   1 mistral mistral 0 Jun 21 10:37 numa_maps
dr-x--x--x.   2 mistral mistral 0 Jun 21 10:37 ns
dr-xr-xr-x.   6 mistral mistral 0 Jun 21 10:37 net
-r--------.   1 mistral mistral 0 Jun 21 10:37 mountstats
-r--r--r--.   1 mistral mistral 0 Jun 21 10:37 mounts
-r--r--r--.   1 mistral mistral 0 Jun 21 10:37 mountinfo
-rw-------.   1 mistral mistral 0 Jun 21 10:37 mem
-r--r--r--.   1 mistral mistral 0 Jun 21 10:37 maps
dr-x------.   2 mistral mistral 0 Jun 21 10:37 map_files
-rw-r--r--.   1 mistral mistral 0 Jun 21 10:37 loginuid
-r--r--r--.   1 mistral mistral 0 Jun 21 10:37 limits
-r--------.   1 mistral mistral 0 Jun 21 10:37 io
-rw-r--r--.   1 mistral mistral 0 Jun 21 10:37 gid_map
dr-x------.   2 mistral mistral 0 Jun 21 10:37 fdinfo
lrwxrwxrwx.   1 mistral mistral 0 Jun 21 10:37 exe
-r--------.   1 mistral mistral 0 Jun 21 10:37 environ
lrwxrwxrwx.   1 mistral mistral 0 Jun 21 10:37 cwd
-r--r--r--.   1 mistral mistral 0 Jun 21 10:37 cpuset
-r--r--r--.   1 mistral mistral 0 Jun 21 10:37 cpu_resctrl_groups
-rw-r--r--.   1 mistral mistral 0 Jun 21 10:37 coredump_filter
-rw-r--r--.   1 mistral mistral 0 Jun 21 10:37 comm
--w-------.   1 mistral mistral 0 Jun 21 10:37 clear_refs
-r--r--r--.   1 mistral mistral 0 Jun 21 10:37 cgroup
-r--------.   1 mistral mistral 0 Jun 21 10:37 auxv
-rw-r--r--.   1 mistral mistral 0 Jun 21 10:37 autogroup
dr-xr-xr-x.   2 mistral mistral 0 Jun 21 10:37 attr


Everything under /proc/7 is owned by mistral.

Comment 6 Rabi Mishra 2021-10-07 08:31:36 UTC
*** Bug 2011676 has been marked as a duplicate of this bug. ***

Comment 18 errata-xmlrpc 2022-03-23 22:10:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 16.2.2), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:1001