Bug 1961237 - OSP16.2 ironic inspector dnsmasq fail healthcheck
Summary: OSP16.2 ironic inspector dnsmasq fail healthcheck
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-common
Version: 16.2 (Train)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: z2
: 16.2 (Train on RHEL 8.4)
Assignee: Steve Baker
QA Contact: David Rosenfeld
URL:
Whiteboard:
: 2011676 (view as bug list)
Depends On:
Blocks: 2021204
TreeView+ depends on / blocked
 
Reported: 2021-05-17 15:33 UTC by David Rosenfeld
Modified: 2022-03-23 22:10 UTC (History)
12 users (show)

Fixed In Version: openstack-tripleo-common-11.7.1-2.20210802105338.9991292.el8ost
Doc Type: Bug Fix
Doc Text:
Before this update, the dnsmasq healthcheck failed even when dnsmasq ran correctly. The healthcheck failed because it used the dnsmasq user rather than the root user, and did not have access to the `/proc` files. This resulted in incorrect systemd journal messages and failures when validations were enabled. With this update, the dnsmasq healthcheck is disabled because it is of limited use and it is being phased out in later releases. The dnsmasq container is now marked as healthy as long as it is running.
Clone Of:
: 2021204 (view as bug list)
Environment:
Last Closed: 2022-03-23 22:10:09 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 799158 0 None MERGED Remove ironic-inspector dnsmasq healthcheck 2022-02-15 15:15:42 UTC
Red Hat Issue Tracker OSP-3954 0 None None None 2021-11-17 05:34:12 UTC
Red Hat Product Errata RHBA-2022:1001 0 None None None 2022-03-23 22:10:45 UTC

Description David Rosenfeld 2021-05-17 15:33:47 UTC
Description of problem: In OSP 16.2 ironic_inspector_dnsmasq is failing healthcheck. This was found by an automated Jenkins job that executes commands below:

sudo podman ps -a | grep ironic_inspector_dnsmasq
3298098dd074  rhos-qe-mirror-rdu2.usersys.redhat.com:5002/rh-osbs/rhosp16-openstack-ironic-inspector:16.2_20210514.1           kolla_start           About an hour ago  Up About an hour ago                  ironic_inspector_dnsmasq


sudo systemctl list-units --failed --plain --no-legend  --no-pager | grep healthcheck.service | grep ironic
tripleo_ironic_inspector_dnsmasq_healthcheck.service loaded failed failed ironic_inspector_dnsmasq healthcheck

Version-Release number of selected component (if applicable): RHOS-16.2-RHEL-8-20210514.n.0. Container healthcheck passed in: RHOS-16.2-RHEL-8-20210420.n.0


How reproducible: Every time


Steps to Reproduce:
1. Execute command: sudo systemctl list-units --failed --plain --no-legend  --no-pager | grep healthcheck.service | grep ironic
2.
3.

Actual results: Container healthcheck failed


Expected results: Container healthcheck passes


Additional info:

Comment 1 Jose Luis Franco 2021-06-21 10:24:11 UTC
This is also impacting us in our FFWD upgrades CI jobs now that we have enabled validations. The tripleo_ironic_inspector_dnsmasq_healthcheck.service appears as failed, making the validation to fail.

I tried to do some debugging and the problem seems to be in the healtcheck_port function:

(after sourcing healtcheck_port and get_user_from_process)

[root@undercloud-0 /]# healthcheck_port 'dnsmasq' 67
exit
[stack@undercloud-0 ~]$ echo $?
1

To be more specific, the problem seems to be when trying to find the -ilname "socket*" in the proc directory:

[root@undercloud-0 /]# ports="${ports}|$(printf '%0.4x' 67)"
[root@undercloud-0 /]# ports=":(${ports:1})"
[root@undercloud-0 /]# echo $ports
:(0043)
[root@undercloud-0 /]# sockets=$(awk -i join -v m=${ports} '{IGNORECASE=1; if ($2 ~ m || $3 ~ m) {output[counter++] = $10} } END{if (length(output)>0) {print join(output, 0, length(output)-1, "|")}}' /proc/net/{tcp,udp})
[root@undercloud-0 /]# echo $sockets
403523
[root@undercloud-0 /]# match=$(( $match+$(sudo -u dnsmasq find /proc/8/fd/ -ilname "socket*" -printf "%l\n" 2>/dev/null | grep -c -E "(${sockets})") ))
[root@undercloud-0 /]# echo $match
0

As it returns 0 the function ends up exiting with result 1.

And when trying to execute sudo -u dnsmasq find /proc/8/fd/ -ilname "socket*" -printf "%l\n" alone, I got:

[root@undercloud-0 /]# sudo -u dnsmasq find /proc/8/fd/ -ilname "socket*" -printf "%l\n"
find: ‘/proc/8/fd/’: Permission denied

However, without the sudo -u dnsmasq I could retrieve the sockets:

[root@undercloud-0 /]# find /proc/8/fd/ -ilname "socket*" -printf "%l\n" 2>/dev/null
socket:[403523]
socket:[403524]

It looks to me that the problem is here https://github.com/openstack/tripleo-common/blob/master/healthcheck/common.sh#L79

Comment 2 Jose Luis Franco 2021-06-21 10:27:02 UTC
My suspicion was correct, after removing the "sudo -u $puser" and restarting the healthcheck the process is back active and running:

[stack@undercloud-0 ~]$ sudo systemctl status tripleo_ironic_inspector_dnsmasq_healthcheck
● tripleo_ironic_inspector_dnsmasq_healthcheck.service - ironic_inspector_dnsmasq healthcheck
   Loaded: loaded (/etc/systemd/system/tripleo_ironic_inspector_dnsmasq_healthcheck.service; disabled; vendor preset: disabled)
   Active: inactive (dead) since Mon 2021-06-21 10:25:21 UTC; 26s ago
  Process: 1029745 ExecStart=/usr/bin/podman exec --user root ironic_inspector_dnsmasq /openstack/healthcheck (code=exited, status=0/SUCCESS)
 Main PID: 1029745 (code=exited, status=0/SUCCESS)

Jun 21 10:25:21 undercloud-0.redhat.local systemd[1]: Starting ironic_inspector_dnsmasq healthcheck...
Jun 21 10:25:21 undercloud-0.redhat.local healthcheck_ironic_inspector_dnsmasq[1029745]: 8
Jun 21 10:25:21 undercloud-0.redhat.local healthcheck_ironic_inspector_dnsmasq[1029745]: Checking dnsmasq port(s) 67.
Jun 21 10:25:21 undercloud-0.redhat.local systemd[1]: tripleo_ironic_inspector_dnsmasq_healthcheck.service: Succeeded.
Jun 21 10:25:21 undercloud-0.redhat.local systemd[1]: Started ironic_inspector_dnsmasq healthcheck.

Comment 3 Jose Luis Franco 2021-06-21 10:38:39 UTC
Digging in a little bit more, it looks like a permission thing when comparing to other containers (as healthcheck_port is used in other healthcheck services). The /proc/8 directory is owned by dnsmasq ,but /proc/8/fd isn't:

[root@undercloud-0 /]# ls -larth /proc/8/
total 0
dr-xr-xr-x. 481 root    root    0 Jun 17 12:34 ..
-r--r--r--.   1 root    root    0 Jun 21 10:19 status
-r--r--r--.   1 root    root    0 Jun 21 10:19 cmdline
dr-xr-xr-x.   9 dnsmasq dnsmasq 0 Jun 21 10:19 .
dr-x------.   2 root    root    0 Jun 21 10:19 fd
-r--r--r--.   1 root    root    0 Jun 21 10:21 stat
-r--r--r--.   1 root    root    0 Jun 21 10:34 wchan
-rw-r--r--.   1 root    root    0 Jun 21 10:34 uid_map
-rw-rw-rw-.   1 root    root    0 Jun 21 10:34 timerslack_ns
-r--r--r--.   1 root    root    0 Jun 21 10:34 timers
-rw-r--r--.   1 root    root    0 Jun 21 10:34 timens_offsets
dr-xr-xr-x.   3 dnsmasq dnsmasq 0 Jun 21 10:34 task
-r--------.   1 root    root    0 Jun 21 10:34 syscall
-r--r--r--.   1 root    root    0 Jun 21 10:34 statm
-r--------.   1 root    root    0 Jun 21 10:34 stack
-r--r--r--.   1 root    root    0 Jun 21 10:34 smaps_rollup
-r--r--r--.   1 root    root    0 Jun 21 10:34 smaps
-rw-r--r--.   1 root    root    0 Jun 21 10:34 setgroups
-r--r--r--.   1 root    root    0 Jun 21 10:34 sessionid
-r--r--r--.   1 root    root    0 Jun 21 10:34 schedstat
-rw-r--r--.   1 root    root    0 Jun 21 10:34 sched
lrwxrwxrwx.   1 root    root    0 Jun 21 10:34 root -> /
-rw-r--r--.   1 root    root    0 Jun 21 10:34 projid_map
-r--------.   1 root    root    0 Jun 21 10:34 personality
-r--------.   1 root    root    0 Jun 21 10:34 patch_state
-r--------.   1 root    root    0 Jun 21 10:34 pagemap
-rw-r--r--.   1 root    root    0 Jun 21 10:34 oom_score_adj
-r--r--r--.   1 root    root    0 Jun 21 10:34 oom_score
-rw-r--r--.   1 root    root    0 Jun 21 10:34 oom_adj
-r--r--r--.   1 root    root    0 Jun 21 10:34 numa_maps
dr-x--x--x.   2 root    root    0 Jun 21 10:34 ns
dr-xr-xr-x.   6 dnsmasq dnsmasq 0 Jun 21 10:34 net
-r--------.   1 root    root    0 Jun 21 10:34 mountstats
-r--r--r--.   1 root    root    0 Jun 21 10:34 mounts
-r--r--r--.   1 root    root    0 Jun 21 10:34 mountinfo
-rw-------.   1 root    root    0 Jun 21 10:34 mem
-r--r--r--.   1 root    root    0 Jun 21 10:34 maps
dr-x------.   2 root    root    0 Jun 21 10:34 map_files
-rw-r--r--.   1 root    root    0 Jun 21 10:34 loginuid
-r--r--r--.   1 root    root    0 Jun 21 10:34 limits
-r--------.   1 root    root    0 Jun 21 10:34 io
-rw-r--r--.   1 root    root    0 Jun 21 10:34 gid_map
dr-x------.   2 root    root    0 Jun 21 10:34 fdinfo
lrwxrwxrwx.   1 root    root    0 Jun 21 10:34 exe -> /usr/sbin/dnsmasq
-r--------.   1 root    root    0 Jun 21 10:34 environ
lrwxrwxrwx.   1 root    root    0 Jun 21 10:34 cwd -> /
-r--r--r--.   1 root    root    0 Jun 21 10:34 cpuset
-r--r--r--.   1 root    root    0 Jun 21 10:34 cpu_resctrl_groups
-rw-r--r--.   1 root    root    0 Jun 21 10:34 coredump_filter
-rw-r--r--.   1 root    root    0 Jun 21 10:34 comm
--w-------.   1 root    root    0 Jun 21 10:34 clear_refs
-r--r--r--.   1 root    root    0 Jun 21 10:34 cgroup
-r--------.   1 root    root    0 Jun 21 10:34 auxv
-rw-r--r--.   1 root    root    0 Jun 21 10:34 autogroup
dr-xr-xr-x.   2 dnsmasq dnsmasq 0 Jun 21 10:34 attr

That is why we receive the Permissiong denied when trying to search on it. For example, comparing to mistral_engine, which also invokes healthcheck_port in his healthcheck:

[root@undercloud-0 /]# ps -edf
UID          PID    PPID  C STIME TTY          TIME CMD
mistral        1       0  0 Jun17 ?        00:00:00 dumb-init --single-child -- kolla_start
mistral        7       1  0 Jun17 ?        00:34:58 /usr/bin/python3 /usr/bin/mistral-server --config-file=/etc/mistral/mistral.conf --log-file=/var/log/m
root       71904       0  1 10:36 pts/0    00:00:00 bash
root       71917   71904  0 10:37 pts/0    00:00:00 ps -edf
[root@undercloud-0 /]# ls -larth /proc/7
ls: cannot read symbolic link '/proc/7/cwd': Permission denied
ls: cannot read symbolic link '/proc/7/root': Permission denied
ls: cannot read symbolic link '/proc/7/exe': Permission denied
total 0
dr-xr-xr-x. 479 root    root    0 Jun 17 12:33 ..
dr-xr-xr-x.   9 mistral mistral 0 Jun 21 08:41 .
dr-x------.   2 mistral mistral 0 Jun 21 08:41 fd
-r--r--r--.   1 mistral mistral 0 Jun 21 10:06 status
-r--r--r--.   1 mistral mistral 0 Jun 21 10:06 cmdline
-r--r--r--.   1 mistral mistral 0 Jun 21 10:29 stat
-r--r--r--.   1 mistral mistral 0 Jun 21 10:37 wchan
-rw-r--r--.   1 mistral mistral 0 Jun 21 10:37 uid_map
-rw-rw-rw-.   1 mistral mistral 0 Jun 21 10:37 timerslack_ns
-r--r--r--.   1 mistral mistral 0 Jun 21 10:37 timers
-rw-r--r--.   1 mistral mistral 0 Jun 21 10:37 timens_offsets
dr-xr-xr-x.   3 mistral mistral 0 Jun 21 10:37 task
-r--------.   1 mistral mistral 0 Jun 21 10:37 syscall
-r--r--r--.   1 mistral mistral 0 Jun 21 10:37 statm
-r--------.   1 mistral mistral 0 Jun 21 10:37 stack
-r--r--r--.   1 mistral mistral 0 Jun 21 10:37 smaps_rollup
-r--r--r--.   1 mistral mistral 0 Jun 21 10:37 smaps
-rw-r--r--.   1 mistral mistral 0 Jun 21 10:37 setgroups
-r--r--r--.   1 mistral mistral 0 Jun 21 10:37 sessionid
-r--r--r--.   1 mistral mistral 0 Jun 21 10:37 schedstat
-rw-r--r--.   1 mistral mistral 0 Jun 21 10:37 sched
lrwxrwxrwx.   1 mistral mistral 0 Jun 21 10:37 root
-rw-r--r--.   1 mistral mistral 0 Jun 21 10:37 projid_map
-r--------.   1 mistral mistral 0 Jun 21 10:37 personality
-r--------.   1 mistral mistral 0 Jun 21 10:37 patch_state
-r--------.   1 mistral mistral 0 Jun 21 10:37 pagemap
-rw-r--r--.   1 mistral mistral 0 Jun 21 10:37 oom_score_adj
-r--r--r--.   1 mistral mistral 0 Jun 21 10:37 oom_score
-rw-r--r--.   1 mistral mistral 0 Jun 21 10:37 oom_adj
-r--r--r--.   1 mistral mistral 0 Jun 21 10:37 numa_maps
dr-x--x--x.   2 mistral mistral 0 Jun 21 10:37 ns
dr-xr-xr-x.   6 mistral mistral 0 Jun 21 10:37 net
-r--------.   1 mistral mistral 0 Jun 21 10:37 mountstats
-r--r--r--.   1 mistral mistral 0 Jun 21 10:37 mounts
-r--r--r--.   1 mistral mistral 0 Jun 21 10:37 mountinfo
-rw-------.   1 mistral mistral 0 Jun 21 10:37 mem
-r--r--r--.   1 mistral mistral 0 Jun 21 10:37 maps
dr-x------.   2 mistral mistral 0 Jun 21 10:37 map_files
-rw-r--r--.   1 mistral mistral 0 Jun 21 10:37 loginuid
-r--r--r--.   1 mistral mistral 0 Jun 21 10:37 limits
-r--------.   1 mistral mistral 0 Jun 21 10:37 io
-rw-r--r--.   1 mistral mistral 0 Jun 21 10:37 gid_map
dr-x------.   2 mistral mistral 0 Jun 21 10:37 fdinfo
lrwxrwxrwx.   1 mistral mistral 0 Jun 21 10:37 exe
-r--------.   1 mistral mistral 0 Jun 21 10:37 environ
lrwxrwxrwx.   1 mistral mistral 0 Jun 21 10:37 cwd
-r--r--r--.   1 mistral mistral 0 Jun 21 10:37 cpuset
-r--r--r--.   1 mistral mistral 0 Jun 21 10:37 cpu_resctrl_groups
-rw-r--r--.   1 mistral mistral 0 Jun 21 10:37 coredump_filter
-rw-r--r--.   1 mistral mistral 0 Jun 21 10:37 comm
--w-------.   1 mistral mistral 0 Jun 21 10:37 clear_refs
-r--r--r--.   1 mistral mistral 0 Jun 21 10:37 cgroup
-r--------.   1 mistral mistral 0 Jun 21 10:37 auxv
-rw-r--r--.   1 mistral mistral 0 Jun 21 10:37 autogroup
dr-xr-xr-x.   2 mistral mistral 0 Jun 21 10:37 attr


Everything under /proc/7 is owned by mistral.

Comment 6 Rabi Mishra 2021-10-07 08:31:36 UTC
*** Bug 2011676 has been marked as a duplicate of this bug. ***

Comment 18 errata-xmlrpc 2022-03-23 22:10:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 16.2.2), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:1001


Note You need to log in before you can comment on or make changes to this bug.