This bug was initially created as a copy of Bug #1979524 I am copying this bug because: This issue also affects 16.1 - it wasn't mentioned back then :/. Basically, we need to get the following Change-Id in: 248622aa5aa13dd9c498e976e368bdd5f5e4008a Description of problem: The FFWD 13 to 16.2 comoposable roles job is failing when upgrading the first controller during the healthcheck verification commands: 2021-07-05 14:25:32 | 2021-07-05 14:25:31.412461 | 52540096-c27d-6344-9f0b-000000003207 | TIMING | Get nova-api healthcheck status | controller-0 | 0:20:32.651323 | 182.97s 2021-07-05 14:25:32 | 2021-07-05 14:25:31.475215 | 52540096-c27d-6344-9f0b-000000003208 | TASK | Fail if nova-api healthcheck report failed status 2021-07-05 14:25:32 | 2021-07-05 14:25:31.534673 | 52540096-c27d-6344-9f0b-000000003208 | SKIPPED | Fail if nova-api healthcheck report failed status | controller-0 2021-07-05 14:25:32 | 2021-07-05 14:25:31.536229 | 52540096-c27d-6344-9f0b-000000003208 | TIMING | Fail if nova-api healthcheck report failed status | controller-0 | 0:20:32.775062 | 0.06s 2021-07-05 14:25:32 | 2021-07-05 14:25:31.609007 | 52540096-c27d-6344-9f0b-00000000320a | TASK | Get nova-conductor healthcheck status 2021-07-05 14:25:32 | 2021-07-05 14:25:32.086642 | 52540096-c27d-6344-9f0b-00000000320a | OK | Get nova-conductor healthcheck status | controller-0 2021-07-05 14:25:32 | 2021-07-05 14:25:32.101296 | 52540096-c27d-6344-9f0b-00000000320a | TIMING | Get nova-conductor healthcheck status | controller-0 | 0:20:33.340073 | 0.49s 2021-07-05 14:25:32 | 2021-07-05 14:25:32.178409 | 52540096-c27d-6344-9f0b-00000000320b | TASK | Fail if nova-conductor healthcheck report failed status 2021-07-05 14:25:32 | 2021-07-05 14:25:32.239516 | 52540096-c27d-6344-9f0b-00000000320b | FATAL | Fail if nova-conductor healthcheck report failed status | controller-0 | error={"changed": false, "msg": "nova-conductor isn't working (healthcheck failed)"} 2021-07-05 14:25:32 | 2021-07-05 14:25:32.241538 | 52540096-c27d-6344-9f0b-00000000320b | TIMING | Fail if nova-conductor healthcheck report failed status | controller-0 | 0:20:33.480353 | 0.06s 2021-07-05 14:25:32 | 2021-07-05 14:25:32 | PLAY RECAP ********************************************************************* 2021-07-05 14:25:32 | controller-0 : ok=322 changed=177 unreachable=0 failed=1 skipped=167 rescued=0 ignored=0 2021-07-05 14:25:32 | database-0 : ok=267 changed=143 unreachable=0 failed=0 skipped=149 rescued=0 ignored=0 2021-07-05 14:25:32 | messaging-0 : ok=265 changed=144 unreachable=0 failed=0 skipped=152 rescued=0 ignored=0 2021-07-05 14:25:32 | networker-0 : ok=288 changed=150 unreachable=0 failed=0 skipped=150 rescued=0 ignored=0 http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-upgrades-ffu-16.2-from-13-latest_cdn-3cont_3db_3msg_2net_3hci-ipv6-ovs_dvr/61/undercloud-0/home/stack/overcloud_upgrade_run-controller-0,database-0,messaging-0,networker-0.log.gz When running the healthcheck script in the controller-0 node, we could see: [root@controller-0 /]# bash -x /openstack/healthcheck 5672 + . /usr/share/openstack-tripleo-common/healthcheck/common.sh ++ : 0 ++ '[' 0 -ne 0 ']' ++ exec ++ : 10 ++ : curl-healthcheck ++ : pyrequests-healthcheck ++ : '\n%{http_code}' '%{remote_ip}:%{remote_port}' '%{time_total}' 'seconds\n' ++ : /dev/null + process=nova-conductor + args=5672 + healthcheck_port nova-conductor 5672 + process=nova-conductor + shift 1 + ports= ++ get_user_from_process nova-conductor ++ process=nova-conductor +++ pgrep -d , -f nova-conductor ++ pid=7,14,15 ++ ps -h -q7,14,15 -o user ++ head -n1 + puser=nova + for p in $@ ++ printf %0.4x 5672 + ports='|1628' + ports=':(1628)' ++ awk -i join -v 'm=:(1628)' '{IGNORECASE=1; if ($2 ~ m || $3 ~ m) {output[counter++] = $10} } END{if (length(output)>0) {print join(output, 0, length(output)-1, "|")}}' /proc/net/tcp /proc/ net/udp + sockets= + test -z + exit 1 And digging in a little more, the issue seems to occurr because the common.sh script expects the port to be running in the same controller, while as this is a composable roles job the rabbitmq service is running in the messaging node: [nova@controller-0 /]$ lsof -P -p 15 | grep -i tcp nova-cond 15 nova 5u sock 0,9 0t0 1154005 protocol: TCPv6 nova-cond 15 nova 8u IPv6 18828319 0t0 TCP controller-0.redhat.local:50915->overcloud.internalapi.localdomain:3306 (ESTABLISHED) nova-cond 15 nova 9u IPv6 1154528 0t0 TCP controller-0.redhat.local:55380->messaging-0.redhat.local:5672 (ESTABLISHED) nova-cond 15 nova 10u IPv6 1303330 0t0 TCP controller-0.redhat.local:45688->messaging-0.redhat.local:5672 (ESTABLISHED) nova-cond 15 nova 11u IPv6 18913527 0t0 TCP controller-0.redhat.local:57781->overcloud.internalapi.localdomain:3306 (ESTABLISHED) nova-cond 15 nova 12u IPv6 2275861 0t0 TCP controller-0.redhat.local:57642->messaging-0.redhat.local:5672 (ESTABLISHED) nova-cond 15 nova 13u IPv6 18966674 0t0 TCP controller-0.redhat.local:37153->overcloud.internalapi.localdomain:3306 (ESTABLISHED) Version-Release number of selected component (if applicable): How reproducible: Running CI job: https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/upgrades/view/ffu/job/DFG-upgrades-ffu-16.2-from-13-latest_cdn-3cont_3db_3msg_2net_3hci-ipv6-ovs_dvr/ Steps to Reproduce: 1. 2. 3. Actual results: Upgrade fails because the healtcheck validation gives a false negative Expected results: Healtcheck passes and also the upgrade. Additional info:
This BZ was originally found in 16.2 using ffu job in description. The 16.1 version of that job: DFG-upgrades-ffu-16.1-from-13-latest_cdn-3cont_3db_3msg_2net_3hci-ipv6-ovs_dvr executes without encountering the healthcheck failed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1.7 (Train) bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3762