Description of problem: The FFWD 13 to 16.2 comoposable roles job is failing when upgrading the first controller during the healthcheck verification commands: 2021-07-05 14:25:32 | 2021-07-05 14:25:31.412461 | 52540096-c27d-6344-9f0b-000000003207 | TIMING | Get nova-api healthcheck status | controller-0 | 0:20:32.651323 | 182.97s 2021-07-05 14:25:32 | 2021-07-05 14:25:31.475215 | 52540096-c27d-6344-9f0b-000000003208 | TASK | Fail if nova-api healthcheck report failed status 2021-07-05 14:25:32 | 2021-07-05 14:25:31.534673 | 52540096-c27d-6344-9f0b-000000003208 | SKIPPED | Fail if nova-api healthcheck report failed status | controller-0 2021-07-05 14:25:32 | 2021-07-05 14:25:31.536229 | 52540096-c27d-6344-9f0b-000000003208 | TIMING | Fail if nova-api healthcheck report failed status | controller-0 | 0:20:32.775062 | 0.06s 2021-07-05 14:25:32 | 2021-07-05 14:25:31.609007 | 52540096-c27d-6344-9f0b-00000000320a | TASK | Get nova-conductor healthcheck status 2021-07-05 14:25:32 | 2021-07-05 14:25:32.086642 | 52540096-c27d-6344-9f0b-00000000320a | OK | Get nova-conductor healthcheck status | controller-0 2021-07-05 14:25:32 | 2021-07-05 14:25:32.101296 | 52540096-c27d-6344-9f0b-00000000320a | TIMING | Get nova-conductor healthcheck status | controller-0 | 0:20:33.340073 | 0.49s 2021-07-05 14:25:32 | 2021-07-05 14:25:32.178409 | 52540096-c27d-6344-9f0b-00000000320b | TASK | Fail if nova-conductor healthcheck report failed status 2021-07-05 14:25:32 | 2021-07-05 14:25:32.239516 | 52540096-c27d-6344-9f0b-00000000320b | FATAL | Fail if nova-conductor healthcheck report failed status | controller-0 | error={"changed": false, "msg": "nova-conductor isn't working (healthcheck failed)"} 2021-07-05 14:25:32 | 2021-07-05 14:25:32.241538 | 52540096-c27d-6344-9f0b-00000000320b | TIMING | Fail if nova-conductor healthcheck report failed status | controller-0 | 0:20:33.480353 | 0.06s 2021-07-05 14:25:32 | 2021-07-05 14:25:32 | PLAY RECAP ********************************************************************* 2021-07-05 14:25:32 | controller-0 : ok=322 changed=177 unreachable=0 failed=1 skipped=167 rescued=0 ignored=0 2021-07-05 14:25:32 | database-0 : ok=267 changed=143 unreachable=0 failed=0 skipped=149 rescued=0 ignored=0 2021-07-05 14:25:32 | messaging-0 : ok=265 changed=144 unreachable=0 failed=0 skipped=152 rescued=0 ignored=0 2021-07-05 14:25:32 | networker-0 : ok=288 changed=150 unreachable=0 failed=0 skipped=150 rescued=0 ignored=0 http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-upgrades-ffu-16.2-from-13-latest_cdn-3cont_3db_3msg_2net_3hci-ipv6-ovs_dvr/61/undercloud-0/home/stack/overcloud_upgrade_run-controller-0,database-0,messaging-0,networker-0.log.gz When running the healthcheck script in the controller-0 node, we could see: [root@controller-0 /]# bash -x /openstack/healthcheck 5672 + . /usr/share/openstack-tripleo-common/healthcheck/common.sh ++ : 0 ++ '[' 0 -ne 0 ']' ++ exec ++ : 10 ++ : curl-healthcheck ++ : pyrequests-healthcheck ++ : '\n%{http_code}' '%{remote_ip}:%{remote_port}' '%{time_total}' 'seconds\n' ++ : /dev/null + process=nova-conductor + args=5672 + healthcheck_port nova-conductor 5672 + process=nova-conductor + shift 1 + ports= ++ get_user_from_process nova-conductor ++ process=nova-conductor +++ pgrep -d , -f nova-conductor ++ pid=7,14,15 ++ ps -h -q7,14,15 -o user ++ head -n1 + puser=nova + for p in $@ ++ printf %0.4x 5672 + ports='|1628' + ports=':(1628)' ++ awk -i join -v 'm=:(1628)' '{IGNORECASE=1; if ($2 ~ m || $3 ~ m) {output[counter++] = $10} } END{if (length(output)>0) {print join(output, 0, length(output)-1, "|")}}' /proc/net/tcp /proc/ net/udp + sockets= + test -z + exit 1 And digging in a little more, the issue seems to occurr because the common.sh script expects the port to be running in the same controller, while as this is a composable roles job the rabbitmq service is running in the messaging node: [nova@controller-0 /]$ lsof -P -p 15 | grep -i tcp nova-cond 15 nova 5u sock 0,9 0t0 1154005 protocol: TCPv6 nova-cond 15 nova 8u IPv6 18828319 0t0 TCP controller-0.redhat.local:50915->overcloud.internalapi.localdomain:3306 (ESTABLISHED) nova-cond 15 nova 9u IPv6 1154528 0t0 TCP controller-0.redhat.local:55380->messaging-0.redhat.local:5672 (ESTABLISHED) nova-cond 15 nova 10u IPv6 1303330 0t0 TCP controller-0.redhat.local:45688->messaging-0.redhat.local:5672 (ESTABLISHED) nova-cond 15 nova 11u IPv6 18913527 0t0 TCP controller-0.redhat.local:57781->overcloud.internalapi.localdomain:3306 (ESTABLISHED) nova-cond 15 nova 12u IPv6 2275861 0t0 TCP controller-0.redhat.local:57642->messaging-0.redhat.local:5672 (ESTABLISHED) nova-cond 15 nova 13u IPv6 18966674 0t0 TCP controller-0.redhat.local:37153->overcloud.internalapi.localdomain:3306 (ESTABLISHED) Version-Release number of selected component (if applicable): How reproducible: Running CI job: https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/upgrades/view/ffu/job/DFG-upgrades-ffu-16.2-from-13-latest_cdn-3cont_3db_3msg_2net_3hci-ipv6-ovs_dvr/ Steps to Reproduce: 1. 2. 3. Actual results: Upgrade fails because the healtcheck validation gives a false negative Expected results: Healtcheck passes and also the upgrade. Additional info:
Sergii found out the actual issue: IPv6 network was overlooked in the healcheck_port method. The patch is therefore really easy, it's just a matter of adding 2 files, tcp6 and udp6, in the check. I'm on it! Thanks José and Sergii for your time - I didn't think about v6 back then -.-'. Sorry! Cheers, C.
The failed nova-conductor healthcheck is no longer seen in the job: https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/upgrades/view/ffu/job/DFG-upgrades-ffu-16.2-from-13-latest_cdn-3cont_3db_3msg_2net_3hci-ipv6-ovs_dvr/
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2021:3483