1990407 – nova-conductor isn't working (healthcheck failed) in composable roles

Bug 1990407 - nova-conductor isn't working (healthcheck failed) in composable roles

Summary: nova-conductor isn't working (healthcheck failed) in composable roles

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-common
Sub Component:
Version:	16.1 (Train)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	z7
Target Release:	16.1 (Train on RHEL 8.2)
Assignee:	Cédric Jeanneret
QA Contact:	David Rosenfeld
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-08-05 10:55 UTC by Cédric Jeanneret
Modified:	2021-12-09 20:20 UTC (History)
CC List:	5 users (show)
Fixed In Version:	openstack-tripleo-common-11.4.1-1.20210719133310.el8ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-12-09 20:20:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-6811	0	None	None	None	2021-11-18 11:34:38 UTC
Red Hat Product Errata	RHBA-2021:3762	0	None	None	None	2021-12-09 20:20:59 UTC

Description Cédric Jeanneret 2021-08-05 10:55:12 UTC

This bug was initially created as a copy of Bug #1979524

I am copying this bug because: 
This issue also affects 16.1 - it wasn't mentioned back then :/.

Basically, we need to get the following Change-Id in:
248622aa5aa13dd9c498e976e368bdd5f5e4008a


Description of problem:

The FFWD 13 to 16.2 comoposable roles job is failing when upgrading the first controller during the healthcheck verification commands:

2021-07-05 14:25:32 | 2021-07-05 14:25:31.412461 | 52540096-c27d-6344-9f0b-000000003207 |     TIMING | Get nova-api healthcheck status | controller-0 | 0:20:32.651323 | 182.97s
2021-07-05 14:25:32 | 2021-07-05 14:25:31.475215 | 52540096-c27d-6344-9f0b-000000003208 |       TASK | Fail if nova-api healthcheck report failed status
2021-07-05 14:25:32 | 2021-07-05 14:25:31.534673 | 52540096-c27d-6344-9f0b-000000003208 |    SKIPPED | Fail if nova-api healthcheck report failed status | controller-0
2021-07-05 14:25:32 | 2021-07-05 14:25:31.536229 | 52540096-c27d-6344-9f0b-000000003208 |     TIMING | Fail if nova-api healthcheck report failed status | controller-0 | 0:20:32.775062 | 0.06s
2021-07-05 14:25:32 | 2021-07-05 14:25:31.609007 | 52540096-c27d-6344-9f0b-00000000320a |       TASK | Get nova-conductor healthcheck status
2021-07-05 14:25:32 | 2021-07-05 14:25:32.086642 | 52540096-c27d-6344-9f0b-00000000320a |         OK | Get nova-conductor healthcheck status | controller-0
2021-07-05 14:25:32 | 2021-07-05 14:25:32.101296 | 52540096-c27d-6344-9f0b-00000000320a |     TIMING | Get nova-conductor healthcheck status | controller-0 | 0:20:33.340073 | 0.49s
2021-07-05 14:25:32 | 2021-07-05 14:25:32.178409 | 52540096-c27d-6344-9f0b-00000000320b |       TASK | Fail if nova-conductor healthcheck report failed status
2021-07-05 14:25:32 | 2021-07-05 14:25:32.239516 | 52540096-c27d-6344-9f0b-00000000320b |      FATAL | Fail if nova-conductor healthcheck report failed status | controller-0 | error={"changed": false, "msg": "nova-conductor isn't working (healthcheck failed)"}
2021-07-05 14:25:32 | 2021-07-05 14:25:32.241538 | 52540096-c27d-6344-9f0b-00000000320b |     TIMING | Fail if nova-conductor healthcheck report failed status | controller-0 | 0:20:33.480353 | 0.06s
2021-07-05 14:25:32 | 
2021-07-05 14:25:32 | PLAY RECAP *********************************************************************
2021-07-05 14:25:32 | controller-0               : ok=322  changed=177  unreachable=0    failed=1    skipped=167  rescued=0    ignored=0   
2021-07-05 14:25:32 | database-0                 : ok=267  changed=143  unreachable=0    failed=0    skipped=149  rescued=0    ignored=0   
2021-07-05 14:25:32 | messaging-0                : ok=265  changed=144  unreachable=0    failed=0    skipped=152  rescued=0    ignored=0   
2021-07-05 14:25:32 | networker-0                : ok=288  changed=150  unreachable=0    failed=0    skipped=150  rescued=0    ignored=0   

http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-upgrades-ffu-16.2-from-13-latest_cdn-3cont_3db_3msg_2net_3hci-ipv6-ovs_dvr/61/undercloud-0/home/stack/overcloud_upgrade_run-controller-0,database-0,messaging-0,networker-0.log.gz

When running the healthcheck script in the controller-0 node, we could see:

[root@controller-0 /]# bash -x /openstack/healthcheck 5672                                                                                                                                     
+ . /usr/share/openstack-tripleo-common/healthcheck/common.sh                                                                                                                                  
++ : 0                                                                                                                                                                                         
++ '[' 0 -ne 0 ']'                                                                                                                                                                             
++ exec                                                                                                                                                                                        
++ : 10                                                                                                                                                                                        
++ : curl-healthcheck                                                                                                                                                                          
++ : pyrequests-healthcheck                                                                                                                                                                    
++ : '\n%{http_code}' '%{remote_ip}:%{remote_port}' '%{time_total}' 'seconds\n'                                                                                                                
++ : /dev/null                                                                                                                                                                                 
+ process=nova-conductor                                                                                                                                                                       
+ args=5672                                                                                                                                                                                    
+ healthcheck_port nova-conductor 5672                                                                                                                                                         
+ process=nova-conductor                                                                                                                                                                       
+ shift 1                                                                                                                                                                                      
+ ports=                                                                                                                                                                                       
++ get_user_from_process nova-conductor                                                                                                                                                        
++ process=nova-conductor                                                                                                                                                                      
+++ pgrep -d , -f nova-conductor                                                                                                                                                               
++ pid=7,14,15                                                                                                                                                                                 
++ ps -h -q7,14,15 -o user                                                                                                                                                                     
++ head -n1                                                                                                                                                                                    
+ puser=nova                                                                                                                                                                                   
+ for p in $@                                                                                                                                                                                  
++ printf %0.4x 5672                                                                                                                                                                           
+ ports='|1628'                                                                                                                                                                                
+ ports=':(1628)'                                                                                                                                                                              
++ awk -i join -v 'm=:(1628)' '{IGNORECASE=1; if ($2 ~ m || $3 ~ m) {output[counter++] = $10} } END{if (length(output)>0) {print join(output, 0, length(output)-1, "|")}}' /proc/net/tcp /proc/
net/udp                                                                                                                                                                                        
+ sockets=                                                                                                                                                                                     
+ test -z                                                                                                                                                                                      
+ exit 1   

And digging in a little more, the issue seems to occurr because the common.sh script expects the port to be running in the same controller, while as this is a composable roles job the rabbitmq service is running in the messaging node:

[nova@controller-0 /]$ lsof -P -p 15 | grep -i tcp                                                                                                                                             
nova-cond  15 nova    5u     sock      0,9      0t0   1154005 protocol: TCPv6                                                                                                                  
nova-cond  15 nova    8u     IPv6 18828319      0t0       TCP controller-0.redhat.local:50915->overcloud.internalapi.localdomain:3306 (ESTABLISHED)                                            
nova-cond  15 nova    9u     IPv6  1154528      0t0       TCP controller-0.redhat.local:55380->messaging-0.redhat.local:5672 (ESTABLISHED)                                                     
nova-cond  15 nova   10u     IPv6  1303330      0t0       TCP controller-0.redhat.local:45688->messaging-0.redhat.local:5672 (ESTABLISHED)                                                     
nova-cond  15 nova   11u     IPv6 18913527      0t0       TCP controller-0.redhat.local:57781->overcloud.internalapi.localdomain:3306 (ESTABLISHED)                                            
nova-cond  15 nova   12u     IPv6  2275861      0t0       TCP controller-0.redhat.local:57642->messaging-0.redhat.local:5672 (ESTABLISHED)                                                     
nova-cond  15 nova   13u     IPv6 18966674      0t0       TCP controller-0.redhat.local:37153->overcloud.internalapi.localdomain:3306 (ESTABLISHED)       


Version-Release number of selected component (if applicable):


How reproducible:

Running CI job: https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/upgrades/view/ffu/job/DFG-upgrades-ffu-16.2-from-13-latest_cdn-3cont_3db_3msg_2net_3hci-ipv6-ovs_dvr/

Steps to Reproduce:
1.
2.
3.

Actual results:

Upgrade fails because the healtcheck validation gives a false negative

Expected results:

Healtcheck passes and also the upgrade.

Additional info:

Comment 8 David Rosenfeld 2021-08-27 13:33:50 UTC

This BZ was originally found in 16.2 using ffu job in description. The 16.1 version of that job: DFG-upgrades-ffu-16.1-from-13-latest_cdn-3cont_3db_3msg_2net_3hci-ipv6-ovs_dvr executes without encountering the healthcheck failed.

Comment 19 errata-xmlrpc 2021-12-09 20:20:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.7 (Train) bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3762

Note You need to log in before you can comment on or make changes to this bug.