Bug 1979524 - nova-conductor isn't working (healthcheck failed) in composable roles
Summary: nova-conductor isn't working (healthcheck failed) in composable roles
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-common
Version: 16.2 (Train)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: beta
: 16.2 (Train on RHEL 8.4)
Assignee: Cédric Jeanneret
QA Contact: David Rosenfeld
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-07-06 09:48 UTC by Jose Luis Franco
Modified: 2021-11-15 08:31 UTC (History)
8 users (show)

Fixed In Version: openstack-tripleo-common-11.6.1-2.20210603180856.el8ost.3
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-09-15 07:16:37 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1934772 0 None None None 2021-07-06 11:18:46 UTC
OpenStack gerrit 799643 0 None NEW Add missing IPv6 support for healthcheck_port 2021-07-06 11:18:46 UTC
Red Hat Issue Tracker OSP-5844 0 None None None 2021-11-15 08:31:36 UTC
Red Hat Product Errata RHEA-2021:3483 0 None None None 2021-09-15 07:17:00 UTC

Description Jose Luis Franco 2021-07-06 09:48:07 UTC
Description of problem:

The FFWD 13 to 16.2 comoposable roles job is failing when upgrading the first controller during the healthcheck verification commands:

2021-07-05 14:25:32 | 2021-07-05 14:25:31.412461 | 52540096-c27d-6344-9f0b-000000003207 |     TIMING | Get nova-api healthcheck status | controller-0 | 0:20:32.651323 | 182.97s
2021-07-05 14:25:32 | 2021-07-05 14:25:31.475215 | 52540096-c27d-6344-9f0b-000000003208 |       TASK | Fail if nova-api healthcheck report failed status
2021-07-05 14:25:32 | 2021-07-05 14:25:31.534673 | 52540096-c27d-6344-9f0b-000000003208 |    SKIPPED | Fail if nova-api healthcheck report failed status | controller-0
2021-07-05 14:25:32 | 2021-07-05 14:25:31.536229 | 52540096-c27d-6344-9f0b-000000003208 |     TIMING | Fail if nova-api healthcheck report failed status | controller-0 | 0:20:32.775062 | 0.06s
2021-07-05 14:25:32 | 2021-07-05 14:25:31.609007 | 52540096-c27d-6344-9f0b-00000000320a |       TASK | Get nova-conductor healthcheck status
2021-07-05 14:25:32 | 2021-07-05 14:25:32.086642 | 52540096-c27d-6344-9f0b-00000000320a |         OK | Get nova-conductor healthcheck status | controller-0
2021-07-05 14:25:32 | 2021-07-05 14:25:32.101296 | 52540096-c27d-6344-9f0b-00000000320a |     TIMING | Get nova-conductor healthcheck status | controller-0 | 0:20:33.340073 | 0.49s
2021-07-05 14:25:32 | 2021-07-05 14:25:32.178409 | 52540096-c27d-6344-9f0b-00000000320b |       TASK | Fail if nova-conductor healthcheck report failed status
2021-07-05 14:25:32 | 2021-07-05 14:25:32.239516 | 52540096-c27d-6344-9f0b-00000000320b |      FATAL | Fail if nova-conductor healthcheck report failed status | controller-0 | error={"changed": false, "msg": "nova-conductor isn't working (healthcheck failed)"}
2021-07-05 14:25:32 | 2021-07-05 14:25:32.241538 | 52540096-c27d-6344-9f0b-00000000320b |     TIMING | Fail if nova-conductor healthcheck report failed status | controller-0 | 0:20:33.480353 | 0.06s
2021-07-05 14:25:32 | 
2021-07-05 14:25:32 | PLAY RECAP *********************************************************************
2021-07-05 14:25:32 | controller-0               : ok=322  changed=177  unreachable=0    failed=1    skipped=167  rescued=0    ignored=0   
2021-07-05 14:25:32 | database-0                 : ok=267  changed=143  unreachable=0    failed=0    skipped=149  rescued=0    ignored=0   
2021-07-05 14:25:32 | messaging-0                : ok=265  changed=144  unreachable=0    failed=0    skipped=152  rescued=0    ignored=0   
2021-07-05 14:25:32 | networker-0                : ok=288  changed=150  unreachable=0    failed=0    skipped=150  rescued=0    ignored=0   

http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-upgrades-ffu-16.2-from-13-latest_cdn-3cont_3db_3msg_2net_3hci-ipv6-ovs_dvr/61/undercloud-0/home/stack/overcloud_upgrade_run-controller-0,database-0,messaging-0,networker-0.log.gz

When running the healthcheck script in the controller-0 node, we could see:

[root@controller-0 /]# bash -x /openstack/healthcheck 5672                                                                                                                                     
+ . /usr/share/openstack-tripleo-common/healthcheck/common.sh                                                                                                                                  
++ : 0                                                                                                                                                                                         
++ '[' 0 -ne 0 ']'                                                                                                                                                                             
++ exec                                                                                                                                                                                        
++ : 10                                                                                                                                                                                        
++ : curl-healthcheck                                                                                                                                                                          
++ : pyrequests-healthcheck                                                                                                                                                                    
++ : '\n%{http_code}' '%{remote_ip}:%{remote_port}' '%{time_total}' 'seconds\n'                                                                                                                
++ : /dev/null                                                                                                                                                                                 
+ process=nova-conductor                                                                                                                                                                       
+ args=5672                                                                                                                                                                                    
+ healthcheck_port nova-conductor 5672                                                                                                                                                         
+ process=nova-conductor                                                                                                                                                                       
+ shift 1                                                                                                                                                                                      
+ ports=                                                                                                                                                                                       
++ get_user_from_process nova-conductor                                                                                                                                                        
++ process=nova-conductor                                                                                                                                                                      
+++ pgrep -d , -f nova-conductor                                                                                                                                                               
++ pid=7,14,15                                                                                                                                                                                 
++ ps -h -q7,14,15 -o user                                                                                                                                                                     
++ head -n1                                                                                                                                                                                    
+ puser=nova                                                                                                                                                                                   
+ for p in $@                                                                                                                                                                                  
++ printf %0.4x 5672                                                                                                                                                                           
+ ports='|1628'                                                                                                                                                                                
+ ports=':(1628)'                                                                                                                                                                              
++ awk -i join -v 'm=:(1628)' '{IGNORECASE=1; if ($2 ~ m || $3 ~ m) {output[counter++] = $10} } END{if (length(output)>0) {print join(output, 0, length(output)-1, "|")}}' /proc/net/tcp /proc/
net/udp                                                                                                                                                                                        
+ sockets=                                                                                                                                                                                     
+ test -z                                                                                                                                                                                      
+ exit 1   

And digging in a little more, the issue seems to occurr because the common.sh script expects the port to be running in the same controller, while as this is a composable roles job the rabbitmq service is running in the messaging node:

[nova@controller-0 /]$ lsof -P -p 15 | grep -i tcp                                                                                                                                             
nova-cond  15 nova    5u     sock      0,9      0t0   1154005 protocol: TCPv6                                                                                                                  
nova-cond  15 nova    8u     IPv6 18828319      0t0       TCP controller-0.redhat.local:50915->overcloud.internalapi.localdomain:3306 (ESTABLISHED)                                            
nova-cond  15 nova    9u     IPv6  1154528      0t0       TCP controller-0.redhat.local:55380->messaging-0.redhat.local:5672 (ESTABLISHED)                                                     
nova-cond  15 nova   10u     IPv6  1303330      0t0       TCP controller-0.redhat.local:45688->messaging-0.redhat.local:5672 (ESTABLISHED)                                                     
nova-cond  15 nova   11u     IPv6 18913527      0t0       TCP controller-0.redhat.local:57781->overcloud.internalapi.localdomain:3306 (ESTABLISHED)                                            
nova-cond  15 nova   12u     IPv6  2275861      0t0       TCP controller-0.redhat.local:57642->messaging-0.redhat.local:5672 (ESTABLISHED)                                                     
nova-cond  15 nova   13u     IPv6 18966674      0t0       TCP controller-0.redhat.local:37153->overcloud.internalapi.localdomain:3306 (ESTABLISHED)       


Version-Release number of selected component (if applicable):


How reproducible:

Running CI job: https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/upgrades/view/ffu/job/DFG-upgrades-ffu-16.2-from-13-latest_cdn-3cont_3db_3msg_2net_3hci-ipv6-ovs_dvr/

Steps to Reproduce:
1.
2.
3.

Actual results:

Upgrade fails because the healtcheck validation gives a false negative

Expected results:

Healtcheck passes and also the upgrade.

Additional info:

Comment 2 Cédric Jeanneret 2021-07-06 11:13:36 UTC
Sergii found out the actual issue: IPv6 network was overlooked in the healcheck_port method. The patch is therefore really easy, it's just a matter of adding 2 files, tcp6 and udp6, in the check.

I'm on it!

Thanks José and Sergii for your time - I didn't think about v6 back then -.-'. Sorry!

Cheers,

C.

Comment 18 errata-xmlrpc 2021-09-15 07:16:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:3483


Note You need to log in before you can comment on or make changes to this bug.