Bug 1619092

Summary: [InstanceHA] check-run-nova-compute connecting by default to publicURL
Product: Red Hat OpenStack Reporter: Robin Cernin <rcernin>
Component: openstack-tripleo-heat-templatesAssignee: Michele Baldessari <michele>
Status: CLOSED ERRATA QA Contact: pkomarov
Severity: urgent Docs Contact:
Priority: urgent    
Version: 13.0 (Queens)CC: abeekhof, akaris, berrange, chintha.govardhan, chjones, cswanson, dasmith, eglynn, fdinitto, jhakimra, jraju, jschluet, jsisul, kchamart, lmarsh, mburns, michele, mschuppe, msufiyan, nm-s, pkomarov, rcernin, sbauza, sgordon, srevivo, vromanso
Target Milestone: z3Keywords: Triaged, ZStream
Target Release: 13.0 (Queens)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-8.0.4-24.el7ost Doc Type: Bug Fix
Doc Text:
One of the instance HA scripts connected to the publicURL keystone endpoint. This has now been moved to the internalURL endpoint by default. Additionally, an operator can override this via the '[placement]/valid_interfaces' configuration entry point in nova.conf.
Story Points: ---
Clone Of:
: 1637805 (view as bug list) Environment:
Last Closed: 2018-11-13 22:28:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1623181    
Bug Blocks: 1581398, 1637805, 1639358    

Description Robin Cernin 2018-08-20 04:44:51 UTC
Description of problem:

If the architecture design doesn't allow the compute nodes have access to the publicURL, the check-run-nova-compute will fail and mark the container as unhealthy.

https://github.com/openstack/tripleo-heat-templates/blob/stable/queens/extraconfig/tasks/instanceha/check-run-nova-compute#L112-L132

As a workaround it is possible to modify the check-run-nova-compute client.Client to point to internalURL endpoint instead:

if clientargs:
            # OSP < Ocata
            # ArgSpec(args=['version', 'username', 'password', 'project_id', 'auth_url'],
            #         varargs=None,
            #         keywords='kwargs', defaults=(None, None, None, None))
            nova = client.Client(version,
                                 None, # User
                                 None, # Password
                                 None, # Tenant
                                 None, # Auth URL
                                 insecure=options["insecure"],
                                 region_name=options["os_region_name"][0],
                                 session=keystone_session, auth=keystone_auth,
                                 http_log_debug=options.has_key("verbose"),
                                 endpoint_type='internalURL')
        else:
            # OSP >= Ocata
            # ArgSpec(args=['version'], varargs='args', keywords='kwargs', defaults=None)
            nova = client.Client(version,
                                 region_name=options["os_region_name"][0],
                                 session=keystone_session, auth=keystone_auth,
                                 http_log_debug=options.has_key("verbose"),
                                 endpoint_type='internalURL') 

However such configuration is not coming by default, or can't be configured by director. Does it mean that publicURL have to be always reachable on the compute nodes?

Version-Release number of selected component (if applicable):

Queens OSP13

Comment 14 Cody Swanson 2018-09-12 20:49:35 UTC
One thing I've noted with this change is the path to the python executable appears to change between versions.

openstack-tripleo-heat-templates-8.0.2-43.el7ost.noarch.rpm

head -n 1 /usr/share/openstack-tripleo-heat-templates/extraconfig/tasks/instanceha/check-run-nova-compute
#!/bin/python -utt

openstack-tripleo-heat-templates-8.0.4-24.el7ost.noarch.rpm

head -n 1 /usr/share/openstack-tripleo-heat-templates/extraconfig/tasks/instanceha/check-run-nova-compute
#!/usr/bin/python -utt

Not sure if this change is intentional or not, both paths are valid on my RHOS13 undercloud lab system:

[root@undercloud-0 ~]# ls -ld /bin/python /usr/bin/python
lrwxrwxrwx. 1 root root 7 Jul 28 11:37 /bin/python -> python2
lrwxrwxrwx. 1 root root 7 Jul 28 11:37 /usr/bin/python -> python2

I just wanted to highlight this in case it is an unintended regression.

Comment 15 Michele Baldessari 2018-09-12 20:51:57 UTC
(In reply to Cody Swanson from comment #14)
> One thing I've noted with this change is the path to the python executable
> appears to change between versions.
> 
> openstack-tripleo-heat-templates-8.0.2-43.el7ost.noarch.rpm
> 
> head -n 1
> /usr/share/openstack-tripleo-heat-templates/extraconfig/tasks/instanceha/
> check-run-nova-compute
> #!/bin/python -utt
> 
> openstack-tripleo-heat-templates-8.0.4-24.el7ost.noarch.rpm
> 
> head -n 1
> /usr/share/openstack-tripleo-heat-templates/extraconfig/tasks/instanceha/
> check-run-nova-compute
> #!/usr/bin/python -utt
> 
> Not sure if this change is intentional or not, both paths are valid on my
> RHOS13 undercloud lab system:
> 
> [root@undercloud-0 ~]# ls -ld /bin/python /usr/bin/python
> lrwxrwxrwx. 1 root root 7 Jul 28 11:37 /bin/python -> python2
> lrwxrwxrwx. 1 root root 7 Jul 28 11:37 /usr/bin/python -> python2
> 
> I just wanted to highlight this in case it is an unintended regression.

Yeah this is intended (see https://bugzilla.redhat.com/show_bug.cgi?id=1612088): we want /usr/bin/python

Comment 26 Andrew Beekhof 2018-10-03 02:18:58 UTC
> So we fenced it again and this time either did NOT unfence or the logs ran out before the compute came back up.


And the reason we fenced again is that due to the long power-on cycle, the node wasn't ready by the time we tried to connect.  Setting reconnect_interval will also assist with this as it tells the cluster that it should not ry to connect immediately.

Comment 27 Andrew Beekhof 2018-10-08 01:44:24 UTC
Can we get confirmation if the proposed fix was sufficient?

Comment 38 pkomarov 2018-10-14 12:21:34 UTC
Verified,

[stack@undercloud-0 ~]$ cat core_puddle_version 
2018-10-02.1[stack@undercloud-0 ~]$ 


[stack@undercloud-0 ~]$ ansible compute -b -mshell -a'cat /var/lib/nova/instanceha/check-run-nova-compute|grep internalURL'
 [WARNING]: Found both group and host with same name: undercloud

overcloud-novacomputeiha-0 | SUCCESS | rc=0 >>
    nova_endpoint_type = 'internalURL'
    # We default to internalURL but we allow this to be overridden via

overcloud-novacomputeiha-1 | SUCCESS | rc=0 >>
    nova_endpoint_type = 'internalURL'
    # We default to internalURL but we allow this to be overridden via



[stack@undercloud-0 ~]$ cat core_puddle_version 
2018-10-02.1[stack@undercloud-0 ~]$ 
[stack@undercloud-0 ~]$ 

verification as in : 
https://review.openstack.org/#/c/595903/

[stack@undercloud-0 ~]$ . stackrc 
(undercloud) [stack@undercloud-0 ~]$ openstack endpoint list |grep comput
| 1cf4cdfd4f1f4fe59c556283db92a964 | regionOne | nova             | compute                 | True    | internal  | http://192.168.24.1:8774/v2.1                  |
| a5ddeeeb70674d91b200fa407425bae2 | regionOne | nova             | compute                 | True    | admin     | http://192.168.24.1:8774/v2.1                  |
| e263d4f49c324a009fd0ba3822ce3f94 | regionOne | nova             | compute                 | True    | public    | http://192.168.24.1:8774/v2.1                  |
(undercloud) [stack@undercloud-0 ~]$ . overcloudrc 
(overcloud) [stack@undercloud-0 ~]$ openstack endpoint list |grep comput
| 1d1681389ac54c448ae08dfec30c2125 | regionOne | nova         | compute        | True    | public    | http://10.0.0.110:8774/v2.1                   |
| 5cc9634f850b4089b5dc2603e93e1eda | regionOne | nova         | compute        | True    | internal  | http://172.17.1.10:8774/v2.1                  |
| a4b33f4b4ae94d2ca6faba941d5e7024 | regionOne | nova         | compute        | True    | admin     | http://172.17.1.10:8774/v2.1                  |
(overcloud) [stack@undercloud-0 ~]$ openstack endpoint list |grep comput|grep internal|sed 's@.*//@@g'|sed 's@:8774.*@@g'
172.17.1.10
(overcloud) [stack@undercloud-0 ~]$ export internal_api_ip=`openstack endpoint list |grep comput|grep internal|sed 's@.*//@@g'|sed 's@:8774.*@@g'`
echo $(overcloud) [stack@undercloud-0 ~]$ echo $internal_api_ip
172.17.1.10
(overcloud) [stack@undercloud-0 ~]$ ansible compute -b -mshell -a"tcpdump -c 10 -i any -nn host $internal_api_ip and port 8774"
 [WARNING]: Found both group and host with same name: undercloud

overcloud-novacomputeiha-0 | SUCCESS | rc=0 >>
12:16:37.185523 ethertype IPv4, IP 172.17.1.10.8774 > 172.17.1.17.57724: Flags [F.], seq 566107819, ack 4214161515, win 243, options [nop,nop,TS val 6267312 ecr 4294735945], length 0
12:16:37.185523 IP 172.17.1.10.8774 > 172.17.1.17.57724: Flags [F.], seq 0, ack 1, win 243, options [nop,nop,TS val 6267312 ecr 4294735945], length 0
12:16:37.197940 IP 172.17.1.17.57724 > 172.17.1.10.8774: Flags [F.], seq 1, ack 1, win 259, options [nop,nop,TS val 4294745915 ecr 6267312], length 0
12:16:37.198887 IP 172.17.1.17.57728 > 172.17.1.10.8774: Flags [S], seq 3810977952, win 29200, options [mss 1460,sackOK,TS val 4294745916 ecr 0,nop,wscale 7], length 0
12:16:37.200734 ethertype IPv4, IP 172.17.1.10.8774 > 172.17.1.17.57724: Flags [.], ack 2, win 243, options [nop,nop,TS val 6267327 ecr 4294745915], length 0
12:16:37.200757 ethertype IPv4, IP 172.17.1.10.8774 > 172.17.1.17.57728: Flags [S.], seq 3928995206, ack 3810977953, win 28960, options [mss 1460,sackOK,TS val 6267327 ecr 4294745916,nop,wscale 7], length 0
12:16:37.200734 IP 172.17.1.10.8774 > 172.17.1.17.57724: Flags [.], ack 2, win 243, options [nop,nop,TS val 6267327 ecr 4294745915], length 0
12:16:37.200757 IP 172.17.1.10.8774 > 172.17.1.17.57728: Flags [S.], seq 3928995206, ack 3810977953, win 28960, options [mss 1460,sackOK,TS val 6267327 ecr 4294745916,nop,wscale 7], length 0
12:16:37.200847 IP 172.17.1.17.57728 > 172.17.1.10.8774: Flags [.], ack 1, win 229, options [nop,nop,TS val 4294745918 ecr 6267327], length 0
12:16:37.201044 IP 172.17.1.17.57728 > 172.17.1.10.8774: Flags [P.], seq 1:471, ack 1, win 229, options [nop,nop,TS val 4294745918 ecr 6267327], length 470tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
10 packets captured
10 packets received by filter
0 packets dropped by kernel

Comment 43 errata-xmlrpc 2018-11-13 22:28:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3587