Description of problem: If the architecture design doesn't allow the compute nodes have access to the publicURL, the check-run-nova-compute will fail and mark the container as unhealthy. https://github.com/openstack/tripleo-heat-templates/blob/stable/queens/extraconfig/tasks/instanceha/check-run-nova-compute#L112-L132 As a workaround it is possible to modify the check-run-nova-compute client.Client to point to internalURL endpoint instead: if clientargs: # OSP < Ocata # ArgSpec(args=['version', 'username', 'password', 'project_id', 'auth_url'], # varargs=None, # keywords='kwargs', defaults=(None, None, None, None)) nova = client.Client(version, None, # User None, # Password None, # Tenant None, # Auth URL insecure=options["insecure"], region_name=options["os_region_name"][0], session=keystone_session, auth=keystone_auth, http_log_debug=options.has_key("verbose"), endpoint_type='internalURL') else: # OSP >= Ocata # ArgSpec(args=['version'], varargs='args', keywords='kwargs', defaults=None) nova = client.Client(version, region_name=options["os_region_name"][0], session=keystone_session, auth=keystone_auth, http_log_debug=options.has_key("verbose"), endpoint_type='internalURL') However such configuration is not coming by default, or can't be configured by director. Does it mean that publicURL have to be always reachable on the compute nodes? Version-Release number of selected component (if applicable): Queens OSP13
One thing I've noted with this change is the path to the python executable appears to change between versions. openstack-tripleo-heat-templates-8.0.2-43.el7ost.noarch.rpm head -n 1 /usr/share/openstack-tripleo-heat-templates/extraconfig/tasks/instanceha/check-run-nova-compute #!/bin/python -utt openstack-tripleo-heat-templates-8.0.4-24.el7ost.noarch.rpm head -n 1 /usr/share/openstack-tripleo-heat-templates/extraconfig/tasks/instanceha/check-run-nova-compute #!/usr/bin/python -utt Not sure if this change is intentional or not, both paths are valid on my RHOS13 undercloud lab system: [root@undercloud-0 ~]# ls -ld /bin/python /usr/bin/python lrwxrwxrwx. 1 root root 7 Jul 28 11:37 /bin/python -> python2 lrwxrwxrwx. 1 root root 7 Jul 28 11:37 /usr/bin/python -> python2 I just wanted to highlight this in case it is an unintended regression.
(In reply to Cody Swanson from comment #14) > One thing I've noted with this change is the path to the python executable > appears to change between versions. > > openstack-tripleo-heat-templates-8.0.2-43.el7ost.noarch.rpm > > head -n 1 > /usr/share/openstack-tripleo-heat-templates/extraconfig/tasks/instanceha/ > check-run-nova-compute > #!/bin/python -utt > > openstack-tripleo-heat-templates-8.0.4-24.el7ost.noarch.rpm > > head -n 1 > /usr/share/openstack-tripleo-heat-templates/extraconfig/tasks/instanceha/ > check-run-nova-compute > #!/usr/bin/python -utt > > Not sure if this change is intentional or not, both paths are valid on my > RHOS13 undercloud lab system: > > [root@undercloud-0 ~]# ls -ld /bin/python /usr/bin/python > lrwxrwxrwx. 1 root root 7 Jul 28 11:37 /bin/python -> python2 > lrwxrwxrwx. 1 root root 7 Jul 28 11:37 /usr/bin/python -> python2 > > I just wanted to highlight this in case it is an unintended regression. Yeah this is intended (see https://bugzilla.redhat.com/show_bug.cgi?id=1612088): we want /usr/bin/python
> So we fenced it again and this time either did NOT unfence or the logs ran out before the compute came back up. And the reason we fenced again is that due to the long power-on cycle, the node wasn't ready by the time we tried to connect. Setting reconnect_interval will also assist with this as it tells the cluster that it should not ry to connect immediately.
Can we get confirmation if the proposed fix was sufficient?
Verified, [stack@undercloud-0 ~]$ cat core_puddle_version 2018-10-02.1[stack@undercloud-0 ~]$ [stack@undercloud-0 ~]$ ansible compute -b -mshell -a'cat /var/lib/nova/instanceha/check-run-nova-compute|grep internalURL' [WARNING]: Found both group and host with same name: undercloud overcloud-novacomputeiha-0 | SUCCESS | rc=0 >> nova_endpoint_type = 'internalURL' # We default to internalURL but we allow this to be overridden via overcloud-novacomputeiha-1 | SUCCESS | rc=0 >> nova_endpoint_type = 'internalURL' # We default to internalURL but we allow this to be overridden via [stack@undercloud-0 ~]$ cat core_puddle_version 2018-10-02.1[stack@undercloud-0 ~]$ [stack@undercloud-0 ~]$ verification as in : https://review.openstack.org/#/c/595903/ [stack@undercloud-0 ~]$ . stackrc (undercloud) [stack@undercloud-0 ~]$ openstack endpoint list |grep comput | 1cf4cdfd4f1f4fe59c556283db92a964 | regionOne | nova | compute | True | internal | http://192.168.24.1:8774/v2.1 | | a5ddeeeb70674d91b200fa407425bae2 | regionOne | nova | compute | True | admin | http://192.168.24.1:8774/v2.1 | | e263d4f49c324a009fd0ba3822ce3f94 | regionOne | nova | compute | True | public | http://192.168.24.1:8774/v2.1 | (undercloud) [stack@undercloud-0 ~]$ . overcloudrc (overcloud) [stack@undercloud-0 ~]$ openstack endpoint list |grep comput | 1d1681389ac54c448ae08dfec30c2125 | regionOne | nova | compute | True | public | http://10.0.0.110:8774/v2.1 | | 5cc9634f850b4089b5dc2603e93e1eda | regionOne | nova | compute | True | internal | http://172.17.1.10:8774/v2.1 | | a4b33f4b4ae94d2ca6faba941d5e7024 | regionOne | nova | compute | True | admin | http://172.17.1.10:8774/v2.1 | (overcloud) [stack@undercloud-0 ~]$ openstack endpoint list |grep comput|grep internal|sed 's@.*//@@g'|sed 's@:8774.*@@g' 172.17.1.10 (overcloud) [stack@undercloud-0 ~]$ export internal_api_ip=`openstack endpoint list |grep comput|grep internal|sed 's@.*//@@g'|sed 's@:8774.*@@g'` echo $(overcloud) [stack@undercloud-0 ~]$ echo $internal_api_ip 172.17.1.10 (overcloud) [stack@undercloud-0 ~]$ ansible compute -b -mshell -a"tcpdump -c 10 -i any -nn host $internal_api_ip and port 8774" [WARNING]: Found both group and host with same name: undercloud overcloud-novacomputeiha-0 | SUCCESS | rc=0 >> 12:16:37.185523 ethertype IPv4, IP 172.17.1.10.8774 > 172.17.1.17.57724: Flags [F.], seq 566107819, ack 4214161515, win 243, options [nop,nop,TS val 6267312 ecr 4294735945], length 0 12:16:37.185523 IP 172.17.1.10.8774 > 172.17.1.17.57724: Flags [F.], seq 0, ack 1, win 243, options [nop,nop,TS val 6267312 ecr 4294735945], length 0 12:16:37.197940 IP 172.17.1.17.57724 > 172.17.1.10.8774: Flags [F.], seq 1, ack 1, win 259, options [nop,nop,TS val 4294745915 ecr 6267312], length 0 12:16:37.198887 IP 172.17.1.17.57728 > 172.17.1.10.8774: Flags [S], seq 3810977952, win 29200, options [mss 1460,sackOK,TS val 4294745916 ecr 0,nop,wscale 7], length 0 12:16:37.200734 ethertype IPv4, IP 172.17.1.10.8774 > 172.17.1.17.57724: Flags [.], ack 2, win 243, options [nop,nop,TS val 6267327 ecr 4294745915], length 0 12:16:37.200757 ethertype IPv4, IP 172.17.1.10.8774 > 172.17.1.17.57728: Flags [S.], seq 3928995206, ack 3810977953, win 28960, options [mss 1460,sackOK,TS val 6267327 ecr 4294745916,nop,wscale 7], length 0 12:16:37.200734 IP 172.17.1.10.8774 > 172.17.1.17.57724: Flags [.], ack 2, win 243, options [nop,nop,TS val 6267327 ecr 4294745915], length 0 12:16:37.200757 IP 172.17.1.10.8774 > 172.17.1.17.57728: Flags [S.], seq 3928995206, ack 3810977953, win 28960, options [mss 1460,sackOK,TS val 6267327 ecr 4294745916,nop,wscale 7], length 0 12:16:37.200847 IP 172.17.1.17.57728 > 172.17.1.10.8774: Flags [.], ack 1, win 229, options [nop,nop,TS val 4294745918 ecr 6267327], length 0 12:16:37.201044 IP 172.17.1.17.57728 > 172.17.1.10.8774: Flags [P.], seq 1:471, ack 1, win 229, options [nop,nop,TS val 4294745918 ecr 6267327], length 470tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes 10 packets captured 10 packets received by filter 0 packets dropped by kernel
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3587