Bug 1421874

Summary: openstack-nova: nova commands against overcloud get stuck and exit with: Unknown Error (HTTP 504) (only on HA setup)
Product: Red Hat OpenStack Reporter: Alexander Chuzhoy <sasha>
Component: openstack-tripleo-heat-templatesAssignee: Eoghan Glynn <eglynn>
Status: CLOSED ERRATA QA Contact: Prasanth Anbalagan <panbalag>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 11.0 (Ocata)CC: aschultz, berrange, dasmith, eglynn, fdinitto, jschluet, kchamart, mburns, mcornea, oblaut, rhel-osp-director-maint, sasha, sbauza, sferdjao, sgordon, srevivo, ushkalim, vromanso
Target Milestone: betaKeywords: AutomationBlocker, Triaged
Target Release: 11.0 (Ocata)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-6.0.0-0.20170218023452.edbaaa9.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-05-17 19:59:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Alexander Chuzhoy 2017-02-13 22:38:57 UTC
openstack-nova: nova commands against overcloud get stuck and exiit with: Unknown Error (HTTP 504)

Environment:
openstack-nova-cert-15.0.0-0.20170129152957.f9d7b38.el7ost.noarch
openstack-nova-compute-15.0.0-0.20170129152957.f9d7b38.el7ost.noarch
python-nova-15.0.0-0.20170129152957.f9d7b38.el7ost.noarch
instack-undercloud-6.0.0-0.20170130174946.5388cd1.el7ost.noarch
openstack-nova-api-15.0.0-0.20170129152957.f9d7b38.el7ost.noarch
openstack-nova-conductor-15.0.0-0.20170129152957.f9d7b38.el7ost.noarch
openstack-nova-common-15.0.0-0.20170129152957.f9d7b38.el7ost.noarch
puppet-nova-10.2.1-0.20170130234756.84cc5b0.el7ost.noarch
openstack-nova-placement-api-15.0.0-0.20170129152957.f9d7b38.el7ost.noarch
openstack-nova-scheduler-15.0.0-0.20170129152957.f9d7b38.el7ost.noarch
python-novaclient-6.0.0-0.20170125131648.25117fa.el7ost.noarch


Steps to reproduce:
1. Deploy overcloud with ironic services:
openstack overcloud deploy --debug --templates --libvirt-type kvm -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/services/ironic.yaml -e virt/ceph.yaml -e virt/hostnames.yml -e virt/network/network-environment.yaml -e ironic.yaml -e flat_networks.yaml -e vxlan_args_osp11 --log-file overcloud_deployment_48.log


[stack@undercloud-0 ~]$ cat ironic.yaml 
parameter_defaults:
  NtpServer: ["clock.redhat.com","clock2.redhat.com"]
  ComputeCount: 2
  ControllerCount: 3
  CephStorageCount: 2
  OvercloudControlFlavor: controller
  OvercloudComputeFlavor: compute
  OvercloudCephStorageFlavor: ceph
  IronicEnabledDrivers:
      - pxe_ssh
  NovaSchedulerDefaultFilters:
      - RetryFilter
      - AggregateInstanceExtraSpecsFilter
      - AvailabilityZoneFilter
      - RamFilter
      - DiskFilter
      - ComputeFilter
      - ComputeCapabilitiesFilter
      - ImagePropertiesFilter
  IronicCleaningDiskErase: metadata
  IronicIPXEEnabled: true
  ControllerExtraConfig:
      ironic::drivers::ssh::libvirt_uri: 'qemu:///system'


[stack@undercloud-0 ~]$ cat flat_networks.yaml 
parameter_defaults:
  NeutronBridgeMappings: datacentre:br-ex,baremetal:br-baremetal
  NeutronFlatNetworks: datacentre,baremetal






2. Attempt to run "nova list" or "openstack server list" against overcloud.

Result- gets stuck: 

[stack@undercloud-0 ~]$ nova list --debug
DEBUG (extension:169) found extension EntryPoint.parse('v1password = swiftclient.authv1:PasswordLoader')
DEBUG (extension:169) found extension EntryPoint.parse('gnocchi-basic = gnocchiclient.auth:GnocchiBasicLoader')
DEBUG (extension:169) found extension EntryPoint.parse('gnocchi-noauth = gnocchiclient.auth:GnocchiNoAuthLoader')
DEBUG (extension:169) found extension EntryPoint.parse('token_endpoint = openstackclient.api.auth_plugin:TokenEndpoint')
DEBUG (extension:169) found extension EntryPoint.parse('v2token = keystoneauth1.loading._plugins.identity.v2:Token')
DEBUG (extension:169) found extension EntryPoint.parse('v3oauth1 = keystoneauth1.extras.oauth1._loading:V3OAuth1')
DEBUG (extension:169) found extension EntryPoint.parse('admin_token = keystoneauth1.loading._plugins.admin_token:AdminToken')
DEBUG (extension:169) found extension EntryPoint.parse('v3oidcauthcode = keystoneauth1.loading._plugins.identity.v3:OpenIDConnectAuthorizationCode')
DEBUG (extension:169) found extension EntryPoint.parse('v2password = keystoneauth1.loading._plugins.identity.v2:Password')
DEBUG (extension:169) found extension EntryPoint.parse('v3samlpassword = keystoneauth1.extras._saml2._loading:Saml2Password')
DEBUG (extension:169) found extension EntryPoint.parse('v3password = keystoneauth1.loading._plugins.identity.v3:Password')
DEBUG (extension:169) found extension EntryPoint.parse('v3oidcaccesstoken = keystoneauth1.loading._plugins.identity.v3:OpenIDConnectAccessToken')
DEBUG (extension:169) found extension EntryPoint.parse('v3oidcpassword = keystoneauth1.loading._plugins.identity.v3:OpenIDConnectPassword')
DEBUG (extension:169) found extension EntryPoint.parse('v3kerberos = keystoneauth1.extras.kerberos._loading:Kerberos')
DEBUG (extension:169) found extension EntryPoint.parse('token = keystoneauth1.loading._plugins.identity.generic:Token')
DEBUG (extension:169) found extension EntryPoint.parse('v3oidcclientcredentials = keystoneauth1.loading._plugins.identity.v3:OpenIDConnectClientCredentials')
DEBUG (extension:169) found extension EntryPoint.parse('v3tokenlessauth = keystoneauth1.loading._plugins.identity.v3:TokenlessAuth')
DEBUG (extension:169) found extension EntryPoint.parse('v3token = keystoneauth1.loading._plugins.identity.v3:Token')
DEBUG (extension:169) found extension EntryPoint.parse('v3totp = keystoneauth1.loading._plugins.identity.v3:TOTP')
DEBUG (extension:169) found extension EntryPoint.parse('password = keystoneauth1.loading._plugins.identity.generic:Password')
DEBUG (extension:169) found extension EntryPoint.parse('v3fedkerb = keystoneauth1.extras.kerberos._loading:MappedKerberos')
DEBUG (extension:169) found extension EntryPoint.parse('aodh-noauth = aodhclient.noauth:AodhNoAuthLoader')
DEBUG (session:347) REQ: curl -g -i -X GET http://10.0.0.105:5000/v2.0 -H "Accept: application/json" -H "User-Agent: nova keystoneauth1/2.18.0 python-requests/2.10.0 CPython/2.7.5"
INFO (connectionpool:213) Starting new HTTP connection (1): 10.0.0.105
DEBUG (connectionpool:395) "GET /v2.0 HTTP/1.1" 200 227
DEBUG (session:395) RESP: [200] Date: Mon, 13 Feb 2017 22:19:57 GMT Server: Apache Vary: X-Auth-Token,Accept-Encoding x-openstack-request-id: req-5571d868-8568-4b72-b42e-2518a3e76a1d Content-Encoding: gzip Content-Length: 227 Content-Type: application/json 
RESP BODY: {"version": {"status": "deprecated", "updated": "2016-08-04T00:00:00Z", "media-types": [{"base": "application/json", "type": "application/vnd.openstack.identity-v2.0+json"}], "id": "v2.0", "links": [{"href": "http://10.0.0.105:5000/v2.0/", "rel": "self"}, {"href": "http://docs.openstack.org/", "type": "text/html", "rel": "describedby"}]}}

DEBUG (session:640) GET call to None for http://10.0.0.105:5000/v2.0 used request id req-5571d868-8568-4b72-b42e-2518a3e76a1d
DEBUG (v2:63) Making authentication request to http://10.0.0.105:5000/v2.0/tokens
DEBUG (connectionpool:395) "POST /v2.0/tokens HTTP/1.1" 200 1196
REQ: curl -g -i -X GET http://10.0.0.105:8774/v2.1 -H "User-Agent: python-novaclient" -H "Accept: application/json" -H "X-Auth-Token: {SHA1}588c7a54ee9b44fc194b3c79f172c10b7ddbcc78"
DEBUG (session:347) REQ: curl -g -i -X GET http://10.0.0.105:8774/v2.1 -H "User-Agent: python-novaclient" -H "Accept: application/json" -H "X-Auth-Token: {SHA1}588c7a54ee9b44fc194b3c79f172c10b7ddbcc78"
INFO (connectionpool:213) Starting new HTTP connection (1): 10.0.0.105



Checking nova and httpd logs on all controllers I see this error:
==> /var/log/nova/nova-compute.log <==
2017-02-13 22:30:40.021 99539 ERROR nova.compute.manager [req-2892d6f2-0fbf-450c-a25a-5e0ff02d9ce3 - - - - -] No compute node record for host controller-2.localdomain


If I restart httpd on all controllers, nova list starts working , but gets stuck after a very short period of time (like few seconds).

Below is the output from when it was temporarily working:
[stack@undercloud-0 ~]$ nova list
+----+------+--------+------------+-------------+----------+
| ID | Name | Status | Task State | Power State | Networks |
+----+------+--------+------------+-------------+----------+
+----+------+--------+------------+-------------+----------+


[stack@undercloud-0 ~]$ nova service-list
+----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+
| Id | Binary           | Host                     | Zone     | Status  | State | Updated_at                 | Disabled Reason |
+----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+
| 12 | nova-conductor   | controller-1.localdomain | internal | enabled | up    | 2017-02-13T22:35:42.000000 | -               |
| 18 | nova-scheduler   | controller-1.localdomain | internal | enabled | up    | 2017-02-13T22:35:49.000000 | -               |
| 21 | nova-consoleauth | controller-1.localdomain | internal | enabled | up    | 2017-02-13T22:35:48.000000 | -               |
| 24 | nova-compute     | controller-1.localdomain | nova     | enabled | up    | 2017-02-13T22:35:50.000000 | -               |
| 27 | nova-conductor   | controller-2.localdomain | internal | enabled | up    | 2017-02-13T22:35:48.000000 | -               |
| 36 | nova-scheduler   | controller-2.localdomain | internal | enabled | up    | 2017-02-13T22:35:42.000000 | -               |
| 39 | nova-consoleauth | controller-2.localdomain | internal | enabled | up    | 2017-02-13T22:35:51.000000 | -               |
| 42 | nova-compute     | controller-2.localdomain | nova     | enabled | up    | 2017-02-13T22:35:48.000000 | -               |
| 48 | nova-compute     | compute-0.localdomain    | nova     | enabled | up    | 2017-02-13T22:35:48.000000 | -               |
| 51 | nova-compute     | compute-1.localdomain    | nova     | enabled | up    | 2017-02-13T22:35:50.000000 | -               |
| 60 | nova-conductor   | controller-0.localdomain | internal | enabled | up    | 2017-02-13T22:35:50.000000 | -               |
| 69 | nova-scheduler   | controller-0.localdomain | internal | enabled | up    | 2017-02-13T22:35:45.000000 | -               |
| 72 | nova-consoleauth | controller-0.localdomain | internal | enabled | up    | 2017-02-13T22:35:48.000000 | -               |
| 75 | nova-compute     | controller-0.localdomain | nova     | enabled | up    | 2017-02-13T22:35:50.000000 | -               |
+----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+

Comment 2 Alexander Chuzhoy 2017-02-15 01:43:59 UTC
The issue reproduced even when the deployment was done without ironic in overcloud:

openstack overcloud deploy --debug --templates --libvirt-type kvm \
-e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e virt/ceph.yaml \
-e virt/hostnames.yml \
-e virt/network/network-environment.yaml \
-e flat_networks.yaml \
-e vxlan_args_osp11 

[stack@undercloud-0 ~]$ cat flat_networks.yaml 
parameter_defaults:
  NeutronBridgeMappings: datacentre:br-ex,baremetal:br-baremetal
  NeutronFlatNetworks: datacentre,baremetal
  NtpServer: ["clock.redhat.com","clock2.redhat.com"]
  ComputeCount: 2
  ControllerCount: 3
  CephStorageCount: 2
  OvercloudControlFlavor: controller
  OvercloudComputeFlavor: compute
  OvercloudCephStorageFlavor: ceph



[stack@undercloud-0 ~]$ cat vxlan_args_osp11 
parameter_defaults:
  NeutronNetworkType: 'vxlan'
  NeutronTunnelTypes: 'vxlan'

Comment 3 Udi Shkalim 2017-02-15 09:46:38 UTC
Blocks our Automation scripts.

Comment 4 Alex Schultz 2017-02-15 13:39:27 UTC
Fix was merged upstream

Comment 7 Alexander Chuzhoy 2017-02-15 16:23:54 UTC
It worked for me after applying this patch: https://review.openstack.org/#/c/430183

The patch is applied before deploying the overcloud.

Comment 13 errata-xmlrpc 2017-05-17 19:59:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1245