Description of problem: OSP10z8 + RHEL 7.5 freshly deployed overcloud. 1) openstack undercloud upgrade 2) update plan: export TOP_DIR="$(cd ${PATH_SCRIPT}/..; pwd)" export TRIPLEO_DIR="/usr/share/openstack-tripleo-heat-templates" source ${TOP_DIR}/pre_deploy.sh || exit 127 set -x time openstack overcloud deploy --update-plan-only \ --templates \ --ntp-server '10.20.0.1", "10.20.0.2' \ --validation-errors-fatal \ -r ${TOP_DIR}/roles_data.yaml \ -e ${TOP_DIR}/node_info_small.yaml \ -e ${TRIPLEO_DIR}/environments/network-isolation.yaml \ -e ${TRIPLEO_DIR}/environments/storage-environment.yaml \ -e ${TRIPLEO_DIR}/environments/ceph-radosgw.yaml \ -e ${TOP_DIR}/net-bond-with-vlans-with-nic4.yaml \ -e ${TOP_DIR}/rhel-registration-environment.yaml \ -e ${TOP_DIR}/storage-environment.yaml \ -e ${TOP_DIR}/krynn-environment.yaml \ -e ${TOP_DIR}/extraconfig-environment.yaml \ -e ${TOP_DIR}/enable-tls.yaml \ -e ${TOP_DIR}/inject-trust-anchor.yaml \ -e ${TRIPLEO_DIR}/environments/tls-endpoints-public-ip.yaml \ -e ${TRIPLEO_DIR}/environments/hyperconverged-ceph.yaml \ -e ${TOP_DIR}/local-environment.yaml \ -e ${TOP_DIR}/token_flush-environment.yaml \ [...] "$@" || exit 127 3) start minor upda[stack@instack ~]$ openstack overcloud update stack \ > -i overcloud \ > -e ${TOP_DIR}/node_info_small.yaml \ > -e ${TRIPLEO_DIR}/environments/network-isolation.yaml \ > -e ${TRIPLEO_DIR}/environments/storage-environment.yaml \ > -e ${TRIPLEO_DIR}/environments/ceph-radosgw.yaml \ > -e ${TOP_DIR}/net-bond-with-vlans-with-nic4.yaml \ > -e ${TOP_DIR}/rhel-registration-environment.yaml \ > -e ${TOP_DIR}/storage-environment.yaml \ > -e ${TOP_DIR}/krynn-environment.yaml \ > -e ${TOP_DIR}/extraconfig-environment.yaml \ > -e ${TOP_DIR}/enable-tls.yaml \ > -e ${TOP_DIR}/inject-trust-anchor.yaml \ > -e ${TRIPLEO_DIR}/environments/tls-endpoints-public-ip.yaml \ > -e ${TRIPLEO_DIR}/environments/hyperconverged-ceph.yaml \ > -e ${TOP_DIR}/local-environment.yaml \ > -e ${TOP_DIR}/token_flush-environment.yaml \ > -e ${TOP_DIR}/gnocchi_tuning.yaml \ > -e ${TOP_DIR}/disable_telemetry.yaml \ > -e ${TOP_DIR}/aodh_policy.yaml \ > -e ${TOP_DIR}/ceilometer_policy.yaml \ > -e ${TOP_DIR}/cinder_policy.yaml \ > -e ${TOP_DIR}/glance_policy.yaml \ > -e ${TOP_DIR}/gnocchi_policy.yaml \ > -e ${TOP_DIR}/heat_policy.yaml \ > -e ${TOP_DIR}/ironic_policy.yaml \ > -e ${TOP_DIR}/keystone_policy.yaml \ > -e ${TOP_DIR}/manila_policy.yaml \ > -e ${TOP_DIR}/mistral_policy.yaml \ > -e ${TOP_DIR}/neutron_policy.yaml \ > -e ${TOP_DIR}/nova_policy.yaml \ > -e ${TOP_DIR}/sahara_policy.yaml \ > -e ${TOP_DIR}/zaqar_policy.yaml starting package update on stack overcloud ERROR: <html><body><h1>504 Gateway Time-out</h1> The server didn't respond in time. </body></html> Aborts almost immediately but stack stays in 'UPDATE_IN_PROGRESS'. [stack@instack ~]$ heat stack-list WARNING (shell) "heat stack-list" is deprecated, please use "openstack stack list" instead +--------------------------------------+------------+--------------------+----------------------+----------------------+ | id | stack_name | stack_status | creation_time | updated_time | +--------------------------------------+------------+--------------------+----------------------+----------------------+ | b5424905-ec4d-4357-acc0-13eb15161135 | overcloud | UPDATE_IN_PROGRESS | 2018-06-20T02:18:47Z | 2018-06-21T15:24:40Z | +--------------------------------------+------------+--------------------+----------------------+----------------------+ te:
One thing that I've noticed is that if I reduce the number of YAMLs on the update line (take out all *_policy.yaml although they were on the original deploy line), -then- I don't get the error again: yes "" | openstack overcloud update stack \ -i overcloud \ -e ${TOP_DIR}/node_info_small.yaml \ -e ${TRIPLEO_DIR}/environments/network-isolation.yaml \ -e ${TRIPLEO_DIR}/environments/storage-environment.yaml \ -e ${TRIPLEO_DIR}/environments/ceph-radosgw.yaml \ -e ${TOP_DIR}/net-bond-with-vlans-with-nic4.yaml \ -e ${TOP_DIR}/rhel-registration-environment.yaml \ -e ${TOP_DIR}/storage-environment.yaml \ -e ${TOP_DIR}/krynn-environment.yaml \ -e ${TOP_DIR}/extraconfig-environment.yaml \ -e ${TOP_DIR}/enable-tls.yaml \ -e ${TOP_DIR}/inject-trust-anchor.yaml \ -e ${TRIPLEO_DIR}/environments/tls-endpoints-public-ip.yaml \ -e ${TRIPLEO_DIR}/environments/hyperconverged-ceph.yaml \ -e ${TOP_DIR}/local-environment.yaml \ -e ${TOP_DIR}/token_flush-environment.yaml \ -e ${TOP_DIR}/gnocchi_tuning.yaml \ -e ${TOP_DIR}/disable_telemetry.yaml \ "$@" which gives: + yes '' + openstack overcloud update stack -i overcloud -e /home/stack/OSP/osp10/node_info_small.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-radosgw.yaml -e /home/stack/OSP/osp10/net-bond-with-vlans-with-nic4.yaml -e /home/stack/OSP/osp10/rhel-registration-environment.yaml -e /home/stack/OSP/osp10/storage-environment.yaml -e /home/stack/OSP/osp10/krynn-environment.yaml -e /home/stack/OSP/osp10/extraconfig-environment.yaml -e /home/stack/OSP/osp10/enable-tls.yaml -e /home/stack/OSP/osp10/inject-trust-anchor.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/tls-endpoints-public-ip.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/hyperconverged-ceph.yaml -e /home/stack/OSP/osp10/local-environment.yaml -e /home/stack/OSP/osp10/token_flush-environment.yaml -e /home/stack/OSP/osp10/gnocchi_tuning.yaml -e /home/stack/OSP/osp10/disable_telemetry.yaml WAITING on_breakpoint: [u'krynn-ctrl-0', u'krynn-ceph-1', u'krynn-ceph-0', u'krynn-cmpt-0']
Hi, so the problem is certainly happening during[1] stack_fields = templates.process_templates( workflow, container=parsed_args.stack) in overcloud_update.py in tripleoclient. The reasoning is that it's before the do_interactive_update and that it seems linked to the number of templates to be proceeded. So the relevant logs should be the mistral one, but in the sos report I couldn't get the actual error. Note that it's missing mistral/api.log and mistral/engine.log so, maybe that's where the error is. I'm passing this one to the workflow and api dfg for help here. [1] can't find a place on the web where I can reference it ...
Hi Sofer, I checked /var/log/mistal/engine.log and Noticed this: 2018-08-22 14:32:31.596 2043 INFO tripleo_common.actions.templates [-] jinja2 rendering role ComputeSriov 2018-08-22 14:32:32.144 2043 INFO tripleo_common.actions.templates [-] Writing rendered template puppet/computesriov-role.yaml 2018-08-22 14:32:32.181 2043 INFO tripleo_common.actions.templates [-] jinja2 rendering role ComputeDpdk 2018-08-22 14:32:32.833 2043 INFO tripleo_common.actions.templates [-] Writing rendered template puppet/computedpdk-role.yaml 2018-08-22 14:32:32.866 2043 INFO tripleo_common.actions.templates [-] jinja2 rendering role BlockStorage 2018-08-22 14:32:32.866 2043 INFO tripleo_common.actions.templates [-] Skipping rendering of puppet/blockstorage-role.yaml, defined in {'name': ['puppet/controller-role.yaml ', 'puppet/compute-role.yaml', 'puppet/blockstorage-role.yaml', 'puppet/objectstorage-role.yaml', 'puppet/cephstorage-role.yaml']} 2018-08-22 14:32:32.867 2043 INFO tripleo_common.actions.templates [-] jinja2 rendering role ObjectStorage 2018-08-22 14:32:32.867 2043 INFO tripleo_common.actions.templates [-] Skipping rendering of puppet/objectstorage-role.yaml, defined in {'name': ['puppet/controller-role.yam l', 'puppet/compute-role.yaml', 'puppet/blockstorage-role.yaml', 'puppet/objectstorage-role.yaml', 'puppet/cephstorage-role.yaml']} 2018-08-22 14:32:32.867 2043 INFO tripleo_common.actions.templates [-] jinja2 rendering role CephStorage 2018-08-22 14:32:32.868 2043 INFO tripleo_common.actions.templates [-] Skipping rendering of puppet/cephstorage-role.yaml, defined in {'name': ['puppet/controller-role.yaml' , 'puppet/compute-role.yaml', 'puppet/blockstorage-role.yaml', 'puppet/objectstorage-role.yaml', 'puppet/cephstorage-role.yaml']} 2018-08-22 14:32:32.868 2043 INFO tripleo_common.actions.templates [-] jinja2 rendering role Networker 2018-08-22 14:32:33.443 2043 INFO tripleo_common.actions.templates [-] Writing rendered template puppet/networker-role.yaml 2018-08-22 14:32:45.684 2043 INFO swiftclient [-] REQ: curl -i https://10.162.200.113:13808/v1/AUTH_a84a70f025ab4b0a8921df929b8fcc5a/overcloud-swift-rings/swift-rings.tar.gz COPY -H "Destination: overcloud-swift-rings/swift-rings.tar.gz-1534962765" -H "X-Auth-Token: 150d888ef5f14a69..." 2018-08-22 14:32:45.685 2043 INFO swiftclient [-] RESP STATUS: 404 Not Found 2018-08-22 14:32:45.686 2043 INFO swiftclient [-] RESP HEADERS: {u'Date': u'Wed, 22 Aug 2018 18:32:45 GMT', u'Content-Length': u'70', u'Content-Type': u'text/html; charset=U TF-8', u'X-Trans-Id': u'tx55ad64c3aa6e4c37a4973-005b7dac4d'} 2018-08-22 14:32:45.686 2043 INFO swiftclient [-] RESP BODY: <html><h1>Not Found</h1><p>The resource could not be found.</p></html> 2018-08-22 14:32:45.687 2043 ERROR swiftclient [-] Object COPY failed: https://10.162.200.113:13808/v1/AUTH_a84a70f025ab4b0a8921df929b8fcc5a/overcloud-swift-rings/swift-ring s.tar.gz 404 Not Found [first 60 chars of response] <html><h1>Not Found</h1><p>The resource could not be found.< 2018-08-22 14:32:45.687 2043 ERROR swiftclient Traceback (most recent call last): 2018-08-22 14:32:45.687 2043 ERROR swiftclient File "/usr/lib/python2.7/site-packages/swiftclient/client.py", line 1649, in _retry 2018-08-22 14:32:45.687 2043 ERROR swiftclient service_token=self.service_token, **kwargs) 2018-08-22 14:32:45.687 2043 ERROR swiftclient File "/usr/lib/python2.7/site-packages/swiftclient/client.py", line 1404, in copy_object 2018-08-22 14:32:45.687 2043 ERROR swiftclient raise ClientException.from_response(resp, 'Object COPY failed', body) 2018-08-22 14:32:45.687 2043 ERROR swiftclient ClientException: Object COPY failed: https://10.162.200.113:13808/v1/AUTH_a84a70f025ab4b0a8921df929b8fcc5a/overcloud-swift-rings/swift-rings.tar.gz 404 Not Found [first 60 chars of response] <html><h1>Not Found</h1><p>The resource could not be found.< 2018-08-22 14:32:45.687 2043 ERROR swiftclient 2018-08-22 14:32:45.687 2043 INFO tripleo_common.actions.deployment [-] Perfoming Heat stack create 2018-08-22 14:32:56.863 2043 INFO mistral.engine.rpc_backend.rpc [-] Received RPC request 'run_action'[rpc_ctx=MistralContext {u'project_name': u'admin', u'user_id': u'3572fc8db37046d68cbea3a2eaee80ac', u'roles': [u'admin'], u'auth_uri': u'https://10.162.200.113:13000/v3', u'auth_cacert': None, u'auth_token': u'150d888ef5f14a699a088c311aedf1e6', u'expires_at': u'2018-08-22T22:30:44.000000Z', u'is_trust_scoped': False, u'service_catalog': u'[{"endpoints": [{"adminURL": "http://10.20.0.2:8080", "region": "regionOne", "internalURL": "http://10.20.0.2:8080/v1/AUTH_a84a70f025ab4b0a8921df929b8fcc5a", "publicURL": "https://10.162.200.113:13808/v1/AUTH_a84a70f025ab4b0a8921df929b8fcc5a"}], "type": "object-store", "name": "swift"}, {"endpoints": [{"adminURL": "http://10.20.0.2:9696", "region": "regionOne", "internalURL": "http://10.20.0.2:9696", "publicURL": "https://10.162.200.113:13696"}], "type": "network", "name": "neutron"}, {"endpoints": [{"adminURL": "ws://10.20.0.2:9000", "region": "regionOne", "internalURL": "ws://10.20.0.2:9000", "publicURL": "wss://10.162.200.113:9000"}], "type": "messaging-websocket", "name": "zaqar-websocket"}, {"endpoints": [{"adminURL": "http://10.20.0.2:8888", "region": "regionOne", "internalURL": "http://10.20.0.2:8888", "publicURL": "https://10.162.200.113:13888"}], "type": "messaging", "name": "zaqar"}, {"endpoints": [{"adminURL": "http://10.20.0.2:8989/v2", "region": "regionOne", "internalURL": "http://10.20.0.2:8989/v2", "publicURL": "https://10.162.200.113:13989/v2"}], "type": "workflowv2", "name": "mis So there's an error related to: Object COPY failed: https://10.162.200.113:13808/v1/AUTH_a84a70f025ab4b0a8921df929b8fcc5a/overcloud-swift-rings/swift-ring s.tar.gz 404 Not Found [first 60 chars of response] <html><h1>Not Found</h1><p>The resource could not be found.< Not sure if it's related. Let me try to reproduce when the overcloud finishes updating.
that was /var/log/mistral/executor.log, sorry.
Is it possible something is getting flooded somewhere? I was checking the mistral logs and I noticed that I could 'see' the tripleo YAML policies that I'm always passing to the deployment to implement the 'read-only' role that was developed for one of our telco customers.
[root@instack mistral]# grep -c readonly engine.log 1418 [root@instack mistral]# grep -c readonly api.log 1624 [root@instack mistral]# grep -c readonly executor.log 467
Did you manage to reproduce after the update? Would it be possible to see the complete mistral logs? I'm not sure that specific error is likely to be the cause, but we can learn lots from the mistral logs.
Hi Dougal, I'm reproducing the bug every time on OSP10 now.. In fact, I've not been able to do a minor update because of this. Would you want a copy of /var/log/mistral? Any other logs?
Here goes: [stack@instack ~]$ time bash -x ./OSP/osp10/bin/deploy24_update.sh + '[' /usr/bin/bash ']' ++++ whence -- ./OSP/osp10/bin/deploy24_update.sh ++++ type -p -- ./OSP/osp10/bin/deploy24_update.sh +++ /usr/bin/dirname ./OSP/osp10/bin/deploy24_update.sh ++ cd ./OSP/osp10/bin ++ pwd + export PATH_SCRIPT=/home/stack/OSP/osp10/bin + PATH_SCRIPT=/home/stack/OSP/osp10/bin ++ cd /home/stack/OSP/osp10/bin/.. ++ pwd + export TOP_DIR=/home/stack/OSP/osp10 + TOP_DIR=/home/stack/OSP/osp10 + export TRIPLEO_DIR=/usr/share/openstack-tripleo-heat-templates + TRIPLEO_DIR=/usr/share/openstack-tripleo-heat-templates + set -x + yes '' + openstack overcloud update stack -i overcloud -e /home/stack/OSP/osp10/node_info_small.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-radosgw.yaml -e /home/stack/OSP/osp10/net-bond-with-vlans-with-nic4.yaml -e /home/stack/OSP/osp10/rhel-registration-environment.yaml -e /home/stack/OSP/osp10/storage-environment.yaml -e /home/stack/OSP/osp10/krynn-environment.yaml -e /home/stack/OSP/osp10/extraconfig-environment.yaml -e /home/stack/OSP/osp10/enable-tls.yaml -e /home/stack/OSP/osp10/inject-trust-anchor.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/tls-endpoints-public-ip.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/hyperconverged-ceph.yaml -e /home/stack/OSP/osp10/local-environment.yaml -e /home/stack/OSP/osp10/token_flush-environment.yaml -e /home/stack/OSP/osp10/gnocchi_tuning.yaml -e /home/stack/OSP/osp10/disable_telemetry.yaml -e /home/stack/OSP/osp10/aodh_policy.yaml -e /home/stack/OSP/osp10/ceilometer_policy.yaml -e /home/stack/OSP/osp10/cinder_policy.yaml -e /home/stack/OSP/osp10/glance_policy.yaml -e /home/stack/OSP/osp10/gnocchi_policy.yaml -e /home/stack/OSP/osp10/heat_policy.yaml -e /home/stack/OSP/osp10/ironic_policy.yaml -e /home/stack/OSP/osp10/keystone_policy.yaml -e /home/stack/OSP/osp10/manila_policy.yaml -e /home/stack/OSP/osp10/mistral_policy.yaml -e /home/stack/OSP/osp10/neutron_policy.yaml -e /home/stack/OSP/osp10/nova_policy.yaml -e /home/stack/OSP/osp10/sahara_policy.yaml -e /home/stack/OSP/osp10/zaqar_policy.yaml starting package update on stack overcloud ERROR: <html><body><h1>504 Gateway Time-out</h1> The server didn't respond in time. </body></html> + set +x Building ansible hosts file.. # Collecting information from Nova and Heat.....................Done! Editing /etc/hosts to add overcloud nodes... 10.20.0.111 krynn-ctrl-0.lasthome.solace.krynn krynn-ctrl-0 ctrl0 10.20.0.106 krynn-ctrl-1.lasthome.solace.krynn krynn-ctrl-1 ctrl1 10.20.0.114 krynn-ctrl-2.lasthome.solace.krynn krynn-ctrl-2 ctrl2 10.20.0.107 krynn-cmpt-0.lasthome.solace.krynn krynn-cmpt-0 cmpt0 10.20.0.109 krynn-cmpt-1.lasthome.solace.krynn krynn-cmpt-1 cmpt1 10.20.0.113 krynn-ceph-0.lasthome.solace.krynn krynn-ceph-0 ceph0 ‘/home/stack/.ssh/config’ -> ‘/home/stack/.ssh/config.orig’ real 5m34.637s user 0m19.168s sys 0m4.472s
Here's what I did to keep only the last 1000 lines: [root@instack mistral]# tail -1000 api.log > ../mistral_new/api.log [root@instack mistral]# tail -1000 engine.log > ../mistral_new/engine.log [root@instack mistral]# tail -1000 executor.log > ../mistral_new/executor.log [root@instack mistral]# tar cvzf /tmp/mistral_logs.tgz /var/log/mistral_new tar: Removing leading `/' from member names /var/log/mistral_new/ /var/log/mistral_new/api.log /var/log/mistral_new/engine.log /var/log/mistral_new/executor.log Size is much more reasonable: [root@instack mistral]# ls -lh /tmp/mistral_logs.tgz -rw-r--r--. 1 root root 6.6M Aug 24 20:50 /tmp/mistral_logs.tgz It seems those logs have -very- long lines (probably because of the huge templates I am testing with).
Created attachment 1478685 [details] last 1000 lines of mistral logs
Also, I've noticed this: MariaDB [mistral]> select max(length(output)) AS Max_Length_String from mistral.action_executions_v2; +-------------------+ | Max_Length_String | +-------------------+ | 818097 | +-------------------+ 1 row in set (0.01 sec)
There are lots of other places in the Mistral DB where they have gigantic rows: MariaDB [mistral]> select count(variables) from mistral.environments_v2; +------------------+ | count(variables) | +------------------+ | 3 | +------------------+ 1 row in set (0.01 sec) MariaDB [mistral]> select max(length(variables)) AS Max_Length_String from mistral.environments_v2; +-------------------+ | Max_Length_String | +-------------------+ | 158172 | +-------------------+ 1 row in set (0.00 sec)
(In reply to Dougal Matthews from comment #10) > Did you manage to reproduce after the update? > > Would it be possible to see the complete mistral logs? > > I'm not sure that specific error is likely to be the cause, but we can learn > lots from the mistral logs. Hi Dougal, Just to clarify: this is a 'fresh' deploy that I'm using everytime I'm testing this issue. My workflow goes like this: - delete overcloud - deploy overcloud using the same templates. - attempt to do a minor update (this is where it fails). So every time I'm doing this, the overcloud I'm trying to update is pretty 'fresh' (It has at most a few hours of idle use).
I have spent some time going through the mistral logs and couldn't find anything suspicious. When you next try and do a deployment it is useful to add "--debug" to the end of the CLI command. This will give us the full Python traceback at the end and can help.
Ok, here we go: [stack@instack ~]$ ./OSP/osp10/bin/deploy24_update.sh (II) Using /home/stack/OSP/osp10/enable-tls.yaml... (II) Using /home/stack/OSP/osp10/inject-trust-anchor.yaml... (II) Using /home/stack/OSP/osp10/local-environment.yaml... (II) Using /home/stack/OSP/osp10/overcloud.pem... (II) Using /home/stack/OSP/osp10/undercloud.pem... + openstack overcloud update stack -i overcloud -e /home/stack/OSP/osp10/node_info_small.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-radosgw.yaml -e /home/stack/OSP/osp10/net-bond-with-vlans-with-nic4.yaml -e /home/stack/OSP/osp10/rhel-registration-environment.yaml -e /home/stack/OSP/osp10/storage-environment.yaml -e /home/stack/OSP/osp10/krynn-environment.yaml -e /home/stack/OSP/osp10/extraconfig-environment.yaml -e /home/stack/OSP/osp10/enable-tls.yaml -e /home/stack/OSP/osp10/inject-trust-anchor.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/tls-endpoints-public-ip.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/hyperconverged-ceph.yaml -e /home/stack/OSP/osp10/local-environment.yaml -e /home/stack/OSP/osp10/token_flush-environment.yaml -e /home/stack/OSP/osp10/gnocchi_tuning.yaml -e /home/stack/OSP/osp10/disable_telemetry.yaml -e /home/stack/OSP/osp10/aodh_policy.yaml -e /home/stack/OSP/osp10/ceilometer_policy.yaml -e /home/stack/OSP/osp10/cinder_policy.yaml -e /home/stack/OSP/osp10/glance_policy.yaml -e /home/stack/OSP/osp10/gnocchi_policy.yaml -e /home/stack/OSP/osp10/heat_policy.yaml -e /home/stack/OSP/osp10/ironic_policy.yaml -e /home/stack/OSP/osp10/keystone_policy.yaml -e /home/stack/OSP/osp10/manila_policy.yaml -e /home/stack/OSP/osp10/mistral_policy.yaml -e /home/stack/OSP/osp10/neutron_policy.yaml -e /home/stack/OSP/osp10/nova_policy.yaml -e /home/stack/OSP/osp10/sahara_policy.yaml -e /home/stack/OSP/osp10/zaqar_policy.yaml --debug [...]
Here's the end of the run: REQ: curl -g -i -X GET https://10.162.200.113:13004/v1/a84a70f025ab4b0a8921df929b8fcc5a/stacks/f342a11a-7fd1-46a8-a2bb-593aafbb9134/resources?nested_depth=5 -H "User-Agent: python-heatclient" -H "Accept: application/json" -H "X-Auth-Token: {SHA1}a98638b170e6fef36bae5539573cf2e6563bd3b1" "GET /v1/a84a70f025ab4b0a8921df929b8fcc5a/stacks/f342a11a-7fd1-46a8-a2bb-593aafbb9134/resources?nested_depth=5 HTTP/1.1" 302 429 RESP: [302] Location: https://10.162.200.113:13004/v1/a84a70f025ab4b0a8921df929b8fcc5a/stacks/overcloud/f342a11a-7fd1-46a8-a2bb-593aafbb9134/resources?nested_depth=5 Content-Length: 429 Content-Type: application/json; charset=UTF-8 X-Openstack-Request-Id: req-f6442346-97e5-4995-8fe7-c9661eed634a Date: Thu, 30 Aug 2018 18:13:51 GMT RESP BODY: {"message": "The resource was found at <a href=\"https://10.162.200.113:13004/v1/a84a70f025ab4b0a8921df929b8fcc5a/stacks/overcloud/f342a11a-7fd1-46a8-a2bb-593aafbb9134/resources?nested_depth=5\">https://10.162.200.113:13004/v1/a84a70f025ab4b0a8921df929b8fcc5a/stacks/overcloud/f342a11a-7fd1-46a8-a2bb-593aafbb9134/resources?nested_depth=5</a>;\nyou should be redirected automatically.\n\n", "code": "302 Found", "title": "Found"} "GET /v1/a84a70f025ab4b0a8921df929b8fcc5a/stacks/overcloud/f342a11a-7fd1-46a8-a2bb-593aafbb9134/resources?nested_depth=5 HTTP/1.1" 504 None RESP: [504] Cache-Control: no-cache Connection: close Content-Type: text/html RESP BODY: Omitted, Content-Type is set to text/html. Only application/json responses have their bodies logged. ERROR: <html><body><h1>504 Gateway Time-out</h1> The server didn't respond in time. </body></html> Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/cliff/app.py", line 387, in run_subcommand result = cmd.run(parsed_args) File "/usr/lib/python2.7/site-packages/osc_lib/command/command.py", line 41, in run return super(Command, self).run(parsed_args) File "/usr/lib/python2.7/site-packages/cliff/command.py", line 59, in run return self.take_action(parsed_args) or 0 File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_update.py", line 92, in take_action update_manager.do_interactive_update() File "/usr/lib/python2.7/site-packages/tripleo_common/_stack_update.py", line 89, in do_interactive_update status, _ = self.get_status() File "/usr/lib/python2.7/site-packages/tripleo_common/_stack_update.py", line 67, in get_status resources = self._resources_by_state() File "/usr/lib/python2.7/site-packages/tripleo_common/_stack_update.py", line 129, in _resources_by_state self.stack.id, nested_depth=self.nested_depth) File "/usr/lib/python2.7/site-packages/heatclient/v1/resources.py", line 71, in list return self._list(url, "resources") File "/usr/lib/python2.7/site-packages/heatclient/openstack/common/apiclient/base.py", line 135, in _list body = self.client.get(url).json() File "/usr/lib/python2.7/site-packages/keystoneauth1/adapter.py", line 187, in get return self.request(url, 'GET', **kwargs) File "/usr/lib/python2.7/site-packages/heatclient/common/http.py", line 318, in request raise exc.from_response(resp) HTTPException: ERROR: <html><body><h1>504 Gateway Time-out</h1> The server didn't respond in time. </body></html> clean_up UpdateOvercloud: ERROR: <html><body><h1>504 Gateway Time-out</h1> The server didn't respond in time. </body></html> Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/osc_lib/shell.py", line 135, in run ret_val = super(OpenStackShell, self).run(argv) File "/usr/lib/python2.7/site-packages/cliff/app.py", line 267, in run result = self.run_subcommand(remainder) File "/usr/lib/python2.7/site-packages/osc_lib/shell.py", line 180, in run_subcommand ret_value = super(OpenStackShell, self).run_subcommand(argv) File "/usr/lib/python2.7/site-packages/cliff/app.py", line 387, in run_subcommand result = cmd.run(parsed_args) File "/usr/lib/python2.7/site-packages/osc_lib/command/command.py", line 41, in run return super(Command, self).run(parsed_args) File "/usr/lib/python2.7/site-packages/cliff/command.py", line 59, in run return self.take_action(parsed_args) or 0 File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_update.py", line 92, in take_action update_manager.do_interactive_update() File "/usr/lib/python2.7/site-packages/tripleo_common/_stack_update.py", line 89, in do_interactive_update status, _ = self.get_status() File "/usr/lib/python2.7/site-packages/tripleo_common/_stack_update.py", line 67, in get_status resources = self._resources_by_state() File "/usr/lib/python2.7/site-packages/tripleo_common/_stack_update.py", line 129, in _resources_by_state self.stack.id, nested_depth=self.nested_depth) File "/usr/lib/python2.7/site-packages/heatclient/v1/resources.py", line 71, in list return self._list(url, "resources") File "/usr/lib/python2.7/site-packages/heatclient/openstack/common/apiclient/base.py", line 135, in _list body = self.client.get(url).json() File "/usr/lib/python2.7/site-packages/keystoneauth1/adapter.py", line 187, in get return self.request(url, 'GET', **kwargs) File "/usr/lib/python2.7/site-packages/heatclient/common/http.py", line 318, in request raise exc.from_response(resp) HTTPException: ERROR: <html><body><h1>504 Gateway Time-out</h1> The server didn't respond in time. </body></html> END return value: 1
So it seems the error is actually coming from heatclient/Heat, not Mistral. I'm glad I got that traceback to verify the source. I'm not sure how to debug failures like this. I guess we need to see the Heat logs? I think then it is best that Upgrades take another a look at this one. Possibly DF? Happy to work with them if I can help
Hi, I was unable to reproduce this and I do agree with Dougal that this came from heat client which points to keystone or heat. How big is the machine in terms of resources? Can we get reproducer or logs from heat too? Thanks
Hi Lukas, The undercloud VM I am using has 24Gb of RAM. Also, it has 4 vcpus. What logs would you like to have (specifically). Would giving you a tarball of the templates help?
Hi, I'd need all the /var/log/heat logs and ideally zaqar, httpd and keystone as bonus. Thanks
Hi Lukas, Thanks for getting back to me. Ok, so /var/log/heat, /var/log/httpd, /var/log/zaqar and /var/log/keystone anything else? A sosreport?
Just those, please make sure you reboot UC if you are doing openstack undercloud upgrade.
(In reply to Lukas Bezdicka from comment #28) > Just those, please make sure you reboot UC if you are doing openstack > undercloud upgrade. Hi Lukas, I always do that, no worries (rebooting the undercloud). Especially when the 'undercloud upgrade' brings in a new kernel. :) Vincent
I've upgraded to OSP10z9, I'm currently re-deploying with: 1 x controller 2 x compute HCI (Ceph OSDs) 1 x ceph-storage This will make it easier to sift through logs.
[stack@instack ~]$ ./OSP/osp10/bin/deploy24_update.sh (II) Using /home/stack/OSP/osp10/enable-tls.yaml... (II) Using /home/stack/OSP/osp10/inject-trust-anchor.yaml... (II) Using /home/stack/OSP/osp10/local-environment.yaml... (II) Using /home/stack/OSP/osp10/overcloud.pem... (II) Using /home/stack/OSP/osp10/undercloud.pem... + yes '' + openstack overcloud update stack -i overcloud -e /home/stack/OSP/osp10/node_info_small.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-radosgw.yaml -e /home/stack/OSP/osp10/net-bond-with-vlans-with-nic4.yaml -e /home/stack/OSP/osp10/rhel-registration-environment.yaml -e /home/stack/OSP/osp10/storage-environment.yaml -e /home/stack/OSP/osp10/krynn-environment.yaml -e /home/stack/OSP/osp10/extraconfig-environment.yaml -e /home/stack/OSP/osp10/enable-tls.yaml -e /home/stack/OSP/osp10/inject-trust-anchor.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/tls-endpoints-public-ip.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/hyperconverged-ceph.yaml -e /home/stack/OSP/osp10/local-environment.yaml -e /home/stack/OSP/osp10/token_flush-environment.yaml -e /home/stack/OSP/osp10/gnocchi_tuning.yaml -e /home/stack/OSP/osp10/disable_telemetry.yaml -e /home/stack/OSP/osp10/aodh_policy.yaml -e /home/stack/OSP/osp10/ceilometer_policy.yaml -e /home/stack/OSP/osp10/cinder_policy.yaml -e /home/stack/OSP/osp10/glance_policy.yaml -e /home/stack/OSP/osp10/gnocchi_policy.yaml -e /home/stack/OSP/osp10/heat_policy.yaml -e /home/stack/OSP/osp10/ironic_policy.yaml -e /home/stack/OSP/osp10/keystone_policy.yaml -e /home/stack/OSP/osp10/manila_policy.yaml -e /home/stack/OSP/osp10/mistral_policy.yaml -e /home/stack/OSP/osp10/neutron_policy.yaml -e /home/stack/OSP/osp10/nova_policy.yaml -e /home/stack/OSP/osp10/sahara_policy.yaml -e /home/stack/OSP/osp10/zaqar_policy.yaml starting package update on stack overcloud ERROR: <html><body><h1>504 Gateway Time-out</h1> The server didn't respond in time. </body></html> + set +x Building ansible hosts file.. # Collecting information from Nova and Heat...............Done! Editing /etc/hosts to add overcloud nodes... 10.20.0.111 krynn-ctrl-0.lasthome.solace.krynn krynn-ctrl-0 ctrl0 10.20.0.114 krynn-cmpt-0.lasthome.solace.krynn krynn-cmpt-0 cmpt0 10.20.0.112 krynn-cmpt-1.lasthome.solace.krynn krynn-cmpt-1 cmpt1 10.20.0.105 krynn-ceph-0.lasthome.solace.krynn krynn-ceph-0 ceph0 ‘/home/stack/.ssh/config’ -> ‘/home/stack/.ssh/config.orig’
Reproduced on OSP10z9, attaching logs...
[root@instack ~]# zip -9r /tmp/logs.zip /var/log/heat /var/log/mistral /var/log/zaqar /var/log/httpd /var/log/keystone updating: var/log/heat/ (stored 0%) updating: var/log/heat/heat-api.log (deflated 95%) updating: var/log/heat/heat-engine.log zip warning: file size changed while zipping /var/log/heat/heat-engine.log (deflated 95%) updating: var/log/heat/heat-api-cfn.log (deflated 91%) updating: var/log/mistral/ (stored 0%) updating: var/log/mistral/engine.log (deflated 86%) updating: var/log/mistral/api.log (deflated 85%) updating: var/log/mistral/executor.log (deflated 86%) updating: var/log/zaqar/ (stored 0%) updating: var/log/zaqar/zaqar.log (deflated 92%) updating: var/log/httpd/ (stored 0%) updating: var/log/httpd/tripleo-ui_error.log (stored 0%) updating: var/log/httpd/default_error.log (stored 0%) updating: var/log/httpd/ipxe_vhost_error.log (stored 0%) updating: var/log/httpd/access_log (stored 0%) updating: var/log/httpd/keystone_wsgi_admin_access.log (deflated 94%) updating: var/log/httpd/error_log (deflated 78%) updating: var/log/httpd/keystone_wsgi_admin_error.log (stored 0%) updating: var/log/httpd/nova_api_wsgi_access.log (deflated 97%) updating: var/log/httpd/keystone_wsgi_main_access.log (deflated 97%) updating: var/log/httpd/nova_api_wsgi_error.log (deflated 94%) updating: var/log/httpd/keystone_wsgi_main_error.log (stored 0%) updating: var/log/httpd/tripleo-ui_access.log (stored 0%) updating: var/log/httpd/ipxe_vhost_access.log (deflated 93%) updating: var/log/keystone/ (stored 0%) updating: var/log/keystone/keystone.log (deflated 94%)
I had to split the logs: zip -9qr /tmp/logs1.zip /var/log/mistral /var/log/zaqar /var/log/httpd /var/log/keystone zip -9qr /tmp/logs2.zip /var/log/heat
Created attachment 1484903 [details] zip -9qr /tmp/logs1.zip /var/log/mistral /var/log/zaqar /var/log/httpd /var/log/keystone
Created attachment 1484904 [details] zip -9qr /tmp/logs2.zip /var/log/heat
Also, please make note of the following: [root@instack heat]# wc -l heat-engine.log 468491 heat-engine.log [root@instack heat]# ls -lh heat-engine.log -rw-r--r--. 1 heat heat 235M Sep 19 15:31 heat-engine.log [root@instack heat]# head -1 heat-engine.log 2018-09-19 03:32:52.355 10902 DEBUG heat.engine.service [req-061419c6-bcb6-4393-8113-84a62c79dc49 - - - - -] Service f5a638e8-ae2b-4a55-bed4-d7927f9add8a is updated service_manage_report /usr/lib/python2.7/site-packages/ This is a huge logfile.
I may have found something... (check the log below.) The only difference is a 'cd' into OSP/osp10 -before- running the update.: RUN #1 stack@osp5p ~]$ ./OSP/osp10/bin/deploy24_update.sh (II) Using /home/stack/OSP/osp10/enable-tls.yaml... (II) Using /home/stack/OSP/osp10/inject-trust-anchor.yaml... (II) Using /home/stack/OSP/osp10/local-environment.yaml... (II) Using /home/stack/OSP/osp10/overcloud.pem... (II) Using /home/stack/OSP/osp10/undercloud.pem... + yes '' + openstack overcloud update stack -i overcloud -e /home/stack/OSP/osp10/node_info_small.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-radosgw.yaml -e /home/stack/OSP/osp10/net-bond-with-vlans-with-nic4.yaml -e /home/stack/OSP/osp10/rhel-registration-environment.yaml -e /home/stack/OSP/osp10/storage-environment.yaml -e /home/stack/OSP/osp10/krynn-environment.yaml -e /home/stack/OSP/osp10/extraconfig-environment.yaml -e /home/stack/OSP/osp10/enable-tls.yaml -e /home/stack/OSP/osp10/inject-trust-anchor.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/tls-endpoints-public-ip.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/hyperconverged-ceph.yaml -e /home/stack/OSP/osp10/local-environment.yaml -e /home/stack/OSP/osp10/token_flush-environment.yaml -e /home/stack/OSP/osp10/gnocchi_tuning.yaml -e /home/stack/OSP/osp10/disable_telemetry.yaml -e /home/stack/OSP/osp10/aodh_policy.yaml -e /home/stack/OSP/osp10/ceilometer_policy.yaml -e /home/stack/OSP/osp10/cinder_policy.yaml -e /home/stack/OSP/osp10/glance_policy.yaml -e /home/stack/OSP/osp10/gnocchi_policy.yaml -e /home/stack/OSP/osp10/heat_policy.yaml -e /home/stack/OSP/osp10/ironic_policy.yaml -e /home/stack/OSP/osp10/keystone_policy.yaml -e /home/stack/OSP/osp10/manila_policy.yaml -e /home/stack/OSP/osp10/mistral_policy.yaml -e /home/stack/OSP/osp10/neutron_policy.yaml -e /home/stack/OSP/osp10/nova_policy.yaml -e /home/stack/OSP/osp10/sahara_policy.yaml -e /home/stack/OSP/osp10/zaqar_policy.yaml starting package update on stack overcloud ERROR: <html><body><h1>504 Gateway Time-out</h1> The server didn't respond in time. </body></html> + set +x RUN #2: [stack@osp5p ~]$ cd OSP/osp10 [stack@osp5p osp10]$ ./bin/deploy24_update.sh (II) Using /home/stack/OSP/osp10/enable-tls.yaml... (II) Using /home/stack/OSP/osp10/inject-trust-anchor.yaml... (II) Using /home/stack/OSP/osp10/local-environment.yaml... (II) Using /home/stack/OSP/osp10/overcloud.pem... (II) Using /home/stack/OSP/osp10/undercloud.pem... + yes '' + openstack overcloud update stack -i overcloud -e /home/stack/OSP/osp10/node_info_small.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-radosgw.yaml -e /home/stack/OSP/osp10/net-bond-with-vlans-with-nic4.yaml -e /home/stack/OSP/osp10/rhel-registration-environment.yaml -e /home/stack/OSP/osp10/storage-environment.yaml -e /home/stack/OSP/osp10/krynn-environment.yaml -e /home/stack/OSP/osp10/extraconfig-environment.yaml -e /home/stack/OSP/osp10/enable-tls.yaml -e /home/stack/OSP/osp10/inject-trust-anchor.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/tls-endpoints-public-ip.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/hyperconverged-ceph.yaml -e /home/stack/OSP/osp10/local-environment.yaml -e /home/stack/OSP/osp10/token_flush-environment.yaml -e /home/stack/OSP/osp10/gnocchi_tuning.yaml -e /home/stack/OSP/osp10/disable_telemetry.yaml -e /home/stack/OSP/osp10/aodh_policy.yaml -e /home/stack/OSP/osp10/ceilometer_policy.yaml -e /home/stack/OSP/osp10/cinder_policy.yaml -e /home/stack/OSP/osp10/glance_policy.yaml -e /home/stack/OSP/osp10/gnocchi_policy.yaml -e /home/stack/OSP/osp10/heat_policy.yaml -e /home/stack/OSP/osp10/ironic_policy.yaml -e /home/stack/OSP/osp10/keystone_policy.yaml -e /home/stack/OSP/osp10/manila_policy.yaml -e /home/stack/OSP/osp10/mistral_policy.yaml -e /home/stack/OSP/osp10/neutron_policy.yaml -e /home/stack/OSP/osp10/nova_policy.yaml -e /home/stack/OSP/osp10/sahara_policy.yaml -e /home/stack/OSP/osp10/zaqar_policy.yaml WAITING on_breakpoint: [u'krynn-ceph-0', u'krynn-cmpt-1', u'krynn-ctrl-0', u'krynn-cmpt-0'] Breakpoint reached, continue? Regexp or Enter=proceed (will clear 35cb0bd0-70a6-4846-9a59-69cda1d66214), no=cancel update, C-c=quit interactive mode: IN_PROGRESS WAITING completed: [u'krynn-cmpt-0'] on_breakpoint: [u'krynn-ceph-0', u'krynn-cmpt-1', u'krynn-ctrl-0'] Breakpoint reached, continue? Regexp or Enter=proceed (will clear ce55271d-e148-4517-be4c-829e01fd52a0), no=cancel update, C-c=quit interactive mode: IN_PROGRESS IN_PROGRESS IN_PROGRESS
Well, that was a false positive.. (the 'cd' thing). I'm still getting failures: [stack@instack ~]$ ./OSP/osp10/bin/deploy24_update.sh [...] + cd /home/stack/OSP/osp10 + yes '' + openstack overcloud update stack -i overcloud -e /home/stack/OSP/osp10/node_info_small.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-radosgw.yaml -e /home/stack/OSP/osp10/net-bond-with-vlans-with-nic4.yaml -e /home/stack/OSP/osp10/rhel-registration-environment.yaml -e /home/stack/OSP/osp10/storage-environment.yaml -e /home/stack/OSP/osp10/krynn-environment.yaml -e /home/stack/OSP/osp10/extraconfig-environment.yaml -e /home/stack/OSP/osp10/enable-tls.yaml -e /home/stack/OSP/osp10/inject-trust-anchor.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/tls-endpoints-public-ip.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/hyperconverged-ceph.yaml -e /home/stack/OSP/osp10/local-environment.yaml -e /home/stack/OSP/osp10/token_flush-environment.yaml -e /home/stack/OSP/osp10/gnocchi_tuning.yaml -e /home/stack/OSP/osp10/disable_telemetry.yaml -e /home/stack/OSP/osp10/aodh_policy.yaml -e /home/stack/OSP/osp10/ceilometer_policy.yaml -e /home/stack/OSP/osp10/cinder_policy.yaml -e /home/stack/OSP/osp10/glance_policy.yaml -e /home/stack/OSP/osp10/gnocchi_policy.yaml -e /home/stack/OSP/osp10/heat_policy.yaml -e /home/stack/OSP/osp10/ironic_policy.yaml -e /home/stack/OSP/osp10/keystone_policy.yaml -e /home/stack/OSP/osp10/manila_policy.yaml -e /home/stack/OSP/osp10/mistral_policy.yaml -e /home/stack/OSP/osp10/neutron_policy.yaml -e /home/stack/OSP/osp10/nova_policy.yaml -e /home/stack/OSP/osp10/sahara_policy.yaml -e /home/stack/OSP/osp10/zaqar_policy.yaml starting package update on stack overcloud ERROR: <html><body><h1>504 Gateway Time-out</h1> The server didn't respond in time. </body></html> + set +x Building ansible hosts file..
Hi Vicent, not sure I get your last log right. It seems you're still cd into /home/stack/OSP/osp10 before running the update. Just to make sure, you *should* (I think it's you have to) run that command from the $HOME directory. Can you confirm that you script are adjusted so that the update command is run from the $HOME directory ? Thanks,
Hi, No progress on my side. I've made sure to run everything from $HOME and it still fails: [stack@osp5p ~]$ ./OSP/osp10/bin/deploy24_update.sh + yes '' + openstack overcloud update stack -i overcloud -e /home/stack/OSP/osp10/node_info_small.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-radosgw.yaml -e /home/stack/OSP/osp10/net-bond-with-vlans-with-nic4.yaml -e /home/stack/OSP/osp10/rhel-registration-environment.yaml -e /home/stack/OSP/osp10/storage-environment.yaml -e /home/stack/OSP/osp10/krynn-environment.yaml -e /home/stack/OSP/osp10/extraconfig-environment.yaml -e /home/stack/OSP/osp10/enable-tls.yaml -e /home/stack/OSP/osp10/inject-trust-anchor.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/tls-endpoints-public-ip.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/hyperconverged-ceph.yaml -e /home/stack/OSP/osp10/local-environment.yaml -e /home/stack/OSP/osp10/token_flush-environment.yaml -e /home/stack/OSP/osp10/gnocchi_tuning.yaml -e /home/stack/OSP/osp10/disable_telemetry.yaml -e /home/stack/OSP/osp10/aodh_policy.yaml -e /home/stack/OSP/osp10/ceilometer_policy.yaml -e /home/stack/OSP/osp10/cinder_policy.yaml -e /home/stack/OSP/osp10/glance_policy.yaml -e /home/stack/OSP/osp10/gnocchi_policy.yaml -e /home/stack/OSP/osp10/heat_policy.yaml -e /home/stack/OSP/osp10/ironic_policy.yaml -e /home/stack/OSP/osp10/keystone_policy.yaml -e /home/stack/OSP/osp10/manila_policy.yaml -e /home/stack/OSP/osp10/mistral_policy.yaml -e /home/stack/OSP/osp10/neutron_policy.yaml -e /home/stack/OSP/osp10/nova_policy.yaml -e /home/stack/OSP/osp10/sahara_policy.yaml -e /home/stack/OSP/osp10/zaqar_policy.yaml starting package update on stack overcloud ERROR: <html><body><h1>504 Gateway Time-out</h1> The server didn't respond in time. </body></html> [stack@osp5p ~]$ heat stack-list WARNING (shell) "heat stack-list" is deprecated, please use "openstack stack list" instead +--------------------------------------+------------+--------------------+----------------------+----------------------+ | id | stack_name | stack_status | creation_time | updated_time | +--------------------------------------+------------+--------------------+----------------------+----------------------+ | b4170f46-8afc-4bf3-9436-d60355156fef | overcloud | UPDATE_IN_PROGRESS | 2019-01-20T21:38:20Z | 2019-01-20T22:48:52Z | +--------------------------------------+------------+--------------------+----------------------+----------------------+ (This was on fresh deploy of OSP10z9): [stack@osp5p ~]$ rpm -qa rhosp\* rhosp-director-images-ipa-10.0-20180821.1.el7ost.noarch rhosp-director-images-10.0-20180821.1.el7ost.noarch
How long does it take to run this command: openstack stack resource list overcloud -n5 > resources
Hi Lukas, [stack@osp5p ~]$ time openstack stack resource list overcloud -n5 > resources real 0m34.875s user 0m1.014s sys 0m0.156s [stack@osp5p ~]$ wc -l resources 1027 resources
Hi Lukas, Here'a summary of what I'm currently doing: - Fresh OSP10z9 undercloud (24gb ram, 4vcpus), same templates/tooling as before. - I ran a fresh 1ctrl + 1cmpt + ceph deploy, without the readonly policies in place. =============================================================================================== time openstack overcloud deploy \ --templates \ --validation-errors-fatal \ -r ${TOP_DIR}/roles_data.yaml \ -e ${TOP_DIR}/node_info_micro.yaml \ -e ${TRIPLEO_DIR}/environments/network-isolation.yaml \ -e ${TRIPLEO_DIR}/environments/storage-environment.yaml \ -e ${TRIPLEO_DIR}/environments/ceph-radosgw.yaml \ -e ${TOP_DIR}/net-bond-with-vlans-with-nic4.yaml \ -e ${TOP_DIR}/rhel-registration-environment.yaml \ -e ${TOP_DIR}/storage-environment.yaml \ -e ${TOP_DIR}/krynn-environment.yaml \ -e ${TOP_DIR}/extraconfig-environment.yaml \ -e ${TOP_DIR}/enable-tls.yaml \ -e ${TOP_DIR}/inject-trust-anchor.yaml \ -e ${TRIPLEO_DIR}/environments/tls-endpoints-public-ip.yaml \ -e ${TRIPLEO_DIR}/environments/hyperconverged-ceph.yaml \ -e ${TOP_DIR}/local-environment.yaml \ -e ${TOP_DIR}/token_flush-environment.yaml \ -e ${TOP_DIR}/disable_telemetry.yaml \ "$@" || exit 127 =============================================================================================== (Note the absence of the previous *_policy.yaml files). - I checked the logs and the error didn't show up: [root@osp5p ~]# journalctl --no-pager --all --boot|grep -i YaqlEvaluationException [root@osp5p ~]# But then I noticed this: [root@osp5p ~]# journalctl --no-pager --all --boot|grep ERROR.*Exception Jan 21 09:29:15 osp5p mistral-server[5425]: 2019-01-21 09:29:15.465 5425 ERROR swiftclient raise ClientException.from_response(resp, 'Object COPY failed', body) Jan 21 09:29:15 osp5p mistral-server[5425]: 2019-01-21 09:29:15.465 5425 ERROR swiftclient ClientException: Object COPY failed: https://10.20.0.3:13808/v1/AUTH_e3fb0c660e384a15b748228ab215d6be/overcloud-swift-rings/swift-rings.tar.gz 404 Not Found [first 60 chars of response] <html><h1>Not Found</h1><p>The resource could not be found.< Jan 21 09:54:42 osp5p mistral-server[5425]: 2019-01-21 09:54:42.318 5425 ERROR swiftclient raise ClientException.from_response(resp, 'Object COPY failed', body) Jan 21 09:54:42 osp5p mistral-server[5425]: 2019-01-21 09:54:42.318 5425 ERROR swiftclient ClientException: Object COPY failed: https://10.20.0.3:13808/v1/AUTH_e3fb0c660e384a15b748228ab215d6be/overcloud-swift-rings/swift-rings.tar.gz 404 Not Found [first 60 chars of response] <html><h1>Not Found</h1><p>The resource could not be found.< Jan 21 11:41:33 osp5p mistral-server[5425]: 2019-01-21 11:41:33.415 5425 ERROR swiftclient raise ClientException.from_response(resp, 'Object COPY failed', body) Jan 21 11:41:33 osp5p mistral-server[5425]: 2019-01-21 11:41:33.415 5425 ERROR swiftclient ClientException: Object COPY failed: https://10.20.0.3:13808/v1/AUTH_e3fb0c660e384a15b748228ab215d6be/overcloud-swift-rings/swift-rings.tar.gz 404 Not Found [first 60 chars of response] <html><h1>Not Found</h1><p>The resource could not be found.< (This was from this morning's attempts) My update CLI this afternoon is like this: =============================================================================================== [stack@osp5p ~]$ time ./OSP/osp10/bin/deploy24_update.sh (II) Using /home/stack/OSP/osp10/enable-tls.yaml... (II) Using /home/stack/OSP/osp10/inject-trust-anchor.yaml... (II) Using /home/stack/OSP/osp10/local-environment.yaml... (II) Using /home/stack/OSP/osp10/overcloud.pem... (II) Using /home/stack/OSP/osp10/undercloud.pem... + yes '' + openstack overcloud update stack -i overcloud -e /home/stack/OSP/osp10/node_info_small.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-radosgw.yaml -e /home/stack/OSP/osp10/net-bond-with-vlans-with-nic4.yaml -e /home/stack/OSP/osp10/rhel-registration-environment.yaml -e /home/stack/OSP/osp10/storage-environment.yaml -e /home/stack/OSP/osp10/krynn-environment.yaml -e /home/stack/OSP/osp10/extraconfig-environment.yaml -e /home/stack/OSP/osp10/enable-tls.yaml -e /home/stack/OSP/osp10/inject-trust-anchor.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/tls-endpoints-public-ip.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/hyperconverged-ceph.yaml -e /home/stack/OSP/osp10/local-environment.yaml -e /home/stack/OSP/osp10/token_flush-environment.yaml -e /home/stack/OSP/osp10/disable_telemetry.yaml starting package update on stack overcloud IN_PROGRESS WAITING on_breakpoint: [u'krynn-ctrl-0', u'krynn-cmpt-0', u'krynn-ceph-0'] Breakpoint reached, continue? Regexp or Enter=proceed (will clear ca31e96c-a9e0-44f0-8f2a-4b6c0a74ae86), no=cancel update, C-c=quit interactive mode: IN_PROGRESS IN_PROGRESS WAITING on_breakpoint: [u'krynn-ctrl-0', u'krynn-cmpt-0', u'krynn-ceph-0'] Breakpoint reached, continue? Regexp or Enter=proceed (will clear ca31e96c-a9e0-44f0-8f2a-4b6c0a74ae86), no=cancel update, C-c=quit interactive mode: IN_PROGRESS IN_PROGRESS IN_PROGRESS WAITING completed: [u'krynn-ceph-0'] =============================================================================================== In retrospect, removing the readonly-role policy yamls seem to make it work properly and not abort with a 504. The only different between this deploy and those that failed to update a little earlier are the removal of these files: [stack@osp5p osp10]$ git diff bin/deploy24_micro.sh diff --git a/osp10/bin/deploy24_micro.sh b/osp10/bin/deploy24_micro.sh index 419de66..3940515 100755 --- a/osp10/bin/deploy24_micro.sh +++ b/osp10/bin/deploy24_micro.sh @@ -31,22 +31,7 @@ time openstack overcloud deploy \ -e ${TRIPLEO_DIR}/environments/hyperconverged-ceph.yaml \ -e ${TOP_DIR}/local-environment.yaml \ -e ${TOP_DIR}/token_flush-environment.yaml \ --e ${TOP_DIR}/gnocchi_tuning.yaml \ -e ${TOP_DIR}/disable_telemetry.yaml \ --e ${TOP_DIR}/aodh_policy.yaml \ --e ${TOP_DIR}/ceilometer_policy.yaml \ --e ${TOP_DIR}/cinder_policy.yaml \ --e ${TOP_DIR}/glance_policy.yaml \ --e ${TOP_DIR}/gnocchi_policy.yaml \ --e ${TOP_DIR}/heat_policy.yaml \ --e ${TOP_DIR}/ironic_policy.yaml \ --e ${TOP_DIR}/keystone_policy.yaml \ --e ${TOP_DIR}/manila_policy.yaml \ --e ${TOP_DIR}/mistral_policy.yaml \ --e ${TOP_DIR}/neutron_policy.yaml \ --e ${TOP_DIR}/nova_policy.yaml \ --e ${TOP_DIR}/sahara_policy.yaml \ --e ${TOP_DIR}/zaqar_policy.yaml \ "$@" || exit 127 set +x Please note that these policies are supposed to be backward-compatible with the default policies provided with RHOSP10. Since it seems they break something (updates), I'll do more research on my end.
Can you provide the templates, if it's the Yaql issue it would be probably typo in templates. Also I don't think it should be the swift error. Does it fail every time or rerun always works?
Hi Lukas, With the *policy.yaml files, behaviour is like this 100% of the time: - deploy succeeds (CREATE_COMPLETE) - deploy update succeeds (run the same deploy command): UPDATE_COMPLETE - stack minor update (openstack overcloud update stack -i overcloud) fails with the 504 error. Without the *policy.yaml files, everything works as expected: - deploy succeeds (CREATE_COMPLETE) - deploy update succeeds (run the same deploy command): UPDATE_COMPLETE - stack minor update (openstack overcloud update stack -i overcloud) seems to work as it is supposed to: [stack@osp5p ~]$ time ./OSP/osp10/bin/deploy24_update.sh (II) Using /home/stack/OSP/osp10/enable-tls.yaml... (II) Using /home/stack/OSP/osp10/inject-trust-anchor.yaml... (II) Using /home/stack/OSP/osp10/local-environment.yaml... (II) Using /home/stack/OSP/osp10/overcloud.pem... (II) Using /home/stack/OSP/osp10/undercloud.pem... + yes '' + openstack overcloud update stack -i overcloud -e /home/stack/OSP/osp10/node_info_small.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-radosgw.yaml -e /home/stack/OSP/osp10/net-bond-with-vlans-with-nic4.yaml -e /home/stack/OSP/osp10/rhel-registration-environment.yaml -e /home/stack/OSP/osp10/storage-environment.yaml -e /home/stack/OSP/osp10/krynn-environment.yaml -e /home/stack/OSP/osp10/extraconfig-environment.yaml -e /home/stack/OSP/osp10/enable-tls.yaml -e /home/stack/OSP/osp10/inject-trust-anchor.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/tls-endpoints-public-ip.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/hyperconverged-ceph.yaml -e /home/stack/OSP/osp10/local-environment.yaml -e /home/stack/OSP/osp10/token_flush-environment.yaml -e /home/stack/OSP/osp10/disable_telemetry.yaml starting package update on stack overcloud IN_PROGRESS WAITING on_breakpoint: [u'krynn-ctrl-0', u'krynn-cmpt-0', u'krynn-ceph-0'] Breakpoint reached, continue? Regexp or Enter=proceed (will clear ca31e96c-a9e0-44f0-8f2a-4b6c0a74ae86), no=cancel update, C-c=quit interactive mode: IN_PROGRESS IN_PROGRESS IN_PROGRESS WAITING completed: [u'krynn-ceph-0'] on_breakpoint: [u'krynn-ctrl-0', u'krynn-cmpt-0'] Breakpoint reached, continue? Regexp or Enter=proceed (will clear c3c39e32-af25-48d2-baaf-e03e85b816a0), no=cancel update, C-c=quit interactive mode: IN_PROGRESS IN_PROGRESS WAITING completed: [u'krynn-cmpt-0', u'krynn-ceph-0'] on_breakpoint: [u'krynn-ctrl-0'] Breakpoint reached, continue? Regexp or Enter=proceed (will clear 914abd21-fefc-4b7f-b992-552022eb1bed), no=cancel update, C-c=quit interactive mode: IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS I'll be providing the templates shortly.
(In reply to Lukas Bezdicka from comment #51) > Can you provide the templates, if it's the Yaql issue it would be probably > typo in templates. Also I don't think it should be the swift error. Does it > fail every time or rerun always works? Also, I'd like to clarify some things: A re-run isn't actually a re-run (I wouldn't be able to re-run since the stack itself is in UPDATE_IN_PROGRESS until it times out 4 hours later). When I'm testing 'running' the upgrade again, I need to delete the stack, re-create it and -then- I can attempt the overcloud update again.
The mystery gets thickers.. Following yet another 504 failure, here's what I got: [stack@osp5p osp10]$ heat stack-list WARNING (shell) "heat stack-list" is deprecated, please use "openstack stack list" instead +--------------------------------------+------------+--------------------+----------------------+----------------------+ | id | stack_name | stack_status | creation_time | updated_time | +--------------------------------------+------------+--------------------+----------------------+----------------------+ | 656a250b-65dc-45d6-8bb2-ec02fd1d1553 | overcloud | UPDATE_IN_PROGRESS | 2019-01-22T21:03:44Z | 2019-01-22T22:17:18Z | +--------------------------------------+------------+--------------------+----------------------+----------------------+ [stack@osp5p osp10]$ cd [stack@osp5p ~]$ openstack overcloud update stack -i overcloud starting package update on stack overcloud ERROR: <html><body><h1>504 Gateway Time-out</h1> The server didn't respond in time. </body></html> [stack@osp5p ~]$ time openstack overcloud update stack -i overcloud WAITING on_breakpoint: [u'krynn-cmpt-0', u'krynn-ceph-0', u'krynn-ctrl-0', u'krynn-cmpt-1'] Breakpoint reached, continue? Regexp or Enter=proceed (will clear 235043d9-e640-40e3-887d-01ffb4fa6777), no=cancel update, C-c=quit interactive mode: IN_PROGRESS
I've been hitting brick walls around this. There is not a final answer on which tripleo policy yaml is causing this. I'm starting to think it could related to the quantity of yaml lines passed to the deploy. Even un-consumed policy yamls (by puppet) in OSP10 seem to flip the 504/UPDATE behaviour. (removing one or two makes the update work, keeping those un-consumed files results in a 504). I'm talking about those files: [raistlin@daltigoth ~/World/Vincent/Code/GIT]$ ls -la OSP/osp10/*policy.yaml -rw-r--r--. 1 raistlin users 2452 May 3 2018 OSP/osp10/aodh_policy.yaml -rw-r--r--. 1 raistlin users 1985 May 3 2018 OSP/osp10/ceilometer_policy.yaml -rw-r--r--. 1 raistlin users 20731 May 3 2018 OSP/osp10/cinder_policy.yaml -rw-r--r--. 1 raistlin users 6578 May 3 2018 OSP/osp10/glance_policy.yaml -rw-r--r--. 1 raistlin users 5119 May 3 2018 OSP/osp10/gnocchi_policy.yaml -rw-r--r--. 1 raistlin users 14590 May 3 2018 OSP/osp10/heat_policy.yaml -rw-r--r--. 1 raistlin cdrom 94 Jun 11 2018 OSP/osp10/ironic_policy.yaml -rw-r--r--. 1 raistlin users 28352 May 3 2018 OSP/osp10/keystone_policy.yaml -rw-r--r--. 1 raistlin users 17857 May 3 2018 OSP/osp10/manila_policy.yaml -rw-r--r--. 1 raistlin users 7552 May 3 2018 OSP/osp10/mistral_policy.yaml -rw-r--r--. 1 raistlin cdrom 30807 Jun 11 2018 OSP/osp10/neutron_policy.yaml -rw-r--r--. 1 raistlin users 48342 May 3 2018 OSP/osp10/nova_policy.yaml -rw-r--r--. 1 raistlin users 10598 May 3 2018 OSP/osp10/sahara_policy.yaml -rw-r--r--. 1 raistlin users 5203 May 3 2018 OSP/osp10/zaqar_policy.yaml
Agreed with Zane - root cause was most likely haproxy timeout. Looks like we could close it as duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1391375 and https://bugzilla.redhat.com/show_bug.cgi?id=1289315 Solution would be to increase workers count which happens in later OSP10 and also change the timeout options in /etc/haproxy.conf.
Closing as duplicate. Sadly I can only suggest to increase timeouts on haproxy if you happen to hit such issue. *** This bug has been marked as a duplicate of bug 1391375 ***
FOr the record and to help others, I've had good results using the following changes (this speeds up the update as well as the original deploy): [stack@osp5p ~]$ grep overr undercloud.conf hieradata_override = /home/stack/OSP/osp10/undercloud-override.yaml [stack@osp5p ~]$ cat /home/stack/OSP/osp10/undercloud-override.yaml # HAProxy timeouts tripleo::haproxy::ssl_cipher_suite: "!SSLv2:kEECDH:kRSA:kEDH:kPSK:+3DES:!aNULL:!eNULL:!MD5:!EXP:!RC4:!SEED:!IDEA:!DES:!MEDIUM" tripleo::haproxy::ssl_options: 'no-sslv3 no-tls-tickets' tripleo::haproxy::haproxy_global_maxconn: 65536 tripleo::haproxy::haproxy_default_maxconn: 16384 tripleo::haproxy::haproxy_default_timeout: - 'http-request 10s' - 'queue 3m' - 'connect 10s' - 'client 5m' - 'server 5m' - 'check 10s' # DNS Domain nova::network::neutron::dhcp_domain: lasthome.solace.krynn neutron::dns_domain: lasthome.solace.krynn # Memcached aodh::keystone::authtoken::memcached_servers: "127.0.0.1:11211" barbican::keystone::authtoken::memcached_servers: "127.0.0.1:11211" ceilometer::keystone::authtoken::memcached_servers: "127.0.0.1:11211" cinder::keystone::authtoken::memcached_servers: "127.0.0.1:11211" congress::keystone::authtoken::memcached_servers: "127.0.0.1:11211" ec2api::keystone::authtoken::memcached_servers: "127.0.0.1:11211" glance::api::authtoken::memcached_servers: "127.0.0.1:11211" gnocchi::keystone::authtoken::memcached_servers: "127.0.0.1:11211" heat::keystone::authtoken::memcached_servers: "127.0.0.1:11211" heat::cache::memcache_servers: "127.0.0.1:11211" horizon::cache_server_ip: "127.0.0.1:11211" ironic::api::authtoken::memcached_servers: "127.0.0.1:11211" ironic::inspector::authtoken::memcached_servers: "127.0.0.1:11211" keystone::cache_memcache_servers: "127.0.0.1:11211" manila::keystone::authtoken::memcached_servers: "127.0.0.1:11211" manila::keystone::authtoken::memcached_servers: "127.0.0.1:11211" mistral::keystone::authtoken::memcached_servers: "127.0.0.1:11211" neutron::keystone::authtoken::memcached_servers: "127.0.0.1:11211" nova::keystone::authtoken::memcached_servers: "127.0.0.1:11211" nova::cache::memcache_servers: "127.0.0.1:11211" nova::keystone::authtoken::memcached_servers: "127.0.0.1:11211" panko::keystone::authtoken::memcached_servers: "127.0.0.1:11211" sahara::keystone::authtoken::memcached_servers: "127.0.0.1:11211" swift::proxy::authtoken::memcache_servers: "127.0.0.1:11211" swift::proxy::cache::memcache_servers: "127.0.0.1:11211" tacker::keystone::authtoken::memcached_servers: "127.0.0.1:11211" zaqar::keystone::authtoken::memcached_servers: "127.0.0.1:11211" swift::objectexpirer::memcached_servers: "127.0.0.1:11211" # Workers heat::api::workers: 4 heat::api_cfn::workers: 4 heat::engine::num_engine_workers: 4 heat::rpc_response_timeout: 600 nova::compute::ironic::max_concurrent_builds: 4 nova::rpc_response_timeout: '600' ronic::config: 'DEFAULT/rpc_thread_pool_size': value => 8 ironic::rpc_response_timeout: 600