Bug 1593811 - OSP10z9, minor update exits quickly with: ERROR: <html><body><h1>504 Gateway Time-out</h1>
Summary: OSP10z9, minor update exits quickly with: ERROR: <html><body><h1>504 Gateway ...
Keywords:
Status: CLOSED DUPLICATE of bug 1391375
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: instack-undercloud
Version: 10.0 (Newton)
Hardware: All
OS: All
medium
medium
Target Milestone: zstream
: 10.0 (Newton)
Assignee: Lukas Bezdicka
QA Contact: Raviv Bar-Tal
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-06-21 15:39 UTC by Vincent S. Cojot
Modified: 2019-02-21 20:45 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-02-21 10:33:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
last 1000 lines of mistral logs (6.56 MB, application/x-gzip)
2018-08-25 00:54 UTC, Vincent S. Cojot
no flags Details
zip -9qr /tmp/logs1.zip /var/log/mistral /var/log/zaqar /var/log/httpd /var/log/keystone (12.96 MB, application/zip)
2018-09-19 19:30 UTC, Vincent S. Cojot
no flags Details
zip -9qr /tmp/logs2.zip /var/log/heat (12.37 MB, application/zip)
2018-09-19 19:31 UTC, Vincent S. Cojot
no flags Details

Description Vincent S. Cojot 2018-06-21 15:39:54 UTC
Description of problem:

OSP10z8 + RHEL 7.5 freshly deployed overcloud.

1) openstack undercloud upgrade
2) update plan:

export TOP_DIR="$(cd ${PATH_SCRIPT}/..; pwd)"
export TRIPLEO_DIR="/usr/share/openstack-tripleo-heat-templates"

source ${TOP_DIR}/pre_deploy.sh || exit 127
set -x
time openstack overcloud deploy --update-plan-only \
--templates \
--ntp-server '10.20.0.1", "10.20.0.2' \
--validation-errors-fatal \
-r ${TOP_DIR}/roles_data.yaml \
-e ${TOP_DIR}/node_info_small.yaml \
-e ${TRIPLEO_DIR}/environments/network-isolation.yaml \
-e ${TRIPLEO_DIR}/environments/storage-environment.yaml \
-e ${TRIPLEO_DIR}/environments/ceph-radosgw.yaml \
-e ${TOP_DIR}/net-bond-with-vlans-with-nic4.yaml \
-e ${TOP_DIR}/rhel-registration-environment.yaml \
-e ${TOP_DIR}/storage-environment.yaml \
-e ${TOP_DIR}/krynn-environment.yaml \
-e ${TOP_DIR}/extraconfig-environment.yaml \
-e ${TOP_DIR}/enable-tls.yaml \
-e ${TOP_DIR}/inject-trust-anchor.yaml \
-e ${TRIPLEO_DIR}/environments/tls-endpoints-public-ip.yaml \
-e ${TRIPLEO_DIR}/environments/hyperconverged-ceph.yaml \
-e ${TOP_DIR}/local-environment.yaml \
-e ${TOP_DIR}/token_flush-environment.yaml \
[...]
"$@" || exit 127

3) start minor upda[stack@instack ~]$ openstack overcloud update stack \
> -i overcloud \   
> -e ${TOP_DIR}/node_info_small.yaml \
> -e ${TRIPLEO_DIR}/environments/network-isolation.yaml \
> -e ${TRIPLEO_DIR}/environments/storage-environment.yaml \
> -e ${TRIPLEO_DIR}/environments/ceph-radosgw.yaml \
> -e ${TOP_DIR}/net-bond-with-vlans-with-nic4.yaml \
> -e ${TOP_DIR}/rhel-registration-environment.yaml \
> -e ${TOP_DIR}/storage-environment.yaml \
> -e ${TOP_DIR}/krynn-environment.yaml \
> -e ${TOP_DIR}/extraconfig-environment.yaml \
> -e ${TOP_DIR}/enable-tls.yaml \
> -e ${TOP_DIR}/inject-trust-anchor.yaml \
> -e ${TRIPLEO_DIR}/environments/tls-endpoints-public-ip.yaml \
> -e ${TRIPLEO_DIR}/environments/hyperconverged-ceph.yaml \
> -e ${TOP_DIR}/local-environment.yaml \
> -e ${TOP_DIR}/token_flush-environment.yaml \
> -e ${TOP_DIR}/gnocchi_tuning.yaml \
> -e ${TOP_DIR}/disable_telemetry.yaml \
> -e ${TOP_DIR}/aodh_policy.yaml \
> -e ${TOP_DIR}/ceilometer_policy.yaml \
> -e ${TOP_DIR}/cinder_policy.yaml \
> -e ${TOP_DIR}/glance_policy.yaml \
> -e ${TOP_DIR}/gnocchi_policy.yaml \
> -e ${TOP_DIR}/heat_policy.yaml \
> -e ${TOP_DIR}/ironic_policy.yaml \
> -e ${TOP_DIR}/keystone_policy.yaml \
> -e ${TOP_DIR}/manila_policy.yaml \
> -e ${TOP_DIR}/mistral_policy.yaml \
> -e ${TOP_DIR}/neutron_policy.yaml \
> -e ${TOP_DIR}/nova_policy.yaml \
> -e ${TOP_DIR}/sahara_policy.yaml \
> -e ${TOP_DIR}/zaqar_policy.yaml
starting package update on stack overcloud
ERROR: <html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>

Aborts almost immediately but stack stays in 'UPDATE_IN_PROGRESS'.

[stack@instack ~]$ heat stack-list
WARNING (shell) "heat stack-list" is deprecated, please use "openstack stack list" instead
+--------------------------------------+------------+--------------------+----------------------+----------------------+
| id                                   | stack_name | stack_status       | creation_time        | updated_time         |
+--------------------------------------+------------+--------------------+----------------------+----------------------+
| b5424905-ec4d-4357-acc0-13eb15161135 | overcloud  | UPDATE_IN_PROGRESS | 2018-06-20T02:18:47Z | 2018-06-21T15:24:40Z |
+--------------------------------------+------------+--------------------+----------------------+----------------------+
te:

Comment 4 Vincent S. Cojot 2018-07-11 21:23:25 UTC
One thing that I've noticed is that if I reduce the number of YAMLs on the update line (take out all *_policy.yaml although they were on the original deploy line), -then- I don't get the error again:

yes "" | openstack overcloud update stack \
-i overcloud \
-e ${TOP_DIR}/node_info_small.yaml \
-e ${TRIPLEO_DIR}/environments/network-isolation.yaml \
-e ${TRIPLEO_DIR}/environments/storage-environment.yaml \
-e ${TRIPLEO_DIR}/environments/ceph-radosgw.yaml \
-e ${TOP_DIR}/net-bond-with-vlans-with-nic4.yaml \
-e ${TOP_DIR}/rhel-registration-environment.yaml \
-e ${TOP_DIR}/storage-environment.yaml \
-e ${TOP_DIR}/krynn-environment.yaml \
-e ${TOP_DIR}/extraconfig-environment.yaml \
-e ${TOP_DIR}/enable-tls.yaml \
-e ${TOP_DIR}/inject-trust-anchor.yaml \
-e ${TRIPLEO_DIR}/environments/tls-endpoints-public-ip.yaml \
-e ${TRIPLEO_DIR}/environments/hyperconverged-ceph.yaml \
-e ${TOP_DIR}/local-environment.yaml \
-e ${TOP_DIR}/token_flush-environment.yaml \
-e ${TOP_DIR}/gnocchi_tuning.yaml \
-e ${TOP_DIR}/disable_telemetry.yaml \
"$@"

which gives:
+ yes ''
+ openstack overcloud update stack -i overcloud -e /home/stack/OSP/osp10/node_info_small.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-radosgw.yaml -e /home/stack/OSP/osp10/net-bond-with-vlans-with-nic4.yaml -e /home/stack/OSP/osp10/rhel-registration-environment.yaml -e /home/stack/OSP/osp10/storage-environment.yaml -e /home/stack/OSP/osp10/krynn-environment.yaml -e /home/stack/OSP/osp10/extraconfig-environment.yaml -e /home/stack/OSP/osp10/enable-tls.yaml -e /home/stack/OSP/osp10/inject-trust-anchor.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/tls-endpoints-public-ip.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/hyperconverged-ceph.yaml -e /home/stack/OSP/osp10/local-environment.yaml -e /home/stack/OSP/osp10/token_flush-environment.yaml -e /home/stack/OSP/osp10/gnocchi_tuning.yaml -e /home/stack/OSP/osp10/disable_telemetry.yaml
WAITING
on_breakpoint: [u'krynn-ctrl-0', u'krynn-ceph-1', u'krynn-ceph-0', u'krynn-cmpt-0']

Comment 5 Sofer Athlan-Guyot 2018-08-17 17:05:57 UTC
Hi,

so the problem is certainly happening during[1]

        stack_fields = templates.process_templates(
            workflow, container=parsed_args.stack)

in overcloud_update.py in tripleoclient.

The reasoning is that it's before the do_interactive_update and that it seems linked to the number of templates to be proceeded.

So the relevant logs should be the mistral one, but in the sos report I couldn't get the actual error.  Note that it's missing mistral/api.log and mistral/engine.log so, maybe that's where the error is.

I'm passing this one to the workflow and api dfg for help here.

[1] can't find a place on the web where I can reference it ...

Comment 6 Vincent S. Cojot 2018-08-22 19:46:55 UTC
Hi Sofer,
I checked /var/log/mistal/engine.log and Noticed this:

2018-08-22 14:32:31.596 2043 INFO tripleo_common.actions.templates [-] jinja2 rendering role ComputeSriov
2018-08-22 14:32:32.144 2043 INFO tripleo_common.actions.templates [-] Writing rendered template puppet/computesriov-role.yaml
2018-08-22 14:32:32.181 2043 INFO tripleo_common.actions.templates [-] jinja2 rendering role ComputeDpdk
2018-08-22 14:32:32.833 2043 INFO tripleo_common.actions.templates [-] Writing rendered template puppet/computedpdk-role.yaml
2018-08-22 14:32:32.866 2043 INFO tripleo_common.actions.templates [-] jinja2 rendering role BlockStorage
2018-08-22 14:32:32.866 2043 INFO tripleo_common.actions.templates [-] Skipping rendering of puppet/blockstorage-role.yaml, defined in {'name': ['puppet/controller-role.yaml
', 'puppet/compute-role.yaml', 'puppet/blockstorage-role.yaml', 'puppet/objectstorage-role.yaml', 'puppet/cephstorage-role.yaml']}
2018-08-22 14:32:32.867 2043 INFO tripleo_common.actions.templates [-] jinja2 rendering role ObjectStorage
2018-08-22 14:32:32.867 2043 INFO tripleo_common.actions.templates [-] Skipping rendering of puppet/objectstorage-role.yaml, defined in {'name': ['puppet/controller-role.yam
l', 'puppet/compute-role.yaml', 'puppet/blockstorage-role.yaml', 'puppet/objectstorage-role.yaml', 'puppet/cephstorage-role.yaml']}
2018-08-22 14:32:32.867 2043 INFO tripleo_common.actions.templates [-] jinja2 rendering role CephStorage
2018-08-22 14:32:32.868 2043 INFO tripleo_common.actions.templates [-] Skipping rendering of puppet/cephstorage-role.yaml, defined in {'name': ['puppet/controller-role.yaml'
, 'puppet/compute-role.yaml', 'puppet/blockstorage-role.yaml', 'puppet/objectstorage-role.yaml', 'puppet/cephstorage-role.yaml']}
2018-08-22 14:32:32.868 2043 INFO tripleo_common.actions.templates [-] jinja2 rendering role Networker
2018-08-22 14:32:33.443 2043 INFO tripleo_common.actions.templates [-] Writing rendered template puppet/networker-role.yaml
2018-08-22 14:32:45.684 2043 INFO swiftclient [-] REQ: curl -i https://10.162.200.113:13808/v1/AUTH_a84a70f025ab4b0a8921df929b8fcc5a/overcloud-swift-rings/swift-rings.tar.gz
 COPY -H "Destination: overcloud-swift-rings/swift-rings.tar.gz-1534962765" -H "X-Auth-Token: 150d888ef5f14a69..."
2018-08-22 14:32:45.685 2043 INFO swiftclient [-] RESP STATUS: 404 Not Found
2018-08-22 14:32:45.686 2043 INFO swiftclient [-] RESP HEADERS: {u'Date': u'Wed, 22 Aug 2018 18:32:45 GMT', u'Content-Length': u'70', u'Content-Type': u'text/html; charset=U
TF-8', u'X-Trans-Id': u'tx55ad64c3aa6e4c37a4973-005b7dac4d'}
2018-08-22 14:32:45.686 2043 INFO swiftclient [-] RESP BODY: <html><h1>Not Found</h1><p>The resource could not be found.</p></html>
2018-08-22 14:32:45.687 2043 ERROR swiftclient [-] Object COPY failed: https://10.162.200.113:13808/v1/AUTH_a84a70f025ab4b0a8921df929b8fcc5a/overcloud-swift-rings/swift-ring
s.tar.gz 404 Not Found  [first 60 chars of response] <html><h1>Not Found</h1><p>The resource could not be found.<
2018-08-22 14:32:45.687 2043 ERROR swiftclient Traceback (most recent call last):
2018-08-22 14:32:45.687 2043 ERROR swiftclient   File "/usr/lib/python2.7/site-packages/swiftclient/client.py", line 1649, in _retry
2018-08-22 14:32:45.687 2043 ERROR swiftclient     service_token=self.service_token, **kwargs)
2018-08-22 14:32:45.687 2043 ERROR swiftclient   File "/usr/lib/python2.7/site-packages/swiftclient/client.py", line 1404, in copy_object
2018-08-22 14:32:45.687 2043 ERROR swiftclient     raise ClientException.from_response(resp, 'Object COPY failed', body)
2018-08-22 14:32:45.687 2043 ERROR swiftclient ClientException: Object COPY failed: https://10.162.200.113:13808/v1/AUTH_a84a70f025ab4b0a8921df929b8fcc5a/overcloud-swift-rings/swift-rings.tar.gz 404 Not Found  [first 60 chars of response] <html><h1>Not Found</h1><p>The resource could not be found.<
2018-08-22 14:32:45.687 2043 ERROR swiftclient 
2018-08-22 14:32:45.687 2043 INFO tripleo_common.actions.deployment [-] Perfoming Heat stack create
2018-08-22 14:32:56.863 2043 INFO mistral.engine.rpc_backend.rpc [-] Received RPC request 'run_action'[rpc_ctx=MistralContext {u'project_name': u'admin', u'user_id': u'3572fc8db37046d68cbea3a2eaee80ac', u'roles': [u'admin'], u'auth_uri': u'https://10.162.200.113:13000/v3', u'auth_cacert': None, u'auth_token': u'150d888ef5f14a699a088c311aedf1e6', u'expires_at': u'2018-08-22T22:30:44.000000Z', u'is_trust_scoped': False, u'service_catalog': u'[{"endpoints": [{"adminURL": "http://10.20.0.2:8080", "region": "regionOne", "internalURL": "http://10.20.0.2:8080/v1/AUTH_a84a70f025ab4b0a8921df929b8fcc5a", "publicURL": "https://10.162.200.113:13808/v1/AUTH_a84a70f025ab4b0a8921df929b8fcc5a"}], "type": "object-store", "name": "swift"}, {"endpoints": [{"adminURL": "http://10.20.0.2:9696", "region": "regionOne", "internalURL": "http://10.20.0.2:9696", "publicURL": "https://10.162.200.113:13696"}], "type": "network", "name": "neutron"}, {"endpoints": [{"adminURL": "ws://10.20.0.2:9000", "region": "regionOne", "internalURL": "ws://10.20.0.2:9000", "publicURL": "wss://10.162.200.113:9000"}], "type": "messaging-websocket", "name": "zaqar-websocket"}, {"endpoints": [{"adminURL": "http://10.20.0.2:8888", "region": "regionOne", "internalURL": "http://10.20.0.2:8888", "publicURL": "https://10.162.200.113:13888"}], "type": "messaging", "name": "zaqar"}, {"endpoints": [{"adminURL": "http://10.20.0.2:8989/v2", "region": "regionOne", "internalURL": "http://10.20.0.2:8989/v2", "publicURL": "https://10.162.200.113:13989/v2"}], "type": "workflowv2", "name": "mis

So there's an error related to:
 Object COPY failed: https://10.162.200.113:13808/v1/AUTH_a84a70f025ab4b0a8921df929b8fcc5a/overcloud-swift-rings/swift-ring
s.tar.gz 404 Not Found  [first 60 chars of response] <html><h1>Not Found</h1><p>The resource could not be found.<

Not sure if it's related.
Let me try to reproduce when the overcloud finishes updating.

Comment 7 Vincent S. Cojot 2018-08-22 19:47:34 UTC
that was /var/log/mistral/executor.log, sorry.

Comment 8 Vincent S. Cojot 2018-08-22 19:55:06 UTC
Is it possible something is getting flooded somewhere? I was checking the mistral logs and I noticed that I could 'see' the tripleo YAML policies that I'm always passing to the deployment to implement the 'read-only' role that was developed for one of our telco customers.

Comment 9 Vincent S. Cojot 2018-08-22 19:55:48 UTC
[root@instack mistral]# grep -c readonly engine.log 
1418
[root@instack mistral]# grep -c readonly api.log 
1624
[root@instack mistral]# grep -c readonly executor.log 
467

Comment 10 Dougal Matthews 2018-08-23 10:44:35 UTC
Did you manage to reproduce after the update?

Would it be possible to see the complete mistral logs?

I'm not sure that specific error is likely to be the cause, but we can learn lots from the mistral logs.

Comment 11 Vincent S. Cojot 2018-08-25 00:35:14 UTC
Hi Dougal,
I'm reproducing the bug every time on OSP10 now.. In fact, I've not been able to do a minor update because of this.
Would you want a copy of /var/log/mistral? Any other logs?

Comment 12 Vincent S. Cojot 2018-08-25 00:46:25 UTC
Here goes:
[stack@instack ~]$ time bash -x ./OSP/osp10/bin/deploy24_update.sh
+ '[' /usr/bin/bash ']'
++++ whence -- ./OSP/osp10/bin/deploy24_update.sh
++++ type -p -- ./OSP/osp10/bin/deploy24_update.sh
+++ /usr/bin/dirname ./OSP/osp10/bin/deploy24_update.sh
++ cd ./OSP/osp10/bin
++ pwd
+ export PATH_SCRIPT=/home/stack/OSP/osp10/bin
+ PATH_SCRIPT=/home/stack/OSP/osp10/bin
++ cd /home/stack/OSP/osp10/bin/..
++ pwd
+ export TOP_DIR=/home/stack/OSP/osp10
+ TOP_DIR=/home/stack/OSP/osp10
+ export TRIPLEO_DIR=/usr/share/openstack-tripleo-heat-templates
+ TRIPLEO_DIR=/usr/share/openstack-tripleo-heat-templates
+ set -x
+ yes ''
+ openstack overcloud update stack -i overcloud -e /home/stack/OSP/osp10/node_info_small.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-radosgw.yaml -e /home/stack/OSP/osp10/net-bond-with-vlans-with-nic4.yaml -e /home/stack/OSP/osp10/rhel-registration-environment.yaml -e /home/stack/OSP/osp10/storage-environment.yaml -e /home/stack/OSP/osp10/krynn-environment.yaml -e /home/stack/OSP/osp10/extraconfig-environment.yaml -e /home/stack/OSP/osp10/enable-tls.yaml -e /home/stack/OSP/osp10/inject-trust-anchor.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/tls-endpoints-public-ip.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/hyperconverged-ceph.yaml -e /home/stack/OSP/osp10/local-environment.yaml -e /home/stack/OSP/osp10/token_flush-environment.yaml -e /home/stack/OSP/osp10/gnocchi_tuning.yaml -e /home/stack/OSP/osp10/disable_telemetry.yaml -e /home/stack/OSP/osp10/aodh_policy.yaml -e /home/stack/OSP/osp10/ceilometer_policy.yaml -e /home/stack/OSP/osp10/cinder_policy.yaml -e /home/stack/OSP/osp10/glance_policy.yaml -e /home/stack/OSP/osp10/gnocchi_policy.yaml -e /home/stack/OSP/osp10/heat_policy.yaml -e /home/stack/OSP/osp10/ironic_policy.yaml -e /home/stack/OSP/osp10/keystone_policy.yaml -e /home/stack/OSP/osp10/manila_policy.yaml -e /home/stack/OSP/osp10/mistral_policy.yaml -e /home/stack/OSP/osp10/neutron_policy.yaml -e /home/stack/OSP/osp10/nova_policy.yaml -e /home/stack/OSP/osp10/sahara_policy.yaml -e /home/stack/OSP/osp10/zaqar_policy.yaml
starting package update on stack overcloud
ERROR: <html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>

+ set +x
Building ansible hosts file..
# Collecting information from Nova and Heat.....................Done!
Editing /etc/hosts to add overcloud nodes...
10.20.0.111     krynn-ctrl-0.lasthome.solace.krynn krynn-ctrl-0  ctrl0
10.20.0.106     krynn-ctrl-1.lasthome.solace.krynn krynn-ctrl-1  ctrl1
10.20.0.114     krynn-ctrl-2.lasthome.solace.krynn krynn-ctrl-2  ctrl2
10.20.0.107     krynn-cmpt-0.lasthome.solace.krynn krynn-cmpt-0  cmpt0
10.20.0.109     krynn-cmpt-1.lasthome.solace.krynn krynn-cmpt-1  cmpt1
10.20.0.113     krynn-ceph-0.lasthome.solace.krynn krynn-ceph-0  ceph0
‘/home/stack/.ssh/config’ -> ‘/home/stack/.ssh/config.orig’

real    5m34.637s
user    0m19.168s
sys     0m4.472s

Comment 14 Vincent S. Cojot 2018-08-25 00:52:45 UTC
Here's what I did to keep only the last 1000 lines:
[root@instack mistral]# tail -1000 api.log  > ../mistral_new/api.log
[root@instack mistral]# tail -1000 engine.log > ../mistral_new/engine.log
[root@instack mistral]# tail -1000 executor.log > ../mistral_new/executor.log
[root@instack mistral]# tar cvzf /tmp/mistral_logs.tgz /var/log/mistral_new
tar: Removing leading `/' from member names
/var/log/mistral_new/
/var/log/mistral_new/api.log
/var/log/mistral_new/engine.log
/var/log/mistral_new/executor.log

Size is much more reasonable:
[root@instack mistral]# ls -lh /tmp/mistral_logs.tgz
-rw-r--r--. 1 root root 6.6M Aug 24 20:50 /tmp/mistral_logs.tgz

It seems those logs have -very- long lines (probably because of the huge templates I am testing with).

Comment 15 Vincent S. Cojot 2018-08-25 00:54:22 UTC
Created attachment 1478685 [details]
last 1000 lines of mistral logs

Comment 16 Vincent S. Cojot 2018-08-25 01:10:48 UTC
Also, I've noticed this:
MariaDB [mistral]> select max(length(output)) AS Max_Length_String from mistral.action_executions_v2;
+-------------------+
| Max_Length_String |
+-------------------+
|            818097 |
+-------------------+
1 row in set (0.01 sec)

Comment 17 Vincent S. Cojot 2018-08-25 01:15:59 UTC
There are lots of other places in the Mistral DB where they have gigantic rows:
MariaDB [mistral]> select count(variables) from mistral.environments_v2;
+------------------+
| count(variables) |
+------------------+
|                3 |
+------------------+
1 row in set (0.01 sec)

MariaDB [mistral]> select max(length(variables)) AS Max_Length_String from mistral.environments_v2;
+-------------------+
| Max_Length_String |
+-------------------+
|            158172 |
+-------------------+
1 row in set (0.00 sec)

Comment 18 Vincent S. Cojot 2018-08-27 15:02:47 UTC
(In reply to Dougal Matthews from comment #10)
> Did you manage to reproduce after the update?
> 
> Would it be possible to see the complete mistral logs?
> 
> I'm not sure that specific error is likely to be the cause, but we can learn
> lots from the mistral logs.

Hi Dougal,
Just to clarify: this is a 'fresh' deploy that I'm using everytime I'm testing this issue.
My workflow goes like this:
- delete overcloud
- deploy overcloud using the same templates.
- attempt to do a minor update (this is where it fails).

So every time I'm doing this, the overcloud I'm trying to update is pretty 'fresh' (It has at most a few hours of idle use).

Comment 19 Dougal Matthews 2018-08-29 14:34:50 UTC
I have spent some time going through the mistral logs and couldn't find anything suspicious.

When you next try and do a deployment it is useful to add "--debug" to the end of the CLI command. This will give us the full Python traceback at the end and can help.

Comment 20 Vincent S. Cojot 2018-08-30 18:12:03 UTC
Ok, here we go:
[stack@instack ~]$ ./OSP/osp10/bin/deploy24_update.sh
(II) Using /home/stack/OSP/osp10/enable-tls.yaml...
(II) Using /home/stack/OSP/osp10/inject-trust-anchor.yaml...
(II) Using /home/stack/OSP/osp10/local-environment.yaml...
(II) Using /home/stack/OSP/osp10/overcloud.pem...
(II) Using /home/stack/OSP/osp10/undercloud.pem...
+ openstack overcloud update stack -i overcloud -e /home/stack/OSP/osp10/node_info_small.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-radosgw.yaml -e /home/stack/OSP/osp10/net-bond-with-vlans-with-nic4.yaml -e /home/stack/OSP/osp10/rhel-registration-environment.yaml -e /home/stack/OSP/osp10/storage-environment.yaml -e /home/stack/OSP/osp10/krynn-environment.yaml -e /home/stack/OSP/osp10/extraconfig-environment.yaml -e /home/stack/OSP/osp10/enable-tls.yaml -e /home/stack/OSP/osp10/inject-trust-anchor.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/tls-endpoints-public-ip.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/hyperconverged-ceph.yaml -e /home/stack/OSP/osp10/local-environment.yaml -e /home/stack/OSP/osp10/token_flush-environment.yaml -e /home/stack/OSP/osp10/gnocchi_tuning.yaml -e /home/stack/OSP/osp10/disable_telemetry.yaml -e /home/stack/OSP/osp10/aodh_policy.yaml -e /home/stack/OSP/osp10/ceilometer_policy.yaml -e /home/stack/OSP/osp10/cinder_policy.yaml -e /home/stack/OSP/osp10/glance_policy.yaml -e /home/stack/OSP/osp10/gnocchi_policy.yaml -e /home/stack/OSP/osp10/heat_policy.yaml -e /home/stack/OSP/osp10/ironic_policy.yaml -e /home/stack/OSP/osp10/keystone_policy.yaml -e /home/stack/OSP/osp10/manila_policy.yaml -e /home/stack/OSP/osp10/mistral_policy.yaml -e /home/stack/OSP/osp10/neutron_policy.yaml -e /home/stack/OSP/osp10/nova_policy.yaml -e /home/stack/OSP/osp10/sahara_policy.yaml -e /home/stack/OSP/osp10/zaqar_policy.yaml --debug
[...]

Comment 21 Vincent S. Cojot 2018-08-30 18:17:04 UTC
Here's the end of the run:
REQ: curl -g -i -X GET https://10.162.200.113:13004/v1/a84a70f025ab4b0a8921df929b8fcc5a/stacks/f342a11a-7fd1-46a8-a2bb-593aafbb9134/resources?nested_depth=5 -H "User-Agent: python-heatclient" -H "Accept: application/json" -H "X-Auth-Token: {SHA1}a98638b170e6fef36bae5539573cf2e6563bd3b1"
"GET /v1/a84a70f025ab4b0a8921df929b8fcc5a/stacks/f342a11a-7fd1-46a8-a2bb-593aafbb9134/resources?nested_depth=5 HTTP/1.1" 302 429
RESP: [302] Location: https://10.162.200.113:13004/v1/a84a70f025ab4b0a8921df929b8fcc5a/stacks/overcloud/f342a11a-7fd1-46a8-a2bb-593aafbb9134/resources?nested_depth=5 Content-Length: 429 Content-Type: application/json; charset=UTF-8 X-Openstack-Request-Id: req-f6442346-97e5-4995-8fe7-c9661eed634a Date: Thu, 30 Aug 2018 18:13:51 GMT
RESP BODY: {"message": "The resource was found at <a href=\"https://10.162.200.113:13004/v1/a84a70f025ab4b0a8921df929b8fcc5a/stacks/overcloud/f342a11a-7fd1-46a8-a2bb-593aafbb9134/resources?nested_depth=5\">https://10.162.200.113:13004/v1/a84a70f025ab4b0a8921df929b8fcc5a/stacks/overcloud/f342a11a-7fd1-46a8-a2bb-593aafbb9134/resources?nested_depth=5</a>;\nyou should be redirected automatically.\n\n", "code": "302 Found", "title": "Found"}

"GET /v1/a84a70f025ab4b0a8921df929b8fcc5a/stacks/overcloud/f342a11a-7fd1-46a8-a2bb-593aafbb9134/resources?nested_depth=5 HTTP/1.1" 504 None
RESP: [504] Cache-Control: no-cache Connection: close Content-Type: text/html
RESP BODY: Omitted, Content-Type is set to text/html. Only application/json responses have their bodies logged.

ERROR: <html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/cliff/app.py", line 387, in run_subcommand
    result = cmd.run(parsed_args)
  File "/usr/lib/python2.7/site-packages/osc_lib/command/command.py", line 41, in run
    return super(Command, self).run(parsed_args)
  File "/usr/lib/python2.7/site-packages/cliff/command.py", line 59, in run
    return self.take_action(parsed_args) or 0
  File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_update.py", line 92, in take_action
    update_manager.do_interactive_update()
  File "/usr/lib/python2.7/site-packages/tripleo_common/_stack_update.py", line 89, in do_interactive_update
    status, _ = self.get_status()
  File "/usr/lib/python2.7/site-packages/tripleo_common/_stack_update.py", line 67, in get_status
    resources = self._resources_by_state()
  File "/usr/lib/python2.7/site-packages/tripleo_common/_stack_update.py", line 129, in _resources_by_state
    self.stack.id, nested_depth=self.nested_depth)
  File "/usr/lib/python2.7/site-packages/heatclient/v1/resources.py", line 71, in list
    return self._list(url, "resources")
  File "/usr/lib/python2.7/site-packages/heatclient/openstack/common/apiclient/base.py", line 135, in _list
    body = self.client.get(url).json()
  File "/usr/lib/python2.7/site-packages/keystoneauth1/adapter.py", line 187, in get
    return self.request(url, 'GET', **kwargs)
  File "/usr/lib/python2.7/site-packages/heatclient/common/http.py", line 318, in request
    raise exc.from_response(resp)
HTTPException: ERROR: <html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>

clean_up UpdateOvercloud: ERROR: <html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/osc_lib/shell.py", line 135, in run
    ret_val = super(OpenStackShell, self).run(argv)
  File "/usr/lib/python2.7/site-packages/cliff/app.py", line 267, in run
    result = self.run_subcommand(remainder)
  File "/usr/lib/python2.7/site-packages/osc_lib/shell.py", line 180, in run_subcommand
    ret_value = super(OpenStackShell, self).run_subcommand(argv)
  File "/usr/lib/python2.7/site-packages/cliff/app.py", line 387, in run_subcommand
    result = cmd.run(parsed_args)
  File "/usr/lib/python2.7/site-packages/osc_lib/command/command.py", line 41, in run
    return super(Command, self).run(parsed_args)
  File "/usr/lib/python2.7/site-packages/cliff/command.py", line 59, in run
    return self.take_action(parsed_args) or 0
  File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_update.py", line 92, in take_action
    update_manager.do_interactive_update()
  File "/usr/lib/python2.7/site-packages/tripleo_common/_stack_update.py", line 89, in do_interactive_update
    status, _ = self.get_status()
  File "/usr/lib/python2.7/site-packages/tripleo_common/_stack_update.py", line 67, in get_status
    resources = self._resources_by_state()
  File "/usr/lib/python2.7/site-packages/tripleo_common/_stack_update.py", line 129, in _resources_by_state
    self.stack.id, nested_depth=self.nested_depth)
  File "/usr/lib/python2.7/site-packages/heatclient/v1/resources.py", line 71, in list
    return self._list(url, "resources")
  File "/usr/lib/python2.7/site-packages/heatclient/openstack/common/apiclient/base.py", line 135, in _list
    body = self.client.get(url).json()
  File "/usr/lib/python2.7/site-packages/keystoneauth1/adapter.py", line 187, in get
    return self.request(url, 'GET', **kwargs)
  File "/usr/lib/python2.7/site-packages/heatclient/common/http.py", line 318, in request
    raise exc.from_response(resp)
HTTPException: ERROR: <html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>


END return value: 1

Comment 22 Dougal Matthews 2018-08-31 08:47:15 UTC
So it seems the error is actually coming from heatclient/Heat, not Mistral. I'm glad I got that traceback to verify the source. I'm not sure how to debug failures like this.

I guess we need to see the Heat logs?

I think then it is best that Upgrades take another a look at this one. Possibly DF? Happy to work with them if I can help

Comment 24 Lukas Bezdicka 2018-09-10 14:06:01 UTC
Hi, I was unable to reproduce this and I do agree with Dougal that this came from heat client which points to keystone or heat. How big is the machine in terms of resources? Can we get reproducer or logs from heat too? Thanks

Comment 25 Vincent S. Cojot 2018-09-12 03:33:31 UTC
Hi Lukas,
The undercloud VM I am using has 24Gb of RAM. Also, it has 4 vcpus.
What logs would you like to have (specifically).
Would giving you a tarball of the templates help?

Comment 26 Lukas Bezdicka 2018-09-12 08:13:42 UTC
Hi, I'd need all the /var/log/heat logs and ideally zaqar, httpd and keystone as bonus. Thanks

Comment 27 Vincent S. Cojot 2018-09-12 16:21:05 UTC
Hi Lukas,
Thanks for getting back to me.
Ok, so /var/log/heat, /var/log/httpd, /var/log/zaqar and /var/log/keystone anything else? A sosreport?

Comment 28 Lukas Bezdicka 2018-09-13 10:31:20 UTC
Just those, please make sure you reboot UC if you are doing openstack undercloud upgrade.

Comment 29 Vincent S. Cojot 2018-09-19 17:20:13 UTC
(In reply to Lukas Bezdicka from comment #28)
> Just those, please make sure you reboot UC if you are doing openstack
> undercloud upgrade.

Hi Lukas,
I always do that, no worries (rebooting the undercloud). Especially when the 'undercloud upgrade' brings in a new kernel. :)

Vincent

Comment 30 Vincent S. Cojot 2018-09-19 17:22:43 UTC
I've upgraded to OSP10z9, I'm currently re-deploying with:
1 x controller
2 x compute HCI (Ceph OSDs)
1 x ceph-storage

This will make it easier to sift through logs.

Comment 31 Vincent S. Cojot 2018-09-19 19:19:43 UTC
[stack@instack ~]$ ./OSP/osp10/bin/deploy24_update.sh
(II) Using /home/stack/OSP/osp10/enable-tls.yaml...
(II) Using /home/stack/OSP/osp10/inject-trust-anchor.yaml...
(II) Using /home/stack/OSP/osp10/local-environment.yaml...
(II) Using /home/stack/OSP/osp10/overcloud.pem...
(II) Using /home/stack/OSP/osp10/undercloud.pem...
+ yes ''
+ openstack overcloud update stack -i overcloud -e /home/stack/OSP/osp10/node_info_small.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-radosgw.yaml -e /home/stack/OSP/osp10/net-bond-with-vlans-with-nic4.yaml -e /home/stack/OSP/osp10/rhel-registration-environment.yaml -e /home/stack/OSP/osp10/storage-environment.yaml -e /home/stack/OSP/osp10/krynn-environment.yaml -e /home/stack/OSP/osp10/extraconfig-environment.yaml -e /home/stack/OSP/osp10/enable-tls.yaml -e /home/stack/OSP/osp10/inject-trust-anchor.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/tls-endpoints-public-ip.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/hyperconverged-ceph.yaml -e /home/stack/OSP/osp10/local-environment.yaml -e /home/stack/OSP/osp10/token_flush-environment.yaml -e /home/stack/OSP/osp10/gnocchi_tuning.yaml -e /home/stack/OSP/osp10/disable_telemetry.yaml -e /home/stack/OSP/osp10/aodh_policy.yaml -e /home/stack/OSP/osp10/ceilometer_policy.yaml -e /home/stack/OSP/osp10/cinder_policy.yaml -e /home/stack/OSP/osp10/glance_policy.yaml -e /home/stack/OSP/osp10/gnocchi_policy.yaml -e /home/stack/OSP/osp10/heat_policy.yaml -e /home/stack/OSP/osp10/ironic_policy.yaml -e /home/stack/OSP/osp10/keystone_policy.yaml -e /home/stack/OSP/osp10/manila_policy.yaml -e /home/stack/OSP/osp10/mistral_policy.yaml -e /home/stack/OSP/osp10/neutron_policy.yaml -e /home/stack/OSP/osp10/nova_policy.yaml -e /home/stack/OSP/osp10/sahara_policy.yaml -e /home/stack/OSP/osp10/zaqar_policy.yaml
starting package update on stack overcloud
ERROR: <html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>

+ set +x
Building ansible hosts file..
# Collecting information from Nova and Heat...............Done!
Editing /etc/hosts to add overcloud nodes...
10.20.0.111     krynn-ctrl-0.lasthome.solace.krynn krynn-ctrl-0  ctrl0
10.20.0.114     krynn-cmpt-0.lasthome.solace.krynn krynn-cmpt-0  cmpt0
10.20.0.112     krynn-cmpt-1.lasthome.solace.krynn krynn-cmpt-1  cmpt1
10.20.0.105     krynn-ceph-0.lasthome.solace.krynn krynn-ceph-0  ceph0
‘/home/stack/.ssh/config’ -> ‘/home/stack/.ssh/config.orig’

Comment 32 Vincent S. Cojot 2018-09-19 19:20:07 UTC
Reproduced on OSP10z9, attaching logs...

Comment 33 Vincent S. Cojot 2018-09-19 19:23:35 UTC
[root@instack ~]# zip -9r /tmp/logs.zip /var/log/heat /var/log/mistral /var/log/zaqar /var/log/httpd /var/log/keystone
updating: var/log/heat/ (stored 0%)
updating: var/log/heat/heat-api.log (deflated 95%)
updating: var/log/heat/heat-engine.log
        zip warning:  file size changed while zipping /var/log/heat/heat-engine.log
 (deflated 95%)
updating: var/log/heat/heat-api-cfn.log (deflated 91%)
updating: var/log/mistral/ (stored 0%)
updating: var/log/mistral/engine.log (deflated 86%)
updating: var/log/mistral/api.log (deflated 85%)
updating: var/log/mistral/executor.log (deflated 86%)
updating: var/log/zaqar/ (stored 0%)
updating: var/log/zaqar/zaqar.log (deflated 92%)
updating: var/log/httpd/ (stored 0%)
updating: var/log/httpd/tripleo-ui_error.log (stored 0%)
updating: var/log/httpd/default_error.log (stored 0%)
updating: var/log/httpd/ipxe_vhost_error.log (stored 0%)
updating: var/log/httpd/access_log (stored 0%)
updating: var/log/httpd/keystone_wsgi_admin_access.log (deflated 94%)
updating: var/log/httpd/error_log (deflated 78%)
updating: var/log/httpd/keystone_wsgi_admin_error.log (stored 0%)
updating: var/log/httpd/nova_api_wsgi_access.log (deflated 97%)
updating: var/log/httpd/keystone_wsgi_main_access.log (deflated 97%)
updating: var/log/httpd/nova_api_wsgi_error.log (deflated 94%)
updating: var/log/httpd/keystone_wsgi_main_error.log (stored 0%)
updating: var/log/httpd/tripleo-ui_access.log (stored 0%)
updating: var/log/httpd/ipxe_vhost_access.log (deflated 93%)
updating: var/log/keystone/ (stored 0%)
updating: var/log/keystone/keystone.log (deflated 94%)

Comment 34 Vincent S. Cojot 2018-09-19 19:30:18 UTC
I had to split the logs:
zip -9qr /tmp/logs1.zip /var/log/mistral /var/log/zaqar /var/log/httpd /var/log/keystone
zip -9qr /tmp/logs2.zip /var/log/heat

Comment 35 Vincent S. Cojot 2018-09-19 19:30:51 UTC
Created attachment 1484903 [details]
zip -9qr /tmp/logs1.zip /var/log/mistral /var/log/zaqar /var/log/httpd /var/log/keystone

Comment 36 Vincent S. Cojot 2018-09-19 19:31:29 UTC
Created attachment 1484904 [details]
zip -9qr /tmp/logs2.zip /var/log/heat

Comment 37 Vincent S. Cojot 2018-09-19 19:32:41 UTC
Also, please make note of the following:

[root@instack heat]# wc -l heat-engine.log 
468491 heat-engine.log

[root@instack heat]# ls -lh heat-engine.log
-rw-r--r--. 1 heat heat 235M Sep 19 15:31 heat-engine.log
[root@instack heat]# head -1 heat-engine.log

2018-09-19 03:32:52.355 10902 DEBUG heat.engine.service [req-061419c6-bcb6-4393-8113-84a62c79dc49 - - - - -] Service f5a638e8-ae2b-4a55-bed4-d7927f9add8a is updated service_manage_report /usr/lib/python2.7/site-packages/

This is a huge logfile.

Comment 38 Vincent S. Cojot 2018-09-20 16:10:01 UTC
I may have found something... (check the log below.) The only difference is a 'cd' into OSP/osp10 -before- running the update.:

RUN #1
stack@osp5p ~]$ ./OSP/osp10/bin/deploy24_update.sh
(II) Using /home/stack/OSP/osp10/enable-tls.yaml...
(II) Using /home/stack/OSP/osp10/inject-trust-anchor.yaml...
(II) Using /home/stack/OSP/osp10/local-environment.yaml...
(II) Using /home/stack/OSP/osp10/overcloud.pem...
(II) Using /home/stack/OSP/osp10/undercloud.pem...
+ yes ''
+ openstack overcloud update stack -i overcloud -e /home/stack/OSP/osp10/node_info_small.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-radosgw.yaml -e /home/stack/OSP/osp10/net-bond-with-vlans-with-nic4.yaml -e /home/stack/OSP/osp10/rhel-registration-environment.yaml -e /home/stack/OSP/osp10/storage-environment.yaml -e /home/stack/OSP/osp10/krynn-environment.yaml -e /home/stack/OSP/osp10/extraconfig-environment.yaml -e /home/stack/OSP/osp10/enable-tls.yaml -e /home/stack/OSP/osp10/inject-trust-anchor.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/tls-endpoints-public-ip.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/hyperconverged-ceph.yaml -e /home/stack/OSP/osp10/local-environment.yaml -e /home/stack/OSP/osp10/token_flush-environment.yaml -e /home/stack/OSP/osp10/gnocchi_tuning.yaml -e /home/stack/OSP/osp10/disable_telemetry.yaml -e /home/stack/OSP/osp10/aodh_policy.yaml -e /home/stack/OSP/osp10/ceilometer_policy.yaml -e /home/stack/OSP/osp10/cinder_policy.yaml -e /home/stack/OSP/osp10/glance_policy.yaml -e /home/stack/OSP/osp10/gnocchi_policy.yaml -e /home/stack/OSP/osp10/heat_policy.yaml -e /home/stack/OSP/osp10/ironic_policy.yaml -e /home/stack/OSP/osp10/keystone_policy.yaml -e /home/stack/OSP/osp10/manila_policy.yaml -e /home/stack/OSP/osp10/mistral_policy.yaml -e /home/stack/OSP/osp10/neutron_policy.yaml -e /home/stack/OSP/osp10/nova_policy.yaml -e /home/stack/OSP/osp10/sahara_policy.yaml -e /home/stack/OSP/osp10/zaqar_policy.yaml
starting package update on stack overcloud
ERROR: <html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>

+ set +x

RUN #2:
[stack@osp5p ~]$ cd OSP/osp10
[stack@osp5p osp10]$ ./bin/deploy24_update.sh
(II) Using /home/stack/OSP/osp10/enable-tls.yaml...
(II) Using /home/stack/OSP/osp10/inject-trust-anchor.yaml...
(II) Using /home/stack/OSP/osp10/local-environment.yaml...
(II) Using /home/stack/OSP/osp10/overcloud.pem...
(II) Using /home/stack/OSP/osp10/undercloud.pem...
+ yes ''
+ openstack overcloud update stack -i overcloud -e /home/stack/OSP/osp10/node_info_small.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-radosgw.yaml -e /home/stack/OSP/osp10/net-bond-with-vlans-with-nic4.yaml -e /home/stack/OSP/osp10/rhel-registration-environment.yaml -e /home/stack/OSP/osp10/storage-environment.yaml -e /home/stack/OSP/osp10/krynn-environment.yaml -e /home/stack/OSP/osp10/extraconfig-environment.yaml -e /home/stack/OSP/osp10/enable-tls.yaml -e /home/stack/OSP/osp10/inject-trust-anchor.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/tls-endpoints-public-ip.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/hyperconverged-ceph.yaml -e /home/stack/OSP/osp10/local-environment.yaml -e /home/stack/OSP/osp10/token_flush-environment.yaml -e /home/stack/OSP/osp10/gnocchi_tuning.yaml -e /home/stack/OSP/osp10/disable_telemetry.yaml -e /home/stack/OSP/osp10/aodh_policy.yaml -e /home/stack/OSP/osp10/ceilometer_policy.yaml -e /home/stack/OSP/osp10/cinder_policy.yaml -e /home/stack/OSP/osp10/glance_policy.yaml -e /home/stack/OSP/osp10/gnocchi_policy.yaml -e /home/stack/OSP/osp10/heat_policy.yaml -e /home/stack/OSP/osp10/ironic_policy.yaml -e /home/stack/OSP/osp10/keystone_policy.yaml -e /home/stack/OSP/osp10/manila_policy.yaml -e /home/stack/OSP/osp10/mistral_policy.yaml -e /home/stack/OSP/osp10/neutron_policy.yaml -e /home/stack/OSP/osp10/nova_policy.yaml -e /home/stack/OSP/osp10/sahara_policy.yaml -e /home/stack/OSP/osp10/zaqar_policy.yaml
WAITING
on_breakpoint: [u'krynn-ceph-0', u'krynn-cmpt-1', u'krynn-ctrl-0', u'krynn-cmpt-0']
Breakpoint reached, continue? Regexp or Enter=proceed (will clear 35cb0bd0-70a6-4846-9a59-69cda1d66214), no=cancel update, C-c=quit interactive mode: IN_PROGRESS
WAITING
completed: [u'krynn-cmpt-0']
on_breakpoint: [u'krynn-ceph-0', u'krynn-cmpt-1', u'krynn-ctrl-0']
Breakpoint reached, continue? Regexp or Enter=proceed (will clear ce55271d-e148-4517-be4c-829e01fd52a0), no=cancel update, C-c=quit interactive mode: IN_PROGRESS
IN_PROGRESS
IN_PROGRESS

Comment 39 Vincent S. Cojot 2018-09-21 17:09:48 UTC
Well, that was a false positive.. (the 'cd' thing). I'm still getting failures:

[stack@instack ~]$ ./OSP/osp10/bin/deploy24_update.sh 
[...]
+ cd /home/stack/OSP/osp10
+ yes ''
+ openstack overcloud update stack -i overcloud -e /home/stack/OSP/osp10/node_info_small.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-radosgw.yaml -e /home/stack/OSP/osp10/net-bond-with-vlans-with-nic4.yaml -e /home/stack/OSP/osp10/rhel-registration-environment.yaml -e /home/stack/OSP/osp10/storage-environment.yaml -e /home/stack/OSP/osp10/krynn-environment.yaml -e /home/stack/OSP/osp10/extraconfig-environment.yaml -e /home/stack/OSP/osp10/enable-tls.yaml -e /home/stack/OSP/osp10/inject-trust-anchor.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/tls-endpoints-public-ip.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/hyperconverged-ceph.yaml -e /home/stack/OSP/osp10/local-environment.yaml -e /home/stack/OSP/osp10/token_flush-environment.yaml -e /home/stack/OSP/osp10/gnocchi_tuning.yaml -e /home/stack/OSP/osp10/disable_telemetry.yaml -e /home/stack/OSP/osp10/aodh_policy.yaml -e /home/stack/OSP/osp10/ceilometer_policy.yaml -e /home/stack/OSP/osp10/cinder_policy.yaml -e /home/stack/OSP/osp10/glance_policy.yaml -e /home/stack/OSP/osp10/gnocchi_policy.yaml -e /home/stack/OSP/osp10/heat_policy.yaml -e /home/stack/OSP/osp10/ironic_policy.yaml -e /home/stack/OSP/osp10/keystone_policy.yaml -e /home/stack/OSP/osp10/manila_policy.yaml -e /home/stack/OSP/osp10/mistral_policy.yaml -e /home/stack/OSP/osp10/neutron_policy.yaml -e /home/stack/OSP/osp10/nova_policy.yaml -e /home/stack/OSP/osp10/sahara_policy.yaml -e /home/stack/OSP/osp10/zaqar_policy.yaml
starting package update on stack overcloud
ERROR: <html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>

+ set +x
Building ansible hosts file..

Comment 40 Sofer Athlan-Guyot 2018-11-21 15:28:33 UTC
Hi Vicent,

not sure I get your last log right.  It seems you're still cd into /home/stack/OSP/osp10 before running the update.

Just to make sure, you *should* (I think it's you have to) run that command from the $HOME directory.

Can you confirm that you script are adjusted so that the update command is run from the $HOME directory ?

Thanks,

Comment 41 Vincent S. Cojot 2019-01-20 22:55:17 UTC
Hi,
No progress on my side. I've made sure to run everything from $HOME and it still fails:
[stack@osp5p ~]$ ./OSP/osp10/bin/deploy24_update.sh 
+ yes ''
+ openstack overcloud update stack -i overcloud -e /home/stack/OSP/osp10/node_info_small.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-radosgw.yaml -e /home/stack/OSP/osp10/net-bond-with-vlans-with-nic4.yaml -e /home/stack/OSP/osp10/rhel-registration-environment.yaml -e /home/stack/OSP/osp10/storage-environment.yaml -e /home/stack/OSP/osp10/krynn-environment.yaml -e /home/stack/OSP/osp10/extraconfig-environment.yaml -e /home/stack/OSP/osp10/enable-tls.yaml -e /home/stack/OSP/osp10/inject-trust-anchor.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/tls-endpoints-public-ip.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/hyperconverged-ceph.yaml -e /home/stack/OSP/osp10/local-environment.yaml -e /home/stack/OSP/osp10/token_flush-environment.yaml -e /home/stack/OSP/osp10/gnocchi_tuning.yaml -e /home/stack/OSP/osp10/disable_telemetry.yaml -e /home/stack/OSP/osp10/aodh_policy.yaml -e /home/stack/OSP/osp10/ceilometer_policy.yaml -e /home/stack/OSP/osp10/cinder_policy.yaml -e /home/stack/OSP/osp10/glance_policy.yaml -e /home/stack/OSP/osp10/gnocchi_policy.yaml -e /home/stack/OSP/osp10/heat_policy.yaml -e /home/stack/OSP/osp10/ironic_policy.yaml -e /home/stack/OSP/osp10/keystone_policy.yaml -e /home/stack/OSP/osp10/manila_policy.yaml -e /home/stack/OSP/osp10/mistral_policy.yaml -e /home/stack/OSP/osp10/neutron_policy.yaml -e /home/stack/OSP/osp10/nova_policy.yaml -e /home/stack/OSP/osp10/sahara_policy.yaml -e /home/stack/OSP/osp10/zaqar_policy.yaml
starting package update on stack overcloud
ERROR: <html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>

[stack@osp5p ~]$ heat stack-list
WARNING (shell) "heat stack-list" is deprecated, please use "openstack stack list" instead
+--------------------------------------+------------+--------------------+----------------------+----------------------+
| id                                   | stack_name | stack_status       | creation_time        | updated_time         |
+--------------------------------------+------------+--------------------+----------------------+----------------------+
| b4170f46-8afc-4bf3-9436-d60355156fef | overcloud  | UPDATE_IN_PROGRESS | 2019-01-20T21:38:20Z | 2019-01-20T22:48:52Z |
+--------------------------------------+------------+--------------------+----------------------+----------------------+

(This was on fresh deploy of OSP10z9):
[stack@osp5p ~]$ rpm -qa rhosp\*
rhosp-director-images-ipa-10.0-20180821.1.el7ost.noarch
rhosp-director-images-10.0-20180821.1.el7ost.noarch

Comment 42 Lukas Bezdicka 2019-01-21 10:08:26 UTC
How long does it take to run this command:

openstack stack resource list overcloud -n5 > resources

Comment 43 Vincent S. Cojot 2019-01-21 16:17:58 UTC
Hi Lukas,

[stack@osp5p ~]$ time openstack stack resource list overcloud -n5 > resources

real    0m34.875s
user    0m1.014s
sys     0m0.156s

[stack@osp5p ~]$ wc -l resources
1027 resources

Comment 50 Vincent S. Cojot 2019-01-21 18:42:16 UTC
Hi Lukas,
Here'a summary of what I'm currently doing:

- Fresh OSP10z9 undercloud (24gb ram, 4vcpus), same templates/tooling as before.
- I ran a fresh 1ctrl + 1cmpt + ceph deploy, without the readonly policies in place.
===============================================================================================
time openstack overcloud deploy \
--templates \
--validation-errors-fatal \
-r ${TOP_DIR}/roles_data.yaml \
-e ${TOP_DIR}/node_info_micro.yaml \
-e ${TRIPLEO_DIR}/environments/network-isolation.yaml \
-e ${TRIPLEO_DIR}/environments/storage-environment.yaml \
-e ${TRIPLEO_DIR}/environments/ceph-radosgw.yaml \
-e ${TOP_DIR}/net-bond-with-vlans-with-nic4.yaml \
-e ${TOP_DIR}/rhel-registration-environment.yaml \
-e ${TOP_DIR}/storage-environment.yaml \
-e ${TOP_DIR}/krynn-environment.yaml \
-e ${TOP_DIR}/extraconfig-environment.yaml \
-e ${TOP_DIR}/enable-tls.yaml \
-e ${TOP_DIR}/inject-trust-anchor.yaml \
-e ${TRIPLEO_DIR}/environments/tls-endpoints-public-ip.yaml \
-e ${TRIPLEO_DIR}/environments/hyperconverged-ceph.yaml \
-e ${TOP_DIR}/local-environment.yaml \
-e ${TOP_DIR}/token_flush-environment.yaml \
-e ${TOP_DIR}/disable_telemetry.yaml \
"$@" || exit 127
===============================================================================================
(Note the absence of the previous *_policy.yaml files).

- I checked the logs and the error didn't show up:
[root@osp5p ~]# journalctl --no-pager --all --boot|grep -i YaqlEvaluationException
[root@osp5p ~]# 

But then I noticed this:
[root@osp5p ~]# journalctl --no-pager --all --boot|grep ERROR.*Exception
Jan 21 09:29:15 osp5p mistral-server[5425]: 2019-01-21 09:29:15.465 5425 ERROR swiftclient     raise ClientException.from_response(resp, 'Object COPY failed', body)
Jan 21 09:29:15 osp5p mistral-server[5425]: 2019-01-21 09:29:15.465 5425 ERROR swiftclient ClientException: Object COPY failed: https://10.20.0.3:13808/v1/AUTH_e3fb0c660e384a15b748228ab215d6be/overcloud-swift-rings/swift-rings.tar.gz 404 Not Found  [first 60 chars of response] <html><h1>Not Found</h1><p>The resource could not be found.<
Jan 21 09:54:42 osp5p mistral-server[5425]: 2019-01-21 09:54:42.318 5425 ERROR swiftclient     raise ClientException.from_response(resp, 'Object COPY failed', body)
Jan 21 09:54:42 osp5p mistral-server[5425]: 2019-01-21 09:54:42.318 5425 ERROR swiftclient ClientException: Object COPY failed: https://10.20.0.3:13808/v1/AUTH_e3fb0c660e384a15b748228ab215d6be/overcloud-swift-rings/swift-rings.tar.gz 404 Not Found  [first 60 chars of response] <html><h1>Not Found</h1><p>The resource could not be found.<
Jan 21 11:41:33 osp5p mistral-server[5425]: 2019-01-21 11:41:33.415 5425 ERROR swiftclient     raise ClientException.from_response(resp, 'Object COPY failed', body)
Jan 21 11:41:33 osp5p mistral-server[5425]: 2019-01-21 11:41:33.415 5425 ERROR swiftclient ClientException: Object COPY failed: https://10.20.0.3:13808/v1/AUTH_e3fb0c660e384a15b748228ab215d6be/overcloud-swift-rings/swift-rings.tar.gz 404 Not Found  [first 60 chars of response] <html><h1>Not Found</h1><p>The resource could not be found.<

(This was from this morning's attempts)

My update CLI this afternoon is like this:
===============================================================================================
[stack@osp5p ~]$ time ./OSP/osp10/bin/deploy24_update.sh
(II) Using /home/stack/OSP/osp10/enable-tls.yaml...
(II) Using /home/stack/OSP/osp10/inject-trust-anchor.yaml...
(II) Using /home/stack/OSP/osp10/local-environment.yaml...
(II) Using /home/stack/OSP/osp10/overcloud.pem...
(II) Using /home/stack/OSP/osp10/undercloud.pem...
+ yes ''
+ openstack overcloud update stack -i overcloud -e /home/stack/OSP/osp10/node_info_small.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-radosgw.yaml -e /home/stack/OSP/osp10/net-bond-with-vlans-with-nic4.yaml -e /home/stack/OSP/osp10/rhel-registration-environment.yaml -e /home/stack/OSP/osp10/storage-environment.yaml -e /home/stack/OSP/osp10/krynn-environment.yaml -e /home/stack/OSP/osp10/extraconfig-environment.yaml -e /home/stack/OSP/osp10/enable-tls.yaml -e /home/stack/OSP/osp10/inject-trust-anchor.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/tls-endpoints-public-ip.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/hyperconverged-ceph.yaml -e /home/stack/OSP/osp10/local-environment.yaml -e /home/stack/OSP/osp10/token_flush-environment.yaml -e /home/stack/OSP/osp10/disable_telemetry.yaml
starting package update on stack overcloud
IN_PROGRESS
WAITING
on_breakpoint: [u'krynn-ctrl-0', u'krynn-cmpt-0', u'krynn-ceph-0']
Breakpoint reached, continue? Regexp or Enter=proceed (will clear ca31e96c-a9e0-44f0-8f2a-4b6c0a74ae86), no=cancel update, C-c=quit interactive mode: IN_PROGRESS
IN_PROGRESS
WAITING
on_breakpoint: [u'krynn-ctrl-0', u'krynn-cmpt-0', u'krynn-ceph-0']
Breakpoint reached, continue? Regexp or Enter=proceed (will clear ca31e96c-a9e0-44f0-8f2a-4b6c0a74ae86), no=cancel update, C-c=quit interactive mode: IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
WAITING
completed: [u'krynn-ceph-0']
===============================================================================================
In retrospect, removing the readonly-role policy yamls seem to make it work properly and not abort with a 504.
The only different between this deploy and those that failed to update a little earlier are the removal of these files:

[stack@osp5p osp10]$ git diff bin/deploy24_micro.sh
diff --git a/osp10/bin/deploy24_micro.sh b/osp10/bin/deploy24_micro.sh
index 419de66..3940515 100755
--- a/osp10/bin/deploy24_micro.sh
+++ b/osp10/bin/deploy24_micro.sh
@@ -31,22 +31,7 @@ time openstack overcloud deploy \
 -e ${TRIPLEO_DIR}/environments/hyperconverged-ceph.yaml \
 -e ${TOP_DIR}/local-environment.yaml \
 -e ${TOP_DIR}/token_flush-environment.yaml \
--e ${TOP_DIR}/gnocchi_tuning.yaml \
 -e ${TOP_DIR}/disable_telemetry.yaml \
--e ${TOP_DIR}/aodh_policy.yaml \
--e ${TOP_DIR}/ceilometer_policy.yaml \
--e ${TOP_DIR}/cinder_policy.yaml \
--e ${TOP_DIR}/glance_policy.yaml \
--e ${TOP_DIR}/gnocchi_policy.yaml \
--e ${TOP_DIR}/heat_policy.yaml \
--e ${TOP_DIR}/ironic_policy.yaml \
--e ${TOP_DIR}/keystone_policy.yaml \
--e ${TOP_DIR}/manila_policy.yaml \
--e ${TOP_DIR}/mistral_policy.yaml \
--e ${TOP_DIR}/neutron_policy.yaml \
--e ${TOP_DIR}/nova_policy.yaml \
--e ${TOP_DIR}/sahara_policy.yaml \
--e ${TOP_DIR}/zaqar_policy.yaml \
 "$@" || exit 127
 
 set +x

Please note that these policies are supposed to be backward-compatible with the default policies provided with RHOSP10. Since it seems they break something (updates), I'll do more research on my end.

Comment 51 Lukas Bezdicka 2019-01-21 18:49:08 UTC
Can you provide the templates, if it's the Yaql issue it would be probably typo in templates. Also I don't think it should be the swift error. Does it fail every time or rerun always works?

Comment 52 Vincent S. Cojot 2019-01-21 18:57:41 UTC
Hi Lukas,

With the *policy.yaml files, behaviour is like this 100% of the time:
- deploy succeeds (CREATE_COMPLETE)
- deploy update succeeds (run the same deploy command): UPDATE_COMPLETE
- stack minor update (openstack overcloud update stack -i overcloud) fails with the 504 error.

Without the *policy.yaml files, everything works as expected:
- deploy succeeds (CREATE_COMPLETE)
- deploy update succeeds (run the same deploy command): UPDATE_COMPLETE
- stack minor update (openstack overcloud update stack -i overcloud) seems to work as it is supposed to:



[stack@osp5p ~]$ time ./OSP/osp10/bin/deploy24_update.sh
(II) Using /home/stack/OSP/osp10/enable-tls.yaml...
(II) Using /home/stack/OSP/osp10/inject-trust-anchor.yaml...
(II) Using /home/stack/OSP/osp10/local-environment.yaml...
(II) Using /home/stack/OSP/osp10/overcloud.pem...
(II) Using /home/stack/OSP/osp10/undercloud.pem...
+ yes ''
+ openstack overcloud update stack -i overcloud -e /home/stack/OSP/osp10/node_info_small.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-radosgw.yaml -e /home/stack/OSP/osp10/net-bond-with-vlans-with-nic4.yaml -e /home/stack/OSP/osp10/rhel-registration-environment.yaml -e /home/stack/OSP/osp10/storage-environment.yaml -e /home/stack/OSP/osp10/krynn-environment.yaml -e /home/stack/OSP/osp10/extraconfig-environment.yaml -e /home/stack/OSP/osp10/enable-tls.yaml -e /home/stack/OSP/osp10/inject-trust-anchor.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/tls-endpoints-public-ip.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/hyperconverged-ceph.yaml -e /home/stack/OSP/osp10/local-environment.yaml -e /home/stack/OSP/osp10/token_flush-environment.yaml -e /home/stack/OSP/osp10/disable_telemetry.yaml
starting package update on stack overcloud
IN_PROGRESS
WAITING
on_breakpoint: [u'krynn-ctrl-0', u'krynn-cmpt-0', u'krynn-ceph-0']
Breakpoint reached, continue? Regexp or Enter=proceed (will clear ca31e96c-a9e0-44f0-8f2a-4b6c0a74ae86), no=cancel update, C-c=quit interactive mode: IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
WAITING
completed: [u'krynn-ceph-0']
on_breakpoint: [u'krynn-ctrl-0', u'krynn-cmpt-0']
Breakpoint reached, continue? Regexp or Enter=proceed (will clear c3c39e32-af25-48d2-baaf-e03e85b816a0), no=cancel update, C-c=quit interactive mode: IN_PROGRESS
IN_PROGRESS
WAITING
completed: [u'krynn-cmpt-0', u'krynn-ceph-0']
on_breakpoint: [u'krynn-ctrl-0']
Breakpoint reached, continue? Regexp or Enter=proceed (will clear 914abd21-fefc-4b7f-b992-552022eb1bed), no=cancel update, C-c=quit interactive mode: IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS

I'll be providing the templates shortly.

Comment 54 Vincent S. Cojot 2019-01-22 16:03:05 UTC
(In reply to Lukas Bezdicka from comment #51)
> Can you provide the templates, if it's the Yaql issue it would be probably
> typo in templates. Also I don't think it should be the swift error. Does it
> fail every time or rerun always works?

Also, I'd like to clarify some things: A re-run isn't actually a re-run (I wouldn't be able to re-run since
the stack itself is in UPDATE_IN_PROGRESS until it times out 4 hours later).
When I'm testing 'running' the upgrade again, I need to delete the stack, re-create it and -then- I can attempt the overcloud update again.

Comment 55 Vincent S. Cojot 2019-01-22 22:43:00 UTC
The mystery gets thickers.. Following yet another 504 failure, here's what I got:


[stack@osp5p osp10]$ heat stack-list
WARNING (shell) "heat stack-list" is deprecated, please use "openstack stack list" instead
+--------------------------------------+------------+--------------------+----------------------+----------------------+
| id                                   | stack_name | stack_status       | creation_time        | updated_time         |
+--------------------------------------+------------+--------------------+----------------------+----------------------+
| 656a250b-65dc-45d6-8bb2-ec02fd1d1553 | overcloud  | UPDATE_IN_PROGRESS | 2019-01-22T21:03:44Z | 2019-01-22T22:17:18Z |
+--------------------------------------+------------+--------------------+----------------------+----------------------+

[stack@osp5p osp10]$ cd

[stack@osp5p ~]$ openstack overcloud update stack -i overcloud
starting package update on stack overcloud
ERROR: <html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>

[stack@osp5p ~]$ time openstack overcloud update stack -i overcloud
WAITING
on_breakpoint: [u'krynn-cmpt-0', u'krynn-ceph-0', u'krynn-ctrl-0', u'krynn-cmpt-1']
Breakpoint reached, continue? Regexp or Enter=proceed (will clear 235043d9-e640-40e3-887d-01ffb4fa6777), no=cancel update, C-c=quit interactive mode:
IN_PROGRESS

Comment 56 Vincent S. Cojot 2019-01-30 21:34:13 UTC
I've been hitting brick walls around this. There is not a final answer on which tripleo policy yaml is causing this.
I'm starting to think it could related to the quantity of yaml lines passed to the deploy.
Even un-consumed policy yamls (by puppet) in OSP10 seem to flip the 504/UPDATE behaviour. (removing one or two makes the update work, keeping those un-consumed files results in a 504).
I'm talking about those files:

[raistlin@daltigoth ~/World/Vincent/Code/GIT]$ ls -la OSP/osp10/*policy.yaml
-rw-r--r--. 1 raistlin users  2452 May  3  2018 OSP/osp10/aodh_policy.yaml
-rw-r--r--. 1 raistlin users  1985 May  3  2018 OSP/osp10/ceilometer_policy.yaml
-rw-r--r--. 1 raistlin users 20731 May  3  2018 OSP/osp10/cinder_policy.yaml
-rw-r--r--. 1 raistlin users  6578 May  3  2018 OSP/osp10/glance_policy.yaml
-rw-r--r--. 1 raistlin users  5119 May  3  2018 OSP/osp10/gnocchi_policy.yaml
-rw-r--r--. 1 raistlin users 14590 May  3  2018 OSP/osp10/heat_policy.yaml
-rw-r--r--. 1 raistlin cdrom    94 Jun 11  2018 OSP/osp10/ironic_policy.yaml
-rw-r--r--. 1 raistlin users 28352 May  3  2018 OSP/osp10/keystone_policy.yaml
-rw-r--r--. 1 raistlin users 17857 May  3  2018 OSP/osp10/manila_policy.yaml
-rw-r--r--. 1 raistlin users  7552 May  3  2018 OSP/osp10/mistral_policy.yaml
-rw-r--r--. 1 raistlin cdrom 30807 Jun 11  2018 OSP/osp10/neutron_policy.yaml
-rw-r--r--. 1 raistlin users 48342 May  3  2018 OSP/osp10/nova_policy.yaml
-rw-r--r--. 1 raistlin users 10598 May  3  2018 OSP/osp10/sahara_policy.yaml
-rw-r--r--. 1 raistlin users  5203 May  3  2018 OSP/osp10/zaqar_policy.yaml

Comment 68 Lukas Bezdicka 2019-02-19 17:43:37 UTC
Agreed with Zane - root cause was most likely haproxy timeout. Looks like we could close it as duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1391375 and https://bugzilla.redhat.com/show_bug.cgi?id=1289315

Solution would be to increase workers count which happens in later OSP10 and also change the timeout options in /etc/haproxy.conf.

Comment 69 Lukas Bezdicka 2019-02-21 10:33:08 UTC
Closing as duplicate. Sadly I can only suggest to increase timeouts on haproxy if you happen to hit such issue.

*** This bug has been marked as a duplicate of bug 1391375 ***

Comment 70 Vincent S. Cojot 2019-02-21 20:45:49 UTC
FOr the record and to help others, I've had good results using the following changes (this speeds up the update as well as the original deploy):

[stack@osp5p ~]$ grep overr undercloud.conf 
hieradata_override = /home/stack/OSP/osp10/undercloud-override.yaml

[stack@osp5p ~]$ cat /home/stack/OSP/osp10/undercloud-override.yaml
# HAProxy timeouts
tripleo::haproxy::ssl_cipher_suite: "!SSLv2:kEECDH:kRSA:kEDH:kPSK:+3DES:!aNULL:!eNULL:!MD5:!EXP:!RC4:!SEED:!IDEA:!DES:!MEDIUM"
tripleo::haproxy::ssl_options: 'no-sslv3 no-tls-tickets'
tripleo::haproxy::haproxy_global_maxconn: 65536
tripleo::haproxy::haproxy_default_maxconn: 16384
tripleo::haproxy::haproxy_default_timeout:
  - 'http-request 10s'
  - 'queue 3m'
  - 'connect 10s'
  - 'client 5m'
  - 'server 5m'
  - 'check 10s'
# DNS Domain
nova::network::neutron::dhcp_domain: lasthome.solace.krynn
neutron::dns_domain: lasthome.solace.krynn
# Memcached
aodh::keystone::authtoken::memcached_servers: "127.0.0.1:11211"
barbican::keystone::authtoken::memcached_servers: "127.0.0.1:11211"
ceilometer::keystone::authtoken::memcached_servers: "127.0.0.1:11211"
cinder::keystone::authtoken::memcached_servers: "127.0.0.1:11211"
congress::keystone::authtoken::memcached_servers: "127.0.0.1:11211"
ec2api::keystone::authtoken::memcached_servers: "127.0.0.1:11211"
glance::api::authtoken::memcached_servers: "127.0.0.1:11211"
gnocchi::keystone::authtoken::memcached_servers: "127.0.0.1:11211"
heat::keystone::authtoken::memcached_servers: "127.0.0.1:11211"
heat::cache::memcache_servers: "127.0.0.1:11211"
horizon::cache_server_ip: "127.0.0.1:11211"
ironic::api::authtoken::memcached_servers: "127.0.0.1:11211"
ironic::inspector::authtoken::memcached_servers: "127.0.0.1:11211"
keystone::cache_memcache_servers: "127.0.0.1:11211"
manila::keystone::authtoken::memcached_servers: "127.0.0.1:11211"
manila::keystone::authtoken::memcached_servers: "127.0.0.1:11211"
mistral::keystone::authtoken::memcached_servers: "127.0.0.1:11211"
neutron::keystone::authtoken::memcached_servers: "127.0.0.1:11211"
nova::keystone::authtoken::memcached_servers: "127.0.0.1:11211"
nova::cache::memcache_servers: "127.0.0.1:11211"
nova::keystone::authtoken::memcached_servers: "127.0.0.1:11211"
panko::keystone::authtoken::memcached_servers: "127.0.0.1:11211"
sahara::keystone::authtoken::memcached_servers: "127.0.0.1:11211"
swift::proxy::authtoken::memcache_servers: "127.0.0.1:11211"
swift::proxy::cache::memcache_servers: "127.0.0.1:11211"
tacker::keystone::authtoken::memcached_servers: "127.0.0.1:11211"
zaqar::keystone::authtoken::memcached_servers: "127.0.0.1:11211"
swift::objectexpirer::memcached_servers: "127.0.0.1:11211"
# Workers
heat::api::workers: 4
heat::api_cfn::workers: 4
heat::engine::num_engine_workers: 4
heat::rpc_response_timeout: 600
nova::compute::ironic::max_concurrent_builds: 4
nova::rpc_response_timeout: '600'
ronic::config:
  'DEFAULT/rpc_thread_pool_size': value => 8
ironic::rpc_response_timeout: 600


Note You need to log in before you can comment on or make changes to this bug.