Bug 1693834 - Deployment workflow: mistral action error because heartbeat wasn't received
Summary: Deployment workflow: mistral action error because heartbeat wasn't received
Keywords:
Status: CLOSED DUPLICATE of bug 1700044
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-tripleoclient
Version: 15.0 (Stein)
Hardware: All
OS: All
high
high
Target Milestone: beta
: ---
Assignee: Emilien Macchi
QA Contact: Sasha Smolyak
URL:
Whiteboard:
: 1693426 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-28 18:26 UTC by Alistair Tonner
Modified: 2023-02-22 23:02 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-05-13 12:55:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
deployment shell script to install openstack in virt env (8.58 KB, application/x-shellscript)
2019-03-28 18:28 UTC, Alistair Tonner
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1821611 0 None None None 2019-04-01 11:51:39 UTC
OpenStack gerrit 647531 0 'None' MERGED Add more options to configure Heartbeat options 2021-02-04 14:19:44 UTC
OpenStack gerrit 647597 0 'None' MERGED mistral: configure heartbeat parameters to avoid action timeout 2021-02-04 14:19:44 UTC
OpenStack gerrit 651802 0 'None' ABANDONED undercloud: increase MistralCheckInterval to 240s 2021-02-04 14:19:44 UTC
Red Hat Issue Tracker OSP-11039 0 None None None 2021-11-24 16:56:51 UTC

Description Alistair Tonner 2019-03-28 18:26:15 UTC
Description of problem:
  
 Deploying overcloud fails during TASK [Debug output for task Start Containers for step 2]
with No such key: \"tripleo::oslo_messaging_rpc::mysql_user (several of these errors stack up and then)

Exception occured while running the command
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/tripleoclient/command.py", line 30, in run
    super(Command, self).run(parsed_args)
  File "/usr/lib/python3.6/site-packages/osc_lib/command/command.py", line 41, in run
    return super(Command, self).run(parsed_args)
  File "/usr/lib/python3.6/site-packages/cliff/command.py", line 184, in run
    return_code = self.take_action(parsed_args) or 0
  File "/usr/lib/python3.6/site-packages/tripleoclient/v1/overcloud_deploy.py", line 949, in take_action
    verbosity=self.app_args.verbose_level)
  File "/usr/lib/python3.6/site-packages/tripleoclient/workflows/deployment.py", line 327, in config_download
    raise exceptions.DeploymentError("Overcloud configuration failed.")
 


Version-Release number of selected component (if applicable):

RHEL8 -> 1830 image

puddle: RHOS_TRUNK-15.0-RHEL-8-20190328.n.1

ansible-role-tripleo-modify-image.noarch      1.0.1-0.20190322190304.c0bcc3c.el8ost                @rhelosp-15.0-trunk
ansible-tripleo-ipsec.noarch                  9.0.1-0.20190220162047.f60ad6c.el8ost                @rhelosp-15.0-trunk
openstack-tripleo-common.noarch               10.6.1-0.20190327210341.25250f0.el8ost               @rhelosp-15.0-trunk
openstack-tripleo-common-containers.noarch    10.6.1-0.20190327210341.25250f0.el8ost               @rhelosp-15.0-trunk
openstack-tripleo-heat-templates.noarch       10.4.1-0.20190328020342.b79f438.el8ost               @rhelosp-15.0-trunk
openstack-tripleo-image-elements.noarch       10.3.1-0.20190325204940.253fe88.el8ost               @rhelosp-15.0-trunk
openstack-tripleo-puppet-elements.noarch      10.2.1-0.20190327211339.0f6cacb.el8ost               @rhelosp-15.0-trunk
openstack-tripleo-validations.noarch          10.3.1-0.20190326150349.de9812b.el8ost               @rhelosp-15.0-trunk
puppet-tripleo.noarch                         10.3.1-0.20190327210329.5b176cb.el8ost               @rhelosp-15.0-trunk
python3-tripleo-common.noarch                 10.6.1-0.20190327210341.25250f0.el8ost               @rhelosp-15.0-trunk
python3-tripleoclient.noarch                  11.3.1-0.20190328080340.0132e7d.el8ost               @rhelosp-15.0-trunk
python3-tripleoclient-heat-installer.noarch   11.3.1-0.20190328080340.0132e7d.el8ost               @rhelosp-15.0-trunk


How reproducible:

Consistent between 4 runs with the attached deploy script and referenced patch/workarounds.


Steps to Reproduce:
run attached deploy script 

Actual results:

2019-03-28 17:55:19.385 198105 ERROR tripleoclient.v1.overcloud_deploy.DeployOvercloud [  admin] Exception occured while running the command
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/tripleoclient/command.py", line 30, in run
    super(Command, self).run(parsed_args)
  File "/usr/lib/python3.6/site-packages/osc_lib/command/command.py", line 41, in run
    return super(Command, self).run(parsed_args)
  File "/usr/lib/python3.6/site-packages/cliff/command.py", line 184, in run
    return_code = self.take_action(parsed_args) or 0
  File "/usr/lib/python3.6/site-packages/tripleoclient/v1/overcloud_deploy.py", line 949, in take_action
    verbosity=self.app_args.verbose_level)
  File "/usr/lib/python3.6/site-packages/tripleoclient/workflows/deployment.py", line 327, in config_download
    raise exceptions.DeploymentError("Overcloud configuration failed.")
tripleoclient.exceptions.DeploymentError: Overcloud configuration failed.
2019-03-28 17:55:19.386 198105 ERROR openstack [  admin] Overcloud configuration failed.





Expected results:

Overcloud deploys successfully


Additional info:

Comment 1 Alistair Tonner 2019-03-28 18:28:36 UTC
Created attachment 1549187 [details]
deployment shell script to install openstack in virt env

Comment 2 Emilien Macchi 2019-04-01 11:50:02 UTC
The actual bug is: https://bugs.launchpad.net/tripleo/+bug/1821611, I'll rename this bug.

Comment 3 Emilien Macchi 2019-04-01 11:53:08 UTC
The bug at this stage is supposed to be fixed, but I've seen it again with John Fulton when he tried to deploy Ceph. I'll keep it open until we are sure it's fixed.

Comment 4 Emilien Macchi 2019-04-01 12:56:21 UTC
*** Bug 1693426 has been marked as a duplicate of this bug. ***

Comment 8 John Fulton 2019-04-05 11:43:32 UTC
Did a stack update with increased timeouts but hit the same issue, just after more time. 

TASK [Start containers for step 5] *********************************************     
Thursday 04 April 2019  22:09:29 +0000 (0:00:00.181)       0:39:15.130 ********
ok: [overcloud-controller-2] => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}
                      
Timed out waiting for messages from Execution (ID: 3cbcbde8-6e6e-4cbc-b517-1e638b989067, State: RUNNING). The WebSocket timed out before the Workflow completed.
Exception occured while running the command
Traceback (most recent call last):                                                    
  File "/usr/lib/python3.6/site-packages/tripleoclient/v1/overcloud_deploy.py", line 949, in take_action
    verbosity=self.app_args.verbose_level)                                      
  File "/usr/lib/python3.6/site-packages/tripleoclient/workflows/deployment.py", line 321, in config_download
    for payload in base.wait_for_messages(workflow_client, ws, execution):      
  File "/usr/lib/python3.6/site-packages/tripleoclient/workflows/base.py", line 61, in wait_for_messages
    for payload in websocket.wait_for_messages(timeout=timeout):             
  File "/usr/lib/python3.6/site-packages/tripleoclient/plugin.py", line 153, in wait_for_messages
    message = self.recv()                                          
  File "/usr/lib/python3.6/site-packages/tripleoclient/plugin.py", line 131, in recv
    return json.loads(self._ws.recv())
  File "/usr/lib64/python3.6/json/__init__.py", line 354, in loads                 
    return _default_decoder.decode(s)    
  File "/usr/lib64/python3.6/json/decoder.py", line 339, in decode                   
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib64/python3.6/json/decoder.py", line 357, in raw_decode      
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)                                                                                                                     
                          
During handling of the above exception, another exception occurred:                                                                                                                         
                       
Traceback (most recent call last):                                                                                                                                                          
  File "/usr/lib/python3.6/site-packages/websocket/_socket.py", line 81, in recv
    bytes_ = sock.recv(bufsize)                                                                                                                                                             
  File "/usr/lib64/python3.6/ssl.py", line 953, in recv
    return self.read(buflen)             
  File "/usr/lib64/python3.6/ssl.py", line 830, in read
    return self._sslobj.read(len, buffer)
  File "/usr/lib64/python3.6/ssl.py", line 589, in read
    v = self._sslobj.read(len)
socket.timeout: The read operation timed out
During handling of the above exception, another exception occurred:
                                                                                
Traceback (most recent call last):                                             
  File "/usr/lib/python3.6/site-packages/tripleoclient/plugin.py", line 153, in wait_for_messages                                                                                           
    message = self.recv()                                                                              
  File "/usr/lib/python3.6/site-packages/tripleoclient/plugin.py", line 131, in recv                   
    return json.loads(self._ws.recv())
  File "/usr/lib/python3.6/site-packages/websocket/_core.py", line 310, in recv 
    opcode, data = self.recv_data()                                            
  File "/usr/lib/python3.6/site-packages/websocket/_core.py", line 327, in recv_data
    opcode, frame = self.recv_data_frame(control_frame)
  File "/usr/lib/python3.6/site-packages/websocket/_core.py", line 340, in recv_data_frame
    frame = self.recv_frame()
  File "/usr/lib/python3.6/site-packages/websocket/_core.py", line 374, in recv_frame
    return self.frame_buffer.recv_frame()                                      
  File "/usr/lib/python3.6/site-packages/websocket/_abnf.py", line 361, in recv_frame                                                                         
    self.recv_header()
  File "/usr/lib/python3.6/site-packages/websocket/_abnf.py", line 309, in recv_header                                                                          
    header = self.recv_strict(2)           
  File "/usr/lib/python3.6/site-packages/websocket/_abnf.py", line 396, in recv_strict
    bytes_ = self.recv(min(16384, shortage))                                                            
  File "/usr/lib/python3.6/site-packages/websocket/_core.py", line 449, in _recv
    return recv(self.sock, bufsize)                                                                          
  File "/usr/lib/python3.6/site-packages/websocket/_socket.py", line 84, in recv
    raise WebSocketTimeoutException(message)                                                            
websocket._exceptions.WebSocketTimeoutException: The read operation timed out
                                                                                                 
During handling of the above exception, another exception occurred:
                                                                                    
Traceback (most recent call last):    
  File "/usr/lib/python3.6/site-packages/tripleoclient/command.py", line 30, in run
    super(Command, self).run(parsed_args)
  File "/usr/lib/python3.6/site-packages/osc_lib/command/command.py", line 41, in run
    return super(Command, self).run(parsed_args)     
  File "/usr/lib/python3.6/site-packages/cliff/command.py", line 184, in run
    return_code = self.take_action(parsed_args) or 0                
  File "/usr/lib/python3.6/site-packages/tripleoclient/v1/overcloud_deploy.py", line 953, in take_action                                                                                    
    plan=stack.stack_name)
  File "/usr/lib/python3.6/site-packages/tripleoclient/workflows/deployment.py", line 419, in set_deployment_status                                                                         
    _WORKFLOW_TIMEOUT):
  File "/usr/lib/python3.6/site-packages/tripleoclient/workflows/base.py", line 61, in wait_for_messages                                                                                    
    for payload in websocket.wait_for_messages(timeout=timeout):                
  File "/usr/lib/python3.6/site-packages/tripleoclient/plugin.py", line 158, in wait_for_messages                                                                                           
    raise exceptions.WebSocketTimeout()                
tripleoclient.exceptions.WebSocketTimeout
                                                       
                                         
real    246m4.751s                                     
user    0m24.449s             
sys     0m1.520s                            
(undercloud) [stack@undercloud-0 ~]$

(undercloud) [stack@undercloud-0 ~]$ sudo grep -A 4  action_heartbeat /var/lib/config-data/puppet-generated/mistral/etc/mistral/mistral.conf 
[action_heartbeat]
max_missed_heartbeats=30
check_interval=40
first_heartbeat_timeout=8200

(undercloud) [stack@undercloud-0 ~]$ 

(undercloud) [stack@undercloud-0 ~]$ cat deploy_ceph.sh 
#!/bin/bash

export THT=/usr/share/openstack-tripleo-heat-templates

time openstack overcloud deploy \
  --timeout 600 \
  --templates $THT \
  --libvirt-type kvm \
  --stack overcloud \
  -r /home/stack/composable_roles_ceph/roles/roles_data.yaml \
  -e /home/stack/composable_roles_ceph/roles/nodes.yaml \
  -e /home/stack/composable_roles_ceph/config_lvm.yaml \
  -e $THT/environments/network-isolation.yaml \
  -e /home/stack/composable_roles_ceph/network/network-environment.yaml \
  -e ~/fencing.yaml \
  -e /home/stack/composable_roles_ceph/inject-trust-anchor.yaml \
  -e $THT/environments/services/neutron-ovn-ha.yaml \
  -e /home/stack/composable_roles_ceph/debug.yaml \
  -e /home/stack/composable_roles_ceph/config_heat.yaml \
  -e ~/extraconfigpre_env.yaml \
  -e ~/containers-prepare-parameter.yaml \
  -e /home/stack/composable_roles_ceph/docker-images.yaml \
  -e $THT/environments/ceph-ansible/ceph-ansible.yaml \
  -e /home/stack/ceph/ceph.yaml \
  --log-file overcloud_deployment_ceph.log
(undercloud) [stack@undercloud-0 ~]$ 

(undercloud) [stack@undercloud-0 ~]$ openstack server list
+--------------------------------------+------------------------+--------+------------------------+----------------+------------+
| ID                                   | Name                   | Status | Networks               | Image          | Flavor     |
+--------------------------------------+------------------------+--------+------------------------+----------------+------------+
| 033a9b16-d17d-43aa-82a0-ef1a4f5d02ce | overcloud-controller-2 | ACTIVE | ctlplane=192.168.24.9  | overcloud-full | controller |
| 23d7d520-8062-4725-a559-7de4352df26f | overcloud-computehci-1 | ACTIVE | ctlplane=192.168.24.11 | overcloud-full | compute    |
| 37993bf7-748e-442f-9b8a-9059f410ab24 | overcloud-computehci-2 | ACTIVE | ctlplane=192.168.24.21 | overcloud-full | compute    |
| c995e5de-d0f3-44e1-95ba-fc33ccd10957 | overcloud-controller-1 | ACTIVE | ctlplane=192.168.24.14 | overcloud-full | controller |
| f8abee7b-2d27-4c99-83c6-eec04126852d | overcloud-controller-0 | ACTIVE | ctlplane=192.168.24.8  | overcloud-full | controller |
| e2389313-143f-4b5b-bea3-5e20cd343037 | overcloud-computehci-0 | ACTIVE | ctlplane=192.168.24.6  | overcloud-full | compute    |
+--------------------------------------+------------------------+--------+------------------------+----------------+------------+
(undercloud) [stack@undercloud-0 ~]$ ssh heat-admin.24.6 "sudo podman ps"
Warning: Permanently added '192.168.24.6' (ECDSA) to the list of known hosts.
CONTAINER ID  IMAGE                                        COMMAND               CREATED       STATUS           PORTS  NAMES
8168b97740be  quay.io/rhceph-dev/rhceph-4.0-rhel-8:latest  /opt/ceph-contain...  14 hours ago  Up 14 hours ago         ceph-osd-0
(undercloud) [stack@undercloud-0 ~]$

(undercloud) [stack@undercloud-0 ~]$ ssh heat-admin.24.8 "sudo podman ps"                                                                                  
Warning: Permanently added '192.168.24.8' (ECDSA) to the list of known hosts.                                                                                                               
CONTAINER ID  IMAGE                                                                                                  COMMAND               CREATED       STATUS           PORTS  NAMES       4f4a72e483d0  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-gnocchi-api:latest              dumb-init --singl...  14 hours ago  Up 14 hours ago         gnocchi_db_s
ync                                                                                                                                                                                         
243a720fec7a  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-swift-proxy-server:latest       dumb-init --singl...  14 hours ago  Up 14 hours ago         swift_proxy 4546645d1e2a  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-panko-api:latest                dumb-init --singl...  14 hours ago  Up 14 hours ago         panko_api
0bee35dd76fb  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-nova-api:latest                 dumb-init --singl...  14 hours ago  Up 14 hours ago         nova_metadat
a                                                                                                                                                                                           
ade0ec1133a9  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-nova-api:latest                 dumb-init --singl...  14 hours ago  Up 14 hours ago         nova_api    a98fcffa8668  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-glance-api:latest               dumb-init --singl...  14 hours ago  Up 14 hours ago         glance_api
8ddbe1c06e72  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-ovn-controller:latest           dumb-init --singl...  14 hours ago  Up 14 hours ago         ovn_controller 
f3c3e426c476  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-nova-placement-api:latest       dumb-init --singl...  14 hours ago  Up 14 hours ago         nova_placeme
nt                                                                                                                                                                                           1e279c8eae4e  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-swift-object:latest             dumb-init --singl...  14 hours ago  Up 14 hours ago         swift_rsync
2c0589025754  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-swift-object:latest             dumb-init --singl...  14 hours ago  Up 14 hours ago         swift_object_updater            
c9b01fa9ac63  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-swift-object:latest             dumb-init --singl...  14 hours ago  Up 14 hours ago         swift_object_server              
94b5bf42eaa1  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-swift-object:latest             dumb-init --singl...  14 hours ago  Up 14 hours ago         swift_object_replicator                                              
93f33d1e8a7e  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-swift-proxy-server:latest       dumb-init --singl...  14 hours ago  Up 14 hours ago         swift_object_expirer                                       
6495d2fbe116  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-swift-object:latest             dumb-init --singl...  14 hours ago  Up 14 hours ago         swift_object
_auditor                                                                                                                                                                                     eb0d90b16a4f  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-swift-container:latest          dumb-init --singl...  14 hours ago  Up 14 hours ago         swift_contai
ner_updater                                                                                                                                                                                 
2fffce3e1eca  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-swift-container:latest          dumb-init --singl...  14 hours ago  Up 14 hours ago         swift_contai
ner_server                                                                                                                                                                                   2216afd29758  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-swift-container:latest          dumb-init --singl...  14 hours ago  Up 14 hours ago         swift_contai
ner_replicator                                                                                                                                                                              
e762cf381d0b  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-swift-container:latest          dumb-init --singl...  14 hours ago  Up 14 hours ago         swift_container_auditor                                            
578b6071d794  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-swift-account:latest            dumb-init --singl...  14 hours ago  Up 14 hours ago         swift_account_server                                  
632004d3be1c  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-swift-account:latest            dumb-init --singl...  14 hours ago  Up 14 hours ago         swift_account_replicator                                                                                                                     
f2ac7585b93e  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-swift-account:latest            dumb-init --singl...  14 hours ago  Up 14 hours ago         swift_account_reaper                                                                                                                         
d774e640bd0d  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-swift-account:latest            dumb-init --singl...  14 hours ago  Up 14 hours ago         swift_accoun
t_auditor                                                                                                                                                                                    09ae4fd047a2  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-nova-novncproxy:latest          dumb-init --singl...  14 hours ago  Up 14 hours ago         nova_vnc_pro
xy                                                                                                                                                                                           1e84cc85635b  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-nova-scheduler:latest           dumb-init --singl...  14 hours ago  Up 14 hours ago         nova_schedul
er                                                                                                                                                                                          
38d7a2685cc0  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-nova-consoleauth:latest         dumb-init --singl...  14 hours ago  Up 14 hours ago         nova_consolauth
6372e24f5b21  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-nova-conductor:latest           dumb-init --singl...  14 hours ago  Up 14 hours ago         nova_conductor                                                                                                                          
261200e7a44d  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-nova-api:latest                 dumb-init --singl...  14 hours ago  Up 14 hours ago         nova_api_cron                                                                                                                                                                  
b6ff6b93c012  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-neutron-server-ovn:latest       dumb-init --singl...  14 hours ago  Up 14 hours ago         neutron_api
3dced2021732  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-cron:latest                     dumb-init --singl...  14 hours ago  Up 14 hours ago         logrotate_crond                                                                                                                                                                                          
9a99c602733c  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-heat-engine:latest              dumb-init --singl...  14 hours ago  Up 14 hours ago         heat_engine
b21c04625a28  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-heat-api:latest                 dumb-init --singl...  14 hours ago  Up 14 hours ago         heat_api_cron                                                                                                                                                                                         
22fbec4a8bb4  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-heat-api-cfn:latest             dumb-init --singl...  14 hours ago  Up 14 hours ago         heat_api_cfn
78e9c4bdf227  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-heat-api:latest                 dumb-init --singl...  14 hours ago  Up 14 hours ago         heat_api   
4f2187802c9c  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-cinder-scheduler:latest         dumb-init --singl...  14 hours ago  Up 14 hours ago         cinder_scheduler                                                                                                                                                                                       
be8dc6bd665b  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-cinder-api:latest               dumb-init --singl...  14 hours ago  Up 14 hours ago         cinder_api_cron
0e89e10b0fe5  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-cinder-api:latest               dumb-init --singl...  14 hours ago  Up 14 hours ago         cinder_api  
02a45d30adc3  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-ceilometer-notification:latest  dumb-init --singl...  14 hours ago  Up 14 hours ago         ceilometer_agent_notification                                                                                                                                                                           
39389ff72279  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-ceilometer-central:latest       dumb-init --singl...  14 hours ago  Up 14 hours ago         ceilometer_agent_central        
530f18e666c8  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-aodh-notifier:latest            dumb-init --singl...  14 hours ago  Up 14 hours ago         aodh_notifier                    
ab33c2c4f363  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-aodh-listener:latest            dumb-init --singl...  14 hours ago  Up 14 hours ago         aodh_listener                                                        
fc55389aa63f  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-aodh-evaluator:latest           dumb-init --singl...  14 hours ago  Up 14 hours ago         aodh_evaluator                                             
bc6e8512fb81  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-aodh-api:latest                 dumb-init --singl...  14 hours ago  Up 14 hours ago         aodh_api    
6938ada702d9  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-keystone:latest                 dumb-init --singl...  14 hours ago  Up 14 hours ago         keystone_cron                                                                                                                                                                                            
da18c78e4050  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-keystone:latest                 dumb-init --singl...  14 hours ago  Up 14 hours ago         keystone   
fc955ce80499  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-iscsid:latest                   dumb-init --singl...  14 hours ago  Up 14 hours ago         iscsid      
50f9e95b0d69  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-ovn-northd:latest               dumb-init --singl...  14 hours ago  Up 14 hours ago         ovn-dbs-bundle-podman-0                                                                                                                                                                                  
8b1d922b5230  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-horizon:latest                  dumb-init --singl...  14 hours ago  Up 14 hours ago         horizon    
47d590edc293  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-haproxy:latest                  dumb-init --singl...  14 hours ago  Up 14 hours ago         haproxy-bundle-podman-0                                            
25300f34d18f  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-redis:latest                    dumb-init --singl...  14 hours ago  Up 14 hours ago         redis-bundle-podman-0                                 
fa613ee02243  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-rabbitmq:latest                 dumb-init --singl...  14 hours ago  Up 14 hours ago         rabbitmq-bundle-podman-0                                                                                                                     
0433270497c0  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-mariadb:latest                  dumb-init /bin/ba...  14 hours ago  Up 14 hours ago         galera-bundle-podman-0                                                                                                                       
395387fd2ff7  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-mariadb:latest                  dumb-init kolla_s...  14 hours ago  Up 14 hours ago         clustercheck
f8f178004a4d  quay.io/rhceph-dev/rhceph-4.0-rhel-8:latest                                                            /opt/ceph-contain...  14 hours ago  Up 14 hours ago         ceph-mon-overcloud-controller-0                                                                                                                                                                          
4c4d2dd71e8f  quay.io/rhceph-dev/rhceph-4.0-rhel-8:latest                                                            /opt/ceph-contain...  14 hours ago  Up 14 hours ago         ceph-mgr-overcloud-controller-0                                                                                                                                                                          
d2bde7014e64  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-memcached:latest                dumb-init --singl...  14 hours ago  Up 14 hours ago         memcached  
(undercloud) [stack@undercloud-0 ~]$

Comment 9 John Fulton 2019-04-05 11:54:54 UTC
Could the reason the containers on the compute nodes aren't coming back up be a network issue and a different bug? I see the ovn-container died.

(undercloud) [stack@undercloud-0 ~]$ for x in $(cat computes); do echo $x; ssh heat-admin@$x "sudo podman ps -a"; done
192.168.24.11
Warning: Permanently added '192.168.24.11' (ECDSA) to the list of known hosts.
CONTAINER ID  IMAGE                                                                                         COMMAND               CREATED       STATUS                   PORTS  NAMES
df4b718121f9  quay.io/rhceph-dev/rhceph-4.0-rhel-8:latest                                                   /opt/ceph-contain...  14 hours ago  Up 14 hours ago                 ceph-osd-2
29006f4ced33  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-ovn-controller:latest  /var/lib/containe...  14 hours ago  Exited (4) 14 hours ago         container-puppet-ovn_controller
192.168.24.21
Warning: Permanently added '192.168.24.21' (ECDSA) to the list of known hosts.
CONTAINER ID  IMAGE                                                                                         COMMAND               CREATED       STATUS                   PORTS  NAMES
b69da4bdfdf7  quay.io/rhceph-dev/rhceph-4.0-rhel-8:latest                                                   /opt/ceph-contain...  14 hours ago  Up 14 hours ago                 ceph-osd-1
a04f03905042  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-ovn-controller:latest  /var/lib/containe...  14 hours ago  Exited (4) 14 hours ago         container-puppet-ovn_controller
192.168.24.6
Warning: Permanently added '192.168.24.6' (ECDSA) to the list of known hosts.
CONTAINER ID  IMAGE                                                                                         COMMAND               CREATED       STATUS                   PORTS  NAMES
8168b97740be  quay.io/rhceph-dev/rhceph-4.0-rhel-8:latest                                                   /opt/ceph-contain...  14 hours ago  Up 14 hours ago                 ceph-osd-0
d2cdd1540597  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-ovn-controller:latest  /var/lib/containe...  14 hours ago  Exited (4) 14 hours ago         container-puppet-ovn_controller
(undercloud) [stack@undercloud-0 ~]$

Comment 10 Alistair Tonner 2019-04-11 13:43:21 UTC
I think I'm *still* seeing this timeout issue:
(note TZ's :
# date
Thu Apr 11 16:35:32 IDT 2019
[root@titan56 tmp]#

[stack@undercloud-0 ~]$ date
Thu Apr 11 09:35:24 EDT 2019
)

(action heartbeat settings:
[action_heartbeat]
max_missed_heartbeats=40
check_interval=50
first_heartbeat_timeout=9000
)
==============================================================================================================

From ir deploy command script:

PLAY [Verify overcloud deployment] 
______________________________________
TASK [fail] 
Thursday 11 April 2019  02:29:28 +0300 (0:00:00.374)       0:50:59.702 ********
fatal: [undercloud-0]: FAILED! => {"changed": false, "msg": "Overcloud deployment failed... :("}
_____________________________________
NO MORE HOSTS LEFT
        to retry, use: --limit @/tmp/RHEL8_test.jDqnJY7O0m/plugins/tripleo-overcloud/main.retry

PLAY RECAP hypervisor                 : ok=2    changed=0    unreachable=0    failed=0
localhost                  : ok=6    changed=2    unreachable=0    failed=0
undercloud-0               : ok=174  changed=80   unreachable=0    failed=1
_______________________________________________________________  (-7 hours 19:29:30 )

From undercloud-0:

openstack task execution list |grep ERROR
| 7e0d02b4-2630-4d6f-a1cd-86866769e8a1 | run_ansible                     | tripleo.deployment.v1.config_download_deploy          |                    | 25884e19-cfbb-4eed-9bf2-103bae94c6c2 | ERROR   | Heartbeat wasn't received... | 2019-04-10 22:57:34 | 2019-04-10 23:22:28 |
| 095e3224-8785-4cf3-a2bb-ad30b9c8a1fc | get_messages                    | tripleo.plan_management.v1.publish_ui_logs_to_swift   |                    | a9ead942-c46d-48e8-9d60-27d4852ba164 | ERROR   | Heartbeat wasn't received... | 2019-04-10 23:01:12 | 2019-04-10 23:22:28 |
  
    

from /var/lib/mistral/overcloud/ansible.log:

019-04-10 19:52:35,247 p=601 u=mistral |  PLAY [Server Post Deployments] *************************************************
2019-04-10 19:52:35,312 p=601 u=mistral |  TASK [include_tasks] ***********************************************************
2019-04-10 19:52:35,312 p=601 u=mistral |  Wednesday 10 April 2019  19:52:35 -0400 (0:00:00.409)       0:54:58.728 *******
2019-04-10 19:52:35,647 p=601 u=mistral |  PLAY [External deployment Post Deploy tasks] ***********************************
2019-04-10 19:52:35,651 p=601 u=mistral |  PLAY RECAP *********************************************************************
2019-04-10 19:52:35,651 p=601 u=mistral |  compute-0                  : ok=182  changed=79   unreachable=0    failed=0
2019-04-10 19:52:35,651 p=601 u=mistral |  compute-1                  : ok=182  changed=79   unreachable=0    failed=0
2019-04-10 19:52:35,651 p=601 u=mistral |  controller-0               : ok=260  changed=143  unreachable=0    failed=0
2019-04-10 19:52:35,651 p=601 u=mistral |  controller-1               : ok=249  changed=140  unreachable=0    failed=0
2019-04-10 19:52:35,651 p=601 u=mistral |  controller-2               : ok=249  changed=140  unreachable=0    failed=0
2019-04-10 19:52:35,652 p=601 u=mistral |  undercloud                 : ok=11   changed=7    unreachable=0    failed=0
2019-04-10 19:52:35,652 p=601 u=mistral |  Wednesday 10 April 2019  19:52:35 -0400 (0:00:00.340)       0:54:59.068 *******
2019-04-10 19:52:35,652 p=601 u=mistral |  ===============================================================================

openstack task execution show 7e0d02b4-2630-4d6f-a1cd-86866769e8a1
+-----------------------+----------------------------------------------+
| Field                 | Value                                        |
+-----------------------+----------------------------------------------+
| ID                    | 7e0d02b4-2630-4d6f-a1cd-86866769e8a1         |
| Name                  | run_ansible                                  |
| Workflow name         | tripleo.deployment.v1.config_download_deploy |
| Workflow namespace    |                                              |
| Workflow Execution ID | 25884e19-cfbb-4eed-9bf2-103bae94c6c2         |
| State                 | ERROR                                        |
| State info            | Heartbeat wasn't received.                   |
| Created at            | 2019-04-10 22:57:34                          |
| Updated at            | 2019-04-10 23:22:28                          |
+-----------------------+----------------------------------------------+
(undercloud) [stack@undercloud-0 ~]$ openstack task execution show 095e3224-8785-4cf3-a2bb-ad30b9c8a1fc
+-----------------------+-----------------------------------------------------+
| Field                 | Value                                               |
+-----------------------+-----------------------------------------------------+
| ID                    | 095e3224-8785-4cf3-a2bb-ad30b9c8a1fc                |
| Name                  | get_messages                                        |
| Workflow name         | tripleo.plan_management.v1.publish_ui_logs_to_swift |
| Workflow namespace    |                                                     |
| Workflow Execution ID | a9ead942-c46d-48e8-9d60-27d4852ba164                |
| State                 | ERROR                                               |
| State info            | Heartbeat wasn't received.                          |
| Created at            | 2019-04-10 23:01:12                                 |
| Updated at            | 2019-04-10 23:22:28                                 |
+-----------------------+-----------------------------------------------------+


 openstack workflow execution show 25884e19-cfbb-4eed-9bf2-103bae94c6c2

+--------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field              | Value                                                                                                                                                                 |
+--------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ID                 | 25884e19-cfbb-4eed-9bf2-103bae94c6c2                                                                                                                                  |
| Workflow ID        | 5e54a2f6-245e-4de4-a31d-941682e63655                                                                                                                                  |
| Workflow name      | tripleo.deployment.v1.config_download_deploy                                                                                                                          |
| Workflow namespace |                                                                                                                                                                       |
| Description        |                                                                                                                                                                       |
| Task Execution ID  | <none>                                                                                                                                                                |
| Root Execution ID  | <none>                                                                                                                                                                |
| State              | ERROR                                                                                                                                                                 |
| State info         | Failed to run task [error=Failed to find workflow [name=tripleo.messaging.v1.send] [namespace=], wf=tripleo.deployment.v1.config_download_deploy, task=send_message]: |
|                    | Traceback (most recent call last):                                                                                                                                    |
|                    |   File "/usr/lib/python3.6/site-packages/mistral/engine/task_handler.py", line 63, in run_task                                                                        |
|                    |     task.run()                                                                                                                                                        |
|                    |   File "/usr/lib/python3.6/site-packages/osprofiler/profiler.py", line 160, in wrapper                                                                                |
|                    |     result = f(*args, **kwargs)                                                                                                                                       |
|                    |   File "/usr/lib/python3.6/site-packages/mistral/engine/tasks.py", line 453, in run                                                                                   |
|                    |     self._run_new()                                                                                                                                                   |
|                    |   File "/usr/lib/python3.6/site-packages/osprofiler/profiler.py", line 160, in wrapper                                                                                |
|                    |     result = f(*args, **kwargs)                                                                                                                                       |
|                    |   File "/usr/lib/python3.6/site-packages/mistral/engine/tasks.py", line 485, in _run_new                                                                              |
|                    |     self._schedule_actions()                                                                                                                                          |
|                    |   File "/usr/lib/python3.6/site-packages/mistral/engine/tasks.py", line 569, in _schedule_actions                                                                     |
|                    |     timeout=self._get_timeout()                                                                                                                                       |
|                    |   File "/usr/lib/python3.6/site-packages/osprofiler/profiler.py", line 160, in wrapper                                                                                |
|                    |     result = f(*args, **kwargs)                                                                                                                                       |
|                    |   File "/usr/lib/python3.6/site-packages/mistral/engine/actions.py", line 586, in schedule                                                                            |
|                    |     wf_spec_name=self.wf_name                                                                                                                                         |
|                    |   File "/usr/lib/python3.6/site-packages/mistral/engine/utils.py", line 91, in resolve_workflow_definition                                                            |
|                    |     (wf_spec_name, namespace)                                                                                                                                         |
|                    | mistral.exceptions.WorkflowException: Failed to find workflow [name=tripleo.messaging.v1.send] [namespace=]                                                           |
|                    |                                                                                                                                                                       |
| Created at         | 2019-04-10 22:56:44                                                                                                                                                   |
| Updated at         | 2019-04-10 23:22:28                                                                                                                                                   |
+--------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+

openstack workflow execution show a9ead942-c46d-48e8-9d60-27d4852ba164
+--------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field              | Value                                                                                                                                                                                                                             |
+--------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ID                 | a9ead942-c46d-48e8-9d60-27d4852ba164                                                                                                                                                                                              |
| Workflow ID        | 0142824d-29a1-4506-8262-50ae4c961637                                                                                                                                                                                              |
| Workflow name      | tripleo.plan_management.v1.publish_ui_logs_to_swift                                                                                                                                                                               |
| Workflow namespace |                                                                                                                                                                                                                                   |
| Description        | {"description": "Workflow execution created by cron trigger '(edcd71bf-a4f6-441b-acda-ea53ff66b91d)'.", "triggered_by": {"type": "cron_trigger", "id": "edcd71bf-a4f6-441b-acda-ea53ff66b91d", "name": "publish-ui-logs-hourly"}} |
| Task Execution ID  | <none>                                                                                                                                                                                                                            |
| Root Execution ID  | <none>                                                                                                                                                                                                                            |
| State              | RUNNING                                                                                                                                                                                                                           |
| State info         | None                                                                                                                                                                                                                              |
| Created at         | 2019-04-10 23:00:57                                                                                                                                                                                                               |
| Updated at         | 2019-04-10 23:00:57                                                                                                                                                                                                               |

Comment 12 Jose Luis Franco 2019-05-10 10:43:00 UTC
We're seeing the very same error in an upgraded OSP15 undercloud when trying to run mistral to upgrade the overcloud nodes Operating System. The "overcloud upgrade run" command will fail without any reason (last ansible task has a rc of 0) and when checking the failed mistral actions we get:

(undercloud) [stack@undercloud-0 ~]$ openstack task execution list |grep ERROR
| 8128e053-74ec-42e3-bd4d-31989f096bc9 | node_update                          | tripleo.package_update.v1.update_nodes               |                    | 6c289cea-22d7-459
e-8e4f-ab38aeed584c | ERROR   | Heartbeat wasn't received... | 2019-05-09 11:25:06 | 2019-05-09 11:30:16 |
| ee4aafde-5fa6-40d4-8820-ea96a861d2ce | node_update                          | tripleo.package_update.v1.update_nodes               |                    | 0389abcf-5cf1-400
c-957d-61350285594d | ERROR   | Heartbeat wasn't received... | 2019-05-09 13:07:21 | 2019-05-09 13:12:39 |

(undercloud) [stack@undercloud-0 ~]$ openstack task execution show ee4aafde-5fa6-40d4-8820-ea96a861d2ce
+-----------------------+----------------------------------------+
| Field                 | Value                                  |
+-----------------------+----------------------------------------+
| ID                    | ee4aafde-5fa6-40d4-8820-ea96a861d2ce   |
| Name                  | node_update                            |
| Workflow name         | tripleo.package_update.v1.update_nodes |
| Workflow namespace    |                                        |
| Workflow Execution ID | 0389abcf-5cf1-400c-957d-61350285594d   |
| State                 | ERROR                                  |
| State info            | Heartbeat wasn't received.             |
| Created at            | 2019-05-09 13:07:21                    |
| Updated at            | 2019-05-09 13:12:39                    |
+-----------------------+----------------------------------------+
(undercloud) [stack@undercloud-0 ~]$ openstack workflow execution show 0389abcf-5cf1-400c-957d-61350285594d
+--------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------
----------+
| Field              | Value                                                                                                                                                 
          |
+--------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------
----------+
| ID                 | 0389abcf-5cf1-400c-957d-61350285594d                                                                                                                  
          |
| Workflow ID        | c9b0b04a-c4a3-4ebf-9e98-03e8d8e00cde                                                                                                                  
          |
| Workflow name      | tripleo.package_update.v1.update_nodes                                                                                                                
          |
| Workflow namespace |                                                                                                                                                       
          |
| Description        |                                                                                                                                                       
          |
| Task Execution ID  | <none>                                                                                                                                                
          |
| Root Execution ID  | <none>                                                                                                                                                
          |
| State              | ERROR       
         |                                                                                                                                                         [15/1969]
| State info         | Failed to run task [error=Failed to find workflow [name=tripleo.messaging.v1.send] [namespace=], wf=tripleo.package_update.v1.update_nodes, task=send_
message]: |
|                    | Traceback (most recent call last):                                                                                                                    
          |
|                    |   File "/usr/lib/python3.6/site-packages/mistral/engine/task_handler.py", line 63, in run_task                                                        
          |
|                    |     task.run()                                                                                                                                        
          |
|                    |   File "/usr/lib/python3.6/site-packages/osprofiler/profiler.py", line 160, in wrapper                                                                
          |
|                    |     result = f(*args, **kwargs)                                                                                                                       
          |
|                    |   File "/usr/lib/python3.6/site-packages/mistral/engine/tasks.py", line 453, in run                                                                   
          |
|                    |     self._run_new()                                                                                                                                   
          |
|                    |   File "/usr/lib/python3.6/site-packages/osprofiler/profiler.py", line 160, in wrapper                                                                
          |
|                    |     result = f(*args, **kwargs)                                                                                                                       
          |
|                    |   File "/usr/lib/python3.6/site-packages/mistral/engine/tasks.py", line 485, in _run_new                                                              
          |
|                    |     self._schedule_actions()                                                                                                                          
          |
|                    |   File "/usr/lib/python3.6/site-packages/mistral/engine/tasks.py", line 569, in _schedule_actions                                                     
          |
|                    |     timeout=self._get_timeout()                                                                                                                       
          |
|                    |   File "/usr/lib/python3.6/site-packages/osprofiler/profiler.py", line 160, in wrapper                                                                
          |
|                    |     result = f(*args, **kwargs)                                                                                                                       
          |
|                    |   File "/usr/lib/python3.6/site-packages/mistral/engine/actions.py", line 561, in schedule                                                            
          |
|                    |     wf_spec_name=self.wf_name                                                                                                                         
          |
|                    |   File "/usr/lib/python3.6/site-packages/mistral/engine/utils.py", line 91, in resolve_workflow_definition  
          |
|                    |     (wf_spec_name, namespace)                                                                                                                         
          |
|                    | mistral.exceptions.WorkflowException: Failed to find workflow [name=tripleo.messaging.v1.send] [namespace=]                                           
          |
|                    |                                                                                                                                                       
          |
| Created at         | 2019-05-09 13:07:12                                                                                                                                   
          |
| Updated at         | 2019-05-09 13:25:48                                                                                                                                   
          |
+--------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------
----------+

Comment 13 Alistair Tonner 2019-05-13 12:55:18 UTC
   Okay -- 
   If I've followed the full list here correctly we've collided two issues in this bug - 
   Originally the case was mistral itself timing out and failing the deploy, 
   the second case was the ceph node deployment hitting the issue of grub-install taking up to 2 minutes per disk attached to the (node) due to the missing bind mount in the IPA image.

  mistral timeouts covered in https://bugzilla.redhat.com/show_bug.cgi?id=1700044

  ceph node grub issues covered in https://bugzilla.redhat.com/show_bug.cgi?id=1691551

*** This bug has been marked as a duplicate of bug 1700044 ***


Note You need to log in before you can comment on or make changes to this bug.