Bug 1703618

Summary: ceph_install mistral task stuck in state RUNNING
Product: Red Hat OpenStack Reporter: John Fulton <johfulto>
Component: openstack-tripleo-commonAssignee: John Fulton <johfulto>
Status: CLOSED DUPLICATE QA Contact: Yogev Rabl <yrabl>
Severity: medium Docs Contact:
Priority: medium    
Version: 13.0 (Queens)CC: dhill, gcharot, gfidente, gprocunier, ldenny, mburns, pveiga, slinaber
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-09-19 18:42:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description John Fulton 2019-04-26 22:46:27 UTC
While deploying an overcloud which uses ceph the ceph_install mistral task [1] became stuck in a state of running even after the /var/log/mistral/ceph-install-workflow.log indicated that the ceph-ansible install was complete. The subsequent tasks in the same workbook, save_fetch_directory and purge_fetch_directory, did not yet run according to the mistral database [2].



[1] https://github.com/openstack/tripleo-common/blob/34f1c505f19371a9110fc14a31fb0d95b31b2af2/workbooks/ceph-ansible.yaml#L144

[2] 
MariaDB [mistral]> select name, state, created_at, updated_at from task_executions_v2 where workflow_execution_id='2cfbc2ec-2abd-4a40-bea7-d9ceb69296d7' order by created_at, updated_at;
+--------------------------+---------+---------------------+---------------------+
| name                     | state   | created_at          | updated_at          |
+--------------------------+---------+---------------------+---------------------+
| set_swift_container      | SUCCESS | 2019-04-21 17:28:31 | 2019-04-21 17:28:32 |
| collect_puppet_hieradata | SUCCESS | 2019-04-21 17:28:32 | 2019-04-21 17:28:33 |
| check_hieradata          | SUCCESS | 2019-04-21 17:28:33 | 2019-04-21 17:28:33 |
| set_ip_lists             | SUCCESS | 2019-04-21 17:28:33 | 2019-04-21 17:28:33 |
| set_blacklisted_ips      | SUCCESS | 2019-04-21 17:28:33 | 2019-04-21 17:28:33 |
| merge_ip_lists           | SUCCESS | 2019-04-21 17:28:33 | 2019-04-21 17:28:33 |
| enable_ssh_admin         | SUCCESS | 2019-04-21 17:28:33 | 2019-04-21 17:29:18 |
| get_private_key          | SUCCESS | 2019-04-21 17:29:18 | 2019-04-21 17:29:18 |
| restore_fetch_directory  | SUCCESS | 2019-04-21 17:29:18 | 2019-04-21 17:29:18 |
| make_fetch_directory     | SUCCESS | 2019-04-21 17:29:18 | 2019-04-21 17:29:18 |
| verify_container_exists  | SUCCESS | 2019-04-21 17:29:18 | 2019-04-21 17:29:18 |
| collect_nodes_uuid       | SUCCESS | 2019-04-21 17:29:18 | 2019-04-21 17:29:20 |
| parse_node_data_lookup   | SUCCESS | 2019-04-21 17:29:20 | 2019-04-21 17:29:20 |
| set_ip_uuids             | SUCCESS | 2019-04-21 17:29:20 | 2019-04-21 17:29:20 |
| map_node_data_lookup     | SUCCESS | 2019-04-21 17:29:20 | 2019-04-21 17:29:21 |
| set_role_vars            | SUCCESS | 2019-04-21 17:29:21 | 2019-04-21 17:29:21 |
| build_extra_vars         | SUCCESS | 2019-04-21 17:29:21 | 2019-04-21 17:29:21 |
| ceph_install             | RUNNING | 2019-04-21 17:29:21 | 2019-04-21 17:29:21 |
+--------------------------+---------+---------------------+---------------------+
18 rows in set (0.001 sec)

MariaDB [mistral]>

Comment 1 John Fulton 2019-04-26 22:55:56 UTC
WORKAROUND: 

Identify the stuck task's action id and manually set it state to SUCCESS.

 
To identify the stuck task's action id, use a command like the following:

 source ~/stackrc
 WORKFLOW='tripleo.storage.v1.ceph-install'
 UUID=$(mistral execution-list --limit=-1 | grep $WORKFLOW | awk {'print $2'} | tail -1)
 for TASK_ID in $(mistral task-list $UUID | awk {'print $2'} | egrep -v 'ID|^$'); do
     echo $TASK_ID
     mistral task-get $TASK_ID
 done

Using the $TASK_ID for the ceph_install task, find it's action ID using:

 mistral action-execution-list mistral action-execution-list

The above will return the action_execution_id in the first column of output. Use this ID to tell Mistral that the task completed successfully: 

 mistral action-execution-update --state SUCCESS action_execution_id

After doing the above the workbook proceeded to the save_fetch_directory and purge_fetch_directory tasks and then the heat stack update continued and the deployment completed.

Comment 3 John Fulton 2019-04-26 23:12:40 UTC
Environment:

The undercloud where this issue presented itself was restored from a VM snapshot and running `mistral execution-list --limit=-1 | grep ceph` after the snapshot was restored showed two other existing executions called tripleo.storage.v1.ceph-install with the action ceph_install running. It's possible that mistral was in an inconsistent state.

The undercloud also did not have the latest errata applied including the following:

 https://github.com/openstack/tripleo-common/commit/87cef2ea1db6efc420a853d5c3b573960e032747

which might explain why it was stuck on returning the result of a successful ceph deployment (see bug 1687795).

Comment 10 Greg Procunier 2019-08-28 17:27:16 UTC
Hey John,

We are running into this exact situation on bare metal using rhosp13 with an August live channel snapshot.

The workaround works, but we want to demonstrate overcloud updates that dont require brain surgury :)  We have a case open with gss regarding this and was wondering if you had any thoughts on this given that its not a virtual machine related (gss case #02459386)

Comment 18 John Fulton 2019-09-06 14:30:52 UTC
In the undercloud,

  /etc/my.cnf.d/mariadb-server.cnf :

  [server]
  max_allowed_packet=64M

Before

MariaDB [(none)]> SHOW VARIABLES LIKE 'max_allowed_packet';
+--------------------+----------+
| Variable_name      | Value    |
+--------------------+----------+
| max_allowed_packet | 16777216 |
+--------------------+----------+
1 row in set (0.00 sec)

After

MariaDB [(none)]> SHOW VARIABLES LIKE 'max_allowed_packet';
+--------------------+----------+
| Variable_name      | Value    |
+--------------------+----------+
| max_allowed_packet | 67108864 |
+--------------------+----------+
1 row in set (0.00 sec)

Comment 19 John Fulton 2019-09-06 14:31:22 UTC
WORKAROUND 2 (easier) see comment #18

Comment 20 John Fulton 2019-09-06 14:33:41 UTC
Mistral is still, despite our efforts in BZ 1687795, trying to write data larger than max_allowed_packet to the DB and getting stuck in running when doing so. We think this patch will address it: 

 https://review.opendev.org/#/c/680709/

Comment 22 John Fulton 2019-09-10 14:02:09 UTC
https://review.opendev.org/#/c/680709 has merged in queens upstream

Comment 24 John Fulton 2019-09-19 18:42:46 UTC

*** This bug has been marked as a duplicate of bug 1747126 ***

Comment 25 John Fulton 2020-05-11 13:55:04 UTC
*** Bug 1830148 has been marked as a duplicate of this bug. ***