Bug 1703618 - ceph_install mistral task stuck in state RUNNING
Summary: ceph_install mistral task stuck in state RUNNING
Keywords:
Status: CLOSED DUPLICATE of bug 1747126
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-common
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: John Fulton
QA Contact: Yogev Rabl
URL:
Whiteboard:
: 1830148 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-04-26 22:46 UTC by John Fulton
Modified: 2023-10-06 18:16 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-09-19 18:42:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1772001 0 None None None 2019-08-30 13:34:28 UTC
Launchpad 1796592 0 None None None 2019-08-30 13:34:55 UTC
OpenStack gerrit 679475 0 'None' MERGED Honor trash_output when not using queue 2021-02-03 20:04:13 UTC
OpenStack gerrit 680709 0 'None' MERGED Honor trash_output when not using queue 2021-02-03 20:04:13 UTC
Red Hat Bugzilla 1687795 0 urgent CLOSED ceph_install workflow fails in debug mode on z5 because of too big publish message 2021-02-22 00:41:40 UTC
Red Hat Knowledge Base (Solution) 4091811 0 None None None 2019-04-28 14:20:29 UTC

Internal Links: 1687795

Description John Fulton 2019-04-26 22:46:27 UTC
While deploying an overcloud which uses ceph the ceph_install mistral task [1] became stuck in a state of running even after the /var/log/mistral/ceph-install-workflow.log indicated that the ceph-ansible install was complete. The subsequent tasks in the same workbook, save_fetch_directory and purge_fetch_directory, did not yet run according to the mistral database [2].



[1] https://github.com/openstack/tripleo-common/blob/34f1c505f19371a9110fc14a31fb0d95b31b2af2/workbooks/ceph-ansible.yaml#L144

[2] 
MariaDB [mistral]> select name, state, created_at, updated_at from task_executions_v2 where workflow_execution_id='2cfbc2ec-2abd-4a40-bea7-d9ceb69296d7' order by created_at, updated_at;
+--------------------------+---------+---------------------+---------------------+
| name                     | state   | created_at          | updated_at          |
+--------------------------+---------+---------------------+---------------------+
| set_swift_container      | SUCCESS | 2019-04-21 17:28:31 | 2019-04-21 17:28:32 |
| collect_puppet_hieradata | SUCCESS | 2019-04-21 17:28:32 | 2019-04-21 17:28:33 |
| check_hieradata          | SUCCESS | 2019-04-21 17:28:33 | 2019-04-21 17:28:33 |
| set_ip_lists             | SUCCESS | 2019-04-21 17:28:33 | 2019-04-21 17:28:33 |
| set_blacklisted_ips      | SUCCESS | 2019-04-21 17:28:33 | 2019-04-21 17:28:33 |
| merge_ip_lists           | SUCCESS | 2019-04-21 17:28:33 | 2019-04-21 17:28:33 |
| enable_ssh_admin         | SUCCESS | 2019-04-21 17:28:33 | 2019-04-21 17:29:18 |
| get_private_key          | SUCCESS | 2019-04-21 17:29:18 | 2019-04-21 17:29:18 |
| restore_fetch_directory  | SUCCESS | 2019-04-21 17:29:18 | 2019-04-21 17:29:18 |
| make_fetch_directory     | SUCCESS | 2019-04-21 17:29:18 | 2019-04-21 17:29:18 |
| verify_container_exists  | SUCCESS | 2019-04-21 17:29:18 | 2019-04-21 17:29:18 |
| collect_nodes_uuid       | SUCCESS | 2019-04-21 17:29:18 | 2019-04-21 17:29:20 |
| parse_node_data_lookup   | SUCCESS | 2019-04-21 17:29:20 | 2019-04-21 17:29:20 |
| set_ip_uuids             | SUCCESS | 2019-04-21 17:29:20 | 2019-04-21 17:29:20 |
| map_node_data_lookup     | SUCCESS | 2019-04-21 17:29:20 | 2019-04-21 17:29:21 |
| set_role_vars            | SUCCESS | 2019-04-21 17:29:21 | 2019-04-21 17:29:21 |
| build_extra_vars         | SUCCESS | 2019-04-21 17:29:21 | 2019-04-21 17:29:21 |
| ceph_install             | RUNNING | 2019-04-21 17:29:21 | 2019-04-21 17:29:21 |
+--------------------------+---------+---------------------+---------------------+
18 rows in set (0.001 sec)

MariaDB [mistral]>

Comment 1 John Fulton 2019-04-26 22:55:56 UTC
WORKAROUND: 

Identify the stuck task's action id and manually set it state to SUCCESS.

 
To identify the stuck task's action id, use a command like the following:

 source ~/stackrc
 WORKFLOW='tripleo.storage.v1.ceph-install'
 UUID=$(mistral execution-list --limit=-1 | grep $WORKFLOW | awk {'print $2'} | tail -1)
 for TASK_ID in $(mistral task-list $UUID | awk {'print $2'} | egrep -v 'ID|^$'); do
     echo $TASK_ID
     mistral task-get $TASK_ID
 done

Using the $TASK_ID for the ceph_install task, find it's action ID using:

 mistral action-execution-list mistral action-execution-list

The above will return the action_execution_id in the first column of output. Use this ID to tell Mistral that the task completed successfully: 

 mistral action-execution-update --state SUCCESS action_execution_id

After doing the above the workbook proceeded to the save_fetch_directory and purge_fetch_directory tasks and then the heat stack update continued and the deployment completed.

Comment 3 John Fulton 2019-04-26 23:12:40 UTC
Environment:

The undercloud where this issue presented itself was restored from a VM snapshot and running `mistral execution-list --limit=-1 | grep ceph` after the snapshot was restored showed two other existing executions called tripleo.storage.v1.ceph-install with the action ceph_install running. It's possible that mistral was in an inconsistent state.

The undercloud also did not have the latest errata applied including the following:

 https://github.com/openstack/tripleo-common/commit/87cef2ea1db6efc420a853d5c3b573960e032747

which might explain why it was stuck on returning the result of a successful ceph deployment (see bug 1687795).

Comment 10 Greg Procunier 2019-08-28 17:27:16 UTC
Hey John,

We are running into this exact situation on bare metal using rhosp13 with an August live channel snapshot.

The workaround works, but we want to demonstrate overcloud updates that dont require brain surgury :)  We have a case open with gss regarding this and was wondering if you had any thoughts on this given that its not a virtual machine related (gss case #02459386)

Comment 18 John Fulton 2019-09-06 14:30:52 UTC
In the undercloud,

  /etc/my.cnf.d/mariadb-server.cnf :

  [server]
  max_allowed_packet=64M

Before

MariaDB [(none)]> SHOW VARIABLES LIKE 'max_allowed_packet';
+--------------------+----------+
| Variable_name      | Value    |
+--------------------+----------+
| max_allowed_packet | 16777216 |
+--------------------+----------+
1 row in set (0.00 sec)

After

MariaDB [(none)]> SHOW VARIABLES LIKE 'max_allowed_packet';
+--------------------+----------+
| Variable_name      | Value    |
+--------------------+----------+
| max_allowed_packet | 67108864 |
+--------------------+----------+
1 row in set (0.00 sec)

Comment 19 John Fulton 2019-09-06 14:31:22 UTC
WORKAROUND 2 (easier) see comment #18

Comment 20 John Fulton 2019-09-06 14:33:41 UTC
Mistral is still, despite our efforts in BZ 1687795, trying to write data larger than max_allowed_packet to the DB and getting stuck in running when doing so. We think this patch will address it: 

 https://review.opendev.org/#/c/680709/

Comment 22 John Fulton 2019-09-10 14:02:09 UTC
https://review.opendev.org/#/c/680709 has merged in queens upstream

Comment 24 John Fulton 2019-09-19 18:42:46 UTC

*** This bug has been marked as a duplicate of bug 1747126 ***

Comment 25 John Fulton 2020-05-11 13:55:04 UTC
*** Bug 1830148 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.