1703618 – ceph_install mistral task stuck in state RUNNING

Bug 1703618 - ceph_install mistral task stuck in state RUNNING

Summary: ceph_install mistral task stuck in state RUNNING

Keywords:
Status:	CLOSED DUPLICATE of bug 1747126
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-common
Sub Component:
Version:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	John Fulton
QA Contact:	Yogev Rabl
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1830148 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-04-26 22:46 UTC by John Fulton
Modified:	2023-10-06 18:16 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-09-19 18:42:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1772001	None	None	None	2019-08-30 13:34:28 UTC
Launchpad	1796592	None	None	None	2019-08-30 13:34:55 UTC
OpenStack gerrit	679475	'None'	MERGED	Honor trash_output when not using queue	2021-02-03 20:04:13 UTC
OpenStack gerrit	680709	'None'	MERGED	Honor trash_output when not using queue	2021-02-03 20:04:13 UTC
Red Hat Bugzilla	1687795	urgent	CLOSED	ceph_install workflow fails in debug mode on z5 because of too big publish message	2021-02-22 00:41:40 UTC
Red Hat Knowledge Base (Solution)	4091811	None	None	None	2019-04-28 14:20:29 UTC

Internal Links: 1687795

Description John Fulton 2019-04-26 22:46:27 UTC

While deploying an overcloud which uses ceph the ceph_install mistral task [1] became stuck in a state of running even after the /var/log/mistral/ceph-install-workflow.log indicated that the ceph-ansible install was complete. The subsequent tasks in the same workbook, save_fetch_directory and purge_fetch_directory, did not yet run according to the mistral database [2].



[1] https://github.com/openstack/tripleo-common/blob/34f1c505f19371a9110fc14a31fb0d95b31b2af2/workbooks/ceph-ansible.yaml#L144

[2] 
MariaDB [mistral]> select name, state, created_at, updated_at from task_executions_v2 where workflow_execution_id='2cfbc2ec-2abd-4a40-bea7-d9ceb69296d7' order by created_at, updated_at;
+--------------------------+---------+---------------------+---------------------+
| name                     | state   | created_at          | updated_at          |
+--------------------------+---------+---------------------+---------------------+
| set_swift_container      | SUCCESS | 2019-04-21 17:28:31 | 2019-04-21 17:28:32 |
| collect_puppet_hieradata | SUCCESS | 2019-04-21 17:28:32 | 2019-04-21 17:28:33 |
| check_hieradata          | SUCCESS | 2019-04-21 17:28:33 | 2019-04-21 17:28:33 |
| set_ip_lists             | SUCCESS | 2019-04-21 17:28:33 | 2019-04-21 17:28:33 |
| set_blacklisted_ips      | SUCCESS | 2019-04-21 17:28:33 | 2019-04-21 17:28:33 |
| merge_ip_lists           | SUCCESS | 2019-04-21 17:28:33 | 2019-04-21 17:28:33 |
| enable_ssh_admin         | SUCCESS | 2019-04-21 17:28:33 | 2019-04-21 17:29:18 |
| get_private_key          | SUCCESS | 2019-04-21 17:29:18 | 2019-04-21 17:29:18 |
| restore_fetch_directory  | SUCCESS | 2019-04-21 17:29:18 | 2019-04-21 17:29:18 |
| make_fetch_directory     | SUCCESS | 2019-04-21 17:29:18 | 2019-04-21 17:29:18 |
| verify_container_exists  | SUCCESS | 2019-04-21 17:29:18 | 2019-04-21 17:29:18 |
| collect_nodes_uuid       | SUCCESS | 2019-04-21 17:29:18 | 2019-04-21 17:29:20 |
| parse_node_data_lookup   | SUCCESS | 2019-04-21 17:29:20 | 2019-04-21 17:29:20 |
| set_ip_uuids             | SUCCESS | 2019-04-21 17:29:20 | 2019-04-21 17:29:20 |
| map_node_data_lookup     | SUCCESS | 2019-04-21 17:29:20 | 2019-04-21 17:29:21 |
| set_role_vars            | SUCCESS | 2019-04-21 17:29:21 | 2019-04-21 17:29:21 |
| build_extra_vars         | SUCCESS | 2019-04-21 17:29:21 | 2019-04-21 17:29:21 |
| ceph_install             | RUNNING | 2019-04-21 17:29:21 | 2019-04-21 17:29:21 |
+--------------------------+---------+---------------------+---------------------+
18 rows in set (0.001 sec)

MariaDB [mistral]>

Comment 1 John Fulton 2019-04-26 22:55:56 UTC

WORKAROUND: 

Identify the stuck task's action id and manually set it state to SUCCESS.

 
To identify the stuck task's action id, use a command like the following:

 source ~/stackrc
 WORKFLOW='tripleo.storage.v1.ceph-install'
 UUID=$(mistral execution-list --limit=-1 | grep $WORKFLOW | awk {'print $2'} | tail -1)
 for TASK_ID in $(mistral task-list $UUID | awk {'print $2'} | egrep -v 'ID|^$'); do
     echo $TASK_ID
     mistral task-get $TASK_ID
 done

Using the $TASK_ID for the ceph_install task, find it's action ID using:

 mistral action-execution-list mistral action-execution-list

The above will return the action_execution_id in the first column of output. Use this ID to tell Mistral that the task completed successfully: 

 mistral action-execution-update --state SUCCESS action_execution_id

After doing the above the workbook proceeded to the save_fetch_directory and purge_fetch_directory tasks and then the heat stack update continued and the deployment completed.

Comment 2 John Fulton 2019-04-26 22:57:25 UTC

https://docs.openstack.org/mistral/ocata/guides/cli_guide.html

Comment 3 John Fulton 2019-04-26 23:12:40 UTC

Environment:

The undercloud where this issue presented itself was restored from a VM snapshot and running `mistral execution-list --limit=-1 | grep ceph` after the snapshot was restored showed two other existing executions called tripleo.storage.v1.ceph-install with the action ceph_install running. It's possible that mistral was in an inconsistent state.

The undercloud also did not have the latest errata applied including the following:

 https://github.com/openstack/tripleo-common/commit/87cef2ea1db6efc420a853d5c3b573960e032747

which might explain why it was stuck on returning the result of a successful ceph deployment (see bug 1687795).

Comment 10 Greg Procunier 2019-08-28 17:27:16 UTC

Hey John,

We are running into this exact situation on bare metal using rhosp13 with an August live channel snapshot.

The workaround works, but we want to demonstrate overcloud updates that dont require brain surgury :)  We have a case open with gss regarding this and was wondering if you had any thoughts on this given that its not a virtual machine related (gss case #02459386)

Comment 18 John Fulton 2019-09-06 14:30:52 UTC

In the undercloud,

  /etc/my.cnf.d/mariadb-server.cnf :

  [server]
  max_allowed_packet=64M

Before

MariaDB [(none)]> SHOW VARIABLES LIKE 'max_allowed_packet';
+--------------------+----------+
| Variable_name      | Value    |
+--------------------+----------+
| max_allowed_packet | 16777216 |
+--------------------+----------+
1 row in set (0.00 sec)

After

MariaDB [(none)]> SHOW VARIABLES LIKE 'max_allowed_packet';
+--------------------+----------+
| Variable_name      | Value    |
+--------------------+----------+
| max_allowed_packet | 67108864 |
+--------------------+----------+
1 row in set (0.00 sec)

Comment 19 John Fulton 2019-09-06 14:31:22 UTC

WORKAROUND 2 (easier) see comment #18

Comment 20 John Fulton 2019-09-06 14:33:41 UTC

Mistral is still, despite our efforts in BZ 1687795, trying to write data larger than max_allowed_packet to the DB and getting stuck in running when doing so. We think this patch will address it: 

 https://review.opendev.org/#/c/680709/

Comment 22 John Fulton 2019-09-10 14:02:09 UTC

https://review.opendev.org/#/c/680709 has merged in queens upstream

Comment 24 John Fulton 2019-09-19 18:42:46 UTC


*** This bug has been marked as a duplicate of bug 1747126 ***

Comment 25 John Fulton 2020-05-11 13:55:04 UTC

*** Bug 1830148 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.