While deploying an overcloud which uses ceph the ceph_install mistral task [1] became stuck in a state of running even after the /var/log/mistral/ceph-install-workflow.log indicated that the ceph-ansible install was complete. The subsequent tasks in the same workbook, save_fetch_directory and purge_fetch_directory, did not yet run according to the mistral database [2]. [1] https://github.com/openstack/tripleo-common/blob/34f1c505f19371a9110fc14a31fb0d95b31b2af2/workbooks/ceph-ansible.yaml#L144 [2] MariaDB [mistral]> select name, state, created_at, updated_at from task_executions_v2 where workflow_execution_id='2cfbc2ec-2abd-4a40-bea7-d9ceb69296d7' order by created_at, updated_at; +--------------------------+---------+---------------------+---------------------+ | name | state | created_at | updated_at | +--------------------------+---------+---------------------+---------------------+ | set_swift_container | SUCCESS | 2019-04-21 17:28:31 | 2019-04-21 17:28:32 | | collect_puppet_hieradata | SUCCESS | 2019-04-21 17:28:32 | 2019-04-21 17:28:33 | | check_hieradata | SUCCESS | 2019-04-21 17:28:33 | 2019-04-21 17:28:33 | | set_ip_lists | SUCCESS | 2019-04-21 17:28:33 | 2019-04-21 17:28:33 | | set_blacklisted_ips | SUCCESS | 2019-04-21 17:28:33 | 2019-04-21 17:28:33 | | merge_ip_lists | SUCCESS | 2019-04-21 17:28:33 | 2019-04-21 17:28:33 | | enable_ssh_admin | SUCCESS | 2019-04-21 17:28:33 | 2019-04-21 17:29:18 | | get_private_key | SUCCESS | 2019-04-21 17:29:18 | 2019-04-21 17:29:18 | | restore_fetch_directory | SUCCESS | 2019-04-21 17:29:18 | 2019-04-21 17:29:18 | | make_fetch_directory | SUCCESS | 2019-04-21 17:29:18 | 2019-04-21 17:29:18 | | verify_container_exists | SUCCESS | 2019-04-21 17:29:18 | 2019-04-21 17:29:18 | | collect_nodes_uuid | SUCCESS | 2019-04-21 17:29:18 | 2019-04-21 17:29:20 | | parse_node_data_lookup | SUCCESS | 2019-04-21 17:29:20 | 2019-04-21 17:29:20 | | set_ip_uuids | SUCCESS | 2019-04-21 17:29:20 | 2019-04-21 17:29:20 | | map_node_data_lookup | SUCCESS | 2019-04-21 17:29:20 | 2019-04-21 17:29:21 | | set_role_vars | SUCCESS | 2019-04-21 17:29:21 | 2019-04-21 17:29:21 | | build_extra_vars | SUCCESS | 2019-04-21 17:29:21 | 2019-04-21 17:29:21 | | ceph_install | RUNNING | 2019-04-21 17:29:21 | 2019-04-21 17:29:21 | +--------------------------+---------+---------------------+---------------------+ 18 rows in set (0.001 sec) MariaDB [mistral]>
WORKAROUND: Identify the stuck task's action id and manually set it state to SUCCESS. To identify the stuck task's action id, use a command like the following: source ~/stackrc WORKFLOW='tripleo.storage.v1.ceph-install' UUID=$(mistral execution-list --limit=-1 | grep $WORKFLOW | awk {'print $2'} | tail -1) for TASK_ID in $(mistral task-list $UUID | awk {'print $2'} | egrep -v 'ID|^$'); do echo $TASK_ID mistral task-get $TASK_ID done Using the $TASK_ID for the ceph_install task, find it's action ID using: mistral action-execution-list mistral action-execution-list The above will return the action_execution_id in the first column of output. Use this ID to tell Mistral that the task completed successfully: mistral action-execution-update --state SUCCESS action_execution_id After doing the above the workbook proceeded to the save_fetch_directory and purge_fetch_directory tasks and then the heat stack update continued and the deployment completed.
https://docs.openstack.org/mistral/ocata/guides/cli_guide.html
Environment: The undercloud where this issue presented itself was restored from a VM snapshot and running `mistral execution-list --limit=-1 | grep ceph` after the snapshot was restored showed two other existing executions called tripleo.storage.v1.ceph-install with the action ceph_install running. It's possible that mistral was in an inconsistent state. The undercloud also did not have the latest errata applied including the following: https://github.com/openstack/tripleo-common/commit/87cef2ea1db6efc420a853d5c3b573960e032747 which might explain why it was stuck on returning the result of a successful ceph deployment (see bug 1687795).
Hey John, We are running into this exact situation on bare metal using rhosp13 with an August live channel snapshot. The workaround works, but we want to demonstrate overcloud updates that dont require brain surgury :) We have a case open with gss regarding this and was wondering if you had any thoughts on this given that its not a virtual machine related (gss case #02459386)
In the undercloud, /etc/my.cnf.d/mariadb-server.cnf : [server] max_allowed_packet=64M Before MariaDB [(none)]> SHOW VARIABLES LIKE 'max_allowed_packet'; +--------------------+----------+ | Variable_name | Value | +--------------------+----------+ | max_allowed_packet | 16777216 | +--------------------+----------+ 1 row in set (0.00 sec) After MariaDB [(none)]> SHOW VARIABLES LIKE 'max_allowed_packet'; +--------------------+----------+ | Variable_name | Value | +--------------------+----------+ | max_allowed_packet | 67108864 | +--------------------+----------+ 1 row in set (0.00 sec)
WORKAROUND 2 (easier) see comment #18
Mistral is still, despite our efforts in BZ 1687795, trying to write data larger than max_allowed_packet to the DB and getting stuck in running when doing so. We think this patch will address it: https://review.opendev.org/#/c/680709/
https://review.opendev.org/#/c/680709 has merged in queens upstream
*** This bug has been marked as a duplicate of bug 1747126 ***
*** Bug 1830148 has been marked as a duplicate of this bug. ***