Description of problem: When deploying a new environment we are hitting a time-out issue, looking at the Mistral workflows tripleo.storage.v1.ceph-install is what is stuck. ~~~ (undercloud) [stack@director ~]$ openstack workflow execution list --fit-width +-------------------+-------------------+-------------------+-------------------+-------------------+---------+------------+ | ID | Workflow ID | Workflow name | Description | Task Execution ID | State | State info | +-------------------+-------------------+-------------------+-------------------+-------------------+---------+------------+ ... | 30443f5b-a7e7-46a | 84f1067d-c6d0-4dc | tripleo.storage.v | sub-workflow | a8273ec5-a77d-442 | RUNNING | None | | 1-9370-7fdba789e1 | 0-9ea5-a961d566b1 | 1.ceph-install | execution | 6-87e5-effc136f77 | | | ... ~~~ However looking at the ceph-install-workflow.log we can see the Ceph deployment completes ~~~ 2020-04-29 12:37:38,627 p=6519 u=mistral | PLAY RECAP ********************************************************************* ... 2020-04-29 12:37:38,631 p=6519 u=mistral | INSTALLER STATUS *************************************************************** 2020-04-29 12:37:38,635 p=6519 u=mistral | Install Ceph Monitor : Complete (0:06:24) 2020-04-29 12:37:38,635 p=6519 u=mistral | Install Ceph Manager : Complete (0:01:42) 2020-04-29 12:37:38,635 p=6519 u=mistral | Install Ceph OSD : Complete (0:26:29) 2020-04-29 12:37:38,635 p=6519 u=mistral | Install Ceph RGW : Complete (0:01:40) 2020-04-29 12:37:38,635 p=6519 u=mistral | Install Ceph Client : Complete (0:12:05) 2020-04-29 12:37:38,635 p=6519 u=mistral | Wednesday 29 April 2020 12:37:38 -0400 (0:00:00.357) 0:54:55.575 ******* 2020-04-29 12:37:38,636 p=6519 u=mistral | =============================================================================== ~~~ Looking at KCS https://access.redhat.com/solutions/4091811 we can see a similar issue where the Ceph install completes but Mistral is not updated to reflect that and needs to be manually updated using the following steps: ~~~ source ~/stackrc WORKFLOW='tripleo.storage.v1.ceph-install' UUID=$(mistral execution-list --limit=-1 | grep $WORKFLOW | awk {'print $2'} | tail -1) for TASK_ID in $(mistral task-list $UUID | awk {'print $2'} | egrep -v 'ID|^$'); do echo $TASK_ID mistral task-get $TASK_ID done ~~~ We don't see the same `InternalError: (1153, u"Got a packet bigger than 'max_allowed_packet' bytes")` errors in the Mistral engine log so I am not sure if it is the same as https://bugzilla.redhat.com/show_bug.cgi?id=1703618 Note the environment is not using the latest z stream (openstack-tripleo-common-8.6.8-5.el7ost.noarch) so this might be the issue. How reproducible: Each time the deployment is ran the work around needs to be followed as the deployment halts on the Ceph install
We have seen this happening on older zstreams when CephAnsiblePlaybookVerbosity was set to non-zero because that increased by a lot the amount of log lines produced by ceph-ansible which had to be stored into mistral table Can you please check setting CephAnsiblePlaybookVerbosity to 0 if that isn't the case already?