Bug 1830148

Summary: tripleo.storage.v1.ceph-install workflow showing as running when ceph-install-workflow.log shows as complete
Product: Red Hat OpenStack Reporter: ldenny
Component: ceph-ansibleAssignee: Guillaume Abrioux <gabrioux>
Status: CLOSED DUPLICATE QA Contact: Yogev Rabl <yrabl>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 13.0 (Queens)CC: gfidente, johfulto
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-11 13:55:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description ldenny 2020-04-30 22:50:19 UTC
Description of problem:
When deploying a new environment we are hitting a time-out issue, looking at the Mistral workflows tripleo.storage.v1.ceph-install is what is stuck.

~~~
(undercloud) [stack@director ~]$ openstack workflow execution list --fit-width
+-------------------+-------------------+-------------------+-------------------+-------------------+---------+------------+
| ID                | Workflow ID       | Workflow name     | Description       | Task Execution ID | State   | State info |
+-------------------+-------------------+-------------------+-------------------+-------------------+---------+------------+
...
| 30443f5b-a7e7-46a | 84f1067d-c6d0-4dc | tripleo.storage.v | sub-workflow      | a8273ec5-a77d-442 | RUNNING | None       |
| 1-9370-7fdba789e1 | 0-9ea5-a961d566b1 | 1.ceph-install    | execution         | 6-87e5-effc136f77 |         |            |
...
~~~

However looking at the ceph-install-workflow.log we can see the Ceph deployment completes 

~~~
2020-04-29 12:37:38,627 p=6519 u=mistral |  PLAY RECAP *********************************************************************
... 
2020-04-29 12:37:38,631 p=6519 u=mistral |  INSTALLER STATUS ***************************************************************
2020-04-29 12:37:38,635 p=6519 u=mistral |  Install Ceph Monitor        : Complete (0:06:24)
2020-04-29 12:37:38,635 p=6519 u=mistral |  Install Ceph Manager        : Complete (0:01:42)
2020-04-29 12:37:38,635 p=6519 u=mistral |  Install Ceph OSD            : Complete (0:26:29)
2020-04-29 12:37:38,635 p=6519 u=mistral |  Install Ceph RGW            : Complete (0:01:40)
2020-04-29 12:37:38,635 p=6519 u=mistral |  Install Ceph Client         : Complete (0:12:05)
2020-04-29 12:37:38,635 p=6519 u=mistral |  Wednesday 29 April 2020  12:37:38 -0400 (0:00:00.357)       0:54:55.575 ******* 
2020-04-29 12:37:38,636 p=6519 u=mistral |  =============================================================================== 
~~~

Looking at KCS https://access.redhat.com/solutions/4091811 we can see a similar issue where the Ceph install completes but Mistral is not updated to reflect that and needs to be manually updated using the following steps:
~~~
source ~/stackrc
WORKFLOW='tripleo.storage.v1.ceph-install'
UUID=$(mistral execution-list --limit=-1 | grep $WORKFLOW | awk {'print $2'} | tail -1)
for TASK_ID in $(mistral task-list $UUID | awk {'print $2'} | egrep -v 'ID|^$'); do
  echo $TASK_ID
  mistral task-get $TASK_ID
done
~~~

We don't see the same `InternalError: (1153, u"Got a packet bigger than 'max_allowed_packet' bytes")` errors in the Mistral engine log so I am not sure if it is the same as https://bugzilla.redhat.com/show_bug.cgi?id=1703618

Note the environment is not using the latest z stream (openstack-tripleo-common-8.6.8-5.el7ost.noarch) so this might be the issue.


How reproducible:
Each time the deployment is ran the work around needs to be followed as the deployment halts on the Ceph install

Comment 1 Giulio Fidente 2020-05-04 13:26:30 UTC
We have seen this happening on older zstreams when CephAnsiblePlaybookVerbosity was set to non-zero because that increased by a lot the amount of log lines produced by ceph-ansible which had to be stored into mistral table

Can you please check setting CephAnsiblePlaybookVerbosity to 0 if that isn't the case already?