Bug 1830148 - tripleo.storage.v1.ceph-install workflow showing as running when ceph-install-workflow.log shows as complete
Summary: tripleo.storage.v1.ceph-install workflow showing as running when ceph-install...
Keywords:
Status: CLOSED DUPLICATE of bug 1703618
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: ceph-ansible
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Guillaume Abrioux
QA Contact: Yogev Rabl
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-04-30 22:50 UTC by ldenny
Modified: 2023-10-06 19:51 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-11 13:55:04 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-28423 0 None None None 2023-09-07 23:05:33 UTC

Description ldenny 2020-04-30 22:50:19 UTC
Description of problem:
When deploying a new environment we are hitting a time-out issue, looking at the Mistral workflows tripleo.storage.v1.ceph-install is what is stuck.

~~~
(undercloud) [stack@director ~]$ openstack workflow execution list --fit-width
+-------------------+-------------------+-------------------+-------------------+-------------------+---------+------------+
| ID                | Workflow ID       | Workflow name     | Description       | Task Execution ID | State   | State info |
+-------------------+-------------------+-------------------+-------------------+-------------------+---------+------------+
...
| 30443f5b-a7e7-46a | 84f1067d-c6d0-4dc | tripleo.storage.v | sub-workflow      | a8273ec5-a77d-442 | RUNNING | None       |
| 1-9370-7fdba789e1 | 0-9ea5-a961d566b1 | 1.ceph-install    | execution         | 6-87e5-effc136f77 |         |            |
...
~~~

However looking at the ceph-install-workflow.log we can see the Ceph deployment completes 

~~~
2020-04-29 12:37:38,627 p=6519 u=mistral |  PLAY RECAP *********************************************************************
... 
2020-04-29 12:37:38,631 p=6519 u=mistral |  INSTALLER STATUS ***************************************************************
2020-04-29 12:37:38,635 p=6519 u=mistral |  Install Ceph Monitor        : Complete (0:06:24)
2020-04-29 12:37:38,635 p=6519 u=mistral |  Install Ceph Manager        : Complete (0:01:42)
2020-04-29 12:37:38,635 p=6519 u=mistral |  Install Ceph OSD            : Complete (0:26:29)
2020-04-29 12:37:38,635 p=6519 u=mistral |  Install Ceph RGW            : Complete (0:01:40)
2020-04-29 12:37:38,635 p=6519 u=mistral |  Install Ceph Client         : Complete (0:12:05)
2020-04-29 12:37:38,635 p=6519 u=mistral |  Wednesday 29 April 2020  12:37:38 -0400 (0:00:00.357)       0:54:55.575 ******* 
2020-04-29 12:37:38,636 p=6519 u=mistral |  =============================================================================== 
~~~

Looking at KCS https://access.redhat.com/solutions/4091811 we can see a similar issue where the Ceph install completes but Mistral is not updated to reflect that and needs to be manually updated using the following steps:
~~~
source ~/stackrc
WORKFLOW='tripleo.storage.v1.ceph-install'
UUID=$(mistral execution-list --limit=-1 | grep $WORKFLOW | awk {'print $2'} | tail -1)
for TASK_ID in $(mistral task-list $UUID | awk {'print $2'} | egrep -v 'ID|^$'); do
  echo $TASK_ID
  mistral task-get $TASK_ID
done
~~~

We don't see the same `InternalError: (1153, u"Got a packet bigger than 'max_allowed_packet' bytes")` errors in the Mistral engine log so I am not sure if it is the same as https://bugzilla.redhat.com/show_bug.cgi?id=1703618

Note the environment is not using the latest z stream (openstack-tripleo-common-8.6.8-5.el7ost.noarch) so this might be the issue.


How reproducible:
Each time the deployment is ran the work around needs to be followed as the deployment halts on the Ceph install

Comment 1 Giulio Fidente 2020-05-04 13:26:30 UTC
We have seen this happening on older zstreams when CephAnsiblePlaybookVerbosity was set to non-zero because that increased by a lot the amount of log lines produced by ceph-ansible which had to be stored into mistral table

Can you please check setting CephAnsiblePlaybookVerbosity to 0 if that isn't the case already?


Note You need to log in before you can comment on or make changes to this bug.