1364547 – execution step of Create Cluster task is reported as failed while the ceph-installer task finises with success

Bug 1364547 - execution step of Create Cluster task is reported as failed while the ceph-installer task finises with success

Summary: execution step of Create Cluster task is reported as failed while the ceph-in...

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Red Hat Storage Console
Classification:	Red Hat Storage
Component:	core
Sub Component:
Version:	2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	3
Assignee:	Nishanth Thomas
QA Contact:	sds-qe-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-08-05 16:40 UTC by Martin Bukatovic
Modified:	2018-11-19 05:44 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:

Attachments	(Terms of Use)
screenshot 1: task detail page of Create Cluster task (104.59 KB, image/png) 2016-08-05 16:40 UTC, Martin Bukatovic	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1341525	0	unspecified	CLOSED	cluster creation task has failed but cluster seems installed properly	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1356016	0	unspecified	CLOSED	Host initialization task strange fail	2021-02-22 00:41:40 UTC

Internal Links: 1341525 1356016

Description Martin Bukatovic 2016-08-05 16:40:35 UTC

Created attachment 1187977 [details]
screenshot 1: task detail page of Create Cluster task

Description of problem
======================

During Create Cluster task, a "Package installation" execution step can be
reported as failed, while the actual task is performed properly without any
error.

Moreover reasoning behind the fail (Was it timeout? Or something else went
wrong?) is not communicated by RHSC 2.0. Again: which is extra confusing,
because the underlying ceph-installer task didn't fail, but finished with
success.

Version-Release
===============

On RHSC 2.0 server machine:

rhscon-core-0.0.39-1.el7scon.x86_64
rhscon-ui-0.0.51-1.el7scon.noarch
rhscon-core-selinux-0.0.39-1.el7scon.noarch
rhscon-ceph-0.0.39-1.el7scon.x86_64
ceph-installer-1.0.14-1.20160802git5cd3d17.el7scon.noarch
ceph-ansible-1.0.5-32.el7scon.noarch

On Ceph 2.0 machines:

rhscon-core-selinux-0.0.39-1.el7scon.noarch
rhscon-agent-0.0.16-1.el7scon.noarch

How reproducible
================

Unknown in general.

But if we knew how to make one "Package installation" execution step fail in
this particular way, it would be 100% reproducible.

Steps to Reproduce
==================

1. Install RHSC 2.0 following the documentation.
2. Accept few nodes for the ceph cluster.
3. Create new ceph cluster named 'alpha', so that at least one
   "Package installation" step would fail[1].
4. Check task detail page of the Create Cluster task and compare it with
   reports from ceph-installer service.

[1] My current best guess (for a reasoning, see "Additional info" section) is
that it's caused by a timeout, so that one would need to make sure that the
step takes extra long, but finishes after all. But this idea haven't been
confirmed so far - check if QE team haven't updated the reproducer in comments
below.

Actual results
==============

On the task detail page for the Create Cluster task (see screenshot 1), I see
that 1st failed execution step was the 8th one:

> Package installation failed on dhcp-126-80.lab.eng.brq.redhat.com

with a timestamp:

> Aug 3 2016, 06:29:04 PM

But when I query ceph-installer service to find the task which were concerned
wit this step:

~~~
$ curl http://usm-server.example.com:8181/api/tasks/ | less
$ curl http://usm-server.example.com:8181/api/tasks/47bb7577-a061-4297-8952-7b72e41400f6/ > task
$ jq '.stdout' task | sed 's!\\\\n!\n!g;s!\\n!\n!g' > task.stdout
~~~

Before we go on, let's check to demonstrate we are looking at the right task
(it should be package installation on dhcp-126-80 machine):

~~~
$ jq '.request' task
"{\"calamari\":true,\"hosts\":[\"dhcp-126-80.lab.eng.brq.redhat.com\"],\"redhat_storage\":true}"
$ jq '.command' task
"/bin/ansible-playbook -v -u ceph-installer /usr/share/ceph-ansible/site.yml.sample -i /tmp/47bb7577-a061-4297-8952-7b72e41400f6_eYx3oK --extra-vars {\"fetch_directory\": \"/var/lib/ceph-installer/fetch\", \"ceph_stable_rh_storage_cdn_install\": true, \"calamari\": true, \"ceph_stable_rh_storage\": true} --tags package-install"
~~~

Good, it's the right task.

Going deeper, I see this:

~~~
$ jq '.succeeded' task
true
$ jq '.exit_code' task
0
$ jq '.stderr' task
""
$ tail task.stdout
PLAY [restapis] *************************************************************** 
skipping: no hosts matched

PLAY [rbdmirrors] ************************************************************* 
skipping: no hosts matched

PLAY RECAP ******************************************************************** 
dhcp-126-80.lab.eng.brq.redhat.com : ok=17   changed=3    unreachable=0    failed=0   

"
~~~

In other words: everything seems to indicate that the ceph-installer task
finished properly without any error. Which matches with manual checking on
the machine itself - all the packages were installed properly.

Expected results
================

RHSC 2.0 should not report a failure for execution step, which is finished
properly by it's ceph-ansible task.

Additional info
===============

My best guess is that some RHSC 2.0 timeout is involved here.

When I checked the full stdout output of the ceph-installer task (see
attachment task.stdout), I noticed that yum installation failed multiple times
trying to download packages, until it found a working mirror in the end so that
the installation did go on and then finished fine.

This could have increased the time of the task so much that RHSC 2.0 decided to
kill it declaring it as failed. And it seems that only the RHSC 2.0 execution
step was killed, while the ceph-installer task wasn't so that it finished with
success anyway.

Moreover when I check the timestamps associated with the ceph-installer task:

~~~
$ jq '.started' task 
"2016-08-03 18:14:03.202477"
$ jq '.ended' task 
"2016-08-03 18:44:29.647375"
~~~

And compare it with the task detail page of Create Cluster task as reported by
RHSC 2.0 (see screenshot 1), I see that:

 * timestamp "Aug 3 2016, 06:29:04 PM" reported for this task (8th step)
   doesn't match started nor ended timestamp as reported by ceph-installer
 * but timestamps from ceph-installer task (shown above) matches timestamps
   for 7th and 9th execution step.

Which may mean that:

1) Task #7 finished at 2016-08-03 18:14:02 (as shown in screenshot 1)
2) Then task #8 was started at 2016-08-03 18:14:03
   (as reported by ceph-installer, see `jq '.started' task`)
3) At 2016-08-03 18:29, RHSC 2.0 decided to kill the task (due to timeout?),
   but the ceph-installer task was still running
4) Since the ceph-installer task was still running, new execution step #9
   had to wait for ceph-installer to finish task for step #8 anyway
5) The next step #9 started at 2016-08-03 18:44 (as shown in screenshot 1),
   which also matches the 'ended' timestamp from the ceph-installer task
   (see `jq '.ended' task`)

Comment 1 Martin Bukatovic 2016-08-05 16:44:22 UTC

Could dev team verify my theory as described in Additional info section?

Comment 4 Martin Bukatovic 2016-08-05 16:59:28 UTC

Moreover this throws a shadow on a fix of these BZs, which describes
a similar issue for other types of tasks (task fails, but the action is done):

BZ 1356016 - initialize task
BZ 1341525 - Create Cluster task (as a whole)

Comment 6 Nishanth Thomas 2016-08-08 07:33:39 UTC

Please have a look at bz 1341525 for the details. Especially look at comment 4 and 6 which will give answers for your questions.

Comment 7 Martin Bukatovic 2016-08-08 13:55:28 UTC

(In reply to Nishanth Thomas from comment #6)
> Please have a look at bz 1341525 for the details. Especially look at comment
> 4 and 6 which will give answers for your questions.

So I read this as "indeed, my hypothesis is right".

Comment 9 Shubhendu Tripathi 2018-11-19 05:43:36 UTC

This product is EOL now

Note You need to log in before you can comment on or make changes to this bug.