Bug 1364547

Summary: execution step of Create Cluster task is reported as failed while the ceph-installer task finises with success
Product: [Red Hat Storage] Red Hat Storage Console Reporter: Martin Bukatovic <mbukatov>
Component: coreAssignee: Nishanth Thomas <nthomas>
core sub component: provisioning QA Contact: sds-qe-bugs
Status: CLOSED EOL Docs Contact:
Severity: medium    
Priority: unspecified CC: mbukatov, nthomas, sankarshan
Version: 2   
Target Milestone: ---   
Target Release: 3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
screenshot 1: task detail page of Create Cluster task none

Description Martin Bukatovic 2016-08-05 16:40:35 UTC
Created attachment 1187977 [details]
screenshot 1: task detail page of Create Cluster task

Description of problem
======================

During Create Cluster task, a "Package installation" execution step can be
reported as failed, while the actual task is performed properly without any
error.

Moreover reasoning behind the fail (Was it timeout? Or something else went
wrong?) is not communicated by RHSC 2.0. Again: which is extra confusing,
because the underlying ceph-installer task didn't fail, but finished with
success.

Version-Release
===============

On RHSC 2.0 server machine:

rhscon-core-0.0.39-1.el7scon.x86_64
rhscon-ui-0.0.51-1.el7scon.noarch
rhscon-core-selinux-0.0.39-1.el7scon.noarch
rhscon-ceph-0.0.39-1.el7scon.x86_64
ceph-installer-1.0.14-1.20160802git5cd3d17.el7scon.noarch
ceph-ansible-1.0.5-32.el7scon.noarch

On Ceph 2.0 machines:

rhscon-core-selinux-0.0.39-1.el7scon.noarch
rhscon-agent-0.0.16-1.el7scon.noarch

How reproducible
================

Unknown in general.

But if we knew how to make one "Package installation" execution step fail in
this particular way, it would be 100% reproducible.

Steps to Reproduce
==================

1. Install RHSC 2.0 following the documentation.
2. Accept few nodes for the ceph cluster.
3. Create new ceph cluster named 'alpha', so that at least one
   "Package installation" step would fail[1].
4. Check task detail page of the Create Cluster task and compare it with
   reports from ceph-installer service.

[1] My current best guess (for a reasoning, see "Additional info" section) is
that it's caused by a timeout, so that one would need to make sure that the
step takes extra long, but finishes after all. But this idea haven't been
confirmed so far - check if QE team haven't updated the reproducer in comments
below.

Actual results
==============

On the task detail page for the Create Cluster task (see screenshot 1), I see
that 1st failed execution step was the 8th one:

> Package installation failed on dhcp-126-80.lab.eng.brq.redhat.com

with a timestamp:

> Aug 3 2016, 06:29:04 PM

But when I query ceph-installer service to find the task which were concerned
wit this step:

~~~
$ curl http://usm-server.example.com:8181/api/tasks/ | less
$ curl http://usm-server.example.com:8181/api/tasks/47bb7577-a061-4297-8952-7b72e41400f6/ > task
$ jq '.stdout' task | sed 's!\\\\n!\n!g;s!\\n!\n!g' > task.stdout
~~~

Before we go on, let's check to demonstrate we are looking at the right task
(it should be package installation on dhcp-126-80 machine):

~~~
$ jq '.request' task
"{\"calamari\":true,\"hosts\":[\"dhcp-126-80.lab.eng.brq.redhat.com\"],\"redhat_storage\":true}"
$ jq '.command' task
"/bin/ansible-playbook -v -u ceph-installer /usr/share/ceph-ansible/site.yml.sample -i /tmp/47bb7577-a061-4297-8952-7b72e41400f6_eYx3oK --extra-vars {\"fetch_directory\": \"/var/lib/ceph-installer/fetch\", \"ceph_stable_rh_storage_cdn_install\": true, \"calamari\": true, \"ceph_stable_rh_storage\": true} --tags package-install"
~~~

Good, it's the right task.

Going deeper, I see this:

~~~
$ jq '.succeeded' task
true
$ jq '.exit_code' task
0
$ jq '.stderr' task
""
$ tail task.stdout
PLAY [restapis] *************************************************************** 
skipping: no hosts matched

PLAY [rbdmirrors] ************************************************************* 
skipping: no hosts matched

PLAY RECAP ******************************************************************** 
dhcp-126-80.lab.eng.brq.redhat.com : ok=17   changed=3    unreachable=0    failed=0   

"
~~~

In other words: everything seems to indicate that the ceph-installer task
finished properly without any error. Which matches with manual checking on
the machine itself - all the packages were installed properly.

Expected results
================

RHSC 2.0 should not report a failure for execution step, which is finished
properly by it's ceph-ansible task.

Additional info
===============

My best guess is that some RHSC 2.0 timeout is involved here.

When I checked the full stdout output of the ceph-installer task (see
attachment task.stdout), I noticed that yum installation failed multiple times
trying to download packages, until it found a working mirror in the end so that
the installation did go on and then finished fine.

This could have increased the time of the task so much that RHSC 2.0 decided to
kill it declaring it as failed. And it seems that only the RHSC 2.0 execution
step was killed, while the ceph-installer task wasn't so that it finished with
success anyway.

Moreover when I check the timestamps associated with the ceph-installer task:

~~~
$ jq '.started' task 
"2016-08-03 18:14:03.202477"
$ jq '.ended' task 
"2016-08-03 18:44:29.647375"
~~~

And compare it with the task detail page of Create Cluster task as reported by
RHSC 2.0 (see screenshot 1), I see that:

 * timestamp "Aug 3 2016, 06:29:04 PM" reported for this task (8th step)
   doesn't match started nor ended timestamp as reported by ceph-installer
 * but timestamps from ceph-installer task (shown above) matches timestamps
   for 7th and 9th execution step.

Which may mean that:

1) Task #7 finished at 2016-08-03 18:14:02 (as shown in screenshot 1)
2) Then task #8 was started at 2016-08-03 18:14:03
   (as reported by ceph-installer, see `jq '.started' task`)
3) At 2016-08-03 18:29, RHSC 2.0 decided to kill the task (due to timeout?),
   but the ceph-installer task was still running
4) Since the ceph-installer task was still running, new execution step #9
   had to wait for ceph-installer to finish task for step #8 anyway
5) The next step #9 started at 2016-08-03 18:44 (as shown in screenshot 1),
   which also matches the 'ended' timestamp from the ceph-installer task
   (see `jq '.ended' task`)

Comment 1 Martin Bukatovic 2016-08-05 16:44:22 UTC
Could dev team verify my theory as described in Additional info section?

Comment 4 Martin Bukatovic 2016-08-05 16:59:28 UTC
Moreover this throws a shadow on a fix of these BZs, which describes
a similar issue for other types of tasks (task fails, but the action is done):

BZ 1356016 - initialize task
BZ 1341525 - Create Cluster task (as a whole)

Comment 6 Nishanth Thomas 2016-08-08 07:33:39 UTC
Please have a look at bz 1341525 for the details. Especially look at comment 4 and 6 which will give answers for your questions.

Comment 7 Martin Bukatovic 2016-08-08 13:55:28 UTC
(In reply to Nishanth Thomas from comment #6)
> Please have a look at bz 1341525 for the details. Especially look at comment
> 4 and 6 which will give answers for your questions.

So I read this as "indeed, my hypothesis is right".

Comment 9 Shubhendu Tripathi 2018-11-19 05:43:36 UTC
This product is EOL now