Created attachment 1187977 [details] screenshot 1: task detail page of Create Cluster task Description of problem ====================== During Create Cluster task, a "Package installation" execution step can be reported as failed, while the actual task is performed properly without any error. Moreover reasoning behind the fail (Was it timeout? Or something else went wrong?) is not communicated by RHSC 2.0. Again: which is extra confusing, because the underlying ceph-installer task didn't fail, but finished with success. Version-Release =============== On RHSC 2.0 server machine: rhscon-core-0.0.39-1.el7scon.x86_64 rhscon-ui-0.0.51-1.el7scon.noarch rhscon-core-selinux-0.0.39-1.el7scon.noarch rhscon-ceph-0.0.39-1.el7scon.x86_64 ceph-installer-1.0.14-1.20160802git5cd3d17.el7scon.noarch ceph-ansible-1.0.5-32.el7scon.noarch On Ceph 2.0 machines: rhscon-core-selinux-0.0.39-1.el7scon.noarch rhscon-agent-0.0.16-1.el7scon.noarch How reproducible ================ Unknown in general. But if we knew how to make one "Package installation" execution step fail in this particular way, it would be 100% reproducible. Steps to Reproduce ================== 1. Install RHSC 2.0 following the documentation. 2. Accept few nodes for the ceph cluster. 3. Create new ceph cluster named 'alpha', so that at least one "Package installation" step would fail[1]. 4. Check task detail page of the Create Cluster task and compare it with reports from ceph-installer service. [1] My current best guess (for a reasoning, see "Additional info" section) is that it's caused by a timeout, so that one would need to make sure that the step takes extra long, but finishes after all. But this idea haven't been confirmed so far - check if QE team haven't updated the reproducer in comments below. Actual results ============== On the task detail page for the Create Cluster task (see screenshot 1), I see that 1st failed execution step was the 8th one: > Package installation failed on dhcp-126-80.lab.eng.brq.redhat.com with a timestamp: > Aug 3 2016, 06:29:04 PM But when I query ceph-installer service to find the task which were concerned wit this step: ~~~ $ curl http://usm-server.example.com:8181/api/tasks/ | less $ curl http://usm-server.example.com:8181/api/tasks/47bb7577-a061-4297-8952-7b72e41400f6/ > task $ jq '.stdout' task | sed 's!\\\\n!\n!g;s!\\n!\n!g' > task.stdout ~~~ Before we go on, let's check to demonstrate we are looking at the right task (it should be package installation on dhcp-126-80 machine): ~~~ $ jq '.request' task "{\"calamari\":true,\"hosts\":[\"dhcp-126-80.lab.eng.brq.redhat.com\"],\"redhat_storage\":true}" $ jq '.command' task "/bin/ansible-playbook -v -u ceph-installer /usr/share/ceph-ansible/site.yml.sample -i /tmp/47bb7577-a061-4297-8952-7b72e41400f6_eYx3oK --extra-vars {\"fetch_directory\": \"/var/lib/ceph-installer/fetch\", \"ceph_stable_rh_storage_cdn_install\": true, \"calamari\": true, \"ceph_stable_rh_storage\": true} --tags package-install" ~~~ Good, it's the right task. Going deeper, I see this: ~~~ $ jq '.succeeded' task true $ jq '.exit_code' task 0 $ jq '.stderr' task "" $ tail task.stdout PLAY [restapis] *************************************************************** skipping: no hosts matched PLAY [rbdmirrors] ************************************************************* skipping: no hosts matched PLAY RECAP ******************************************************************** dhcp-126-80.lab.eng.brq.redhat.com : ok=17 changed=3 unreachable=0 failed=0 " ~~~ In other words: everything seems to indicate that the ceph-installer task finished properly without any error. Which matches with manual checking on the machine itself - all the packages were installed properly. Expected results ================ RHSC 2.0 should not report a failure for execution step, which is finished properly by it's ceph-ansible task. Additional info =============== My best guess is that some RHSC 2.0 timeout is involved here. When I checked the full stdout output of the ceph-installer task (see attachment task.stdout), I noticed that yum installation failed multiple times trying to download packages, until it found a working mirror in the end so that the installation did go on and then finished fine. This could have increased the time of the task so much that RHSC 2.0 decided to kill it declaring it as failed. And it seems that only the RHSC 2.0 execution step was killed, while the ceph-installer task wasn't so that it finished with success anyway. Moreover when I check the timestamps associated with the ceph-installer task: ~~~ $ jq '.started' task "2016-08-03 18:14:03.202477" $ jq '.ended' task "2016-08-03 18:44:29.647375" ~~~ And compare it with the task detail page of Create Cluster task as reported by RHSC 2.0 (see screenshot 1), I see that: * timestamp "Aug 3 2016, 06:29:04 PM" reported for this task (8th step) doesn't match started nor ended timestamp as reported by ceph-installer * but timestamps from ceph-installer task (shown above) matches timestamps for 7th and 9th execution step. Which may mean that: 1) Task #7 finished at 2016-08-03 18:14:02 (as shown in screenshot 1) 2) Then task #8 was started at 2016-08-03 18:14:03 (as reported by ceph-installer, see `jq '.started' task`) 3) At 2016-08-03 18:29, RHSC 2.0 decided to kill the task (due to timeout?), but the ceph-installer task was still running 4) Since the ceph-installer task was still running, new execution step #9 had to wait for ceph-installer to finish task for step #8 anyway 5) The next step #9 started at 2016-08-03 18:44 (as shown in screenshot 1), which also matches the 'ended' timestamp from the ceph-installer task (see `jq '.ended' task`)
Could dev team verify my theory as described in Additional info section?
Moreover this throws a shadow on a fix of these BZs, which describes a similar issue for other types of tasks (task fails, but the action is done): BZ 1356016 - initialize task BZ 1341525 - Create Cluster task (as a whole)
Please have a look at bz 1341525 for the details. Especially look at comment 4 and 6 which will give answers for your questions.
(In reply to Nishanth Thomas from comment #6) > Please have a look at bz 1341525 for the details. Especially look at comment > 4 and 6 which will give answers for your questions. So I read this as "indeed, my hypothesis is right".
This product is EOL now