Bug 1571244
| Summary: | Import cluster job fails for a while but then finishes successfully | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Filip Balák <fbalak> | ||||||||||||||
| Component: | web-admin-tendrl-notifier | Assignee: | gowtham <gshanmug> | ||||||||||||||
| Status: | CLOSED ERRATA | QA Contact: | Filip Balák <fbalak> | ||||||||||||||
| Severity: | medium | Docs Contact: | |||||||||||||||
| Priority: | unspecified | ||||||||||||||||
| Version: | rhgs-3.4 | CC: | dahorak, fbalak, mbukatov, nthomas, rhs-bugs, sankarshan | ||||||||||||||
| Target Milestone: | --- | ||||||||||||||||
| Target Release: | RHGS 3.4.0 | ||||||||||||||||
| Hardware: | Unspecified | ||||||||||||||||
| OS: | Unspecified | ||||||||||||||||
| Whiteboard: | |||||||||||||||||
| Fixed In Version: | tendrl-commons-1.6.3-8.el7rhgs | Doc Type: | If docs needed, set a value | ||||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||||
| Clone Of: | Environment: | ||||||||||||||||
| Last Closed: | 2018-09-04 07:04:50 UTC | Type: | Bug | ||||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||||
| Documentation: | --- | CRM: | |||||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||
| Embargoed: | |||||||||||||||||
| Bug Depends On: | |||||||||||||||||
| Bug Blocks: | 1503137, 1602858 | ||||||||||||||||
| Attachments: |
|
||||||||||||||||
|
Description
Filip Balák
2018-04-24 11:49:18 UTC
Add a screenshot. Created attachment 1427153 [details]
task page during import
Created attachment 1427154 [details]
task page after import
Created attachment 1427155 [details]
logs and configuration files
I reproduced this problem, This is happening because of the same job is executed by two different nodes. this issue is fixed https://github.com/Tendrl/commons/pull/954 *** Bug 1576717 has been marked as a duplicate of this bug. *** I did run script that discovered this issue 25x times (23x times the cluster was imported successfully) and this issue didn't appear. --> VERIFIED Tested with: tendrl-ansible-1.6.3-4.el7rhgs.noarch tendrl-api-1.6.3-3.el7rhgs.noarch tendrl-api-httpd-1.6.3-3.el7rhgs.noarch tendrl-commons-1.6.3-6.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-4.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-4.el7rhgs.noarch tendrl-node-agent-1.6.3-6.el7rhgs.noarch tendrl-notifier-1.6.3-3.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-3.el7rhgs.noarch Created attachment 1455248 [details]
01 Import cluster fail
Created attachment 1455249 [details]
02 ImportCluster later passed
This issue is seen with following reproducer: 1. Change tendrl-node-devel repository baseurl to bad one 2. Start cluster import Cluster import fails for a while and looks as Failed (so user can run unmanage cluster and import it again as described in BZ 1570048) but after few minutes the job continues and finishes with succeed. Cluster looks as imported in UI but tendrl-gluster-integration is not installed on the node with error. This behaviour is visible in attachments 1455248 and 1455249. --> ASSIGNED Tested with: tendrl-ansible-1.6.3-5.el7rhgs.noarch tendrl-api-1.6.3-3.el7rhgs.noarch tendrl-api-httpd-1.6.3-3.el7rhgs.noarch tendrl-commons-1.6.3-7.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-5.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-5.el7rhgs.noarch tendrl-node-agent-1.6.3-7.el7rhgs.noarch tendrl-notifier-1.6.3-4.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-4.el7rhgs.noarch Pr is under review https://github.com/Tendrl/commons/pull/1016 Last time i had this new change in PR but last minute confusion I removed this change from my PR :) Created attachment 1458054 [details]
01 Import cluster fail - new
Created attachment 1458055 [details]
02 ImportCluster later passed - new
The issue appeared again when I was trying to import cluster with 6 nodes and 2 volumes. --> ASSIGNED Tested with: tendrl-ansible-1.6.3-5.el7rhgs.noarch tendrl-api-1.6.3-4.el7rhgs.noarch tendrl-api-httpd-1.6.3-4.el7rhgs.noarch tendrl-commons-1.6.3-8.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-6.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-6.el7rhgs.noarch tendrl-node-agent-1.6.3-8.el7rhgs.noarch tendrl-notifier-1.6.3-4.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-6.el7rhgs.noarch https://bugzilla.redhat.com/show_bug.cgi?id=1571244 @rohan, need help in this. I tried a lot but I have no I dead why this bug comes again. This bug which comes in the last screenshot is really hard to reproduce and debug. I fixed this bug in a lot of scenarios but only this issue which is present in the last screenshot is really difficult to fix. @rohan and @shubhendu Need help filip without clear reproduces step I struggling to proceed further in this issue. I tried a lot of times but I can't find a clear reproduce step for this issue. It happens very rarely. I accidentally found it when importing cluster with 6 nodes and 2 volumes and with cluster name set There are patches gone in already to fix the issue. Comment - https://bugzilla.redhat.com/show_bug.cgi?id=1571244#c12 says that issue is not seen and moved the bug to 'Verified'. That means there is an improvement in the situation. The issue reported as part of https://bugzilla.redhat.com/show_bug.cgi?id=1571244#c20 is not reproducible ans submitter itself commented that its a rare case. Had a discussion with Martin and decide to split the Bz into two so that the original BZ can be verified. The new BZ will be fixed when we have a clear procedure to reproduce the issue. Classifying this as medium severity because errors like this without clear root cause could raise the cost of debugging and problem solving in the production. Based on discussion with Nishanth (as noted in comment 25), I'm limiting scope of this BZ to fixes created by Gowtham. QE is expected to verify that the the problem described in this BZ is less likely to happen compared to the original report (verifying that the fixes improved the situation significantly). New BZ 1602858 is created to track effort on: * figuring out root cause of remaining part of this issue * figuring out a reproducer or clarifying/improving the likelihood of this problem to appear * fixing the problem entirely I keep this BZ in ON QA state and I will wait for Filip to discuss if we can verify this BZ based on previous reports or if we need to run new verification. My opinion is that this should be retested unless Filip has a high confidence that this is not necessary. Filip, either retest or mark this BZ as verified based on previous testing, as noted in comment 27. I tested this again but I was unable to reproduce it. Based on Comment 27 I VERIFY this BZ. If I will be able to reproduce it again in future, I will add info to BZ 1602858. Tested with: tendrl-ansible-1.6.3-6.el7rhgs.noarch tendrl-api-1.6.3-5.el7rhgs.noarch tendrl-api-httpd-1.6.3-5.el7rhgs.noarch tendrl-commons-1.6.3-11.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-8.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-8.el7rhgs.noarch tendrl-node-agent-1.6.3-9.el7rhgs.noarch tendrl-notifier-1.6.3-4.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-9.el7rhgs.noarch Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2616 |