Bug 1341525
| Summary: | cluster creation task has failed but cluster seems installed properly | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Storage Console | Reporter: | Martin Kudlej <mkudlej> | ||||||
| Component: | core | Assignee: | Nishanth Thomas <nthomas> | ||||||
| core sub component: | provisioning | QA Contact: | Martin Bukatovic <mbukatov> | ||||||
| Status: | CLOSED CURRENTRELEASE | Docs Contact: | |||||||
| Severity: | high | ||||||||
| Priority: | unspecified | CC: | mbukatov, nthomas | ||||||
| Version: | 2 | ||||||||
| Target Milestone: | --- | ||||||||
| Target Release: | 2 | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | rhscon-core-0.0.26-1.el7scon.x86_64 rhscon-ceph-0.0.26-1.el7scon.x86_64 | Doc Type: | If docs needed, set a value | ||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2018-11-19 05:34:49 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 1344192 | ||||||||
| Attachments: |
|
||||||||
|
Description
Martin Kudlej
2016-06-01 09:02:55 UTC
It sounds like 1) the root cause needs to be addressed here to determine why the console loses contact with the installation process. There also 2) needs to be some way for a cluster creation definition to be "recovered" if it is lost. In other words the console should go out and verify the state of nodes it was attempting to install on and correct errors. It is my understanding that ceph-ansible (and ansible in general) is idempotent - meaning it can be safely run again without doing harm. If it finds items already completed it should not matter. So my suggestion is to look into ways to simply repeat ansible operations on a node or nodes if we lose contact. This may or may not result in a resolution but should be tried first and then fail with info if a second try times out. It may be that we are simply hitting timeouts but ops are succeeding after the timeout. 1. There are situations where the southbound API requests are hung due to various reasons. Currently USM doesn't handle this effectively. USM should detect these and time out the requests which are hung for long time. 2. Detecting the current state and repairing nodes might not be a good option this point. For example the disks might have already partitioned and configurations might have been already done. Cleaning all these might not in the job of USM. If the southbound API can provide this functionality, USM can implement the business logic around this. Probably requires more work and may not be feasible for 2.0 3. Re-attempting task again is another option. USM should cache the request and replay it when requested by user. I assume that as Jeff mentioned ceph-ansible is idempotent. This also requires good amount work requires changes both in UI as well as in the core So is it a good solution to implement option 1 at this point think about others post 2.0? Yes let's proceed with 1) as suggested. As a fix to this issue, we have introduced timeout to ceph-installer calls, any call that does not respond until 15 minutes will be timed out. That particular call will be treated as failure and we proceed with next requests. I still see this as you can see at screenshot. How can I recognize if this is 2) or 3) which will not be fixed in 2.0? Created attachment 1184668 [details]
failure and success
Created attachment 1184673 [details]
log from monitor2
QE team will be checking if this issue will ever happen again during the whole testing phase. If the issue will not appear, we are going to close it. We are going to assume the fix in a sense of option 1 from comment 4. Based on comment 11, I'm moving this BZ into VERIFIED state, since the QE team haven't seen this issue again during our testing phase. The fact that QE team haven't reproduced this issue again most likely means that changes done by dev team made this particular problem either impossible or less likely to happen. That said, the general root cause is not fixed, as noted in comment 4 and as demonstrated by recent BZ 1364547, which means that similar issue still could happen. |