Description of problem: I've tried to create cluster and cluster creation task has time-outed. But even if cluster creation task has failed Ceph cluster seems installed properly. I think this is not acceptable state. There should be consistency between Console and Ceph clusters. In this case user cannot be sure that Ceph cluster is installed according his inputs in cluster creation wizard in Console. I expect that cluster creation task has time-outed but calamari task continued so cluster is installed. Anyway this is not right behavior of Console. Version-Release number of selected component (if applicable): ceph-ansible-1.0.5-15.el7scon.noarch ceph-installer-1.0.11-1.el7scon.noarch rhscon-ceph-0.0.20-1.el7scon.x86_64 rhscon-core-0.0.21-1.el7scon.x86_64 rhscon-ui-0.0.34-1.el7scon.noarch How reproducible: 100% Steps to Reproduce: 1. try to create cluster and this task should time-outed 2. check if ceph cluster has been created Actual results: Cluster creation task in Console is inconsistent with Ceph cluster status. User is not sure if Ceph cluster is created according inputs in cluster creation wizard in Console. Expected results: Cluster creation task in Console is consistent with Ceph cluster status.
It sounds like 1) the root cause needs to be addressed here to determine why the console loses contact with the installation process. There also 2) needs to be some way for a cluster creation definition to be "recovered" if it is lost. In other words the console should go out and verify the state of nodes it was attempting to install on and correct errors. It is my understanding that ceph-ansible (and ansible in general) is idempotent - meaning it can be safely run again without doing harm. If it finds items already completed it should not matter. So my suggestion is to look into ways to simply repeat ansible operations on a node or nodes if we lose contact. This may or may not result in a resolution but should be tried first and then fail with info if a second try times out. It may be that we are simply hitting timeouts but ops are succeeding after the timeout.
1. There are situations where the southbound API requests are hung due to various reasons. Currently USM doesn't handle this effectively. USM should detect these and time out the requests which are hung for long time. 2. Detecting the current state and repairing nodes might not be a good option this point. For example the disks might have already partitioned and configurations might have been already done. Cleaning all these might not in the job of USM. If the southbound API can provide this functionality, USM can implement the business logic around this. Probably requires more work and may not be feasible for 2.0 3. Re-attempting task again is another option. USM should cache the request and replay it when requested by user. I assume that as Jeff mentioned ceph-ansible is idempotent. This also requires good amount work requires changes both in UI as well as in the core So is it a good solution to implement option 1 at this point think about others post 2.0?
Yes let's proceed with 1) as suggested.
As a fix to this issue, we have introduced timeout to ceph-installer calls, any call that does not respond until 15 minutes will be timed out. That particular call will be treated as failure and we proceed with next requests.
I still see this as you can see at screenshot. How can I recognize if this is 2) or 3) which will not be fixed in 2.0?
Created attachment 1184668 [details] failure and success
Created attachment 1184673 [details] log from monitor2
QE team will be checking if this issue will ever happen again during the whole testing phase. If the issue will not appear, we are going to close it. We are going to assume the fix in a sense of option 1 from comment 4.
Based on comment 11, I'm moving this BZ into VERIFIED state, since the QE team haven't seen this issue again during our testing phase. The fact that QE team haven't reproduced this issue again most likely means that changes done by dev team made this particular problem either impossible or less likely to happen. That said, the general root cause is not fixed, as noted in comment 4 and as demonstrated by recent BZ 1364547, which means that similar issue still could happen.