Description of problem: I see that gluster-blockd fails to come up on a fresh installed setup of OCP3.11 + OCS 3.11.2 and due to this deployment fails. This issue is not hit everytime but we hit this intermittently. Version-Release number of selected component (if applicable): OCP 3.11 + OCS 3.11.2 How reproducible: Intermittently (2/3) Steps to Reproduce: 1. Install OCP 3.11 + OCS 3.11.2 using deploy_cluster.yml 2. Some times gluster-blockd fails to come up on one or two of the pods and due to this gluster install fails. 3. Actual results: Issue is hit intermittently where gluster-blockd does not come up on the ndoe. Expected results: gluster-blockd should always come up and we should not be hitting any issues Additional info: 14:42:51 <kasturi> can you tell me the file path 14:42:56 <kasturi> let me check in other nodes too 14:43:04 <xiubli> if tcmu-runner start up just after gluster-blockd service, then the gluster-blockd will faile 14:43:07 <xiubli> as expected 14:43:27 <xiubli> # /usr/lib/systemd/system/gluster-blockd.service 14:44:14 <kasturi> other nodes as well i see the same thing 14:44:14 <kasturi> sh-4.2# cat /usr/lib/systemd/system/gluster-blockd.service 14:44:14 <kasturi> [Unit] 14:44:14 <kasturi> Description=Gluster block storage utility 14:44:14 <kasturi> Requisite=glusterd.service 14:44:14 <kasturi> Requires= 14:44:14 <kasturi> BindsTo=gluster-block-target.service 14:44:14 <kasturi> After=gluster-block-target.service 14:44:23 <kasturi> where the pods are in 0/1 state 14:45:18 <kasturi> sorry, i see the requires empty in other pods too where they are up and running 14:45:34 <xiubli> yeah, 14:45:48 <kasturi> so, is this a bug then 14:46:00 <kasturi> i do not have a previous setup to compare 14:46:06 <xiubli> ================== 14:46:07 <xiubli> [2019-03-06 09:02:39.555722] ERROR: tcmu-runner not running [at gluster-blockd.c+383 :<blockNodeSanityCheck>] 14:46:13 <kasturi> yes 14:46:18 <kasturi> but it is actually running 14:46:47 <xiubli> Active: active (running) since Wed 2019-03-06 09:02:41 UTC; 13min ago 14:47:09 <kasturi> yes 14:47:12 <xiubli> you can see the logs time is 02:39 14:47:27 <xiubli> and the tcmu-runner's active time is 02:41 14:47:42 <xiubli> means the tcmu-runner start after gluster-blockd 14:47:44 <kasturi> so do you think tcmu is taking some time 14:47:49 <kasturi> to start 14:50:04 <xiubli> yeah 14:50:49 <xiubli> the gluster-blockd will depend on the tcmu-runenr service, only after tcmu-runner is up, then the gluster-blockd will be successfully up 14:54:18 <kasturi> okay 14:54:35 <kasturi> do you think the above is a race then ? 14:57:52 <xiubli> yeah 14:58:07 <xiubli> BTW, will this be hit 100% in this pod ? 14:58:41 <kasturi> did not get when you said 100% in this pod 14:58:51 <kasturi> did you want to respin the pod and see if this is happening 15:00:13 <xiubli> yeah, I meant could we reproduce this issue in that pod always when repining ? 15:00:21 <kasturi> i am not sure 15:00:23 <kasturi> let me try that 15:00:28 <xiubli> okay
Hello prasanna, Ashmitha has provided the ansible logs at the bugzilla. Not sure if you have already gone through them. Please let us know if that is not enough for debugging. I do not have the sosreports collected as of now. But i can collect it when i run deployment again and provide it Thanks kasturi
Clearing the needinfo as i have provided the setup to xibuli
Blocked on verifying this bug due to https://bugzilla.redhat.com/show_bug.cgi?id=1699209 and also clearing the needinfo on xiubo Li since i have already spoken to him about the issue
Verified the fix in gluster-block-0.2.1-33.el7rhgs.x86_64 . Moving the bug to verified state based on comment 50 and comment 51 and also have not encountered any issue during fresh install of OCP3.11.z + OCS3.11.4 . All the gluster-block logs from all the gluster pods are copied in the link below. http://rhsqe-repo.lab.eng.blr.redhat.com/cns/bugs/1699209/
*** Bug 1737218 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:3256