Bug 1699209 - gluster-blockd fails to comes up on a fresh install setup of OCP 3.11 + OCS 3.11.2
Summary: gluster-blockd fails to comes up on a fresh install setup of OCP 3.11 + OCS 3...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: gluster-block
Version: ocs-3.11
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: OCS 3.11.z Batch Update 4
Assignee: Xiubo Li
QA Contact: RamaKasturi
URL:
Whiteboard:
: 1737218 (view as bug list)
Depends On: 1728645
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-04-12 06:46 UTC by RamaKasturi
Modified: 2019-10-30 12:33 UTC (History)
20 users (show)

Fixed In Version: gluster-block-0.2.1-33.el7rhgs
Doc Type: No Doc Update
Doc Text:
Cause: When the upgrading activities script finishes and exit successfully, the gluster-blockd daemon running in the background may still not be killed, so this will be a problem when the gluster-block service tries to start and will hit the following error: ERROR: gluster-blockd is already running... Consequence: The gluster-block service will fail to startup. Fix: When running the gluster-blockd daemon in background it may need a while to get fired, so in the upgrade script it will always wait at most 5 seconds to make sure the gluster-blockd daemon invoked by the upgrade script is totally exited. Result: There will always only one gluster-blockd daemon be running, so no failure any more and the gluster-blockd service could startup successfully.
Clone Of:
Environment:
Last Closed: 2019-10-30 12:33:28 UTC
Embargoed:
xiubli: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:3256 0 None None None 2019-10-30 12:33:33 UTC

Description RamaKasturi 2019-04-12 06:46:50 UTC
Description of problem:

I see that gluster-blockd fails to come up on a fresh installed setup of OCP3.11 + OCS 3.11.2 and due to this deployment fails. This issue is not hit everytime but  we hit this intermittently.


Version-Release number of selected component (if applicable):
OCP 3.11 + OCS 3.11.2

How reproducible:
Intermittently (2/3)

Steps to Reproduce:
1. Install OCP 3.11 + OCS 3.11.2 using deploy_cluster.yml
2. Some times gluster-blockd fails to come up on one or two of the pods and due to this gluster install fails.
3.

Actual results:
Issue is hit intermittently where gluster-blockd does not come up on the ndoe.


Expected results:
gluster-blockd should always come up and we should not be hitting any issues

Additional info:

14:42:51 <kasturi>      can you tell me the file path
14:42:56 <kasturi>      let me check in other nodes too
14:43:04 <xiubli>       if tcmu-runner start up just after gluster-blockd service, then the gluster-blockd will faile
14:43:07 <xiubli>       as expected
14:43:27 <xiubli>       # /usr/lib/systemd/system/gluster-blockd.service
14:44:14 <kasturi>      other nodes as well i see the same thing
14:44:14 <kasturi>      sh-4.2# cat /usr/lib/systemd/system/gluster-blockd.service
14:44:14 <kasturi>      [Unit]
14:44:14 <kasturi>      Description=Gluster block storage utility
14:44:14 <kasturi>      Requisite=glusterd.service
14:44:14 <kasturi>      Requires=
14:44:14 <kasturi>      BindsTo=gluster-block-target.service
14:44:14 <kasturi>      After=gluster-block-target.service
14:44:23 <kasturi>      where the pods are in 0/1 state
14:45:18 <kasturi>      sorry, i see the requires empty in other pods too where they are up and running 
14:45:34 <xiubli>       yeah,
14:45:48 <kasturi>      so, is this a bug then
14:46:00 <kasturi>      i do not have a previous setup to compare
14:46:06 <xiubli>       ==================
14:46:07 <xiubli>       [2019-03-06 09:02:39.555722] ERROR: tcmu-runner not running [at gluster-blockd.c+383 :<blockNodeSanityCheck>]
14:46:13 <kasturi>      yes
14:46:18 <kasturi>      but it is actually running
14:46:47 <xiubli>       Active: active (running) since Wed 2019-03-06 09:02:41 UTC; 13min ago
14:47:09 <kasturi>      yes
14:47:12 <xiubli>       you can see the logs time is 02:39
14:47:27 <xiubli>       and the tcmu-runner's active time is 02:41
14:47:42 <xiubli>       means the tcmu-runner start after gluster-blockd
14:47:44 <kasturi>      so do you think tcmu is taking some time
14:47:49 <kasturi>      to start
14:50:04 <xiubli>       yeah
14:50:49 <xiubli>       the gluster-blockd will depend on the tcmu-runenr service, only after tcmu-runner is up, then the gluster-blockd will be successfully up
14:54:18 <kasturi>      okay
14:54:35 <kasturi>      do you think the above is a race then ?
14:57:52 <xiubli>       yeah
14:58:07 <xiubli>       BTW, will this be hit 100% in this pod ?
14:58:41 <kasturi>      did not get when you said 100% in this pod
14:58:51 <kasturi>      did you want to respin the pod and see if this is happening
15:00:13 <xiubli>       yeah, I meant could we reproduce this issue in that pod always when repining ?
15:00:21 <kasturi>      i am not sure
15:00:23 <kasturi>      let me try that
15:00:28 <xiubli>       okay

Comment 7 RamaKasturi 2019-04-17 10:03:31 UTC
Hello prasanna,

   Ashmitha has provided the ansible logs at the bugzilla. Not sure if you have already gone through them. Please let us know if that is not enough for debugging. I do not have the sosreports collected as of now. But i can collect it when i run deployment again and provide it

Thanks
kasturi

Comment 26 RamaKasturi 2019-05-08 07:40:43 UTC
Clearing the needinfo as i have provided the setup to xibuli

Comment 46 RamaKasturi 2019-07-10 10:56:44 UTC
Blocked on verifying this bug due to https://bugzilla.redhat.com/show_bug.cgi?id=1699209 and also clearing the needinfo on xiubo Li since i have already spoken to him about the issue

Comment 52 RamaKasturi 2019-07-17 13:05:38 UTC
Verified the fix in gluster-block-0.2.1-33.el7rhgs.x86_64 .

Moving the bug to verified state based on comment 50 and comment 51 and also have not encountered any issue during fresh install of OCP3.11.z + OCS3.11.4 .

All the gluster-block logs from all the gluster pods are copied in the link below.

http://rhsqe-repo.lab.eng.blr.redhat.com/cns/bugs/1699209/

Comment 55 Bipin Kunal 2019-08-05 08:32:57 UTC
*** Bug 1737218 has been marked as a duplicate of this bug. ***

Comment 80 errata-xmlrpc 2019-10-30 12:33:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3256


Note You need to log in before you can comment on or make changes to this bug.