Description of problem: In a node reboot (for example) case if glusterd has not comeup before gluster-blockd service we see a lot of failures due to storage absence. Ideally in can be any case when the gluster-blockd is brought up before glusterd
Scenario 1: * Stop glusterd - that results in gluster-blockd service going into inactive state * Start/restart gluster-blockd - that fails with dependency error (as expected) Scenario 2: * Stop gluster-block-target - which results in gluster-blockd service again going to inactive state * Stop glusterd * Start/restart gluster-blockd - that fails with dependency. Tcmu-runner, gluster-block-target, glusterd - all remain down * Start glusterd - tcmu-runner, gluster-block-target, gluster-blockd all remain down * Start gluster-blockd - all the mentioned services come up successfully Scenario 3: * Stop glusterd - that results in gluster-blockd service going to inactive state * Start/restart gluster-blockd - that fails with dependency error (as expected) * Start glusterd - glusterd comes up. Gluster-blockd continues to remain down * Start gluster-blockd - that gets gluster-blockd up Have tested the above mentioned scenarios and multiple permutations of the services. All justify and prove the order of gluster-blockd dependent on tcmu-runner, tcmu-runner dependent on gluster-block-target, which in turn is dependent on glusterd. One last question, before I move this bug to verified: Scenario 3, step3: If glusterd is brought up, are we expecting gluster-blockd to automatically come up? In other words, after we do a start/restart in step2, should we make gluster-blockd service check for glusterd at regular intervals, so that gluster-blockd can get itself back online as soon as it sees glusterd up? It presently doesn't.. Prasanna/Atin, please ignore the question of comment7. Keeping the need_info on this bug for the query mentioned above.
sweta, Is it expected to start gluster-blockd when glusterd is brought back ? If there is such a requirement then we should explore Wanted= or PartOf= options in systemd units.
Just to confirm that we understand it right, Add "WantedBy=glusterd.service" to [Unit] section of gluster-blockd.service. The modified unit looks like #cat /usr/lib/systemd/system/gluster-blockd.service [Unit] Description=Gluster block storage utility BindsTo=tcmu-runner.service rpcbind.service After=tcmu-runner.service rpcbind.service WantedBy=glusterd.service [Service] Type=simple Environment="GB_GLFS_LRU_COUNT=5" Environment="GB_LOG_LEVEL=INFO" EnvironmentFile=-/etc/sysconfig/gluster-blockd ExecStart=/usr/sbin/gluster-blockd --glfs-lru-count $GB_GLFS_LRU_COUNT --log-level $GB_LOG_LEVEL $GB_EXTRA_ARGS KillMode=process [Install] WantedBy=multi-user.target Should give you what you are asking for, but we might need a justification why we would need this.
My opinion - it is nice to have. At /this/ stage of the release? We can live with it for now, unless it becomes a bigger problem in CNS Karthick/Humble, thoughts? If glusterd goes down, does the entire pod go down? If yes, then we might not hit this scenario at all. If no, then please guide/reply to comment9. I will be moving this bug to verified if this is acceptable in the CNS environment. A new bug can be raised (if needed) for the new change.
(In reply to Sweta Anandpara from comment #11) > My opinion - it is nice to have. At /this/ stage of the release? We can live > with it for now, unless it becomes a bigger problem in CNS > > Karthick/Humble, thoughts? If glusterd goes down, does the entire pod go > down? If yes, then we might not hit this scenario at all. If no, then please > guide/reply to comment9. Yes, if glusterd is down, the pod is restarted. > > I will be moving this bug to verified if this is acceptable in the CNS > environment. A new bug can be raised (if needed) for the new change.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:2773