Description of problem: ------------------------- When a storage server in a cluster goes down, the status of bricks that reside on that server, should be shown as down. Currently the status of such bricks remains UP if the volume is started. Version-Release number of selected component (if applicable): Red Hat Storage Console Version: 2.1.2-0.0.scratch.beta1.el6_4 How reproducible: Always Steps to Reproduce: 1. Create a cluster and add two hosts to it. 2. Create a volume with bricks on both servers. 3. Bring down one of the hosts. Actual results: The bricks that reside on the server that is down, are still shown with UP status. Expected results: Bricks that are on a server which is down, should be shown as down. Additional info:
Hi Sahina, I have the following observations - 1. Power off one server in a cluster of 4 servers - The server moves to non-responsive, but the bricks are still shown as UP in the UI. 2. Kill glusterd on one of the servers in a cluster of 4 servers - The server moves to non-operational, and the bricks are now shown as DOWN in the UI. I was expecting the bricks to be shown as DOWN in case 1, as the bricks are actually not usable. In case 2, even though glusterd is not running on the server, the bricks are still usable. So is it right to show the bricks as DOWN? Bricks that reside on servers that are down, and on servers where glusterd is not running, are not displayed in the output of 'gluster volume status' command. For e.g. - In a cluster of the following 4 servers, 10.70.37.84 - powered off 10.70.37.132 - glusterd down 10.70.37.64 - up and running 10.70.37.176 - up and running The following commands were run on 10.70.37.64 - [root@rhs ~]# gluster volume info dis_vol Volume Name: dis_vol Type: Distribute Volume ID: a7a904f8-b4ca-4ba2-a176-966e4a286fab Status: Started Number of Bricks: 4 Transport-type: tcp Bricks: Brick1: 10.70.37.84:/rhs/brick1/b1 Brick2: 10.70.37.132:/rhs/brick1/b1 Brick3: 10.70.37.64:/rhs/brick1/b1 Brick4: 10.70.37.176:/rhs/brick1/b1 Options Reconfigured: auth.allow: * user.cifs: enable nfs.disable: off [root@rhs ~]# gluster volume status dis_vol Status of volume: dis_vol Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.37.64:/rhs/brick1/b1 49152 Y 12035 Brick 10.70.37.176:/rhs/brick1/b1 49152 Y 22190 NFS Server on localhost 2049 Y 21872 NFS Server on 10.70.37.176 2049 Y 31865 Task Status of Volume dis_vol ------------------------------------------------------------------------------ There are no active volume tasks
After the non-operational host is brought back up by starting glusterd on it, the bricks status is set to UP as part of the periodic sync job, and not immediately, along with the host status changing to UP.
Case 1 - when host is non-responsive. In this case, host has moved to non-responsive state due to a network error between engine and host (could be many reasons, for instance - vdsm not running / server powered off etc) In this case, brick status can be moved to UNKNOWN, as the engine cannot determine status unless the sync job returns brick status from another host. Proposed flow : if host is non-responsive, change brick status to UNKNOWN. If sync job determines the status from another server, the status will be moved to UP/DOWN Case 2 - When host is non-operational. The host moves to non-operational due to glusterd not running. In this case, the gluster volume status does not return brick status either and operations on brick like remove-brick, brick advanced details fails and brick is offline for all practical purposes. Hence, moving the brick to status Down seems appropriate.
Moving this bug to ASSIGNED to take care of Case 1. Dusmant, please confirm the proposed flow
proposed flow is fine.
I am observing the following behavior when the steps below are performed - 1. On a cluster of 4 nodes, power off one, and stop network service on another. This causes both these servers to be in non-responsive state. The bricks residing on these servers are set to '?' ( unknown ) status. 2. Bring the powered-off server back up. The bricks residing on this server are supposed to come up, but did not. It is found to be because of the "gluster volume status" command failing as per vdsm logs, because of BZ #1045374 . Will verify this BZ after the above BZ is fixed.
Verified as fixed in Red Hat Storage Console Version: 2.1.2-0.30.el6rhs.
Please review the edited DocText and signoff.
Looks ok
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-0208.html