Description of problem: ======================= Had a two node cluster with Distributed volume, By stopping glusterd, removed the vol file directory and started glusterd and checked the volume status, in the volume status, bricks of the node where i removed the vol file is showing offline with error message in the bricks log like below """ [2016-01-11 11:05:54.547543] E [socket.c:769:__socket_server_bind] 0-socket.glusterfsd: binding to failed: Address already in use [2016-01-11 11:05:54.547574] E [socket.c:772:__socket_server_bind] 0-socket.glusterfsd: Port is already in use [2016-01-11 11:05:54.547585] W [rpcsvc.c:1604:rpcsvc_transport_create] 0-rpc-service: listening on transport failed [2016-01-11 11:05:54.562789] I [glusterfsd-mgmt.c:58:mgmt_cbk_spec] 0-mgmt: Volume file changed """ Version-Release number of selected component (if applicable): ============================================================== glusterfs-3.7.5-15. How reproducible: ================ Always Steps to Reproduce: =================== 1. Create a two node cluster (Eg node-1 and node-2 ) 2. Create a one Distributed volume using both the node bricks. 3. Stop glusterd in node-2 4. Remove the vole directory in node-2 ie "rm -rf /var/lib/glusterd/vols" 5. Start glusterd in node-2. 6. Check the volume status in both the nodes //bricks of node-2 will be in offline state. Actual results: =============== After volume sync, volume status showing bricks offline Expected results: ================= Bricks should be in online after volume sync. Additional info:
The problem here is once the sync happened through handshaking the brick was attempted a start where the brick was already running and it resulted into a port already in use issue. In this case we shouldn't have attempted to start the bricks and here is the reason: Brick pidfiles are stored in /var/lib/glusterd/vols/<volname/run/ which itself is wrong as we shouldn't store any run time information in the configuration path. We already have a bug to correct the pid file path across Gluster stack, refer to [1] . Due to this when the configuration folder was deleted, the pidfile also got removed at node 2 resulting into node 2 identifying that brick process is not running where it is and hence results into 'port already in use' failure. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1258561 Reducing the severity to low as we have not seen usage of volume sync by the users/customers.
(In reply to Atin Mukherjee from comment #3) > > Reducing the severity to low as we have not seen usage of volume sync by the > users/customers. This issue is not related to volume sync invoked by the users/customers, its all about correct placement of PID files. PIDs should be maintained per-node run-state dir which is /var/run or /var/run gluster. That is as per standard linux configuration. This will affect the users who go for offline upgrade or upgrade from ISO image, where they backup /var/lib/glusterd and restore post upgrade. Based on the above facts, raising the severity to HIGH
(In reply to SATHEESARAN from comment #6) > Based on the above facts, raising the severity to HIGH Missed to update severity
(In reply to SATHEESARAN from comment #6) > (In reply to Atin Mukherjee from comment #3) > > > > Reducing the severity to low as we have not seen usage of volume sync by the > > users/customers. > > This issue is not related to volume sync invoked by the users/customers, its > all about correct placement of PID files. > > PIDs should be maintained per-node run-state dir which is /var/run or > /var/run gluster. > > That is as per standard linux configuration. Yes, agreed. Although volume sync is one of the use case which caused this problem the underlying issue is at misplaced pid files. > > This will affect the users who go for offline upgrade or upgrade from ISO > image, where they backup /var/lib/glusterd and restore post upgrade. > > Based on the above facts, raising the severity to HIGH I kind of disagree here. If we go along with the documented procedures of offline upgrade where the brick processes are stopped you'd never encounter this. However the incorrect pid files will be placed in the configuration.
(In reply to Atin Mukherjee from comment #8) > I kind of disagree here. If we go along with the documented procedures of > offline upgrade where the brick processes are stopped you'd never encounter > this. However the incorrect pid files will be placed in the configuration. Thanks Atin for correcting me. I agree with the same. Offline upgrade or peer replacement procedure mandates stopping bricks and so this issue will never be encountered, unless the customer/user removed /var/lib/glusterd accidentally with running bricks. As you suggested, placing the PID files per standards, we can avoid any unforeseen problems arising in the future. I suggest to prioritize it for RHGS 3.1 zstream