Bug 1297634 - [GlusterD]: After volume sync (vol dir), bricks went to offline with error message "glusterfsd: Port is already in use"
[GlusterD]: After volume sync (vol dir), bricks went to offline with error me...
Status: CLOSED NEXTRELEASE
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: glusterd (Show other bugs)
3.1
x86_64 Linux
low Severity high
: ---
: ---
Assigned To: Satish Mohan
Byreddy
: ZStream
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-01-11 23:39 EST by Byreddy
Modified: 2017-08-17 00:36 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-08-17 00:36:02 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Byreddy 2016-01-11 23:39:49 EST
Description of problem:
=======================
Had a two node cluster with Distributed volume, By stopping glusterd, removed the vol file directory and started glusterd and checked the volume status, in the  volume status, bricks of the node where i removed the vol file is showing offline with error message in the bricks log like below

"""
[2016-01-11 11:05:54.547543] E [socket.c:769:__socket_server_bind] 0-socket.glusterfsd: binding to  failed: Address already in use
[2016-01-11 11:05:54.547574] E [socket.c:772:__socket_server_bind] 0-socket.glusterfsd: Port is already in use
[2016-01-11 11:05:54.547585] W [rpcsvc.c:1604:rpcsvc_transport_create] 0-rpc-service: listening on transport failed
[2016-01-11 11:05:54.562789] I [glusterfsd-mgmt.c:58:mgmt_cbk_spec] 0-mgmt: Volume file changed
"""


Version-Release number of selected component (if applicable):
==============================================================
glusterfs-3.7.5-15.

How reproducible:
================
Always


Steps to Reproduce:
===================
1. Create a two node cluster (Eg node-1 and node-2 )
2. Create a one Distributed volume  using both the node bricks.
3. Stop glusterd in node-2
4. Remove the vole directory in node-2 ie  "rm -rf /var/lib/glusterd/vols"
5. Start glusterd in node-2.
6. Check the volume status in both the nodes //bricks of node-2 will be in  offline state.

Actual results:
===============
After volume sync, volume status showing bricks offline


Expected results:
=================
Bricks should be in online after volume sync.

Additional info:
Comment 3 Atin Mukherjee 2016-01-11 23:47:52 EST
The problem here is once the sync happened through handshaking
the brick was attempted a start where the brick was already running and
it resulted into a port already in use issue. In this case we shouldn't
have attempted to start the bricks and here is the reason:

Brick pidfiles are stored in /var/lib/glusterd/vols/<volname/run/ which
itself is wrong as we shouldn't store any run time information in the
configuration path. We already have a bug to correct the pid file path across Gluster stack, refer to [1] . Due to this when the configuration
folder was deleted, the pidfile also got removed at node 2 resulting into
node 2 identifying that brick process is not running where it is and hence results into 'port already in use' failure.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1258561

Reducing the severity to low as we have not seen usage of volume sync by the users/customers.
Comment 6 SATHEESARAN 2016-01-12 01:35:38 EST
(In reply to Atin Mukherjee from comment #3)
> 
> Reducing the severity to low as we have not seen usage of volume sync by the
> users/customers.

This issue is not related to volume sync invoked by the users/customers, its all about correct placement of PID files.

PIDs should be maintained per-node run-state dir which is /var/run or /var/run gluster.

That is as per standard linux configuration.

This will affect the users who go for offline upgrade or upgrade from ISO image, where they backup /var/lib/glusterd and restore post upgrade.

Based on the above facts, raising the severity to HIGH
Comment 7 SATHEESARAN 2016-01-12 01:36:19 EST
(In reply to SATHEESARAN from comment #6)
> Based on the above facts, raising the severity to HIGH

Missed to update severity
Comment 8 Atin Mukherjee 2016-01-12 01:54:12 EST
(In reply to SATHEESARAN from comment #6)
> (In reply to Atin Mukherjee from comment #3)
> > 
> > Reducing the severity to low as we have not seen usage of volume sync by the
> > users/customers.
> 
> This issue is not related to volume sync invoked by the users/customers, its
> all about correct placement of PID files.
> 
> PIDs should be maintained per-node run-state dir which is /var/run or
> /var/run gluster.
> 
> That is as per standard linux configuration.
Yes, agreed. Although volume sync is one of the use case which caused this problem the underlying issue is at misplaced pid files.
> 
> This will affect the users who go for offline upgrade or upgrade from ISO
> image, where they backup /var/lib/glusterd and restore post upgrade.
> 
> Based on the above facts, raising the severity to HIGH
I kind of disagree here. If we go along with the documented procedures of offline upgrade where the brick processes are stopped you'd never encounter this. However the incorrect pid files will be placed in the configuration.
Comment 9 SATHEESARAN 2016-01-12 02:08:08 EST
(In reply to Atin Mukherjee from comment #8)

> I kind of disagree here. If we go along with the documented procedures of
> offline upgrade where the brick processes are stopped you'd never encounter
> this. However the incorrect pid files will be placed in the configuration.

Thanks Atin for correcting me.
I agree with the same. Offline upgrade or peer replacement procedure mandates stopping bricks and so this issue will never be encountered, unless the customer/user removed /var/lib/glusterd accidentally with running bricks.

As you suggested, placing the PID files per standards, we can avoid any unforeseen problems arising in the future. I suggest to prioritize it for RHGS 3.1 zstream

Note You need to log in before you can comment on or make changes to this bug.