1260007 – glusterd tries to start before network is online and fails to start on RHGS3.1.1 nodes based on RHEL7 after a reboot

Bug 1260007 - glusterd tries to start before network is online and fails to start on RHGS3.1.1 nodes based on RHEL7 after a reboot

Summary: glusterd tries to start before network is online and fails to start on RHGS3...

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterd
Sub Component:
Version:	rhgs-3.1
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Bug Updates Notification Mailing List
QA Contact:	storage-qa-internal@redhat.com
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1379451 (view as bug list)
Depends On:	1262231
Blocks:
TreeView+	depends on / blocked

Reported:	2015-09-04 07:08 UTC by RamaKasturi
Modified:	2019-10-10 10:09 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:	RHEL-7
Last Closed:	2017-02-08 13:24:12 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description RamaKasturi 2015-09-04 07:08:12 UTC

Description of problem:
glusterd fails to start on RHEL7 based RHGS3.1.1 nodes after reboot of the machine. Further debugging this issue has shown that glusterd tries to come up even before network and it fails to start.

Version-Release number of selected component (if applicable):
glusterfs-3.7.1-14.el7rhgs.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Install latest RHGS3.1.1 ISO based out of RHEL7.1
2. Now create a volume and start it.
3. Reboot the node

Actual results:
glusterd fails to start once the system comes back online.

Expected results:
glusterd should start sucessfully.

Additional info:

Comment 3 Kaushal 2015-09-04 10:16:13 UTC

The glusterd systemd unit file is as follows,
```
[Unit]
Description=GlusterFS, a clustered file-system server
After=network.target rpcbind.service
Before=network-online.target

[Service]
Type=forking
PIDFile=/var/run/glusterd.pid
LimitNOFILE=65536
ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid
KillMode=process

[Install]
WantedBy=multi-user.target
```

We see that the unit is set to be started after network.target and rpcbin.service, but before network-online.target.

The network and network-online targets are special systemd units, whose behaviour is not clear on first glance.

network.target implies only that the networking devices have been brought up, not that the network is up. network-online.target implies that at least one network is up.

One would think that glusterd should be brought up after network-online.target instead of before it. But this is set to allow mounts with _netdev to happen correctly. Systemd performs _netdev mounts after network-online.target is reached. So, glusterd has added the `Before` requirement to ensure mounts happen only after it starts.

With the latest versions of systemd, (I checked with systemd-224), a new mount option is available 'x-systemd.requires', which can be used to schedule mounts after a specific service instead of the general network-online.target. Using this we could have glusterd start after network-online.target, but still have mounts happen after glusterd. This is not available not available in RHEL 7 with systemd-208.

Comment 5 SATHEESARAN 2015-10-14 06:32:11 UTC

Noticed a similar issue, where glusterd was killed with SIGNUM 0 with the following logs :

2015-10-14 06:05:10.798516] E [MSGID: 106408] [glusterd-peer-utils.c:120:glusterd_peerinfo_find_by_hostname] 0-management: error in getaddrinfo: Name or service not known
 [Unknown error -2]
[2015-10-14 06:05:10.798765] E [MSGID: 101075] [common-utils.c:3143:gf_is_local_addr] 0-management: error in getaddrinfo: Name or service not known

[2015-10-14 06:05:10.798800] E [MSGID: 106187] [glusterd-store.c:4244:glusterd_resolve_all_bricks] 0-glusterd: resolve brick failed in restore
[2015-10-14 06:05:10.798879] E [MSGID: 101019] [xlator.c:428:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again
[2015-10-14 06:05:10.798895] E [MSGID: 101066] [graph.c:326:glusterfs_graph_init] 0-management: initializing translator failed
[2015-10-14 06:05:10.798904] E [MSGID: 101176] [graph.c:672:glusterfs_graph_activate] 0-graph: init failed
[2015-10-14 06:05:10.798993] E [MSGID: 106408] [glusterd-peer-utils.c:120:glusterd_peerinfo_find_by_hostname] 0-management: error in getaddrinfo: Name or service not known
 [Unknown error -2]
[2015-10-14 06:05:10.801447] E [MSGID: 101075] [common-utils.c:3143:gf_is_local_addr] 0-management: error in getaddrinfo: Name or service not known

pending frames:
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 
2015-10-14 06:05:10
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.7.1
[2015-10-14 06:05:10.808636] W [glusterfsd.c:1219:cleanup_and_exit] (-->/usr/sbin/glusterd(glusterfs_volumes_init+0xfd) [0x7faf5d57817d] -->/usr/sbin/glusterd(glusterfs_process_volfp+0x126) [0x7faf5d578026] -
->/usr/sbin/glusterd(cleanup_and_exit+0x69) [0x7faf5d577609] ) 0-: received signum (0), shutting down

Comment 6 SATHEESARAN 2015-10-14 06:33:22 UTC

(In reply to SATHEESARAN from comment #5)
> Noticed a similar issue, where glusterd was killed with SIGNUM 0 with the
> following logs :
> 
This issue was seen in RHGS 3.1.1 based on RHEL 7.1 ( glusterfs-3.7.1-16.el7rhgs )

Comment 7 SATHEESARAN 2015-10-28 07:52:47 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1262231 is the upstream BZ on the same issue

Comment 8 Byreddy 2015-12-08 04:06:54 UTC

This issue reproduced with RHGS build - glusterfs-3.7.5-9.

No core file found in /var/log/core 

Steps i done:
=============
1. Created a volume (Distributed type) and started it
2. Rebooted the node it.
3. Checked glusterd status //it was not running


Glusterd log:
=============

[2015-12-07 11:25:09.930108] I [MSGID: 106479] [glusterd.c:1399:init] 0-management: Using /var/lib/glusterd as working directory
[2015-12-07 11:25:10.040640] W [MSGID: 103071] [rdma.c:4592:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event channel creation failed [No such device]
[2015-12-07 11:25:10.040705] W [MSGID: 103055] [rdma.c:4899:init] 0-rdma.management: Failed to initialize IB Device
[2015-12-07 11:25:10.040724] W [rpc-transport.c:358:rpc_transport_load] 0-rpc-transport: 'rdma' initialization failed
[2015-12-07 11:25:10.040894] W [rpcsvc.c:1597:rpcsvc_transport_create] 0-rpc-service: cannot create listener, initing the transport failed
[2015-12-07 11:25:10.040923] E [MSGID: 106243] [glusterd.c:1623:init] 0-management: creation of 1 listeners failed, continuing with succeeded transport
[2015-12-07 11:25:13.278110] I [MSGID: 106513] [glusterd-store.c:2047:glusterd_restore_op_version] 0-glusterd: retrieved op-version: 30706
[2015-12-07 11:25:14.015619] E [MSGID: 106187] [glusterd-store.c:4267:glusterd_resolve_all_bricks] 0-glusterd: resolve brick failed in restore
[2015-12-07 11:25:14.015662] E [MSGID: 101019] [xlator.c:428:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again
[2015-12-07 11:25:14.015674] E [graph.c:322:glusterfs_graph_init] 0-management: initializing translator failed
[2015-12-07 11:25:14.015680] E [graph.c:661:glusterfs_graph_activate] 0-graph: init failed
[2015-12-07 11:25:14.016225] W [glusterfsd.c:1236:cleanup_and_exit] (-->/usr/sbin/glusterd(glusterfs_volumes_init+0xfd) [0x7f05f730c2fd] -->/usr/sbin/glusterd(glusterfs_process_volfp+0x126) 
[0x7f05f730c1a6] -->/usr/sbin/glusterd(cleanup_and_exit+0x69) [0x7f05f730b789] ) 0-: received signum (0), shutting down

Comment 9 Byreddy 2015-12-08 08:43:01 UTC

I am able to reproduce this issue consistently and this issue is coming only if i add a node to RHEVM.

I tried the below things to confirm on two newly installed RHEL7.2 WITH RHGS3.1.2 (glusterfs-3.7.5-9).

Node-1:
=====
1. Created a simple Dis volume using one brick
2. Started volume.
3. rebooted the node sever times
4. After every reboot, glusterd started automatically

Node-2
=====
1. Created a simple Dis volume using one brick
2. Started the volume.
3. Added Node to RHEVM.
4. Removed it from RHEVM
5. rebooted the node sever times
6. After every reboot, GlusterD was not coming up automatically.

Comment 10 Atin Mukherjee 2016-08-31 04:57:34 UTC

As per https://bugzilla.redhat.com/show_bug.cgi?id=1262231#c7 the issue is not seen with RHEL 7.2 platform. Can we see if the issue persists, if not we can close this BZ.

Comment 11 Oonkwee Lim 2016-09-28 18:28:57 UTC

*** Bug 1379451 has been marked as a duplicate of this bug. ***

Comment 12 Atin Mukherjee 2017-02-08 13:24:12 UTC

I am closing this bug as I've not heard from QE on this for long time. Kindly reopen if the issue persists.

Note You need to log in before you can comment on or make changes to this bug.