Description of problem: glusterd fails to start on RHEL7 based RHGS3.1.1 nodes after reboot of the machine. Further debugging this issue has shown that glusterd tries to come up even before network and it fails to start. Version-Release number of selected component (if applicable): glusterfs-3.7.1-14.el7rhgs.x86_64 How reproducible: Always Steps to Reproduce: 1. Install latest RHGS3.1.1 ISO based out of RHEL7.1 2. Now create a volume and start it. 3. Reboot the node Actual results: glusterd fails to start once the system comes back online. Expected results: glusterd should start sucessfully. Additional info:
The glusterd systemd unit file is as follows, ``` [Unit] Description=GlusterFS, a clustered file-system server After=network.target rpcbind.service Before=network-online.target [Service] Type=forking PIDFile=/var/run/glusterd.pid LimitNOFILE=65536 ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid KillMode=process [Install] WantedBy=multi-user.target ``` We see that the unit is set to be started after network.target and rpcbin.service, but before network-online.target. The network and network-online targets are special systemd units, whose behaviour is not clear on first glance. network.target implies only that the networking devices have been brought up, not that the network is up. network-online.target implies that at least one network is up. One would think that glusterd should be brought up after network-online.target instead of before it. But this is set to allow mounts with _netdev to happen correctly. Systemd performs _netdev mounts after network-online.target is reached. So, glusterd has added the `Before` requirement to ensure mounts happen only after it starts. With the latest versions of systemd, (I checked with systemd-224), a new mount option is available 'x-systemd.requires', which can be used to schedule mounts after a specific service instead of the general network-online.target. Using this we could have glusterd start after network-online.target, but still have mounts happen after glusterd. This is not available not available in RHEL 7 with systemd-208.
Noticed a similar issue, where glusterd was killed with SIGNUM 0 with the following logs : 2015-10-14 06:05:10.798516] E [MSGID: 106408] [glusterd-peer-utils.c:120:glusterd_peerinfo_find_by_hostname] 0-management: error in getaddrinfo: Name or service not known [Unknown error -2] [2015-10-14 06:05:10.798765] E [MSGID: 101075] [common-utils.c:3143:gf_is_local_addr] 0-management: error in getaddrinfo: Name or service not known [2015-10-14 06:05:10.798800] E [MSGID: 106187] [glusterd-store.c:4244:glusterd_resolve_all_bricks] 0-glusterd: resolve brick failed in restore [2015-10-14 06:05:10.798879] E [MSGID: 101019] [xlator.c:428:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again [2015-10-14 06:05:10.798895] E [MSGID: 101066] [graph.c:326:glusterfs_graph_init] 0-management: initializing translator failed [2015-10-14 06:05:10.798904] E [MSGID: 101176] [graph.c:672:glusterfs_graph_activate] 0-graph: init failed [2015-10-14 06:05:10.798993] E [MSGID: 106408] [glusterd-peer-utils.c:120:glusterd_peerinfo_find_by_hostname] 0-management: error in getaddrinfo: Name or service not known [Unknown error -2] [2015-10-14 06:05:10.801447] E [MSGID: 101075] [common-utils.c:3143:gf_is_local_addr] 0-management: error in getaddrinfo: Name or service not known pending frames: patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2015-10-14 06:05:10 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.7.1 [2015-10-14 06:05:10.808636] W [glusterfsd.c:1219:cleanup_and_exit] (-->/usr/sbin/glusterd(glusterfs_volumes_init+0xfd) [0x7faf5d57817d] -->/usr/sbin/glusterd(glusterfs_process_volfp+0x126) [0x7faf5d578026] - ->/usr/sbin/glusterd(cleanup_and_exit+0x69) [0x7faf5d577609] ) 0-: received signum (0), shutting down
(In reply to SATHEESARAN from comment #5) > Noticed a similar issue, where glusterd was killed with SIGNUM 0 with the > following logs : > This issue was seen in RHGS 3.1.1 based on RHEL 7.1 ( glusterfs-3.7.1-16.el7rhgs )
https://bugzilla.redhat.com/show_bug.cgi?id=1262231 is the upstream BZ on the same issue
This issue reproduced with RHGS build - glusterfs-3.7.5-9. No core file found in /var/log/core Steps i done: ============= 1. Created a volume (Distributed type) and started it 2. Rebooted the node it. 3. Checked glusterd status //it was not running Glusterd log: ============= [2015-12-07 11:25:09.930108] I [MSGID: 106479] [glusterd.c:1399:init] 0-management: Using /var/lib/glusterd as working directory [2015-12-07 11:25:10.040640] W [MSGID: 103071] [rdma.c:4592:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event channel creation failed [No such device] [2015-12-07 11:25:10.040705] W [MSGID: 103055] [rdma.c:4899:init] 0-rdma.management: Failed to initialize IB Device [2015-12-07 11:25:10.040724] W [rpc-transport.c:358:rpc_transport_load] 0-rpc-transport: 'rdma' initialization failed [2015-12-07 11:25:10.040894] W [rpcsvc.c:1597:rpcsvc_transport_create] 0-rpc-service: cannot create listener, initing the transport failed [2015-12-07 11:25:10.040923] E [MSGID: 106243] [glusterd.c:1623:init] 0-management: creation of 1 listeners failed, continuing with succeeded transport [2015-12-07 11:25:13.278110] I [MSGID: 106513] [glusterd-store.c:2047:glusterd_restore_op_version] 0-glusterd: retrieved op-version: 30706 [2015-12-07 11:25:14.015619] E [MSGID: 106187] [glusterd-store.c:4267:glusterd_resolve_all_bricks] 0-glusterd: resolve brick failed in restore [2015-12-07 11:25:14.015662] E [MSGID: 101019] [xlator.c:428:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again [2015-12-07 11:25:14.015674] E [graph.c:322:glusterfs_graph_init] 0-management: initializing translator failed [2015-12-07 11:25:14.015680] E [graph.c:661:glusterfs_graph_activate] 0-graph: init failed [2015-12-07 11:25:14.016225] W [glusterfsd.c:1236:cleanup_and_exit] (-->/usr/sbin/glusterd(glusterfs_volumes_init+0xfd) [0x7f05f730c2fd] -->/usr/sbin/glusterd(glusterfs_process_volfp+0x126) [0x7f05f730c1a6] -->/usr/sbin/glusterd(cleanup_and_exit+0x69) [0x7f05f730b789] ) 0-: received signum (0), shutting down
I am able to reproduce this issue consistently and this issue is coming only if i add a node to RHEVM. I tried the below things to confirm on two newly installed RHEL7.2 WITH RHGS3.1.2 (glusterfs-3.7.5-9). Node-1: ===== 1. Created a simple Dis volume using one brick 2. Started volume. 3. rebooted the node sever times 4. After every reboot, glusterd started automatically Node-2 ===== 1. Created a simple Dis volume using one brick 2. Started the volume. 3. Added Node to RHEVM. 4. Removed it from RHEVM 5. rebooted the node sever times 6. After every reboot, GlusterD was not coming up automatically.
As per https://bugzilla.redhat.com/show_bug.cgi?id=1262231#c7 the issue is not seen with RHEL 7.2 platform. Can we see if the issue persists, if not we can close this BZ.
*** Bug 1379451 has been marked as a duplicate of this bug. ***
I am closing this bug as I've not heard from QE on this for long time. Kindly reopen if the issue persists.