Description of problem: Upgrade from older versions 3.12.15 and 5.5 to gluster 6.3-1 fails Similar to issue https://bugzilla.redhat.com/show_bug.cgi?id=1684029 which is supposed to be fixed Version-Release number of selected component (if applicable): 6.3-1 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
How to re-produce: 1) Have a cluster of 3 nodes on 3.12.15 2) Upgraded 1st node to 6.3-1 , bricks on that node went off-line and can't be brought online till backed out to 3.12.15. 3) If upgrade 2nd node, bricks on that node go offline, etc The following are the few lines reported in the /var/log/glusterfs/glusterd.log on the upgraded node [root@gfs-bcmt-01-control-02 glusterfs]# cat glusterd.log [2019-07-08 03:11:18.641072] E [MSGID: 101097] [xlator.c:218:xlator_volopt_dynload] 0-xlator: dlsym(xlator_api) missing: /usr/lib64/glusterfs/6.3/rpc-transport/socket.so: undefined symbol: xlator_api The message "E [MSGID: 101097] [xlator.c:218:xlator_volopt_dynload] 0-xlator: dlsym(xlator_api) missing: /usr/lib64/glusterfs/6.3/rpc-transport/socket.so: undefined symbol: xlator_api" repeated 7 times between [2019-07-08 03:11:18.641072] and [2019-07-08 03:11:18.641729] [2019-07-08 03:31:32.071964] E [MSGID: 101097] [xlator.c:218:xlator_volopt_dynload] 0-xlator: dlsym(xlator_api) missing: /usr/lib64/glusterfs/6.3/rpc-transport/socket.so: undefined symbol: xlator_api The message "E [MSGID: 101097] [xlator.c:218:xlator_volopt_dynload] 0-xlator: dlsym(xlator_api) missing: /usr/lib64/glusterfs/6.3/rpc-transport/socket.so: undefined symbol: xlator_api" repeated 7 times between [2019-07-08 03:31:32.071964] and [2019-07-08 03:31:32.072051] [root@gfs-bcmt-01-control-02 glusterfs]# This is really blocking!!!!!!!!!!!!!! Thought release 6.3-1 addressed all the upgrade issues, not to fail in the first step!!!
Can you run 'glusterd -N --debug' on the node which gave above output, and let us know the logs?
[root@gfs-bcmt-01-control-02 log]# glusterd -N --debug [2019-07-08 04:12:38.074980] I [MSGID: 100030] [glusterfsd.c:2849:main] 0-glusterd: Started running glusterd version 6.3 (args: glusterd -N --debug) [2019-07-08 04:12:38.075332] I [glusterfsd.c:2556:daemonize] 0-glusterfs: Pid of current running process is 8786 [2019-07-08 04:12:38.075456] D [logging.c:1841:__gf_log_inject_timer_event] 0-logging-infra: Starting timer now. Timeout = 120, current buf size = 5 [2019-07-08 04:12:38.078077] D [MSGID: 0] [glusterfsd.c:801:get_volfp] 0-glusterfsd: loading volume file /etc/glusterfs/glusterd.vol [2019-07-08 04:12:38.082612] I [MSGID: 106478] [glusterd.c:1422:init] 0-management: Maximum allowed open file descriptors set to 65536 [2019-07-08 04:12:38.083042] I [MSGID: 106479] [glusterd.c:1478:init] 0-management: Using /var/lib/glusterd as working directory [2019-07-08 04:12:38.083261] I [MSGID: 106479] [glusterd.c:1484:init] 0-management: Using /var/run/gluster as pid file working directory [2019-07-08 04:12:38.084562] D [MSGID: 0] [glusterd.c:458:glusterd_rpcsvc_options_build] 0-glusterd: listen-backlog value: 1024 [2019-07-08 04:12:38.084955] D [rpcsvc.c:2835:rpcsvc_init] 0-rpc-service: RPC service inited. [2019-07-08 04:12:38.085647] D [rpcsvc.c:2337:rpcsvc_program_register] 0-rpc-service: New program registered: GF-DUMP, Num: 123451501, Ver: 1, Port: 0 [2019-07-08 04:12:38.085884] D [rpc-transport.c:293:rpc_transport_load] 0-rpc-transport: attempt to load file /usr/lib64/glusterfs/6.3/rpc-transport/socket.so [2019-07-08 04:12:38.090909] D [socket.c:4505:socket_init] 0-socket.management: Configued transport.tcp-user-timeout=42 [2019-07-08 04:12:38.091148] D [socket.c:4523:socket_init] 0-socket.management: Reconfigued transport.keepalivecnt=9 [2019-07-08 04:12:38.091307] D [socket.c:4209:ssl_setup_connection_params] 0-socket.management: SSL support on the I/O path is NOT enabled [2019-07-08 04:12:38.091460] D [socket.c:4212:ssl_setup_connection_params] 0-socket.management: SSL support for glusterd is NOT enabled [2019-07-08 04:12:38.091633] D [name.c:575:server_fill_address_family] 0-socket.management: option address-family not specified, defaulting to inet [2019-07-08 04:12:38.092093] I [socket.c:931:__socket_server_bind] 0-socket.management: process started listening on port (43971) [2019-07-08 04:12:38.092405] D [rpc-transport.c:293:rpc_transport_load] 0-rpc-transport: attempt to load file /usr/lib64/glusterfs/6.3/rpc-transport/rdma.so [2019-07-08 04:12:38.092622] E [rpc-transport.c:297:rpc_transport_load] 0-rpc-transport: /usr/lib64/glusterfs/6.3/rpc-transport/rdma.so: cannot open shared object file: No such file or directory [2019-07-08 04:12:38.092777] W [rpc-transport.c:301:rpc_transport_load] 0-rpc-transport: volume 'rdma.management': transport-type 'rdma' is not valid or not found on this machine [2019-07-08 04:12:38.092926] W [rpcsvc.c:1985:rpcsvc_create_listener] 0-rpc-service: cannot create listener, initing the transport failed [2019-07-08 04:12:38.093065] E [MSGID: 106244] [glusterd.c:1785:init] 0-management: creation of 1 listeners failed, continuing with succeeded transport [2019-07-08 04:12:38.093563] D [rpcsvc.c:2337:rpcsvc_program_register] 0-rpc-service: New program registered: GlusterD svc peer, Num: 1238437, Ver: 2, Port: 0 [2019-07-08 04:12:38.093996] D [rpcsvc.c:2337:rpcsvc_program_register] 0-rpc-service: New program registered: GlusterD svc cli read-only, Num: 1238463, Ver: 2, Port: 0 [2019-07-08 04:12:38.094461] D [rpcsvc.c:2337:rpcsvc_program_register] 0-rpc-service: New program registered: GlusterD svc mgmt, Num: 1238433, Ver: 2, Port: 0 [2019-07-08 04:12:38.094932] D [rpcsvc.c:2337:rpcsvc_program_register] 0-rpc-service: New program registered: GlusterD svc mgmt v3, Num: 1238433, Ver: 3, Port: 0 [2019-07-08 04:12:38.095438] D [rpcsvc.c:2337:rpcsvc_program_register] 0-rpc-service: New program registered: Gluster Portmap, Num: 34123456, Ver: 1, Port: 0 [2019-07-08 04:12:38.096060] D [rpcsvc.c:2337:rpcsvc_program_register] 0-rpc-service: New program registered: Gluster Handshake, Num: 14398633, Ver: 2, Port: 0 [2019-07-08 04:12:38.096475] D [rpcsvc.c:2337:rpcsvc_program_register] 0-rpc-service: New program registered: Gluster MGMT Handshake, Num: 1239873, Ver: 1, Port: 0 [2019-07-08 04:12:38.096614] D [rpcsvc.c:2835:rpcsvc_init] 0-rpc-service: RPC service inited. [2019-07-08 04:12:38.096865] D [rpcsvc.c:2337:rpcsvc_program_register] 0-rpc-service: New program registered: GF-DUMP, Num: 123451501, Ver: 1, Port: 0 [2019-07-08 04:12:38.096989] D [rpc-transport.c:293:rpc_transport_load] 0-rpc-transport: attempt to load file /usr/lib64/glusterfs/6.3/rpc-transport/socket.so [2019-07-08 04:12:38.097178] D [socket.c:4465:socket_init] 0-socket.management: disabling nodelay [2019-07-08 04:12:38.097288] D [socket.c:4505:socket_init] 0-socket.management: Configued transport.tcp-user-timeout=42 [2019-07-08 04:12:38.097427] D [socket.c:4523:socket_init] 0-socket.management: Reconfigued transport.keepalivecnt=9 [2019-07-08 04:12:38.097608] D [socket.c:4209:ssl_setup_connection_params] 0-socket.management: SSL support on the I/O path is NOT enabled [2019-07-08 04:12:38.097807] D [socket.c:4212:ssl_setup_connection_params] 0-socket.management: SSL support for glusterd is NOT enabled [2019-07-08 04:12:38.100379] I [socket.c:902:__socket_server_bind] 0-socket.management: closing (AF_UNIX) reuse check socket 10 [2019-07-08 04:12:38.101254] E [socket.c:912:__socket_server_bind] 0-socket.management: binding to failed: Address already in use [2019-07-08 04:12:38.101449] E [socket.c:914:__socket_server_bind] 0-socket.management: Port is already in use [2019-07-08 04:12:38.101640] E [socket.c:3688:socket_listen] 0-socket.management: __socket_server_bind failed;closing socket 9 [2019-07-08 04:12:38.101867] W [rpcsvc.c:1991:rpcsvc_create_listener] 0-rpc-service: listening on transport failed [2019-07-08 04:12:38.102124] D [MSGID: 0] [glusterd.c:1136:glusterd_init_uds_listener] 0-management: Failed to create listener [2019-07-08 04:12:38.102383] E [MSGID: 106236] [glusterd.c:1160:glusterd_init_uds_listener] 0-management: Failed to start glusterd unix domain socket listener. [2019-07-08 04:12:38.102574] E [MSGID: 101019] [xlator.c:629:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again [2019-07-08 04:12:38.102867] E [MSGID: 101066] [graph.c:362:glusterfs_graph_init] 0-management: initializing translator failed [2019-07-08 04:12:38.103081] E [MSGID: 101176] [graph.c:725:glusterfs_graph_activate] 0-graph: init failed [2019-07-08 04:12:38.103275] D [logging.c:1813:gf_log_flush_extra_msgs] 0-logging-infra: Log buffer size reduced. About to flush 5 extra log messages [2019-07-08 04:12:38.103396] D [logging.c:1816:gf_log_flush_extra_msgs] 0-logging-infra: Just flushed 5 extra log messages [2019-07-08 04:12:38.103841] W [glusterfsd.c:1570:cleanup_and_exit] (-->glusterd(glusterfs_volumes_init+0xfd) [0x558527feadcd] -->glusterd(glusterfs_process_volfp+0x249) [0x558527feacb9] -->glusterd(cleanup_and_exit+0x6b) [0x558527fea06b] ) 0-: received signum (-1), shutting down [2019-07-08 04:12:38.104044] W [mgmt-pmap.c:133:rpc_clnt_mgmt_pmap_signout] 0-glusterfs: failed to create XDR payload [root@gfs-bcmt-01-control-02 log]#
Created attachment 1588229 [details] /var/log/glusterfs - directory after running: "glusterd -N --debug"
Looks like 'glusterd' is successfully running in that machine. > [2019-07-08 04:12:38.101449] E [socket.c:914:__socket_server_bind] 0-socket.management: Port is already in use
Looks like 'glusterd' is successfully running in that machine. > [2019-07-08 04:12:38.101449] E [socket.c:914:__socket_server_bind] 0-socket.management: Port is already in use What does 'netstat -ntlp' say? Is port 24007 in use? which process?
Amgad, I tried upgrading from 3.12.15 to 6.3 and I haven't observed any issue with bricks coming up. They are online. You will face https://bugzilla.redhat.com/show_bug.cgi?id=1728126 while in-service upgrade. Fix for the bug is posted. please, let me know if I'm missing something. Thanks, Sanju
Thanks Amar for the trouble shooting. Three issues were found and need follow-up: 1) Check why glusterd didn't start on 24007 2) systemctl issue check and respond 3) Check why "gluster volume status" hung from old servers (on older release, i.e. 3.12.x)
Amgad, 1. Please share the output of "netstat -ntlp | grep 24007" 2. I didn't understand what is 2nd issue 3. Fix for this already posted upstream i.e, https://review.gluster.org/23014 I also request you to confirm about in-service upgrade, I haven't observed the bricks not coming up issue. We may have to change the summary of this bug to a more appropriate one. Thanks, Sanju
Hi Sanju: For 1. The output was showing a different port every time glusterd is started (not 24007) till Amar advised to added explicitly in the /etc/glusterfs/glusterd.vol: ... option transport.socket.listen-port 24007 ... Then it started with 24007. It's supposed to be the default For 2. after the failure, were not able to restart glusterd using systemctl Getting the following error: [root@gfs ~]# systemctl restart glusterd Job for glusterd.service failed because the control process exited with error code. See "systemctl status glusterd.service" and "journalctl -xe" for details. [root@gfs ~]# [root@gfs-bcmt-1906-control-02 ~]# systemctl -l status glusterd.service รข glusterd.service - GlusterFS, a clustered file-system server Loaded: loaded (/etc/systemd/system/glusterd.service; enabled; vendor preset: disabled) Active: activating (auto-restart) (Result: exit-code) since Fri 2019-07-12 16:08:43 UTC; 2s ago Process: 12324 ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level $LOG_LEVEL $GLUSTERD_OPTIONS (code=exited, status=1/FAILURE) Tasks: 0 Memory: 0B (limit: 3.8G) Jul 12 16:08:43 gfs-bcmt-1906-control-02 systemd[1]: Failed to start GlusterFS, a clustered file-system server. Jul 12 16:08:43 gfs-bcmt-1906-control-02 systemd[1]: Unit glusterd.service entered failed state. Jul 12 16:08:43 gfs-bcmt-1906-control-02 systemd[1]: glusterd.service failed.
Amgad, I was able to find out the reason for the failure of binding to 24007. Culprit is me :-) ref: https://review.gluster.org/#/c/glusterfs/+/21249/ With this change, we actually added listen-port value of 24007 to glusterd.vol file. But as we had not had a case of anyone changing glusterd.vol before, in most of the upgrades, the update happened properly, as new glusterd.vol file was overwriting the old one. But as in your case, you had edited glusterd.vol file, rpm install didn't overwrite the glusterd.vol file, and that resulted latest (glusterfs-6.x+) glusterd process to come up in different ports. So, the actual fix is to add the line in volfile, which we did as a 'workaround' earlier. So, please make this change through your scripts (to add 'option listen-port 24007' to your glusterd.vol files before upgrade). About systemctl, I figured out because we handled the process outside of the cgroups, and other required parameters once we started, it was getting into problem. A systemctl disable glusterd and enable glusterd is the right way to get back to normalcy. About 3, I guess Sanju's patch link should be option, which makes all the issues observed addressed.
Amgad, Looking at Amar's reply, I can say that all the issues observed are addressed. I would like to close this bug as there is nothing pending. Please confirm. Thanks, Sanju
Amar: Thanks for the responses - >> With this change, we actually added listen-port value of 24007 to glusterd.vol file. But as we had not had a case of anyone changing glusterd.vol before, in most of the upgrades, the update happened properly, as new glusterd.vol file was overwriting the old one. During the upgrade we don't change "glusterd.vol". How does it detect that the existing glusterd.vol is updated during installation of the older release? I'll try to test the systemctl case. Sanju: I guess if systemctl workaround works fine, you can close the issue. By the way, which release the "https://review.gluster.org/23014" patch will be added
Amgad, It will be part of release 6.4 I'm closing this bug. If you happen to see that systemctl case is not working fine, please raise a new bug. Thanks, Sanju