Bug 1727682 - Upgrade from older versions 3.12.15 and 5.5 to gluster 6.3-1 fails - bricks become offline
Summary: Upgrade from older versions 3.12.15 and 5.5 to gluster 6.3-1 fails - bricks b...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: GlusterFS
Classification: Community
Component: core
Version: 6
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
Assignee: Sanju
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-07-08 03:27 UTC by Amgad
Modified: 2019-07-18 07:42 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-07-18 07:42:24 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
/var/log/glusterfs - directory after running: "glusterd -N --debug" (65.52 KB, application/gzip)
2019-07-08 04:21 UTC, Amgad
no flags Details

Description Amgad 2019-07-08 03:27:08 UTC
Description of problem:

Upgrade from older versions 3.12.15 and 5.5 to gluster 6.3-1 fails
Similar to issue https://bugzilla.redhat.com/show_bug.cgi?id=1684029 which is supposed to be fixed

Version-Release number of selected component (if applicable):
6.3-1

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Amgad 2019-07-08 03:35:12 UTC
How to re-produce:

1) Have a cluster of 3 nodes on 3.12.15
2) Upgraded 1st node to 6.3-1 , bricks on that node went off-line and can't be brought online till backed out to 3.12.15.
3) If upgrade 2nd node, bricks on that node go offline, etc

The following are the few lines reported in the /var/log/glusterfs/glusterd.log on the upgraded node

[root@gfs-bcmt-01-control-02 glusterfs]# cat glusterd.log    
[2019-07-08 03:11:18.641072] E [MSGID: 101097] [xlator.c:218:xlator_volopt_dynload] 0-xlator: dlsym(xlator_api) missing: /usr/lib64/glusterfs/6.3/rpc-transport/socket.so: undefined symbol: xlator_api
The message "E [MSGID: 101097] [xlator.c:218:xlator_volopt_dynload] 0-xlator: dlsym(xlator_api) missing: /usr/lib64/glusterfs/6.3/rpc-transport/socket.so: undefined symbol: xlator_api" repeated 7 times between [2019-07-08 03:11:18.641072] and [2019-07-08 03:11:18.641729]
[2019-07-08 03:31:32.071964] E [MSGID: 101097] [xlator.c:218:xlator_volopt_dynload] 0-xlator: dlsym(xlator_api) missing: /usr/lib64/glusterfs/6.3/rpc-transport/socket.so: undefined symbol: xlator_api
The message "E [MSGID: 101097] [xlator.c:218:xlator_volopt_dynload] 0-xlator: dlsym(xlator_api) missing: /usr/lib64/glusterfs/6.3/rpc-transport/socket.so: undefined symbol: xlator_api" repeated 7 times between [2019-07-08 03:31:32.071964] and [2019-07-08 03:31:32.072051]
[root@gfs-bcmt-01-control-02 glusterfs]# 


This is really blocking!!!!!!!!!!!!!! Thought release 6.3-1 addressed all the upgrade issues, not to fail in the first step!!!

Comment 2 Amar Tumballi 2019-07-08 03:59:39 UTC
Can you run 'glusterd -N --debug' on the node which gave above output, and let us know the logs?

Comment 3 Amgad 2019-07-08 04:14:32 UTC
[root@gfs-bcmt-01-control-02 log]# glusterd -N --debug
[2019-07-08 04:12:38.074980] I [MSGID: 100030] [glusterfsd.c:2849:main] 0-glusterd: Started running glusterd version 6.3 (args: glusterd -N --debug)
[2019-07-08 04:12:38.075332] I [glusterfsd.c:2556:daemonize] 0-glusterfs: Pid of current running process is 8786
[2019-07-08 04:12:38.075456] D [logging.c:1841:__gf_log_inject_timer_event] 0-logging-infra: Starting timer now. Timeout = 120, current buf size = 5
[2019-07-08 04:12:38.078077] D [MSGID: 0] [glusterfsd.c:801:get_volfp] 0-glusterfsd: loading volume file /etc/glusterfs/glusterd.vol
[2019-07-08 04:12:38.082612] I [MSGID: 106478] [glusterd.c:1422:init] 0-management: Maximum allowed open file descriptors set to 65536
[2019-07-08 04:12:38.083042] I [MSGID: 106479] [glusterd.c:1478:init] 0-management: Using /var/lib/glusterd as working directory
[2019-07-08 04:12:38.083261] I [MSGID: 106479] [glusterd.c:1484:init] 0-management: Using /var/run/gluster as pid file working directory
[2019-07-08 04:12:38.084562] D [MSGID: 0] [glusterd.c:458:glusterd_rpcsvc_options_build] 0-glusterd: listen-backlog value: 1024
[2019-07-08 04:12:38.084955] D [rpcsvc.c:2835:rpcsvc_init] 0-rpc-service: RPC service inited.
[2019-07-08 04:12:38.085647] D [rpcsvc.c:2337:rpcsvc_program_register] 0-rpc-service: New program registered: GF-DUMP, Num: 123451501, Ver: 1, Port: 0
[2019-07-08 04:12:38.085884] D [rpc-transport.c:293:rpc_transport_load] 0-rpc-transport: attempt to load file /usr/lib64/glusterfs/6.3/rpc-transport/socket.so
[2019-07-08 04:12:38.090909] D [socket.c:4505:socket_init] 0-socket.management: Configued transport.tcp-user-timeout=42
[2019-07-08 04:12:38.091148] D [socket.c:4523:socket_init] 0-socket.management: Reconfigued transport.keepalivecnt=9
[2019-07-08 04:12:38.091307] D [socket.c:4209:ssl_setup_connection_params] 0-socket.management: SSL support on the I/O path is NOT enabled
[2019-07-08 04:12:38.091460] D [socket.c:4212:ssl_setup_connection_params] 0-socket.management: SSL support for glusterd is NOT enabled
[2019-07-08 04:12:38.091633] D [name.c:575:server_fill_address_family] 0-socket.management: option address-family not specified, defaulting to inet
[2019-07-08 04:12:38.092093] I [socket.c:931:__socket_server_bind] 0-socket.management: process started listening on port (43971)
[2019-07-08 04:12:38.092405] D [rpc-transport.c:293:rpc_transport_load] 0-rpc-transport: attempt to load file /usr/lib64/glusterfs/6.3/rpc-transport/rdma.so
[2019-07-08 04:12:38.092622] E [rpc-transport.c:297:rpc_transport_load] 0-rpc-transport: /usr/lib64/glusterfs/6.3/rpc-transport/rdma.so: cannot open shared object file: No such file or directory
[2019-07-08 04:12:38.092777] W [rpc-transport.c:301:rpc_transport_load] 0-rpc-transport: volume 'rdma.management': transport-type 'rdma' is not valid or not found on this machine
[2019-07-08 04:12:38.092926] W [rpcsvc.c:1985:rpcsvc_create_listener] 0-rpc-service: cannot create listener, initing the transport failed
[2019-07-08 04:12:38.093065] E [MSGID: 106244] [glusterd.c:1785:init] 0-management: creation of 1 listeners failed, continuing with succeeded transport
[2019-07-08 04:12:38.093563] D [rpcsvc.c:2337:rpcsvc_program_register] 0-rpc-service: New program registered: GlusterD svc peer, Num: 1238437, Ver: 2, Port: 0
[2019-07-08 04:12:38.093996] D [rpcsvc.c:2337:rpcsvc_program_register] 0-rpc-service: New program registered: GlusterD svc cli read-only, Num: 1238463, Ver: 2, Port: 0
[2019-07-08 04:12:38.094461] D [rpcsvc.c:2337:rpcsvc_program_register] 0-rpc-service: New program registered: GlusterD svc mgmt, Num: 1238433, Ver: 2, Port: 0
[2019-07-08 04:12:38.094932] D [rpcsvc.c:2337:rpcsvc_program_register] 0-rpc-service: New program registered: GlusterD svc mgmt v3, Num: 1238433, Ver: 3, Port: 0
[2019-07-08 04:12:38.095438] D [rpcsvc.c:2337:rpcsvc_program_register] 0-rpc-service: New program registered: Gluster Portmap, Num: 34123456, Ver: 1, Port: 0
[2019-07-08 04:12:38.096060] D [rpcsvc.c:2337:rpcsvc_program_register] 0-rpc-service: New program registered: Gluster Handshake, Num: 14398633, Ver: 2, Port: 0
[2019-07-08 04:12:38.096475] D [rpcsvc.c:2337:rpcsvc_program_register] 0-rpc-service: New program registered: Gluster MGMT Handshake, Num: 1239873, Ver: 1, Port: 0
[2019-07-08 04:12:38.096614] D [rpcsvc.c:2835:rpcsvc_init] 0-rpc-service: RPC service inited.
[2019-07-08 04:12:38.096865] D [rpcsvc.c:2337:rpcsvc_program_register] 0-rpc-service: New program registered: GF-DUMP, Num: 123451501, Ver: 1, Port: 0
[2019-07-08 04:12:38.096989] D [rpc-transport.c:293:rpc_transport_load] 0-rpc-transport: attempt to load file /usr/lib64/glusterfs/6.3/rpc-transport/socket.so
[2019-07-08 04:12:38.097178] D [socket.c:4465:socket_init] 0-socket.management: disabling nodelay
[2019-07-08 04:12:38.097288] D [socket.c:4505:socket_init] 0-socket.management: Configued transport.tcp-user-timeout=42
[2019-07-08 04:12:38.097427] D [socket.c:4523:socket_init] 0-socket.management: Reconfigued transport.keepalivecnt=9
[2019-07-08 04:12:38.097608] D [socket.c:4209:ssl_setup_connection_params] 0-socket.management: SSL support on the I/O path is NOT enabled
[2019-07-08 04:12:38.097807] D [socket.c:4212:ssl_setup_connection_params] 0-socket.management: SSL support for glusterd is NOT enabled
[2019-07-08 04:12:38.100379] I [socket.c:902:__socket_server_bind] 0-socket.management: closing (AF_UNIX) reuse check socket 10
[2019-07-08 04:12:38.101254] E [socket.c:912:__socket_server_bind] 0-socket.management: binding to  failed: Address already in use
[2019-07-08 04:12:38.101449] E [socket.c:914:__socket_server_bind] 0-socket.management: Port is already in use
[2019-07-08 04:12:38.101640] E [socket.c:3688:socket_listen] 0-socket.management: __socket_server_bind failed;closing socket 9
[2019-07-08 04:12:38.101867] W [rpcsvc.c:1991:rpcsvc_create_listener] 0-rpc-service: listening on transport failed
[2019-07-08 04:12:38.102124] D [MSGID: 0] [glusterd.c:1136:glusterd_init_uds_listener] 0-management: Failed to create listener
[2019-07-08 04:12:38.102383] E [MSGID: 106236] [glusterd.c:1160:glusterd_init_uds_listener] 0-management: Failed to start glusterd unix domain socket listener.
[2019-07-08 04:12:38.102574] E [MSGID: 101019] [xlator.c:629:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again
[2019-07-08 04:12:38.102867] E [MSGID: 101066] [graph.c:362:glusterfs_graph_init] 0-management: initializing translator failed
[2019-07-08 04:12:38.103081] E [MSGID: 101176] [graph.c:725:glusterfs_graph_activate] 0-graph: init failed
[2019-07-08 04:12:38.103275] D [logging.c:1813:gf_log_flush_extra_msgs] 0-logging-infra: Log buffer size reduced. About to flush 5 extra log messages
[2019-07-08 04:12:38.103396] D [logging.c:1816:gf_log_flush_extra_msgs] 0-logging-infra: Just flushed 5 extra log messages
[2019-07-08 04:12:38.103841] W [glusterfsd.c:1570:cleanup_and_exit] (-->glusterd(glusterfs_volumes_init+0xfd) [0x558527feadcd] -->glusterd(glusterfs_process_volfp+0x249) [0x558527feacb9] -->glusterd(cleanup_and_exit+0x6b) [0x558527fea06b] ) 0-: received signum (-1), shutting down
[2019-07-08 04:12:38.104044] W [mgmt-pmap.c:133:rpc_clnt_mgmt_pmap_signout] 0-glusterfs: failed to create XDR payload
[root@gfs-bcmt-01-control-02 log]#

Comment 4 Amgad 2019-07-08 04:21:19 UTC
Created attachment 1588229 [details]
/var/log/glusterfs - directory after running: "glusterd -N --debug"

Comment 5 Amar Tumballi 2019-07-08 04:24:55 UTC
Looks like 'glusterd' is successfully running in that machine.

> [2019-07-08 04:12:38.101449] E [socket.c:914:__socket_server_bind] 0-socket.management: Port is already in use

Comment 6 Amar Tumballi 2019-07-08 04:26:16 UTC
Looks like 'glusterd' is successfully running in that machine.

> [2019-07-08 04:12:38.101449] E [socket.c:914:__socket_server_bind] 0-socket.management: Port is already in use

What does 'netstat -ntlp' say? Is port 24007 in use? which process?

Comment 7 Sanju 2019-07-09 08:26:55 UTC
Amgad,

I tried upgrading from 3.12.15 to 6.3 and I haven't observed any issue with bricks coming up. They are online.

You will face https://bugzilla.redhat.com/show_bug.cgi?id=1728126 while in-service upgrade. Fix for the bug is posted.

please, let me know if I'm missing something.

Thanks,
Sanju

Comment 8 Amgad 2019-07-09 14:06:40 UTC
Thanks Amar for the trouble shooting. Three issues were found and need follow-up:

          1) Check why glusterd didn't start on 24007
          2) systemctl issue check and respond
          3) Check why "gluster volume status" hung from old servers (on older release, i.e. 3.12.x)

Comment 9 Sanju 2019-07-11 02:25:16 UTC
Amgad, 
1. Please share the output of "netstat -ntlp | grep 24007"
2. I didn't understand what is 2nd issue
3. Fix for this already posted upstream i.e, https://review.gluster.org/23014

I also request you to confirm about in-service upgrade, I haven't observed the bricks not coming up issue. We may have to change the summary of this bug to a more appropriate one.

Thanks,
Sanju

Comment 10 Amgad 2019-07-12 16:10:02 UTC
Hi Sanju:

For 1. The output was showing a different port every time glusterd is started (not 24007) till Amar advised to added explicitly in the 
/etc/glusterfs/glusterd.vol:
...
     option transport.socket.listen-port 24007
...
Then it started with 24007. It's supposed to be the default
       

For 2. after the failure, were not able to restart glusterd using systemctl
        Getting the following error:

[root@gfs ~]# systemctl restart glusterd
Job for glusterd.service failed because the control process exited with error code. See "systemctl status glusterd.service" and "journalctl -xe" for details.
[root@gfs ~]# 

[root@gfs-bcmt-1906-control-02 ~]# systemctl -l  status glusterd.service
รข glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/etc/systemd/system/glusterd.service; enabled; vendor preset: disabled)
   Active: activating (auto-restart) (Result: exit-code) since Fri 2019-07-12 16:08:43 UTC; 2s ago
  Process: 12324 ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level $LOG_LEVEL $GLUSTERD_OPTIONS (code=exited, status=1/FAILURE)
    Tasks: 0
   Memory: 0B (limit: 3.8G)

Jul 12 16:08:43 gfs-bcmt-1906-control-02 systemd[1]: Failed to start GlusterFS, a clustered file-system server.
Jul 12 16:08:43 gfs-bcmt-1906-control-02 systemd[1]: Unit glusterd.service entered failed state.
Jul 12 16:08:43 gfs-bcmt-1906-control-02 systemd[1]: glusterd.service failed.

Comment 11 Amar Tumballi 2019-07-13 03:58:58 UTC
Amgad, I was able to find out the reason for the failure of binding to 24007. Culprit is me :-) ref: https://review.gluster.org/#/c/glusterfs/+/21249/

With this change, we actually added listen-port value of 24007 to glusterd.vol file. But as we had not had a case of anyone changing glusterd.vol before, in most of the upgrades, the update happened properly, as new glusterd.vol file was overwriting the old one.

But as in your case, you had edited glusterd.vol file, rpm install didn't overwrite the glusterd.vol file, and that resulted latest (glusterfs-6.x+) glusterd process to come up in different ports. So, the actual fix is to add the line in volfile, which we did as a 'workaround' earlier. So, please make this change through your scripts (to add 'option listen-port 24007' to your glusterd.vol files before upgrade).


About systemctl, I figured out because we handled the process outside of the cgroups, and other required parameters once we started, it was getting into problem.  A systemctl disable glusterd and enable glusterd is the right way to get back to normalcy.

About 3, I guess Sanju's patch link should be option, which makes all the issues observed addressed.

Comment 12 Sanju 2019-07-15 13:22:33 UTC
Amgad,

Looking at Amar's reply, I can say that all the issues observed are addressed. I would like to close this bug as there is nothing pending. Please confirm.

Thanks,
Sanju

Comment 13 Amgad 2019-07-17 14:25:15 UTC
Amar:

Thanks for the responses - 

>> With this change, we actually added listen-port value of 24007 to glusterd.vol file. But as we had not had a case of anyone changing glusterd.vol before, in most of the upgrades, the update happened properly, as new glusterd.vol file was overwriting the old one.

During the upgrade we don't change "glusterd.vol". How does it detect that the existing glusterd.vol is updated during installation of the older release?

I'll try to test the systemctl case.

Sanju:
I guess if systemctl workaround works fine, you can close the issue.

By the way, which release the "https://review.gluster.org/23014" patch will be added

Comment 14 Sanju 2019-07-18 07:42:24 UTC
Amgad, It will be part of release 6.4

I'm closing this bug. If you happen to see that systemctl case is not working fine, please raise a new bug.

Thanks,
Sanju


Note You need to log in before you can comment on or make changes to this bug.