Bug 1410132
Summary: | [NFS-Ganesha] Linux kernel untar failed after add-brick | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Prasad Desala <tdesala> |
Component: | nfs-ganesha | Assignee: | Kaleb KEITHLEY <kkeithle> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | surabhi <sbhaloth> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | rhgs-3.2 | CC: | amukherj, dang, ffilz, jthottan, mbenjamin, nbalacha, pkarampu, ravishankar, rgowdapp, rhs-bugs, sankarshan, skoduri, smohan, spalai, sraj, storage-qa-internal, tdesala |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | rhgs-3.3.0 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-08-21 12:45:32 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Prasad Desala
2017-01-04 14:41:35 UTC
Thanks Prasad. Even with the original volume, we are unable to reproduce the issue. Looking at the logs/sosreport provided - tar: linux-4.4.1/arch/arm/mach-pxa/colibri-pxa3xx.c: Cannot open: Invalid argument tar: linux-4.4.1/arch/arm/mach-pxa/corgi.c: Cannot open: Invalid argument tar: linux-4.4.1/arch/arm/mach-pxa/corgi_pm.c: Cannot open: Invalid argument In gfapi.log , I see http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/Prasad/1410132/ganesha-gfapi/server-1/ganesha-gfapi.log [2017-01-04 13:03:40.845609] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 6-distrep-dht: Found anomalies in /1/linux-4.4.1/arch/arm (gfid = 8b3b468c-2c5a-4438-9147-2d1510d1120a). Holes=3 overlaps=0 [2017-01-04 13:03:40.845645] W [MSGID: 109005] [dht-selfheal.c:2106:dht_selfheal_directory] 6-distrep-dht: Directory selfheal failed: 3 subvolumes down.Not fixing. path = /1/linux-4.4.1/arch/arm, gfid = 8b3b468c-2c5a-4438-9147-2d1510d1120a [2017-01-04 13:03:40.845718] W [MSGID: 109011] [dht-layout.c:186:dht_layout_search] 6-distrep-dht: no subvolume for hash (value) = 3001413547 [2017-01-04 13:03:40.845912] I [MSGID: 108006] [afr-common.c:4823:afr_local_init] 6-distrep-replicate-1: no subvolumes up [2017-01-04 13:03:40.846116] I [MSGID: 108006] [afr-common.c:4823:afr_local_init] 6-distrep-replicate-3: no subvolumes up [2017-01-04 13:03:40.846288] I [MSGID: 108006] [afr-common.c:4823:afr_local_init] 6-distrep-replicate-5: no subvolumes up [2017-01-04 13:03:40.847437] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 6-distrep-dht: Found anomalies in /1/linux-4.4.1/arch/arm/mach-pxa (gfid = ce4eabb8-607f-4f26-a549-d67dd6f253dd). Holes=3 overlaps=0 [2017-01-04 13:03:40.847501] W [MSGID: 109005] [dht-selfheal.c:2106:dht_selfheal_directory] 6-distrep-dht: Directory selfheal failed: 3 subvolumes down.Not fixing. path = /1/linux-4.4.1/arch/arm/mach-pxa, gfid = ce4eabb8-607f-4f26-a549-d67dd6f253dd [2017-01-04 13:03:40.847948] I [MSGID: 108006] [afr-common.c:4823:afr_local_init] 6-distrep-replicate-1: no subvolumes up Susant confirmed when there are network disconnects, directory healing doesn't happen. From the newly-added brick log (bricks-brick5-b5.log) - [2017-01-04 13:03:44.445721] I [MSGID: 115036] [server.c:552:server_rpc_notify] 0-distrep-server: disconnecting connection from dhcp47-155.lab.eng.blr.redhat.com-6911-2017/01/04-13:03:26:328233-distrep-client-12-0-0 [2017-01-04 13:03:44.445912] I [MSGID: 101055] [client_t.c:415:gf_client_unref] 0-distrep-server: Shutting down connection dhcp47-155.lab.eng.blr.redhat.com-6911-2017/01/04-13:03:26:328233-distrep-client-12-0-0 [2017-01-04 13:03:44.457352] I [MSGID: 115036] [server.c:552:server_rpc_notify] 0-distrep-server: disconnecting connection from dhcp46-101.lab.eng.blr.redhat.com-7157-2017/01/04-13:03:26:612690-distrep-client-12-0-0 [2017-01-04 13:03:44.457529] I [MSGID: 101055] [client_t.c:415:gf_client_unref] 0-distrep-server: Shutting down connection dhcp46-101.lab.eng.blr.redhat.com-7157-2017/01/04-13:03:26:612690-distrep-client-12-0-0 [2017-01-04 13:03:44.560467] I [login.c:76:gf_auth] 0-auth/login: allowed user names: dfec5104-52aa-44a6-95d1-79dfcd333c61 [2017-01-04 13:03:44.560522] I [MSGID: 115029] [server-handshake.c:693:server_setvolume] 0-distrep-server: accepted client from dhcp47-155.lab.eng.blr.redhat.com-13639-2017/01/03-10:05:27:67298-distrep-client-12-6-0 (version: 3.8.4) [2017-01-04 13:03:44.668517] I [login.c:76:gf_auth] 0-auth/login: allowed user names: dfec5104-52aa-44a6-95d1-79dfcd333c61 [2017-01-04 13:03:44.668560] I [MSGID: 115029] [server-handshake.c:693:server_setvolume] 0-distrep-server: accepted client from dhcp47-167.lab.eng.blr.redhat.com-32528-2017/01/04-13:03:27:935078-distrep-client-12-0-0 (version: 3.8.4) [2017-01-04 13:03:44.708822] I [login.c:76:gf_auth] 0-auth/login: allowed user names: dfec5104-52aa-44a6-95d1-79dfcd333c61 [2017-01-04 13:03:44.708868] I [MSGID: 115029] [server-handshake.c:693:server_setvolume] 0-distrep-server: accepted client from dhcp47-167.lab.eng.blr.redhat.com-7769-2017/01/03-10:05:27:68091-distrep-client-12-6-0 (version: 3.8.4) [2017-01-04 13:03:44.778648] I [MSGID: 115036] [server.c:552:server_rpc_notify] 0-distrep-server: disconnecting connection from dhcp47-167.lab.eng.blr.redhat.com-32528-2017/01/04-13:03:27:935078-distrep-client-12-0-0 [2017-01-04 13:03:44.778866] I [MSGID: 101055] [client_t.c:415:gf_client_unref] 0-distrep-server: Shutting down connection dhcp47-167.lab.eng.blr.redhat.com-32528-2017/01/04-13:03:27:935078-distrep-client-12-0-0 There are disconnect messages. So most probably, directory heal dint happen because of network disconnects then resulting in these errors on the console. However, from the logs looks like there seem to be another issue - [2017-01-04 13:25:57.756641] W [MSGID: 108019] [afr-lk-common.c:1064:afr_log_entry_locks_failure] 6-distrep-replicate-6: Unable to obtain sufficient blocking entry locks on at least one child while attempting MKDIR on {pgfid:b314cb37-f8a5-447a-a66c-147191de23ef, name:io}. [2017-01-04 13:25:57.760166] I [MSGID: 108019] [afr-transaction.c:1870:afr_post_blocking_entrylk_cbk] 6-distrep-replicate-6: Blocking entrylks failed. If I do lookup on the parent gfid directory at the backend- [root@dhcp46-42 gluster]# ls /bricks/brick5/b5/.glusterfs/b3/ 00/ 05/ 06/ 15/ 29/ 2e/ 65/ 71/ 80/ 9c/ e2/ f4/ There is no such directory created on the newly added brick - b5 [root@dhcp46-42 gluster]# ls /bricks/brick7/b7/.glusterfs/b3/1 10/ 12/ 13/ 14/ 15/ 16/ 1a/ 1c/ 1e/ 1f/ [root@dhcp46-42 gluster]# ls /bricks/brick7/b7/.glusterfs/b3/14/ b314cb37-f8a5-447a-a66c-147191de23ef However that directory is present in the earlier bricks. Checked with Ravi on this. If the directory heal hasn't happened on replicate-6 subvol, we are not sure why the fops(/entrylk) are sent to AFR in the first place. Request dht team to take a look and provide comments. As suggested by Susant doing 'ls -R' on mountpoint seemed to have healed the directories on all the bricks now. When Soumya pointed to this setup, there were network disconnections. If the network disconnection happens once the mkdir completes on all the bricks, but before layout setting on them, then there will be holes and tar operations will fail. Soumya/Prasad, Is this reproduced all the time? Ideally without any network disconnect, we should not hit this bug. Raghavendra and myself have looked at the logs - [root@dhcp46-42 ~]# grep -i "14-distrep" /var/log/ganesha-gfapi.log-20170109 | grep -v "2017-01-05" | grep -E "client-6|client-7" | grep -v "Connection refused" | grep -v " changing port" [2017-01-06 06:43:49.281871] I [MSGID: 114020] [client.c:2356:notify] 14-distrep-client-6: parent translators are ready, attempting connect on transport [2017-01-06 06:43:49.285108] I [MSGID: 114020] [client.c:2356:notify] 14-distrep-client-7: parent translators are ready, attempting connect on transport [2017-01-06 06:43:49.810107] I [MSGID: 114057] [client-handshake.c:1446:select_server_supported_programs] 14-distrep-client-7: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2017-01-06 06:43:49.816029] I [MSGID: 114046] [client-handshake.c:1222:client_setvolume_cbk] 14-distrep-client-7: Connected to distrep-client-7, attached to remote volume '/bricks/brick8/b8'. [2017-01-06 06:43:49.816058] I [MSGID: 114047] [client-handshake.c:1233:client_setvolume_cbk] 14-distrep-client-7: Server and Client lk-version numbers are not same, reopening the fds [2017-01-06 06:43:49.816192] I [MSGID: 108005] [afr-common.c:4654:afr_notify] 14-distrep-replicate-3: Subvolume 'distrep-client-7' came back up; going online. [2017-01-06 06:43:49.843843] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 14-distrep-client-7: Server lk version = 1 [2017-01-06 06:43:49.846150] I [MSGID: 114057] [client-handshake.c:1446:select_server_supported_programs] 14-distrep-client-6: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2017-01-06 06:43:49.849027] I [MSGID: 114046] [client-handshake.c:1222:client_setvolume_cbk] 14-distrep-client-6: Connected to distrep-client-6, attached to remote volume '/bricks/brick8/b8'. [2017-01-06 06:43:49.849052] I [MSGID: 114047] [client-handshake.c:1233:client_setvolume_cbk] 14-distrep-client-6: Server and Client lk-version numbers are not same, reopening the fds [2017-01-06 06:43:49.850164] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 14-distrep-client-6: Server lk version = 1 [2017-01-06 07:06:15.409507] W [socket.c:590:__socket_rwv] 14-distrep-client-7: readv on 10.70.47.167:49153 failed (No data available) [2017-01-06 07:06:15.410199] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 14-distrep-client-7: disconnected from distrep-client-7. Client process will keep trying to connect to glusterd until brick's port is available [2017-01-06 11:03:09.948565] I [MSGID: 114057] [client-handshake.c:1446:select_server_supported_programs] 14-distrep-client-7: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2017-01-06 11:03:09.949940] I [MSGID: 114046] [client-handshake.c:1222:client_setvolume_cbk] 14-distrep-client-7: Connected to distrep-client-7, attached to remote volume '/bricks/brick8/b8'. [2017-01-06 11:03:09.949983] I [MSGID: 114047] [client-handshake.c:1233:client_setvolume_cbk] 14-distrep-client-7: Server and Client lk-version numbers are not same, reopening the fds [2017-01-06 11:03:09.950451] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 14-distrep-client-7: Server lk version = 1 [root@dhcp46-42 ~]# Client-7 was disconnected for about ~4hrs with error "Connection refused". Not sure what is the cause of network disconnect. There seem to be iptable rules as well for the brick ports. Not sure if they blocked any I/Os. [root@dhcp46-42 ~]# grep -i "14-distrep" /var/log/ganesha-gfapi.log-20170109 | grep -v "2017-01-05" | grep -i "replicate-3" [2017-01-06 06:43:49.816192] I [MSGID: 108005] [afr-common.c:4654:afr_notify] 14-distrep-replicate-3: Subvolume 'distrep-client-7' came back up; going online. [root@dhcp46-42 ~]# Since there are no errors reported wrt to on-going I/O in the logs, maybe there were some errors wrt ganesha server-client communication itself. Eitherwise request Prasad to reproduce and provide us the setup while client I/Os going on. Thanks! However the replicate-3 was up and never seem to have gone down. So I/Os shouldn't have been disrupted. We shall need live setup to further debug using gdb or tcpdump. Though there is rpc network disconection ,it should n't I am moving this BZ back to NFS-Ganesha as the issue looks to be on the nfs client & ganesha server side as per comment 14. reopen if seen again |