Description of problem: Using auth.allow with symbolic hostnames breaks glusterfs for replicated or distributed-replicated (w/ or w/o arbiter) volumes. When client mounted volume and writes files, running `gluster volume heal ${volname} info` reports "Status: Transport endpoint is not connected" or "volgeo7: Not able to fetch volfile from glusterd Volume heal failed." Version-Release number of selected component (if applicable): glusterfs-server-3.8.4-54.8.el7rhgs.x86_64 How reproducible: regularly Steps to Reproduce: 1. create replicated or distributed-replicated volume (i.e. volgeo6) 2. create a hostlist: `for i in `seq 1 254`; do echo gl-dummycl-$i >> hostlist_names_254; done` 3. add valid client name to list 4. gluster volume set volgeo6 auth.allow `cat hostlist_names_254` 5. mount volume from client: `mount -t glusterfs gl-n4:/volgeo6 /gluster/volgeo6` 6. write data: `cp -r /usr/lib /* /gluster/volgeo6/` 7. on RHGS server: gluster vol heal volgeo6 info Actual results: [root@gl-n5 glusterfs]# gluster vol heal volgeo6 info date Brick gl-n4.private-eval.local:/rhgs/brick_o/brick Status: Connected Number of entries: 0 Brick gl-n5.private-eval.local:/rhgs/brick_o/brick Status: Transport endpoint is not connected Number of entries: - Brick gl-n6.private-eval.local:/rhgs/brick_o/brick Status: Connected Number of entries: 0 Expected results: [root@gl-n5 glusterfs]# gluster vol heal volgeo6 info date Brick gl-n4.private-eval.local:/rhgs/brick_o/brick Status: Connected Number of entries: 0 Brick gl-n5.private-eval.local:/rhgs/brick_o/brick Status: Connected Number of entries: 0 Brick gl-n6.private-eval.local:/rhgs/brick_o/brick Status: Connected Number of entries: 0 Additional info: Using IP addresses, this works. Affected volumes from data: volgeo6, volgeo7 (using hostnames). Not affected volumes: volgeo4, volgeo5 (using IP addresses) data available from: https://github.com/mattmuench/bug-gluster/tree/master/bug-authallow-hostnames
Hi Matthias, I don't think the problem is in glusterfs code not accepting fqdn name. I believe the problem in your environment, dns is not able to resolve hostname successfully. Below are the errors are coming at the time of resolving hostname with dns calls getaddrinfo, as you can see it is throwing messages Name or service not known. >>>>>>>>>>>>>>>>>>> [2018-05-05 10:40:10.022307] I [addr.c:55:compare_addr_and_update] 0-/rhgs/brick_m/brick: allowed = "gl-dummycl-45", received addr = "172.20.11.15" [2018-05-05 10:40:10.023607] W [MSGID: 101075] [common-utils.c:3550:gf_is_same_address] 0-gl-dummycl-45: error in getaddrinfo: Name or service not known [2018-05-05 10:40:10.023625] I [addr.c:55:compare_addr_and_update] 0-/rhgs/brick_m/brick: allowed = "gl-dummycl-46", received addr = "172.20.11.15" [2018-05-05 10:40:10.024874] W [MSGID: 101075] [common-utils.c:3550:gf_is_same_address] 0-gl-dummycl-46: error in getaddrinfo: Name or service not known [2018-05-05 10:40:10.024900] I [addr.c:55:compare_addr_and_update] 0-/rhgs/brick_m/brick: allowed = "gl-dummycl-47", received addr = "172.20.11.15" [2018-05-05 10:40:10.026125] W [MSGID: 101075] [common-utils.c:3550:gf_is_same_address] 0-gl-dummycl-47: error in getaddrinfo: Name or service not known [2018-05-05 10:40:10.026143] I [addr.c:55:compare_addr_and_update] 0-/rhgs/brick_m/brick: allowed = "gl-dummycl-48", received addr = "172.20.11.15" [2018-05-05 10:40:10.027485] W [MSGID: 101075] [common-utils.c:3550:gf_is_same_address] 0-gl-dummycl-48: error in getaddrinfo: Name or service not known >>>>>>>>>>>>>>>>>>>>> Either you have to update the same on your DNS or need to update the same in /etc/hosts to resolve it successfully. Before passed the same auth.allow you can use attach program if getaddrinfo is successfully resolving hostname or not 1) compile attach program gcc getaddr.c -o getadd 2) Run program like below getaddr <host-name> <ip-addr> Regards Mohit Agrawal
Created attachment 1445355 [details] getaddr.c
I checked again with setup of all hostnames in DNS. Once the names are properly resolved it's working. Toggeling back to unknown names, so removed the names from DNS again, this can be easily triggered. For RHGS 3.3, it seems to be related to be not able to properly resolve FQDN/hostnames. It should be checked in RHGS 3.4.0 whether this can be triggered again, using unresolvable FQDN.