Bug 1579928

Summary: using auth.allow with hostnames or fqdn breaks volume; volume heal info errors
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Matthias Muench <mmuench>
Component: glusterfsAssignee: Sanju <srakonde>
Status: CLOSED CURRENTRELEASE QA Contact: Bala Konda Reddy M <bmekala>
Severity: medium Docs Contact:
Priority: medium    
Version: rhgs-3.3CC: amukherj, mmuench, moagrawa, rhs-bugs, sankarshan, vbellur, vdas
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-11-06 07:54:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
getaddr.c none

Description Matthias Muench 2018-05-18 17:29:04 UTC
Description of problem:
Using auth.allow with symbolic hostnames breaks glusterfs for replicated or distributed-replicated (w/ or w/o arbiter) volumes. When client mounted volume and writes files, running `gluster volume heal ${volname} info` reports "Status: Transport endpoint is not connected" or "volgeo7: Not able to fetch volfile from glusterd
Volume heal failed." 


Version-Release number of selected component (if applicable):
glusterfs-server-3.8.4-54.8.el7rhgs.x86_64

How reproducible:
regularly


Steps to Reproduce:
1. create replicated or distributed-replicated volume (i.e. volgeo6)
2. create a hostlist: `for i in `seq 1 254`; do echo gl-dummycl-$i >> hostlist_names_254; done`
3. add valid client name to list
4. gluster volume set volgeo6 auth.allow `cat hostlist_names_254`
5. mount volume from client: `mount -t glusterfs gl-n4:/volgeo6 /gluster/volgeo6`
6. write data: `cp -r /usr/lib /* /gluster/volgeo6/`
7. on RHGS server: gluster vol heal volgeo6 info

Actual results:
[root@gl-n5 glusterfs]# gluster vol heal volgeo6 info
date
Brick gl-n4.private-eval.local:/rhgs/brick_o/brick
Status: Connected
Number of entries: 0

Brick gl-n5.private-eval.local:/rhgs/brick_o/brick
Status: Transport endpoint is not connected
Number of entries: -

Brick gl-n6.private-eval.local:/rhgs/brick_o/brick
Status: Connected
Number of entries: 0



Expected results:
[root@gl-n5 glusterfs]# gluster vol heal volgeo6 info
date
Brick gl-n4.private-eval.local:/rhgs/brick_o/brick
Status: Connected
Number of entries: 0

Brick gl-n5.private-eval.local:/rhgs/brick_o/brick
Status: Connected
Number of entries: 0

Brick gl-n6.private-eval.local:/rhgs/brick_o/brick
Status: Connected
Number of entries: 0




Additional info:
Using IP addresses, this works.
Affected volumes from data: volgeo6, volgeo7 (using hostnames). Not affected volumes: volgeo4, volgeo5 (using IP addresses)
data available from: https://github.com/mattmuench/bug-gluster/tree/master/bug-authallow-hostnames

Comment 2 Mohit Agrawal 2018-05-29 11:34:41 UTC
Hi Matthias,

   I don't think the problem is in glusterfs code not accepting fqdn name. I believe the
   problem in your environment, dns is not able to resolve hostname successfully.
   
   Below are the errors are coming at the time of resolving hostname with dns calls
   getaddrinfo, as you can see it is throwing messages Name or service not known.

   >>>>>>>>>>>>>>>>>>>
    
   [2018-05-05 10:40:10.022307] I [addr.c:55:compare_addr_and_update] 0-/rhgs/brick_m/brick: allowed = "gl-dummycl-45", received addr = "172.20.11.15"
[2018-05-05 10:40:10.023607] W [MSGID: 101075] [common-utils.c:3550:gf_is_same_address] 0-gl-dummycl-45: error in getaddrinfo: Name or service not known

[2018-05-05 10:40:10.023625] I [addr.c:55:compare_addr_and_update] 0-/rhgs/brick_m/brick: allowed = "gl-dummycl-46", received addr = "172.20.11.15"
[2018-05-05 10:40:10.024874] W [MSGID: 101075] [common-utils.c:3550:gf_is_same_address] 0-gl-dummycl-46: error in getaddrinfo: Name or service not known

[2018-05-05 10:40:10.024900] I [addr.c:55:compare_addr_and_update] 0-/rhgs/brick_m/brick: allowed = "gl-dummycl-47", received addr = "172.20.11.15"
[2018-05-05 10:40:10.026125] W [MSGID: 101075] [common-utils.c:3550:gf_is_same_address] 0-gl-dummycl-47: error in getaddrinfo: Name or service not known

[2018-05-05 10:40:10.026143] I [addr.c:55:compare_addr_and_update] 0-/rhgs/brick_m/brick: allowed = "gl-dummycl-48", received addr = "172.20.11.15"
[2018-05-05 10:40:10.027485] W [MSGID: 101075] [common-utils.c:3550:gf_is_same_address] 0-gl-dummycl-48: error in getaddrinfo: Name or service not known


 >>>>>>>>>>>>>>>>>>>>>

 Either you have to update the same on your DNS or need to update the same in /etc/hosts to resolve it successfully. Before passed the same auth.allow you can use attach program if getaddrinfo is successfully resolving hostname or not 

 1) compile attach program
    gcc getaddr.c -o getadd
 2) Run program like below
    getaddr <host-name> <ip-addr>

Regards
Mohit Agrawal

Comment 3 Mohit Agrawal 2018-05-29 11:35:49 UTC
Created attachment 1445355 [details]
getaddr.c

Comment 6 Matthias Muench 2018-06-29 15:21:45 UTC
I checked again with setup of all hostnames in DNS. Once the names are properly resolved it's working. Toggeling back to unknown names, so removed the names from DNS again, this can be easily triggered.
For RHGS 3.3, it seems to be related to be not able to properly resolve FQDN/hostnames.

It should be checked in RHGS 3.4.0 whether this can be triggered again, using unresolvable FQDN.