Bug 682833

Summary: Autofs doesn't resolve correctly with dnsnames after a connection.
Product: Red Hat Enterprise Linux 6 Reporter: Patrik Martinsson <martinsson.patrik>
Component: autofsAssignee: Ian Kent <ikent>
Status: CLOSED INSUFFICIENT_DATA QA Contact: yanfu,wang <yanwang>
Severity: unspecified Docs Contact:
Priority: low    
Version: 6.0CC: ikent, rwheeler
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-08-15 12:34:38 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Name lookup alternate program
none
Logs, recording and getaddr-sample.
none
Another name lookup test program
none
Video of autofs tests. none

Description Patrik Martinsson 2011-03-07 17:57:59 UTC
Description of problem:
Autofs doesn't seem to be able to get host info trough getaddrinfo() when using dns-names. The feature works well and as expected when having a network, but if the connection goes down, and then brought up again it doesn't seem to work properly. 


Version-Release number of selected component (if applicable):
autofs-5.0.5-23.el6.x86_64

How reproducible:
Always. 

Steps to Reproduce:
1. Create a map, 
   test -auto,intr,vers=3  fileserver:/vol/test
   
2. /usr/sbin/automount -d -f -n 1 
   
3. cd /autofs/test 

4. Disconnect network. 

5. Connect network. 

6. Run simple getaddrinfo-program to test just to see that we actually can reach filer.  
====
// gcc getaddr.c -o getaddr
#include <sys/types.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <sys/socket.h>
#include <netdb.h>

int main(int argc, char *argv[])
{

  struct addrinfo hints, *ni; 
	int ret;

  memset(&hints, 0, sizeof(struct addrinfo));
	hints.ai_flags = AI_ADDRCONFIG;
	hints.ai_family = AF_UNSPEC;
	hints.ai_socktype = SOCK_DGRAM;

  ret = getaddrinfo(argv[1], NULL, &hints, &ni);
  if (ret) {
    fprintf(stderr, "getaddrinfo: %s (%s)\n", gai_strerror(ret), argv[1]);
    exit(EXIT_FAILURE);
  }

  fprintf(stdout, "getaddrinfo: Success on %s \n", argv[1]);
}
====

6. connection verified, try to cd /autofs/test  

Actual results:
autofs complains about, 
add_host_addrs: hostname lookup failed: Temporary failure in name resolution

Expected results:
Successful mount. 


Additional info:
If using an ipaddr instead if dns this works as expected. 
The time after a disconnect/connect for a successful connection various, sometimes it actually works after a couple of seconds, sometimes it takes minutes, and for each try you will see the, add_host_addrs: hostname lookup failed: Temporary failure in name resolution, even though the sample program runs fine. 

Any help appreciated, maybe I'm way off here, but I'm very confused by why my sample-getaddrinfo-program is able to return a addrinfo structure but not autofs, thus making it fail

Comment 2 Ian Kent 2011-03-08 02:02:53 UTC
Yes, that's strange behaviour.
I'll check it out.
Ian

Comment 3 Ian Kent 2011-03-08 04:41:13 UTC
(In reply to comment #0)
> Description of problem:
> Autofs doesn't seem to be able to get host info trough getaddrinfo() when using
> dns-names. The feature works well and as expected when having a network, but if
> the connection goes down, and then brought up again it doesn't seem to work
> properly. 

Initial impression is I can't duplicate this but I haven't yet
setup proper DNS.

> 
> 
> Version-Release number of selected component (if applicable):
> autofs-5.0.5-23.el6.x86_64
> 
> How reproducible:
> Always. 
> 

There are a couple of things wrong with this procedure as
well. Perhaps you can redo your test taking account of the
comments below.

> Steps to Reproduce:
> 1. Create a map, 
>    test -auto,intr,vers=3  fileserver:/vol/test

OK, good start, use an invalid mount option, auto, in the map
entry. Probably not an actual problem though.

> 
> 2. /usr/sbin/automount -d -f -n 1 
> 
> 3. cd /autofs/test 
> 
> 4. Disconnect network. 
> 
> 5. Connect network. 
> 
> 6. Run simple getaddrinfo-program to test just to see that we actually can
> reach filer.  

snip ...

Right, so run an independent lookup .... mmm, which doesn't
check if glibc is caching results and is somehow affected by
the network disconnect ..... again probably not the case.

> 
> 6. connection verified, try to cd /autofs/test  

But your working directory is already /autofs/test so
the previous mount should still be mounted and this
should do nothing but set the directory again without
contacting the autofs daemon.

> 
> Actual results:
> autofs complains about, 
> add_host_addrs: hostname lookup failed: Temporary failure in name resolution

You have run the daemon with debug but haven't included the
output so we have no idea what actually happened. Like whether
the mount was expired (but you would have had to change working
directory away, which isn't included in your description) and a
new mount was attempted.

How about redoing the test with a little more detailed feedback
please.

Ian

Comment 4 Ian Kent 2011-03-08 04:44:04 UTC
Created attachment 482834 [details]
Name lookup alternate program

Running the before, during and after as part of the test
should provide a check for some sort of glibc caching
causing a problem.

Comment 5 Patrik Martinsson 2011-03-08 10:12:11 UTC
Thanks for the quick response, sorry for being somewhat sloppy in the bugreport. 

In you getaddr-program I added, ni = NULL; after the declaration, otherwise freeaddrinfo segfaults if a host is not reachable. Also changed so we never exit the program, I've attached it together with a debuglog and a recording of how I made the test. 

I've tested it with the approach you suggest, 

1. map is now, 
temp -defaults,intr,vers=3 fs5:/vol/vol3/no_backup/temp

2. /usr/sbin/automount -d -f -n 1, the log is attached. 

3. Having your getaddr app running. 

4. cd /autofs/test, mount works. 

5. Pressing enter on getaddr program shows successful getaddrinfo on host.

6. cd / 

7. Disconnect network. 

8. Pressing enter on getaddr program shows unsuccessful getaddrinfo on host.

9. cd /autofs/test, mount does not work, as expected with no network. 

10. Connect network again. 

11. Pressing enter on getaddr program shows successful getaddrinfo on host.

12. cd /autofs/test, mount does not work, debug output says temporary failure in name resolution, retrying this a couple of times with same result. 

13. Pressing enter on getaddr program shows successful getaddrinfo on host.

14. cd /autofs/test, mount does not work, debug output says temporary failure in name resolution. 

15. Closing test. 

I've recorded the test so you can see exactly how its made. I can not reproduce this when using ip-addr instead of hostname. This is really not a big issue since we never change ipaddresses on the filers, however i thought i should mention the problem if anybody else has the same issue. And as I mentioned earlier, sometimes it works after a couple of tries, sometimes I never seem to be able to mount without restarting the daemon.

Comment 6 Patrik Martinsson 2011-03-08 10:13:34 UTC
Created attachment 482867 [details]
Logs, recording and getaddr-sample.

Comment 7 Ian Kent 2011-03-09 08:16:33 UTC
(In reply to comment #5)
> Thanks for the quick response, sorry for being somewhat sloppy in the
> bugreport. 

No problem.

> 
> In you getaddr-program I added, ni = NULL; after the declaration, otherwise
> freeaddrinfo segfaults if a host is not reachable. Also changed so we never
> exit the program, I've attached it together with a debuglog and a recording of
> how I made the test. 

Got that.

At first I was wondering why the directory change after the
interface comes back up caused a new mount but I see that the
existing mount gets umounted when the interface goes down,
probably by NetworkManager (in my case), it's definitely not
autofs doing it. I wasn't aware of that behaviour.

At this point I still don't see the name lookup problem you
are seeing but that's on Fedora 14. I don't have an up to
date RHEL-6 install, so that's the next thing I need to do.

I will also update the dns lookup program to do the same
thing as the daemon does. It does an address lookup on the
passed in string and then progresses to a name lookup if
that fails, in case the name is actually an address string.

Ian

Comment 8 Ian Kent 2011-03-14 11:16:49 UTC
Created attachment 484147 [details]
Another name lookup test program

This is basically what the daemon does when looking up a name.
Does using this for the name lookup exhibit the same problem
you are seeing?

Comment 9 Patrik Martinsson 2011-03-16 15:39:34 UTC
Hey Ian, 

I've redone the tests with the new name lookup program, and I'm seeing the same behaviour as earlier, you can watch the video. 

The test is not conclusive since it tends to work sometimes and sometimes not, as you can see in the video the first mount after the disconnect works as expected, but then when i disconnect/connect again I cant get it to mount. 

Spontaneously I think this sounds like a some caching issue as you've mentioned earlier, but i don't know how much time and effort we should spend on this matter - since it appears that I'm the only one with this problem and it's very random. 

As I said earlier, we use the IP-addresses instead and it works fine. 

Although it's always a bit annoying to not find the real issue :)

Best regards, 
Patrik Martinsson

Comment 10 Patrik Martinsson 2011-03-16 15:52:24 UTC
Created attachment 485771 [details]
Video of autofs tests.

Comment 11 Ian Kent 2011-03-17 11:15:13 UTC
(In reply to comment #9)
> Hey Ian, 
> 
> I've redone the tests with the new name lookup program, and I'm seeing the same
> behaviour as earlier, you can watch the video. 

I was hoping the name lookup program would fail in the same way
as automount. I thought maybe it was an initialization problem
specific to your machine because you had to add initialization
to my original program and I didn't see that problem when I used
it. It may still be that since the environment within automount
isn't the same as the test program.

> 
> The test is not conclusive since it tends to work sometimes and sometimes not,
> as you can see in the video the first mount after the disconnect works as
> expected, but then when i disconnect/connect again I cant get it to mount. 
> 
> Spontaneously I think this sounds like a some caching issue as you've mentioned
> earlier, but i don't know how much time and effort we should spend on this
> matter - since it appears that I'm the only one with this problem and it's very
> random. 

Yes, it's quite odd.

Adding specific initialization to the name lookup parts of
autofs is straight forward, it done in only two places.
I can make a test package on the chance it would fix the
problem.

Another thing from the video is that it looks like the "-n 1"
option isn't being honoured. The video clearly shows the mount
request immediately returning a fail long after the one second
negative timeout. That deserves some investigation. I'll have
a look around and see if I can see what might cause that.

Comment 14 Ian Kent 2011-06-10 01:51:11 UTC
I'm setting devel_ack+ on this bug so I can investigate (and
fix) the observed problem with setting the negative timeout.
Since I can't reproduce the DNS problem at all I can't fix
it until we get a report that has a scenario that gives a
different view of the problem and a lead to follow as to
what the problem is.

Comment 16 Ian Kent 2011-08-04 14:37:51 UTC
I have been unable to reproduce this problem despite a fair
amount of effort. Consequently I don't have a resolution
ready for RHEL-6.2.

Deferring to 6.3.

Comment 17 Patrik Martinsson 2011-08-15 12:12:32 UTC
Hi again Ian, 

I haven't looked into this issue since we changed to rhel 6.1, and since we also changed to ip-addr instead of hostnames. I know that our dns setup is somewhat "not the way it should be", so maybe that has something to do with it (although you would figure that the test-program would get the same result then). 

I think you can close this as invalid and I can look into it in the future and reopen the bug if I hit this issue again. 

Thanks for the great work anyway!

Comment 18 Ian Kent 2011-08-15 12:34:38 UTC
(In reply to comment #17)
> Hi again Ian, 
> 
> I haven't looked into this issue since we changed to rhel 6.1, and since we
> also changed to ip-addr instead of hostnames. I know that our dns setup is
> somewhat "not the way it should be", so maybe that has something to do with it
> (although you would figure that the test-program would get the same result
> then). 

Yes, puzzling.

> 
> I think you can close this as invalid and I can look into it in the future and
> reopen the bug if I hit this issue again. 
> 
> Thanks for the great work anyway!

OK, thanks for getting back to us.
I'll close this INSUFFICIENT_DATA for want of of better status.

Ian