I have about 150 machines running Fedora 22 which are experiencing frequent automount segfaults. I'm fixing up the unit files to auto-restart the daemon since it's critical to any proper operation of the machine, but I figured it's worth reporting the actual segfault. Might be worth having the unit auto-restart; I have Restart=on-abnormal set currently (in the [Service] section). autofs-5.1.0-12.fc22.x86_64 openldap-2.4.40-12.fc22.x86_64 kernel-4.1.8-200.fc22.x86_64 (though the issue has been consistent across kernels). My maps (including the master) are all in LDAP; sssd_autofs is caching the maps though I don't know if it actually caches the master. I guess if it did then there would be no reason for autofs to ever talk to an LDAP server directly. LDAP clients are working fine; there should be no problem with the certificates. Autofs works fine until it segfaults. Please let me know if there's any additional information I can provide. Here's what coredumpctl says: PID: 727 (automount) UID: 0 (root) GID: 0 (root) Signal: 6 (ABRT) Timestamp: Fri 2015-10-02 16:08:49 CDT (4 days ago) Command Line: /usr/sbin/automount --pid-file /run/autofs.pid Executable: /usr/sbin/automount Control Group: /system.slice/autofs.service Unit: autofs.service Slice: system.slice Boot ID: 7555696902d44ee9a612b01a0292177a Machine ID: 33e4e1055dd74af3b82a4ca2e3395a37 Hostname: ld110.e.math.uh.edu Message: Process 727 (automount) of user 0 dumped core. Stack trace of thread 15383: #0 0x00007f5bbb56d9c8 raise (libc.so.6) #1 0x00007f5bbb56f65a abort (libc.so.6) #2 0x00007f5bbb5b0a92 __libc_message (libc.so.6) #3 0x00007f5bbb5bcd2d __libc_free (libc.so.6) #4 0x00007f5bb35e25e5 ldap_int_tls_destroy (libldap-2.4.so.2) #5 0x00007f5bb35c7af7 ldap_ld_free (libldap-2.4.so.2) #6 0x00007f5bb81e7228 __unbind_ldap_connection (lookup_ldap.so) #7 0x00007f5bb81e728e unbind_ldap_connection (lookup_ldap.so) #8 0x00007f5bb81e7668 do_connect (lookup_ldap.so) #9 0x00007f5bb81e7e68 do_reconnect (lookup_ldap.so) #10 0x00007f5bb81eb654 lookup_mount (lookup_ldap.so) #11 0x000055ac6fe1dc37 lookup_nss_mount (automount) #12 0x000055ac6fe14d84 do_mount_indirect (automount) #13 0x00007f5bbc9e8555 start_thread (libpthread.so.0) #14 0x00007f5bbb63bb9d __clone (libc.so.6) Stack trace of thread 727: #0 0x00007f5bbc9f11b6 sigwait (libpthread.so.0) #1 0x000055ac6fe112ca main (automount) #2 0x00007f5bbb559700 __libc_start_main (libc.so.6) #3 0x000055ac6fe117b9 _start (automount) Stack trace of thread 728: #0 0x00007f5bbc9ed8e9 pthread_cond_timedwait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x000055ac6fe2a576 alarm_handler (automount) #2 0x00007f5bbc9e8555 start_thread (libpthread.so.0) #3 0x00007f5bbb63bb9d __clone (libc.so.6) Stack trace of thread 773: #0 0x00007f5bbb6302fd poll (libc.so.6) #1 0x000055ac6fe12bc5 handle_packet (automount) #2 0x000055ac6fe14218 handle_mounts (automount) #3 0x00007f5bbc9e8555 start_thread (libpthread.so.0) #4 0x00007f5bbb63bb9d __clone (libc.so.6) Stack trace of thread 764: #0 0x00007f5bbb6302fd poll (libc.so.6) #1 0x000055ac6fe12bc5 handle_packet (automount) #2 0x000055ac6fe14218 handle_mounts (automount) #3 0x00007f5bbc9e8555 start_thread (libpthread.so.0) #4 0x00007f5bbb63bb9d __clone (libc.so.6) Stack trace of thread 783: #0 0x00007f5bbb6302fd poll (libc.so.6) #1 0x000055ac6fe12bc5 handle_packet (automount) #2 0x000055ac6fe14218 handle_mounts (automount) #3 0x00007f5bbc9e8555 start_thread (libpthread.so.0) #4 0x00007f5bbb63bb9d __clone (libc.so.6) Stack trace of thread 784: #0 0x00007f5bbb6302fd poll (libc.so.6) #1 0x000055ac6fe12bc5 handle_packet (automount) #2 0x000055ac6fe14218 handle_mounts (automount) #3 0x00007f5bbc9e8555 start_thread (libpthread.so.0) #4 0x00007f5bbb63bb9d __clone (libc.so.6) Stack trace of thread 775: #0 0x00007f5bbb6302fd poll (libc.so.6) #1 0x000055ac6fe12bc5 handle_packet (automount) #2 0x000055ac6fe14218 handle_mounts (automount) #3 0x00007f5bbc9e8555 start_thread (libpthread.so.0) #4 0x00007f5bbb63bb9d __clone (libc.so.6) Stack trace of thread 785: #0 0x00007f5bbb6302fd poll (libc.so.6) #1 0x000055ac6fe12bc5 handle_packet (automount) #2 0x000055ac6fe14218 handle_mounts (automount) #3 0x00007f5bbc9e8555 start_thread (libpthread.so.0) #4 0x00007f5bbb63bb9d __clone (libc.so.6) Stack trace of thread 729: #0 0x00007f5bbc9ed540 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x000055ac6fe1f5d3 st_queue_handler (automount) #2 0x00007f5bbc9e8555 start_thread (libpthread.so.0) #3 0x00007f5bbb63bb9d __clone (libc.so.6)
I've been struggling with this for a while now.
Would you mind getting a full debug log of this please. Don't forget to ensure syslog is recording facility daemon level debug and greater.
(In reply to Jason Tibbitts from comment #0) > I have about 150 machines running Fedora 22 which are experiencing frequent > automount segfaults. I'm fixing up the unit files to auto-restart the > daemon since it's critical to any proper operation of the machine, but I > figured it's worth reporting the actual segfault. Might be worth having the > unit auto-restart; I have Restart=on-abnormal set currently (in the > [Service] section). > > autofs-5.1.0-12.fc22.x86_64 > openldap-2.4.40-12.fc22.x86_64 > kernel-4.1.8-200.fc22.x86_64 (though the issue has been consistent across > kernels). > > My maps (including the master) are all in LDAP; sssd_autofs is caching the > maps though I don't know if it actually caches the master. I guess if it > did then there would be no reason for autofs to ever talk to an LDAP server > directly. If your using sss as a map source autofs shouldn't be using the ldap lookup module at all. But it's easy to get that wrong by specifying the map source as ldap in master map entries. That essentially overrides the nsswitch source selection and makes autofs use ldap directly instead of sss. > > LDAP clients are working fine; there should be no problem with the > certificates. Autofs works fine until it segfaults. How long does it last before it fails? Ian
At least I'm not the only person seeing this. Unfortunately I can't reproduce this easily. On some machines it crashes relatively often; on others, not so much. So on some machines I have this: ld110 Thu 2015-08-27 23:06:48 CDT 711 0 0 11 /usr/sbin/automount Mon 2015-09-14 20:14:33 CDT 719 0 0 6 /usr/sbin/automount Tue 2015-09-22 19:12:20 CDT 712 0 0 6 /usr/sbin/automount Mon 2015-09-28 21:12:24 CDT 734 0 0 6 /usr/sbin/automount Fri 2015-10-02 16:08:51 CDT 727 0 0 6 /usr/sbin/automount Wed 2015-10-07 16:21:48 CDT 6121 0 0 4 * /usr/sbin/automount Wed 2015-10-07 20:07:57 CDT 12009 0 0 6 * /usr/sbin/automount ld69 Sun 2015-07-19 07:01:17 CDT 710 0 0 6 /usr/sbin/automount Tue 2015-07-21 17:12:00 CDT 23746 0 0 11 /usr/sbin/automount Thu 2015-08-06 14:04:16 CDT 714 0 0 4 /usr/sbin/automount Thu 2015-10-01 07:01:49 CDT 716 0 0 6 /usr/sbin/automount Sat 2015-10-03 02:07:24 CDT 5854 0 0 6 /usr/sbin/automount Mon 2015-10-05 15:04:49 CDT 4440 0 0 11 * /usr/sbin/automount Wed 2015-10-07 12:39:02 CDT 7049 0 0 4 * /usr/sbin/automount Wed 2015-10-07 16:25:00 CDT 15852 0 0 4 * /usr/sbin/automount while on others it hasn't crashed at all. I took the machines where it crashes most often and set logging = debug. We'll see if I can capture a crash. No syslog daemons here; everything's in the journal. My master map has entries like: dn: cn=/storage,nisMapName=auto.master,dc=math,dc=uh,dc=edu objectClass: nisObject cn: /storage nisMapName: auto.master nisMapEntry: ldap:nisMapName=auto.storage,dc=math,dc=uh,dc=edu which I'm guessing is the wrong way to do it. What should the map entries look like?
(In reply to Jason Tibbitts from comment #4) > At least I'm not the only person seeing this. > > Unfortunately I can't reproduce this easily. On some machines it crashes > relatively often; on others, not so much. So on some machines I have this: > > ld110 > Thu 2015-08-27 23:06:48 CDT 711 0 0 11 /usr/sbin/automount > Mon 2015-09-14 20:14:33 CDT 719 0 0 6 /usr/sbin/automount > Tue 2015-09-22 19:12:20 CDT 712 0 0 6 /usr/sbin/automount > Mon 2015-09-28 21:12:24 CDT 734 0 0 6 /usr/sbin/automount > Fri 2015-10-02 16:08:51 CDT 727 0 0 6 /usr/sbin/automount > Wed 2015-10-07 16:21:48 CDT 6121 0 0 4 * /usr/sbin/automount > Wed 2015-10-07 20:07:57 CDT 12009 0 0 6 * /usr/sbin/automount > > ld69 > Sun 2015-07-19 07:01:17 CDT 710 0 0 6 /usr/sbin/automount > Tue 2015-07-21 17:12:00 CDT 23746 0 0 11 /usr/sbin/automount > Thu 2015-08-06 14:04:16 CDT 714 0 0 4 /usr/sbin/automount > Thu 2015-10-01 07:01:49 CDT 716 0 0 6 /usr/sbin/automount > Sat 2015-10-03 02:07:24 CDT 5854 0 0 6 /usr/sbin/automount > Mon 2015-10-05 15:04:49 CDT 4440 0 0 11 * /usr/sbin/automount > Wed 2015-10-07 12:39:02 CDT 7049 0 0 4 * /usr/sbin/automount > Wed 2015-10-07 16:25:00 CDT 15852 0 0 4 * /usr/sbin/automount I don't get that, what do the columns mean? > > while on others it hasn't crashed at all. > > I took the machines where it crashes most often and set logging = debug. > We'll see if I can capture a crash. No syslog daemons here; everything's in > the journal. OK, lets see what we get, that will tell me about your configuration and usage pattern which might be useful. > > My master map has entries like: > > dn: cn=/storage,nisMapName=auto.master,dc=math,dc=uh,dc=edu > objectClass: nisObject > cn: /storage > nisMapName: auto.master > nisMapEntry: ldap:nisMapName=auto.storage,dc=math,dc=uh,dc=edu > > which I'm guessing is the wrong way to do it. What should the map entries > look like? There's not really a wrong way but if you put a map source in the map entry autofs must use it, aka. ldap: above. The nsswitch function is meant to work when only the map name is given. Clearly the dn isn't relevant to the files source for example. Which means you need to set the base dn and probably the ldap server uri in either the openldap configuration or the autofs configuration and then that's no longer centrally managed. Ian
(In reply to Ian Kent from comment #5) > The nsswitch function is meant to work when only the map name > is given. Clearly the dn isn't relevant to the files source for > example. > > Which means you need to set the base dn and probably the ldap > server uri in either the openldap configuration or the autofs > configuration and then that's no longer centrally managed. or configure sss appropriately. You should also be able to use sss:<map name only> to prevent local file map overrides when you have "automount: files sss" for example. > > Ian
Sorry, that's the output of coredumpctl | grep automount. I should have just done coredumpctl list automount: ld69:~> coredumpctl list automount TIME PID UID GID SIG PRESENT EXE Sun 2015-07-19 07:01:17 CDT 710 0 0 6 /usr/sbin/automount Tue 2015-07-21 17:12:00 CDT 23746 0 0 11 /usr/sbin/automount Thu 2015-08-06 14:04:16 CDT 714 0 0 4 /usr/sbin/automount Thu 2015-10-01 07:01:49 CDT 716 0 0 6 /usr/sbin/automount Sat 2015-10-03 02:07:24 CDT 5854 0 0 6 /usr/sbin/automount Mon 2015-10-05 15:04:49 CDT 4440 0 0 11 * /usr/sbin/automount Wed 2015-10-07 12:39:02 CDT 7049 0 0 4 * /usr/sbin/automount Wed 2015-10-07 16:25:00 CDT 15852 0 0 4 * /usr/sbin/automount I think "PRESENT" and the asterisk mean that the actual core file is still available. I can grab one of them if you like. I'll experiment with changing the maps a bit, but these maps have to work back to EL5 so I have to be careful. In any case, I'll leave things alone long enough to hopefully catch a full debug log.
(In reply to Jason Tibbitts from comment #7) > Sorry, that's the output of coredumpctl | grep automount. I should have > just done coredumpctl list automount: > > ld69:~> coredumpctl list automount > TIME PID UID GID SIG PRESENT EXE > Sun 2015-07-19 07:01:17 CDT 710 0 0 6 /usr/sbin/automount > Tue 2015-07-21 17:12:00 CDT 23746 0 0 11 /usr/sbin/automount > Thu 2015-08-06 14:04:16 CDT 714 0 0 4 /usr/sbin/automount > Thu 2015-10-01 07:01:49 CDT 716 0 0 6 /usr/sbin/automount > Sat 2015-10-03 02:07:24 CDT 5854 0 0 6 /usr/sbin/automount > Mon 2015-10-05 15:04:49 CDT 4440 0 0 11 * /usr/sbin/automount > Wed 2015-10-07 12:39:02 CDT 7049 0 0 4 * /usr/sbin/automount > Wed 2015-10-07 16:25:00 CDT 15852 0 0 4 * /usr/sbin/automount > > I think "PRESENT" and the asterisk mean that the actual core file is still > available. I can grab one of them if you like. Ahhh .. right. I've seen quite a few back traces of this problem. They all appear to be the same. What I most often don't get is a debug log so I don't have half the information I need to make informed conclusions. The thing is this problem looks like automount has released the same ldap connection handle twice but, in spite of looking high and low, I just can't see how that could happen. One thing I do have is a RHEL-6 package which is the current best chance of possibly resolving this. I could make that available to you to install on a couple of machines showing the most problems but you would need to build the srpm on a target Fedora for it to work. Problem is this has gone on for so long I have accumulated quite a number of patches that, for various reasons, I think might help and applying them to the Fedora rpm will be quite time consuming and I don't know if it will resolve the issue anyway. Let me know if your interested in carrying out this test. > > I'll experiment with changing the maps a bit, but these maps have to work > back to EL5 so I have to be careful. In any case, I'll leave things alone > long enough to hopefully catch a full debug log. LOL, that's a fair way back. The EL5 machines won't understand the sss source so you would have to use the map name alone approach and set the appropriate sources in nsswitch.conf if you really wanted to use sss on newer machines. Ian
I have my own build infrastructure and you can just throw patches over the wall if you like. I can incorporate them into the package and test. Heck, I can even push updates to the Fedora packages if you like. In any case, it looks like switching entirely to sssd for map access is slightly less trivial so I'm going to stick with the current setup for at least a little while. Which will give me the chance to do some debugging on this issue. Currently, though, I will wait to see if I can capture a crash with a debug log before I start pushing fixes. Nothing so far this morning, but it usually takes a few days and everything was restarted yesterday.
(In reply to Jason Tibbitts from comment #9) > I have my own build infrastructure and you can just throw patches over the > wall if you like. I can incorporate them into the package and test. Heck, > I can even push updates to the Fedora packages if you like. > I guess we could do that, not sure that would be easier for either of us though. I generally work on things in the upstream source then back port and I always work with patches until they are committed to the upstream repository. I'd need to send patches to bring F22 up to 5.1.1, looks like 5 patches. And then send the patches in my 5.1.2 queue, that's 50 patches. I could re-order them and get that down to between 36 to 41 patches depending on what bug fixes I drop. That's why I was thinking of using the current rhel rpm that has selected patches back ported already. The real difficulty with this is that, while you've only seen one problem there seem to be other problems which may or may not be related. These problems might relate to a simple lack of initialization of an automount variable, releasing the ldap context more than once (which I've totally been unable to find), or shared library data segment inconsistency due to soft-reloading of the nss library on fork(). It's been quite hard going. Ian
(In reply to Ian Kent from comment #10) > (In reply to Jason Tibbitts from comment #9) > > I have my own build infrastructure and you can just throw patches over the > > wall if you like. I can incorporate them into the package and test. Heck, > > I can even push updates to the Fedora packages if you like. > > > > I guess we could do that, not sure that would be easier > for either of us though. > > I generally work on things in the upstream source then > back port and I always work with patches until they are > committed to the upstream repository. > > I'd need to send patches to bring F22 up to 5.1.1, looks > like 5 patches. > > And then send the patches in my 5.1.2 queue, that's 50 > patches. I could re-order them and get that down to > between 36 to 41 patches depending on what bug fixes > I drop. > > That's why I was thinking of using the current rhel rpm > that has selected patches back ported already. > I'm also tempted to try and use a private branch and make a scratch build but even though I use that all the time for RHEL it seemed different in Fedora and I haven't worked out what I need to do for it in Fedora. Anyway I have a whole bunch of patches which apply to the current F22 source, your call on which way we go, if we even pursue this at all. Ian
Well, you can make a scratch build without having a branch or committing anything at all. Just fedpkg srpm and koji build --scratch f22 foo.srpm. Then hand me the URL and I can try it out. The build will be deleted in a couple of weeks. However, please note that I'm leaving in the morning to go on vacation for a week and while I'll be reading email I won't be spending much time on autofs issues. I did have another crash with the same backtrace today, but of course it was on a machine that hadn't had a crash previously so I hadn't turned on debugging output. Aargh. Hopefully we'll catch one soon.
(In reply to Jason Tibbitts from comment #12) > Well, you can make a scratch build without having a branch or committing > anything at all. Just fedpkg srpm and koji build --scratch f22 foo.srpm. > Then hand me the URL and I can try it out. The build will be deleted in a > couple of weeks. Right, I have used that before, OK. > > However, please note that I'm leaving in the morning to go on vacation for a > week and while I'll be reading email I won't be spending much time on autofs > issues. Ummm .. I'm not quite sure what that means, morning in what time zone? Even though I'm confident there won't be a problem with the build I'd hand over (other than possibly the existing one) I'm not sure it would be a good idea to push out such a large change to even only one or two machines right before going on leave. Ian
(In reply to Jason Tibbitts from comment #12) > Well, you can make a scratch build without having a branch or committing > anything at all. Just fedpkg srpm and koji build --scratch f22 foo.srpm. > Then hand me the URL and I can try it out. The build will be deleted in a > couple of weeks. Anyway, just in case you do wish to test this out I've made a scratch build that can be found at: https://koji.fedoraproject.org/koji/taskinfo?taskID=11380780 It is essentially a snapshot of the current upstream patches. Almost all the patches have been either verified by rhel customers, or run through the regression test suite (110+ tests) I use, or have been included in test builds given to customers. The most recent couple of patches have only been run through the regression test suite and a series to change autofs to use the monotonic clock source hasn't been much tested by me and I can't assume the contributor has tested them either but they are fairly straight forward. There's also a change for a security issue that affects the standard environment variable names for program maps so if you use program maps that expect $USER, $UID, $GID etc. you might not want to use this build, or make sure you set "force_standard_program_map_env = yes" in /etc/autofs.conf to force the existing usage. Ian
(In reply to Ian Kent from comment #14) > > There's also a change for a security issue that affects the > standard environment variable names for program maps so if > you use program maps that expect $USER, $UID, $GID etc. you > might not want to use this build, or make sure you set > "force_standard_program_map_env = yes" in /etc/autofs.conf > to force the existing usage. Ummm .. now we have more than one section you'd need to ensure you put this in the "[ autofs ]" section. > > Ian
Thanks; I downloaded the builds but I'm about to head out so I'll try them when I'm back on the job.
(In reply to Jason Tibbitts from comment #16) > Thanks; I downloaded the builds but I'm about to head out so I'll try them > when I'm back on the job. Did you happen to try the rpm provided?
Fedora 22 changed to end-of-life (EOL) status on 2016-07-19. Fedora 22 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed.