Description of problem: After upgrading to autofs-5.0.1-0.rc3.29 (and lots of other packeges as well:-(), our systems tend to stop automounting some of the mountpoints specified by NIS-maps after some variable time (hours - several days) Version-Release number of selected component (if applicable): autofs-5.0.1-0.rc3.29 How reproducible: Often Steps to Reproduce: 1. Use system for some time in such a way that automounting is triggered. Actual results: Automounter stops responding for some mountpoints Expected results: Automounting responding. Additional info:
Created attachment 155705 [details] Possibly relevant 'ps -eLf', 'gdb' and 'strace' data
Kernel version? Could you post your maps? Could we have a debug log as well please? Ian
Created attachment 155783 [details] /proc/kallsyms
On this particular machine (my development machine): 2.6.21-rc7 Problem most often has manifested itself on: 2.6.20-1.2948.fc6 I have enclosed /proc/kallsyms What do you mean by a debug log (remember, the problem is not easily reproducible, and I would hate to destroy a specimen of it that I can keep alive as long as I see fit).
(In reply to comment #4) > On this particular machine (my development machine): > > 2.6.21-rc7 Oh really, and you've seen it using this kernel? > > Problem most often has manifested itself on: > > 2.6.20-1.2948.fc6 > > I have enclosed /proc/kallsyms > > What do you mean by a debug log (remember, the problem is not easily > reproducible, and I would hate to destroy a specimen of it that I can keep alive > as long as I see fit). Yep, I know it a big ask and setting debug logging will most likely prevent the bug from occurring but, if you have enough disk space, give it a try if you can. http://people.redhat.com/jmoyer describes what needs to be set. btw, autofs-5.0.1-0.rc3.30 should be on it's way to the testing directory on the mirrors. I'm not sure this will help but please give it a try when it arrives. Or grab the source rpm from Rawhide or F7, build it and give that a try. Ian
Created attachment 155790 [details] strace of good mount I'm still dodging the 'restart automounter' request, since I want to gather as much info from the current hang (machine has been up for 35 days, automounter for 18 days before the bug manifested itself). Usually this bug hits various desktop machines for users who don't like me to debug their machines for hours :-(. So I have submitted 2 strace's, one for a request that works (strace.ok) and one for a failed request (strace.bad). Since the strace of the failed request is totally silent, it seems like there is a pending mount request somewhere deep down in the kernel that is stopping the request to reach the automounter.
Created attachment 155792 [details] strace of failed mount
Another point of possible interest (assuming it's a kernel bug): An affected system stops to auto-unmount unused filesystems. I'll continue to try to understand where that behaviour intersects with a stuck mount.
(In reply to comment #6) > Created an attachment (id=155790) [edit] > strace of good mount > > I'm still dodging the 'restart automounter' request, since I want to gather as > much info from the current hang (machine has been up for 35 days, automounter > for 18 days before the bug manifested itself). Usually this bug hits various > desktop machines for users who don't like me to debug their machines for hours > :-(. We probably won't get anything more that's useful from the broken machine, thanks for getting the stuff so far. We need to record the kernel version of machines that show this problem. We need to confirm whether you see it on machines with 2.6.21 or not. A patch for a race between mount and expire is included in it. It may not be the problem you're seeing but it seems to have come up a few times lately (actually suddenly, I don't know why, it was really hard to reproduce when we were degugging it). The patch was present from 2.6.21-rc1. > > So I have submitted 2 strace's, one for a request that works (strace.ok) and > one for a failed request (strace.bad). Since the strace of the failed request > is totally silent, it seems like there is a pending mount request somewhere > deep down in the kernel that is stopping the request to reach the automounter. Not sure if we got any extra evidence with these. The symptoms of the race above are that you don't see mount requests coming in when there should be. The straces do appear to indicate that but this should only happen with specific configurations and specific usage patterns. In some cases it will recover several seconds after it happens (maybe 10) and just start working again. OTOH the gdb trace looks like we have a new problem, a deadlock, which would have symptoms of stopping any further mounts to that mount point and possibly (it looks probable from the gdb outputy) other mount points as well. The package I mentioned above (0.rc3.30) has some significant changes to the map re-reading and identification of changed maps. The gdb trace shows a map re-read happening and that's why I wanted you to use that for further testing. I'd hate for us to spend heaps of time solving something that's fixed already. btw, a sysrq-t log dump is often useful. Lastly, if this is a new race then things get more difficult. Installing the corresponding autofs-debuginfo package so we can get slightly more useful gdb backtrace info would be good. I'll dig up my instructions for what to get. Basically, list the threads and then a stack trace (bt) for threads that don't look as they should (much like what you did already). Essentially igoring anything waiting in poll and a couple of other worker threads (if they look ok). What's really is important though is your setup, maps, nsswitch and particularly usage patterns. Any of this info is really useful. So bummer, having this bug, sorry. Ian
(In reply to comment #8) > Another point of possible interest (assuming it's a kernel bug): > > An affected system stops to auto-unmount unused filesystems. I'll continue to > try to understand where that behaviour intersects with a stuck mount. > Mmm .. I hadn't noticed that with the race I mentioned above but we were more interested in fixing the dead mount problem so it could be related to the same problem. But lets not make assumptions, the gdb output looks like a pthreads deadlock in the user space daemon. Ian Ian
Ahh, it was the automount maps (silly me) # Master map for automounter /home auto.home --ghost /work auto.work --ghost /auto.direct /etc/auto.direct # /etc/nsswitch.conf passwd: files nis shadow: files nis group: files nis hosts: files dns nis ethers: files nis netmasks: files networks: files protocols: files nis rpc: files services: files nis netgroup: files nis automount: files nis aliases: files Some more stray thoughts: Restarting automounter makes problem go away (kill -9 required). This indicates that the kernel is waiting for the stuck mount, blocking further requests to that mountpoint to succed (back to pointing fingers at automount I guess). If I understand you correctly, the problems (probably) resides in automount, since 2.6.21-rc7 [SMP PREEMPT] exhibits the problem. I slowly bring all our computers to use autofs-5.0.1-0.rc3.30.
(In reply to comment #11) > Ahh, it was the automount maps (silly me) > > # Master map for automounter > /home auto.home --ghost > /work auto.work --ghost > /auto.direct /etc/auto.direct Are the maps themselves simple maps, ie no multi-mount, submount entries or "+" map inclusion? > > > # /etc/nsswitch.conf > passwd: files nis > shadow: files nis > group: files nis > hosts: files dns nis > ethers: files nis > netmasks: files > networks: files > protocols: files nis > rpc: files > services: files nis > netgroup: files nis > automount: files nis > aliases: files All normal. > > Some more stray thoughts: > > Restarting automounter makes problem go away (kill -9 required). This indicates > that the kernel is waiting for the stuck mount, blocking further requests to > that mountpoint to succed (back to pointing fingers at automount I guess). > > If I understand you correctly, the problems (probably) resides in automount, > since 2.6.21-rc7 [SMP PREEMPT] exhibits the problem. I slowly bring all our > computers to use autofs-5.0.1-0.rc3.30. Yep, there are no other known problems with the kernel module. I'm keen to see how the update goes. It's been pushed so it should find its way to the testing directory on the mirrors soonish. Ian
(In reply to comment #11) > Restarting automounter makes problem go away (kill -9 required). This indicates > that the kernel is waiting for the stuck mount, blocking further requests to > that mountpoint to succed (back to pointing fingers at automount I guess). OK, that sounds more like a daemon problem, yes. > > If I understand you correctly, the problems (probably) resides in automount, > since 2.6.21-rc7 [SMP PREEMPT] exhibits the problem. I slowly bring all our > computers to use autofs-5.0.1-0.rc3.30. It would be good to understand the workload better. The best way for me to work out what's wrong is to be able to duplicate it. So understanding what's wrong is the first step. I've got a dual core notebook here so I may be able to simulate the workload and just leave it running. That also involves using maps similar to what you have in the same way. Ian
> Are the maps themselves simple maps, ie no multi-mount, > submount entries or "+" map inclusion? Yes, mainly peoples /home/ directories (coing from 2 different servers) and /work directories residing on peoples workstations (around 50), and usually mounted locally, but accessible from all machines. > Yep, there are no other known problems with the kernel module. > I'm keen to see how the update goes. > It's been pushed so it should find its way to the testing > directory on the mirrors soonish. Installed on my development machine :-) > It would be good to understand the workload better. Basically random mounts with a huge preference to one /home and one /work directory per machine. I have started the following script on the upgraded and one other machine (I will report back if I get any interesting results): #!/usr/bin/python import nis import os import random import time if __name__ == "__main__": HOME = nis.cat("auto.home").keys() WORK = nis.cat("auto.work").keys() while True: dir = random.choice(["/home/%s/." % random.choice(HOME) , "/work/%s/." % random.choice(WORK)]) try: print dir, os.stat(dir).st_mtime except OSError, e: print e time.sleep(30)
Created attachment 156053 [details] gdb + strace info of rc3.30 hung machine Conclusive answer was reached after this weekends stresstest: 6 out of 39 computers running autofs-5.0.1-0.rc3.30 are stuck!
Created attachment 156069 [details] Better gdb trace (with symbols) Finally managed to get my mands on the debugs symbols (despite the recent release of Fedora7) Deadlock: Thread 11: alarm_handler(void *arg): alarm_lock() state_mutex_lock(ap) Thread 5: expire_cleanup(void *arg) state_mutex_lock(ap); alarm_add(struct autofs_point *ap, time_t seconds): alarm_lock() And the following lock chain evolves: 11 <-+- 4 ^ | | +- 6 <-+- 2 v | 5 +- 3 <- 1 To be honest, there is a lot of locking going on, and it seems (to me) like one big lock and condwaits could do the job, if that is not the case, a strict locking hierarchy should be observed!
(In reply to comment #16) > Created an attachment (id=156069) [edit] > Better gdb trace (with symbols) > > Finally managed to get my mands on the debugs symbols (despite the > recent release of Fedora7) > > Deadlock: > > Thread 11: > alarm_handler(void *arg): > alarm_lock() > state_mutex_lock(ap) > > Thread 5: > expire_cleanup(void *arg) > state_mutex_lock(ap); > alarm_add(struct autofs_point *ap, time_t seconds): > alarm_lock() Mmm .. I thought I got rid of that out of order locking. Thanks for the detective work, I'll go through the code and see what I can find. I'm not sure why I thought I needed to add the alarm under the state mutex. > > And the following lock chain evolves: > > 11 <-+- 4 > ^ | > | +- 6 <-+- 2 > v | > 5 +- 3 <- 1 > > > To be honest, there is a lot of locking going on, and it seems (to me) > like one big lock and condwaits could do the job, if that is not the case, > a strict locking hierarchy should be observed! True, it is to complex and there is plenty of room for improvement. It may be possible to use that approach, I certainly thought about it and started down that path but it rapidly got very difficult, perhaps in the next version when what we have settles down and fewer changes are being made. Ian
Please send the result (patch or similar) as soon as you have something to test, and I'll put it under stress...
As an extra layer of paranoia, how about adding something like this (to detect locks that are not granted within TIMEOUT seconds)? If you think it's a good idea, I could create an actual patch with something that even might compile :-) static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; struct watch_queue { struct list_head list; pthread_mutex_t *mutex; pthread_t thread; time_t time }; LIST_HEAD(watch); /* Use this to monitor deadlocks */ int watch_pthread_mutex_lock(pthread_mutex_t *m) { int result; struct watch_queue w; w.mutex = m; w.time = time(); w.thread = pthread_self(); pthread_mutex_lock(mutex); list_add(&w->list, &watch); pthread_mutex_unlock(mutex); result = pthread_mutex_lock(m); pthread_mutex_lock(mutex); list_delete(&w->list); pthread_mutex_unlock(mutex); return result; } /* To be run in a separate thread */ void watch() { while (true) { struct list_head *p; sleep(10); time_t now = time(); pthread_mutex_lock(mutex); list_for_each(p, &watch) { struct watch w = list_entry(p, struct watch, list); if (now - w->time > TIMEOUT) { report_or_die(); } } pthread_mutex_unlock(mutex); } }
(In reply to comment #19) > As an extra layer of paranoia, how about adding something like this (to detect > locks that are not granted within TIMEOUT seconds)? If you think it's a good > idea, I could create an actual patch with something that even might compile :-) Ha .. thanks, good idea but I'd really rather not add to the complexity just now. I've had a quick look and I have a simple patch, I haven't even checked to see if it compiles. I do want to be able to add alarms while holding the state mutex, at least for now. Bearing in mind that I still need to think more about this would you be willing to give the patch a try anyway. I'll check it compiles and give it a quick run against the autofs connectathon suite first. Ian
Of course, but I'll be gone for today within 35 minutes, so it will probably be tomorrow!
(In reply to comment #21) > Of course, but I'll be gone for today within 35 minutes, so it will probably be > tomorrow! And I'm out much of tomorrow. I'll post the patch after testing and catch up tomorrow afternoon (probably near your lunch time). Ian
Probably not, I'm on CET (GMT + 1), but since I don't know your timezone, what do I know...
(In reply to comment #23) > Probably not, I'm on CET (GMT + 1), but since I don't know your timezone, what > do I know... WST (GMT + 8).
Created attachment 156089 [details] Test patch for the alarm deadlock problem Can you give this patch a try when you get a chance please. I've ru it against the autofs connectathon test suite but with only a single cpu machine.
Created attachment 156232 [details] .tar file of various pieces to show presence of [possible] deadlock I feel confident that it will deadlock (eventually). This is based on actual traces of a short execution on my machine, and what I hope is a correct analysis of the output of that trace. The following output from the autofs_analyze tool on the automount.log file illustrates my point (last entry in analyze.output, which is generated by './autofs_analyze automount.log'): ['0x8002fa68', '0x8004c2a4', '0x8002f740', '0x8004c2a4'] 0xb7e886c0 : 0x8002fa68 0x8004c2a4 &master_mutex &ap->state_mutex master.c:1149 master_mount_mounts master.c:1192 master_mount_mounts 0xb7e87b90 : 0x8002f740 0x8004c2a4 &mutex &ap->state_mutex state.c:899 st_queue_handler state.c:819 run_state_task 0xb7ab6b90 : 0x8004c2a4 0x8002f740 &ap->state_mutex &mutex state.c:108 expire_cleanup state.c:876 st_set_done The deadlock will occur beacuse tasks 0xb7e87b90 and 0xb7ab6b90 takes the mutexes in reverse order. There are NO shortcuts to a proper lock ordering (thats my experience after more than 20 years of realtime programming experience), but it can take years for it to actually come and bite you... It might be that some of the possible deadlocks are impossible due to lifetime of tasks, but I wouldn't bet on it...
(In reply to comment #26) > Created an attachment (id=156232) [edit] > .tar file of various pieces to show presence of [possible] deadlock > > I feel confident that it will deadlock (eventually). This is based on actual > traces of a short execution on my machine, and what I hope is a correct > analysis of the output of that trace. The following output from the > autofs_analyze tool on the automount.log file illustrates my point (last entry > in analyze.output, which is generated by './autofs_analyze automount.log'): > > ['0x8002fa68', '0x8004c2a4', '0x8002f740', '0x8004c2a4'] > 0xb7e886c0 : > 0x8002fa68 0x8004c2a4 > &master_mutex &ap->state_mutex > master.c:1149 master_mount_mounts master.c:1192 master_mount_mounts > 0xb7e87b90 : > 0x8002f740 0x8004c2a4 > &mutex &ap->state_mutex > state.c:899 st_queue_handler state.c:819 run_state_task > 0xb7ab6b90 : > 0x8004c2a4 0x8002f740 > &ap->state_mutex &mutex > state.c:108 expire_cleanup state.c:876 st_set_done > > The deadlock will occur beacuse tasks 0xb7e87b90 and 0xb7ab6b90 takes the > mutexes in reverse order. There are NO shortcuts to a proper lock ordering > (thats my experience after more than 20 years of realtime programming > experience), but it can take years for it to actually come and bite you... > > It might be that some of the possible deadlocks are impossible due to lifetime > of tasks, but I wouldn't bet on it... > Good work Anders, give me a bit to work through this but, on the face of it, it looks like your analyzer is a smart little cookie! Ian
Created attachment 156238 [details] lock reorder patch Not smart enough :-(, with the attached patch I get a lot of warnings like: ['0x8004b6a4', '0x8002f660', '0x8002fa80', '0x8004c2a4', '0x8002f660'] 0xb7df9b90 : 0x8004b6a4 0x8002f660 0x8002fa80 &ap->state_mutex &suc.mutex &instance_mutex automount.c:1165 handle_mounts automount.c:1167 handle_mounts master.c:353 master_find_source_instance 0xb7b64b90 : 0x8004c2a4 0x8002f660 0x8002fa80 &ap->state_mutex &suc.mutex &instance_mutex automount.c:1165 handle_mounts automount.c:1167 handle_mounts master.c:353 master_find_source_instance Which I think are benign (even if I feel a bit queasy about them). But I don't know if it is correct to temporarily drop the ap_lock and then retake it in st_set_done (but this time after st_mutex_lock has been taken)
Created attachment 156240 [details] new autofs_analyze OK, modified autofs_analyze tool. It reports the obvious deadlock problems and is quiet with the ones I think are benign. This really need to be cleaned up and scrutinized if we should rely on it...
(In reply to comment #28) > Created an attachment (id=156238) [edit] > lock reorder patch > > Not smart enough :-(, with the attached patch I get a lot of warnings like: > > ['0x8004b6a4', '0x8002f660', '0x8002fa80', '0x8004c2a4', '0x8002f660'] > 0xb7df9b90 : > 0x8004b6a4 0x8002f660 0x8002fa80 > > &ap->state_mutex &suc.mutex > &instance_mutex > automount.c:1165 handle_mounts automount.c:1167 handle_mounts > master.c:353 master_find_source_instance > 0xb7b64b90 : > 0x8004c2a4 0x8002f660 0x8002fa80 > > &ap->state_mutex &suc.mutex > &instance_mutex > automount.c:1165 handle_mounts automount.c:1167 handle_mounts > master.c:353 master_find_source_instance > > Which I think are benign (even if I feel a bit queasy about them). > But I don't know if it is correct to temporarily drop the ap_lock and then > retake it in st_set_done (but this time after st_mutex_lock has been taken) OK, I'll have a look there as well. The other trace is quite interesting but I'm not sure that deadlock is possible. The only thing I could see (assuming output from the analyzer is correct, and it looks to be) is a suggested deadlock between the st_queue_handler->st_run_task and expire_cleanup->st_set_done. For a given ap (struct autofs_point, one for each mount) the st_run_task is run only when a new task is present and doesn't dequeue another until that one has finished, by calling st_set_done. At least that's the way it's supposed to work. OTOH this assumption may be violated at shutdown so I'll go through that and check. OTOOH I think there really isn't any need to call st_task_done from under the ap state_mutex so I'm going to move it out anyway. So I guess what I'm saying is that maybe the analyzer picked invocations of st_run_task and expire_cleanup that each correspond to a different ap and hence different state_mutex. Thanks for your effort, I appreciate it very much. btw, I have no idea how you produced the automount.log, what did you use for that? Ian
> The other trace is quite interesting but I'm not sure that > deadlock is possible. Neither am I, but the risk is that some very fast multiprocessor manages to stumble over it :-) > btw, I have no idea how you produced the automount.log, what > did you use for that? The hack.patch from the earlier .tar file (don't tell anyone I'm responsible for that one :-) ) Good that you pointed to shutdown as an important thing to check, a shutdown that exposed the following possible problems: ['0x8002fa68', '0x8002f660', '0x8004c2a4', '0x8002fa68'] 0xb7e0b6c0 : 0x8002fa68 0x8002f660 &master_mutex &suc.mutex master.c:1149 master_mount_mounts master.c:999 master_do_mount 0xb7b64b90 : 0x8004c2a4 0x8002f660 0x8002fa80 &ap->state_mutex &suc.mutex &instance_mutex automount.c:1165 handle_mounts automount.c:1167 handle_mounts master.c:353 master_find_source_instance 0xb7874b90 : 0x8002fa68 0x8004c2a4 &master_mutex &ap->state_mutex master.c:939 master_notify_state_change master.c:948 master_notify_state_change ['0x8002fa68', '0x8002f660', '0x8004b6a4', '0x8002fa68'] 0xb7e0b6c0 : 0x8002fa68 0x8002f660 &master_mutex &suc.mutex master.c:1149 master_mount_mounts master.c:999 master_do_mount 0xb7df9b90 : 0x8004b6a4 0x8002f660 0x8002f7ac &ap->state_mutex &suc.mutex &parse_mutex automount.c:1165 handle_mounts automount.c:1167 handle_mounts nss_parse.y:142 parse_mutex_lock 0xb7874b90 : 0x8002fa68 0x8004b6a4 &master_mutex &ap->state_mutex master.c:939 master_notify_state_change master.c:948 master_notify_state_change ['0x8002fa68', '0x8002f660', '0x8004bc74', '0x8002fa68'] 0xb7e0b6c0 : 0x8002fa68 0x8002f660 &master_mutex &suc.mutex master.c:1149 master_mount_mounts master.c:999 master_do_mount 0xb7c65b90 : 0x8004bc74 0x8002f660 0x8002f7ac &ap->state_mutex &suc.mutex &parse_mutex automount.c:1165 handle_mounts automount.c:1167 handle_mounts nss_parse.y:142 parse_mutex_lock 0xb7874b90 : 0x8002fa68 0x8004bc74 &master_mutex &ap->state_mutex master.c:939 master_notify_state_change master.c:948 master_notify_state_change
(In reply to comment #31) > > The other trace is quite interesting but I'm not sure that > > deadlock is possible. > Neither am I, but the risk is that some very fast multiprocessor manages to > stumble over it :-) Maybe, you'd be talking about between the call to st_set_done and state_mutex_unlock in expire_cleanup since I've set the next alarm above it. But it's worse because the user can send a prune event with a signal, so yes, I'll move the st_set_done out from under the lock. I can't see any reason not to move it out but I'm still thinking about it. > > > btw, I have no idea how you produced the automount.log, what > > did you use for that? > The hack.patch from the earlier .tar file (don't tell anyone I'm responsible for > that one :-) ) Oh yes, I missed that. > > Good that you pointed to shutdown as an important thing to check, a shutdown > that exposed the following possible problems: And I'll have a look at these below. snip ... Ian
(In reply to comment #32) > (In reply to comment #31) > > > The other trace is quite interesting but I'm not sure that > > > deadlock is possible. > > Neither am I, but the risk is that some very fast multiprocessor manages to > > stumble over it :-) > > Maybe, you'd be talking about between the call to st_set_done and > state_mutex_unlock in expire_cleanup since I've set the next alarm > above it. But it's worse because the user can send a prune event with > a signal, so yes, I'll move the st_set_done out from under the lock. Nop, this deadlock is not possible. As it stands the code is independent of execution order and timing. There's no need to move the st_set_done or add additional state mutex locking. I think the reason for the false positive is that watch mutex doesn't take account of the state queue handler predicate (struct state_queue) control variables, task->busy and task->done, and it assumes that the ap state_mutex referred to in st_run_task is the same ap state_mutex that is currently held. When st_run_task is called the ap can be any ap except for one that has a currently active task and the one that holds the state mutex is clearly an active task. Now, onto the shutdown. Iam
(In reply to comment #31) > Good that you pointed to shutdown as an important thing to check, a shutdown > that exposed the following possible problems: > > ['0x8002fa68', '0x8002f660', '0x8004c2a4', '0x8002fa68'] > 0xb7e0b6c0 : > 0x8002fa68 0x8002f660 > &master_mutex &suc.mutex > master.c:1149 master_mount_mounts master.c:999 master_do_mount > 0xb7b64b90 : > 0x8004c2a4 0x8002f660 > 0x8002fa80 > &ap->state_mutex &suc.mutex > &instance_mutex > automount.c:1165 handle_mounts automount.c:1167 handle_mounts > master.c:353 master_find_source_instance > 0xb7874b90 : > 0x8002fa68 0x8004c2a4 > > &master_mutex &ap->state_mutex > > master.c:939 master_notify_state_change master.c:948 > master_notify_state_change I'm having a lot of trouble understanding how this sequence could be a problem. In order for master_notify_state_change to aquire the master_mutex master_mount_mounts must have completed in which case the master_mutex will have been released. It is possible for the last mount (handle_mounts) to not have completed it's startup at this time and if there is an event which calls master_notify_state_change and it is able to take the master_mutex it may need to wait on an ap state_mutex if it's held. It looks like the analysis tool doesn't take account of the passage of time and so is a little inaccurate in its reporting. Ian
> I'm having a lot of trouble understanding how this sequence > could be a problem. So have I at a closer inspection, once more barking up the wrong tree... Let's hope that all deadlock chances are removed. Sorry to bother you with non-problems...
(In reply to comment #35) > > I'm having a lot of trouble understanding how this sequence > > could be a problem. > So have I at a closer inspection, once more barking up the wrong tree... > > Let's hope that all deadlock chances are removed. > > Sorry to bother you with non-problems... > Absolutely no problem. There's no need to be sorry. I welcome your efforts to find problems and it gives me the opportunity to focus on validating specific possible problem areas of the code. I've tried to identify all the races and deadlocks but Murphys' law rules, so time will tell. Thanks Ian
btw, I'm going to go ahead with the patch we have for this so far as it's certainly a bug that needs to be fixed. Ian
Created attachment 156433 [details] alternate implementation of alarm_handler (In reply to comment #37) > btw, I'm going to go ahead with the patch we have for > this so far as it's certainly a bug that needs to be > fixed. If it's the patch to alarm.c, I have attached an alternate implementation of alarm_handler: * no code duplication (less risk of reintroducing bug in one branch :-) * one extra call to pthread_cond_timedwait per expired alarm, but no call to time(), so I belive that performance would be on par with current implementation
(In reply to comment #38) > Created an attachment (id=156433) [edit] > alternate implementation of alarm_handler > > (In reply to comment #37) > > btw, I'm going to go ahead with the patch we have for > > this so far as it's certainly a bug that needs to be > > fixed. > If it's the patch to alarm.c, I have attached an alternate implementation of > alarm_handler: > * no code duplication (less risk of reintroducing bug in one branch :-) > * one extra call to pthread_cond_timedwait per expired alarm, > but no call to time(), so I belive that performance would be > on par with current implementation I'm not sure about this alternate implementation. Why incur the overhead of possibly unlocking the mutex, waiting (I hope it's smarter than that) and locking the mutex again for an alarm that's already passed. We have to assume that the mutex will be unlocked and then locked, even if the time has passed, since that's the way condition waits are defined, so it's possible for current to be invalid on return. So it's not quite right as is. Ian
(In reply to comment #39) > (In reply to comment #38) > > Created an attachment (id=156433) [edit] [edit] > > alternate implementation of alarm_handler > > > > (In reply to comment #37) > > > btw, I'm going to go ahead with the patch we have for > > > this so far as it's certainly a bug that needs to be > > > fixed. > > If it's the patch to alarm.c, I have attached an alternate implementation of > > alarm_handler: > > * no code duplication (less risk of reintroducing bug in one branch :-) > > * one extra call to pthread_cond_timedwait per expired alarm, > > but no call to time(), so I belive that performance would be > > on par with current implementation > > I'm not sure about this alternate implementation. > > Why incur the overhead of possibly unlocking the mutex, > waiting (I hope it's smarter than that) and locking the > mutex again for an alarm that's already passed. > > We have to assume that the mutex will be unlocked > and then locked, even if the time has passed, since > that's the way condition waits are defined, so it's > possible for current to be invalid on return. > > So it's not quite right as is. Also, what happens if the pthread_cond_timedwait is signaled, not timed out and an alarm has been added with a timeout before the alarm we're waiting for? Ian
Created attachment 156437 [details] new try on alarm_handler In reply to comment #40) > Also, what happens if the pthread_cond_timedwait is > signaled, not timed out and an alarm has been added > with a timeout before the alarm we're waiting for? We start over from the top of the while loop. New version of alarm_handler, addressing the [VERY] valid point of current to be the first element in the alarm list (leading to out of sequence firing). I probably should try to keep my mouth shut :-)
(In reply to comment #41) > Created an attachment (id=156437) [edit] > new try on alarm_handler > > In reply to comment #40) > > Also, what happens if the pthread_cond_timedwait is > > signaled, not timed out and an alarm has been added > > with a timeout before the alarm we're waiting for? > We start over from the top of the while loop. New version of alarm_handler, > addressing the [VERY] valid point of current to be the first element in the > alarm list (leading to out of sequence firing). > It looks like it's still not quite right. But we're getting away from fixing a bug here now. If you'd like to contribute to autofs please feel welcome to join the autofs mailing list at http://linux.kernel.org/mailman/listinfo/autofs I prefer code submissions as unified diffs against either the current release tarball (with all patches from kernel.org applied) or the development git repository. Tarball: http://www.kernel.org/pub/linux/daemons/autofs/v5/autofs-5.0.1.tar.gz git: git://git.kernel.org/pub/scm/linux/storage/autofs/autofs.git Ian
(In reply to comment #42) > I prefer code submissions as unified diffs against > either the current release tarball (with all patches > from kernel.org applied) or the development git > repository OK, will do.
Fedora apologizes that these issues have not been resolved yet. We're sorry it's taken so long for your bug to be properly triaged and acted on. We appreciate the time you took to report this issue and want to make sure no important bugs slip through the cracks. If you're currently running a version of Fedora Core between 1 and 6, please note that Fedora no longer maintains these releases. We strongly encourage you to upgrade to a current Fedora release. In order to refocus our efforts as a project we are flagging all of the open bugs for releases which are no longer maintained and closing them. http://fedoraproject.org/wiki/LifeCycle/EOL If this bug is still open against Fedora Core 1 through 6, thirty days from now, it will be closed 'WONTFIX'. If you can reporduce this bug in the latest Fedora version, please change to the respective version. If you are unable to do this, please add a comment to this bug requesting the change. Thanks for your help, and we apologize again that we haven't handled these issues to this point. The process we are following is outlined here: http://fedoraproject.org/wiki/BugZappers/F9CleanUp We will be following the process here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping to ensure this doesn't happen again. And if you'd like to join the bug triage team to help make things better, check out http://fedoraproject.org/wiki/BugZappers
This bug is open for a Fedora version that is no longer maintained and will not be fixed by Fedora. Therefore we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen thus bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed.