Description of problem: When sanlock starts up, it fails in sched_setscheduler(): 2015-07-13 12:03:03+0300 1622 [13612]: sanlock daemon started 3.2.2 host ade0d225-5bd3-424b-bed5-ca739c40b7dd.bamba.tlv. 2015-07-13 12:03:03+0300 1622 [13612]: set scheduler RR|RESET_ON_FORK priority 99 failed: Operation not permitted Version-Release number of selected component (if applicable): 3.2.2 How reproducible: Always Steps to Reproduce: 1. Start sanlock service This failure does not happen on rhel 7.1, fedora 20, and 21. Looks like a kernel issue in fedora 22. We suspect that not running using SCHED_RR may lead to io timeouts and unneeded fencing of the SPM, which fail any operation running on the SPM.
Created attachment 1054317 [details] sanlock.log
This seems a likely cause for the timeouts. The wdmd daemon also does the same scheduler steps, so I'd expect the same errors from wdmd to be in /var/log/messages. If not, could you check if wdmd was able to set its scheduling successfully? Running: ps ax -o pid,stat,cmd,class,rtprio | grep wdmd Should show this: 14282 SLs wdmd RR 99
(In reply to David Teigland from comment #2) > This seems a likely cause for the timeouts. > > The wdmd daemon also does the same scheduler steps, so I'd expect the same > errors from wdmd to be in /var/log/messages. If not, could you check if > wdmd was able to set its scheduling successfully? I see: # ps axf -o pid,stat,cmd,class,rtprio 721 SLs wdmd -G sanlock RR 99 723 SLsl sanlock daemon -U sanlock - RR 99 724 S \_ sanlock daemon -U sanlo TS - And I also do not see any error after yesterday at 01:30 - maybe the issue disappeared after reboot?
I rebooted the host, and I see: (reboot) # sanlock.log 2015-07-21 22:10:31+0300 11 [747]: sanlock daemon started 3.2.2 host 708de246-f98b-4f9a-b9b2-de8d8a10a291.bamba.tlv. 2015-07-21 22:10:53+0300 33 [752]: cmd_add_lockspace 3,9 f4f54f47-9ccf-4978-a9a7-12a6d89bf94e:2:/rhev/data-center/mnt/multipass.eng.lab.tlv.redhat.com:_export_images_rnd_ahadas # ps axf -o pid,stat,cmd,class,rtprio 742 SLs wdmd -G sanlock RR 99 747 SLsl sanlock daemon -U sanlock - RR 99 748 S \_ sanlock daemon -U sanlo TS - And when running the tests program, it works now. Seems like a temporary failure that I cannot reproduce now. How do you suggest to proceed with this?
That's good and bad I suppose. I don't have any clue what could have happened.
Since the sched_setscheduler(2) issue disappeared, we cannot do much about this. Closing until we have more data.