1245181 – Sanlock fail to set scheduler to SCHED_RR

Bug 1245181 - Sanlock fail to set scheduler to SCHED_RR

Summary: Sanlock fail to set scheduler to SCHED_RR

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	sanlock
Sub Component:
Version:	rawhide
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	David Teigland
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1243935
TreeView+	depends on / blocked

Reported:	2015-07-21 12:08 UTC by Nir Soffer
Modified:	2015-07-26 15:28 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-07-26 15:28:08 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
sanlock.log (120.14 KB, text/plain) 2015-07-21 12:10 UTC, Nir Soffer	no flags	Details
View All

Description Nir Soffer 2015-07-21 12:08:50 UTC

Description of problem:

When sanlock starts up, it fails in sched_setscheduler():

2015-07-13 12:03:03+0300 1622 [13612]: sanlock daemon started 3.2.2 host ade0d225-5bd3-424b-bed5-ca739c40b7dd.bamba.tlv.
2015-07-13 12:03:03+0300 1622 [13612]: set scheduler RR|RESET_ON_FORK priority 99 failed: Operation not permitted

Version-Release number of selected component (if applicable):
3.2.2

How reproducible:
Always

Steps to Reproduce:
1. Start sanlock service

This failure does not happen on rhel 7.1, fedora 20, and 21.

Looks like a kernel issue in fedora 22.

We suspect that not running using SCHED_RR may lead to io timeouts and unneeded
fencing of the SPM, which fail any operation running on the SPM.

Comment 1 Nir Soffer 2015-07-21 12:10:19 UTC

Created attachment 1054317 [details]
sanlock.log

Comment 2 David Teigland 2015-07-21 15:17:40 UTC

This seems a likely cause for the timeouts.

The wdmd daemon also does the same scheduler steps, so I'd expect the same errors from wdmd to be in /var/log/messages.  If not, could you check if wdmd was able to set its scheduling successfully?

Running:
ps ax -o pid,stat,cmd,class,rtprio | grep wdmd

Should show this:
14282 SLs  wdmd                        RR      99

Comment 3 Nir Soffer 2015-07-21 19:11:05 UTC

(In reply to David Teigland from comment #2)
> This seems a likely cause for the timeouts.
> 
> The wdmd daemon also does the same scheduler steps, so I'd expect the same
> errors from wdmd to be in /var/log/messages.  If not, could you check if
> wdmd was able to set its scheduling successfully?

I see:

# ps axf -o pid,stat,cmd,class,rtprio
  721 SLs  wdmd -G sanlock             RR      99
  723 SLsl sanlock daemon -U sanlock - RR      99
  724 S     \_ sanlock daemon -U sanlo TS       -

And I also do not see any error after yesterday at 01:30 - maybe the issue
disappeared after reboot?

Comment 4 Nir Soffer 2015-07-21 19:18:18 UTC

I rebooted the host, and I see:

(reboot)

# sanlock.log
2015-07-21 22:10:31+0300 11 [747]: sanlock daemon started 3.2.2 host 708de246-f98b-4f9a-b9b2-de8d8a10a291.bamba.tlv.
2015-07-21 22:10:53+0300 33 [752]: cmd_add_lockspace 3,9 f4f54f47-9ccf-4978-a9a7-12a6d89bf94e:2:/rhev/data-center/mnt/multipass.eng.lab.tlv.redhat.com:_export_images_rnd_ahadas

# ps axf -o pid,stat,cmd,class,rtprio
  742 SLs  wdmd -G sanlock             RR      99
  747 SLsl sanlock daemon -U sanlock - RR      99
  748 S     \_ sanlock daemon -U sanlo TS       -

And when running the tests program, it works now.

Seems like a temporary failure that I cannot reproduce now.

How do you suggest to proceed with this?

Comment 5 David Teigland 2015-07-22 14:21:45 UTC

That's good and bad I suppose.  I don't have any clue what could have happened.

Comment 6 Nir Soffer 2015-07-26 15:28:08 UTC

Since the sched_setscheduler(2) issue disappeared, we cannot do much about
this. Closing until we have more data.

Note You need to log in before you can comment on or make changes to this bug.