Bug 1245181

Summary:

Sanlock fail to set scheduler to SCHED_RR

Product:

[Fedora] Fedora

Reporter:

Nir Soffer <nsoffer>

Component:

sanlock

Assignee:

David Teigland <teigland>

Status:

CLOSED INSUFFICIENT_DATA

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

rawhide

CC:

cfeist, fsimonce, teigland

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2015-07-26 15:28:08 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1243935

Attachments:

Description	Flags
sanlock.log	none

Description Nir Soffer 2015-07-21 12:08:50 UTC

Description of problem:

When sanlock starts up, it fails in sched_setscheduler():

2015-07-13 12:03:03+0300 1622 [13612]: sanlock daemon started 3.2.2 host ade0d225-5bd3-424b-bed5-ca739c40b7dd.bamba.tlv.
2015-07-13 12:03:03+0300 1622 [13612]: set scheduler RR|RESET_ON_FORK priority 99 failed: Operation not permitted

Version-Release number of selected component (if applicable):
3.2.2

How reproducible:
Always

Steps to Reproduce:
1. Start sanlock service

This failure does not happen on rhel 7.1, fedora 20, and 21.

Looks like a kernel issue in fedora 22.

We suspect that not running using SCHED_RR may lead to io timeouts and unneeded
fencing of the SPM, which fail any operation running on the SPM.

Comment 1 Nir Soffer 2015-07-21 12:10:19 UTC

Created attachment 1054317 [details]
sanlock.log

Comment 2 David Teigland 2015-07-21 15:17:40 UTC

This seems a likely cause for the timeouts.

The wdmd daemon also does the same scheduler steps, so I'd expect the same errors from wdmd to be in /var/log/messages.  If not, could you check if wdmd was able to set its scheduling successfully?

Running:
ps ax -o pid,stat,cmd,class,rtprio | grep wdmd

Should show this:
14282 SLs  wdmd                        RR      99

Comment 3 Nir Soffer 2015-07-21 19:11:05 UTC

(In reply to David Teigland from comment #2)
> This seems a likely cause for the timeouts.
> 
> The wdmd daemon also does the same scheduler steps, so I'd expect the same
> errors from wdmd to be in /var/log/messages.  If not, could you check if
> wdmd was able to set its scheduling successfully?

I see:

# ps axf -o pid,stat,cmd,class,rtprio
  721 SLs  wdmd -G sanlock             RR      99
  723 SLsl sanlock daemon -U sanlock - RR      99
  724 S     \_ sanlock daemon -U sanlo TS       -

And I also do not see any error after yesterday at 01:30 - maybe the issue
disappeared after reboot?

Comment 4 Nir Soffer 2015-07-21 19:18:18 UTC

I rebooted the host, and I see:

(reboot)

# sanlock.log
2015-07-21 22:10:31+0300 11 [747]: sanlock daemon started 3.2.2 host 708de246-f98b-4f9a-b9b2-de8d8a10a291.bamba.tlv.
2015-07-21 22:10:53+0300 33 [752]: cmd_add_lockspace 3,9 f4f54f47-9ccf-4978-a9a7-12a6d89bf94e:2:/rhev/data-center/mnt/multipass.eng.lab.tlv.redhat.com:_export_images_rnd_ahadas

# ps axf -o pid,stat,cmd,class,rtprio
  742 SLs  wdmd -G sanlock             RR      99
  747 SLsl sanlock daemon -U sanlock - RR      99
  748 S     \_ sanlock daemon -U sanlo TS       -

And when running the tests program, it works now.

Seems like a temporary failure that I cannot reproduce now.

How do you suggest to proceed with this?

Comment 5 David Teigland 2015-07-22 14:21:45 UTC

That's good and bad I suppose.  I don't have any clue what could have happened.

Comment 6 Nir Soffer 2015-07-26 15:28:08 UTC

Since the sched_setscheduler(2) issue disappeared, we cannot do much about
this. Closing until we have more data.