Bug 1419633 - [GSS] CTDB service on gluster server is stopped and cannot be started
Summary: [GSS] CTDB service on gluster server is stopped and cannot be started
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: ctdb
Version: rhgs-3.1
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: ---
Assignee: Anoop C S
QA Contact: surabhi
URL:
Whiteboard:
Depends On:
Blocks: 1351530
TreeView+ depends on / blocked
 
Reported: 2017-02-06 15:45 UTC by Otakar Masek
Modified: 2023-03-24 13:45 UTC (History)
12 users (show)

Fixed In Version: ctdb-4.6.3-4.el7rhgs
Doc Type: Known Issue
Doc Text:
CTDB fails to start on those setups where the real time schedulers have been disabled. One such example is where vdsm is installed. Workaround: Enable real time schedulers by "echo 950000 > /sys/fs/cgroup/cpu,cpuacct/system.slice/cpu.rt_runtime_us" and then restart ctdb service. Refer the cgroup section of Red Hat Enterprise Linux administration guide (https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html-single/System_Administrators_Guide/index.html) for making this change permanent.
Clone Of:
Environment:
Last Closed: 2017-09-14 08:14:45 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3673191 0 None None None 2018-11-27 11:01:22 UTC

Description Otakar Masek 2017-02-06 15:45:10 UTC
Description of problem:

Customer has 2 Gluster nodes which uses CTDB for samba share. Customer started the service on one node but on the other node the service is not starting up, it comes up with error in /var/log/log.ctdb ->

2017/02/02 10:03:48.510907 [ 4898]: CTDB starting on node
2017/02/02 10:03:48.514958 [ 4899]: Starting CTDBD (Version 4.4.5) as PID: 4899
2017/02/02 10:03:48.515105 [ 4899]: Created PID file /run/ctdb/ctdbd.pid
2017/02/02 10:03:48.515144 [ 4899]: Unable to set scheduler to SCHED_FIFO (Operation not permitted)
2017/02/02 10:03:48.515149 [ 4899]: CTDB daemon shutting down
2017/02/02 10:03:49.515388 [ 4899]: Removed PID file /run/ctdb/ctdbd.pid

Version-Release number of selected component (if applicable):

CTDBD (Version 4.4.5)

Comment 3 Raghavendra Talur 2017-02-08 08:24:51 UTC
Here are the dots that need to be connected.

What is the problem?
ctdb is not permitted to set scheduling preference for its threads.
This should not happen and does not happen with same systemd unit
files on non-vdsm setups.

What could be the problem?
May be vdsm changes something in
"/sys/fs/cgroup/cpu/system.slice/cpu.rt_runtime_us" . 


Workaround to be tried 
1. echo 10000 > /sys/fs/cgroup/cpu/system.slice/cpu.rt_runtime_us
2. systemctl stop ctdb.service
3. systemctl start ctdb.service

Comment 4 Raghavendra Talur 2017-02-08 08:25:48 UTC
Otakar replied that the following workaround was sufficient


-------------------------------------------------------------------------------
Issue fixed after execution of :

1. echo 950000 > /sys/fs/cgroup/cpu,cpuacct/system.slice/cpu.rt_runtime_us
2. systemctl stop ctdb.service
3. systemctl start ctdb.service
I had to change the value and the file cpu.rt_runtime_us path as customer RHEL is 7.3
---------------------------------------------------------------------------------

Comment 5 Sahina Bose 2017-02-08 09:18:49 UTC
This problem appears on restarting a node after vdsm is installed.
Yaniv, does vdsm change any global systemd settings?

Comment 7 Yaniv Bronhaim 2017-02-08 16:21:29 UTC
it doesn't sound like systemd issue, but the service ctdb run fails to start without setting cpu.rt_runtime_us in cgroup.. im not aware of touching this in vdsm scope, but maybe we do as part of sla stuff? check the value before the change, maybe the default in centos is wrong and need an update?

Comment 9 Raghavendra Talur 2017-03-06 10:00:02 UTC
(In reply to Yaniv Bronhaim from comment #7)
> it doesn't sound like systemd issue, but the service ctdb run fails to start
> without setting cpu.rt_runtime_us in cgroup.. im not aware of touching this
> in vdsm scope, but maybe we do as part of sla stuff? check the value before
> the change, maybe the default in centos is wrong and need an update?

The default in rhel 7 is 950000.

After a vdsm installation is complete, we see that it has been changed to 0. I am not sure which package makes this change. 

Is there a mailing list where we can ask this question, it is for sure related to virt.

Comment 10 Yaniv Bronhaim 2017-03-06 12:17:11 UTC
is this something new? is it producible always after installing vdsm?
I tried to reproduce it over centos 7.2 and the file was not set at all after vdsm installation

I tried over centos 7.3 to remove and re-install vdsm after setting it to 950000 and it is not changed.. same if I reinstalled libvirt


---- snip
[root@localhost ~]# cat /etc/redhat-release 
CentOS Linux release 7.2.1511 (Core) 
[root@localhost ~]# cat /sys/fs/cgroup/cpu,cpuacct/system.slice/cpu.rt_runtime_us
cat: /sys/fs/cgroup/cpu,cpuacct/system.slice/cpu.rt_runtime_us: No such file or directory

[root@localhost ~]# rpm -qa | grep vdsm
vdsm-jsonrpc-4.20.0-422.git13530cc.el7.centos.noarch
vdsm-api-4.20.0-422.git13530cc.el7.centos.noarch
vdsm-client-4.20.0-422.git13530cc.el7.centos.noarch
vdsm-python-4.20.0-422.git13530cc.el7.centos.noarch
vdsm-yajsonrpc-4.20.0-422.git13530cc.el7.centos.noarch
vdsm-4.20.0-422.git13530cc.el7.centos.x86_64
vdsm-tests-4.20.0-422.git13530cc.el7.centos.noarch
vdsm-xmlrpc-4.20.0-422.git13530cc.el7.centos.noarch
vdsm-hook-vmfex-dev-4.20.0-422.git13530cc.el7.centos.noarch
----

Comment 11 Raghavendra Talur 2017-03-07 12:40:08 UTC
(In reply to Yaniv Bronhaim from comment #10)
> is this something new? is it producible always after installing vdsm?
> I tried to reproduce it over centos 7.2 and the file was not set at all
> after vdsm installation
> 

It is NOT new. However, as you have observed, the file was named differently till RHEL 7.2. I don't remember the exact path but it was certainly under /sys/fs/cgroup/.


> I tried over centos 7.3 to remove and re-install vdsm after setting it to
> 950000 and it is not changed.. same if I reinstalled libvirt

I think you did not perform a restart. I have not yet figured out systemd+cgroups works, but after vdsm+virt packages are installed and machine is restarted, this option changes. May be there is some other config file that is changed.

> 
> 
> ---- snip
> [root@localhost ~]# cat /etc/redhat-release 
> CentOS Linux release 7.2.1511 (Core) 
> [root@localhost ~]# cat
> /sys/fs/cgroup/cpu,cpuacct/system.slice/cpu.rt_runtime_us
> cat: /sys/fs/cgroup/cpu,cpuacct/system.slice/cpu.rt_runtime_us: No such file
> or directory
> 
> [root@localhost ~]# rpm -qa | grep vdsm
> vdsm-jsonrpc-4.20.0-422.git13530cc.el7.centos.noarch
> vdsm-api-4.20.0-422.git13530cc.el7.centos.noarch
> vdsm-client-4.20.0-422.git13530cc.el7.centos.noarch
> vdsm-python-4.20.0-422.git13530cc.el7.centos.noarch
> vdsm-yajsonrpc-4.20.0-422.git13530cc.el7.centos.noarch
> vdsm-4.20.0-422.git13530cc.el7.centos.x86_64
> vdsm-tests-4.20.0-422.git13530cc.el7.centos.noarch
> vdsm-xmlrpc-4.20.0-422.git13530cc.el7.centos.noarch
> vdsm-hook-vmfex-dev-4.20.0-422.git13530cc.el7.centos.noarch
> ----

Comment 12 Raghavendra Talur 2017-03-08 12:59:56 UTC
This link has the best possible info on the cgroup for realtime cpu and systemd interaction.
https://www.freedesktop.org/wiki/Software/systemd/MyServiceCantGetRealtime/

Comment 13 Yaniv Bronhaim 2017-03-09 10:47:24 UTC
I tried now with fresh centos latest installation, ran yum upgrade, then deployed using ovirt-engine, rebooted the host, and still the file does not exist at all.

I saw this 950000 value in some setups. but I can't reproduce the description with vdsm and engine, 4.1 and master code. so I assume its not changed by vdsm rpm installation or the deploy flow

Comment 14 Bhavana 2017-03-13 16:23:26 UTC
Updated the doc text slightly for the release notes

Comment 19 Anoop C S 2017-09-01 09:43:16 UTC
Hi Otakar,

Is there anything pending from Engineering side? I have provided the solution in comment #16. Can you please confirm whether it worked for the customer or not?


Note You need to log in before you can comment on or make changes to this bug.