Bug 1444728

Summary: Pacemaker can not realize the change of cpu count when it was running.
Product: Red Hat Enterprise Linux 7 Reporter: Xu Yin <xyin>
Component: pacemakerAssignee: Ken Gaillot <kgaillot>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: low Docs Contact:
Priority: medium    
Version: 7.3CC: abeekhof, cluster-maint, mnovacek, qguo
Target Milestone: rc   
Target Release: 7.4   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: pacemaker-1.1.16-9.el7 Doc Type: Enhancement
Doc Text:
Feature: Pacemaker will now detect CPU cores that have been hot-plugged into a virtual machine used as a cluster node. Reason: Pacemaker throttles the number of actions it will execute concurrently based on the system load and the number of CPU cores. Result: Pacemaker's action throttling is now more accurate when CPU cores are hot-plugged.
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-01 17:54:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Test Time: April 25, 13:20 none

Description Xu Yin 2017-04-24 06:43:33 UTC
Created attachment 1273519 [details]
Test Time:  April 25, 13:20

Description of problem:
Pacemaker can not realize the change of cpu count when it was running.

Version-Release number of selected component (if applicable):
Same issue in my tes env and customer env
My test env:
~~~
pacemaker-1.1.15-11.el7_3.2.x86_64
pacemaker-cluster-libs-1.1.15-11.el7_3.2.x86_64
pacemaker-libs-1.1.15-11.el7_3.2.x86_64
pacemaker-cli-1.1.15-11.el7_3.2.x86_64
~~~
Customer's env:
~~~
pacemaker-1.1.13-10.el7.x86_64
pacemaker-cli-1.1.13-10.el7.x86_64
pacemaker-cluster-libs-1.1.13-10.el7.x86_64
pacemaker-libs-1.1.13-10.el7.x86_64
~~~

How reproducible:


Steps to Reproduce:
1. Hot-plug cpu to test vm, in my test case, I change the cpu count from 2 to 12, as "notice: High CPU load detected: XX", "we know the "XX"'s value is calculated by the cpu count * 1.6, so before we add cpu, it should be 3.2, and after change, it should be 19.2; and verfiy the cpu added successfully.
~~~
Apr 24 13:21:50 pcmk1 kernel: smpboot: APIC(4) Converting physical 4 to logical package 2
Apr 24 13:21:50 pcmk1 kernel: CPU2 has been hot-added
Apr 24 13:21:50 pcmk1 kernel: smpboot: APIC(6) Converting physical 6 to logical package 3
Apr 24 13:21:50 pcmk1 kernel: CPU3 has been hot-added
Apr 24 13:21:50 pcmk1 kernel: smpboot: APIC(8) Converting physical 8 to logical package 4
Apr 24 13:21:50 pcmk1 kernel: CPU4 has been hot-added
Apr 24 13:21:50 pcmk1 kernel: smpboot: APIC(a) Converting physical 10 to logical package 5
Apr 24 13:21:50 pcmk1 kernel: CPU5 has been hot-added
Apr 24 13:21:50 pcmk1 kernel: smpboot: APIC(c) Converting physical 12 to logical package 6
Apr 24 13:21:50 pcmk1 kernel: CPU6 has been hot-added
Apr 24 13:21:50 pcmk1 kernel: smpboot: APIC(e) Converting physical 14 to logical package 7
Apr 24 13:21:50 pcmk1 kernel: CPU7 has been hot-added
Apr 24 13:21:50 pcmk1 kernel: smpboot: APIC(10) Converting physical 16 to logical package 8
Apr 24 13:21:50 pcmk1 kernel: CPU8 has been hot-added
Apr 24 13:21:50 pcmk1 kernel: smpboot: APIC(12) Converting physical 18 to logical package 9
Apr 24 13:21:50 pcmk1 kernel: CPU9 has been hot-added
Apr 24 13:21:50 pcmk1 kernel: smpboot: APIC(14) Converting physical 20 to logical package 10
Apr 24 13:21:50 pcmk1 kernel: CPU10 has been hot-added
Apr 24 13:21:50 pcmk1 kernel: smpboot: APIC(16) Converting physical 22 to logical package 11
Apr 24 13:21:50 pcmk1 kernel: CPU11 has been hot-added
Apr 24 13:21:50 pcmk1 kernel: smpboot: Booting Node 0 Processor 10 APIC 0x14
Apr 24 13:21:50 pcmk1 kernel: Disabled fast string operations
Apr 24 13:21:50 pcmk1 kernel: mce: CPU supports 0 MCE banks
...

# numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11
node 0 size: 2047 MB
node 0 free: 915 MB
node distances:
node   0 
  0:  10 
~~~

2. Running something for adding workload, in my test case, I run a simple script for that
~~~
# cat /tmp/workload 
#!/bin/python
while True:
	pass
~~~

3. You can notice the high cpu load from the /var/log/messages like following.
~~~
Apr 24 13:29:16 pcmk1 crmd[1840]:  notice: High CPU load detected: 7.600000
Apr 24 13:29:45 pcmk1 root: zuodasi
Apr 24 13:29:46 pcmk1 crmd[1840]:  notice: High CPU load detected: 7.760000
....
~~~
but the above notice should be printed when load is higher than "18.9"

4. We can let pacemkaer re-know the cpu count by restart pacemaker, after restart, it seems be normal.
~~~
...
Apr 24 13:29:16 pcmk1 crmd[1840]:  notice: High CPU load detected: 7.600000
Apr 24 13:29:45 pcmk1 root: zuodasi
Apr 24 13:29:46 pcmk1 crmd[1840]:  notice: High CPU load detected: 7.760000
Apr 24 13:29:53 pcmk1 systemd: Stopping Pacemaker High Availability Cluster Manager...
Apr 24 13:29:53 pcmk1 pacemakerd[1773]:  notice: Caught 'Terminated' signal
Apr 24 13:29:53 pcmk1 pacemakerd[1773]:  notice: Shutting down Pacemaker
Apr 24 13:29:53 pcmk1 pacemakerd[1773]:  notice: Stopping crmd
...
Apr 24 13:29:59 pcmk1 systemd: Starting Corosync Cluster Engine...
...
Apr 24 13:32:31 pcmk1 crmd[8319]:  notice: High CPU load detected: 24.120001
Apr 24 13:33:01 pcmk1 crmd[8319]:  notice: High CPU load detected: 29.160000
...
~~~

Actual results:
Pacemaker doesn't know the cpu count has been changed.

Expected results:
Pacemaker can re-calculate the workload after hot-plug cpu or other hardware resource.

Additional info:

Sosreport uploaded, please help to check.
Test Time:  April 25, 13:20
Environment: 
ESXi6
RHEL7.3 (3.10.0-514.el7.x86_64)
pacemaker-1.1.15-11.el7_3.2.x86_64,

Comment 2 Ken Gaillot 2017-04-24 19:03:27 UTC
Hi,

I thought Pacemaker checked the CPU count each time, but you're right, it's cached after the first check. That makes sense, since it can check frequently, and the extra I/O and processing at every check, for changes that are extremely rare, would be inefficient.

My planned solution is:

* Pacemaker can subscribe to udev notifications of hot-plug events. This would automatically keep the CPU count current, as long as the libgudev1 library package is available, and udev is functioning properly.

* We could implement a user command to manually refresh the CPU count, something like "pcs cluster cpu-update". This would be a fallback method if the automatic method was unavailable or not working.

Comment 5 Ken Gaillot 2017-04-25 16:29:35 UTC
I've thought of a cleaner solution that doesn't add any overhead:

We already process /proc/stat every throttle check, to get the I/O load. That file has an identifiable line for each CPU, so we can can count the cores there.

Comment 9 Ken Gaillot 2017-04-28 20:38:10 UTC
Fixed upstream: https://github.com/ClusterLabs/pacemaker/pull/1267

The CPU count is now updated from /proc/stat every time it is needed. The I/O checks were determined to be unnecessary and removed.

QA: How to test:

Prep:
1. Configure a cluster of at least 2 VMs (as full cluster nodes) on a single host (which is not part of the cluster). For this example, I'll assume all VMs are created with 1 vcpu each. It's fine to use a different number and adjust the commands accordingly.
2. Choose one of the VMs (I'll call it $VM). Ensure it is stopped.
3. On the host, give the VM an additional CPU with:

        virsh setvcpus $VM 2 --maximum --config

   Confirm the change with:

        virsh dumpxml $VM | grep vcpu

    which will show something like:

        <vcpu placement='static' current='1'>2</vcpu>

4. Start $VM and login. Verify the number of CPUs is 1 (e.g. grep processor /proc/cpuinfo). Edit /etc/sysconfig/pacemaker and set PCMK_debug=crmd to get more throttling messages in the pacemaker detail log.

Repeat this test with the old and new packages:
1. On the VM, install the packages, reboot, and login.
2. Verify that the VM has 1 cpu.
3. pcs cluster start --all
4. Watch the pacemaker detail log for messages containing "throttle".
   Before the fix, you will see messages like:

        crmd:    debug: throttle_load_avg:        Current load is 0.270000 (full: 0.27 0.15 0.10 3/159 15885)
        crmd:    debug: throttle_io_load: Current IO load is 0.000000

   The "IO load" will always be 0. After the fix, you will see:

        crmd:    debug: throttle_mode:    Current load is 0.000000 across 1 core(s)

   with no IO load messages. The "Current load" is the 1-minute load average as reported by
   /proc/loadavg, top, etc.
5. Do something to generate CPU load. This can be as simple as "while true; do true; done", or
   using the "stress" command, or anything else. Watch the "Current load" go up in the log.
   If the load goes up enough, you will see a messages like:

        crmd:   notice: throttle_check_thresholds:      High CPU load detected: 1.180000
        crmd:     info: throttle_send_command:  New throttle mode: 0100 (was 0000)
        crmd:    debug: throttle_update:        Host rhel7-1 supports a maximum of 2 jobs and throttle mode 0100.  New job limit is 1

6. On the host, hot-plug an addition CPU with:

        virsh setvcpus $VM 2

7. On the VM, verify the number of CPUs again. Continue watching the logs; with the older packages, nothing will change. With the new packages, the log message will show the correct number of cores, and the throttle mode should change back.

It would be worthwhile to check the new packages on all supported architectures just to verify that the load and number of cores are reported correctly (even after hot-plugging, if supported).

Comment 11 Ken Gaillot 2017-05-09 21:55:51 UTC
Docs: not sure if this is worth a release note or not.

Comment 13 michal novacek 2017-05-24 14:18:07 UTC
I have verified that hot-added cpus are correctly detected by pacemaker-1.1.16-9

----


[root@big-02 ~]# virsh setvcpus virt-135.cluster-qe.lab.eng.brq.redhat.com 2 --maximum --config

[root@big-02 ~]# virsh dumpxml virt-135.cluster-qe.lab.eng.brq.redhat.com | grep \<vcpu
  <vcpu placement='static' current='1'>2</vcpu>

[root@big-02 ~]# virsh start virt-135.cluster-qe.lab.eng.brq.redhat.com
Domain virt-135.cluster-qe.lab.eng.brq.redhat.com started

...

[root@virt-135 ~]# grep 'core id' /proc/cpuinfo 
core id         : 0

[root@virt-135 ~]# pcs cluster stop --all
virt-135: Stopping Cluster (pacemaker)...
virt-134: Stopping Cluster (pacemaker)...
virt-136: Stopping Cluster (pacemaker)...
virt-135: Stopping Cluster (corosync)...
virt-134: Stopping Cluster (corosync)...
virt-136: Stopping Cluster (corosync)...

[root@virt-135 ~]# ( while /bin/true; do yes > /dev/null; done ) &


before the fix (pacemaker-1.1.16-1.el7.x86_64)
==============================================

[root@virt-135 ~]# pcs cluster start --all
virt-136: Starting Cluster...
virt-134: Starting Cluster...
virt-135: Starting Cluster...


/var/log/cluster/corosync.log:
...
May 24 16:10:29 [8485] virt-135.cluster-qe.lab.eng.brq.redhat.com       crmd:    debug: throttle_cib_load:      cib load: 0.000333 (1 ticks in 30s)
May 24 16:10:29 [8485] virt-135.cluster-qe.lab.eng.brq.redhat.com       crmd:    debug: throttle_load_avg:      Current load is 0.050000 (full: 0.05 0.03 0.05 2/166 8967)
May 24 16:10:29 [8485] virt-135.cluster-qe.lab.eng.brq.redhat.com       crmd:    debug: throttle_io_load:       Current IO load is 0.000000

# insert new cpu
[root@big-01 ~]# virsh setvcpus virt-135.cluster-qe.lab.eng.brq.redhat.com 2

[root@virt-135 ~]# grep 'core id' /proc/cpuinfo 
core id         : 0
core id         : 0

May 24 16:14:29 [8485] virt-135.cluster-qe.lab.eng.brq.redhat.com       crmd:    debug: throttle_io_load:       Current IO load is 0.000000
May 24 16:14:29 [8485] virt-135.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: throttle_send_command:  New throttle mode: 0001 (was 0000)
May 24 16:14:29 [8485] virt-135.cluster-qe.lab.eng.brq.redhat.com       crmd:    debug: throttle_update:        Host virt-135 supports a maximum of 2 jobs and throttle mode 0001.  New job limit is 1
May 24 16:14:59 [8485] virt-135.cluster-qe.lab.eng.brq.redhat.com       crmd:    debug: throttle_cib_load:      cib load: 0.000000 (0 ticks in 30s)
May 24 16:14:59 [8485] virt-135.cluster-qe.lab.eng.brq.redhat.com       crmd:    debug: throttle_load_avg:      Current load is 0.800000 (full: 0.80 0.32 0.15 4/175 9630)
May 24 16:14:59 [8485] virt-135.cluster-qe.lab.eng.brq.redhat.com       crmd:    debug: throttle_handle_load:   Noticeable CPU load detected: 0.800000
May 24 16:14:59 [8485] virt-135.cluster-qe.lab.eng.brq.redhat.com       crmd:    debug: throttle_io_load:       Current IO load is 0.000000
May 24 16:15:29 [8485] virt-135.cluster-qe.lab.eng.brq.redhat.com       crmd:    debug: throttle_cib_load:      cib load: 0.000000 (0 ticks in 30s)
May 24 16:15:29 [8485] virt-135.cluster-qe.lab.eng.brq.redhat.com       crmd:    debug: throttle_load_avg:      Current load is 0.880000 (full: 0.88 0.39 0.18 3/175 9703)
May 24 16:15:29 [8485] virt-135.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: throttle_handle_load:   Moderate CPU load detected: 0.880000
May 24 16:15:29 [8485] virt-135.cluster-qe.lab.eng.brq.redhat.com       crmd:    debug: throttle_io_load:       Current IO load is 0.000000
May 24 16:15:29 [8485] virt-135.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: throttle_send_command:  New throttle mode: 0010 (was 0001)
>> May 24 16:15:29 [8485] virt-135.cluster-qe.lab.eng.brq.redhat.com       crmd:    debug: throttle_update:        Host virt-135 supports a maximum of 2 jobs and throttle mode 0010.  New job limit is 1

no cpu addition detected


after the fix (pacemaker-1.1.16-9)
==================================


/var/log/cluster/corosync.log:
...
May 24 14:40:10 [2629] virt-135       crmd:    debug: throttle_mode:  Current load is 0.170000 across 1 core(s)
May 24 14:40:10 [2629] virt-135       crmd:     info: throttle_send_command:  New throttle mode: 0000 (was ffffffff)
May 24 14:40:10 [2629] virt-135       crmd:    debug: throttle_update:        Host virt-135 supports a maximum of 2 jobs and throttle mode 0000.  New job limit is 2
May 24 14:40:40 [2629] virt-135       crmd:    debug: throttle_cib_load:      cib load: 0.000333 (1 ticks in 30s)
May 24 14:40:40 [2629] virt-135       crmd:    debug: throttle_mode:  Current load is 0.100000 across 1 core(s)
May 24 14:41:10 [2629] virt-135       crmd:    debug: throttle_cib_load:      cib load: 0.000333 (1 ticks in 30s)
May 24 14:41:10 [2629] virt-135       crmd:    debug: throttle_mode:  Current load is 0.060000 across 1 core(s)

[root@big-01 ~]# virsh setvcpus virt-135.cluster-qe.lab.eng.brq.redhat.com 2

[root@virt-135 ~]# grep 'core id' /proc/cpuinfo 
core id         : 0
core id         : 0

/var/log/cluster/corosync.log:
...
May 24 14:57:41 [2629] virt-135       crmd:     info: throttle_send_command:  New throttle mode: 0000 (was 0100)
>> May 24 14:57:41 [2629] virt-135       crmd:    debug: throttle_update:        Host virt-135 supports a maximum of 2 jobs and throttle mode 0000.  New job limit is 2
May 24 14:58:11 [2629] virt-135       crmd:    debug: throttle_cib_load:      cib load: 0.000000 (0 ticks in 30s)
May 24 14:58:11 [2629] virt-135       crmd:    debug: throttle_mode:  Current load is 2.420000 across 2 core(s)
...

[root@big-01 ~]# virsh setvcpus virt-135.cluster-qe.lab.eng.brq.redhat.com 1
...
May 24 15:08:41 [2629] virt-135.cluster-qe.lab.eng.brq.redhat.com       crmd:    debug: throttle_cib_load:      cib load: 0.000000 (0 ticks in 30s)
May 24 15:08:41 [2629] virt-135.cluster-qe.lab.eng.brq.redhat.com       crmd:    debug: throttle_mode:  Current load is 2.870000 across 1 core(s)
May 24 15:08:41 [2629] virt-135.cluster-qe.lab.eng.brq.redhat.com       crmd:   notice: throttle_check_thresholds:      High CPU load detected: 2.870000
May 24 15:08:41 [2629] virt-135.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: throttle_send_command:  New throttle mode: 0100 (was 0010)
May 24 15:08:41 [2629] virt-135.cluster-qe.lab.eng.brq.redhat.com       crmd:    debug: throttle_update:        Host virt-135 supports a maximum of 2 jobs and throttle mode 0100.  New job limit is 1
...

Comment 14 errata-xmlrpc 2017-08-01 17:54:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1862