Bug 1299091 - Multiple active attempts to start kdump
Summary: Multiple active attempts to start kdump
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: kexec-tools
Version: 7.3
Hardware: x86_64
OS: Linux
high
medium
Target Milestone: rc
: 7.3
Assignee: Xunlei Pang
QA Contact: Qiao Zhao
URL:
Whiteboard:
Depends On:
Blocks: 1366034 1394638
TreeView+ depends on / blocked
 
Reported: 2016-01-15 22:35 UTC by George Beshers
Modified: 2017-03-03 05:45 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-01-04 17:06:26 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description George Beshers 2016-01-15 22:35:32 UTC
Description of problem:
   We have observed that with the parallel initialization of page tables
   (see bz727269 just posted) that the delayed initialization by kswapd
   causes the hotplug mechanism for systemd to be triggered which then
   sends a kill to the current kdump and starts a new one.  Since on a
   large NUMA system several CPUs are adding memory simultaneously the
   situation gets messy.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Lukáš Nykrýn 2016-01-18 11:16:30 UTC
I assume we are talking about 98-kexec.rules, which belongs to kexec-tools.

Comment 4 Dave Young 2016-06-21 03:11:52 UTC
Sorry that we have no bandwidth to work on this in 7.3, we will work on it in 7.4

Comment 5 George Beshers 2016-08-03 02:53:51 UTC
This has been resolved by patches included in 727269.

Comment 6 George Beshers 2016-08-18 17:17:53 UTC
This has become an issue again.
Have not tracked down why.

I marked it as a regression but only medium priority
because I have not seen it cause kdump to not be
functional after the system has settled.

That is, when all is done..

[root@harp50-sys ~]# systemctl status kdump
● kdump.service - Crash recovery kernel arming
   Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; vendor preset: enabled)
   Active: activating (start) since Wed 2016-08-17 17:49:06 CDT; 422ms ago
  Process: 197224 ExecStop=/usr/bin/kdumpctl stop (code=exited, status=0/SUCCESS)
 Main PID: 197907 (kdumpctl)
   CGroup: /system.slice/kdump.service
           ├─197907 /bin/sh /usr/bin/kdumpctl start
           └─197909 /bin/sh /usr/bin/kdumpctl start


But this is annoying....


[  OK  ] Started Crash recovery kernel arming.
         Stopping Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Started LSB: gr_systat setup.
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Started Crash recovery kernel arming.
         Stopping Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Started Crash recovery kernel arming.
         Stopping Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Started Crash recovery kernel arming.
         Stopping Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Started Crash recovery kernel arming.
         Stopping Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Started Crash recovery kernel arming.
         Stopping Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Started Crash recovery kernel arming.
         Stopping Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Started Crash recovery kernel arming.
         Stopping Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Started Crash recovery kernel arming.
         Stopping Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Started Crash recovery kernel arming.
         Stopping Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Started Crash recovery kernel arming.
         Stopping Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...
[  OK  ] Stopped Crash recovery kernel arming.
         Starting Crash recovery kernel arming...

Comment 8 Xunlei Pang 2016-08-30 13:04:23 UTC
(In reply to George Beshers from comment #6)
> This has become an issue again.
> Have not tracked down why.

Hi George,

This is not a regression, it is an issue triggered due to too many memory hotplug uevents, and should be exposed by the new parallel initialization of page tables patches.

kdump.service uses udev(98-kexec.rules) to listen to the memory/cpu hotplug events and to further trigger kdump.service restart(because kdump has to update the elfcorehdr note info for memory bank/cpu changes).

The current rules in 98-kexec.rules for memory online is: 
SUBSYSTEM=="memory", ACTION=="online", PROGRAM="/bin/systemctl try-restart kdump.service"

I don't think there is a way to control the number of kernel events delivered to udev and further to systemd udev threads to invoke PROGRAM. I'd be very glad
if anyone has any effective idea on this.

So, these restart actions will bread a lot of log messages which is normal cases. But I agree that the whole process should not last too long or waste too much cpu cycles.

For systemd service there are several parameters to control the number of service restart: "StartLimitInterval" and "StartLimitBurst". But the service will become unvailable(with some failed status) if the number of service restart is over the limit defined by the two parameters. They don't help our situation(kdump.service has StartLimitInterval=0, allowing unlimited number of service restart).

I thought of a way to mitigate this kdump service restart issue(will hopefully eliminate the side effect of hogging the cpu for some perceivable period):
udev calls "systemctl try-restart kdump.service" for every event, the whole process of restart is actually very time-consuming, as kdump checks a lot of possible conditions to decide if the initramfs rebuild is needed or not. However for cpu/memory hotplug scenarios, we don't need any initramfs rebuild, actually we can simply reload the kernel, I found it saves a large portion of time in this way.

I made an updated kexec-tools build with the idea patched in, could you please help verify to see if it helps the issue? Thanks!
https://people.redhat.com/~xpang/.bz1299091/kexec-tools-2.0.7-49.el7.x86_64.rpm

Regards,
Xunlei

> 
> I marked it as a regression but only medium priority
> because I have not seen it cause kdump to not be
> functional after the system has settled.
> 
> That is, when all is done..
> 
> [root@harp50-sys ~]# systemctl status kdump
> ● kdump.service - Crash recovery kernel arming
>    Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; vendor
> preset: enabled)
>    Active: activating (start) since Wed 2016-08-17 17:49:06 CDT; 422ms ago
>   Process: 197224 ExecStop=/usr/bin/kdumpctl stop (code=exited,
> status=0/SUCCESS)
>  Main PID: 197907 (kdumpctl)
>    CGroup: /system.slice/kdump.service
>            ├─197907 /bin/sh /usr/bin/kdumpctl start
>            └─197909 /bin/sh /usr/bin/kdumpctl start
> 
> 
> But this is annoying....
> 
> 
> [  OK  ] Started Crash recovery kernel arming.
>          Stopping Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Started LSB: gr_systat setup.
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Started Crash recovery kernel arming.
>          Stopping Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Started Crash recovery kernel arming.
>          Stopping Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Started Crash recovery kernel arming.
>          Stopping Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Started Crash recovery kernel arming.
>          Stopping Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Started Crash recovery kernel arming.
>          Stopping Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Started Crash recovery kernel arming.
>          Stopping Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Started Crash recovery kernel arming.
>          Stopping Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Started Crash recovery kernel arming.
>          Stopping Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Started Crash recovery kernel arming.
>          Stopping Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Started Crash recovery kernel arming.
>          Stopping Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...
> [  OK  ] Stopped Crash recovery kernel arming.
>          Starting Crash recovery kernel arming...

Comment 9 Xunlei Pang 2016-08-30 13:11:51 UTC
(In reply to Xunlei Pang from comment #8)
> (In reply to George Beshers from comment #6)
> > This has become an issue again.
> > Have not tracked down why.
> 
> Hi George,
> 
> This is not a regression, it is an issue triggered due to too many memory
> hotplug uevents, and should be exposed by the new parallel initialization of
> page tables patches.
> 
> kdump.service uses udev(98-kexec.rules) to listen to the memory/cpu hotplug
> events and to further trigger kdump.service restart(because kdump has to
> update the elfcorehdr note info for memory bank/cpu changes).
> 
> The current rules in 98-kexec.rules for memory online is: 
> SUBSYSTEM=="memory", ACTION=="online", PROGRAM="/bin/systemctl try-restart
> kdump.service"
> 
> I don't think there is a way to control the number of kernel events
> delivered to udev and further to systemd udev threads to invoke PROGRAM. I'd
> be very glad
> if anyone has any effective idea on this.
> 
> So, these restart actions will bread a lot of log messages which is normal
> cases. But I agree that the whole process should not last too long or waste
> too much cpu cycles.
> 
> For systemd service there are several parameters to control the number of
> service restart: "StartLimitInterval" and "StartLimitBurst". But the service
> will become unvailable(with some failed status) if the number of service
> restart is over the limit defined by the two parameters. They don't help our
> situation(kdump.service has StartLimitInterval=0, allowing unlimited number
> of service restart).
> 
> I thought of a way to mitigate this kdump service restart issue(will
> hopefully eliminate the side effect of hogging the cpu for some perceivable
> period):
> udev calls "systemctl try-restart kdump.service" for every event, the whole
> process of restart is actually very time-consuming, as kdump checks a lot of
> possible conditions to decide if the initramfs rebuild is needed or not.
> However for cpu/memory hotplug scenarios, we don't need any initramfs
> rebuild, actually we can simply reload the kernel, I found it saves a large
> portion of time in this way.
> 
> I made an updated kexec-tools build with the idea patched in, could you
> please help verify to see if it helps the issue? Thanks!
> https://people.redhat.com/~xpang/.bz1299091/kexec-tools-2.0.7-49.el7.x86_64.
> rpm

Sorry, forgot to leave two extra important steps:
1) After installing the updated kexec-tools I provided, please modify all the
"PROGRAM="/bin/systemctl try-restart kdump.service"
to be
"PROGRAM="/bin/systemctl reload-or-try-restart kdump.service""

located in "/usr/lib/udev/rules.d/98-kexec.rules". 

2) sudo systemctl daemon-reload or reboot, and do the test.

Regards,
Xunlei

Comment 10 George Beshers 2016-09-15 15:01:45 UTC
Hi Xunlei,

First, I completely agree with your analysis.

The large system has been co-opted again.

The good news is that I am not seeing the problem
when the system has 24Tb or 31Tb of memory using
the beta and snap2 bits.  NOTE: 32Tb fails to
start kdump altogether which is a different problem.

Of course, this means that I have not tested
your patch either.

When I get the system again in a couple of weeks
I will pursue this next time I get the system because
of the real possibility it is an intermittent problem.

Regards,
George

Comment 11 Xunlei Pang 2016-09-30 06:11:35 UTC
Hi George,

It is good if you can test it, I will wait for your test results.

Regards,
Xunlei

Comment 13 George Beshers 2017-01-04 17:06:26 UTC
We have not seen this in some time.
Closing.


Note You need to log in before you can comment on or make changes to this bug.