Hide Forgot
Description of problem: We have observed that with the parallel initialization of page tables (see bz727269 just posted) that the delayed initialization by kswapd causes the hotplug mechanism for systemd to be triggered which then sends a kill to the current kdump and starts a new one. Since on a large NUMA system several CPUs are adding memory simultaneously the situation gets messy. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
I assume we are talking about 98-kexec.rules, which belongs to kexec-tools.
Sorry that we have no bandwidth to work on this in 7.3, we will work on it in 7.4
This has been resolved by patches included in 727269.
This has become an issue again. Have not tracked down why. I marked it as a regression but only medium priority because I have not seen it cause kdump to not be functional after the system has settled. That is, when all is done.. [root@harp50-sys ~]# systemctl status kdump ● kdump.service - Crash recovery kernel arming Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; vendor preset: enabled) Active: activating (start) since Wed 2016-08-17 17:49:06 CDT; 422ms ago Process: 197224 ExecStop=/usr/bin/kdumpctl stop (code=exited, status=0/SUCCESS) Main PID: 197907 (kdumpctl) CGroup: /system.slice/kdump.service ├─197907 /bin/sh /usr/bin/kdumpctl start └─197909 /bin/sh /usr/bin/kdumpctl start But this is annoying.... [ OK ] Started Crash recovery kernel arming. Stopping Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Started LSB: gr_systat setup. [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Started Crash recovery kernel arming. Stopping Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Started Crash recovery kernel arming. Stopping Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Started Crash recovery kernel arming. Stopping Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Started Crash recovery kernel arming. Stopping Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Started Crash recovery kernel arming. Stopping Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Started Crash recovery kernel arming. Stopping Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Started Crash recovery kernel arming. Stopping Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Started Crash recovery kernel arming. Stopping Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Started Crash recovery kernel arming. Stopping Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Started Crash recovery kernel arming. Stopping Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming... [ OK ] Stopped Crash recovery kernel arming. Starting Crash recovery kernel arming...
(In reply to George Beshers from comment #6) > This has become an issue again. > Have not tracked down why. Hi George, This is not a regression, it is an issue triggered due to too many memory hotplug uevents, and should be exposed by the new parallel initialization of page tables patches. kdump.service uses udev(98-kexec.rules) to listen to the memory/cpu hotplug events and to further trigger kdump.service restart(because kdump has to update the elfcorehdr note info for memory bank/cpu changes). The current rules in 98-kexec.rules for memory online is: SUBSYSTEM=="memory", ACTION=="online", PROGRAM="/bin/systemctl try-restart kdump.service" I don't think there is a way to control the number of kernel events delivered to udev and further to systemd udev threads to invoke PROGRAM. I'd be very glad if anyone has any effective idea on this. So, these restart actions will bread a lot of log messages which is normal cases. But I agree that the whole process should not last too long or waste too much cpu cycles. For systemd service there are several parameters to control the number of service restart: "StartLimitInterval" and "StartLimitBurst". But the service will become unvailable(with some failed status) if the number of service restart is over the limit defined by the two parameters. They don't help our situation(kdump.service has StartLimitInterval=0, allowing unlimited number of service restart). I thought of a way to mitigate this kdump service restart issue(will hopefully eliminate the side effect of hogging the cpu for some perceivable period): udev calls "systemctl try-restart kdump.service" for every event, the whole process of restart is actually very time-consuming, as kdump checks a lot of possible conditions to decide if the initramfs rebuild is needed or not. However for cpu/memory hotplug scenarios, we don't need any initramfs rebuild, actually we can simply reload the kernel, I found it saves a large portion of time in this way. I made an updated kexec-tools build with the idea patched in, could you please help verify to see if it helps the issue? Thanks! https://people.redhat.com/~xpang/.bz1299091/kexec-tools-2.0.7-49.el7.x86_64.rpm Regards, Xunlei > > I marked it as a regression but only medium priority > because I have not seen it cause kdump to not be > functional after the system has settled. > > That is, when all is done.. > > [root@harp50-sys ~]# systemctl status kdump > ● kdump.service - Crash recovery kernel arming > Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; vendor > preset: enabled) > Active: activating (start) since Wed 2016-08-17 17:49:06 CDT; 422ms ago > Process: 197224 ExecStop=/usr/bin/kdumpctl stop (code=exited, > status=0/SUCCESS) > Main PID: 197907 (kdumpctl) > CGroup: /system.slice/kdump.service > ├─197907 /bin/sh /usr/bin/kdumpctl start > └─197909 /bin/sh /usr/bin/kdumpctl start > > > But this is annoying.... > > > [ OK ] Started Crash recovery kernel arming. > Stopping Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Started LSB: gr_systat setup. > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Started Crash recovery kernel arming. > Stopping Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Started Crash recovery kernel arming. > Stopping Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Started Crash recovery kernel arming. > Stopping Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Started Crash recovery kernel arming. > Stopping Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Started Crash recovery kernel arming. > Stopping Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Started Crash recovery kernel arming. > Stopping Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Started Crash recovery kernel arming. > Stopping Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Started Crash recovery kernel arming. > Stopping Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Started Crash recovery kernel arming. > Stopping Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Started Crash recovery kernel arming. > Stopping Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming... > [ OK ] Stopped Crash recovery kernel arming. > Starting Crash recovery kernel arming...
(In reply to Xunlei Pang from comment #8) > (In reply to George Beshers from comment #6) > > This has become an issue again. > > Have not tracked down why. > > Hi George, > > This is not a regression, it is an issue triggered due to too many memory > hotplug uevents, and should be exposed by the new parallel initialization of > page tables patches. > > kdump.service uses udev(98-kexec.rules) to listen to the memory/cpu hotplug > events and to further trigger kdump.service restart(because kdump has to > update the elfcorehdr note info for memory bank/cpu changes). > > The current rules in 98-kexec.rules for memory online is: > SUBSYSTEM=="memory", ACTION=="online", PROGRAM="/bin/systemctl try-restart > kdump.service" > > I don't think there is a way to control the number of kernel events > delivered to udev and further to systemd udev threads to invoke PROGRAM. I'd > be very glad > if anyone has any effective idea on this. > > So, these restart actions will bread a lot of log messages which is normal > cases. But I agree that the whole process should not last too long or waste > too much cpu cycles. > > For systemd service there are several parameters to control the number of > service restart: "StartLimitInterval" and "StartLimitBurst". But the service > will become unvailable(with some failed status) if the number of service > restart is over the limit defined by the two parameters. They don't help our > situation(kdump.service has StartLimitInterval=0, allowing unlimited number > of service restart). > > I thought of a way to mitigate this kdump service restart issue(will > hopefully eliminate the side effect of hogging the cpu for some perceivable > period): > udev calls "systemctl try-restart kdump.service" for every event, the whole > process of restart is actually very time-consuming, as kdump checks a lot of > possible conditions to decide if the initramfs rebuild is needed or not. > However for cpu/memory hotplug scenarios, we don't need any initramfs > rebuild, actually we can simply reload the kernel, I found it saves a large > portion of time in this way. > > I made an updated kexec-tools build with the idea patched in, could you > please help verify to see if it helps the issue? Thanks! > https://people.redhat.com/~xpang/.bz1299091/kexec-tools-2.0.7-49.el7.x86_64. > rpm Sorry, forgot to leave two extra important steps: 1) After installing the updated kexec-tools I provided, please modify all the "PROGRAM="/bin/systemctl try-restart kdump.service" to be "PROGRAM="/bin/systemctl reload-or-try-restart kdump.service"" located in "/usr/lib/udev/rules.d/98-kexec.rules". 2) sudo systemctl daemon-reload or reboot, and do the test. Regards, Xunlei
Hi Xunlei, First, I completely agree with your analysis. The large system has been co-opted again. The good news is that I am not seeing the problem when the system has 24Tb or 31Tb of memory using the beta and snap2 bits. NOTE: 32Tb fails to start kdump altogether which is a different problem. Of course, this means that I have not tested your patch either. When I get the system again in a couple of weeks I will pursue this next time I get the system because of the real possibility it is an intermittent problem. Regards, George
Hi George, It is good if you can test it, I will wait for your test results. Regards, Xunlei
We have not seen this in some time. Closing.