Bug 1714162

Summary: [Hyper-V][RHEL7.6] kexec-tools: kdump saves vmcore failed with enabled dynamic memory and login graphical mode
Product: Red Hat Enterprise Linux 7 Reporter: HuijingHei <hhei>
Component: kexec-toolsAssignee: Kairui Song <kasong>
Status: CLOSED ERRATA QA Contact: Emma Wu <xiawu>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.6CC: bhsharma, bhu, boyang, dhildenb, hhei, kasong, ldu, leiwang, ruyang, xialiu, xiaofwan, xiawu, xuli, yacao, yzheng
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: kexec-tools-2.0.15-33.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1718771 (view as bug list) Environment:
Last Closed: 2019-08-06 12:55:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1718771    
Bug Blocks: 1661416    

Description HuijingHei 2019-05-27 09:31:43 UTC
Description of problem:
Gen2 vm on Hyper-V, login graphical mode with enabled dynamic memory, trigger kdump and kdump saving vmcore failed with vmcore-incomplete

Version-Release number of selected component (if applicable):
kernel 3.10.0-957.el7.x86_64
<Host> hyper-v windows
<Hyper-V Virtual Machine>
Generation: Gen2
Secure Boot: Disabled
The number of virtual CPUs: 4
Virtual memory: 4096MB (Dynamic memory enabled )

How reproducible:80%


Steps to Reproduce:
1. Create a Hyper-V virtual machine described in above.

2. Install RHEL7.6(kernel: 3.10.0-957.el7.x86_64) with Software Selection: [Server with GUI], and Kdump: enabled

3. Reboot OS after installation.

4. Login to RHEL graphical mode and execute the following command.
   # echo c > /proc/sysrq-trigger

Actual results:
--- console log ---
kdump: saving vmcore-dmesg.txt
kdump: saving vmcore-dmesg.txt complete
kdump: saving vmcore
Checking for memory holes                         : [  0.0 %] /                                                                                                                                                      Checking for memory holes                         : [100.0 %] |                                                                                                                                                      Excluding unnecessary pages                       : [100.0 %] \                                                                                                                                                      Copying data                                      : [ 86.3 %] -           eta: 0                                                                                                                                     s
[    6.633284] traps: makedumpfile[1237] general protection ip:7f682e343d69 sp:                                                                                                                                     7ffd54fee178 error:0 in libc-2.17.so[7f682e1f0000+1c2000]
/lib/kdump-lib-initramfs.sh: line 86:  1237 Segmentation fault      $CORE_COLLEC                                                                                                                                     TOR /proc/vmcore $_mp/$KDUMP_PATH/$HOST_IP-$DATEDIR/vmcore-incomplete
kdump: saving vmcore failed
[FAILED] Failed to start Kdump Vmcore Save Service.
--------------------
After the above console log, guest reboot and check there is the vmcore-incomplete, sometimes kdump failed to start

Expected results:
kdump should work with saving vmcore successfully


Additional info:
1. Change vm memory to 8G, the issue does not exist.
2. RHEL8 does not have the same issue.

Comment 2 Kairui Song 2019-05-27 15:32:33 UTC
This looks like a similar issue with:
https://bugzilla.redhat.com/show_bug.cgi?id=1644600

On the same VM, if you use latest RHEL-8.1 instead, is it still reproducible?

Comment 3 HuijingHei 2019-05-28 05:44:24 UTC
(In reply to Kairui Song from comment #2)
> This looks like a similar issue with:
> https://bugzilla.redhat.com/show_bug.cgi?id=1644600

For rhel7.6, with set-vmmemory can also result to vmcore-incomplete and similar console logs. Seems the same issue on rhel8.0(kexec-tools-2.0.15-21.el7_6.3.x86_64)

> 
> On the same VM, if you use latest RHEL-8.1 instead, is it still reproducible?
No, the issue does not exist on RHEL-8.1(20190523.0) with kexec-tools-2.0.19-3.el8.x86_64

Comment 5 David Hildenbrand 2019-05-29 12:19:41 UTC
Upstream: PG_offline essentially replaces PG_balloon.

However, in RHEL7, PG_balloon is still needed for other purposes: balloon compaction

We could

a) Backport b1123ea6d3b3d ("mm: balloon: use general non-lru movable page feature") and friends, to free up PG_balloon

b) Introduce a new MAPCOUNT value for PG_offline downstream, letting it co-exist with PG_balloon

c) Let it remain broken in RHEL7

I *guess* b) would be more feasible than a). I suspect that a) is quite involved.

Comment 6 Kairui Song 2019-06-10 08:01:03 UTC
(In reply to David Hildenbrand from comment #5)
> Upstream: PG_offline essentially replaces PG_balloon.
> 
> However, in RHEL7, PG_balloon is still needed for other purposes: balloon
> compaction
> 
> We could
> 
> a) Backport b1123ea6d3b3d ("mm: balloon: use general non-lru movable page
> feature") and friends, to free up PG_balloon
> 
> b) Introduce a new MAPCOUNT value for PG_offline downstream, letting it
> co-exist with PG_balloon
> 
> c) Let it remain broken in RHEL7
> 
> I *guess* b) would be more feasible than a). I suspect that a) is quite
> involved.

Thanks, I agree plan b is a feasible solution. I've cloned a bug for kernel fix, bz1718771, will you implement it for RHEL-7?

Comment 7 David Hildenbrand 2019-06-11 10:37:18 UTC
The kexec-tools backport should be pretty easy I assume.

I'll have a look at the 7.7? backport and let you know when I run into issues.

Comment 8 David Hildenbrand 2019-06-18 13:16:33 UTC
Testing with virtio-balloon without balloon compaction, not with Hyper-V, leaving that to the experts. To test with virtio-balloon, a special kernel build is required (CONFIG_BALLOON_COMPACTION=n).

-> Task info: https://brewweb.devel.redhat.com/taskinfo?taskID=22226003

[cloud-user@rhel7 ~]$ uname -a
Linux rhel7 3.10.0-1057.el7.test.x86_64 #1 SMP Tue Jun 18 07:21:55 EDT 2019 x86_64 x86_64 x86_64 GNU/Linux

[root@rhel7 cloud-user]# grep "BALLOON_COMPACTION" /boot/config-3.10.0-1057.el7. 
config-3.10.0-1057.el7.test.x86_64
[root@rhel7 cloud-user]# grep "BALLOON_COMPACTION" /boot/config-3.10.0-1057.el7.test.x86_64
# CONFIG_BALLOON_COMPACTION is not set



1. Start a guest with 8GB of memory, modified kernel and custom built "makedumpfile" installed.

[cloud-user@rhel7 ~]$ cat /proc/meminfo 
MemTotal:        8008712 kB
MemFree:         7683636 kB
MemAvailable:    7630212 kB
Buffers:            2088 kB

2. Inflate the balloon (notice that the crashkernel area also consumes memory)

[dhildenb@virtlab412 ~]$ echo "balloon 700" | sudo nc -U /var/tmp/monitor
QEMU 2.12.0 monitor - type 'help' for more information
(qemu) balloon 700
[dhildenb@virtlab412 ~]$ echo "info balloon" | sudo nc -U /var/tmp/monitor
QEMU 2.12.0 monitor - type 'help' for more information
(qemu) info balloon
balloon: actual=700

[cloud-user@rhel7 ~]$ cat /proc/meminfo 
MemTotal:         336904 kB
MemFree:          136264 kB
MemAvailable:      23600 kB
Buffers:             724 kB

3. Modify /etc/kdump.conf to display verbose information when dumping

-> core_collector makedumpfile -l --message-level 31 -d 31

4. Restart kdump

[guest] $ systemctl restart kdump

5. Trigger a kernel crash

[guest] $ echo 1 > /proc/sys/kernel/sysrq
[guest] $ echo c > /proc/sysrq-trigger


Guest restarts into kdump kernel and performs the dump. Being quick to capture the output:

Original pages  : 0x00000000001f7514
  Excluded pages   : 0x00000000001e9150
    Pages filled with zero  : 0x0000000000006bad
    Non-private cache pages : 0x000000000000370a
    Private cache pages     : 0x000000000000000f
    User process data pages : 0x0000000000002a10
    Free pages              : 0x000000000000807a
    Hwpoison pages          : 0x0000000000000000
    Offline pages           : 0x00000000001d4400
  Remaining pages  : 0x000000000000e3c4
  (The number of pages is reduced to 2%.)
Memory Hole     : 0x0000000000048aec
--------------------------------------------------
Total pages     : 0x0000000000240000


8008712 kB - 336904 kB = 7671808 == 1917952 pages == 0x1D4400 pages

-> All inflated pages (offline) got excluded.


Similar approach will also work for testing under Hyper-V (inflate the balloon differently - enable dynamic memory).

In contrast to RHEL8, this will *not* work with
- XEN balloon - XEN patch to mark pages offline is not included
- virtio-balloon (with CONFIG_BALLOON_COMPACTION=y) - Pages are marked PageOffline() and PageBalloon() -> kdump cannot handle this yet.

So this really is only to fix Hyper-V.

Comment 10 Kairui Song 2019-06-19 03:23:44 UTC
Hi David,

Thanks for the work! I've backported your "[PATCH] exclude pages that are logically offline". But to get the patch merged we need three acks and blocker flag for RHEL-7.7.

Can you also give devel_ack to the kernel bug? The dependency issue is because that bug is cloned which have a default dependency, I'll fix that.

Comment 24 errata-xmlrpc 2019-08-06 12:55:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2134