1974639 – Kdump is failing on RHCOS node

Bug 1974639 - Kdump is failing on RHCOS node

Summary: Kdump is failing on RHCOS node

Keywords:
Status:	CLOSED DUPLICATE of bug 1971739
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	RHCOS
Sub Component:
Version:	4.8
Hardware:	ppc64le
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Timothée Ravier
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-22 08:37 UTC by Satwinder Singh
Modified:	2021-06-23 09:21 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-06-22 10:24:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Satwinder Singh 2021-06-22 08:37:38 UTC

---

OCP Version at Install Time:
# oc version
Client Version: 4.8.0-rc.0
Server Version: 4.8.0-rc.0
Kubernetes Version: v1.21.0-rc.0+120883f


RHCOS Version at Install Time:
$ sudo cat /etc/os-release 
NAME="Red Hat Enterprise Linux CoreOS"
VERSION="48.84.202106130219-0"
ID="rhcos"
ID_LIKE="rhel fedora"
VERSION_ID="4.8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 48.84.202106130219-0 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::coreos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://docs.openshift.com/container-platform/4.8/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.8"
REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
REDHAT_SUPPORT_PRODUCT_VERSION="4.8"
OPENSHIFT_VERSION="4.8"
RHEL_VERSION="8.4"
OSTREE_VERSION='48.84.202106130219-0'



Platform: IBM Power Systems
Architecture: ppc64le


What are you trying to do? What is your use case?
I was trying to configure the Kdumps on the rhcos nodes (worker nodes)


What happened? What went wrong or what did you expect?
kdumps.service failed to start 
# systemctl status kdump.service
● kdump.service - Crash recovery kernel arming
   Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Mon 2021-06-21 07:10:54 UTC; 5h 8min ago
 Main PID: 1471 (code=exited, status=1/FAILURE)
      CPU: 130ms
Jun 21 07:10:53 worker-0 systemd[1]: Starting Crash recovery kernel arming...
Jun 21 07:10:54 worker-0 kdumpctl[1471]: kdump: No kdump initial ramdisk found.
Jun 21 07:10:54 worker-0 kdumpctl[1471]: kdump: Rebuilding /boot/ostree/rhcos-952b979da3b3785d4154f56213b0c66d327a4f29fd167a9d8121ded93d158716/initramfs-4.18.0-305.3.1.el8_4.ppc64lekdump.img
Jun 21 07:10:54 worker-0 kdumpctl[1471]: kdump: /boot/ostree/rhcos-952b979da3b3785d4154f56213b0c66d327a4f29fd167a9d8121ded93d158716 does not have write permission.Can not rebuild /boot/ostree/rhcos-952b979da3b3785d4154f56213b0c66d327a4f29fd167a9d8121ded93d158716/initramfs-4.18.0-305.3.1.el8_4.ppc64lekdump.img
Jun 21 07:10:54 worker-0 kdumpctl[1471]: kdump: Starting kdump: [FAILED]
Jun 21 07:10:54 worker-0 systemd[1]: kdump.service: Main process exited, code=exited, status=1/FAILURE
Jun 21 07:10:54 worker-0 systemd[1]: kdump.service: Failed with result 'exit-code'.
Jun 21 07:10:54 worker-0 systemd[1]: Failed to start Crash recovery kernel arming.
Jun 21 07:10:54 worker-0 systemd[1]: kdump.service: Consumed 130ms CPU time


What are the steps to reproduce your issue? Please try to reduce these steps to something that can be reproduced with a single RHCOS node.
Using this doc for ref: https://docs.openshift.com/container-platform/4.7/support/troubleshooting/troubleshooting-operating-system-issues.html



Other details:

# journalctl -b 0 | grep kdumpctl
Jun 22 06:17:23 worker-0 kdumpctl[1419]: kdump: No kdump initial ramdisk found.
Jun 22 06:17:23 worker-0 kdumpctl[1419]: kdump: Rebuilding /boot/ostree/rhcos-952b979da3b3785d4154f56213b0c66d327a4f29fd167a9d8121ded93d158716/initramfs-4.18.0-305.3.1.el8_4.ppc64lekdump.img
Jun 22 06:17:23 worker-0 kdumpctl[1419]: kdump: /boot/ostree/rhcos-952b979da3b3785d4154f56213b0c66d327a4f29fd167a9d8121ded93d158716 does not have write permission. Can not rebuild /boot/ostree/rhcos-952b979da3b3785d4154f56213b0c66d327a4f29fd167a9d8121ded93d158716/initramfs-4.18.0-305.3.1.el8_4.ppc64lekdump.img
Jun 22 06:17:23 worker-0 kdumpctl[1419]: kdump: Starting kdump: [FAILED]


# cat /proc/cmdline
BOOT_IMAGE=(ieee1275//vdevice/v-scsi@30000002/disk@8200000000000000,gpt3)/ostree/rhcos-952b979da3b3785d4154f56213b0c66d327a4f29fd167a9d8121ded93d158716/vmlinuz-4.18.0-305.3.1.el8_4.ppc64le random.trust_cpu=on console=tty0 console=hvc0,115200n8 ostree=/ostree/boot.1/rhcos/952b979da3b3785d4154f56213b0c66d327a4f29fd167a9d8121ded93d158716/0 ignition.platform.id=openstack root=UUID=80edca18-0f7e-48b4-bd59-336c5d3fe2a8 rw rootflags=prjquota crashkernel=1024M

# dmesg | grep crashkernel
[    0.000000] Reserving 1024MB of memory at 128MB for crashkernel (System RAM: 32768MB)
[    0.000000] Kernel command line: BOOT_IMAGE=(ieee1275//vdevice/v-scsi@30000002/disk@8200000000000000,gpt3)/ostree/rhcos-952b979da3b3785d4154f56213b0c66d327a4f29fd167a9d8121ded93d158716/vmlinuz-4.18.0-305.3.1.el8_4.ppc64le random.trust_cpu=on console=tty0 console=hvc0,115200n8 ostree=/ostree/boot.1/rhcos/952b979da3b3785d4154f56213b0c66d327a4f29fd167a9d8121ded93d158716/0 ignition.platform.id=openstack root=UUID=80edca18-0f7e-48b4-bd59-336c5d3fe2a8 rw rootflags=prjquota crashkernel=1024M


# grep ^[^#] /etc/sysconfig/kdump
KDUMP_KERNELVER=""
KDUMP_COMMANDLINE=""
KDUMP_COMMANDLINE_REMOVE="hugepages hugepagesz slub_debug quiet log_buf_len swiotlb"
KDUMP_COMMANDLINE_APPEND="irqpoll maxcpus=1 noirqdistrib reset_devices cgroup_disable=memory numa=off udev.children-max=2 ehea.use_mcs=0 panic=10 rootflags=nofail kvm_cma_resv_ratio=0 transparent_hugepage=never novmcoredd"
KEXEC_ARGS="--dt-no-old-root"
KDUMP_IMG="vmlinuz"
KDUMP_IMG_EXT=""


# grep ^[^#] /etc/kdump.conf
path /var/crash
core_collector makedumpfile -l --message-level 7 -d 31


# rpm -qa |  grep kexec-tools
kexec-tools-2.0.20-46.el8.ppc64le

Comment 1 Timothée Ravier 2021-06-22 10:24:28 UTC


*** This bug has been marked as a duplicate of bug 1971739 ***

Comment 2 Timothée Ravier 2021-06-22 10:27:07 UTC

Likely to be a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1971739. Current workaround is to remount /boot RW. You can do that by dropping the following config https://github.com/coreos/fedora-coreos-config/blob/testing-devel/overlay.d/12kdump/usr/lib/systemd/system/kdump.service.d/remount-boot.conf in /etc/systemd/system/kdump.service.d/.

Note You need to log in before you can comment on or make changes to this bug.