Enabling kdump on RHCOS currently requires /boot to be RW until https://src.fedoraproject.org/rpms/kexec-tools/c/75bdcb7399b6fe48032a8db534e18b01206601bc?branch=rawhide is backported to RHEL 8.4.
Moving back to POST has this needs the second PR.
@travier I wasn't able to get this to work. I think this requires a boot image bump to enable kdump day-1. -- Logs begin at Wed 2021-06-30 12:51:02 UTC, end at Wed 2021-06-30 14:18:20 UTC. -- Jun 30 14:18:20 ip-10-0-131-173.us-west-2.compute.internal systemd[1]: Starting Crash recovery kernel arming... Jun 30 14:18:20 ip-10-0-131-173.us-west-2.compute.internal kdumpctl[312655]: kdump: No kdump initial ramdisk found. Jun 30 14:18:20 ip-10-0-131-173.us-west-2.compute.internal kdumpctl[312655]: kdump: Rebuilding /boot/ostree/rhcos-804feafb00025bb841960dcf2cce555bb8988> Jun 30 14:18:20 ip-10-0-131-173.us-west-2.compute.internal kdumpctl[312655]: kdump: /boot/ostree/rhcos-804feafb00025bb841960dcf2cce555bb8988d5b4bda50cc> Jun 30 14:18:20 ip-10-0-131-173.us-west-2.compute.internal kdumpctl[312655]: kdump: Starting kdump: [FAILED] Jun 30 14:18:20 ip-10-0-131-173.us-west-2.compute.internal systemd[1]: kdump.service: Main process exited, code=exited, status=1/FAILURE Jun 30 14:18:20 ip-10-0-131-173.us-west-2.compute.internal systemd[1]: kdump.service: Failed with result 'exit-code'. Jun 30 14:18:20 ip-10-0-131-173.us-west-2.compute.internal systemd[1]: Failed to start Crash recovery kernel arming. Jun 30 14:18:20 ip-10-0-131-173.us-west-2.compute.internal systemd[1]: kdump.service: Consumed 82ms CPU time sh-4.4# sudo rpm-ostree status State: idle Deployments: * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eb121c97e974d7372eba219506205f364ab8f7809de74a3ef4c0157c4372938c CustomOrigin: Managed by machine-config-operator Version: 48.84.202106162024-0 (2021-06-16T20:28:08Z) ostree://92ede04b462bc884de5562062fb45e06d803754cbaa466e3a2d34b4ee5e9634b Version: 48.84.202105190318-0 (2021-05-19T03:22:10Z) $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.0-0.nightly-2021-06-30-034414 True False 66m Cluster version is 4.9.0-0.nightly-2021-06-30-034414
This was merged in master on Jun 23, 2021 so 48.84.202106162024-0 will not have it. You can check if the file is included with: `systemctl cat kdump.service`
This should not require a boot image bump AFAIK.
Confirmed, this does not need a boot image bump but https://github.com/openshift/os/pull/561 needs to be in the boot image in order for this to work.
A boot image bump is only needed for https://github.com/openshift/os/pull/561 if kdump is setup on day-1 via Ignition or a MC, not for manual setup.
Thanks for the clarification. I was able to verify this fix on enabling kdump day-2. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.0-0.nightly-2021-07-07-021823 True False 35m Cluster version is 4.9.0-0.nightly-2021-07-07-021823 $ cat 99-kdump-worker.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: 99-worker-kdump spec: kernelArguments: - 'crashkernel=256M' config: ignition: version: 3.2.0 systemd: units: - enabled: true name: kdump.service $ oc create -f 99-kdump-worker.yaml machineconfig.machineconfiguration.openshift.io/99-worker-kdump created $ oc get mc NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE 00-master 95dcdb123f1a5fa8887a1fb66a044c9ad74191ce 3.2.0 60m 00-worker 95dcdb123f1a5fa8887a1fb66a044c9ad74191ce 3.2.0 60m 01-master-container-runtime 95dcdb123f1a5fa8887a1fb66a044c9ad74191ce 3.2.0 60m 01-master-kubelet 95dcdb123f1a5fa8887a1fb66a044c9ad74191ce 3.2.0 60m 01-worker-container-runtime 95dcdb123f1a5fa8887a1fb66a044c9ad74191ce 3.2.0 60m 01-worker-kubelet 95dcdb123f1a5fa8887a1fb66a044c9ad74191ce 3.2.0 60m 99-master-generated-registries 95dcdb123f1a5fa8887a1fb66a044c9ad74191ce 3.2.0 60m 99-master-ssh 3.2.0 63m 99-worker-generated-registries 95dcdb123f1a5fa8887a1fb66a044c9ad74191ce 3.2.0 60m 99-worker-kdump 3.2.0 9s 99-worker-ssh 3.2.0 63m rendered-master-38ccce6954aff7d3ae792692e8962bb1 95dcdb123f1a5fa8887a1fb66a044c9ad74191ce 3.2.0 60m rendered-worker-0b5f5063c2b34c7b307de02d3cb44eaa 95dcdb123f1a5fa8887a1fb66a044c9ad74191ce 3.2.0 4s rendered-worker-da71f2941166daf4ff90249dc4d88a52 95dcdb123f1a5fa8887a1fb66a044c9ad74191ce 3.2.0 60m $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-38ccce6954aff7d3ae792692e8962bb1 True False False 3 3 3 0 62m worker rendered-worker-da71f2941166daf4ff90249dc4d88a52 False True False 3 0 0 0 62m $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-38ccce6954aff7d3ae792692e8962bb1 True False False 3 3 3 0 72m worker rendered-worker-0b5f5063c2b34c7b307de02d3cb44eaa True False False 3 3 3 0 72m $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-148-96.us-west-2.compute.internal Ready worker 65m v1.21.1+0228142 ip-10-0-154-86.us-west-2.compute.internal Ready master 72m v1.21.1+0228142 ip-10-0-164-96.us-west-2.compute.internal Ready master 73m v1.21.1+0228142 ip-10-0-178-111.us-west-2.compute.internal Ready worker 64m v1.21.1+0228142 ip-10-0-209-56.us-west-2.compute.internal Ready master 72m v1.21.1+0228142 ip-10-0-215-192.us-west-2.compute.internal Ready worker 65m v1.21.1+0228142 $ oc debug node/ip-10-0-148-96.us-west-2.compute.internal Starting pod/ip-10-0-148-96us-west-2computeinternal-debug ... To use host binaries, run `chroot /host` If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.4# systemctl is-enabled kdump enabled sh-4.4# systemctl status kdump ● kdump.service - Crash recovery kernel arming Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; vendor preset: disabled) Drop-In: /usr/lib/systemd/system/kdump.service.d └─remount-boot.conf Active: active (exited) since Wed 2021-07-07 15:11:52 UTC; 7min ago Process: 1348 ExecStart=/usr/bin/kdumpctl start (code=exited, status=0/SUCCESS) Process: 1342 ExecStartPre=/usr/bin/mount -o remount,rw /boot (code=exited, status=0/SUCCESS) Main PID: 1348 (code=exited, status=0/SUCCESS) CPU: 1min 3.582s Jul 07 15:10:57 ip-10-0-148-96.us-west-2.compute.internal dracut[1643]: Stored kernel commandline: Jul 07 15:10:57 ip-10-0-148-96.us-west-2.compute.internal dracut[1643]: No dracut internal kernel commandline stored in the initramfs Jul 07 15:10:57 ip-10-0-148-96.us-west-2.compute.internal dracut[1643]: *** Install squash loader *** Jul 07 15:10:59 ip-10-0-148-96.us-west-2.compute.internal dracut[1643]: *** Squashing the files inside the initramfs *** Jul 07 15:11:46 ip-10-0-148-96.us-west-2.compute.internal dracut[1643]: *** Squashing the files inside the initramfs done *** Jul 07 15:11:46 ip-10-0-148-96.us-west-2.compute.internal dracut[1643]: *** Creating image file '/boot/ostree/rhcos-4a92af59dcf52f5d7c3a58e> Jul 07 15:11:50 ip-10-0-148-96.us-west-2.compute.internal dracut[1643]: *** Creating initramfs image file '/boot/ostree/rhcos-4a92af59dcf52> Jul 07 15:11:52 ip-10-0-148-96.us-west-2.compute.internal kdumpctl[1348]: kdump: kexec: loaded kdump kernel Jul 07 15:11:52 ip-10-0-148-96.us-west-2.compute.internal kdumpctl[1348]: kdump: Starting kdump: [OK] Jul 07 15:11:52 ip-10-0-148-96.us-west-2.compute.internal systemd[1]: Started Crash recovery kernel arming. sh-4.4# rpm-ostree status State: idle Deployments: * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:19ca82ee51a648e95a6cc2c2fd99260d28a89a9be8467d2575ebcc1491c4383f CustomOrigin: Managed by machine-config-operator Version: 49.84.202107051924-0 (2021-07-05T19:28:02Z) pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:19ca82ee51a648e95a6cc2c2fd99260d28a89a9be8467d2575ebcc1491c4383f CustomOrigin: Managed by machine-config-operator Version: 49.84.202107051924-0 (2021-07-05T19:28:02Z) sh-4.4# cat /proc/cmdline BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-4a92af59dcf52f5d7c3a58e9722f057bf98a16b7625d925326e9af2fdd4060b3/vmlinuz-4.18.0-305.7.1.el8_4.x86_64 random.trust_cpu=on console=tty0 console=ttyS0,115200n8 ostree=/ostree/boot.1/rhcos/4a92af59dcf52f5d7c3a58e9722f057bf98a16b7625d925326e9af2fdd4060b3/0 ignition.platform.id=aws root=UUID=daa7e97d-faaf-46e9-995e-1e41f03bd55f rw rootflags=prjquota crashkernel=256M exit
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759