Created attachment 1570915 [details] pod / pvc / pv / storageclass dumps Description of problem: During configuration of a fresh HTB6 install, the monitoring clusteroperator failed due to a storage mounting issue. Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedMount 6m23s (x756 over 35h) kubelet, ip-10-0-129-19.ec2.internal Unable to mount volumes for pod "prometheus-k8s-1_openshift-monitoring(e0214bbe-790a-11e9-bf2a-128845da01f8)": timeout expired waiting for volumes to attach or mount for pod "openshift-monitoring"/"prometheus-k8s-1". list of unmounted volumes=[prometheus-k8s-db]. list of unattached volumes=[prometheus-k8s-db config config-out prometheus-k8s-rulefiles-0 secret-kube-etcd-client-certs secret-prometheus-k8s-tls secret-prometheus-k8s-proxy secret-prometheus-k8s-htpasswd secret-kube-rbac-proxy configmap-serving-certs-ca-bundle configmap-kubelet-serving-ca-bundle prometheus-k8s-token-l47bh] Warning FailedMount 80s (x1229 over 35h) kubelet, ip-10-0-129-19.ec2.internal (combined from similar events): MountVolume.MountDevice failed for volume "pvc-e01ca78c-790a-11e9-bf2a-128845da01f8" : failed to mount the volume as "ext4", it already contains unknown data, probably partitions. Mount error: mount failed: exit status 32 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/aws/us-east-1a/vol-0b2f4de47fffb627d --scope -- mount -t ext4 -o defaults /dev/xvdcc /var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/aws/us-east-1a/vol-0b2f4de47fffb627d Output: Running scope as unit: run-rc658e4b20a404c48b09cbf1e00ccc68f.scope mount: /var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/aws/us-east-1a/vol-0b2f4de47fffb627d: wrong fs type, bad option, bad superblock on /dev/xvdcc, missing codepage or helper program, or other error. Version-Release number of selected component (if applicable): 4.1.0-rc.4 How reproducible: Unknown Steps to Reproduce: 1. Fresh install of HTB6 2. Configure openshiftingress (this is probably unrelated other than it caused the monitoring pods to restart) 3. Actual results: The monitoring operator stays in degraded state. Expected results: PV should mount again and monitoring operator should report ready. Additional info: See attachments
Version is set to 4.1.z so changed it to 4.1.0 and target to 4.1.z - Pl review if this is correvt. As there hasn't been a 4.1.z release we should set bug version to 4.1.0 and target to 4.1.z so it can be delivered in z stream
Looks like the FS on the volume is corrupted: May 18 01:19:00 ip-10-0-129-19 hyperkube[1353]: I0518 01:19:00.624834 1353 operation_generator.go:510] MountVolume.WaitForAttach succeeded for volume "pvc-e01ca78c-790a-11e9-bf2a-128845da01f8" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1a/vol-0b2f4de47fffb627d") pod "prometheus-k8s-1" (UID: "e0214bbe-790a-11e9-bf2a-128845da01f8") DevicePath "/dev/xvdcc" May 18 01:19:00 ip-10-0-129-19 hyperkube[1353]: I0518 01:19:00.666392 1353 mount_linux.go:454] `fsck` error fsck from util-linux 2.32.1 May 18 01:19:00 ip-10-0-129-19 hyperkube[1353]: fsck.ext2: Bad magic number in super-block while trying to open /dev/xvdcc May 18 01:19:00 ip-10-0-129-19 hyperkube[1353]: /dev/xvdcc: May 18 01:19:00 ip-10-0-129-19 hyperkube[1353]: The superblock could not be read or does not describe a valid ext2/ext3/ext4 May 18 01:19:00 ip-10-0-129-19 hyperkube[1353]: filesystem. If the device is valid and it really contains an ext2/ext3/ext4 May 18 01:19:00 ip-10-0-129-19 hyperkube[1353]: filesystem (and not swap or ufs or something else), then the superblock May 18 01:19:00 ip-10-0-129-19 hyperkube[1353]: is corrupt, and you might try running e2fsck with an alternate superblock: May 18 01:19:00 ip-10-0-129-19 hyperkube[1353]: e2fsck -b 8193 <device> May 18 01:19:00 ip-10-0-129-19 hyperkube[1353]: or May 18 01:19:00 ip-10-0-129-19 hyperkube[1353]: e2fsck -b 32768 <device> May 18 01:19:00 ip-10-0-129-19 hyperkube[1353]: Found a atari partition table in /dev/xvdcc Did you encounter the problem multiple times?
I tried to restart the pod with ext4 volume mounted, but could not reproduce it. @jupierce could you provide the detailed steps?
Just some additional observation: This looks like we're mounting the unformatted volume but it's not completely empty (the "atari partition table"), so kuberentes will not try to format it.
@chaoyang - I don't know how to reproduce it reliably. The steps reported were simply what triggered the condition in my environment.
Thanks. I don't expect to be given steps to reproduce. I only wondered if this something you encounter frequently... The "error 32" is actually not a problem, but the junk looking like a partition table is.
This has only happened once. I installed an identical cluster later in the day and it did not experience this. The cluster affected by the correct PV is still in this state if it would help analysis.
The only other interesting thing on the node is the following backtrace from kernel being spit for 4 of the 8 cores. It's definitely strange but whether it's related to the problem is hard to guess (possible memory corruptor?): May 18 01:11:14 localhost kernel: installing Xen timer for CPU 7 May 18 01:11:14 localhost kernel: #7 May 18 01:11:14 localhost kernel: WARNING: CPU: 0 PID: 1 at drivers/xen/events/events_base.c:1110 unbind_from_irqhandler+0x34/0x40 May 18 01:11:14 localhost kernel: Modules linked in: May 18 01:11:14 localhost kernel: CPU: 0 PID: 1 Comm: swapper/0 Tainted: G W --------- - - 4.18.0-80.1.2.el8_0.x86_64 #1 May 18 01:11:14 localhost kernel: Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006 May 18 01:11:14 localhost kernel: RIP: 0010:unbind_from_irqhandler+0x34/0x40 May 18 01:11:14 localhost kernel: Code: 89 fb e8 ff 1c c3 ff 48 85 c0 74 1e 48 8b 40 10 48 83 78 08 00 74 13 89 df 48 89 ee e8 75 f9 c2 ff 89 df 5b 5d e9 bc fe ff ff <0f> 0b 5b 5d c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 e8 46 f2 ff ff May 18 01:11:14 localhost kernel: RSP: 0000:ffff9cb283137d58 EFLAGS: 00010246 May 18 01:11:14 localhost kernel: RAX: 0000000000000000 RBX: 00000000ffffffff RCX: 0000000000000006 May 18 01:11:14 localhost kernel: RDX: 0000000000000028 RSI: 00000000ffffffff RDI: ffffffff95459e40 May 18 01:11:14 localhost kernel: RBP: 0000000000000000 R08: ffff8fe08780d4d1 R09: ffffffff94218d96 May 18 01:11:14 localhost kernel: R10: ffffc2d7441f0040 R11: 000000000000000d R12: ffff8fe76f9c0000 May 18 01:11:14 localhost kernel: R13: 0000000000000007 R14: 0000000000000000 R15: ffffffff942aaf30 May 18 01:11:14 localhost kernel: FS: 0000000000000000(0000) GS:ffff8fe76f800000(0000) knlGS:0000000000000000 May 18 01:11:14 localhost kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 May 18 01:11:14 localhost kernel: CR2: ffff8fe565ef0000 CR3: 00000005e540a001 CR4: 00000000001606f0 May 18 01:11:14 localhost kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 May 18 01:11:14 localhost kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 May 18 01:11:14 localhost kernel: Call Trace: May 18 01:11:14 localhost kernel: xen_uninit_lock_cpu+0x28/0x62 May 18 01:11:14 localhost kernel: xen_hvm_cpu_die+0x21/0x30 May 18 01:11:14 localhost kernel: takedown_cpu+0x9d/0xe0 May 18 01:11:14 localhost kernel: cpuhp_invoke_callback+0x94/0x550 May 18 01:11:14 localhost kernel: ? ring_buffer_record_is_set_on+0x10/0x10 May 18 01:11:14 localhost kernel: _cpu_up+0x141/0x150 May 18 01:11:14 localhost kernel: ? do_early_param+0x91/0x91 May 18 01:11:14 localhost kernel: do_cpu_up+0x7b/0xc0 May 18 01:11:14 localhost kernel: smp_init+0xc8/0xcd May 18 01:11:14 localhost kernel: kernel_init_freeable+0x112/0x258 May 18 01:11:14 localhost kernel: ? rest_init+0xaa/0xaa May 18 01:11:14 localhost kernel: kernel_init+0xa/0x106 May 18 01:11:14 localhost kernel: ret_from_fork+0x35/0x40 May 18 01:11:14 localhost kernel: ---[ end trace 66b766237d675325 ]--- Do we still have the machine running somewhere?
Yes, this cluster is still running. I can grab whatever you need or give you ssh access if necessary.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409
*** Bug 1974313 has been marked as a duplicate of this bug. ***