1711674 – Storage unavailable due to unable "MountVolume.MountDevice failed for volume ... : failed to mount the volume as "ext4", it already contains unknown data"

Bug 1711674 - Storage unavailable due to unable "MountVolume.MountDevice failed for volume ... : failed to mount the volume as "ext4", it already contains unknown data"

Summary: Storage unavailable due to unable "MountVolume.MountDevice failed for volume ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.5.0
Assignee:	aos-storage-staff@redhat.com
QA Contact:	Wei Duan
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1974313 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-05-19 13:26 UTC by Justin Pierce
Modified:	2023-10-06 18:19 UTC (History)
CC List:	16 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-07-13 17:11:03 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
pod / pvc / pv / storageclass dumps (20.48 KB, text/plain) 2019-05-19 13:26 UTC, Justin Pierce	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin pull 24601	0	None	closed	Bug 1711674: Add pv mount failure event	2021-02-19 10:50:03 UTC
Red Hat Product Errata	RHBA-2020:2409	0	None	None	None	2020-07-13 17:11:16 UTC

Description Justin Pierce 2019-05-19 13:26:52 UTC

Created attachment 1570915 [details]
pod / pvc / pv / storageclass dumps

Description of problem:
During configuration of a fresh HTB6 install, the monitoring clusteroperator failed due to a storage mounting issue.

Events:
  Type     Reason       Age                    From                                  Message
  ----     ------       ----                   ----                                  -------
  Warning  FailedMount  6m23s (x756 over 35h)  kubelet, ip-10-0-129-19.ec2.internal  Unable to mount volumes for pod "prometheus-k8s-1_openshift-monitoring(e0214bbe-790a-11e9-bf2a-128845da01f8)": timeout expired waiting for volumes to attach or mount for pod "openshift-monitoring"/"prometheus-k8s-1". list of unmounted volumes=[prometheus-k8s-db]. list of unattached volumes=[prometheus-k8s-db config config-out prometheus-k8s-rulefiles-0 secret-kube-etcd-client-certs secret-prometheus-k8s-tls secret-prometheus-k8s-proxy secret-prometheus-k8s-htpasswd secret-kube-rbac-proxy configmap-serving-certs-ca-bundle configmap-kubelet-serving-ca-bundle prometheus-k8s-token-l47bh]
  Warning  FailedMount  80s (x1229 over 35h)   kubelet, ip-10-0-129-19.ec2.internal  (combined from similar events): MountVolume.MountDevice failed for volume "pvc-e01ca78c-790a-11e9-bf2a-128845da01f8" : failed to mount the volume as "ext4", it already contains unknown data, probably partitions. Mount error: mount failed: exit status 32
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/aws/us-east-1a/vol-0b2f4de47fffb627d --scope -- mount -t ext4 -o defaults /dev/xvdcc /var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/aws/us-east-1a/vol-0b2f4de47fffb627d
Output: Running scope as unit: run-rc658e4b20a404c48b09cbf1e00ccc68f.scope
mount: /var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/aws/us-east-1a/vol-0b2f4de47fffb627d: wrong fs type, bad option, bad superblock on /dev/xvdcc, missing codepage or helper program, or other error.


Version-Release number of selected component (if applicable):
4.1.0-rc.4

How reproducible:
Unknown

Steps to Reproduce:
1. Fresh install of HTB6
2. Configure openshiftingress (this is probably unrelated other than it caused the monitoring pods to restart)
3. 

Actual results:
The monitoring operator stays in degraded state.

Expected results:
PV should mount again and monitoring operator should report ready.

Additional info:
See attachments

Comment 2 Sudha Ponnaganti 2019-05-19 20:17:46 UTC

Version is set to 4.1.z so changed it to 4.1.0 and target to 4.1.z - Pl review if this is correvt. As there hasn't been a 4.1.z release we should set bug version to 4.1.0 and target to 4.1.z so it can be delivered in z stream

Comment 3 Tomas Smetana 2019-05-20 07:13:45 UTC

Looks like the FS on the volume is corrupted:

May 18 01:19:00 ip-10-0-129-19 hyperkube[1353]: I0518 01:19:00.624834    1353 operation_generator.go:510] MountVolume.WaitForAttach succeeded for volume "pvc-e01ca78c-790a-11e9-bf2a-128845da01f8" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1a/vol-0b2f4de47fffb627d") pod "prometheus-k8s-1" (UID: "e0214bbe-790a-11e9-bf2a-128845da01f8") DevicePath "/dev/xvdcc"
May 18 01:19:00 ip-10-0-129-19 hyperkube[1353]: I0518 01:19:00.666392    1353 mount_linux.go:454] `fsck` error fsck from util-linux 2.32.1
May 18 01:19:00 ip-10-0-129-19 hyperkube[1353]: fsck.ext2: Bad magic number in super-block while trying to open /dev/xvdcc
May 18 01:19:00 ip-10-0-129-19 hyperkube[1353]: /dev/xvdcc:
May 18 01:19:00 ip-10-0-129-19 hyperkube[1353]: The superblock could not be read or does not describe a valid ext2/ext3/ext4
May 18 01:19:00 ip-10-0-129-19 hyperkube[1353]: filesystem.  If the device is valid and it really contains an ext2/ext3/ext4
May 18 01:19:00 ip-10-0-129-19 hyperkube[1353]: filesystem (and not swap or ufs or something else), then the superblock
May 18 01:19:00 ip-10-0-129-19 hyperkube[1353]: is corrupt, and you might try running e2fsck with an alternate superblock:
May 18 01:19:00 ip-10-0-129-19 hyperkube[1353]:     e2fsck -b 8193 <device>
May 18 01:19:00 ip-10-0-129-19 hyperkube[1353]:  or
May 18 01:19:00 ip-10-0-129-19 hyperkube[1353]:     e2fsck -b 32768 <device>
May 18 01:19:00 ip-10-0-129-19 hyperkube[1353]: Found a atari partition table in /dev/xvdcc

Did you encounter the problem multiple times?

Comment 4 Chao Yang 2019-05-20 10:05:37 UTC

I tried to restart the pod with ext4 volume mounted, but could not reproduce it. 
@jupierce could you provide the detailed steps?

Comment 5 Tomas Smetana 2019-05-20 11:10:53 UTC

Just some additional observation: This looks like we're mounting the unformatted volume but it's not completely empty (the "atari partition table"), so kuberentes will not try to format it.

Comment 6 Justin Pierce 2019-05-20 14:54:34 UTC

@chaoyang - I don't know how to reproduce it reliably. The steps reported were simply what triggered the condition in my environment.

Comment 7 Tomas Smetana 2019-05-20 15:07:52 UTC

Thanks. I don't expect to be given steps to reproduce. I only wondered if this something you encounter frequently... The "error 32" is actually not a problem, but the junk looking like a partition table is.

Comment 8 Justin Pierce 2019-05-20 17:48:16 UTC

This has only happened once. I installed an identical cluster later in the day and it did not experience this. The cluster affected by the correct PV is still in this state if it would help analysis.

Comment 9 Tomas Smetana 2019-05-29 11:17:47 UTC

The only other interesting thing on the node is the following backtrace from kernel being spit for 4 of the 8 cores. It's definitely strange but whether it's related to the problem is hard to guess (possible memory corruptor?):

May 18 01:11:14 localhost kernel: installing Xen timer for CPU 7
May 18 01:11:14 localhost kernel:   #7
May 18 01:11:14 localhost kernel: WARNING: CPU: 0 PID: 1 at drivers/xen/events/events_base.c:1110 unbind_from_irqhandler+0x34/0x40
May 18 01:11:14 localhost kernel: Modules linked in:
May 18 01:11:14 localhost kernel: CPU: 0 PID: 1 Comm: swapper/0 Tainted: G        W        --------- -  - 4.18.0-80.1.2.el8_0.x86_64 #1
May 18 01:11:14 localhost kernel: Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
May 18 01:11:14 localhost kernel: RIP: 0010:unbind_from_irqhandler+0x34/0x40
May 18 01:11:14 localhost kernel: Code: 89 fb e8 ff 1c c3 ff 48 85 c0 74 1e 48 8b 40 10 48 83 78 08 00 74 13 89 df 48 89 ee e8 75 f9 c2 ff 89 df 5b 5d e9 bc fe ff ff <0f> 0b 5b 5d c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 e8 46 f2 ff ff
May 18 01:11:14 localhost kernel: RSP: 0000:ffff9cb283137d58 EFLAGS: 00010246
May 18 01:11:14 localhost kernel: RAX: 0000000000000000 RBX: 00000000ffffffff RCX: 0000000000000006
May 18 01:11:14 localhost kernel: RDX: 0000000000000028 RSI: 00000000ffffffff RDI: ffffffff95459e40
May 18 01:11:14 localhost kernel: RBP: 0000000000000000 R08: ffff8fe08780d4d1 R09: ffffffff94218d96
May 18 01:11:14 localhost kernel: R10: ffffc2d7441f0040 R11: 000000000000000d R12: ffff8fe76f9c0000
May 18 01:11:14 localhost kernel: R13: 0000000000000007 R14: 0000000000000000 R15: ffffffff942aaf30
May 18 01:11:14 localhost kernel: FS:  0000000000000000(0000) GS:ffff8fe76f800000(0000) knlGS:0000000000000000
May 18 01:11:14 localhost kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 18 01:11:14 localhost kernel: CR2: ffff8fe565ef0000 CR3: 00000005e540a001 CR4: 00000000001606f0
May 18 01:11:14 localhost kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
May 18 01:11:14 localhost kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
May 18 01:11:14 localhost kernel: Call Trace:
May 18 01:11:14 localhost kernel:  xen_uninit_lock_cpu+0x28/0x62
May 18 01:11:14 localhost kernel:  xen_hvm_cpu_die+0x21/0x30
May 18 01:11:14 localhost kernel:  takedown_cpu+0x9d/0xe0
May 18 01:11:14 localhost kernel:  cpuhp_invoke_callback+0x94/0x550
May 18 01:11:14 localhost kernel:  ? ring_buffer_record_is_set_on+0x10/0x10
May 18 01:11:14 localhost kernel:  _cpu_up+0x141/0x150
May 18 01:11:14 localhost kernel:  ? do_early_param+0x91/0x91
May 18 01:11:14 localhost kernel:  do_cpu_up+0x7b/0xc0
May 18 01:11:14 localhost kernel:  smp_init+0xc8/0xcd
May 18 01:11:14 localhost kernel:  kernel_init_freeable+0x112/0x258
May 18 01:11:14 localhost kernel:  ? rest_init+0xaa/0xaa
May 18 01:11:14 localhost kernel:  kernel_init+0xa/0x106
May 18 01:11:14 localhost kernel:  ret_from_fork+0x35/0x40
May 18 01:11:14 localhost kernel: ---[ end trace 66b766237d675325 ]---

Do we still have the machine running somewhere?

Comment 10 Justin Pierce 2019-05-29 14:35:29 UTC

Yes, this cluster is still running. I can grab whatever you need or give you ssh access if necessary.

Comment 45 errata-xmlrpc 2020-07-13 17:11:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Comment 46 Jan Safranek 2021-06-22 15:26:22 UTC

*** Bug 1974313 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.