1836641 – Pods using local storage do not spin up and report xfs errors

Bug 1836641 - Pods using local storage do not spin up and report xfs errors

Summary: Pods using local storage do not spin up and report xfs errors

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Hemant Kumar
QA Contact:	Qin Ping
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-05-17 15:14 UTC by Sai Sindhur Malleni
Modified:	2020-07-13 17:39 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-07-13 17:39:31 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:2409	0	None	None	None	2020-07-13 17:39:51 UTC

Description Sai Sindhur Malleni 2020-05-17 15:14:26 UTC

Description of problem:
Used the local-storage operator to expose one nvme disk per master node as a PV
apiVersion: v1
kind: Namespace
metadata:
  name: local-storage
---
apiVersion: operators.coreos.com/v1alpha2
kind: OperatorGroup
metadata:
  name: local-operator-group
  namespace: local-storage
spec:
  targetNamespaces:
    - local-storage
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: local-storage-operator
  namespace: local-storage
spec:
  channel: "4.4"
  installPlanApproval: Automatic
  name: local-storage-operator
  source: redhat-operators
  sourceNamespace: openshift-marketplace
=========================================================
[kni@e19-h24-b01-fc640 local-storage]$ cat volume.yaml 
apiVersion: "local.storage.openshift.io/v1"
kind: "LocalVolume"
metadata:
  name: "local-disks"
  namespace: "local-storage"
spec:
  tolerations:
    - key: storage
      operator: Equal 
      value: "true"
  storageClassDevices:
    - storageClassName: "local-sc"
      volumeMode: Filesystem
      fsType: xfs
      devicePaths:
        - /dev/nvme0n1

All master nodes are tainted oc adm taint node master-<node> storage=true:NoSchedule


The tried to deploy cluster-logging to use those PVs
[kni@e19-h24-b01-fc640 local-storage]$ cat ~/logging/instance.yaml 
apiVersion: "logging.openshift.io/v1"
kind: "ClusterLogging"
metadata:
  name: "instance"
  namespace: "openshift-logging"
spec:
  managementState: "Managed"
  logStore:
    type: "elasticsearch"
    elasticsearch:
      tolerations:
        - key: storage
          operator: Equal
          value: "true"
      nodeCount: 3
      storage:
        storageClassName: local-sc
        size: 100G
      redundancyPolicy: "SingleRedundancy"
  visualization:
    tolerations:
      - key: storage
        operator: Equal
        value: "true"
    type: "kibana"
    kibana:
      replicas: 1
  curation:
    type: "curator"
    curator:
      schedule: "30 3 * * *"
  collection:
    logs:
      type: "fluentd"
      fluentd: {}

The ES pods never go into running and are stuck at containercreating

looking at oc describe pod

Events:
  Type     Reason       Age                  From               Message
  ----     ------       ----                 ----               -------
  Normal   Scheduled    19m                  default-scheduler  Successfully assigned openshift-logging/elasticsearch-cdm-w4a280ij-1-6bd64d7578-x8nb6 to master-2
  Warning  FailedMount  17m                  kubelet, master-2  Unable to attach or mount volumes: unmounted volumes=[elasticsearch-storage], unattached volumes=[elasticsearch-token-vnrv8 elasticsearch-metrics elasticsearch-storage elasticsearch-config certificates]: timed out waiting for the condition
  Warning  FailedMount  9m18s (x2 over 11m)  kubelet, master-2  Unable to attach or mount volumes: unmounted volumes=[elasticsearch-storage], unattached volumes=[elasticsearch-config certificates elasticsearch-token-vnrv8 elasticsearch-metrics elasticsearch-storage]: timed out waiting for the condition
  Warning  FailedMount  7m15s                kubelet, master-2  Unable to attach or mount volumes: unmounted volumes=[elasticsearch-storage], unattached volumes=[elasticsearch-metrics elasticsearch-storage elasticsearch-config certificates elasticsearch-token-vnrv8]: timed out waiting for the condition
  Warning  FailedMount  3m8s (x3 over 15m)   kubelet, master-2  Unable to attach or mount volumes: unmounted volumes=[elasticsearch-storage], unattached volumes=[certificates elasticsearch-token-vnrv8 elasticsearch-metrics elasticsearch-storage elasticsearch-config]: timed out waiting for the condition
  Warning  FailedMount  68s (x17 over 19m)   kubelet, master-2  MountVolume.MountDevice failed for volume "local-pv-a2502689" : local: failed to mount device /mnt/local-storage/local-sc/nvme0n1 at /var/lib/kubelet/plugins/kubernetes.io/local-volume/mounts/local-pv-a2502689 (fstype: xfs), error 'xfs_repair' found errors on device /mnt/local-storage/local-sc/nvme0n1 but could not correct them: Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair.  If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.
  Warning  FailedMount  65s (x2 over 13m)  kubelet, master-2  Unable to attach or mount volumes: unmounted volumes=[elasticsearch-storage], unattached volumes=[elasticsearch-storage elasticsearch-config certificates elasticsearch-token-vnrv8 elasticsearch-metrics]: timed out waiting for the condition 

Please find journal on master-2 attached.

Version-Release number of selected component (if applicable):
4.5
How reproducible:

Steps to Reproduce:
1. Deploy cluster
2. Deploy local-storage operator
3. Try to use pvc from local-storage

Actual results:
Pod stuck in containercreating

Expected results:
Pods should spin up successfully

Master Log
http://rdu-storage01.scalelab.redhat.com/sai/journal.tar.gz
Node Log (of failed PODs):

PV Dump:
[kni@e19-h24-b01-fc640 logging]$ oc get pv
NAME                CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS     CLAIM                                                          STORAGECLASS   REASON   AGE
local-pv-2f659adb   1490Gi     RWO            Delete           Released   openshift-logging/elasticsearch-elasticsearch-cdm-5b7zrrqe-1   local-sc                19m
local-pv-5e03c2b0   1490Gi     RWO            Delete           Released   openshift-logging/elasticsearch-elasticsearch-cdm-5b7zrrqe-2   local-sc                20m
local-pv-a2502689   1490Gi     RWO            Delete           Released   openshift-logging/elasticsearch-elasticsearch-cdm-5b7zrrqe-3   local-sc                19m

(please note, to try to get it working I deleted the PVs and recreated them after hitting the issue but did not help)

PVC Dump:

StorageClass Dump (if StorageClass used by PV/PVC):

Additional info:

Comment 1 Hemant Kumar 2020-05-20 16:23:21 UTC

We seem to have a buggy version of mount library bundled in OCP-4.5. Opened for master - https://github.com/openshift/origin/pull/25006 will backport once merged.

Comment 7 Qin Ping 2020-06-01 06:59:35 UTC

Verified with: 4.5.0-0.nightly-2020-05-30-025738

Comment 9 errata-xmlrpc 2020-07-13 17:39:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.