Bug 1836641 - Pods using local storage do not spin up and report xfs errors
Summary: Pods using local storage do not spin up and report xfs errors
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.5
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.5.0
Assignee: Hemant Kumar
QA Contact: Qin Ping
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-17 15:14 UTC by Sai Sindhur Malleni
Modified: 2020-07-13 17:39 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-07-13 17:39:31 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:39:51 UTC

Description Sai Sindhur Malleni 2020-05-17 15:14:26 UTC
Description of problem:
Used the local-storage operator to expose one nvme disk per master node as a PV
apiVersion: v1
kind: Namespace
metadata:
  name: local-storage
---
apiVersion: operators.coreos.com/v1alpha2
kind: OperatorGroup
metadata:
  name: local-operator-group
  namespace: local-storage
spec:
  targetNamespaces:
    - local-storage
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: local-storage-operator
  namespace: local-storage
spec:
  channel: "4.4"
  installPlanApproval: Automatic
  name: local-storage-operator
  source: redhat-operators
  sourceNamespace: openshift-marketplace
=========================================================
[kni@e19-h24-b01-fc640 local-storage]$ cat volume.yaml 
apiVersion: "local.storage.openshift.io/v1"
kind: "LocalVolume"
metadata:
  name: "local-disks"
  namespace: "local-storage"
spec:
  tolerations:
    - key: storage
      operator: Equal 
      value: "true"
  storageClassDevices:
    - storageClassName: "local-sc"
      volumeMode: Filesystem
      fsType: xfs
      devicePaths:
        - /dev/nvme0n1

All master nodes are tainted oc adm taint node master-<node> storage=true:NoSchedule


The tried to deploy cluster-logging to use those PVs
[kni@e19-h24-b01-fc640 local-storage]$ cat ~/logging/instance.yaml 
apiVersion: "logging.openshift.io/v1"
kind: "ClusterLogging"
metadata:
  name: "instance"
  namespace: "openshift-logging"
spec:
  managementState: "Managed"
  logStore:
    type: "elasticsearch"
    elasticsearch:
      tolerations:
        - key: storage
          operator: Equal
          value: "true"
      nodeCount: 3
      storage:
        storageClassName: local-sc
        size: 100G
      redundancyPolicy: "SingleRedundancy"
  visualization:
    tolerations:
      - key: storage
        operator: Equal
        value: "true"
    type: "kibana"
    kibana:
      replicas: 1
  curation:
    type: "curator"
    curator:
      schedule: "30 3 * * *"
  collection:
    logs:
      type: "fluentd"
      fluentd: {}

The ES pods never go into running and are stuck at containercreating

looking at oc describe pod

Events:
  Type     Reason       Age                  From               Message
  ----     ------       ----                 ----               -------
  Normal   Scheduled    19m                  default-scheduler  Successfully assigned openshift-logging/elasticsearch-cdm-w4a280ij-1-6bd64d7578-x8nb6 to master-2
  Warning  FailedMount  17m                  kubelet, master-2  Unable to attach or mount volumes: unmounted volumes=[elasticsearch-storage], unattached volumes=[elasticsearch-token-vnrv8 elasticsearch-metrics elasticsearch-storage elasticsearch-config certificates]: timed out waiting for the condition
  Warning  FailedMount  9m18s (x2 over 11m)  kubelet, master-2  Unable to attach or mount volumes: unmounted volumes=[elasticsearch-storage], unattached volumes=[elasticsearch-config certificates elasticsearch-token-vnrv8 elasticsearch-metrics elasticsearch-storage]: timed out waiting for the condition
  Warning  FailedMount  7m15s                kubelet, master-2  Unable to attach or mount volumes: unmounted volumes=[elasticsearch-storage], unattached volumes=[elasticsearch-metrics elasticsearch-storage elasticsearch-config certificates elasticsearch-token-vnrv8]: timed out waiting for the condition
  Warning  FailedMount  3m8s (x3 over 15m)   kubelet, master-2  Unable to attach or mount volumes: unmounted volumes=[elasticsearch-storage], unattached volumes=[certificates elasticsearch-token-vnrv8 elasticsearch-metrics elasticsearch-storage elasticsearch-config]: timed out waiting for the condition
  Warning  FailedMount  68s (x17 over 19m)   kubelet, master-2  MountVolume.MountDevice failed for volume "local-pv-a2502689" : local: failed to mount device /mnt/local-storage/local-sc/nvme0n1 at /var/lib/kubelet/plugins/kubernetes.io/local-volume/mounts/local-pv-a2502689 (fstype: xfs), error 'xfs_repair' found errors on device /mnt/local-storage/local-sc/nvme0n1 but could not correct them: Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair.  If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.
  Warning  FailedMount  65s (x2 over 13m)  kubelet, master-2  Unable to attach or mount volumes: unmounted volumes=[elasticsearch-storage], unattached volumes=[elasticsearch-storage elasticsearch-config certificates elasticsearch-token-vnrv8 elasticsearch-metrics]: timed out waiting for the condition 

Please find journal on master-2 attached.

Version-Release number of selected component (if applicable):
4.5
How reproducible:

Steps to Reproduce:
1. Deploy cluster
2. Deploy local-storage operator
3. Try to use pvc from local-storage

Actual results:
Pod stuck in containercreating

Expected results:
Pods should spin up successfully

Master Log
http://rdu-storage01.scalelab.redhat.com/sai/journal.tar.gz
Node Log (of failed PODs):

PV Dump:
[kni@e19-h24-b01-fc640 logging]$ oc get pv
NAME                CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS     CLAIM                                                          STORAGECLASS   REASON   AGE
local-pv-2f659adb   1490Gi     RWO            Delete           Released   openshift-logging/elasticsearch-elasticsearch-cdm-5b7zrrqe-1   local-sc                19m
local-pv-5e03c2b0   1490Gi     RWO            Delete           Released   openshift-logging/elasticsearch-elasticsearch-cdm-5b7zrrqe-2   local-sc                20m
local-pv-a2502689   1490Gi     RWO            Delete           Released   openshift-logging/elasticsearch-elasticsearch-cdm-5b7zrrqe-3   local-sc                19m

(please note, to try to get it working I deleted the PVs and recreated them after hitting the issue but did not help)

PVC Dump:

StorageClass Dump (if StorageClass used by PV/PVC):

Additional info:

Comment 1 Hemant Kumar 2020-05-20 16:23:21 UTC
We seem to have a buggy version of mount library bundled in OCP-4.5. Opened for master - https://github.com/openshift/origin/pull/25006 will backport once merged.

Comment 7 Qin Ping 2020-06-01 06:59:35 UTC
Verified with: 4.5.0-0.nightly-2020-05-30-025738

Comment 9 errata-xmlrpc 2020-07-13 17:39:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.