1646945 – logging-es-data-master pod went into "CrashLoopBackOff" state post upgrading the OCP 3.9 nodes from RHEL 7.5 to RHEL 7.6

Bug 1646945 - logging-es-data-master pod went into "CrashLoopBackOff" state post upgrading the OCP 3.9 nodes from RHEL 7.5 to RHEL 7.6

Summary: logging-es-data-master pod went into "CrashLoopBackOff" state post upgrading ...

Keywords:
Status:	CLOSED DUPLICATE of bug 1624678
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	gluster-block
Sub Component:
Version:	ocs-3.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Xiubo Li
QA Contact:	Rahul Hinduja
Docs Contact:
URL:
Whiteboard:
Depends On:	1624670 1624678
Blocks:
TreeView+	depends on / blocked

Reported:	2018-11-06 11:14 UTC by Manisha Saini
Modified:	2019-07-05 18:10 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-11-19 02:51:20 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Manisha Saini 2018-11-06 11:14:29 UTC

Description of problem:

I had a CRS setup with OCP 3.9 and OCS 3.9 configured using ansible playbooks.I upgraded the OCS from 3.9 to 3.11 (RHEL 7.5) with OCP still as 3.9 (not upgraded).
All pods were up and running post upgrading to OCS 3.11.

Later I upgraded the operating system from RHEL 7.5 to RHEL 7.6 on OCP nodes (OCP VERSION 3.9) one by one.While in process of upgrading OS on my Infra node, I drained the pods on that node,did yum update and again scheduled that host.

Post that,observed that "logging-es-data-master" went in "CrashLoopBackOff" state.Everything was up and running before performing the upgrade.On describing that pod,there is an error "Back-off restarting failed container"

Note: logging pods were using block PV as storage

=============
# oc get events
LAST SEEN   FIRST SEEN   COUNT     NAME                                                       KIND      SUBOBJECT                        TYPE      REASON    SOURCE                                      MESSAGE
11m         13h          162       logging-es-data-master-aefwnylm-1-kkj4s.1564561f0d6759c1   Pod       spec.containers{elasticsearch}   Normal    Pulled    kubelet, dhcp46-34.lab.eng.blr.redhat.com   Container image "registry.access.redhat.com/openshift3/logging-elasticsearch:v3.9.43" already present on machine
1m          13h          3639      logging-es-data-master-aefwnylm-1-kkj4s.156456221d8f7776   Pod       spec.containers{elasticsearch}   Warning   BackOff   kubelet, dhcp46-34.lab.eng.blr.redhat.com   Back-off restarting failed container

=============

# oc project
Using project "logging" on server "https://dhcp46-169.lab.eng.blr.redhat.com:8443".

# oc get pods
NAME                                      READY     STATUS             RESTARTS   AGE
logging-curator-1-22rmf                   1/1       Running            0          21h
logging-es-data-master-aefwnylm-1-kkj4s   1/2       CrashLoopBackOff   160        21h
logging-fluentd-2l4s7                     1/1       Running            2          9d
logging-fluentd-57b9k                     1/1       Running            1          9d
logging-fluentd-l9m2n                     1/1       Running            1          9d
logging-fluentd-pcpbt                     1/1       Running            2          9d
logging-kibana-1-px7rn                    2/2       Running            0          21h

# oc get pvc
NAME           STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS               AGE
logging-es-0   Bound     pvc-6c2eacc3-da11-11e8-9779-005056a5b7d6   20Gi       RWO            glusterfs-registry-block   9d


Version-Release number of selected component (if applicable):

# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.6 (Maipo)

# rpm -qa | grep gluster-block
gluster-block-0.2.1-28.el7rhgs.x86_64

glusterfs-3.12.2-25.el7rhgs.x86_64

OCP :3.9 (Rhel 7.6)
OCS :3.11

How reproducible:
1/1


Steps to Reproduce:
1.Configure CRS setup with OCS 3.9 and OCP 3.9 using ansible playbooks
(4 nodes OCP, 4 nodes gluster, 3 nodes glusterfs-registry)

2.Later upgrade only the OCS 3.9 to OCS 3.11
3.Upgrade base OS on the gluster-registry nodes and gluster nodes from RHEL 7.5 to RHEL 7.6
4.Upgrade the base operating system from on OCP nodes from RHEL 7.5 to RHEL 7.6 one by one on all the OCP nodes.Started with master node first.

to perform the 3rd steps,drain the node which needs to be upgraded,perform yum update and again schedule that node.


Actual results:

Post upgrading infra node,"logging-es-data-master-aefwnylm-1-kkj4s" pod went into "CrashLoopBackOff" state and failed with "Back-off restarting failed container"


Expected results:
Post OS upgrade, all pods should come up and running and no pods should be observed in pending/Crashloop state.

Additional info:

dmesg on the node where the pod was trying to come up has the following trace

=========================
[15904.570921]  1} (detected by 2, t=74458 jiffies, g=1635418, c=1635417, q=329)
[15904.570925] Task dump for CPU 1:
[15904.570928] openshift-route R  running task        0 19072  18920 0x00000088
[15904.570932] Call Trace:
[15904.570944]  [<ffffffffb2f6770f>] ? __schedule+0x3ff/0x890
[15904.570950]  [<ffffffffb28c5cad>] ? hrtimer_start_range_ns+0x1ed/0x3c0
[15904.570953]  [<ffffffffb2f67bc9>] schedule+0x29/0x70
[15904.570959]  [<ffffffffb290ce56>] futex_wait_queue_me+0xc6/0x130
[15904.570962]  [<ffffffffb290db3b>] futex_wait+0x17b/0x280
[15904.570966]  [<ffffffffb28c57e0>] ? hrtimer_get_res+0x50/0x50
[15904.570969]  [<ffffffffb290ce34>] ? futex_wait_queue_me+0xa4/0x130
[15904.570972]  [<ffffffffb290f886>] do_futex+0x106/0x5a0
[15904.570978]  [<ffffffffb2a57044>] ? poll_select_copy_remaining+0x144/0x180
[15904.570981]  [<ffffffffb290fda0>] SyS_futex+0x80/0x190
[15904.570987]  [<ffffffffb2f74ddb>] system_call_fastpath+0x22/0x27
[15904.573437]  crc_t10dif crct10dif_generic ata_generic pata_acpi crct10dif_pclmul crct10dif_common crc32c_intel serio_raw floppy vmwgfx drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm vmxnet3 drm ahci ata_piix libahci libata vmw_pvscsi drm_panel_orientation_quirks dm_mirror dm_region_hash dm_log dm_multipath dm_mod
[15904.574594] CPU: 3 PID: 18152 Comm: fluentd Kdump: loaded Not tainted 3.10.0-957.el7.x86_64 #1
[15904.574598] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/21/2015
[15904.574601] task: ffff9a8cfdb0b0c0 ti: ffff9a8d03f00000 task.ti: ffff9a8d03f00000
[15904.575395] RIP: 0010:[<ffffffffb2f6b4f0>]  [<ffffffffb2f6b4f0>] retint_careful+0xe/0x32
[15904.575406] RSP: 0000:ffff9a8d03f03f88  EFLAGS: 00000203
[15904.575408] RAX: 0140000000000000 RBX: ffff9a8cfcfac270 RCX: ffff9a8d03f00000
[15904.575410] RDX: 0000000000000008 RSI: ffff9a8d03f03f78 RDI: 000000000000fe0e
[15904.575412] RBP: 0000000000000008 R08: 0000000000000000 R09: ffff9a8cfdb0b0c0
[15904.575414] R10: 00007f5c7bf504a8 R11: 00007f5c7b7ff220 R12: ffff9a8cfcfac270
[15904.575415] R13: ffff9a8cfcfac000 R14: 00000000fcfac270 R15: ffff9a8cfcfac288
[15904.575418] FS:  00007f5c7a53e700(0000) GS:ffff9a8d7fd80000(0000) knlGS:0000000000000000
[15904.575420] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[15904.575422] CR2: 000000c42aefd000 CR3: 00000007b97a8000 CR4: 00000000003607e0
[15904.575487] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[15904.575489] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[15904.575491] Call Trace:
[15904.575494] Code: 54 24 30 48 8b 74 24 38 48 8b 7c 24 40 48 83 c4 50 48 cf 0f 1f 40 00 0f 1f 40 00 48 cf 0f ba e2 03 73 2c fb 0f 1f 80 00 00 00 00 <57> e8 0a d5 ff ff 5f 65 48 8b 0c 25 78 0e 01 00 48 81 e9 d8 3f
[15904.577618]  connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4310546003, last ping 4310546002, now 4310570655
[15904.609212]  connection2:0: detected conn error (1022)
[15904.958513]  connection17:0: detected conn error (1020)
[15904.958797]  connection15:0: detected conn error (1020)
[15905.004930] sd 47:0:0:0: [sdj] FAILED Result: hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
[15905.004940] sd 47:0:0:0: [sdj] CDB: Test Unit Ready 00 00 00 00 00 00
[15907.399802] XFS (dm-41): Mounting V5 Filesystem
[15907.590178] XFS (dm-41): Ending clean mount
[15907.605316] XFS (dm-39): Unmounting Filesystem
[15907.678689] XFS (dm-41): Unmounting Filesystem

=============================

# oc describe pod logging-es-data-master-aefwnylm-1-kkj4s
Name:           logging-es-data-master-aefwnylm-1-kkj4s
Namespace:      logging
Node:           dhcp46-34.lab.eng.blr.redhat.com/10.70.46.34
Start Time:     Tue, 06 Nov 2018 02:30:37 +0530
Labels:         component=es
                deployment=logging-es-data-master-aefwnylm-1
                deploymentconfig=logging-es-data-master-aefwnylm
                logging-infra=elasticsearch
                provider=openshift
Annotations:    openshift.io/deployment-config.latest-version=1
                openshift.io/deployment-config.name=logging-es-data-master-aefwnylm
                openshift.io/deployment.name=logging-es-data-master-aefwnylm-1
                openshift.io/scc=restricted
Status:         Running
IP:             10.128.0.47
Controlled By:  ReplicationController/logging-es-data-master-aefwnylm-1
Containers:
  elasticsearch:
    Container ID:   docker://6f24e5e71f9ec0604592c6a449e1f82756d27b06928a605ee29d932a10171e18
    Image:          registry.access.redhat.com/openshift3/logging-elasticsearch:v3.9.43
    Image ID:       docker-pullable://registry.access.redhat.com/openshift3/logging-elasticsearch@sha256:da926931413f2470b37d149b071077b70987449dfe90fc99cd5e74dbd6e2db22
    Ports:          9200/TCP, 9300/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 06 Nov 2018 16:16:14 +0530
      Finished:     Tue, 06 Nov 2018 16:16:18 +0530
    Ready:          False
    Restart Count:  163
    Limits:
      memory:  8Gi
    Requests:
      cpu:      1
      memory:   8Gi
    Readiness:  exec [/usr/share/java/elasticsearch/probe/readiness.sh] delay=10s timeout=30s period=5s #success=1 #failure=3
    Environment:
      DC_NAME:                  logging-es-data-master-aefwnylm
      NAMESPACE:                logging (v1:metadata.namespace)
      KUBERNETES_TRUST_CERT:    true
      SERVICE_DNS:              logging-es-cluster
      CLUSTER_NAME:             logging-es
      INSTANCE_RAM:             8Gi
      HEAP_DUMP_LOCATION:       /elasticsearch/persistent/heapdump.hprof
      NODE_QUORUM:              1
      RECOVER_EXPECTED_NODES:   1
      RECOVER_AFTER_TIME:       5m
      READINESS_PROBE_TIMEOUT:  30
      POD_LABEL:                component=es
      IS_MASTER:                true
      HAS_DATA:                 true
      PROMETHEUS_USER:          system:serviceaccount:openshift-metrics:prometheus
    Mounts:
      /elasticsearch/persistent from elasticsearch-storage (rw)
      /etc/elasticsearch/secret from elasticsearch (ro)
      /usr/share/java/elasticsearch/config from elasticsearch-config (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from aggregated-logging-elasticsearch-token-b6hvv (ro)
  proxy:
    Container ID:  docker://1c809fd4f1997f2bbcf96a6a139cfa4ae82069d00a9d103a812cf20cd8057614
    Image:         registry.access.redhat.com/openshift3/oauth-proxy:v3.9.43
    Image ID:      docker-pullable://registry.access.redhat.com/openshift3/oauth-proxy@sha256:ba9fba2531a9af5fdca95b948a0d1cf974e787c6af074e9695d8c63edfd61f0c
    Port:          4443/TCP
    Args:
      --upstream-ca=/etc/elasticsearch/secret/admin-ca
      --https-address=:4443
      -provider=openshift
      -client-id=system:serviceaccount:logging:aggregated-logging-elasticsearch
      -client-secret-file=/var/run/secrets/kubernetes.io/serviceaccount/token
      -cookie-secret=bmZJeXpoaDhWZ041WW44SQ==
      -basic-auth-password=tdzWWpPLNZLXuUYA
      -upstream=https://localhost:9200
      -openshift-sar={"namespace": "logging", "verb": "view", "resource": "prometheus", "group": "metrics.openshift.io"}
      -openshift-delegate-urls={"/": {"resource": "prometheus", "verb": "view", "group": "metrics.openshift.io", "namespace": "logging"}}
      --tls-cert=/etc/tls/private/tls.crt
      --tls-key=/etc/tls/private/tls.key
      -pass-access-token
      -pass-user-headers
    State:          Running
      Started:      Tue, 06 Nov 2018 02:31:05 +0530
    Ready:          True
    Restart Count:  0
    Limits:
      memory:  64Mi
    Requests:
      cpu:        100m
      memory:     64Mi
    Environment:  <none>
    Mounts:
      /etc/elasticsearch/secret from elasticsearch (ro)
      /etc/tls/private from proxy-tls (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from aggregated-logging-elasticsearch-token-b6hvv (ro)
Conditions:
  Type           Status
  Initialized    True 
  Ready          False 
  PodScheduled   True 
Volumes:
  proxy-tls:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-tls
    Optional:    false
  elasticsearch:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  logging-elasticsearch
    Optional:    false
  elasticsearch-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      logging-elasticsearch
    Optional:  false
  elasticsearch-storage:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  logging-es-0
    ReadOnly:   false
  aggregated-logging-elasticsearch-token-b6hvv:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  aggregated-logging-elasticsearch-token-b6hvv
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  region=infra
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule
Events:
  Type     Reason   Age                  From                                       Message
  ----     ------   ----                 ----                                       -------
  Normal   Pulled   14m (x162 over 13h)  kubelet, dhcp46-34.lab.eng.blr.redhat.com  Container image "registry.access.redhat.com/openshift3/logging-elasticsearch:v3.9.43" already present on machine
  Warning  BackOff  4m (x3639 over 13h)  kubelet, dhcp46-34.lab.eng.blr.redhat.com  Back-off restarting failed container

Comment 2 Prasanna Kumar Kalever 2018-11-06 11:24:26 UTC

Can you please attach
1. sos-reports.
2. /etc/target dir as tar ball from all the pods
3. targetcli ls output from all the pods
4. /block-meta/ dir as tar ball from all the block hosting volume


Thanks!
--
Prasanna

Comment 18 Xiubo Li 2018-11-19 02:51:20 UTC


*** This bug has been marked as a duplicate of bug 1624678 ***

Note You need to log in before you can comment on or make changes to this bug.