Description of problem: I had a CRS setup with OCP 3.9 and OCS 3.9 configured using ansible playbooks.I upgraded the OCS from 3.9 to 3.11 (RHEL 7.5) with OCP still as 3.9 (not upgraded). All pods were up and running post upgrading to OCS 3.11. Later I upgraded the operating system from RHEL 7.5 to RHEL 7.6 on OCP nodes (OCP VERSION 3.9) one by one.While in process of upgrading OS on my Infra node, I drained the pods on that node,did yum update and again scheduled that host. Post that,observed that "logging-es-data-master" went in "CrashLoopBackOff" state.Everything was up and running before performing the upgrade.On describing that pod,there is an error "Back-off restarting failed container" Note: logging pods were using block PV as storage ============= # oc get events LAST SEEN FIRST SEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE 11m 13h 162 logging-es-data-master-aefwnylm-1-kkj4s.1564561f0d6759c1 Pod spec.containers{elasticsearch} Normal Pulled kubelet, dhcp46-34.lab.eng.blr.redhat.com Container image "registry.access.redhat.com/openshift3/logging-elasticsearch:v3.9.43" already present on machine 1m 13h 3639 logging-es-data-master-aefwnylm-1-kkj4s.156456221d8f7776 Pod spec.containers{elasticsearch} Warning BackOff kubelet, dhcp46-34.lab.eng.blr.redhat.com Back-off restarting failed container ============= # oc project Using project "logging" on server "https://dhcp46-169.lab.eng.blr.redhat.com:8443". # oc get pods NAME READY STATUS RESTARTS AGE logging-curator-1-22rmf 1/1 Running 0 21h logging-es-data-master-aefwnylm-1-kkj4s 1/2 CrashLoopBackOff 160 21h logging-fluentd-2l4s7 1/1 Running 2 9d logging-fluentd-57b9k 1/1 Running 1 9d logging-fluentd-l9m2n 1/1 Running 1 9d logging-fluentd-pcpbt 1/1 Running 2 9d logging-kibana-1-px7rn 2/2 Running 0 21h # oc get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE logging-es-0 Bound pvc-6c2eacc3-da11-11e8-9779-005056a5b7d6 20Gi RWO glusterfs-registry-block 9d Version-Release number of selected component (if applicable): # cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.6 (Maipo) # rpm -qa | grep gluster-block gluster-block-0.2.1-28.el7rhgs.x86_64 glusterfs-3.12.2-25.el7rhgs.x86_64 OCP :3.9 (Rhel 7.6) OCS :3.11 How reproducible: 1/1 Steps to Reproduce: 1.Configure CRS setup with OCS 3.9 and OCP 3.9 using ansible playbooks (4 nodes OCP, 4 nodes gluster, 3 nodes glusterfs-registry) 2.Later upgrade only the OCS 3.9 to OCS 3.11 3.Upgrade base OS on the gluster-registry nodes and gluster nodes from RHEL 7.5 to RHEL 7.6 4.Upgrade the base operating system from on OCP nodes from RHEL 7.5 to RHEL 7.6 one by one on all the OCP nodes.Started with master node first. to perform the 3rd steps,drain the node which needs to be upgraded,perform yum update and again schedule that node. Actual results: Post upgrading infra node,"logging-es-data-master-aefwnylm-1-kkj4s" pod went into "CrashLoopBackOff" state and failed with "Back-off restarting failed container" Expected results: Post OS upgrade, all pods should come up and running and no pods should be observed in pending/Crashloop state. Additional info: dmesg on the node where the pod was trying to come up has the following trace ========================= [15904.570921] 1} (detected by 2, t=74458 jiffies, g=1635418, c=1635417, q=329) [15904.570925] Task dump for CPU 1: [15904.570928] openshift-route R running task 0 19072 18920 0x00000088 [15904.570932] Call Trace: [15904.570944] [<ffffffffb2f6770f>] ? __schedule+0x3ff/0x890 [15904.570950] [<ffffffffb28c5cad>] ? hrtimer_start_range_ns+0x1ed/0x3c0 [15904.570953] [<ffffffffb2f67bc9>] schedule+0x29/0x70 [15904.570959] [<ffffffffb290ce56>] futex_wait_queue_me+0xc6/0x130 [15904.570962] [<ffffffffb290db3b>] futex_wait+0x17b/0x280 [15904.570966] [<ffffffffb28c57e0>] ? hrtimer_get_res+0x50/0x50 [15904.570969] [<ffffffffb290ce34>] ? futex_wait_queue_me+0xa4/0x130 [15904.570972] [<ffffffffb290f886>] do_futex+0x106/0x5a0 [15904.570978] [<ffffffffb2a57044>] ? poll_select_copy_remaining+0x144/0x180 [15904.570981] [<ffffffffb290fda0>] SyS_futex+0x80/0x190 [15904.570987] [<ffffffffb2f74ddb>] system_call_fastpath+0x22/0x27 [15904.573437] crc_t10dif crct10dif_generic ata_generic pata_acpi crct10dif_pclmul crct10dif_common crc32c_intel serio_raw floppy vmwgfx drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm vmxnet3 drm ahci ata_piix libahci libata vmw_pvscsi drm_panel_orientation_quirks dm_mirror dm_region_hash dm_log dm_multipath dm_mod [15904.574594] CPU: 3 PID: 18152 Comm: fluentd Kdump: loaded Not tainted 3.10.0-957.el7.x86_64 #1 [15904.574598] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/21/2015 [15904.574601] task: ffff9a8cfdb0b0c0 ti: ffff9a8d03f00000 task.ti: ffff9a8d03f00000 [15904.575395] RIP: 0010:[<ffffffffb2f6b4f0>] [<ffffffffb2f6b4f0>] retint_careful+0xe/0x32 [15904.575406] RSP: 0000:ffff9a8d03f03f88 EFLAGS: 00000203 [15904.575408] RAX: 0140000000000000 RBX: ffff9a8cfcfac270 RCX: ffff9a8d03f00000 [15904.575410] RDX: 0000000000000008 RSI: ffff9a8d03f03f78 RDI: 000000000000fe0e [15904.575412] RBP: 0000000000000008 R08: 0000000000000000 R09: ffff9a8cfdb0b0c0 [15904.575414] R10: 00007f5c7bf504a8 R11: 00007f5c7b7ff220 R12: ffff9a8cfcfac270 [15904.575415] R13: ffff9a8cfcfac000 R14: 00000000fcfac270 R15: ffff9a8cfcfac288 [15904.575418] FS: 00007f5c7a53e700(0000) GS:ffff9a8d7fd80000(0000) knlGS:0000000000000000 [15904.575420] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [15904.575422] CR2: 000000c42aefd000 CR3: 00000007b97a8000 CR4: 00000000003607e0 [15904.575487] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [15904.575489] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [15904.575491] Call Trace: [15904.575494] Code: 54 24 30 48 8b 74 24 38 48 8b 7c 24 40 48 83 c4 50 48 cf 0f 1f 40 00 0f 1f 40 00 48 cf 0f ba e2 03 73 2c fb 0f 1f 80 00 00 00 00 <57> e8 0a d5 ff ff 5f 65 48 8b 0c 25 78 0e 01 00 48 81 e9 d8 3f [15904.577618] connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4310546003, last ping 4310546002, now 4310570655 [15904.609212] connection2:0: detected conn error (1022) [15904.958513] connection17:0: detected conn error (1020) [15904.958797] connection15:0: detected conn error (1020) [15905.004930] sd 47:0:0:0: [sdj] FAILED Result: hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK [15905.004940] sd 47:0:0:0: [sdj] CDB: Test Unit Ready 00 00 00 00 00 00 [15907.399802] XFS (dm-41): Mounting V5 Filesystem [15907.590178] XFS (dm-41): Ending clean mount [15907.605316] XFS (dm-39): Unmounting Filesystem [15907.678689] XFS (dm-41): Unmounting Filesystem ============================= # oc describe pod logging-es-data-master-aefwnylm-1-kkj4s Name: logging-es-data-master-aefwnylm-1-kkj4s Namespace: logging Node: dhcp46-34.lab.eng.blr.redhat.com/10.70.46.34 Start Time: Tue, 06 Nov 2018 02:30:37 +0530 Labels: component=es deployment=logging-es-data-master-aefwnylm-1 deploymentconfig=logging-es-data-master-aefwnylm logging-infra=elasticsearch provider=openshift Annotations: openshift.io/deployment-config.latest-version=1 openshift.io/deployment-config.name=logging-es-data-master-aefwnylm openshift.io/deployment.name=logging-es-data-master-aefwnylm-1 openshift.io/scc=restricted Status: Running IP: 10.128.0.47 Controlled By: ReplicationController/logging-es-data-master-aefwnylm-1 Containers: elasticsearch: Container ID: docker://6f24e5e71f9ec0604592c6a449e1f82756d27b06928a605ee29d932a10171e18 Image: registry.access.redhat.com/openshift3/logging-elasticsearch:v3.9.43 Image ID: docker-pullable://registry.access.redhat.com/openshift3/logging-elasticsearch@sha256:da926931413f2470b37d149b071077b70987449dfe90fc99cd5e74dbd6e2db22 Ports: 9200/TCP, 9300/TCP State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 1 Started: Tue, 06 Nov 2018 16:16:14 +0530 Finished: Tue, 06 Nov 2018 16:16:18 +0530 Ready: False Restart Count: 163 Limits: memory: 8Gi Requests: cpu: 1 memory: 8Gi Readiness: exec [/usr/share/java/elasticsearch/probe/readiness.sh] delay=10s timeout=30s period=5s #success=1 #failure=3 Environment: DC_NAME: logging-es-data-master-aefwnylm NAMESPACE: logging (v1:metadata.namespace) KUBERNETES_TRUST_CERT: true SERVICE_DNS: logging-es-cluster CLUSTER_NAME: logging-es INSTANCE_RAM: 8Gi HEAP_DUMP_LOCATION: /elasticsearch/persistent/heapdump.hprof NODE_QUORUM: 1 RECOVER_EXPECTED_NODES: 1 RECOVER_AFTER_TIME: 5m READINESS_PROBE_TIMEOUT: 30 POD_LABEL: component=es IS_MASTER: true HAS_DATA: true PROMETHEUS_USER: system:serviceaccount:openshift-metrics:prometheus Mounts: /elasticsearch/persistent from elasticsearch-storage (rw) /etc/elasticsearch/secret from elasticsearch (ro) /usr/share/java/elasticsearch/config from elasticsearch-config (ro) /var/run/secrets/kubernetes.io/serviceaccount from aggregated-logging-elasticsearch-token-b6hvv (ro) proxy: Container ID: docker://1c809fd4f1997f2bbcf96a6a139cfa4ae82069d00a9d103a812cf20cd8057614 Image: registry.access.redhat.com/openshift3/oauth-proxy:v3.9.43 Image ID: docker-pullable://registry.access.redhat.com/openshift3/oauth-proxy@sha256:ba9fba2531a9af5fdca95b948a0d1cf974e787c6af074e9695d8c63edfd61f0c Port: 4443/TCP Args: --upstream-ca=/etc/elasticsearch/secret/admin-ca --https-address=:4443 -provider=openshift -client-id=system:serviceaccount:logging:aggregated-logging-elasticsearch -client-secret-file=/var/run/secrets/kubernetes.io/serviceaccount/token -cookie-secret=bmZJeXpoaDhWZ041WW44SQ== -basic-auth-password=tdzWWpPLNZLXuUYA -upstream=https://localhost:9200 -openshift-sar={"namespace": "logging", "verb": "view", "resource": "prometheus", "group": "metrics.openshift.io"} -openshift-delegate-urls={"/": {"resource": "prometheus", "verb": "view", "group": "metrics.openshift.io", "namespace": "logging"}} --tls-cert=/etc/tls/private/tls.crt --tls-key=/etc/tls/private/tls.key -pass-access-token -pass-user-headers State: Running Started: Tue, 06 Nov 2018 02:31:05 +0530 Ready: True Restart Count: 0 Limits: memory: 64Mi Requests: cpu: 100m memory: 64Mi Environment: <none> Mounts: /etc/elasticsearch/secret from elasticsearch (ro) /etc/tls/private from proxy-tls (ro) /var/run/secrets/kubernetes.io/serviceaccount from aggregated-logging-elasticsearch-token-b6hvv (ro) Conditions: Type Status Initialized True Ready False PodScheduled True Volumes: proxy-tls: Type: Secret (a volume populated by a Secret) SecretName: prometheus-tls Optional: false elasticsearch: Type: Secret (a volume populated by a Secret) SecretName: logging-elasticsearch Optional: false elasticsearch-config: Type: ConfigMap (a volume populated by a ConfigMap) Name: logging-elasticsearch Optional: false elasticsearch-storage: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: logging-es-0 ReadOnly: false aggregated-logging-elasticsearch-token-b6hvv: Type: Secret (a volume populated by a Secret) SecretName: aggregated-logging-elasticsearch-token-b6hvv Optional: false QoS Class: Burstable Node-Selectors: region=infra Tolerations: node.kubernetes.io/memory-pressure:NoSchedule Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Pulled 14m (x162 over 13h) kubelet, dhcp46-34.lab.eng.blr.redhat.com Container image "registry.access.redhat.com/openshift3/logging-elasticsearch:v3.9.43" already present on machine Warning BackOff 4m (x3639 over 13h) kubelet, dhcp46-34.lab.eng.blr.redhat.com Back-off restarting failed container
Can you please attach 1. sos-reports. 2. /etc/target dir as tar ball from all the pods 3. targetcli ls output from all the pods 4. /block-meta/ dir as tar ball from all the block hosting volume Thanks! -- Prasanna
*** This bug has been marked as a duplicate of bug 1624678 ***