Description of problem (please be detailed as possible and provide log snippests): ---------------------------------------------------------------------- OCP 4.4 + OCS 4.4.1-rc2 cluster on VMware On a 3 node OCS cluster, perofrmed prolonged shutdown of a node which was hosting RGW, rook-ceph-operator, ocs-operator, noobaa-db, osd-0, osd-3 and mon-a , amongst other pods. In this case, compute-0 was powered OFF @Wed Jul 1 15:17:47 UTC 2020 Following were some observations ======================================= 1. The original RGW pod on compute-0 stayed in terminating state forever on compute-0(as expected) 2. The newly spinned RGW pod on new node is continuously transitioning between Running and CrashLoopBackOff state rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-79b8bc7fvj5v 0/1 CrashLoopBackOff 23 71m 10.128.2.28 compute-2 <none> <none> rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-79b8bc7jzv5v 1/1 Terminating 16 8h 10.131.0.93 compute-0 <none> <none> P.S: The suffix is different, but the pod prefix is same (as is the case with mon-a-xxx and so on) >> [nberry@localhost rgw-CLBO]$ oc logs rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-79b8bc7fvj5v debug 2020-07-01 16:32:30.866 7f6294b52100 0 framework: beast debug 2020-07-01 16:32:30.866 7f6294b52100 0 framework conf key: port, val: 80 debug 2020-07-01 16:32:30.867 7f6294b52100 0 deferred set uid:gid to 167:167 (ceph:ceph) debug 2020-07-01 16:32:30.867 7f6294b52100 0 ceph version 14.2.8-59.el8cp (53387608e81e6aa2487c952a604db06faa5b2cd0) nautilus (stable), process radosgw, pid 1 >> oc describe Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 21m default-scheduler Successfully assigned openshift-storage/rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-79b8bc7fvj5v to compute-2 Normal Pulled 21m kubelet, compute-2 Container image "quay.io/rhceph-dev/rhceph@sha256:20cf789235e23ddaf38e109b391d1496bb88011239d16862c4c106d0e05fea9e" already present on machine Normal Created 21m kubelet, compute-2 Created container chown-container-data-dir Normal Started 21m kubelet, compute-2 Started container chown-container-data-dir Normal Created 19m (x3 over 21m) kubelet, compute-2 Created container rgw Normal Started 19m (x3 over 21m) kubelet, compute-2 Started container rgw Warning Unhealthy 19m (x9 over 21m) kubelet, compute-2 Liveness probe failed: Get http://10.128.2.28:80/swift/healthcheck: dial tcp 10.128.2.28:80: connect: connection refused Normal Killing 19m (x3 over 21m) kubelet, compute-2 Container rgw failed liveness probe, will be restarted Normal Pulled 16m (x6 over 21m) kubelet, compute-2 Container image "quay.io/rhceph-dev/rhceph@sha256:20cf789235e23ddaf38e109b391d1496bb88011239d16862c4c106d0e05fea9e" already present on machine Warning BackOff 88s (x52 over 15m) kubelet, compute-2 Back-off restarting failed container Version of all relevant components (if applicable): ---------------------------------------------------------------------- OCS version = ocs-operator.v4.4.1-465.ci OCP version = 4.4.0-0.nightly-2020-06-29-071755 Ceph = RHCS 4.1 (ceph version 14.2.8-59.el8cp (53387608e81e6aa2487c952a604db06faa5b2cd0) nautilus (stable)) Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? ---------------------------------------------------------------------- Yes. It is not in user's hand that the RGW pod failing over to another node takes the same name. But maybe due to this, the new POD keeps going to CLBO state and the RGW functionality is broken Is there any workaround available to the best of your knowledge? ---------------------------------------------------------------------- Not sure Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? ---------------------------------------------------------------------- 3 Can this issue reproducible? ---------------------------------------------------------------------- Yes. In case the failed over RGW pod gets the same name as the Terminating RGW pod, one faces this issue Can this issue reproduce from the UI? ---------------------------------------------------------------------- If this is a regression, please provide more details to justify this: ---------------------------------------------------------------------- Steps to Reproduce: ---------------------------------------------------------------------- 1. Create an OCP 4.4 + OCS 4.4.1 cluster (non GA'd) 2. Power off a node which wa shosting the RGW pod 3. See if after 5 minutes , a new RGW pod is spun up on another node. The original pod should stay in terminating state 4. Check ceph status. It doesnt show the RGW details at all under the "Services" section Actual results: ---------------------------------------------------------------------- The New RGW pod which came up on another node is continuously in CrashLoop state->Runnning->CrashLoop state. Expected results: ---------------------------------------------------------------------- New RGW pod should be in Running state on another node Additional info: ---------------------------------------------------------------------- The powered off node had following important pods running on it before it was shutdown. 1. lib-bucket-provisioner-5b9cb4f848-t2qhg 2. noobaa-db-0 3. noobaa-operator-6f9db4df44-gd2hg 4. ocs-operator-7fdf85b7b4-kxlvk 5. rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7794998d2lpk7 6. rook-ceph-mgr-a-7ddcffcb68-m57fb 7. rook-ceph-operator-68557c549b-6nwcq 8. rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-79b8bc7jzv5v 9. mon-a 10. osd-0 and osd-3 >>Current POD status [nberry@localhost vmware]$ oc get pods -o wide -n openshift-storage NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES compute-0-debug 1/1 Terminating 0 5h12m 10.46.27.134 compute-0 <none> <none> csi-cephfsplugin-8k6m8 3/3 Running 0 2d6h 10.46.27.134 compute-0 <none> <none> csi-cephfsplugin-92tng 3/3 Running 0 2d6h 10.46.27.138 compute-1 <none> <none> csi-cephfsplugin-nrp29 3/3 Running 0 2d6h 10.46.27.133 compute-2 <none> <none> csi-cephfsplugin-provisioner-6fdd566b4d-2kgfc 5/5 Running 0 2d6h 10.129.2.20 compute-1 <none> <none> csi-cephfsplugin-provisioner-6fdd566b4d-62ffh 5/5 Terminating 0 8h 10.131.0.87 compute-0 <none> <none> csi-cephfsplugin-provisioner-6fdd566b4d-tchmw 5/5 Running 0 71m 10.128.2.29 compute-2 <none> <none> csi-rbdplugin-jv8bk 3/3 Running 0 2d6h 10.46.27.138 compute-1 <none> <none> csi-rbdplugin-provisioner-7d784b9d7c-k9ss7 5/5 Running 0 71m 10.128.2.27 compute-2 <none> <none> csi-rbdplugin-provisioner-7d784b9d7c-m4lvw 5/5 Running 0 2d1h 10.129.2.128 compute-1 <none> <none> csi-rbdplugin-provisioner-7d784b9d7c-xdjqk 5/5 Terminating 0 8h 10.131.0.84 compute-0 <none> <none> csi-rbdplugin-pz9c7 3/3 Running 0 2d6h 10.46.27.133 compute-2 <none> <none> csi-rbdplugin-xq97h 3/3 Running 0 2d6h 10.46.27.134 compute-0 <none> <none> lib-bucket-provisioner-5b9cb4f848-t2qhg 1/1 Terminating 0 8h 10.131.0.89 compute-0 <none> <none> lib-bucket-provisioner-5b9cb4f848-t569j 1/1 Running 0 71m 10.128.2.18 compute-2 <none> <none> noobaa-core-0 1/1 Running 0 2d1h 10.129.2.130 compute-1 <none> <none> noobaa-db-0 1/1 Terminating 0 7h49m 10.131.0.100 compute-0 <none> <none> noobaa-endpoint-58ff4bdfd6-br76q 0/1 CrashLoopBackOff 18 2d1h 10.129.2.127 compute-1 <none> <none> noobaa-operator-6f9db4df44-gd2hg 1/1 Terminating 0 8h 10.131.0.91 compute-0 <none> <none> noobaa-operator-6f9db4df44-pknzr 1/1 Running 0 71m 10.128.2.30 compute-2 <none> <none> ocs-operator-7fdf85b7b4-d7n5r 0/1 Running 0 71m 10.128.2.24 compute-2 <none> <none> ocs-operator-7fdf85b7b4-kxlvk 1/1 Terminating 0 8h 10.131.0.85 compute-0 <none> <none> rook-ceph-crashcollector-compute-0-6d6894d4-9gsbc 0/1 Pending 0 70m <none> <none> <none> <none> rook-ceph-crashcollector-compute-0-7fbb7b4b76-dkgz8 1/1 Terminating 0 2d1h 10.131.0.51 compute-0 <none> <none> rook-ceph-crashcollector-compute-0-7fbb7b4b76-xgg79 0/1 Pending 0 71m <none> <none> <none> <none> rook-ceph-crashcollector-compute-1-6d4656d5f5-v2dvs 1/1 Running 0 2d6h 10.129.2.26 compute-1 <none> <none> rook-ceph-crashcollector-compute-2-88cb7fddb-9n9v9 1/1 Running 0 7h36m 10.128.2.5 compute-2 <none> <none> rook-ceph-drain-canary-compute-0-6c7798b9b8-rx7wd 1/1 Terminating 0 2d1h 10.131.0.50 compute-0 <none> <none> rook-ceph-drain-canary-compute-0-6c7798b9b8-txnkx 0/1 Pending 0 70m <none> <none> <none> <none> rook-ceph-drain-canary-compute-1-7499bc7d6f-7pb22 1/1 Running 0 2d6h 10.129.2.25 compute-1 <none> <none> rook-ceph-drain-canary-compute-2-6b8dd759db-84vpj 1/1 Running 0 7h36m 10.128.2.4 compute-2 <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7794998d2lpk7 1/1 Terminating 0 8h 10.131.0.86 compute-0 <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7794998dnb6kb 1/1 Running 0 71m 10.128.2.31 compute-2 <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-5c6964ccxp5fk 1/1 Running 0 2d6h 10.129.2.28 compute-1 <none> <none> rook-ceph-mgr-a-7ddcffcb68-ggc62 1/1 Running 0 71m 10.128.2.26 compute-2 <none> <none> rook-ceph-mgr-a-7ddcffcb68-m57fb 1/1 Terminating 0 8h 10.131.0.92 compute-0 <none> <none> rook-ceph-mon-a-7766b47887-xrmlp 0/1 Pending 0 71m <none> <none> <none> <none> rook-ceph-mon-b-669cc6c5d4-k5cgm 1/1 Running 1 2d6h 10.129.2.22 compute-1 <none> <none> rook-ceph-mon-c-689976cc99-mjrjn 1/1 Running 0 7h36m 10.128.2.9 compute-2 <none> <none> rook-ceph-mon-d-canary-9c576d797-pnwz4 0/1 Pending 0 22s <none> <none> <none> <none> rook-ceph-operator-68557c549b-5lc8h 1/1 Running 0 71m 10.128.2.25 compute-2 <none> <none> rook-ceph-operator-68557c549b-6nwcq 1/1 Terminating 0 8h 10.131.0.90 compute-0 <none> <none> rook-ceph-osd-0-544cf5448b-d4nc9 0/1 Pending 0 71m <none> <none> <none> <none> rook-ceph-osd-1-7ffcfdd9d5-pdqpq 1/1 Running 0 2d6h 10.129.2.27 compute-1 <none> <none> rook-ceph-osd-2-f95cffd6d-2l6p6 1/1 Running 0 7h36m 10.128.2.8 compute-2 <none> <none> rook-ceph-osd-3-75fb657f59-jnhjx 0/1 Pending 0 71m <none> <none> <none> <none> rook-ceph-osd-4-6bf4d8dbc7-mzfz4 1/1 Running 0 7h36m 10.128.2.7 compute-2 <none> <none> rook-ceph-osd-5-77b688dffb-bglr5 1/1 Running 0 2d1h 10.129.2.134 compute-1 <none> <none> rook-ceph-osd-prepare-ocs-deviceset-1-0-bx9h5-8cldc 0/1 Completed 0 2d6h 10.129.2.24 compute-1 <none> <none> rook-ceph-osd-prepare-ocs-deviceset-1-1-vtgbd-n7t59 0/1 Completed 0 2d1h 10.129.2.133 compute-1 <none> <none> rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-79b8bc7fvj5v 0/1 CrashLoopBackOff 23 71m 10.128.2.28 compute-2 <none> <none> rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-79b8bc7jzv5v 1/1 Terminating 16 8h 10.131.0.93 compute-0 <none> <none> rook-ceph-tools-b7789c95-kwn7x 1/1 Terminating 0 8h 10.46.27.134 compute-0 <none> <none> rook-ceph-tools-b7789c95-z6s27 1/1 Running 0 71m 10.46.27.133 compute-2 <none> <none> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Current ceph status =====ceph status ==== Wed Jul 1 16:18:11 UTC 2020 cluster: id: ca6c02dd-1942-4f11-bd38-1a03979fc85e health: HEALTH_WARN 2 osds down 2 hosts (2 osds) down 1 rack (2 osds) down Reduced data availability: 1 pg inactive Degraded data redundancy: 31265/93789 objects degraded (33.335%), 324 pgs degraded, 400 pgs undersized 1/3 mons down, quorum b,c services: mon: 3 daemons, quorum b,c (age 60m), out of quorum: a mgr: a(active, since 54m) mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-b=up:active} 1 up:standby-replay osd: 6 osds: 4 up (since 59m), 6 in (since 2d) <<--- RGW is not listed task status: scrub status: mds.ocs-storagecluster-cephfilesystem-a: idle mds.ocs-storagecluster-cephfilesystem-b: idle mds.ocs-storagecluster-cephfilesystem-c: idle mds.ocs-storagecluster-cephfilesystem-d: idle data: pools: 11 pools, 400 pgs objects: 31.26k objects, 116 GiB usage: 207 GiB used, 1.8 TiB / 2.0 TiB avail pgs: 0.250% pgs not active 31265/93789 objects degraded (33.335%) 323 active+undersized+degraded 76 active+undersized 1 undersized+degraded+peered io: client: 1.2 KiB/s rd, 77 KiB/s wr, 2 op/s rd, 0 op/s wr ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Due to network problem, log collection is still ongoing.. I will update the logs once collected..
In the log for the rgw pod in terminating state: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-1852983-rgw-clbo/must-gather.local.7750024955620505231/ceph/namespaces/openshift-storage/pods/rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-79b8bc7fvj5v/rgw/rgw/logs/current.log The rgw log shows that there were responses generated for at over 18 minutes: 2020-07-01T16:46:10.46609365Z debug 2020-07-01 16:46:10.464 7fe1a788f700 1 ====== starting new request req=0x7fe1e90c3890 ===== 2020-07-01T16:46:10.466406957Z debug 2020-07-01 16:46:10.465 7fe1a788f700 1 ====== req done req=0x7fe1e90c3890 op status=0 http_status=200 latency=0.00100006s ====== ... 2020-07-01T17:04:50.46582237Z debug 2020-07-01 17:04:50.464 7fe15b7f7700 1 ====== starting new request req=0x7fe1e90c3890 ===== 2020-07-01T17:04:50.465990506Z debug 2020-07-01 17:04:50.464 7fe15b7f7700 1 ====== req done req=0x7fe1e90c3890 op status=0 http_status=200 latency=0s ====== In the pod status, it also shows the rgw pod as running: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-1852983-rgw-clbo/ocs-pod-logs/openshift-storage/pods.json "metadata": { "name": "rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-79b8bc7fvj5v", "generateName": "rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-79b8bc7fff-", ... "status": { "phase": "Running", Is it possible this must gather is from a different cluster? I just don't see the crash info for the rgw pod. Is there an `oc describe <rgw-pod>` available in the must-gather that I'm missing for the crashing rgw pod?
Ah, now I see the liveness probe failed from your original pod description: Warning Unhealthy 19m (x9 over 21m) kubelet, compute-2 Liveness probe failed: Get http://10.128.2.28:80/swift/healthcheck: dial tcp 10.128.2.28:80: connect: connection refused Normal Killing 19m (x3 over 21m) kubelet, compute-2 Container rgw failed liveness probe, will be restarted
(In reply to Travis Nielsen from comment #6) > In the log for the rgw pod in terminating state: > http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-1852983-rgw-clbo/ > must-gather.local.7750024955620505231/ceph/namespaces/openshift-storage/pods/ > rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-79b8bc7fvj5v/rgw/rgw/logs/ > current.log > > The rgw log shows that there were responses generated for at over 18 minutes: > > 2020-07-01T16:46:10.46609365Z debug 2020-07-01 16:46:10.464 7fe1a788f700 1 > ====== starting new request req=0x7fe1e90c3890 ===== > 2020-07-01T16:46:10.466406957Z debug 2020-07-01 16:46:10.465 7fe1a788f700 1 > ====== req done req=0x7fe1e90c3890 op status=0 http_status=200 > latency=0.00100006s ====== > ... > 2020-07-01T17:04:50.46582237Z debug 2020-07-01 17:04:50.464 7fe15b7f7700 1 > ====== starting new request req=0x7fe1e90c3890 ===== > 2020-07-01T17:04:50.465990506Z debug 2020-07-01 17:04:50.464 7fe15b7f7700 1 > ====== req done req=0x7fe1e90c3890 op status=0 http_status=200 latency=0s > ====== > > In the pod status, it also shows the rgw pod as running: > > http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-1852983-rgw-clbo/ > ocs-pod-logs/openshift-storage/pods.json > > "metadata": { > "name": > "rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-79b8bc7fvj5v", > "generateName": > "rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-79b8bc7fff-", > ... > > "status": { > "phase": "Running", > > Is it possible this must gather is from a different cluster? I just don't > see the crash info for the rgw pod. Is there an `oc describe <rgw-pod>` > available in the must-gather that I'm missing for the crashing rgw pod? Hi Travis As I mentioned, I was unable to collect must-gather unless I powered back the node ON again. So the new RGW pod gets stabilized and we stop seeing the restarts.
The rgw pod appears to be running successfully, but the liveness probe is just failing. Warning Unhealthy 3m38s (x14 over 8m28s) kubelet, compute-2 Liveness probe failed: Get http://10.131.0.67:80/swift/healthcheck: dial tcp 10.131.0.67:80: connect: connection refused From the rgw logs: debug 2020-07-01 19:30:27.373 7f8b9df2c100 0 framework: beast debug 2020-07-01 19:30:27.373 7f8b9df2c100 0 framework conf key: port, val: 80 debug 2020-07-01 19:30:27.374 7f8b9df2c100 0 deferred set uid:gid to 167:167 (ceph:ceph) debug 2020-07-01 19:30:27.374 7f8b9df2c100 0 ceph version 14.2.8-59.el8cp (53387608e81e6aa2487c952a604db06faa5b2cd0) nautilus (stable), process radosgw, pid 1 From inside the same pod the liveness probe is also failing: sh-4.4# curl http://localhost/swift/healthcheck curl: (7) Failed to connect to localhost port 80: Connection refused
@Jiffin @Matt Who from RGW could look into this issue with RGW not responding?
IMO the rgw server itself is not started, please see the logs from other rgw pod, ==== START logs for container rgw of pod openshift-storage/rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-79b8bc7fvj5v ==== debug 2020-07-01 16:46:00.618 7fe1e90dc100 0 framework: beast debug 2020-07-01 16:46:00.618 7fe1e90dc100 0 framework conf key: port, val: 80 debug 2020-07-01 16:46:00.618 7fe1e90dc100 0 deferred set uid:gid to 167:167 (ceph:ceph) debug 2020-07-01 16:46:00.619 7fe1e90dc100 0 ceph version 14.2.8-59.el8cp (53387608e81e6aa2487c952a604db06faa5b2cd0) nautilus (stable), process radosgw, pid 1 debug 2020-07-01 16:46:01.004 7fe1e90dc100 0 starting handler: beast debug 2020-07-01 16:46:01.006 7fe1e90dc100 0 set uid:gid to 167:167 (ceph:ceph) debug 2020-07-01 16:46:01.024 7fe1e90dc100 1 mgrc service_daemon_register rgw.ocs.storagecluster.cephobjectstore.a metadata {arch=x86_64,ceph_release=nautilus,ceph_version=ceph version 14.2.8-59.el8cp (53387608e81e6aa2487c952a604db06faa5b2cd0) nautilus (stable),ceph_version_short=14.2.8-59.el8cp,container_hostname=rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-79b8bc7fvj5v,container_image=quay.io/rhceph-dev/rhceph@sha256:20cf789235e23ddaf38e109b391d1496bb88011239d16862c4c106d0e05fea9e,cpu=Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz,distro=rhel,distro_description=Red Hat Enterprise Linux 8.2 (Ootpa),distro_version=8.2,frontend_config#0=beast port=80,frontend_type#0=beast,hostname=compute-2,kernel_description=#1 SMP Wed Jun 10 19:19:16 UTC 2020,kernel_version=4.18.0-147.20.1.el8_1.x86_64,mem_swap_kb=0,mem_total_kb=32934540,num_handles=1,os=Linux,pid=1,pod_name=rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-79b8bc7fvj5v,pod_namespace=openshift-storage,zone_id=e102bd57-92b4-4481-a8a4-e7cd17cf65df,zone_name=ocs-storagecluster-cephobjectstore,zonegroup_id=95556bcc-928d-4016-815a-7001b1804c89,zonegroup_name=ocs-storagecluster-cephobjectstore} But not sure where or why it got stuck.
(In reply to Travis Nielsen from comment #12) > @Jiffin @Matt Who from RGW could look into this issue with RGW not > responding? sure, thanks @mark, could this be related to the issue you're working on? Matt
Not all the PGs are "active", therefore reading/writing to the failed PG(s) would be blocked. Still following up on the cause of the inactive PG... pgs: 0.250% pgs not active 9129/30015 objects degraded (30.415%) 875/30015 objects misplaced (2.915%) 298 active+undersized+degraded 79 active+undersized 20 active+clean+remapped 2 active+clean 1 undersized+degraded+peered
Here are the details for the inactive PG: sh-4.4# ceph pg 1.1 query { "state": "undersized+degraded+peered", "snap_trimq": "[]", "snap_trimq_len": 0, "epoch": 1169, "up": [ 4 ], "acting": [ 4 ], "acting_recovery_backfill": [ "4" ], "info": { "pgid": "1.1", "last_update": "38'1", "last_complete": "38'1", "log_tail": "38'1", "last_user_version": 0, "last_backfill": "MAX", "last_backfill_bitwise": 1, "purged_snaps": [], "history": { "epoch_created": 8, "epoch_pool_created": 8, "last_epoch_started": 866, "last_interval_started": 865, "last_epoch_clean": 866, "last_interval_clean": 865, "last_epoch_split": 0, "last_epoch_marked_full": 0, "same_up_since": 1121, "same_interval_since": 1121, "same_primary_since": 1121, "last_scrub": "0'0", "last_scrub_stamp": "2020-07-01 14:29:50.960751", "last_deep_scrub": "0'0", "last_deep_scrub_stamp": "2020-07-01 14:29:50.960751", "last_clean_scrub_stamp": "2020-07-01 14:29:50.960751" }, "stats": { "version": "38'1", "reported_seq": "594", "reported_epoch": "1169", "state": "undersized+degraded+peered", "last_fresh": "2020-07-01 19:56:10.470098", "last_change": "2020-07-01 18:49:42.253350", "last_active": "2020-07-01 18:49:41.245332", "last_peered": "2020-07-01 19:56:10.470098", "last_clean": "2020-07-01 17:40:53.421560", "last_became_active": "2020-07-01 17:40:56.544617", "last_became_peered": "2020-07-01 18:49:42.253350", "last_unstale": "2020-07-01 19:56:10.470098", "last_undegraded": "2020-07-01 18:49:42.247539", "last_fullsized": "2020-07-01 18:49:42.247516", "mapping_epoch": 1121, "log_start": "38'1", "ondisk_log_start": "38'1", "created": 8, "last_epoch_clean": 866, "parent": "0.0", "parent_split_bits": 0, "last_scrub": "0'0", "last_scrub_stamp": "2020-07-01 14:29:50.960751", "last_deep_scrub": "0'0", "last_deep_scrub_stamp": "2020-07-01 14:29:50.960751", "last_clean_scrub_stamp": "2020-07-01 14:29:50.960751", "log_size": 0, "ondisk_log_size": 0, "stats_invalid": false, "dirty_stats_invalid": false, "omap_stats_invalid": false, "hitset_stats_invalid": false, "hitset_bytes_stats_invalid": false, "pin_stats_invalid": false, "manifest_stats_invalid": false, "snaptrimq_len": 0, "stat_sum": { "num_bytes": 46, "num_objects": 1, "num_object_clones": 0, "num_object_copies": 3, "num_objects_missing_on_primary": 0, "num_objects_missing": 0, "num_objects_degraded": 2, "num_objects_misplaced": 0, "num_objects_unfound": 0, "num_objects_dirty": 1, "num_whiteouts": 0, "num_read": 7, "num_read_kb": 7, "num_write": 1, "num_write_kb": 1, "num_scrub_errors": 0, "num_shallow_scrub_errors": 0, "num_deep_scrub_errors": 0, "num_objects_recovered": 2, "num_bytes_recovered": 92, "num_keys_recovered": 0, "num_objects_omap": 0, "num_objects_hit_set_archive": 0, "num_bytes_hit_set_archive": 0, "num_flush": 0, "num_flush_kb": 0, "num_evict": 0, "num_evict_kb": 0, "num_promote": 0, "num_flush_mode_high": 0, "num_flush_mode_low": 0, "num_evict_mode_some": 0, "num_evict_mode_full": 0, "num_objects_pinned": 0, "num_legacy_snapsets": 0, "num_large_omap_objects": 0, "num_objects_manifest": 0, "num_omap_bytes": 0, "num_omap_keys": 0, "num_objects_repaired": 0 }, "up": [ 4 ], "acting": [ 4 ], "avail_no_missing": [ "4" ], "object_location_counts": [ { "shards": "4", "objects": 1 } ], "blocked_by": [], "up_primary": 4, "acting_primary": 4, "purged_snaps": [] }, "empty": 0, "dne": 0, "incomplete": 0, "last_epoch_started": 866, "hit_set_history": { "current_last_update": "0'0", "history": [] } }, "peer_info": [], "recovery_state": [ { "name": "Started/Primary/Active", "enter_time": "2020-07-01 18:49:42.247533", "might_have_unfound": [], "recovery_progress": { "backfill_targets": [], "waiting_on_backfill": [], "last_backfill_started": "MIN", "backfill_info": { "begin": "MIN", "end": "MIN", "objects": [] }, "peer_backfill_info": [], "backfills_in_flight": [], "recovering": [], "pg_backend": { "pull_from_peer": [], "pushing": [] } }, "scrub": { "scrubber.epoch_start": "0", "scrubber.active": false, "scrubber.state": "INACTIVE", "scrubber.start": "MIN", "scrubber.end": "MIN", "scrubber.max_end": "MIN", "scrubber.subset_last_update": "0'0", "scrubber.deep": false, "scrubber.waiting_on_whom": [] } }, { "name": "Started", "enter_time": "2020-07-01 18:49:41.245437" } ], "agent_state": {} }
Per conversation in chat with Josh, we are hitting a corner case where the PG is stuck. From Josh: the pg is only getting assigned to one osd this is a rare quirk that can come up from crush not finding a 2nd osd in enough tries, which can occur more easily with small clusters we can make it less likely with some crush tuning ... this would have been present in ceph forever it's unrelated to writes, just the fact that there are only 2 hosts up makes it more likely though (cluster is even smaller) @Josh, tagging you for more details on the crush tuning, thanks @Neha Since this is not a regression and does not happen frequently, sound good to remove the blocker tag from 4.4.z and move it to 4.5? Implementing the crush tuning will take more effort.
Thanks for the update Travis / Josh! If it is not a regression, then at least we should not block 4.4.1 for it.
Agreed, let's consider for 4.5 if the scope of the crush tuning is small.
@Josh, could it be possible that the update to RHCS 4.1 (from RHCS 4.0) has introduced some small change that has made it more likely to hit the problem that Travis described? Just pure speculation here, since I'm not well aware what's the delta between 4.0 and 4.1... :-)
As a follow up to Michael's query, the concern here is about the fact that we have encountered this only in 4.4.1 so far.
@Neha, thanks for the details. And don't get me wrong. I was not suggesting that RHCS introduced a regression. Josh already explained that this has been there since a long time. I was merely speculating that if a change in any component could have made it more likely to hit it now, it would have been RHCS itself which was updated from 4.0 to 4.1. I agree with you that it has probably just been pure luck until now.
In summary, there are two problems, not the crush issue I originally thought this was: 1) osds never got marked out Reccommended fix: set mon_osd_down_out_subtree_limit to 'datacenter' or something else above 'rack' in AWS 2) .rgw.root pool is distributed across hosts, not racks Recommended fix: separate across racks for all pools in AWS More detail below. (In reply to Travis Nielsen from comment #21) > Per conversation in chat with Josh, we are hitting a corner case where the > PG is stuck. > > From Josh: > the pg is only getting assigned to one osd > this is a rare quirk that can come up from crush not finding a 2nd osd in > enough tries, which can occur more easily with small clusters > we can make it less likely with some crush tuning > ... > this would have been present in ceph forever > it's unrelated to writes, just the fact that there are only 2 hosts up makes > it more likely though (cluster is even smaller) > > @Josh, tagging you for more details on the crush tuning, thanks > > @Neha Since this is not a regression and does not happen frequently, sound > good to remove the blocker tag from 4.4.z and move it to 4.5? Implementing > the crush tuning will take more effort. The first issue is the OSDs that were down (not running) were never marked out (so data was still mapped to them). You can see this by trying to map pg 1.1 with this osdmap: $ osdmaptool osdmap --tree osdmaptool: osdmap file 'osdmap' ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 3.00000 root default -8 1.00000 rack rack0 -7 0.50000 host ocs-deviceset-0-0-j4mtq 1 hdd 0.50000 osd.1 down 1.00000 1.00000 -15 0.50000 host ocs-deviceset-0-1-zqnq7 3 hdd 0.50000 osd.3 down 1.00000 1.00000 -4 1.00000 rack rack1 -3 0.50000 host ocs-deviceset-1-0-pcs7b 0 hdd 0.50000 osd.0 up 1.00000 1.00000 -17 0.50000 host ocs-deviceset-1-1-st982 4 hdd 0.50000 osd.4 up 1.00000 1.00000 -12 1.00000 rack rack2 -11 0.50000 host ocs-deviceset-2-0-vbkqq 2 hdd 0.50000 osd.2 up 1.00000 1.00000 -19 0.50000 host ocs-deviceset-2-1-48dxg 5 hdd 0.50000 osd.5 up 1.00000 1.00000 $ osdmaptool --test-map-pg 1.1 osdmap bin/osdmaptool: osdmap file 'osdmap' parsed '1.1' -> 1.1 1.1 raw ([1,4,3], p1) up ([4], p4) acting ([4], p4) The raw crush choice (before accounting for down osds) includes osds 1 and 3, which are both down, so the only one left is 4. Normally OSDs are marked out (and thus no longer considered by crush) after being down for 10 minutes (mon_osd_down_out_interval). There are a few limits on this which can be hit on small clusters like this though: 1) mon_osd_min_in_ratio = 0.75 # do not automatically mark OSDs 'out' if fewer than this many OSDs are 'in' 2) mon_osd_down_out_subtree_limit = rack # do not automatically mark OSDs 'out' if an entire subtree of this size is down Since this cluster had a virtual 'rack' go down, and it represented 33% of the OSDs, both of these limits would prevent the osds from being marked out. In bare metal, where we don't have virtual 'racks', but separate across hosts, we wouldn't hit the 2nd limit, and the 1st limit would only have stopped the 2nd osd from being marked out, so crush would still be able to find 2 copies for each pg. (In reply to Michael Adam from comment #24) > @Josh, could it be possible that the update to RHCS 4.1 (from RHCS 4.0) has > introduced some small change that has made it more likely to hit the problem > that Travis described? > > Just pure speculation here, since I'm not well aware what's the delta > between 4.0 and 4.1... :-) I don't think anything about this behaviour changed between 4.0 and 4.1. Did rook change any settings related to marking osds out, or anything about how racks are set up? There's still an element of chance, in that this would only cause an availability problem if there happened to be a pg mapped to the two osds that were stuck down+in. The 2nd problem, which caused this to be noticeable, was that the .rgw.root pool does not have a crush rule set up for it (perhaps the pool was auto-created by rgw?) so it uses the default crush rule that separates replicas across hosts instead of racks. If it was separated across racks instead, a whole rack going down would not cause an availability problem even if the osds were never marked out, since there would be 2 osds still up to make the pg available.
@Neha Do you have an operator log from the original deployment of Rook? In the must-gather I see a later reconcile of the object store instead of the original reconcile. @Josh Rook creates all of the rgw pools explicitly before starting rgw the first time, so I don't know what would have created the pool without the crush rule as expected. But this certainly sounds like the cause of the issues if the pool doesn't have the rack failure domain as expected. When I did a local install of rook based on the release-4.4 branch, I see the .rgw.root pool created with the rack-based crush rule as expected. 2020-07-07 21:57:35.741027 I | exec: Running command: ceph osd crush rule create-replicated .rgw.root default rack --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/240953093 2020-07-07 21:57:36.343092 I | exec: Running command: ceph osd pool create .rgw.root 8 replicated .rgw.root --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/824994976 And I see the crush rule as expected here: $ ceph osd crush dump ... { "rule_id": 6, "rule_name": ".rgw.root", "ruleset": 6, "type": 1, "min_size": 1, "max_size": 10, "steps": [ { "op": "take", "item": -1, "item_name": "default" }, { "op": "chooseleaf_firstn", "num": 0, "type": "rack" }, { "op": "emit" } ] }, ...
@Neha Do you have the original log from the deployment that would show the configuration of the crush rules? The operator log in the must-gather must have been restarted so I didn't see the original reconcile there.
Ok, tracked down the issue after reading through the original operator log (thanks Neha!). The issue is that an object user was getting created before the object store was fully configured by Rook. Thus, Ceph was creating the .rgw.root pool with the default host failure domain when the object user was being created. About 30 seconds later when Rook finally attempted to create the .rgw.root pool, it already existed and therefore Rook did not set its failure domain to rack. The object user controller was intending to wait for rgw pods to be running before creating a user, however, after querying for the pods it didn't check if any pods were actually running. The fix is extremely simple and low risk to check for the number of rgw pods before allowing the object user to be created. This issue does not repro in the 4.5 or newer codebase. This issue had already been fixed by a refactor in the object store with the move to the controller runtime. This only affects 4.4, so moving back to that milestone to discuss if it's still a blocker to consider for 4.4.z. It's such a small change, I would suggest we take it. Here is the simple downstream PR, pending agreement to include in 4.4.z. https://github.com/openshift/rook/pull/81
I just observed that this BZ is moved to MODIFIED because the downstream PR is merged which is not appropriate because: 1. We still don't have any ack to merge the fix to 4.4.z. AFAIK, once we have all the acks then only PR should be merged. 2. 4.4.z is not open to take any fixes because right now we are working on 4.4.2 which is CVE only release and any change targeted for 4.4.z should be merged when we are ready for 4.4.3 Ideally, we shouled revert this PR. Michael, please correct me if I am wrong.
(In reply to Mudit Agarwal from comment #37) > I just observed that this BZ is moved to MODIFIED because the downstream PR > is merged which is not appropriate because: > > 1. We still don't have any ack to merge the fix to 4.4.z. AFAIK, once we > have all the acks then only PR should be merged. > 2. 4.4.z is not open to take any fixes because right now we are working on > 4.4.2 which is CVE only release and any change targeted for 4.4.z should be > merged when we are ready for 4.4.3 > > > Ideally, we shouled revert this PR. > > Michael, please correct me if I am wrong. You are absolutely correct. https://github.com/openshift/rook/pull/82 revert PR ^^
Yep, we had a miscommunication on merging this. Seb and I sync'd up about it.
@Michael Are we still taking fixes for 4.4.z or should we close this?
*** Bug 1880447 has been marked as a duplicate of this bug. ***
Closing since it is already fixed in 4.5 and there is no more 4.4.z release planned.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days