Bug 1852983 - OCS 4.4.1-rc2: On hosting node shutdown, the failed-over RGW on 2nd node is transitioning between CrashLoopBackOff -> Running
Summary: OCS 4.4.1-rc2: On hosting node shutdown, the failed-over RGW on 2nd node is ...
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: rook
Version: 4.5
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Travis Nielsen
QA Contact: Elad
URL:
Whiteboard:
: 1880447 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-01 17:03 UTC by Neha Berry
Modified: 2023-12-15 18:22 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-09-22 04:09:34 UTC
Embargoed:
mkogan: needinfo-


Attachments (Terms of Use)

Description Neha Berry 2020-07-01 17:03:15 UTC
Description of problem (please be detailed as possible and provide log
snippests):
----------------------------------------------------------------------
OCP 4.4 + OCS 4.4.1-rc2 cluster on VMware

On a 3 node OCS cluster, perofrmed prolonged shutdown of a node which was hosting RGW, rook-ceph-operator, ocs-operator, noobaa-db, osd-0, osd-3 and mon-a , amongst other pods.

In this case, compute-0 was powered OFF @Wed Jul  1 15:17:47 UTC 2020

Following were some observations
=======================================

1. The original RGW pod on compute-0 stayed in terminating state forever on compute-0(as expected)
2. The newly spinned RGW pod on new node is continuously transitioning between Running and CrashLoopBackOff state

rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-79b8bc7fvj5v   0/1     CrashLoopBackOff   23         71m     10.128.2.28    compute-2   <none>           <none>
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-79b8bc7jzv5v   1/1     Terminating        16         8h      10.131.0.93    compute-0   <none>           <none>

P.S: The suffix is different, but the pod prefix is same (as is the case with mon-a-xxx and so on)

>> [nberry@localhost rgw-CLBO]$ oc logs rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-79b8bc7fvj5v
debug 2020-07-01 16:32:30.866 7f6294b52100  0 framework: beast
debug 2020-07-01 16:32:30.866 7f6294b52100  0 framework conf key: port, val: 80
debug 2020-07-01 16:32:30.867 7f6294b52100  0 deferred set uid:gid to 167:167 (ceph:ceph)
debug 2020-07-01 16:32:30.867 7f6294b52100  0 ceph version 14.2.8-59.el8cp (53387608e81e6aa2487c952a604db06faa5b2cd0) nautilus (stable), process radosgw, pid 1

>> oc describe

Events:
  Type     Reason     Age                 From                Message
  ----     ------     ----                ----                -------
  Normal   Scheduled  21m                 default-scheduler   Successfully assigned openshift-storage/rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-79b8bc7fvj5v to compute-2
  Normal   Pulled     21m                 kubelet, compute-2  Container image "quay.io/rhceph-dev/rhceph@sha256:20cf789235e23ddaf38e109b391d1496bb88011239d16862c4c106d0e05fea9e" already present on machine
  Normal   Created    21m                 kubelet, compute-2  Created container chown-container-data-dir
  Normal   Started    21m                 kubelet, compute-2  Started container chown-container-data-dir
  Normal   Created    19m (x3 over 21m)   kubelet, compute-2  Created container rgw
  Normal   Started    19m (x3 over 21m)   kubelet, compute-2  Started container rgw
  Warning  Unhealthy  19m (x9 over 21m)   kubelet, compute-2  Liveness probe failed: Get http://10.128.2.28:80/swift/healthcheck: dial tcp 10.128.2.28:80: connect: connection refused
  Normal   Killing    19m (x3 over 21m)   kubelet, compute-2  Container rgw failed liveness probe, will be restarted
  Normal   Pulled     16m (x6 over 21m)   kubelet, compute-2  Container image "quay.io/rhceph-dev/rhceph@sha256:20cf789235e23ddaf38e109b391d1496bb88011239d16862c4c106d0e05fea9e" already present on machine
  Warning  BackOff    88s (x52 over 15m)  kubelet, compute-2  Back-off restarting failed container





Version of all relevant components (if applicable):
----------------------------------------------------------------------
OCS version = ocs-operator.v4.4.1-465.ci

OCP version = 4.4.0-0.nightly-2020-06-29-071755

Ceph = RHCS 4.1 (ceph version 14.2.8-59.el8cp (53387608e81e6aa2487c952a604db06faa5b2cd0) nautilus (stable)) 

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
----------------------------------------------------------------------
Yes. It is not in user's hand that the RGW pod failing over to another node takes the same name. But maybe due to this, the new POD keeps going to CLBO state and the RGW functionality is broken


Is there any workaround available to the best of your knowledge?
----------------------------------------------------------------------
Not sure


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
----------------------------------------------------------------------
3

Can this issue reproducible?
----------------------------------------------------------------------
Yes. In case the failed over RGW pod gets the same name as the Terminating RGW pod, one faces this issue
Can this issue reproduce from the UI?
----------------------------------------------------------------------

If this is a regression, please provide more details to justify this:
----------------------------------------------------------------------

Steps to Reproduce:
----------------------------------------------------------------------
1. Create an OCP 4.4 + OCS 4.4.1 cluster (non GA'd)
2. Power off a node which wa shosting the RGW pod
3. See if after 5 minutes , a new RGW pod is spun up on another node. The original pod should stay in terminating state
4. Check ceph status. It doesnt show the RGW details at all under the "Services" section

Actual results:
----------------------------------------------------------------------
The New RGW pod which came up on another node is continuously in CrashLoop state->Runnning->CrashLoop state.

Expected results:
----------------------------------------------------------------------
New RGW pod should be in Running state on another node

Additional info:
----------------------------------------------------------------------

The powered off node had following important pods running on it before it was shutdown.

1. lib-bucket-provisioner-5b9cb4f848-t2qhg
2. noobaa-db-0
3. noobaa-operator-6f9db4df44-gd2hg
4. ocs-operator-7fdf85b7b4-kxlvk
5. rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7794998d2lpk7
6. rook-ceph-mgr-a-7ddcffcb68-m57fb
7. rook-ceph-operator-68557c549b-6nwcq
8. rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-79b8bc7jzv5v
9. mon-a
10. osd-0 and osd-3


>>Current POD status

[nberry@localhost vmware]$ oc get pods -o wide -n openshift-storage
NAME                                                              READY   STATUS             RESTARTS   AGE     IP             NODE        NOMINATED NODE   READINESS GATES
compute-0-debug                                                   1/1     Terminating        0          5h12m   10.46.27.134   compute-0   <none>           <none>
csi-cephfsplugin-8k6m8                                            3/3     Running            0          2d6h    10.46.27.134   compute-0   <none>           <none>
csi-cephfsplugin-92tng                                            3/3     Running            0          2d6h    10.46.27.138   compute-1   <none>           <none>
csi-cephfsplugin-nrp29                                            3/3     Running            0          2d6h    10.46.27.133   compute-2   <none>           <none>
csi-cephfsplugin-provisioner-6fdd566b4d-2kgfc                     5/5     Running            0          2d6h    10.129.2.20    compute-1   <none>           <none>
csi-cephfsplugin-provisioner-6fdd566b4d-62ffh                     5/5     Terminating        0          8h      10.131.0.87    compute-0   <none>           <none>
csi-cephfsplugin-provisioner-6fdd566b4d-tchmw                     5/5     Running            0          71m     10.128.2.29    compute-2   <none>           <none>
csi-rbdplugin-jv8bk                                               3/3     Running            0          2d6h    10.46.27.138   compute-1   <none>           <none>
csi-rbdplugin-provisioner-7d784b9d7c-k9ss7                        5/5     Running            0          71m     10.128.2.27    compute-2   <none>           <none>
csi-rbdplugin-provisioner-7d784b9d7c-m4lvw                        5/5     Running            0          2d1h    10.129.2.128   compute-1   <none>           <none>
csi-rbdplugin-provisioner-7d784b9d7c-xdjqk                        5/5     Terminating        0          8h      10.131.0.84    compute-0   <none>           <none>
csi-rbdplugin-pz9c7                                               3/3     Running            0          2d6h    10.46.27.133   compute-2   <none>           <none>
csi-rbdplugin-xq97h                                               3/3     Running            0          2d6h    10.46.27.134   compute-0   <none>           <none>
lib-bucket-provisioner-5b9cb4f848-t2qhg                           1/1     Terminating        0          8h      10.131.0.89    compute-0   <none>           <none>
lib-bucket-provisioner-5b9cb4f848-t569j                           1/1     Running            0          71m     10.128.2.18    compute-2   <none>           <none>
noobaa-core-0                                                     1/1     Running            0          2d1h    10.129.2.130   compute-1   <none>           <none>
noobaa-db-0                                                       1/1     Terminating        0          7h49m   10.131.0.100   compute-0   <none>           <none>
noobaa-endpoint-58ff4bdfd6-br76q                                  0/1     CrashLoopBackOff   18         2d1h    10.129.2.127   compute-1   <none>           <none>
noobaa-operator-6f9db4df44-gd2hg                                  1/1     Terminating        0          8h      10.131.0.91    compute-0   <none>           <none>
noobaa-operator-6f9db4df44-pknzr                                  1/1     Running            0          71m     10.128.2.30    compute-2   <none>           <none>
ocs-operator-7fdf85b7b4-d7n5r                                     0/1     Running            0          71m     10.128.2.24    compute-2   <none>           <none>
ocs-operator-7fdf85b7b4-kxlvk                                     1/1     Terminating        0          8h      10.131.0.85    compute-0   <none>           <none>
rook-ceph-crashcollector-compute-0-6d6894d4-9gsbc                 0/1     Pending            0          70m     <none>         <none>      <none>           <none>
rook-ceph-crashcollector-compute-0-7fbb7b4b76-dkgz8               1/1     Terminating        0          2d1h    10.131.0.51    compute-0   <none>           <none>
rook-ceph-crashcollector-compute-0-7fbb7b4b76-xgg79               0/1     Pending            0          71m     <none>         <none>      <none>           <none>
rook-ceph-crashcollector-compute-1-6d4656d5f5-v2dvs               1/1     Running            0          2d6h    10.129.2.26    compute-1   <none>           <none>
rook-ceph-crashcollector-compute-2-88cb7fddb-9n9v9                1/1     Running            0          7h36m   10.128.2.5     compute-2   <none>           <none>
rook-ceph-drain-canary-compute-0-6c7798b9b8-rx7wd                 1/1     Terminating        0          2d1h    10.131.0.50    compute-0   <none>           <none>
rook-ceph-drain-canary-compute-0-6c7798b9b8-txnkx                 0/1     Pending            0          70m     <none>         <none>      <none>           <none>
rook-ceph-drain-canary-compute-1-7499bc7d6f-7pb22                 1/1     Running            0          2d6h    10.129.2.25    compute-1   <none>           <none>
rook-ceph-drain-canary-compute-2-6b8dd759db-84vpj                 1/1     Running            0          7h36m   10.128.2.4     compute-2   <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7794998d2lpk7   1/1     Terminating        0          8h      10.131.0.86    compute-0   <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7794998dnb6kb   1/1     Running            0          71m     10.128.2.31    compute-2   <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-5c6964ccxp5fk   1/1     Running            0          2d6h    10.129.2.28    compute-1   <none>           <none>
rook-ceph-mgr-a-7ddcffcb68-ggc62                                  1/1     Running            0          71m     10.128.2.26    compute-2   <none>           <none>
rook-ceph-mgr-a-7ddcffcb68-m57fb                                  1/1     Terminating        0          8h      10.131.0.92    compute-0   <none>           <none>
rook-ceph-mon-a-7766b47887-xrmlp                                  0/1     Pending            0          71m     <none>         <none>      <none>           <none>
rook-ceph-mon-b-669cc6c5d4-k5cgm                                  1/1     Running            1          2d6h    10.129.2.22    compute-1   <none>           <none>
rook-ceph-mon-c-689976cc99-mjrjn                                  1/1     Running            0          7h36m   10.128.2.9     compute-2   <none>           <none>
rook-ceph-mon-d-canary-9c576d797-pnwz4                            0/1     Pending            0          22s     <none>         <none>      <none>           <none>
rook-ceph-operator-68557c549b-5lc8h                               1/1     Running            0          71m     10.128.2.25    compute-2   <none>           <none>
rook-ceph-operator-68557c549b-6nwcq                               1/1     Terminating        0          8h      10.131.0.90    compute-0   <none>           <none>
rook-ceph-osd-0-544cf5448b-d4nc9                                  0/1     Pending            0          71m     <none>         <none>      <none>           <none>
rook-ceph-osd-1-7ffcfdd9d5-pdqpq                                  1/1     Running            0          2d6h    10.129.2.27    compute-1   <none>           <none>
rook-ceph-osd-2-f95cffd6d-2l6p6                                   1/1     Running            0          7h36m   10.128.2.8     compute-2   <none>           <none>
rook-ceph-osd-3-75fb657f59-jnhjx                                  0/1     Pending            0          71m     <none>         <none>      <none>           <none>
rook-ceph-osd-4-6bf4d8dbc7-mzfz4                                  1/1     Running            0          7h36m   10.128.2.7     compute-2   <none>           <none>
rook-ceph-osd-5-77b688dffb-bglr5                                  1/1     Running            0          2d1h    10.129.2.134   compute-1   <none>           <none>
rook-ceph-osd-prepare-ocs-deviceset-1-0-bx9h5-8cldc               0/1     Completed          0          2d6h    10.129.2.24    compute-1   <none>           <none>
rook-ceph-osd-prepare-ocs-deviceset-1-1-vtgbd-n7t59               0/1     Completed          0          2d1h    10.129.2.133   compute-1   <none>           <none>
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-79b8bc7fvj5v   0/1     CrashLoopBackOff   23         71m     10.128.2.28    compute-2   <none>           <none>
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-79b8bc7jzv5v   1/1     Terminating        16         8h      10.131.0.93    compute-0   <none>           <none>
rook-ceph-tools-b7789c95-kwn7x                                    1/1     Terminating        0          8h      10.46.27.134   compute-0   <none>           <none>
rook-ceph-tools-b7789c95-z6s27                                    1/1     Running            0          71m     10.46.27.133   compute-2   <none>           <none>

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

>> Current ceph status


=====ceph status ====
Wed Jul  1 16:18:11 UTC 2020
  cluster:
    id:     ca6c02dd-1942-4f11-bd38-1a03979fc85e
    health: HEALTH_WARN
            2 osds down
            2 hosts (2 osds) down
            1 rack (2 osds) down
            Reduced data availability: 1 pg inactive
            Degraded data redundancy: 31265/93789 objects degraded (33.335%), 324 pgs degraded, 400 pgs undersized
            1/3 mons down, quorum b,c
 
  services:
    mon: 3 daemons, quorum b,c (age 60m), out of quorum: a
    mgr: a(active, since 54m)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-b=up:active} 1 up:standby-replay
    osd: 6 osds: 4 up (since 59m), 6 in (since 2d)
                                                    <<--- RGW is not listed
 
  task status:
    scrub status:
        mds.ocs-storagecluster-cephfilesystem-a: idle
        mds.ocs-storagecluster-cephfilesystem-b: idle
        mds.ocs-storagecluster-cephfilesystem-c: idle
        mds.ocs-storagecluster-cephfilesystem-d: idle
 
  data:
    pools:   11 pools, 400 pgs
    objects: 31.26k objects, 116 GiB
    usage:   207 GiB used, 1.8 TiB / 2.0 TiB avail
    pgs:     0.250% pgs not active
             31265/93789 objects degraded (33.335%)
             323 active+undersized+degraded
             76  active+undersized
             1   undersized+degraded+peered
 
  io:
    client:   1.2 KiB/s rd, 77 KiB/s wr, 2 op/s rd, 0 op/s wr


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Comment 4 Neha Berry 2020-07-01 17:50:13 UTC
Due to network problem, log collection is still ongoing.. I will update the logs once collected..

Comment 6 Travis Nielsen 2020-07-01 18:52:56 UTC
In the log for the rgw pod in terminating state:
http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-1852983-rgw-clbo/must-gather.local.7750024955620505231/ceph/namespaces/openshift-storage/pods/rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-79b8bc7fvj5v/rgw/rgw/logs/current.log

The rgw log shows that there were responses generated for at over 18 minutes:

2020-07-01T16:46:10.46609365Z debug 2020-07-01 16:46:10.464 7fe1a788f700  1 ====== starting new request req=0x7fe1e90c3890 =====
2020-07-01T16:46:10.466406957Z debug 2020-07-01 16:46:10.465 7fe1a788f700  1 ====== req done req=0x7fe1e90c3890 op status=0 http_status=200 latency=0.00100006s ======
...
2020-07-01T17:04:50.46582237Z debug 2020-07-01 17:04:50.464 7fe15b7f7700  1 ====== starting new request req=0x7fe1e90c3890 =====
2020-07-01T17:04:50.465990506Z debug 2020-07-01 17:04:50.464 7fe15b7f7700  1 ====== req done req=0x7fe1e90c3890 op status=0 http_status=200 latency=0s ======

In the pod status, it also shows the rgw pod as running:

http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-1852983-rgw-clbo/ocs-pod-logs/openshift-storage/pods.json

            "metadata": {
                "name": "rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-79b8bc7fvj5v",
                "generateName": "rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-79b8bc7fff-",
...

            "status": {
                "phase": "Running",

Is it possible this must gather is from a different cluster? I just don't see the crash info for the rgw pod. Is there an `oc describe <rgw-pod>` available in the must-gather that I'm missing for the crashing rgw pod?

Comment 7 Travis Nielsen 2020-07-01 18:56:07 UTC
Ah, now I see the liveness probe failed from your original pod description:

  Warning  Unhealthy  19m (x9 over 21m)   kubelet, compute-2  Liveness probe failed: Get http://10.128.2.28:80/swift/healthcheck: dial tcp 10.128.2.28:80: connect: connection refused
  Normal   Killing    19m (x3 over 21m)   kubelet, compute-2  Container rgw failed liveness probe, will be restarted

Comment 8 Neha Berry 2020-07-01 19:03:52 UTC
(In reply to Travis Nielsen from comment #6)
> In the log for the rgw pod in terminating state:
> http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-1852983-rgw-clbo/
> must-gather.local.7750024955620505231/ceph/namespaces/openshift-storage/pods/
> rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-79b8bc7fvj5v/rgw/rgw/logs/
> current.log
> 
> The rgw log shows that there were responses generated for at over 18 minutes:
> 
> 2020-07-01T16:46:10.46609365Z debug 2020-07-01 16:46:10.464 7fe1a788f700  1
> ====== starting new request req=0x7fe1e90c3890 =====
> 2020-07-01T16:46:10.466406957Z debug 2020-07-01 16:46:10.465 7fe1a788f700  1
> ====== req done req=0x7fe1e90c3890 op status=0 http_status=200
> latency=0.00100006s ======
> ...
> 2020-07-01T17:04:50.46582237Z debug 2020-07-01 17:04:50.464 7fe15b7f7700  1
> ====== starting new request req=0x7fe1e90c3890 =====
> 2020-07-01T17:04:50.465990506Z debug 2020-07-01 17:04:50.464 7fe15b7f7700  1
> ====== req done req=0x7fe1e90c3890 op status=0 http_status=200 latency=0s
> ======
> 
> In the pod status, it also shows the rgw pod as running:
> 
> http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-1852983-rgw-clbo/
> ocs-pod-logs/openshift-storage/pods.json
> 
>             "metadata": {
>                 "name":
> "rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-79b8bc7fvj5v",
>                 "generateName":
> "rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-79b8bc7fff-",
> ...
> 
>             "status": {
>                 "phase": "Running",
> 
> Is it possible this must gather is from a different cluster? I just don't
> see the crash info for the rgw pod. Is there an `oc describe <rgw-pod>`
> available in the must-gather that I'm missing for the crashing rgw pod?

Hi Travis

As I mentioned, I was unable to collect must-gather unless I powered back the node ON again. So the new RGW pod gets stabilized and we stop seeing the restarts.

Comment 11 Travis Nielsen 2020-07-01 19:33:54 UTC
The rgw pod appears to be running successfully, but the liveness probe is just failing.

  Warning  Unhealthy  3m38s (x14 over 8m28s)  kubelet, compute-2  Liveness probe failed: Get http://10.131.0.67:80/swift/healthcheck: dial tcp 10.131.0.67:80: connect: connection refused

From the rgw logs:

debug 2020-07-01 19:30:27.373 7f8b9df2c100  0 framework: beast
debug 2020-07-01 19:30:27.373 7f8b9df2c100  0 framework conf key: port, val: 80
debug 2020-07-01 19:30:27.374 7f8b9df2c100  0 deferred set uid:gid to 167:167 (ceph:ceph)
debug 2020-07-01 19:30:27.374 7f8b9df2c100  0 ceph version 14.2.8-59.el8cp (53387608e81e6aa2487c952a604db06faa5b2cd0) nautilus (stable), process radosgw, pid 1

From inside the same pod the liveness probe is also failing:

sh-4.4# curl http://localhost/swift/healthcheck
curl: (7) Failed to connect to localhost port 80: Connection refused

Comment 12 Travis Nielsen 2020-07-01 23:14:12 UTC
@Jiffin @Matt Who from RGW could look into this issue with RGW not responding?

Comment 13 Jiffin 2020-07-02 08:01:23 UTC
IMO the rgw server itself is not started, please see the logs from other rgw pod, 

==== START logs for container rgw of pod openshift-storage/rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-79b8bc7fvj5v ====
debug 2020-07-01 16:46:00.618 7fe1e90dc100  0 framework: beast
debug 2020-07-01 16:46:00.618 7fe1e90dc100  0 framework conf key: port, val: 80
debug 2020-07-01 16:46:00.618 7fe1e90dc100  0 deferred set uid:gid to 167:167 (ceph:ceph)
debug 2020-07-01 16:46:00.619 7fe1e90dc100  0 ceph version 14.2.8-59.el8cp (53387608e81e6aa2487c952a604db06faa5b2cd0) nautilus (stable), process radosgw, pid 1
debug 2020-07-01 16:46:01.004 7fe1e90dc100  0 starting handler: beast
debug 2020-07-01 16:46:01.006 7fe1e90dc100  0 set uid:gid to 167:167 (ceph:ceph)
debug 2020-07-01 16:46:01.024 7fe1e90dc100  1 mgrc service_daemon_register rgw.ocs.storagecluster.cephobjectstore.a metadata {arch=x86_64,ceph_release=nautilus,ceph_version=ceph version 14.2.8-59.el8cp (53387608e81e6aa2487c952a604db06faa5b2cd0) nautilus (stable),ceph_version_short=14.2.8-59.el8cp,container_hostname=rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-79b8bc7fvj5v,container_image=quay.io/rhceph-dev/rhceph@sha256:20cf789235e23ddaf38e109b391d1496bb88011239d16862c4c106d0e05fea9e,cpu=Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz,distro=rhel,distro_description=Red Hat Enterprise Linux 8.2 (Ootpa),distro_version=8.2,frontend_config#0=beast port=80,frontend_type#0=beast,hostname=compute-2,kernel_description=#1 SMP Wed Jun 10 19:19:16 UTC 2020,kernel_version=4.18.0-147.20.1.el8_1.x86_64,mem_swap_kb=0,mem_total_kb=32934540,num_handles=1,os=Linux,pid=1,pod_name=rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-79b8bc7fvj5v,pod_namespace=openshift-storage,zone_id=e102bd57-92b4-4481-a8a4-e7cd17cf65df,zone_name=ocs-storagecluster-cephobjectstore,zonegroup_id=95556bcc-928d-4016-815a-7001b1804c89,zonegroup_name=ocs-storagecluster-cephobjectstore}

But not sure where or why it got stuck.

Comment 15 Matt Benjamin (redhat) 2020-07-02 12:43:55 UTC
(In reply to Travis Nielsen from comment #12)
> @Jiffin @Matt Who from RGW could look into this issue with RGW not
> responding?

sure, thanks

@mark, could this be related to the issue you're working on?

Matt

Comment 18 Travis Nielsen 2020-07-02 15:23:36 UTC
Not all the PGs are "active", therefore reading/writing to the failed PG(s) would be blocked. Still following up on the cause of the inactive PG...

   pgs:     0.250% pgs not active
             9129/30015 objects degraded (30.415%)
             875/30015 objects misplaced (2.915%)
             298 active+undersized+degraded
             79  active+undersized
             20  active+clean+remapped
             2   active+clean
             1   undersized+degraded+peered

Comment 19 Travis Nielsen 2020-07-02 16:33:48 UTC
Here are the details for the inactive PG:

sh-4.4# ceph pg 1.1 query
{
    "state": "undersized+degraded+peered",
    "snap_trimq": "[]",
    "snap_trimq_len": 0,
    "epoch": 1169,
    "up": [
        4
    ],
    "acting": [
        4
    ],
    "acting_recovery_backfill": [
        "4"
    ],
    "info": {
        "pgid": "1.1",
        "last_update": "38'1",
        "last_complete": "38'1",
        "log_tail": "38'1",
        "last_user_version": 0,
        "last_backfill": "MAX",
        "last_backfill_bitwise": 1,
        "purged_snaps": [],
        "history": {
            "epoch_created": 8,
            "epoch_pool_created": 8,
            "last_epoch_started": 866,
            "last_interval_started": 865,
            "last_epoch_clean": 866,
            "last_interval_clean": 865,
            "last_epoch_split": 0,
            "last_epoch_marked_full": 0,
            "same_up_since": 1121,
            "same_interval_since": 1121,
            "same_primary_since": 1121,
            "last_scrub": "0'0",
            "last_scrub_stamp": "2020-07-01 14:29:50.960751",
            "last_deep_scrub": "0'0",
            "last_deep_scrub_stamp": "2020-07-01 14:29:50.960751",
            "last_clean_scrub_stamp": "2020-07-01 14:29:50.960751"
        },
        "stats": {
            "version": "38'1",
            "reported_seq": "594",
            "reported_epoch": "1169",
            "state": "undersized+degraded+peered",
            "last_fresh": "2020-07-01 19:56:10.470098",
            "last_change": "2020-07-01 18:49:42.253350",
            "last_active": "2020-07-01 18:49:41.245332",
            "last_peered": "2020-07-01 19:56:10.470098",
            "last_clean": "2020-07-01 17:40:53.421560",
            "last_became_active": "2020-07-01 17:40:56.544617",
            "last_became_peered": "2020-07-01 18:49:42.253350",
            "last_unstale": "2020-07-01 19:56:10.470098",
            "last_undegraded": "2020-07-01 18:49:42.247539",
            "last_fullsized": "2020-07-01 18:49:42.247516",
            "mapping_epoch": 1121,
            "log_start": "38'1",
            "ondisk_log_start": "38'1",
            "created": 8,
            "last_epoch_clean": 866,
            "parent": "0.0",
            "parent_split_bits": 0,
            "last_scrub": "0'0",
            "last_scrub_stamp": "2020-07-01 14:29:50.960751",
            "last_deep_scrub": "0'0",
            "last_deep_scrub_stamp": "2020-07-01 14:29:50.960751",
            "last_clean_scrub_stamp": "2020-07-01 14:29:50.960751",
            "log_size": 0,
            "ondisk_log_size": 0,
            "stats_invalid": false,
            "dirty_stats_invalid": false,
            "omap_stats_invalid": false,
            "hitset_stats_invalid": false,
            "hitset_bytes_stats_invalid": false,
            "pin_stats_invalid": false,
            "manifest_stats_invalid": false,
            "snaptrimq_len": 0,
            "stat_sum": {
                "num_bytes": 46,
                "num_objects": 1,
                "num_object_clones": 0,
                "num_object_copies": 3,
                "num_objects_missing_on_primary": 0,
                "num_objects_missing": 0,
                "num_objects_degraded": 2,
                "num_objects_misplaced": 0,
                "num_objects_unfound": 0,
                "num_objects_dirty": 1,
                "num_whiteouts": 0,
                "num_read": 7,
                "num_read_kb": 7,
                "num_write": 1,
                "num_write_kb": 1,
                "num_scrub_errors": 0,
                "num_shallow_scrub_errors": 0,
                "num_deep_scrub_errors": 0,
                "num_objects_recovered": 2,
                "num_bytes_recovered": 92,
                "num_keys_recovered": 0,
                "num_objects_omap": 0,
                "num_objects_hit_set_archive": 0,
                "num_bytes_hit_set_archive": 0,
                "num_flush": 0,
                "num_flush_kb": 0,
                "num_evict": 0,
                "num_evict_kb": 0,
                "num_promote": 0,
                "num_flush_mode_high": 0,
                "num_flush_mode_low": 0,
                "num_evict_mode_some": 0,
                "num_evict_mode_full": 0,
                "num_objects_pinned": 0,
                "num_legacy_snapsets": 0,
                "num_large_omap_objects": 0,
                "num_objects_manifest": 0,
                "num_omap_bytes": 0,
                "num_omap_keys": 0,
                "num_objects_repaired": 0
            },
            "up": [
                4
            ],
            "acting": [
                4
            ],
            "avail_no_missing": [
                "4"
            ],
            "object_location_counts": [
                {
                    "shards": "4",
                    "objects": 1
                }
            ],
            "blocked_by": [],
            "up_primary": 4,
            "acting_primary": 4,
            "purged_snaps": []
        },
        "empty": 0,
        "dne": 0,
        "incomplete": 0,
        "last_epoch_started": 866,
        "hit_set_history": {
            "current_last_update": "0'0",
            "history": []
        }
    },
    "peer_info": [],
    "recovery_state": [
        {
            "name": "Started/Primary/Active",
            "enter_time": "2020-07-01 18:49:42.247533",
            "might_have_unfound": [],
            "recovery_progress": {
                "backfill_targets": [],
                "waiting_on_backfill": [],
                "last_backfill_started": "MIN",
                "backfill_info": {
                    "begin": "MIN",
                    "end": "MIN",
                    "objects": []
                },
                "peer_backfill_info": [],
                "backfills_in_flight": [],
                "recovering": [],
                "pg_backend": {
                    "pull_from_peer": [],
                    "pushing": []
                }
            },
            "scrub": {
                "scrubber.epoch_start": "0",
                "scrubber.active": false,
                "scrubber.state": "INACTIVE",
                "scrubber.start": "MIN",
                "scrubber.end": "MIN",
                "scrubber.max_end": "MIN",
                "scrubber.subset_last_update": "0'0",
                "scrubber.deep": false,
                "scrubber.waiting_on_whom": []
            }
        },
        {
            "name": "Started",
            "enter_time": "2020-07-01 18:49:41.245437"
        }
    ],
    "agent_state": {}
}

Comment 21 Travis Nielsen 2020-07-02 18:40:14 UTC
Per conversation in chat with Josh, we are hitting a corner case where the PG is stuck. 

From Josh:
the pg is only getting assigned to one osd
this is a rare quirk that can come up from crush not finding a 2nd osd in enough tries, which can occur more easily with small clusters
we can make it less likely with some crush tuning
...
this would have been present in ceph forever
it's unrelated to writes, just the fact that there are only 2 hosts up makes it more likely though (cluster is even smaller)

@Josh, tagging you for more details on the crush tuning, thanks

@Neha Since this is not a regression and does not happen frequently, sound good to remove the blocker tag from 4.4.z and move it to 4.5? Implementing the crush tuning will take more effort.

Comment 22 Michael Adam 2020-07-02 22:46:38 UTC
Thanks for the update Travis / Josh!

If it is not a regression, then at least we should not block 4.4.1 for it.

Comment 23 Travis Nielsen 2020-07-02 22:49:52 UTC
Agreed, let's consider for 4.5 if the scope of the crush tuning is small.

Comment 24 Michael Adam 2020-07-02 22:58:16 UTC
@Josh, could it be possible that the update to RHCS 4.1 (from RHCS 4.0) has introduced some small change that has made it more likely to hit the problem that Travis described?

Just pure speculation here, since I'm not well aware what's the delta between 4.0 and 4.1... :-)

Comment 25 Elad 2020-07-02 23:00:00 UTC
As a follow up to Michael's query, the concern here is about the fact that we have encountered this only in 4.4.1 so far.

Comment 27 Michael Adam 2020-07-03 13:42:43 UTC
@Neha, thanks for the details.

And don't get me wrong. I was not suggesting that RHCS introduced a regression. Josh already explained that this has been there since a long time. I was merely speculating that if a change in any component could have made it more likely to hit it now, it would have been RHCS itself which was updated from 4.0 to 4.1.

I agree with you that it has probably just been pure luck until now.

Comment 31 Josh Durgin 2020-07-06 22:31:08 UTC
In summary, there are two problems, not the crush issue I originally thought this was:

1) osds never got marked out

Reccommended fix: set mon_osd_down_out_subtree_limit to 'datacenter' or something else above 'rack' in AWS

2) .rgw.root pool is distributed across hosts, not racks

Recommended fix: separate across racks for all pools in AWS

More detail below.

(In reply to Travis Nielsen from comment #21)
> Per conversation in chat with Josh, we are hitting a corner case where the
> PG is stuck. 
> 
> From Josh:
> the pg is only getting assigned to one osd
> this is a rare quirk that can come up from crush not finding a 2nd osd in
> enough tries, which can occur more easily with small clusters
> we can make it less likely with some crush tuning
> ...
> this would have been present in ceph forever
> it's unrelated to writes, just the fact that there are only 2 hosts up makes
> it more likely though (cluster is even smaller)
> 
> @Josh, tagging you for more details on the crush tuning, thanks
> 
> @Neha Since this is not a regression and does not happen frequently, sound
> good to remove the blocker tag from 4.4.z and move it to 4.5? Implementing
> the crush tuning will take more effort.

The first issue is the OSDs that were down (not running) were never marked out (so data was still mapped to them).

You can see this by trying to map pg 1.1 with this osdmap:

$ osdmaptool osdmap --tree
osdmaptool: osdmap file 'osdmap'
ID   CLASS  WEIGHT   TYPE NAME                             STATUS  REWEIGHT  PRI-AFF
 -1         3.00000  root default                                                   
 -8         1.00000      rack rack0                                                 
 -7         0.50000          host ocs-deviceset-0-0-j4mtq                           
  1    hdd  0.50000              osd.1                       down   1.00000  1.00000
-15         0.50000          host ocs-deviceset-0-1-zqnq7                           
  3    hdd  0.50000              osd.3                       down   1.00000  1.00000
 -4         1.00000      rack rack1                                                 
 -3         0.50000          host ocs-deviceset-1-0-pcs7b                           
  0    hdd  0.50000              osd.0                         up   1.00000  1.00000
-17         0.50000          host ocs-deviceset-1-1-st982                           
  4    hdd  0.50000              osd.4                         up   1.00000  1.00000
-12         1.00000      rack rack2                                                 
-11         0.50000          host ocs-deviceset-2-0-vbkqq                           
  2    hdd  0.50000              osd.2                         up   1.00000  1.00000
-19         0.50000          host ocs-deviceset-2-1-48dxg                           
  5    hdd  0.50000              osd.5                         up   1.00000  1.00000

$ osdmaptool --test-map-pg 1.1 osdmap

bin/osdmaptool: osdmap file 'osdmap'
 parsed '1.1' -> 1.1
1.1 raw ([1,4,3], p1) up ([4], p4) acting ([4], p4)

The raw crush choice (before accounting for down osds) includes osds 1 and 3, which are both down, so the only one left is 4.

Normally OSDs are marked out (and thus no longer considered by crush) after being down for 10 minutes (mon_osd_down_out_interval).
There are a few limits on this which can be hit on small clusters like this though:

1) mon_osd_min_in_ratio = 0.75 # do not automatically mark OSDs 'out' if fewer than this many OSDs are 'in'

2) mon_osd_down_out_subtree_limit = rack # do not automatically mark OSDs 'out' if an entire subtree of this size is down

Since this cluster had a virtual 'rack' go down, and it represented 33% of the OSDs, both of these limits would prevent the osds from being marked out.

In bare metal, where we don't have virtual 'racks', but separate across hosts, we wouldn't hit the 2nd limit, and the 1st limit would only have stopped the 2nd osd from being marked out, so crush would still be able to find 2 copies for each pg.

(In reply to Michael Adam from comment #24)
> @Josh, could it be possible that the update to RHCS 4.1 (from RHCS 4.0) has
> introduced some small change that has made it more likely to hit the problem
> that Travis described?
> 
> Just pure speculation here, since I'm not well aware what's the delta
> between 4.0 and 4.1... :-)

I don't think anything about this behaviour changed between 4.0 and 4.1. Did rook change any settings related to marking osds out, or anything about how racks are set up? There's still an element of chance, in that this would only cause an availability problem if there happened to be a pg mapped to the two osds that were stuck down+in.

The 2nd problem, which caused this to be noticeable, was that the .rgw.root pool does not have a crush rule set up for it (perhaps the pool was auto-created by rgw?) so it uses the default crush rule that separates replicas across hosts instead of racks.
If it was separated across racks instead, a whole rack going down would not cause an availability problem even if the osds were never marked out, since there would be 2 osds still up to make the pg available.

Comment 32 Travis Nielsen 2020-07-07 22:04:19 UTC
@Neha Do you have an operator log from the original deployment of Rook? In the must-gather I see a later reconcile of the object store instead of the original reconcile.

@Josh Rook creates all of the rgw pools explicitly before starting rgw the first time, so I don't know what would have created the pool without the crush rule as expected. But this certainly sounds like the cause of the issues if the pool doesn't have the rack failure domain as expected.

When I did a local install of rook based on the release-4.4 branch, I see the .rgw.root pool created with the rack-based crush rule as expected.

2020-07-07 21:57:35.741027 I | exec: Running command: ceph osd crush rule create-replicated .rgw.root default rack --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/240953093
2020-07-07 21:57:36.343092 I | exec: Running command: ceph osd pool create .rgw.root 8 replicated .rgw.root --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/824994976

And I see the crush rule as expected here:

$ ceph osd crush dump
...
        {
            "rule_id": 6,
            "rule_name": ".rgw.root",
            "ruleset": 6,
            "type": 1,
            "min_size": 1,
            "max_size": 10,
            "steps": [
                {
                    "op": "take",
                    "item": -1,
                    "item_name": "default"
                },
                {
                    "op": "chooseleaf_firstn",
                    "num": 0,
                    "type": "rack"
                },
                {
                    "op": "emit"
                }
            ]
        },
...

Comment 33 Travis Nielsen 2020-07-08 23:31:38 UTC
@Neha Do you have the original log from the deployment that would show the configuration of the crush rules? The operator log in the must-gather must have been restarted so I didn't see the original reconcile there.

Comment 35 Travis Nielsen 2020-07-13 23:26:01 UTC
Ok, tracked down the issue after reading through the original operator log (thanks Neha!). 

The issue is that an object user was getting created before the object store was fully configured by Rook. Thus, Ceph was creating the .rgw.root pool with the default host failure domain when the object user was being created. About 30 seconds later when Rook finally attempted to create the .rgw.root pool, it already existed and therefore Rook did not set its failure domain to rack.

The object user controller was intending to wait for rgw pods to be running before creating a user, however, after querying for the pods it didn't check if any pods were actually running. The fix is extremely simple and low risk to check for the number of rgw pods before allowing the object user to be created.

This issue does not repro in the 4.5 or newer codebase. This issue had already been fixed by a refactor in the object store with the move to the controller runtime.

This only affects 4.4, so moving back to that milestone to discuss if it's still a blocker to consider for 4.4.z. It's such a small change, I would suggest we take it.

Here is the simple downstream PR, pending agreement to include in 4.4.z.
https://github.com/openshift/rook/pull/81

Comment 37 Mudit Agarwal 2020-07-15 14:02:31 UTC
I just observed that this BZ is moved to MODIFIED because the downstream PR is merged which is not appropriate because:

1. We still don't have any ack to merge the fix to 4.4.z. AFAIK, once we have all the acks then only PR should be merged.
2. 4.4.z is not open to take any fixes because right now we are working on 4.4.2 which is CVE only release and any change targeted for 4.4.z should be merged when we are ready for 4.4.3


Ideally, we shouled revert this PR.

Michael, please correct me if I am wrong.

Comment 38 Michael Adam 2020-07-15 16:03:31 UTC
(In reply to Mudit Agarwal from comment #37)
> I just observed that this BZ is moved to MODIFIED because the downstream PR
> is merged which is not appropriate because:
> 
> 1. We still don't have any ack to merge the fix to 4.4.z. AFAIK, once we
> have all the acks then only PR should be merged.
> 2. 4.4.z is not open to take any fixes because right now we are working on
> 4.4.2 which is CVE only release and any change targeted for 4.4.z should be
> merged when we are ready for 4.4.3
> 
> 
> Ideally, we shouled revert this PR.
> 
> Michael, please correct me if I am wrong.

You are absolutely correct.

https://github.com/openshift/rook/pull/82

revert PR ^^

Comment 39 Travis Nielsen 2020-07-15 16:47:00 UTC
Yep, we had a miscommunication on merging this. Seb and I sync'd up about it.

Comment 40 Travis Nielsen 2020-09-09 19:29:02 UTC
@Michael Are we still taking fixes for 4.4.z or should we close this?

Comment 41 Travis Nielsen 2020-09-21 21:55:01 UTC
*** Bug 1880447 has been marked as a duplicate of this bug. ***

Comment 42 Travis Nielsen 2020-09-22 04:09:34 UTC
Closing since it is already fixed in 4.5 and there is no more 4.4.z release planned.

Comment 43 Red Hat Bugzilla 2023-09-14 06:03:19 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.