Bug 2177202
| Summary: | All worker node replaced, ODF in error state | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Gayathri Menath <gmenath> |
| Component: | rook | Assignee: | Travis Nielsen <tnielsen> |
| Status: | CLOSED NOTABUG | QA Contact: | Neha Berry <nberry> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.11 | CC: | ocs-bugs, odf-bz-bot |
| Target Milestone: | --- | Flags: | tnielsen:
needinfo?
(gmenath) |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-03-20 20:00:18 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Gayathri Menath
2023-03-10 11:54:39 UTC
$ oc describe storagecluster ocs-storagecluster -n openshift-storage
Name: ocs-storagecluster
Namespace: openshift-storage
Labels: <none>
Annotations: uninstall.ocs.openshift.io/cleanup-policy: delete
uninstall.ocs.openshift.io/mode: graceful
API Version: ocs.openshift.io/v1
Kind: StorageCluster
Metadata:
Creation Timestamp: 2023-03-10T07:58:42Z
Finalizers:
storagecluster.ocs.openshift.io
Generation: 2
Managed Fields:
API Version: ocs.openshift.io/v1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.:
f:uninstall.ocs.openshift.io/cleanup-policy:
f:uninstall.ocs.openshift.io/mode:
f:finalizers:
.:
v:"storagecluster.ocs.openshift.io":
f:spec:
f:version:
Manager: ocs-operator
Operation: Update
Time: 2023-03-10T07:58:43Z
API Version: ocs.openshift.io/v1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:ownerReferences:
.:
k:{"uid":"bf74285d-62f3-41c9-910d-c92784f62ca1"}:
f:spec:
.:
f:arbiter:
f:encryption:
.:
f:kms:
f:externalStorage:
f:managedResources:
.:
f:cephBlockPools:
f:cephCluster:
f:cephConfig:
f:cephDashboard:
f:cephFilesystems:
f:cephObjectStoreUsers:
f:cephObjectStores:
f:cephToolbox:
f:mirroring:
f:monDataDirHostPath:
f:storageDeviceSets:
Manager: manager
Operation: Update
Time: 2023-03-10T07:58:54Z
API Version: ocs.openshift.io/v1
Fields Type: FieldsV1
fieldsV1:
f:status:
.:
f:conditions:
f:externalStorage:
.:
f:grantedCapacity:
f:failureDomain:
f:failureDomainKey:
f:failureDomainValues:
f:images:
.:
f:ceph:
.:
f:actualImage:
f:desiredImage:
f:noobaaCore:
.:
f:actualImage:
f:desiredImage:
f:noobaaDB:
.:
f:actualImage:
f:desiredImage:
f:kmsServerConnection:
f:nodeTopologies:
.:
f:labels:
.:
f:kubernetes.io/hostname:
f:topology.kubernetes.io/region:
f:topology.kubernetes.io/zone:
f:phase:
f:relatedObjects:
Manager: ocs-operator
Operation: Update
Subresource: status
Time: 2023-03-10T11:52:08Z
Owner References:
API Version: odf.openshift.io/v1alpha1
Kind: StorageSystem
Name: ocs-storagecluster-storagesystem
UID: bf74285d-62f3-41c9-910d-c92784f62ca1
Resource Version: 305666
UID: a3d071fb-58a0-4dfe-b17e-709cc61f315a
Spec:
Arbiter:
Encryption:
Kms:
External Storage:
Managed Resources:
Ceph Block Pools:
Ceph Cluster:
Ceph Config:
Ceph Dashboard:
Ceph Filesystems:
Ceph Object Store Users:
Ceph Object Stores:
Ceph Toolbox:
Mirroring:
Mon Data Dir Host Path: /var/lib/rook
Storage Device Sets:
Config:
Count: 1
Data PVC Template:
Metadata:
Spec:
Access Modes:
ReadWriteOnce
Resources:
Requests:
Storage: 1
Storage Class Name: localblock
Volume Mode: Block
Status:
Name: ocs-deviceset
Placement:
Prepare Placement:
Replica: 3
Resources:
Version: 4.11.0
Status:
Conditions:
Last Heartbeat Time: 2023-03-10T11:52:08Z
Last Transition Time: 2023-03-10T11:01:47Z
Message: Reconcile completed successfully
Reason: ReconcileCompleted
Status: True
Type: ReconcileComplete
Last Heartbeat Time: 2023-03-10T11:52:08Z
Last Transition Time: 2023-03-10T11:01:47Z
Message: CephCluster error: Failed to configure ceph cluster
Reason: ClusterStateError
Status: False
Type: Available
Last Heartbeat Time: 2023-03-10T11:47:49Z
Last Transition Time: 2023-03-10T10:24:44Z
Message: CephCluster is creating: Configuring Ceph Mons
Reason: ClusterStateCreating
Status: True
Type: Progressing
Last Heartbeat Time: 2023-03-10T11:52:08Z
Last Transition Time: 2023-03-10T11:01:47Z
Message: CephCluster error: Failed to configure ceph cluster
Reason: ClusterStateError
Status: True
Type: Degraded
Last Heartbeat Time: 2023-03-10T11:47:49Z
Last Transition Time: 2023-03-10T11:03:22Z
Message: CephCluster is creating: Configuring Ceph Mons
Reason: ClusterStateCreating
Status: False
Type: Upgradeable
External Storage:
Granted Capacity: 0
Failure Domain: zone
Failure Domain Key: topology.kubernetes.io/zone
Failure Domain Values:
us-south-2
us-south-1
us-south-3
Images:
Ceph:
Actual Image: registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:957294824e1cbf89ca24a1a2aa2a8e8acd567cfb5a25535e2624989ad1046a60
Desired Image: registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:957294824e1cbf89ca24a1a2aa2a8e8acd567cfb5a25535e2624989ad1046a60
Noobaa Core:
Actual Image: registry.redhat.io/odf4/mcg-core-rhel8@sha256:f3470a4dc896b30d77bce7f1c887340ca48b4be9e4fc353e81aef8e1834cbcf5
Desired Image: registry.redhat.io/odf4/mcg-core-rhel8@sha256:f3470a4dc896b30d77bce7f1c887340ca48b4be9e4fc353e81aef8e1834cbcf5
Noobaa DB:
Actual Image: registry.redhat.io/rhel8/postgresql-12@sha256:aa65868b9684f7715214f5f3fac3139245c212019cc17742f237965a7508222d
Desired Image: registry.redhat.io/rhel8/postgresql-12@sha256:aa65868b9684f7715214f5f3fac3139245c212019cc17742f237965a7508222d
Kms Server Connection:
Node Topologies:
Labels:
kubernetes.io/hostname:
bha-odf-mar9-p-2645-host-4
bha-odf-mar9-p-2645-host-6
bha-odf-mar9-p-2645-host-8
bha-odf-mar9-p-2645-host-7
bha-odf-mar9-p-2645-host-3
bha-odf-mar9-p-2645-host-5
topology.kubernetes.io/region:
us-south
topology.kubernetes.io/zone:
us-south-2
us-south-1
us-south-3
Phase: Progressing
Related Objects:
API Version: ceph.rook.io/v1
Kind: CephCluster
Name: ocs-storagecluster-cephcluster
Namespace: openshift-storage
Resource Version: 304625
UID: ae876acf-bbed-491f-b2a1-a9d3b7d955a1
API Version: noobaa.io/v1alpha1
Kind: NooBaa
Name: noobaa
Namespace: openshift-storage
Resource Version: 229429
UID: a70e769e-d85f-4a02-b9e8-340dc50862bc
Events: <none>
Bhagyashrees-MacBook-Pro-2:odf_replace3worker-trial2 bhagyashree$
OSD removal job is in error state $oc logs -l job-name=ocs-osd-removal-job -n openshift-storage mon_data_avail_warn = 15 [osd] osd_memory_target_cgroup_limit_ratio = 0.8 [client.admin] keyring = /var/lib/rook/openshift-storage/client.admin.keyring 2023-03-10 11:33:19.520767 D | exec: Running command: ceph osd dump --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json 2023-03-10 11:33:34.625564 C | rookcmd: failed to get osd dump: failed to get osd dump: exit status 1 This cluster is running on local PVs, and mons are using the host path to store the metadata. If you replace all the nodes at the same time, you will immediately lose mon quorum and all OSDs, and the cluster cannot recover. Notice that all the mon pods are pending, which means they are waiting for their original nodes to come back online and the mon quorum is lost until those same nodes come back online. Replacing all the nodes at the same time can only be done if the mons and OSDs are portable. Have you tested this on an AWS cluster? Please reopen if there is still an issue to discuss |