2315464 – [GSS] CephCluster progressing state due to validation check fail, sees two zones instead of three

Bug 2315464 - [GSS] CephCluster progressing state due to validation check fail, sees two zones instead of three

Summary: [GSS] CephCluster progressing state due to validation check fail, sees two zo...

Keywords:
Status:	MODIFIED
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ocs-operator
Sub Component:
Version:	4.14
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Santosh Pillai
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-09-28 14:39 UTC by Shriya Mulay
Modified:	2025-07-11 08:29 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	red-hat-storage ocs-operator pull 2841	0	None	Merged	Adjust Node Label Conditions Based on Full Label Name	2024-10-21 08:06:48 UTC
Red Hat Issue Tracker	OCSBZM-9414	0	None	None	None	2024-10-21 08:08:02 UTC

Description Shriya Mulay 2024-09-28 14:39:55 UTC

Description of problem (please be detailed as possible and provide log
snippests):

- For a stretch cluster, an upgrade was done. It seemed to be successful, later on it was noted that the ceph components still use the older hotfix images, this was fixed by removing the "reconcileStrategy: ignore"
- Afterwards, it was seen that the storagecluster is in "error" state due to below error :
-----
 - lastHeartbeatTime: "2024-09-28T11:07:32Z"
    lastTransitionTime: "2024-09-28T10:47:01Z"
    message: 'CephCluster error: failed to perform validation before cluster creation:
      expecting exactly three zones for the stretch cluster, but found 2'
    reason: ClusterStateError
    status: "True"
    type: Degraded
-----
- Additionally, registry is unable to mount the cephfs volumes due to unable to reach mon-service, I suspect this is due to the issue with the ceph mismatching versions. The rook-ceph-csi-config was missing the mon-IPs.

- We tried applying the zone failureDomainKey and failureDomainValue to the ODF nodes but no effect.
- Below is the config in the storagecluster yaml :

----
  failureDomain: zone
  failureDomainKey: topology.kubernetes.io/zone-principal
  failureDomainValues:
  - "true"

<snip>

  kmsServerConnection: {}
  nodeTopologies:
    labels:
      kubernetes.io/hostname:
      - <node>-hnfz8
      - <node>-whnrh
      - <node>-9xv56
      - <node>-pgjxm
      topology.kubernetes.io/zone-principal:
      - "true"
---------------


Version of all relevant components (if applicable):
ODF 4.14.10


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Unable to mount volumes, ceph version mismatch


Is there any workaround available to the best of your knowledge?
NA

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
NA

Can this issue reproduce from the UI?


Actual results:
ODF is unable to detect three zones when they are present.


Expected results:
ODF should detect three zones.

Additional info:
Next update.

Comment 11 khover 2024-09-30 12:55:12 UTC

The arbiter node is missing our label 

Name:               slocp4oat101-dgmwp
Roles:              storage,worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    dynatrace=none
                    env=global
                    kind=storage
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=slocp4oat101-dgmwp
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/storage=
                    node-role.kubernetes.io/worker=
                    node.openshift.io/os_id=rhcos
                    topology.kubernetes.io/zone=arbiter
                    topology.kubernetes.io/zone-principal=true
Annotations:        alpha.kubernetes.io/provided-node-ip: 10.77.5.107

Needs our label
cluster.ocs.openshift.io/openshift-storage=

      message: 'CephCluster error: failed to perform validation before cluster creation:
        expecting exactly three zones for the stretch cluster, but found 2'
      reason: ClusterStateError


Storagecluster CR

    nodeTopologies:
      labels:
        kubernetes.io/hostname:
        - slocp4odt100-hnfz8
        - slocp4odt101-whnrh
        - slocp4odt202-9xv56
        - slocp4odt203-pgjxm
        topology.kubernetes.io/zone-principal:
        - "true"
    phase: Error

Note You need to log in before you can comment on or make changes to this bug.