Travis, I haven't had a chance to test anything atm, I just got 11k pvc's created :) Will update once I have more info... Setting need info on myself
Travis, Okay, I got a few moments to do some testing. From the UI, I see these events: - Pod openshift-storage/rook-ceph-operator-554974456c-4pvn7 has been in a non-ready state for longer than 15 minutes. - pod/rook-ceph-operator-554974456c-tw7gr in namespace openshift-storage on container rook-ceph-operator has been in waiting state for longer than 1 hour. View details - Deployment openshift-storage/rook-ceph-operator has not matched the expected number of replicas for longer than 15 minutes. This indicates that cluster infrastructure is unable to start or restart the necessary components. This most often occurs when one or more nodes are down or partioned from the cluster, or a fault occurs on the node that prevents the workload from starting. In rare cases this may indicate a new version of a cluster component cannot start due to a bug or configuration error. Assess the pods for this deployment to verify they are running on healthy nodes and then contact support. doesn't seem to be correct at first glance: kelson@quorra:~$ oc get pods|grep rook-ceph-operator-554974456c-4pvn7 rook-ceph-operator-554974456c-4pvn7 1/1 Running 0 50s kelson@quorra:~$ oc get deployment|grep rook-ceph-operator rook-ceph-operator 1/1 1 1 57d however, I can't seem to view the logs: kelson@quorra:~$ oc logs rook-ceph-operator-554974456c-4pvn7 Error from server (NotFound): pods "rook-ceph-operator-554974456c-4pvn7" not found Seems the pod has restarted: kelson@quorra:~$ oc get pods|grep rook-ceph-operator-554974456c-9585s rook-ceph-operator-554974456c-9585s 1/1 Running 0 50s but its gone again?: kelson@quorra:~$ oc get pods|grep rook-ceph-operator-554974456c-9585s kelson@quorra:~$ Hum... keeps getting redeployed: rook-ceph-operator-554974456c-zkgg9 1/1 Terminating 0 60s rook-ceph-operator-554974456c-tx67x 0/1 Pending 0 0s rook-ceph-operator-554974456c-tx67x 0/1 Pending 0 0s rook-ceph-operator-554974456c-tx67x 0/1 Pending 0 0s rook-ceph-operator-554974456c-tx67x 0/1 ContainerCreating 0 0s rook-ceph-operator-554974456c-zkgg9 0/1 Terminating 0 60s rook-ceph-operator-554974456c-tx67x 0/1 ContainerCreating 0 1s rook-ceph-operator-554974456c-zkgg9 0/1 Terminating 0 61s rook-ceph-operator-554974456c-zkgg9 0/1 Terminating 0 61s logs at the time of pod restart: 2024-09-06 01:07:27.006485 D | telemetry: set telemetry key: rook/cluster/storage/device-set/count/portable=0 2024-09-06 01:07:27.006524 D | op-config: setting "rook/cluster/storage/device-set/count/non-portable"="3" option in the mon config-key store 2024-09-06 01:07:27.006548 D | exec: Running command: ceph config-key set rook/cluster/storage/device-set/count/non-portable 3 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json 2024-09-06 01:07:27.101556 I | operator: shutdown signal received, exiting... context canceled 2024-09-06 01:07:27.101811 I | op-mon: stopping monitoring of mons in namespace "openshift-storage" 2024-09-06 01:07:27.102027 I | op-osd: stopping monitoring of OSDs in namespace "openshift-storage" 2024-09-06 01:07:27.102077 I | ceph-cluster-controller: stopping monitoring of ceph status Sadly, I don't know what this 'operator: shutdown signal received, exiting... context canceled' means. Viewing [1] gives some more details on what this is occurring. Why tho? Kubelet logs: Sep 06 01:13:27 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:27.186971 2539 kubelet.go:2419] "SyncLoop DELETE" source="api" pods=["openshift-storage/rook-ceph-operator-554974456c-77mm7"] Sep 06 01:13:27 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:27.187904 2539 kuberuntime_container.go:745] "Killing container with a grace period" pod="openshift-storage/rook-ceph-operator-554974456c-77mm7" podUID="09e2cf5d-557a-461b-92ec-6d6559a43796" containerName="rook-ceph-operator" containerID="cri-o://44837cc7ff3875bfc8699ccd15cc513a61ed4d9cdf68a05098358c45b8ebdcf8" gracePeriod=30 Sep 06 01:13:27 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:27.228353 2539 kubelet.go:2403] "SyncLoop ADD" source="api" pods=["openshift-storage/rook-ceph-operator-554974456c-gtfrc"] Sep 06 01:13:27 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:27.228517 2539 topology_manager.go:215] "Topology Admit Handler" podUID="0c8924ab-324c-45fe-86b2-2d21dac5a462" podNamespace="openshift-storage" podName="rook-ceph-operator-554974456c-gtfrc" Sep 06 01:13:27 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: E0906 01:13:27.228720 2539 cpu_manager.go:396] "RemoveStaleState: removing container" podUID="d66b27d4-1241-49ad-88f5-35c9857ae133" containerName="rook-ceph-operator" Sep 06 01:13:27 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:27.228738 2539 state_mem.go:107] "Deleted CPUSet assignment" podUID="d66b27d4-1241-49ad-88f5-35c9857ae133" containerName="rook-ceph-operator" Sep 06 01:13:27 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:27.228871 2539 memory_manager.go:346] "RemoveStaleState removing state" podUID="d66b27d4-1241-49ad-88f5-35c9857ae133" containerName="rook-ceph-operator" Sep 06 01:13:27 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:27.267752 2539 kubelet.go:2410] "SyncLoop UPDATE" source="api" pods=["openshift-storage/rook-ceph-operator-554974456c-gtfrc"] Sep 06 01:13:27 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:27.326164 2539 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"rook-config\" (UniqueName: \"kubernetes.io/empty-dir/0c8924ab-324c-45fe-86b2-2d21dac5a462-rook-config\") pod \"rook-ceph-operator-554974456c-gtfrc\" (UID: \"0c8924ab-324c-45fe-86b2-2d21dac5a462\") " pod="openshift-storage/rook-ceph-operator-554974456c-gtfrc" Sep 06 01:13:27 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:27.332108 2539 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"default-config-dir\" (UniqueName: \"kubernetes.io/empty-dir/0c8924ab-324c-45fe-86b2-2d21dac5a462-default-config-dir\") pod \"rook-ceph-operator-554974456c-gtfrc\" (UID: \"0c8924ab-324c-45fe-86b2-2d21dac5a462\") " pod="openshift-storage/rook-ceph-operator-554974456c-gtfrc" Sep 06 01:13:27 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:27.332184 2539 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kube-api-access-nbhfr\" (UniqueName: \"kubernetes.io/projected/0c8924ab-324c-45fe-86b2-2d21dac5a462-kube-api-access-nbhfr\") pod \"rook-ceph-operator-554974456c-gtfrc\" (UID: \"0c8924ab-324c-45fe-86b2-2d21dac5a462\") " pod="openshift-storage/rook-ceph-operator-554974456c-gtfrc" Sep 06 01:13:27 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:27.433535 2539 reconciler_common.go:231] "operationExecutor.MountVolume started for volume \"rook-config\" (UniqueName: \"kubernetes.io/empty-dir/0c8924ab-324c-45fe-86b2-2d21dac5a462-rook-config\") pod \"rook-ceph-operator-554974456c-gtfrc\" (UID: \"0c8924ab-324c-45fe-86b2-2d21dac5a462\") " pod="openshift-storage/rook-ceph-operator-554974456c-gtfrc" Sep 06 01:13:27 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:27.433611 2539 reconciler_common.go:231] "operationExecutor.MountVolume started for volume \"default-config-dir\" (UniqueName: \"kubernetes.io/empty-dir/0c8924ab-324c-45fe-86b2-2d21dac5a462-default-config-dir\") pod \"rook-ceph-operator-554974456c-gtfrc\" (UID: \"0c8924ab-324c-45fe-86b2-2d21dac5a462\") " pod="openshift-storage/rook-ceph-operator-554974456c-gtfrc" Sep 06 01:13:27 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:27.433649 2539 reconciler_common.go:231] "operationExecutor.MountVolume started for volume \"kube-api-access-nbhfr\" (UniqueName: \"kubernetes.io/projected/0c8924ab-324c-45fe-86b2-2d21dac5a462-kube-api-access-nbhfr\") pod \"rook-ceph-operator-554974456c-gtfrc\" (UID: \"0c8924ab-324c-45fe-86b2-2d21dac5a462\") " pod="openshift-storage/rook-ceph-operator-554974456c-gtfrc" Sep 06 01:13:27 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:27.434043 2539 operation_generator.go:722] "MountVolume.SetUp succeeded for volume \"rook-config\" (UniqueName: \"kubernetes.io/empty-dir/0c8924ab-324c-45fe-86b2-2d21dac5a462-rook-config\") pod \"rook-ceph-operator-554974456c-gtfrc\" (UID: \"0c8924ab-324c-45fe-86b2-2d21dac5a462\") " pod="openshift-storage/rook-ceph-operator-554974456c-gtfrc" Sep 06 01:13:27 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:27.434252 2539 operation_generator.go:722] "MountVolume.SetUp succeeded for volume \"default-config-dir\" (UniqueName: \"kubernetes.io/empty-dir/0c8924ab-324c-45fe-86b2-2d21dac5a462-default-config-dir\") pod \"rook-ceph-operator-554974456c-gtfrc\" (UID: \"0c8924ab-324c-45fe-86b2-2d21dac5a462\") " pod="openshift-storage/rook-ceph-operator-554974456c-gtfrc" Sep 06 01:13:27 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:27.457487 2539 operation_generator.go:722] "MountVolume.SetUp succeeded for volume \"kube-api-access-nbhfr\" (UniqueName: \"kubernetes.io/projected/0c8924ab-324c-45fe-86b2-2d21dac5a462-kube-api-access-nbhfr\") pod \"rook-ceph-operator-554974456c-gtfrc\" (UID: \"0c8924ab-324c-45fe-86b2-2d21dac5a462\") " pod="openshift-storage/rook-ceph-operator-554974456c-gtfrc" Sep 06 01:13:27 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:27.603034 2539 util.go:30] "No sandbox for pod can be found. Need to start a new one" pod="openshift-storage/rook-ceph-operator-554974456c-gtfrc" Sep 06 01:13:27 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:27.997613 2539 kubelet.go:2410] "SyncLoop UPDATE" source="api" pods=["openshift-storage/rook-ceph-operator-554974456c-gtfrc"] Sep 06 01:13:28 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:28.071517 2539 kubelet.go:2435] "SyncLoop (PLEG): event for pod" pod="openshift-storage/rook-ceph-operator-554974456c-77mm7" event={"ID":"09e2cf5d-557a-461b-92ec-6d6559a43796","Type":"ContainerDied","Data":"44837cc7ff3875bfc8699ccd15cc513a61ed4d9cdf68a05098358c45b8ebdcf8"} Sep 06 01:13:28 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:28.071560 2539 kubelet.go:2435] "SyncLoop (PLEG): event for pod" pod="openshift-storage/rook-ceph-operator-554974456c-77mm7" event={"ID":"09e2cf5d-557a-461b-92ec-6d6559a43796","Type":"ContainerDied","Data":"da911b7cae3e737e6bf235750bc7f0b70a2bc9f51dff63f3dda4d0862d79c141"} Sep 06 01:13:28 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:28.073460 2539 kubelet.go:2435] "SyncLoop (PLEG): event for pod" pod="openshift-storage/rook-ceph-operator-554974456c-gtfrc" event={"ID":"0c8924ab-324c-45fe-86b2-2d21dac5a462","Type":"ContainerStarted","Data":"d31051774bbd3f7627c036ee2163a455ce84f7197af08916518fd0cd59feed4e"} Sep 06 01:13:28 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:28.143705 2539 kubelet.go:2419] "SyncLoop DELETE" source="api" pods=["openshift-storage/rook-ceph-operator-554974456c-77mm7"] Sep 06 01:13:28 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:28.146430 2539 kubelet.go:2413] "SyncLoop REMOVE" source="api" pods=["openshift-storage/rook-ceph-operator-554974456c-77mm7"] Sep 06 01:13:29 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:29.082163 2539 kubelet.go:2435] "SyncLoop (PLEG): event for pod" pod="openshift-storage/rook-ceph-operator-554974456c-gtfrc" event={"ID":"0c8924ab-324c-45fe-86b2-2d21dac5a462","Type":"ContainerStarted","Data":"0208fd9eea43321abea50652ad50a003ae928f65522044b43860f33be8eb9521"} I have stop for today, I'll pick this up tomorrow... [1] https://github.com/rook/rook/blob/master/pkg/operator/ceph/operator.go#L98
Sorry, I've been slacking on this. Hopefully ill have some time to get to this today...
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days