Description of problem: Cluster with dev addon that contains changes in topology related to ODFMS-55 can not finish ODF addon installation and is stuck in Installing state. Size parameter was set to 8. Version-Release number of selected component (if applicable): ocs-osd-deployer.v2.0.7 How reproducible: 1/1 Steps to Reproduce: 1. Install provider: rosa create service --type ocs-provider-dev --name fbalak-pr --machine-cidr 10.0.0.0/16 --size 8 --onboarding-validation-key <key> --subnet-ids <subnet-ids> --region us-east-1 2. Wait until installation finishes. Actual results: Installation doesn't finish after 3 hours. There are Events in ocs-operator pod that indicate there is insufficient memory: 0/12 nodes are available: 1 Insufficient memory, 1 node(s) were unschedulable, 2 node(s) had taint {node.ocs.openshift.io/osd: true}, that the pod didn't tolerate, 3 Insufficient cpu, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Expected results: Cluster finishes installation. Additional info:
Deployment was tested again after changing the OSD memory to 5800Mi. Some pods are not running. $ oc get pods -o wide -n openshift-storage NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES 6af63c67149ab3b269c737daf8def41d78c6bb149170553e4c3914cd31rjbtb 0/1 Completed 0 3h11m 10.129.2.41 ip-10-0-146-95.ec2.internal <none> <none> addon-ocs-provider-dev-catalog-q2666 1/1 Running 0 3h11m 10.129.2.39 ip-10-0-146-95.ec2.internal <none> <none> alertmanager-managed-ocs-alertmanager-0 2/2 Running 0 177m 10.128.2.17 ip-10-0-138-237.ec2.internal <none> <none> alertmanager-managed-ocs-alertmanager-1 0/2 Pending 0 177m <none> <none> <none> <none> alertmanager-managed-ocs-alertmanager-2 2/2 Running 0 178m 10.128.2.15 ip-10-0-138-237.ec2.internal <none> <none> cb3d906f639ebb485aaea5f79d4dac57d31d0e663c1f60667880d6e7fd9srmj 0/1 Completed 0 3h11m 10.129.2.40 ip-10-0-146-95.ec2.internal <none> <none> csi-addons-controller-manager-b8b965868-tr7jq 2/2 Running 3 (3h7m ago) 3h9m 10.129.2.50 ip-10-0-146-95.ec2.internal <none> <none> ocs-metrics-exporter-577574796b-2jbvn 1/1 Running 0 3h7m 10.131.0.14 ip-10-0-169-230.ec2.internal <none> <none> ocs-operator-5c77756ddd-hg8gc 1/1 Running 0 178m 10.128.2.14 ip-10-0-138-237.ec2.internal <none> <none> ocs-osd-controller-manager-5fb6bc955d-stwtv 2/3 Running 3 (3h8m ago) 3h10m 10.129.2.45 ip-10-0-146-95.ec2.internal <none> <none> ocs-provider-server-7694d4875b-vtfb7 1/1 Running 3 (3h7m ago) 3h9m 10.131.0.8 ip-10-0-169-230.ec2.internal <none> <none> odf-console-585db6ddb-gtjhq 1/1 Running 0 178m 10.128.2.11 ip-10-0-138-237.ec2.internal <none> <none> odf-operator-controller-manager-7866b5fdbb-jlb4d 2/2 Running 0 178m 10.128.2.12 ip-10-0-138-237.ec2.internal <none> <none> prometheus-managed-ocs-prometheus-0 3/3 Running 0 3h9m 10.131.0.10 ip-10-0-169-230.ec2.internal <none> <none> prometheus-operator-8547cc9f89-h4nns 1/1 Running 0 3h5m 10.131.0.21 ip-10-0-169-230.ec2.internal <none> <none> rook-ceph-crashcollector-ip-10-0-133-162.ec2.internal-6959xbz9h 1/1 Running 0 3h 10.0.133.162 ip-10-0-133-162.ec2.internal <none> <none> rook-ceph-crashcollector-ip-10-0-138-237.ec2.internal-cb59x27wc 1/1 Running 0 3h1m 10.0.138.237 ip-10-0-138-237.ec2.internal <none> <none> rook-ceph-crashcollector-ip-10-0-146-95.ec2.internal-69946tw5bs 0/1 Pending 0 178m <none> <none> <none> <none> rook-ceph-crashcollector-ip-10-0-152-183.ec2.internal-5b69pcgt4 1/1 Running 0 179m 10.0.152.183 ip-10-0-152-183.ec2.internal <none> <none> rook-ceph-crashcollector-ip-10-0-168-88.ec2.internal-66777jssmj 0/1 Pending 0 177m <none> <none> <none> <none> rook-ceph-crashcollector-ip-10-0-169-230.ec2.internal-d444s7z59 0/1 Pending 0 3h1m <none> <none> <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-849f56ccjl4jm 2/2 Running 0 178m 10.0.146.95 ip-10-0-146-95.ec2.internal <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-cbdbf889tmv7n 2/2 Running 0 178m 10.0.169.230 ip-10-0-169-230.ec2.internal <none> <none> rook-ceph-mgr-a-598d65f8cb-vmxhz 2/2 Running 0 3h1m 10.0.138.237 ip-10-0-138-237.ec2.internal <none> <none> rook-ceph-mon-a-6f5f986bc4-4vfdg 2/2 Running 0 3h6m 10.0.169.230 ip-10-0-169-230.ec2.internal <none> <none> rook-ceph-mon-b-857c67df96-jf9bl 2/2 Running 0 3h3m 10.0.146.95 ip-10-0-146-95.ec2.internal <none> <none> rook-ceph-mon-c-57bcdfbff9-jkpzn 2/2 Running 0 3h3m 10.0.138.237 ip-10-0-138-237.ec2.internal <none> <none> rook-ceph-operator-564cb5cb98-bg7r7 1/1 Running 0 178m 10.128.2.13 ip-10-0-138-237.ec2.internal <none> <none> rook-ceph-osd-0-6b6965d658-5sr4p 2/2 Running 0 3h 10.0.168.88 ip-10-0-168-88.ec2.internal <none> <none> rook-ceph-osd-1-58d45d8b65-8nx5p 0/2 Pending 0 3h <none> <none> <none> <none> rook-ceph-osd-2-cdb5cb56-d9kwz 2/2 Running 0 3h 10.0.133.162 ip-10-0-133-162.ec2.internal <none> <none> rook-ceph-osd-3-f9fc495b9-rfm25 2/2 Running 0 3h 10.0.133.162 ip-10-0-133-162.ec2.internal <none> <none> rook-ceph-osd-4-674ccf987-6tgbn 2/2 Running 0 179m 10.0.152.183 ip-10-0-152-183.ec2.internal <none> <none> rook-ceph-osd-5-55477d4879-vbnzt 2/2 Running 0 179m 10.0.152.183 ip-10-0-152-183.ec2.internal <none> <none> rook-ceph-osd-prepare-default-0-data-02lr54-xpn58 0/1 Completed 0 3h1m 10.0.133.162 ip-10-0-133-162.ec2.internal <none> <none> rook-ceph-osd-prepare-default-1-data-0pclls-6n6vl 0/1 Completed 0 3h1m 10.0.152.183 ip-10-0-152-183.ec2.internal <none> <none> rook-ceph-osd-prepare-default-1-data-1gh97s-lmdcr 0/1 Completed 0 3h1m 10.0.133.162 ip-10-0-133-162.ec2.internal <none> <none> rook-ceph-osd-prepare-default-2-data-1njrsp-6rrjv 0/1 Completed 0 3h1m 10.0.152.183 ip-10-0-152-183.ec2.internal <none> <none> rook-ceph-tools-787676bdbd-jgpz4 1/1 Running 0 3h9m 10.0.169.230 ip-10-0-169-230.ec2.internal <none> <none> Insufficient cpu for OSD pod. $ oc describe pod rook-ceph-osd-1-58d45d8b65-8nx5p Name: rook-ceph-osd-1-58d45d8b65-8nx5p Namespace: openshift-storage Priority: 2000001000 Priority Class Name: system-node-critical Node: <none> Labels: app=rook-ceph-osd app.kubernetes.io/component=cephclusters.ceph.rook.io app.kubernetes.io/created-by=rook-ceph-operator app.kubernetes.io/instance=1 app.kubernetes.io/managed-by=rook-ceph-operator app.kubernetes.io/name=ceph-osd app.kubernetes.io/part-of=ocs-storagecluster-cephcluster ceph-osd-id=1 ceph-version=16.2.7-126 ceph.rook.io/DeviceSet=default-2 ceph.rook.io/pvc=default-2-data-08hqkr ceph_daemon_id=1 ceph_daemon_type=osd failure-domain=default-2-data-08hqkr osd=1 pod-template-hash=58d45d8b65 portable=true rook-version=v4.10.5-0.985405daeba3b29a178cb19aa864324e65548a63 rook.io/operator-namespace=openshift-storage rook_cluster=openshift-storage topology-location-host=default-2-data-08hqkr topology-location-region=us-east-1 topology-location-root=default topology-location-zone=us-east-1c Annotations: openshift.io/scc: rook-ceph Status: Pending IP: IPs: <none> Controlled By: ReplicaSet/rook-ceph-osd-1-58d45d8b65 Init Containers: blkdevmapper: Image: registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6 Port: <none> Host Port: <none> Command: /bin/bash -c set -xe PVC_SOURCE=/default-2-data-08hqkr PVC_DEST=/var/lib/ceph/osd/ceph-1/block CP_ARGS=(--archive --dereference --verbose) if [ -b "$PVC_DEST" ]; then PVC_SOURCE_MAJ_MIN=$(stat --format '%t%T' $PVC_SOURCE) PVC_DEST_MAJ_MIN=$(stat --format '%t%T' $PVC_DEST) if [[ "$PVC_SOURCE_MAJ_MIN" == "$PVC_DEST_MAJ_MIN" ]]; then CP_ARGS+=(--no-clobber) else echo "PVC's source major/minor numbers changed" CP_ARGS+=(--remove-destination) fi fi cp "${CP_ARGS[@]}" "$PVC_SOURCE" "$PVC_DEST" Limits: cpu: 1750m memory: 5800Mi Requests: cpu: 1750m memory: 5800Mi Environment: <none> Mounts: /var/lib/ceph/osd/ceph-1 from default-2-data-08hqkr-bridge (rw,path="ceph-1") /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kf79w (ro) Devices: /default-2-data-08hqkr from default-2-data-08hqkr activate: Image: registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6 Port: <none> Host Port: <none> Command: ceph-bluestore-tool Args: prime-osd-dir --dev /var/lib/ceph/osd/ceph-1/block --path /var/lib/ceph/osd/ceph-1 --no-mon-config Limits: cpu: 1750m memory: 5800Mi Requests: cpu: 1750m memory: 5800Mi Environment: <none> Mounts: /var/lib/ceph/osd/ceph-1 from default-2-data-08hqkr-bridge (rw,path="ceph-1") /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kf79w (ro) Devices: /var/lib/ceph/osd/ceph-1/block from default-2-data-08hqkr expand-bluefs: Image: registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6 Port: <none> Host Port: <none> Command: ceph-bluestore-tool Args: bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-1 Limits: cpu: 1750m memory: 5800Mi Requests: cpu: 1750m memory: 5800Mi Environment: <none> Mounts: /var/lib/ceph/osd/ceph-1 from default-2-data-08hqkr-bridge (rw,path="ceph-1") /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kf79w (ro) chown-container-data-dir: Image: registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6 Port: <none> Host Port: <none> Command: chown Args: --verbose --recursive ceph:ceph /var/log/ceph /var/lib/ceph/crash /var/lib/ceph/osd/ceph-1 Limits: cpu: 1750m memory: 5800Mi Requests: cpu: 1750m memory: 5800Mi Environment: <none> Mounts: /etc/ceph from rook-config-override (ro) /run/udev from run-udev (rw) /var/lib/ceph/crash from rook-ceph-crash (rw) /var/lib/ceph/osd/ceph-1 from default-2-data-08hqkr-bridge (rw,path="ceph-1") /var/lib/rook from rook-data (rw) /var/log/ceph from rook-ceph-log (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kf79w (ro) Containers: osd: Image: registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6 Port: <none> Host Port: <none> Command: ceph-osd Args: --foreground --id 1 --fsid 70287815-1de0-4251-b5fa-df8a7ace78b8 --setuser ceph --setgroup ceph --crush-location=root=default host=default-2-data-08hqkr region=us-east-1 zone=us-east-1c --osd-recovery-sleep=0.1 --osd-snap-trim-sleep=2 --osd-delete-sleep=2 --log-to-stderr=true --err-to-stderr=true --mon-cluster-log-to-stderr=true --log-stderr-prefix=debug --default-log-to-file=false --default-mon-cluster-log-to-file=false Limits: cpu: 1750m memory: 5800Mi Requests: cpu: 1750m memory: 5800Mi Liveness: exec [env -i sh -c ceph --admin-daemon /run/ceph/ceph-osd.1.asok status] delay=10s timeout=1s period=10s #success=1 #failure=3 Startup: exec [env -i sh -c ceph --admin-daemon /run/ceph/ceph-osd.1.asok status] delay=10s timeout=1s period=10s #success=1 #failure=9 Environment Variables from: rook-ceph-osd-env-override ConfigMap Optional: true Environment: ROOK_NODE_NAME: default-2-data-08hqkr ROOK_CLUSTER_ID: 5de55d50-3129-4929-965a-9ce6994e9f0c ROOK_CLUSTER_NAME: ocs-storagecluster-cephcluster ROOK_PRIVATE_IP: (v1:status.podIP) ROOK_PUBLIC_IP: (v1:status.podIP) POD_NAMESPACE: openshift-storage ROOK_MON_ENDPOINTS: <set to the key 'data' of config map 'rook-ceph-mon-endpoints'> Optional: false ROOK_MON_SECRET: <set to the key 'mon-secret' in secret 'rook-ceph-mon'> Optional: false ROOK_CEPH_USERNAME: <set to the key 'ceph-username' in secret 'rook-ceph-mon'> Optional: false ROOK_CEPH_SECRET: <set to the key 'ceph-secret' in secret 'rook-ceph-mon'> Optional: false ROOK_CONFIG_DIR: /var/lib/rook ROOK_CEPH_CONFIG_OVERRIDE: /etc/rook/config/override.conf ROOK_FSID: <set to the key 'fsid' in secret 'rook-ceph-mon'> Optional: false NODE_NAME: (v1:spec.nodeName) ROOK_CRUSHMAP_ROOT: default ROOK_CRUSHMAP_HOSTNAME: default-2-data-08hqkr CEPH_VOLUME_DEBUG: 1 CEPH_VOLUME_SKIP_RESTORECON: 1 DM_DISABLE_UDEV: 1 CONTAINER_IMAGE: registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6 POD_NAME: rook-ceph-osd-1-58d45d8b65-8nx5p (v1:metadata.name) POD_MEMORY_LIMIT: 6081740800 (limits.memory) POD_MEMORY_REQUEST: 6081740800 (requests.memory) POD_CPU_LIMIT: 2 (limits.cpu) POD_CPU_REQUEST: 2 (requests.cpu) ROOK_OSD_UUID: a70cfe80-0c33-4dac-a478-95fe269c0b82 ROOK_OSD_ID: 1 ROOK_CEPH_MON_HOST: <set to the key 'mon_host' in secret 'rook-ceph-config'> Optional: false CEPH_ARGS: -m $(ROOK_CEPH_MON_HOST) ROOK_BLOCK_PATH: /mnt/default-2-data-08hqkr ROOK_CV_MODE: raw ROOK_OSD_DEVICE_CLASS: nvme ROOK_OSD_PVC_SIZE: 4Ti ROOK_TOPOLOGY_AFFINITY: topology.kubernetes.io/zone=us-east-1c ROOK_PVC_BACKED_OSD: true Mounts: /etc/ceph from rook-config-override (ro) /run/udev from run-udev (rw) /var/lib/ceph/crash from rook-ceph-crash (rw) /var/lib/ceph/osd/ceph-1 from default-2-data-08hqkr-bridge (rw,path="ceph-1") /var/lib/rook from rook-data (rw) /var/log/ceph from rook-ceph-log (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kf79w (ro) log-collector: Image: registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6 Port: <none> Host Port: <none> Command: /bin/bash -x -e -m -c CEPH_CLIENT_ID=ceph-osd.1 PERIODICITY=24h LOG_ROTATE_CEPH_FILE=/etc/logrotate.d/ceph if [ -z "$PERIODICITY" ]; then PERIODICITY=24h fi # edit the logrotate file to only rotate a specific daemon log # otherwise we will logrotate log files without reloading certain daemons # this might happen when multiple daemons run on the same machine sed -i "s|*.log|$CEPH_CLIENT_ID.log|" "$LOG_ROTATE_CEPH_FILE" while true; do sleep "$PERIODICITY" echo "starting log rotation" logrotate --verbose --force "$LOG_ROTATE_CEPH_FILE" echo "I am going to sleep now, see you in $PERIODICITY" done Limits: cpu: 50m memory: 80Mi Requests: cpu: 50m memory: 80Mi Environment: <none> Mounts: /etc/ceph from rook-config-override (ro) /var/lib/ceph/crash from rook-ceph-crash (rw) /var/log/ceph from rook-ceph-log (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kf79w (ro) Conditions: Type Status PodScheduled False Volumes: rook-data: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> rook-config-override: Type: Projected (a volume that contains injected data from multiple sources) ConfigMapName: rook-config-override ConfigMapOptional: <nil> rook-ceph-log: Type: HostPath (bare host directory volume) Path: /var/lib/rook/openshift-storage/log HostPathType: rook-ceph-crash: Type: HostPath (bare host directory volume) Path: /var/lib/rook/openshift-storage/crash HostPathType: default-2-data-08hqkr: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: default-2-data-08hqkr ReadOnly: false default-2-data-08hqkr-bridge: Type: HostPath (bare host directory volume) Path: /var/lib/rook/openshift-storage/default-2-data-08hqkr HostPathType: DirectoryOrCreate run-udev: Type: HostPath (bare host directory volume) Path: /run/udev HostPathType: kube-api-access-kf79w: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: <nil> QoS Class: Guaranteed Node-Selectors: <none> Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 5s node.ocs.openshift.io/osd=true:NoSchedule node.ocs.openshift.io/storage=true:NoSchedule Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 178m default-scheduler 0/12 nodes are available: 1 Insufficient cpu, 1 node(s) were unschedulable, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 4 node(s) didn't match Pod's node affinity/selector. Warning FailedScheduling 176m (x3 over 177m) default-scheduler 0/12 nodes are available: 1 Insufficient cpu, 1 node(s) were unschedulable, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 4 node(s) didn't match Pod's node affinity/selector. Warning FailedScheduling 174m (x10 over 176m) default-scheduler 0/12 nodes are available: 1 Insufficient cpu, 1 node(s) were unschedulable, 2 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 5 node(s) didn't match Pod's node affinity/selector. Warning FailedScheduling 2m47s (x168 over 172m) default-scheduler 0/12 nodes are available: 1 node(s) were unschedulable, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 5 node(s) didn't match Pod's node affinity/selector. $ oc describe pod alertmanager-managed-ocs-alertmanager-1 Name: alertmanager-managed-ocs-alertmanager-1 Namespace: openshift-storage Priority: 0 Node: <none> Labels: alertmanager=managed-ocs-alertmanager app.kubernetes.io/instance=managed-ocs-alertmanager app.kubernetes.io/managed-by=prometheus-operator app.kubernetes.io/name=alertmanager app.kubernetes.io/version=0.23.0 controller-revision-hash=alertmanager-managed-ocs-alertmanager-6dc4cdcbc6 statefulset.kubernetes.io/pod-name=alertmanager-managed-ocs-alertmanager-1 Annotations: kubectl.kubernetes.io/default-container: alertmanager openshift.io/scc: restricted Status: Pending IP: IPs: <none> Controlled By: StatefulSet/alertmanager-managed-ocs-alertmanager Containers: alertmanager: Image: registry.redhat.io/openshift4/ose-prometheus-alertmanager:v4.10.0-202204090935.p0.g0133959.assembly.stream Ports: 9093/TCP, 9094/TCP, 9094/UDP Host Ports: 0/TCP, 0/TCP, 0/UDP Args: --config.file=/etc/alertmanager/config/alertmanager.yaml --storage.path=/alertmanager --data.retention=120h --cluster.listen-address=[$(POD_IP)]:9094 --web.listen-address=:9093 --web.route-prefix=/ --cluster.peer=alertmanager-managed-ocs-alertmanager-0.alertmanager-operated:9094 --cluster.peer=alertmanager-managed-ocs-alertmanager-1.alertmanager-operated:9094 --cluster.peer=alertmanager-managed-ocs-alertmanager-2.alertmanager-operated:9094 --cluster.reconnect-timeout=5m Limits: cpu: 100m memory: 200Mi Requests: cpu: 100m memory: 200Mi Liveness: http-get http://:web/-/healthy delay=0s timeout=3s period=10s #success=1 #failure=10 Readiness: http-get http://:web/-/ready delay=3s timeout=3s period=5s #success=1 #failure=10 Environment: POD_IP: (v1:status.podIP) Mounts: /alertmanager from alertmanager-managed-ocs-alertmanager-db (rw) /etc/alertmanager/certs from tls-assets (ro) /etc/alertmanager/config from config-volume (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-p7g6k (ro) config-reloader: Image: registry.redhat.io/openshift4/ose-prometheus-config-reloader:v4.10.0-202204090935.p0.g73ddd44.assembly.stream Port: 8080/TCP Host Port: 0/TCP Command: /bin/prometheus-config-reloader Args: --listen-address=:8080 --reload-url=http://localhost:9093/-/reload --watched-dir=/etc/alertmanager/config Limits: cpu: 100m memory: 50Mi Requests: cpu: 100m memory: 50Mi Environment: POD_NAME: alertmanager-managed-ocs-alertmanager-1 (v1:metadata.name) SHARD: -1 Mounts: /etc/alertmanager/config from config-volume (ro) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-p7g6k (ro) Conditions: Type Status PodScheduled False Volumes: config-volume: Type: Secret (a volume populated by a Secret) SecretName: alertmanager-managed-ocs-alertmanager-generated Optional: false tls-assets: Type: Projected (a volume that contains injected data from multiple sources) SecretName: alertmanager-managed-ocs-alertmanager-tls-assets-0 SecretOptionalName: <nil> alertmanager-managed-ocs-alertmanager-db: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> kube-api-access-p7g6k: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: <nil> QoS Class: Guaranteed Node-Selectors: <none> Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 179m default-scheduler 0/12 nodes are available: 1 node(s) were unschedulable, 2 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 Insufficient cpu, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 3 node(s) had taint {node.ocs.openshift.io/osd: true}, that the pod didn't tolerate. Warning FailedScheduling 179m (x3 over 179m) default-scheduler 0/12 nodes are available: 1 node(s) were unschedulable, 2 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 Insufficient cpu, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 3 node(s) had taint {node.ocs.openshift.io/osd: true}, that the pod didn't tolerate. Warning FailedScheduling 6m32s (x207 over 178m) default-scheduler 0/12 nodes are available: 1 node(s) were unschedulable, 2 node(s) had taint {node.ocs.openshift.io/osd: true}, that the pod didn't tolerate, 3 Insufficient cpu, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. $ oc get csv NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.10.6 NooBaa Operator 4.10.6 mcg-operator.v4.10.5 Succeeded ocs-operator.v4.10.5 OpenShift Container Storage 4.10.5 ocs-operator.v4.10.4 Succeeded ocs-osd-deployer.v2.0.7 OCS OSD Deployer 2.0.7 Installing odf-csi-addons-operator.v4.10.5 CSI Addons 4.10.5 odf-csi-addons-operator.v4.10.4 Succeeded odf-operator.v4.10.5 OpenShift Data Foundation 4.10.5 odf-operator.v4.10.4 Succeeded ose-prometheus-operator.4.10.0 Prometheus Operator 4.10.0 ose-prometheus-operator.4.8.0 Succeeded route-monitor-operator.v0.1.422-151be96 Route Monitor Operator 0.1.422-151be96 route-monitor-operator.v0.1.420-b65f47e Succeeded $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.33 True False 3h21m Cluster version is 4.10.33
Must-gather(comment #2): http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-3-pr/jijoy-3-pr_20221003T095311/logs/testcases_1664813044/
hi, - when I checked last time, I observed from the must-gather the nodes were getting updated and so some of the nodes were going into unschedulable (cordned) state and so kubelet is logging "1 node(s) were unschedulable" - once the upgrade is completed above error should go away - however there was an issue with 4.10.33 (observation only, didn't get hold of any bugzilla) wherein node allocotable resources are quite less than expected, (expected: 3920m vs actual: 3500m) - when I tested with 4.10.34 this morning, I didn't observe above issue and mostly a regression from OCP side - so, I'd ask to create new cluster and test the scaling/installation thanks, leela.
- Still awaiting response whether this was hit when re-deployed - More info on this issue: 1. It's a known flow that AMIs will get updated even during Addon being installed 2. In the scenario one of the nodes will be in Unschedulable state 3. It's a legit issue when all nodes are in Ready state for a considerably long time and the addon is still stuck in installation/failed state 4. The OSDs can't run on another node due to the underlying PVC being in another zone and so during upgrades/a single node reboots it's expected the one of the OSDs will stay in pending state, IOW the fault tolerance will be in play 5. It was confirmed that during node upgrades the allocatable CPU will be reduced and it shouldn't cause issues with our Addon when that specific node is getting upgraded
We will schedule a deployment of cluster with size 8 for next week.
- Bug is resolved, the dependent jira issue is fixed from OCM
@Leela We are seeing this issue in new deployments with qe addon. Tested in version: $ oc get csv NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.10.9 NooBaa Operator 4.10.9 mcg-operator.v4.10.8 Succeeded observability-operator.v0.0.17 Observability Operator 0.0.17 observability-operator.v0.0.17-rc Succeeded ocs-operator.v4.10.9 OpenShift Container Storage 4.10.9 ocs-operator.v4.10.8 Succeeded ocs-osd-deployer.v2.0.11 OCS OSD Deployer 2.0.11 ocs-osd-deployer.v2.0.10 Succeeded odf-csi-addons-operator.v4.10.9 CSI Addons 4.10.9 odf-csi-addons-operator.v4.10.8 Succeeded odf-operator.v4.10.9 OpenShift Data Foundation 4.10.9 odf-operator.v4.10.8 Succeeded ose-prometheus-operator.4.10.0 Prometheus Operator 4.10.0 ose-prometheus-operator.4.8.0 Succeeded route-monitor-operator.v0.1.451-3df1ed1 Route Monitor Operator 0.1.451-3df1ed1 route-monitor-operator.v0.1.450-6e98c37 Succeeded Some pods are in Pending state due to insufficient resources. This is the status after deployment. No other operations were done the the cluster. $ oc get pods -o wide | grep -v 'Running\|Completed' NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES rook-ceph-crashcollector-ip-10-0-163-204.ec2.internal-58b9bjvt8 0/1 Pending 0 5h10m <none> <none> <none> <none> rook-ceph-osd-2-987bd7c5-89hwh 0/2 Pending 0 5h13m <none> <none> <none> <none> rook-ceph-osd-3-5784d8b7c8-pnvxj 0/2 Pending 0 5h13m <none> <none> <none> <none> Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 43s (x399 over 5h24m) default-scheduler 0/12 nodes are available: 1 Insufficient cpu, 1 node(s) were unschedulable, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 4 node(s) didn't match Pod's node affinity/selector. Status of one worker node is 'SchedulingDisabled'. $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-128-210.ec2.internal Ready worker 5h28m v1.23.12+8a6bfe4 ip-10-0-131-98.ec2.internal Ready master 5h49m v1.23.12+8a6bfe4 ip-10-0-133-16.ec2.internal Ready infra,worker 5h29m v1.23.12+8a6bfe4 ip-10-0-134-41.ec2.internal Ready worker 5h44m v1.23.12+8a6bfe4 ip-10-0-148-246.ec2.internal Ready infra,worker 5h29m v1.23.12+8a6bfe4 ip-10-0-152-117.ec2.internal Ready worker 5h28m v1.23.12+8a6bfe4 ip-10-0-154-160.ec2.internal Ready master 5h50m v1.23.12+8a6bfe4 ip-10-0-157-189.ec2.internal Ready worker 5h44m v1.23.12+8a6bfe4 ip-10-0-162-36.ec2.internal Ready master 5h49m v1.23.12+8a6bfe4 ip-10-0-163-204.ec2.internal Ready,SchedulingDisabled worker 5h28m v1.23.12+8a6bfe4 ip-10-0-164-197.ec2.internal Ready infra,worker 5h29m v1.23.12+8a6bfe4 ip-10-0-174-227.ec2.internal Ready worker 5h40m v1.23.12+8a6bfe4 logs - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-bz37-pr/jijoy-bz37-pr_20221223T042745/logs/testcases_1671792496/
> 5h13m - pending from a long time, this is quite a tricky scenario - can think of two scenarios why this is happening, the drain on a new node couldn't proceed due to PDB and the ongoing drain hit some issue - either ways, can proceed only after looking at a live cluster.
Hi Leela, We faced this issue with size 20 cluster as well. $ oc get pods | egrep -v '(Running|Completed)' NAME READY STATUS RESTARTS AGE alertmanager-managed-ocs-alertmanager-0 0/2 Pending 0 110m ocs-metrics-exporter-5dd96c885b-lf46k 0/1 Pending 0 110m rook-ceph-crashcollector-ip-10-0-142-35.ec2.internal-7d947nsp5c 0/1 Pending 0 110m rook-ceph-crashcollector-ip-10-0-154-93.ec2.internal-56688gknf4 0/1 Pending 0 111m rook-ceph-crashcollector-ip-10-0-160-72.ec2.internal-7f454ffxjm 0/1 Pending 0 109m rook-ceph-osd-14-66c75f68dc-sxs7z 0/2 Pending 0 111m rook-ceph-osd-6-55d85cccf8-kddcx 0/2 Pending 0 111m rook-ceph-osd-7-74d599f4b6-pcx7j 0/2 Pending 0 111m rook-ceph-osd-8-78bd587c97-mgz54 0/2 Pending 0 111m rook-ceph-osd-9-5b68fc68b4-2p99f 0/2 Pending 0 111m $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-128-44.ec2.internal Ready worker 120m v1.23.12+8a6bfe4 ip-10-0-132-177.ec2.internal Ready worker 120m v1.23.12+8a6bfe4 ip-10-0-133-84.ec2.internal Ready infra,worker 121m v1.23.12+8a6bfe4 ip-10-0-136-188.ec2.internal Ready master 140m v1.23.12+8a6bfe4 ip-10-0-142-35.ec2.internal Ready worker 132m v1.23.12+8a6bfe4 ip-10-0-143-114.ec2.internal Ready worker 120m v1.23.12+8a6bfe4 ip-10-0-147-121.ec2.internal Ready worker 120m v1.23.12+8a6bfe4 ip-10-0-151-231.ec2.internal Ready master 140m v1.23.12+8a6bfe4 ip-10-0-153-87.ec2.internal Ready worker 120m v1.23.12+8a6bfe4 ip-10-0-154-93.ec2.internal Ready worker 131m v1.23.12+8a6bfe4 ip-10-0-155-208.ec2.internal Ready worker 120m v1.23.12+8a6bfe4 ip-10-0-157-56.ec2.internal Ready infra,worker 121m v1.23.12+8a6bfe4 ip-10-0-160-174.ec2.internal Ready infra,worker 121m v1.23.12+8a6bfe4 ip-10-0-160-72.ec2.internal Ready,SchedulingDisabled worker 120m v1.23.12+8a6bfe4 ip-10-0-161-135.ec2.internal Ready worker 134m v1.23.12+8a6bfe4 ip-10-0-162-68.ec2.internal Ready master 140m v1.23.12+8a6bfe4 ip-10-0-164-82.ec2.internal Ready worker 120m v1.23.12+8a6bfe4 ip-10-0-168-82.ec2.internal Ready worker 120m v1.23.12+8a6bfe4 Events from one of the pods (rook-ceph-osd-14-66c75f68dc-sxs7z): Warning FailedScheduling 25m (x112 over 114m) default-scheduler 0/18 nodes are available: 1 Insufficient memory, 1 node(s) were unschedulable, 2 Insufficient cpu, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 9 node(s) didn't match Pod's node affinity/selector. $ rosa list addons -c jijoy-size20-pr | grep ocs-provider-qe ocs-provider-qe Red Hat OpenShift Data Foundation Managed Service Provider (QE) installing $ oc get csv NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.10.9 NooBaa Operator 4.10.9 mcg-operator.v4.10.8 Succeeded observability-operator.v0.0.17 Observability Operator 0.0.17 observability-operator.v0.0.17-rc Succeeded ocs-operator.v4.10.9 OpenShift Container Storage 4.10.9 ocs-operator.v4.10.8 Installing ocs-osd-deployer.v2.0.11 OCS OSD Deployer 2.0.11 ocs-osd-deployer.v2.0.10 Installing odf-csi-addons-operator.v4.10.9 CSI Addons 4.10.9 odf-csi-addons-operator.v4.10.8 Succeeded odf-operator.v4.10.9 OpenShift Data Foundation 4.10.9 odf-operator.v4.10.8 Succeeded ose-prometheus-operator.4.10.0 Prometheus Operator 4.10.0 ose-prometheus-operator.4.8.0 Succeeded route-monitor-operator.v0.1.451-3df1ed1 Route Monitor Operator 0.1.451-3df1ed1 route-monitor-operator.v0.1.450-6e98c37 Succeeded [Note: ocs-operator.v4.10.9 showed Failed state also. ocs-osd-deployer.v2.0.11 showed Pending state also.] managedocs status: status: components: alertmanager: state: Pending prometheus: state: Ready storageCluster: state: Ready OCS and OCP must-gather logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-size20-pr/jijoy-size20-pr_20221229T154719/logs/failed_testcase_ocs_logs_1672329486/deployment_ocs_logs/
Hi Leela, As suggested,I have opened a new bug #2156988 for the instalaation issue with size 20(comment #20)
*** Bug 2156988 has been marked as a duplicate of this bug. ***
Try the test on latest build
Deployment and tier1 tests were executed. Tested in version: OCP 4.10.50 ODF 4.10.9-7 Test results - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy8-14-c1/jijoy8-14-c1_20230214T072244/multicluster/logs/test_report_1676369952.html
Verified in deployer version 2.0.11
Closing this bug as fixed in v2.0.11 and tested by QE.