Bug 2131237
| Summary: | Managed Service cluster with size 8 can not be installed | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Filip Balák <fbalak> |
| Component: | odf-managed-service | Assignee: | Leela Venkaiah Gangavarapu <lgangava> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Jilju Joy <jijoy> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.10 | CC: | aeyal, dbindra, ebenahar, jijoy, lgangava, nberry, ocs-bugs, odf-bz-bot, rchikatw |
| Target Milestone: | --- | Flags: | rchikatw:
needinfo-
|
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-03-14 15:33:52 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Deployment was tested again after changing the OSD memory to 5800Mi. Some pods are not running.
$ oc get pods -o wide -n openshift-storage
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
6af63c67149ab3b269c737daf8def41d78c6bb149170553e4c3914cd31rjbtb 0/1 Completed 0 3h11m 10.129.2.41 ip-10-0-146-95.ec2.internal <none> <none>
addon-ocs-provider-dev-catalog-q2666 1/1 Running 0 3h11m 10.129.2.39 ip-10-0-146-95.ec2.internal <none> <none>
alertmanager-managed-ocs-alertmanager-0 2/2 Running 0 177m 10.128.2.17 ip-10-0-138-237.ec2.internal <none> <none>
alertmanager-managed-ocs-alertmanager-1 0/2 Pending 0 177m <none> <none> <none> <none>
alertmanager-managed-ocs-alertmanager-2 2/2 Running 0 178m 10.128.2.15 ip-10-0-138-237.ec2.internal <none> <none>
cb3d906f639ebb485aaea5f79d4dac57d31d0e663c1f60667880d6e7fd9srmj 0/1 Completed 0 3h11m 10.129.2.40 ip-10-0-146-95.ec2.internal <none> <none>
csi-addons-controller-manager-b8b965868-tr7jq 2/2 Running 3 (3h7m ago) 3h9m 10.129.2.50 ip-10-0-146-95.ec2.internal <none> <none>
ocs-metrics-exporter-577574796b-2jbvn 1/1 Running 0 3h7m 10.131.0.14 ip-10-0-169-230.ec2.internal <none> <none>
ocs-operator-5c77756ddd-hg8gc 1/1 Running 0 178m 10.128.2.14 ip-10-0-138-237.ec2.internal <none> <none>
ocs-osd-controller-manager-5fb6bc955d-stwtv 2/3 Running 3 (3h8m ago) 3h10m 10.129.2.45 ip-10-0-146-95.ec2.internal <none> <none>
ocs-provider-server-7694d4875b-vtfb7 1/1 Running 3 (3h7m ago) 3h9m 10.131.0.8 ip-10-0-169-230.ec2.internal <none> <none>
odf-console-585db6ddb-gtjhq 1/1 Running 0 178m 10.128.2.11 ip-10-0-138-237.ec2.internal <none> <none>
odf-operator-controller-manager-7866b5fdbb-jlb4d 2/2 Running 0 178m 10.128.2.12 ip-10-0-138-237.ec2.internal <none> <none>
prometheus-managed-ocs-prometheus-0 3/3 Running 0 3h9m 10.131.0.10 ip-10-0-169-230.ec2.internal <none> <none>
prometheus-operator-8547cc9f89-h4nns 1/1 Running 0 3h5m 10.131.0.21 ip-10-0-169-230.ec2.internal <none> <none>
rook-ceph-crashcollector-ip-10-0-133-162.ec2.internal-6959xbz9h 1/1 Running 0 3h 10.0.133.162 ip-10-0-133-162.ec2.internal <none> <none>
rook-ceph-crashcollector-ip-10-0-138-237.ec2.internal-cb59x27wc 1/1 Running 0 3h1m 10.0.138.237 ip-10-0-138-237.ec2.internal <none> <none>
rook-ceph-crashcollector-ip-10-0-146-95.ec2.internal-69946tw5bs 0/1 Pending 0 178m <none> <none> <none> <none>
rook-ceph-crashcollector-ip-10-0-152-183.ec2.internal-5b69pcgt4 1/1 Running 0 179m 10.0.152.183 ip-10-0-152-183.ec2.internal <none> <none>
rook-ceph-crashcollector-ip-10-0-168-88.ec2.internal-66777jssmj 0/1 Pending 0 177m <none> <none> <none> <none>
rook-ceph-crashcollector-ip-10-0-169-230.ec2.internal-d444s7z59 0/1 Pending 0 3h1m <none> <none> <none> <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-849f56ccjl4jm 2/2 Running 0 178m 10.0.146.95 ip-10-0-146-95.ec2.internal <none> <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-cbdbf889tmv7n 2/2 Running 0 178m 10.0.169.230 ip-10-0-169-230.ec2.internal <none> <none>
rook-ceph-mgr-a-598d65f8cb-vmxhz 2/2 Running 0 3h1m 10.0.138.237 ip-10-0-138-237.ec2.internal <none> <none>
rook-ceph-mon-a-6f5f986bc4-4vfdg 2/2 Running 0 3h6m 10.0.169.230 ip-10-0-169-230.ec2.internal <none> <none>
rook-ceph-mon-b-857c67df96-jf9bl 2/2 Running 0 3h3m 10.0.146.95 ip-10-0-146-95.ec2.internal <none> <none>
rook-ceph-mon-c-57bcdfbff9-jkpzn 2/2 Running 0 3h3m 10.0.138.237 ip-10-0-138-237.ec2.internal <none> <none>
rook-ceph-operator-564cb5cb98-bg7r7 1/1 Running 0 178m 10.128.2.13 ip-10-0-138-237.ec2.internal <none> <none>
rook-ceph-osd-0-6b6965d658-5sr4p 2/2 Running 0 3h 10.0.168.88 ip-10-0-168-88.ec2.internal <none> <none>
rook-ceph-osd-1-58d45d8b65-8nx5p 0/2 Pending 0 3h <none> <none> <none> <none>
rook-ceph-osd-2-cdb5cb56-d9kwz 2/2 Running 0 3h 10.0.133.162 ip-10-0-133-162.ec2.internal <none> <none>
rook-ceph-osd-3-f9fc495b9-rfm25 2/2 Running 0 3h 10.0.133.162 ip-10-0-133-162.ec2.internal <none> <none>
rook-ceph-osd-4-674ccf987-6tgbn 2/2 Running 0 179m 10.0.152.183 ip-10-0-152-183.ec2.internal <none> <none>
rook-ceph-osd-5-55477d4879-vbnzt 2/2 Running 0 179m 10.0.152.183 ip-10-0-152-183.ec2.internal <none> <none>
rook-ceph-osd-prepare-default-0-data-02lr54-xpn58 0/1 Completed 0 3h1m 10.0.133.162 ip-10-0-133-162.ec2.internal <none> <none>
rook-ceph-osd-prepare-default-1-data-0pclls-6n6vl 0/1 Completed 0 3h1m 10.0.152.183 ip-10-0-152-183.ec2.internal <none> <none>
rook-ceph-osd-prepare-default-1-data-1gh97s-lmdcr 0/1 Completed 0 3h1m 10.0.133.162 ip-10-0-133-162.ec2.internal <none> <none>
rook-ceph-osd-prepare-default-2-data-1njrsp-6rrjv 0/1 Completed 0 3h1m 10.0.152.183 ip-10-0-152-183.ec2.internal <none> <none>
rook-ceph-tools-787676bdbd-jgpz4 1/1 Running 0 3h9m 10.0.169.230 ip-10-0-169-230.ec2.internal <none> <none>
Insufficient cpu for OSD pod.
$ oc describe pod rook-ceph-osd-1-58d45d8b65-8nx5p
Name: rook-ceph-osd-1-58d45d8b65-8nx5p
Namespace: openshift-storage
Priority: 2000001000
Priority Class Name: system-node-critical
Node: <none>
Labels: app=rook-ceph-osd
app.kubernetes.io/component=cephclusters.ceph.rook.io
app.kubernetes.io/created-by=rook-ceph-operator
app.kubernetes.io/instance=1
app.kubernetes.io/managed-by=rook-ceph-operator
app.kubernetes.io/name=ceph-osd
app.kubernetes.io/part-of=ocs-storagecluster-cephcluster
ceph-osd-id=1
ceph-version=16.2.7-126
ceph.rook.io/DeviceSet=default-2
ceph.rook.io/pvc=default-2-data-08hqkr
ceph_daemon_id=1
ceph_daemon_type=osd
failure-domain=default-2-data-08hqkr
osd=1
pod-template-hash=58d45d8b65
portable=true
rook-version=v4.10.5-0.985405daeba3b29a178cb19aa864324e65548a63
rook.io/operator-namespace=openshift-storage
rook_cluster=openshift-storage
topology-location-host=default-2-data-08hqkr
topology-location-region=us-east-1
topology-location-root=default
topology-location-zone=us-east-1c
Annotations: openshift.io/scc: rook-ceph
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/rook-ceph-osd-1-58d45d8b65
Init Containers:
blkdevmapper:
Image: registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6
Port: <none>
Host Port: <none>
Command:
/bin/bash
-c
set -xe
PVC_SOURCE=/default-2-data-08hqkr
PVC_DEST=/var/lib/ceph/osd/ceph-1/block
CP_ARGS=(--archive --dereference --verbose)
if [ -b "$PVC_DEST" ]; then
PVC_SOURCE_MAJ_MIN=$(stat --format '%t%T' $PVC_SOURCE)
PVC_DEST_MAJ_MIN=$(stat --format '%t%T' $PVC_DEST)
if [[ "$PVC_SOURCE_MAJ_MIN" == "$PVC_DEST_MAJ_MIN" ]]; then
CP_ARGS+=(--no-clobber)
else
echo "PVC's source major/minor numbers changed"
CP_ARGS+=(--remove-destination)
fi
fi
cp "${CP_ARGS[@]}" "$PVC_SOURCE" "$PVC_DEST"
Limits:
cpu: 1750m
memory: 5800Mi
Requests:
cpu: 1750m
memory: 5800Mi
Environment: <none>
Mounts:
/var/lib/ceph/osd/ceph-1 from default-2-data-08hqkr-bridge (rw,path="ceph-1")
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kf79w (ro)
Devices:
/default-2-data-08hqkr from default-2-data-08hqkr
activate:
Image: registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6
Port: <none>
Host Port: <none>
Command:
ceph-bluestore-tool
Args:
prime-osd-dir
--dev
/var/lib/ceph/osd/ceph-1/block
--path
/var/lib/ceph/osd/ceph-1
--no-mon-config
Limits:
cpu: 1750m
memory: 5800Mi
Requests:
cpu: 1750m
memory: 5800Mi
Environment: <none>
Mounts:
/var/lib/ceph/osd/ceph-1 from default-2-data-08hqkr-bridge (rw,path="ceph-1")
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kf79w (ro)
Devices:
/var/lib/ceph/osd/ceph-1/block from default-2-data-08hqkr
expand-bluefs:
Image: registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6
Port: <none>
Host Port: <none>
Command:
ceph-bluestore-tool
Args:
bluefs-bdev-expand
--path
/var/lib/ceph/osd/ceph-1
Limits:
cpu: 1750m
memory: 5800Mi
Requests:
cpu: 1750m
memory: 5800Mi
Environment: <none>
Mounts:
/var/lib/ceph/osd/ceph-1 from default-2-data-08hqkr-bridge (rw,path="ceph-1")
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kf79w (ro)
chown-container-data-dir:
Image: registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6
Port: <none>
Host Port: <none>
Command:
chown
Args:
--verbose
--recursive
ceph:ceph
/var/log/ceph
/var/lib/ceph/crash
/var/lib/ceph/osd/ceph-1
Limits:
cpu: 1750m
memory: 5800Mi
Requests:
cpu: 1750m
memory: 5800Mi
Environment: <none>
Mounts:
/etc/ceph from rook-config-override (ro)
/run/udev from run-udev (rw)
/var/lib/ceph/crash from rook-ceph-crash (rw)
/var/lib/ceph/osd/ceph-1 from default-2-data-08hqkr-bridge (rw,path="ceph-1")
/var/lib/rook from rook-data (rw)
/var/log/ceph from rook-ceph-log (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kf79w (ro)
Containers:
osd:
Image: registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6
Port: <none>
Host Port: <none>
Command:
ceph-osd
Args:
--foreground
--id
1
--fsid
70287815-1de0-4251-b5fa-df8a7ace78b8
--setuser
ceph
--setgroup
ceph
--crush-location=root=default host=default-2-data-08hqkr region=us-east-1 zone=us-east-1c
--osd-recovery-sleep=0.1
--osd-snap-trim-sleep=2
--osd-delete-sleep=2
--log-to-stderr=true
--err-to-stderr=true
--mon-cluster-log-to-stderr=true
--log-stderr-prefix=debug
--default-log-to-file=false
--default-mon-cluster-log-to-file=false
Limits:
cpu: 1750m
memory: 5800Mi
Requests:
cpu: 1750m
memory: 5800Mi
Liveness: exec [env -i sh -c ceph --admin-daemon /run/ceph/ceph-osd.1.asok status] delay=10s timeout=1s period=10s #success=1 #failure=3
Startup: exec [env -i sh -c ceph --admin-daemon /run/ceph/ceph-osd.1.asok status] delay=10s timeout=1s period=10s #success=1 #failure=9
Environment Variables from:
rook-ceph-osd-env-override ConfigMap Optional: true
Environment:
ROOK_NODE_NAME: default-2-data-08hqkr
ROOK_CLUSTER_ID: 5de55d50-3129-4929-965a-9ce6994e9f0c
ROOK_CLUSTER_NAME: ocs-storagecluster-cephcluster
ROOK_PRIVATE_IP: (v1:status.podIP)
ROOK_PUBLIC_IP: (v1:status.podIP)
POD_NAMESPACE: openshift-storage
ROOK_MON_ENDPOINTS: <set to the key 'data' of config map 'rook-ceph-mon-endpoints'> Optional: false
ROOK_MON_SECRET: <set to the key 'mon-secret' in secret 'rook-ceph-mon'> Optional: false
ROOK_CEPH_USERNAME: <set to the key 'ceph-username' in secret 'rook-ceph-mon'> Optional: false
ROOK_CEPH_SECRET: <set to the key 'ceph-secret' in secret 'rook-ceph-mon'> Optional: false
ROOK_CONFIG_DIR: /var/lib/rook
ROOK_CEPH_CONFIG_OVERRIDE: /etc/rook/config/override.conf
ROOK_FSID: <set to the key 'fsid' in secret 'rook-ceph-mon'> Optional: false
NODE_NAME: (v1:spec.nodeName)
ROOK_CRUSHMAP_ROOT: default
ROOK_CRUSHMAP_HOSTNAME: default-2-data-08hqkr
CEPH_VOLUME_DEBUG: 1
CEPH_VOLUME_SKIP_RESTORECON: 1
DM_DISABLE_UDEV: 1
CONTAINER_IMAGE: registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6
POD_NAME: rook-ceph-osd-1-58d45d8b65-8nx5p (v1:metadata.name)
POD_MEMORY_LIMIT: 6081740800 (limits.memory)
POD_MEMORY_REQUEST: 6081740800 (requests.memory)
POD_CPU_LIMIT: 2 (limits.cpu)
POD_CPU_REQUEST: 2 (requests.cpu)
ROOK_OSD_UUID: a70cfe80-0c33-4dac-a478-95fe269c0b82
ROOK_OSD_ID: 1
ROOK_CEPH_MON_HOST: <set to the key 'mon_host' in secret 'rook-ceph-config'> Optional: false
CEPH_ARGS: -m $(ROOK_CEPH_MON_HOST)
ROOK_BLOCK_PATH: /mnt/default-2-data-08hqkr
ROOK_CV_MODE: raw
ROOK_OSD_DEVICE_CLASS: nvme
ROOK_OSD_PVC_SIZE: 4Ti
ROOK_TOPOLOGY_AFFINITY: topology.kubernetes.io/zone=us-east-1c
ROOK_PVC_BACKED_OSD: true
Mounts:
/etc/ceph from rook-config-override (ro)
/run/udev from run-udev (rw)
/var/lib/ceph/crash from rook-ceph-crash (rw)
/var/lib/ceph/osd/ceph-1 from default-2-data-08hqkr-bridge (rw,path="ceph-1")
/var/lib/rook from rook-data (rw)
/var/log/ceph from rook-ceph-log (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kf79w (ro)
log-collector:
Image: registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6
Port: <none>
Host Port: <none>
Command:
/bin/bash
-x
-e
-m
-c
CEPH_CLIENT_ID=ceph-osd.1
PERIODICITY=24h
LOG_ROTATE_CEPH_FILE=/etc/logrotate.d/ceph
if [ -z "$PERIODICITY" ]; then
PERIODICITY=24h
fi
# edit the logrotate file to only rotate a specific daemon log
# otherwise we will logrotate log files without reloading certain daemons
# this might happen when multiple daemons run on the same machine
sed -i "s|*.log|$CEPH_CLIENT_ID.log|" "$LOG_ROTATE_CEPH_FILE"
while true; do
sleep "$PERIODICITY"
echo "starting log rotation"
logrotate --verbose --force "$LOG_ROTATE_CEPH_FILE"
echo "I am going to sleep now, see you in $PERIODICITY"
done
Limits:
cpu: 50m
memory: 80Mi
Requests:
cpu: 50m
memory: 80Mi
Environment: <none>
Mounts:
/etc/ceph from rook-config-override (ro)
/var/lib/ceph/crash from rook-ceph-crash (rw)
/var/log/ceph from rook-ceph-log (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kf79w (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
rook-data:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
rook-config-override:
Type: Projected (a volume that contains injected data from multiple sources)
ConfigMapName: rook-config-override
ConfigMapOptional: <nil>
rook-ceph-log:
Type: HostPath (bare host directory volume)
Path: /var/lib/rook/openshift-storage/log
HostPathType:
rook-ceph-crash:
Type: HostPath (bare host directory volume)
Path: /var/lib/rook/openshift-storage/crash
HostPathType:
default-2-data-08hqkr:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: default-2-data-08hqkr
ReadOnly: false
default-2-data-08hqkr-bridge:
Type: HostPath (bare host directory volume)
Path: /var/lib/rook/openshift-storage/default-2-data-08hqkr
HostPathType: DirectoryOrCreate
run-udev:
Type: HostPath (bare host directory volume)
Path: /run/udev
HostPathType:
kube-api-access-kf79w:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
ConfigMapName: openshift-service-ca.crt
ConfigMapOptional: <nil>
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 5s
node.ocs.openshift.io/osd=true:NoSchedule
node.ocs.openshift.io/storage=true:NoSchedule
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 178m default-scheduler 0/12 nodes are available: 1 Insufficient cpu, 1 node(s) were unschedulable, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 4 node(s) didn't match Pod's node affinity/selector.
Warning FailedScheduling 176m (x3 over 177m) default-scheduler 0/12 nodes are available: 1 Insufficient cpu, 1 node(s) were unschedulable, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 4 node(s) didn't match Pod's node affinity/selector.
Warning FailedScheduling 174m (x10 over 176m) default-scheduler 0/12 nodes are available: 1 Insufficient cpu, 1 node(s) were unschedulable, 2 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 5 node(s) didn't match Pod's node affinity/selector.
Warning FailedScheduling 2m47s (x168 over 172m) default-scheduler 0/12 nodes are available: 1 node(s) were unschedulable, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 5 node(s) didn't match Pod's node affinity/selector.
$ oc describe pod alertmanager-managed-ocs-alertmanager-1
Name: alertmanager-managed-ocs-alertmanager-1
Namespace: openshift-storage
Priority: 0
Node: <none>
Labels: alertmanager=managed-ocs-alertmanager
app.kubernetes.io/instance=managed-ocs-alertmanager
app.kubernetes.io/managed-by=prometheus-operator
app.kubernetes.io/name=alertmanager
app.kubernetes.io/version=0.23.0
controller-revision-hash=alertmanager-managed-ocs-alertmanager-6dc4cdcbc6
statefulset.kubernetes.io/pod-name=alertmanager-managed-ocs-alertmanager-1
Annotations: kubectl.kubernetes.io/default-container: alertmanager
openshift.io/scc: restricted
Status: Pending
IP:
IPs: <none>
Controlled By: StatefulSet/alertmanager-managed-ocs-alertmanager
Containers:
alertmanager:
Image: registry.redhat.io/openshift4/ose-prometheus-alertmanager:v4.10.0-202204090935.p0.g0133959.assembly.stream
Ports: 9093/TCP, 9094/TCP, 9094/UDP
Host Ports: 0/TCP, 0/TCP, 0/UDP
Args:
--config.file=/etc/alertmanager/config/alertmanager.yaml
--storage.path=/alertmanager
--data.retention=120h
--cluster.listen-address=[$(POD_IP)]:9094
--web.listen-address=:9093
--web.route-prefix=/
--cluster.peer=alertmanager-managed-ocs-alertmanager-0.alertmanager-operated:9094
--cluster.peer=alertmanager-managed-ocs-alertmanager-1.alertmanager-operated:9094
--cluster.peer=alertmanager-managed-ocs-alertmanager-2.alertmanager-operated:9094
--cluster.reconnect-timeout=5m
Limits:
cpu: 100m
memory: 200Mi
Requests:
cpu: 100m
memory: 200Mi
Liveness: http-get http://:web/-/healthy delay=0s timeout=3s period=10s #success=1 #failure=10
Readiness: http-get http://:web/-/ready delay=3s timeout=3s period=5s #success=1 #failure=10
Environment:
POD_IP: (v1:status.podIP)
Mounts:
/alertmanager from alertmanager-managed-ocs-alertmanager-db (rw)
/etc/alertmanager/certs from tls-assets (ro)
/etc/alertmanager/config from config-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-p7g6k (ro)
config-reloader:
Image: registry.redhat.io/openshift4/ose-prometheus-config-reloader:v4.10.0-202204090935.p0.g73ddd44.assembly.stream
Port: 8080/TCP
Host Port: 0/TCP
Command:
/bin/prometheus-config-reloader
Args:
--listen-address=:8080
--reload-url=http://localhost:9093/-/reload
--watched-dir=/etc/alertmanager/config
Limits:
cpu: 100m
memory: 50Mi
Requests:
cpu: 100m
memory: 50Mi
Environment:
POD_NAME: alertmanager-managed-ocs-alertmanager-1 (v1:metadata.name)
SHARD: -1
Mounts:
/etc/alertmanager/config from config-volume (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-p7g6k (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
config-volume:
Type: Secret (a volume populated by a Secret)
SecretName: alertmanager-managed-ocs-alertmanager-generated
Optional: false
tls-assets:
Type: Projected (a volume that contains injected data from multiple sources)
SecretName: alertmanager-managed-ocs-alertmanager-tls-assets-0
SecretOptionalName: <nil>
alertmanager-managed-ocs-alertmanager-db:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kube-api-access-p7g6k:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
ConfigMapName: openshift-service-ca.crt
ConfigMapOptional: <nil>
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 179m default-scheduler 0/12 nodes are available: 1 node(s) were unschedulable, 2 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 Insufficient cpu, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 3 node(s) had taint {node.ocs.openshift.io/osd: true}, that the pod didn't tolerate.
Warning FailedScheduling 179m (x3 over 179m) default-scheduler 0/12 nodes are available: 1 node(s) were unschedulable, 2 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 Insufficient cpu, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 3 node(s) had taint {node.ocs.openshift.io/osd: true}, that the pod didn't tolerate.
Warning FailedScheduling 6m32s (x207 over 178m) default-scheduler 0/12 nodes are available: 1 node(s) were unschedulable, 2 node(s) had taint {node.ocs.openshift.io/osd: true}, that the pod didn't tolerate, 3 Insufficient cpu, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
$ oc get csv
NAME DISPLAY VERSION REPLACES PHASE
mcg-operator.v4.10.6 NooBaa Operator 4.10.6 mcg-operator.v4.10.5 Succeeded
ocs-operator.v4.10.5 OpenShift Container Storage 4.10.5 ocs-operator.v4.10.4 Succeeded
ocs-osd-deployer.v2.0.7 OCS OSD Deployer 2.0.7 Installing
odf-csi-addons-operator.v4.10.5 CSI Addons 4.10.5 odf-csi-addons-operator.v4.10.4 Succeeded
odf-operator.v4.10.5 OpenShift Data Foundation 4.10.5 odf-operator.v4.10.4 Succeeded
ose-prometheus-operator.4.10.0 Prometheus Operator 4.10.0 ose-prometheus-operator.4.8.0 Succeeded
route-monitor-operator.v0.1.422-151be96 Route Monitor Operator 0.1.422-151be96 route-monitor-operator.v0.1.420-b65f47e Succeeded
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.10.33 True False 3h21m Cluster version is 4.10.33
Must-gather(comment #2): http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-3-pr/jijoy-3-pr_20221003T095311/logs/testcases_1664813044/ hi, - when I checked last time, I observed from the must-gather the nodes were getting updated and so some of the nodes were going into unschedulable (cordned) state and so kubelet is logging "1 node(s) were unschedulable" - once the upgrade is completed above error should go away - however there was an issue with 4.10.33 (observation only, didn't get hold of any bugzilla) wherein node allocotable resources are quite less than expected, (expected: 3920m vs actual: 3500m) - when I tested with 4.10.34 this morning, I didn't observe above issue and mostly a regression from OCP side - so, I'd ask to create new cluster and test the scaling/installation thanks, leela. - Still awaiting response whether this was hit when re-deployed - More info on this issue: 1. It's a known flow that AMIs will get updated even during Addon being installed 2. In the scenario one of the nodes will be in Unschedulable state 3. It's a legit issue when all nodes are in Ready state for a considerably long time and the addon is still stuck in installation/failed state 4. The OSDs can't run on another node due to the underlying PVC being in another zone and so during upgrades/a single node reboots it's expected the one of the OSDs will stay in pending state, IOW the fault tolerance will be in play 5. It was confirmed that during node upgrades the allocatable CPU will be reduced and it shouldn't cause issues with our Addon when that specific node is getting upgraded We will schedule a deployment of cluster with size 8 for next week. - Bug is resolved, the dependent jira issue is fixed from OCM @Leela
We are seeing this issue in new deployments with qe addon.
Tested in version:
$ oc get csv
NAME DISPLAY VERSION REPLACES PHASE
mcg-operator.v4.10.9 NooBaa Operator 4.10.9 mcg-operator.v4.10.8 Succeeded
observability-operator.v0.0.17 Observability Operator 0.0.17 observability-operator.v0.0.17-rc Succeeded
ocs-operator.v4.10.9 OpenShift Container Storage 4.10.9 ocs-operator.v4.10.8 Succeeded
ocs-osd-deployer.v2.0.11 OCS OSD Deployer 2.0.11 ocs-osd-deployer.v2.0.10 Succeeded
odf-csi-addons-operator.v4.10.9 CSI Addons 4.10.9 odf-csi-addons-operator.v4.10.8 Succeeded
odf-operator.v4.10.9 OpenShift Data Foundation 4.10.9 odf-operator.v4.10.8 Succeeded
ose-prometheus-operator.4.10.0 Prometheus Operator 4.10.0 ose-prometheus-operator.4.8.0 Succeeded
route-monitor-operator.v0.1.451-3df1ed1 Route Monitor Operator 0.1.451-3df1ed1 route-monitor-operator.v0.1.450-6e98c37 Succeeded
Some pods are in Pending state due to insufficient resources. This is the status after deployment. No other operations were done the the cluster.
$ oc get pods -o wide | grep -v 'Running\|Completed'
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
rook-ceph-crashcollector-ip-10-0-163-204.ec2.internal-58b9bjvt8 0/1 Pending 0 5h10m <none> <none> <none> <none>
rook-ceph-osd-2-987bd7c5-89hwh 0/2 Pending 0 5h13m <none> <none> <none> <none>
rook-ceph-osd-3-5784d8b7c8-pnvxj 0/2 Pending 0 5h13m <none> <none> <none> <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 43s (x399 over 5h24m) default-scheduler 0/12 nodes are available: 1 Insufficient cpu, 1 node(s) were unschedulable, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 4 node(s) didn't match Pod's node affinity/selector.
Status of one worker node is 'SchedulingDisabled'.
$ oc get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-128-210.ec2.internal Ready worker 5h28m v1.23.12+8a6bfe4
ip-10-0-131-98.ec2.internal Ready master 5h49m v1.23.12+8a6bfe4
ip-10-0-133-16.ec2.internal Ready infra,worker 5h29m v1.23.12+8a6bfe4
ip-10-0-134-41.ec2.internal Ready worker 5h44m v1.23.12+8a6bfe4
ip-10-0-148-246.ec2.internal Ready infra,worker 5h29m v1.23.12+8a6bfe4
ip-10-0-152-117.ec2.internal Ready worker 5h28m v1.23.12+8a6bfe4
ip-10-0-154-160.ec2.internal Ready master 5h50m v1.23.12+8a6bfe4
ip-10-0-157-189.ec2.internal Ready worker 5h44m v1.23.12+8a6bfe4
ip-10-0-162-36.ec2.internal Ready master 5h49m v1.23.12+8a6bfe4
ip-10-0-163-204.ec2.internal Ready,SchedulingDisabled worker 5h28m v1.23.12+8a6bfe4
ip-10-0-164-197.ec2.internal Ready infra,worker 5h29m v1.23.12+8a6bfe4
ip-10-0-174-227.ec2.internal Ready worker 5h40m v1.23.12+8a6bfe4
logs - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-bz37-pr/jijoy-bz37-pr_20221223T042745/logs/testcases_1671792496/
> 5h13m
- pending from a long time, this is quite a tricky scenario
- can think of two scenarios why this is happening, the drain on a new node couldn't proceed due to PDB and the ongoing drain hit some issue
- either ways, can proceed only after looking at a live cluster.
Hi Leela,
We faced this issue with size 20 cluster as well.
$ oc get pods | egrep -v '(Running|Completed)'
NAME READY STATUS RESTARTS AGE
alertmanager-managed-ocs-alertmanager-0 0/2 Pending 0 110m
ocs-metrics-exporter-5dd96c885b-lf46k 0/1 Pending 0 110m
rook-ceph-crashcollector-ip-10-0-142-35.ec2.internal-7d947nsp5c 0/1 Pending 0 110m
rook-ceph-crashcollector-ip-10-0-154-93.ec2.internal-56688gknf4 0/1 Pending 0 111m
rook-ceph-crashcollector-ip-10-0-160-72.ec2.internal-7f454ffxjm 0/1 Pending 0 109m
rook-ceph-osd-14-66c75f68dc-sxs7z 0/2 Pending 0 111m
rook-ceph-osd-6-55d85cccf8-kddcx 0/2 Pending 0 111m
rook-ceph-osd-7-74d599f4b6-pcx7j 0/2 Pending 0 111m
rook-ceph-osd-8-78bd587c97-mgz54 0/2 Pending 0 111m
rook-ceph-osd-9-5b68fc68b4-2p99f 0/2 Pending 0 111m
$ oc get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-128-44.ec2.internal Ready worker 120m v1.23.12+8a6bfe4
ip-10-0-132-177.ec2.internal Ready worker 120m v1.23.12+8a6bfe4
ip-10-0-133-84.ec2.internal Ready infra,worker 121m v1.23.12+8a6bfe4
ip-10-0-136-188.ec2.internal Ready master 140m v1.23.12+8a6bfe4
ip-10-0-142-35.ec2.internal Ready worker 132m v1.23.12+8a6bfe4
ip-10-0-143-114.ec2.internal Ready worker 120m v1.23.12+8a6bfe4
ip-10-0-147-121.ec2.internal Ready worker 120m v1.23.12+8a6bfe4
ip-10-0-151-231.ec2.internal Ready master 140m v1.23.12+8a6bfe4
ip-10-0-153-87.ec2.internal Ready worker 120m v1.23.12+8a6bfe4
ip-10-0-154-93.ec2.internal Ready worker 131m v1.23.12+8a6bfe4
ip-10-0-155-208.ec2.internal Ready worker 120m v1.23.12+8a6bfe4
ip-10-0-157-56.ec2.internal Ready infra,worker 121m v1.23.12+8a6bfe4
ip-10-0-160-174.ec2.internal Ready infra,worker 121m v1.23.12+8a6bfe4
ip-10-0-160-72.ec2.internal Ready,SchedulingDisabled worker 120m v1.23.12+8a6bfe4
ip-10-0-161-135.ec2.internal Ready worker 134m v1.23.12+8a6bfe4
ip-10-0-162-68.ec2.internal Ready master 140m v1.23.12+8a6bfe4
ip-10-0-164-82.ec2.internal Ready worker 120m v1.23.12+8a6bfe4
ip-10-0-168-82.ec2.internal Ready worker 120m v1.23.12+8a6bfe4
Events from one of the pods (rook-ceph-osd-14-66c75f68dc-sxs7z):
Warning FailedScheduling 25m (x112 over 114m) default-scheduler 0/18 nodes are available: 1 Insufficient memory, 1 node(s) were unschedulable, 2 Insufficient cpu, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 9 node(s) didn't match Pod's node affinity/selector.
$ rosa list addons -c jijoy-size20-pr | grep ocs-provider-qe
ocs-provider-qe Red Hat OpenShift Data Foundation Managed Service Provider (QE) installing
$ oc get csv
NAME DISPLAY VERSION REPLACES PHASE
mcg-operator.v4.10.9 NooBaa Operator 4.10.9 mcg-operator.v4.10.8 Succeeded
observability-operator.v0.0.17 Observability Operator 0.0.17 observability-operator.v0.0.17-rc Succeeded
ocs-operator.v4.10.9 OpenShift Container Storage 4.10.9 ocs-operator.v4.10.8 Installing
ocs-osd-deployer.v2.0.11 OCS OSD Deployer 2.0.11 ocs-osd-deployer.v2.0.10 Installing
odf-csi-addons-operator.v4.10.9 CSI Addons 4.10.9 odf-csi-addons-operator.v4.10.8 Succeeded
odf-operator.v4.10.9 OpenShift Data Foundation 4.10.9 odf-operator.v4.10.8 Succeeded
ose-prometheus-operator.4.10.0 Prometheus Operator 4.10.0 ose-prometheus-operator.4.8.0 Succeeded
route-monitor-operator.v0.1.451-3df1ed1 Route Monitor Operator 0.1.451-3df1ed1 route-monitor-operator.v0.1.450-6e98c37 Succeeded
[Note: ocs-operator.v4.10.9 showed Failed state also. ocs-osd-deployer.v2.0.11 showed Pending state also.]
managedocs status:
status:
components:
alertmanager:
state: Pending
prometheus:
state: Ready
storageCluster:
state: Ready
OCS and OCP must-gather logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-size20-pr/jijoy-size20-pr_20221229T154719/logs/failed_testcase_ocs_logs_1672329486/deployment_ocs_logs/
Hi Leela, As suggested,I have opened a new bug #2156988 for the instalaation issue with size 20(comment #20) *** Bug 2156988 has been marked as a duplicate of this bug. *** Try the test on latest build Deployment and tier1 tests were executed. Tested in version: OCP 4.10.50 ODF 4.10.9-7 Test results - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy8-14-c1/jijoy8-14-c1_20230214T072244/multicluster/logs/test_report_1676369952.html Verified in deployer version 2.0.11 Closing this bug as fixed in v2.0.11 and tested by QE. |
Description of problem: Cluster with dev addon that contains changes in topology related to ODFMS-55 can not finish ODF addon installation and is stuck in Installing state. Size parameter was set to 8. Version-Release number of selected component (if applicable): ocs-osd-deployer.v2.0.7 How reproducible: 1/1 Steps to Reproduce: 1. Install provider: rosa create service --type ocs-provider-dev --name fbalak-pr --machine-cidr 10.0.0.0/16 --size 8 --onboarding-validation-key <key> --subnet-ids <subnet-ids> --region us-east-1 2. Wait until installation finishes. Actual results: Installation doesn't finish after 3 hours. There are Events in ocs-operator pod that indicate there is insufficient memory: 0/12 nodes are available: 1 Insufficient memory, 1 node(s) were unschedulable, 2 node(s) had taint {node.ocs.openshift.io/osd: true}, that the pod didn't tolerate, 3 Insufficient cpu, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Expected results: Cluster finishes installation. Additional info: