Bug 2162514
| Summary: | test_node_maintenance_restart_activate[worker] is failing on IBM Power | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Aaruni Aggarwal <aaaggarw> | ||||
| Component: | rook | Assignee: | Travis Nielsen <tnielsen> | ||||
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Neha Berry <nberry> | ||||
| Severity: | unspecified | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 4.12 | CC: | ocs-bugs, odf-bz-bot | ||||
| Target Milestone: | --- | Flags: | tnielsen:
needinfo?
(aaaggarw) |
||||
| Target Release: | --- | ||||||
| Hardware: | ppc64le | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2023-02-06 23:57:44 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
node_names = ['syd05-worker-0.rdr-abhi.ibm.com'], status = 'NotReady'
timeout = 900
def wait_for_nodes_status(node_names=None, status=constants.NODE_READY, timeout=180):
"""
Wait until all nodes are in the given status
Args:
node_names (list): The node names to wait for to reached the desired state
If None, will wait for all cluster nodes
status (str): The node status to wait for
(e.g. 'Ready', 'NotReady', 'SchedulingDisabled')
timeout (int): The number in seconds to wait for the nodes to reach
the status
Raises:
ResourceWrongStatusException: In case one or more nodes haven't
reached the desired state
"""
try:
if not node_names:
for sample in TimeoutSampler(60, 3, get_node_objs):
if sample:
node_names = [node.name for node in sample]
break
nodes_not_in_state = copy.deepcopy(node_names)
log.info(f"Waiting for nodes {node_names} to reach status {status}")
for sample in TimeoutSampler(timeout, 3, get_node_objs, nodes_not_in_state):
for node in sample:
if node.ocp.get_resource_status(node.name) == status:
log.info(f"Node {node.name} reached status {status}")
nodes_not_in_state.remove(node.name)
if not nodes_not_in_state:
break
log.info(f"The following nodes reached status {status}: {node_names}")
except TimeoutExpiredError:
log.error(
f"The following nodes haven't reached status {status}: "
f"{nodes_not_in_state}"
)
error_message = (
f"{node_names}, {[n.describe() for n in get_node_objs(node_names)]}"
)
> raise exceptions.ResourceWrongStatusException(error_message)
E ocs_ci.ocs.exceptions.ResourceWrongStatusException: Resource ['syd05-worker-0.rdr-abhi.ibm.com'], ['Name: syd05-worker-0.rdr-abhi.ibm.com\nRoles: worker\nLabels: beta.kubernetes.io/arch=ppc64le\n beta.kubernetes.io/os=linux\n cluster.ocs.openshift.io/openshift-storage=\n kubernetes.io/arch=ppc64le\n kubernetes.io/hostname=syd05-worker-0.rdr-abhi.ibm.com\n kubernetes.io/os=linux\n node-role.kubernetes.io/worker=\n node.kubernetes.io/instance-type=e980\n node.openshift.io/os_id=rhcos\n topology.kubernetes.io/region=syd\n topology.kubernetes.io/zone=syd05\nAnnotations: csi.volume.kubernetes.io/nodeid:\n {"openshift-storage.cephfs.csi.ceph.com":"syd05-worker-0.rdr-abhi.ibm.com","openshift-storage.rbd.csi.ceph.com":"syd05-worker-0.rdr-abhi.i...\n k8s.ovn.org/host-addresses: ["192.168.0.167"]\n k8s.ovn.org/l3-gateway-config:\n {"default":{"mode":"shared","interface-id":"br-ex_syd05-worker-0.rdr-abhi.ibm.com","mac-address":"fa:e2:90:e8:6a:20","ip-addresses":["192....\n k8s.ovn.org/node-chassis-id: 61a3f05a-ae79-4517-8cd8-98984ca7fea6\n k8s.ovn.org/node-gateway-router-lrp-ifaddr: {"ipv4":"100.64.0.5/16"}\n k8s.ovn.org/node-mgmt-port-mac-address: ba:07:14:3f:10:9b\n k8s.ovn.org/node-primary-ifaddr: {"ipv4":"192.168.0.167/24"}\n k8s.ovn.org/node-subnets: {"default":"10.131.0.0/23"}\n machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable\n machineconfiguration.openshift.io/currentConfig: rendered-worker-8cf89dd62e71532acb6311e39bcf5d86\n machineconfiguration.openshift.io/desiredConfig: rendered-worker-8cf89dd62e71532acb6311e39bcf5d86\n machineconfiguration.openshift.io/desiredDrain: uncordon-rendered-worker-8cf89dd62e71532acb6311e39bcf5d86\n machineconfiguration.openshift.io/lastAppliedDrain: uncordon-rendered-worker-8cf89dd62e71532acb6311e39bcf5d86\n machineconfiguration.openshift.io/reason: \n machineconfiguration.openshift.io/state: Done\n volumes.kubernetes.io/controller-managed-attach-detach: true\nCreationTimestamp: Fri, 13 Jan 2023 00:29:38 -0500\nTaints: node.kubernetes.io/unreachable:NoExecute\n node.kubernetes.io/unreachable:NoSchedule\n node.kubernetes.io/unschedulable:NoSchedule\nUnschedulable: true\nLease:\n HolderIdentity: syd05-worker-0.rdr-abhi.ibm.com\n AcquireTime: <unset>\n RenewTime: Thu, 19 Jan 2023 08:17:53 -0500\nConditions:\n Type Status LastHeartbeatTime LastTransitionTime Reason Message\n ---- ------ ----------------- ------------------ ------ -------\n MemoryPressure Unknown Thu, 19 Jan 2023 08:13:38 -0500 Thu, 19 Jan 2023 08:18:36 -0500 NodeStatusUnknown Kubelet stopped posting node status.\n DiskPressure Unknown Thu, 19 Jan 2023 08:13:38 -0500 Thu, 19 Jan 2023 08:18:36 -0500 NodeStatusUnknown Kubelet stopped posting node status.\n PIDPressure Unknown Thu, 19 Jan 2023 08:13:38 -0500 Thu, 19 Jan 2023 08:18:36 -0500 NodeStatusUnknown Kubelet stopped posting node status.\n Ready Unknown Thu, 19 Jan 2023 08:13:38 -0500 Thu, 19 Jan 2023 08:18:36 -0500 NodeStatusUnknown Kubelet stopped posting node status.\nAddresses:\n InternalIP: 192.168.0.167\n Hostname: syd05-worker-0.rdr-abhi.ibm.com\nCapacity:\n cpu: 16\n ephemeral-storage: 125419500Ki\n hugepages-16Gi: 0\n hugepages-16Mi: 0\n memory: 66888512Ki\n pods: 250\nAllocatable:\n cpu: 15500m\n ephemeral-storage: 114512869185\n hugepages-16Gi: 0\n hugepages-16Mi: 0\n memory: 65737536Ki\n pods: 250\nSystem Info:\n Machine ID: 461bc1445bed42d9a39469f2db5d7b28\n System UUID: IBM,02212F40W\n Boot ID: 8faba2f4-52e6-4ad5-a183-a58d8e24e017\n Kernel Version: 4.18.0-372.39.1.el8_6.ppc64le\n OS Image: Red Hat Enterprise Linux CoreOS 412.86.202212170457-0 (Ootpa)\n Operating System: linux\n Architecture: ppc64le\n Container Runtime Version: cri-o://1.25.1-5.rhaos4.12.git6005903.el8\n Kubelet Version: v1.25.4+77bec7a\n Kube-Proxy Version: v1.25.4+77bec7a\nNon-terminated Pods: (17 in total)\n Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age\n --------- ---- ------------ ---------- --------------- ------------- ---\n openshift-cluster-node-tuning-operator tuned-bbnfx 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 6d7h\n openshift-dns dns-default-kt7qn 60m (0%) 0 (0%) 110Mi (0%) 0 (0%) 6d8h\n openshift-dns node-resolver-qvphc 5m (0%) 0 (0%) 21Mi (0%) 0 (0%) 6d8h\n openshift-image-registry node-ca-zhc5m 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 6d8h\n openshift-ingress-canary ingress-canary-9k7mk 10m (0%) 0 (0%) 20Mi (0%) 0 (0%) 6d8h\n openshift-local-storage diskmaker-discovery-gwhs4 20m (0%) 0 (0%) 70Mi (0%) 0 (0%) 3d8h\n openshift-local-storage diskmaker-manager-pn69p 20m (0%) 0 (0%) 70Mi (0%) 0 (0%) 3d8h\n openshift-machine-config-operator machine-config-daemon-fzl9w 40m (0%) 0 (0%) 100Mi (0%) 0 (0%) 6d8h\n openshift-monitoring node-exporter-pgn6d 9m (0%) 0 (0%) 47Mi (0%) 0 (0%) 6d8h\n openshift-multus multus-additional-cni-plugins-2vrhq 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 6d8h\n openshift-multus multus-dtldv 10m (0%) 0 (0%) 65Mi (0%) 0 (0%) 6d8h\n openshift-multus network-metrics-daemon-tznbz 20m (0%) 0 (0%) 120Mi (0%) 0 (0%) 6d8h\n openshift-network-diagnostics network-check-target-4xjbn 10m (0%) 0 (0%) 15Mi (0%) 0 (0%) 6d8h\n openshift-ovn-kubernetes ovnkube-node-pclxz 50m (0%) 0 (0%) 660Mi (1%) 0 (0%) 6d7h\n openshift-storage csi-cephfsplugin-ljvpj 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3d7h\n openshift-storage csi-rbdplugin-s9fdz 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3d7h\n powervm-rmc powervm-rmc-4tcb6 100m (0%) 0 (0%) 500Mi (0%) 1Gi (1%) 6d7h\nAllocated resources:\n (Total limits may be over 100 percent, i.e., overcommitted.)\n Resource Requests Limits\n -------- -------- ------\n cpu 384m (2%) 0 (0%)\n memory 1868Mi (2%) 1Gi (1%)\n ephemeral-storage 0 (0%) 0 (0%)\n hugepages-16Gi 0 (0%) 0 (0%)\n hugepages-16Mi 0 (0%) 0 (0%)\nEvents:\n Type Reason Age From Message\n ---- ------ ---- ---- -------\n Normal Starting 33m kubelet Starting kubelet.\n Normal NodeHasSufficientMemory 33m (x2 over 33m) kubelet Node syd05-worker-0.rdr-abhi.ibm.com status is now: NodeHasSufficientMemory\n Normal NodeHasNoDiskPressure 33m (x2 over 33m) kubelet Node syd05-worker-0.rdr-abhi.ibm.com status is now: NodeHasNoDiskPressure\n Normal NodeHasSufficientPID 33m (x2 over 33m) kubelet Node syd05-worker-0.rdr-abhi.ibm.com status is now: NodeHasSufficientPID\n Normal NodeNotReady 33m kubelet Node syd05-worker-0.rdr-abhi.ibm.com status is now: NodeNotReady\n Normal NodeAllocatableEnforced 33m kubelet Updated Node Allocatable limit across pods\n Normal NodeReady 33m kubelet Node syd05-worker-0.rdr-abhi.ibm.com status is now: NodeReady\n Normal NodeSchedulable 33m kubelet Node syd05-worker-0.rdr-abhi.ibm.com status is now: NodeSchedulable\n Normal NodeNotSchedulable 15m (x2 over 33m) kubelet Node syd05-worker-0.rdr-abhi.ibm.com status is now: NodeNotSchedulable\n Normal NodeNotReady 14m (x2 over 19h) node-controller Node syd05-worker-0.rdr-abhi.ibm.com status is now: NodeNotReady\n']
ocs_ci/ocs/node.py:162: ResourceWrongStatusException
Tried draining the worker node manually as well: [root@rdr-abhi-syd05-bastion-0 ~]# oc adm drain syd05-worker-0.rdr-abhi.ibm.com node/syd05-worker-0.rdr-abhi.ibm.com already cordoned error: unable to drain node "syd05-worker-0.rdr-abhi.ibm.com" due to error:cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): openshift-cluster-node-tuning-operator/tuned-bbnfx, openshift-dns/dns-default-kt7qn, openshift-dns/node-resolver-qvphc, openshift-image-registry/node-ca-zhc5m, openshift-ingress-canary/ingress-canary-9k7mk, openshift-local-storage/diskmaker-discovery-gwhs4, openshift-local-storage/diskmaker-manager-pn69p, openshift-machine-config-operator/machine-config-daemon-fzl9w, openshift-monitoring/node-exporter-pgn6d, openshift-multus/multus-additional-cni-plugins-2vrhq, openshift-multus/multus-dtldv, openshift-multus/network-metrics-daemon-tznbz, openshift-network-diagnostics/network-check-target-4xjbn, openshift-ovn-kubernetes/ovnkube-node-pclxz, openshift-storage/csi-cephfsplugin-ljvpj, openshift-storage/csi-rbdplugin-s9fdz, powervm-rmc/powervm-rmc-4tcb6, continuing command... There are pending nodes to be drained: syd05-worker-0.rdr-abhi.ibm.com cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): openshift-cluster-node-tuning-operator/tuned-bbnfx, openshift-dns/dns-default-kt7qn, openshift-dns/node-resolver-qvphc, openshift-image-registry/node-ca-zhc5m, openshift-ingress-canary/ingress-canary-9k7mk, openshift-local-storage/diskmaker-discovery-gwhs4, openshift-local-storage/diskmaker-manager-pn69p, openshift-machine-config-operator/machine-config-daemon-fzl9w, openshift-monitoring/node-exporter-pgn6d, openshift-multus/multus-additional-cni-plugins-2vrhq, openshift-multus/multus-dtldv, openshift-multus/network-metrics-daemon-tznbz, openshift-network-diagnostics/network-check-target-4xjbn, openshift-ovn-kubernetes/ovnkube-node-pclxz, openshift-storage/csi-cephfsplugin-ljvpj, openshift-storage/csi-rbdplugin-s9fdz, powervm-rmc/powervm-rmc-4tcb6 Moving out of 4.12 as it's not a blocker. Aaruni, is there someone from Red Hat QE you are working with on running the tests? It would be best to confirm with them how to track the test issues. I don't believe QE tracks the test issues in Bugzilla, usually BZs are opened for product issues. Travis, Actually we opened the issue in ocs-ci: https://github.com/red-hat-storage/ocs-ci/issues/6735, but Elad asked us to create a BZ. (In reply to Aaruni Aggarwal from comment #5) > Travis, Actually we opened the issue in ocs-ci: > https://github.com/red-hat-storage/ocs-ci/issues/6735, but Elad asked us to > create a BZ. Ok thanks for that background. Could you also provide an analysis of the issue and how engineering can help investigate? We really don't know the background on the downstream CI, so it is difficult to troubleshoot without an analysis of why the test failed. Please reopen if you have more details for engineering to investigate |
Created attachment 1939215 [details] logfile for the testcase Description of problem (please be detailed as possible and provide log snippests): test_node_maintenance_restart_activate[worker] is failing on IBM Power Version of all relevant components (if applicable): ODF- 4.12 OCP - 4.12 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? No Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1.Drain one of the worker node 2.Restart the worker node 3. Actual results: Expected results: Additional info: