Bug 2162514

Summary: test_node_maintenance_restart_activate[worker] is failing on IBM Power
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Aaruni Aggarwal <aaaggarw>
Component: rookAssignee: Travis Nielsen <tnielsen>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Neha Berry <nberry>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.12CC: ocs-bugs, odf-bz-bot
Target Milestone: ---Flags: tnielsen: needinfo? (aaaggarw)
Target Release: ---   
Hardware: ppc64le   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-02-06 23:57:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logfile for the testcase none

Description Aaruni Aggarwal 2023-01-19 18:47:30 UTC
Created attachment 1939215 [details]
logfile for the testcase

Description of problem (please be detailed as possible and provide log
snippests):

test_node_maintenance_restart_activate[worker] is failing on IBM Power 

Version of all relevant components (if applicable):

ODF- 4.12
OCP - 4.12

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
No

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Drain one of the worker node
2.Restart the worker node
3. 


Actual results:


Expected results:


Additional info:

Comment 2 Aaruni Aggarwal 2023-01-19 18:52:02 UTC
node_names = ['syd05-worker-0.rdr-abhi.ibm.com'], status = 'NotReady'
timeout = 900

    def wait_for_nodes_status(node_names=None, status=constants.NODE_READY, timeout=180):
        """
        Wait until all nodes are in the given status
    
        Args:
            node_names (list): The node names to wait for to reached the desired state
                If None, will wait for all cluster nodes
            status (str): The node status to wait for
                (e.g. 'Ready', 'NotReady', 'SchedulingDisabled')
            timeout (int): The number in seconds to wait for the nodes to reach
                the status
    
        Raises:
            ResourceWrongStatusException: In case one or more nodes haven't
                reached the desired state
    
        """
        try:
            if not node_names:
                for sample in TimeoutSampler(60, 3, get_node_objs):
                    if sample:
                        node_names = [node.name for node in sample]
                        break
            nodes_not_in_state = copy.deepcopy(node_names)
            log.info(f"Waiting for nodes {node_names} to reach status {status}")
            for sample in TimeoutSampler(timeout, 3, get_node_objs, nodes_not_in_state):
                for node in sample:
                    if node.ocp.get_resource_status(node.name) == status:
                        log.info(f"Node {node.name} reached status {status}")
                        nodes_not_in_state.remove(node.name)
                if not nodes_not_in_state:
                    break
            log.info(f"The following nodes reached status {status}: {node_names}")
        except TimeoutExpiredError:
            log.error(
                f"The following nodes haven't reached status {status}: "
                f"{nodes_not_in_state}"
            )
            error_message = (
                f"{node_names}, {[n.describe() for n in get_node_objs(node_names)]}"
            )
>           raise exceptions.ResourceWrongStatusException(error_message)
E           ocs_ci.ocs.exceptions.ResourceWrongStatusException: Resource ['syd05-worker-0.rdr-abhi.ibm.com'], ['Name:               syd05-worker-0.rdr-abhi.ibm.com\nRoles:              worker\nLabels:             beta.kubernetes.io/arch=ppc64le\n                    beta.kubernetes.io/os=linux\n                    cluster.ocs.openshift.io/openshift-storage=\n                    kubernetes.io/arch=ppc64le\n                    kubernetes.io/hostname=syd05-worker-0.rdr-abhi.ibm.com\n                    kubernetes.io/os=linux\n                    node-role.kubernetes.io/worker=\n                    node.kubernetes.io/instance-type=e980\n                    node.openshift.io/os_id=rhcos\n                    topology.kubernetes.io/region=syd\n                    topology.kubernetes.io/zone=syd05\nAnnotations:        csi.volume.kubernetes.io/nodeid:\n                      {"openshift-storage.cephfs.csi.ceph.com":"syd05-worker-0.rdr-abhi.ibm.com","openshift-storage.rbd.csi.ceph.com":"syd05-worker-0.rdr-abhi.i...\n                    k8s.ovn.org/host-addresses: ["192.168.0.167"]\n                    k8s.ovn.org/l3-gateway-config:\n                      {"default":{"mode":"shared","interface-id":"br-ex_syd05-worker-0.rdr-abhi.ibm.com","mac-address":"fa:e2:90:e8:6a:20","ip-addresses":["192....\n                    k8s.ovn.org/node-chassis-id: 61a3f05a-ae79-4517-8cd8-98984ca7fea6\n                    k8s.ovn.org/node-gateway-router-lrp-ifaddr: {"ipv4":"100.64.0.5/16"}\n                    k8s.ovn.org/node-mgmt-port-mac-address: ba:07:14:3f:10:9b\n                    k8s.ovn.org/node-primary-ifaddr: {"ipv4":"192.168.0.167/24"}\n                    k8s.ovn.org/node-subnets: {"default":"10.131.0.0/23"}\n                    machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable\n                    machineconfiguration.openshift.io/currentConfig: rendered-worker-8cf89dd62e71532acb6311e39bcf5d86\n                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-8cf89dd62e71532acb6311e39bcf5d86\n                    machineconfiguration.openshift.io/desiredDrain: uncordon-rendered-worker-8cf89dd62e71532acb6311e39bcf5d86\n                    machineconfiguration.openshift.io/lastAppliedDrain: uncordon-rendered-worker-8cf89dd62e71532acb6311e39bcf5d86\n                    machineconfiguration.openshift.io/reason: \n                    machineconfiguration.openshift.io/state: Done\n                    volumes.kubernetes.io/controller-managed-attach-detach: true\nCreationTimestamp:  Fri, 13 Jan 2023 00:29:38 -0500\nTaints:             node.kubernetes.io/unreachable:NoExecute\n                    node.kubernetes.io/unreachable:NoSchedule\n                    node.kubernetes.io/unschedulable:NoSchedule\nUnschedulable:      true\nLease:\n  HolderIdentity:  syd05-worker-0.rdr-abhi.ibm.com\n  AcquireTime:     <unset>\n  RenewTime:       Thu, 19 Jan 2023 08:17:53 -0500\nConditions:\n  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message\n  ----             ------    -----------------                 ------------------                ------              -------\n  MemoryPressure   Unknown   Thu, 19 Jan 2023 08:13:38 -0500   Thu, 19 Jan 2023 08:18:36 -0500   NodeStatusUnknown   Kubelet stopped posting node status.\n  DiskPressure     Unknown   Thu, 19 Jan 2023 08:13:38 -0500   Thu, 19 Jan 2023 08:18:36 -0500   NodeStatusUnknown   Kubelet stopped posting node status.\n  PIDPressure      Unknown   Thu, 19 Jan 2023 08:13:38 -0500   Thu, 19 Jan 2023 08:18:36 -0500   NodeStatusUnknown   Kubelet stopped posting node status.\n  Ready            Unknown   Thu, 19 Jan 2023 08:13:38 -0500   Thu, 19 Jan 2023 08:18:36 -0500   NodeStatusUnknown   Kubelet stopped posting node status.\nAddresses:\n  InternalIP:  192.168.0.167\n  Hostname:    syd05-worker-0.rdr-abhi.ibm.com\nCapacity:\n  cpu:                16\n  ephemeral-storage:  125419500Ki\n  hugepages-16Gi:     0\n  hugepages-16Mi:     0\n  memory:             66888512Ki\n  pods:               250\nAllocatable:\n  cpu:                15500m\n  ephemeral-storage:  114512869185\n  hugepages-16Gi:     0\n  hugepages-16Mi:     0\n  memory:             65737536Ki\n  pods:               250\nSystem Info:\n  Machine ID:                             461bc1445bed42d9a39469f2db5d7b28\n  System UUID:                            IBM,02212F40W\n  Boot ID:                                8faba2f4-52e6-4ad5-a183-a58d8e24e017\n  Kernel Version:                         4.18.0-372.39.1.el8_6.ppc64le\n  OS Image:                               Red Hat Enterprise Linux CoreOS 412.86.202212170457-0 (Ootpa)\n  Operating System:                       linux\n  Architecture:                           ppc64le\n  Container Runtime Version:              cri-o://1.25.1-5.rhaos4.12.git6005903.el8\n  Kubelet Version:                        v1.25.4+77bec7a\n  Kube-Proxy Version:                     v1.25.4+77bec7a\nNon-terminated Pods:                      (17 in total)\n  Namespace                               Name                                   CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age\n  ---------                               ----                                   ------------  ----------  ---------------  -------------  ---\n  openshift-cluster-node-tuning-operator  tuned-bbnfx                            10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         6d7h\n  openshift-dns                           dns-default-kt7qn                      60m (0%)      0 (0%)      110Mi (0%)       0 (0%)         6d8h\n  openshift-dns                           node-resolver-qvphc                    5m (0%)       0 (0%)      21Mi (0%)        0 (0%)         6d8h\n  openshift-image-registry                node-ca-zhc5m                          10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         6d8h\n  openshift-ingress-canary                ingress-canary-9k7mk                   10m (0%)      0 (0%)      20Mi (0%)        0 (0%)         6d8h\n  openshift-local-storage                 diskmaker-discovery-gwhs4              20m (0%)      0 (0%)      70Mi (0%)        0 (0%)         3d8h\n  openshift-local-storage                 diskmaker-manager-pn69p                20m (0%)      0 (0%)      70Mi (0%)        0 (0%)         3d8h\n  openshift-machine-config-operator       machine-config-daemon-fzl9w            40m (0%)      0 (0%)      100Mi (0%)       0 (0%)         6d8h\n  openshift-monitoring                    node-exporter-pgn6d                    9m (0%)       0 (0%)      47Mi (0%)        0 (0%)         6d8h\n  openshift-multus                        multus-additional-cni-plugins-2vrhq    10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         6d8h\n  openshift-multus                        multus-dtldv                           10m (0%)      0 (0%)      65Mi (0%)        0 (0%)         6d8h\n  openshift-multus                        network-metrics-daemon-tznbz           20m (0%)      0 (0%)      120Mi (0%)       0 (0%)         6d8h\n  openshift-network-diagnostics           network-check-target-4xjbn             10m (0%)      0 (0%)      15Mi (0%)        0 (0%)         6d8h\n  openshift-ovn-kubernetes                ovnkube-node-pclxz                     50m (0%)      0 (0%)      660Mi (1%)       0 (0%)         6d7h\n  openshift-storage                       csi-cephfsplugin-ljvpj                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         3d7h\n  openshift-storage                       csi-rbdplugin-s9fdz                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         3d7h\n  powervm-rmc                             powervm-rmc-4tcb6                      100m (0%)     0 (0%)      500Mi (0%)       1Gi (1%)       6d7h\nAllocated resources:\n  (Total limits may be over 100 percent, i.e., overcommitted.)\n  Resource           Requests     Limits\n  --------           --------     ------\n  cpu                384m (2%)    0 (0%)\n  memory             1868Mi (2%)  1Gi (1%)\n  ephemeral-storage  0 (0%)       0 (0%)\n  hugepages-16Gi     0 (0%)       0 (0%)\n  hugepages-16Mi     0 (0%)       0 (0%)\nEvents:\n  Type    Reason                   Age                From             Message\n  ----    ------                   ----               ----             -------\n  Normal  Starting                 33m                kubelet          Starting kubelet.\n  Normal  NodeHasSufficientMemory  33m (x2 over 33m)  kubelet          Node syd05-worker-0.rdr-abhi.ibm.com status is now: NodeHasSufficientMemory\n  Normal  NodeHasNoDiskPressure    33m (x2 over 33m)  kubelet          Node syd05-worker-0.rdr-abhi.ibm.com status is now: NodeHasNoDiskPressure\n  Normal  NodeHasSufficientPID     33m (x2 over 33m)  kubelet          Node syd05-worker-0.rdr-abhi.ibm.com status is now: NodeHasSufficientPID\n  Normal  NodeNotReady             33m                kubelet          Node syd05-worker-0.rdr-abhi.ibm.com status is now: NodeNotReady\n  Normal  NodeAllocatableEnforced  33m                kubelet          Updated Node Allocatable limit across pods\n  Normal  NodeReady                33m                kubelet          Node syd05-worker-0.rdr-abhi.ibm.com status is now: NodeReady\n  Normal  NodeSchedulable          33m                kubelet          Node syd05-worker-0.rdr-abhi.ibm.com status is now: NodeSchedulable\n  Normal  NodeNotSchedulable       15m (x2 over 33m)  kubelet          Node syd05-worker-0.rdr-abhi.ibm.com status is now: NodeNotSchedulable\n  Normal  NodeNotReady             14m (x2 over 19h)  node-controller  Node syd05-worker-0.rdr-abhi.ibm.com status is now: NodeNotReady\n']

ocs_ci/ocs/node.py:162: ResourceWrongStatusException

Comment 3 Aaruni Aggarwal 2023-01-19 18:53:43 UTC
Tried draining the worker node manually as well: 

[root@rdr-abhi-syd05-bastion-0 ~]# oc adm drain syd05-worker-0.rdr-abhi.ibm.com
node/syd05-worker-0.rdr-abhi.ibm.com already cordoned
error: unable to drain node "syd05-worker-0.rdr-abhi.ibm.com" due to error:cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): openshift-cluster-node-tuning-operator/tuned-bbnfx, openshift-dns/dns-default-kt7qn, openshift-dns/node-resolver-qvphc, openshift-image-registry/node-ca-zhc5m, openshift-ingress-canary/ingress-canary-9k7mk, openshift-local-storage/diskmaker-discovery-gwhs4, openshift-local-storage/diskmaker-manager-pn69p, openshift-machine-config-operator/machine-config-daemon-fzl9w, openshift-monitoring/node-exporter-pgn6d, openshift-multus/multus-additional-cni-plugins-2vrhq, openshift-multus/multus-dtldv, openshift-multus/network-metrics-daemon-tznbz, openshift-network-diagnostics/network-check-target-4xjbn, openshift-ovn-kubernetes/ovnkube-node-pclxz, openshift-storage/csi-cephfsplugin-ljvpj, openshift-storage/csi-rbdplugin-s9fdz, powervm-rmc/powervm-rmc-4tcb6, continuing command...
There are pending nodes to be drained:
 syd05-worker-0.rdr-abhi.ibm.com
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): openshift-cluster-node-tuning-operator/tuned-bbnfx, openshift-dns/dns-default-kt7qn, openshift-dns/node-resolver-qvphc, openshift-image-registry/node-ca-zhc5m, openshift-ingress-canary/ingress-canary-9k7mk, openshift-local-storage/diskmaker-discovery-gwhs4, openshift-local-storage/diskmaker-manager-pn69p, openshift-machine-config-operator/machine-config-daemon-fzl9w, openshift-monitoring/node-exporter-pgn6d, openshift-multus/multus-additional-cni-plugins-2vrhq, openshift-multus/multus-dtldv, openshift-multus/network-metrics-daemon-tznbz, openshift-network-diagnostics/network-check-target-4xjbn, openshift-ovn-kubernetes/ovnkube-node-pclxz, openshift-storage/csi-cephfsplugin-ljvpj, openshift-storage/csi-rbdplugin-s9fdz, powervm-rmc/powervm-rmc-4tcb6

Comment 4 Travis Nielsen 2023-01-19 21:30:38 UTC
Moving out of 4.12 as it's not a blocker.

Aaruni, is there someone from Red Hat QE you are working with on running the tests? 
It would be best to confirm with them how to track the test issues.
I don't believe QE tracks the test issues in Bugzilla, usually BZs are opened
for product issues.

Comment 5 Aaruni Aggarwal 2023-01-20 08:17:04 UTC
Travis, Actually we opened the issue in ocs-ci: https://github.com/red-hat-storage/ocs-ci/issues/6735, but Elad asked us to create a BZ.

Comment 6 Travis Nielsen 2023-01-20 15:27:53 UTC
(In reply to Aaruni Aggarwal from comment #5)
> Travis, Actually we opened the issue in ocs-ci:
> https://github.com/red-hat-storage/ocs-ci/issues/6735, but Elad asked us to
> create a BZ.

Ok thanks for that background. Could you also provide an analysis of the issue and how engineering can help investigate? We really don't know the background on the downstream CI, so it is difficult to troubleshoot without an analysis of why the test failed.

Comment 9 Travis Nielsen 2023-02-06 23:57:44 UTC
Please reopen if you have more details for engineering to investigate