Description of problem: User "system:serviceaccount:openshift-infra:node-problem-detector" cannot patch nodes/status at the cluster scope, clusterrole.rbac.authorization.k8s.io "node-problem-detector" not found How reproducible: always Steps to Reproduce: 1. Deploy NPD with the config file as below on ocp 3.10 $ cat qe-inventory-host-file ... openshift_node_problem_detector_state=present openshift_node_problem_detector_image_prefix=****/openshift3/ose- openshift_node_problem_detector_image_version=v3.10.0 ... $ ansible-playbook -v -i qe-inventory-host-file ~/openshift-ansible/playbooks/openshift-node-problem-detector/config.yml 2.On the node, simulate a log message ase below: echo "kernel: task docker:1234 blocked for more than 300 seconds." >> /dev/kmsg echo "kernel: task docker:1234 blocked for more than 400 seconds." >> /dev/kmsg echo "kernel: task docker:1234 blocked for more than 500 seconds." >> /dev/kmsg echo "kernel: task docker:1234 blocked for more than 600 seconds." >> /dev/kmsg echo "kernel: task docker:1234 blocked for more than 700 seconds." >> /dev/kmsg 3.Check the node conditions Actual results: 3. The node condition keeps to be " KernelDeadlock True Wed, 11 Apr 2018 01:32:28 -0400 Wed, 11 Apr 2018 01:14:18 -0400 DockerHung task docker:1234 blocked for more than 300 seconds." # oc describe no qe-weinliu310-master-etcd-1 Name: qe-weinliu310-master-etcd-1 Roles: master Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/instance-type=n1-highmem-4 beta.kubernetes.io/os=linux failure-domain.beta.kubernetes.io/region=us-central1 failure-domain.beta.kubernetes.io/zone=us-central1-a kubernetes.io/hostname=qe-weinliu310-master-etcd-1 node-role.kubernetes.io/master=true role=node Annotations: volumes.kubernetes.io/controller-managed-attach-detach=true Taints: <none> CreationTimestamp: Tue, 10 Apr 2018 21:35:42 -0400 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- KernelDeadlock True Wed, 11 Apr 2018 01:32:28 -0400 Wed, 11 Apr 2018 01:14:18 -0400 DockerHung task docker:1234 blocked for more than 300 seconds. NetworkUnavailable False Mon, 01 Jan 0001 00:00:00 +0000 Tue, 10 Apr 2018 21:35:42 -0400 RouteCreated openshift-sdn cleared kubelet-set NoRouteCreated OutOfDisk False Wed, 11 Apr 2018 01:32:59 -0400 Tue, 10 Apr 2018 21:35:42 -0400 KubeletHasSufficientDisk kubelet has sufficient disk space available MemoryPressure False Wed, 11 Apr 2018 01:32:59 -0400 Tue, 10 Apr 2018 21:35:42 -0400 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Wed, 11 Apr 2018 01:32:59 -0400 Tue, 10 Apr 2018 21:35:42 -0400 KubeletHasNoDiskPressure kubelet has no disk pressure Ready True Wed, 11 Apr 2018 01:32:59 -0400 Tue, 10 Apr 2018 21:36:42 -0400 KubeletReady kubelet is posting ready status Addresses: InternalIP: 10.240.0.11 ExternalIP: 35.193.44.247 Hostname: qe-weinliu310-master-etcd-1 Capacity: cpu: 4 memory: 26587956Ki pods: 250 Allocatable: cpu: 4 memory: 26485556Ki pods: 250 System Info: Machine ID: 4fc628ddb2fb4859b28234ccec9c4399 System UUID: EB77B14E-C6B2-0F45-EB9E-CC7AFD909892 Boot ID: 6488c8c7-af91-4f6d-bf5a-d2410ff95a1c Kernel Version: 3.10.0-862.el7.x86_64 OS Image: Red Hat Enterprise Linux Server 7.5 (Maipo) Operating System: linux Architecture: amd64 Container Runtime Version: docker://1.13.1 Kubelet Version: v1.9.1+a0ce1bc657 Kube-Proxy Version: v1.9.1+a0ce1bc657 ExternalID: 7617027894977551420 Non-terminated Pods: (8 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits --------- ---- ------------ ---------- --------------- ------------- kube-service-catalog apiserver-qvsht 0 (0%) 0 (0%) 0 (0%) 0 (0%) kube-service-catalog controller-manager-7qxl8 0 (0%) 0 (0%) 0 (0%) 0 (0%) openshift-infra node-problem-detector-x677l 0 (0%) 0 (0%) 0 (0%) 0 (0%) openshift-node sync-ff6k7 0 (0%) 0 (0%) 0 (0%) 0 (0%) openshift-sdn ovs-qc967 100m (2%) 200m (5%) 200Mi (0%) 300Mi (1%) openshift-sdn sdn-5w4zp 100m (2%) 0 (0%) 200Mi (0%) 0 (0%) openshift-template-service-broker apiserver-8cqvn 0 (0%) 0 (0%) 0 (0%) 0 (0%) openshift-web-console webconsole-86df987b49-fmr6f 100m (2%) 0 (0%) 100Mi (0%) 0 (0%) Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) CPU Requests CPU Limits Memory Requests Memory Limits ------------ ---------- --------------- ------------- 300m (7%) 200m (5%) 500Mi (1%) 300Mi (1%) Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning TaskHung 18m kernel-monitor, qe-weinliu310-master-etcd-1 task docker:1234 blocked for more than 300 seconds. Warning TaskHung 15m kernel-monitor, qe-weinliu310-master-etcd-1 task docker:1234 blocked for more than 400 seconds. Warning TaskHung 10m kernel-monitor, qe-weinliu310-master-etcd-1 task docker:1234 blocked for more than 500 seconds. Warning TaskHung 8m kernel-monitor, qe-weinliu310-master-etcd-1 task docker:1234 blocked for more than 600 seconds. Warning TaskHung 7m kernel-monitor, qe-weinliu310-master-etcd-1 task docker:1234 blocked for more than 700 seconds. Warning KernelOops 2m kernel-monitor, qe-weinliu310-master-etcd-1 BUG: unable to handle kernel NULL pointer dereference at 12345 [root@qe-weinliu310-master-etcd-1 ~]# Expected results: The node condition matches with the lastest logs simulated "kernel: task docker:1234 blocked for more than 700 seconds." Additional info:# oc version oc v3.10.0-0.16.0 kubernetes v1.9.1+a0ce1bc657 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://qe-weinliu310-master-etcd-1:8443 openshift v3.10.0-0.16.0 kubernetes v1.9.1+a0ce1bc657
@Joel, could you help to confirm if it's an issue or invalid scenario?
Joel, can you respond to this?
I'm not certain, but it may be that this behavior is intended. In other words, perhaps the NPD only records the first reason and message when transitioning from false to true. I started looking through the code to see if this is the case, but didn't find it yet. I'll be out of the office this week, but I'll try to spend some time on it anyway so we can decide if this bug needs to be fixed or not. Thanks!
I finally found the code responsible and it appears that this behavior is intentional. I believe that the reason for this behavior is to prevent noisy log entries from potentially causing many node updates despite the node condition not changing in a material way. If interested, here is the code and comment explaining it: https://github.com/kubernetes/node-problem-detector/blob/10dbfef1a88d9ea05fabb600d31609f9dbdf2045/pkg/systemlogmonitor/log_monitor.go#L138-L144 With this in mind, I'm closing this as NOTABUG. Thanks for your testing efforts.