Description of problem: User "system:serviceaccount:openshift-infra:node-problem-detector" cannot patch nodes/status at the cluster scope, clusterrole.rbac.authorization.k8s.io "node-problem-detector" not found How reproducible: always Steps to Reproduce: 1. Deploy NPD with the config file as below on ocp 3.10 $ cat qe-inventory-host-file ... openshift_node_problem_detector_state=present openshift_node_problem_detector_image_prefix=***/openshift3/ose- openshift_node_problem_detector_image_version=v3.10.0 ... $ ansible-playbook -v -i qe-inventory-host-file ~/openshift-ansible/playbooks/openshift-node-problem-detector/config.yml 2. Check the pod logs Actual results: 2. # oc get pod -n openshift-infra NAME READY STATUS RESTARTS AGE node-problem-detector-5pjlm 1/1 Running 0 27m node-problem-detector-dllj5 1/1 Running 0 27m node-problem-detector-n6ksn 1/1 Running 0 27m # oc get logs node-problem-detector-5pjlm ... E0408 05:19:03.342379 1 manager.go:160] failed to update node conditions: nodes "qe-dma310-node-registry-router-1" is forbidden: User "system:serviceaccount:openshift-infra:node-problem-detector" cannot patch nodes/status at the cluster scope: User "system:serviceaccount:openshift-infra:node-problem-detector" cannot "patch" "nodes/status" with name "qe-dma310-node-registry-router-1" in project "": clusterrole.rbac.authorization.k8s.io "node-problem-detector" not found ... Expected results: No errors in pod logs Additional info: # oc version oc v3.10.0-0.16.0 kubernetes v1.9.1+a0ce1bc657 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://qe-dma310-master-etcd-1:8443 openshift v3.10.0-0.16.0 kubernetes v1.9.1+a0ce1bc657 [root@qe-dma310-master-etcd-1 ~]# oc get sa node-problem-detector NAME SECRETS AGE node-problem-detector 2 29m [root@qe-dma310-master-etcd-1 ~]# oc describe sa node-problem-detector Name: node-problem-detector Namespace: openshift-infra Labels: <none> Annotations: <none> Image pull secrets: node-problem-detector-dockercfg-nfvwx Mountable secrets: node-problem-detector-dockercfg-nfvwx node-problem-detector-token-plw2x Tokens: node-problem-detector-token-5qhv2 node-problem-detector-token-plw2x Events: <none> Description of problem: Version-Release number of the following components: rpm -q openshift-ansible rpm -q ansible ansible --version How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Please include the entire output from the last TASK line through the end of output if an error is generated Expected results: Additional info: Please attach logs from ansible-playbook with the -vvv flag
Upstream PR with the fix: https://github.com/openshift/openshift-ansible/pull/7856
Merged upstream
Still fails on env below. Could you help to double check? [root@qe-weinliu-master-etcd-1 ~]# oc version oc v3.10.0-0.41.0 kubernetes v1.10.0+b81c8f8 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://qe-weinliu-master-etcd-1:8443 openshift v3.10.0-0.41.0 kubernetes v1.10.0+b81c8f8 [root@qe-weinliu-master-etcd-1 ~]# uname -a Linux qe-weinliu-master-etcd-1 3.10.0-693.21.1.el7.x86_64 #1 SMP Fri Feb 23 18:54:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux [root@qe-weinliu-master-etcd-1 ~]# oc logs node-problem-detector-4bhqn I0514 04:28:10.705502 1 log_monitor.go:63] Finish parsing log monitor config file: {WatcherConfig:{Plugin:journald PluginConfig:map[source:kernel] LogPath:/host/log/journal Lookback:5m} BufferSize:10 Source:kernel-monitor DefaultConditions:[{Type:KernelDeadlock Status:false Transition:0001-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:kernel has no deadlock}] Rules:[{Type:temporary Condition: Reason:OOMKilling Pattern:Kill process \d+ (.+) score \d+ or sacrifice child\nKilled process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB} {Type:temporary Condition: Reason:TaskHung Pattern:task \S+:\w+ blocked for more than \w+ seconds\.} {Type:temporary Condition: Reason:UnregisterNetDevice Pattern:unregister_netdevice: waiting for \w+ to become free. Usage count = \d+} {Type:temporary Condition: Reason:KernelOops Pattern:BUG: unable to handle kernel NULL pointer dereference at .*} {Type:temporary Condition: Reason:KernelOops Pattern:divide error: 0000 \[#\d+\] SMP} {Type:permanent Condition:KernelDeadlock Reason:AUFSUmountHung Pattern:task umount\.aufs:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:DockerHung Pattern:task docker:\w+ blocked for more than \w+ seconds\.}]} I0514 04:28:10.705864 1 log_watchers.go:40] Use log watcher of plugin "journald" I0514 04:28:10.706291 1 log_monitor.go:63] Finish parsing log monitor config file: {WatcherConfig:{Plugin:journald PluginConfig:map[source:docker] LogPath:/host/log/journal Lookback:5m} BufferSize:10 Source:docker-monitor DefaultConditions:[] Rules:[{Type:temporary Condition: Reason:CorruptDockerImage Pattern:Error trying v2 registry: failed to register layer: rename /var/lib/docker/image/(.+) /var/lib/docker/image/(.+): directory not empty.*}]} I0514 04:28:10.706333 1 log_watchers.go:40] Use log watcher of plugin "journald" I0514 04:28:10.708389 1 log_monitor.go:72] Start log monitor I0514 04:28:10.712128 1 log_watcher.go:69] Start watching journald I0514 04:28:10.712166 1 log_monitor.go:72] Start log monitor I0514 04:28:10.712292 1 log_watcher.go:69] Start watching journald I0514 04:28:10.712309 1 problem_detector.go:73] Problem detector started I0514 04:28:10.712658 1 log_monitor.go:163] Initialize condition generated: [] I0514 04:28:10.712715 1 log_monitor.go:163] Initialize condition generated: [{Type:KernelDeadlock Status:false Transition:2018-05-14 04:28:10.712708442 -0400 EDT m=+0.025078977 Reason:KernelHasNoDeadlock Message:kernel has no deadlock}] E0514 04:28:11.724221 1 manager.go:160] failed to update node conditions: nodes "qe-weinliu-master-etcd-1" is forbidden: User "system:serviceaccount:openshift-infra:node-problem-detector" cannot patch nodes/status at the cluster scope: User "system:serviceaccount:openshift-infra:node-problem-detector" cannot "patch" "nodes/status" with name "qe-weinliu-master-etcd-1" in project "": clusterrole.rbac.authorization.k8s.io "node-problem-detector" not found E0514 04:28:22.711113 1 manager.go:160] failed to update node conditions: nodes "qe-weinliu-master-etcd-1" is forbidden: User "system:serviceaccount:openshift-infra:node-problem-detector" cannot patch nodes/status at the cluster scope: User "system:serviceaccount:openshift-infra:node-problem-detector" cannot "patch" "nodes/status" with name "qe-weinliu-master-etcd-1" in project "": clusterrole.rbac.authorization.k8s.io "node-problem-detector" not found E0514 04:28:32.714392 1 manager.go:160] failed to update node conditions: nodes "qe-weinliu-master-etcd-1" is forbidden: User "system:serviceaccount:openshift-infra:node-problem-detector" cannot patch nodes/status at the cluster scope: User "system:serviceaccount:openshift-infra:node-problem-detector" cannot "patch" "nodes/status" with name "qe-weinliu-master-etcd-1" in project "": clusterrole.rbac.authorization.k8s.io "node-problem-detector" not found E0514 04:28:43.715188 1 manager.go:160] failed to update node conditions: nodes "qe-weinliu-master-etcd-1" is forbidden: User "system:serviceaccount:openshift-infra:node-problem-detector" cannot patch nodes/status at the cluster scope: User "system:serviceaccount:openshift-infra:node-problem-detector" cannot "patch" "nodes/status" with name "qe-weinliu-master-etcd-1" in project "": clusterrole.rbac.authorization.k8s.io "node-problem-detector" not found E0514 04:28:53.710952 1 manager.go:160] failed to update node conditions: nodes "qe-weinliu-master-etcd-1" is forbidden: User "system:serviceaccount:openshift-infra:node-problem-detector" cannot patch nodes/status at the cluster scope: User "system:serviceaccount:openshift-infra:node-problem-detector" cannot "patch" "nodes/status" with name "qe-weinliu-master-etcd-1" in project "": clusterrole.rbac.authorization.k8s.io "node-problem-detector" not found E0514 04:29:04.715792 1 manager.go:160] failed to update node conditions: nodes "qe-weinliu-master-etcd-1" is forbidden: User "system:serviceaccount:openshift-infra:node-problem-detector" cannot patch nodes/status at the cluster scope: User "system:serviceaccount:openshift-infra:node-problem-detector" cannot "patch" "nodes/status" with name "qe-weinliu-master-etcd-1" in project "": clusterrole.rbac.authorization.k8s.io "node-problem-detector" not found .... .... ....
What openshift-ansible rpm did you use? Can you list the cluster roles and search for node-problem-detector? This should be already fixed. I tested the deployment with https://github.com/openshift/openshift-ansible/pull/7856.
(In reply to Jan Chaloupka from comment #4) > What openshift-ansible rpm did you use? Can you list the cluster roles and > search for node-problem-detector? > > This should be already fixed. I tested the deployment with > https://github.com/openshift/openshift-ansible/pull/7856. OPENSHIFT_ANSIBLE_URL git:openshift/openshift-ansible I got the issue worked around by creating ClusterRoleBinding #oc create -f crb.yaml cat crb.yaml apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: npd-binding labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: system:node-problem-detector subjects: - kind: ServiceAccount name: node-problem-detector namespace: openshift-infra
Bug verified to be fixed [root@ip-172-18-14-36 ~]# oc version oc v3.10.0-0.46.0 kubernetes v1.10.0+b81c8f8 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://ip-172-18-14-36.ec2.internal:8443 openshift v3.10.0-0.46.0 kubernetes v1.10.0+b81c8f8 [root@ip-172-18-14-36 ~]# uname -a Linux ip-172-18-14-36.ec2.internal 3.10.0-862.el7.x86_64 #1 SMP Wed Mar 21 18:14:51 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux [root@ip-172-18-14-36 ~]# oc logs node-problem-detector-vc7t9 I0515 23:03:24.960521 1 log_monitor.go:63] Finish parsing log monitor config file: {WatcherConfig:{Plugin:journald PluginConfig:map[source:kernel] LogPath:/host/log/journal Lookback:5m} BufferSize:10 Source:kernel-monitor DefaultConditions:[{Type:KernelDeadlock Status:false Transition:0001-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:kernel has no deadlock}] Rules:[{Type:temporary Condition: Reason:OOMKilling Pattern:Kill process \d+ (.+) score \d+ or sacrifice child\nKilled process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB} {Type:temporary Condition: Reason:TaskHung Pattern:task \S+:\w+ blocked for more than \w+ seconds\.} {Type:temporary Condition: Reason:UnregisterNetDevice Pattern:unregister_netdevice: waiting for \w+ to become free. Usage count = \d+} {Type:temporary Condition: Reason:KernelOops Pattern:BUG: unable to handle kernel NULL pointer dereference at .*} {Type:temporary Condition: Reason:KernelOops Pattern:divide error: 0000 \[#\d+\] SMP} {Type:permanent Condition:KernelDeadlock Reason:AUFSUmountHung Pattern:task umount\.aufs:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:DockerHung Pattern:task docker:\w+ blocked for more than \w+ seconds\.}]} I0515 23:03:24.960737 1 log_watchers.go:40] Use log watcher of plugin "journald" I0515 23:03:24.960862 1 log_monitor.go:63] Finish parsing log monitor config file: {WatcherConfig:{Plugin:journald PluginConfig:map[source:docker] LogPath:/host/log/journal Lookback:5m} BufferSize:10 Source:docker-monitor DefaultConditions:[] Rules:[{Type:temporary Condition: Reason:CorruptDockerImage Pattern:Error trying v2 registry: failed to register layer: rename /var/lib/docker/image/(.+) /var/lib/docker/image/(.+): directory not empty.*}]} I0515 23:03:24.960895 1 log_watchers.go:40] Use log watcher of plugin "journald" I0515 23:03:24.961544 1 log_monitor.go:72] Start log monitor I0515 23:03:24.963479 1 log_watcher.go:69] Start watching journald I0515 23:03:24.963512 1 log_monitor.go:72] Start log monitor I0515 23:03:24.963700 1 log_watcher.go:69] Start watching journald I0515 23:03:24.963717 1 problem_detector.go:73] Problem detector started I0515 23:03:24.963755 1 log_monitor.go:163] Initialize condition generated: [] I0515 23:03:24.963739 1 log_monitor.go:163] Initialize condition generated: [{Type:KernelDeadlock Status:false Transition:2018-05-15 23:03:24.963714476 -0400 EDT m=+0.049238160 Reason:KernelHasNoDeadlock Message:kernel has no deadlock}] I0515 23:10:29.853322 1 log_monitor.go:114] New status generated: &{Source:kernel-monitor Events:[{Severity:warn Timestamp:2018-05-15 23:10:29.85287 -0400 EDT Reason:TaskHung Message:task docker:1234 blocked for more than 700 seconds.}] Conditions:[{Type:KernelDeadlock Status:false Transition:2018-05-15 23:03:24.963714476 -0400 EDT m=+0.049238160 Reason:KernelHasNoDeadlock Message:kernel has no deadlock}]} I0515 23:10:29.853491 1 log_monitor.go:114] New status generated: &{Source:kernel-monitor Events:[] Conditions:[{Type:KernelDeadlock Status:true Transition:2018-05-15 23:10:29.85287 -0400 EDT Reason:DockerHung Message:task docker:1234 blocked for more than 700 seconds.}]}
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1816