Bug 1564852

Summary: User cannot patch nodes/status at the cluster scope for node-problem-detector pods
Product: OpenShift Container Platform Reporter: Weinan Liu <weinliu>
Component: InstallerAssignee: Jan Chaloupka <jchaloup>
Status: CLOSED ERRATA QA Contact: Weinan Liu <weinliu>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.10.0CC: aos-bugs, dma, jchaloup, jokerman, mmccomas
Target Milestone: ---   
Target Release: 3.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1565980 (view as bug list) Environment:
Last Closed: 2018-07-30 19:12:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Weinan Liu 2018-04-08 09:53:57 UTC
Description of problem:
User "system:serviceaccount:openshift-infra:node-problem-detector" cannot patch nodes/status at the cluster scope, clusterrole.rbac.authorization.k8s.io "node-problem-detector" not found


How reproducible:
always

Steps to Reproduce:
1. Deploy NPD with the config file as below on ocp 3.10 
$ cat qe-inventory-host-file
...
openshift_node_problem_detector_state=present
openshift_node_problem_detector_image_prefix=***/openshift3/ose-
openshift_node_problem_detector_image_version=v3.10.0
...
$ ansible-playbook -v -i qe-inventory-host-file ~/openshift-ansible/playbooks/openshift-node-problem-detector/config.yml

2. Check the pod logs


Actual results:
2. # oc get pod -n openshift-infra
NAME                          READY     STATUS    RESTARTS   AGE
node-problem-detector-5pjlm   1/1       Running   0          27m
node-problem-detector-dllj5   1/1       Running   0          27m
node-problem-detector-n6ksn   1/1       Running   0          27m

# oc get logs node-problem-detector-5pjlm 
...
E0408 05:19:03.342379       1 manager.go:160] failed to update node conditions: nodes "qe-dma310-node-registry-router-1" is forbidden: User "system:serviceaccount:openshift-infra:node-problem-detector" cannot patch nodes/status at the cluster scope: User "system:serviceaccount:openshift-infra:node-problem-detector" cannot "patch" "nodes/status" with name "qe-dma310-node-registry-router-1" in project "": clusterrole.rbac.authorization.k8s.io "node-problem-detector" not found
...

Expected results:
No errors in pod logs

Additional info:
# oc version
oc v3.10.0-0.16.0
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://qe-dma310-master-etcd-1:8443
openshift v3.10.0-0.16.0
kubernetes v1.9.1+a0ce1bc657

[root@qe-dma310-master-etcd-1 ~]# oc get sa node-problem-detector
NAME                    SECRETS   AGE
node-problem-detector   2         29m
[root@qe-dma310-master-etcd-1 ~]# oc describe sa node-problem-detector
Name:                node-problem-detector
Namespace:           openshift-infra
Labels:              <none>
Annotations:         <none>
Image pull secrets:  node-problem-detector-dockercfg-nfvwx
Mountable secrets:   node-problem-detector-dockercfg-nfvwx
                     node-problem-detector-token-plw2x
Tokens:              node-problem-detector-token-5qhv2
                     node-problem-detector-token-plw2x
Events:              <none>



Description of problem:

Version-Release number of the following components:
rpm -q openshift-ansible
rpm -q ansible
ansible --version

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

Expected results:

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 Jan Chaloupka 2018-04-09 14:13:43 UTC
Upstream PR with the fix: https://github.com/openshift/openshift-ansible/pull/7856

Comment 2 Jan Chaloupka 2018-04-09 14:33:02 UTC
Merged upstream

Comment 3 Weinan Liu 2018-05-14 08:39:58 UTC
Still fails on env below. Could you help to double check? 

[root@qe-weinliu-master-etcd-1 ~]# oc version
oc v3.10.0-0.41.0
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://qe-weinliu-master-etcd-1:8443
openshift v3.10.0-0.41.0
kubernetes v1.10.0+b81c8f8
[root@qe-weinliu-master-etcd-1 ~]# uname -a
Linux qe-weinliu-master-etcd-1 3.10.0-693.21.1.el7.x86_64 #1 SMP Fri Feb 23 18:54:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux



[root@qe-weinliu-master-etcd-1 ~]# oc logs node-problem-detector-4bhqn
I0514 04:28:10.705502       1 log_monitor.go:63] Finish parsing log monitor config file: {WatcherConfig:{Plugin:journald PluginConfig:map[source:kernel] LogPath:/host/log/journal Lookback:5m} BufferSize:10 Source:kernel-monitor DefaultConditions:[{Type:KernelDeadlock Status:false Transition:0001-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:kernel has no deadlock}] Rules:[{Type:temporary Condition: Reason:OOMKilling Pattern:Kill process \d+ (.+) score \d+ or sacrifice child\nKilled process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB} {Type:temporary Condition: Reason:TaskHung Pattern:task \S+:\w+ blocked for more than \w+ seconds\.} {Type:temporary Condition: Reason:UnregisterNetDevice Pattern:unregister_netdevice: waiting for \w+ to become free. Usage count = \d+} {Type:temporary Condition: Reason:KernelOops Pattern:BUG: unable to handle kernel NULL pointer dereference at .*} {Type:temporary Condition: Reason:KernelOops Pattern:divide error: 0000 \[#\d+\] SMP} {Type:permanent Condition:KernelDeadlock Reason:AUFSUmountHung Pattern:task umount\.aufs:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:DockerHung Pattern:task docker:\w+ blocked for more than \w+ seconds\.}]}
I0514 04:28:10.705864       1 log_watchers.go:40] Use log watcher of plugin "journald"
I0514 04:28:10.706291       1 log_monitor.go:63] Finish parsing log monitor config file: {WatcherConfig:{Plugin:journald PluginConfig:map[source:docker] LogPath:/host/log/journal Lookback:5m} BufferSize:10 Source:docker-monitor DefaultConditions:[] Rules:[{Type:temporary Condition: Reason:CorruptDockerImage Pattern:Error trying v2 registry: failed to register layer: rename /var/lib/docker/image/(.+) /var/lib/docker/image/(.+): directory not empty.*}]}
I0514 04:28:10.706333       1 log_watchers.go:40] Use log watcher of plugin "journald"
I0514 04:28:10.708389       1 log_monitor.go:72] Start log monitor
I0514 04:28:10.712128       1 log_watcher.go:69] Start watching journald
I0514 04:28:10.712166       1 log_monitor.go:72] Start log monitor
I0514 04:28:10.712292       1 log_watcher.go:69] Start watching journald
I0514 04:28:10.712309       1 problem_detector.go:73] Problem detector started
I0514 04:28:10.712658       1 log_monitor.go:163] Initialize condition generated: []
I0514 04:28:10.712715       1 log_monitor.go:163] Initialize condition generated: [{Type:KernelDeadlock Status:false Transition:2018-05-14 04:28:10.712708442 -0400 EDT m=+0.025078977 Reason:KernelHasNoDeadlock Message:kernel has no deadlock}]
E0514 04:28:11.724221       1 manager.go:160] failed to update node conditions: nodes "qe-weinliu-master-etcd-1" is forbidden: User "system:serviceaccount:openshift-infra:node-problem-detector" cannot patch nodes/status at the cluster scope: User "system:serviceaccount:openshift-infra:node-problem-detector" cannot "patch" "nodes/status" with name "qe-weinliu-master-etcd-1" in project "": clusterrole.rbac.authorization.k8s.io "node-problem-detector" not found
E0514 04:28:22.711113       1 manager.go:160] failed to update node conditions: nodes "qe-weinliu-master-etcd-1" is forbidden: User "system:serviceaccount:openshift-infra:node-problem-detector" cannot patch nodes/status at the cluster scope: User "system:serviceaccount:openshift-infra:node-problem-detector" cannot "patch" "nodes/status" with name "qe-weinliu-master-etcd-1" in project "": clusterrole.rbac.authorization.k8s.io "node-problem-detector" not found
E0514 04:28:32.714392       1 manager.go:160] failed to update node conditions: nodes "qe-weinliu-master-etcd-1" is forbidden: User "system:serviceaccount:openshift-infra:node-problem-detector" cannot patch nodes/status at the cluster scope: User "system:serviceaccount:openshift-infra:node-problem-detector" cannot "patch" "nodes/status" with name "qe-weinliu-master-etcd-1" in project "": clusterrole.rbac.authorization.k8s.io "node-problem-detector" not found
E0514 04:28:43.715188       1 manager.go:160] failed to update node conditions: nodes "qe-weinliu-master-etcd-1" is forbidden: User "system:serviceaccount:openshift-infra:node-problem-detector" cannot patch nodes/status at the cluster scope: User "system:serviceaccount:openshift-infra:node-problem-detector" cannot "patch" "nodes/status" with name "qe-weinliu-master-etcd-1" in project "": clusterrole.rbac.authorization.k8s.io "node-problem-detector" not found
E0514 04:28:53.710952       1 manager.go:160] failed to update node conditions: nodes "qe-weinliu-master-etcd-1" is forbidden: User "system:serviceaccount:openshift-infra:node-problem-detector" cannot patch nodes/status at the cluster scope: User "system:serviceaccount:openshift-infra:node-problem-detector" cannot "patch" "nodes/status" with name "qe-weinliu-master-etcd-1" in project "": clusterrole.rbac.authorization.k8s.io "node-problem-detector" not found
E0514 04:29:04.715792       1 manager.go:160] failed to update node conditions: nodes "qe-weinliu-master-etcd-1" is forbidden: User "system:serviceaccount:openshift-infra:node-problem-detector" cannot patch nodes/status at the cluster scope: User "system:serviceaccount:openshift-infra:node-problem-detector" cannot "patch" "nodes/status" with name "qe-weinliu-master-etcd-1" in project "": clusterrole.rbac.authorization.k8s.io "node-problem-detector" not found
....
....
....

Comment 4 Jan Chaloupka 2018-05-14 10:35:24 UTC
What openshift-ansible rpm did you use? Can you list the cluster roles and search for node-problem-detector?

This should be already fixed. I tested the deployment with https://github.com/openshift/openshift-ansible/pull/7856.

Comment 5 Weinan Liu 2018-05-14 10:58:15 UTC
(In reply to Jan Chaloupka from comment #4)
> What openshift-ansible rpm did you use? Can you list the cluster roles and
> search for node-problem-detector?
> 
> This should be already fixed. I tested the deployment with
> https://github.com/openshift/openshift-ansible/pull/7856.

OPENSHIFT_ANSIBLE_URL git:openshift/openshift-ansible

I got the issue worked around by creating ClusterRoleBinding
#oc create -f crb.yaml

cat crb.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: npd-binding
  labels:
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:node-problem-detector
subjects:
- kind: ServiceAccount
  name: node-problem-detector
  namespace: openshift-infra

Comment 7 Weinan Liu 2018-05-16 03:14:03 UTC
Bug verified to be fixed 

[root@ip-172-18-14-36 ~]# oc version
oc v3.10.0-0.46.0
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-18-14-36.ec2.internal:8443
openshift v3.10.0-0.46.0
kubernetes v1.10.0+b81c8f8
[root@ip-172-18-14-36 ~]# uname -a
Linux ip-172-18-14-36.ec2.internal 3.10.0-862.el7.x86_64 #1 SMP Wed Mar 21 18:14:51 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux


[root@ip-172-18-14-36 ~]# oc logs node-problem-detector-vc7t9
I0515 23:03:24.960521       1 log_monitor.go:63] Finish parsing log monitor config file: {WatcherConfig:{Plugin:journald PluginConfig:map[source:kernel] LogPath:/host/log/journal Lookback:5m} BufferSize:10 Source:kernel-monitor DefaultConditions:[{Type:KernelDeadlock Status:false Transition:0001-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:kernel has no deadlock}] Rules:[{Type:temporary Condition: Reason:OOMKilling Pattern:Kill process \d+ (.+) score \d+ or sacrifice child\nKilled process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB} {Type:temporary Condition: Reason:TaskHung Pattern:task \S+:\w+ blocked for more than \w+ seconds\.} {Type:temporary Condition: Reason:UnregisterNetDevice Pattern:unregister_netdevice: waiting for \w+ to become free. Usage count = \d+} {Type:temporary Condition: Reason:KernelOops Pattern:BUG: unable to handle kernel NULL pointer dereference at .*} {Type:temporary Condition: Reason:KernelOops Pattern:divide error: 0000 \[#\d+\] SMP} {Type:permanent Condition:KernelDeadlock Reason:AUFSUmountHung Pattern:task umount\.aufs:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:DockerHung Pattern:task docker:\w+ blocked for more than \w+ seconds\.}]}
I0515 23:03:24.960737       1 log_watchers.go:40] Use log watcher of plugin "journald"
I0515 23:03:24.960862       1 log_monitor.go:63] Finish parsing log monitor config file: {WatcherConfig:{Plugin:journald PluginConfig:map[source:docker] LogPath:/host/log/journal Lookback:5m} BufferSize:10 Source:docker-monitor DefaultConditions:[] Rules:[{Type:temporary Condition: Reason:CorruptDockerImage Pattern:Error trying v2 registry: failed to register layer: rename /var/lib/docker/image/(.+) /var/lib/docker/image/(.+): directory not empty.*}]}
I0515 23:03:24.960895       1 log_watchers.go:40] Use log watcher of plugin "journald"
I0515 23:03:24.961544       1 log_monitor.go:72] Start log monitor
I0515 23:03:24.963479       1 log_watcher.go:69] Start watching journald
I0515 23:03:24.963512       1 log_monitor.go:72] Start log monitor
I0515 23:03:24.963700       1 log_watcher.go:69] Start watching journald
I0515 23:03:24.963717       1 problem_detector.go:73] Problem detector started
I0515 23:03:24.963755       1 log_monitor.go:163] Initialize condition generated: []
I0515 23:03:24.963739       1 log_monitor.go:163] Initialize condition generated: [{Type:KernelDeadlock Status:false Transition:2018-05-15 23:03:24.963714476 -0400 EDT m=+0.049238160 Reason:KernelHasNoDeadlock Message:kernel has no deadlock}]
I0515 23:10:29.853322       1 log_monitor.go:114] New status generated: &{Source:kernel-monitor Events:[{Severity:warn Timestamp:2018-05-15 23:10:29.85287 -0400 EDT Reason:TaskHung Message:task docker:1234 blocked for more than 700 seconds.}] Conditions:[{Type:KernelDeadlock Status:false Transition:2018-05-15 23:03:24.963714476 -0400 EDT m=+0.049238160 Reason:KernelHasNoDeadlock Message:kernel has no deadlock}]}
I0515 23:10:29.853491       1 log_monitor.go:114] New status generated: &{Source:kernel-monitor Events:[] Conditions:[{Type:KernelDeadlock Status:true Transition:2018-05-15 23:10:29.85287 -0400 EDT Reason:DockerHung Message:task docker:1234 blocked for more than 700 seconds.}]}

Comment 9 errata-xmlrpc 2018-07-30 19:12:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1816