Bug 1962638

Summary: missing secret for sriov-network-config-daemon after upgrade from OCP 4.5.16 to 4.6.17
Product: OpenShift Container Platform Reporter: Andreas Karis <akaris>
Component: NetworkingAssignee: Peng Liu <pliu>
Networking sub component: SR-IOV QA Contact: zhaozhanqi <zzhao>
Status: CLOSED NEXTRELEASE Docs Contact:
Severity: unspecified    
Priority: unspecified CC: dansmall, gdiotte, pibanezr
Version: 4.6   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: Telco
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-06-24 06:11:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Andreas Karis 2021-05-20 12:41:56 UTC
Description of problem:

During an upgrade from OCP 4.5.16 to 4.6.17 , an sriov-network-config-daemon pod went into CrashLoopBackOff state.  The pod recovered after being deleted and recreated.

This bug report was created to identify whether this is a known issue or one that warrants attention, and to track it through to resolution.

The problem appeared after one of the load-balancer nodes was rebooted and its pods were recreated.  At the time of CrashLoopBackOff, the messages in `oc -n openshift-sriov-network-operator describe pod sriov-network-config-daemon-xxxxx` showed that a necessary secret did not exist.

We no longer have the exact name of the secret that was missing, but it was likely the one listed below, as per the `oc adm inspect ns/openshift-sriov-network-operator` output at the time of the OCP 4.6.17 upgrade:

~~~
$ cat oc_adm_inspect_ns_openshift-sriov-network-operator/inspect.local.8920316685580493059/namespaces/openshift-sriov-network-operator/pods/sriov-network-config-daemon-6lml7/sriov-network-config-daemon/sriov-network-config-daemon/logs/current.log | grep "Unable to rotate token" | tail -n1
2021-05-11T19:29:56.613845500Z E0511 19:29:56.613835   10308 token_source.go:152] Unable to rotate token: failed to read token file "/var/run/secrets/kubernetes.io/serviceaccount/token": open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
~~~

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Peng Liu 2021-05-24 06:33:42 UTC
@akaris The file '/var/run/secrets/kubernetes.io/serviceaccount/token' shall be injected into the pod by Kubernetes automatically. As you were doing an upgrade. Can you check the status of the MCP with 'oc get mcp'? Also please check if there are any other pods on the same node reporting the same error.

Comment 3 Gabriel Diotte 2021-06-03 12:15:50 UTC
The requested oc get mcp can be found below. Also of note: this has only been observed specifically on this sriov-network-config-daemon pod on this worker. It appears to be tied to reboots of this node, as a pull-secret that triggered a reboot reproduced the issue.

Node: worker-06 (0020-sosreport-worker-06-2021-05-18-qcorose.tar.xz)

~~~
[akaris@supportshell 02943641]$ omg get nodes --show-labels | grep loadbalancer | awk '{print $1}'
[WARN] Skipped 2/489 lines from the end of master-0.yaml to the load the yaml file properly
[WARN] Skipped 4/927 lines from the end of worker-04.yaml to the load the yaml file properly
worker-06
worker-07
~~~

~~~
[akaris@supportshell 02943641]$ omg get machineconfigpool
NAME          CONFIG                                                  UPDATED  UPDATING  DEGRADED  MACHINECOUNT  READYMACHINECOUNT  UPDATEDMACHINECOUNT  DEGRADEDMACHINECOUNT  AGE
loadbalancer  rendered-loadbalancer-e67038cac0e0c6059a2a452518ae1085  True     False     False     2             2                  2                    0                     241d
master        rendered-master-42a372c895c06a4bda9512763a899f20        True     False     False     3             3                  3                    0                     241d
worker        rendered-worker-b516ea73de067359e6f3d7cf3fe4d627        True     False     False     6             6                  6                    0                     241d
~~~

~~~
[akaris@supportshell oc_adm_inspect_ns_openshift-sriov-network-operator]$ grep -R nodeName: inspect.local.8920316685580493059/namespaces/openshift-sriov-network-operator/pods/sriov*
inspect.local.8920316685580493059/namespaces/openshift-sriov-network-operator/pods/sriov-cni-559cn/sriov-cni-559cn.yaml:  nodeName: worker-06
inspect.local.8920316685580493059/namespaces/openshift-sriov-network-operator/pods/sriov-cni-cftwr/sriov-cni-cftwr.yaml:  nodeName: worker-07
inspect.local.8920316685580493059/namespaces/openshift-sriov-network-operator/pods/sriov-device-plugin-6kt5r/sriov-device-plugin-6kt5r.yaml:  nodeName: worker-06
inspect.local.8920316685580493059/namespaces/openshift-sriov-network-operator/pods/sriov-device-plugin-q58q5/sriov-device-plugin-q58q5.yaml:  nodeName: worker-07
inspect.local.8920316685580493059/namespaces/openshift-sriov-network-operator/pods/sriov-network-config-daemon-46tb4/sriov-network-config-daemon-46tb4.yaml:  nodeName: worker-06
inspect.local.8920316685580493059/namespaces/openshift-sriov-network-operator/pods/sriov-network-config-daemon-6lml7/sriov-network-config-daemon-6lml7.yaml:  nodeName: worker-07
inspect.local.8920316685580493059/namespaces/openshift-sriov-network-operator/pods/sriov-network-config-daemon-bqrlx/sriov-network-config-daemon-bqrlx.yaml:  nodeName: worker-02
inspect.local.8920316685580493059/namespaces/openshift-sriov-network-operator/pods/sriov-network-config-daemon-hvvxl/sriov-network-config-daemon-hvvxl.yaml:  nodeName: worker-01
inspect.local.8920316685580493059/namespaces/openshift-sriov-network-operator/pods/sriov-network-config-daemon-mnhj9/sriov-network-config-daemon-mnhj9.yaml:  nodeName: worker-00
inspect.local.8920316685580493059/namespaces/openshift-sriov-network-operator/pods/sriov-network-config-daemon-q5bqg/sriov-network-config-daemon-q5bqg.yaml:  nodeName: worker-04
inspect.local.8920316685580493059/namespaces/openshift-sriov-network-operator/pods/sriov-network-config-daemon-rlwcv/sriov-network-config-daemon-rlwcv.yaml:  nodeName: worker-03
inspect.local.8920316685580493059/namespaces/openshift-sriov-network-operator/pods/sriov-network-config-daemon-x9ssn/sriov-network-config-daemon-x9ssn.yaml:  nodeName: worker-05
inspect.local.8920316685580493059/namespaces/openshift-sriov-network-operator/pods/sriov-network-operator-5879fb4869-v4xdn/sriov-network-operator-5879fb4869-v4xdn.yaml:  nodeName: master-2
~~~

~~~
[akaris@supportshell 02943641]$ omg get mcp loadbalancer -o yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  creationTimestamp: '2020-09-16T20:37:17Z'
  generation: 27
  name: loadbalancer
  resourceVersion: '385038868'
  selfLink: /apis/machineconfiguration.openshift.io/v1/machineconfigpools/loadbalancer
  uid: 94bc148b-4701-456d-8a77-8de94fe99f5d
spec:
  configuration:
    name: rendered-loadbalancer-e67038cac0e0c6059a2a452518ae1085
    source:
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 00-worker
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-container-runtime
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-kubelet
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 05-hugepages-kernelarg
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 06-blacklist-sctp-module
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 11-worker-bonding
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 12-worker-sssd
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 15-load-eric-amf-modules
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 50-sshd-crypto-worker
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 50-worker-idmap
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-coredns-override-worker
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-loadbalancer-kernelarg-nosmt
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-generated-crio-capabilities
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-generated-registries
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-ssh
  machineConfigSelector:
    matchExpressions:
    - key: machineconfiguration.openshift.io/role
      operator: In
      values:
      - worker
      - load-balancer
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/load-balancer: ''
  paused: false
status:
  conditions:
  - lastTransitionTime: '2020-09-16T20:37:55Z'
    message: ''
    reason: ''
    status: 'False'
    type: NodeDegraded
  - lastTransitionTime: '2021-05-02T00:03:29Z'
    message: ''
    reason: ''
    status: 'False'
    type: RenderDegraded
  - lastTransitionTime: '2021-05-02T00:03:34Z'
    message: ''
    reason: ''
    status: 'False'
    type: Degraded
  - lastTransitionTime: '2021-05-11T19:30:44Z'
    message: All nodes are updated with rendered-loadbalancer-e67038cac0e0c6059a2a452518ae1085
    reason: ''
    status: 'True'
    type: Updated
  - lastTransitionTime: '2021-05-11T19:30:44Z'
    message: ''
    reason: ''
    status: 'False'
    type: Updating
  configuration:
    name: rendered-loadbalancer-e67038cac0e0c6059a2a452518ae1085
    source:
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 00-worker
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-container-runtime
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-kubelet
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 05-hugepages-kernelarg
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 06-blacklist-sctp-module
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 11-worker-bonding
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 12-worker-sssd
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 15-load-eric-amf-modules
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 50-sshd-crypto-worker
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 50-worker-idmap
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-coredns-override-worker
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-loadbalancer-kernelarg-nosmt
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-generated-crio-capabilities
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-generated-registries
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-ssh
  degradedMachineCount: 0
  machineCount: 2
  observedGeneration: 27
  readyMachineCount: 2
  unavailableMachineCount: 0
  updatedMachineCount: 2
~~~

~~~
[akaris@supportshell 02943641]$ omg get machineconfig -A
NAME                                                    GENERATEDBYCONTROLLER                     IGNITIONVERSION  AGE
00-master                                               fc2e69c4408d898b24760eea9e889f0673369e67  3.1.0            241d
00-worker                                               fc2e69c4408d898b24760eea9e889f0673369e67  3.1.0            241d
01-master-container-runtime                             fc2e69c4408d898b24760eea9e889f0673369e67  3.1.0            241d
01-master-kubelet                                       fc2e69c4408d898b24760eea9e889f0673369e67  3.1.0            241d
01-worker-container-runtime                             fc2e69c4408d898b24760eea9e889f0673369e67  3.1.0            241d
01-worker-kubelet                                       fc2e69c4408d898b24760eea9e889f0673369e67  3.1.0            241d
05-hugepages-kernelarg                                                                            2.2.0            241d
06-blacklist-sctp-module                                                                          2.2.0            219d
11-master-bonding                                                                                 2.2.0            241d
11-worker-bonding                                                                                 2.2.0            241d
12-master-sssd                                                                                    2.2.0            241d
12-worker-sssd                                                                                    2.2.0            241d
15-load-eric-amf-modules                                                                          2.2.0            219d
50-sshd-crypto-master                                                                             2.2.0            151d
50-sshd-crypto-worker                                                                             2.2.0            151d
50-worker-idmap                                                                                   2.2.0            151d
99-coredns-override-master                                                                        3.1.0            6d
99-coredns-override-worker                                                                        3.1.0            6d
99-loadbalancer-kernelarg-nosmt                                                                   2.2.0            241d
99-master-generated-crio-capabilities                                                             2.2.0            151d
99-master-generated-registries                          fc2e69c4408d898b24760eea9e889f0673369e67  3.1.0            6d
99-master-ssh                                                                                     2.2.0            241d
99-worker-generated-crio-capabilities                                                             2.2.0            151d
99-worker-generated-registries                          fc2e69c4408d898b24760eea9e889f0673369e67  3.1.0            6d
99-worker-ssh                                                                                     2.2.0            241d
rendered-loadbalancer-082d32812ca9768cdee577980a8103f4  287dd2cfa692ecbbce7b3bc1913b99b3e2d2f5c7  2.2.0            151d
rendered-loadbalancer-13a8335f2046645d727637f5ee2c72f2  cdce2822a6b3bff31b5aafc23b773f7dcbea2caa  2.2.0            151d
rendered-loadbalancer-13d976bfebd30a32a75f701aa54c3096  601c2285f497bf7c73d84737b9977a0e697cb86a  2.2.0            219d
rendered-loadbalancer-1a7a6f58ac5b36732e0e07f5c6d3e24a  cdce2822a6b3bff31b5aafc23b773f7dcbea2caa  2.2.0            16d
rendered-loadbalancer-205302633a5c29a3eee35a5ec330ebdf  480accd5d4f631d34e560aa5c8a3dfab0c7bbe27  2.2.0            219d
rendered-loadbalancer-42cb1af1078712cf191710afa588e4b2  cdce2822a6b3bff31b5aafc23b773f7dcbea2caa  2.2.0            16d
rendered-loadbalancer-45cf8b5a4ff8fde713c3b5b02c207a0c  cdce2822a6b3bff31b5aafc23b773f7dcbea2caa  2.2.0            151d
rendered-loadbalancer-6bb55ae458d3c56eda7d3d789c7c4bcb  fc2e69c4408d898b24760eea9e889f0673369e67  3.1.0            6d
rendered-loadbalancer-718eec92432e91c77e644525f8552c9f  480accd5d4f631d34e560aa5c8a3dfab0c7bbe27  2.2.0            241d
rendered-loadbalancer-d8d5c6141f75c9e80b1b05473dba8cc5  cdce2822a6b3bff31b5aafc23b773f7dcbea2caa  2.2.0            151d
rendered-loadbalancer-e67038cac0e0c6059a2a452518ae1085  fc2e69c4408d898b24760eea9e889f0673369e67  3.1.0            6d
rendered-loadbalancer-ec7c478427ea92c589d4d7eccac50b3e  601c2285f497bf7c73d84737b9977a0e697cb86a  2.2.0            194d
rendered-master-0606b8cd9cb3a1328dc1baeb511bea76        601c2285f497bf7c73d84737b9977a0e697cb86a  2.2.0            219d
rendered-master-42a372c895c06a4bda9512763a899f20        fc2e69c4408d898b24760eea9e889f0673369e67  3.1.0            6d
rendered-master-52b76832aadf250ab2ac67b450d44164        cdce2822a6b3bff31b5aafc23b773f7dcbea2caa  2.2.0            16d
rendered-master-57e1dae84829f1ad67e80feb8560d24c        287dd2cfa692ecbbce7b3bc1913b99b3e2d2f5c7  2.2.0            151d
rendered-master-62977f3ca03c2ed0be287ca6649975a0        cdce2822a6b3bff31b5aafc23b773f7dcbea2caa  2.2.0            151d
rendered-master-96189514f64425ea300a36634233b8e3        cdce2822a6b3bff31b5aafc23b773f7dcbea2caa  2.2.0            16d
rendered-master-a19f36ea563c2586a02e18250cdbac08        cdce2822a6b3bff31b5aafc23b773f7dcbea2caa  2.2.0            151d
rendered-master-a4a5a77c90c26e6b43590edae3abd8e9        480accd5d4f631d34e560aa5c8a3dfab0c7bbe27  2.2.0            241d
rendered-master-acef0d1f36a94836c42895441f85c866        cdce2822a6b3bff31b5aafc23b773f7dcbea2caa  2.2.0            6d
rendered-master-c29a5491507884db73e0a3cfacf7bb28        fc2e69c4408d898b24760eea9e889f0673369e67  3.1.0            6d
rendered-master-d01f03e10cf5f4155acf642ef166b71d        601c2285f497bf7c73d84737b9977a0e697cb86a  2.2.0            194d
rendered-worker-17d2b9d62c511967a453d310cbbd36ec        601c2285f497bf7c73d84737b9977a0e697cb86a  2.2.0            194d
rendered-worker-577409a7776da128db21bebab94dfe30        480accd5d4f631d34e560aa5c8a3dfab0c7bbe27  2.2.0            241d
rendered-worker-581ce191b1a159ff439b26b3f62eae11        fc2e69c4408d898b24760eea9e889f0673369e67  3.1.0            6d
rendered-worker-73ab53151ec3d7ca201b15376fc0d612        287dd2cfa692ecbbce7b3bc1913b99b3e2d2f5c7  2.2.0            151d
rendered-worker-755eb578712a7ba1d27d24219dc5140a        cdce2822a6b3bff31b5aafc23b773f7dcbea2caa  2.2.0            16d
rendered-worker-a77ffedbcca3cf125b0f67aebd523e85        cdce2822a6b3bff31b5aafc23b773f7dcbea2caa  2.2.0            151d
rendered-worker-b25d232de2c0579357fe8c1075a8e324        601c2285f497bf7c73d84737b9977a0e697cb86a  2.2.0            219d
rendered-worker-b516ea73de067359e6f3d7cf3fe4d627        fc2e69c4408d898b24760eea9e889f0673369e67  3.1.0            6d
rendered-worker-e83002e9c5b12547b429517b338e56be        cdce2822a6b3bff31b5aafc23b773f7dcbea2caa  2.2.0            16d
rendered-worker-fa9440b6170bfeb7a801931bc56a30f1        480accd5d4f631d34e560aa5c8a3dfab0c7bbe27  2.2.0            219d
rendered-worker-faa847eb307c77e0edf12e109fa68391        cdce2822a6b3bff31b5aafc23b773f7dcbea2caa  2.2.0            151d

Comment 4 Peng Liu 2021-06-03 13:46:47 UTC
Could you help to collect the kubelet log of that node?

Comment 5 Peng Liu 2021-06-04 15:05:09 UTC
I think I may find the root cause. Can you upgrade the sriov operator to 4.6 and see if the issue can be resolved? I think you hit a bug in the 4.5 code, which has been fixed in 4.6.

Comment 6 Andreas Karis 2021-06-04 15:46:28 UTC
Hi,

I think the issue cannot be reproduced any more for this case. Can you just point out the code section which you think is the culprit and possibly the patch and we'll relay that to the customer.

Thanks so much!

- Andreas

Comment 7 Dan Small 2021-06-06 17:01:34 UTC
Hi Peng,
New attachments have been added to the salesforce case that include the kubelet logs you requested.

Cheers,
Dan

Comment 8 Peng Liu 2021-06-07 03:00:48 UTC
@akaris

In 4.5 code, https://github.com/openshift/sriov-network-operator/blob/7637810f42a401af61095dbed107101beb774170/pkg/plugins/generic/generic_plugin.go#L121, here we chroot to the host root path with `utils.Chroot`. Normally, if 'utils.SyncNodeState' returns no error, the 'exit()' shall be invoked and chroot back to the pod's own root path, where the /var/run/secrets/kubernetes.io/serviceaccount/token is mounted. However, in your case, an error was returned, therefore the 'exit()' was skiped. So the process cannot find /var/run/secrets/kubernetes.io/serviceaccount/token any more. In 4.6, the logic was changed to ensure the 'exit()' is always invoked.

Comment 9 Red Hat Bugzilla 2023-09-15 01:06:56 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days