Bug 1868799 - NFD Operator not feature complete on Z
Summary: NFD Operator not feature complete on Z
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Multi-Arch
Version: 4.8
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.8.0
Assignee: jschinta
QA Contact: Barry Donahue
URL:
Whiteboard: multi-arch
Depends On: 1957846
Blocks: ocp-46-z-tracker ocp-47-z-tracker
TreeView+ depends on / blocked
 
Reported: 2020-08-13 20:13 UTC by Cheryl A Fillekes
Modified: 2021-05-21 10:44 UTC (History)
25 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-05-21 10:44:52 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift node-feature-discovery pull 42 0 None closed Bug 1868799: Mount /usr inside the Pod 2021-05-18 14:39:20 UTC

Description Cheryl A Fillekes 2020-08-13 20:13:04 UTC
Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Cheryl A Fillekes 2020-08-13 20:27:45 UTC
Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Install NFD Operator from OperatorHub
2. Instantiate NFD Operator in one namespace
3. Query Node Features

Actual results:

sh-4.2$ nfd-worker
2020/08/13 20:26:07 Node Feature Discovery Worker v0.4.0
2020/08/13 20:26:07 NodeName: 'worker-1.pok-31.pok.stglabs.ibm.com'
INFO: 2020/08/13 20:26:07 parsed scheme: ""
INFO: 2020/08/13 20:26:07 scheme "" not registered, fallback to default scheme
INFO: 2020/08/13 20:26:07 ccResolverWrapper: sending update to cc: {[{localhost:8080 0  <nil>}] <nil>}
INFO: 2020/08/13 20:26:07 ClientConn switching balancer to "pick_first"
2020/08/13 20:26:07 CONF: {{[BMI1 BMI2 CLMUL CMOV CX16 ERMS F16C HTT LZCNT MMX MMXEXT NX POPCNT RDRAND RDSEED RDTSCP SGX SSE SSE2 SSE3 SSE4.1 SSE4.2 SSSE3] []}}
2020/08/13 20:26:07 cpu-cpuid.EDAT = true
2020/08/13 20:26:07 cpu-cpuid.VX = true
2020/08/13 20:26:07 cpu-cpuid.ESAN3 = true
2020/08/13 20:26:07 cpu-cpuid.VXE = true
2020/08/13 20:26:07 cpu-cpuid.STFLE = true
2020/08/13 20:26:07 cpu-cpuid.VXD = true
2020/08/13 20:26:07 cpu-cpuid.ETF3EH = true
2020/08/13 20:26:07 cpu-cpuid.MSA = true
2020/08/13 20:26:07 cpu-cpuid.GS = true
2020/08/13 20:26:07 cpu-cpuid.EIMM = true
2020/08/13 20:26:07 cpu-cpuid.DFP = true
2020/08/13 20:26:07 cpu-cpuid.LDISP = true
2020/08/13 20:26:07 cpu-cpuid.HIGHGPRS = true
2020/08/13 20:26:07 cpu-cpuid.TE = true
2020/08/13 20:26:07 cpu-cpuid.ZARCH = true
2020/08/13 20:26:07 Failed to read /proc/config.gz: open /proc/config.gz: no such file or directory
2020/08/13 20:26:07 ERROR: Failed to read kconfig: open /host-boot/config-4.18.0-211.el8.s390x: no such file or directory
2020/08/13 20:26:07 kernel-version.full = 4.18.0-211.el8.s390x
2020/08/13 20:26:07 kernel-version.major = 4
2020/08/13 20:26:07 kernel-version.minor = 18
2020/08/13 20:26:07 kernel-version.revision = 0
2020/08/13 20:26:07 kernel-selinux.enabled = true
WARNING: 2020/08/13 20:26:07 grpc: addrConn.createTransport failed to connect to {localhost:8080 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp [::1]:8080: connect: connection refused". Reconnecting...
2020/08/13 20:26:07 SR-IOV not supported for network interface: enc2e0: open /sys/class/net/enc2e0/device/sriov_totalvfs: no such file or directory
2020/08/13 20:26:07 SR-IOV not supported for network interface: vxlan_sys_4789: open /sys/class/net/vxlan_sys_4789/device/sriov_totalvfs: no such file or directory
2020/08/13 20:26:07 SR-IOV not supported for network interface: tun0: open /sys/class/net/tun0/device/sriov_totalvfs: no such file or directory
2020/08/13 20:26:07 SR-IOV not supported for network interface: vethaad46e9a: open /sys/class/net/vethaad46e9a/device/sriov_totalvfs: no such file or directory
2020/08/13 20:26:07 SR-IOV not supported for network interface: vethc4cf3119: open /sys/class/net/vethc4cf3119/device/sriov_totalvfs: no such file or directory
2020/08/13 20:26:07 SR-IOV not supported for network interface: veth6126a652: open /sys/class/net/veth6126a652/device/sriov_totalvfs: no such file or directory
2020/08/13 20:26:07 SR-IOV not supported for network interface: veth1c80cd42: open /sys/class/net/veth1c80cd42/device/sriov_totalvfs: no such file or directory
2020/08/13 20:26:07 SR-IOV not supported for network interface: vethd6880663: open /sys/class/net/vethd6880663/device/sriov_totalvfs: no such file or directory
2020/08/13 20:26:07 SR-IOV not supported for network interface: veth1db88e34: open /sys/class/net/veth1db88e34/device/sriov_totalvfs: no such file or directory
2020/08/13 20:26:07 SR-IOV not supported for network interface: veth0e035530: open /sys/class/net/veth0e035530/device/sriov_totalvfs: no such file or directory
2020/08/13 20:26:07 SR-IOV not supported for network interface: veth75fa65aa: open /sys/class/net/veth75fa65aa/device/sriov_totalvfs: no such file or directory
2020/08/13 20:26:07 SR-IOV not supported for network interface: vethcb69d03d: open /sys/class/net/vethcb69d03d/device/sriov_totalvfs: no such file or directory
2020/08/13 20:26:07 SR-IOV not supported for network interface: vethd21a951e: open /sys/class/net/vethd21a951e/device/sriov_totalvfs: no such file or directory
2020/08/13 20:26:07 storage-nonrotationaldisk = true
2020/08/13 20:26:07 system-os_release.VERSION_ID.major = 4
2020/08/13 20:26:07 system-os_release.VERSION_ID.minor = 6
2020/08/13 20:26:07 system-os_release.ID = rhcos
2020/08/13 20:26:07 system-os_release.VERSION_ID = 4.6
2020/08/13 20:26:07 Sendng labeling request nfd-master
2020/08/13 20:26:07 failed to set node labels: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp [::1]:8080: connect: connection refused"
2020/08/13 20:26:07 ERROR: failed to advertise labels: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp [::1]:8080: connect: connection refused"
sh-4.2$ 


Expected results:

That NFD pick up more details regarding network interfaces, IMMU, memory features and storage on Z


Additional info:

Comment 2 Carvel Baus 2020-08-17 14:20:40 UTC
How is the NFD for s390x different from NFD on x86 as far as features go?

Comment 3 Dennis Gilmore 2020-08-18 19:27:26 UTC
moving this RFE to Jira https://issues.redhat.com/browse/MULTIARCH-384

Comment 4 Cheryl A Fillekes 2020-09-14 14:40:41 UTC
The missing features are:

1. Kernel Configuration: 

2020/08/13 20:26:07 Failed to read /proc/config.gz: open /proc/config.gz: no such file or directory
2020/08/13 20:26:07 ERROR: Failed to read kconfig: open /host-boot/config-4.18.0-211.el8.s390x: no such file or directory

2. Network:

WARNING: 2020/08/13 20:26:07 grpc: addrConn.createTransport failed to connect to {localhost:8080 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp [::1]:8080: connect: connection refused". Reconnecting...
2020/08/13 20:26:07 SR-IOV not supported for network interface: enc2e0: open /sys/class/net/enc2e0/device/sriov_totalvfs: no such file or directory
2020/08/13 20:26:07 SR-IOV not supported for network interface: vxlan_sys_4789: open /sys/class/net/vxlan_sys_4789/device/sriov_totalvfs: no such file or directory
2020/08/13 20:26:07 SR-IOV not supported for network interface: tun0: open /sys/class/net/tun0/device/sriov_totalvfs: no such file or directory
2020/08/13 20:26:07 SR-IOV not supported for network interface: vethaad46e9a: open /sys/class/net/vethaad46e9a/device/sriov_totalvfs: no such file or directory
2020/08/13 20:26:07 SR-IOV not supported for network interface: vethc4cf3119: open /sys/class/net/vethc4cf3119/device/sriov_totalvfs: no such file or directory
2020/08/13 20:26:07 SR-IOV not supported for network interface: veth6126a652: open /sys/class/net/veth6126a652/device/sriov_totalvfs: no such file or directory
2020/08/13 20:26:07 SR-IOV not supported for network interface: veth1c80cd42: open /sys/class/net/veth1c80cd42/device/sriov_totalvfs: no such file or directory
2020/08/13 20:26:07 SR-IOV not supported for network interface: vethd6880663: open /sys/class/net/vethd6880663/device/sriov_totalvfs: no such file or directory
2020/08/13 20:26:07 SR-IOV not supported for network interface: veth1db88e34: open /sys/class/net/veth1db88e34/device/sriov_totalvfs: no such file or directory
2020/08/13 20:26:07 SR-IOV not supported for network interface: veth0e035530: open /sys/class/net/veth0e035530/device/sriov_totalvfs: no such file or directory
2020/08/13 20:26:07 SR-IOV not supported for network interface: veth75fa65aa: open /sys/class/net/veth75fa65aa/device/sriov_totalvfs: no such file or directory
2020/08/13 20:26:07 SR-IOV not supported for network interface: vethcb69d03d: open /sys/class/net/vethcb69d03d/device/sriov_totalvfs: no such file or directory
2020/08/13 20:26:07 SR-IOV not supported for network interface: vethd21a951e: open /sys/class/net/vethd21a951e/device/sriov_totalvfs: no such file or directory

Whatever it's trying to find here:

2020/08/13 20:26:07 failed to set node labels: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp [::1]:8080: connect: connection refused"
2020/08/13 20:26:07 ERROR: failed to advertise labels: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp [::1]:8080: connect: connection refused"

And no information seems to be available from NFD on the nodes' disk or memory configuration.

Comment 5 jschinta 2021-03-26 13:56:54 UTC
Seeing the same behaviour in 4.7:

2021/03/25 15:00:59 Node Feature Discovery Worker 1.15
2021/03/25 15:00:59 NodeName: 'worker-01.s8343015.lnxne.boe'
INFO: 2021/03/25 15:00:59 parsed scheme: ""
INFO: 2021/03/25 15:00:59 scheme "" not registered, fallback to default scheme
INFO: 2021/03/25 15:00:59 ccResolverWrapper: sending update to cc: {[{nfd-master:12000  <nil> 0 <nil>}] <nil> <nil>}
INFO: 2021/03/25 15:00:59 ClientConn switching balancer to "pick_first"
2021/03/25 15:00:59 Configuration successfully loaded from "/etc/kubernetes/node-feature-discovery/nfd-worker.conf"
2021/03/25 15:00:59 cpu-cpuid.ZARCH = true
2021/03/25 15:00:59 cpu-cpuid.STFLE = true
2021/03/25 15:00:59 cpu-cpuid.EIMM = true
2021/03/25 15:00:59 cpu-cpuid.ETF3EH = true
2021/03/25 15:00:59 cpu-cpuid.TE = true
2021/03/25 15:00:59 cpu-cpuid.ESAN3 = true
2021/03/25 15:00:59 cpu-cpuid.MSA = true
2021/03/25 15:00:59 cpu-cpuid.LDISP = true
2021/03/25 15:00:59 cpu-cpuid.HIGHGPRS = true
2021/03/25 15:00:59 cpu-cpuid.VX = true
2021/03/25 15:00:59 cpu-cpuid.DFP = true
2021/03/25 15:00:59 cpu-cpuid.EDAT = true
2021/03/25 15:00:59 kernel-version.full = 4.18.0-240.10.1.el8_3.s390x
2021/03/25 15:00:59 kernel-version.major = 4
2021/03/25 15:00:59 kernel-version.minor = 18
2021/03/25 15:00:59 kernel-version.revision = 0
2021/03/25 15:00:59 kernel-selinux.enabled = true
2021/03/25 15:00:59 storage-nonrotationaldisk = true
2021/03/25 15:00:59 system-os_release.RHEL_VERSION = 8.3
2021/03/25 15:00:59 system-os_release.ID = rhcos
2021/03/25 15:00:59 system-os_release.VERSION_ID = 4.7
2021/03/25 15:00:59 system-os_release.VERSION_ID.major = 4
2021/03/25 15:00:59 system-os_release.VERSION_ID.minor = 7
2021/03/25 15:00:59 Sending labeling request to nfd-master
2021/03/25 15:00:59 ERROR: Failed to read kconfig: Failed to read kernel config from [ /proc/config.gz /usr/src/linux-4.18.0-240.10.1.el8_3.s390x/.config /usr/src/linux/.config /usr/lib/modules/4.18.0-240.10.1.el8_3.s390x/config /usr/lib/ostree-boot/config-4.18.0-240.10.1.el8_3.s390x /usr/lib/kernel/config-4.18.0-240.10.1.el8_3.s390x /usr/src/linux-headers-4.18.0-240.10.1.el8_3.s390x/.config /lib/modules/4.18.0-240.10.1.el8_3.s390x/build/.config /host-boot/config-4.18.0-240.10.1.el8_3.s390x]:
2021/03/25 15:00:59 SR-IOV not supported for network interface: encf22d: open /host-sys/class/net/encf22d/device/sriov_totalvfs: no such file or directory
2021/03/25 15:00:59 SR-IOV not supported for network interface: tun0: open /host-sys/class/net/tun0/device/sriov_totalvfs: no such file or directory
2021/03/25 15:00:59 SR-IOV not supported for network interface: veth12ba4c9e: open /host-sys/class/net/veth12ba4c9e/device/sriov_totalvfs: no such file or directory
2021/03/25 15:00:59 SR-IOV not supported for network interface: veth3214bc5c: open /host-sys/class/net/veth3214bc5c/device/sriov_totalvfs: no such file or directory
2021/03/25 15:00:59 SR-IOV not supported for network interface: veth7257aecb: open /host-sys/class/net/veth7257aecb/device/sriov_totalvfs: no such file or directory
2021/03/25 15:00:59 SR-IOV not supported for network interface: veth748ca087: open /host-sys/class/net/veth748ca087/device/sriov_totalvfs: no such file or directory
2021/03/25 15:00:59 SR-IOV not supported for network interface: veth85a36285: open /host-sys/class/net/veth85a36285/device/sriov_totalvfs: no such file or directory
2021/03/25 15:00:59 SR-IOV not supported for network interface: veth8cb39d1c: open /host-sys/class/net/veth8cb39d1c/device/sriov_totalvfs: no such file or directory
2021/03/25 15:00:59 SR-IOV not supported for network interface: vethb93d0c66: open /host-sys/class/net/vethb93d0c66/device/sriov_totalvfs: no such file or directory
2021/03/25 15:00:59 SR-IOV not supported for network interface: vethda5edc13: open /host-sys/class/net/vethda5edc13/device/sriov_totalvfs: no such file or directory
2021/03/25 15:00:59 SR-IOV not supported for network interface: vethfc871087: open /host-sys/class/net/vethfc871087/device/sriov_totalvfs: no such file or directory
2021/03/25 15:00:59 SR-IOV not supported for network interface: vxlan_sys_4789: open /host-sys/class/net/vxlan_sys_4789/device/sriov_totalvfs: no such file or directory
2021/03/25 15:00:59 INFO: Custom features: [{Name:rdma.capable MatchOn:[{PciID:0xc000238320 UsbID:<nil> LoadedKMod:<nil> CpuID:<nil> Kconfig:<nil>}]} {Name:rdma.available MatchOn:[{PciID:<nil> UsbID:<nil> LoadedKMod:0xc000207460 CpuID:<nil> Kconfig:<nil>}]}]
2021/03/25 15:01:59 Configuration successfully loaded from "/etc/kubernetes/node-feature-discovery/nfd-worker.conf"

Comment 6 jschinta 2021-04-09 11:19:26 UTC
So update, i looked through the NFD-Operator and the only actual missing feature is the kernel configuration. On s390x rhcos the kernel configuration is only located on /usr/lib/modules/<kernel-version>/config. Starting with OCP 4.7 the NFD Operator does search for the config file under this path, however the path is not mounted inside the Pod, resulting in the error:
2021/03/25 15:00:59 ERROR: Failed to read kconfig: Failed to read kernel config from [ /proc/config.gz /usr/src/linux-4.18.0-240.10.1.el8_3.s390x/.config /usr/src/linux/.config /usr/lib/modules/4.18.0-240.10.1.el8_3.s390x/config /usr/lib/ostree-boot/config-4.18.0-240.10.1.el8_3.s390x /usr/lib/kernel/config-4.18.0-240.10.1.el8_3.s390x /usr/src/linux-headers-4.18.0-240.10.1.el8_3.s390x/.config /lib/modules/4.18.0-240.10.1.el8_3.s390x/build/.config /host-boot/config-4.18.0-240.10.1.el8_3.s390x]:

As to the Network Log Messages, NFD detects the SR-IOV capabilites for each Network Interface, but only writes the labels for capable and configured. The actual results for each Network Interface are written in the log, resulting in the shown output.

Lastly:
2020/08/13 20:26:07 failed to set node labels: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp [::1]:8080: connect: connection refused"

This seems to be a temporary connection problem between master and worker pods.

Comment 7 Cheryl A Fillekes 2021-04-09 15:30:42 UTC
1. Network: So NFD doesn't do any feature discover for network interfaces unless they are SR-IOV enabled PCI devices?  
   

2. Memory: no memory is detected by NFD on Z whereas the following features are detected on x86 https://kubernetes-sigs.github.io/node-feature-discovery/v0.7/get-started/features.html#memory

3. Kernel configuration file https://kubernetes-sigs.github.io/node-feature-discovery/v0.7/get-started/deployment-and-usage#configuration

4. Storage: no storage information seems to be labelled or queried


The message seems to be that NFD doesn't detect features on Z nodes because NFD is only designed to look for features x86 and arm; 
therefore when Z features go undiscovered, it works as designed.  dgilmore is this correct?

Comment 8 jschinta 2021-04-13 14:19:13 UTC
1. Network: NFD only discovers SR-IOV as a feature
2. Memory: As written only numa and nvdimm are discovered
3. Kernel configuration file: Possible workaround is deploying nfd manually, not as operator and mountin /usr/lib/modules inside the pod
4. Storage: Detects nonrotational disks: 2021/03/25 15:00:59 storage-nonrotationaldisk = true

But yes, NFD currently has not implemented any s390x specific features. We only get the general features, but you can already do a lot with custom features.
https://kubernetes-sigs.github.io/node-feature-discovery/v0.8/get-started/features.html#custom

Comment 9 jschinta 2021-04-28 08:29:24 UTC
Hello,

i created a fix for the kernel config for upstream https://github.com/kubernetes-sigs/node-feature-discovery/pull/519.
I'm now in the process of backporting it to 4.8 https://github.com/openshift/node-feature-discovery/pull/42

Comment 10 Cheryl A Fillekes 2021-04-28 10:18:57 UTC
Thanks for working on this, Jan! Re-opening and targeting for OCP 4.8  jschinta! :)

Comment 12 jschinta 2021-04-28 12:11:13 UTC
Hello Eric,

the PR for the backport needs the Bugzilla to be targeted to 4.8. Could you change the Target Release to 4.8?

eparis

Comment 13 Dan Li 2021-04-28 12:16:06 UTC
Hi Jan, I have changed the Target Release to 4.8.0. Are you the right assignee for this bug?

Comment 14 jschinta 2021-04-28 12:42:43 UTC
Hi Dan,

yes you can assign this to me.

Comment 15 Dan Li 2021-04-28 14:05:41 UTC
Hi @Jan, do you think this PR will be merged before the end of this sprint? If not, I'd like to add "Reviewed-in-Sprint" flag.

Comment 16 jschinta 2021-04-29 07:46:49 UTC
Hi @Dan, no i don't think the PR will be merged this Sprint.

Comment 17 Dan Li 2021-04-29 11:20:54 UTC
Setting "reviewed-in-sprint+"

Comment 19 jschinta 2021-05-07 10:02:12 UTC
Still waiting on Operator PR https://github.com/openshift/cluster-nfd-operator/pull/164

Downstream backport https://github.com/openshift/node-feature-discovery/pull/42 has been merged.

Comment 20 Tom Dale 2021-05-07 13:50:56 UTC
Looks like NFD can now be installed and used properly on s390x openshift Version: 4.8.0-0.nightly-s390x-2021-05-07-075507

Comment 21 jschinta 2021-05-10 14:09:50 UTC
Hi @Tom,

yes NFD works on 4.6 - 4.8. But you should currently still see the error with the kernel-config in the container log.

Comment 22 Tom Dale 2021-05-11 14:22:31 UTC
Oh I see, sorry yes those errors are still in the worker container logs
```
ERROR: Failed to read kconfig: Failed to read kernel config from [ /proc/config.gz /usr/src/linux-4.18.0-293.el8.s390x/.config /usr/src/linux/.config /usr/lib/modules/4.18.0-293.el8.s390x/config /usr/lib/ostree-boot/config-4.18.0-293.el8.s390x /usr/lib/kernel/config-4.18.0-293.el8.s390x /usr/src/linux-headers-4.18.0-293.el8.s390x/.config /lib/modules/4.18.0-293.el8.s390x/build/.config /host-boot/config-4.18.0-293.el8.s390x]:

SR-IOV not supported for network interface: 0f291e079a988a8: open /host-sys/class/net/0f291e079a988a8/device/sriov_totalvfs: no such file or directory
```

Comment 23 Douglas Slavens 2021-05-11 15:06:23 UTC
Marking this as assigned since it does not appear like it is ready to be tested.

Comment 24 Dan Li 2021-05-17 19:50:40 UTC
Hi Jan, looks like this bug is back to assigned - do you think it will be resolved before the end of this sprint? If not, I'd like to set a "reviewed-in-sprint" flag

Comment 25 jschinta 2021-05-18 14:38:51 UTC
Hi Dan,
i can verify the fix for the kernel config works in the latest NFD Version 4.8.0-202105131518.p0


Note You need to log in before you can comment on or make changes to this bug.