Description of problem: Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Description of problem: Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Install NFD Operator from OperatorHub 2. Instantiate NFD Operator in one namespace 3. Query Node Features Actual results: sh-4.2$ nfd-worker 2020/08/13 20:26:07 Node Feature Discovery Worker v0.4.0 2020/08/13 20:26:07 NodeName: 'worker-1.pok-31.pok.stglabs.ibm.com' INFO: 2020/08/13 20:26:07 parsed scheme: "" INFO: 2020/08/13 20:26:07 scheme "" not registered, fallback to default scheme INFO: 2020/08/13 20:26:07 ccResolverWrapper: sending update to cc: {[{localhost:8080 0 <nil>}] <nil>} INFO: 2020/08/13 20:26:07 ClientConn switching balancer to "pick_first" 2020/08/13 20:26:07 CONF: {{[BMI1 BMI2 CLMUL CMOV CX16 ERMS F16C HTT LZCNT MMX MMXEXT NX POPCNT RDRAND RDSEED RDTSCP SGX SSE SSE2 SSE3 SSE4.1 SSE4.2 SSSE3] []}} 2020/08/13 20:26:07 cpu-cpuid.EDAT = true 2020/08/13 20:26:07 cpu-cpuid.VX = true 2020/08/13 20:26:07 cpu-cpuid.ESAN3 = true 2020/08/13 20:26:07 cpu-cpuid.VXE = true 2020/08/13 20:26:07 cpu-cpuid.STFLE = true 2020/08/13 20:26:07 cpu-cpuid.VXD = true 2020/08/13 20:26:07 cpu-cpuid.ETF3EH = true 2020/08/13 20:26:07 cpu-cpuid.MSA = true 2020/08/13 20:26:07 cpu-cpuid.GS = true 2020/08/13 20:26:07 cpu-cpuid.EIMM = true 2020/08/13 20:26:07 cpu-cpuid.DFP = true 2020/08/13 20:26:07 cpu-cpuid.LDISP = true 2020/08/13 20:26:07 cpu-cpuid.HIGHGPRS = true 2020/08/13 20:26:07 cpu-cpuid.TE = true 2020/08/13 20:26:07 cpu-cpuid.ZARCH = true 2020/08/13 20:26:07 Failed to read /proc/config.gz: open /proc/config.gz: no such file or directory 2020/08/13 20:26:07 ERROR: Failed to read kconfig: open /host-boot/config-4.18.0-211.el8.s390x: no such file or directory 2020/08/13 20:26:07 kernel-version.full = 4.18.0-211.el8.s390x 2020/08/13 20:26:07 kernel-version.major = 4 2020/08/13 20:26:07 kernel-version.minor = 18 2020/08/13 20:26:07 kernel-version.revision = 0 2020/08/13 20:26:07 kernel-selinux.enabled = true WARNING: 2020/08/13 20:26:07 grpc: addrConn.createTransport failed to connect to {localhost:8080 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp [::1]:8080: connect: connection refused". Reconnecting... 2020/08/13 20:26:07 SR-IOV not supported for network interface: enc2e0: open /sys/class/net/enc2e0/device/sriov_totalvfs: no such file or directory 2020/08/13 20:26:07 SR-IOV not supported for network interface: vxlan_sys_4789: open /sys/class/net/vxlan_sys_4789/device/sriov_totalvfs: no such file or directory 2020/08/13 20:26:07 SR-IOV not supported for network interface: tun0: open /sys/class/net/tun0/device/sriov_totalvfs: no such file or directory 2020/08/13 20:26:07 SR-IOV not supported for network interface: vethaad46e9a: open /sys/class/net/vethaad46e9a/device/sriov_totalvfs: no such file or directory 2020/08/13 20:26:07 SR-IOV not supported for network interface: vethc4cf3119: open /sys/class/net/vethc4cf3119/device/sriov_totalvfs: no such file or directory 2020/08/13 20:26:07 SR-IOV not supported for network interface: veth6126a652: open /sys/class/net/veth6126a652/device/sriov_totalvfs: no such file or directory 2020/08/13 20:26:07 SR-IOV not supported for network interface: veth1c80cd42: open /sys/class/net/veth1c80cd42/device/sriov_totalvfs: no such file or directory 2020/08/13 20:26:07 SR-IOV not supported for network interface: vethd6880663: open /sys/class/net/vethd6880663/device/sriov_totalvfs: no such file or directory 2020/08/13 20:26:07 SR-IOV not supported for network interface: veth1db88e34: open /sys/class/net/veth1db88e34/device/sriov_totalvfs: no such file or directory 2020/08/13 20:26:07 SR-IOV not supported for network interface: veth0e035530: open /sys/class/net/veth0e035530/device/sriov_totalvfs: no such file or directory 2020/08/13 20:26:07 SR-IOV not supported for network interface: veth75fa65aa: open /sys/class/net/veth75fa65aa/device/sriov_totalvfs: no such file or directory 2020/08/13 20:26:07 SR-IOV not supported for network interface: vethcb69d03d: open /sys/class/net/vethcb69d03d/device/sriov_totalvfs: no such file or directory 2020/08/13 20:26:07 SR-IOV not supported for network interface: vethd21a951e: open /sys/class/net/vethd21a951e/device/sriov_totalvfs: no such file or directory 2020/08/13 20:26:07 storage-nonrotationaldisk = true 2020/08/13 20:26:07 system-os_release.VERSION_ID.major = 4 2020/08/13 20:26:07 system-os_release.VERSION_ID.minor = 6 2020/08/13 20:26:07 system-os_release.ID = rhcos 2020/08/13 20:26:07 system-os_release.VERSION_ID = 4.6 2020/08/13 20:26:07 Sendng labeling request nfd-master 2020/08/13 20:26:07 failed to set node labels: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp [::1]:8080: connect: connection refused" 2020/08/13 20:26:07 ERROR: failed to advertise labels: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp [::1]:8080: connect: connection refused" sh-4.2$ Expected results: That NFD pick up more details regarding network interfaces, IMMU, memory features and storage on Z Additional info:
How is the NFD for s390x different from NFD on x86 as far as features go?
moving this RFE to Jira https://issues.redhat.com/browse/MULTIARCH-384
The missing features are: 1. Kernel Configuration: 2020/08/13 20:26:07 Failed to read /proc/config.gz: open /proc/config.gz: no such file or directory 2020/08/13 20:26:07 ERROR: Failed to read kconfig: open /host-boot/config-4.18.0-211.el8.s390x: no such file or directory 2. Network: WARNING: 2020/08/13 20:26:07 grpc: addrConn.createTransport failed to connect to {localhost:8080 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp [::1]:8080: connect: connection refused". Reconnecting... 2020/08/13 20:26:07 SR-IOV not supported for network interface: enc2e0: open /sys/class/net/enc2e0/device/sriov_totalvfs: no such file or directory 2020/08/13 20:26:07 SR-IOV not supported for network interface: vxlan_sys_4789: open /sys/class/net/vxlan_sys_4789/device/sriov_totalvfs: no such file or directory 2020/08/13 20:26:07 SR-IOV not supported for network interface: tun0: open /sys/class/net/tun0/device/sriov_totalvfs: no such file or directory 2020/08/13 20:26:07 SR-IOV not supported for network interface: vethaad46e9a: open /sys/class/net/vethaad46e9a/device/sriov_totalvfs: no such file or directory 2020/08/13 20:26:07 SR-IOV not supported for network interface: vethc4cf3119: open /sys/class/net/vethc4cf3119/device/sriov_totalvfs: no such file or directory 2020/08/13 20:26:07 SR-IOV not supported for network interface: veth6126a652: open /sys/class/net/veth6126a652/device/sriov_totalvfs: no such file or directory 2020/08/13 20:26:07 SR-IOV not supported for network interface: veth1c80cd42: open /sys/class/net/veth1c80cd42/device/sriov_totalvfs: no such file or directory 2020/08/13 20:26:07 SR-IOV not supported for network interface: vethd6880663: open /sys/class/net/vethd6880663/device/sriov_totalvfs: no such file or directory 2020/08/13 20:26:07 SR-IOV not supported for network interface: veth1db88e34: open /sys/class/net/veth1db88e34/device/sriov_totalvfs: no such file or directory 2020/08/13 20:26:07 SR-IOV not supported for network interface: veth0e035530: open /sys/class/net/veth0e035530/device/sriov_totalvfs: no such file or directory 2020/08/13 20:26:07 SR-IOV not supported for network interface: veth75fa65aa: open /sys/class/net/veth75fa65aa/device/sriov_totalvfs: no such file or directory 2020/08/13 20:26:07 SR-IOV not supported for network interface: vethcb69d03d: open /sys/class/net/vethcb69d03d/device/sriov_totalvfs: no such file or directory 2020/08/13 20:26:07 SR-IOV not supported for network interface: vethd21a951e: open /sys/class/net/vethd21a951e/device/sriov_totalvfs: no such file or directory Whatever it's trying to find here: 2020/08/13 20:26:07 failed to set node labels: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp [::1]:8080: connect: connection refused" 2020/08/13 20:26:07 ERROR: failed to advertise labels: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp [::1]:8080: connect: connection refused" And no information seems to be available from NFD on the nodes' disk or memory configuration.
Seeing the same behaviour in 4.7: 2021/03/25 15:00:59 Node Feature Discovery Worker 1.15 2021/03/25 15:00:59 NodeName: 'worker-01.s8343015.lnxne.boe' INFO: 2021/03/25 15:00:59 parsed scheme: "" INFO: 2021/03/25 15:00:59 scheme "" not registered, fallback to default scheme INFO: 2021/03/25 15:00:59 ccResolverWrapper: sending update to cc: {[{nfd-master:12000 <nil> 0 <nil>}] <nil> <nil>} INFO: 2021/03/25 15:00:59 ClientConn switching balancer to "pick_first" 2021/03/25 15:00:59 Configuration successfully loaded from "/etc/kubernetes/node-feature-discovery/nfd-worker.conf" 2021/03/25 15:00:59 cpu-cpuid.ZARCH = true 2021/03/25 15:00:59 cpu-cpuid.STFLE = true 2021/03/25 15:00:59 cpu-cpuid.EIMM = true 2021/03/25 15:00:59 cpu-cpuid.ETF3EH = true 2021/03/25 15:00:59 cpu-cpuid.TE = true 2021/03/25 15:00:59 cpu-cpuid.ESAN3 = true 2021/03/25 15:00:59 cpu-cpuid.MSA = true 2021/03/25 15:00:59 cpu-cpuid.LDISP = true 2021/03/25 15:00:59 cpu-cpuid.HIGHGPRS = true 2021/03/25 15:00:59 cpu-cpuid.VX = true 2021/03/25 15:00:59 cpu-cpuid.DFP = true 2021/03/25 15:00:59 cpu-cpuid.EDAT = true 2021/03/25 15:00:59 kernel-version.full = 4.18.0-240.10.1.el8_3.s390x 2021/03/25 15:00:59 kernel-version.major = 4 2021/03/25 15:00:59 kernel-version.minor = 18 2021/03/25 15:00:59 kernel-version.revision = 0 2021/03/25 15:00:59 kernel-selinux.enabled = true 2021/03/25 15:00:59 storage-nonrotationaldisk = true 2021/03/25 15:00:59 system-os_release.RHEL_VERSION = 8.3 2021/03/25 15:00:59 system-os_release.ID = rhcos 2021/03/25 15:00:59 system-os_release.VERSION_ID = 4.7 2021/03/25 15:00:59 system-os_release.VERSION_ID.major = 4 2021/03/25 15:00:59 system-os_release.VERSION_ID.minor = 7 2021/03/25 15:00:59 Sending labeling request to nfd-master 2021/03/25 15:00:59 ERROR: Failed to read kconfig: Failed to read kernel config from [ /proc/config.gz /usr/src/linux-4.18.0-240.10.1.el8_3.s390x/.config /usr/src/linux/.config /usr/lib/modules/4.18.0-240.10.1.el8_3.s390x/config /usr/lib/ostree-boot/config-4.18.0-240.10.1.el8_3.s390x /usr/lib/kernel/config-4.18.0-240.10.1.el8_3.s390x /usr/src/linux-headers-4.18.0-240.10.1.el8_3.s390x/.config /lib/modules/4.18.0-240.10.1.el8_3.s390x/build/.config /host-boot/config-4.18.0-240.10.1.el8_3.s390x]: 2021/03/25 15:00:59 SR-IOV not supported for network interface: encf22d: open /host-sys/class/net/encf22d/device/sriov_totalvfs: no such file or directory 2021/03/25 15:00:59 SR-IOV not supported for network interface: tun0: open /host-sys/class/net/tun0/device/sriov_totalvfs: no such file or directory 2021/03/25 15:00:59 SR-IOV not supported for network interface: veth12ba4c9e: open /host-sys/class/net/veth12ba4c9e/device/sriov_totalvfs: no such file or directory 2021/03/25 15:00:59 SR-IOV not supported for network interface: veth3214bc5c: open /host-sys/class/net/veth3214bc5c/device/sriov_totalvfs: no such file or directory 2021/03/25 15:00:59 SR-IOV not supported for network interface: veth7257aecb: open /host-sys/class/net/veth7257aecb/device/sriov_totalvfs: no such file or directory 2021/03/25 15:00:59 SR-IOV not supported for network interface: veth748ca087: open /host-sys/class/net/veth748ca087/device/sriov_totalvfs: no such file or directory 2021/03/25 15:00:59 SR-IOV not supported for network interface: veth85a36285: open /host-sys/class/net/veth85a36285/device/sriov_totalvfs: no such file or directory 2021/03/25 15:00:59 SR-IOV not supported for network interface: veth8cb39d1c: open /host-sys/class/net/veth8cb39d1c/device/sriov_totalvfs: no such file or directory 2021/03/25 15:00:59 SR-IOV not supported for network interface: vethb93d0c66: open /host-sys/class/net/vethb93d0c66/device/sriov_totalvfs: no such file or directory 2021/03/25 15:00:59 SR-IOV not supported for network interface: vethda5edc13: open /host-sys/class/net/vethda5edc13/device/sriov_totalvfs: no such file or directory 2021/03/25 15:00:59 SR-IOV not supported for network interface: vethfc871087: open /host-sys/class/net/vethfc871087/device/sriov_totalvfs: no such file or directory 2021/03/25 15:00:59 SR-IOV not supported for network interface: vxlan_sys_4789: open /host-sys/class/net/vxlan_sys_4789/device/sriov_totalvfs: no such file or directory 2021/03/25 15:00:59 INFO: Custom features: [{Name:rdma.capable MatchOn:[{PciID:0xc000238320 UsbID:<nil> LoadedKMod:<nil> CpuID:<nil> Kconfig:<nil>}]} {Name:rdma.available MatchOn:[{PciID:<nil> UsbID:<nil> LoadedKMod:0xc000207460 CpuID:<nil> Kconfig:<nil>}]}] 2021/03/25 15:01:59 Configuration successfully loaded from "/etc/kubernetes/node-feature-discovery/nfd-worker.conf"
So update, i looked through the NFD-Operator and the only actual missing feature is the kernel configuration. On s390x rhcos the kernel configuration is only located on /usr/lib/modules/<kernel-version>/config. Starting with OCP 4.7 the NFD Operator does search for the config file under this path, however the path is not mounted inside the Pod, resulting in the error: 2021/03/25 15:00:59 ERROR: Failed to read kconfig: Failed to read kernel config from [ /proc/config.gz /usr/src/linux-4.18.0-240.10.1.el8_3.s390x/.config /usr/src/linux/.config /usr/lib/modules/4.18.0-240.10.1.el8_3.s390x/config /usr/lib/ostree-boot/config-4.18.0-240.10.1.el8_3.s390x /usr/lib/kernel/config-4.18.0-240.10.1.el8_3.s390x /usr/src/linux-headers-4.18.0-240.10.1.el8_3.s390x/.config /lib/modules/4.18.0-240.10.1.el8_3.s390x/build/.config /host-boot/config-4.18.0-240.10.1.el8_3.s390x]: As to the Network Log Messages, NFD detects the SR-IOV capabilites for each Network Interface, but only writes the labels for capable and configured. The actual results for each Network Interface are written in the log, resulting in the shown output. Lastly: 2020/08/13 20:26:07 failed to set node labels: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp [::1]:8080: connect: connection refused" This seems to be a temporary connection problem between master and worker pods.
1. Network: So NFD doesn't do any feature discover for network interfaces unless they are SR-IOV enabled PCI devices? 2. Memory: no memory is detected by NFD on Z whereas the following features are detected on x86 https://kubernetes-sigs.github.io/node-feature-discovery/v0.7/get-started/features.html#memory 3. Kernel configuration file https://kubernetes-sigs.github.io/node-feature-discovery/v0.7/get-started/deployment-and-usage#configuration 4. Storage: no storage information seems to be labelled or queried The message seems to be that NFD doesn't detect features on Z nodes because NFD is only designed to look for features x86 and arm; therefore when Z features go undiscovered, it works as designed. dgilmore is this correct?
1. Network: NFD only discovers SR-IOV as a feature 2. Memory: As written only numa and nvdimm are discovered 3. Kernel configuration file: Possible workaround is deploying nfd manually, not as operator and mountin /usr/lib/modules inside the pod 4. Storage: Detects nonrotational disks: 2021/03/25 15:00:59 storage-nonrotationaldisk = true But yes, NFD currently has not implemented any s390x specific features. We only get the general features, but you can already do a lot with custom features. https://kubernetes-sigs.github.io/node-feature-discovery/v0.8/get-started/features.html#custom
Hello, i created a fix for the kernel config for upstream https://github.com/kubernetes-sigs/node-feature-discovery/pull/519. I'm now in the process of backporting it to 4.8 https://github.com/openshift/node-feature-discovery/pull/42
Thanks for working on this, Jan! Re-opening and targeting for OCP 4.8 jschinta! :)
Hello Eric, the PR for the backport needs the Bugzilla to be targeted to 4.8. Could you change the Target Release to 4.8? eparis
Hi Jan, I have changed the Target Release to 4.8.0. Are you the right assignee for this bug?
Hi Dan, yes you can assign this to me.
Hi @Jan, do you think this PR will be merged before the end of this sprint? If not, I'd like to add "Reviewed-in-Sprint" flag.
Hi @Dan, no i don't think the PR will be merged this Sprint.
Setting "reviewed-in-sprint+"
Still waiting on Operator PR https://github.com/openshift/cluster-nfd-operator/pull/164 Downstream backport https://github.com/openshift/node-feature-discovery/pull/42 has been merged.
Looks like NFD can now be installed and used properly on s390x openshift Version: 4.8.0-0.nightly-s390x-2021-05-07-075507
Hi @Tom, yes NFD works on 4.6 - 4.8. But you should currently still see the error with the kernel-config in the container log.
Oh I see, sorry yes those errors are still in the worker container logs ``` ERROR: Failed to read kconfig: Failed to read kernel config from [ /proc/config.gz /usr/src/linux-4.18.0-293.el8.s390x/.config /usr/src/linux/.config /usr/lib/modules/4.18.0-293.el8.s390x/config /usr/lib/ostree-boot/config-4.18.0-293.el8.s390x /usr/lib/kernel/config-4.18.0-293.el8.s390x /usr/src/linux-headers-4.18.0-293.el8.s390x/.config /lib/modules/4.18.0-293.el8.s390x/build/.config /host-boot/config-4.18.0-293.el8.s390x]: SR-IOV not supported for network interface: 0f291e079a988a8: open /host-sys/class/net/0f291e079a988a8/device/sriov_totalvfs: no such file or directory ```
Marking this as assigned since it does not appear like it is ready to be tested.
Hi Jan, looks like this bug is back to assigned - do you think it will be resolved before the end of this sprint? If not, I'd like to set a "reviewed-in-sprint" flag
Hi Dan, i can verify the fix for the kernel config works in the latest NFD Version 4.8.0-202105131518.p0