1. Please describe the problem: It is not possible to ssh into a machine that was installed with FedoraELN distro. Since same issue happens with a RHEL9.3 distro installation with a FedoraELN kernel, I'm filing this as kernel issue. Also, on impacted machines it is not possible to perform dnf updates using https based repositories which again possibly indicates an issue with kernel. Note that there are machines that are not impacted by the issue, for example Beaker host wsfd-advnetlab197.anl.eng.rdu2.dc.redhat.com. Example of impacted host is wsfd-netdev54.ntdv.lab.eng.bos.redhat.com 2. What is the Version-Release number of the kernel: kernel-6.8.0-0.rc1.20240124git615d30064886.13.eln134 3. Did it work previously in Fedora? If so, what kernel version did the issue *first* appear? Old kernels are available for download at https://koji.fedoraproject.org/koji/packageinfo?packageID=8 : This worked fine with kernel-6.8.0-0.rc1.12.eln134 4. Can you reproduce this issue? If so, please provide the steps to reproduce the issue below: Install a FedoraELN distro on wsfd-netdev54.ntdv.lab.eng.bos.redhat.com Beaker host. 5. Does this problem occur with the latest Rawhide kernel? To install the Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by ``sudo dnf update --enablerepo=rawhide kernel``: I have not checked yet. 6. Are you running any modules that not shipped with directly Fedora's kernel?: No. 7. Please attach the kernel logs. You can get the complete kernel log for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the issue occurred on a previous boot, use the journalctl ``-b`` flag. Since the symptom is present in ssh, here's the log from sshd: Feb 20 09:10:57 wsfd-netdev38.ntdv.lab.eng.bos.redhat.com sshd[1525]: ssh_dispatch_run_fatal: Connection from 10.43.3.80 port 51010: incomplete message [preauth] Feb 20 09:11:23 wsfd-netdev38.ntdv.lab.eng.bos.redhat.com sshd[1552]: error: kex_input_kexinit: discard proposal: incomplete message [preauth] Reproducible: Always
Machines details: * impacted wsfd-netdev54.ntdv.lab.eng.bos.redhat.com PowerEdge R730 with Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz * not impacted wsfd-advnetlab197.anl.eng.rdu2.dc.redhat.com PowerEdge R750 Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz
I've checked following RAWHIDE kernel on impacted machine and the kernel/sshd works fine: 6.8.0-0.rc5.41.fc41.x86_64 + Fedora-Rawhide-20240220.n.0 Server x86_64
Seen also on wsfd-advnetlab63.anl.eng.rdu2.dc.redhat.com: PowerEdge R740 with Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz
It seems the issue affects only the ELN kernel. The latest tested ELN kernel is kernel-6.8.0-0.rc5.41.eln136 - ssh fails on roughly 50% of systems. We could not identify a pattern of which systems are failing and which are not. Here are the aborted Beaker jobs because of a failing ssh connection: https://beaker.engineering.redhat.com/jobs/?jobsearch-0.table=Whiteboard&jobsearch-0.operation=contains&jobsearch-0.value=2024-02-19T23%3A22%3A21.013496&jobsearch-1.table=Status&jobsearch-1.operation=is&jobsearch-1.value=Aborted Jan Tluka has successfully tested the Fedora Rawhide kernel (see comment #2) 6.8.0-0.rc5.41.fc41.x86_64 + Fedora-Rawhide-20240220.n.0 Server x86_64 He has done the test on one server. I have now scheduled a comprehensive test with Fedora-Rawhide-20240225.n.0 and kernel-6.8.0-0.rc5.20240222git39133352cbed.44.fc41 to confirm that Fedora Rawhide is working fine: https://beaker.engineering.redhat.com/jobs/?jobsearch-0.value=2024-02-26T13%3A23%3A22.805320&jobsearch-0.table=Whiteboard&list_tgp_ordering=-id&list_tgp_no=1&list_tgp_limit=50&jobsearch-0.operation=contains&jobsearch--repetitions=1 Could somebody please check the differences between ELN and Fedora Rawhide kernel configs to see if there is anything to explain why the ELN kernel has issues enabling SSH connection?
Fedora-ELN distro has the same problem. Another issue with affected ELN kernels is that Beaker jobs, when fetching XML over HTTPS protocol are getting only the first chunk of XML file. Based on this, we think that the problem is directly in the ELN kernel.
The problem occurred between kernels kernel-6.8.0-0.rc1.12.eln134 and kernel-6.8.0-0.rc1.20240124git615d30064886.13.eln134. The kernel-ark log diff is below (minus btrfs). We don't see anything obvious though there are a couple of network changes there. [dzickus@zickus kernel-ark]$ git log --no-merges --oneline kernel-6.8.0-0.rc1.12..kernel-6.8.0-0.rc1.615d30064886.13|grep -v btrfs 4b3f708e4eec [redhat] kernel-6.8.0-0.rc1.615d30064886.13 834bf76add3e eventfs: Save directory inodes in the eventfs_inode structure 728a55b08c48 [redhat] kernel-6.8.0-0.rc1.7ed2632ec7d7.12 7ed2632ec7d7 drm/ttm: fix ttm pool initialization for no-dma-device drivers 2b44760609e9 tracing: Ensure visibility when inserting an element into tracing_map c06975c5f4e6 CI: include aarch64 in CKI container image gating a5e0ace04fbf init: Kconfig: Disable -Wstringop-overflow for GCC-11 113a61863ecb Makefile: Enable -Wstringop-overflow globally bccf33735c37 [redhat] New configs in fs/erofs 65f64d322a2d redhat: spec: Fix update_scripts run for CentOS builds 183a614b78ee Remove CONFIG_NET_EMATCH_STACK file for RHEL 6c5be56d7c70 [redhat] New configs in lib/Kconfig.debug caef59dd2cf9 [redhat] New configs in drivers/vfio 8ff73a4522a9 [redhat] New configs in drivers/video 8c99692b64a5 [redhat] New configs in drivers/usb 91f51d2d954a [redhat] New configs in drivers/rtc ba4cf919e073 [redhat] New configs in arch/x86 77cfdc08dfcd New configs in drivers/crypto 8002107c7d1b [redhat] New configs in drivers/mtd 1a4a64f05eca [redhat] New configs in drivers/misc 1504af5a539f [redhat] New configs in drivers/iio 0b16af2cf5bc net: bump CONFIG_MAX_SKB_FRAGS to 45 4b729efef0f3 Enable CONFIG_MARVELL_88Q2XXX_PHY c7ec4f2d684e xen-netback: don't produce zero-size SKB frags Since the Fedora Rawhide kernel is not affected, we think that the difference might be caused by the kernel config file. I have attached kernel config files as a tarball. The diff reduced to the "net" keyword is below. Kernel_Config_Files/kernel-6.8.0-0.rc5.41.eln136.x86_64.config Kernel_Config_Files/kernel-6.8.0-0.rc5.20240222git39133352cbed.44.fc41.x86_64.config Can you spot any possible suspects? $diff * | grep -i net < CONFIG_COMPAT_NETLINK_MESSAGES=y < CONFIG_NET_FOU=m < CONFIG_NET_FOU_IP_TUNNELS=y > # CONFIG_NET_FOU is not set > # CONFIG_NET_FOU_IP_TUNNELS is not set < CONFIG_NETFILTER_NETLINK_ACCT=m > # CONFIG_NETFILTER_NETLINK_ACCT is not set < # CONFIG_NETFILTER_NETLINK_GLUE_CT is not set > CONFIG_NF_CT_NETLINK_TIMEOUT=m > CONFIG_NF_CT_NETLINK_HELPER=m > CONFIG_NETFILTER_NETLINK_GLUE_CT=y < CONFIG_NETFILTER_XT_TARGET_LED=m > # CONFIG_NETFILTER_XT_TARGET_LED is not set < CONFIG_NETFILTER_XT_MATCH_IPCOMP=m > # CONFIG_NETFILTER_XT_MATCH_IPCOMP is not set < CONFIG_NETFILTER_XT_MATCH_L2TP=m > # CONFIG_NETFILTER_XT_MATCH_L2TP is not set < CONFIG_NETFILTER_XT_MATCH_NFACCT=m > # CONFIG_NETFILTER_XT_MATCH_NFACCT is not set < CONFIG_NETFILTER_XT_MATCH_TIME=m < CONFIG_NETFILTER_XT_MATCH_U32=m > # CONFIG_NETFILTER_XT_MATCH_TIME is not set > # CONFIG_NETFILTER_XT_MATCH_U32 is not set < CONFIG_NET_DSA=m < CONFIG_NET_DSA_TAG_NONE=m < # CONFIG_NET_DSA_TAG_AR9331 is not set < CONFIG_NET_DSA_TAG_BRCM_COMMON=m < CONFIG_NET_DSA_TAG_BRCM=m < CONFIG_NET_DSA_TAG_BRCM_LEGACY=m < CONFIG_NET_DSA_TAG_BRCM_PREPEND=m < CONFIG_NET_DSA_TAG_HELLCREEK=m < CONFIG_NET_DSA_TAG_GSWIP=m < CONFIG_NET_DSA_TAG_DSA_COMMON=m < CONFIG_NET_DSA_TAG_DSA=m < CONFIG_NET_DSA_TAG_EDSA=m < CONFIG_NET_DSA_TAG_MTK=m < CONFIG_NET_DSA_TAG_KSZ=m < CONFIG_NET_DSA_TAG_OCELOT=m < CONFIG_NET_DSA_TAG_OCELOT_8021Q=m < CONFIG_NET_DSA_TAG_QCA=m < CONFIG_NET_DSA_TAG_RTL4_A=m < CONFIG_NET_DSA_TAG_RTL8_4=m < # CONFIG_NET_DSA_TAG_RZN1_A5PSW is not set < CONFIG_NET_DSA_TAG_LAN9303=m < CONFIG_NET_DSA_TAG_SJA1105=m < CONFIG_NET_DSA_TAG_TRAILER=m < CONFIG_NET_DSA_TAG_XRS700X=m > # CONFIG_NET_DSA is not set < CONFIG_NET_SCH_SFB=m > # CONFIG_NET_SCH_SFB is not set < CONFIG_NET_SCH_TEQL=m > # CONFIG_NET_SCH_TEQL is not set < CONFIG_NET_SCH_DRR=m > # CONFIG_NET_SCH_DRR is not set < CONFIG_NET_SCH_CHOKE=m < CONFIG_NET_SCH_QFQ=m < CONFIG_NET_SCH_CODEL=m > # CONFIG_NET_SCH_CHOKE is not set > # CONFIG_NET_SCH_QFQ is not set > # CONFIG_NET_SCH_CODEL is not set < CONFIG_NET_SCH_HHF=m < CONFIG_NET_SCH_PIE=m < CONFIG_NET_SCH_FQ_PIE=m > # CONFIG_NET_SCH_HHF is not set > # CONFIG_NET_SCH_PIE is not set < CONFIG_NET_SCH_PLUG=m > # CONFIG_NET_SCH_PLUG is not set < # CONFIG_NET_SCH_DEFAULT is not set > CONFIG_NET_SCH_DEFAULT=y > CONFIG_DEFAULT_NET_SCH="fq_codel" < CONFIG_NET_CLS_BASIC=m < CONFIG_NET_CLS_ROUTE4=m > # CONFIG_NET_CLS_BASIC is not set > # CONFIG_NET_CLS_ROUTE4 is not set < CONFIG_NET_EMATCH=y < CONFIG_NET_EMATCH_STACK=32 < CONFIG_NET_EMATCH_CMP=m < CONFIG_NET_EMATCH_NBYTE=m < CONFIG_NET_EMATCH_U32=m < CONFIG_NET_EMATCH_META=m < CONFIG_NET_EMATCH_TEXT=m < CONFIG_NET_EMATCH_CANID=m < CONFIG_NET_EMATCH_IPSET=m < CONFIG_NET_EMATCH_IPT=m > # CONFIG_NET_EMATCH is not set < CONFIG_NET_ACT_IPT=m < CONFIG_NET_ACT_NAT=m > # CONFIG_NET_ACT_IPT is not set > # CONFIG_NET_ACT_NAT is not set < CONFIG_NET_ACT_SIMP=m > # CONFIG_NET_ACT_SIMP is not set < CONFIG_NET_ACT_CONNMARK=m > # CONFIG_NET_ACT_CONNMARK is not set < CONFIG_NET_ACT_SKBMOD=m < CONFIG_NET_ACT_IFE=m > # CONFIG_NET_ACT_SKBMOD is not set > # CONFIG_NET_ACT_IFE is not set < CONFIG_NET_ACT_GATE=m < CONFIG_NET_IFE_SKBMARK=m < CONFIG_NET_IFE_SKBPRIO=m < CONFIG_NET_IFE_SKBTCINDEX=m > # CONFIG_NET_ACT_GATE is not set < CONFIG_NET_MPLS_GSO=m > CONFIG_NET_MPLS_GSO=y < CONFIG_NET_NSH=m > CONFIG_NET_NSH=y < CONFIG_NET_NCSI=y > # CONFIG_NET_NCSI is not set < CONFIG_NETROM=m < # AX.25 network device drivers < # end of AX.25 network device drivers < CONFIG_NET_9P=m < CONFIG_NET_9P_FD=m < CONFIG_NET_9P_VIRTIO=m < CONFIG_NET_9P_XEN=m < CONFIG_NET_9P_RDMA=m < # CONFIG_NET_9P_DEBUG is not set > # CONFIG_NET_9P is not set < CONFIG_NET_IFE=m > # CONFIG_NET_IFE is not set < CONFIG_PATA_NETCELL=m > # CONFIG_PATA_NETCELL is not set < CONFIG_FIREWIRE_NET=m < CONFIG_NET_TEAM=m < CONFIG_NET_TEAM_MODE_BROADCAST=m < CONFIG_NET_TEAM_MODE_ROUNDROBIN=m < CONFIG_NET_TEAM_MODE_RANDOM=m < CONFIG_NET_TEAM_MODE_ACTIVEBACKUP=m < CONFIG_NET_TEAM_MODE_LOADBALANCE=m > # CONFIG_NET_TEAM is not set < CONFIG_NETKIT=y > # CONFIG_NETKIT is not set < CONFIG_NET_DSA_BCM_SF2=m < CONFIG_NET_DSA_LOOP=m < CONFIG_NET_DSA_HIRSCHMANN_HELLCREEK=m < # CONFIG_NET_DSA_LANTIQ_GSWIP is not set < CONFIG_NET_DSA_MT7530=m < CONFIG_NET_DSA_MT7530_MDIO=m < CONFIG_NET_DSA_MT7530_MMIO=m < # CONFIG_NET_DSA_MV88E6060 is not set < # CONFIG_NET_DSA_MICROCHIP_KSZ_COMMON is not set < CONFIG_NET_DSA_MV88E6XXX=m < CONFIG_NET_DSA_MV88E6XXX_PTP=y < # CONFIG_NET_DSA_AR9331 is not set < CONFIG_NET_DSA_QCA8K=m < CONFIG_NET_DSA_QCA8K_LEDS_SUPPORT=y < # CONFIG_NET_DSA_SJA1105 is not set < CONFIG_NET_DSA_XRS700X=m < CONFIG_NET_DSA_XRS700X_I2C=m < CONFIG_NET_DSA_XRS700X_MDIO=m < CONFIG_NET_DSA_REALTEK=m < # CONFIG_NET_DSA_REALTEK_MDIO is not set < # CONFIG_NET_DSA_REALTEK_SMI is not set < CONFIG_NET_DSA_REALTEK_RTL8365MB=m < CONFIG_NET_DSA_REALTEK_RTL8366RB=m < CONFIG_NET_DSA_SMSC_LAN9303=m < CONFIG_NET_DSA_SMSC_LAN9303_I2C=m < CONFIG_NET_DSA_SMSC_LAN9303_MDIO=m < # CONFIG_NET_DSA_VITESSE_VSC73XX_SPI is not set < # CONFIG_NET_DSA_VITESSE_VSC73XX_PLATFORM is not set < CONFIG_NET_VENDOR_3COM=y < CONFIG_NET_VENDOR_ADAPTEC=y < CONFIG_NET_VENDOR_AGERE=y > # CONFIG_NET_VENDOR_3COM is not set > # CONFIG_NET_VENDOR_ADAPTEC is not set > # CONFIG_NET_VENDOR_AGERE is not set < CONFIG_NET_VENDOR_ALTEON=y > # CONFIG_NET_VENDOR_ALTEON is not set < CONFIG_PCNET32=m > # CONFIG_PCNET32 is not set < CONFIG_NET_VENDOR_ARC=y > # CONFIG_NET_VENDOR_ARC is not set < CONFIG_NET_VENDOR_CADENCE=y > # CONFIG_NET_VENDOR_CADENCE is not set < CONFIG_NET_VENDOR_DAVICOM=y > # CONFIG_NET_VENDOR_DAVICOM is not set < CONFIG_NET_TULIP=y > # CONFIG_NET_TULIP is not set < # CONFIG_BE2NET_HWMON is not set < CONFIG_BE2NET_BE2=y < CONFIG_BE2NET_BE3=y > CONFIG_BE2NET_HWMON=y > # CONFIG_BE2NET_BE2 is not set > # CONFIG_BE2NET_BE3 is not set < CONFIG_NET_VENDOR_ENGLEDER=y > # CONFIG_NET_VENDOR_ENGLEDER is not set < # CONFIG_NET_VENDOR_FUJITSU is not set < CONFIG_NET_VENDOR_FUNGIBLE=y > # CONFIG_NET_VENDOR_FUNGIBLE is not set < # CONFIG_NET_VENDOR_HUAWEI is not set > CONFIG_NET_VENDOR_HUAWEI=y < CONFIG_NET_VENDOR_ADI=y < CONFIG_NET_VENDOR_LITEX=y > # CONFIG_NET_VENDOR_ADI is not set > # CONFIG_NET_VENDOR_LITEX is not set < CONFIG_NET_VENDOR_MICREL=y > # CONFIG_NET_VENDOR_MICREL is not set < CONFIG_NET_VENDOR_NATSEMI=y < CONFIG_NET_VENDOR_NETERION=y > # CONFIG_NET_VENDOR_NATSEMI is not set > # CONFIG_NET_VENDOR_NETERION is not set < CONFIG_NET_VENDOR_8390=y < CONFIG_PCMCIA_AXNET=m < CONFIG_PCMCIA_PCNET=m < CONFIG_NET_VENDOR_NVIDIA=y > # CONFIG_NET_VENDOR_NVIDIA is not set < CONFIG_NET_VENDOR_PACKET_ENGINES=y > # CONFIG_NET_VENDOR_PACKET_ENGINES is not set < CONFIG_NET_VENDOR_QUALCOMM=y < CONFIG_RMNET=m < CONFIG_NET_VENDOR_RDC=y > # CONFIG_NET_VENDOR_QUALCOMM is not set > # CONFIG_NET_VENDOR_RDC is not set < CONFIG_NET_VENDOR_SILAN=y < CONFIG_NET_VENDOR_SIS=y > # CONFIG_NET_VENDOR_SILAN is not set > # CONFIG_NET_VENDOR_SIS is not set < CONFIG_NET_VENDOR_SMSC=y > # CONFIG_NET_VENDOR_SMSC is not set < CONFIG_NET_VENDOR_SUN=y > # CONFIG_NET_VENDOR_SUN is not set < CONFIG_NET_VENDOR_TEHUTI=y < CONFIG_NET_VENDOR_TI=y < CONFIG_NET_VENDOR_VERTEXCOM=y < CONFIG_NET_VENDOR_VIA=y < CONFIG_NET_VENDOR_WANGXUN=y < CONFIG_NET_VENDOR_WIZNET=y < CONFIG_WIZNET_W5100=m < CONFIG_WIZNET_W5300=m < # CONFIG_WIZNET_BUS_DIRECT is not set < # CONFIG_WIZNET_BUS_INDIRECT is not set < CONFIG_WIZNET_BUS_ANY=y < CONFIG_WIZNET_W5100_SPI=m < CONFIG_NET_VENDOR_XILINX=y < CONFIG_NET_VENDOR_XIRCOM=y > # CONFIG_NET_VENDOR_TEHUTI is not set > # CONFIG_NET_VENDOR_TI is not set > # CONFIG_NET_VENDOR_VERTEXCOM is not set > # CONFIG_NET_VENDOR_VIA is not set > # CONFIG_NET_VENDOR_WANGXUN is not set > # CONFIG_NET_VENDOR_WIZNET is not set > # CONFIG_NET_VENDOR_XILINX is not set < CONFIG_USB_NET_SR9700=m > # CONFIG_USB_NET_SR9700 is not set < CONFIG_USB_NET_AQC111=m > # CONFIG_USB_NET_AQC111 is not set < CONFIG_XEN_NETDEV_BACKEND=m < # CONFIG_REGULATOR_NETLINK_EVENTS is not set < CONFIG_DVB_NET=y < CONFIG_DVB_NETUP_UNIDVB=m < CONFIG_LEDS_TRIGGER_NETDEV=m > # CONFIG_LEDS_TRIGGER_NETDEV is not set < CONFIG_SNET_VDPA=m > # CONFIG_SNET_VDPA is not set < # Network Analyzer, Impedance Converters < # end of Network Analyzer, Impedance Converters
Created attachment 2018951 [details] Kernel_Config_Files for Fedora Rawhide kernel (working) and ELN Kernel (Broken) Kernel_Config_Files/kernel-6.8.0-0.rc5.41.eln136.x86_64.config => broken kernel Kernel_Config_Files/kernel-6.8.0-0.rc5.20240222git39133352cbed.44.fc41.x86_64.config => works fine
I can confirm that the Fedora Rawhide kernel is not affected: 6.8.0-0.rc5.20240222git39133352cbed.44.fc41.x86_64 - runs fine The latest ELN kernel kernel-6.8.0-0.rc6.49.eln136 still has the same problem - ssh is not working on some servers. Here is an example of a console log from Beaker: https://beaker.engineering.redhat.com/recipes/15636736#task174154981 https://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2024/02/89666/8966696/15636736/console.log ssh -vvv -E ssh_log_kernel-6.8.0-0.rc6.49.eln136.log root.eng.brq2.redhat.com ends with =============================================================== debug1: kex: algorithm: curve25519-sha256 debug1: kex: host key algorithm: ssh-ed25519 debug1: kex: server->client cipher: aes256-gcm MAC: <implicit> compression: none debug1: kex: client->server cipher: aes256-gcm MAC: <implicit> compression: none debug1: kex: curve25519-sha256 need=32 dh_need=32 debug1: kex: curve25519-sha256 need=32 dh_need=32 debug3: send packet: type 30 debug1: expecting SSH2_MSG_KEX_ECDH_REPLY Connection closed by 10.37.153.175 port 22 ===============================================================
Created attachment 2019132 [details] ssh log showing the problem to server running kernel-6.8.0-0.rc6.49.eln136 ssh -vvv -E ssh_log_kernel-6.8.0-0.rc6.49.eln136.log root.eng.brq2.redhat.com
Marcelo Leitner has found out that reverting 0b16af2cf5bc net: bump CONFIG_MAX_SKB_FRAGS to 45 fixes the problem. Note that it even had varying effects depending on the NIC driver. Igb had one behavior and mlx5 had another.