Bug 1537597 - ssh_dispatch_run_fatal: Connection to <server-ip> port 22: message authentication code incorrect [NEEDINFO]
Summary: ssh_dispatch_run_fatal: Connection to <server-ip> port 22: message authentica...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 27
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-01-23 15:04 UTC by Tadej Janež
Modified: 2018-08-29 15:25 UTC (History)
25 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-08-29 15:25:17 UTC
Type: Bug
Embargoed:
jforbes: needinfo?


Attachments (Terms of Use)
Debug log of failed scp transfer with F27 openssh (52.66 KB, text/plain)
2018-01-23 16:30 UTC, Tadej Janež
no flags Details
Debug log of failed scp transfer with F26 openssh (49.60 KB, text/plain)
2018-01-23 16:30 UTC, Tadej Janež
no flags Details
Debug log of successful scp transfer with F26 openssh (62.41 KB, text/plain)
2018-01-23 16:31 UTC, Tadej Janež
no flags Details

Description Tadej Janež 2018-01-23 15:04:50 UTC
Description of problem:
After upgrading my workstation from F26 to F27 I'm no longer able to download files using scp, rsync, ... due to SSH connecting breaking with the following error:
ssh_dispatch_run_fatal: Connection to <server-ip> port 22: message authentication code incorrect


Version-Release number of selected component (if applicable):
openssh-clients-7.6p1-3.fc27.x86_64


How reproducible:
Always.


Steps to Reproduce:
- scp <remote-server>:<path-to-a-larger-file> ./

or:
- rsync --partial --progress <remote-server>:<path-to-a-larger-file> ./

Actual results:
Download breaks at random points, for example:
[tadej@toronto production-dbs]$ scp <remote-server>:/home/genialis/genialis_base_dump-20180123-043002.gz ./
genialis_base_dump-20180123-043002.gz                                                                       0%    0     0.0KB/s   --:-- ETA
ssh_dispatch_run_fatal: Connection to <server-ip> port 22: message authentication code incorrect
lost connection
[tadej@toronto production-dbs]$ scp <remote-server>:/home/genialis/genialis_base_dump-20180123-043002.gz ./
genialis_base_dump-20180123-043002.gz                                                                       4% 4736KB   1.3MB/s   01:18 ETA
ssh_dispatch_run_fatal: Connection to <server-ip> port 22: message authentication code incorrect
lost connection


Expected results:
Download would complete normally.


Additional info:
I've tried downloading from a number of different remote servers, all of them running the latest versions of CentOS or RHEL 7.4 with openssh-server-7.4p1-13.el7_4.x86_64. Same errors occurred with all of them.

I've also tried downloading from an old Debian server. There, I get "Corrupted MAC on input." error before the ssh_dispatch_run_fatal error:
[tadej@toronto production-dbs]$ scp <remote-server>:/home/genialis/genialis_base_dump-20180123-043002.gz ./
genialis_base_dump-20180123-043002.gz                                                                       2% 8144KB   8.0MB/s   00:35 ETA
Corrupted MAC on input.
ssh_dispatch_run_fatal: Connection to <server-ip> port 22: message authentication code incorrect
lost connection

If you need more assistance in debugging the issue, I'm happy to help.

Comment 1 Jakub Jelen 2018-01-23 15:37:34 UTC
I did not notice this with Fedora 27 using every day, but yes, I am not using it to transfer very large amounts of data.

First of all, seeing the debug log (with -vvv arguments to scp for example) should give us some idea what is going on. Second thing, I would, check if the old version of OpenSSH still works (either by downgrading to the older F27 packages or to F26 version).

This error message looks like something is inspecting the packets and is modifying them on the network. Do you see these problems even if you try to transfer files to "localhost"?

Comment 2 Tadej Janež 2018-01-23 16:28:17 UTC
Thanks for such a quick response!

(In reply to Jakub Jelen from comment #1)
> 
> First of all, seeing the debug log (with -vvv arguments to scp for example)
> should give us some idea what is going on. 

No problem, I'll attach scp's debug log.

> Second thing, I would, check if
> the old version of OpenSSH still works (either by downgrading to the older
> F27 packages or to F26 version).

In terms of bisection, I went straight to F26's latest version:
openssh-clients.x86_64 7.5p1-4.fc26

I was able to reproduce the problem there also. I'll attach the output of two runs with F26's openssh, one for a successful download and one for an unsuccessful download.

> This error message looks like something is inspecting the packets and is
> modifying them on the network. Do you see these problems even if you try to
> transfer files to "localhost"?

I couldn't reproduce the issue when attempting to transfer a 2GB file through SSH server on the localhost 5 times.

I have a secondary machine that still runs F26 and I could connect it in the same way as I have the main F27 machine. Would that be some useful info if I try the transfers there?

Comment 3 Tadej Janež 2018-01-23 16:30:09 UTC
Created attachment 1384931 [details]
Debug log of failed scp transfer with F27 openssh

Comment 4 Tadej Janež 2018-01-23 16:30:55 UTC
Created attachment 1384933 [details]
Debug log of failed scp transfer with F26 openssh

Comment 5 Tadej Janež 2018-01-23 16:31:47 UTC
Created attachment 1384934 [details]
Debug log of successful scp transfer with F26 openssh

Comment 6 Tadej Janež 2018-01-23 16:35:37 UTC
(In reply to Tadej Janež from comment #2)
> 
> I was able to reproduce the problem there also. I'll attach the output of
> two runs with F26's openssh, one for a successful download and one for an
> unsuccessful download.

FWIW, I was also able to successfully download the file with F27's openssh.

Comment 7 Jakub Jelen 2018-01-23 17:17:26 UTC
Do I understand it right, that downgraded Fedora 26 package on the Fedora 27 fails the same way as the new one, but the Fedora 26 box on the same network works?

In that case, it sounds like a bug in kernel, network or some hardware issue. Is it normal LAN, Wi-Fi, or something special?

If so, there is no way to fix it in openssh. The debug logs do not show anything wrong.

I saw similar issues, which ended up as hardware errors [1], but that is hard to verify unless you try to replace the network card, or try different.

[1] https://unix.stackexchange.com/a/288550/121504

Comment 8 Tadej Janež 2018-01-25 15:55:09 UTC
(In reply to Jakub Jelen from comment #7)
> Do I understand it right, that downgraded Fedora 26 package on the Fedora 27
> fails the same way as the new one, but the Fedora 26 box on the same network
> works?

Yes, that is the case.

> In that case, it sounds like a bug in kernel, network or some hardware
> issue. Is it normal LAN, Wi-Fi, or something special?

I was using ordinary LAN of the Dell ThunderBolt TB16 docking station connected to Dell XPS 15 9560 laptop.

> If so, there is no way to fix it in openssh. The debug logs do not show
> anything wrong.
> 
> I saw similar issues, which ended up as hardware errors [1], but that is
> hard to verify unless you try to replace the network card, or try different.
> 
> [1] https://unix.stackexchange.com/a/288550/121504

You are right. If I downloaded the files through laptop's WiFi or another Ethernet device connected through USB, things work ok.

So, I have to debug further to see if this is a bug in kernel or some hardware issue.

Thanks for your help!

Comment 9 George B. Magklaras 2018-02-15 12:17:41 UTC
I confirm I am facing the same issue with a Precision 5520, when docked on a TB16 docking. 

dnf list installed | grep openssh
openssh.x86_64                             7.6p1-5.fc27                @updates 
openssh-askpass.x86_64                     7.6p1-5.fc27                @updates 
openssh-clients.x86_64                     7.6p1-5.fc27                @updates 
openssh-server.x86_64                      7.6p1-5.fc27                @updates 

The problem goes away when I choose a different NIC (aka on board wifi). Occasionally, I see from (what I think is) the TB16 docking station this on dmesg:

Feb 15 11:52:09 slartibartfast3 kernel: pcieport 0000:00:1d.6:    [12] Replay Timer Timeout  
Feb 15 11:52:09 slartibartfast3 kernel: pcieport 0000:00:1d.6:   device [8086:a11e] error status/mask=00001000/00002000
Feb 15 11:52:09 slartibartfast3 kernel: pcieport 0000:00:1d.6: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=00ee(Transmitter ID)


might be relevant, might not. The issue also has become worse since I upgraded to 
kernel 4.14.18-300.fc27.x86_64, not good.

My Thunderbolt config below, from lspci -v:

06:00.0 PCI bridge: Intel Corporation DSL6340 Thunderbolt 3 Bridge [Alpine Ridge 2C 2015] (prog-if 00 [Normal decode])
	Flags: bus master, fast devsel, latency 0, IRQ 126
	Bus: primary=06, secondary=07, subordinate=3e, sec-latency=0
	I/O behind bridge: None
	Memory behind bridge: d4000000-ea0fffff [size=353M]
	Prefetchable memory behind bridge: 0000000090000000-00000000b1ffffff [size=544M]
	Capabilities: <access denied>
	Kernel driver in use: pcieport
	Kernel modules: shpchp

07:00.0 PCI bridge: Intel Corporation DSL6340 Thunderbolt 3 Bridge [Alpine Ridge 2C 2015] (prog-if 00 [Normal decode])
	Flags: bus master, fast devsel, latency 0, IRQ 127
	Bus: primary=07, secondary=08, subordinate=08, sec-latency=0
	I/O behind bridge: None
	Memory behind bridge: ea000000-ea0fffff [size=1M]
	Prefetchable memory behind bridge: None
	Capabilities: <access denied>
	Kernel driver in use: pcieport
	Kernel modules: shpchp

07:01.0 PCI bridge: Intel Corporation DSL6340 Thunderbolt 3 Bridge [Alpine Ridge 2C 2015] (prog-if 00 [Normal decode])
	Flags: bus master, fast devsel, latency 0, IRQ 128
	Bus: primary=07, secondary=09, subordinate=3d, sec-latency=0
	I/O behind bridge: None
	Memory behind bridge: d4000000-e9efffff [size=351M]
	Prefetchable memory behind bridge: 0000000090000000-00000000b1ffffff [size=544M]
	Capabilities: <access denied>
	Kernel driver in use: pcieport
	Kernel modules: shpchp

07:02.0 PCI bridge: Intel Corporation DSL6340 Thunderbolt 3 Bridge [Alpine Ridge 2C 2015] (prog-if 00 [Normal decode])
	Flags: bus master, fast devsel, latency 0, IRQ 129
	Bus: primary=07, secondary=3e, subordinate=3e, sec-latency=0
	I/O behind bridge: None
	Memory behind bridge: e9f00000-e9ffffff [size=1M]
	Prefetchable memory behind bridge: None
	Capabilities: <access denied>
	Kernel driver in use: pcieport
	Kernel modules: shpchp

08:00.0 System peripheral: Intel Corporation DSL6340 Thunderbolt 3 NHI [Alpine Ridge 2C 2015]
	Subsystem: Device 2222:1111
	Flags: bus master, fast devsel, latency 0, IRQ 18
	Memory at ea000000 (32-bit, non-prefetchable) [size=256K]
	Memory at ea040000 (32-bit, non-prefetchable) [size=4K]
	Capabilities: <access denied>
	Kernel driver in use: thunderbolt
	Kernel modules: thunderbolt

09:00.0 PCI bridge: Intel Corporation DSL6540 Thunderbolt 3 Bridge [Alpine Ridge 4C 2015] (prog-if 00 [Normal decode])
	Flags: bus master, fast devsel, latency 0, IRQ 130
	Bus: primary=09, secondary=0a, subordinate=3d, sec-latency=0
	I/O behind bridge: 00002000-00002fff [size=4K]
	Memory behind bridge: d4000000-e9efffff [size=351M]
	Prefetchable memory behind bridge: 0000000090000000-00000000b1ffffff [size=544M]
	Capabilities: <access denied>
	Kernel driver in use: pcieport
	Kernel modules: shpchp

0a:01.0 PCI bridge: Intel Corporation DSL6540 Thunderbolt 3 Bridge [Alpine Ridge 4C 2015] (prog-if 00 [Normal decode])
	Flags: bus master, fast devsel, latency 0, IRQ 131
	Bus: primary=0a, secondary=0b, subordinate=0b, sec-latency=0
	I/O behind bridge: None
	Memory behind bridge: None
	Prefetchable memory behind bridge: None
	Capabilities: <access denied>
	Kernel driver in use: pcieport
	Kernel modules: shpchp

0a:04.0 PCI bridge: Intel Corporation DSL6540 Thunderbolt 3 Bridge [Alpine Ridge 4C 2015] (prog-if 00 [Normal decode])
	Flags: bus master, fast devsel, latency 0, IRQ 132
	Bus: primary=0a, secondary=0c, subordinate=3d, sec-latency=0
	I/O behind bridge: 00002000-00002fff [size=4K]
	Memory behind bridge: d4000000-e9efffff [size=351M]
	Prefetchable memory behind bridge: 0000000090000000-00000000b1ffffff [size=544M]
	Capabilities: <access denied>
	Kernel driver in use: pcieport
	Kernel modules: shpchp

0c:00.0 PCI bridge: Intel Corporation DSL6540 Thunderbolt 3 Bridge [Alpine Ridge 4C 2015] (prog-if 00 [Normal decode])
	Flags: bus master, fast devsel, latency 0, IRQ 133
	Bus: primary=0c, secondary=0d, subordinate=3d, sec-latency=0
	I/O behind bridge: 00002000-00002fff [size=4K]
	Memory behind bridge: d4000000-e9efffff [size=351M]
	Prefetchable memory behind bridge: 0000000090000000-00000000b1ffffff [size=544M]
	Capabilities: <access denied>
	Kernel driver in use: pcieport
	Kernel modules: shpchp

0d:01.0 PCI bridge: Intel Corporation DSL6540 Thunderbolt 3 Bridge [Alpine Ridge 4C 2015] (prog-if 00 [Normal decode])
	Flags: bus master, fast devsel, latency 0, IRQ 134
	Bus: primary=0d, secondary=0e, subordinate=0e, sec-latency=0
	I/O behind bridge: None
	Memory behind bridge: d4000000-d40fffff [size=1M]
	Prefetchable memory behind bridge: None
	Capabilities: <access denied>
	Kernel driver in use: pcieport
	Kernel modules: shpchp

0d:04.0 PCI bridge: Intel Corporation DSL6540 Thunderbolt 3 Bridge [Alpine Ridge 4C 2015] (prog-if 00 [Normal decode])
	Flags: bus master, fast devsel, latency 0, IRQ 135
	Bus: primary=0d, secondary=0f, subordinate=3d, sec-latency=0
	I/O behind bridge: 00002000-00002fff [size=4K]
	Memory behind bridge: d4100000-e9efffff [size=350M]
	Prefetchable memory behind bridge: 0000000090000000-00000000b1ffffff [size=544M]
	Capabilities: <access denied>
	Kernel driver in use: pcieport
	Kernel modules: shpchp

Comment 10 Jakub Jelen 2018-02-15 14:20:50 UTC
That is certainly not an OpenSSH bug. I am moving it to the kernel, which is probably responsible for the hardware support. Hopefully, they will be able to figure out more.

Comment 11 Jeremy Cline 2018-02-15 18:03:01 UTC
This sounds like a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1460789, can you see if this issue is still present the 4.15 kernel? It should be in updates-testing at the moment.

Thanks!

Comment 12 George B. Magklaras 2018-02-16 14:03:10 UTC
I cannot personally move to the 4.15 kernel at the moment, as I am running bumblebee on that system and would like to do more testing before I do so (one thing is to mess up thunderbolt wired networking and another to do this *and* mess up my cuda/nvidia setup :-) ) . What I can confirm is that the workaround from 1460789 does work, so chances are it's the same bug because doing a:

ethtool --offload $DEVNAME rx off

does indeed work and I am able to run with the 4.14.18-300.fc27.x86_64 kernel with the wired interface. Good for people to verify this works for them and if it does, you folks please patch the 4.14.x kernels before you push the 4.15 into production. People that run complex setups known (such as bumblebee) will feel safer and thank you for this IMHO. 

Cheers,
GM

Comment 13 Justin M. Forbes 2018-07-23 15:35:25 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There are a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 27 kernel bugs.

Fedora 27 has now been rebased to 4.17.7-100.fc27.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 28, and are still experiencing this issue, please change the version to Fedora 28.

If you experience different issues, please open a new bug report for those.

Comment 14 Justin M. Forbes 2018-08-29 15:25:17 UTC
*********** MASS BUG UPDATE **************
This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 5 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.


Note You need to log in before you can comment on or make changes to this bug.