Bug 1913279 - [Tracker] PTP faults and sync loss on Mellanox MT27800 Family [ConnectX-5]
Summary: [Tracker] PTP faults and sync loss on Mellanox MT27800 Family [ConnectX-5]
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Unknown
Version: 4.7
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: 4.8.0
Assignee: Sudha Ponnaganti
QA Contact: Jianwei Hou
URL:
Whiteboard:
Depends On: 1918456
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-06 12:30 UTC by Vitaly Grinberg
Modified: 2021-05-11 15:35 UTC (History)
9 users (show)

Fixed In Version: 4.7.0-0.nightly-2021-03-01-085007
Doc Type: Release Note
Doc Text:
Issue: Precision Time Protocol (PTP) faults are observed on the Mellanox MT27800 Family [ConnectX-5] of adapter cards. In the ptp4l log, errors are observed which disturb clock synchronization. These errors result in larger than normal system clock updates due to the NIC hardware clock resetting. The root cause of this issue is unknown and no workaround currently exists.
Clone Of:
Environment:
Last Closed: 2021-05-11 15:35:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Vitaly Grinberg 2021-01-06 12:30:34 UTC
Description of problem:
PTP faults are observed in the linuxptp-daemon log:
2021-01-05T13:29:23.472548651+00:00 stdout F ptp4l[179996.612]: port 1: SLAVE to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED)

2021-01-05T13:29:39.490561597+00:00 stdout F ptp4l[180012.630]: port 1: FAULTY to LISTENING on INIT_COMPLETE

2021-01-05T14:49:36.872517937+00:00 stdout F ptp4l[184810.012]: port 1: SLAVE to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED)

2021-01-05T14:49:56.101147349+00:00 stdout F ptp4l[184829.241]: port 1: FAULTY to LISTENING on INIT_COMPLETE

2021-01-05T18:52:59.733716420+00:00 stdout F ptp4l[199412.873]: port 1: SLAVE to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED)

2021-01-05T18:53:27.548388704+00:00 stdout F ptp4l[199440.688]: port 1: FAULTY to LISTENING on INIT_COMPLETE

Each fault leads to PTP synchronization loss.


Version-Release number of selected component (if applicable):

86:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
	Subsystem: Mellanox Technologies Device 0091
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 205
	NUMA node: 1
	Region 0: Memory at d6000000 (64-bit, prefetchable) [size=32M]
	Expansion ROM at d3800000 [disabled] [size=1M]
	Capabilities: [60] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75.000W
		DevCtl:	CorrErr- NonFatalErr+ FatalErr+ UnsupReq+
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x16, ASPM not supported
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s (ok), Width x8 (downgraded)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, NROPrPrP-, LTR-
			 10BitTagComp-, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS-, TPHComp-, ExtTPHComp-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF Disabled
			 AtomicOpsCtl: ReqEn+
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
			 EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
	Capabilities: [48] Vital Product Data
		Product Name: Mellanox ConnectX-5 Dual Port 25GbE SFP Network Adapter
		Read-only fields:
			[PN] Part number: 0TDNNT
			[EC] Engineering changes: A00
			[MN] Manufacture ID: 1028
			[SN] Serial number: IL0TDNNT7403102A0095
			[VA] Vendor specific: DSV1028VPDR.VER2.1
			[VB] Vendor specific: FFV16.25.40.62
			[VC] Vendor specific: NPY2
			[VD] Vendor specific: PMT78
			[VE] Vendor specific: NMVMellanox Technologies, Inc.
			[VF] Vendor specific: DTINIC
			[VG] Vendor specific: DCM1001FFFFFF1202FFFFFF1403FFFFFF1604FFFFFF2101FFFFFF2302FFFFFF2503FFFFFF2704FFFFFF
			[VH] Vendor specific: L1D0
			[RV] Reserved: checksum good, 3 byte(s) reserved
		End
	Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
		Vector table: BAR=0 offset=00002000
		PBA: BAR=0 offset=00003000
	Capabilities: [c0] Vendor Specific Information: Len=18 <?>
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES- TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt- RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		CEMsk:	RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ AdvNonFatalErr+
		AERCap:	First Error Pointer: 04, ECRCGenCap+ ECRCGenEn+ ECRCChkCap+ ECRCChkEn+
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
		ARICap:	MFVC- ACS-, Next Function: 1
		ARICtl:	MFVC- ACS-, Function Group: 0
	Capabilities: [180 v1] Single Root I/O Virtualization (SR-IOV)
		IOVCap:	Migration-, Interrupt Message Number: 000
		IOVCtl:	Enable- Migration- Interrupt- MSE- ARIHierarchy+
		IOVSta:	Migration-
		Initial VFs: 5, Total VFs: 5, Number of VFs: 0, Function Dependency Link: 00
		VF offset: 2, stride: 1, Device ID: 1018
		Supported Page Size: 000007ff, System Page Size: 00000001
		Region 0: Memory at 00000000d8500000 (64-bit, prefetchable)
		VF Migration: offset: 00000000, BIR: 0
	Capabilities: [1c0 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
		LaneErrStat: 0
	Capabilities: [230 v1] Access Control Services
		ACSCap:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
	Kernel driver in use: mlx5_core
	Kernel modules: mlx5_core


How reproducible:
The setup consist of a network switch boundary clock as a master, Openshift 4.7 bare metal worker node as a subordinate.
PTP telecom profile configured

Steps to Reproduce:
Monitor the linuxptp-daemon-container log

Actual results:
Faults observed during a night run

Expected results:
No faults

Additional info:
Compared to the logs on a different station with Intel NICs. No faults observed there.

Comment 1 Vitaly Grinberg 2021-01-06 19:30:48 UTC
Synchronization faults are not always reproducible.
We have observed 24 hours fault-free operation today


Note You need to log in before you can comment on or make changes to this bug.