Bug 1913279

Summary: [Tracker] PTP faults and sync loss on Mellanox MT27800 Family [ConnectX-5]
Product: OpenShift Container Platform Reporter: Vitaly Grinberg <vgrinber>
Component: UnknownAssignee: Sudha Ponnaganti <sponnaga>
Status: CLOSED NOTABUG QA Contact: Jianwei Hou <jhou>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.7CC: aos-bugs, ayosef, dosmith, eparis, fpaoline, jokerman, keyoung, kquinn, sscheink
Target Milestone: ---Keywords: Tracking
Target Release: 4.8.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: 4.7.0-0.nightly-2021-03-01-085007 Doc Type: Release Note
Doc Text:
Issue: Precision Time Protocol (PTP) faults are observed on the Mellanox MT27800 Family [ConnectX-5] of adapter cards. In the ptp4l log, errors are observed which disturb clock synchronization. These errors result in larger than normal system clock updates due to the NIC hardware clock resetting. The root cause of this issue is unknown and no workaround currently exists.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-11 15:35:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1918456    
Bug Blocks:    

Description Vitaly Grinberg 2021-01-06 12:30:34 UTC
Description of problem:
PTP faults are observed in the linuxptp-daemon log:
2021-01-05T13:29:23.472548651+00:00 stdout F ptp4l[179996.612]: port 1: SLAVE to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED)

2021-01-05T13:29:39.490561597+00:00 stdout F ptp4l[180012.630]: port 1: FAULTY to LISTENING on INIT_COMPLETE

2021-01-05T14:49:36.872517937+00:00 stdout F ptp4l[184810.012]: port 1: SLAVE to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED)

2021-01-05T14:49:56.101147349+00:00 stdout F ptp4l[184829.241]: port 1: FAULTY to LISTENING on INIT_COMPLETE

2021-01-05T18:52:59.733716420+00:00 stdout F ptp4l[199412.873]: port 1: SLAVE to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED)

2021-01-05T18:53:27.548388704+00:00 stdout F ptp4l[199440.688]: port 1: FAULTY to LISTENING on INIT_COMPLETE

Each fault leads to PTP synchronization loss.


Version-Release number of selected component (if applicable):

86:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
	Subsystem: Mellanox Technologies Device 0091
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 205
	NUMA node: 1
	Region 0: Memory at d6000000 (64-bit, prefetchable) [size=32M]
	Expansion ROM at d3800000 [disabled] [size=1M]
	Capabilities: [60] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75.000W
		DevCtl:	CorrErr- NonFatalErr+ FatalErr+ UnsupReq+
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x16, ASPM not supported
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s (ok), Width x8 (downgraded)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, NROPrPrP-, LTR-
			 10BitTagComp-, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS-, TPHComp-, ExtTPHComp-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF Disabled
			 AtomicOpsCtl: ReqEn+
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
			 EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
	Capabilities: [48] Vital Product Data
		Product Name: Mellanox ConnectX-5 Dual Port 25GbE SFP Network Adapter
		Read-only fields:
			[PN] Part number: 0TDNNT
			[EC] Engineering changes: A00
			[MN] Manufacture ID: 1028
			[SN] Serial number: IL0TDNNT7403102A0095
			[VA] Vendor specific: DSV1028VPDR.VER2.1
			[VB] Vendor specific: FFV16.25.40.62
			[VC] Vendor specific: NPY2
			[VD] Vendor specific: PMT78
			[VE] Vendor specific: NMVMellanox Technologies, Inc.
			[VF] Vendor specific: DTINIC
			[VG] Vendor specific: DCM1001FFFFFF1202FFFFFF1403FFFFFF1604FFFFFF2101FFFFFF2302FFFFFF2503FFFFFF2704FFFFFF
			[VH] Vendor specific: L1D0
			[RV] Reserved: checksum good, 3 byte(s) reserved
		End
	Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
		Vector table: BAR=0 offset=00002000
		PBA: BAR=0 offset=00003000
	Capabilities: [c0] Vendor Specific Information: Len=18 <?>
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES- TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt- RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		CEMsk:	RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ AdvNonFatalErr+
		AERCap:	First Error Pointer: 04, ECRCGenCap+ ECRCGenEn+ ECRCChkCap+ ECRCChkEn+
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
		ARICap:	MFVC- ACS-, Next Function: 1
		ARICtl:	MFVC- ACS-, Function Group: 0
	Capabilities: [180 v1] Single Root I/O Virtualization (SR-IOV)
		IOVCap:	Migration-, Interrupt Message Number: 000
		IOVCtl:	Enable- Migration- Interrupt- MSE- ARIHierarchy+
		IOVSta:	Migration-
		Initial VFs: 5, Total VFs: 5, Number of VFs: 0, Function Dependency Link: 00
		VF offset: 2, stride: 1, Device ID: 1018
		Supported Page Size: 000007ff, System Page Size: 00000001
		Region 0: Memory at 00000000d8500000 (64-bit, prefetchable)
		VF Migration: offset: 00000000, BIR: 0
	Capabilities: [1c0 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
		LaneErrStat: 0
	Capabilities: [230 v1] Access Control Services
		ACSCap:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
	Kernel driver in use: mlx5_core
	Kernel modules: mlx5_core


How reproducible:
The setup consist of a network switch boundary clock as a master, Openshift 4.7 bare metal worker node as a subordinate.
PTP telecom profile configured

Steps to Reproduce:
Monitor the linuxptp-daemon-container log

Actual results:
Faults observed during a night run

Expected results:
No faults

Additional info:
Compared to the logs on a different station with Intel NICs. No faults observed there.

Comment 1 Vitaly Grinberg 2021-01-06 19:30:48 UTC
Synchronization faults are not always reproducible.
We have observed 24 hours fault-free operation today