Description of problem: PTP faults are observed in the linuxptp-daemon log: 2021-01-05T13:29:23.472548651+00:00 stdout F ptp4l[179996.612]: port 1: SLAVE to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED) 2021-01-05T13:29:39.490561597+00:00 stdout F ptp4l[180012.630]: port 1: FAULTY to LISTENING on INIT_COMPLETE 2021-01-05T14:49:36.872517937+00:00 stdout F ptp4l[184810.012]: port 1: SLAVE to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED) 2021-01-05T14:49:56.101147349+00:00 stdout F ptp4l[184829.241]: port 1: FAULTY to LISTENING on INIT_COMPLETE 2021-01-05T18:52:59.733716420+00:00 stdout F ptp4l[199412.873]: port 1: SLAVE to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED) 2021-01-05T18:53:27.548388704+00:00 stdout F ptp4l[199440.688]: port 1: FAULTY to LISTENING on INIT_COMPLETE Each fault leads to PTP synchronization loss. Version-Release number of selected component (if applicable): 86:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5] Subsystem: Mellanox Technologies Device 0091 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Interrupt: pin A routed to IRQ 205 NUMA node: 1 Region 0: Memory at d6000000 (64-bit, prefetchable) [size=32M] Expansion ROM at d3800000 [disabled] [size=1M] Capabilities: [60] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75.000W DevCtl: CorrErr- NonFatalErr+ FatalErr+ UnsupReq+ RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset- MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend- LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s (ok), Width x8 (downgraded) TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, NROPrPrP-, LTR- 10BitTagComp-, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix- EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit- FRS-, TPHComp-, ExtTPHComp- AtomicOpsCap: 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF Disabled AtomicOpsCtl: ReqEn+ LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+ EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest- Capabilities: [48] Vital Product Data Product Name: Mellanox ConnectX-5 Dual Port 25GbE SFP Network Adapter Read-only fields: [PN] Part number: 0TDNNT [EC] Engineering changes: A00 [MN] Manufacture ID: 1028 [SN] Serial number: IL0TDNNT7403102A0095 [VA] Vendor specific: DSV1028VPDR.VER2.1 [VB] Vendor specific: FFV16.25.40.62 [VC] Vendor specific: NPY2 [VD] Vendor specific: PMT78 [VE] Vendor specific: NMVMellanox Technologies, Inc. [VF] Vendor specific: DTINIC [VG] Vendor specific: DCM1001FFFFFF1202FFFFFF1403FFFFFF1604FFFFFF2101FFFFFF2302FFFFFF2503FFFFFF2704FFFFFF [VH] Vendor specific: L1D0 [RV] Reserved: checksum good, 3 byte(s) reserved End Capabilities: [9c] MSI-X: Enable+ Count=64 Masked- Vector table: BAR=0 offset=00002000 PBA: BAR=0 offset=00003000 Capabilities: [c0] Vendor Specific Information: Len=18 <?> Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold+) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES- TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt- RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ CEMsk: RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ AdvNonFatalErr+ AERCap: First Error Pointer: 04, ECRCGenCap+ ECRCGenEn+ ECRCChkCap+ ECRCChkEn+ MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- HeaderLog: 00000000 00000000 00000000 00000000 Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI) ARICap: MFVC- ACS-, Next Function: 1 ARICtl: MFVC- ACS-, Function Group: 0 Capabilities: [180 v1] Single Root I/O Virtualization (SR-IOV) IOVCap: Migration-, Interrupt Message Number: 000 IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+ IOVSta: Migration- Initial VFs: 5, Total VFs: 5, Number of VFs: 0, Function Dependency Link: 00 VF offset: 2, stride: 1, Device ID: 1018 Supported Page Size: 000007ff, System Page Size: 00000001 Region 0: Memory at 00000000d8500000 (64-bit, prefetchable) VF Migration: offset: 00000000, BIR: 0 Capabilities: [1c0 v1] Secondary PCI Express LnkCtl3: LnkEquIntrruptEn-, PerformEqu- LaneErrStat: 0 Capabilities: [230 v1] Access Control Services ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans- ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans- Kernel driver in use: mlx5_core Kernel modules: mlx5_core How reproducible: The setup consist of a network switch boundary clock as a master, Openshift 4.7 bare metal worker node as a subordinate. PTP telecom profile configured Steps to Reproduce: Monitor the linuxptp-daemon-container log Actual results: Faults observed during a night run Expected results: No faults Additional info: Compared to the logs on a different station with Intel NICs. No faults observed there.
Synchronization faults are not always reproducible. We have observed 24 hours fault-free operation today