Bug 1008902

Summary: ptp: phc2sys sys offset suddenly increasing very large
Product: Red Hat Enterprise Linux 6 Reporter: Dong Zhu <dZhu>
Component: linuxptpAssignee: Jiri Benc <jbenc>
Status: CLOSED NOTABUG QA Contact: Dong Zhu <dZhu>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 6.5CC: dZhu, jbenc, jipan, mlichvar, qcai
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-09-24 09:47:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
ptp4l and phc2sys logs from slave
none
Comment none

Description Dong Zhu 2013-09-17 10:12:03 UTC
Created attachment 915768 [details]
Comment

(This comment was longer than 65,535 characters and has been moved to an attachment by Red Hat Bugzilla).

Comment 1 Jiri Benc 2013-09-17 10:21:40 UTC
Could you attach the full ptp4l log from the slave?

Comment 3 Dong Zhu 2013-09-18 05:17:15 UTC
Created attachment 799074 [details]
ptp4l and phc2sys logs from slave

Comment 4 Miroslav Lichvar 2013-09-18 07:52:53 UTC
In the phc2sys log the ~35 second offset appears around second 1926, but in the ptp4l log there doesn't seem to be anything interesting around that second. This looks like some other process may be setting the system clock.

Any chance there is ntpd running or is ntpdate/hwclock/rdate called periodically?

Comment 6 Miroslav Lichvar 2013-09-18 08:46:46 UTC
Ok, ntpd probably just stepped the clock after its stepout interval (900  seconds).

But there is a strange offset in the slave ptp4l log at 1438. Is the master running ntpd. Is the PHC synchronized by phc2sys from the system clock?

Also, any explanation why the master is dropping out? Was it restarted or blocked by the firewall?

Comment 7 Jiri Benc 2013-09-18 09:11:46 UTC
(In reply to Miroslav Lichvar from comment #6)
> But there is a strange offset in the slave ptp4l log at 1438.

Looking at the three logs, it seems likely that at that time, Sync message was delayed by a switch by 1.7 ms, subsequent Delay_Req and Delay_Resp messages were delayed by unknown time and no further traffic got through until ~21 seconds later.

Are you doing anything with the communication path (switch, network cables, etc.) during the testing?

Comment 9 Jiri Benc 2013-09-18 13:07:56 UTC
Stopping ntpd helped with one of the problems (the one described in comment 4).

The problem described in comment 7 still remains but it's a separate one. Miroslav captured packets on both machines and the captures confirm my theory (see comment 7). This can be caused by hardware at the master not sending the frames, the hardware at the slave not receiving the frames properly, or switch discarding the frames.

Comment 10 Jiri Benc 2013-09-18 13:44:28 UTC
Tried to find out which case (see previous comment) it is but it seems the problem does not reproduce with a ping running in parallel.

There doesn't seem to be anything wrong with ptp4l/phc2sys. If this is a Linux issue, then the only point that could be wrong is the NIC driver. I suspect more a hardware problem, though.

Could you try with a different switch? Or with a master running on a different NIC?

Comment 15 Jiri Benc 2013-09-24 08:20:05 UTC
Thanks for doing the testing. I'm very much inclined to say this was a problem with the switch and its handling of multicast packets. The only thing preventing me from saying for sure this is not a RHEL bug is Jimmy Pan reproducing the problem with igb cards and Cisco Catalyst 3750 switch.

I'll try a few things with a modified linuxptp.

Comment 16 Jiri Benc 2013-09-24 09:02:32 UTC
For the record, cannot reproduce it anymore on the machines that showed the problem originally.

Comment 17 Jiri Benc 2013-09-24 09:38:04 UTC
For the record, Jimmy Pan experienced the problem on the same machines.

Comment 19 Jiri Benc 2013-10-01 09:14:14 UTC
*** Bug 1011356 has been marked as a duplicate of this bug. ***

Comment 20 Jiri Benc 2013-10-01 09:14:17 UTC
*** Bug 1011367 has been marked as a duplicate of this bug. ***

Comment 21 Jiri Benc 2013-10-01 09:14:43 UTC
*** Bug 1011363 has been marked as a duplicate of this bug. ***

Comment 22 Jiri Benc 2013-10-01 09:14:46 UTC
*** Bug 1011368 has been marked as a duplicate of this bug. ***