Bug 1008902 - ptp: phc2sys sys offset suddenly increasing very large
ptp: phc2sys sys offset suddenly increasing very large
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: linuxptp (Show other bugs)
6.5
Unspecified Unspecified
unspecified Severity unspecified
: rc
: ---
Assigned To: Jiri Benc
Dong Zhu
:
: 1011356 1011363 1011367 1011368 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-09-17 06:12 EDT by Dong Zhu
Modified: 2014-03-16 21:48 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-09-24 05:47:46 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
ptp4l and phc2sys logs from slave (58.54 KB, application/x-gzip)
2013-09-18 01:17 EDT, Dong Zhu
no flags Details
Comment (113.77 KB, text/plain)
2013-09-17 06:12 EDT, Dong Zhu
no flags Details

  None (edit)
Description Dong Zhu 2013-09-17 06:12:03 EDT
Created attachment 915768 [details]
Comment

(This comment was longer than 65,535 characters and has been moved to an attachment by Red Hat Bugzilla).
Comment 1 Jiri Benc 2013-09-17 06:21:40 EDT
Could you attach the full ptp4l log from the slave?
Comment 3 Dong Zhu 2013-09-18 01:17:15 EDT
Created attachment 799074 [details]
ptp4l and phc2sys logs from slave
Comment 4 Miroslav Lichvar 2013-09-18 03:52:53 EDT
In the phc2sys log the ~35 second offset appears around second 1926, but in the ptp4l log there doesn't seem to be anything interesting around that second. This looks like some other process may be setting the system clock.

Any chance there is ntpd running or is ntpdate/hwclock/rdate called periodically?
Comment 6 Miroslav Lichvar 2013-09-18 04:46:46 EDT
Ok, ntpd probably just stepped the clock after its stepout interval (900  seconds).

But there is a strange offset in the slave ptp4l log at 1438. Is the master running ntpd. Is the PHC synchronized by phc2sys from the system clock?

Also, any explanation why the master is dropping out? Was it restarted or blocked by the firewall?
Comment 7 Jiri Benc 2013-09-18 05:11:46 EDT
(In reply to Miroslav Lichvar from comment #6)
> But there is a strange offset in the slave ptp4l log at 1438.

Looking at the three logs, it seems likely that at that time, Sync message was delayed by a switch by 1.7 ms, subsequent Delay_Req and Delay_Resp messages were delayed by unknown time and no further traffic got through until ~21 seconds later.

Are you doing anything with the communication path (switch, network cables, etc.) during the testing?
Comment 9 Jiri Benc 2013-09-18 09:07:56 EDT
Stopping ntpd helped with one of the problems (the one described in comment 4).

The problem described in comment 7 still remains but it's a separate one. Miroslav captured packets on both machines and the captures confirm my theory (see comment 7). This can be caused by hardware at the master not sending the frames, the hardware at the slave not receiving the frames properly, or switch discarding the frames.
Comment 10 Jiri Benc 2013-09-18 09:44:28 EDT
Tried to find out which case (see previous comment) it is but it seems the problem does not reproduce with a ping running in parallel.

There doesn't seem to be anything wrong with ptp4l/phc2sys. If this is a Linux issue, then the only point that could be wrong is the NIC driver. I suspect more a hardware problem, though.

Could you try with a different switch? Or with a master running on a different NIC?
Comment 15 Jiri Benc 2013-09-24 04:20:05 EDT
Thanks for doing the testing. I'm very much inclined to say this was a problem with the switch and its handling of multicast packets. The only thing preventing me from saying for sure this is not a RHEL bug is Jimmy Pan reproducing the problem with igb cards and Cisco Catalyst 3750 switch.

I'll try a few things with a modified linuxptp.
Comment 16 Jiri Benc 2013-09-24 05:02:32 EDT
For the record, cannot reproduce it anymore on the machines that showed the problem originally.
Comment 17 Jiri Benc 2013-09-24 05:38:04 EDT
For the record, Jimmy Pan experienced the problem on the same machines.
Comment 19 Jiri Benc 2013-10-01 05:14:14 EDT
*** Bug 1011356 has been marked as a duplicate of this bug. ***
Comment 20 Jiri Benc 2013-10-01 05:14:17 EDT
*** Bug 1011367 has been marked as a duplicate of this bug. ***
Comment 21 Jiri Benc 2013-10-01 05:14:43 EDT
*** Bug 1011363 has been marked as a duplicate of this bug. ***
Comment 22 Jiri Benc 2013-10-01 05:14:46 EDT
*** Bug 1011368 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.