Description of problem: Customer is experiencing application problems because time is failing to advance despite calling nanosleep() and receiving no error. Version-Release number of selected component (if applicable): kernel-xen-2.6.18-92.1.10.el5 also seen with earlier kernels, -67 and -92 How reproducible: Always on customer system using test program Steps to Reproduce: 1. Running attached nanotest program 2. 3. Actual results: Over multiple runs a wide variation in results is seen backwards = 0 no_time = 0 short_sleep = 0 long_sleep = 22 Elapsed: 240 -- backwards = 0 no_time = 5262 short_sleep = 20 long_sleep = 50 Elapsed: 156 -- backwards = 0 no_time = 9864 short_sleep = 37 long_sleep = 99 Elapsed: 99 -- backwards = 0 no_time = 9863 short_sleep = 35 long_sleep = 102 Elapsed: 102 -- backwards = 0 no_time = 6627 short_sleep = 22 long_sleep = 71 Elapsed: 146 Expected results: backwards = 0 no_time = 0 short_sleep = 0 Additional info: System is "PRIMERGY RX200 S3" with one socket filled, Handle 0x0004, DMI type 4, 35 bytes. Processor Information Socket Designation: CPU 1 Type: Central Processor Family: Xeon Manufacturer: Intel ID: FB 06 00 00 FF FB EB BF Signature: Type 0, Family 6, Model 15, Stepping 11 Flags: FPU (Floating-point unit on-chip) VME (Virtual mode extension) DE (Debugging extension) PSE (Page size extension) TSC (Time stamp counter) MSR (Model specific registers) PAE (Physical address extension) MCE (Machine check exception) CX8 (CMPXCHG8 instruction supported) APIC (On-chip APIC hardware supported) SEP (Fast system call) MTRR (Memory type range registers) PGE (Page global enable) MCA (Machine check architecture) CMOV (Conditional move instruction supported) PAT (Page attribute table) PSE-36 (36-bit page size extension) CLFSH (CLFLUSH instruction supported) DS (Debug store) ACPI (ACPI supported) MMX (MMX technology supported) FXSR (Fast floating-point save and restore) SSE (Streaming SIMD extensions) SSE2 (Streaming SIMD extensions 2) SS (Self-snoop) HTT (Hyper-threading technology) TM (Thermal monitor supported) PBE (Pending break enabled) Version: Intel(R) Xeon(R) CPU 5148 @ Voltage: 1.5 V External Clock: 1333 MHz Max Speed: 2333 MHz Current Speed: 2333 MHz Status: Populated, Enabled Upgrade: ZIF Socket L1 Cache Handle: 0x0006 L2 Cache Handle: 0x0007 L3 Cache Handle: 0x0008 Serial Number: Not Specified Asset Tag: Not Specified Part Number: Not Specified
Created attachment 317183 [details] nanosleep() test program
Latest tests were run with notsc. Similar results are seen in the RHEL5 DomU.
Created attachment 317208 [details] backport of upstream patch that rounds up sleep I am building a test kernel with this backported patch now and will test the nanosleep() test program with and without this patch.
I cannot seem to reproduce the short sleep problem on my hardware, even though I do think I understand why it can happen. Do we have any hardware in-house on which the bug is reproducible?
Martin (Poole), if I get you a test kernel with the patch, could you get it tested at the customer site? I have not found any hardware here that reproduces the bug, but the patch is low risk enough that testing at the customer site should be enough to get it approved for merging in a RHEL update.
Customer is willing to test an experimental kernel and will even be able to install it today (october 13th) or tomorrow (14th) if we give it to him now. Internal Status set to 'Waiting on Engineering' Status set to: Waiting on Tech This event sent from IssueTracker by akunysz issue 173294
I have made test kernels available at http://people.redhat.com/riel/.bz462853/ Please let me know whether the test kernel resolves the issue.
Thank you. Customer has been given test kernel. Waiting for feedback. Internal Status set to 'Waiting on Customer' Status set to: Waiting on Client This event sent from IssueTracker by akunysz issue 173294
Since the patch is upstream, safe, obviously correct and greatly improves the test case for the customer, I will submit it for inclusion in a RHEL update. There may be other unrelated time bugs that caused the issue to show up on one of the domUs.
Posted the patch for internal review.
The bug only happens on one specific system and can not be reproduced on other systems of the same model. Putting in a workaround for one specific system entails too much risk for next to no gain, so CLOSED WORKSFORME. Please reopen if the bug can be triggered on multiple systems.