1200353 – regression in ktime.h breaks infrared IR blaster/transmitter after upgrade to Fedora 21

Bug 1200353 - regression in ktime.h breaks infrared IR blaster/transmitter after upgrade to Fedora 21

Summary: regression in ktime.h breaks infrared IR blaster/transmitter after upgrade to...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	21
Hardware:	i686
OS:	Linux
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-03-10 11:33 UTC by Trevor Cordes
Modified:	2015-06-01 17:06 UTC (History)
CC List:	11 users (show)
Fixed In Version:	kernel-4.0.4-202.fc21
Clone Of:
Environment:
Last Closed:	2015-06-01 17:03:50 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
diff between good and bad version of mceusb.c (2.39 KB, patch) 2015-03-12 17:11 UTC, Trevor Cordes	no flags	Details \| Diff
PoC reverts the changes that cause my bug, for kernel-3.19.1-201.fc21.i686 (788 bytes, patch) 2015-03-23 17:04 UTC, Trevor Cordes	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Linux Kernel	95431	0	None	None	None	Never

Description Trevor Cordes 2015-03-10 11:33:43 UTC

Description of problem:
After upgrading to Fedora 21 my lirc IR blaster setup (for MythTV) no longer works. Everything worked fine in F19 and earlier. I've boiled the problem down to basic lirc/module commands. I'm using a mceusb USB transceiver device.

Trying to issue basic an irsend command results in:
irsend: timeout

messages. If I do enough irsends in a short time span, irsend eventually hangs and I've even gotten the kernel to oops (bug #1200304).

The funny thing is, the first few irsend commands I issue seem to be blasted to my set-top-box! So that means everything is "sort of" working in the chain, down to the bare hardware, but something is keeping irsend from getting back a valid response in time. My hunch is it's possibly the kernel module to blame (if so we can change the bug component).

I'd love to debug but I can't figure out any way to get lircd to output debug or get the modules into debug mode (modprobe mceusb debug=1 just gives a "mceusb: unknown parameter 'debug' ignored" message. I've checked out the irsend source and it's clear it's timing out waiting for a valid response from lircd.

Version-Release number of selected component (if applicable):
kernel-PAE-3.18.8-201.fc21.i686
lirc-0.9.1a-4.fc21.i686

How reproducible:
Always

Steps to Reproduce:
1. Setup a valid blaster lircd.conf (like my old one that worked in F19)
2. lircd -n -H default (or just use systemctl start lircd.service)
3. irsend SEND_ONCE dct700 5

Actual results:
<pause>
irsend: timeout
Exit 1

and blasted IR code sometimes actually is sent

Expected results:
instantaneous return with no error message and all blasted IR codes are sent

Additional info:
#lsusb
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 005 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 004 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 003 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 002 Device 003: ID 0471:0815 Philips (or NXP) eHome Infrared Receiver
Bus 002 Device 002: ID 046d:c313 Logitech, Inc. Internet 350 Keyboard
Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub

Beginning of lircd.conf stanza:

# Modified DCT2000 entry for DCT700 (sends zeros)
# For use with the USB MCE ir receiver

begin remote

name dct700
flags RAW_CODES
eps 30
aeps 100

ptrail 520
repeat 0 0
gap 150000

begin raw_codes

Comment 1 Trevor Cordes 2015-03-10 11:39:41 UTC

See attachment #999788 [details] for messages output for the kernel drivers loading, etc.  Also, I should make it clear and mention that I am *not* trying to use a IR remote (ie have lirc receive and control myth), I'm just trying to use the blaster/transmitter aspect.  (Though I do have the remote setup in my lirc, I haven't used it in years.)

Comment 2 Trevor Cordes 2015-03-10 12:09:42 UTC

Further messages logs info.  I tried changing USB ports (note, these are old mobo USB2 ports, it doesn't have USB3 so I'm not hitting the google-described USB3 bug).

Right after, I issued an irsend LIST "" "" at Mar 10 07:06:06, note how that runs 100% OK and returns immediately, and lircd says "removed client" right away.

Then I issued irsend SEND_ONCE dct700 info at Mar 10 07:06:14 and then lircd *never* says "removed client"!  That's why irsend is timing out, lircd is getting screwed up somehow.

Mar 10 07:04:17 piles kernel: [10665.500078] usb 2-2: USB disconnect, device number 3
Mar 10 07:05:01 piles kernel: [10709.639030] usb 3-2: new full-speed USB device number 2 using uhci_hcd
Mar 10 07:05:02 piles kernel: [10710.483059] usb 3-2: New USB device found, idVendor=0471, idProduct=0815
Mar 10 07:05:02 piles kernel: [10710.564576] usb 3-2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
Mar 10 07:05:02 piles kernel: [10710.644881] usb 3-2: Product: eHome Infrared Transceiver
Mar 10 07:05:02 piles kernel: [10710.723675] usb 3-2: Manufacturer: Philips
Mar 10 07:05:02 piles kernel: [10710.800965] usb 3-2: SerialNumber: PH00FsaI
Mar 10 07:05:03 piles kernel: [10710.904065] Registered IR keymap rc-rc6-mce
Mar 10 07:05:03 piles kernel: [10710.980857] input: Media Center Ed. eHome Infrared Remote Transceiver (0471:0815) as /devices/pci0000:00/0000:00:1d.1/usb3/3-2/3-2:1.0/rc/rc0/input15
Mar 10 07:05:03 piles kernel: [10711.135596] rc0: Media Center Ed. eHome Infrared Remote Transceiver (0471:0815) as /devices/pci0000:00/0000:00:1d.1/usb3/3-2/3-2:1.0/rc/rc0
Mar 10 07:05:03 piles kernel: [10711.291674] input: MCE IR Keyboard/Mouse (mceusb) as /devices/virtual/input/input16
Mar 10 07:05:03 piles kernel: [10711.369295] rc rc0: lirc_dev: driver ir-lirc-codec (mceusb) registered at minor = 0
Mar 10 07:05:03 piles kernel: [10711.589040] mceusb 3-2:1.0: Registered Philips eHome Infrared Transceiver with mce emulator interface version 1
Mar 10 07:05:03 piles kernel: [10711.666949] mceusb 3-2:1.0: 2 tx ports (0x0 cabled) and 2 rx sensors (0x1 active)
Mar 10 07:05:03 piles kernel: [10711.744051] usbcore: registered new interface driver mceusb
Mar 10 07:06:03 piles lircd: 'lirc' written to protocols file /sys/class/rc/rc0/protocols
Mar 10 07:06:03 piles lircd-0.9.1a[4535]: lircd(default) ready, using /var/run/lirc/lircd
Mar 10 07:06:06 piles lircd-0.9.1a[4535]: accepted new client on /var/run/lirc/lircd
Mar 10 07:06:06 piles lircd-0.9.1a[4535]: removed client
Mar 10 07:06:14 piles lircd-0.9.1a[4535]: accepted new client on /var/run/lirc/lircd

Comment 3 Trevor Cordes 2015-03-10 13:03:39 UTC

For kicks I installed the necessary lirc stuff from Fedora 22 on F21: lirc-0.9.2a-1.fc22, etc.

#irsend SEND_ONCE dct700 info

lircd output:
lircd-0.9.2a[5151]: Debug: registering local client
lircd-0.9.2a[5151]: Notice: accepted new client on /var/run/lirc/lircd
lircd-0.9.2a[5151]: Debug: driver supports both sending and receiving
lircd-0.9.2a[5151]: Debug: received command: "SEND_ONCE dct700 info"
lircd-0.9.2a[5151]: Debug: Sending once, msg: SEND_ONCE dct700 info
, args: dct700 info, once: 1
lircd-0.9.2a[5151]: Debug: clearing transmit buffer
lircd-0.9.2a[5151]: Debug: transmit buffer ready
... hangs here until ^C, then:
^Clircd-0.9.2a[5151]: Debug: Sending success
lircd-0.9.2a[5151]: Info: removed client
lircd-0.9.2a[5151]: Notice: caught signal


messages log:
Mar 10 07:23:58 piles lircd-0.9.2a[4885]: Notice: accepted new client on /var/run/lirc/lircd
Mar 10 07:23:59 piles irsend: fill_string: timeout
Mar 10 07:24:00 piles irsend: fill_string: timeout
Mar 10 07:24:01 piles irsend: fill_string: timeout
Mar 10 07:24:02 piles irsend: fill_string: timeout
... forever! until ^C lircd


Doing an strace on the lircd I can see it is hanging on a write() to fd7 which is /dev/lirc0!

writev(2, [{"lircd-0.9.2a[5399]: Debug: trans"..., 48}, {"\n", 1}], 2lircd-0.9.2a[5399]: Debug: transmit buffer ready
) = 49
send(5, "<135>Mar 10 07:51:57 lircd-0.9.2"..., 69, MSG_NOSIGNAL) = 69
write(7, "Z#\0\0000\21\0\0&\2\0\0000\21\0\0&\2\0\0000\21\0\0X\2\0\0f\10\0\0"..., 140


strace on irsend shows trying to read on fd3 which is "/proc/5649/fd/3 -> socket:[28512]".  Not sure how to tell where socket 28512 is going, probably lircd?

socket(PF_LOCAL, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 4
connect(4, {sa_family=AF_LOCAL, sun_path="/dev/log"}, 110) = 0
send(4, "<15>Mar 10 08:00:45 irsend: lirc"..., 77, MSG_NOSIGNAL) = 77
write(3, "SEND_ONCE dct700 info\n", 22) = 22
setsockopt(3, SOL_SOCKET, SO_RCVTIMEO, "\1\0\0\0\0\0\0\0", 8) = 0
read(3, 0xbfd67e99, 256)                = -1 EAGAIN (Resource temporarily unavailable)
time(NULL)                              = 1425992446
send(4, "<13>Mar 10 08:00:46 irsend: fill"..., 49, MSG_NOSIGNAL) = 49
setsockopt(3, SOL_SOCKET, SO_RCVTIMEO, "\1\0\0\0\0\0\0\0", 8) = 0
read(3, 0xbfd67e99, 256)                = -1 EAGAIN (Resource temporarily unavailable)
time(NULL)                              = 1425992447


Anyways, it just repeats the EAGAIN for ever.

Something is barfing and not communicating according to the lircd/irsend protocol.  I'm poking around the source but I'm starting to get in over my head.

Comment 4 Alec Leamas 2015-03-10 13:06:18 UTC

Hm... this might be a hard one.

Have you tried to raise the lircd loglevel? In a standard installation, edit /etc/lirc/lirc_options.conf and set loglevel to 10. Then run, and check "journalctl -b /usr/sbin/lircd". Perhaps, here can be some more hints.

Comment 5 Trevor Cordes 2015-03-10 13:19:18 UTC

I seem to have gotten loglevel 10 with my tests now in the new 0.9.2 version:

strace lircd -n -H default --loglevel=trace2

I also tried it in lirc_options.conf with

debug          = 10
logfile        = /tmp/l.log

But the output is the same.  journalctl doesn't tell me anything because I seem to have successfully disabled it in favor of rsyslog (yay!).  Or I'm not using it right.  The way I have it configured, everything that journalctl should have seen goes to rsyslog now and /v/l/messages.

I think I need to figure out why that lircd write() is hanging, and who is it trying to write to?  If someone knows the lirc protocol being spoken, the strace output above should tell us what the "message" is.  I can try hitting the source some more...

Comment 6 Trevor Cordes 2015-03-11 10:01:24 UTC

I installed F20's lirc lirc-0.9.0-21.fc20.i686 (and related rpms), that's the last version that worked for me (except I was using fc19).  No go, still the exact same timeout with irsend.  Still the same-looking straces.

Does this mean the bug is in the kernel -- the new in-kernel version of mceusb?  Does lircd write() (the hanging write) to the kernel?  Right before the hanging write() is an ioctl.

Comment 7 Alec Leamas 2015-03-11 11:07:40 UTC

Hm....

Already at 0.9.0 the mceusb was in-kernel, so it's not that simple. Still, the kernel code might of course have been updated.

The kernel module history is at [1]. Perhaps one of these commits is the culprit?

I can see two ways to proceed into this. I can try to support you either way, but since you are the only one which can reproduce the error...

One way would be to patch the current irsend code. Here are some obvious areas which could be improved (client logging, handling of EAGAIN/timeout). Fixing these should basically be done on upstream master: [2]. I don't really think the proposed patches would fix this, but it might (should?) give more information than we have today. I don't think you need to install lirc to run these tests, it could be done on an un-installed but built tree.

The other way would be to try to bisect the mceusb driver. Rebuilding the driver isn't hard, there is example code in drivers/ (upstream tree). Question is just how far back we can go before the driver becomes incompatible with current overall kernel (no idea).


[1] https://github.com/torvalds/linux/commits/master/drivers/media/rc/mceusb.c
[2] http://sourceforge.net/p/lirc/git/ci/master/tree/

Comment 8 Trevor Cordes 2015-03-11 13:01:09 UTC

For kicks I installed a kernel I found in the repos for F20:
kernel-PAE-3.11.10-301.fc20.i686

on my F21 userspace (ugly!)

Rebooted, ran the lircd / irsend tests (still with lirc-0.9.0-21.fc20 installed) and it all works fine!  No bug!

I think this proves it is a kernel (mceusb module probably) bug, right?  Should I reassign this bugzilla?

Comparing the straces is interesting:

BUGGY VERSIONS:
writev(2, [{"lircd-0.9.0[24802]: accepted new"..., 62}, {"\n", 1}], 2lircd-0.9.0[24802]: accepted new client on /var/run/lirc/lircd
) = 63
send(5, "<29>Mar 11 04:59:41 lircd-0.9.0["..., 82, MSG_NOSIGNAL) = 82
stat64("/dev/lirc0", {st_mode=S_IFCHR|0666, st_rdev=makedev(248, 0), ...}) = 0
open("/dev/lirc0", O_RDWR)              = 7
ioctl(7, I2OGETIOPS or LIRC_GET_FEATURES, 0xb77ed168) = 0
gettimeofday({1426067981, 29156}, NULL) = 0
select(8, [4 6 7], NULL, NULL, NULL)    = 1 (in [6])
gettimeofday({1426067981, 29247}, NULL) = 0
select(7, [6], NULL, NULL, {0, 0})      = 1 (in [6], left {0, 0})
read(6, "SEND_ONCE dct700 info\n", 256) = 22
ioctl(7, IPMICTL_SET_MY_LUN_CMD or LIRC_SET_SEND_CARRIER, 0xbfd902b8) = 38000
write(7, "Z#\0\0000\21\0\0&\2\0\0000\21\0\0&\2\0\0000\21\0\0X\2\0\0f\10\0\0"..., 140^Clircd-0.9.0[24802]: removed client


OLD WORKING VERSIONS:
writev(2, [{"lircd-0.9.0[1620]: accepted new "..., 61}, {"\n", 1}], 2lircd-0.9.0[1620]: accepted new client on /var/run/lirc/lircd
) = 62
send(5, "<29>Mar 11 07:55:59 lircd-0.9.0["..., 81, MSG_NOSIGNAL) = 81
stat64("/dev/lirc0", {st_mode=S_IFCHR|0600, st_rdev=makedev(248, 0), ...}) = 0
open("/dev/lirc0", O_RDWR)              = 7
ioctl(7, I2OGETIOPS or LIRC_GET_FEATURES, 0xb77aa168) = 0
gettimeofday({1426078559, 984679}, NULL) = 0
select(8, [4 6 7], NULL, NULL, NULL)    = 1 (in [6])
gettimeofday({1426078559, 984969}, NULL) = 0
select(7, [6], NULL, NULL, {0, 0})      = 1 (in [6], left {0, 0})
read(6, "SEND_ONCE dct700 info\n", 256) = 22
ioctl(7, IPMICTL_SET_MY_LUN_CMD or LIRC_SET_SEND_CARRIER, 0xbfe487a8) = 38000
write(7, "Z#\0\0000\21\0\0&\2\0\0000\21\0\0&\2\0\0000\21\0\0X\2\0\0f\10\0\0"..., 140) = 140
^^^^^^^^^^^^ !!!!!!!! WHERE NEWER VERSIONS HANG
gettimeofday({1426078560, 7368}, NULL)  = 0
...


From this it's clear now fd7 is /dev/lirc0 which I think speaks directly to the kernel modules?  OK, so it seems clear that the bug is in the kernel modules, which are hanging or freezing or something and never accepting the write() to /dev/lirc0 from lircd.  Note, in both versions above the string sent to /dev/lirc0 is identical.

I wish there were more interim kernel versions to test but I'm having trouble finding more stock kernel rpm's in the Fedora repo world.  Is there somewhere that keeps every kernel rpm ever released to stable (or testing)?

I've never had luck with bisecting "fedora-ized" kernel with git using Fedora, especially on 32-bit systems where I hit bugs using instructions I found from Fedora people.  I'm clueless at the process and would need some good links with instructions.  It's so much easier to find interim rpm versions if I can.

Help!

Comment 9 Trevor Cordes 2015-03-11 13:10:36 UTC

(In reply to Alec Leamas from comment #7)
> Hm....

Thanks for the help!  I'll do anything.  I've already started some of what you suggest, let me tackle the points:

> The kernel module history is at [1]. Perhaps one of these commits is the
> culprit?

I've gitted Linus's entire tree and looked at the git changelog for mceusb and some commits do sound fishy, like 0cacb46ace1f433f0ab02af10686f6dc50b5d268, and others.

> One way would be to patch the current irsend code. Here are some obvious
> areas which could be improved (client logging, handling of EAGAIN/timeout).
> Fixing these should basically be done on upstream master: [2]. I don't
> really think the proposed patches would fix this, but it might (should?)
> give more information than we have today. I don't think you need to install
> lirc to run these tests, it could be done on an un-installed but built tree.

OK, but read my most recent comment, maybe it's for sure now a kernel thing and not irsend/lircd's fault?  I think I'll concentrate on mceusb module before hitting the lirc git unless you think the above paragraph is a wiser place to start still.

> The other way would be to try to bisect the mceusb driver. Rebuilding the
> driver isn't hard, there is example code in drivers/ (upstream tree).
> Question is just how far back we can go before the driver becomes
> incompatible with current overall kernel (no idea).

Ya, I tried compiling the "fedora way" an older version of mceusb to that git commit I list above and it wouldn't build due to missing fn rc_set_allowed_protocols which seems to have been removed from an include file, as it's no where to be grepped in the vanilla source now.  That's only Apr 2014 which isn't very far back at all.

I guess I could keep stepping back a commit at a time until I find one that compiles.  But no other newer commits seem to have anything interesting that might be causing this bug (but I'm no expert).

I think I'll need to hone in on some suspicious commits and back-port a reversal of the change?  (If that's even the right terminology?)  That should allow me to build and test various reverts without rebooting.

I can hack the diffs manually, as again I have no idea how to meld the idea of "fedora way" with "vanilla way" for a bisect, and I have no idea how to bisect just one module.  If you can point me to useful fedora-centric websites for this stuff, that would be great, or I'll just hack around and see what I can get to work.

Thanks a ton!!  No mythtv recordings going on until this works...

Comment 10 Alec Leamas 2015-03-11 13:27:45 UTC

Perhaps... If you created a new git repo just containing the mceusb.c file. And added it some version long time ago, and then each version in the mceusb.c history. From that point you should be able to make a clean bisect.

Given all facts, IMHO it makes sense to start with bisecting the mceusb driver.

Comment 11 Alec Leamas 2015-03-11 13:31:59 UTC

You have this Makefile, right?

http://sourceforge.net/p/lirc/git/ci/master/tree/drivers/lirc_wpc8769l/

Comment 12 Trevor Cordes 2015-03-11 13:44:01 UTC

(In reply to Alec Leamas from comment #10)
> Perhaps... If you created a new git repo just containing the mceusb.c file.
> And added it some version long time ago, and then each version in the
> mceusb.c history. From that point you should be able to make a clean bisect.
> 
> Given all facts, IMHO it makes sense to start with bisecting the mceusb
> driver.

Sorry, over my head on that one, I know what you're saying but not how to do it using a fedora kernel src.rpm.  And I don't know how to make vanilla do anything without running configure, which I don't want to do as it won't give me a "fedora" setup :-(  I just don't know the magic commands.  I know all the theory about git, bisect, etc, but not how to actually do it.  If you want to throw a few command examples at me (no explanation needed, just raw commands) that would help, or send me to a good link.

For right now, I'm working on undoing some git code changes that look nasty manually in mceusb.c and compiling them/testing.  That much I know how to do.  If I have luck I'll report here immediately.

Thanks!!

Comment 13 Alec Leamas 2015-03-11 14:03:02 UTC

You don't need a src.rpm, you should just compile a module. Use the Makefile in comment #11, you just need to patch the obj-m += ... line. Using that, it should be a simple 'make' + 'sudo make install'.

As for the git repo, the trick is to clone torvald's kernel repo on github. You have done that, right. After that you could do something like

- Checkout an old version (start of bisect) of Torvalds repo at somewhere/
- create a new empty directory newrepo/
   + cd newrepo
   + git init
   + cp /somewhere/driver/media/rc/mceucb.c .
   + git add mceusb.c
   + git commit am "Initial: parent hash fa1afe1"
- then for each hash listed by 'git log --oneline mceusb.c' starting with first commit after the initial:
   + checkout commit in torvald's repo 
   + cp /somewhere/driver/media/rc/mceucb.c .
   + git commit -am "parent hash deadbee"

After this you could make a clean bisect, using make + "make install" to test.

Comment 14 Trevor Cordes 2015-03-11 16:01:40 UTC

OK, I'm trying your git steps, and your Makefile works great.

We have a big problem though, the bisect and make show that functions in other files are changing not that long ago and the make is failing with missing functions, like ‘rc_set_allowed_protocols’ and missing struct members like rc_dev allowed_protos.

I can only compile down to c5540fb but not 0cacb46 (linus commit hashes).

It looks like I need to be able to compile going back 10 linus commits on that file in order to bisect back to where I think it was last working.  So how do I do that if other dependent files are changing also?  Is there a way to do your steps on multiple files?  I guess it might depend on how far up the module dep chain we must go :-(

Comment 15 Alec Leamas 2015-03-11 16:30:02 UTC

> OK, I'm trying your git steps, and your Makefile works great.
> 
> We have a big problem though, the bisect and make show that functions in other
> files are changing not that long ago and the make is failing with missing
> functions, like ‘rc_set_allowed_protocols’ and missing struct members like
> rc_dev allowed_protos.
> 
> I can only compile down to c5540fb but not 0cacb46 (linus commit hashes). 

This is as anticipated...

> It looks like I need to be able to compile going back 10 linus commits on that
> file in order to bisect back to where I think it was last working.  So how do I
> do that if other dependent files are changing also?  Is there a way to do your
> steps on multiple files? 

Never done that. But if this is more the one file, it'd probably better to make a full kernel rebuild and use that for bisecting (horrors...)

I think...

--alec

Comment 16 Trevor Cordes 2015-03-11 16:48:21 UTC

OK, then please allow me to ask one question... How do I use linus tree for bisection but still use/compile with the fedora specs/config/options ie: do it the "fedora" way?  I never figured this out and can't find the info anywhere.  Is it even possible?  I don't want (or know how) to customize the kernel config myself to be exactly like Fedora 21 stock kernel :-(

I don't relish compiling the whole kernel on this Mythtv box as it's only a Pentium-4-D!!  Ouch x 2.

In the meantime I'll keep hacking at my other ideas/options.  Maybe I'll get lucky.

Thanks!!

Comment 17 Trevor Cordes 2015-03-11 17:03:01 UTC

This is weird.  I put some debug outputs in mceusb to see what paths it was taking for the bit int vs bulk changes in 0cacb46ace1f433f0ab02af10686f6dc50b5d268.

I ran the module and my test and it shows that out is interrupt but in is bulk?  Is that even valid?  Isn't the usb setup either bulk or int for a device but not both?  Maybe that's the problem?

But if I try forcing int on everything, trying to recreate what the old code did, the module throws "BOGUS urb xfer" stack traces.  There must be some other init code somewhere that handles the setup for that.

Comment 18 Trevor Cordes 2015-03-11 18:20:48 UTC

I put a bunch more debug printks in the mceusb and it's weird, it doesn't seem to be hanging in that module.  I'm playing with lirc_dev.c now to see if it's hanging in there.  This is getting complicated as the code path is tough to follow for me.  Maybe an entire kernel bisect will have to be done.  (But I don't know how the "Fedora" way!)

Comment 19 Trevor Cordes 2015-03-11 18:21:34 UTC

I see some debug code in the modules already: how do I turn it on so I can see it?

Comment 20 Trevor Cordes 2015-03-12 14:22:47 UTC

I found where all the kernel builds live on koji (yay!) and am "bisecting" using those to narrow things down.

http://koji.fedoraproject.org/koji/packageinfo?buildStart=150&packageID=8&buildOrder=-completion_time&tagOrder=name&tagStart=0#buildlist

In terms of location on the page, I have it narrowed down now:

works:
3.14.0-1.fc21.i686+PAE
3.16.0-1.fc21.i686+PAE
3.16.3-302.fc21.i686+PAE

broken:
3.17.0-0.rc7.git1.1.fc22.i686+PAE
3.17.0-301.fc21.i686+PAE
3.17.1-302.fc21.i686+PAE
3.17.3-300.fc21.i686+PAE
3.18.8

In what order should I try the last few?  Is this enough to narrow it down to changes on the relevant module files?  I have a funny feeling I'm going to find the very first 3.17.x build is broken.  So should I keep going on the 3.16.3+ builds?  The problem is they are only built for fc20, but I guess that shouldn't matter?  I've already run a fc20 kernel on this box once briefly.  It always freaks me out though.

Now, how do we turn koji builds into git commit hashes?

Comment 21 Trevor Cordes 2015-03-12 17:09:15 UTC

OK, I've narrowed it down to a tiny set of source code changes, not using git or bisect, but by using the koji kernel build rpms and then the src rpms to diff the fedora version of the sources.

The changes to mceusb look innocuous.  I will attach the diff.  I bet the problem is in another module.  My question is, what .c files should I be looking at?  Can't I just follow the chain up from mceusb?  Or are there any "core" or "library" .c files that have functions that are heavily used when trying to transmit with the blaster?  It would be nice to narrow my focus.

Comment 22 Trevor Cordes 2015-03-12 17:11:19 UTC

Created attachment 1001148 [details]
diff between good and bad version of mceusb.c

good 3.16.3-302.fc21.i686+PAE
bad  3.17.0-0.rc0.git4.2.fc22.i686+PAE

fedora version, not vanilla

Comment 23 Alec Leamas 2015-03-12 18:46:08 UTC

Thanks Trevor for a great job. Re-assigning to kernel to get some feedback on this.

Comment 24 Trevor Cordes 2015-03-12 18:51:13 UTC

The corresponding commits (only as far as mceusb.c is concerned) that match the diff of versions in comment #2 are:

c5540fbb9de39ceec108a889133664a887c2f55a
9683e01effc9ab3eb3c4584a39a6b87dadf5f391
5b8c8d41919ac4985dfacd31a860e4f00f52d3a6

The newest 3 commits in fact.  So if the bug is in mceusb.c itself, which I'm starting to doubt, the bug is in those commits.  Perhaps I could use those commit objects to look at changes in other relevant files that might also cause the bug?

Or if I could figure out how to combine fedora build setup with vanilla commits for a bisect I could do a real bisect and some kernel compiles...

Comment 25 Trevor Cordes 2015-03-14 14:16:23 UTC

I'm ready to do the kernel bisect between the commits I've identified to solve this bug, but I still don't know how to merge the concept of vanilla bisects with Fedora's rpmbuild/config process.

I attempt to use Igor's wonderful script:
http://fedoraproject.org/wiki/User:Ignatenkobrain/Kernel/Bisection

But it still has the 32-bit bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1090186

So that can't be used.

Then I read on Fedora mailing lists that I can copy over the config, but then I see all you get is a vmlinuz and I have no idea how I can get that to install on my box "the fedora way" so I can get a bootable system: who installs the zillion modules?  Who makes the initrd?  I've never made my own kernel config before, and I don't want to now, as I'll screw something up and blow up this production machine I'm doing all this testing on.

Isn't there any resource besides Igor's on how to bisect vanilla "the Fedora way"?  Why doesn't the Fedora ecosystem provide some good instructions on how to do this?  Surely more Fedora people would help solve kernel bugs if we had an easy way to bisect (that works on 32-bit).

Thanks!!

Comment 26 Trevor Cordes 2015-03-18 22:29:16 UTC

I'm trying to do a vanilla bisect with a Fedora config, as per Josh Boyer's comments:

https://lists.fedoraproject.org/pipermail/devel/2013-August/187988.html

But I hit yet another roadblock.  There's a bug in scripts/basic/fixdep on my system!  (Nothing is going right when it comes to fixing my blaster!)

n19 make bzImage
  HOSTCC  scripts/basic/fixdep
fixdep: error fstat'ing depfile: scripts/basic/.fixdep.d: Value too large for defined data type
scripts/Makefile.host:118: recipe for target 'scripts/basic/fixdep' failed
make[2]: *** [scripts/basic/fixdep] Error 2
/data/Tmp/Linux-git/linux-git/Makefile:443: recipe for target 'scripts_basic' failed
make[1]: *** [scripts_basic] Error 2
  SYSTBL  arch/x86/syscalls/../include/generated/asm/syscalls_32.h
  SYSHDR  arch/x86/syscalls/../include/generated/uapi/asm/unistd_32.h
  SYSHDR  arch/x86/syscalls/../include/generated/uapi/asm/unistd_64.h
  SYSHDR  arch/x86/syscalls/../include/generated/uapi/asm/unistd_x32.h
  HOSTCC  scripts/basic/fixdep
fixdep: error fstat'ing depfile: scripts/basic/.fixdep.d: Value too large for defined data type
scripts/Makefile.host:118: recipe for target 'scripts/basic/fixdep' failed
make[1]: *** [scripts/basic/fixdep] Error 2
Makefile:443: recipe for target 'scripts_basic' failed
make: *** [scripts_basic] Error 2


Googling around I see that this is because my box is 32-bit and my filesystem is XFS which now (Fedora default) uses 64-bit inodes.  Should I open another bug for this?

I'm going to try to find a workaround and/or move my build files to a smaller drive using ext4 instead of 64-bit XFS.

Comment 27 Trevor Cordes 2015-03-18 22:38:28 UTC

I found this on the lkml regarding adding -D_FILE_OFFSET_BITS=64 to the compile of fixdep to fix my problem in comment #26:

https://lkml.org/lkml/2013/8/29/195

But I can't figure out where that setting would go in which Makefile.  Can anyone provide guidance so I can start bisecting?  Thanks!

Comment 28 Mauro Carvalho Chehab 2015-03-18 23:35:56 UTC

There are very few changes at Remote Controller code between 3.16 and 3.17:

$ git lg v3.16..v3.17 drivers/media/rc
32570579cbde [media] ir-rc5-decoder: print where decoding fails
d24b69fc6b0a [media] rc-main: allow raw protocol drivers to restrict the allowed protos
5b8c8d41919a [media] mceusb: select default keytable based on vendor
9683e01effc9 [media] mceusb: add support for newer cx231xx devices
1dee9b59d69a [media] rc: Add support for decoding XMP protocol
356237d67a44 [media] ati_remote: Better default keycodes
220edfb0f724 [media] ati_remote: Add comments to keycode table
d701c8318bc2 [media] ati_remote: Sort buttons in top-to-bottom order
1e182708c26c [media] ati_remote: Use non-alomic __set_bit
89b0c0d2779e [media] ati_remote: Merge some duplicate code
3f245b9cb444 [media] ati_remote: Shrink the ati_remote_tbl even more
8ecd5e32d8c8 [media] ati_remote: Generalize KIND_ACCEL to accept diagonals
f95589cc41ff [media] ati_remote: Delete superfluous input_sync()
528abb9888d8 [media] ati_remote: Shrink ati_remote_tbl structure
20d7e3ca1be7 [media] ati_remote: Check the checksum
d6740d86deda [media] rc-core: fix various sparse warnings
4924a311a62f [media] rc-core: rename ir-raw.c
e87b540be2dd [media] rc-core: merge rc5 and streamzap decoders
9d2f1d3cdedd [media] rc-core: rename dev->scanmask to dev->scancode_mask
c5540fbb9de3 [media] rc-core: remove protocol arrays
da6e162d6a46 [media] rc-core: simplify sysfs code
0bc56cbef9fb [media] saa7134: NEC scancode fix
120703f9eb32 [media] rc-core: document the protocol type
2886f013c312 [media] bt8xx: fixup RC5 decoding
b4e3e59fb59c [media] rc: add sunxi-ir driver
2ac6f6305efe [media] rc: img-ir: Expand copyright headers with GPL notices
50e59058a420 Merge tag 'v3.16-rc1' into patchwork

Most of the above are on other drivers.The two changes at mceusb are related to the support for WinTV 930C-HD and PCTV QuatroStick. Likely, won't be affecting you. So, it is more likely that some of the rc-core changes is the culprit,
e. g. one of those:

d6740d86deda [media] rc-core: fix various sparse warnings
4924a311a62f [media] rc-core: rename ir-raw.c
e87b540be2dd [media] rc-core: merge rc5 and streamzap decoders
9d2f1d3cdedd [media] rc-core: rename dev->scanmask to dev->scancode_mask
c5540fbb9de3 [media] rc-core: remove protocol arrays
da6e162d6a46 [media] rc-core: simplify sysfs code
120703f9eb32 [media] rc-core: document the protocol type

On a quick look the changes that would more likely break it would be:

c5540fbb9de3 [media] rc-core: remove protocol arrays
da6e162d6a46 [media] rc-core: simplify sysfs code

You could try to revert each of them in order to see what patch broke for LIRC.

Before that, perhaps you should check if the LIRC protocol is enabled. 

Btw, there was indeed a regression for LIRC due to the above patches. Basically, the LIRC protocol were disabled after those patches. The fix is
done via this changeset:

commit 14edb593338e3811e818aba286237c365f8881a1
Author: Tomas Melin <tomas.melin>
Date:   Tue Oct 28 15:43:14 2014 -0300

    [media] rc-core: fix protocol_change regression in ir_raw_event_register
    
    IR receiver using nuvoton-cir and lirc required additional configuration
    steps after upgrade from kernel 3.16 to 3.17-rcX. Bisected regression to
    commit da6e162d6a4607362f8478c715c797d84d449f8b ("[media] rc-core:
    simplify sysfs code").
    
    The regression comes from adding function change_protocol in ir-raw.c.
    It changes behaviour so that only the protocol enabled by driver's
    map_name will be active after registration. This breaks user space
    behaviour, lirc does not get key press signals anymore.
    
    Enable lirc protocol by default for ir raw decoders to restore original
    behaviour.
    
    Cc: stable.org # for v3.17
    Signed-off-by: Tomas Melin <tomas.melin>
    Acked-by: David Härdeman <david>
    Signed-off-by: Mauro Carvalho Chehab <mchehab.com>

Perhaps the fixup is not covering your case, or was not backported to Fedora.

Comment 29 Trevor Cordes 2015-03-19 03:53:09 UTC

(In reply to Mauro Carvalho Chehab from comment #28)
> 
> d6740d86deda [media] rc-core: fix various sparse warnings
> 4924a311a62f [media] rc-core: rename ir-raw.c
> e87b540be2dd [media] rc-core: merge rc5 and streamzap decoders
> 9d2f1d3cdedd [media] rc-core: rename dev->scanmask to dev->scancode_mask
> c5540fbb9de3 [media] rc-core: remove protocol arrays
> da6e162d6a46 [media] rc-core: simplify sysfs code
> 120703f9eb32 [media] rc-core: document the protocol type
> 
> On a quick look the changes that would more likely break it would be:
> 
> c5540fbb9de3 [media] rc-core: remove protocol arrays
> da6e162d6a46 [media] rc-core: simplify sysfs code
> 
> You could try to revert each of them in order to see what patch broke for
> LIRC.

I will look at those commits!  That was exactly what I was looking for, someone who knew where "up the source chain" I could look next since mceusb.c doesn't seem to be the problem.

> Before that, perhaps you should check if the LIRC protocol is enabled.

How do I do that?  The lirc modules are in lsmod.  The /dev/lirc0 is there.  The Fedora .config for my kernel shows all entries with *lirc* as y or m.

> Btw, there was indeed a regression for LIRC due to the above patches.
> Basically, the LIRC protocol were disabled after those patches. The fix is
> done via this changeset:
> 
> commit 14edb593338e3811e818aba286237c365f8881a1
>     [media] rc-core: fix protocol_change regression in ir_raw_event_register
>
> Perhaps the fixup is not covering your case, or was not backported to Fedora.

I checked the source of the Fedora kernel I am running and 14edb593338e3811e818aba286237c365f8881a1 is in it.  So like you said, it might not be covering my case.  Or it's not the culprit.  Is there a way I can tell if it's applying to me?  Would it be helpful for me to put printk's inside the if's and recompile and see if the code is running?  If you have any pointers on what code paths/values to look for, I can tinker and find out.

Thank you so much for taking a moment to share your wisdom, you have set me on a good path of further things to try and I will see if I can't get this fixed and even offer a patch!

Comment 30 Trevor Cordes 2015-03-20 20:46:23 UTC

Hooray, I manged to get enough space on a non-64bit-inode disk to compile vanilla and do a real bisect using Josh Boyer's tips in that email above.

After 2 days of non-stop bisects I got the culprit (I'm pretty sure):

9d2f1d3cdedd4d3efff8d14f1f49cf73e2f5cc36 is the first bad commit
commit 9d2f1d3cdedd4d3efff8d14f1f49cf73e2f5cc36
Author: David Härdeman <david>
Date:   Thu Apr 3 20:32:26 2014 -0300

    [media] rc-core: rename dev->scanmask to dev->scancode_mask
    
    We already have dev->scancode_filter and dev->scancode_wakeup_filter
    so rename dev->scanmask to dev->scancode_mask for consistency.
    
    Signed-off-by: David Härdeman <david>
    Signed-off-by: Mauro Carvalho Chehab <m.chehab>

:040000 040000 6b7369e197f4c35b41818ec6ebae9478fa2d88a3 e0fd23081240a74852497af1380b37a1827b1a37 M	drivers
:040000 040000 155f3d47dcad21ead01e81e82bfd7e791fb24be9 ede1ce73b9166a4ab1de6680c955e343c870d0f8 M	include


I will look at the code now, but I was having trouble figuring out some of the bigger commits as some touch many files going up many levels.  If anyone wants to give me some ideas of what to try to undo from this commit and apply it to origin-head (or better yet fedora's latest src.rpm) I can play around with it.

Thanks for everyone's help.  Almost there!

Comment 31 Trevor Cordes 2015-03-20 20:51:47 UTC

Hmm, 9d2f1d3cdedd doesn't seem to really be doing anything much.  Could I have made a mistake with my bisect?

git bisect start 'drivers/media/rc'
# good: [19583ca584d6f574384e17fe7613dfaeadcdc4a6] Linux 3.16
git bisect good 19583ca584d6f574384e17fe7613dfaeadcdc4a6
# bad: [9d2f1d3cdedd4d3efff8d14f1f49cf73e2f5cc36] [media] rc-core: rename dev->scanmask to dev->scancode_mask
git bisect bad 9d2f1d3cdedd4d3efff8d14f1f49cf73e2f5cc36
# good: [19583ca584d6f574384e17fe7613dfaeadcdc4a6] Linux 3.16
git bisect good 19583ca584d6f574384e17fe7613dfaeadcdc4a6
# good: [67dd8f35c2d8ed80f26c9654b474cffc11c6674d] Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
git bisect good 67dd8f35c2d8ed80f26c9654b474cffc11c6674d
# good: [2886f013c312d288fdd4b8472b82a0a9590e3134] [media] bt8xx: fixup RC5 decoding
git bisect good 2886f013c312d288fdd4b8472b82a0a9590e3134
# good: [0bc56cbef9fb496c79e34c46329f016064d3b0f6] [media] saa7134: NEC scancode fix
git bisect good 0bc56cbef9fb496c79e34c46329f016064d3b0f6
# good: [c5540fbb9de39ceec108a889133664a887c2f55a] [media] rc-core: remove protocol arrays
git bisect good c5540fbb9de39ceec108a889133664a887c2f55a
# first bad commit: [9d2f1d3cdedd4d3efff8d14f1f49cf73e2f5cc36] [media] rc-core: rename dev->scanmask to dev->scancode_mask


I confirmed the initial good but I don't think I confirmed the initial bad, I think I pulled that from my earlier tests.  I'm going to rebisect this time with a for-sure bad starting point.  Thankfully I can use some of the above results to narrow the next set of builds.

Comment 32 Trevor Cordes 2015-03-22 01:07:45 UTC

Hooray #2 (The Real One):
ae045e2455429c418a418a3376301a9e5753a0a8 is the first bad commit

This time I had a good bisect with some good and some bad ones, so for sure this is the correct bad commit.

git bisect start '--' 'drivers/media'
# good: [19583ca584d6f574384e17fe7613dfaeadcdc4a6] Linux 3.16
git bisect good 19583ca584d6f574384e17fe7613dfaeadcdc4a6
# good: [19583ca584d6f574384e17fe7613dfaeadcdc4a6] Linux 3.16
git bisect good 19583ca584d6f574384e17fe7613dfaeadcdc4a6
# good: [67dd8f35c2d8ed80f26c9654b474cffc11c6674d] Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
git bisect good 67dd8f35c2d8ed80f26c9654b474cffc11c6674d
# good: [2886f013c312d288fdd4b8472b82a0a9590e3134] [media] bt8xx: fixup RC5 decoding
git bisect good 2886f013c312d288fdd4b8472b82a0a9590e3134
# good: [0bc56cbef9fb496c79e34c46329f016064d3b0f6] [media] saa7134: NEC scancode fix
git bisect good 0bc56cbef9fb496c79e34c46329f016064d3b0f6
# good: [c5540fbb9de39ceec108a889133664a887c2f55a] [media] rc-core: remove protocol arrays
git bisect good c5540fbb9de39ceec108a889133664a887c2f55a
# bad: [bfe01a5ba2490f299e1d2d5508cbbbadd897bbe9] Linux 3.17
git bisect bad bfe01a5ba2490f299e1d2d5508cbbbadd897bbe9
# good: [36efec48e2e6016e05364906720a0ec350a5d768] [media] cx23885: Add si2165 support for HVR-5500
git bisect good 36efec48e2e6016e05364906720a0ec350a5d768
# good: [833e1063266ebbf75934f1171c6546cdf33e9848] [media] media: atmel-isi: convert the pdata from pointer to structure
git bisect good 833e1063266ebbf75934f1171c6546cdf33e9848
# bad: [6a03dc92cc2edfa2257502557b9f714893987383] [media] cx18: fix kernel oops with tda8290 tuner
git bisect bad 6a03dc92cc2edfa2257502557b9f714893987383
# good: [f4d33337eac4007793ca11fd1ab68d91ce7aa762] Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
git bisect good f4d33337eac4007793ca11fd1ab68d91ce7aa762
# bad: [ae045e2455429c418a418a3376301a9e5753a0a8] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
git bisect bad ae045e2455429c418a418a3376301a9e5753a0a8
# good: [c835a677331495cf137a7f8a023463afd9f032f8] net: set name_assign_type in alloc_netdev()
git bisect good c835a677331495cf137a7f8a023463afd9f032f8
# good: [8fd90bb889635fa1e7f80a3950948cc2e74c1446] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
git bisect good 8fd90bb889635fa1e7f80a3950948cc2e74c1446
# first bad commit: [ae045e2455429c418a418a3376301a9e5753a0a8] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next

I'll dive into the code shortly.  Anyone else have any ideas?

ae045e is a bit unexpected as the culprit...

Comment 33 Trevor Cordes 2015-03-22 04:14:11 UTC

I looked at the diff for my above bisect and none looked like they could cause this bug.  So I played with git a bit more...

Oh, I see, because I started the bisect with:
git bisect start '--' 'drivers/media'

that means I'm not really done, because the problem probably (as it turns out) isn't in drivers/media.

I just started a blanket bisect (no path specified) between ae045e and 8fd90b and there's a few thousand more commits to go.  I'll report back in 2-3 days when I finally (hopefully!) have the *real* commit that causes this bug.  Wish my buggy computer was faster than a P4.

Comment 34 Trevor Cordes 2015-03-23 05:15:47 UTC

Hooray #3 (The Real Real One):

166afb64511eef08e13331b970c44fe91cea45ef is the first bad commit

This time I didn't set any path when bisecting, I bisected every commit between a known/tested good and bad, and my bisect gave me both good and bad commits during testing.

166afb64511eef08e13331b970c44fe91cea45ef seems like a very odd culprit.  I hope again I didn't do anything wrong.  Here is my procedure:

git bisect [good|bad]
n19 make bzImage && n19 make modules && n19 make modules_install && n19 make install
# then confirm grub2 is set to boot correct (newly built) kernel, confirm with ll -t /boot
reboot
uname -a
# confirm in newly built kernel

strace lircd
irsend SEND_ONCE dct700 info

# if it times out, it's "bad", if it returns immediately with no error, it's "good"

repeat.

I hope I'm not supposed to do any "clean" or anything before repeating.  I didn't really do anything else besides the above.

Here's my bisect log.  Can anyone see any possible way this commit causes the problem and how?

git bisect start
# good: [8fd90bb889635fa1e7f80a3950948cc2e74c1446] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
git bisect good 8fd90bb889635fa1e7f80a3950948cc2e74c1446
# bad: [ae045e2455429c418a418a3376301a9e5753a0a8] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
git bisect bad ae045e2455429c418a418a3376301a9e5753a0a8
# good: [b9aaea39f65e242303103b5283abeaefd8e538a4] staging: comedi: addi_apci_1564: remove diagnostic interrupt support code
git bisect good b9aaea39f65e242303103b5283abeaefd8e538a4
# good: [f67d251a87ccb288a3a164c5226c6ee9ce8ea53d] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu
git bisect good f67d251a87ccb288a3a164c5226c6ee9ce8ea53d
# good: [ed5c41d30ef2ce578fd6b6e2f7ec23f2a58b1eba] x86: MCE: Add raw_lock conversion again
git bisect good ed5c41d30ef2ce578fd6b6e2f7ec23f2a58b1eba
# good: [61ab9efddf51cbc0d57356a4d650785cf5721fbe] net/usb/hso: Add support for Option GTM671WFS
git bisect good 61ab9efddf51cbc0d57356a4d650785cf5721fbe
# bad: [e7fda6c4c3c1a7d6996dd75fd84670fa0b5d448f] Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect bad e7fda6c4c3c1a7d6996dd75fd84670fa0b5d448f
# bad: [154adc14b1ce941b642a3775f2423c20d55c026b] mfd: cros_ec_spi: Use ktime_get_ns()
git bisect bad 154adc14b1ce941b642a3775f2423c20d55c026b
# good: [3252a646aa2cf706b2a26433a8bd9cb2e5dce410] clocksource: exynos_mct: Only use 32-bits where possible
git bisect good 3252a646aa2cf706b2a26433a8bd9cb2e5dce410
# bad: [7c032df5570388044b4efda3d9f4d2ffb96a3116] timekeeping: Provide internal ktime_t based data
git bisect bad 7c032df5570388044b4efda3d9f4d2ffb96a3116
# good: [24e4a8c3e8868874835b0f1ad6dd417341e99822] ktime: Kill non-scalar ktime_t implementation for 2038
git bisect good 24e4a8c3e8868874835b0f1ad6dd417341e99822
# bad: [7d489d15ce4be5310ca60e5896df833f9b3b4088] timekeeping: Convert timekeeping core to use timespec64s
git bisect bad 7d489d15ce4be5310ca60e5896df833f9b3b4088
# bad: [b17b20d70dcbe48dd1aa6aba073a60ddfce5d7db] ktime: Change ktime_set() to take 64bit seconds value
git bisect bad b17b20d70dcbe48dd1aa6aba073a60ddfce5d7db
# bad: [166afb64511eef08e13331b970c44fe91cea45ef] ktime: Sanitize ktime_to_us/ms conversion
git bisect bad 166afb64511eef08e13331b970c44fe91cea45ef
# first bad commit: [166afb64511eef08e13331b970c44fe91cea45ef] ktime: Sanitize ktime_to_us/ms conversion


I won't include the bad commit's diff, unless someone requests, I assume everyone can easily git it:

git diff 166afb64511^ 166afb64511

Thanks for any help!!

Comment 35 Trevor Cordes 2015-03-23 06:37:55 UTC

Ah!  Possibly a bug in this commit?!

ktime_to_us returns s64, but the commit changes it so ktime_to_us just returns what ktime_divns returns, and ktime_divns returns a u64!  If the u64 is big enough, wouldn't it wrap s64 around to a negative number?  Or, perhaps if some caller is passing in negative ktime_t to begin with it will trigger without having to hit big numbers.  With my limited knowledge of C, I am stabbing in the dark here.

And maybe no one else sees this because I'm one of the last people on 32-bit?  Or maybe it's a combo of 32-bit + infrared code.  Seems strange, as this code has been in the kernel many months.

I am right now compiling the latest FC21 kernel 3.19.1-201.fc21 with a self-made patch to undo the changes to just those 2 functions to be what they were before.  In a few hours I'll be able to boot into the test kernel and know for sure if this is the bug!

These functions are called 75 times in the source tree, including in drivers, but none in drivers/media.  I'm crossing my fingers that my little patch fixes this problem and it's solved once and for all.  Otherwise I have no idea what I did wrong in the bisect this time.

Comment 36 Trevor Cordes 2015-03-23 17:02:16 UTC

Final hooray!  This commit *IS INDEED* my bug!  I reverted a tiny bit of the commit to use the old method instead of the function call and compiled it into stock kernel-PAE-3.19.1-201.fc21.i686 src.rpm and built/installed and *my bug goes away*!

I just posted my results to the LKML to see if we can get some input on what the "proper" fix is.  Just reverting to the old method is a kludge.  We'd want to fix the new method to work properly in my use case.  I'm still leaning towards the u64/s64 types being the problem.

I'll report back here with any news.

In the meantime, any chance my patch can get applied to Fedora 21 stock kernel rpm? :-)

Comment 37 Trevor Cordes 2015-03-23 17:04:05 UTC

Created attachment 1005505 [details]
PoC reverts the changes that cause my bug, for kernel-3.19.1-201.fc21.i686

Not the final solution, but solves my bug for me right now.

Comment 38 Fedora Kernel Team 2015-04-28 18:31:47 UTC

*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 21 kernel bugs.

Fedora 21 has now been rebased to 3.19.5-200.fc21.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 22, and are still experiencing this issue, please change the version to Fedora 22.

If you experience different issues, please open a new bug report for those.

Comment 39 Trevor Cordes 2015-04-29 08:14:25 UTC

This bug persists in 3.19.5-200.fc21.

I have bisected this bug down to the 1 or 2 tiny functions that cause it, like 2 lines of code, and posted to both kernel and rhbz, and posted to kernel mailing list yet not a single person has responded?  What do I have to do to get some eyeballs on this?  From what I can tell by the offending code, this bug could hit more people than just the mceusb driver.  One would think the ktime functions are important enough to make bug free?

In the meantime I'm running 3.19.1-201.fc21 with my reversion of the offending patch, and it's worked great for over a month.

Comment 40 Josh Boyer 2015-04-29 11:47:27 UTC

LKML is kind of a dumping ground for patches and if you didn't have the right maintainers CC'd it likely didn't even get read.

I don't have a copy of your email locally any longer for some reason.  Could you reply to it and make sure you have John Stultz <john.stultz> and Thomas Gleixner <tglx> on CC?

Comment 41 Trevor Cordes 2015-05-11 16:52:46 UTC

The LKML guys have solved this, and it looks like a patch will make its way into mainline.  The details are on LKML "ktime: Fix ktime_divns to do signed division" and I copied some of the info into upstream kernel bz: https://bugzilla.kernel.org/show_bug.cgi?id=95431 along with the patch.

If this patch could make it into Fedora 21 that would be great!  To apply to F21 3.19.5 you also need to apply one other git commit patch first.  I'm not sure of the protocol for this.  I made a single patch containing both patches in order to rpmbuild a test kernel.  Let me know how best to provide these things for inclusion in F21.

Thanks to all for fixing this bug!

Comment 42 Fedora Update System 2015-05-28 20:40:44 UTC

kernel-4.0.4-303.fc22 has been submitted as an update for Fedora 22.
https://admin.fedoraproject.org/updates/kernel-4.0.4-303.fc22

Comment 43 Fedora Update System 2015-05-28 20:42:00 UTC

kernel-4.0.4-202.fc21 has been submitted as an update for Fedora 21.
https://admin.fedoraproject.org/updates/kernel-4.0.4-202.fc21

Comment 44 Fedora Update System 2015-05-28 20:42:03 UTC

kernel-4.0.4-101.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/kernel-4.0.4-101.fc20

Comment 45 Trevor Cordes 2015-05-30 04:24:10 UTC

4.0.4-201.fc21.i686+PAE does *not* fix this bug for me.  Are you sure it contains the relevant patch or commit?

11:19pm ~#uptime ; uname -a
 23:19:42 up 1 min,  1 user,  load average: 1.26, 0.59, 0.22
Linux piles.mydomain.com 4.0.4-201.fc21.i686+PAE #1 SMP Thu May 21 16:14:41 UTC 2015 i686 i686 i386 GNU/Linux
11:19pm ~#irsend SEND_ONCE dct700 info
irsend: timeout
Exit 1

[... reboot ...]

11:23pm ~#uptime ; uname -a
 23:23:10 up 1 min,  1 user,  load average: 0.93, 0.44, 0.16
Linux piles.mydomain.com 3.19.5-201.fc21.i686+PAE #1 SMP Fri May 8 16:41:05 CDT 2015 i686 i686 i386 GNU/Linux
11:23pm ~#irsend SEND_ONCE dct700 info
11:23pm ~#irsend SEND_ONCE dct700 info

Comment 46 Laura Abbott 2015-05-30 15:11:21 UTC

It isn't fixed in 201, it is fixed in 202. Please try the newest version.

Comment 47 Fedora Update System 2015-05-30 15:46:27 UTC

Package kernel-4.0.4-202.fc21:
* should fix your issue,
* was pushed to the Fedora 21 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing kernel-4.0.4-202.fc21'
as soon as you are able to, then reboot.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2015-9127/kernel-4.0.4-202.fc21
then log in and leave karma (feedback).

Comment 48 Trevor Cordes 2015-05-31 06:40:23 UTC

Sorry!  Didn't know it was in -testing only.  I got it from testing, rebooted and it does indeed fix the bug.  Hooray!

Thanks to everyone for your help, thanks especially to Josh and the Redhatters helping me to get attention from the other kernel devs.

I will leave karma shortly (signing up now, first time!).

Comment 49 Fedora Update System 2015-06-01 17:03:50 UTC

kernel-4.0.4-303.fc22 has been pushed to the Fedora 22 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 50 Fedora Update System 2015-06-01 17:06:52 UTC

kernel-4.0.4-202.fc21 has been pushed to the Fedora 21 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.