Bug 493431

Summary: sm-notify should recover from temporary DNS resolution failures
Product: [Fedora] Fedora Reporter: Chuck Lever <chuck.lever>
Component: nfs-utilsAssignee: Steve Dickson <steved>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: 10CC: dcbw, steved
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-06-11 11:05:53 EDT Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Attachments:
Description Flags
Untested proposed fix none

Description Chuck Lever 2009-04-01 14:07:32 EDT
Description of problem:

sm-notify is used to notify NFS peers of local reboots so that NFS lock recovery can be initiated.  The sm-notify command fails if DNS resolution isn't working.  This might occur during a boot sequence before networking is configured.

Version-Release number of selected component (if applicable):

Seen on Fedora 10, but likely exists on any recent system using NetworkManager.

How reproducible:

100%

Steps to Reproduce:
1. Run any application that acquires NFS locks on an NFS client that uses NetworkManager
2. Trigger a crash of the client (e.g. "echo b > /proc/sysrq-trigger")
3. When the client reboots, there will be error reports in the system log from sm-notify
  
Actual results:

Mar 30 13:11:34 ingres sm-notify[1692]: tarkus.1015granger.net doesn't seem to be a valid address, skipped
Mar 30 13:11:34 ingres Backgrounding to notify hosts...
Mar 30 13:11:35 ingres kernel: RPC: Registered udp transport module.
Mar 30 13:11:35 ingres kernel: RPC: Registered tcp transport module.
Mar 30 13:11:35 ingres acpid: starting up
Mar 30 13:11:36 ingres kernel: it87: Found IT8718F chip at 0xe80, revision 5
Mar 30 13:11:36 ingres kernel: it87: in3 is VCC (+5V)
Mar 30 13:11:36 ingres kernel: it87: in7 is VCCH (+5V Stand-By)
Mar 30 13:11:36 ingres acpid: client connected from 1936[68:68]
Mar 30 13:11:37 ingres NetworkManager: <info>  starting...

I then added the fixed IP address of tarkus to ingres' /etc/hosts file, and tried again.

Mar 30 13:25:43 ingres rpc.statd[1692]: Version 1.1.4 Starting
Mar 30 13:25:43 ingres sm-notify[1694]: Sending Reboot Notification to  'tarkus.1015granger.net' failed: errno 101 (Network is unreachable)
Mar 30 13:25:43 ingres kernel: RPC: Registered udp transport module.
Mar 30 13:25:43 ingres kernel: RPC: Registered tcp transport module.
Mar 30 13:25:44 ingres acpid: starting up
Mar 30 13:25:45 ingres kernel: it87: Found IT8718F chip at 0xe80, revision 5
Mar 30 13:25:45 ingres kernel: it87: in3 is VCC (+5V)
Mar 30 13:25:45 ingres kernel: it87: in7 is VCCH (+5V Stand-By)
Mar 30 13:25:45 ingres acpid: client connected from 1936[68:68]
Mar 30 13:25:45 ingres sm-notify[1694]: Sending Reboot Notification to 'tarkus.1015granger.net' failed: errno 101 (Network is unreachable)
Mar 30 13:25:45 ingres NetworkManager: <info>  starting...

...

Mar 30 13:25:49 ingres sm-notify[1694]: Sending Reboot Notification to 'tarkus.1015granger.net' failed: errno 101 (Network is unreachable)
Mar 30 13:25:50 ingres NetworkManager: <info>  (eth0): device state change: 1 -> 2
Mar 30 13:25:50 ingres NetworkManager: <info>  (eth0): bringing up device.

...

Mar 30 13:25:54 ingres NetworkManager: <info>  (eth0): device state change: 7 -> 8
Mar 30 13:25:54 ingres NetworkManager: <info>  Policy set 'System eth0' (eth0) as default for routing and DNS.
Mar 30 13:25:54 ingres NetworkManager: <info>  Activation (eth0) successful, device activated.
Mar 30 13:25:54 ingres NetworkManager: <info>  Activation (eth0) Stage 5 of 5 (IP Configure Commit) complete.

...

Then finally:

Mar 30 13:25:57 ingres Backgrounding to notify hosts...

(this comes out after notification is complete because it is done  
before daemon() and openlog() are called, so it's buffered up).

Expected results:

sm-notify should be able to recover from temporary DNS resolution failures, just as it does from temporary rpcbind problems.  When networking becomes available, sm-notify should then be able to notify hosts of the reboot.

Additional info:

You could look at this as a boot-time ordering problem.  However, I think sm-notify should be smart enough to recover from temporary DNS resolution failures.
Comment 1 Chuck Lever 2009-04-01 14:10:22 EDT
I've rewritten sm-notify to support IPv6.  The rewrite has a potential fix for this problem, which I hope will appear upstream soon.

However, currently released versions of sm-notify should be updated to address this issue.  Lock recovery failures can easily result in data corruption.
Comment 2 Chuck Lever 2009-04-15 11:04:54 EDT
Created attachment 339694 [details]
Untested proposed fix

I coded up a possible solution for this problem.  The RPC scheduler in sm-notify already retries if the RPCs time out, so I changed the logic in notify_host() to retry the notification later if the DNS lookup fails.

The patch builds, but is untested.
Comment 3 Steve Dickson 2009-06-11 11:05:53 EDT
Fixed in nfs-utils-1.2.0-3