-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
System hang occurs after UDP Tx, when using Tx descriptor writeback #6
Comments
These sysctls look interesting: dev.ixl.1.iflib.txq00.ring_state: pidx_head: 1668 pidx_tail: 1668 cidx: 1670 state: STALLED
dev.ixl.1.iflib.txq00.r_drops: 32212354
|
So r_drops indicates that there was no room in the mp ring to add the buffer. It sounds like the ring is simply not draining. If you enable INVARIANTS, you'll get some extra counters which will include the number of encap failures. |
Or just manually define IFLIB_DEBUG_COUNTERS to 1 in iflib.c |
FYI, the additional counters show up under net.iflib not dev.X.Y.iflib |
Manually defining IFLIB_DEBUG_COUNTERS to 1 made the problem go away for me. :/ |
Belay that -- after a dozen tries of netperf, I see the hang again |
net.iflib.tx_seen seems to increment pretty rapidly. The igb interface I'm using for remote access is affecting all of the other debug counters, but that one seems to be going up by a couple hundred thousand per second, compared to tx_encap, which only increments by one every second (which I guess might be due to igb and me using mosh). The encap_ and txq_drain_ sysctls are all at 0. |
I was going to try using dtrace, but after the script runs for ~30s it stops with the message:
That's still enough time to start checking what functions are getting called while the system is hanging, I think. |
These functions are all getting called rapidly during the hang, according to dtrace probes |
I'm going to agree with Drew's guess: iflib is detecting that the queues are hung, and while trying to recover the queues, it gets into a deadlock over the txq lock and the ctx lock. But why are the Tx queues hanging? Are one of the Tx/Rx functions in the driver not returning the right values for iflib to use? Are Tx interrupts not working correctly, maybe? |
Updates:
|
Disregard, I think this was an environment issue on my part :) |
There's no need to perform the interrupt unbind while holding the blkback lock, and doing so leads to the following LOR: lock order reversal: (sleepable after non-sleepable) 1st 0xfffff8000802fe90 xbbd1 (xbbd1) @ /usr/src/sys/dev/xen/blkback/blkback.c:3423 2nd 0xffffffff81fdf890 intrsrc (intrsrc) @ /usr/src/sys/x86/x86/intr_machdep.c:224 stack backtrace: #0 0xffffffff80bdd993 at witness_debugger+0x73 #1 0xffffffff80bdd814 at witness_checkorder+0xe34 #2 0xffffffff80b7d798 at _sx_xlock+0x68 #3 0xffffffff811b3913 at intr_remove_handler+0x43 #4 0xffffffff811c63ef at xen_intr_unbind+0x10f #5 0xffffffff80a12ecf at xbb_disconnect+0x2f #6 0xffffffff80a12e54 at xbb_shutdown+0x1e4 #7 0xffffffff80a10be4 at xbb_frontend_changed+0x54 #8 0xffffffff80ed66a4 at xenbusb_back_otherend_changed+0x14 #9 0xffffffff80a2a382 at xenwatch_thread+0x182 #10 0xffffffff80b34164 at fork_exit+0x84 #11 0xffffffff8101ec9e at fork_trampoline+0xe Reported by: Nathan Friess <nathan.friess@gmail.com> Sponsored by: Citrix Systems R&D
System console becomes unresponsive after running UDP Tx using netperf or tftp.
Hardware:
The text was updated successfully, but these errors were encountered: