System hang occurs after UDP Tx, when using Tx descriptor writeback #6

ricera · 2018-03-29T16:37:01Z

System console becomes unresponsive after running UDP Tx using netperf or tftp.

ssh/mosh sessions continue to function, and programs can be run in them
kldunload if_ixl or shutting down the system will kill ssh/mosh sessions, but the system will hang and not unload the driver/shutdown the system
no error/watchdog messages appear in the system log
when running netperf in demo mode, the interim throughput shown is higher than the physical capability of the port (e.g. 14Gb/s throughput on a 10G port)

Hardware:

Occurs on 10/25 GbE cards, but not 40GbE cards

ricera · 2018-03-29T16:39:14Z

These sysctls look interesting:

dev.ixl.1.iflib.txq00.ring_state: pidx_head: 1668 pidx_tail: 1668 cidx: 1670 state: STALLED

There are only 1024 tx/rx descriptors by default, so these indices look weird
EDIT: The buf_ring is created with a size of 2048, so these indices would be valid

dev.ixl.1.iflib.txq00.r_drops: 32212354

If I run netperf UDP_STREAM again after noticing the system hang, this is the only queue stat that increments -- this matches Drew's suggestion that this may be an issue with how drops are being handled

W8BSD · 2018-03-29T17:28:59Z

So r_drops indicates that there was no room in the mp ring to add the buffer. It sounds like the ring is simply not draining. If you enable INVARIANTS, you'll get some extra counters which will include the number of encap failures.

W8BSD · 2018-03-29T17:31:52Z

Or just manually define IFLIB_DEBUG_COUNTERS to 1 in iflib.c

W8BSD · 2018-03-29T17:37:12Z

FYI, the additional counters show up under net.iflib not dev.X.Y.iflib

ricera · 2018-03-29T20:58:40Z

Manually defining IFLIB_DEBUG_COUNTERS to 1 made the problem go away for me. :/

ricera · 2018-03-29T21:01:02Z

Belay that -- after a dozen tries of netperf, I see the hang again

ricera · 2018-03-29T21:06:22Z

net.iflib.tx_seen seems to increment pretty rapidly.

The igb interface I'm using for remote access is affecting all of the other debug counters, but that one seems to be going up by a couple hundred thousand per second, compared to tx_encap, which only increments by one every second (which I guess might be due to igb and me using mosh).

The encap_ and txq_drain_ sysctls are all at 0.

ricera · 2018-03-30T18:50:04Z

I was going to try using dtrace, but after the script runs for ~30s it stops with the message:

dtrace: processing aborted: Abort due to systemic unresponsiveness

That's still enough time to start checking what functions are getting called while the system is hanging, I think.

ricera · 2018-03-30T19:46:46Z

ixl_if_tx_queue_intr_enable:entry
ixl_isc_txd_credits_update:return (retval = 0)
_task_fn_tx:entry  (mtx name for txq = ixl1:tx(0):call)
ifmp_ring_check_drainage:entry
ifmp_ring_enqueue:return (retval = 55 / ENOBUFS)

These functions are all getting called rapidly during the hang, according to dtrace probes

ricera · 2018-04-02T17:07:40Z

I'm going to agree with Drew's guess: iflib is detecting that the queues are hung, and while trying to recover the queues, it gets into a deadlock over the txq lock and the ctx lock.

But why are the Tx queues hanging? Are one of the Tx/Rx functions in the driver not returning the right values for iflib to use? Are Tx interrupts not working correctly, maybe?

ricera · 2018-04-04T18:46:35Z

Updates:

Drew found that reverting to head writeback for TX descriptor completion reporting stopped the hangs; I'll integrate that into another commit soon. We still need descriptor writeback to work for the future, though, so I'll continue to investigate, leading into...
I found that adding "#define NO_64BIT_ATOMICS 1" to mp_ring.h caused the hangs to stop. There might be an issue there, but I don't know why that wouldn't have shown up in em or ixgbe.

jepiepe · 2018-04-05T03:54:01Z

Disregard, I think this was an environment issue on my part :)

There's no need to perform the interrupt unbind while holding the blkback lock, and doing so leads to the following LOR: lock order reversal: (sleepable after non-sleepable) 1st 0xfffff8000802fe90 xbbd1 (xbbd1) @ /usr/src/sys/dev/xen/blkback/blkback.c:3423 2nd 0xffffffff81fdf890 intrsrc (intrsrc) @ /usr/src/sys/x86/x86/intr_machdep.c:224 stack backtrace: #0 0xffffffff80bdd993 at witness_debugger+0x73 #1 0xffffffff80bdd814 at witness_checkorder+0xe34 #2 0xffffffff80b7d798 at _sx_xlock+0x68 #3 0xffffffff811b3913 at intr_remove_handler+0x43 #4 0xffffffff811c63ef at xen_intr_unbind+0x10f #5 0xffffffff80a12ecf at xbb_disconnect+0x2f #6 0xffffffff80a12e54 at xbb_shutdown+0x1e4 #7 0xffffffff80a10be4 at xbb_frontend_changed+0x54 #8 0xffffffff80ed66a4 at xenbusb_back_otherend_changed+0x14 #9 0xffffffff80a2a382 at xenwatch_thread+0x182 #10 0xffffffff80b34164 at fork_exit+0x84 #11 0xffffffff8101ec9e at fork_trampoline+0xe Reported by: Nathan Friess <nathan.friess@gmail.com> Sponsored by: Citrix Systems R&D

ricera added bug iflib-ixl and removed iflib-ixl labels Mar 29, 2018

ricera added this to the Initial iflib-ixl conversion release milestone Mar 29, 2018

ricera changed the title ~~System hang occurs after UDP Tx~~ System hang occurs after UDP Tx using Tx descriptor writeback Apr 5, 2018

ricera changed the title ~~System hang occurs after UDP Tx using Tx descriptor writeback~~ System hang occurs after UDP Tx, when using Tx descriptor writeback Apr 5, 2018

ricera modified the milestones: Initial ixl-iflib conversion release, ixl-iflib release to HEAD Apr 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

System hang occurs after UDP Tx, when using Tx descriptor writeback #6

System hang occurs after UDP Tx, when using Tx descriptor writeback #6

ricera commented Mar 29, 2018 •

edited

Loading

ricera commented Mar 29, 2018 •

edited

Loading

W8BSD commented Mar 29, 2018

W8BSD commented Mar 29, 2018

W8BSD commented Mar 29, 2018

ricera commented Mar 29, 2018

ricera commented Mar 29, 2018

ricera commented Mar 29, 2018 •

edited

Loading

ricera commented Mar 30, 2018

ricera commented Mar 30, 2018

ricera commented Apr 2, 2018

ricera commented Apr 4, 2018

jepiepe commented Apr 5, 2018 •

edited

Loading

System hang occurs after UDP Tx, when using Tx descriptor writeback #6

System hang occurs after UDP Tx, when using Tx descriptor writeback #6

Comments

ricera commented Mar 29, 2018 • edited Loading

ricera commented Mar 29, 2018 • edited Loading

W8BSD commented Mar 29, 2018

W8BSD commented Mar 29, 2018

W8BSD commented Mar 29, 2018

ricera commented Mar 29, 2018

ricera commented Mar 29, 2018

ricera commented Mar 29, 2018 • edited Loading

ricera commented Mar 30, 2018

ricera commented Mar 30, 2018

ricera commented Apr 2, 2018

ricera commented Apr 4, 2018

jepiepe commented Apr 5, 2018 • edited Loading

ricera commented Mar 29, 2018 •

edited

Loading

ricera commented Mar 29, 2018 •

edited

Loading

ricera commented Mar 29, 2018 •

edited

Loading

jepiepe commented Apr 5, 2018 •

edited

Loading