Networking Stack

Introduction

The OSv networking stack originates from FreeBSD as of circa 2013 but has since been heavily modified to implement Van Jacobson's "network channels" design, to reduce the number of locks and lock operations. For more theory and high-level design details please read the "Network Channels" chapter of the OSv paper.

This Wiki instead, focuses on the code and where these design ideas are implemented. It still touches just a tip of "the iceberg" which is the code of the networking stack located mostly under the bsd/ subtree.

Studying Networking Stack

One can use trace.py to effectively study the OSv networking stack. There are numerous trace points that can be enabled when running a given app and then extracted and analyzed using the aforementioned tool as described in this wiki page.

A good testing bed might be a httpserver-monitoring-api app which can be built and started with the following trace points enabled:

./scripts/build image=httpserver-monitoring-api.fg

./scripts/run.py --trace=net_packet*,tcp*,tso*,inpcb*,in_lltable* --trace-backtrace --api

curl http://localhost:8000/os/threads # Just to trigger some networking activity

./scripts/trace.py extract
./scripts/trace.py list --tcpdump -blLF

0xffff800001981040 /libhttpserver-  0         0.117564227 tcp_state            tp=0xffffa000015b9c00, CLOSED -> CLOSED
  tcpcb::set_state(int) bsd/sys/netinet/tcp_var.h:233
  tcp_attach bsd/sys/netinet/tcp_usrreq.cc:1618
  tcp_usr_attach bsd/sys/netinet/tcp_usrreq.cc:130
  socreate bsd/sys/kern/uipc_socket.cc:335
  socreate(int, int, int) bsd/sys/kern/uipc_syscalls.cc:118
  sys_socket bsd/sys/kern/uipc_syscalls.cc:133
  linux_socket bsd/sys/compat/linux/linux_socket.cc:616
  socket bsd/sys/kern/uipc_syscalls_wrap.cc:333

Data Structures

There is a number of data structures that are key to understanding the networking stack. Most of them originate from FreeBSD and only some have been extended with OSv-specific data, especially for net channel needs.

Mbuf (Memory Buffer)

The struct mbuf is a key structure used to exchange data between a network driver and the other part of the stack. It is used at the very bottom of the stack. It is made of the header (32 bytes) followed by data. Typically constructed to carry multiple mbufs as a linked list - "mbuf chain". For more details read here.

Socket

The struct socket represents a socket object behind a socket API and typically maps to a single connection.

More [TODO]

Net Channels: Slow Path vs Fast Path

The net channel is a direct bottom-up traffic line flowing from a network driver acting as a producer to an app thread calling recv(), poll(), epoll() among others acting as a consumer. The net channels are designed to avoid most of the locking involved when typically traversing layer by layer. Relatedly, one can see many references in the code to both "fast path" and "slow path". To understand both and net channels, one can start looking at this code in virtio-net driver (there is a similar code in the vmxnet3 driver):

void net::receiver()
{
...
  bool fast_path = _ifn->if_classifier.post_packet(m_head);
  if (!fast_path) {
      (*_ifn->if_input)(_ifn, m_head);
  }
...
}

In essence, this code is called to process incoming data (RX) from the network card and it tries to "push" the resulting mbuf via the network channel (fast-path). If that fails it falls back to the if_input from the FreeBSD way of doing things.

The if_classifier, a member of the struct ifnet describing network interface and defined in if_var.h, is an instance of the class classifier. The method post_packet() used in the code above, is part of the "producer" interface and its role is to identify or classify if mbuf in question has some corresponding net channel and if so push the mbuf on that net channel and wake consumers of the net channel. So the network card driver, virtio-net in this example, is a "producer" in the context of the net channel and threads blocked when calling send, recv and poll are "consumers". Also, an instance of a net channel corresponds to a single TCP connection.

Here is an example of the "successful" fast path traversal:

0xffff8000015ff040 virtio-net-rx    0        21.143180806 net_packet_in        b'IP truncated-ip - 14 bytes missing! 192.168.122.1.36394 > 192.1
68.122.15.8000: Flags [P.], seq 2688002:2688090, ack 2893961834, win 65535, length 88'
  log_packet_in(mbuf*, int) core/net_trace.cc:143
  classifier::post_packet(mbuf*) core/net_channel.cc:133
  virtio::net::receiver() drivers/virtio-net.cc:542
  std::_Function_handler<void (), virtio::net::net(virtio::virtio_device&)::{lambda()#1}>::_M_invoke(std::_Any_data const&) drivers/virtio-net.cc:243
  __invoke_impl<void, virtio::net::net(virtio::virtio_device&)::<lambda()>&> /usr/include/c++/11/bits/invoke.h:61
  __invoke_r<void, virtio::net::net(virtio::virtio_device&)::<lambda()>&> /usr/include/c++/11/bits/invoke.h:154
  _M_invoke /usr/include/c++/11/bits/std_function.h:290
  sched::thread::main() core/sched.cc:1267
  thread_main_c arch/x64/arch-switch.hh:325
  thread_main arch/x64/entry.S:116

Now, how does the post_packet() exactly "classify" the packet? Under the hood, it calls the method classify_ipv4_tcp(), which in turn first verifies if the packet belongs in the "fast path" category meaning more-less:

is it an IP packet?
does it carry a TCP payload?
is the underlying TCP connection in the right state - not TH_SYN nor TH_FIN nor TH_RST.

The last condition effectively means that only sockets in the state - ESTABLISHED, CLOSE_WAIT, FIN_WAIT_2, and TIME_WAIT - would "participate" in the fast path traversal. In other words, the fast path only plays a role when a TCP connection is established and the slow path is what happens during establishing and tear-down of a TCP connection.

The post_packet() pushes an mbuf onto the net channel only if one exists. But when does a net channel get created? The net channel gets constructed by tcp_setup_net_channel() and destroyed by tcp_teardown_net_channel() or tcp_free_net_channel(). The former gets called when a TCP connection gets established in tcp_do_segment() here and there. The tcp_teardown_net_channel() gets called by tcp_do_segment() when socket in ESTABLISHED state transitions to CLOSE_WAIT one, and an established socket is closed is in tcp_usr_close() and tcp_usrclosed(). The tcp_free_net_channel() on other hand, gets called by tcp_discardcb() when the process of TCP connection closing begins in other TCP state machine cases.

The tcp_setup_net_channel() is key as it binds the "consumers" of a net channel by calling add_poller() and add_epoll. It also registers a new net channel in the RCU hashtable kept as part of the classifier.

Another question remains: how do "consumers" consume data of net channels? The post_packet() method discussed above wakes all interested consumers after a successful push. The consumers include the _waiting_thread - a member of the net channel - and pollers and "epollers" woken by wake_pollers(). Now, how does an mbuf exactly get consumed? The critical places in the bsd/ tree are sprinkled with calls to process_queue() which pops an mbuf of a queue and passes it by invoking a callback method _process_packet which was set when constructing the net channel. This may be illustrated by this stack trace:

0xffff800001de8040 >/tests/misc-tc  3        24.416126421 net_packet_handling  b'IP truncated-ip - 1366 bytes missing! 192.168.122.1.9999 > 192.168.122.15.20728: Flags [.], seq 10059266:10060706, ack 2978885776, win 65535, length 1440'
  log_packet_handling(mbuf*, int) core/net_trace.cc:153
  std::_Function_handler<void (mbuf*), tcp_setup_net_channel::{lambda(mbuf*)#1}>::_M_invoke(std::_Any_data const&, mbuf*&&) bsd/sys/netinet/tcp_input.cc:3193
  operator() bsd/sys/netinet/tcp_input.cc:3231
  __invoke_impl<void, tcp_setup_net_channel(tcpcb*, ifnet*)::<lambda(mbuf*)>&, mbuf*> /usr/include/c++/11/bits/invoke.h:61
  __invoke_r<void, tcp_setup_net_channel(tcpcb*, ifnet*)::<lambda(mbuf*)>&, mbuf*> /usr/include/c++/11/bits/invoke.h:154
  _M_invoke /usr/include/c++/11/bits/std_function.h:290
  std::function<void (mbuf*)>::operator()(mbuf*) const /usr/include/c++/11/bits/std_function.h:590
  net_channel::process_queue() core/net_channel.cc:37
  int sbwait_tmo<osv::clock::uptime>(socket*, sockbuf*, boost::optional<std::chrono::time_point<osv::clock::uptime, osv::clock::uptime::duration> >) bsd/sys/kern/uipc_sockbuf.cc:167
  sbwait bsd/sys/kern/uipc_sockbuf.cc:190
  soreceive_generic bsd/sys/kern/uipc_socket.cc:1464
  kern_recvit bsd/sys/kern/uipc_syscalls.cc:607
  sys_recvfrom bsd/sys/kern/uipc_syscalls.cc:673
  sys_recvfrom bsd/sys/kern/uipc_syscalls.cc:707
  linux_recv bsd/sys/compat/linux/linux_socket.cc:866
  recv bsd/sys/kern/uipc_syscalls_wrap.cc:183 #libc API call

The sbwait_tmo() in the stack above waits on the net channel associated with the socket object in question and once awoken proceeds to call process_queue() as can be seen in the shortened version of the code:

int sbwait_tmo(socket* so, struct sockbuf *sb, boost::optional<std::chrono::time_point<Clock>> timeout)
{
...
    if (so->so_nc && !so->so_nc_busy) {
        so->so_nc_busy = true;
        sched::thread::wait_for(SOCK_MTX_REF(so), *so->so_nc, sb->sb_cc_wq, tmr, sc);
        so->so_nc_busy = false;
        so->so_nc_wq.wake_all(SOCK_MTX_REF(so));
    } else {
        sched::thread::wait_for(SOCK_MTX_REF(so), so->so_nc_wq, sb->sb_cc_wq, tmr, sc);
    }
...
    if (so->so_nc) {
        so->so_nc->process_queue();
    }

    return 0;
}

Lastly, the _process_packet invoked by process_queue() is actually the function tcp_net_channel_packet() that disassembles the mbuf popped of a net channel queue and pushes it up the stack by calling the tcp_do_segment() function which by itself is long and pretty complicated.

Consuming the data through net channels in the context of poll and epoll is handled differently, and the key element is a socket_file class and epoll_file respectively. Here is an example stack trace illustrating some of the details:

0xffff800001981040 /libhttpserver-  0        19.880613508 net_packet_handling  b'IP truncated-ip - 14 bytes missing! 192.168.122.1.36398 > 19
2.168.122.15.8000: Flags [P.], seq 2496002:2496090, ack 233273190, win 65535, length 88'
  log_packet_handling(mbuf*, int) core/net_trace.cc:153
  std::_Function_handler<void (mbuf*), tcp_setup_net_channel::{lambda(mbuf*)#1}>::_M_invoke(std::_Any_data const&, mbuf*&&) bsd/sys/netinet/tcp_input.cc:3193
  operator() bsd/sys/netinet/tcp_input.cc:3231
  __invoke_impl<void, tcp_setup_net_channel(tcpcb*, ifnet*)::<lambda(mbuf*)>&, mbuf*> /usr/include/c++/11/bits/invoke.h:61
  __invoke_r<void, tcp_setup_net_channel(tcpcb*, ifnet*)::<lambda(mbuf*)>&, mbuf*> /usr/include/c++/11/bits/invoke.h:154
  _M_invoke /usr/include/c++/11/bits/std_function.h:290
  std::function<void (mbuf*)>::operator()(mbuf*) const /usr/include/c++/11/bits/std_function.h:590
  net_channel::process_queue() core/net_channel.cc:37
  socket_file::poll(int) bsd/sys/kern/sys_socket.cc:260
  epoll_file::add(epoll_key, epoll_event*) core/epoll.cc:99
  epoll_ctl core/epoll.cc:308

Coming back to the original code, if the "fast path" fails when post_packet() returns false, the if_input function - "slow path" is called. Here is an example of a "slow path" execution:

0xffff800001783040 virtio-net-rx    0        19.881495336 net_packet_in        b'IP 192.168.122.1.36398 > 192.168.122.15.8000: Flags [F.], seq 2496090, ack 233281200, win 65535, length 0'
  log_packet_in(mbuf*, int) core/net_trace.cc:143
  netisr_dispatch_src bsd/sys/net/netisr.cc:768
  virtio::net::receiver() drivers/virtio-net.cc:544
  std::_Function_handler<void (), virtio::net::net(virtio::virtio_device&)::{lambda()#1}>::_M_invoke(std::_Any_data const&) drivers/virtio-net.cc:243
  __invoke_impl<void, virtio::net::net(virtio::virtio_device&)::<lambda()>&> /usr/include/c++/11/bits/invoke.h:61
  __invoke_r<void, virtio::net::net(virtio::virtio_device&)::<lambda()>&> /usr/include/c++/11/bits/invoke.h:154
  _M_invoke /usr/include/c++/11/bits/std_function.h:290
  sched::thread::main() core/sched.cc:1267
  thread_main_c arch/x64/arch-switch.hh:325
  thread_main arch/x64/entry.S:116

The netisr_dispatch_src - the FreeBSD stack routine - is what the if_input member of the struct ifnet points to.

To conclude, fast path because it directly calls net channel rather than traversing all traditional stack call paths that involve many locks - slow path.

Top-down Direction

There are many ways the network stack can be dissected and analyzed but one common one is to look at the direction of traffic and how it travels through the layers. One direction is a top-down one starting with the libc functions like send(), recv() and others implemented in bsd/sys/kern/uipc_syscalls_wrap.cc called by an application at the socket layer, that convert user buffers to TCP packets, attach IP headers to those TCP packets, and finally egress via the network card driver. Here is an example of a stacktrace illustrating the send() function call traversing all the way down the stack to push out an mbuf onto the network interface (note ether_output_frame()):

0xffff8000019db040 >/tests/misc-tc  2       121.338417028 virtio_net_tx_packet_size vring 0xffffa000011a9200 vec_sz 3
  virtio::net::txq::try_xmit_one_locked(virtio::net::net_req*) drivers/virtio-net.cc:712
  virtio::net::txq::try_xmit_one_locked(void*) drivers/virtio-net.cc:655
  osv::xmitter<virtio::net::txq, 4096u, std::function<bool ()>, boost::iterators::function_output_iterator<osv::xmitter_functor<virtio::net::txq> > >::xmit(mbuf*) include/osv/percpu_xmit.hh:293
  ether_output_frame bsd/sys/net/if_ethersubr.cc:398
  ether_output bsd/sys/net/if_ethersubr.cc:366
  ip_output(mbuf*, mbuf*, route*, int, ip_moptions*, inpcb*) bsd/sys/netinet/ip_output.cc:621
  tcp_output bsd/sys/netinet/tcp_output.cc:1385
  tcp_usr_send(socket*, int, mbuf*, bsd_sockaddr*, mbuf*, thread*) bsd/sys/netinet/tcp_usrreq.cc:832
  sosend_generic bsd/sys/kern/uipc_socket.cc:1075
  kern_sendit bsd/sys/kern/uipc_syscalls.cc:515
  sys_sendto bsd/sys/kern/uipc_syscalls.cc:470
  sys_sendto bsd/sys/kern/uipc_syscalls.cc:554
  linux_send bsd/sys/compat/linux/linux_socket.cc:859
  send bsd/sys/kern/uipc_syscalls_wrap.cc:239

In another example below:

0xffff8000019db040 >/tests/misc-tc  2       121.339244385 virtio_net_tx_packet_size vring 0xffffa000011a9200 vec_sz 2
  virtio::net::txq::try_xmit_one_locked(virtio::net::net_req*) drivers/virtio-net.cc:712
  virtio::net::txq::try_xmit_one_locked(void*) drivers/virtio-net.cc:655
  osv::xmitter<virtio::net::txq, 4096u, std::function<bool ()>, boost::iterators::function_output_iterator<osv::xmitter_functor<virtio::net::txq> > >::xmit(mbuf*) include/osv/percpu_xmit.hh:293
  ether_output_frame bsd/sys/net/if_ethersubr.cc:398
  ether_output bsd/sys/net/if_ethersubr.cc:366
  ip_output(mbuf*, mbuf*, route*, int, ip_moptions*, inpcb*) bsd/sys/netinet/ip_output.cc:621
  tcp_output bsd/sys/netinet/tcp_output.cc:1385
  tcp_do_segment(mbuf*, tcphdr*, socket*, tcpcb*, int, int, unsigned char, int, bool&) bsd/sys/netinet/tcp_input.cc:1421
  tcp_net_channel_packet bsd/sys/netinet/tcp_input.cc:3212
  operator() bsd/sys/netinet/tcp_input.cc:3231
  __invoke_impl<void, tcp_setup_net_channel(tcpcb*, ifnet*)::<lambda(mbuf*)>&, mbuf*> /usr/include/c++/11/bits/invoke.h:61
  __invoke_r<void, tcp_setup_net_channel(tcpcb*, ifnet*)::<lambda(mbuf*)>&, mbuf*> /usr/include/c++/11/bits/invoke.h:154
  _M_invoke /usr/include/c++/11/bits/std_function.h:290
  std::function<void (mbuf*)>::operator()(mbuf*) const /usr/include/c++/11/bits/std_function.h:590
  net_channel::process_queue() core/net_channel.cc:37
  int sbwait_tmo<osv::clock::uptime>(socket*, sockbuf*, boost::optional<std::chrono::time_point<osv::clock::uptime, osv::clock::uptime::duration> >) bsd/sys/kern/uipc_sockbuf.cc:167
  sbwait bsd/sys/kern/uipc_sockbuf.cc:190
  soreceive_generic bsd/sys/kern/uipc_socket.cc:1464
  kern_recvit bsd/sys/kern/uipc_syscalls.cc:607
  sys_recvfrom bsd/sys/kern/uipc_syscalls.cc:673
  sys_recvfrom bsd/sys/kern/uipc_syscalls.cc:707
  linux_recv bsd/sys/compat/linux/linux_socket.cc:866
  recv bsd/sys/kern/uipc_syscalls_wrap.cc:183

we have the recv() function trying to receive data over a socket that traverses over the net channel and eventually hits the tcp_do_segment() that triggers TCP output to send an ACK. Just like in the former stacktrace, the ether_output_frame() is the one which calls the if_transmit - a member of the struct ifnet - to push out an mbuf onto the associated network card.

Transmit Queues

In order to efficiently transmit data in the top-down flow, OSv uses an optimization technique based on the per-cpu TX queues. The main idea is to use the xmitter class introduced by this commit to try to push an mbuf onto network card if there is no contention and otherwise put it on a per-cpu queue which is processed later by special worker thread. For more details about how the xmitter has been integrated into the virtio-net and vmxnet3 drivers please look at this commit and that one respectively.

Bottom-up Direction

A good part of this direction has been extensively discussed in the section about net channels, and the slow and fast paths above. But here you see another "slow path" example illustrating the TCP state transition when data arrives:

0xffff800001783040 virtio-net-rx    1       140.273629729 tcp_state            tp=0xffffa00002a8b400, FIN_WAIT_1 -> FIN_WAIT_2
  tcpcb::set_state(int) ./bsd/sys/netinet/tcp_var.h:233
  tcp_do_segment(mbuf*, tcphdr*, socket*, tcpcb*, int, int, unsigned char, int, bool&) bsd/sys/netinet/tcp_input.cc:2277
  tcp_input bsd/sys/netinet/tcp_input.cc:956
  ip_input bsd/sys/netinet/ip_input.cc:774
  netisr_dispatch_src bsd/sys/net/netisr.cc:769
  netisr_dispatch_src bsd/sys/net/netisr.cc:769
  virtio::net::receiver() drivers/virtio-net.cc:544
  std::_Function_handler<void (), virtio::net::net(virtio::virtio_device&)::{lambda()#1}>::_M_invoke(std::_Any_data const&) drivers/virtio-net.cc:243
  __invoke_impl<void, virtio::net::net(virtio::virtio_device&)::<lambda()>&> /usr/include/c++/11/bits/invoke.h:61
  __invoke_r<void, virtio::net::net(virtio::virtio_device&)::<lambda()>&> /usr/include/c++/11/bits/invoke.h:154
  _M_invoke /usr/include/c++/11/bits/std_function.h:290
  sched::thread::main() core/sched.cc:1267
  thread_main_c arch/x64/arch-switch.hh:325
  thread_main arch/x64/entry.S:116

Domains and Protocols

Switch Tables

As mbufs travel up and down the stack, relevant functions get called to process them depending on family and protocol. To accommodate it, OSv re-uses the switch tables from FreeBSD.

For example, the netisr_dispatch_src() is called by a network driver (through ether_input() which if_input member of struct ifnet is set to) to propagate an mbuf up the stack:

int netisr_dispatch_src(u_int proto, uintptr_t source, struct mbuf *m)
{
    ...
    netisr_proto[proto].np_handler(m);
    ...
}

In this case, the ether netisr handler - ip_input() - gets called for the protocol NETISR_ETHER (the np_handler gets set by netisr_register() routine).

The ip_input() ends up calling the tcp_input() function using the switch table ip_protox like so:

void ip_input(struct mbuf *m)
{
    uint8_t protocol;
    int hlen;
    m = ip_preprocess_packet(m, protocol, hlen);
    if (!m) {
        return;
    }
    (*inetsw[ip_protox[protocol]].pr_input)(m, hlen);
}

The switch tables for the inet domain are setup in bsd/sys/netinet/in_proto.cc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly