An in-depth code review of Linux Kernel Routing (of kernel version v6.10.3) Part 1: Ingress

This is a compilation of kernel code, documentation, and other sources that tries to give a complete view on how packets are handled by the Linux kernel. The idea comes from this blogpost, but updated to the current version of the kernel. A really helpful tool for the creation of this page was the elixir.bootlin kernel source page.

I will try to go over the logic of the code as well as the semantics of linux kernel C code-isms. For easier reading I decided to put most of my comments inside the code itself.

Interrupt handling: Softirq

Incoming physical packets are handled by the NIC, which puts them into the drivers DMA (Direct Memory Access) memory, a section of RAM that both the CPU and the physical device can access, and sends an interrupt to the machine that it received data. The linux kernel has been using software interupts, or softirqs since 2004 1. These softirqs are an implementation of bottom half interrupts (top half interrupts are classic hardware interrupts, they quickly handle all immediate tasks involving the interrupt with other interrupts disabled. Bottom half interrupts do the part of the work that is time intensive, and are usually interruptable). These are not to be confused with bottomhalf-s, the linux interrupt type that was used before softirqs and were deprecated on their adoption2.

The current softirqs include/linux/interrupt.h, line 553

  • HI_SOFTIRQ
  • TIMER_SOFTIRQ
  • NET_TX_SOFTIRQ
  • NET_RX_SOFTIRQ
  • BLOCK_SOFTIRQ
  • RQ_POLL_SOFTIRQ
  • TASKLET_SOFTIRQ
  • SCHED_SOFTIRQ
  • HRTIMER_SOFTIRQ
  • RCU_SOFTIRQ

*For our purposed we will be talking about NET_RX_SOFTIRQ, and NET_TX_SOFTIRQ, where RX is the receive softirq, TX is the transmit softirq. The number and name of softirqs are static, but their actions themselves are created dynamically at boot time3

For our kernel version, the softirq handling code can be found here:

kernel/softirq.c, line 442

// asmlinkage - The compiler will compile the function so that all its variables will be passed through the stack, as opposed to registers
// __visible - It makes the function visible from outside the program, even if GCC says otherwise 

asmlinkage __visible void do_softirq(void)
{
        // Userspace type __u32 shared with userspace
	__u32 pending;
	unsigned long flags;
	
	// in_interrupt returns the CPU's irq_count register
	// ---------------------------------------------------
        // This tells us whether we are already in an interrupt, and whether bottom half interrupts can run.
	// relevant code snippet:
	// linux/preempt.h:139 * in_interrupt() - We're in NMI,IRQ,SoftIRQ context or have BH disabled
	// NMI - Non maskable interrupt
	// IRQ - Hardware interrupt
	// SoftIRQ - Software interrupt
	// BH - Bottom halves
	if (in_interrupt())
		return;


// Saves the CPU's irq bits, and disables interrupts on the CPU this code is running.
// -------------------------------------------------------------
// It looks weird because it doesn't use pointers (because it is a macro instead of a function)
// It puts the irq flags into the flags variable
// 
// It eventually jumps to a code snippet a bit like this
// arch/x86/include/asm/irqflags.h:101
// static __always_inline unsigned long arch_local_irq_save(void)
// {
//	unsigned long flags = arch_local_save_flags();
//	arch_local_irq_disable();
//	return flags;
// }
	local_irq_save(flags);

// Returns which softirqs are activated (activate - "fire" an interrupt, 
//      it doesn't mean its active as in not disabled and firing an interrupt doesn't mean its not disabled) 
// ----------------------------------------------------------
// pending is a sequence of bits, where each bit represents a pre-defined softirq. If no softirqs are activated, then
// it is a 0 integer.
	pending = local_softirq_pending();

	if (pending)
	// Deep, deeper, yet deeper
		do_softirq_own_stack();

// Restores the irq state
	local_irq_restore(flags);
}

do_softirq_own_stack is an architecture specific code, that prepares the computer to do the softirq code. For x86 architecture this is how it looks like:

arch/x86/include/asm/irq_stack.h, line 213

/*
 * Macro to invoke __do_softirq on the irq stack. This is only called from
 * task context when bottom halves are about to be reenabled and soft
 * interrupts are pending to be processed. The interrupt stack cannot be in
 * use here.
 */
#define do_softirq_own_stack()						\
{									\
	__this_cpu_write(pcpu_hot.hardirq_stack_inuse, true);		\
	call_on_irqstack(__do_softirq, ASM_CALL_ARG0);			\
	__this_cpu_write(pcpu_hot.hardirq_stack_inuse, false);		\
}

On x86 there is a separate stack for interrupts. Both hardware interrupts and softirqs run on this on Linux, separated from userspace. From here we jump to __do_softirq ( double underscores mean it is an internal use function4).

kernel/softirq.c, line 586

// __softrirq_entry - The compiler will put this part of the code at the .softirqentry.text section of the elf binary
//    
asmlinkage __visible void __softirq_entry __do_softirq(void)
{
	handle_softirqs(false);
}

There are alternative ways to reach the softirq handler5, this is just the most common one.

kernel/softirq.c, line 511

// ksirqd - This boolean is true, if the function was called by the ksoftirqd daemon
static void handle_softirqs(bool ksirqd)
{
      // It sets the deadline of the softirq
      // -----------------------------------------------
      // jiffy is the main way of time keeping on linux, it is usually 10ms, it is incremented on each time interrupt 
      // it begins at 0 on boot.
      // kernel/softirq.c:479 #define MAX_SOFTIRQ_TIME  msecs_to_jiffies(2)
      // So it gets 2msecs, which is 1 jiffy rounded up
	unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
	
	// current - Current thread's struct thread_info
	// flags - per process flags, arch specific
	unsigned long old_flags = current->flags;
	
	// MAX_SOFTIRQ_RESTART = 10
	int max_restart = MAX_SOFTIRQ_RESTART;
  
      // this struct has only a single function pointer
      //-------------------------------------
      // include/linux/interrupt.h:591
      //  struct softirq_action
      //  {
      //	void	(*action)(struct softirq_action *);
      //  };
	struct softirq_action *h;
	
	bool in_hardirq;
	__u32 pending;
	int softirq_bit;

	/*
	 * Mask out PF_MEMALLOC as the current task context is borrowed for the
	 * softirq. A softirq handled, such as network RX, might set PF_MEMALLOC
	 * again if the socket is related to swapping.
	 *
        */
	// PF_MEMALLOC = 0x00000800	
	// 		This flags the process as a memory allocator. kswapd sets this flag 
	//		and it is set for any process that is about to be killed by the Out Of Memory (OOM) killer
	//  	It tells the buddy allocator to ignore zone watermarks and assign the pages if at all possible
	//		https://www.kernel.org/doc/gorman/html/understand/understand009.html
	current->flags &= ~PF_MEMALLOC;

        // Gets the active softirqs
	pending = local_softirq_pending();
  
        // This next function call only does anything if CONFIG_TRACE_IRQFLAGS is set
        //-------------------------------------------------
        // CONFIG_TRACE_IRQFLAGS can be used to debug interrupts and locks
        // some additional literature on this
	softirq_handle_begin();
	
	// This is also CONFIG_TRACE_IRQFLAGS only
	in_hardirq = lockdep_softirq_start();
	
       // Accounting, from here spent CPUtime is accounted under softirqs
       //------------------------------------------------
       // include/linux/vtime.h:133
       //  static inline void account_softirq_enter(struct task_struct *tsk)
       //  {
       //		vtime_account_irq(tsk, SOFTIRQ_OFFSET);
       //		irqtime_account_irq(tsk, SOFTIRQ_OFFSET);
       //  }
       //  For more about accounting read this
	account_softirq_enter(current);

// If the softirq fails, the code will restart from here
//-------------------------------
// gotos are used extensively in the linux kernel for error handling and exiting loops 
restart:
        // Resets so it won't trigger for the same interrupts again
	/* Reset the pending bitmask before enabling irqs */
	set_softirq_pending(0);

        // Enables maskable interrupts
	local_irq_enable();

        // This is a macro that points to an array with all the softirq structs 
        h = softirq_vec;

        // source/arch/x86/include/asm/bitops.h:L337
       /**
        * ffs - find first set bit in word
        * @x: the word to search
        *
        * This is defined the same way as the libc and compiler builtin ffs
        * routines, therefore differs in spirit from the other bitops.
        *
        * ffs(value) returns 0 if value is 0 or the position of the first
        * set bit if value is nonzero. The first (least significant) bit
        * is at position 1.
        */
	while ((softirq_bit = ffs(pending))) {
		unsigned int vec_nr;
		int prev_count;
  
                // pointer magic that basically means this = &(softirq_vec[softirq_bit-1])
		h += softirq_bit - 1;

               // vec_nr = softirq_bit - 1
               //-----------------------------
               //  h = h' + softirq_bit - 1
               //  h' = softirq_vec
               //  softirq_vec + softirq_bit - 1 - softirq_vec
		vec_nr = h - softirq_vec;
		
		// preempt_count:The amount of interrupts we are currently in, the depth of the interrupt stack	
		prev_count = preempt_count();

                // kernel statistics that can be read from /proc/stat
		kstat_incr_softirqs_this_cpu(vec_nr);

                // tracepoint api entry
                //----------------------
                // Is used to find the function that is called as the softirq handler 
		trace_softirq_entry(vec_nr);
		
		// The softirq handler
		h->action(h);
		
		// tracepoint exit
		//---------------------
		// the debugger looks for the function call between these two
		trace_softirq_exit(vec_nr);
		
		// If the preempt count changed during the handler, theres a problem
		//----------------------------------------------------------------------
		// unlikely - it tells the compiler to assume it will be false

		if (unlikely(prev_count != preempt_count())) {
			pr_err("huh, entered softirq %u %s %p with preempt_count %08x, exited with %08x?\n",
			       vec_nr, softirq_to_name[vec_nr], h->action,
			       prev_count, preempt_count());
			preempt_count_set(prev_count);
		}
		
		// It pushes the vectors to return to the base state
		//-------------------------------------------------
		// 87654 3 21 -> 8765432 1
		// 00110[1]00 -> 0000011[0]|100
		// pending   -> pending >> softirq_bit
		// h = []
		h++;
		pending >>= softirq_bit;
	}

	if (!IS_ENABLED(CONFIG_PREEMPT_RT) && ksirqd)
		rcu_softirq_qs();

        // Disables interrupts again
	local_irq_disable();

        // Checks if new softirqs came in during the handling of the softirq
	pending = local_softirq_pending();
	if (pending) {
		// If it still has time, it does the softirqs again
		if (time_before(jiffies, end) && !need_resched() &&
		    --max_restart)
			goto restart;
    
                // If its over, it wakes up softirqd and puts it in the process queue
		wakeup_softirqd();
	}

        // Stops accounting
	account_softirq_exit(current);

        // CONFIG_TRACE_IRQFLAGS only
	lockdep_softirq_end(in_hardirq);
	softirq_handle_end();
	
	current_restore_flags(old_flags, PF_MEMALLOC);
}

From here we enter the actual network related code:

RX packet handling

The net_rx_action function will poll installed drivers that have activated themselves. Since the implementation of softirqs, linux has been using the napi interface system to limit the amount of interrupts on the drivers side.

A few important concepts come into view here:

NAPI: The New API is the event handling system of the linux network stack. Instead of an interrupt occuring on each packet, and the kernel handing the full routing on a per packet basis, packets are accumulated, and handled at the same time.678

SKB: Socket Buffers are the structure the kernel uses to reference and manage packets. These house the metadata of the packet, the raw data of the packet, as well as any header information.9

NAPI drivers signal to the kernel that they have new packets by adding themselves to the poll_list list of the softnet data struct of the CPU. On a poll these drivers handle all the data that they accumulated since the last polling.

net/core/dev.c, line 6870

//__latent_entropy - The system should use the inputs and variables of this function to decrease
// determinism in its random number generator1011
static __latent_entropy void net_rx_action(struct softirq_action *h)
{
  
  // Incoming data is put into per cpu queues, this is the softnet_data struct
  // Heres a deprecated but useful explanation, and the kernel struct
	struct softnet_data *sd = this_cpu_ptr(&softnet_data);
	
	// READ_ONCE optimises memory calls, only looks up the value once
	// net_hotdata - a structure that caches frequently used data
	// budget_usecs - Maximum number of microseconds in one NAPI polling cycle

	unsigned long time_limit = jiffies +
		usecs_to_jiffies(READ_ONCE(net_hotdata.netdev_budget_usecs));
		
	
	// Maximum number of packets taken from all interfaces in one polling cycle (NAPI poll)
	// 	In one polling cycle interfaces which are registered to polling are probed in a round-robin manner.12
	int budget = READ_ONCE(net_hotdata.netdev_budget);
	
	// LIST_HEAD - creates a linked list13
       //      It looks a lot like this, but its different: https://linux.die.net/man/3/list_head
	LIST_HEAD(list);
	LIST_HEAD(repoll);

start:
        // Tells the CPU it is inside this function
	sd->in_net_rx_action = true;

	local_irq_disable();
        // It appends poll_list's data into list, and zeroes out poll_list14
	list_splice_init(&sd->poll_list, &list);
	local_irq_enable();

	for (;;) {
		// a napi_struct represents a driver
		struct napi_struct *n;

                // It goes over defer_list and handles all skb-s
                //  it calls this function: napi_consume_skb(skb, 1)
		skb_defer_free_flush(sd);

		if (list_empty(&list)) {
			if (list_empty(&repoll)) {
				sd->in_net_rx_action = false;
				
				// Waits for all memory accesses to complete
				//-------------------------------------
				// Compiler barrier only. The compiler will not reorder memory accesses
				// from one side of this statement to the other. This has no effect on
				// the order that the processor actually executes the generated instructions.15
				//  
				barrier();
				/* We need to check if ____napi_schedule()
				 * had refilled poll_list while
				 * sd->in_net_rx_action was true.
				 */
				if (!list_empty(&sd->poll_list))
					goto start;
					
				// RPS - receive packet steering
				// IPI - inter processor interrupt
				// 	
                                // When a CPU sends packets to another CPU for handling, it uses this flag16
				// Only does anything if the kernel was built with RPS support
				if (!sd_has_rps_ipi_waiting(sd))
					goto end;
			}
			break;
		}

                // Takes out the first element of the list
                //-----------------------------
                // elements of the  linux kernel linked list are pointers to list_head structures,
                // which themselves are only fields of a struct of any type
                // this macro resolves the struct itself by providing the list,
                // the type of the struct in question, and the name of the field
                // inside the struct which is the list_head struct
		n = list_first_entry(&list, struct napi_struct, poll_list);
		
		// Polls the driver, and returns with the packets handled
		budget -= napi_poll(n, &repoll);

		/* If softirq window is exhausted then punt.
		 * Allow this to run for 2 jiffies since which will allow
		 * an average latency of 1.5/HZ.
		 */
		if (unlikely(budget <= 0 ||
			     time_after_eq(jiffies, time_limit))) {
                        // If we didnt reach the end of the list, but our time is up, we increment time_squeeze
			sd->time_squeeze++;
			break;
		}
	}

	local_irq_disable();
        
        // Again moves poll_list and reinitializes it
	list_splice_tail_init(&sd->poll_list, &list);
	// Does the same with repoll
	list_splice_tail(&repoll, &list);
	// And moves everything into poll_list
	list_splice(&list, &sd->poll_list);
	if (!list_empty(&sd->poll_list))
		// If poll_list is empty, theres no need to handle packets next time
		__raise_softirq_irqoff(NET_RX_SOFTIRQ);
	else
		sd->in_net_rx_action = false;
  
        // If RPS is enabled, it wakes up other CPUs here
	net_rps_action_and_irq_enable(sd);
end:;

So the function iterates over all NAPI drivers that signalled that they have packets inbound, and polls them.

now comes two wrapper functions between driver code.

net/core/dev.c, line 6781

static int napi_poll(struct napi_struct *n, struct list_head *repoll)
{
	bool do_repoll = false;
	void *have;
	int work;

	list_del_init(&n->poll_list);

  // netpoll - low level packet handling for debug / crash dumps, used if you do remote KGDB
  // locks the netdevice device, so only this cpu/program/etc can handle the packets at this time
	have = netpoll_poll_lock(n);

	work = __napi_poll(n, &do_repoll);

	if (do_repoll)
		list_add_tail(&n->poll_list, repoll);

	netpoll_poll_unlock(have);

	return work;
}

net/core/dev.c, line 6708

static int __napi_poll(struct napi_struct *n, bool *repoll)
{
	int work, weight;

        // The priority of the device
        //-----------------------------------------------
        // Each device has time allotted for their handling according to their weight
	weight = n->weight;

	/* This NAPI_STATE_SCHED test is for avoiding a race
	 * with netpoll's poll_napi().  Only the entity which
	 * obtains the lock and sees NAPI_STATE_SCHED set will
	 * actually make the ->poll() call.  Therefore we avoid
	 * accidentally calling ->poll() when NAPI is not scheduled.
	 */
	work = 0;
	
	// The NAPI could be unscheduled by f.e. being disabled
	if (napi_is_scheduled(n)) {
		// This is where it handles packets
		work = n->poll(n, weight);
		// This is also a dinamically set function pointer
		// So it has a tracer to find it afterwards
		trace_napi_poll(n, work, weight);
		
 
                // Checks if the xdp_do_flush function was called in the handler, aka, 
                // whether there was XDP redirection used17
                //----------------------------------------------------
                // XDP express data path eBPF (Berkeley Packet Filter)
                // It only sends a warning, if CONFIG_DEBUG_NET && CONFIG_BPF_SYSCALL are set
		xdp_do_check_flushed(n);
	}


	if (unlikely(work > weight))
		netdev_err_once(n->dev, "NAPI poll function %pS returned %d, exceeding its budget of %d.\n",
				n->poll, work, weight);

	if (likely(work < weight))
		return work;
	// work = weight

	/* Drivers must not modify the NAPI state if they
	 * consume the entire weight.  In such cases this code
	 * still "owns" the NAPI instance and therefore can
	 * move the instance around on the list at-will.
	 */
	// It only enters here if the NAPI is disabled
	//-----------------------------------
	// calls napi_complete_done(n, 0) , which means it did no work
        //
	if (unlikely(napi_disable_pending(n))) {
		napi_complete(n);
		return work;
	}

	/* The NAP I context has more processing work, but busy-polling
	 * is preferred. Exit early.
	 */
	// busy polling - Busy polling allows a user process to check for incoming packets
        // before the device interrupt fires
	if (napi_prefer_busy_poll(n)) {
		if (napi_complete_done(n, work)) {
			/* If timeout is not set, we need to make sure
			 * that the NAPI is re-scheduled.
			 */
			napi_schedule(n);
		}
		return work;
	}

	if (n->gro_bitmask) {
		/* flush too old packets
		 * If HZ < 1000, flush all packets.
		 */
		napi_gro_flush(n, HZ >= 1000);
	}

	gro_normal_list(n);

	/* Some drivers may have called napi_schedule
	 * prior to exhausting their budget.
	 */
	if (unlikely(!list_empty(&n->poll_list))) {
		pr_warn_once("%s: Budget exhausted after napi rescheduled\n",
			     n->dev ? n->dev->name : "backlog");
		return work;
	}

	*repoll = true;

	return work;
}

The poll function is driver specific, and handles a lot of offloading, but eventually it will reach __netif_receive_skb_core, between these two points, the driver, and netif wrapper functions create socket buffers from the data, perform validity checks, optionally group them with GRO, and perform memory management tasks.

These poll functions are also registered dynamically by the netif_napi_add function. I will go through a brief example with the intel e1000 line of cards.18

RX Ring: memory-mapped ring buffer in the DMA where the incoming packets are put. If the ring overflows

GRO: Generic Receive Offload is a way for the kernel to group similar packets together for handling. Packets that belong to the same flow are routed together.19

drivers/net/ethernet/intel/e1000/e1000_main.c, line 3792


/**
 * e1000_clean - NAPI Rx polling callback
 * @napi: napi struct containing references to driver info
 * @budget: budget given to driver for receive packets
 **/
static int e1000_clean(struct napi_struct *napi, int budget)
{
  // container_of - cast a member of a structure out to the containing structure
  // The NAPI struct is in the adapter structure
	struct e1000_adapter *adapter = container_of(napi, struct e1000_adapter,
						     napi);
	int tx_clean_complete = 0, work_done = 0;

        // e1000_clean_tx_irq - Reclaim resources after transmit completes
        //-----------------------------------------------------------------
        // Transmit doesn't happen here, it just frees memory
	tx_clean_complete = e1000_clean_tx_irq(adapter, &adapter->tx_ring[0]);

	adapter->clean_rx(adapter, &adapter->rx_ring[0], &work_done, budget);

	if (!tx_clean_complete || work_done == budget)
		return budget;

	/* Exit the polling mode, but don't re-enable interrupts if stack might
	 * poll us due to busy-polling
	 */
	if (likely(napi_complete_done(napi, work_done))) {
		if (likely(adapter->itr_setting & 3))
			e1000_set_itr(adapter);
		if (!test_bit(__E1000_DOWN, &adapter->flags))
			e1000_irq_enable(adapter);
	}

	return work_done;
}

It then jumps to the adapters cleaning code clean_rx, most drivers naming convention uses the clean prefix for these functions. This is the driver for jumbo frame support:

drivers/net/ethernet/intel/e1000/e1000_main.c, line 4126

/**
 * e1000_clean_jumbo_rx_irq - Send received data up the network stack; legacy
 * @adapter: board private structure
 * @rx_ring: ring to clean
 * @work_done: amount of napi work completed this call
 * @work_to_do: max amount of work allowed for this call to do
 *
 * the return value indicates whether actual cleaning was done, there
 * is no guarantee that everything was cleaned
 */
 
 static bool e1000_clean_jumbo_rx_irq(struct e1000_adapter *adapter,
				     struct e1000_rx_ring *rx_ring,
				     int *work_done, int work_to_do){
				
// I'm skipping a few parts of memory management here
// This part is here to handle fragmented packets
// It is inside a while loop

#define rxtop rx_ring->rx_skb_top
process_skb:
    // EOP = /* End of Packet */
		if (!(status & E1000_RXD_STAT_EOP)) {
			/* this descriptor is only the beginning (or middle) */
			// There is no skb for this, so this is the first fragment of the packet
			if (!rxtop) {
				/* this is the beginning of a chain */
				// This creates the skb or gets it from the napi struct
				rxtop = napi_get_frags(&adapter->napi);
				if (!rxtop)
					break;
        
                               /**
                                * skb_fill_page_desc - initialise a paged fragment in an skb
                                * @skb: buffer containing fragment to be initialised
                                * @i: paged fragment index to initialise
                                * @page: the page to use for this fragment
                                * @off: the offset to the data with @page
                                * @size: the length of the data
                                *
                                * As per __skb_fill_page_desc() -- initialises the @i'th fragment of
                                * @skb to point to @size bytes at offset @off within @page. In
                                * addition updates @skb such that @i is the last fragment.
                                *
                                * Does not take any additional reference on the fragment.
                                */
				skb_fill_page_desc(rxtop, 0,
						   buffer_info->rxbuf.page,
						   0, length);
			} else {
				/* this is the middle of a chain */
				skb_fill_page_desc(rxtop,
				    skb_shinfo(rxtop)->nr_frags,
				    buffer_info->rxbuf.page, 0, length);
			}
			// Grows the skb-s length according to the new fragment
			// jumbo packet helper script
			e1000_consume_page(buffer_info, rxtop, length);
			// Reads the next section of data on the ring
			goto next_desc;
		} else {
			if (rxtop) {
				/* end of the chain */
				skb_fill_page_desc(rxtop,
				    skb_shinfo(rxtop)->nr_frags,
				    buffer_info->rxbuf.page, 0, length);
				skb = rxtop;
				rxtop = NULL;
				e1000_consume_page(buffer_info, skb, length);
			} else {
				struct page *p;
				/* no chain, got EOP, this buf is the packet
				 * copybreak to save the put_page/alloc_page
				 */
				p = buffer_info->rxbuf.page;
				// copybreak - "Maximum size of packet that is copied to a new buffer on receive"
				if (length <= copybreak) {
					// RXFCS a frame check sequence-t is belerakja az skb-be, 4 byte hosszú
					if (likely(!(netdev->features & NETIF_F_RXFCS)))
						length -= 4;
						
					// Creates a new skb
					skb = e1000_alloc_rx_skb(adapter,
								 length);
					if (!skb)
						break;

					memcpy(skb_tail_pointer(skb),
					       page_address(p), length);

					/* re-use the page, so don't erase
					 * buffer_info->rxbuf.page
					 */
					/**
                                         *	skb_put - add data to a buffer
                                         *	@skb: buffer to use
                                         *	@len: amount of data to add
                                         *
                                         *	This function extends the used data area of the buffer. If this would
                                         *	exceed the total buffer size the kernel will panic. A pointer to the
                                         *	first byte of the extra data is returned.
                                         */
					skb_put(skb, length);
					// 1000_rx_checksum - Receive Checksum Offload for 82543, a type of NIC
					e1000_rx_checksum(adapter,
							  status | rx_desc->errors << 24,
							  le16_to_cpu(rx_desc->csum), skb);

					total_rx_bytes += skb->len;
					total_rx_packets++;
                                        
                                        // This is the important part, the skb is handled
					e1000_receive_skb(adapter, status,
							  rx_desc->special, skb);

                                        //Goto to jump inside the while loop
					goto next_desc;
				} else {
                                        // clear everything
					skb = napi_get_frags(&adapter->napi);
					if (!skb) {
						adapter->alloc_rx_buff_failed++;
						break;
					}
					skb_fill_page_desc(skb, 0, p, 0,
							   length);

					e1000_consume_page(buffer_info, skb,
							   length);
				}
			}
		}

		/* Receive Checksum Offload XXX recompute due to CRC strip? */
		e1000_rx_checksum(adapter,
				  (u32)(status) |
				  ((u32)(rx_desc->errors) << 24),
				  le16_to_cpu(rx_desc->csum), skb);

		total_rx_bytes += (skb->len - 4); /* don't count FCS */
		if (likely(!(netdev->features & NETIF_F_RXFCS)))
			pskb_trim(skb, skb->len - 4);
		total_rx_packets++;

		if (status & E1000_RXD_STAT_VP) {
			__le16 vlan = rx_desc->special;
			u16 vid = le16_to_cpu(vlan) & E1000_RXD_SPC_VLAN_MASK;

			__vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), vid);
		}
                // Handle the remaining fragments, basically the same as the e1000_receive_skb path,
                //  but with frags in its name
		napi_gro_frags(&adapter->napi);

This code basically generates the data on the rx ring into skbs, it also handles GRO, the important function here is the e1000_receive_skb code

/**
 * e1000_receive_skb - helper function to handle rx indications
 * @adapter: board private structure
 * @status: descriptor status field as written by hardware
 * @vlan: descriptor vlan field as written by hardware (no le/be conversion)
 * @skb: pointer to sk_buff to be indicated to stack
 */
static void e1000_receive_skb(struct e1000_adapter *adapter, u8 status,
			      __le16 vlan, struct sk_buff *skb)
{
        // Puts the layer2 protocol type in the skb
	skb->protocol = eth_type_trans(skb, adapter->netdev);
        
        // Puts the vlan ID if it exists
	if (status & E1000_RXD_STAT_VP) {
		u16 vid = le16_to_cpu(vlan) & E1000_RXD_SPC_VLAN_MASK;

		__vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), vid);
	}
        // Handles the GRO, and sends the packet up to skb handling
	napi_gro_receive(&adapter->napi, skb);
}

net/core/gro.c, line 626

gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
{
	gro_result_t ret;

	skb_mark_napi_id(skb, napi);
	trace_napi_gro_receive_entry(skb);

	skb_gro_reset_offset(skb, 0);

	ret = napi_skb_finish(napi, skb, dev_gro_receive(napi, skb));
	trace_napi_gro_receive_exit(ret);

	return ret;
}

dev_gro_receive starts a chain of function calls that each figure out whether there is a viable GRO list the packet can join, then calls a function of the protocol above it.

The valid gro return types are as below:

include/linux/netdevice.h, line 408

  • GRO_MERGED – The packet was merged into a list, it will be handled later as a list
  • GRO_MERGED_FREE – The packet’s data was merged into a different skb, it will be handled later, and the current skb can be deleted
  • GRO_HELD – The SKB was put into a new list
  • GRO_NORMAL – The skb can be handled like normal
  • GRO_CONSUMED – The packet was consumed by the gro_receive function call, it doesn’t need further processing

include/linux/netdevice.h, line 415

static gro_result_t napi_skb_finish(struct napi_struct *napi,
				    struct sk_buff *skb,
				    gro_result_t ret)
{
	switch (ret) {
	case GRO_NORMAL:
		gro_normal_one(napi, skb, 1);
		break;

	case GRO_MERGED_FREE:
		if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD)
			napi_skb_free_stolen_head(skb);
		else if (skb->fclone != SKB_FCLONE_UNAVAILABLE)
			__kfree_skb(skb);
		else
			__napi_kfree_skb(skb, SKB_CONSUMED);
		break;

	case GRO_HELD:
	case GRO_MERGED:
	case GRO_CONSUMED:
		break;
	}

	return ret;
}

This part of the code makes sure that the kernel only does further packet handling if the GRO completed, and the skb was not merged with other packets, or held up.

The next two functions make sure the kernel doesn’t have to handle packets one by one.

include/net/gro.h, line 523


/* Pass the currently batched GRO_NORMAL SKBs up to the stack. */
static inline void gro_normal_list(struct napi_struct *napi)
{
	if (!napi->rx_count)
		return;
	netif_receive_skb_list_internal(&napi->rx_list);
	INIT_LIST_HEAD(&napi->rx_list);
	napi->rx_count = 0;
}

/* Queue one GRO_NORMAL SKB up for list processing. If batch size exceeded,
 * pass the whole batch up to the stack.
 */
static inline void gro_normal_one(struct napi_struct *napi, struct sk_buff *skb, int segs)
{
	list_add_tail(&skb->list, &napi->rx_list);
	napi->rx_count += segs;
	if (napi->rx_count >= READ_ONCE(net_hotdata.gro_normal_batch))
		gro_normal_list(napi);
}

netif_receive_skb_list_internal calls__netif_receive_skb_list which handle memory and single out packets according to RPS and PF_MEMALLOC, the next is __netif_receive_skb_list_core, this function calls for further processing on each skb, but also groups them together, so that if packets with the same protocols and originating devices come together, their own lists of skbs are handled together.

net/core/dev.c, line 5677

static void __netif_receive_skb_list_core(struct list_head *head, bool pfmemalloc)
{
	/* Fast-path assumptions:
	 * - There is no RX handler.
	 * - Only one packet_type matches.
	 * If either of these fails, we will end up doing some per-packet
	 * processing in-line, then handling the 'last ptype' for the whole
	 * sublist.  This can't cause out-of-order delivery to any single ptype,
	 * because the 'last ptype' must be constant across the sublist, and all
	 * other ptypes are handled per-packet.
	 */
	/* Current (common) ptype of sublist */
	struct packet_type *pt_curr = NULL;
	/* Current (common) orig_dev of sublist */
	struct net_device *od_curr = NULL;
	struct list_head sublist;
	struct sk_buff *skb, *next;

	INIT_LIST_HEAD(&sublist);
	list_for_each_entry_safe(skb, next, head, list) {
		struct net_device *orig_dev = skb->dev;
		struct packet_type *pt_prev = NULL;

		skb_list_del_init(skb);
                // Does the skb-s handling
                // &pt_prev is the protocol that was found on the skb
                // It bunches up skb lists with the same protocol and orig device
		__netif_receive_skb_core(&skb, pfmemalloc, &pt_prev);
		if (!pt_prev)
			continue;
		if (pt_curr != pt_prev || od_curr != orig_dev) {
			/* dispatch old sublist */
                        // This is a fast path that only does the list_rcv of protocols
			__netif_receive_skb_list_ptype(&sublist, pt_curr, od_curr);
			/* start new sublist */
			INIT_LIST_HEAD(&sublist);
			pt_curr = pt_prev;
			od_curr = orig_dev;
		}
		list_add_tail(&skb->list, &sublist);
	}

	/* dispatch final sublist */
	__netif_receive_skb_list_ptype(&sublist, pt_curr, od_curr);
}

__netif_receive_skb_core will do

  • XDP filtering
  • Non protocol specific / Promiscuous packet handling
    It will duplicate packets to monitors here
  • TC and netfilter ingress filtering
  • Vlan stripping and vlan device handling
  • running the RX_Handler, macvlans, ipvlans and the like
  • Protocol specific packet handling

protocol handlers of the device, which are registered by the dev_add_pack function.20 and then the net device’s rx_handler, rx handlers are registered with the function netdev_rx_handler_register, ipvlans, macvlan use this to forward packets to different devices.

/**
 * enum rx_handler_result - Possible return values for rx_handlers.
 * @RX_HANDLER_CONSUMED: skb was consumed by rx_handler, do not process it
 * further.
 * @RX_HANDLER_ANOTHER: Do another round in receive path. This is indicated in
 * case skb->dev was changed by rx_handler.
 * @RX_HANDLER_EXACT: Force exact delivery, no wildcard.
 * @RX_HANDLER_PASS: Do nothing, pass the skb as if no rx_handler was called.
 *
 * rx_handlers are functions called from inside __netif_receive_skb(), to do
 * special processing of the skb, prior to delivery to protocol handlers.
 *
 * Currently, a net_device can only have a single rx_handler registered. Trying
 * to register a second rx_handler will return -EBUSY.
 *
 * To register a rx_handler on a net_device, use netdev_rx_handler_register().
 * To unregister a rx_handler on a net_device, use
 * netdev_rx_handler_unregister().
 *
 * Upon return, rx_handler is expected to tell __netif_receive_skb() what to
 * do with the skb.
 *
 * If the rx_handler consumed the skb in some way, it should return
 * RX_HANDLER_CONSUMED. This is appropriate when the rx_handler arranged for
 * the skb to be delivered in some other way.
 *
 * If the rx_handler changed skb->dev, to divert the skb to another
 * net_device, it should return RX_HANDLER_ANOTHER. The rx_handler for the
 * new device will be called if it exists.
 *
 * If the rx_handler decides the skb should be ignored, it should return
 * RX_HANDLER_EXACT. The skb will only be delivered to protocol handlers that
 * are registered on exact device (ptype->dev == skb->dev).
 *
 * If the rx_handler didn't change skb->dev, but wants the skb to be normally
 * delivered, it should return RX_HANDLER_PASS.
 *
 * A device without a registered rx_handler will behave as if rx_handler
 * returned RX_HANDLER_PASS.
 */

net/core/dev.c, line 5457

static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc,
				    struct packet_type **ppt_prev)
{
	struct packet_type *ptype, *pt_prev;
	rx_handler_func_t *rx_handler;
	struct sk_buff *skb = *pskb;
	struct net_device *orig_dev;
	bool deliver_exact = false;
	int ret = NET_RX_DROP;
	__be16 type;

	net_timestamp_check(!READ_ONCE(net_hotdata.tstamp_prequeue), skb);

	trace_netif_receive_skb(skb);

	orig_dev = skb->dev;

	skb_reset_network_header(skb);
	if (!skb_transport_header_was_set(skb))
		skb_reset_transport_header(skb);
	skb_reset_mac_len(skb);

	pt_prev = NULL;

another_round:
	skb->skb_iif = skb->dev->ifindex;

	__this_cpu_inc(softnet_data.processed);

        // Does XDP if its required, this skips all further handling
	if (static_branch_unlikely(&generic_xdp_needed_key)) {
		int ret2;

		migrate_disable();
		ret2 = do_xdp_generic(rcu_dereference(skb->dev->xdp_prog),
				      &skb);
		migrate_enable();

		if (ret2 != XDP_PASS) {
			ret = NET_RX_DROP;
			goto out;
		}
	}
        // Takes off the vlan tag
	if (eth_type_vlan(skb->protocol)) {
		skb = skb_vlan_untag(skb);
		if (unlikely(!skb))
			goto out;
	}
        
        // TC is a traffic controlling method in linux21
	if (skb_skip_tc_classify(skb))
		goto skip_classify;

        // if it is a dinamically allocated piece of memory22
	if (pfmemalloc)
		goto skip_taps;
        
        // This part is for handling promiscuos packets
        //RCU -  “Read, Copy, Update” 23
        // A syncronisation method for parallel access data
        /**
        * list_for_each_entry_rcu	-	iterate over rcu list of given type
        * @pos:	the type * to use as a loop cursor.
        * @head:	the head for your list.
        * @member:	the name of the list_head within the struct.
        * @cond:	optional lockdep expression if called from non-RCU protection.
        *
        * This list-traversal primitive may safely run concurrently with
        * the _rcu list-mutation primitives such as list_add_rcu()
        * as long as the traversal is guarded by rcu_read_lock().
        */
	list_for_each_entry_rcu(ptype, &net_hotdata.ptype_all, list) {
		if (pt_prev)
                        // It calls the carrying protocols handling function like ip_rcv
			ret = deliver_skb(skb, pt_prev, orig_dev);
		pt_prev = ptype;
	}
        // net_hotdata
        // dev->ptype_all:      These contain promiscuous packet_types irrespective of netdevice.
        // Each AF_PACKET socket adds a packet_type to this list.
        // packet_rcv() is called to pass the packet to userspace.24
	list_for_each_entry_rcu(ptype, &skb->dev->ptype_all, list) {
		if (pt_prev)
			ret = deliver_skb(skb, pt_prev, orig_dev);
		pt_prev = ptype;
	}

skip_taps:
#ifdef CONFIG_NET_INGRESS
	if (static_branch_unlikely(&ingress_needed_key)) {
		bool another = false;
                // sets the skip egress flag
		nf_skip_egress(skb, true);
                // Does traffic control
                // It returns with another being true if the packet was redirected
		skb = sch_handle_ingress(skb, &pt_prev, &ret, orig_dev,
					 &another);
		if (another)
			goto another_round;
		if (!skb)
			goto out;

		nf_skip_egress(skb, false);
                // NF_INGRESS hook
		if (nf_ingress(skb, &pt_prev, &ret, orig_dev) < 0)
			goto out;
	}
#endif
	skb_reset_redirect(skb);
skip_classify:
        // If the protocol cannot handle memalloced packets
	if (pfmemalloc && !skb_pfmemalloc_protocol(skb))
		goto drop;
        // Only does the last ptype if there is a vlan tag present, possibly the vlan ptype?
	if (skb_vlan_tag_present(skb)) {
		if (pt_prev) {
			ret = deliver_skb(skb, pt_prev, orig_dev);
			pt_prev = NULL;
		}
                // sets the skb's device to vlan device
		if (vlan_do_receive(&skb))
			goto another_round;
		else if (unlikely(!skb))
			goto out;
	}

	rx_handler = rcu_dereference(skb->dev->rx_handler);
	if (rx_handler) {
		if (pt_prev) {
			ret = deliver_skb(skb, pt_prev, orig_dev);
			pt_prev = NULL;
		}
		switch (rx_handler(&skb)) {
		case RX_HANDLER_CONSUMED:
			ret = NET_RX_SUCCESS;
			goto out;
		case RX_HANDLER_ANOTHER:
			goto another_round;
		case RX_HANDLER_EXACT:
			deliver_exact = true;
			break;
		case RX_HANDLER_PASS:
			break;
		default:
			BUG();
		}
	}

	if (unlikely(skb_vlan_tag_present(skb)) && !netdev_uses_dsa(skb->dev)) {
check_vlan_id:
		if (skb_vlan_tag_get_id(skb)) {
			/* Vlan id is non 0 and vlan_do_receive() above couldn't
			 * find vlan device.
			 */
			skb->pkt_type = PACKET_OTHERHOST;
		} else if (eth_type_vlan(skb->protocol)) {
			/* Outer header is 802.1P with vlan 0, inner header is
			 * 802.1Q or 802.1AD and vlan_do_receive() above could
			 * not find vlan dev for vlan id 0.
			 */
			__vlan_hwaccel_clear_tag(skb);
			skb = skb_vlan_untag(skb);
			if (unlikely(!skb))
				goto out;
			if (vlan_do_receive(&skb))
				/* After stripping off 802.1P header with vlan 0
				 * vlan dev is found for inner header.
				 */
				goto another_round;
			else if (unlikely(!skb))
				goto out;
			else
				/* We have stripped outer 802.1P vlan 0 header.
				 * But could not find vlan dev.
				 * check again for vlan id to set OTHERHOST.
				 */
				goto check_vlan_id;
		}
		/* Note: we might in the future use prio bits
		 * and set skb->priority like in vlan_do_receive()
		 * For the time being, just ignore Priority Code Point
		 */
		__vlan_hwaccel_clear_tag(skb);
	}

	type = skb->protocol;

        // Protocol specific handling
        // Global packet handling
	/* deliver only exact match when indicated */
	if (likely(!deliver_exact)) {
		deliver_ptype_list_skb(skb, &pt_prev, orig_dev, type,
				       &ptype_base[ntohs(type) &
						   PTYPE_HASH_MASK]);
	}
        // The original ingress device
	deliver_ptype_list_skb(skb, &pt_prev, orig_dev, type,
			       &orig_dev->ptype_specific);
        // The current device
	if (unlikely(skb->dev != orig_dev)) {
		deliver_ptype_list_skb(skb, &pt_prev, orig_dev, type,
				       &skb->dev->ptype_specific);
	}

	if (pt_prev) {
		if (unlikely(skb_orphan_frags_rx(skb, GFP_ATOMIC)))
			goto drop;
		*ppt_prev = pt_prev;
	} else {
drop:
		if (!deliver_exact)
			dev_core_stats_rx_dropped_inc(skb->dev);
		else
			dev_core_stats_rx_nohandler_inc(skb->dev);
		kfree_skb_reason(skb, SKB_DROP_REASON_UNHANDLED_PROTO);
		/* Jamal, now you will not able to escape explaining
		 * me how you were going to use this. :-)
		 */
		ret = NET_RX_DROP;
	}

out:
	/* The invariant here is that if *ppt_prev is not NULL
	 * then skb should also be non-NULL.
	 *
	 * Apparently *ppt_prev assignment above holds this invariant due to
	 * skb dereferencing near it.
	 */
	*pskb = skb;
	return ret;
}

net/core/dev.c, line 5702

static inline void __netif_receive_skb_list_ptype(struct list_head *head,
						  struct packet_type *pt_prev,
						  struct net_device *orig_dev)
{
	struct sk_buff *skb, *next;

	if (!pt_prev)
		return;
	if (list_empty(head))
		return;
	if (pt_prev->list_func != NULL)
		INDIRECT_CALL_INET(pt_prev->list_func, ipv6_list_rcv,
				   ip_list_rcv, head, pt_prev, orig_dev);
	else
		list_for_each_entry_safe(skb, next, head, list) {
			skb_list_del_init(skb);
			pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
		}
}

If it was an IP packet, the handler will be ip_rcv, ip_list_rcv does the same with calling ip_rcv_core and then the NF_INET_PRE_ROUTING hook.

net/ipv4/ip_input.c, line 560

/*
 * IP receive entry point
 */
int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
	   struct net_device *orig_dev)
{
	struct net *net = dev_net(dev);
        // Mostly does header and hash checks
	skb = ip_rcv_core(skb, net);
	if (skb == NULL)
		return NET_RX_DROP;
        // Checks the NETFILTER hook of IPv4 prerouting
	return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING,
		       net, NULL, skb, dev, NULL,
		       ip_rcv_finish);
}

ip_rcv_core and further will handle the routing of packets, which will be looked into in part 2 of this series.

  1. http://ftp.gnumonks.org/pub/doc/packet-journey-2.4.html ↩︎
  2. https://lkp.pierreolivier.eu/slides/13_BottomHalves.pdf ↩︎
  3. kernel/softirq.c, line 703 (as a function) ↩︎
  4. https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2310.pdf , page 133 ↩︎
  5. https://events.static.linuxfound.org/sites/events/files/slides/Chaiken_ELCE2016.pdf ↩︎
  6. https://wiki.linuxfoundation.org/networking/napi ↩︎
  7. https://www.kernel.org/doc/html//next/networking/napi.html ↩︎
  8. https://people.redhat.com/pladd/MHVLUG_2017-04_Network_Receive_Stack.pdf ↩︎
  9. vger.kernel.org/~davem/skb.html ↩︎
  10. https://grsecurity.net/pipermail/grsecurity/2012-July/001093.html ↩︎
  11. https://lwn.net/Articles/688492/ ↩︎
  12. https://www.kernel.org/doc/html/latest/admin-guide/sysctl/net.html ↩︎
  13. include/linux/list.h, line 23 ↩︎
  14. include/linux/list.h, line 561 ↩︎
  15. https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/memory-access-ordering-part-2—barriers-and-the-linux-kernel ↩︎
  16. https://www.kernel.org/doc/html/latest/networking/scaling.html ↩︎
  17. https://elixir.bootlin.com/linux/v6.10.3/source/net/core/filter.c#L4246 ↩︎
  18. https://elixir.bootlin.com/linux/v6.10.3/source/drivers/net/ethernet/intel/e1000/e1000_main.c:1013 ↩︎
  19. https://lwn.net/Articles/358910/ ↩︎
  20. ↩︎
  21. https://www.man7.org/linux/man-pages/man8/tc.8.html ↩︎
  22. https://lkml.iu.edu/hypermail/linux/kernel/2104.2/01282.html ↩︎
  23. https://www.kernel.org/doc/html/latest/RCU/whatisRCU.html ↩︎
  24. ↩︎