Debugging a crashdump in SmartOS
I was trying to pickup the changes done in the past to add SO_REUSEPORT tcp option, most all the code was there it just needed to be in sync with current sources. I have been running this change in my machine for a couple of weeks and I encountered a panic(). So I needed to investigate if my changes were the cause of it. And certainly my changes had to do with this at first glance.
> ::status debugging crash dump vmcore.4 (64-bit) from dev01 operating system: 5.11 joyent_20240305T151949Z (i86pc) git branch: master git rev: d621dbfba634273486d2f66ce0b6f66a1eb4cfa6 image uuid: (not set) panic message: BAD TRAP: type=e (#pf Page fault) rp=fffffe0011b9ea20 addr=8 occurred in module "ipnet" due to a NULL pointer dereference dump content: kernel pages only
The patch changed ipnet module and the related code to add support for this new TCP option, so I need to dig deeper on this. One powerfull reason to do development and deploy your software using illumos, is that the system comes with observability tools integrated with the system. In this oportunity I need to use mdb to inspect what was the state of the kernel and structures at the time of the panic.
The first thing is to check the stack backtrace
> $c ipnet_nicevent_task+0x3c(fffffe0bf633c4b0) taskq_thread+0x2a6(fffffe0bd67ac158) thread_start+0xb()
So the last code that was executing at the time of the panic was ipnet_nicevent_task, this could be check also looking at the instruction pointer (rip ).
> <rip/p ipnet_nicevent_task+0x3c: 0x2444394908458b49
Now, I want to see the instructions prior to this code.
> ipnet_nicevent_task+0x3c::dis ipnet_nicevent_task+0x17: movq %rbx,-0x38(%rbp) ipnet_nicevent_task+0x1b: movq %r13,-0x28(%rbp) ipnet_nicevent_task+0x1f: movq %r14,-0x20(%rbp) ipnet_nicevent_task+0x23: movl 0x10(%rdi),%edi ipnet_nicevent_task+0x26: call +0x79d1905 <netstack_find_by_stackid> ipnet_nicevent_task+0x2b: testq %rax,%rax ipnet_nicevent_task+0x2e: movq %rax,%r15 ipnet_nicevent_task+0x31: je +0x65 <ipnet_nicevent_task+0x98> ipnet_nicevent_task+0x33: movq 0x88(%rax),%r13 ipnet_nicevent_task+0x3a: xorl %ebx,%ebx ipnet_nicevent_task+0x3c: movq 0x8(%r13),%rax --> here we panic ipnet_nicevent_task+0x40: cmpq %rax,0x8(%r12) ipnet_nicevent_task+0x45: leaq 0x20(%r13),%r14 ipnet_nicevent_task+0x49: movq %r14,%rdi ipnet_nicevent_task+0x4c: sete %bl ipnet_nicevent_task+0x4f: call +0x781df0c <mutex_enter> ipnet_nicevent_task+0x54: movl (%r12),%eax ipnet_nicevent_task+0x58: cmpl $0x6,%eax ipnet_nicevent_task+0x5b: je +0x8f <ipnet_nicevent_task+0xf0> ipnet_nicevent_task+0x61: ja +0x5d <ipnet_nicevent_task+0xc0> ipnet_nicevent_task+0x63: cmpl $0x1,%eax
Let’s now read the source code to see what is happening, next to the C code I wrote the assembly code that I think the instruction belong to.
static void ipnet_nicevent_task(void *arg) { ipnet_nicevent_t *ipne = arg; netstack_t *ns; ipnet_stack_t *ips; boolean_t isv6; if ((ns = netstack_find_by_stackid(ipne->ipne_stackid)) == NULL) goto done; ips = ns->netstack_ipnet; --> movq 0x88(%rax),%r13 isv6 = (ipne->ipne_protocol == ips->ips_ndv6); --> xorl %ebx,%ebx movq 0x8(%r13),%rax cmpq %rax,0x8(%r12) leaq 0x20(%r13),%r14 movq %r14,%rdi sete %bl mutex_enter(&ips->ips_event_lock); --> call +0x781df0c <mutex_enter> ...
As the mdb ::status command specified this was caused by a NULL pointer reference, so where is the NULL pointer?
ipnet_nicevent_task+0x3c: movq 0x8(%r13),%rax
So the NULL pointer comes from memory location at %r13 as this is the source operand.
> <r13=K 0 > <r13/p mdb: failed to read data from target: no mapping for address 0:
Now looking at the previous code listing, who writes to %r13?.
ips = ns->netstack_ipnet; --> movq 0x88(%rax),%r13 isv6 = (ipne->ipne_protocol == ips->ips_ndv6); --> xorl %ebx,%ebx movq 0x8(%r13),%rax cmpq %rax,0x8(%r12) leaq 0x20(%r13),%r14 movq %r14,%rdi sete %bl mutex_enter(&ips->ips_event_lock); --> call +0x781df0c <mutex_enter>
This means that ns->netstack_ipnet is NULL, now the real work is to follow the code to find out what caused this state.
The WIP patch for SO_REUSEPORT is here
Figure 1: A Ship on the High Seas Caught by a Squall, Known as ‘The Gust’ (c. 1680)