Public
Authored by Robin Randhawa

Trying to understand kernel stack management on x86_64

This is to help me resolve a nagging issue with the kernel side implementation of the clone system call for AArch64.


What I want to know:

  • Is there a distinction between the 'primary' kernel stack and kernel stacks for user-space Context instances ?
  • Assuming there is, how does the kernel arrange for the primary kernel stack to be used during intialization ?
  • How does the kernel arrange for kernel stacks to be used on syscall entry ?
  • How does the kernel arrange for user-space Context specific kernel stacks to be switched in on context switch time ?

Here's a 'pseudo' call graph of the early init flow focusing only on items of interest:

  • kstart

    • gdt::init_paging(tcb_offset, stack_base + stack_size)

      The stack_base and stack_size are given to kstart by the bootloader and I believe this qualifies as the 'primary' stack in that it is not associated with any Context instance.

      • set_tss_stack(stack_offset)

        Writes stack_offset to TSS.rsp[0]. TSS is a mutable static instance of type TaskStateSegment.

      • task::load_tr(SegmentSelector::new(GDT_TSS as u16, PrivilegeLevel::Ring0))

        This loads the tr register with the index (7) of the TSS entry in the GDT.

From this I gather that the processor 'recognises' the kernel stack and from this point on, if user-space were to invoke a system call, that would result in a switch to kernel mode and the use of this kernel stack.


Here's a call-graph of the user-space init flow:

  • kstart
    • kmain

      • context::init()

        This inits the global static CONTEXTS list (of type ContextList), creates a new Context (gets added to the CONTEXTS list), sets up some fields in this context. I assume this context represents the kernel's own context that is not attached to any user-space context. However I don't see context.arch.set_stack() called for this context. Neither do I see context.kstack assigned with any value).

      • context::contexts_mut().spawn(userspace_init)

        A new Context is created here. A 64 Kb vector turned into a boxed slice is declared as a stack. The last element in this stack is filled with the address of the function passed in (userspace_init). This implies that if a return instruction is invoked with this stack then execution will return to userspace_init.

        • context.arch.set_stack(stack.as_ptr() as usize + offset)

          The Context struct has an arch specific Context struct embedded within. The arch specific Context struct's rsp element is set to the top of the stack.

        context.kstack is initialized with the top of the stack. spawn() returns Ok(context_lock). This is matched and on success the environment information passed to kmain is inserted into the context specific holder for environment data.

      • loop {}

        This is the main scheduling loop that fixes global interrupt enable/disable and calls context::switch().

        • context::switch()

          Gets a pointer to the current Context (from_ptr) and one to the next runnable Context (to_ptr).

          • gdt::set_tss_stack(stack.as_ptr() as usize + stack.len())

            The stack in question here is the next runnable Context's kstack. As seen before, set_tss_stack updates the TSS with the new stack's top.

          • arch.switch_to()

            Saves the old thread's relevant registers and restores state for the new thread. This ends up saving the current stack pointer to Context.arch.sp for the outbound context which in this instance is the 'primary' Context. When this function returns, the stack pointer points to the new thread's kstack's last item. So when a ret happens, control passes to userspace_init.

From this I gather that when this new thread (which execs /sbin/init) invokes a syscall from user-space, then the kernel will execute with the kstack set up above in context::switch. From that point on, the processor's stack pointer is set up to point to the incoming context's kstack whenever context::switch is called. If no thread is runnable, control eventually may pass to the primary context and that context's kstack will be used.


I think I now have answers to all my questions


  • Is there a distinction between the 'primary' kernel stack and kernel stacks for user-space Context instances ?

    Yes


  • Assuming there is, how does the kernel arrange for the primary kernel stack to be used during intialization ?

    Done in gdt::init_paging


  • How does the kernel arrange for kernel stacks to be used on syscall entry ?

    The kstack pointed to be the TSS is what is used on syscall entry. Given that the TSS entry is updated by context::switch, the implication is that when a user-space context invokes a syscall, the same Context instance's kstack is used by the kernel.


  • How does the kernel arrange for user-space Context specific kernel stacks to be switched in on context switch time ?

    context::switch calls gdt::set_tss_stack.


Edited
11 Bytes
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment