Threading implementation using setjmp.h

Hi guys,

A little bit of preamble will help me segue into the topic at hand. I've been using chatGPT quite often over the past 4-6 weeks. Although not always 100% accurate, it's still a great tool and has definitely augmented my programming experience. I have found it explains concept topics nicely. With that being said; I emphasise "not always 100% accurate" as I have a hunch that it's not telling me the correct execution flow, of alternatively, I'm mistaken.

This is quite a complex subject, and I'm probably out of my depth here. But I still want to jump into the deep end. So I asked chatGPT to create code to simulate threading, and what a (very) simple threading library would potentially look like(just a rough and ready example, not production code).

What better way to learn about user-level threading than delving deep into the implementation details(as crude as it may be). Below is the code, if anyone could please fill in my gaps, this would be more than appreciated. I believe that I'm failing to understand the execution flow of the program and what longjmp and setjmp are really doing.

So I'll begin; setjmp() is used to save the current execution context(CPU registers, stack pointer, etc), we pass a jmp_buf as the argument to this, jmp_buf is a buffer that saves the context. When we call longjmp(), we pass in the context and we basically load the context from our jmp_buf and continue execution from that context. When we first call setjmp() it returns 0, indicating that we have not jumped yet. When we call longjmp() we jump back to where setjmp() left off, this time setjmp() returns 1 indicating that we have jumped back to that execution context.

So now for the code(note this code was written by chatGPT and not me);

1) we declare two threads
2) then call create_thread(&thread1, example_thread_function) for the first thread
2 i) we point our start routine to the routine that we specify as args
2 ii) we create memory on the heap which will be used for our stack
2 iii) call setjmp() and pass it our thread1s context buffer(stores context)
2 iv) setjmp() returns 0 and if block is executed
2 v) set the stack pointer to point to our dynamically allocated stack
2 vi) callq calls our function passed as arg,will put ret addr on thread1s stack
2 vii) we are now in the function example_thread_function
2 viii) we loop 5 times
2 ix) i = 0, i < 4: so we call yield_thread(&thread1,&thread2)
2 x) **this is where my understanding breaks down(below)...
2 x) we are now in yield_thread(): we call setjmp(current->context)...
2 x) which is thread1->context


So I'm assuming my understanding of the execution flow is largely correct up until "2 x". So what happens here? Since current equals thread1, we save the current context in threads 1 context buffer, but now we have just overwritten this buffer which contained the previous context just before we called the example_thread_function() in create_thread(), why have we done this and what's the point of saving our context in create_thread() if we just overwrite it in yield_thread()?

Next since setjmp() returns 0, we execute longjmp(next->context,1) but where are we jumping to? next = thread 2, yet we haven't gotten to the line of code to actually create the second thread, thus I thought thread2s context buffer hasn't been set yet, so again, where are we jumping to?

Thanks


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76

#include <stdio.h>
#include <stdlib.h>
#include <setjmp.h>

#define STACK_SIZE 4096 // define stack size on the heap

typedef struct {
    jmp_buf context;
    void (*start_routine)(void);
    void *stack;
} thread_t;

void create_thread(thread_t *thread, void (*start_routine)(void)) {
    thread->start_routine = start_routine;
    thread->stack = malloc(STACK_SIZE);

    if (thread->stack == NULL) {
        perror("Thread creation failed");
        exit(EXIT_FAILURE);
    }

    if (setjmp(thread->context) == 0) {
        // init the stack pointer and jump to the start routine
        __asm__ volatile (
            "movq %[stack], %%rsp;"
            "callq *%[start_routine];"
            :
            : [stack] "r" (thread->stack + STACK_SIZE),
              [start_routine] "r" (start_routine)
            : "memory"
        );
    }
}

void yield_thread(thread_t *current, thread_t *next) {
    // Save the current context and jump to the next context
    if (setjmp(current->context) == 0) {
        longjmp(next->context, 1);
    }
}

void join_thread(thread_t *thread) {
    free(thread->stack);
}

void example_thread_function() {
    for (int i = 0; i < 5; ++i) {
        printf("Thread: %d\n", i);
        // simulate thread yielding to another thread
        if (i < 4) {
            yield_thread(&thread1, &thread2);
        }
    }
}

thread_t thread1, thread2;

int main() {
    // creation of two new threads
    create_thread(&thread1, example_thread_function);
    create_thread(&thread2, example_thread_function);

    // schedule the threads
    for (int i = 0; i < 5; ++i) {
        yield_thread(&thread1, &thread2);
        yield_thread(&thread2, &thread1);
    }

    // join threads
    join_thread(&thread1);
    join_thread(&thread2);

    return 0;
}
Do you understand setjump() and lomgjump()? See:

https://en.cppreference.com/w/c/program/setjmp
https://en.cppreference.com/w/c/program/longjmp

Also note that this design gets very messy when the number of threads is increased. Often you'd use a circular list of threads so that when the current thread executing yields, the next one in the list executes if it's ready to execute (and if not advances through the list until one is found that can be executed). This design is also non pre-emptive scheduling in that a thread will execute until it yields.

https://www.geeksforgeeks.org/preemptive-and-non-preemptive-scheduling/

I Just read the following links; Yeah that's pretty much my perception of what setjmp() and longjmp() do respectively. setjmp() will have the current context i.e. register values, stack pointer and any other values that are associated with that execution context. setjmp(env) will return 0 if it's the first invocation and will save the context data to env. longjmp(env, status) will jump to env's point in time it was running i.e. jump to that execution context(restoring the context with env), it will then return the status number to setjmp() which will this time be set as the status number. So essentially code will leave off from where setjmp(env) was called but will not be the first invocation of setjmp() and will equal the status passed by longjmp()

So my question is(and with reference to the code chatGPT supplied); in yield_thread() we setjmp(current->context) which overwrites the context buffers data, so we can restore the context, but.... If we have overwritten the context, what was the point of the first setjmp(thread->context) when it will just be overwritten, considering we don't actually make any longjmp() calls to it?

Also, We call longjmp(next->context) before we even create the second(next) thread, thus next->context has not yet been set with a setjmp(). I pushed chatGPT on this and it stated that the code may exhibit unexpected or erroneous behaviour.. I'm not sure if that was me just nagging chatGPT, or if there exists any truth in that reply by chatGPT. To me it just doesn't make sense as we are using longjmp() in yield_thread() to jump to (what it seems) like a context that has not been setup/saved with setjmp().
Last edited on
After yet again pushing chatGPT, I have come to the a preliminary conclusion aided by chatGPT, that although we do save the context in the setjmp() in thread_create(), we don't actually every jump to it and it's merely there as a placeholder/ is there for demo purposes.

I also believe that both my assertions including the one above still hold true? As we never initialise the second thread before jumping to it's context buffer which again was never set. Thus leading to potential undefined behaviour.

Are those assertions correct?

I believe the code chatGPT produced for a simple threading library is just incomplete and majority is just for demo purposes i.e. not the greatest example. But, I could be wrong?
Here's my understanding of threads, a thread is just a "thread" of execution. Threading libraries create threads and manually use context switching between threads at a user level with the help of assembly code and other libraries in C or C++. Each thread has it's own stack dynamically allocated on the heap. The thread library will have a scheduler that schedules threads. For example, let's say we have an I/O bound thread and a CPU bound thread. If the I/O bound thread is awaiting an I/O completion, that thread will be blocked and the thread scheduler will context switch to the CPU bound thread. But here is where my understanding breaks down...


So operating systems have no conception of user level threads, they're managed at the user level as previously mentioned. When a process makes a system call such as an I/O system call, the process will be placed in a blocked state waiting for the I/O completion. So with that being said..

If a process uses threads via a user-level threading library and has two threads one an I/O bound thread and the other CPU bound, if the I/O bound thread makes a I/O system call and bearing on what I said about the OS generally having no idea about user-level threads, so if the OS sees that the process makes an I/O system call how does the threading library assure that the process isn't put into a blocked state by the OS with the I/O call and then switch to the CPU bound thread?

Thanks
Last edited on
The are several considerations if you're going to create your own threading package.
Each thread requires its own stack area to keep stack frames. Classically, there is one stack per thread. Some hardware architectures (Intel Itanium) implement two stacks per thread.
Each thread has a stack at a distinct, static address within the virtual memory space of the process. An alternative approach called "stack swapping" employs only a single stack timeshared by all active threads; the frame images of waiting stacks are swapped to separate holding spaces. Static stacks have the advantage of much faster context switches, but they are impractical when the architecture restricts total stack space. An advantage of stack swapping is that all threads share the special attributes of
one stack area, including its stack overflow detection mechanism. Stack swapping may also require less total memory, because the swap area needs to hold only the frames of a waiting thread; an active thread might call additional procedures that (without preemption) will exit before the thread waits. When an individual thread stack becomes too big, it can be allocated a larger swap area, but moving a static stack to a larger area is impractical. A key disadvantage of stack swapping is that local variables on the thread stack cannot be passed by reference to other threads or used as I/O or message buffers in operations that may proceed while the thread waits.

The active part of a stack component extends from the "stack origin" to the "stack tip." The origin is anchored; the tip moves as procedures are entered and exited, causing the stack to grow and shrink. The "stack root" is the set of activation records (stack frames) for procedures that were invoked before multithreading was initialized and do not exit until after the last thread has terminated. These activation records, along with the process global and heap, remain visible to all threads in both "static" and "swapped" models.
Another distinction in a multithreading model is the relationship of the initial context of the process to all other threads: it may be "peer" or "master." A peer protocol is symmetric; context can switch from any thread directly to any other; the "Main" (initial) thread differs from others only in that the program does not initiate or terminate it. In a master protocol, the original process context is considered "unthreaded" and used when no thread is active; context switches occur only from the master to some thread and back.
The master protocol incurs more context switches than with peers, but in a stack-swapping model there are no "Main-thread" activation records to swap. Individual thread stacks can be somewhat smaller with this protocol because the "dispatch" function (choosing the next thread or waiting for something to do) occurs only in the unthreaded master state.

Thread Model Variations
The simplest thread model utilizes static stacks: the thread context to be saved includes only the values of certain processor registers. For swapped threads, the context also includes the procedure activation records (stack frames) on the stack.

For a swapping model, part of the process stack is time-shared by all the threads. (These diagrams show a single stack component growing from left to right.) With the peer protocol, the swapped portion of each thread originates at the same point. The location of the swap origin is not critical; it is typically placed just beyond the stack root, minimizing the amount of stack-frame data to be swapped for the main thread.

|--- stack root ---|--- main thread ---|
|--- thread 1 ---|
|--- thread 2 ---------|
|...

In the master protocol, the swap origin must occur beyond the stack frames of those unthreaded
procedures that remain on the stack while a thread is running. These frames are marked * in the diagram and may or may not remain while the unthreaded master code is running.
 
|--- stack root ---|--- unthreaded master ---|
|- * -|--- thread 1 ---|
|- * -|--- thread 2 ---------|
|- * -|...


For example, let's say we have an I/O bound thread and a CPU bound thread.

Simple threading packages don't have a concept of I/O bound or CPU bound. If a thread initiates an I/O, it's up to the programmer to yield control back to the master thread. When the master thread has no more threads that can be initiated, it calls an operating system routine to wait for a pending I/O operation to complete. When the OS returns control to the master thread, it's up to the master thread to activate the thread that initiated the I/O.

So operating systems have no conception of user level threads

Not necessarily true. OS's generally have some built-in support to make writing threading packages possible.

When a process makes a system call such as an I/O system call, the process will be placed in a blocked state waiting for the I/O completion.

If you're writing program using a simple threading package, you want to use non-blocking I/O, so that other threads can execute. i.e. You don't want to preempt the entire process.




If I'm understanding correctly, if the OS supports user-level non blocking I/O threads. And again let's say we had two threads, one for computation and one for I/O. If the I/O thread makes an I/O request via a system call, the process may make a context switch to the computation thread, is that correct? I can definitely see the efficiency gains in this.

I'm going to use some code as an example, let's assume we have a single core processor and only one single logical core too. How would calling these functions using threads be faster or more efficient than without the threads. I assume that after a time slice the threads will switch/yield to one another, this would obviously involve a context shift(at the user level), with this added context shift wouldn't this mean running the code as threads actually be slower then calling the computation function twice in a row?

I can see how running such code on a multi core system with/and a library that permits parallelism would result in a clear efficiency gain, but I don't understand how said efficiency gain would hold for a single core processor considering the added context shift.

Sorry guys for my late response, really good posts and much appreciated for the help, I took the weekend off due to burnout.

Thanks

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59

/*******************************************************************************
*
* Program: Threads Demonstration
* 
* Description: Example of using threads in C with the pthread.h libray (POSIX
* thread library).
*
* YouTube Lesson: https://www.youtube.com/watch?v=ldJ8WGZVXZk 
*
* Author: Kevin Browne @ https://portfoliocourses.com
*
*******************************************************************************/
#include <stdio.h>
#include <pthread.h>

void *computation(void *add);

int main()
{
  // pthread_t will be used to uniquely identify, create, and manage threads
  pthread_t thread1;
  pthread_t thread2;
  
  long value1 = 1;
  long value2 = 2;
  
  // Create two threads that will each run the computation function, one is 
  // passed a void-casted pointer to value1 as an argument, the other is 
  // passed a void-cased pointer to value2 as an argument.
  pthread_create(&thread1, NULL, computation, (void*) &value1);
  pthread_create(&thread2, NULL, computation, (void*) &value2);
  
  // execution will pause at pthread_join until the thread provided as an 
  // argument has completed its execution
  pthread_join(thread1, NULL);
  pthread_join(thread2, NULL);

  return 0;
}

// Accepts a void pointer and returns a void pointer as pthread_create expects
// both of these properties.  We pass in a value via the void pointer.  The 
// function does some meaningless work to simulate meaningful computational 
// working occurring.
void *computation(void *add)
{
  long sum = 0;

  // cast the void pointer add to a long pointer
  long *add_num = (long *) (add);
  
  // de-reference add_num to get at the value pointed to by add_num, have 
  // the loop run many, many times doing some computational work
  for (long i = 0; i < 1000000000; i++)
    sum += *add_num;

  return NULL;
}


*note code is not mine, here is the link to the authors github -> https://github.com/portfoliocourses/c-example-code/blob/main/thread_intro.c

Last edited on
I'm guessing and from what I researched, the code that I provided above wouldn't allow have any benefit running on a single core processor since both threads are computational threads, and with the overhead of context switching and other handling routines from a treading library, it actually would slow down performance slightly if anything, is that correct?

An example of where threading on a single core processor make sense would be if we have a thread for I/O and a thread that is computational bound, i.e. when the I/O bound thread requests I/O, the computational thread can run while the I/O request is being serviced.

Would this be correct? Thanks



It usually makes no sense to have more cpu-bound threads then there are processor cores.
If I'm understanding correctly, if the OS supports user-level non blocking I/O threads.

That's a mis-characterization. All modern OS's provide blocking and non-blocking I/O. This is a feature that is not specifically related to threads. If you're writing a simple threads package, it's important that your threads do only non-blocking I/O.
If the I/O thread makes an I/O request via a system call, the process may make a context switch to the computation thread, is that correct?

No. That is not correct. At the OS level, the OS is generally unaware of threads. Threading is a concept that allows a normally single threaded process (from the OS point of view) to multi-thread.
I assume that after a time slice the threads will switch/yield to one another

That's is up to the threading package. A simple threading package as we've been discussing does not do that. More sophisticated threading packages can have a scheduler thread running that can preempt another thread upon receipt of a timer signal (non-blocking) from the OS.

It's not unusual in a threading package to want to do "timed I/O", particularly when dealing with communication threads (sockets, etc). In this case the threading package has to know the user wants to do timed-I/O, put the I/O on a timer list in addition to being on the pending I/O list and then deal with whether the timer completes (tell the OS to cancel the I/O and notify the thread that the I/O completed with a timeout error), or handle the I/O completion from the OS and remove the I/O from the timer list.
with this added context shift wouldn't this mean running the code as threads actually be slower then calling the computation function twice in a row?

Yes.
I can see how running such code on a multi core system with/and a library that permits parallelism

Don't forget that threads have to keep out of each others way. i.e. You can't have two threads modifying the same variable at the same time.
when the I/O bound thread requests I/O, the computational thread can run while the I/O request is being serviced.

Generally yes.
*If the I/O thread makes an I/O request via a system call, the process may make a context switch to the computation thread, is that correct?


No. That is not correct. At the OS level, the OS is generally unaware of threads. Threading is a concept that allows a normally single threaded process (from the OS point of view) to multi-thread.


Sorry I think the way I posed that question was incorrect, I'll rewrite it*, If the I/O thread makes an I/O request via a system call such as write or read, the process may make a (user-level) context switch to another thread at user level, i.e. the code written by the program leverages a threading library to then switch to the computation bound thread, would this be more accurate?

That's a mis-characterization. All modern OS's provide blocking and non-blocking I/O. This is a feature that is not specifically related to threads. If you're writing a simple threads package, it's important that your threads do only non-blocking I/O.


But I thought modern OS's have some concept of user-level threads or do they just know the concept of kernel level threads?

If so; and sorry if I'm repeating or missing any of the points, if the OS has no concept of user-level threads, how do threads make a user-level(done by code) context shift to another thread if a write system call blocks the thread, I would guess that it has something to do with non-blocking I/O? as you mentioned.


If the I/O thread makes an I/O request via a system call such as write or read, the process may make a (user-level) context switch to another thread at user level, i.e. the code written by the program leverages a threading library to then switch to the computation bound another thread, would this be more accurate?

Yes, that would be more accurate. Keep in mind that we've been talking about what I have been referring to as "simple" threading packages. pthreads being the benchmark.
But I thought modern OS's have some concept of user-level threads or do they just know the concept of kernel level threads?

This can get to be a grey area and depends on the operating system. For example, pthreads is an API that is released with all POSIX compliant OS's. Does that make it part of the operating system? From one point of view, it's just a library, like C++'s STL.
There are complex threading packages that have extensive OS support. IBM's ACP/TPF comes to mind. In this same vein are VM hypervisors that run multiple guest OS's. These hypervisors have to isolate each guest OS and emulate all the hardware level services each guest OS expects.
As I explained in an earlier response, thread stacks in a "simple" thread package. don't generally have their own hardware protection. i.e. one thread can overwrite another thread's storage without it being detected. This can be obviated by using swapped stacks at the expense of higher context switch costs.
if the OS has no concept of user-level threads, how do threads make a user-level(done by code) context shift to another thread if a write system call blocks the thread, I would guess that it has something to do with non-blocking I/O? as you mentioned.

If you’re writing your own threads package, there are a couple of ways to do this:
1) You can initiate an I/O using the operating system’s non-blocking I/O. Then call an entry point in your threads package that waits for the I/O to complete.
1
2
    OS_NONBLOCK_WRITE (filenum, bufaddr, count);
    MYTHREADPKG_IO_WAIT (filenum);

2) You can provide an I/O call in your threads package that is a wrapper for the OS’s non-blocking I/O and automatically puts the thread in an I/O wait state until the I/O completes.
 
    MYTHREADPKG_WAITED_WRITE(filenum, bufaddr, count);


[edit]
The second example is simply a combination of the two calls in the first example.
The examples I’ve given are simplistic. i.e. they don’t take into consideration timed-I/O which I mentioned in an earlier response. The examples also don’t take into consideration multiple concurrent I/Os on the same thread. This can occur in socket programming where you might want a read and write outstanding on the same socket at the same time.
Last edited on
Topic archived. No new replies allowed.