GDB

Debug A High Availability Program

1. Introduction

High availability systems are designed for continuous service, typically relying on multi-threaded programs where some threads manage real-time tasks while others handle external events. Debugging these systems is more challenging because interrupting or pausing the program to investigate issues can disrupt service and affect system reliability.

2. Lab

2.1. High Availability Program

In this lab, we will the debug the below OpenSAF AMF program.

  
int main(int argc, char ** argv)
{
    //...    
    saAmfInitialize_o4(&amfHandler, &callbacks, &apiVersion);
    
    saAmfComponentRegister(amfHandler, &name, 0);

    pthread_create(&thread, NULL, threadFunction, NULL);

    saAmfSelectionObjectGet(amfHandler, &monitorFd);    
    struct pollfd fds[1];
    fds[0].fd = monitorFd;
    fds[0].events = POLLIN;
    
    while (poll(fds, 1, -1) != -1)
    {
        if (fds[0].revents & POLLIN)
        {            
            saAmfDispatch(amfHandler, SA_DISPATCH_ONE);
        }
    }
}

void* threadFunction(void* arg)
{
    trace();
    for (int i = 0; ; i++)
    {
        trace("continue... i=%d\n", i);
        sleep(5);
    }
}
  
  
start() {
    logger -st demo 'start'
    start-stop-daemon --start --background \
        --make-pidfile --pidfile /var/run/demo.pid \
        --exec /opt/demo/main
}

stop() {
    logger -st demo 'stop'
    start-stop-daemon --stop --pidfile /var/run/demo.pid
}
  
  
$ mkdir /opt/demo
$ cp worker_thread/* /opt/demo
$ cd /opt/demo
$ ./build.sh
$ systemctl start opensafd
$ ./import-imm.sh
$ amf-state su all safSu=SC-1,safSg=demo,safApp=demo
  

We will debug the program in two phases:

2.2. Debug Entrypoint

When debugging entrypoint of a AMF application, we encounters the below points:

First, we adjust these timeouts to have enough time for debugging.

  
$ immcfg -a saAmfCtDefClcCliTimeout=600000000000 safVersion=1,safCompType=demo
  
  
start() {
    logger -st demo 'start'
    start-stop-daemon --start --background \
        --make-pidfile --pidfile /var/run/demo.pid \
        --exec /usr/bin/gdbserver localhost:5555 /opt/demo/main
}
  
  
$ gdb
  
  
(gdb) set tcp auto-retry on
(gdb) set tcp connect-timeout unlimited
  
  
(gdb) target remote localhost:5555
  
  
$ amf-adm unlock-in safSu=SC-1,safSg=demo,safApp=demo
  
  
# console log
Remote debugging using localhost:5555
Reading /opt/demo/main from remote target...
  
  
(gdb) info inferiors
  
  
# console log
  Num  Description       Connection                Executable        
* 1    process 379783    1 (remote localhost:5555) target:/opt/demo/main
  
  
(gdb) break main
  
  
# console log
Breakpoint 1 at 0xaaaaaaaa0b64: file main.c, line 40.
  
  
(gdb) continue
  
  
# console log
Continuing.

Breakpoint 1, main (argc=1, argv=0xfffffffffba8) at main.c:40
40          openlog("demo", LOG_PID, LOG_USER);
  
  
(gdb) continue
  
  
# syslog
2024-12-31T23:46:23.560527-05:00 SC-1 demo: start
2024-12-31T23:46:23.563433-05:00 SC-1 demo[384582]: main: start
2024-12-31T23:46:23.565075-05:00 SC-1 demo[384582]: main: receive poll event
2024-12-31T23:46:23.565192-05:00 SC-1 demo[384582]: healthcheckCallback: count = 0
2024-12-31T23:46:23.565281-05:00 SC-1 demo[384582]: threadFunction: 
2024-12-31T23:46:23.565324-05:00 SC-1 demo[384582]: threadFunction: continue... i=0
2024-12-31T23:46:28.566365-05:00 SC-1 demo[384582]: threadFunction: continue... i=1
2024-12-31T23:46:33.566979-05:00 SC-1 demo[384582]: threadFunction: continue... i=2
2024-12-31T23:46:33.652628-05:00 SC-1 demo[384582]: main: receive poll event
2024-12-31T23:46:33.652835-05:00 SC-1 demo[384582]: healthcheckCallback: count = 1
2024-12-31T23:46:38.568911-05:00 SC-1 demo[384582]: threadFunction: continue... i=3
2024-12-31T23:46:43.569081-05:00 SC-1 demo[384582]: threadFunction: continue... i=4
  
  
(gdb) detach
(gdb) quit
  

2.2. Debug Running Process

After each healthcheck timeout, AMF sends a ping to the running process. If there is no response, AMF assumes the process is down and triggers failover. Below is list of healthcheck timeouts:

For convinience, we can adjust these timeouts to have more time for debugging.

  
$ immcfg -a saAmfHctDefPeriod=600000000000 safHealthcheckKey=demo,safVersion=1,safCompType=demo
  

The healthcheck ping will be handled by some threads of the AMF program, while other threads manage different tasks. In the below output, threads 1, 2, 3 handle ping requests; while thread 4 executes function threadFunction() - a while loop.

  
$ gdb -q -p `pidof /opt/demo/main` -batch -ex 'thread apply all backtrace'
  
  
// console log
Thread 4
#0  __GI___clock_nanosleep ()   at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:48
#1  __GI___nanosleep ()         at ../sysdeps/unix/sysv/linux/nanosleep.c:25
#2  __sleep ()                  at ../sysdeps/posix/sleep.c:55
#3  threadFunction ()           at main.c:157
#4  start_thread ()             at ./nptl/pthread_create.c:442
#5  thread_start ()             at ../sysdeps/unix/sysv/linux/aarch64/clone.S:79

Thread 3
#0  __GI___poll ()              at ../sysdeps/unix/sysv/linux/poll.c:41
#1  poll ()                     at /usr/include/aarch64-linux-gnu/bits/poll2.h:39
#2  mdtm_process_recv_events_tcp () at src/mds/mds_dt_trans.c:986
#3  start_thread ()             at ./nptl/pthread_create.c:442
#4  thread_start ()             at ../sysdeps/unix/sysv/linux/aarch64/clone.S:79

Thread 2
#0  __GI___poll ()              at ../sysdeps/unix/sysv/linux/poll.c:41
#1 poll ()                      at /usr/include/aarch64-linux-gnu/bits/poll2.h:39
#2  osaf_poll_no_timeout ()     at src/base/osaf_poll.c:31
#3  osaf_poll ()                at src/base/osaf_poll.c:44
#4  osaf_poll_one_fd ()         at src/base/osaf_poll.c:133
#5  ncs_tmr_wait ()             at ./src/base/timer/timer_handle.h:76
#6  start_thread ()             at ./nptl/pthread_create.c:442
#7  thread_start ()             at ../sysdeps/unix/sysv/linux/aarch64/clone.S:79

Thread 1
#0  __GI___poll ()              at ../sysdeps/unix/sysv/linux/poll.c:41
#1  main ()                     at main.c:95
  

We will use gdb non-stop mode to keep thread 1,2,3 running while pause thread 4 for debugging.

  
$ gdb
(gdb) set non-stop on  
(gdb) set pagination off
(gdb) set confirm off
  
  
(gdb) shell pidof main
(gdb) attach 403184
  
  
# console log
Attaching to process 403184
Thread 2 "main" stopped.
Thread 3 "main" stopped.
Thread 4 "main" stopped.
  
  
(gdb) continue -a &
  
  
(gdb) info threads
  
  
# console log
  Id   Target Id                                 Frame 
* 1    Thread 0xffff929a4440 (LWP 403184) "main" (running)
  2    Thread 0xffff923af120 (LWP 403185) "main" (running)
  3    Thread 0xffff9237f120 (LWP 403186) "main" (running)
  4    Thread 0xffff9234f120 (LWP 403187) "main" (running)
  
  
(gdb) thread 4
  
  
# console log
[Switching to thread 4 (Thread 0xffff9234f120 (LWP 403187))](running)
  
  
(gdb) interrupt
  
  
# console log
Thread 4 "main" stopped.
  
  
(gdb) info threads
  
  
# console log
  Id   Target Id                                 Frame 
  1    Thread 0xffff929a4440 (LWP 403184) "main" (running)
  2    Thread 0xffff923af120 (LWP 403185) "main" (running)
  3    Thread 0xffff9237f120 (LWP 403186) "main" (running)
* 4    Thread 0xffff9234f120 (LWP 403187) "main" __GI___clock_nanosleep () at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:48
  
  
(gdb) backtrace
  
  
// console log
#0  __GI___clock_nanosleep ()   at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:48
#1  __GI___nanosleep ()         at ../sysdeps/unix/sysv/linux/nanosleep.c:25
#2  __sleep ()                  at ../sysdeps/posix/sleep.c:55
#3  threadFunction ()           at main.c:157
#4  start_thread ()             at ./nptl/pthread_create.c:442
#5  thread_start ()             at ../sysdeps/unix/sysv/linux/aarch64/clone.S:79
  

3. Cheatsheets

3.1. Debug Entrypoint

  
set tcp auto-retry on
set tcp connect-timeout unlimited
  
  
gdb -q -x debug.gdb -ex 'target remote localhost:5555'
  
  
gdbserver localhost:5555 /opt/demo/main
  

3.2. Debug Running Process

  
set non-stop on
set pagination off
set confirm off
  
  
gdb -x debug.gdb -ex "attach $(pidof /opt/demo/main)"
  
  
(gdb) continue -a &
(gdb) thread 4
(gdb) interrupt
  

References