Debug A High Availability Program
High availability systems are designed for continuous service, typically relying on multi-threaded programs where some threads manage real-time tasks while others handle external events. Debugging these systems is more challenging because interrupting or pausing the program to investigate issues can disrupt service and affect system reliability.
In this lab, we will the debug the below OpenSAF AMF program.
Source code: github.com/thachmpham/samples
File main.c
int main(int argc, char ** argv)
{
//...
(&amfHandler, &callbacks, &apiVersion);
saAmfInitialize_o4
(amfHandler, &name, 0);
saAmfComponentRegister
(&thread, NULL, threadFunction, NULL);
pthread_create
(amfHandler, &monitorFd);
saAmfSelectionObjectGetstruct pollfd fds[1];
[0].fd = monitorFd;
fds[0].events = POLLIN;
fds
while (poll(fds, 1, -1) != -1)
{
if (fds[0].revents & POLLIN)
{
(amfHandler, SA_DISPATCH_ONE);
saAmfDispatch}
}
}
void* threadFunction(void* arg)
{
();
tracefor (int i = 0; ; i++)
{
("continue... i=%d\n", i);
trace(5);
sleep}
}
start() {
logger -st demo 'start'
start-stop-daemon --start --background \
--make-pidfile --pidfile /var/run/demo.pid \
--exec /opt/demo/main
}
stop() {
logger -st demo 'stop'
start-stop-daemon --stop --pidfile /var/run/demo.pid
}
$ mkdir /opt/demo
$ cp worker_thread/* /opt/demo
$ cd /opt/demo
$ ./build.sh
$ systemctl start opensafd
$ ./import-imm.sh
$ amf-state su all safSu=SC-1,safSg=demo,safApp=demo
We will debug the program in two phases:
When debugging entrypoint of a AMF application, we encounters the below points:
saAmfCtRelPathInstantiateCmd
of class
SaAmfCompType
.saAmfCtDefClcCliTimeout
of class
SaAmfCompType
, AMF will assume that the operation failed
and stop the operation.saAmfCompInstantiateTimeout
of class
SaAmfComp
, AMF will automatically kill and restart the
process.First, we adjust these timeouts to have enough time for debugging.
$ immcfg -a saAmfCtDefClcCliTimeout=600000000000 safVersion=1,safCompType=demo
start() {
logger -st demo 'start'
start-stop-daemon --start --background \
--make-pidfile --pidfile /var/run/demo.pid \
--exec /usr/bin/gdbserver localhost:5555 /opt/demo/main
}
$ gdb
(gdb) set tcp auto-retry on
(gdb) set tcp connect-timeout unlimited
(gdb) target remote localhost:5555
$ amf-adm unlock-in safSu=SC-1,safSg=demo,safApp=demo
# console log
Remote debugging using localhost:5555
Reading /opt/demo/main from remote target...
(gdb) info inferiors
# console log
Num Description Connection Executable
* 1 process 379783 1 (remote localhost:5555) target:/opt/demo/main
(gdb) break main
# console log
Breakpoint 1 at 0xaaaaaaaa0b64: file main.c, line 40.
(gdb) continue
# console log
Continuing.
Breakpoint 1, main (argc=1, argv=0xfffffffffba8) at main.c:40
40 openlog("demo", LOG_PID, LOG_USER);
(gdb) continue
# syslog
2024-12-31T23:46:23.560527-05:00 SC-1 demo: start
2024-12-31T23:46:23.563433-05:00 SC-1 demo[384582]: main: start
2024-12-31T23:46:23.565075-05:00 SC-1 demo[384582]: main: receive poll event
2024-12-31T23:46:23.565192-05:00 SC-1 demo[384582]: healthcheckCallback: count = 0
2024-12-31T23:46:23.565281-05:00 SC-1 demo[384582]: threadFunction:
2024-12-31T23:46:23.565324-05:00 SC-1 demo[384582]: threadFunction: continue... i=0
2024-12-31T23:46:28.566365-05:00 SC-1 demo[384582]: threadFunction: continue... i=1
2024-12-31T23:46:33.566979-05:00 SC-1 demo[384582]: threadFunction: continue... i=2
2024-12-31T23:46:33.652628-05:00 SC-1 demo[384582]: main: receive poll event
2024-12-31T23:46:33.652835-05:00 SC-1 demo[384582]: healthcheckCallback: count = 1
2024-12-31T23:46:38.568911-05:00 SC-1 demo[384582]: threadFunction: continue... i=3
2024-12-31T23:46:43.569081-05:00 SC-1 demo[384582]: threadFunction: continue... i=4
(gdb) detach
(gdb) quit
After each healthcheck timeout, AMF sends a ping to the running process. If there is no response, AMF assumes the process is down and triggers failover. Below is list of healthcheck timeouts:
saAmfHctDefPeriod
in class
SaAmfHealthcheckType
.saAmfHealthcheckPeriod
in class
SaAmfHealthcheck
.For convinience, we can adjust these timeouts to have more time for debugging.
$ immcfg -a saAmfHctDefPeriod=600000000000 safHealthcheckKey=demo,safVersion=1,safCompType=demo
The healthcheck ping will be handled by some threads of the AMF program, while other threads manage different tasks. In the below output, threads 1, 2, 3 handle ping requests; while thread 4 executes function threadFunction() - a while loop.
$ gdb -q -p `pidof /opt/demo/main` -batch -ex 'thread apply all backtrace'
// console log
4
Thread #0 __GI___clock_nanosleep () at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:48
#1 __GI___nanosleep () at ../sysdeps/unix/sysv/linux/nanosleep.c:25
#2 __sleep () at ../sysdeps/posix/sleep.c:55
#3 threadFunction () at main.c:157
#4 start_thread () at ./nptl/pthread_create.c:442
#5 thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:79
3
Thread #0 __GI___poll () at ../sysdeps/unix/sysv/linux/poll.c:41
#1 poll () at /usr/include/aarch64-linux-gnu/bits/poll2.h:39
#2 mdtm_process_recv_events_tcp () at src/mds/mds_dt_trans.c:986
#3 start_thread () at ./nptl/pthread_create.c:442
#4 thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:79
2
Thread #0 __GI___poll () at ../sysdeps/unix/sysv/linux/poll.c:41
#1 poll () at /usr/include/aarch64-linux-gnu/bits/poll2.h:39
#2 osaf_poll_no_timeout () at src/base/osaf_poll.c:31
#3 osaf_poll () at src/base/osaf_poll.c:44
#4 osaf_poll_one_fd () at src/base/osaf_poll.c:133
#5 ncs_tmr_wait () at ./src/base/timer/timer_handle.h:76
#6 start_thread () at ./nptl/pthread_create.c:442
#7 thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:79
1
Thread #0 __GI___poll () at ../sysdeps/unix/sysv/linux/poll.c:41
#1 main () at main.c:95
We will use gdb non-stop mode to keep thread 1,2,3 running while pause thread 4 for debugging.
$ gdb
(gdb) set non-stop on
(gdb) set pagination off
(gdb) set confirm off
(gdb) shell pidof main
(gdb) attach 403184
# console log
Attaching to process 403184
Thread 2 "main" stopped.
Thread 3 "main" stopped.
Thread 4 "main" stopped.
(gdb) continue -a &
(gdb) info threads
# console log
Id Target Id Frame
* 1 Thread 0xffff929a4440 (LWP 403184) "main" (running)
2 Thread 0xffff923af120 (LWP 403185) "main" (running)
3 Thread 0xffff9237f120 (LWP 403186) "main" (running)
4 Thread 0xffff9234f120 (LWP 403187) "main" (running)
(gdb) thread 4
# console log
[Switching to thread 4 (Thread 0xffff9234f120 (LWP 403187))](running)
(gdb) interrupt
# console log
Thread 4 "main" stopped.
(gdb) info threads
# console log
Id Target Id Frame
1 Thread 0xffff929a4440 (LWP 403184) "main" (running)
2 Thread 0xffff923af120 (LWP 403185) "main" (running)
3 Thread 0xffff9237f120 (LWP 403186) "main" (running)
* 4 Thread 0xffff9234f120 (LWP 403187) "main" __GI___clock_nanosleep () at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:48
(gdb) backtrace
// console log
#0 __GI___clock_nanosleep () at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:48
#1 __GI___nanosleep () at ../sysdeps/unix/sysv/linux/nanosleep.c:25
#2 __sleep () at ../sysdeps/posix/sleep.c:55
#3 threadFunction () at main.c:157
#4 start_thread () at ./nptl/pthread_create.c:442
#5 thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:79
set tcp auto-retry on
set tcp connect-timeout unlimited
gdb -q -x debug.gdb -ex 'target remote localhost:5555'
gdbserver localhost:5555 /opt/demo/main
set non-stop on
set pagination off
set confirm off
gdb -x debug.gdb -ex "attach $(pidof /opt/demo/main)"
(gdb) continue -a &
(gdb) thread 4
(gdb) interrupt