Sunday, August 12, 2007

more explanation on print_stacktrace

For code segement:

while (fp < (uchar**) stack_bottom)
{
#if defined(__i386__) || defined(__x86_64__)
uchar** new_fp = (uchar**)*fp;
fprintf(stderr, "%p\n", frame_count == sigreturn_frame_count ?
*(fp + SIGRETURN_FRAME_OFFSET) : *(fp + 1));
#endif /* defined(__386__) || defined(__x86_64__) */

fp = new_fp;
++frame_count;
}

As suggested by Paul,
if it is a signal frame, its frame OFFSET is different.
It is defined by SIGRETURN_FRAME_OFFSET.

Since we are going to print stacktrace inside a normal function, we don't need consider this.
Here, fp (frame pointer) is EBP (Extend Base Pointer) of this caller. Its content *fp is the EBP of its caller.
That's why we can trace to the stack frame of callers by using
new_fp = *fp;

The return address of the callee is next to EBP. In this code segment,
we print *(fp + 1) to get it.

I did test on this. The printed address makes sense.

Next, the problem is how to resolve those symbols.
http://dev.mysql.com/doc/refman/5.0/en/using-stack-trace.html

In mysql, the resolving part is done by
resolve_stack_dump.c.

But I have problem to resolve my test stack with symbol info.

Sunday, July 22, 2007

print_stacktrace in mysql source code

Mysql AB provides print_stacktrace function inside stacktrace.c.
void print_stacktrace(gptr stack_bottom, ulong thread_stack)

mysqld.cc calls it when segfault happens:
print_stacktrace(thd ? (gptr) thd->thread_stack : (gptr) 0,
thread_stack);

Its basic idea of print_stacktrace is

1. first get current frame pointer fp

take __i386__ for example, fp is
Extended Base Pointer (ebp)

#ifdef __i386__
__asm __volatile__ ("movl %%ebp,%0"
:"=r"(fp)
:"r"(fp));
#endif


+---------+------------------+
| EBP - 4 | local variables |
| EBP | callees EBP |
| EBP + 4 | ret-addr |
| EBP + 8 | parameters |
+---------+------------------+

2. transverse the stack to print the call stack info
/* We are 1 frame above signal frame with NPTL and 2 frames above with LT */
sigreturn_frame_count = thd_lib_detected == THD_LIB_LT ? 2 : 1;

while (fp < (uchar**) stack_bottom)
{
#if defined(__i386__) || defined(__x86_64__)
uchar** new_fp = (uchar**)*fp;
fprintf(stderr, "%p\n", frame_count == sigreturn_frame_count ?
*(fp + SIGRETURN_FRAME_OFFSET) : *(fp + 1));
#endif /* defined(__386__) || defined(__x86_64__) */

fp = new_fp;
++frame_count;
}

I don't quite understand how to get sigreturn_frame_count in the 2nd step.


The 2nd parameter seems the stack size, and the first one seems the stack bottom.
I am trying to call this function inside my program. But I passed wrong parameters.
I will further investigate how to pass correct parameters.

use backtrace to get call stack information (GNU C library)

We can use GNU C lib backtrace functions to get call stack on Linux OS.
int backtrace (void **buffer, int size)
char ** backtrace_symbols (void *const *buffer, int size)
header file:
execinfo.h
in order to get the symbols, we need pass -rdynamic to gcc.
-rdynamic will
pass the flag -export-dynamic to the ELF linker, on targets that support it. This instructs the linker to add all symbols, not only used ones, to the dynamic symbol table.

Limitation: it only works when backtrace is provided in GNU C. It does not work on Mac.


Compile: gcc -o mytrace mytrace.c -rdynamic

Example Result:
Obtained 6 stack frames.
./mytrace(print_trace+0x14) [0x8048724]
./mytrace(dummy_function+0xb) [0x80487b7]
./mytrace(dummy_function_1+0xb) [0x80487a7]
./mytrace(main+0xb) [0x80487c7]
/lib/libc.so.6(__libc_start_main+0xa9) [0x4003b54d]
./mytrace(backtrace_symbols+0x35) [0x8048631]


Example:

#include <execinfo.h>
#include <stdio.h>
#include <stdlib.h>

/* Obtain a backtrace and print it to stdout. */
void
print_trace (void)
{
void *array[10];
size_t size;
char **strings;
size_t i;

size = backtrace (array, 10);
strings = backtrace_symbols (array, size);

printf ("Obtained %zd stack frames.\n", size);

for (i = 0; i < size; i++)
printf ("%s\n", strings[i]);

free (strings);
}

void
dummy_function_1 (void)
{
dummy_function();
}
/* A dummy function to make the backtrace more interesting. */
void
dummy_function (void)
{
print_trace ();
}

int
main (void)
{
dummy_function_1();
return 0;
}

Sunday, July 15, 2007

Aspect-oriented C ( ACC ) Set Up

First, we download ACC from www.aspectc.net and make it following the README. ACC script tacc can automatically compile aspect files (with suffix .acc) with regular .c files. To use tacc, we need setup several environment variables:
For example, in .bashrc

export PATH=$PATH:.:/ACC_PATH/bin
export CC=tacc
export SRCROOT=/PBXT_SOURCE_CODE_PATH/src
export AR=accar

SRCROOT is our source code (i.e. pbxt) directory. We also put .acc files in the same source code directory. Then, we can run ./configure inside PBXT source code directory as before. E.g,

./configure --with-mysql=/home/build/mysql-51 --libdir=/home/build/test/lib/mysql --with-debug=full

It will automatically generate a makefile which uses ACC command.
Note, in our case, we have mixed .cc and .c files. Since ACC does not work with .cc, we need use gcc to link file instead of using g++. So, we have to manually change TAG inside Makefile in PBXT src. That is, in line

CXXLINK = $(LIBTOOL) --tag=CXX …

We change the tag to be --tag=CC.

Tuesday, July 10, 2007

Initial Test Result for Profile.acc

I test updates, queries and their mixed cases. It works well.
I haven't measure the performance impact yet. I think the influence of Aspect C should be small since it just adds function calls. If the inline is supported when compiling, the cost should be minimum.

Test 1:
./sysbench --test=oltp --oltp-table-size=1000000 --mysql-table-engine=pbxt
--mysql-engine-trx=yes --mysql-user=root --mysql-socket=/tmp/mysql.sock
prepare --mysql-db=test

SysBench Result 1:
sysbench v0.4.8: multi-threaded system evaluation benchmark

No DB drivers specified, using mysql
Creating table 'sbtest'...
Creating 1000000 records in table 'sbtest'...

Profile Result 1:
System_call Total_Execution_Time Total_Call_Times
pread 719098 24312
pwrite 11799545 1026503
malloc 732 453
calloc 0 0
realloc 0 0
pthread_mutex_lock 2461 497
pthread_rwlock_rdlock 18 37
pthread_rwlock_wrlock 17 72

Total Execution Time represents the sum of execution time (in micro
seconds), and the Total_Call_Times means how many times this function is
called.

Test 2:
./sysbench --test=oltp --oltp-table-size=1000000 --init-rng=on
--mysql-engine-trx=yes --oltp-table-name=sbtest --oltp-test-mode=complex
--oltp-point-selects=1 --oltp-simple-ranges=0 --oltp-sum-ranges=0
--oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-skip-trx=on
--oltp-read-only=on --max-requests=0 --num-threads=4
--oltp-dist-type=special --init-rng=on --mysql-db=sbtest --max-time=60
--mysql-user=root --mysql-socket=/tmp/mysql.sock run --mysql-db=test

SysBench Result2:
sysbench v0.4.8: multi-threaded system evaluation benchmark
No DB drivers specified, using mysql
Running the test with following options:
Number of threads: 4
Initializing random number generator from timer.
Doing OLTP test.
Running mixed OLTP test
Doing read-only test
Using Special distribution (12 iterations, 1 pct of values are returned
in 75 pct cases)
Skipping BEGIN/COMMIT
Using auto_inc on the id column
Threads started!
Time limit exceeded, exiting...
(last message repeated 3 times)
Done.

OLTP test statistics:
queries performed:
read: 180493
write: 0
other: 0
total: 180493
transactions: 180493 (3008.18 per sec.)
deadlocks: 0 (0.00 per sec.)
read/write requests: 180493 (3008.18 per sec.)
other operations: 0 (0.00 per sec.)

Test execution summary:
total time: 60.0008s
total number of events: 180493
total time taken by event execution: 239.7300
per-request statistics:
min: 0.0003s
avg: 0.0013s
max: 0.0119s
approx. 95 percentile: 0.0018s
Threads fairness:
events (avg/stddev): 45123.2500/70.89
execution time (avg/stddev): 59.9325/0.00

Profile Result 2:
System_call Total_Execution_Time Total_Call_Times
pread 404260 56359
pwrite 0 0
malloc 34736 50182
calloc 0 0
realloc 0 0
pthread_mutex_lock 71303 140555
pthread_rwlock_rdlock 6 24
pthread_rwlock_wrlock 15 42

Test 3:
./sysbench --test=oltp --oltp-table-size=1000000 --oltp-read-only=off
--oltp-test-mode=complex --num-threads=2 --mysql-engine-trx=yes
--init-rng=on --oltp-point-selects=0 --oltp-simple-ranges=0
--oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0
--mysql-user=root --mysql-socket=/tmp/mysql.sock --oltp-table-name=sbtest
--oltp-dist-type=special --max-requests=50000 run --mysql-db=test

SysBench Result 3:
sysbench v0.4.8: multi-threaded system evaluation benchmark

No DB drivers specified, using mysql
WARNING: Preparing of "BEGIN" is unsupported, using emulation
(last message repeated 1 times)
Running the test with following options:
Number of threads: 2
Initializing random number generator from timer.

Doing OLTP test.
Running mixed OLTP test
Using Special distribution (12 iterations, 1 pct of values are returned
in 75 pct cases)
Using "BEGIN" for starting transactions
Using auto_inc on the id column
Maximum number of requests for OLTP test is limited to 50000
Threads started!
Done.

OLTP test statistics:
queries performed:
read: 0
write: 250056
other: 100027
total: 350083
transactions: 50000 (491.55 per sec.)
deadlocks: 27 (0.27 per sec.)
read/write requests: 250056 (2458.31 per sec.)
other operations: 100027 (983.37 per sec.)

Test execution summary:
total time: 101.7185s
total number of events: 50000
total time taken by event execution: 203.2377
per-request statistics:
min: 0.0017s
avg: 0.0041s
max: 5.0590s
approx. 95 percentile: 0.0042s

Threads fairness:
events (avg/stddev): 25000.0000/57.00
execution time (avg/stddev): 101.6188/0.00

Profile Result 3:
System_call Total_Execution_Time Total_Call_Times
pread 8963222 129261
pwrite 42618355 2338101
malloc 26740 44017
calloc 0 0
realloc 0 0
pthread_mutex_lock 50763 121142
pthread_rwlock_rdlock 120 19
pthread_rwlock_wrlock 145 33

Profiling of System Calls by ACC

I've written the following program to profile system calls by using Aspect C.
Put this file (profile.acc) inside the src directory, ACC will automatically weave it with the original code, and then compile it.

#include
#include "xt_defs.h"

#undef PROFILE_DEBUG

/*
* profile_get_time
* Return: current time in micro seconds
*/
inline long profile_get_time( )
{
struct timeval current_time;
gettimeofday(¤t_time,0);
return (current_time.tv_sec)*1000000+current_time.tv_usec;
}

/*
* profile i/o function: pread
* we record its execution time when pread is called
*/

long profile_pread_access_time=0;
long profile_pread_freq=0;

int around(): call($ pread(...)) {

long start_time = profile_get_time();

int val = proceed();

profile_pread_access_time += profile_get_time()-start_time;
profile_pread_freq ++;

#ifdef PROFILE_DEBUG
printf(" %s %s in function %s of file %s \n",
this->kind,this->targetName,this->funcName,this->fileName);
#endif

return val;
}

/*
* profile i/o function: pwrite
* we record its execution time when pwrite is called
*/

long profile_pwrite_access_time=0;
long profile_pwrite_freq=0;

int around(): call($ pwrite(...)) {

long start_time = profile_get_time();

int val = proceed();

profile_pwrite_access_time += profile_get_time()-start_time;
profile_pwrite_freq ++;

#ifdef PROFILE_DEBUG
printf(" %s %s in function %s of file %s \n",
this->kind,this->targetName,this->funcName,this->fileName);
#endif

return val;
}

/*
* profile memory function: malloc
* we record its execution time when malloc is called
*/

long profile_malloc_access_time=0;
long profile_malloc_freq=0;

int around(): call($ malloc(...)) {

long start_time = profile_get_time();

int val = proceed();

profile_malloc_access_time += profile_get_time()-start_time;
profile_malloc_freq ++;

#ifdef PROFILE_DEBUG
printf(" %s %s in function %s of file %s \n",
this->kind,this->targetName,this->funcName,this->fileName);
#endif

return val;
}

/*
* profile memory function: calloc
* we record its execution time when calloc is called
*/

long profile_calloc_access_time=0;
long profile_calloc_freq=0;

int around(): call($ calloc(...)) {

long start_time = profile_get_time();

int val = proceed();

profile_calloc_access_time += profile_get_time()-start_time;
profile_calloc_freq ++;

#ifdef PROFILE_DEBUG
printf(" %s %s in function %s of file %s \n",
this->kind,this->targetName,this->funcName,this->fileName);
#endif

return val;
}

/*
* profile memory function: realloc
* we record its execution time when realloc is called
*/

long profile_realloc_access_time=0;
long profile_realloc_freq=0;

int around(): call($ realloc(...)) {

long start_time = profile_get_time();

int val = proceed();

profile_realloc_access_time += profile_get_time()-start_time;
profile_realloc_freq ++;

#ifdef PROFILE_DEBUG
printf(" %s %s in function %s of file %s \n",
this->kind,this->targetName,this->funcName,this->fileName);
#endif

return val;
}

/*
* profile lock function: pthread_mutex_lock
* we record its execution time when pthread_mutex_lock is called
*/

long profile_pthread_mutex_lock_access_time=0;
long profile_pthread_mutex_lock_freq=0;

int around(): call($ pthread_mutex_lock(...)) {

long start_time = profile_get_time();

int val = proceed();

profile_pthread_mutex_lock_access_time += profile_get_time()-start_time;
profile_pthread_mutex_lock_freq ++;

#ifdef PROFILE_DEBUG
printf(" %s %s in function %s of file %s \n",
this->kind,this->targetName,this->funcName,this->fileName);
#endif

return val;
}


/*
* profile lock function: pthread_rwlock_rdlock
* we record its execution time when pthread_rwlock_rdlock is called
*/

long profile_pthread_rwlock_rdlock_access_time=0;
long profile_pthread_rwlock_rdlock_freq=0;

int around(): call($ pthread_rwlock_rdlock(...)) {

long start_time = profile_get_time();

int val = proceed();

profile_pthread_rwlock_rdlock_access_time += profile_get_time()-start_time;
profile_pthread_rwlock_rdlock_freq ++;

#ifdef PROFILE_DEBUG
printf(" %s %s in function %s of file %s \n",
this->kind,this->targetName,this->funcName,this->fileName);
#endif

return val;
}

/*
* profile lock function: pthread_rwlock_wrlock
* we record its execution time when pthread_rwlock_wrlock is called
*/

long profile_pthread_rwlock_wrlock_access_time=0;
long profile_pthread_rwlock_wrlock_freq=0;

int around(): call($ pthread_rwlock_wrlock(...)) {

long start_time = profile_get_time();

int val = proceed();

profile_pthread_rwlock_wrlock_access_time += profile_get_time()-start_time;
profile_pthread_rwlock_wrlock_freq ++;

#ifdef PROFILE_DEBUG
printf(" %s %s in function %s of file %s \n",
this->kind,this->targetName,this->funcName,this->fileName);
#endif

return val;
}

/*
* Print out all results of profiling
* when xt_create_thread("TempForEnd", &err, TRUE) is executed in pbxt_end
*/

void printProfileResult()
{
printf("System_call\tTotal_Execution_Time\tTotal_Call_Times\n");
printf("pread %d %d\n",profile_pread_access_time,profile_pread_freq);
printf("pwrite %d %d\n",profile_pwrite_access_time,profile_pwrite_freq);

printf("malloc %d %d\n",profile_malloc_access_time,profile_malloc_freq);
printf("calloc %d %d\n",profile_calloc_access_time,profile_calloc_freq);
printf("realloc %d %d\n",profile_realloc_access_time,profile_realloc_freq);

printf("pthread_mutex_lock %d %d\n",profile_pthread_mutex_lock_access_time,profile_pthread_mutex_lock_freq);
printf("pthread_rwlock_rdlock %d %d\n",profile_pthread_rwlock_rdlock_access_time,profile_pthread_rwlock_rdlock_freq);
printf("pthread_rwlock_wrlock %d %d\n",profile_pthread_rwlock_wrlock_access_time,profile_pthread_rwlock_wrlock_freq);
}

/*
* Print out all results of profiling
* when xt_create_thread("TempForEnd", &err, TRUE) is executed in pbxt_end
*/
after(c_char *name): execution($ xt_create_thread(...)) && args(name,...)
{
#ifdef PROFILE_DEBUG
printf(" %s %s in function %s of file %s \n",
this->kind,this->targetName,this->funcName,this->fileName);
#endif
printf("End Name %s\n",name);
if(strcmp(name,"TempForEnd")==0){
//it is called by pbxt_end
printProfileResult();
}
}

Sunday, June 24, 2007

Using Aspect C to Profile PBXT

Since our goal is to find a tool for profiling while we don't want to change/modify the existing code. So we chooses the Aspect-oriented C tool as the profiling language.

The basic idea is to maintain the profiling code separately, and then use Aspect C to weave the code into the original code automatically. For instance, if we are going to profile the I/O function pread, we can write the below code according Aspect specification. Then every time when pread is called, it will be executed.

#include
long pbxt_profile_pread_access_time;
long pbxt_profile_pread_freq;

int around(): execution(int pread(...)) {
struct timeval start_time;
struct timeval end_time;
gettimeofday(&start_time,NULL);

int val = proceed();

gettimeofday(&end_time,NULL);
pbxt_profile_pread_access_time +=
(end_time.tv_sec-start_time.tv_sec)*1000000+
(end_time.tv_usec-start_time.tv_usec);
pbxt_profile_pread_freq ++;

return val;
}

The limitation of Aspect C is that it only works for pure C code, and thus it cannot profile our C++ code. So I am also trying to use Aspect C++ for profiling the C++ part.

Sunday, May 27, 2007

"compile-in" method v.s. "sampling" method

There are two approaches to do the profiling, i.e. compile-in and sampling.
Compile-in is a method to record the timing of executing each component or function, and the one using the longest time is considered the bottleneck.
Sampling is a method to periodically check what the current active component is, and the one
appearing most frequently is considered the bottleneck.

Compile-in needs to instrument the source code, and may affect the performance due to the overhead of counting and logging. To implement compile-in method, one way is to instrument the source code directly, for instance, using macros to redefine the calls and recompile. We are thinking to use aspect of programming techniques to help the instrumentation so that we can keep the instrumented code and the original code separately.

Sampling involves interrupting the process, and grabbing a stack trace. Sampling may affect the performance less if the frequency is not intense. The implementation of sampling is more complicate, and we may reply on some system profiling functions.

Plan:
After the discussion with Paul, we decide to first implement a compile-in method for profiling file I/O, i.e. pread() and pwrite() in c_filesys.c . Our goal is to find out where in the code, a thread is "waiting". Note, when we count the wall time as the execution time, the context switch period is also included. We expect the impact of the context switch among processes is ignorable if the waiting duration is significant.

Saturday, May 26, 2007

Coding Environment Setup

1. How to build PBXT with MySQL
2. How to use SysBench
3. PBXT code download
4. SysBench code download
5. MySQL5.1 code download

Introduction of the "PAT" project

This project is one of the Google Summer of Code 2007 projects for MySQL.

I am a graduate student of University of Toronto, and my mentor of this project is Paul McCullagh.

The goal of this project to provide a performance analysis tool for a MySQL storage engine PrimeBase XT. We aim to help developers to locate the bottleneck of the system. In contrast to traditional profiling tools, we focus on how to capture the impact of the resource contention through measuring the time spent on waiting critical resources, such as I/O, memory and locks. We also try to provide context information to help developers to identify the critical path.