So far, based on Lulesh example, we have 12-45% (2,4,8,16,32,48,56 threads on carina) overhead of tracing when using gcc-5.4 without energy measurement. With energy measurement, the overhead is 250% - 750%. Our goal is to reduce the overhead to 5% or even less, overhead >=10% is hardly acceptable.
Approaches:
- Reduce unnecessary trace events, sync_region and sync_regin_wait pairs are all together, so we only need to trace one pair, say sync_region_begin and sync_region_end. We need to check other events.
- Turn on and off tracing at runtime, needs to check OMPT to see whether this can be done from OMPT callback. Or using the approach of assigning log level to a tracepoint so we can turn on/off tracepoint.
- Eliminate thread_id field in each events, this needs to be checked with LTTng channel streaming, etc. E.g. if an OpenMP thread migrates from one core to another, how the traces stream to LTTng
- Choose smaller data types for certain event files to reduce event record size, e.g. thread_id could be a short type if it needed.
- Consult with LTTng about performance and the mechanism of tracing records, e.g. can we not creating LTTng threads for each channel and let the tracepoints just write to memory buffer
So far, based on Lulesh example, we have 12-45% (2,4,8,16,32,48,56 threads on carina) overhead of tracing when using gcc-5.4 without energy measurement. With energy measurement, the overhead is 250% - 750%. Our goal is to reduce the overhead to 5% or even less, overhead >=10% is hardly acceptable.
Approaches: