You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm developing a profiler for SYCL offload programs. My approach involves serializing kernel launches using zeEventHostSynchronize to ensure only one kernel is offloaded to the Intel GPU device at a time. For each kernel, I use a profiling thread to read stall sampling data using zetMetricStreamerReadData.
Current Implementation
Currently, after each kernel execution, I collect and process the data. To ensure non-overlapping stall samples between kernels, I've implemented a manual buffer flushing function zeroFlushStreamerBuffer(streamer, desc). This function closes the current streamer and opens a new one.
voidzeroFlushStreamerBuffer(zet_metric_streamer_handle_t& streamer, ZeDeviceDescriptor* desc)
{
ze_result_t status = ZE_RESULT_SUCCESS;
// Close the old streamer
status = zetMetricStreamerClose(streamer);
level0_check_result(status, **LINE**);
// Open a new streameruint32_t interval = 500000; // nszet_metric_streamer_desc_t streamer_desc = {ZET_STRUCTURE_TYPE_METRIC_STREAMER_DESC, nullptr, max_metric_samples, interval};
status = zetMetricStreamerOpen(desc->context_, desc->device_, desc->metric_group_, &streamer_desc, nullptr, &streamer);
if (status != ZE_RESULT_SUCCESS) {
std::cerr << "[ERROR] Failed to open metric streamer (" << status << "). The sampling interval might be too small." << std::endl;
streamer = nullptr;
return;
}
if (streamer_desc.notifyEveryNReports > max_metric_samples) {
max_metric_samples = streamer_desc.notifyEveryNReports;
}
}
Current Implementation Details
To provide more context, here's the main profiling loop where zeroFlushStreamerBuffer is used:
void
ZeMetricProfiler::RunProfilingLoop
(
ZeDeviceDescriptor* desc,
zet_metric_streamer_handle_t& streamer
)
{
std::vector<uint8_t> raw_metrics(MAX_METRIC_BUFFER + 512);
desc->profiling_state_.store(PROFILER_ENABLED, std::memory_order_release);
ze_result_t status;
while (desc->profiling_state_.load(std::memory_order_acquire) != PROFILER_DISABLED) {
// Wait for the kernel to start runningwhile (true) {
status = zeEventHostSynchronize(desc->serial_kernel_start_, 50000000);
if (status == ZE_RESULT_SUCCESS) {
break;
}
// Handle case where kernel execution is extremely short:// In such cases, the kernel might finish before zeEventHostSynchronize can detect the start event.// Without this check, a deadlock could occur:// - The Profiling thread would keep waiting for the start event (which has already been reset).// - The App thread would be waiting for the Profiling thread to complete data processing.// kernel_started_ allows Profiling thread to proceed, avoiding deadlock.if (desc->kernel_started_.load(std::memory_order_acquire)) {
break;
}
if (desc->profiling_state_.load(std::memory_order_acquire) == PROFILER_DISABLED) {
return;
}
}
// Kernel is running, enter sampling loopwhile (true) {
// Update correlation IDgpu_correlation_channel_receive(1, UpdateCorrelationID, desc);
// Wait for the next interval
status = zeEventHostSynchronize(desc->serial_kernel_end_, 5000);
if (status == ZE_RESULT_SUCCESS) {
break;
}
CollectAndProcessMetrics(desc, streamer, raw_metrics);
}
// Kernel has finished, perform final sampling and cleanupCollectAndProcessMetrics(desc, streamer, raw_metrics);
// FIXME(Yuning): may need a better way to flush the streamer buffer without repeatedly closing and reopening the streamerzeroFlushStreamerBuffer(streamer, desc);
desc->running_kernel_ = nullptr;
desc->kernel_started_.store(false, std::memory_order_release);
// Notify the app thread that data processing is complete
status = zeEventHostSignal(desc->serial_data_ready_);
level0_check_result(status, **LINE**);
}
}
This code demonstrates how we currently handle metric collection for each kernel execution, including the use of zeroFlushStreamerBuffer to attempt non-overlapping data collection between kernels.
Questions
Data Overlap: When collecting data for a kernel after its execution, is there a possibility that the data from zetMetricStreamerReadData includes stall samples from the previous kernel? My goal is to obtain non-overlapping stall samples for each kernel to enable fine-grained performance analysis.
API Enhancement: If my understanding is correct, would it be possible to provide a levelzero API for flushing the metrics streamer, such as zetMetricStreamerFlushData? This could potentially be more efficient than the current zeroFlushStreamerBuffer implementation.
Clarification: If my understanding is incorrect, could you please confirm that each call to zetMetricStreamerReadData always returns non-overlapping data? This would allow me to remove the zeroFlushStreamerBuffer function, potentially improving performance.
Request
I would greatly appreciate clarification on the behavior of zetMetricStreamerReadData in this context and any guidance on the best practices for ensuring non-overlapping metric collection between kernel executions.
The text was updated successfully, but these errors were encountered:
Data Overlap: When collecting data for a kernel after its execution, is there a possibility that the data from zetMetricStreamerReadData includes stall samples from the previous kernel? My goal is to obtain non-overlapping stall samples for each kernel to enable fine-grained performance analysis.
From the API specification point of view, currently the only way to ensure this is to close and open the streamer.
However this behaviour could be platform specific. For example On Aurora, If the previous kernel execution is completed (ensured using a HostSynchronize call) and all the stall data is read-out before the next kernel execution, then there should not be any overlaps in the stall data.
API Enhancement: If my understanding is correct, would it be possible to provide a levelzero API for flushing the metrics streamer, such as zetMetricStreamerFlushData? This could potentially be more efficient than the current zeroFlushStreamerBuffer implementation.
Yes. We are internally discussing the usefulness of such an API and having the use-case like you suggested would help finalize it.
Clarification: If my understanding is incorrect, could you please confirm that each call to zetMetricStreamerReadData always returns non-overlapping data? This would allow me to remove the zeroFlushStreamerBuffer function, potentially improving performance.
I think I have clarified this above.
Please share if there are further clarifications.
Environment
Context
I'm developing a profiler for SYCL offload programs. My approach involves serializing kernel launches using
zeEventHostSynchronize
to ensure only one kernel is offloaded to the Intel GPU device at a time. For each kernel, I use a profiling thread to read stall sampling data usingzetMetricStreamerReadData
.Current Implementation
Currently, after each kernel execution, I collect and process the data. To ensure non-overlapping stall samples between kernels, I've implemented a manual buffer flushing function
zeroFlushStreamerBuffer(streamer, desc)
. This function closes the current streamer and opens a new one.Current Implementation Details
To provide more context, here's the main profiling loop where
zeroFlushStreamerBuffer
is used:This code demonstrates how we currently handle metric collection for each kernel execution, including the use of
zeroFlushStreamerBuffer
to attempt non-overlapping data collection between kernels.Questions
Data Overlap: When collecting data for a kernel after its execution, is there a possibility that the data from
zetMetricStreamerReadData
includes stall samples from the previous kernel? My goal is to obtain non-overlapping stall samples for each kernel to enable fine-grained performance analysis.API Enhancement: If my understanding is correct, would it be possible to provide a levelzero API for flushing the metrics streamer, such as
zetMetricStreamerFlushData
? This could potentially be more efficient than the currentzeroFlushStreamerBuffer
implementation.Clarification: If my understanding is incorrect, could you please confirm that each call to
zetMetricStreamerReadData
always returns non-overlapping data? This would allow me to remove thezeroFlushStreamerBuffer
function, potentially improving performance.Request
I would greatly appreciate clarification on the behavior of
zetMetricStreamerReadData
in this context and any guidance on the best practices for ensuring non-overlapping metric collection between kernel executions.The text was updated successfully, but these errors were encountered: