GL_INTEL_performance_queries (unofficial unstable incomplete documentation)

Introduction

There are a number of OpenGL extensions related to GPU timing and performance: GL_ARB_timer_query (supported by modern NVIDIA and ATI/AMD drivers on Windows and Linux for most of their devices), GL_EXT_timer_query (supported on those plus older drivers, and some Mesa drivers on Linux), GL_AMD_performance_monitor (supported by ATI/AMD drivers).

There's also a GL_INTEL_performance_queries extension supported by recent Intel drivers on Windows on Sandybridge hardware. Currently (2011-11-05) it appears to be completely undocumented, so I couldn't resist the temptation to work out what it does and how to use it.

Be very careful when using this extension. There's probably a reason why it's not officially documented – it may be buggy or unstable, or may change incompatibly in the future, so don't rely on it continuing to work. Also, this documentation may be wrong.

API

Check for the GL_INTEL_performance_queries extension string, and load the functions with the standard extension mechanism. The argument names and enum names here are just made up by me and probably not highly accurate.

void glGetFirstPerfQueryIdINTEL (GLuint *queryId);

void glGetNextPerfQueryIdINTEL (GLuint prevQueryId, GLuint *queryId);

void glGetPerfQueryInfoINTEL (GLuint queryId,
    GLuint nameMaxLength, char *name,
    GLuint *counterBufferSize, GLuint *numCounters, GLuint *maxQueries,
    GLuint *unknown);

void glGetPerfCounterInfoINTEL (GLuint queryId, GLuint counterId,
    GLuint nameMaxLength, char *name,
    GLuint descMaxLength, char *desc,
    GLuint *offset, GLuint *size, GLuint *usage, GLuint *type,
    GLuint64 *unknown);

void glCreatePerfQueryINTEL (GLuint queryId, GLuint *id);

void glBeginPerfQueryINTEL (GLuint id);

void glEndPerfQueryINTEL (GLuint id);

void glDeletePerfQueryINTEL (GLuint id);

void glGetPerfQueryDataINTEL (GLuint id, GLenum requestType,
    GLuint maxLength, char *buffer, GLuint *length);

#define INTEL_PERFQUERIES_NONBLOCK  0x83FA
#define INTEL_PERFQUERIES_BLOCK     0x83FB

#define INTEL_PERFQUERIES_TYPE_UNSIGNED_INT    0x9402
#define INTEL_PERFQUERIES_TYPE_UNSIGNED_INT64  0x9403
#define INTEL_PERFQUERIES_TYPE_FLOAT           0x9404
#define INTEL_PERFQUERIES_TYPE_BOOL            0x9406

#define INTEL_PERFQUERIES_CATEGORY_MISC1             0x9407
#define INTEL_PERFQUERIES_CATEGORY_TIMEFRACTION      0x9408
#define INTEL_PERFQUERIES_CATEGORY_CYCLESPERTHREAD   0x9409
#define INTEL_PERFQUERIES_CATEGORY_THROUGHPUT        0x940A
#define INTEL_PERFQUERIES_CATEGORY_MISC2             0x940B

Concepts

The implementation supports a number of query types. Each query type is associated with a number of counter types. An application can create query objects (each having some query type), then begin the query, do some rendering, end the query, and (some time in the future) read back the query data. That data contains values for every counter type associated with that query type.

Queries are asynchronous (as with GL_ARB_timer_query) – they are run somewhere deep in the rendering pipeline, not on the CPU. E.g. the TotalTime counter seemingly gives the time taken for the GPU to execute the rendering commands that were sent during the query period, rather than the CPU time taken to submit those rendering commands.

Usage

To find the supported query types, walk it like a linked list:

GLuint queryId;
glGetFirstPerfQueryIdINTEL(&queryId);
while (queryId != 0)
{
    ProcessQueryType(queryId);

    glGetNextPerfQueryIdINTEL(queryId, &queryId);
}

To find the details of a query type (repeat for each queryId):

char queryName[256]; // don't know what the upper limit should be
GLuint bufferSize;
GLuint numCounters;
GLuint maxQueries;
GLuint unknown; // don't know what this is for; always seems to be set to 1

glGetPerfQueryInfoINTEL(queryId, sizeof(queryName), queryName,
    &bufferSize, &numCounters, &maxQueries, &unknown);

To find the counter types for a query type, following on from the above:

for (GLuint counterId = 1; counterId <= numCounters; ++counterId)
{
    char counterName[256];  // don't know what the upper limit should be
    char counterDesc[1024]; // don't know what the upper limit should be
    GLuint counterOffset;
    GLuint counterSize;
    GLuint counterCategory;   // one of INTEL_PERFQUERIES_CATEGORY_*
    GLuint counterType;       // one of INTEL_PERFQUERIES_TYPE_*
    GLuint64 unknown2; // don't know what this is for; seems to be set to 0 or 1

    glGetPerfCounterInfoINTEL(queryId, counterId,
        sizeof(counterName), counterName,
        sizeof(counterDesc), counterDesc,
        &counterOffset, &counterSize, &counterCategory, &counterType, &unknown2);

    ProcessCounterType(...);
}

To perform a query:

GLuint ids[2];
glCreatePerfQueryINTEL(queryId, &ids[0]);
glCreatePerfQueryINTEL(queryId, &ids[1]);
    // you can create up to 'maxQueries' simultaneously for each query type

glBeginPerfQueryINTEL(ids[0]);
// ... do some rendering ...
glEndPerfQueryINTEL(ids[0]);

glBeginPerfQueryINTEL(ids[1]);
// ... do some more rendering ...
glEndPerfQueryINTEL(ids[1]);

// Wait a while - it might take a frame or two before the results are available.

// Request the query data, without blocking if the results aren't available yet:

GLuint length;
char* buffer = (char*) malloc(bufferSize);
glGetPerfQueryDataINTEL(ids[0], INTEL_PERFQUERIES_NONBLOCK, bufferSize, buffer, &length);
if (length == 0)
{
    // results not available yet - should try again later
}
else
{
    assert(length == bufferSize); // always seems to be true
    ParseCounterValues(buffer);
}

// Alternatively you can do a blocking request (it'll do a spinloop until the
// results are available):

glGetPerfQueryDataINTEL(ids[1], INTEL_PERFQUERIES_BLOCK, bufferSize, buffer, &length);
assert(length == bufferSize); // always seems to be true
ParseCounterValues(buffer);

// You can reuse the query objects for subsequent frames,
// or just destroy them at the end:

glDeletePerfQueryINTEL(ids[0]);
glDeletePerfQueryINTEL(ids[1]);

Query results can become available in an arbitrary order. (You can't assume that an earlier query is ready, just because a later query of the same type is ready.)

To parse the counter values, implement the earlier ProcessCounterType like:

switch (counterType)
{
    case INTEL_PERFQUERIES_TYPE_UNSIGNED_INT:
    {
        GLuint value;
        assert(counterSize == sizeof(value));
        memcpy(&value + counterOffset, buffer, sizeof(value));
        // do something with 'value'
        break;
    }
    case INTEL_PERFQUERIES_TYPE_UNSIGNED_INT64:
    {
        GLuint64 value;
        assert(counterSize == sizeof(value));
        memcpy(&value + counterOffset, buffer, sizeof(value));
        // do something with 'value'
        break;
    }
    case INTEL_PERFQUERIES_TYPE_FLOAT:
    {
        GLfloat value;
        assert(counterSize == sizeof(value));
        memcpy(&value + counterOffset, buffer, sizeof(value));
        // do something with 'value'
        break;
    }
    case INTEL_PERFQUERIES_TYPE_BOOL:
    {
        GLuint value;
        assert(counterSize == sizeof(value));
        memcpy(&value + counterOffset, buffer, sizeof(value));
        assert(value == 0 || value == 1);
        // do something with 'value'
        break;
    }
}

Example queries/counters

The following is the list of two queries and many counters reported by an Intel(R) HD Graphics 3000 with driver version 8.15.10.2509 on 64-bit Win7.

Intel_GT_Hardware_Counters (queryId=0x0206, bufferSize=216, maxQueries=32767)

CounterNameCategoryTypeDescription
1TotalTime0x9409uint64Total query time in microseconds.
2GPUBusyTime0x9408floatThe fraction of time in which the 3D hardware was busy, i.e. the command ring was not empty. Usage: This counter can be used after normalization to assess the percentage of time in which the 3D hardware is busy and process 3D commands as the GPU ring is busy.
3PipelineFrontEndWaitTime0x9408floatThe fraction of time in which the 3D Pipeline Front End waits. Usage: This counter can be used after normalization to assess the percentage of time in which the 3D Pipeline Front End Stage is idle and doesn?t process any 3D commands because they are not available for decoding (ring is empty or commands are loading from memory). Normalization equation: PipelineFrontEndWaitClockCycles / CoreClocksCount
4PipelineFrontEndStallTime0x9408floatThe fraction of time in which the 3D Pipeline Front End is stalled by the next pipeline stage. Usage: This counter can be used after normalization to assess the percentage of time in which Pipeline Front End of GPU is stalled by the next pipeline stage because its command queue is full. Normalization equation: PipelineFrontEndStallClockCycles / CoreClocksCount
5GfxCoresBusyTime0x9408floatThe fraction of time in which each GFX Core was active. Usage: This counter can be used after normalization to assess the percentage of time in which the GFX Cores were actively processing shader instructions. Normalization equation: GfxCoresBusyClockCycles / (Number_of_GfxCores * CoreClocksCount).
6GfxCoresStallTime0x9408floatThe fraction of time in which each GFX Core was suspended. Usage: This counter can be used after normalization to assess the percentage of time in which the GFX Cores were stalled (due to any reason) during processing shader instructions. Normalization equation: GfxCoresStallClockCycles / (Number_of_GfxCores * CoreClocksCount)
7VertexShaderActiveTime0x9408floatThe fraction of time in which the GFX Cores were actively processing VS instructions. Usage: This counter can be used after normalization to assess the percentage of time in which GFX Cores were actively processing Vertex Shaders kernels. Normalization equation: VertexShaderActiveClockCycles / (Number_of_GfxCores * CoreClocksCount).
8VertexShaderStallTime0x9408floatThe fraction of time in which the GFX Cores were stalled due to processing VS instruction Usage: This counter can be used after normalization to assess the percentage of time in which GFX Cores were stalled on processing Vertex Shaders kernels. Normalization equation: VertexShaderStallClockCycles / (Number_of_GfxCores * CoreClocksCount).
9VertexShaderWaitTime0x9408floatThe fraction of time in which the Fragment Shader threads wait on completion of other threads (e.g. VS threads). Usage: This counter can be used after normalization to assess the percentage of time in which Fragment Shader kernels had to wait for completion of other type kernels. Normalization equation: FragmentShaderWaitClockCycles / (Number_of_GfxCores * CoreClocksCount).
10FragmentShaderActiveTime0x9408floatThe fraction of time in which the Fragment Stage waits. Usage: This counter can be used after normalization to assess the percentage of time in which 3D Pipeline Fragment Stage is idle and doesn?t process any 3D commands because its command queue is empty.
11FragmentShaderStallTime0x9408floatThe fraction of time in which the GFX Cores were stalled due to processing PS instruction. Usage: This counter can be used after normalization to assess the percentage of time in which GFX Cores were stalled on processing Fragment Shaders kernels. Normalization equation: FragmentShaderStallClockCycles / (Number_of_GfxCores * CoreClocksCount).
12FragmentShaderWaitTime0x9408floatThe fraction of time in which the Fragment Shader threads wait on completion of other threads (e.g. VS threads). Usage: This counter can be used after normalization to assess the percentage of time in which Fragment Shader kernels had to wait for completion of other type kernels. Normalization equation: FragmentShaderWaitClockCycles / (Number_of_GfxCores * CoreClocksCount).
13SamplerBusyTime0x9408floatThe fraction of time in which the sampler was busy. Usage: This counter can be used after normalization to assess the percentage of time in which the Sampler unit was busy due to processing texture requests. Normalization equation: SamplerBusyClockCycles / CoreClocksCount.
14SamplerStallTime0x9408floatThe fraction of time in which the sampler was stalled during processing texture requests. Usage: This counter can be used after normalization to assess the percentage of time in which the Sampler unit was stalled during processing texture requests.Normalization equation: SamplerStallClockCycles / CoreClocksCount.
15ClipperStageActiveTime0x9408floatThe fraction of time in which the Clip Stage is active. Usage: This counter can be used after normalization to assess the percentage of time in which 3D Pipeline Clip Stage is active. Normalization equation: ClipStageActiveClockCycles / CoreClocksCount.
16TriangleSetupStageActiveTime0x9408floatThe fraction of time in which the Triangle Setup Stage is active. Usage: This counter can be used after normalization to assess the percentage of time in which the 3D Pipeline Triangle Setup Stage is active. Normalization equation: TriangleSetupStageActiveTime / CoreClocksCount.
17VertexShaderActivePerThread0x9409uintThe average (per thread invocation) number of render clock cycles in which theGFX Cores were actively processing VS instructions. Usage: This counter is normalized according to the equation below to assess the average number of thread cycles in which the Gfx Cores were actively processing shader instructions. Normalization equation: VertexShaderBusyClockCycles / VertexShaderThreadsCount/
18VertexShaderStallPerThread0x9409uintThe average (per thread invocation) number of render clock cycles in which the GFX Cores were stalled due to processing VS instructions. Usage: This counter is normalized according to the equation below to assess the average number of thread cycles in which the Gfx Cores were stalled due to VS shader instructions. Normalization equation: VertexShaderStallClockCycles / VertexShaderThreadsCount.
19VertexShaderWaitPerThread0x9409uintThe average (per thread invocation) number of render clock cycles in which the Vertex Shader threads wait on completion of other threads (e.g. FS threads).Usage: This counter can be used after normalization to assess the average number of thread cycles in which Vertex Shader kernels had to wait for completion of other type kernels. Normalization equation: VertexShaderWaitClockCycles / VertexShaderThreadsCount.
20FragmentShaderActivePerThread0x9409uintThe average (per thread invocation) number of render clock cycles in which the GFX Cores were actively processing Fragment Shader instructions. Usage: This counter is normalized according to the equation below to assess the average number of thread cycles in which the Gfx Cores were actively processing shader instructions. Normalization equation: FragmentShaderBusyClockCycles / FragmentShaderThreadsCount.
21FragmentShaderStallPerThread0x9409uintThe average (per thread invocation) number of render clock cycles in which the GFX Cores were stalled due to processing Fragment Shader instructions. Usage: This counter is normalized according to the equation below to assess the average number of thread cycles in which the Gfx Cores were stalled due to VS shader instructions. Normalization equation: FragmentShaderStallClockCycles / FragmentShaderThreadsCount.
22FragmentShaderWaitPerThread0x9409uintThe average (per thread invocation) number of render clock cycles in which the Fragment Shader threads wait on completion of other threads (e.g. FS threads). Usage: This counter can be used after normalization to assess the average number of thread cycles in which Fragment Shader kernels had to wait for completion of other type kernels. Normalization equation: FragmentShaderWaitClockCycles / FragmentShaderThreadsCount.
23VertexShaderThreadsCount0x9407uint64Number of times the Vertex Shader kernels have been executed on GFX Cores. Usage: This counter shows how many times Vertex Shader kernels have been executed on GFX Cores.
24FragmentShaderThreadsCount0x9407uint64Number of times the Fragment Shader kernels have been executed on GFX CoresUsage: This counter shows how many times Fragment Shader kernels have been executed on GFX Cores.
25FragmentsBlendedCount0x940buint64Number of fragments blended. The counter is in units of 4 fragments. Usage: This counter can be used to assess how many fragments have been processed by Color Blending Stage.
26SamplerTextureMemoryThroughput0x940auint64Number of bytes read from memory by sampler to fill texture requests. Usage: This counter can be used to assess the memory throughput consumed by sampler to do texturing.
27SamplerPostFilteredTexels0x940buint64Number of texels returned from the sampler. Usage: This counter can be used to assess how many fragments have been processed by sampler.
28GPUMemoryWriteThroughput0x940auint64Number of bytes written to GPU memory. Usage: This counter can be used to assess the total throughput of GPU memory writes.
29GPUMemoryReadThroughput0x940auint64Number of bytes read from GPU memory. Usage: This counter can be used to assess the total throughput of GPU memory reads.
30DepthBufferThroughput0x940auint64Number of bytes read/written from/to depth buffer. Usage: This counter can be used to assess the total throughput of depth buffer.
31FragmentsKillCount0x940buint64Number of times Fragment Shader performed a fragment-discard operation for fragment or sample
32AlphaTestFails0x940buint64Number of times a fragment has been discarded due to Alpha test
33PerfCounter10x9407uint64ODLAT Perf Mon counter 1.
34PerfCounter20x9407uint64ODLAT Perf Mon counter 2.
35CoreClocksCount0x940buint64Number of core clock cycles.
36SplitOccured0x9407boolBOOL flag set to true if command buffer split has occurred during the query
37CoreFrequencyChanged0x9407boolBOOL flag set to true if frequency has changed during the query
38CoreFrequency0x940buint64The most recent value of the GFX Cores frequency in Hz

Intel_Pipeline_Query (queryId=0x0104, bufferSize=48, maxQueries=8191)

CounterNameCategoryTypeDescription
1VerticesCount0x940buint64Number of vertices sent to 3D hardware
2PrimitivesCount0x940buint64Number of primitives sent to 3D hardware
3VerticesProcessed0x940buint64Number of vertices processed by vertex shader
4ClipperInvocations0x9407uint64Number of times clipper has been invocated
5ClipperPrimitives0x940buint64Number of primitives processed by the clipper stage
6FragmentsRendered0x940buint64Number of fragments that have been rendered by