
Data collection configurations (Release 2.7)
P.J. Drongowski
11 June 2007

***************************************************************
Approach
***************************************************************

1. Provide one configuration for time-based profiling and one
   configuration for pipeline simulation.

2. Provide one configuration (using EBP) to collect data to
   compute several, commonly used measurements for an overall
   assessment of performance. This configuration would:

     * Be a starting point for performance investigations,
       that is, indicate potential issues for investigation.
     * Be "one stop analysis" for the engineer in a hurry or
       novice/casual users.

3. Provide a small number (4 to 5) of configurations for
   in-depth investigation of the most common and important issues:
   branch/near return mispredictions, data access, instruction
   access, L2 cache access.

Considerations

   * For the next release, CodeAnalyst will be able to compute a
     ratio of two events. Formulas will be supported in a later
     release. A large number of useful measurements can be reported
     using ratios. However, in a few cases, formulas are required in
     order to compute more accurate measurements. Simpler, ratio-based
     measurements are given below.

   * Data collection should use at most two event groups. This will
     improve the accuracy of measurements by keeping the sample size
     of individual events as large as possible.


***************************************************************
Processor-specific events
***************************************************************

With GH, new PMC events and unit masks were introduced. A few
of the DC and View configuration XML files were affected. 
The affected PMC events are:

    0x045 L1 DTLB miss and L2 DTLB hit    New unit mask bits
    0x046 L1 DTLB and L2 DTLB miss        New unit mask bits
    0x085 L1 ITLB miss, L2 ITLB miss      New unit mask bits

A new unit mask bit was also added to 0x07d Requests to L2 cache,
but that addition does not alter the quantities measured by the
L2 DC configuration.

New processor-specific directories (K7, K8 and GH) were created to hold
the XML files with processor-specific events and unit masks. The main
DCConfig directory continues to hold DC configuration files that
are common across all processors.



***************************************************************
Configuration #1: Time-based profiling
***************************************************************

    File:            tbp.xml
    Name:            Time-based profile
    Interval:        1.0 ms


***************************************************************
Configuration #2: Pipeline simulation
***************************************************************

    File:            sim.xml
    Name:            Pipeline simulation
    Max to trace:    5
    Warm up caches:  True
    Run sim after:   True
    Save trace:      False
    Trace file name: "trace"
    Proc core:       "AMD Opteron 64"
    Multiplier:      15


***************************************************************
Configuration #3: Basic assessment (EBP)
***************************************************************

    File:            assess.xml
    Name:            Assess performance
    Mux period:      1

    Select  Unit  OS User  Count Abbreviation     Event
    ------ ------ -- ---- ------ ---------------- -----------------------------
      C0     00    T    T 250000 Ret_instructions Retired Instructions
      76     00    T    T 250000 CPU_clocks       CPU Clocks Not Halted
      C2     00    T    T  25000 Branches         Ret Branch Instructions
      C3     00    T    T  25000 Mispred_branches Ret Mispredicted Branch Inst

    Select  Unit  OS User  Count Abbreviation     Event
    ------ ------ -- ---- ------ ---------------- -----------------------------
      40     00    T    T 250000 DC_accesses      Data Cache Accesses
      41     00    T    T  25000 DC_misses        Data Cache Misses
      46     00    T    T  25000 DTLB_L1M_L2M     L1 DTLB Miss and L2 DTLB Miss
      47     00    T    T  25000 Misalign_access  Misaligned Accesses

    IPC = Ret_instructions / CPU_clocks
    CPI = CPU_Clocks / Ret_instructions

    Branch rate = Branches / Ret_instructions
    Branch misprediction rate = Mispred_branches / Ret_instructions
    Branch misprediction ratio = Mispred_branches / Branches
    Instructions per branch = Ret_instructions / Branches

    Data cache request rate = DC_accesses / Ret_instructions
    Data cache miss rate = DC_misses / Ret_instructions
    Data cache miss ratio = DC_misses / DC_accesses

    Misaligned access rate = Misalign_access / Ret_instructions
    Misaligned access ratio = Misalign_access / DC_accesses

    L2 DTLB miss rate = DTLB_L1M_L2M / Ret_instructions


***************************************************************
Configuration #4: Branch mispredictions
***************************************************************

    File:            branch.xml
    Name:            Investigate branching
    Mux period:      1

    Select  Unit  OS User  Count Abbreviation     Event
    ------ ------ -- ---- ------ ---------------- -----------------------------
      C0     00    T    T 250000 Ret_instructions Retired Instructions
      C2     00    T    T  25000 Branches         Ret Branch Instructions
      C3     00    T    T  25000 Mispred_branches Ret Mispredicted Branch Inst
      C4     00    T    T  25000 Taken_branches   Ret Taken Branch Instructions

    Select  Unit  OS User  Count Abbreviation     Event
    ------ ------ -- ---- ------ ---------------- -----------------------------
      C0     00    T    T 250000 Ret_instructions Retired Instructions
      C8     00    T    T  25000 Near_returns     Retired Near Returns
      C9     00    T    T  25000 Mispred_near_ret Ret Near Returns Mispredicted
      CA     00    T    T  25000 Mispred_indirect Ret Indirect Branches Mispred

    Branch rate = Branches / Ret_instructions
    Branch misprediction rate = Mispred_branches / Ret_instructions
    Branch misprediction ratio = Mispred_branches / Branches
    Instructions per branch = Ret_instructions / Branches

    Taken branch rate = Taken_branches / Ret_instructions
    Taken branch ratio = Taken_branches / Branches

    Return Stack misprediction rate = Mispred_near_ret / Ret_instructions
    Return Stack misprediction ratio = Mispred_near_ret / Near_returns

    Indirect branch misprediction rate = Mispred_indirect / Ret_instructions


***************************************************************
Configuration #5: Data access
***************************************************************

    File:            data_access.xml
    Name:            Investigate data access
    Mux period:      1

    Select  Unit  OS User  Count Abbreviation     Event
    ------ ------ -- ---- ------ ---------------- -----------------------------
      C0     00    T    T 250000 Ret_instructions Retired Instructions
      40     00    T    T 250000 DC_accesses      Data Cache Accesses
      41     00    T    T  25000 DC_misses        Data Cache Misses
      42     1F    T    T  25000 DC_refills       Data Cache Refills (L2+Sys)

    Select  Unit  OS User  Count Abbreviation     Event
    ------ ------ -- ---- =----- ---------------- -----------------------------
      C0     00    T    T 250000 Ret_instructions Retired Instructions
      45     00    T    T  25000 DTLB_L1M_L2H     L1 DTLB Miss and L2 DTLB Hit
      46     00    T    T  25000 DTLB_L1M_L2M     L1 DTLB Miss and L2 DTLB Miss
      47     00    T    T  25000 Misalign_access  Misaligned Accesses

    Data cache request rate = DC_accesses / Ret_instructions
    Data cache miss rate = DC_misses / Ret_instructions
    Data cache miss ratio = DC_misses / DC_accesses

    Data cache refill rate = DC_refills / Ret_instructions
    Data cache refill ratio = DC_refills / DC_accesses

    L1 DTLB request rate = DC_accesses / Ret_instructions
    L1 DTLB miss rate = (DTLB_L1M_L2H + DTLB_L1M_L2M) / Ret_instructions  **
    L1 DTLB miss ratio = (DTLB_L1M_L2H + DTLB_L1M_L2M) / DC_accesses  **

    L2 DTLB request rate = (DTLB_L1M_L2H + DTLB_L1M_L2M) / Ret_instructions  **
    L2 DTLB miss rate = DTLB_L1M_L2M / Ret_instructions
    L2 DTLB miss ratio = DTLB_L1M_L2M / (DTLB_L1M_L2H + DTLB_L1M_L2M)  **

    Misaligned access rate = Misalign_access / Ret_instructions
    Misaligned access ratio = Misalign_access / DC_accesses

    Note: A more accurate method of measuring data cache performance is
    available, but requires formula computation. Those measurements
    cannot be computed solely using ratios (division.)

    ** This measurement requires formula computation. It is included here
       for completeness.


***************************************************************
Configuration #6: Instruction access
***************************************************************

    File:            inst_access.xml
    Name:            Investigate instruction access
    Mux period:      1

    Select  Unit  OS User  Count Abbreviation     Event
    ------ ------ -- ---- ------ ---------------- -----------------------------
      C0     00    T    T 250000 Ret_instructions Retired Instructions
      80     00    T    T 250000 IC_fetches       Instruction Cache Fetches
      81     00    T    T  25000 IC_misses        Instruction Cache Misses

    Select  Unit  OS User  Count Abbreviation     Event
    ------ ------ -- ---- ------ ---------------- -----------------------------
      C0     00    T    T 250000 Ret_instructions Retired Instructions
      84     00    T    T  25000 ITLB_L1M_L2H     L1 ITLB Miss and L2 ITLB Hit
      85     00    T    T  25000 ITLB_L1M_L2M     L1 ITLB Miss and L2 ITLB Miss

    Instruction cache request rate = IC_fetches / Ret_instructions
    Instruction cache miss rate = IC_misses / Ret_instructions
    Instruction cache miss ratio = IC_misses / IC_fetches

    L1 ITLB request rate = IC_fetches / Ret_instructions
    L1 ITLB miss rate = (ITLB_L1M_L2H + ITLB_L1M_L2M) / Ret_instructions **
    L1 ITLB miss ratio = (ITLB_L1M_L2H + ITLB_L1M_L2M) / IC_fetches **

    L2 ITLB request rate = (ITLB_L1M_L2H + ITLB_L1M_L2M) / Ret_instructions **
    L2 ITLB miss rate = ITLB_L1M_L2M / Ret_instructions
    L2 ITLB miss ratio = ITLB_L1M_L2M / (ITLB_L1M_L2H + DTLB_L1M_L2M) **

    Note: A more accurate method of measuring instruction cache performance is
    available, but requires formula computation. Those measurements cannot be
    computed solely using ratios (division.)

    ** This measurement requires formula computation. It is included here
       for completeness.


***************************************************************
Configuration #7: L2 cache access
***************************************************************

    File:            l2_access.xml
    Name:            Investigate L2 cache access
    Mux period:      0 (no multiplexing)

    Select  Unit  OS User  Count Abbreviation     Event
    ------ ------ -- ---- ------ ---------------- -----------------------------
      C0     00    T    T 250000 Ret_instructions Retired Instructions
      7D     07    T    T  25000 L2_requests      Requests to L2 Cache
      7E     07    T    T  25000 L2_misses        L2 Cache Misses
      7F     03    T    T  25000 L2_fill_write    L2 Fill / Writeback

    L2 read request rate = L2_requests / Ret_instructions
    L2 write request rate = L2_fill_write / Ret_instructions
    L2 miss rate = L2_misses / Ret_instructions

    Note: A more accurate method of measuring L2 cache performance is
    available, but requires formula computation and more
    event groups. Those measurements cannot be computed solely using
    ratios (division.)


***************************************************************
Configuration #8: Instruction-based sampling
***************************************************************

    File:            ibs.xml
    Name:            Instruction-based sampling
    Fetch sampling:  T
    Fetch max count  250000
    Op sampling      T
    Op max count     250000

