Saturday, October 6, 2007

Performance Counter for Microblaze

To know how much time it spend exactly in calculation is not straightforward. In EDK, there is only software profiling which is not precise enough and probably doesn't work on multiprocessor without shared memory. We need something else.

Fortunately there are already examples. In most of 'big' processors, there are performance counters available which are basically small hardware counters running independent of software. Therefore it's more accurate. These counters can be configured to count lots of processor internal status, like cache hit, etc. It's common to use VTune or PAPI to read these counters on PC and then tune performance. So it's a good idea to make a performance counter for Microblaze if we need to tune it.

I choose FSL bus as the interface from counter to Microblaze. It's because read/start/stop counter op should be as light as possible. FSL is low-overhead and predictable, almost ideal candidate. The counter is written in VHDL, following template generated by EDK. It's available at http://www.opencores.org/projects.cgi/web/performance_counter/overview

There are three parts (process), bus interface, counter and overflow detector. Bus interface get command, like start, stop, reset, from processor and send counter value back. If there is an overflow, the overflow detector shall find. Every time processor read counter value, it should check if there is an overflow in between.

The performance counter is later integrated into BlazeCluster. If set
mb0, microblaze, 8k on-chip ram, barrel-shifter, fpu, perfcnt

a performance counter, FPU and barrel-shifter is instantiated and properly connected. perfcounter.c is software driver. Inside there are functions

1) reset_and_start_counter()
2) reset_and_stop_counter()
3) start_counter()
4) stop_counter()
5) read_counter()

The function names are self-explanatory. In read_counter() it reads from FSL twice and use the latter value. It's because there is one level FIFO in FSL bus. The first word read is stale value cached in FIFO while the second is the right one.

Source code

-----------------------------------------------------------------------------
-- perfcounter - entity/architecture pair
------------------------------------------------------------------------------

library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;

-------------------------------------------------------------------------------------
--
--
-- Definition of Ports
-- FSL_Clk : Synchronous clock
-- FSL_Rst : System reset, should always come from FSL bus
-- FSL_S_Clk : Slave asynchronous clock
-- FSL_S_Read : Read signal, requiring next available input to be read
-- FSL_S_Data : Input data
-- FSL_S_CONTROL : Control Bit, indicating the input data are control word
-- FSL_S_Exists : Data Exist Bit, indicating data exist in the input FSL bus
-- FSL_M_Clk : Master asynchronous clock
-- FSL_M_Write : Write signal, enabling writing to output FSL bus
-- FSL_M_Data : Output data
-- FSL_M_Control : Control Bit, indicating the output data are contol word
-- FSL_M_Full : Full Bit, indicating output FSL bus is full
--
-------------------------------------------------------------------------------

entity perfcounter is
port
(
-- DO NOT EDIT BELOW THIS LINE ---------------------
-- Bus protocol ports, do not add or delete.
FSL_Clk : in std_logic;
FSL_Rst : in std_logic;
FSL_S_Clk : out std_logic;
FSL_S_Read : out std_logic;
FSL_S_Data : in std_logic_vector(0 to 31);
FSL_S_Control : in std_logic;
FSL_S_Exists : in std_logic;
FSL_M_Clk : out std_logic;
FSL_M_Write : out std_logic;
FSL_M_Data : out std_logic_vector(0 to 31);
FSL_M_Control : out std_logic;
FSL_M_Full : in std_logic
-- DO NOT EDIT ABOVE THIS LINE ---------------------
);

attribute SIGIS : string;
attribute SIGIS of FSL_Clk : signal is "Clk";
attribute SIGIS of FSL_S_Clk : signal is "Clk";
attribute SIGIS of FSL_M_Clk : signal is "Clk";

end perfcounter;

architecture EXAMPLE of perfcounter is

-- cmd - command to counters, b0 enable b1 rst
-- counter - performance counter

signal cmd : std_logic_vector(0 to 3);
signal counter : std_logic_vector(0 to 31);
signal overflow : std_logic;

begin

FSL_S_Read <= FSL_S_Exists when FSL_Rst = '0' else '0';
FSL_M_Write <= not FSL_M_Full when FSL_Rst = '0' else '0';

FSL_M_Data <= counter;
FSL_M_Control <= overflow;

INPUT : process(FSL_CLK) is
begin
if FSL_Clk'event and FSL_Clk = '1' then
if FSL_Rst = '1' then
cmd <= X"0";
else
if (FSL_S_Exists = '1') then
cmd <= FSL_S_Data(24 to 27);
end if;
end if;
end if;
end process INPUT;

CNT : process(FSL_CLK) is
begin
if FSL_Clk'event and FSL_Clk = '1' then
if FSL_S_Data(25) = '1' and FSL_S_Exists = '1' then
counter <= X"00000000";
overflow <= '0';
else
if (cmd(0) = '1') then
counter <= std_logic_vector(unsigned(counter) + 1);
end if;
end if;
end if;
end process CNT;

OV : process(counter(31)) is
begin
if counter(31)'event and counter(31) = '0' then
overflow <= '1';
end if;
end process OV;

end architecture EXAMPLE;

No comments: