How I scripted testing for SystemVerilog

In Python and other languages, I’m used to having test suites and a handy runner to pull it all together. While digital logic designers are usually pretty good about writing test benches to go along with the RTL code, I didn’t find a lot of resources out on the interwebs that described a process for automating tests against SystemVerilog code. In my desire to have something similar to continuous integration testing for RTL I explored what options I’d have for scripting the process.

Design Under Test

The motivator for this excersize is that I was wanted to extend my shift register implementation to support a configurable depth. Initially it would shift data after one clock cycle, but it can be useful to have other delays as well. I also wanted to use this as an example for the SystemVerilog package manager I’ve been slapping together, so I wanted some way to tell early if I do something dumb that’ll impact other projects that use this shift register.

So I extended my module to include this new depth parameter:

module shift_register
    #(parameter width = 1,
      parameter depth = 1) (
  input logic clock,
  input logic [0:width-1] in,
  output logic [0:width-1] out);

  logic [0:depth-1][0:width-1] data;

  always_ff @ (posedge clock) begin
    {out, data} <= {data, in};
  end
endmodule

To test this I wrote this basic test bench:

module test_shift_register_default();

  logic clock;
  logic data_in;
  logic data_out;

  shift_register dut(clock, data_in, data_out);

  initial begin
    clock <= 0;
    data_in <= 0;
    #3 data_in <= 1;
    #4 assert (data_out == 0);
    data_in <= 0;
    #4 assert(data_out == 1);
    #5 $finish();
  end

  // 2ns clock
  always #2 clock <= ~clock;

endmodule

This test runs some basic assertions to validate the vanilla version of this shift module, I wrote tests for a few parameter permutations as well.



Vivado Batch Mode

Since the cards I have at work to tinker with use Xilinx based FPGAs, I decided to see how I could go through this process using Vivado. I read partially through an official Vivado simulation tutorial; in particular, chapter 3 on running simulations in “batch mode”. After some tinkering with the process I found this was a lot easier than I initially expected.

The first thing I needed to do was use xvlog to parse and build my SystemVerilog code.

$ xvlog --sv shift_register.sv tests/*.sv
INFO: [VRFC 10-2263] Analyzing SystemVerilog file "/home/kwilke/projects/shift-register/shift_register.sv" into library work
INFO: [VRFC 10-311] analyzing module shift_register
INFO: [VRFC 10-2263] Analyzing SystemVerilog file "/home/kwilke/projects/shift-register/tests/test_16bit_delayed.sv" into library work
INFO: [VRFC 10-311] analyzing module test_shift_register_16bit_delayed
INFO: [VRFC 10-2263] Analyzing SystemVerilog file "/home/kwilke/projects/shift-register/tests/test_8bit.sv" into library work
INFO: [VRFC 10-311] analyzing module test_shift_register_8bit
INFO: [VRFC 10-2263] Analyzing SystemVerilog file "/home/kwilke/projects/shift-register/tests/test_default.sv" into library work
INFO: [VRFC 10-311] analyzing module test_shift_register_default
INFO: [VRFC 10-2263] Analyzing SystemVerilog file "/home/kwilke/projects/shift-register/tests/test_delayed.sv" into library work
INFO: [VRFC 10-311] analyzing module test_shift_register_delayed

This creates a xsim.dir that has a file built for each module in my project, and a few log files.

Next, I use xelab to create a simulation snapshot. I’m not sure what a simulation snapshot is, but I know the xsim binary enjoys working with them! Generally I’m just giving it an argument for the module I want to use as my top level entity.

$ xelab test_shift_register_default
Vivado Simulator 2015.4
Copyright 1986-1999, 2001-2015 Xilinx, Inc. All Rights Reserved.
Running: /opt/Xilinx/Vivado/2015.4/bin/unwrapped/lnx64.o/xelab test_shift_register_default 
Multi-threading is on. Using 2 slave threads.
Starting static elaboration
Completed static elaboration
Starting simulation data flow analysis
Completed simulation data flow analysis
Time Resolution for simulation is 1ns
Compiling module work.shift_register
Compiling module work.test_shift_register_default
Built simulation snapshot work.test_shift_register_default

****** Webtalk v2015.4 (64-bit)
  **** SW Build 1412921 on Wed Nov 18 09:44:32 MST 2015
  **** IP Build 1412160 on Tue Nov 17 13:47:24 MST 2015
    ** Copyright 1986-2015 Xilinx, Inc. All Rights Reserved.

source /home/kwilke/projects/shift-register/xsim.dir/work.test_shift_register_default/webtalk/xsim_webtalk.tcl -notrace
INFO: [Common 17-206] Exiting Webtalk at Thu Mar 17 08:45:07 2016...

This adds some more data to the xsim.dir and brings them to a state that I can run the tests in. At this point I can use xsim with the -R flag to run simulation until it ends, then quit.

$ xsim -R test_shift_register_default

****** xsim v2015.4 (64-bit)
  **** SW Build 1412921 on Wed Nov 18 09:44:32 MST 2015
  **** IP Build 1412160 on Tue Nov 17 13:47:24 MST 2015
    ** Copyright 1986-2015 Xilinx, Inc. All Rights Reserved.

source xsim.dir/work.test_shift_register_default/xsim_script.tcl
# xsim {work.test_shift_register_default} -maxdeltaid 10000 -autoloadwcfg -runall
Vivado Simulator 2015.4
Time resolution is 1 ns
run -all
$finish called at time : 16 ns : File "/home/kwilke/projects/shift-register/tests/test_default.sv" Line 24
exit
INFO: [Common 17-206] Exiting xsim at Thu Mar 17 08:48:42 2016...

Automating the process

Initially, I whipped up a Makefile to allow me to run a make test to test my project. Since the xsim program doesn’t exit with an error status code for assertion failures, I was piping it’s output to a small shell script. This process was a little wonky.

At this point I decided to add it to Packilog so I could control the process a little easier with Python. I found this to be a pleasant and familar type of flow.

$ packilog -t
Using vivado testing driver
Building tests.
Running test module "test_shift_register_default"
Simulating test module 'test_shift_register_default'.

****** xsim v2015.4 (64-bit)
  **** SW Build 1412921 on Wed Nov 18 09:44:32 MST 2015
  **** IP Build 1412160 on Tue Nov 17 13:47:24 MST 2015
    ** Copyright 1986-2015 Xilinx, Inc. All Rights Reserved.

source xsim.dir/work.test_shift_register_default/xsim_script.tcl
# xsim {work.test_shift_register_default} -maxdeltaid 10000 -autoloadwcfg -runall
Vivado Simulator 2015.4
Time resolution is 1 ns
run -all
$finish called at time : 16 ns : File "/home/kwilke/projects/shift-register/tests/test_default.sv" Line 24
exit
INFO: [Common 17-206] Exiting xsim at Thu Mar 17 08:53:11 2016...

Result: PASS
Running test module "test_shift_register_delayed"
Simulating test module 'test_shift_register_delayed'.

****** xsim v2015.4 (64-bit)
  **** SW Build 1412921 on Wed Nov 18 09:44:32 MST 2015
  **** IP Build 1412160 on Tue Nov 17 13:47:24 MST 2015
    ** Copyright 1986-2015 Xilinx, Inc. All Rights Reserved.

source xsim.dir/work.test_shift_register_delayed/xsim_script.tcl
# xsim {work.test_shift_register_delayed} -maxdeltaid 10000 -autoloadwcfg -runall
Vivado Simulator 2015.4
Time resolution is 1 ns
run -all
$finish called at time : 32 ns : File "/home/kwilke/projects/shift-register/tests/test_delayed.sv" Line 28
exit
INFO: [Common 17-206] Exiting xsim at Thu Mar 17 08:53:13 2016...

Result: PASS
Running test module "test_shift_register_8bit"
Simulating test module 'test_shift_register_8bit'.

****** xsim v2015.4 (64-bit)
  **** SW Build 1412921 on Wed Nov 18 09:44:32 MST 2015
  **** IP Build 1412160 on Tue Nov 17 13:47:24 MST 2015
    ** Copyright 1986-2015 Xilinx, Inc. All Rights Reserved.

source xsim.dir/work.test_shift_register_8bit/xsim_script.tcl
# xsim {work.test_shift_register_8bit} -maxdeltaid 10000 -autoloadwcfg -runall
Vivado Simulator 2015.4
Time resolution is 1 ns
run -all
$finish called at time : 16 ns : File "/home/kwilke/projects/shift-register/tests/test_8bit.sv" Line 24
exit
INFO: [Common 17-206] Exiting xsim at Thu Mar 17 08:53:25 2016...

Result: PASS
Running test module "test_shift_register_16bit_delayed"
Simulating test module 'test_shift_register_16bit_delayed'.

****** xsim v2015.4 (64-bit)
  **** SW Build 1412921 on Wed Nov 18 09:44:32 MST 2015
  **** IP Build 1412160 on Tue Nov 17 13:47:24 MST 2015
    ** Copyright 1986-2015 Xilinx, Inc. All Rights Reserved.

source xsim.dir/work.test_shift_register_16bit_delayed/xsim_script.tcl
# xsim {work.test_shift_register_16bit_delayed} -maxdeltaid 10000 -autoloadwcfg -runall
Vivado Simulator 2015.4
Time resolution is 1 ns
run -all
$finish called at time : 21 ns : File "/home/kwilke/projects/shift-register/tests/test_16bit_delayed.sv" Line 24
exit
INFO: [Common 17-206] Exiting xsim at Thu Mar 17 08:53:28 2016...

Result: PASS
Package Test Result: PASS

Hopefully the test driver pattern in Packilog will allow for easy cross-tool testing, as I would like to get support for a variety of common tools. ‘m pretty happy with this flow so far because it allows me to write my code in my editor of choice and use a simple command line program to pull it all together.

Hello AFU – Part 6

This is the 6th and final part of my Hello AFU tutorial. In the last post, I started building out a state machine for the AFU and read from the data structure that the WED points to. In this post, I’ll finish off the state machine, pulling down the data in our stripes XOR them together and write that data back to userland.

Reading the Stripes

Since the largest memory size I can request via the PSL is for 128 bytes, I’ll make requests for that amount. I need a scratch pad for this data so I’ll add two 1024 bit internal registers for these chunks of data. I’ll also need a variable to know when I’ve received both chunks, so I’ll setup a small register for that as well.

logic [0:1023] stripe1_data;
logic [0:1023] stripe2_data;
logic stripe_received;

In my REQUEST_STRIPES state I’ll request data from stripe1 in one cycle, then stripe2 in the next, I’ll use the command’s tag to know where I am in that process. I’ll set my stripe_received to 0, to indicate I’ve not yet retrieved either.

REQUEST_STRIPES: begin
  command_out.valid <= 1;
  command_out.size = 128;
  command_out.command <= READ_CL_NA;
  if (command_out.tag == REQUEST_READ) begin
    command_out.tag <= STRIPE1_READ;
    command_out.address <= request.stripe1;
  end else begin
    command_out.tag <= STRIPE2_READ;
    command_out.address <= request.stripe2;
    current_state <= WAITING_FOR_STRIPES;
    stripe_received <= 0;
  end
end

With the requests for stripe data sent, I need to wait for the data to come back. This could happen in any order, so I need to be ready for either.

WAITING_FOR_STRIPES: begin
  command_out.valid <= 0;
  if (buffer_in.write_valid) begin
    case(buffer_in.write_tag)
      STRIPE1_READ: begin
        if (buffer_in.write_address  == 0) begin
          stripe1_data[0:511] <= buffer_in.write_data;
        end else begine
          stripe1_data[512:1023] <= buffer_in.write_data;
        end
      end
      STRIPE2_READ: begin
        if (buffer_in.write_address == 0) begin
          stripe2_data[0:511] <= buffer_in.write_data;
        end else begine
          stripe2_data[512:1023] <= buffer_in.write_data;
        end
      end
    endcase
  end
end

In the same state, I’ll look for the tags to come in over the response interface. On the first request I set the stripe_received register, the second request the state progresses to WRITE_PARITY

if (response.valid) begin
  if (response.tag == STRIPE1_READ ||
      response.tag == STRIPE2_READ) begin
    if (stripe_received) begin
      current_state <= WRITE_PARITY;
    end else begin
      stripe_received <= 1;
    end
  end
end



Where is this Parity?

I decided to parity the stripes via assign, by creating one new internal variable parity_data can be referenced for the XOR’d value of stripe1 and stripe2.

logic [0:1023] parity_data;

assign parity_data = stripe1_data ^ stripe2_data;

parity

Since I set the buffer latency to 1, the data being put on the buffer for writing to memory needs to be shifted back a cycle.

logic [0:511] write_buffer;

shift_register #(512) write_shift (
  .clock(clock),
  .in(write_buffer),
  .out(buffer_out.read_data));

Now I need to write the parity data to the memory at request.parity. This is pretty similar to reading memory. I’ll send a WRITE_CL “write cacheline” command and align my data with buffer_out.read_data, returning the first half for address 0 and the high half in 1.

WRITE_PARITY: begin
  if (command_out.tag != PARITY_WRITE) begin
    command_out.command <= WRITE_NA;
    command_out.address <= request.parity;
    command_out.tag <= PARITY_WRITE;
    command_out.valid <= 1;
  end else begin
    command_out.valid <= 0;
    // Read half depending on address
    if (buffer_in.read_address == 0)  begin
      write_buffer <= parity_data[0:511];
    end else begin
      write_buffer <= parity_data[512:1023];
    end
    // Handle response
    if (response.valid &&
        response.tag == PARITY_WRITE) begin
        current_state <= DONE;
    end
  end
end

After the parity is written, the job is complete. The state progresses to DONE when the write comes back on the response interface.

Aligned Writing

Writing the done flag is a little trickier, since it is not on a 128 or 64-byte alignment. The PSL can handle writing to any address, but the data must be aligned within the 128-byte read bus. If the data size you’re writing to is 64 bytes or less you can let the same data sit on the buffer interface for both addresses.

In this case, the done field is 32 bytes past WED. and I’m doing a 1 byte write. I’ll align my data starting at the 256th bit, writing 8 bits. I’ll write a 1 in the first byte to set the little-endian unsigned 64bit number to a non-zero.

DONE: begin
  if (command_out.tag != DONE_WRITE) begin
    command_out.tag <= DONE_WRITE;
    command_out.size <= 1;
    command_out.address <= wed + 32;
    command_out.valid <= 1;
    write_buffer[256:319] <= 1;
  end else begin
    command_out.valid <= 0;
  end
end

With that, the parity is written and the userspace application can see when it completes. Here’s the output from the test_afu application.

INFO:Connecting to host 'localhost' port 16384
[example structure
  example: 0x7fa500
  example->size: 128
  example->stripe1: 0x7fa600
  example->stripe2: 0x7fa780
  example->parity: 0x7fa880
  &(example->done): 0x7fa520
Attached to AFU
Waiting for completion by AFU
done: 0
done: 0
done: 1
PARITY:
That is some proper parity! This is exactly what I'm expecting to see. I'd also like to see this running on some real gear soon
Releasing AFU

That completes the basic function of this AFU, I’ll commit my changes here.

Larger buffers

Now I’ll extend the design to support more than 128-byte buffers, this just requires an offset buffer that keep track of the current offset relative to the total size of the buffer to generate parity for.

I’ll start by adding a new variable for the offset that matches the data type as size.

longint unsigned offset;

Then I’ll set it to 0 in the START state.

offset <= 0;

In the REQUEST_STRIPES state I’ll add the offset to the stripe pointers.

command_out.address <= request.stripe1 + offset;

In the WRITE_PARITY state I’ll add the offset to the parity pointer, and check to see if the operation is complete.

command_out.address <= request.parity + offset;
if (offset + 128 < request.size) begin
  offset <= offset + 128;
  current_state <= REQUEST_STRIPES;
end else begin
  current_state <= DONE;
end

With that I’d say this AFU is good enough for this tutorial. I’ll commit my changes and welcome pull requests if you find improvements to this tutorial. Hope this helps you hack on CAPI!

Hello AFU – Part 5

This is part 5 of my Hello AFU tutorial. In the last post, I built the C application that would attach and utilize the AFU that’s the focus of these posts. In this post I’ll start pulling data from the application’s memory space into the AFU and read the WED structure.

Keeping it Running

Before I start requesting for data, some modifications are necessary to notify the underlying systems that the AFU is running. So far, I’m not managing the ah_jrunning signal that should be set high when the AFU is performing a task. After a short time the PSL will stop driving the AFU’s clock if the AFU hasn’t raised the ah_jrunning signal, so lets quickly fix this and improve the parity_afu module a little bit.

I’ll refactor the always_ff block of the parity_afu module to use a case statement to handle commands and add handling for the START command in addition to our existing RESET command.

always_ff @(posedge clock) begin
  if(job_in.valid) begin
    case(job_in.command)
      RESET: begin
        jdone <= 1;
        job_out.running <= 0;
      end
      START: begin
        jdone <= 0;
        job_out.running <= 1;
      end
    endcase
  end else begin
    jdone <= 0;
  end
end

Now that I’m setting job_out.running, I’ll also remove my static assignment of that signal. These changes are committed here.



Planning for the Work Element Submodule

The ground work to actually deal with the issue at hand is almost completely laid out. The module that will do the real work will have considerably more complexity than the components so far, so I’ll start planning and creating a new module to segregate that functionality to, my parity_workelement.

First I’ll define the inputs and outputs of this module

Direction Name Purpose
Input clock Clock signal to follow
Input enabled High while AFU is in running state
Input reset Signal triggering reset of internal state
Input wed The WED pointer from userspace
Input buffer_in For reading userspace buffer data
Input response To check responses of commands
Output command_out To request buffer reads and writes
Output buffer_out For writing userspace buffer data

We’ll also define a mostly linear finite state machine to describe the work to be done.

State Purpose Next State
START Request data at WED WAITING_FOR_REQUEST
WAITING_FOR_REQUEST Wait for WED data to be available REQUEST_STRIPES
REQUEST_STRIPES Send commands to read stripe1 and stripe2 WAITING_FOR_STRIPES
WAITING_FOR_STRIPES Wait for stripe data to be available WRITE_PARITY
WRITE_PARITY Write XOR’d parity from stripes back to memory REQUEST_STRIPES if more data to read;
DONE otherwise
DONE Write done flag and halt. n/a

Now I’ll write the first couple portions of this module. I’ll create an enumeration that contains the various states used by the module. In the module definition itself I’ll define the input/output ports and create an internal register for the current_state. While I’m in here I setup some signals with assign, mostly some settings I don’t want to change and a few parity generators as well. Lastly I’ll start off the always_ff block that’ll contains the reset logic and the case statement that implements my state machine.

import CAPI::*;

typedef enum {
  START,
  WAITING_FOR_REQUEST,
  REQUEST_STRIPES,
  WAITING_FOR_STRIPES,
  WRITE_PARITY,
  DONE
} state;

module parity_workelement (
  input logic clock,
  input logic enabled,
  input logic reset,
  input pointer_t wed,
  input BufferInterfaceInput buffer_in,
  input ResponseInterface response,
  output CommandInterfaceOutput command_out,
  output BufferInterfaceOutput buffer_out
);

  state current_state;

  assign command_out.abt = 0,
         command_out.context_handle = 0,
         buffer_out.read_latency = 1,
         command_out.command_parity = ~^command_out.command,
         command_out.address_parity = ~^command_out.address,
         command_out.tag_parity = ~^command_out.tag,
         buffer_out.read_parity = ~^buffer_out.read_data;

  always_ff @ (posedge clock) begin
    if (reset) begin
      current_state <= START;
    end else if (enabled) begin
      case(current_state)
        START: begin
          $display("Started!");
        end
      endcase
    end
  end

endmodule

With that defined, I’ll modify my parity_afu module to include and instance of my parity_workelement:

parity_workelement workelement(
  .clock(clock),
  .enabled(job_out.running),
  .reset(jdone),
  .wed(job_in.address),
  .buffer_in(buffer_in),
  .response(response),
  .command_out(command_out),
  .buffer_out(buffer_out));

To reduce how much I’m looking at during simulation, I’ll also modify my test.do to just show what’s going on in my workelement.

vsim work.top
add wave -position insertpoint sim:/top/a0/svAFU/workelement/*
run 136

Since this is a significant amount of code I’ll commit here before implementing the state machine.

Requesting Data

Requesting the WED data will be easy enough, but I first want a handy container to put it in, so I’ll define a new type in SystemVerilog that matches my WED structure in C but I skip the done field as I don’t need to look at what’s currently in there; I can set that later by it’s offset relative to the WED.

typedef struct {
  longint unsigned size;
  pointer_t stripe1;
  pointer_t stripe2;
  pointer_t parity;
} parity_request;

Next I’ll add an internal register to the parity_workelement module that can hold this structure.

parity_request request;

To use the PSL’s Command Interface to request this data, the PSL requires that each active commands has a unique tag ID. I’ll define another enum that will be used to automatically ensure I have a unique tag for each purpose.

typedef enum logic [0:7] {
  REQUEST_READ,
  STRIPE1_READ,
  STRIPE2_READ,
  PARITY_WRITE,
  DONE_WRITE
} request_tag;

The simplest way to request data from userspace is using the READ_CL_NA, or “read cacheline, no allocate”, command. I’ll request a read size of 32 bytes, as I’m reading in 4 64-bit pointers. I’ll set the tag to REQUEST_READ and use the wed as my address. As with the other interfaces, I need to set a valid signal high for 1 clock, I’ll do this by setting it high in the START state, transitioning to the WAITING_FOR_REQUEST state, and have it set back low there.

case(current_state)
  START: begin
    command_out.command <= READ_CL_NA;
    command_out.tag <= REQUEST_READ;
    command_out.size <= 32;
    command_out.address <= wed;
    command_out.valid <= 1;
    current_state = WAITING_FOR_REQUEST;
  end
  WAITING_FOR_REQUEST: begin
    command_out.valid <= 0;
  end
endcase

When the data I’ve requested comes back, it’ll come via two writes on the buffer_in.write_data bus. This bus is 512-bites wide, but supports 128 byte (1024 bit) requests. As such, there are two writes that occur to deliver the lower (address 0) and higher (address 1) halves. Since I’ve only requested 32 bytes, the data will be in the first 256 bits of the writes to address 0 for the REQUEST_READ tag.

One important thing to look out for is that you can get multiple cycles of data on this bus, so you need to capture that data until the response interface lets you know the last cycle was valid.

With this in mind I’ll read the buffer interface each time it’s a valid signal and it’s for my tag and it’s for the address I’m looking for. It’s also important to remember that the terms read and write for the buffer interface are named from the PSL’s perspective, so even though I’m making a read request to read data, it comes to the AFU on the buses named write_data and such.

if (buffer_in.write_valid &&
    buffer_in.write_tag == REQUEST_READ &&
    buffer_in.write_address == 0) begin
  request.size <= buffer_in.write_data[0:63];
  request.stripe1 <= buffer_in.write_data[64:127];
  request.stripe2 <= buffer_in.write_data[128:191];
  request.parity <= buffer_in.write_data[192:255];
end

When the data comes back, it’s not quite as I’d like it to be.

wed_data

My application code spits out what these values should be:

[example structure
  example: 0x1d91500
  example->size: 128
  example->stripe1: 0x1d91600
  example->stripe2: 0x1d91780
  example->parity: 0x1d91880
  &(example->done): 0x1d91520

The issue here is that I’m reading in data that is in a little-endian byte format, but is being interpreted as big-endian. To deal with this issue I wrote a SystemVerilog function that can swap the endianness of the bytes in a generic way.

function logic [0:63] swap_endianness(logic [0:63] in);
  return {in[56:63], in[48:55], in[40:47], in[32:39], in[24:31], in[16:23],
          in[8:15], in[0:7]};
endfunction

I’ll modify my assignments to make use of this function.

request.size <= swap_endianness(buffer_in.write_data[0:63]);
request.stripe1 <= swap_endianness(buffer_in.write_data[64:127]);
request.stripe2 <= swap_endianness(buffer_in.write_data[128:191]);
request.parity <= swap_endianness(buffer_in.write_data[192:255]);

Now that this is in the right byte order, my internal request register is being filled with the appropriate values.

I’ll add a touch of logic to catch when these values are set to something valid then move to the next state.

if (response.valid && response.tag == REQUEST_READ) begin
  current_state <= REQUEST_STRIPES;
end

With our WED data all the way into our AFU I’ll commit my changes and call it a wrap for this post. In the next post I’ll write the remaining states and write some data back to userspace memory, completing this AFU!