CMSC 411 Project, Fall 2020

CMSC 411 Computer Architecture Project Fall 2020

corrected

ask questions on webex, https://umbc.webex.com/meet/squire

Tuesdays and Thursdays 1:00pm to 2:15pm

additional information and questions

Cadence setup, 2020, follow instructions exactly, Best to use
Makefiles to save on your typing.
Log into linux.gl.umbc.edu and type commands: (skip any done for HW4,HW6)
  mkdir cs411
  cd cs411
  mkdir vhdl2
  cd vhdl2
  cp  /afs/umbc.edu/users/s/q/squire/pub/download/cadence_setup.tar . 
  tar -xvf cadence_setup.tar
  cp  /afs/umbc.edu/users/s/q/squire/pub/download/Makefile_411 .
  cp  /afs/umbc.edu/users/s/q/squire/pub/download/Makefile_ghdl .
  cp  /afs/umbc.edu/users/s/q/squire/pub/download/t_table.run  .
  cp  /afs/umbc.edu/users/s/q/squire/pub/download/t_table.vhdl  .
  cp  /afs/umbc.edu/users/s/q/squire/pub/download/inverter.run  .
  cp  /afs/umbc.edu/users/s/q/squire/pub/download/inverter.vhdl  .
  cp  /afs/umbc.edu/users/s/q/squire/pub/download/inverter_test.vhdl  .
  cp  /afs/umbc.edu/users/s/q/squire/pub/download/add32_test.run  .
  cp  /afs/umbc.edu/users/s/q/squire/pub/download/add32_test.vhdl  .
  # if you typed everything correctly, run cadence logic simulation
  make -f Makefile_411 t_table.out
  make -f Makefile_ghdl t_table.out
  make -f Makefile_411 inverter_out.txt
  make -f Makefile_ghdl inverter_out.txt
  make -f Makefile_ghdl clean # saves on quota 

Look over the files newly created cadence files   dir -ltr *

You may use either or both cadence and GHDL simulator.
  make -f Makefile_ghdl t_table.gout
  make -f Makefile_ghdl add32_test.gout

Look over the files newly created GHDL files    dir -ltr *
  Oh, junk files also. Remove with clean
  make -f Makefile_ghdl clean
  dir -ltr *

The goal of the semester project is to design and simulate a pipelined RISC CPU. Major components will be the pipelined ALU data path, the instruction decoder, hazard detection and associated forwarding/stall and cache memory controller.

Do not copy a previous semesters project

It will not work, you will loose points.

You will get a  -0  or worse, on any project part that is
a copy of a previous semesters project. DO NOT COPY !

Submitting your Project

 The project is to be submitted on GL as five transactions for five files:
   submit cs411 part1 part1.vhdl
   submit cs411 part2 part2a.vhdl
   submit cs411 part2 part2b.vhdl
   submit cs411 part3 part3a.vhdl
   submit cs411 part3 part3b.vhdl

 The files you submit are not the starter files but the starter files
 with your additions to make it work. Do not submit extra files.
 I use makefiles and  .chk .chkg files to grade projects.
 
 Note: DO NOT use "Blackboard" for turning in project or homework.

Five Part Project

part1

part2a

part2b

part3a

part3b

Getting Started

Each time you log on, using Cadence VHDL: cd cs411/vhdl2 then work on your .vhdl files make, then fix errors and check "diff" if no errors make clean # just before you logoff, save disk quota If Cadence license problem, use Makefile_ghdl You may use either cadence or GHDL or both on GL. MAC OSX users do the following: mkdir ghdl # do all your work here brew install Caskroom/cask/ghdl # this installs ghdl, like vhdl

Start the project by getting files

 Starter files may be copied to your vhdl2 subdirectory on
 linux.gl.umbc.edu  using commands such as: set up for cadence and GHDL

 cp  /afs/umbc.edu/users/s/q/squire/pub/download/part1_start.vhdl .
 cp  /afs/umbc.edu/users/s/q/squire/pub/download/cs411_opcodes.txt .
 cp  /afs/umbc.edu/users/s/q/squire/pub/download/bshift.vhdl .
 cp  /afs/umbc.edu/users/s/q/squire/pub/download/part1.abs .
 cp  /afs/umbc.edu/users/s/q/squire/pub/download/part1.run .
 cp  /afs/umbc.edu/users/s/q/squire/pub/download/part1.chk .
 cp  /afs/umbc.edu/users/s/q/squire/pub/download/part1.chkg .
 cp  /afs/umbc.edu/users/s/q/squire/pub/download/divcas16.vhdl .
 cp  /afs/umbc.edu/users/s/q/squire/pub/download/bshift.vhdl .


 
 you should have      add32.vhdl   file from your HW4
                      pmul16.vhdl  file from your HW6

Part1

 PART1: Handle lw, sw, add, sub, and, or, addi, lwim, sll, srl, mul, div, cmpl
               and nop with no hazards.
        (nop's are inserted in the part1.abs file to prevent hazards.)
        See cs411_opcodes.txt for detailed instruction formats and definitions.
        See reglist.txt for register use conventions.
   You should use part1_start.vhdl as a       start for coding your circuit.
  use part1_start.vhdl as a start for coding your circuit.
        You can do your own shift circuit or use the bshift.vhdl component.
        The instruction definitions and bit patterns for this semester are in
        cs411_opcodes.txt

   Quick start steps:
     1)  copy part1_start.vhdl to part1.vhdl
	 then work on project in part1.vhdl  
     2)  replace  "_start" with "", e.g  delete _start everywhere.
     3)  fill in .vhdl for the ALU_32 architecture to implement
         sub, and, or, sll, srl, cmpl, mul, div . See diagram.
         All other instructions must do a plain add.
         Note that EX_IR coming into ALU_32 has the instruction in "inst"
         and a possible schematic and some code ALU.
     4)  compute the signal  WB_write_enb (needs 'or' of more opcodes)
         search for ??? some more work needed
          as an example for setting a mux control based on opcode.
          In each stage **_IR is the instruction currently in that stage.
          **_IR(31 downto 26) is the six bit major op code. "100011" for lw
          **_IR(5 downto 0) is the six bit minor op code. "100000" for add.
              when major op code is "000000"

     5) Compile, analyze, run using commands in your Makefile
        make -f Makefile_411 part1.out
	make -f Makefile_ghdl part1.gout
	
     6) then do difference:
        diff -iw part1.out part1.chk | more
        diff -iw part1.gout part1.chkg | more

	(ignore "squire" just from my run, ignore extra
	cadence or GHDL stuff. Use either or both to find errors.)
	
	Look at difference  <  lines are yours
	                    >  lines are check, required
	see line number, look at  part1.out on that line and
	see what instruction is being executed in which stage and
	fix that instruction.
	Or, if difference in ALUSrc or RegDst etc, fix that signal.
	
   Add to Makefile, if you did not download Makefile_411	
      all:  ... part1.out  # add part1.out to the list

      part1.out: part1.vhdl add32.vhdl bshift.vhdl pmul16.vhdl \
                 divcas16.vhdl part1.run part1.abs
         run_ncvhdl.bash -v93 add32.vhdl ...
         run_ncvhdl.bash -v93 bshift.vhdl ...
         run ncvhdl.bash -v93 pmul16.vhdl ...
         run ncvhdl.bash -v93 divcas16.vhdl ...
         run_ncvhdl.bash -v93 part1.vhdl  # copied from part1_start.vhdl
         run_ncelab.bash -v93 part1:schematic
	 run_ncsim.bash  -batch -logfile part1.out -input part1.run part1

         diff -iw part1.out part1.chk     should be no differences
                                          no stalls, timing should be exact
        in Makefile_ghdl:
	
        ghdl -a --ieee=synopsys add32.vhdl
        ghdl -a --ieee=synopsys bshift.vhdl
        ghdl -a --ieee=synopsys pmul16.vhdl
        ghdl -a --ieee=synopsys divcas16.vhdl
        ghdl -a --ieee=synopsys part1.vhdl
        ghdl -e --ieee=synopsys part1
        ghdl -r --ieee=synopsys part1 --stop-time=250ns > part1.gout
        diff -iw part1.gout part1.chkg     should be no differences
                                            no stalls, timing should be exact




        The CS411 Project Part 1 uses a schematic as shown in Lecture 18
        and part1.ps
        Check that opcodes are latest cs411_opcodes.txt

        For grading reasons, keep the signal names that
        are pipeline registers and the entity/memory names.


        The resulting output should be as shown in
         part1.chk  file based on part1.abs and  part1.run 

        Check the results in part1.out to be sure the instructions
        worked. You can follow each instruction through the pipeline
        by following the instruction register, *_IR and check the
        *_*  signals for correct values at each stage.

        It is possible that your part1.out does not agree with
        part1.chk but you should
        be able to explain why. (Probably different don't care choices.)

        You may want to copy part1.vhdl to another file and add more
        'write' statements to print out more internal signal names in order
        to help debug your circuit. debug.txt

        Submit all components and your main circuit as one plain text
        file using submit. DO NOT include add32.vhdl or bshift.vhdl,
        pmul16.vhdl, divcas16.vhdl, etc.
        they are provided by the instructor for testing. The file
        must be named  "part1.vhdl". DO NOT EMail except for questions.

        You submit on GL using:  submit cs411 part1 part1.vhdl

        No makefiles or run files or output is to be
        submitted. Partial credit will be given based on number of
        instructions simulated correctly. The starter file part1_start.vhdl
        only simulates the  lw and a few instruction correctly.

        This code  part1.vhdl gets copied to part2a.vhdl for next project

 Part2a: Copy your  part1.vhdl  to  part2a.vhdl
        Substitute string "part2a" for every "part1"

        cp  /afs/umbc.edu/users/s/q/squire/pub/download/part2a.abs .
        cp  /afs/umbc.edu/users/s/q/squire/pub/download/part2a.run .
        cp  /afs/umbc.edu/users/s/q/squire/pub/download/part2a.chk .
        cp  /afs/umbc.edu/users/s/q/squire/pub/download/part2a.chkg .
        implement data forwarding and jump and branch.
        CS411 does the branch and jump in the ID stage
        CS411 goes beyond the book by forwarding for beq.
        submit cs411 part2 part2a.vhdl # before working part2b

        You are upgrading part1.vhdl to  part2a.jpg
        or part2a.ps

        Data forwarding paths must cover at least those cases covered
        in class (see the class handout for details).
        Additional insight may be gained from a comparison of the
        pipeline stages with and without data forwarding in forward.txt
        A possible implementation of forwarding is forward_mem.jpg
        The EX stage forwarding may use entity mux_32_3,
        a multiplexor with three 32-bit inputs.

        Note: jump and beq are followed by a delayed branch slot that
        contains an instruction that is always executed. jump can not
        cause a stall. If beq does not get data forwarding, then it
        can stall, and stall, and stall. Add data forwarding for beq
        by adding two mux's in the ID STAGE that get inputs from the
        MEM stage as shown in part2a.jpg
        or part2a.ps

        Implement your circuit assuming that software has correctly
        filled the delayed branch slot and implement the branch in
        the ID stage as modified for this class project.

        You may use the mux32_3

        For grading reasons, keep the signal names that
        are pipeline registers and the component/memory names.

        Run the following commands to check your work.

  make -f Makefile_411 part1.out
  make -f Makefile_ghdl part1.gout
  diff -iw part1.out part1.chk
  diff -iw part1.gout part1.chkg

  Implement green logic in two diagrams in lecture 19

  For additional debugging, download and insert debug_forward.vhdl
  fix signal names if yours are different,
  and diff -iw part2a.out  part2a_print.chk 

Ignore difference in PC_next in clock 6.

My bug in my part2a.vhdl made an error in part2a.chk and part2a.chkg.
  MEM_data_reg   should be   EX_BB, MEM_addr):
  I had EX_B. Fixed now, part2a.chk, part2a.chkg OK.
  (The part2a_bug.chk and part2a_bug.chkg still accepted)
  OK to submit either way. I will ignore it when grading.

 Part2b: Copy your  part2a.vhdl  to  part2b.vhdl
        Substitute string "part2b" for every "part2a"

        cp  /afs/umbc.edu/users/s/q/squire/pub/download/Makefile_ghdl .
        cp  /afs/umbc.edu/users/s/q/squire/pub/download/part2b.abs .
        cp  /afs/umbc.edu/users/s/q/squire/pub/download/part2b.run .
        cp  /afs/umbc.edu/users/s/q/squire/pub/download/part2b.chk .
        cp  /afs/umbc.edu/users/s/q/squire/pub/download/part2b.chkg .
        implement hazard detection and stall the minimum possible.


        Handle hazards. Detect hazards, prevent wrong results by
        stalling when necessary. A stall is implemented by holding
        the instruction in the ID stage and letting the EX, MEM and
        WB stages proceed. The stall signal prevents the IF and ID
        stages from getting a clock signal. A terse summary of the
        hazard detection is in hazard.txt
        A possible implementation of hazards is stall_lw.jpg

        The CS411 Project Part 2b uses a modified schematic handed out
        See web for schematic part2b.jpg and part2b.ps


        Run the following commands to check your work.

  make -f Makefile_411 part2b.out
  make -f Makefile_ghdl part2b.gout
  diff -iw part2b.out part2b.chk
  diff -iw part2b.gout part2b.chkg

                 OK if different PCnext on clock 6

        Part2b  needs both data forwarding and hazards (stalls)
        Submit all components and your main circuit as one plain text
        file using 'submit'. No makefiles or run files or output is to be
        submitted. Partial credit will be given based on number of
        data forwards, jump, beq, and hazard stalls handled correctly.

        Your circuit will not be tested with jump or branch or data
        addresses greater than 10 bits, in other words your instruction
        and data memories do not need to be bigger than 1024 words.

        You may not get exactly the .chk results.
        (except Clock 6, any PCnext  also possibly a few more PC_next
         you still get 100)
        Timing and stalls will be graded. Points will
        be deducted for memory or register differences
        or improper stalls.


  For additional debugging, download and insert debug_stall.vhdl
  fix signal names if yours are different,
  and diff -iw part2b.out  part2b_print.chk

 Part3a: Copy your  part2b.vhdl  to  part3a.vhdl
        Substitute "part3a" for every "part2b"

        cp  /afs/umbc.edu/users/s/q/squire/pub/download/part3a.abs .
        cp  /afs/umbc.edu/users/s/q/squire/pub/download/part3a.run .
        cp  /afs/umbc.edu/users/s/q/squire/pub/download/part3a.chk .
        cp  /afs/umbc.edu/users/s/q/squire/pub/download/part3a.chkg .
        cp  /afs/umbc.edu/users/s/q/squire/pub/download/part3a_print.chk .
        Implement a cache in the instruction memory (read only)
        submit cs411 part3 part3a.vhdl

        Put the cache inside the instruction memory
        component (entity and architecture).
        (you will need to pass a few extra signals in and out)
          One output is the name of miss signal set <= '1', '0' etc.
          Then in architecture part3a "or" the new signal into "stall". 

        part3a.ps

        Use the existing shared memory data as the main memory. 
        Make a miss on the instruction cache cause a three cycle stall.
        A cycle is 10 ns, thus a three cycle stall is 30 ns.
        Previous stalls from part2b must still work.

   added:
         add to end of  entity instruction_memory
                    miss: out std_logic);
         add code after local_miss <=
                          miss <= '1','0' after 30ns; -- to hold stall
         add to inst_mem: WORK.instruction_memory
                    miss => IF_miss); -- instruction fetch miss
         add to  stall <= ...
                         or IF_miss; -- define above
                                     signal IF_miss: std_logic := '0'; 

	The instruction cache cache holds 16 words
        organized as four blocks of four words. Remember vhdl
        memory is addressed by word address, the MIPS/SGI memory
        is addressed by byte address and a cache is addressed by
        block number. 

        The cache schematic for the instruction cache was handed out
        in class and shown in. icache.jpg
        or  cache.png

        The cache may be implemented using behavioral VHDL, basically
        writing sequential code in VHDL or by connecting hardware.

        Possible behavioral, not required, VHDL to set up the start of a cache:
        (no partial credit for just putting this in your cache.)

          add in or out signals to  entity instruction_memory  as needed
	  for example, 'clk'  'clear'  'miss'
          make corresponding changes at  inst_mem:
	  also add code below with additions


	  architecture behavior of instruction_memory is
            subtype block_type is std_logic_vector(154 downto 0);
            type cache_type is array (0 to 3) of block_type;
            signal cache : cache_type := (others=>(others=>'0'));
            signal local_miss : std_logic := '0'; -- needed between process calls
            -- now we have a cache memory initialized to zero
          begin  -- behavior
            inst_mem:
            process ... add  local_miss)       compute same as miss
              variable quad_word_address : natural;  -- for memory fetch
              variable cblock : block_type;-- the shaded block in the cache
              variable index : natural;   -- index into cache to get a block
              variable word : natural;    -- select a word
              variable my_line : line;    -- for debug printout
              alias tag   : std_logic_vector(25 downto 0) is cblock(153 downto 128);
              alias w0    : std_logic_vector(31 downto 0) is cblock(127 downto 96);
              alias valid : std_logic is cblock(154); -- other alias allowed
              ...
            begin


	  if clear = '1' then
	    miss <= '0';		    
            inst <= x"00000000";
          end if;
          if clear = '0' then
	    index := to_integer(addr(5 downto 4));
            word  := to_integer(addr(3 downto 2));
            cblock := cache(index);  -- has valid (154), tag (153 downto 128)
                                       -- W0 (127 downto 96), W1(95 downto 64)
                                       -- W2(63 downto 32), W3 (31 downto 0)
                                       -- cblock is the shaded block in handout

            if (valid = '1') and (tag = addr(31 downto 6)) then -- hit
	       -- ... do hit	    
            else -- miss

	      -- ... do miss, get 4 words from memory, set tag and valid
              ...
              quad_word_address := to_integer(addr(13 downto 4));
              w0 := memory(quad_word_address*4+0);
              w1 := memory(quad_word_address*4+1); -- ...
                                       -- fill in cblock with new words, then
              cache(index) <= cblock after 30 ns; -- 3 clock delay
              miss <= '1', '0' after 30 ns;       -- miss is '1' for 30 ns
              local_miss <= '1', '0' after 30 ns; -- to get process to run 
              -- this "miss" signal gets ored into part2b "stall" signal
              ...
              -- the part3a.chk file has 'inst' set to zero while 'miss' is 1
              -- not required but cleans up the "diff"


            end if;
          end if; -- clear = '0'

          remember to  or  cashe miss signal into   stall  signal
	  that takes care of  sclk  signal.	    
	  	    
        I was a bit extreme in computing the  miss  signal in the cache.
	I did not use a  hit  signal yet had a  local_miss  signal.

	signal local_miss : std_logic := '0'; -- saved between calls
	My process had   process(addr, clear, local_miss)
		    
        More information, including debug print, is in Lecture 24 and
        debug.txt

        For debugging your cache, you might find it convenient to add
        this 'debug' print process right after  end process inst_mem;

  debug:  process -- used to print contents of I cache, diff part3a_print.chk
            variable my_line : LINE;   -- not part of working circuit
          begin
            wait for 9.5 ns;         -- just before rising clock
            for I in 0 to 3 loop
               write(my_line, string'("line="));
               write(my_line, I);
               write(my_line, string'("  V="));
               write(my_line, cache(I)(154));
               write(my_line, string'("  tag="));
               hwrite(my_line, cache(I)(151 downto 128));  -- ignore top bits
               write(my_line, string'("  w0="));
               hwrite(my_line, cache(I)(127 downto 96));
               write(my_line, string'("  w1="));
               hwrite(my_line, cache(I)(95 downto 64));
               write(my_line, string'("  w2="));
               hwrite(my_line, cache(I)(63 downto 32));
               write(my_line, string'("  w3="));
               hwrite(my_line, cache(I)(31 downto 0));
               writeline(output, my_line);
            end loop;
            wait for 0.5 ns;         -- rest of clock
          end process debug;

      And, add in front of instruction_memory architecture:
         use STD.textio.all;
         use IEEE.std_logic_textio.all;


        Then diff -iw part3a.out part3a_print.chk
        see part3a_print.chk with debug

        You may print out signals such as 'miss' using  prtmiss from.
        debug.txt

        For grading reasons, keep the signal names that
        are pipeline registers and the component/memory names.

        make -f Makefile_411 part3a.out
        make -f Makefile_ghdl part3a.gout
        diff -iw part3a.out part3a.chk
        diff -iw part3a.gout part3a.chkg
        diff -iw part3a.out part3a_print.chk

        You submit on GL using:  submit cs411 part3 part3a.vhdl

	Ignore difference in PC_next in clock 27.
	Ignore PC not zero if results in registers and memory are same.


 Part3b: Copy your  part3a.vhdl  to  part3b.vhdl
        Substitute "part3b" for every "part3a"

        cp  /afs/umbc.edu/users/s/q/squire/pub/download/part3b.abs .
        cp  /afs/umbc.edu/users/s/q/squire/pub/download/part3b.run .
        cp  /afs/umbc.edu/users/s/q/squire/pub/download/part3b.chk .
        cp  /afs/umbc.edu/users/s/q/squire/pub/download/part3b.chkg .
        cp  /afs/umbc.edu/users/s/q/squire/pub/download/part3b_print.chk .
        Implement a cache in the data memory  (read/write)
        submit cs411 part3 part3b.vhdl

        Put the cache inside the data memory entity and process.
        Almost all the code from the instruction cache, part3a,
        can be copied and used inside the data memory for the data cache.
        (you will need to pass a few extra signals in and out)

        part3b.ps

        Use the existing shared memory data as the main memory. 
        Make a miss on the data cache cause a three cycle stall
        of all pipeline stages.
        (you will need another signal similar to sclk in order
         to stall the EX, MEM and WB stages  e.g.  dsclk
         dsclk replaces  clk  for all registers in EX, MEM and WB stages.)

        A cycle is 10 ns, thus a three cycle stall is 30 ns.
        Previous stalls from part2b and part3a must still work.
        
        Change  MEMread : std_logic := '1'; to
                MEMread : std_logic := '0';  for part3b.

        Do a write through cache for the data memory.
        (It must work to the point that results in main memory are
         correct at the end of the run and the timing is correct,
         partial credit for partial functionality with correct timing
         for the stalls.)

        Then test part3b.vhdl with the data cache.
       
   make -f Makefile_411 part3b.out
   make -f Makefile_ghdl part3b.gout
   diff -iw part3b.out part3b.chk
   diff -iw part3b.gout part3b.chkg

   submit cs411 part3 part3b.vhdl

   Submit all components and your main circuit as one plain text
   file by using 'submit'. No makefiles or run files or output is to be
   submitted. Partial credit will be given based on correct timing
   and number of instructions simulated correctly,
   number of hazards handled correctly and proper operation of the
   data cache. Of course, the instruction cache must work before
   the data cache is graded.

      I use  dsclk  in EX, MEM, WB stages in place of  clk
      clk200 <= clk after 200 ps; -- slight delay
      dsclk  <= clk200 or dmiss;  -- dmiss out of data_memory

      stall gets added     or dmiss


      dmiss out of  data_memory  similar to   miss out of instruction_memory
      my code in data_memory very much like my code in instruction_memory
      same cache structure in another cache, L1 Dcache

      Add two signals to  entity data_memory    clear  miss
      miss will be called  dmiss   outside the data_memory

      data_memory does nothing unless either read_enable or write_enable is '1'
      use code from instruction_memory  cashe:
      	very similar when  read_enable is '1'  reading from memory
	add code when  write_enable is '1'     writing into memory one word
        do nothing in data_memory  unless either read_enb or write_enb is '1'
        Test read_enable and write_enable for both "hit" and "miss" cases.

        Typical start of data cache process ...
	  begin
	    if clear='1' then
	      miss <= '0';
	    end if;
	    if clear='0' and miss='0' and (read_enable='1' or
	       (write_enable='1' and write_clk'event and write_clk='1')) then
	      index := to_integer(address(5 downto 4));
	      word  := to_integer(address(3 downto 2));
	      cblock := cache(index);
              ...

  -- for debug, After:  end process data_mem;  insert for  part3b_print.chk

  debug:  process -- used to print contents of D cache, use part3b_print.chk
            variable my_line : LINE;   -- not part of working circuit
          begin
            wait for 9.5 ns;         -- just before rising clock
            for I in 0 to 3 loop
               write(my_line, string'("line="));
               write(my_line, I);
               write(my_line, string'("  V="));
               write(my_line, cache(I)(154));
               write(my_line, string'("  tag="));
               hwrite(my_line, cache(I)(151 downto 128));  -- ignore top bits
               write(my_line, string'("  w0="));
               hwrite(my_line, cache(I)(127 downto 96));
               write(my_line, string'("  w1="));
               hwrite(my_line, cache(I)(95 downto 64));
               write(my_line, string'("  w2="));
               hwrite(my_line, cache(I)(63 downto 32));
               write(my_line, string'("  w3="));
               hwrite(my_line, cache(I)(31 downto 0));
               writeline(output, my_line);
            end loop;
            wait for 0.5 ns;         -- rest of clock
          end process debug;
     -- end architecture behavior;  -- of data_memory

      And, add in front of data_memory architecture:
         use STD.textio.all;
         use IEEE.std_logic_textio.all;

Files to download and other links

Last updated 12/3/2020