Main

July 16, 2010

Simulating the NIOS II w/mmu using Verilator

I've fallen in love with "verilator".

It's a verilog simulator which generates C++ code. It's not fair to call it a verilog to
C translator because it does much more than that.

I like it because it forces problems out of my designs, generates good lint output and runs very very fast. For many of the designs I simulate using cver or iverilog takes many hours, sometimes a day. With verilator I cycle time performance which is withing an order of magnitude of my behavioral models (well, ok, that might be an exaggeration but it seems that way). My sims often shorten from overnight to 10 minutes.

So, I wanted to study the TLB accesses on the NIOS II running full linux so I could
improve my behavioral model of the NIOS II. To that end I grabbed a pre-baked
NIOS II and stripped off everything but the cpu. I then wrote my own "testbench" models for the memory - sdram and flash as well we the jtag uart and timers. Just a simple hack to answer when the cpu calls. Nothing as complex a full Avalon bus with arbitration, because I don't need that to sim the basic cpu.

I also had to write behavioral models for "altsyncram" and the "altmul_sum" multiplier. That was fun. The sync ram model is very very dense inside and pretty much inscrutable. But fortunately the NIOS uses the sync rams as basic dual port memories in a fairly constrained way. It does make interesting use of the byte enables and clock enables - my guess is this helps them make timing. When the pipe stalls the clock enables are de-asserted to stall the sync rams "internal pipe" as well. It's fun to watch in green waves.

So, I have gotten the NIOS II to simulate using verilator. I'm still having a problem with the JFFS2 reading of flash, but I'm very close to running something in userland. The kernel comes up to the point where it mounts the JFFS2 file system, which is good. It just has not used the TLB's yet.

I like the NIOS II architecture, especially with the MMU, if you can call it that. It's a simple six stage pipeline with tightly coupled I and D caches. The I cache is small but seems to work pretty well, as does the D cache. The caches are 16 way set associative.

The "mmu" is really just a small set of TLB's which are used for various segments of the address space. They function much like the i-cache near as I can tell, i.e. an n-way set associative memory. I'm still working out the details.


March 26, 2009

Re-creating old CPU designs

IMG_0036.jpg

Over the years I've done a number of experiments using Verilog, a hardware modeling language. In several of these experiments I have attempted to recreate old CPU designs like the MIT CADR lisp machine and the DEC PDP-8/I. My latest experiment is to recreate the PDP-11, in modern verlog, using modern simulation techniques.

Note that this has been done before. I know of at least 2-3 old microcoded versions and more recently there are 3 other groups which have done this, but in all the cases the code is either not in verilog or is proprietary and closed. Not very helpful.

I have not (yet) delved into SystemC, but I have done some fun work with co-simulation. Most recently I wired my RTL simulation of the pdp-11 in almost-verilog to a "known good" pdp-11 instruction set simulator. The idea is that both the RTL simulation and the instruction set simulator run the same code and at the end of each instruction cycle the results are compared. The "results" are the internal register values, the processor status word and the list of bus operations which occurred (address, type, data).

In a perfect world the two simulations will run in lock step and any deviation is a bug. And this is mostly true. The comparison turns out to be extremely helpful and very valuable.

Again, however, this is not new. I learned this technique from others who are smarter than I am.

While attempting to recreate the pdp-11 I ran into a number of interesting problems. The instruction set is fairly simple but it is not RISC. The effective address computations are complex and in many cases doubled. Let me supply an example:. Here is a list of the 8 addressing modes. A complex instruction can have a source operand (with one of these addressing modes) and a destination operand (with one of these addressing modes). So, in the worst case you need to compute the effective address and do one or more fetches for the source and destination.

mode symbol  ea1     ea2             ea3             data          side-effect                                                                               
0    R       x       x               x               R               x       
1    (R)     R       x               x               M[R]            x       
2    (R)+    R       X               x               M[R]            R<-R+2  
3    @(R)+   R       M[R]            x               M[M[R]]         R<-R+2  
4    -(R)    R-2     x               x               M[R-2]          R<-R-2  
5    @-(R)   R-2     M[R-2]          x               M[M[R-2]]       R<-R-2  
6    X(R)    PC      M[PC]+R         x               M[M[PC]+R]      x       
7    @X(R)   PC      M[PC]+R         M[M[PC]+R]      M[M[M[PC]+R]]   x       
Seems complex, yes? Each M[] is a memory read. The basic register indirect is simple. But modes 6 & 7 add the side effect of reading addition operand data from the next instruction location. This increments the pc as well as fetching an offset which gets added to the result of a previous EA calculation.

So, how to implement this? My first thought was a complex state machine. After a while I got frustrated and thought it might be easier just to make a machine which recodes the old pdp-11 instruction into new "risc-like" instructions on the fly. Sort of a just-in-time binary recompilation. I think this is how modern day X86 machines work. The fun idea would be to have several "machines" running ahead and converting the pdp-11 CISC instructions into simple RISC instructions, filling several FIFO's. The then RISC engine could use modern ideas like a multi-stage pipeline, speculative execution and branch prediction. While very cool, I quickly decided that was more complexity than I wanted at this stage.

I do think, however that it might make sense initially to do a simple "recoding engine" and a simple "risc pipeline". I want to do it and compare the gate count to a state machine version.

So, I set out to do a simple state machine version. I tried to compress the states as much as possible but current feel there has to be a decode state, four states for each operand, an execute state and a write-back state. The four states for each operand can be reduced to a little as one, depending on how the instruction decodes. I tried to eliminate the single EA state for each operand but instructions like:

   mov   @(R5)+,@(R5)+

causes problems. Why? because the value of R5 is incremented twice, once after each EA calculation. If I did the EA and post-increment in one state I needed to special case the increment (to be 2x) if the both registers were equal. And it got to be a big mess. I capitulated, added a state, and reduced the complexity.

I should note here that all pdp-11's, except one, are microcoded. And I can see why.

At some point I do want to try an experiment by adding a pre-fetch unit, keeping at least 3 words available and doing the EA calculations in parallel. The EA calculation will stack up (i.e. stall) queuing up for memory reads, but it has the potential for being more efficient, especially if there is a cache which does burst reads and the line size is at least 8 bytes.

I know this might all sound crazy, but I've learned a lot in the process and almost everything I have learned has been useful in my day job.

October 21, 2006

Never Say Die... (reviving a Linksys WRT54GS)

A while back I did a project using the Linksys wireless routers. Well,
more truthfully an associate did all the work, I just managed the project.

Anyway, my associate, being rusty with hardware at that point managed to "brick"
one of the units. It was toasted and would not respond to the simple
recovery proceedure. (aside: I do this all the time. generally I like
to produce smoke, sparks and flames. apparently all he got was a dead
unit. He'll need to practice shorting wires of higher amperage if he wants
to catch up)

Most recently I found myself needing two more Linksys boxes for some
distributed computing I'm doing. I found one and it runs OpenWRT which
I now love. The other was the "bricked" unit. (to brick = to render a device
no more useful than a brick)

I figured I'm a tough guy so I soldered the serial connector and JTAG
port. I connected up a simple jtag box and ran the "debrick" code
someone (hairdairymaid) wrote for windows. It didn't work. I fooled
around with it for hours. hours. I resoldered the pins. I used the
o-scope. I beeped out the lines. I compiled/hacked/fixed the linux
"wince jtag tools" to work with this particular broadcom cpu chip
(which, I might add, I did the original board port for). Still nothing.
I almost managed to get the linux jtag tools to erase the flash and
rewrite it but I had to modify the code and it was ugly. pain.
frustration. why am I spending 1/2 a day on a $50 box?

So, I put the board in the trash. But I didn't take out the trash. A
day goes buy. I break down and use a bogus WRT54GP box I had which
won't run linux. Blech. But it worked.

Today I notice (on accident) a posting where a guy says "hey, I tried
that hairydairymaid debrick code but it didn't work on my Intel J3
strataflash until I added a 2 second delay here and a 5 second delay
here..." His symptoms matched mine (erase appeared not to work). So, I
made the same change to the code and VIOLA! it's debricked! it works!
it runs OpenWRT! happy happy! joy joy!

Seems dumb all for a $50 box.

But, the interesting thing is that I found the answer on the web, but it
was not the first one. Many of the low level 'instructructions'
surrouding the WRT are marginal/flawed. But they often work well
enough. The high level ones seem to be largely good.

Wasn't that fun? :-)