Merlintec Computers

TACHYON

Development Tools and Project Schedule

This is a very preliminary schedule for the development of the Tachyon chip:

The total duration for the project is 42 weeks (10 and a half months) which is pretty ambitious for the proposed development team (one engineer), but certainly doable. This schedule doesn't take into account the Merlin 6 project, however. That is a slightly reduced implementation of this design using FPGAs. When the Merlin 6 is completed and tested, many of the steps listed on this page can be eliminated or greatly reduced.

Architectural simulation - microcode cache

Since some of the ideas being proposed here have never been tried before, the very first step is the creation of a very high level simulation of these particular aspects of the design just to see if they "work as advertised". This is normally done as a quick and dirty simulator written in a high level language such as C. Fortunately there is already a suitable framework available in Self, so this job will be much simpler than normal. Using the simulation framework has the added advantage that the simulation can evolve incrementally from an architectural one to a structural one - the C simulations are normally thrown away and an entirely new one must be written in VHDL or Verilog.

Since the microcode cache is both a hardware and software solution, at least part of the Java virtual machine must be written in order to run the simulation. The Java program itself will be the javac binaries from Sun's latest JDK and it will be tested on the example sources that come with the JDK. If more time were available, it would be very interesting to test the microcode cache with a port of Squeak Smalltalk, but this schedule doesn't allow that unless more developers are included in the project.

Architectural simulation - coprocessors

An entirely separate simulation will be created (again using the Self framework developed in 1991) to test the idea of multiple media coprocessors cooperating with a multithreaded master processor. Since the compiler to generate code for such an architecture doesn't exist yet, a few test "kernels" will have to be written directly in microcode, manually translated from the C implementation of the MPEG-2 decompression program. Some quick tests of the "soft I/O" feature of this architecture will also be included in this phase.

Performance evaluation and architectural tuning

The two previous simulations will be combined into a single one and all the other elements of the processor's design will added. The architectural simulations will be made more detailed so that the result is a cycle accurate simulation of the whole chip. The simulator will be parameterized: details such as cache sizes and number of coprocessors can be different for each run. The idea is to run the full suite of benchmarks while varying each of the parameters in turn. The resulting data will allow the design to be refined and for the very best architecture to be chosen for the next phases.

Software tools

When a previous processor was being designed in 1990, there was a significant gap in the capability of commercial EDA tools for handling high level designs. VHDL was very new and most available simulators only handled a subset of it. Silicon compilers were still very rare and required design to be captured in some proprietary format. The greatest bottleneck by far in the development cycle was the creation of test vectors for simulations.

Given this situation, the decision was made to create a new mixed level simulator using Self as the hardware description language (HDL). Its dynamic nature and extensibility proved perfect for the job, allowing it to do double duty as the command line interface (CLI) for the simulation system as well. Each component could be described as either a black box with Self code for its behavior or as a netlist of more primitive components. The more interesting case was for components with both descriptions - either one could be chosen for any given simulation. This made it possible to trade off detail for speed, eliminating the need for manual creation of test vectors. Instead of simulating an ALU as a separate component using an "artificial" set of inputs, for example, the ALU could be simulated in detail as a part of a whole system (simulated at the highest possible level) running real benchmarks. It is very common for this method to generate input combinations that the designer had not considered and, therefore, had omitted from the test vectors. Of course, it is still a good idea to complement this with artificial tests to guarantee full coverage (it is an even better idea to create the tests that will be used in manufacturing at this stage). Another example of the mixed level style of simulation would be when testing the registers. One of them could be expanded (all the way down to the transistor "switch" level in this simulator) while the other 31 would remain at the Self behavioral level for high speed. This makes it easy to verify that the structural description (netlist) does indeed correspond to the behavioral one.

Once the high level design has been captured and verified, the detailed design must be created. The mixed level simulator allows a top-down development style where each black box is manually replaced with more primitive components (some newly created, others from a library) until the transistor level is reached. A much better solution is to use silicon compilers to synthesize the detailed design automatically from the behavioral description. The current state of the art in silicon compilers requires the designer to use a limited subset of the HDL in some very stereotyped ways in order to enable it to generate the low level design. Fortunately, a completely automatic solution is not needed in real life projects. Any help is better than none if the results of synthesis can be integrated with "hand crafted" parts of the chip, as the simple "array generators" of the 1980s (which helped create PLAs, RAMs, ROMs, register banks and so on) proved so well.

The normal output from synthesis tools is a netlist of gates and simple registers which are then mapped into some predefined library and placed on the chip using the Standard Cell method. This is a one dimensional method: the standard cells have different widths but always the same height and they are placed side by side in a long row with their inputs and outputs are connected by special metal tracks. This row is normally folded into a snaking pattern in order to better fill a square area. Eliminating one of the two dimensions available in chip design simplifies the placement software and greatly reduces computing time for this step, but the resulting layout is considerably less efficient than what can be produced manually by an engineer (2.8 times larger in one study).

An alternate method that fully exploits the 2D nature of IC design was developed at the University of Utah and is called PPL (for Physical Placement of Logic or Path Programmable Logic, depending on who you ask). In some ways the very opposite of Gate Arrays, the PPL design is done in a fixed grid defined by the wiring. This is based on the observation that the design rules for metal and vias are normally not as fine pitched as those for polysilicon and diffusion, so wires tend to take up more space on a chip than the transistors. Design in PPL means positioning cells, that are a multiple of the grid size, next to each other on that grid. Each grid position was represented by two characters so students at the University of Utah could use an Emacs derived editor to create their designs using simple ASCII terminals (this was the late 80s, after all). The left character indicates any "cuts" to the normal wiring and the right character indicates the function of the cell (with a blank indicating the default of no function at all; just wiring).

Here is an example of a very simple PPL design:

i i

        |0 0 + j

        |0 1 + j

        |1 0 + j

        |1 1 + j

This is a 4 to 1 multiplexor on a 4 by 5 section of the grid with the input and output indications omitted to make the presentation simpler. The selection addresses come in from either below or above using the column wires of the first two columns in the design. There are four column wires (plus power) and three row wires at each grid position, but they have default functions so that when we say "the input comes in the column wire" we know which one of the four we are talking about. The "i" cell is a simple inverter which takes its input on the default column wire and outputs to the secondary column wire. So this means we now have both the original selection addresses and their inverses available in the first two columns. This should be very familiar to PLA designers and is standard practice in PPL.

The "|" character indicates that the row wires were cut to the left of the cells in the first column. That is because the inputs to be multiplexed come in from the right using the default row wires and we want to isolate this design from any circuit that might be placed left of it. The "0" cell is part of a distributed AND gate, along with the "+" cell and the "1" cell. When the inverted column signal is true, it sends an indication along the row. The "1" does the same thing for the non inverted signal. A "+" cell combines all indications along its row and outputs a true if all of the columns detected the signal they were expecting. So in the design above, one of the four "+" cells will output a true while all other three will output a false for any combination of inputs. The "j" cell is a three state design that will output its default row signal to its column only if its secondary row has a true value. This means that one of the "j" cells will multiplex its row input onto the fourth column as the circuit's output, as selected by the two inputs on the first two columns.

Note that while this is a high level physical design (there is a direct correspondence with the final layout), this also looks very much like a truth table for the operation of this circuit. This is the main feature that makes PPL such an efficient method for hand made designs - there are no separate schematic/layout steps. But our interest here is in PPL as the target for automatic synthesis tools. It fills that role nicely, due to:

connection by abutment - there is little explicit routing, and combining placement and routing saves time and code
trivial 2D - the fixed grid eliminates most degrees of freedom which would make compute times so impractical while retaining most advantages of two dimensions
easy to mix with hand made designs - tool set doesn't have to be complete to be useful
output is easily inspected for verifying correctness while developing tools

Cell library - design and characterization

The idea behind PPL is simple, and so are the tools for creating designs with it. But to generate a detailed chip layout, a whole cell library must be hand designed for the fabrication process that will be used. An example library is available for the Mosis 2.0 um process, but it can serve as little more than inspiration for this project.

In the ideal case, it would be possible to introduce new ideas, such as the True One Phase Clock (TOPC) design method while creating these cells. The schedule indicated above won't allow that, however, so the library will only include those cells that have already been proven in the PPL system over the years.

After the cells have been designed, they will have to go through detailed simulations using a tool like Spice which can show the detailed electrical operation of the circuit under many different conditions. The results of these simulations should be included in the high level design tools (the switch level simulator, for example).

Cell library - test chip

With the design tools and the library of cells ready, a special chip using all of the cells in several simple test circuits should be created. Each cell should be included in this chip's design several times so it can be evaluated in different conditions. Ideally, critical subsets of the CPU itself can be included in this test.

When the chip is returned from the foundry, it must be tested to find any flaws in the library design and to fine tune all the simulators that are part of the tool chain. The design must be as modular as possible so that if one part of the circuit is not working the rest of the design can still be tested (an example would be the I/O library - if all internal circuits are connected to pins using an untested design, a flaw in that single cell would make testing the rest of the library impossible).

Detailed design

In this step the high level blocks of the first simulations must be replaced by netlists of cells available in the library. For some blocks this means the use of synthesis tools, while for others this will be a manual process. The placement of the blocks within the chip is also done at this stage.

Logic simulation

Now that the design for the whole chip is available at the transistor level, and information about wire lengths was obtained during placement, a complete logical simulation can be executed. All of the test vectors must be re-evaluated using this complete simulation, no matter how long this takes.

PPL layout generation

Since the PPL logical grid on which design was done maps directly to a fixed physical grid, the final layout generation step must merely replace each cell by its physical representation from the library and then do some simple patches (to better stitch together metal connections in neighboring cells and to take care of details such as p- and n-well polarization). The resulting design must be verified for compliance with both design rules and electrical rules before being sent for mask fabrication (if there are no bugs in the tools, there should be no rule violations at all).

First silicon

No amount of simulation can replace actually having a foundry make a batch of chips. When the chips come back, they must be fully tested to separate problems due to manufacturing (which will tend to affect some chips of the lot but not others) from design problems (which nearly always affect all chips in the same way, unless they interact in strange ways with manufacturing problems). If any design problems are found, they must be corrected and a new iteration of a prototype batch of chips must be made. The schedule indicated here doesn't take into account any design iterations - which does happen sometimes, but two or three attempts are more common. Sometimes ways can be found to reduce (or eliminate) some manufacturing problems through a redesign. This will greatly enhance yield (and so reduce cost) during production, and so is certainly worth some extra delay when it is possible.

Hardware prototype and testing

While for many chips "testing" means having all the manufacturing vectors successfully applied, the real test for the Tachyon chip is to be included in a real hardware design (an internet access device, for example) and to run real applications. The use of the Tachyon CPU as a replacement for the one built from FPGAs in the Merlin 6 machine will mean that this step will be easier - the real hardware design can be a patch of the Merlin 6 and the test applications can be those already running on that machine.

The Next Generation

While some of the earliest microprocessors are still with us (the Z80, for example, is a CPU designed in the mid 70s which is still being incorporated into new designs in 1999!) the rule is that the production life of a modern processor is shorter than its design cycle. The design of a new CPU can no longer be an isolated event, but must now be part of a larger process that can produce enhanced versions on an almost yearly basis. Several features of the Tachyon design itself are very scalable (number of media coprocessors, size of caches, number of MOVE busses) which will make it possible to quickly release higher performance versions on a regular basis. In addition, the design process described here, with the use of PPL for symbolic layout, allows a parallel effort to increase performance through the redesign of the cell library and the move to denser manufacturing technologies.

Back to the Tachyon Home Page

Back to Merlintec home page