Merlintec Computers

TACHYON

Hardware Support for Object-Oriented Languages

Most efforts to use special hardware to create high performance implementations of object-oriented programming languages have failed to compete with software only solutions running on stock hardware. One problem is that these projects tend to be slow, while standard CPUs are doubliing in performance every 18 months, so a three year project will face processors four times as fast as those available when it started. There is no solution to this other than the obvious - an object-oriented CPU must be developed in a schedule compatible with that of mainstream CPUs (including the planning for follow on products).

There are other problems too - they will be described below so that it can be seen how the proposed microcode cache can solve them.

What you don't know can slow you down.

To execute some program, the processor must have available all relevant information. Some of this information is known to the compiler when the machine level code is generated, and we will call it "static information". Other information is unavailable at compile time and must be obtained during the execution of the program, which we call "dynamic information". A simple rule is that the less static information we have, the lower the performance we can obtain. Take a simple expression like "a[i]", for example. If the compiler doesn't know anything special about this expression, it might generate code that looks like:

; code to load the address of the array "a" into R3

; code to load the value of variable "i" into R4

; code to load the size of the elements of array "a" into R5

mult R5, R4 -> R6

load [R3, R6] -> R1       ;  R1 now contains a[i]

On the other hand, suppose the compiler knows what the size of the elements of array "a" is and that it is a power of two (like 8, for example). And also suppose that the compiler has kept track of variable "i" and has arranged it to be in a register already when this code executes:

; code to load the address of the array "a" into R3

lshft R4, #3 -> R6  ; R4 is the variable "i", and left shifting it 3 times is

                    ; the same as multiplying it by 8, which is the size of the

                    ; elements of array "a"

load [R3, R6] -> R1    ; R1 now contains a[i]

If the array "a" happens to be stored in a known place, the compiler can greatly simplify the code to load its address into R3. In some situations, the compiler might be able to find out what the value of the variable "i" will be when this expression is evaluated (the previous line might have "i := 91;" or something like that). The code which combines that case with the one where the address of "a" is static looks like:

move #1000768 -> R6  ; the address of "a" is known to be 1000000. The value of

                     ; variable "i" is known to be 91. The size of the elements of

                     ; array "a" is known to be 8. So 1000000 + ( 8 * 91 ) is the

                     ; address of the element we are looking for

load [R6] -> R1      ; R1 now contains a[i]

Up to now we have been thinking that the elements of array "a" can be changed by program execution. That is not always the case - constant arrays are very common. If we happen to know all the information of the previous example plus the fact that array is constant, then the compiler itself can do the lookup and generate this code instead:

move #26 -> R1     ; R1 now contains a[i], which we happen to know is always 26 for

                   ; i = 91

This should have made it very clear that the more static information is available, the higher will be the performance of the resulting code. Many language designers have made an effort to increase the amount of static information while avoiding as much as possible the undesireable side effect - the decrease in flexibility. Most languages that started out as very static (Fortran, Occam) have become more dynamic with time. It isn't much fun, for example, for the author of a grading program to have to inform the compiler the number of students that a class will have. Since processors have been doubling in performance every 18 months, we can trade off some speed for greater convenience.

Object Procrastination

The most important trend in current software development, object-oriented programming, pushes the balance as far as possible to the dynamic side. Static subroutine calls, for example, are replaced with dynamic message lookups. The advantages in terms of programmer flexibility are undeniable, but this takes a huge toll in terms of performance. Faster processors have made this less of a problem, as we have mentioned - a Smalltalk interpreter which would have brought a 386 to its knees is quite snappy on a Pentium. But the difference in performance on the same CPU between a dynamic system (like Smalltalk) and a static one (like C, or even object-oriented ones like C++ or Eiffel) has kept the more flexible alternative out of reach of most programmers.

All this, of course, is based on the idea that there are two distinct moments in a program's life: "compile time" is when the source is translated to machine code right after the programmer finished writing or editing it and "run time" is when the CPU executes the code for some user. Recently, however, alternative definitions have been proposed for compile time:

when software is installed in a machine - software is distributed in some neutral format (such as OSF's Architecuture Neutral Distribution Format, ANDF, or the Juice syntax tree system for the Oberon programming language) and the installation procedure is extended to include a compilation step. This is similar to distributing the source code and having each client compile it, but is more transparent. The popularity of Java as "the" system for "compile once, run everywhere" has mostly killed all interest in this alternative.
when software is loaded into a machine's memory - in this case the software remains in the distribution format on local disk or on a remote server and the software that loads it into RAM is modified to include a compilation step. This is how the popular JIT (Just In Time) compilers for Java work and is only possible because processors are now so fast that the compilation time is masked by the input/output overhead (reading from disks or from a network is very slow relative to a modern CPU).
when software is first executed - a cache of machine code is maintained in memory, and only when a part of the program not present in the cache is called (the first time a subroutine is called, for example) is the compiler invoked. The advantage of this "dynamic compilation" over a JIT compiler is that code that is never called need never be compiled.

The important consequence of delaying the moment in which machine code is actually generated is that what was previously dynamic information becomes static without any change in the programmer's model of the system.

Partial Evaluation

A more general idea of the optimizing transformations we applied to the "a[i]" expression to get it to run faster would be what is called "partial evaluation". Imagine the following function to calculate X raised to the Yth power:

power(x:float, y:int):float = if (y=0) then 1.0

                                       else if(y<0) then 1.0/power(x,-y)

                                                    else x*power(x,y-1)

The compiler must generate very slow code due to all those subroutine calls. If both X and Y are known, however, we could give the whole expression to an interpreter and use the (constant) result whenever we needed it. Sometimes the actual situation is something between these two extremes: we might know nothing about X, for example, but know that Y is always -2. In that case we could have a very special interpreter that would directly execute whatever it could but would generate some code (called the "residual code") to be executed later for whatever it couldn't figure out. Alternatively, we can picture this process as a series of transformations on the source code:

power(x:float, -2):float = if (-2=0) then 1.0

                                     else if (-2<0) then 1.0/power(x,+2)

                                                    else x*power(x,(-2)-1)

                         = 1.0/power(x,2)

                         = 1.0/{if (2=0) then 1.0

                                         else if (2<0) then 1.0/power(x,-2)

                                                       else x*power(x,2-1)}

                         = 1.0/{x*power(x,1)}

                         = 1.0/{x*{if (1=0) then 1.0

                                            else if (1<0) then 1.0/power(x,-1)

                                                          else x*power(x,1-1)}}

                         = 1.0/{x*{x*power(x,0)}}

                         = 1.0/{x*{x*{if (0=0) then 1.0

                                               else if (0<0) then 1.0/power(x,-0)

                                                             else x*power(x,0-1)}}}

                         = 1.0/{x*{x*{1.0}}}

power(x:float, -2):float = 1.0/(x*x)

While this final, partially evaluated subroutine is much less general than the original, it is also significantly faster (and the binary code resulting from compiling it will be much smaller). But how did we find out that the value of Y should be -2? By looking through the program for the places where the power function is actually called. If we find out that it is only called in one place with the Y argument a constant value of -2, then the above strategy works perfectly. If, on the other hand, we find that it is called in four places with Y=-2 in two of them, Y=3 in another and an undetermined value for Y in the last, we could create three separate versions of the power function and have each place in the program (call site) refer to the correct version.

This idea, combined with the dynamic compilation mentioned earlier, is the key to a high performance implementation for object-oriented languages. By delaying the compilation step until the last possible minute (right before the actual execution of the code) a lot of the information that is dynamic according to the language model can be treated as static for the purpose of generating code: we are essentially partially evaluating the code in its execution environment and placing the residual (as machine code) in a cache where it can be run as many times as necessary. If any of the information used for this partial evaluation should change (since it is actually dynamic - we just pretended it was static to get high performance) we can still get the correct results by simply flushing the code cache and compiling again (with the new information).

Microcode Cache

The very first computers were quite simple due to the limited hardware available. They had to have separate circuits for each different instruction that they implemented - what is now called a direct execution design. As more complex machines became economically feasable, the idea of "microcode" was developed. Inside the main processor there would be a complete first generation computer with its own simple instruction set and registers plus a limited amount of memory. The program in this memory, called the microcode, would efficiently simulate the actual instruction set (called the "macrocode") defined for the machine. Since the processors were so much faster than the main (core) memory, the fact that each macroinstruction took several clock cycles (several microinstructions) to execute wasn't a problem. In fact, since fetching the macroinstructions from memory was so slow, it was considered a very good idea to define very complex instructions that could do a lot of work before the next instruction would be needed.

Ok, so on to Merlin 6. This machine has 64K words of 96 bits each for use as a direct mapped data and "microcode" cache - 64 bits of data and 16 bits for the tags. The machine language in Merlin 6 is Self bytecodes: nothing else exists in main memory. When it needs to execute some bytecodes, the instruction pointer is extended with a special "context register" and the result is looked up in the cache. If the cache hits, then the data found there is microcode to be directly executed and *not* the bytecodes pointed to by the instruction pointer! This is a subtle, but very important difference relative to all architectures I am aware of. If the cache misses, then the CPU fetches microcode from a fixed location in a special region in the cache. This is the bytecode interpreter/compiler and it will fetch the bytecodes (storing them in the *data* region of the cache) and will either execute them directly or generate new microcode to be stored in the cache.

The context register normally has a value of zero, but this can be changed by the microcode. This allows more than one microcode to be associated with a single bytecode. This is used not only to implement multicycle instructions (so it would be a microcode instruction pointer), but also for Self 4.0 customized compilation.

About the microcode - the processor is a simple MOVE CPU and each 64 bit word includes 4 move instructions that are executed in a single clock. I will represent one microcode word as

[cond1]

src1->dest1

[cond2]

src2->dest2

[cond3]

src3->dest2

[cond4]

src4->dest4

which means that data in src1 is transfered to dest1 in this cycle if cond1 happens to be true. Since each word is in a cache, I will represent the whole thing as

<IP,CR>

microcode word

This means that if the current value of the instruction pointer is IP and of the context register is CR, then this microcode word (the four moves) is executed. Here is an example of microcode for adding two integers:

<109,0>	[true]	R2 -> ADDER_OP	[true]	R3 -> ADDER_TRIG	[...]	...	[...]	...
<109,1>	[true]	ADDER_OUT -> R4	[true]	INC_IP -> IP	[...]	...	[...]	...

As you can see, this is even more primitive than a RISC CPU. But note that the compiler has complete control and could find interesting things to do with the four moves I have omitted from the example.

Imagine a very normal microcoded CPU, except that the control unit is a little different. There are these registers:

     PC - the program counter. Points to the currently executing bytecode
     PX - a kind of context register, which I will explain
     IR - instruction register. Holds the current bytecode

The PX normally has a value of zero. It might be from 6 to 14 bits wide (simulation would have to be done to see what the best value is). There is an associative memory in the control store that matches the pair PC,PX and returns a microcode word for the data path. When this memory fails to produce a match, then the microcode word is obtained by loading the IR with the memory value pointed to by PC, looking up the value of IR in a table of microcode pointer start values and then using this to address a microcode ROM. Something like:

      mc = mcCache.lookup(PC,PX);
      if (mc != NOT_FOUND) return (mc);
      IR = memory[PC];
      PX = 0; /* I am not sure this should be here */
      mc = mcROM[mPC=start[IR]];
      return (mc);

The microcode word can change both PC and PX so that the normal relation (one or more microcode instructions for each macrocode instruction) need not apply. If the cache has a sequence:

      PC    PX    --->   microinstruction
     1097    5            do something1, PX := 6
     1097    6            do something2, PX := 7
     1097    7            do something3, PX := 0, PC = 1099

then we have three microinstructions corresponding to two macroinstructions. Here PX is being used as a strange mPC, but it can be used to allow customization as different microcode sequences can be associated with the same macroinstruction (actually, the same value of PC).

PICs

Stefan Matthias Aust was asking about caching message lookups and we got into a discussion about Polymorphic Inline Caches (PICs). These PICs are simply short sequences of code that test the receiver's type (class in Smalltalk, map in Self) against a short list of candidates and jumps to the corresponding native code directly. If none of the alternatives are the correct one, then the traditional lookup system is invoked and it can extend the PIC so the current answer is found the next time around. To make this expansion easier, we might store the PIC in a separate memory region from the main code in the method and have that main code include a call to the PIC (by using jump instructions in the PIC, we guarantee that the return instruction in the invoked native method will bring us directly back to the main code). Since the number of entries in a PIC is small, the best implementation is something like:

if (receiver->map == SmallIntegerMap) goto SmallIntegerAdd;

if (receiver->map == FloatMap) goto FloatAdd;

goto Lookup(receiver->map,"+");

The Self compiler can use this as a source of type information instead of (or together with) type analysis. It can tell that sending "+" to the object known as "receiver" at this point in the source code (known as the call site) has only resulted in either executing SmallIntegerAdd or FloatAdd. We can't know if there will be additional types for receiver in the future (though type analysis might be able to tell us that) so we won't be able to eliminate these tests from the compiled code (I am supposing we have decided to recompile this for some reason), but we can move them around and maybe merge them with similar tests nearby.

Here is data presented in Table A-1 in Urs's PhD thesis:

number of receiver types	percentage of call sites
0	12.4%
1	81.9%
2	4.4%
3	0.8%
4	0.2%
5-10	0.2%
11-99	0.05%
>99	0.02%

When I was designing the Tachyon CPU, I came up with a nice implementation for PICs. But it used associative memory for the microcode cache and I can't put that in the Merlin 6. So here is the idea adapted for a direct mapped cache:

<432,0>	[true]	R6 -> PIC_TRIG	[true]	CONST32 -> PIC_OP	#????
<IntMap>	[true]	R0-> ADDER_OP	[true]	R2 -> ADDER_TRIG	[true]	#7 -> CR	[...]	..
<FloatMap>	[true]	R0 -> FADD_OP	[true]	R2 -> FADD_TRIG	[true]	#3 -> CR	[...]	..
<XXXXXX>	ZZZZZZZZZ
....
<432,3>	[true]	FADD_OUT -> R0	...		...		...
....
<432,7>	[true]	ADDER_OUT -> R0	...		...		...

I am supposing that by the time the CPU tries to execute the bytecode at memory address 432 (which would be "send '+'" or something similar) the map of the guy of the top of the stack has been loaded into R6. We move a 32 bit constant into the operation register of the PIC unit. This constant comes from the 32 lower bits in this cache word (shown as #???? above), so the last two moves don't exist (their condition is automatically forced to "false"). Meanwhile, R6 has been moved to the trigger register. The PIC unit compares bits 2 to 11 in R6 to three 10 bit values in the constant. Depending on which one matches, it fetches in the next clock the word in the cache located 1, 2 or 3 words after the one with these moves. Instead of comparing that word's tag with <IP,CR> as it normally does, it compares it with bits 12 to 27 in R6 and if they don't match it causes a cache miss. That constant would be calculated by the compiler like this:

???? = (((intMap >> 2) & 0x0FFC) << 20) |

       (((floatMap >> 2) & 0x0FFC) << 10) |

       0; /* no third entry */

This way we can have PICs up to 3 entries (which is most of them) which start executing the destination code in just one clock. If the PIC has less than 3 entries (as in this example), the other cache lines can be reused for other things.

Back to the Tachyon Home Page

Back to Merlintec home page