Merlintec Computers

TACHYON

TACHYON: A High Performance Object Oriented CPU

Jecel Mattos de Assumpcao Jr.

jecel@lsi.usp.br

Features

Characteristic Benefits
Direct bytecode execution Reduced memory footprint

High performance/cost ratio

Brings the advantages of object-oriented programming to embedded/multimedia applications

Multi-coprocessing with loadable microcode Eliminates peripheral chips (and their special memory banks)

Software media processing can adapt to emerging standards, like MPEG-4

Easy to understand programming model

option 1: software controlled, object-oriented data cache High performance from a low cost DRAM system
option 2: embedded main memory Extremely reduced chip count

Reduced pin count eliminates the need for expensive package

Reduced power consumption due to elimination of off-chip signals

Applications

  • video game consoles
  • settop boxes
  • digital TV
  • internet consumer devices
  • palmtop multimedia computers
  • wearable computers
  • voice I/O applications

Introduction

These pages describe a high performance CPU intended to run multimedia intensive applications written in the Self, Smalltalk or Java programming languages. The key concept is that a huge performance gap has developed between the processor and main memory. When the Merlin 2 was designed, for example, the 68000 took 4 cycles (500 nS at 8 MHz) to make one memory access, while its DRAM could respond in only two cycles. The spare bandwidth was used for video. Today, a 600 MHz Alpha CPU can execute up to four instructions (requiring up to two memory accesses) in one clock cycle, a worst case of a word every 0.2 nS. Yet a "normal" DRAM can take 60 nS to service a request and even fast SDRAMs cycle no faster than a peak of 10 nS per word (50 times too slow). We have reached a point at which the CPU that can execute a program with the fewest accesses to main memory will be the fastest - all other considerations are secondary.

Elaborate cache organizations (up to three levels deep) are the current answer, but Tachyon attacks this problem though architectural innovation. It is important to know the memory traffic and tune the hardware to deal with it. For example, the memory access patterns for the stack are very different than for random memory positions. So a dedicated hardware for the stack (like the register windows in the Berkeley RISC and Sun Sparc or the stack cache in the AT&T Hobbit and the Sun PicoJava) can outperform a generic memory cache while using only a small fraction of the number of transistors. In the same way, separating memory traffic related to object access, instruction decoding and to multimedia processing can result in special hardware for low cost/high performance implementations.

Project Schedule and Development Tools

The key to developing a competitive product is the use of the right tools. A great variety of EDA (Electronic Design Automation) software currently exists, but for leading edge projects the development of custom tools is inevitable.

Microcode Cache

Adaptive compilation was developed for Self, but now this technology is being adapted for high performance implementations of Java. The microcode cache is the silicon equivalent of adaptive compilation, and offers a significantly improved performance relative to alternative implementations of direct bytecode execution.

Multimedia Stream Processing

Most CPU cycles are now taken up by processing video and sound, and this will increase even more in the future. Traditional memory systems are designed with the idea that memory accesses, while highly unpredictable, are very localized (the same memory positions tend to be used over and over). This assumption is false, however, for media processing. There memory accesses are predictable and non repetitive. Each byte in a video or sound buffer will be touched once, sequentially, in most decompression algorithms. A new memory architecture optimized for dealing with regular streams of data, therefore, is needed to handle these tasks. And since multiple parallel streams are the norm (a video sequence plus left and right stereo audio channels, for example), the ideal architecture includes multiple data paths.

Object Oriented Memory System

Just as the traditional data cache can't handle multimedia memory traffic very well, it has proved inefficient for object-oriented applications as well. Even small modifications to the addressing modes and virtual address translation can greatly improve this situation, however, as was shown in the Mushroom project.

Palmtop Multicomputing

Computer performance should be highly scalable. Today, it is possible to connect a large number of off-the-shelf personal computers to challenge traditional supercomputers for many applications. Object-oriented systems are even better suited for this kind of architecture. By integrating the processor, memory and communications on a single chip, the benefits of these high performance implementations can be extended to small and low cost systems - even to the palmtop computing level.

Links

Here are some links to related information available on the web. Many older designs were published in traditional conferences before the widespread use of the internet and can only be found in printed versions, but are still very much worth studying.

Java CPUs:

Other Object Oriented Architectures:

Multimedia CPUs:

Integrated Memory/CPU designs:

  • Berkeley IRAM group

Other:

  • Merlin Project
  • Transmeta patent - hardware support for dynamic code translation
  • Sun's Hotspot technology
  • Sun's Self language, where this technology was developed

Back to Merlintec home page


(c) 2000 Merlintec Computers. Please send comments to the Webmaster