Merlintec Computers

TACHYON

Media Coprocessors

The main idea - replace hardware with software, but run it in parallel.

The Architecture

A main processor, optimized for multithreading, is connected to several vectorizing coprocessors. These can be divided into data transformation units and data movement units. One alternative would be to connect all these elements via a crossbar, but it is probably more reasonable to use one or two busses and add direct paths between each unit and its neighbors.

The software shouldn't be hard to write. Let's look at the inner loop for a digital filter:

        for ( i = 16 ; i < inBufSize ; ++ i )
             {
             outBuf[i] = 0;
             for ( k = 0 ; k < 16 ; ++ k )
                  outBuf[i] += filter[k]*inBuf[i-k];
             }

A previous loop (not shown here) initialized outBuf[0 to 15]. If we suppose that the coprocessor's ALU has a datapath that can handle four outBuf[]s at a time, we could rewrite it like this:

        for ( i2 = 16 ; i2+3 < inBufSize ; i2 += 4 )
             {
             outBuf[i2] = 0;
             outBuf[i2+1] = 0;
             outBuf[i2+2] = 0;
             outBuf[i2+3] = 0;
             for ( k = 0 ; k < 16 ; ++ k )
                  outBuf[i2] += filter[k]*inBuf[i2-k];
                  outBuf[i2+1] += filter[k]*inBuf[i2+1-k];
                  outBuf[i2+2] += filter[k]*inBuf[i2+2-k];
                  outBuf[i2+3] += filter[k]*inBuf[i2+3-k];
             }

Now we will save outBuf[i2 to i2+3] in R0 during the loop, filter[] in R1 to R4 and inBuf[i2+3 to i2-15] in R5 to R8. The code now looks like:

         R1 = quad(filter[0],filter[1],filter[2],filter[3]);
         R2 = quad(filter[4],filter[5],filter[5],filter[7]);
         R3 = quad(filter[8],filter[9],filter[10],filter[11]);
         R4 = quad(filter[12],filter[13],filter[14],filter[15]);
         R5 = quad(inBuf[1],inBuf[2],inBuf[3],inBuf[4]);
         R6 = quad(inBuf[5],inBuf[6],inBuf[7],inBuf[8]);
         R7 = quad(inBuf[9],inBuf[10],inBuf[11],inBuf[12]);
         R8 = quad(inBuf[13],inBuf[14],inBuf[15],inBuf[16]);
         for ( i2 = 16 ; i2+3 < inBufSize ; i2 += 4 )
              {
              R0 = 0;
              R9 = spread(R1,3,4) /* four copies of filter[3] */
              quadMAC(R0,R9,R8);
              R9 = spread(R2,3,4) /* four copies of filter[7] */
              quadMAC(R0,R9,R7);
              R9 = spread(R3,3,4) /* four copies of filter[11] */
              quadMAC(R0,R9,R6);
              R9 = spread(R4,3,4) /* four copies of filter[15] */
              quadMAC(R0,R9,R5);
              shiftIn(R5,R6,R7,R8,inBuf[i2+1]);

              R9 = spread(R1,2,4) /* four copies of filter[2] */
              quadMAC(R0,R9,R8);
              R9 = spread(R2,2,4) /* four copies of filter[6] */
              quadMAC(R0,R9,R7);
              R9 = spread(R3,2,4) /* four copies of filter[10] */
              quadMAC(R0,R9,R6);
              R9 = spread(R4,2,4) /* four copies of filter[14] */
              quadMAC(R0,R9,R5);
              shiftIn(R5,R6,R7,R8,inBuf[i2+2]);

              R9 = spread(R1,1,4) /* four copies of filter[1] */
              quadMAC(R0,R9,R8);
              R9 = spread(R2,1,4) /* four copies of filter[5] */
              quadMAC(R0,R9,R7);
              R9 = spread(R3,1,4) /* four copies of filter[9] */
              quadMAC(R0,R9,R6);
              R9 = spread(R4,1,4) /* four copies of filter[13] */
              quadMAC(R0,R9,R5);
              shiftIn(R5,R6,R7,R8,inBuf[i2+3]);

              R9 = spread(R1,0,4) /* four copies of filter[0] */
              quadMAC(R0,R9,R8);
              R9 = spread(R2,0,4) /* four copies of filter[4] */
              quadMAC(R0,R9,R7);
              R9 = spread(R3,0,4) /* four copies of filter[8] */
              quadMAC(R0,R9,R6);
              R9 = spread(R4,0,4) /* four copies of filter[12] */
              quadMAC(R0,R9,R5);
              shiftIn(R5,R6,R7,R8,inBuf[i2+4]);

writeQuad(&outBuf[i2],R0);
}

The spread() and quadMAC() operations correspond directly to a single instruction in the data transformation unit. The shiftIn() and writeQuad() instruction, as well as the "for", require the collaboration between the transform unit and the data transfer unit (DMA). Everything else is done in the main processor. In the first pass through the loop, the main processor fetches each instruction from memory and the coprocessors execute them and save them in their internal registers. At the end of the first loop, the main processor blocks the current thread and can start executing totally unrelated code while the two coprocessors start the second pass through the loop all by themselves. When the DMA gets to the end of its count (the expression "i2+3 < inBufSize" is false in the for "head") the two coprocessors stop and a signal is sent to the main processor, reactivating the thread that had been suspended. When this thread is executed, the code immediately following the "for" starts running and everything works exactly the same as if the whole code had been executed in the main processor.

Data Movement Units

smart DMAs

Data Transform Units

data flow ALUs

Soft I/O

Alto, Sinclair/Cheap Video

Back to the Tachyon Home Page

Back to Merlintec home page