• Computations kernels are independent and self contained Computation kernels are localized such that there are no data dependencies between other kernels. A programmer can annotate portions of a program that exhibit this behavior for mapping onto a stream processor or accelerator.
• Computation groups are relatively static The processing performed in each computation group is regular or repetitive, which often come in the form of a loop structure. There are opportunities for compiler optimization to organize the computation as well as the regular access patterns to memory.
• Explicit definition of communication Computation kernels produce an output stream from one or more input streams. The stream and other scalar values which hold persistent application state are identified explicitly as variables in a communication stream or signal between kernels.
• Data movement exposed to programmer A programmer can explicitly define movement of data from memory or to other computation kernels. Hardware mechanisms such as a DMA or stream unit provide this capability without interrupting the processor. The stream processing model seeks to either minimize data movement by localizing the computation, or to overlap computation with data movement. Furthermore, the programmer can retune the application memory access as memory bottlenecks arise.
There is a number of streaming processor architectures developed over recent years. Examples of stream processors include RAW , Imagine , Merrimac , and the RSVP™ architecture [8,9]. There is also another class of streaming architectures with origins from reconfigurable platforms such as FPGA. These architectures rely on the flexibility of the platform to synthesize streaming accelerators based on programmer definition. In comparison to the above mentioned architectures, a set of compiler tools create optimized hardware configurations rather than map computation onto existing design. They are associated with the programming language or compiler tool that allows software developers to configure hardware for stream computation. Examples include SCORE , ASC , and Streams-C . While each approach is different, stream architectures provide hardware mechanisms that can configure their datapaths for different types of parallelism in stream computation. Furthermore, they include programmable communication infrastructure to move data based on programmer defined API. In this paper, we propose the use of stream descriptors [8,9] for use in a reconfigurable FPGA platform to generate an optimized memory subsystem. Stream descriptors are a language extension to specify memory access patterns, which is used by dedicated stream units to prefetch and assemble data. The programmer describes the computation independently from stream descriptors, and then a compiler synthesizes the proper hardware for stream processing. The FPGA platform allows exploration of different configurations of the memory hierarchy. Once optimized for a particular class of applications, the design can be ported into standard or structured ASICs design flows for fabrication.
STREAM MEMORY HIERARCHY
A design framework is being developed to automatically generate synthesizable streaming accelerators . Using stream programming languages [9,14,15,16] which includes programmer's explicit definition of streams and their movement, an integrated memory subsystem can be built. This approach selects designs from well-engineered framework consisting of accelerators and network rather than generating hardware from a generic representation of a high level language . The memory subsystem builds upon stream units that moves data based on stream descriptors, as shown in Figure 1. Single or multiple accelerators in various configurations can be built. Furthermore, systems with multiple scalar processors, bus, peripherals or memory controllers can be configured such that the stream unit and accelerator are placed appropriately according to the flow of data. Stream descriptors have been recently applied to stream processors [8,9] and peripherals [18,19] to leverage on the deterministic movements of data from memory. In this paper, the stream descriptors are applied to the entire memory subsystem so as to enable stream data movement throughout the computing platform. The goal of this research is to generate an optimized memory subsystem based on stream programming input. As data stream type and movement are explicitly defined, there are opportunities to optimize the memory subsystem by prefetching and overlapping movement with computation. By distributing stream units throughout the memory subsystem, the design framework avoid large cache mechanisms that are not efficient for streaming computation and are difficult to synthesize on FPGAs. This following section describes the stream descriptors used to capture stream access patterns in memory. Furthermore, an example stream unit design is described with preliminary results from synthesis.
are mechanisms to allow theprogrammer to describe the shape and location of data inmemory. Dedicated stream units can then utilize the streamdescriptors to prefetch data from memory for the computingplatform. Each stream unit handles all issues inloading/storing of data: address calculation, byte alignmentdata ordering, and memory bus interface. A compiler can alsschedule the loading of a stream descriptor that is dependenon run time values. A stream descriptor is represented by the tuple (TypeStart_Address, Stride, Span Skip, Size) where:
• Type indicates how many bytes are in each element (Type is 0 for bytes, 1 for 16-bit half-words, etc.)