The first RISC processors were introduced at a time when standard memory parts were faster than their contemporary microprocessors, but this situation did not per- sist for long. Subsequent advances in semiconductor process technology which have been exploited to make microprocessors faster have been applied differently to improve memory chips. Standard DRAM parts have got a little faster, but mostly they have been developed to have a much higher capacity.

In 1980 a typical DRAM part could hold 4 Kbits of data, with 16 Kbit chips arriv- ing in 1981 and 1982. These parts would cycle at 3 or 4 MHz for random accesses, and at about twice this rate for local accesses (in page mode). Microprocessors at that time could request around two million memory accesses per second.

In 2000 DRAM parts have a capacity of 256 Mbits per chip, with random accesses operating at around 30 MHz. Microprocessors can request several hundred million memory accesses per second. If the processor is so much faster than the memory, it can only deliver its full performance potential with the help of a cache memory.

A cache memory is a small, very fast memory that retains copies of recently used memory values. It operates transparently to the programmer, automatically deciding which values to keep and which to overwrite. These days it is usually implemented on the same chip as the processor. Caches work because programs normally display the property of locality, which means that at any particular time they tend to execute the same instructions many times (for instance in a loop) on the same areas of data (for instance a stack).

Caches can be built in many ways. At the highest level a processor can have one of the following two organizations:

• A unified cache. This is a single cache for both instructions and data.

• Separate instruction and data caches. This organization is sometimes called a modified Harvard architecture.

Both these organizations have their merits. The unified cache automatically adjusts the proportion of the cache memory used by instructions according to the current program requirements, giving a better performance than a fixed partitioning. On the other hand the separate caches allow load and store instructions to execute in a single clock cycle.