*/P1. Solution: (a) ld f2, 0(r2) F D X M W ld f3, 0(r3) F D X M W multd f2, f2, f3 F D s X X X X M W addd f2, f2, f1 F D s s s s X X X X M W addi r2, r2, 8 F D X M W addi r3, r3, 8 F D X M W subi r1, r1, 1 F D X M W sd -8(r2), f2 F D s s s X M W bnez r1, loop F D s s s X M W total: 16 cycles per iteration (b) unroll 4 times: ld f2, 0(r2) F D X M W ld f3, 0(r3) F D X M W ld f4, 8(r2) F D X M W ld f5, 8(r3) F D X M W ld f6, 16(r2) F D X M W ld f7, 16(r3) F D X M W ld f8, 24(r2) F D X M W ld f9, 24(r3) F D X M W multd f2, f2, f3 F D X X X X M W multd f4, f4, f5 F D X X X X M W multd f6, f6, f7 F D X X X X M W multd f8, f8, f9 F D X X X X M W addd f2, f2, f1 F D X X X X M W addd f4, f4, f1 F D X X X X M W addd f6, f6, f1 F D X X X X M W addd f8, f8, f1 F D X X X X M W addi r2, r2, 32 F D X M W addi r3, r3, 32 F D X M W subi r1, r1, 4 F D X M W sd -32(r2), f2 F D X M W sd -24(r2), f4 F D X M W sd -16(r2), f6 F D X M W sd -8(r2), f8 F D X M W bnez r1, loop F D X M W total: 28 cycles per 4 iterations, or 7 cycles per iteration (c) ld f2, 0(r2) F D X M W ld f3, 0(r3) F D X M W ld f4, 8(r2) F D X M W ld f5, 8(r3) F D X M W ld f6, 16(r2) F D X M W ld f7, 16(r3) F D X M W ld f8, 24(r2) F D X M W ld f9, 24(r3) F D X M W multd f2, f2, f3 F D X X X X M W multd f4, f4, f5 F D X X X X M W multd f6, f6, f7 F D X X X X M W multd f8, f8, f9 F D X X X X M W addd f2, f2, f1 F D s s X X X X M W addd f4, f4, f1 F D s s X X X X M W addd f6, f6, f1 F D s s X X X X M W addd f8, f8, f1 F D s s X X X X M W addi r2, r2, 32 F D X M W addi r3, r3, 32 F D X M W subi r1, r1, 4 F D X M W sd -32(r2), f2 F D s s s X M W sd -24(r2), f4 F D s s X M W sd -16(r2), f6 F D s s s X M W sd -8(r2), f8 F D s s X M W bnez r1, loop F D X M W total: 16 cycles per 4 iterations, or 4 cycles per iteration (d) ld f2, 0(r2) F D X M W ld f3, 0(r3) F D X M W multadd f2, f3, f1 F D s X X X X M W addi r2, r2, 8 F D X M W addi r3, r3, 8 F D X M W subi r1, r1, 1 F D X M W sd -8(r2), f2 F D X M W bnez r1, loop F D X M W total: 12 cycles per iteration since cycle is now 10% longer the final gain will be only: 16/(12*1.1)=1.2 P2. Solution: (a) Spatial locality: neighboring data is likely to be used soon Temporal locality: data accessed now is likely to be used again soon (b) Cold misses: first access to the block Conflict misses: block was evicted because of a conflict in the set Capacity misses: block was evicted because cache was too small (c) miss rate: large block size: more data is brought in on a cache miss and is likely to be used soon high-associativity caches: reduce conflict misses victim cache: evicted blocks stay close and are likely to be used soon hardware prefetch: reduce cold misses software prefetch: reduce cold misses compiler optimizations: improve data layout to maximise reuse of data miss penalty: priorisation of reads: service read misses first and release processor earlier critical word first: give the request word to processor more quickly non-blocking caches: don't queue up hits after a miss L2 caches: L2 hit time is shorter than memory access time hit time: small and simple caches: quicker access and tag check virtual address caches: no need to translate address before cache access (d) physical address caches: + no aliasing problem: + easier to maintain coherence with I/O DMA operations - translation in the critical cache access path virtual address caches: - aliasing problem: flush cache on context switch need to make sure OS and user application use the same virtual address for shared data structures - more difficult to maintain coherence with I/O DMA operations + no translation in the critical cache access path P3. Solution: (a) current counter value ! prediction ! actual outcome ! new counter value 00 NT NT 00 00 NT T 01 01 NT NT 00 01 NT T 10 10 T NT 01 10 T T 11 11 T NT 10 11 T T 11 (b) Iteration ! current counter value ! prediction ! actual outcome ! new counter value 1 00 NT NT 00 -> hit 2 00 NT T 01 -> miss 3 01 NT NT 00 -> hit 4 00 NT T 01 -> miss ... (c) ld f1, 0(r2) F D X M W F D X M W ld f2, 0(r3) F D X M W F D X M W bnez f1, else F D X M W F D X M W addd f2, f0, f0 F D X X X X M W F D X k bez r0, fall F D X M W F D k divd f2, f2, f1 F k F F D X X X X M W addi r2, r2, 8 F D X M W F D X M W addi r3, r3, 8 F D X M W F D X M W subi r1, r1, 1 F D X M W F D X M W sd -8(r3), f2 F D X M W F D X M W bnez r1, loop F D X M W F D X M W Two iterations take 27 cycles. (d) loop: ld f1, 0(r2) ld f2, 0(r3) seqp p1, f1, f0 adddp f2, f0, f0, p1 divd f2, f2, f1, ^p1 addi r2, r2, 8 addi r3, r3, 8 subi r1, r1, 1 sd -8(r3), f2 bnez r1, loop (e) ld f1, 0(r2) F D X M W F D X M W ld f2, 0(r3) F D X M W F D X M W seqp p1, f1, f0 F D X M W F D X M W adddp f2, f0, f0, p1 F D X X X X M W F D k divd f2, f2, f1, ^p1 F k F D X X X X M W addi r2, r2, 8 F D X M W F D X M W addi r3, r3, 8 F D X M W F D X M W subi r1, r1, 1 F D X M W F D X M W sd -8(r3), f2 F D X M W F D X M W bnez r1, loop F D X M W F D X M W Two iterations now take 24 cycles. */