Why is PTLsim so fast?
PTLsim is fast for a variety of reasons: • Because of its unique co-simulation approach (see above), PTLsim can run native x86 instructions in hardware to reproduce the exact semantics of each PTLsim micro-op, rather than using slow emulations written in generic C. • PTLsim uses vectorized SSE operations and x86 specific instructions to do O(1) parallel matching on most of the associative structures it models, rather than the naive linear scan approach used by competing simulators. • Branch free and cache aware algorithms are used pervasively in PTLsim, yet through the use of C++ templates and macros, the source code remains very clear. • Cache profiling is used to reduce the size of key structures such that the entire working set of the out of order core is under ~1 MB and hence fits in the L2 and/or L1 cache of most processors. • PTLsim is self profiling, allowing us to identify hot spots at any time. • Users can balance performance and accuracy by turning off certain features (e.g.