Is there a way to translate measurements between the test or train and ref data sets?
No. There is no formula for converting between test, train and ref measurements. The smaller mtest/mtrain and ltest/ltrain data sets would show different scaling behavior than the larger mref and lref data sets, and their smaller memory footprints would show less difference in performance between different cache sizes. We expect some correlation between them all, i.e. machines with higher results with one data set tend to have higher results with the other, but there are no universal conversion formulae for all systems.