Not a full answer, I simply don’t have time to do the detailed benchmarks your question needs but hopefully useful.
Know your enemy
You are targeting a combination of the JVM (in essence the JIT) and the underlying CPU/Memory subsystem. Thus “This is faster on JVM X” is not likely to be valid in all cases as you move into more aggressive optimisations.
If your target market/application will largely run on a particular architecture you should consider investing in tools specific to it.
* If your performance on x86 is the critical factor then intel’s VTune is excellent for drilling down into the sort of jit output analysis you describe.
* The differences between 64 bit and 32 bit JITs can be considerable, especially on x86 platforms where calling conventions can change and enregistering opportunities are very different.
Get the right tools
You would likely want to get a sampling profiler. The overhead of instrumentation (and the associated knock on on things like inlining, cache pollution and code size inflation) for your specific needs would be far too great. The intel VTune analyser can actually be used for Java though the integration is not so tight as others.
If you are using the sun JVM and are happy only knowing what the latest/greatest version is doing then the options available to investigate the output of the JIT are considerable if you know a bit of assembly.
This article details some interesting analysis using this functionality
Learn from other implementations
The Change history change history indicates that previous inline assembly was in fact counter productive and that allowing the compiler to take total control of the output (with tweaks in code rather than directives in assembly) yielded better results.
Some specifics
Since LZF is, in an efficient unmanaged implementation on modern desktop CPUS, largely memory bandwidth limited (hence it being compered to the speed of an unoptimised memcpy) you will need you code to remain entirely within level 1 cache.
As such any static fields you cannot make into constants should be placed within the same class as these values will often be placed within the same area of memory devoted to the vtables and meta data associated with classes.
Object allocations which cannot be trapped by Escape Analysis (only in 1.6 onwards) will need to be avoided.
The c code makes aggressive use of loop unrolling. However the performance of this on older (1.4 era) VM’s is heavily dependant on the mode the JVM is in. Apparently latter sun jvm versions are more aggressive at inlining and unrolling, especially in server mode.
The prefetch instrctions generated by the JIT can make all the difference on code like this which is near memory bound.
“It’s coming straight for us”
Your target is moving, and will continue to. Again Marc Lehmann’s previous experience:
default HLOG size is now 15 (cpu caches have increased)
Even minor updates to the jvm can involve significant compiler changes
6544668 Don’t vecorized array operations that can’t be aligned at runtime.
6536652 Implement some superword (SIMD) optimizations
6531696 don’t use immediate 16-bits value store to memory on Intel cpus
6468290 Divide and allocate out of eden on a per cpu basis
Captain Obvious
Measure, Measure, Measure. IF you can get your library to include (in a separate dll) a simple and easy to execute benchmark that logs the relevant information (vm version, cpu, OS, command line switches etc) and makes this simple to send back to you you will increase your coverage, best of all you’ll cover those people using it that care.