I follow the excellent weekly posts by Mike Ash, and entered a brief discussion in comments about toll free bridging. In particular, the difference between calling a method via Objective-C (objc_msgSend) and it's equivalent CoreFoundation C call. Mike suggested adding it to his original suite of tests, which lead to the following results.
iPhone 3G
ARM1176 ~412MHz / 2.4ns per cycle
| Name | Iterations | Total time (sec) | Time per (ns) |
| IMP-cached message send | 100000000 | 3.9 | 38.6 |
| C++ virtual method call | 100000000 | 5.0 | 49.9 |
| Floating-point division | 10000000 | 0.8 | 81.3 |
| Float division with int conversion | 10000000 | 0.8 | 81.4 |
| 16 byte memcpy | 10000000 | 1.4 | 136.0 |
| Objective-C message send | 100000000 | 14.9 | 148.6 |
| Integer division | 100000000 | 16.2 | 162.2 |
| CF CFArrayGetValueAtIndex | 10000000 | 2.0 | 201.7 |
| Objective-C objectAtIndex: | 10000000 | 4.2 | 418.3 |
| NSInvocation message send | 100000 | 0.2 | 1833.2 |
| 16 byte malloc/free | 10000000 | 27.3 | 2729.8 |
| NSObject alloc/init/release | 100000 | 1.4 | 14179.1 |
| NSAutoreleasePool alloc/init/release | 100000 | 1.9 | 18956.7 |
| 16MB malloc/free | 1000 | 0.0 | 47811.3 |
| Zero-second delayed perform | 1000 | 0.8 | 803419.3 |
| pthread create/join | 100 | 0.1 | 1085830.0 |
| 1MB memcpy | 100 | 1.0 | 9902796.7 |
iPhone 3GS (ARMv7 binary)
ARM Cortex A8 ~600MHz / 1.66 ns per cycle
| Name | Iterations | Total time (sec) | Time per (ns) |
| IMP-cached message send | 100000000 | 1.2 | 11.8 |
| C++ virtual method call | 100000000 | 4.3 | 42.9 |
| Objective-C message send | 100000000 | 5.9 | 59.2 |
| CF CFArrayGetValueAtIndex | 10000000 | 1.0 | 97.9 |
| Integer division | 100000000 | 9.8 | 98.4 |
| 16 byte memcpy | 10000000 | 1.1 | 109.3 |
| Floating-point division | 10000000 | 1.2 | 118.5 |
| Objective-C objectAtIndex: | 10000000 | 1.3 | 129.0 |
| Float division with int conversion | 10000000 | 1.4 | 142.6 |
| 16 byte malloc/free | 10000000 | 7.5 | 748.6 |
| NSInvocation message send | 100000 | 0.1 | 806.0 |
| NSObject alloc/init/release | 100000 | 0.5 | 4793.1 |
| NSAutoreleasePool alloc/init/release | 100000 | 0.5 | 4953.1 |
| 16MB malloc/free | 1000 | 0.0 | 17969.2 |
| Zero-second delayed perform | 1000 | 0.2 | 211840.4 |
| pthread create/join | 100 | 0.0 | 214742.5 |
| 1MB memcpy | 100 | 0.3 | 3162774.6 |
Note that I did reduce the iterations from the original tests, so whilst the total times are significantly less, the iteration times are still a reflection of overall performance. Compared to Mike's results, these show that the IMP method is indeed faster as expected, but this was only after I changed to a release build. I also compiled these with Thumb disabled.
Observations
- The IMP-cached message send is significantly faster on the newer Cortex CPU. I have read of improvements in the branch prediction logic, which is particularly important due to the greater penalty of a misprediction in the longer A8 pipeline. The code for executing the call is
blx r8
r8 contains the target address of the function, and remains so for the duration of the test. - For the 3GS, the Objective-C message send is very close to the C++ virtual method call. I ran this test several times, and the behaviour didn't change. The virtual method call is an indirect load of the pc register
ldr pc, [r3]
Without being able to access the PMC registers, I can't be sure of mispredictions; however, I know that 9 instructions are executed every iteration in the C++ test. That suggests around 15ns / iteration; but, we're at 42.9. Adding an additional 13 cycles every iteration (21.58ns) for a mispredition would get us to 37ns / iteration - much closer. Stepping in to the objc_msgSend function finds the cached method on the first pass, totaling 28 instructions per iteration. Given there are significantly more instructions for the Objective-C call, we're probably seeing the benefits of the dual—issue architecture. - Memory performance of the 3GS is significantly higher. I've done some other micro-benchmarks, showing 2nd gen around 200 MB/s and 3rd gen around 800MB/s. With some very well placed cache-preloads, I've actually pushed the ARM1176 to almost 300MB/s.
- Calling the objectAtIndex: using CoreFoundation API is 2x faster on older devices; however, the gap is less significant with the newer hardware. We've seen significant improvements to the objc_msgSend performance on the 3GS, which undoubtedly is making up much of the gap.
- Floating point performance for scalar operations is slightly slower on the newer device.
Source code for this test is available here.


