So, if you look at the power consumption part of doing convolutions of complex inputs on the iPhone, it is pretty clear that the CoreML engine is grabbing 100% of its data from memory, it does not has the capability to neither cache or access directly the outputs of the DSLs, or even a memory buffer that was not set as a Metal buffer, array<texture> or fragment buffer.
To do good, the next version of CoreML needs to allow you to get access to fragment buffers from the DSP or the other input sensors, it would be a lot faster … AND would save a lot of power.
The write combining buffer rolling mechanism would be welcome between the DSP and the convolution engine to allow speech recognition in real time. For the moment, I am spending 7% of my time writing and reading from memory because of this lack of mechanism.
I am sure they will figured it out 😉