Machine learning Benchmarks for the iPhone 11 Max Pro

It is free to reuse all content from here,
as long as you quote the source of the information: 
umangnify.com 

So, I was out to code @humangnify and work for my new employer. Tonight, I started testing seriously the iPhone 11, A13, and there are pretty funny and interesting details coming, so, please stay tuned.

let’s start with the performance of the CoreML engines

On iPhone XS Max

http://umangnify.com iPhone XS Max

InceptionV3 Run Time: 1689ms
Nudity Run Time: 471ms
Resnet50  Run Time:1702ms
Car Recognition  Run Time:631ms
CNNEmotions Run Time:10120ms
GoogleNetPlace  Run Time:1340ms
GenderNet Run Time: 761ms
TinyYolo Run Time: 879ms

InceptionV3 Run Time: 132ms
Nudity Run Time: 88ms
Resnet50  Run Time:105ms
Car Recognition  Run Time:121ms
CNNEmotions Run Time:131ms
GoogleNetPlace  Run Time:107ms
GenderNet Run Time: 72ms
TinyYolo Run Time: 86ms

InceptionV3 Run Time: 147ms
Nudity Run Time: 88ms
Resnet50  Run Time:107ms
Car Recognition  Run Time:109ms
CNNEmotions Run Time:129ms
GoogleNetPlace  Run Time:111ms
GenderNet Run Time: 87ms
TinyYolo Run Time: 211ms

Now, the iPhone 11 Max Pro

http://umangnify.com results for iPhone 11 Max Pro

InceptionV3 Run Time: 1529ms
Nudity Run Time: 468ms
Resnet50  Run Time:1139ms
Car Recognition  Run Time:525ms
CNNEmotions Run Time:9264ms
GoogleNetPlace  Run Time:1333ms
GenderNet Run Time: 767ms
TinyYolo Run Time: 895ms

InceptionV3 Run Time: 96ms
Nudity Run Time: 60ms
Resnet50  Run Time:69ms
Car Recognition  Run Time:78ms
CNNEmotions Run Time:94ms
GoogleNetPlace  Run Time:68ms
GenderNet Run Time: 52ms
TinyYolo Run Time: 148ms

InceptionV3 Run Time: 83ms
Nudity Run Time: 54ms
Resnet50  Run Time:72ms
Car Recognition  Run Time:84ms
CNNEmotions Run Time:104ms
GoogleNetPlace  Run Time:87ms
GenderNet Run Time: 53ms
TinyYolo Run Time: 122ms

So, here it is , the speed up of the CoreML is pretty nice.

You can see a nice range of performance improvement into the CoreML engine, when the neural network is large, the CoreML engine seems to be having linear limitation due to memory read.

BenchmarkiPhone Xs MaxiPhone 11 Max ProSpeed up
InceptionV3147831.77
Nudity checker88541.63
Resnet50107721.49
Car Recognition109841.30
CNN Emotions 1291041.24
GoogleNetPlace111871.28
GenderNet87531.64
TinyYolo2111121.88

unit is Milliseconds, obviously

When looking at the 1st run, the loading of the inference is as slow on the XS and the iPhone 11 … Apple still does not have a bridge in the architecture to speed up the loading. The processor is still used to feed the information into the neural engine inference.

Optimizing a neuronal architecture is a lot easier than a processor core, I was expecting more speed up than this, especially before it seems that the die space is more than 2X.

Did some coding on adjacency matrix, after @pyninsula (twitter) presentation on it yesterday…

So, here is what is an adjacency matrix, it is a pretty simple way of representing connections between elements of a graph … 

Based on that connection, you can define relations, like group of people, or group of information, it is a nice way to represent a graph. The cool part, you can actually use Zykov Graph arithmetic …

So, for example, you can multiply an adjacency matrix with an other one, and you actually get the intersection of the 2 group of relationships … This is Genius !

if you are interested into that kind of subjects, you should read this: http://www.math.caltech.edu/~2014-15/2term/ma006b/12%20Graphs%20and%20Matrices.pdf

The application on the Machine learning coordination of multiple inference outputs is simply astonishing.  

Then … The primes of Zykov may apply … 

Using Hierarchical Temporal Memory (HTM) to Match DeepLearning on Chatbot kernel.

One of the big problem of Deep Learning is its “deep part”, you need to present thousands of datapoint to get a good learning. 
Hierarchical Temporal Memory (HTM) is a technic to model the neo cortex in a much more realistic way than deep learning. While many people have made fun of this technics, I was totally ununware of it. Siraj (excellent youtuber https://www.youtube.com/channel/UCWN3xxRkmTPmbKwht9FuE5A ) did speak about it recently, and I immediately got obsesed to implement a tiny chatbot with it … I swallowed all the training videos in a matter of hours (2X speed helps), then, I got to the keyboard and spend a full day (about 10 hours) to code the 1st system with NuPic ( https://github.com/numenta/nupic )… Natural language is fairly easy to debug, that is why I picked a Chatbot as example. 

  1. Learning: Can not find an easy way to run the inference of NuPic on CoreML … outch! The API of CoreML will have to be extended, and more basic operation will have to be added in the Neural net including in A12. (Expected, when you are 1st at machine chips, you often forget one posibility of the system)
  2. Looking at the 1st baby code I did with HTM, it did quickly learn, a lot faster than Deep Learning … The inferences are blazing fast too. It does not match the Deep Learning chatbot in term of quality of the answer yet, there are some serious wrong answers in the middle of it, but it is now beating on iOS, and x86. I got some other goal to complete because I come back to it, but this is astonishly promissing, the learning is a matter of minutes instead of days… 
Excellent training videos from Numera

Sooner or later, the Efficiency of Hierarchical Temporal Memory (HTM) will allow it to take over Deep Learning, it is a matter of time.

Next Steps:

  1. Matching Deep Learning in term of quality of the prediction
  2. Try to use NuPic to match deepSpeech2 prediction stats.

RAVDESS injected in Deepspeech, final training for alpha of the app started.

That is cool stuff, this network will give the app the speech to text with high accuracy, and the sentiment of the speaker. I hope this is going to help the chatboot to be smarter, by giving it a better sharper input.

Adding the sentiment detection unbalance my optimization for GPU, right now, it is only using 40% of the 4 GPUs, the 72 cores of the xeon are not pounded either, I ‘ll have to look at this more carefully. If I don’t see the convergence on tensorboard happing fast enough, I may have to stop the run and re-organize the layers, as it is my own alchemy with less layer for the emotion analysis, on a moe of the network. (2 set of FFT analysis as input, one output, with different propagation one common layer in the middle to merge. The size of the training set of RAVDESS is a lot smaller than the input of deepspeech, I never did that kind of mix like this, curious to see what happens.

DeepSpeech and Sentiment of voice merged into one single Neural net, 2 FFTs input, one ouput with Texts and sentiment. 

1 year, 1 hour per day, 22 millions of twitter threads sorted, 45 years of audio indexed, and all disassembled in FFTs

That is the math of the effort To get the most extensive conversation database on social media …

let’s hope it does work as I am hoping at the end, I ll May look like this if it does not come together nicely:

The goal is to use the inference of voice intonation using FFT, then, use them to understand the intent of the user by extracting the feeling, while having learned from this user for hours in the back ground. That input is used as one of the input of the deepchatbot. In theory, that should be able to catch derision and few other things that trend to break the realism of chat bots. I have spend my day trying to by pass the output of the 1st sentiment inference and connect the second to last layer to the Chatbot 2nd level layer, this is rock n roll on the CPU and I had to add a super fast SSD for the testing … this stuff consumer SSD bandwidth … that should do:

Voices + text and stored FFTs … recipe for warming up the house 😉

Time to assemble the Voice recognition, Speaker recognition and DeepChatbot into one single iOS app: Humangnify

They are now all up and running, Speech recognition ratio is 95%, Speaker recognition does not make any more mistakes, it recognize its owner, and create new people when unknown automatically, and the deepchatbot passes most of Turing test, but … it kind of answer like a movie character (it is using one of the movie database for training.

????

So, this is when we see if I was totally insane to try to put this together, or not 

I expect the merge to take 3 to 5 days.

Memory wise, it is all down around 1GB of RAM, all together, the chatbot is on Tensorflow light, Deepspeech is on HAckedCoreML, and speaker recognition uses  the iPhone DSP and CoreML.

Little problem yet, CoreML is pretty slow at loading its models, and they do not cooperate well together when being large. They seem to be kicking each other out of the neural engine yet. I may be the dummy, so, I am tripple checking what is happening.

So, let’s see, was I insane to try to put all of this together?

Paper coding

I have been coding on paper for 1 week, it is actually a very interesting exercise, it is sharpening my programming abstraction skills, I recommend any programmer to do that once a year.

It does force you to think about what you will code a lot more in your mind, and have a more complete idea of how it should work before coding it. I coded a GPU based physic engine for Humangnify, assembling physic rules and Forced driven graph, and using it as a decision voting system to decide how to display tweets. Demo soon! 

TinyYolo on high definition iPhone 12Mpixels

The speed of my own version of Tiny Yolo is remarkable … it is beating a Xeon skylake easily in term of instantaneous delivery latency. I think I still can get it down 10% more, but the iPhoneXsMax is doing good with its CoreML engine.

I noticed a little drop once again of the confidence of the prediction compare to the version running on the GPU on the iPhone 7 and co.

Latency of the 1st access is high (600ms to 900ms), but after the model is loaded into the CoreML engine, the latency is very consistent at around 200msish.

Automatic Quantization of apple tools is pretty cool, I did not find a neuronal net, except deepspeech2 that stop converging on a proper prediction, I still have to find the tf.nn operation that causes this.

The iPhone architecture needs a bridge between its CoreML engine and the inputs, here is an example

So, if you look at the power consumption part of doing convolutions of complex inputs on the iPhone, it is pretty clear that the CoreML engine is grabbing 100% of its data from memory, it does not has the capability to neither cache or access directly the outputs of the DSLs, or even a memory buffer that was not set as a Metal buffer, array<texture> or fragment buffer.

To do good, the next version of CoreML needs to allow you to get access to fragment buffers from the DSP or the other input sensors, it would be a lot faster … AND would save a lot of power.

The write combining buffer rolling mechanism would be welcome between the DSP and the convolution engine to allow speech recognition in real time. For the moment, I am spending 7% of my time writing and reading from memory because of this lack of mechanism.

I am sure they will figured it out 😉

How to use Texture slices in instanced draw primitive on Apple Metal.

So, I have looking for samples to show how to do that, and I could not find a complete solution, so, it is important to understand the difference between array<Texture, XXX> and texture2d_array

If you understand that, it gets a lot easier immediately. I let you go and read the difference on the apple documentation, but this is pretty clear there.

here is the shader piece you need to understand:

Notice that the index of the texture is passed by the vertexes.

here is the magic for loading the textures with slices …

Objective C loading Texture array with Slices.

Now, you are almost there…  you need to call draw primitive with the right call, including an instance index.

The Draw call … 

Don’t forget to stack one uniform for each of your instance, as the apple sample shows you, and adding this let you change the texture for each object! 

Et Voila 😉

Just before the Inferences of #MachineLearning get plugged in the universe.

So, The Force-directed force field is applied, but each node has zero affinity because the CoreML inference results are not plugged into the force fields yet.
We are going to change this today, and we will plug the scoring of many inference engines into this view.

The goal is fairly simple, the Deep Natural language processors will set up from “low to high” the people with their tweets, and as you expect, people will good twitter reasonable like will be at your own level. People posting very “irritating” tweets, and everything society think is “low” will be showing below, and what is seeing as “isotheric” will appear high.

Then, science based will be pointing the north, etc …  This is when we will know if a 3D based interface can be usefull, we are going to try to build constalations of ideas, and trees of vitual locations for tweets and let’s what will happen.

I am sharing with you the construction process, because I hope to be copied, sooner or later.

[youtube https://www.youtube.com/watch?v=7LALeBxCFgQ&w=560&h=315]

Pytorch to train Deepspeech and optimize it.

So, for the last 6 months, I have been optimizing the FFTs into deepspeech, having a model of the iPhone FFT’s DSP to improve accuracy of the deeplearning recognition, because the closest you get from the input stream, the better the recognition level of deepspeech is. 

So, I have modeled the DSP to be able to reprocess the large data set of Deepspeech, and added few 100 hours of real audio from the iPhone recording subsystems. That had given me a huge boost in recognition compare to the default sets.

Yesterday, because of curiority, and because of some extra features of Pytorch compare to TensorFlow, I did look at Pytorch, and its port of deepspeech, I plugged the FFT model, and incorporated my workload into pytorch. 

Interestingly, it was pretty straight forward to make it work, I have no surprise on the python side, few libraries with different versions, left or right, but nothing much to worry. 

Now, what is interesting: There are some pretty good tools from Apple to brigde from Pytorch to CoreML. For the moment , I use a modified version of TensorFlow Light on iPhone, I have added the missing feedback features of Tensorflow in the light version, but this is not using yet the full power of the neuronal engine of the A12 Bionoc. So, if the bridge between Pytorch and core ML is enabling a full direct conversion, I would be able to see some serious performance gain. 

This is the flow to follow PyTorch → ONNX→ CoreML

I could not find the list of features supported for those conversion, so, one of the best way to learn about it is to make the test, it should be about a day of work to get the all pipeline in place, and get a good idea about what is missing in term of feature set, and then, learning about how much it needs to be reworked to have the full feature list required. (very often, mobile version of convolution systems only few the top 200 features of a full training/scoring features, and most of innovating neural net use more than this limited set)

So, I know what I do this week, I really want Deepspeech singing on this A12 Bionic Neuronal Engine.

Memory management of CoreML on iPhoneXsMax, New Cool improvement.

So, there is the memory profile of 6 ran inferences on iOS12 on the iPhoneXsMax

iPhone Xs Max

here is the memory profile of 6 inferences ran on iPhone X

iPhone X

The iPhoneXsMax seems to be storing the inference models into something that is NOT the main memory, it is all functional, but the amount of memory used is a LOT smaller.

It can be due to 2 things:

  1. The Phone XsMAX uses less precision when storing the Neural Net
  2. The hardware Neural engine stores it inside itself.

This is very interesting, because the I have allocated 26 differents inferences, and the increase of memory usage is very mininum, while on iPhone X, it was surging almost linearly with the size of the model stored side. This is a very good news, as we will be able to pack a huge number of Machine learning inferences, without exploding in memory size, giving the app a chance to stay in memory and process some ML inference in the back ground, you just need to make sure it is not too heavy. Just verified, we can run those inference on data fetched in background task.

The iPhoneXs family used about half of the memory required by other iOS devices, it is remarkable.

Francois Piednoel.

François Piednoël

Confirmation of Color drifting in Portrait mode of the #iPhoneXsMax

Color Chart
Improved Brad 😉

So, here is a confirmation of the color drifting of the portrait mode of the new gen iPhone. when a face is detected, the colors are being drifted. I am not going into the all detail here about the exact recipe apple is using, but the deltas are measurable … If you look carefully, the effort is mostly to boost blue and green, and its visual effect drives the picture to be less yellow. There is an additional processing done to avoid washing the yellows and lose yellowness on strong yellows.

No face:
Red(RGB in pic) 211 91 82
Blue 67 84 136
Green 69 101 62
Yellow: 237 209 110

With Face
Red 216 91 82
Blue 82 98 150
Green 93 133 76
Yellow 248 217 109

no Brad! !!

Again, this is quick and dirty look at something people seem to care, over all, the colors are there, a little modified, but nothing to be crazy about. I would be happy if Apple allows you to turn it off.