Neural networks are becoming massive monsters that are hard to train (without the “regular” 12 last-generation GPUs).
Is there a way to skip that? Let me introduce you to Zero-Cost proxies
Our Sponsor
Explore the Complex World of Regulations. Compliance can be overwhelming. Multiple frameworks. Overlapping requirements. Let Arctic Wolf be your guide.
Check it out at https://arcticwolf.com/datascience
References
- https://www.technologyreview.com/2022/08/05/1056814/automation-ai-machine-learning-automl/
- https://iclr-blog-track.github.io/2022/03/25/zero-cost-proxies/
Transcript
1
00:00:04,630 –> 00:00:09,514
And here we are again with season four of the Data Science at Home podcast.
2
00:00:09,622 –> 00:00:19,626
This time we have something for you if you want to help us shape the data science leaders of the future, we have created this the Data Science at Home’s Ambassador program.
3
00:00:19,808 –> 00:00:28,834
Ambassadors are volunteers who are passionate about data science and want to give back to our growing community of data science professionals and enthusiasts.
4
00:00:29,002 –> 00:00:38,038
You will be instrumental in helping us achieve our goal of raising awareness about the critical role of data science in cutting edge technologies.
5
00:00:38,194 –> 00:00:46,280
If you want to learn more about this program, visit the Ambassadors page on our website at datascience@home.com.
6
00:00:46,670 –> 00:00:49,558
Welcome back to another episode of Data Science at Home podcast.
7
00:00:49,594 –> 00:00:56,254
I’m from Chesco podcasting from the usual office of Ameth Technologies based in Belgium.
8
00:00:56,422 –> 00:01:04,858
In this episode, I want to discuss some of the automated techniques that are used to develop artificial intelligence.
9
00:01:05,014 –> 00:01:34,410
Due to the fact that as artificial intelligence becomes more and more, let’s say, off the shelf in a way, many of the algorithms that were essentially became the core of artificial intelligence and in particular in the deep learning field have now become some sort of state of the art and quite standardized ways of producing artificial intelligence or whatever you want to call it.
10
00:01:34,580 –> 00:01:59,000
But essentially due to the fact that there are so many moving variables, for example, the topology of the network, as well as the number of inputs, as well as the number of layers, and then of course, the learning rate and all the other hyper parameters that you might think of when you design a deep learning model.
11
00:01:59,330 –> 00:02:15,690
Well, due to all these moving pieces, in fact, it’s become kind of an art, dark art to design the best possible topology that will solve that particular problem with that particular data.
12
00:02:15,800 –> 00:02:33,570
And in fact, there is no such formula that would allow researchers or practitioners in this case to understand what type of topology one should be using or what’s the percentage of drop out for a particular use case, if not by trial and error.
13
00:02:33,890 –> 00:02:46,522
That is, of course, running these models over and over again and then apply tiny changes or more or less targeted changes to the topology and to all the other variables.
14
00:02:46,666 –> 00:03:02,278
And essentially that means moving into a dimensional space that is even bigger than the dimensional space of neural networks, which is already very high if you consider a number of parameters of a different network today in pretty much any domain.
15
00:03:02,314 –> 00:03:08,240
Of course, I’m not generalizing, I’m just trying to give you some tangible numbers.
16
00:03:08,690 –> 00:03:15,740
A network with several millions of parameters nowadays is no longer a big deal.
17
00:03:17,510 –> 00:03:40,366
We have been speaking about the monsters or the bisques of deep learning, like 100 billion parameters, 175 parameters, and we remember the family of GPT models and Dolly and all the other big models that allow us to pretty much have fun online or just speak about deep learning in action.
18
00:03:40,498 –> 00:04:03,030
Well, these are monsters and except for these monsters which means that they are in the realm of hundreds of billions of parameters all the rest are pretty much really tiny with respect to those monsters but still pretty large with respect to, let’s say off the shelf machine learning models.
19
00:04:03,890 –> 00:04:16,506
Don’t forget that of course a very sophisticated or a very deep random forest model would be orders of magnitude smaller than one of the simplest neural networks out there.
20
00:04:16,628 –> 00:04:32,878
Now of course they would be applied to different use cases, to different types of types of data and also they might be giving different results depending on, for example for a computer vision system you would not definitely not use a random force.
21
00:04:33,034 –> 00:04:43,902
But just to give you an idea of what’s the dimension of these models nowadays and there is a problem when these things become large.
22
00:04:44,096 –> 00:05:03,286
Well, there is a problem for practitioners because how do you design something? Is there a tool that would allow you to let’s say, facilitate or give you support in deciding how big or small a particular network should be? How many layers should that network have? And that’s when AutoML was born.
23
00:05:03,418 –> 00:05:11,938
AutoML in fact AutoML algorithms operate at a level of abstraction that is usually above the machine learning models.
24
00:05:11,974 –> 00:05:17,670
Of course they can be applied to pretty much all machine learning models, not just neural networks.
25
00:05:18,110 –> 00:05:34,410
And they usually rely only on the outputs of the model as a guide to guide and suggest to the practitioner that maybe some, let’s say hyper parameters specific to that particular model should be changed.
26
00:05:34,970 –> 00:05:44,434
The problem is that when it comes to neural networks this process is very time consuming and also burns a lot of resources.
27
00:05:44,482 –> 00:05:51,920
And I think we have discussed this long time ago nas which stands for Neural Architecture Search.
28
00:05:52,250 –> 00:06:05,626
There is a technique that allows researchers or practitioners to find out, to discover the best possible topology or neural network for a particular scenario.
29
00:06:05,758 –> 00:06:29,122
And how do they do that? Well, by iterating over and over again observing the outputs, observing some metrics of the training session and understand, measuring essentially if the accuracy is moving and what’s the magnitude of that movement in particular observed directly from the output.
30
00:06:29,266 –> 00:06:50,230
The problem is that as I said this for a relatively large neural network this becomes sometimes prohibitive actually many more times than sometimes it becomes really prohibitive when as the networks grow in size and also as the data set grows in volume.
31
00:06:50,410 –> 00:07:09,550
So as you can understand it would be great to have for example a technique or a method that allows you to let’s say, understand or predict what would be the accuracy of a particular neural network of which I’ve changed to the topology, I’ve changed some of the other parameters.
32
00:07:09,670 –> 00:07:23,578
But I would like to know that accuracy before training, maybe without training or training an infinitesimal amount of data with respect to what I usually do under the Nas technique.
33
00:07:23,674 –> 00:07:28,630
Right? And this is how zero cost proxies have been introduced.
34
00:07:28,810 –> 00:07:31,198
There is a very interesting literature.
35
00:07:31,294 –> 00:07:34,302
I was pretty unaware of it.
36
00:07:34,496 –> 00:07:45,642
But I’ve spent some time trying to do some research of what has been published so far and I should say that indeed in the last few years.
37
00:07:45,776 –> 00:08:02,770
In fact it’s a pretty new methodology or trend in a way that I believe is raised due to the fact that indeed we need to optimize a lot of the training process.
38
00:08:02,880 –> 00:08:05,182
That there’s a lot of room for improvement there.
39
00:08:05,376 –> 00:08:13,274
We have to save as much as we can computation for the sake of the environment, for the sake of the costs.
40
00:08:13,382 –> 00:08:22,406
Training and retraining this neural network can cost several millions of dollars when you think about the big monsters that I mentioned at the beginning of the episode.
41
00:08:22,538 –> 00:08:49,966
And so zero cost proxies indeed can, at least from a theoretical perspective, they can solve such a problem, which is the problem of avoiding training the network almost entirely and still understand or predict if that network is going to perform, then change some hyper parameters and run that prediction again in pretty much zero time.
42
00:08:50,088 –> 00:08:57,346
That’s why zero cost and understand if that topology indeed would be better than the previous one.
43
00:08:57,468 –> 00:09:20,918
So this means that the dimensional space of the problem would be essentially demolished because instead of being in a dimensional space that considers this space of the upper parameters times the space of the usual parameter space of the network, that is, the number of parameters, the weights, etc.
44
00:09:20,954 –> 00:09:43,034
For in this case, with zero cost we would just be in front of the hyper parameter dimensional space which is notoriously much smaller, of course, than the parameter space even though there is an infinite number of combinations in the neural networks.
45
00:09:43,082 –> 00:09:57,770
How to set the topology of the neural network? Of course, finding a suboptimal index space would be much, much easier than finding a suboptimal in the parameter space with millions and millions and sometimes billions of characters.
46
00:09:57,950 –> 00:10:00,180
And now let me tell you something important.
47
00:10:00,510 –> 00:10:02,426
Cybercriminals are evolving.
48
00:10:02,558 –> 00:10:07,486
Their techniques and tactics are more advanced, intricate and dangerous than ever before.
49
00:10:07,668 –> 00:10:15,670
Industries and governments around the world are fighting back on building new regulations meant to better protect data against this rising threat.
50
00:10:15,990 –> 00:10:24,338
Today, the world of cybersecurity compliance is a complex one and understanding the requirements the organization must adhere to can be a daunting task.
51
00:10:24,494 –> 00:10:26,518
But not when the pack has your back.
52
00:10:26,664 –> 00:10:38,880
Arctic Wolf, the leader in security operations, is on a mission to end cyber risk by giving organizations the protection, information and confidence they need to protect their people, technology and data.
53
00:10:39,270 –> 00:10:47,834
Their new interactive compliance portal helps you discover the regulations in your region and industry and start the journey towards achieving and maintaining compliance.
54
00:10:48,002 –> 00:10:52,618
Visit Arcticwolf.com DataScience to take your first step.
55
00:10:52,764 –> 00:10:56,530
That’s arcticwolf.com DataScience.
56
00:10:56,970 –> 00:11:14,100
So there is a taxonomy around the zero cost proxies and of course I will provide some of the links that I have been exploring in these days which are very interesting and they require some time to digest all that information.
57
00:11:14,490 –> 00:11:16,606
But I think it’s worth it.
58
00:11:16,788 –> 00:11:20,350
I learned a lot of stuff some other times.
59
00:11:20,400 –> 00:11:27,730
Some papers don’t add anything super novel but they help, or at least that happened with me.
60
00:11:27,840 –> 00:11:38,954
They helped me understanding the background and some of the linear algebra metrics that usually researchers are considering for zero cost proxies.
61
00:11:39,062 –> 00:11:54,950
So without entering the details, which can be quite intense for a podcast episode, zero cost proxies have an extremely quick way to estimate the performance of neural network architectures.
62
00:11:55,130 –> 00:12:03,600
And so the method essentially computes statistics from usually a forward pass of a single mini batch of data.
63
00:12:04,170 –> 00:12:07,522
And this means that in fact, it’s not exactly zero cost.
64
00:12:07,596 –> 00:12:12,000
It’s really negligible with respect to the entire training process.
65
00:12:12,630 –> 00:12:14,280
That’s why it’s zero.
66
00:12:14,790 –> 00:12:16,102
Metaphorically speaking.
67
00:12:16,176 –> 00:12:20,122
If you want to say zero from an engineering standpoint, of course it’s not.
68
00:12:20,256 –> 00:12:27,540
But as I said, it’s nothing with respect to what the neural network should train for real.
69
00:12:27,930 –> 00:12:35,110
Now, of course there are different groups or different ways to, let’s say, categorize zillow cost proxies.
70
00:12:35,550 –> 00:12:41,474
Of course the two major ones are data independent and data dependent.
71
00:12:41,582 –> 00:13:01,190
And based on these two, let’s say, categories, they measure different things, of course, because in the first group, of course, we have zero cost proxies that are data independent which means that they don’t rely on the data set, on the input data set to measure the quality of a particular network.
72
00:13:01,370 –> 00:13:33,394
So one would say how does that happen? Like how can a method ignore the data completely to understand if a particular network is in fact better than another? And well, there are some techniques that use, for example, synthetic proxy tasks to estimate the ability of the particular architecture to capture, for example, different types of sign frequencies to capture, scaling variances, special information.
73
00:13:33,552 –> 00:13:55,994
So if you are dealing with, for example, a problem in which all these things are indeed involved, for example, special information can be involved for several things, even computer vision as well as GIS systems, scaling variances for images for sure and sign frequencies for some sort of encoding.
74
00:13:56,102 –> 00:14:06,890
But usually these synthetic proxy tasks are usually present in the natural, let’s say, environment.
75
00:14:07,070 –> 00:14:23,098
So when you have a network that is indeed being trained for a particular use case, most of the time the network has to have the ability to capture sign frequencies, scale invariances and special information.
76
00:14:23,184 –> 00:14:39,070
So researchers understood that, we all know that in fact, and said okay, how about creating these things synthetically and evaluate without data? Another very important metric that is used is, of course the number of parameters in the network.
77
00:14:40,110 –> 00:14:52,502
And yet another one is a way to approximate the neural network by piecewise linear functions that are usually conditioned on the activation patterns of the network.
78
00:14:52,646 –> 00:14:57,370
So these are all of course, synthetic measures.
79
00:14:58,050 –> 00:15:05,342
They are exasperated in a way during training, during the assessment of the quality of the neural network.
80
00:15:05,426 –> 00:15:12,826
And that’s why we call this zero cost proxies data independent because they do not need the presence of the data.
81
00:15:13,008 –> 00:15:19,510
The second category, as you can imagine, is the data dependent zero cost proxies.
82
00:15:19,950 –> 00:15:26,038
And here is where of course, the methodology considers the initial data, the input data.
83
00:15:26,124 –> 00:15:46,646
Now of course, when we say input data again, it’s a tiny fraction of the entire data set because we still want the methods to be a zero cost approach, which means that we still need to have a negligible time that passes to understand and to assess these metrics.
84
00:15:46,718 –> 00:15:49,066
And yet the metrics can be completely different.
85
00:15:49,188 –> 00:15:57,146
For example, there are techniques that measure the intra and inter class correlations of the prediction.
86
00:15:57,278 –> 00:16:05,594
Jacobian Mattresses there are of course number of Flops floating point operations to pass the input through the network.
87
00:16:05,702 –> 00:16:12,430
That gives already an idea of how complex the network is from the input to the output layer.
88
00:16:13,050 –> 00:16:16,990
There is a technique that sums the Euclidean norm of the gradients.
89
00:16:17,790 –> 00:16:29,942
Another technique that performs an approximation of the neural network gaussian process using Monte Carlo methods, which is usually a much cheaper measure of performance and so on and so forth.
90
00:16:29,966 –> 00:16:44,940
So these are more techniques that look at of course, the data in the sense that they let the data pass through in a forward pass, usually very minimal amount of batches, usually one.
91
00:16:45,990 –> 00:16:52,270
And then they start calculating metrics that depend of course, on the particular methodology.
92
00:16:53,190 –> 00:17:03,854
Now, does this work? Well, there is a very interesting work, kind of a review of all these methods in action.
93
00:17:04,022 –> 00:17:16,150
And some researchers have come to the conclusion that across a wide range of tasks there is no single zero cost proxy that performs significantly better than the others.
94
00:17:16,260 –> 00:17:21,850
So pretty much they are performing the same statistically.
95
00:17:22,470 –> 00:17:42,338
Of course, another point that is quite important to mention is that zero cost proxies still require research because Flops floating point operations, a number of parameters alone are usually consistent, are usually quite competitive baselines.
96
00:17:42,494 –> 00:18:01,620
So one would say why should I complicate my life with a zero cost proxy methodology? If number of floating point operations and number of parameters are both a pretty good baseline, they give me a decent measure of how that network is performing or will perform.
97
00:18:02,670 –> 00:18:11,618
So of course, my conclusion is that at least from the literature I’ve been reading is that zero cost proxies are an interesting approach.
98
00:18:11,714 –> 00:18:14,520
The idea, of course is amazing.
99
00:18:15,090 –> 00:18:24,862
It’s the execution that probably doesn’t match the idea or the power of the idea, which is great and that happens in research.
100
00:18:25,056 –> 00:18:52,586
But I believe that probably using zero cost proxies together with other methods, for example, model based predictions usually or one shot training as well, maybe they can help improving the performance of the Nas neural architecture search techniques that are already in place or they can just give some more robustness to those existing methodologies.
101
00:18:52,718 –> 00:19:02,422
Of course, I’m not an expert, I admit that I’m just reframing a bit the literature that I’ve been exploring in the last few days or weeks.
102
00:19:02,616 –> 00:19:05,038
It has been a nice journey, to be honest.
103
00:19:05,184 –> 00:19:09,370
Very interesting stuff I will find as interesting as I did, of course.
104
00:19:09,480 –> 00:19:10,654
That’s it for today.
105
00:19:10,752 –> 00:19:13,534
Don’t forget to drop by our discord channel.
106
00:19:13,692 –> 00:19:18,454
You will find the link on the official website datascience@home.com.
107
00:19:18,612 –> 00:19:20,040
Speak with you next time.
108
00:19:20,730 –> 00:19:23,762
You’ve been listening to Data Science at home podcast.
109
00:19:23,846 –> 00:19:28,430
Be sure to subscribe on itunes, Stitcher or Pot Bean to get new, fresh episodes.
110
00:19:28,490 –> 00:19:34,480
For more, please follow us on Instagram, Twitter and Facebook or visit our website at datascienceathome.com