That deep learning alone is not sufficient to solve artificial general intelligence is a more and more accepted statement. Generalist agents have great properties that can overcome some of the limitations of single-task deep learning models. Be aware, that we are still far from AGI, though. So what are generalist agents?
00:00:02,650 –> 00:00:05,494
This is the sound of turning ideas into software.
00:00:05,662 –> 00:00:08,550
This is the sound of engineering and passion.
00:00:09,350 –> 00:00:16,820
Work more, work harder, experiment, build, break, take and build again.
00:00:17,150 –> 00:00:19,518
Write code, improve it.
00:00:19,664 –> 00:00:20,780
00:00:21,230 –> 00:00:22,230
00:00:22,910 –> 00:00:30,046
Insurance, finance, retail, defense, robotics, energy, amethyx.
00:00:30,238 –> 00:00:33,118
Welcome back to another episode of Data Science podcast.
00:00:33,154 –> 00:00:40,230
I’m a Francesco podcasting from the regular office of my company, Amytech Technologies based in Belgium.
00:00:40,550 –> 00:01:03,538
Today I want to introduce another important model and I would say milestone from the DeepMind community and from the researchers at DeepMind that have put together another important model that is considered one of the first generalist agents in the literature.
00:01:03,694 –> 00:01:12,030
And why I say this is important is because, well, first of all, there is no production rate result, there’s still a loss to work on.
00:01:12,140 –> 00:01:26,530
But in my opinion it’s a very interesting milestone because it starts, I would say, a new era or it sets the beginning of a new era for the development and design of generalist agents or generalist models.
00:01:26,650 –> 00:01:35,266
So why the generalist model or the term generalist, first of all, doesn’t have to be misunderstood with artificial general intelligence.
00:01:35,398 –> 00:01:36,966
We’re very far from there.
00:01:37,088 –> 00:01:52,906
Probably the models and the mathematics that we are using today has already said a number of times on this show is probably not even ready or not even appropriate for this type of modeling and for reaching human capabilities in pretty much any task.
00:01:53,038 –> 00:02:03,654
But what a general station does or wants to do is to have the capability of solving different tasks or multiple tasks at the same time.
00:02:03,692 –> 00:02:20,382
And so you would not have a model that is, as it happens today, specialize in, for example, doing object recognition or face detection or street sign recognition and so on and so forth in a very specialized way.
00:02:20,516 –> 00:02:27,318
But we would have a model that in fact an agent that can do many things right.
00:02:27,404 –> 00:02:40,798
And so for example, it can recognize images, it can recognize objects in an image, it can describe a scene, it can play a game, it can generate text, it can chat with humans and so on and so forth.
00:02:40,894 –> 00:02:44,660
So that’s the idea of having a generalist model.
00:02:44,990 –> 00:02:52,134
Now of course there are many challenges and we’ll see how Deep Mind researchers have dealt with that.
00:02:52,172 –> 00:02:57,342
But there are several benefits of having, of course, a generalist model or general agent.
00:02:57,536 –> 00:03:14,622
For example, you no longer need to handcraft policy models with appropriate inductive bias for each domain, which is usually the case when you have multiple specialized models that are put together in forming so called generalist agent.
00:03:14,756 –> 00:03:28,700
And it also increases the amount of diversity of training data because of course, data come from different domains and most of the time you never have shortage of data.
00:03:30,170 –> 00:03:35,034
It’s really quite difficult to have shortage of data for many sectors and many domains out there.
00:03:35,072 –> 00:03:50,782
But in case you are dealing with domains that require, for example, knowledge transfer or for which there is not appropriate or not a particularly large amount of data available, or data are extremely expensive with respect to other domains.
00:03:50,866 –> 00:03:57,298
Well, a generalist model can also be trained from other data, data coming from different domains.
00:03:57,334 –> 00:04:03,682
And so that might help dealing with the scarcity of data whenever that applies.
00:04:03,826 –> 00:04:14,010
So I want to start by giving some numbers and make sure that everybody is on the same page here, that we’re still speaking about a very large model.
00:04:14,120 –> 00:04:16,318
It’s 1.2 billion parameters.
00:04:16,474 –> 00:04:25,650
The model is called Gato Gato and it’s a generous agent that you don’t have to consider it as a small model at all.
00:04:25,760 –> 00:04:39,966
1.2 billion parameters is a massive model already not approachable by any off the shelf machine that you might be dealing with at home or in our office.
00:04:40,088 –> 00:04:47,658
So it’s something that requires dedicated hardware, dedicated infrastructure for a number of days of continuous training.
00:04:47,744 –> 00:04:50,950
And we’ll get into these details in a minute.
00:04:51,070 –> 00:05:02,434
Also, another thing that is important to mention, gato Gato, I don’t know, was trained offline a purely supervised manner.
00:05:02,482 –> 00:05:18,800
And this might be kind of a limitation because there are many tasks, especially when it comes to gaming, for example, for which you would like to have some sort of reinforcement learning injection or reinforcement learning based approach to training.
00:05:19,610 –> 00:05:21,822
And so in this case, that didn’t happen.
00:05:21,896 –> 00:05:40,530
And as much to be fair, the authors already understood that type of limitation and they also, of course, mentioned that in the conclusion and future work, that of course, adding air enforcement learning approach to the training would definitely help.
00:05:40,700 –> 00:05:51,130
So what does this model do? Well, first of all, this model ingests a number of inputs that of course come from different domains.
00:05:51,190 –> 00:05:56,314
For example, there is text, there are Atari images and discrete actions.
00:05:56,362 –> 00:06:10,366
When it comes to playing arcade games, there are images that have been annotated, for example, with description or the Alt text from web pages.
00:06:10,558 –> 00:06:14,000
There is text coming from images and questions.
00:06:14,330 –> 00:06:25,554
For example, there is an image and there is a question, eventually even with an answer of, for example, what is in that picture? There is a nice dog, there is a cute cat.
00:06:25,712 –> 00:06:40,470
And if you provide the input and the output or whatever, the network should be responding to that particular input, you would have essentially the pair, the XY pair that deploying networks are used to deal with.
00:06:40,580 –> 00:06:44,230
And finally, there are images and continuous actions.
00:06:44,410 –> 00:06:53,206
And the continuous actions are due to the fact that the network has been trained to deal also with a robotic arm.
00:06:53,278 –> 00:07:13,186
So together with analyzing text and eventually providing an answer for a particular question and also understanding, let’s say I’m quoting, understanding images, the network, the Gato model is also capable of maneuvering a robotic arm.
00:07:13,258 –> 00:07:18,020
And so that’s why that is also part of the input in the training process.
00:07:18,950 –> 00:07:27,390
Now, the very first thing that developers and engineers and researchers have been doing is of course taking care of so called tokenization.
00:07:28,070 –> 00:07:34,042
There is of course an infinite number of ways to transform data into tokens.
00:07:34,066 –> 00:07:56,950
And so of course the tokenization is something that comes more kind of art or dark art from the researcher and also because the organization has to be done in a way that is first of all computationally efficient but also has to maintain some sort of semantics in the token.
00:07:57,010 –> 00:08:00,870
So you don’t want to lose information when you tokenize something.
00:08:01,040 –> 00:08:04,138
That’s why there are different flavors of tokenization.
00:08:04,234 –> 00:08:18,330
And there is an impressive amount of detail in the official paper that presents this work that I of course will report to the show notes of this episode on the official website email@example.com.
00:08:18,500 –> 00:08:27,478
But the tokenization, as I said, has been applied in a different flavor of tokenizations to different data types.
00:08:27,514 –> 00:08:39,860
So for example, text is encoded with sentence piece with something like 32,000 subwords into the integer range from zero to 32,000.
00:08:40,550 –> 00:08:51,450
Discrete and continuous values are also tokenized with row major order both in case they are discrete both or floating point values as well.
00:08:51,620 –> 00:09:16,940
Images are also transformed into sequences of non overlapping patches or tiles 16 by 16 in raster order and then each pixel in the image patches is then normalized between range negative one and one and of course normalized which means divided by the square root of the patch size which is of course four square root of 60.
00:09:17,510 –> 00:09:50,660
Now, all these tokens are essentially embedded so after tokenization and sequencing because you have to create the input in sequential order, in fact the tokenized version of the input in sequential order in a sequence, the researchers apply a parameterize embedding function to each token that’s clearly to maintain some sort of control on the dimensionality of the problem.
00:09:51,410 –> 00:10:04,018
We have seen this happening a number of times already is pretty much standardized technique when it comes to high dimensional models and finally the tokens that belong to image patches.
00:10:04,054 –> 00:10:17,970
So the visual ones, the one for the images are also passed or embedded using a single rasnet block in order to obtain one vector per patch.
00:10:18,830 –> 00:10:25,362
When it comes to training, Gatsus network architecture has essentially two components that work together.
00:10:25,436 –> 00:10:39,802
The first is the embedding function that of course as I said transforms these tokens, token embeddings and then we have the sequence model that simply outputs a distribution over the next discrete token.
00:10:39,886 –> 00:10:44,878
So essentially the model becomes a model that predicts the next token.
00:10:44,974 –> 00:10:54,510
Right now for this type of model you can use a general sequence model that would work as a predictor of the next token.
00:10:55,130 –> 00:10:57,586
And we have seen this many times in many domains.
00:10:57,658 –> 00:11:07,846
Even on this show we have seen the sequence models in action to predict the next token when it comes usually for NLP models.
00:11:07,978 –> 00:11:24,138
But researchers at DeepMind however they chose a transformer for scalability reasons and as I said, 1.2 billion parameter model is not something that you can deal with your regular laptop here.
00:11:24,284 –> 00:11:32,480
It’s something that requires dedicated hardware and of course, something that requires a scalable sequence model as well.
00:11:33,170 –> 00:11:37,330
Let me give you some numbers about the architecture of the neural network.
00:11:37,450 –> 00:11:42,558
It’s a 24 layer with an embedded size of 2048.
00:11:42,584 –> 00:11:55,546
And there is also a post attention feed forward hidden size of 8196, and of course, that totals 1.2 billion.
00:11:55,678 –> 00:11:59,420
Parameter the training model.
00:11:59,930 –> 00:12:16,760
The training process has been performed on a 16 x 16 TPU so tensor Processing unit v three for 1 million steps with the batch size of 512 samples, and for about four days.
00:12:17,150 –> 00:12:35,946
So when it comes to the data sets that have been used for training to train gasol, well, it has been trained on a very large number of data sets, comprising natural language image data sets, of course, and Atari games as well.
00:12:36,068 –> 00:12:49,414
When it comes to the vision language data set, they have used a line that consists of 1.8 billion images and the alt text annotation of each image.
00:12:49,582 –> 00:12:57,898
LTIP stands for Long Text and Image Pairs, which consists of 312,000,000 images with captions.
00:12:57,994 –> 00:13:11,554
And of course, many others like conceptual captions Coco Captions are also captioning data sets with about 3,000,1200 image text pairs, respectively.
00:13:11,602 –> 00:13:19,762
So, as you can see, only for the vision and language, there is an impressive amount of data at their disposal.
00:13:19,906 –> 00:13:25,438
I already mentioned the fact that the model can also maneuver a robotic arm.
00:13:25,594 –> 00:13:30,846
And of course, researchers have used RGB stacking data.
00:13:31,028 –> 00:13:39,620
It’s an environment that is made of blocks, RGB blocks, and essentially the task of the robot is to stack these blocks together.
00:13:40,070 –> 00:13:43,126
And this has happened in real and simulation.
00:13:43,258 –> 00:14:12,394
So, are we curious to know how did it perform after training? And what are the capabilities of such a generalist agent? Well, to start with, before showing the results, of course, we have to explain, what does it mean? Good and bad, right? For such an agent, performance are measured as a percentage, where 100% corresponds to the task that an expert, how an expert would solve that particular task.
00:14:12,502 –> 00:14:16,066
And of course, 0% corresponds to random policies.
00:14:16,198 –> 00:14:27,630
So essentially, randomly assessing the solution of that particular task, and depending on which benchmark the researchers have used, they have used many.
00:14:27,740 –> 00:14:31,940
And of course, you will find the details in the official paper.
00:14:32,390 –> 00:14:34,160
I will just mention a few.
00:14:34,670 –> 00:14:39,342
There is a relatively okay result.
00:14:39,536 –> 00:15:04,410
For example, when it comes to the Ale Atari Gato achieves, the average human scores for 23 Atari games, which is pretty good considering that we haven’t trained the model specifically for playing Atari games, but just together with many other tasks, it’s pretty decent, I would say.
00:15:04,580 –> 00:15:08,854
And as I mentioned, this has been done in completely supervised manner.
00:15:08,962 –> 00:15:19,366
So the training, usually deep learning, network playing games usually happens under a reinforcement learning context.
00:15:19,498 –> 00:15:20,950
There is also baby AI.
00:15:21,010 –> 00:15:38,218
It’s another benchmark that is used for assessing the quality of the solution provided by an artificial intelligence model and Gato achieves over 80% of expert score for nearly all levels.
00:15:38,314 –> 00:15:45,114
So that’s also pretty good for the most difficult task that is called Boss level for a reason.
00:15:45,272 –> 00:15:47,298
Gato scores 75%.
00:15:47,384 –> 00:16:01,520
So also there considering that behaving as an expert would be 100%, being at 75% in a first attempt of generalist agent, I would say is not that bad.
00:16:02,090 –> 00:16:18,358
Finally, on Meta World Gas achieves more than 50% for all 44 out of 45 tasks that researchers have trained and over 80% for 35 tasks and over 90% for just three tasks.
00:16:18,454 –> 00:16:24,942
So there are three tasks where Gato apparently performs way better than the others.
00:16:25,076 –> 00:16:47,420
So as a first attempt all this work is, I would say, absolutely interesting and worth the reading because there are several techniques and several details especially provided in the appendix where if you are a deep learning practitioner, you definitely would like to read.
00:16:47,930 –> 00:16:56,382
Of course in this show I cannot go through all the details of this model, that’s what academic papers are for.
00:16:56,576 –> 00:17:06,690
But I would like to spend a few words about the impact of such a model which is in my opinion also very important because as I said.
00:17:06,860 –> 00:17:21,406
Generalist agents are growing in terms of interest from the community though they’re currently not present at all in production environments for a reason because there are many risks.
00:17:21,598 –> 00:17:22,134
00:17:22,172 –> 00:17:23,250
Many benefits as well.
00:17:23,300 –> 00:17:25,114
But definitely many risks.
00:17:25,162 –> 00:17:30,320
Too many risks before deploying a generalist agent in.
00:17:31,490 –> 00:17:36,478
I would say an uncontrolled environment or an environment in which there are also humans.
00:17:36,634 –> 00:17:47,566
One thing I would really be cautious is related to the cross domain knowledge transfer which in my opinion is a very powerful concept.
00:17:47,638 –> 00:17:59,850
The fact that you can train a neural network in a particular domain and then extrapolate or transfer that knowledge and utilize that knowledge in another domain, in another sector, that’s absolutely powerful.
00:18:00,230 –> 00:18:05,214
But I would raise an eyebrow if you trained a.
00:18:05,312 –> 00:18:05,970
00:18:06,080 –> 00:18:11,374
Neural network or a massive model playing arcade games.
00:18:11,422 –> 00:18:23,302
Maybe fighting games and then move that knowledge into the real world on another task that I don’t know requires for example surveillance or requires to interact with humans.
00:18:23,446 –> 00:18:32,158
You never know how that knowledge has been transferred in fact from one domain which is fighting after all into another domain.
00:18:32,194 –> 00:18:50,670
So also another thing that we don’t have to forget is that deep learning models and statistical models in general, in fact, optimization is a bad beast sometimes because what we are doing here is minimizing a loss function.
00:18:50,780 –> 00:19:03,046
And also about that in the paper there is a very interesting section how the researchers have designed the function optimization part and of course the loss function optimization.
00:19:03,178 –> 00:19:08,514
But essentially what deep learning does is minimizing the loss function, minimizing the function.
00:19:08,612 –> 00:19:25,546
So there is nothing that gives you a quantitative approach to, for example, ethics or to acceptance, how a particular action, how acceptable is a particular action in a particular context, in a particular scenario.
00:19:25,678 –> 00:19:35,118
So these are all things that still need to be discussed and to be taught considerably by the community.
00:19:35,264 –> 00:19:44,074
And not just by the researchers or by deploying practitioners or coders who just want to see something tangible.
00:19:44,242 –> 00:19:47,818
A model that learns stuff automatically.
00:19:47,914 –> 00:19:51,922
Something that of course is very appealing to the mind of a developer.
00:19:52,006 –> 00:19:55,642
Probably less appealing to the mind of a regulator.
00:19:55,786 –> 00:20:07,834
Who has at some point takes the decision of dealing with that particular model and eventually leave that model uncontrolled in an environment where there are other humans.
00:20:07,942 –> 00:20:08,850
That’s it for today.
00:20:08,900 –> 00:20:16,642
Of course, I will take the chance to invite you to the Discord Channel, which is our official channel where we speak about all things machine learning and AI.
00:20:16,786 –> 00:20:24,870
And of course, you are free to propose any topic you would like me to speak in the next episode.
00:20:25,370 –> 00:20:36,150
Last but not least, I recently started again a series of hands on sessions on Twitch, so I also invite you to drop by.
00:20:36,260 –> 00:20:47,214
You will find the schedule of my Twitch Live coding sessions, where you can interact live with me and of course, with other followers and viewers, and we can just have fun.
00:20:47,372 –> 00:20:48,354
That’s it for today.
00:20:48,452 –> 00:20:49,798
Thank you so much for listening.
00:20:49,894 –> 00:20:59,494
Speak to you next time you’ve been listening to Data Science at Home podcast, be sure to subscribe on itunes, Stitcher or Pod Bean to get new, fresh episodes.
00:20:59,542 –> 00:21:05,580
For more, please follow us on Instagram, Twitter and Facebook or visit our website at datasciencehome.com