LivingDataLab - A YouTube Video Summarizer Using Whisper and LangChain

1 Introduction

We recently discussed LangChain’s strong feature called chains, which allows for the building of an end-to-end pipeline for leveraging language models. To create a user-friendly interface, chains incorporate many components such as models, prompts, memory, parsing output, and debugging. We also went over the process of creating custom pipelines by inheriting the Chain class and looked at the LLMChain as an example. That article laid the groundwork for subsequent posts, in which we will apply these concepts to a hands-on project of summarising a YouTube video.

In this section, we look at the difficulty of efficiently summarising YouTube videos in the digital age. It will provide two cutting-edge technologies, Whisper and LangChain, to assist in addressing this challenge. We will go over the “stuff,” “map-reduce,” and “refine” ways for dealing with big volumes of text and extracting valuable information. By using Whisper to transcribe YouTube audio files using LangChain’s summarization techniques, such as stuff, refine, and map_reduce, it is feasible to effectively extract key points from videos. We also emphasised LangChain’s customizability, which allows for personalised prompts, multilingual summaries, and URL storing in a Deep Lake vector store. You can save time, boost knowledge retention, and gain a better comprehension of many topics by utilising these advanced tools.

First, we download the youtube video we are interested in and transcribe it using Whisper. Then, we’ll proceed by creating summaries using two different approaches:

First we use an existing summarization chain to generate the final summary, which automatically manages embeddings and prompts.
Then, we use another approach more step-by-step to generate a final summary formatted in bullet points, consisting in splitting the transcription into chunks, computing their embeddings, and preparing ad-hoc prompts.

The wealth of information available in the digital age can be overwhelming, and we frequently find ourselves scrambling to consume as much knowledge as possible in our limited time. YouTube is a wealth of information and pleasure, but it may be difficult to wade through long videos to extract the main lessons. Don’t worry, we’ve got your back! In this session, we’ll show you how to use two cutting-edge tools to quickly summarise YouTube videos: Whisper and LangChain.

I’ll walk you through the steps of downloading a YouTube audio clip, transcribing it with Whisper, and summarising the transcribed text using LangChain’s novel stuff, refine, and map_reduce approaches.

Workflow:

Download the YouTube audio file.
Transcribe the audio using Whisper.
Summarize the transcribed text using LangChain with three different approaches: stuff, refine, and map_reduce.
Adding multiple URLs to DeepLake database, and retrieving information.

2 Import Libs & Setup

JiWER is a simple and fast python package to evaluate an automatic speech recognition system.

Then, we must install the ffmpeg application, which is one of the requirements for the yt_dlp package. This application is installed on Google Colab instances by default. The following commands show the installation process on Mac and Ubuntu operating systems.

If you’re working on an operating system that wasn’t mentioned earlier (such as Windows), you can read the next article. It includes detailed, step-by-step instructions on “How to Install ffmpeg.”

The next step is to include the OpenAI and Deep Lake API keys in the environment variables. To read the values from an.env file, use the load_dotenv function, or run the following code. Remember that the API keys must be kept private because anyone who has them can access these services on your behalf.

# MacOS (requires https://brew.sh/)
#brew install ffmpeg

# Ubuntu
#sudo apt install ffmpeg

import os 
from dotenv import load_dotenv
from pytube import YouTube
load_dotenv()

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

We chose a video starring Yann LeCun, a renowned computer scientist and AI researcher, for our experiment. LeCun looks into the issues faced by huge language models in this lively debate.

The download_mp4_from_youtube() method will fetch the highest quality mp4 video file from any YouTube link and store it to the location and filename you provide. We simply need to copy/paste the URL of the selected video and feed it to the specified function.

import yt_dlp

def download_mp4_from_youtube(url):
    # Set the options for the download
    filename = 'lecuninterview.mp4'
    ydl_opts = {
        'format': 'bestvideo[ext=mp4]+bestaudio[ext=m4a]/best[ext=mp4]',
        'outtmpl': filename,
        'quiet': True,
    }

    # Download the video file
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        result = ydl.extract_info(url, download=True)

url = "https://www.youtube.com/watch?v=mBjPyte2ZZo"
download_mp4_from_youtube(url)

3 Whisper

Whisper is an advanced automatic voice recognition system created by OpenAI. Whisper has been trained on an incredible 680,000 hours of multilingual and multitasking supervised data obtained from the web, giving it cutting-edge capabilities. This large and diverse dataset improves the system’s robustness, allowing it to readily manage accents, background noise, and technical terminology. OpenAI has made available the models and code needed to build valuable apps that leverage the power of speech recognition.

The whisper package, which we previously loaded, includes the.load_model() function for downloading the model and trancribing a video file. There are several models to choose from: tiny, base, small, medium, and giant. Each involves a compromise between accuracy and speed. For this example, we will use the ‘basic’ model.

import whisper

model = whisper.load_model("base")
result = model.transcribe("lecuninterview.mp4")
print(result['text'])

100%|███████████████████████████████████████| 139M/139M [00:02<00:00, 60.2MiB/s]

 Hi, I'm Craig Smith and this is I on A On. This week I talk to Jan LeCoon, one of the seminal figures in deep learning development and a long time proponent of self-supervised learning. Jan spoke about what's missing in large language models and about his new joint embedding predictive architecture which may be a step toward filling that gap. He also talked about his theory of consciousness and the potential for AI systems to someday exhibit the features of consciousness. It's a fascinating conversation that I hope you'll enjoy. Okay, so Jan, it's great to see you again. I wanted to talk to you about where you've gone with so supervised learning since last week spoke. In particular, I'm interested in how it relates to large language models because the large language models really came on stream since we spoke. In fact, in your talk about JEPA, which is joint embedding predictive architecture. There you go. Thank you. You mentioned that large language models lack a world model. I wanted to talk first about where you've gone with self-supervised learning and where this latest paper stands in your trajectory. But to start, if you could just introduce yourself and we'll go from there. Okay, so my name is Jan Le Ka or Jan Le Koon who want to do it in Gilleswee and I'm a professor at New York University and at the Quarantine Institute in the Center for Data Science. And I'm also the chief AI scientist at Fair, which is the fundamental AI research lab. That's what Fair stands for. Admetta, Neil, Facebook. So tell me about where you've gone with self-supervised learning, how the joint embedding predictive architecture fits into your research. And then if you could talk about how that relates to what's lacking in large language models. Okay, self-supervised learning has been, has basically brought about a revolution in natural language processing because of their use for pre-training transformer architectures. And the fact that we use transformer architectures for that is somewhat orthogonal to the fact that we use self-supervised learning. But the way those systems are trained is that you take a piece of text, you remove some of the words, you replace them by black markers, and then you train the very large neural net to predict the words that are missing. That's a pre-training phase. And then in the process of training itself to do so, the system learns good representations of text that you can then use as input to its subsequent downstream task, I don't know, translation or Hitchbitch detection or something like that. So that's been a career revolution over the last three or four years. And including in sort of very practical applications, like every sort of type of performing contact moderation systems on Facebook, Google, YouTube, et cetera, use this kind of technique. And there's all kinds of other applications. Now, large language models are partially this, but also the idea that you can train those things to just predict the next word in a text. And if you use that, you can have those system generate text spontaneously. So there's a few issues with this. First of all, those things are what's called generative models in the sense that they predict the words, the information that is missing, words in this case. And the problem with generative models is that it's very difficult to represent uncertain predictions. So in the case of words, it's easy because we just have the system produce essentially what amounts to a score or a probability for every word in the dictionary. And so it cannot tell you if the word missing in a sentence like the blank chases the mouse in the kitchen. It's probably a cat, could be a dog, but it's probably a cat, right? So you have some distribution of probability over all words in the dictionary. And you can handle uncertainty in the prediction this way. But then what if you want to apply this to let's say video, right? So you show a video to the system, you remove some of the frames in that video and you train you to predict the frames that I'm missing. For example, predict what comes next in a video and that doesn't work. And it doesn't work because it's very difficult to train the system to predict an image or whole image. We have techniques for that for generating images before actually predicting good images that could fit in the video. It doesn't work very well. Or if it works, it doesn't produce internal representations that are particularly good for downstream task like object recognition or something of that time. So attempting to transfer those SSL method that are successful in LP into the realm of images has not been a big success. It's been somewhat of a success in audio. But really the only thing that works in the domain of images is those generating architectures where instead of predicting the image, you predict a representation of the image, right? So you feed. Let's say one view of a scene to the system, you run it to something on that that computes a representation of it. And then you take a different view of the same scene, you run it through the same network that produces another representation and you train the system in such a way that those two representations are as close to each other as possible. And the only thing the systems can agree on is the content of the image so they end up including the content of the image independently of the viewpoint. The difficulty of making this work is to make sure that when you show two different images, it will produce different representations. So to make sure that there are informative of the inputs and your system didn't collapse and just produce always the same representation for everything. But that's the reason why the techniques that have been generative architectures have been successful in LP aren't working so well. And images is their inability to represent complicated complicated uncertainties if you want. So now that's for training a system in SSL to learn representations of data. But what I've been proposing to do in the position paper I published a few months ago is the idea that we should use SSL to get machines to learn predictive world models. So basically to predict where the world world is going to evolve. So predict the continuation of a video, for example. Possibly predict how it's going to evolve as a consequence of an action that an intelligent agent might take. Because if we have such a world model in an agent, the agent being capable of predicting what's going to happen as a consequence of its action will be able to plan complex sequence of actions to arrive at a particular goal. And that's what's missing from all the pretty much all the AI systems that everybody has been working on or has been talking about loudly. Except for a few people who are working on robotics or it's absolutely necessary. So some of the interesting work there comes out of the robotics community, the sort of machine learning and robotics committee. Because there you need to have the skip ability for planning. And the work that you've been doing is it possible to build that into a large language model or is it incompatible with the architecture of large language models. It is compatible with large language models. And in fact, it might solve some of the problems that we're observing with large language models. One point is large language models is that when you use them to generate text, you initialize them with a prompt, right? So you type in an initial segment of a text, which could be in the form of a question or something. And then you hope that it will generate a consistent answer to that text. And the problem with that is that those systems generate text that sounds fine grammatically, but semantically, but sometimes they make various stupid mistakes. And those mistakes are due to two things. The first thing is that to generate that text, they don't really have some sort of objective. But then just satisfying the sort of statistical consistency with the prompt that was typed. So there is no way to control the type of answer that will produce. At least no direct way, if you want. That's the first problem. And then the second problem, which is much more acute is the fact that those large language models have no idea of the underlying reality that language. Discribes. And so there is a limit to how smart it can be and how accurate it can be because they have no experience of the real world, which is really the underlying reality of language. So their understanding of reality is extremely superficial and only contained in whatever is contained in language that they've been trained on. And that's very shallow. Most of human knowledge is completely non-linguistic. It's very difficult for us to realize that's the case, but most of what we learn has nothing to do with language. Language is built on top of a massive amount of background knowledge that we all have in common, that we call common sense. And those machines don't have that, but a cat has it, a dog has it. So we're able to reproduce some of the linguistic abilities of humans without having all the basics that a cat or dog has about how the world works. And that's why the systems are. Failures is actually. So I think what we would need is an ability for machines to learn how the world works by observation in the manner of. Babies and. Infants and young animals. Accumulate all the background knowledge about the world that constitutes the basis of common sense if you want. And then use this word model as. The tool for being able to plan sequences of actions to arrive at a goal so sitting goals is also an ability that humans and many animals have. So goals for arriving at an overall goal and then planning sequences of actions to satisfy those goals. And those my goals don't have any of that. They don't have a understanding of the learning world. They don't have a capability of planning for planning. They don't have goals. They can send sent themselves goals, other than through typing a point, which is a very weird way. Where are you in your experimentation with this. JAPAR architecture. So pretty early. So we have forms of it simplified form of them that we call joint-time meeting architectures without the P without the predictive. And they work quite well for learning representations of images. So you take an image you distorted a little bit and you train an neural net to produce. Essentially, we're also identical representations for those two distorted versions of the same image. And then you have some mechanism for making sure that it produces different representations for different images. And so that works really well. And we have simple forms of JAPAR the predictive version where the representation of one image is predicted from the representation of the other one. One version of this was actually presented that narrates this. It's called V-rag-L for local. And it works very well for training neural net to learn representations that are good for image experimentation, for example. But we're still working on a recipe if you want for a system that would be able to learn. The properties of the world by watching videos, understanding, for example, very basic concepts like the word is three dimensional. The system could discover that the world is three dimensional by being shown video with the moving camera. And the best way to explain how the view of the world changes as the camera moves is that every pixel has a depth that explains products, motion, et cetera. Once that concept is learned, then the notion of objects and occlusion objects are in front of others naturally emerges because objects are part of the image that move together with products, motion. At least in animate objects, animate objects are objects that move by themselves. So that could be also a natural distinction. This ability to spontaneously form the categories, the babies do this at the age of a few months. They have an audio without having the names of anything they know. Right. They can tell a car from a bicycle, the chair table, the tree, et cetera. And then on top of this, you can build notions of intuitive physics, the fact that objects that are not supported with all, for example, the babies run this at the age of nine months roughly. It's pretty late and inertia six things of that type. And then after you've acquired those basic knowledge background knowledge about how the world works, then you have pretty good ability to predict. And you can also predict perhaps the consequence of your actions when you start acting in the world. And then that gives you the ability to plan perhaps it gives you some basis for common sense. So that's the progression that we need to do. We don't know how to do any of this yet. We don't have a good recipe for training a system to predict what's going to happen in the video, for example, within any degree of usefulness. Just for the training portion, how much data would you need? It seems to me, you would need a tremendous amount of data. We need a couple of hours on Instagram or YouTube. That would be enough. Really. The amount of data of raw video data that's available. It's incredibly large. If you think about let's say five year old child and let's imagine that this five year old child can usefully analyze. Visual percept maybe ten times a second. Okay, so there's ten frames per second. And if you can't how many seconds they are in five years, it's something like 80 millions. So the child is in an 800 million frames, right? Or something like that issue. Yeah, it's an approximation. Let's say it's not that much data. We can have that tomorrow by just recording like saving a YouTube video or something. So I don't think it's an issue of data. I think it's more an issue of architecture, training paradigm, principles, mathematics, and principles on which to base this. One thing I've said is if you want to solve that problem, abandon five major pillars of machine learning, one of which is those generative models. And to replace them with those joint embedding architectures. A lot of people envision already convinced of that. Then to abandon the idea of doing probabilistic modeling. So we're not going to be able to predict to represent usefully the probability of the continuation of a video from condition on what we already observed. We have to be less ambitious about or mathematical framework if you want. So I've been advocating for many years to use something called energy based models, which is a weaker form of modeling under a certainty if you want. Then there is another concept that has been popular for training, joint embedding architectures over the last few years, which had the first paper on in the early 90s actually on something called same is networks. So it's called contrastive running and I'm actually advocating against that to use to this idea that once in a while you have to cover up new ideas and. And it's going to be very difficult to convince people who are very attached to those ideas to abandon them, but I think it's time for that to happen. Once you've trained one of these networks and you've established a world model, how do you transfer that to the equivalent of a large language model, one of the things that's fascinating about the development of LLM's in the last couple of years is that they're now multi model. They're not purely text and language. So how do you combine these two ideas or can you or do you need to? Yeah, so there's two or three different questions in that one question. One of them is can we usually transform existing language models? Whose purpose is only to produce text in such a way that they have they can do the planning and objectives and things like that. The answer is yes, that's probably fairly simple to do. Can we can we train language model purely on language and expected to understand the underlying reality and the answer is no and in fact. I have a paper on this in a. Overlap is a philosophy magazine called noina, which I co-wrote with a carcoring philosopher who is a post document about NYU where we say that there is a limit to what we can do with this because most of human knowledge is non linguistic. And if we only train systems on language, they will have a very superficial understanding of what they're talking about. So if you want systems that are robust and work, we need them to be grounded in reality. And it's an old debate whether they are actually being grounded or not. And so the approach that some people have taken at the moment is to basically turn everything including images and audio into text or something similar to text. So you take an image, you cut it into little squares, you turn those squares into vectors that's called tokenization. And now an image is just a sequence of tokens. The text is a sequence of words, right? And you do this with everything and you get those multiple systems and they do something. Okay, now clear. That's the right approach long term, but they do something. I think the ingredients that I'm missing there is the fact that I think if we're dealing with sort of continuous type data like video, we should use the joint embedding architecture, not the generative architectures that large language models currently use. First of all, I don't think we should tokenize them because a lot of it get lost in translation when we tokenizing edges and videos. And there's a problem also which is that those systems don't scale very well with the number of tokens you feed them with. So it works when you have a text and you need a context to predict the next word that is maybe the 4000 last words, it's fine. But a 4000 tokens for an image or video is tiny like you need way more than that and those systems scale horribly with the number of tokens you feed them. We're going to need to do a lot of new innovations in architectures there. And my guess is that we can't do it with generative models. So we'll have to do the joint embedding. How does a computer recognize an image without tokenization? So, commercial nets for example, don't tokenize. They take an image as pixels, they extract local features, they detect local motifs on different windows, on the image that overlap. And then those motifs get combined into other slightly less local motifs. And it's just kind of hierarchy where representations of larger and larger parts of the image are constructed as we go up in the layers. But there's no point where you cut the image into squares and you turn them into individual vectors. It's more sort of progressive. So there's been a bit of a back and forth competition between the transformer architectures that tend to rely on this tokenization and commercial nets which we don't or in different ways. And my guess is that ultimately what would be the best solution is a combination of the two where the first few layers are more like commercial nets. They exploit the structure of images and video certainly. And then by the time you get to up to several layers, they are the representation is more object based and there you have an advantage in using those those transformers. But currently basically the image transformers only have one layer of conclusions at the bottom. And I think it's a bit of a waste and it doesn't scale very well when you want to apply the video. On the timeline, this is all moving very fast. It's very fast. How long do you think before you'll be able to scale this new architecture? It's not just scale is actually coming up with a good recipe that works that would allow us to just plug a large neural net or the smaller on that on on YouTube and then learn how the work works by watching in a video. We don't have that recipe. We don't have probably don't have the architecture other than some vague idea, which I call hierarchical, but there's a lot of details to figure out that we haven't figured out this probably failure mode that we haven't yet encountered that we need to find solutions for. And so I can give you a recipe and I can tell you if welcome up with the recipe in the next six months year, two years, five years, ten years. It could be quick or it could be much more difficult than we think, but I think we're on the right path in searching for a solution in that direction. So once we come up with a good recipe, then it will open the door to new breed of AI systems, essentially that can they can plan, they can reason. And will be much more capable of having some level of common sense, perhaps, and have forms of intelligence that are more similar to what we observe being in animals and humans. Your work is inspired by the cognitive processes of the brain. Yeah. And that process of perception and then informing a world model, is that confirmed in neuroscience? It's a hypothesis that is based on some evidence from both neuroscience and cognitive science. So what I showed is a proposal for what's called a cognitive architecture, which is some sort of modular architectures that would be capable of the things like like planning and reasoning that we observe in capabilities that we observe in animals and humans. And that the current most current AI systems except for a few robotics systems don't have. It's important in that respect. But it's more of an inspiration really than a sort of direct copy interested in understanding the principles behind intelligence, but I would be perfectly happy to come up with some procedure that is that uses back proper level, but. At a higher level kind of does something different from the super resonating or something like that, which is why I work on self-supervisor. And so I'm not necessarily convinced that the path towards the satisfying the goal that was talking about of learning world models, etc. necessarily goes through finding biological and plausible learning procedures. What did you think of the forward forward algorithm and were you involved in that research? Well, although I've thought about things that are somewhat similar for many decades, but very few of which is actually published. It's in the direct line of a series of work that Jeff has been very passionate about for 40 years of new learning procedures of different types for basically local learning worlds that can train fairly complex neural nets to learn good representations. And things like that. So he started with the Boston machine, which was a really interesting concept that turned out to be somewhat in practical, but very interesting concept that a lot of people started. Backprop, which of course, he and I both had in developing something I worked on also simultaneously with backprop in the 1980s, called target prop, where it's an attempt at making backprop more local by computing a virtual target for. Every neuron in a large neural net that can be locally optimized. Unfortunately, the way to compute this target is normal calls. And I haven't worked on this particular type of procedure for a long time, but you should have been sure as we've used a few papers on this over the last 10 years or so. Yosha Jeff and I when we started the deep learning conspiracy in the early 2000 to renew the interest of the community and deep learning. We focused largely on forms of kind of local self supervised learning methods. So things like in just case that was focused on restricted Boston machines. Yosha settled on something called denosing auto encoders, which is the basis for a lot of the large language model type training that we're using today. I was focusing more on what's called sparsato encoders. So this is different ways of doing training a layer if you want in the neural net to learn something useful without being it without it being focused on any particular task. So you don't need label data. And a lot of that work has been put aside a little bit by the incredible success of just pure supervised learning with very deep model we found ways to train very large neural nets with with many layers with just back prop and so we put those techniques on the side and Jeff basically is coming back to them. And I'm coming back to them in different form a little bit with this so the JEPA architecture. And he also had ideas in the past, something called recirculation. A lot of informax methods, which actually the JEPA use this thing ideas are similar. He's a very productive source of ideas that are that sometimes seems out of the left field. And where the community pays attention and then doesn't quite figure it right away and then it takes a few years for those things to disseminate and sometimes they don't just a minute. Hello. Beauregard, I'm recording right now. Who? Rasmus? I'll answer when I get back. Yeah, you'll be famous someday. Okay, okay, great. Thanks very much. Yep. Bye-bye. Sorry about that. There was a very interesting talk by David Chalmers. At some level it was not a very serious talk because everyone knows as you described earlier that large language models are not reasoning. They don't have common sense. He doesn't claim that they do. No, that's right. But what you're describing with this JEPA architecture, if you could develop a large language model that is based on a world model. You'll be a large language model. You'll be a world model. At first it would not be based on language. You'll be based on visual perception, maybe audio perception. If you have a machine they can do what a cat does, you don't need language. Language can be put on top of this. To some extent language is easy, which is why we have those large language models. We don't have systems that run how they work. But let's say that you build this world model and you put language on top of it so that you can interrogate it, communicate with it. Does that take you a step toward what Chalmers was talking about? And I don't want to get into the theory of consciousness, but at least an AI model that would exhibit a lot of the features of consciousness. David actually has two different definitions for sentience and consciousness. You can have sentience without consciousness. Simple animal or sentience. In the sense that they have experience, emotions, and drives and things like that. But they may have the type of consciousness that we think we have. At least the illusion of consciousness. So sentience I think can be achieved by the type of architecture I propose if we can make them work, which is a big if. And the reason I think that is is that. What those systems would be able to do is have objectives that you need to satisfy. Think of them as drives. And having the system. Compute those drives, which would be basically predictions of. Of the outcome of a situation or a sequence of actions that the agent might take. Basically, those would be indistinguishable from emotions. So if you have your new situation where you can take a sequence of actions to arrive at a result. And the outcomes that you're predicting. It's terrible results in your destruction. Okay, that creates fear. You try to figure out that is another sequence of action I take that would not. Result in the same outcome. If you make those predictions with these are huge uncertainty in the prediction. One of which. With probability half maybe. Is that you get destroyed. It creates even more fear. And then on the contrary, if the outcome is going to be good, then it's more like elation. So those are long term prediction of outcomes, which. Systems that use the architecture and proposing I think will have so they will have. Some level of experience and they will have emotions that will drive the behavior. Because they would be able to anticipate outcomes. And perhaps act on them. Now consciousness is different story. So my full theory of consciousness, which I've talked to David about. Thinking it was going to tell me I'm crazy. But he said no, actually that overlaps with some pretty common. The theories of consciousness among philosophers is. Is the idea that we have essentially a single world model in our head. Somewhere in a prefrontal cortex. And that world model is configurable to. The situation we're facing at the moment. So we're configuring our brain. Including our world model for solving the problem that you know satisfying the objective that we currently set to ourselves. And because we only have a civil world model engine. We can only solve one such task at any one time. This is a characteristic of humans and. Many animals, which is that we focus on the task. We can't do anything else. And we can do subconscious tasks simultaneously. But we can only do one conscious deliberate task at any one time. And it's because we have a single world model engine. Now, why would evolution build us in a way that we have a single world model engine? There's two reasons for this. One reason is. That single world model engine can be. Configured for the situation at hand. But only the part that changes from one situation to another. And so it can share knowledge between different situations. The physics of the world doesn't change. If you are building a table or trying to jump over a river or something. And so you are sort of. Basic knowledge about how the world works doesn't need to be reconfigured. It's only the thing that depends on the situation at hand. So that's one reason. And the second reason is that. If we had multiple models of the world, they would have to be individually less powerful because. You have to all fit them within your brain and that's an immediate size. So I think that's probably the reason why we only have one. And so if you have only one world model that needs to be configured for the situation at hand, you need some sort of meta module that configures it. Figures out like what situation am I in? What sub goals should I set myself and how should I configure the rest of the. My brain to solve that problem. And that module would have to be able to observe. The state and capabilities would have to have a model of the rest of itself. It's an of the agent. And that perhaps is something that gives us the illusion of consciousness. So I must say this is very speculative. Okay, I'm not saying this is exactly what happens, but it. Fits with a few things that we know about. About consciousness. You were saying that this. Architecture is inspired by cognitive science or neuroscience. How much do you think your work, Jeff's work, other people's work. At the kind of the leading edge of deep learning or machine learning research is informing neuroscience. Or is it more of the other way around? Certainly in the beginning, it was the other way around. But at this point, it seems that there's a lot of information that then is reflecting back to the fields. So it's been a bit of a feedback loop. So new concepts in machine learning have driven people in neuroscience and curiosity science to. Use computational models if you want for whether we're studying. And many of my colleagues and my favorite colleagues work on this. The whole field of computational neuroscience basically is around this. And what we're seeing today is a big influence. Or rather a wide use of deep learning models such as conventional nets and transformers. As models. Explanatory model of what goes on in the visual cortex, for example. So the people, you know, for a number of years now who have. Don FMRI experiments and then show the same image to a subject in the FMRI machine and to a conventional net and then try to explain the variance they observe in the activity of various areas of the brain. With the activity that is observed in corresponding neural net. And what comes out of the studies is that. The notion of multilayer hierarchy that we have. Commercial nets. Matches the type of hierarchy that we observe in the at least in the ventral pathway of the visual system. So V1 corresponds to the first few layers of the conventional net and in V2 to some of the following layers and V4. More and then the E4 temporal cortex to the top layers are the best explanation of each other if you try to do the matching right. One of my colleagues at Fair Paris. There's a dual affiliation also with. Norsepin that academic lab in Paris has done the same type of experiment using transformer architectures and I wish models essentially. And observing. When activity of people who are listening to stories and attempting to understand the story. So that they can answer questions about the story. Or or give it. A summary of it. And there the matching is not that great in sense that there is some sort of correspondence between the type of activity you observe in those large transformers. And the type of activity is in the brain but the hierarchy is not nearly as clear. And it's what is clear is that the brain is a capable of making much longer term prediction that those language models are capable of today. So that begs the question of what are we missing in terms of architecture and to some extent it's jibes with the idea that. The models that we should have should build hierarchical. Representations of the preset that different levels of abstraction so that the highest level of abstraction. Are able to make long term predictions that perhaps are less accurate than the lower level but longer term. We don't need to have that in current models. I had a question I wanted to ask you since our last conversation you have a lot of things going on. You teach you have your role at Facebook. Your role I think at CVPR or how do you work on this? Have like three days a week or two hours a day where you're just focused. Are you a tinkering with code or with diagrams or is it in iterations with some of your graduates who the. Or is this something where it's kind of always in your mind and you're in the shower and you think yeah that might work. I'm just curious how do you love all of it? Okay so first of all once you understand is that my position at meta at fair is not a position of management. I don't manage anything. I'm chief scientist which means I try to inspire others to work on things that I think are promising. And I advise several projects that I'm not personally involved in. I work on strategy and orientations and things like this but I don't do that to the management. I'm very thankful that you know is doing this for fair and doing very very good job. I'm not very good at it either so it's for you better if I don't if I don't do it. So that allows me to spend quite a bit of time on research itself. And I don't have a group of engineers and scientists working with me. I have a group of more junior people working with me students and postdocs. Both at fair and at NYU. Both in New York and in Paris. And working with students and postdocs is wonderful because they are sure less they're creative. Many of them have amazing talents in theoretical abilities or implementation abilities or an academic things work. And so what happens very often is either one of them will come up with an idea that whose results surprise me and I was thinking that is wrong. And that's the best thing that can happen. Or sometimes I come up with an idea and turns out to work which is great. Usually not in the form that I formatted it normally it's there's a lot of contributions that have to be brought to an idea for to make it work. And then what's happened also quite a bit in the last few years is I come up with an idea that I'm sure it's going to work. And she students and postdoc try to make it work and they come back to me and said, oh sorry it doesn't work and here is a fair move. Oh yeah, we should have thought about this. Okay, so here's a new idea to get around this problem. So for example several years ago I was advocating for the use of generative models with latent variables to handle the uncertainty. And I completely changed my mind about this now advocating for those joint evading architecture that do not actually predict. I was more or less invented those contrasting methods that a lot of people are talking about and using at this point and I'm advocating against them now in favor of those methods such as V Craig or about the twins that basically instead of using contrasting methods can try to maximize the information content of representations and that idea of information maximization. And I know about for decades because Jeff was working on this in the 1980s when I was opposed to her with him. And he abandoned the idea pretty much he had a couple papers with one of his students who back her in the early 90s that show that he could work but only in sort of small dimension and it pretty much abandoned it. And the reason he abandoned it is because of a major flaw with those methods. Due to the fact that we don't have any good measures of information content or the measures that we had are up about not lower bound so we can try to maximize information content very well. And so I never thought about those that those methods could ever work because of my experience with with that. And why don't we post out stiff and the actually kind of revise the idea and show that it worked that was about a twins paper. So we changed our mind. And so now that we had a new tool information about maximization applied to the joint embedding architectures and came up with an improvement of it called V Craig. And and now we're working on that. But there are other ideas we're working on to solve the same problem with other groups of people at the moment, which probably will come up in the next few months. So we don't again we don't have a perfect recipe yet. And we're looking for one and hopefully one of the things that we are working on with stick. Yeah. Are you coding models and then training them and running them or are you conceptualizing and turning it over to someone else. So it's mostly conceptualizing and mostly letting the students and postdocs doing the implementation, although I do a little bit of coding myself, but not enough to my taste. I wish I could do more. I have a lot of postdocs and students and so I have to devote sufficient amount of my time to interact with them. Sure. And then leave them some breathing room to do the work that they do best. And so it's interesting question because that question was asked to Jeff to start right. Yeah. And he said he was using matlab and he said you have to do this those things yourself because it's something doesn't. If you give a project to a student and a project come back saying it doesn't work, you don't know if it's because there is a conceptual problem with the idea or whether it's just some stupid detail that wasn't done right. And when I'm facing with this, that's when I start looking at the code and perhaps experimenting with it myself. Or I get multiple students to work on them to collaborate on the project so that if one makes an error, perhaps the other one will detect what it is. I love coding. I just don't do as much as I like it. Yeah. This JAPA or the forward forward things have moved so quickly. You think back to when the transformers were introduced or at least the attention mechanism and that kind of shifted the field. It's difficult for an outsider to judge when I hear the JAPA talk. Is this one of those moments that wow this idea is going to transform the field or have you been through many of these moments and they contribute to some extent but they're not the answer to ship the paradigm. It's hard to tell at first but whenever I kind of keep pursuing an idea and promote it, it's because I have a good hunch that they're going to have a relatively big impact. And it was easy for me to do before I was as famous as I am now because I wasn't listened to that much. So I could make some claim and now I have to be careful what I claim because a lot of people listen to me. Yeah. And it's the same issue with JAPA. So JAPA, for example, a few years ago, was promoting this idea of capsules. Yeah. And everybody was thinking this is going to be like a big thing and a lot of people started working on it. It turns out it's very hard to make it work and it didn't have the impact that many people started would have, including JAPA. And it turned out to be limited by implementation issues and stuff like that. The underlying idea behind it is good but like very often the practical side of it kills it. There was the case also with Wilson machines. They are conceptually super interesting. They just don't work that well. They don't scale very well. They're very slow to train because actually it's a very interesting idea that everybody should know about. So there's a lot of those ideas that allow us, there are some mental objects that allow us to think differently about what we do. But they may not actually have that much practical impact. For forward, we don't know yet. It could be like the weak sleep algorithm that Jeff talked about 20 years ago or something. Or it could be the new back prop. We don't know. Or the new target prop, which is interesting but not really mainstream. Because it has some advantages in some situations, but it's not. It brings you like an improved performance on some standard benchmark that people are interested in. So it doesn't have the right of deal perhaps. So it's hard to figure out. But what I can tell you is that if we figure out how to train one of those. JAPA start architecture from video. And the representations that it learns are good. And the predictive model that he learns are good. This is going to open the door to a new breed of AI systems. You have no, no doubt about that. It's exciting the speed at which things have been moving in particular in the last three years. About, about transformers and the history of transformers. Once you only say about this is that. We see the most visible progress, but we don't realize by how much of a history there was behind it. And even the people who actually came up with some of those ideas don't realize that. They are ideas actually had roots in other things. For example, back in the 90s, people were already working on things that we could now call mixer of experts. And also multiplicative interactions, which at the time were called the semi-py networks or things like that. So it's the idea that instead of having two variables that you add together with weights, you multiply them. And then you have a way for you have weights before you multiply. It doesn't matter. This idea goes back every long time since the 1980s. And then you had ideas of linearly combining multiple inputs with weights that are between 0 and 1 and sum to 1 and are dependent. So now we call this attention, but this is a circuit that was used in mixer mixer of expert models back in the early 90s also. So the idea is old. Then there were ideas of neural networks that have a separate module for computation and memory that's two separate modules. So one module that is a classical neural net. And the output of that module would be an address into an associative memory that itself would be a different type of neural net. And those different types of neural net associative memories use what we now call attention. So they compute the similarity or the product between a query vector and a bunch of key vectors. And then they normalize and so this onto one and then the output of the memory is weighted some of the value value vectors. There was a series of papers by my colleagues in the early days of fair actually in 2014, 15 one called memory network, one called end to end memory network, one called the stack of maintain memory network and other one called key value memory network and then a whole bunch of things. So those use those associative memories that basically are the basic modules that are used inside the transformers and then attention mechanism like this were popularized in around 2015 by a paper from the usual bench was good at Miller and demonstrated that they are extremely powerful for doing things like translation language translation in NLP. And that really started the craze on attention. And so you come on all those ideas and you get a transformer that uses something called self attention where the input tokens are used both as queries and keys in a associative memory very much like a memory network. And then you use this as a layer if you want you put several of those in a layer and then you stack those layers and that's what the transformer is. And then attention is not obvious but there is one those ideas have been around and people have been talking about it and the similar work also around 2015, 16 and from deep mind called the neural turning machine or differentiable neural computer those ideas that you have a separate module for competition and other one from memory. And then you have a separate or higher and group also on neural nets that have separate memory associative memory type system. They are the same type of things. I think this idea is very powerful. The big advantage of transformers is that the same way commercial nets are equivalent to shift so you shift the input of a commercial net. The output also shifts but otherwise doesn't change. The transformer if you permute the input tokens the output tokens get permuted the same way but are otherwise unchanged so. Comments are equivalent to shifts. Transformers are equivalent to permutation and with a combination of the two is great. She's why I think the combination of cognets at a low level and transformer at the top I think for natural input data like image and video is a very combination. The combinatorial effect as the field progresses all of these ideas create a cascade of new ideas. Is that why the field is speeding up? It's not the only reason the there's a number of reasons the. So one of the reasons is that you build on each other's ideas and etc which of course is the whole mark of science in general also art. But there is a number of characteristics I think that. Help that to a large extent the one in particular is the fact that. Most research work in this area now comes with code that other people can use and build upon right so. The habit of distributing your code in a source I think is a is an enormous. Contributor to the acceleration of progress the other one is the availability of the most sophisticated tools like pet or for example or TensorFlow or jacks or things like that where which where researchers can build on top of each other's code base basically to. Come up with really complex concepts. And all of this is committed by the fact that some of the main contributors that are from industry to those ideas don't seem to be too. Obsessive compulsive about IP protection. So meta and in particular is very open we may occasionally fight patterns but we're not going to see you for infringing them unless you sue us. Google as a similar policy. You don't see this much from companies that tend to be a little more secretive about their research like Apple and Amazon but although I just talked to Sam in Benio he's trying to implement that openness. More power to him good luck it's a culture change for a company like Apple so this is not a battle I want to fight but if you can win it like good for him. Yeah. It's difficult difficult battle also I think another contributor is that there are real practical commercial applications of all of this they're not just imagine they are real. And so that creates a market and that increases the size of the community and so that creates more appeal for new ideas right more more. Outlets if you want for new ideas do you think that this. Hockey stick curve is going to continue for a while or do you think will hit a plateau then. Is it difficult to say nothing works more like a next next financial that the beginning of a sigmoid so every natural process has to saturate at some point. The question is when and I don't see any obvious wall that is being hit by a research at the moment it's quite the opposite seems to be an acceleration in fact of progress. And there's no question that we need the new concepts and new ideas in fact that's the purpose of my research at the moment because I think there are limitations to current approaches. This is not to say that we just need to. Scale up deep learning and turn the crank and we'll get to human level intelligence I don't believe that. I don't believe that it's just a matter of making reinforcement learning more efficient I don't think that's possible with the current way reinforcement learning is formulated. And we're not going to get there with supervised learning either. I think we definitely need. New innovative concepts but I don't see any slow down yet. I don't see any people turning away from me I think it's obviously not going to work but despite there is. Screams of various critiques right sure about that but. But. They to some extent at the moment are fighting a real guard battle yeah because they plan to flag the city. You're never going to be able to do this and then. So you can do this or they plan to flag a little further down and now you're not going to be able to do this so it's a tiny yeah okay my last question are you still doing music. I am and are you still building instruments are. Electronic wind instruments yes I'm. The process of designing a new one well. Yeah okay maybe I think I said this last time maybe I could get some recordings and put them into the podcast or something. Right I probably told you nuts are such a great performer and. I'm probably better at conceptualizing and building those instruments and playing them but but yeah it's possible. That's it for this episode I want to thank you and for his time if you want to read a transcript of today's conversation you can find one on our website. I on AI that's EY E hyphen O N dot AI. Feel free to drop us a line with comments or suggestions at Craig at I on AI that's C R A I G. At EY E hyphen O N dot AI. And remember the singularity may not be near but AI is about to change your world so pay attention. you

we got the result in a form of a raw text

with open ('docs/text.txt', 'w') as file:
    file.write(result['text'])

4 Summarization with LangChain

This imports necessary LangChain library components for effective text summarising and starts an instance of OpenAI’s large language model with a temperature of 0. Classes for dealing with huge texts, optimisation, fast building, and summarising techniques are among the major components.

This code instantiates the RecursiveCharacterTextSplitter class, which is in charge of separating input text into smaller parts.

from langchain import OpenAI, LLMChain
from langchain.chains.mapreduce import MapReduceChain
from langchain.prompts import PromptTemplate
from langchain.chains.summarize import load_summarize_chain

llm = OpenAI(model_name="text-davinci-003", temperature=0)

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=0, separators=[" ", ",", "\n"]
)

It has a chunk_size of 1000 characters, no chunk_overlap, and separators of spaces, commas, and newline characters. This guarantees that the input text is split down into digestible chunks, allowing the language model to process it efficiently.

We’ll open the previously saved text file and split the transcripts using the.split_text() technique.

with open('docs/text.txt') as f:
    text = f.read()
texts = text_splitter.split_text(text)

Each Document object is initialized with the content of a chunk from the texts list. The [:4] slice notation indicates that only the first four chunks will be used to create the Document objects.

from langchain.docstore.document import Document

docs = [Document(page_content=t) for t in texts[:4]]

The textwrap library in Python provides a convenient way to wrap and format plain text by adjusting line breaks in an input paragraph. It is particularly useful when displaying text within a limited width, such as in console outputs, emails, or other formatted text displays. The library includes convenience functions like wrap, fill, and shorten, as well as the TextWrapper class that handles most of the work. If you’re curious, I encourage you to follow this link and find out more, as there are other functions in the textwrap library that can be useful depending on your needs.

from langchain.chains.summarize import load_summarize_chain
import textwrap

chain = load_summarize_chain(llm,
                             chain_type="map_reduce")


output_summary = chain.run(docs)
wrapped_text = textwrap.fill(output_summary, width=100)
print(wrapped_text)

 Craig Smith interviews Jan LeCoon, a deep learning developer and proponent of self-supervised
learning, about his new joint embedding predictive architecture and his theory of consciousness. Jan
discusses the gap in large language models and the potential for AI systems to exhibit features of
consciousness. Self-supervised learning is a technique used to train large neural networks to
predict missing words in a piece of text, and generative models are used to predict missing words in
a text, but it is difficult to represent uncertain predictions.

chain.llm_chain.prompt.template

'Write a concise summary of the following:\n\n\n"{text}"\n\n\nCONCISE SUMMARY:'

The “stuff” approach is the most basic and unsophisticated, employing all of the text from the transcribed video in a single prompt. This technique may throw problems if all of the text is longer than the LLM’s available context size, and it is not the most efficient approach to process big amounts of text.

We’re going to try out the prompt below. This prompt will output the summary in the form of bullet points.

prompt_template = """Write a concise bullet point summary of the following:


{text}


CONSCISE SUMMARY IN BULLET POINTS:"""

BULLET_POINT_PROMPT = PromptTemplate(template=prompt_template,
                        input_variables=["text"])

Also, we initialized the summarization chain using the stuff as chain_type and the prompt above.

chain = load_summarize_chain(llm,
                             chain_type="stuff",
                             prompt=BULLET_POINT_PROMPT)

output_summary = chain.run(docs)

wrapped_text = textwrap.fill(output_summary,
                             width=1000,
                             break_long_words=False,
                             replace_whitespace=False)
print(wrapped_text)


- Jan LeCoon is a seminal figure in deep learning development and a long time proponent of self-supervised learning
- Discussed his new joint embedding predictive architecture which may be a step toward filling the gap in large language models
- Theory of consciousness and potential for AI systems to exhibit features of consciousness
- Self-supervised learning revolutionized natural language processing
- Large language models lack a world model and are generative models, making it difficult to represent uncertain predictions

We were able to obtain short bullet-point summaries of the dialogue by using the offered prompt and implementing the relevant summary strategies.

We have the ability to develop bespoke prompts in LangChain that are tailored to individual needs. For example, if you want the summarising output to be in French, you can easily create a prompt that instructs the language model to build a summary in that language.

The’refine’ summarising chain is a technique for producing more precise and context-aware summaries. This chain type is intended to iteratively refine the summary by adding context as needed. That is, it creates a summary of the first segment. The work-in-progress summary is then updated with new information from each subsequent piece.

chain = load_summarize_chain(llm, chain_type="refine")

output_summary = chain.run(docs)
wrapped_text = textwrap.fill(output_summary, width=100)
print(wrapped_text)

  Craig Smith interviews Jan LeCoon, a deep learning developer and proponent of self-supervised
learning, about his new joint embedding predictive architecture and his theory of consciousness. Jan
discusses the gap in large language models and the potential for AI systems to exhibit features of
consciousness. He explains how self-supervised learning has revolutionized natural language
processing through the use of transformer architectures for pre-training, such as taking a piece of
text, removing some of the words, and replacing them with black markers to train a large neural net
to predict the words that are missing. This technique has been used in practical applications such
as contact moderation systems on Facebook, Google, YouTube, and more. Jan also explains how this
technique can be used to represent uncertain predictions in generative models, such as predicting
the missing words in a text, or predicting the missing frames in a video.

The ‘refine’ summarization chain in LangChain provides a flexible and iterative approach to generating summaries, allowing you to customize prompts and provide additional context for refining the output. This method can result in more accurate and context-aware summaries compared to other chain types like ‘stuff’ and ‘map_reduce’.

5 Adding Transcripts to Deep Lake

This method can be extremely useful when you have more data. Let’s see how we can improve our expariment by adding multiple URLs, store them in Deep Lake database and retrieve information using QA chain.

First, we need to modify the script for video downloading slightly, so it can work with a list of URLs.

import yt_dlp

def download_mp4_from_youtube(urls, job_id):
    # This will hold the titles and authors of each downloaded video
    video_info = []

    for i, url in enumerate(urls):
        # Set the options for the download
        file_temp = f'./{job_id}_{i}.mp4'
        ydl_opts = {
            'format': 'bestvideo[ext=mp4]+bestaudio[ext=m4a]/best[ext=mp4]',
            'outtmpl': file_temp,
            'quiet': True,
        }

        # Download the video file
        with yt_dlp.YoutubeDL(ydl_opts) as ydl:
            result = ydl.extract_info(url, download=True)
            title = result.get('title', "")
            author = result.get('uploader', "")

        # Add the title and author to our list
        video_info.append((file_temp, title, author))

    return video_info

urls=["https://www.youtube.com/watch?v=mBjPyte2ZZo&t=78s",
    "https://www.youtube.com/watch?v=cjs7QKJNVYM",]
vides_details = download_mp4_from_youtube(urls, 1)

And transcribe the videos using Whisper as we previously saw and save the results in a text file.

import whisper

# load the model
model = whisper.load_model("base")

# iterate through each video and transcribe
results = []
for video in vides_details:
    result = model.transcribe(video[0])
    results.append( result['text'] )
    print(f"Transcription for {video[0]}:\n{result['text']}\n")

Transcription for ./1_0.mp4:
 Hi, I'm Craig Smith and this is I on A On. This week I talk to Jan LeCoon, one of the seminal figures in deep learning development and a long time proponent of self-supervised learning. Jan spoke about what's missing in large language models and about his new joint embedding predictive architecture which may be a step toward filling that gap. He also talked about his theory of consciousness and the potential for AI systems to someday exhibit the features of consciousness. It's a fascinating conversation that I hope you'll enjoy. Okay, so Jan, it's great to see you again. I wanted to talk to you about where you've gone with so supervised learning since last week spoke. In particular, I'm interested in how it relates to large language models because the large language models really came on stream since we spoke. In fact, in your talk about JEPA, which is joint embedding predictive architecture. There you go. Thank you. You mentioned that large language models lack a world model. I wanted to talk first about where you've gone with self-supervised learning and where this latest paper stands in your trajectory. But to start, if you could just introduce yourself and we'll go from there. Okay, so my name is Jan Le Ka or Jan Le Koon who want to do it in Gilleswee and I'm a professor at New York University and at the Quarantine Institute in the Center for Data Science. And I'm also the chief AI scientist at Fair, which is the fundamental AI research lab. That's what Fair stands for. Admetta, Neil, Facebook. So tell me about where you've gone with self-supervised learning, how the joint embedding predictive architecture fits into your research. And then if you could talk about how that relates to what's lacking in large language models. Okay, self-supervised learning has been, has basically brought about a revolution in natural language processing because of their use for pre-training transformer architectures. And the fact that we use transformer architectures for that is somewhat orthogonal to the fact that we use self-supervised learning. But the way those systems are trained is that you take a piece of text, you remove some of the words, you replace them by black markers, and then you train the very large neural net to predict the words that are missing. That's a pre-training phase. And then in the process of training itself to do so, the system learns good representations of text that you can then use as input to its subsequent downstream task, I don't know, translation or Hitchbitch detection or something like that. So that's been a career revolution over the last three or four years. And including in sort of very practical applications, like every sort of type of performing contact moderation systems on Facebook, Google, YouTube, et cetera, use this kind of technique. And there's all kinds of other applications. Now, large language models are partially this, but also the idea that you can train those things to just predict the next word in a text. And if you use that, you can have those system generate text spontaneously. So there's a few issues with this. First of all, those things are what's called generative models in the sense that they predict the words, the information that is missing, words in this case. And the problem with generative models is that it's very difficult to represent uncertain predictions. So in the case of words, it's easy because we just have the system produce essentially what amounts to a score or a probability for every word in the dictionary. And so it cannot tell you if the word missing in a sentence like the blank chases the mouse in the kitchen. It's probably a cat, could be a dog, but it's probably a cat, right? So you have some distribution of probability over all words in the dictionary. And you can handle uncertainty in the prediction this way. But then what if you want to apply this to let's say video, right? So you show a video to the system, you remove some of the frames in that video and you train you to predict the frames that I'm missing. For example, predict what comes next in a video and that doesn't work. And it doesn't work because it's very difficult to train the system to predict an image or whole image. We have techniques for that for generating images before actually predicting good images that could fit in the video. It doesn't work very well. Or if it works, it doesn't produce internal representations that are particularly good for downstream task like object recognition or something of that time. So attempting to transfer those SSL method that are successful in LP into the realm of images has not been a big success. It's been somewhat of a success in audio. But really the only thing that works in the domain of images is those generating architectures where instead of predicting the image, you predict a representation of the image, right? So you feed. Let's say one view of a scene to the system, you run it to something on that that computes a representation of it. And then you take a different view of the same scene, you run it through the same network that produces another representation and you train the system in such a way that those two representations are as close to each other as possible. And the only thing the systems can agree on is the content of the image so they end up including the content of the image independently of the viewpoint. The difficulty of making this work is to make sure that when you show two different images, it will produce different representations. So to make sure that there are informative of the inputs and your system didn't collapse and just produce always the same representation for everything. But that's the reason why the techniques that have been generative architectures have been successful in LP aren't working so well. And images is their inability to represent complicated complicated uncertainties if you want. So now that's for training a system in SSL to learn representations of data. But what I've been proposing to do in the position paper I published a few months ago is the idea that we should use SSL to get machines to learn predictive world models. So basically to predict where the world world is going to evolve. So predict the continuation of a video, for example. Possibly predict how it's going to evolve as a consequence of an action that an intelligent agent might take. Because if we have such a world model in an agent, the agent being capable of predicting what's going to happen as a consequence of its action will be able to plan complex sequence of actions to arrive at a particular goal. And that's what's missing from all the pretty much all the AI systems that everybody has been working on or has been talking about loudly. Except for a few people who are working on robotics or it's absolutely necessary. So some of the interesting work there comes out of the robotics community, the sort of machine learning and robotics committee. Because there you need to have the skip ability for planning. And the work that you've been doing is it possible to build that into a large language model or is it incompatible with the architecture of large language models. It is compatible with large language models. And in fact, it might solve some of the problems that we're observing with large language models. One point is large language models is that when you use them to generate text, you initialize them with a prompt, right? So you type in an initial segment of a text, which could be in the form of a question or something. And then you hope that it will generate a consistent answer to that text. And the problem with that is that those systems generate text that sounds fine grammatically, but semantically, but sometimes they make various stupid mistakes. And those mistakes are due to two things. The first thing is that to generate that text, they don't really have some sort of objective. But then just satisfying the sort of statistical consistency with the prompt that was typed. So there is no way to control the type of answer that will produce. At least no direct way, if you want. That's the first problem. And then the second problem, which is much more acute is the fact that those large language models have no idea of the underlying reality that language. Discribes. And so there is a limit to how smart it can be and how accurate it can be because they have no experience of the real world, which is really the underlying reality of language. So their understanding of reality is extremely superficial and only contained in whatever is contained in language that they've been trained on. And that's very shallow. Most of human knowledge is completely non-linguistic. It's very difficult for us to realize that's the case, but most of what we learn has nothing to do with language. Language is built on top of a massive amount of background knowledge that we all have in common, that we call common sense. And those machines don't have that, but a cat has it, a dog has it. So we're able to reproduce some of the linguistic abilities of humans without having all the basics that a cat or dog has about how the world works. And that's why the systems are. Failures is actually. So I think what we would need is an ability for machines to learn how the world works by observation in the manner of. Babies and. Infants and young animals. Accumulate all the background knowledge about the world that constitutes the basis of common sense if you want. And then use this word model as. The tool for being able to plan sequences of actions to arrive at a goal so sitting goals is also an ability that humans and many animals have. So goals for arriving at an overall goal and then planning sequences of actions to satisfy those goals. And those my goals don't have any of that. They don't have a understanding of the learning world. They don't have a capability of planning for planning. They don't have goals. They can send sent themselves goals, other than through typing a point, which is a very weird way. Where are you in your experimentation with this. JAPAR architecture. So pretty early. So we have forms of it simplified form of them that we call joint-time meeting architectures without the P without the predictive. And they work quite well for learning representations of images. So you take an image you distorted a little bit and you train an neural net to produce. Essentially, we're also identical representations for those two distorted versions of the same image. And then you have some mechanism for making sure that it produces different representations for different images. And so that works really well. And we have simple forms of JAPAR the predictive version where the representation of one image is predicted from the representation of the other one. One version of this was actually presented that narrates this. It's called V-rag-L for local. And it works very well for training neural net to learn representations that are good for image experimentation, for example. But we're still working on a recipe if you want for a system that would be able to learn. The properties of the world by watching videos, understanding, for example, very basic concepts like the word is three dimensional. The system could discover that the world is three dimensional by being shown video with the moving camera. And the best way to explain how the view of the world changes as the camera moves is that every pixel has a depth that explains products, motion, et cetera. Once that concept is learned, then the notion of objects and occlusion objects are in front of others naturally emerges because objects are part of the image that move together with products, motion. At least in animate objects, animate objects are objects that move by themselves. So that could be also a natural distinction. This ability to spontaneously form the categories, the babies do this at the age of a few months. They have an audio without having the names of anything they know. Right. They can tell a car from a bicycle, the chair table, the tree, et cetera. And then on top of this, you can build notions of intuitive physics, the fact that objects that are not supported with all, for example, the babies run this at the age of nine months roughly. It's pretty late and inertia six things of that type. And then after you've acquired those basic knowledge background knowledge about how the world works, then you have pretty good ability to predict. And you can also predict perhaps the consequence of your actions when you start acting in the world. And then that gives you the ability to plan perhaps it gives you some basis for common sense. So that's the progression that we need to do. We don't know how to do any of this yet. We don't have a good recipe for training a system to predict what's going to happen in the video, for example, within any degree of usefulness. Just for the training portion, how much data would you need? It seems to me, you would need a tremendous amount of data. We need a couple of hours on Instagram or YouTube. That would be enough. Really. The amount of data of raw video data that's available. It's incredibly large. If you think about let's say five year old child and let's imagine that this five year old child can usefully analyze. Visual percept maybe ten times a second. Okay, so there's ten frames per second. And if you can't how many seconds they are in five years, it's something like 80 millions. So the child is in an 800 million frames, right? Or something like that issue. Yeah, it's an approximation. Let's say it's not that much data. We can have that tomorrow by just recording like saving a YouTube video or something. So I don't think it's an issue of data. I think it's more an issue of architecture, training paradigm, principles, mathematics, and principles on which to base this. One thing I've said is if you want to solve that problem, abandon five major pillars of machine learning, one of which is those generative models. And to replace them with those joint embedding architectures. A lot of people envision already convinced of that. Then to abandon the idea of doing probabilistic modeling. So we're not going to be able to predict to represent usefully the probability of the continuation of a video from condition on what we already observed. We have to be less ambitious about or mathematical framework if you want. So I've been advocating for many years to use something called energy based models, which is a weaker form of modeling under a certainty if you want. Then there is another concept that has been popular for training, joint embedding architectures over the last few years, which had the first paper on in the early 90s actually on something called same is networks. So it's called contrastive running and I'm actually advocating against that to use to this idea that once in a while you have to cover up new ideas and. And it's going to be very difficult to convince people who are very attached to those ideas to abandon them, but I think it's time for that to happen. Once you've trained one of these networks and you've established a world model, how do you transfer that to the equivalent of a large language model, one of the things that's fascinating about the development of LLM's in the last couple of years is that they're now multi model. They're not purely text and language. So how do you combine these two ideas or can you or do you need to? Yeah, so there's two or three different questions in that one question. One of them is can we usually transform existing language models? Whose purpose is only to produce text in such a way that they have they can do the planning and objectives and things like that. The answer is yes, that's probably fairly simple to do. Can we can we train language model purely on language and expected to understand the underlying reality and the answer is no and in fact. I have a paper on this in a. Overlap is a philosophy magazine called noina, which I co-wrote with a carcoring philosopher who is a post document about NYU where we say that there is a limit to what we can do with this because most of human knowledge is non linguistic. And if we only train systems on language, they will have a very superficial understanding of what they're talking about. So if you want systems that are robust and work, we need them to be grounded in reality. And it's an old debate whether they are actually being grounded or not. And so the approach that some people have taken at the moment is to basically turn everything including images and audio into text or something similar to text. So you take an image, you cut it into little squares, you turn those squares into vectors that's called tokenization. And now an image is just a sequence of tokens. The text is a sequence of words, right? And you do this with everything and you get those multiple systems and they do something. Okay, now clear. That's the right approach long term, but they do something. I think the ingredients that I'm missing there is the fact that I think if we're dealing with sort of continuous type data like video, we should use the joint embedding architecture, not the generative architectures that large language models currently use. First of all, I don't think we should tokenize them because a lot of it get lost in translation when we tokenizing edges and videos. And there's a problem also which is that those systems don't scale very well with the number of tokens you feed them with. So it works when you have a text and you need a context to predict the next word that is maybe the 4000 last words, it's fine. But a 4000 tokens for an image or video is tiny like you need way more than that and those systems scale horribly with the number of tokens you feed them. We're going to need to do a lot of new innovations in architectures there. And my guess is that we can't do it with generative models. So we'll have to do the joint embedding. How does a computer recognize an image without tokenization? So, commercial nets for example, don't tokenize. They take an image as pixels, they extract local features, they detect local motifs on different windows, on the image that overlap. And then those motifs get combined into other slightly less local motifs. And it's just kind of hierarchy where representations of larger and larger parts of the image are constructed as we go up in the layers. But there's no point where you cut the image into squares and you turn them into individual vectors. It's more sort of progressive. So there's been a bit of a back and forth competition between the transformer architectures that tend to rely on this tokenization and commercial nets which we don't or in different ways. And my guess is that ultimately what would be the best solution is a combination of the two where the first few layers are more like commercial nets. They exploit the structure of images and video certainly. And then by the time you get to up to several layers, they are the representation is more object based and there you have an advantage in using those those transformers. But currently basically the image transformers only have one layer of conclusions at the bottom. And I think it's a bit of a waste and it doesn't scale very well when you want to apply the video. On the timeline, this is all moving very fast. It's very fast. How long do you think before you'll be able to scale this new architecture? It's not just scale is actually coming up with a good recipe that works that would allow us to just plug a large neural net or the smaller on that on on YouTube and then learn how the work works by watching in a video. We don't have that recipe. We don't have probably don't have the architecture other than some vague idea, which I call hierarchical, but there's a lot of details to figure out that we haven't figured out this probably failure mode that we haven't yet encountered that we need to find solutions for. And so I can give you a recipe and I can tell you if welcome up with the recipe in the next six months year, two years, five years, ten years. It could be quick or it could be much more difficult than we think, but I think we're on the right path in searching for a solution in that direction. So once we come up with a good recipe, then it will open the door to new breed of AI systems, essentially that can they can plan, they can reason. And will be much more capable of having some level of common sense, perhaps, and have forms of intelligence that are more similar to what we observe being in animals and humans. Your work is inspired by the cognitive processes of the brain. Yeah. And that process of perception and then informing a world model, is that confirmed in neuroscience? It's a hypothesis that is based on some evidence from both neuroscience and cognitive science. So what I showed is a proposal for what's called a cognitive architecture, which is some sort of modular architectures that would be capable of the things like like planning and reasoning that we observe in capabilities that we observe in animals and humans. And that the current most current AI systems except for a few robotics systems don't have. It's important in that respect. But it's more of an inspiration really than a sort of direct copy interested in understanding the principles behind intelligence, but I would be perfectly happy to come up with some procedure that is that uses back proper level, but. At a higher level kind of does something different from the super resonating or something like that, which is why I work on self-supervisor. And so I'm not necessarily convinced that the path towards the satisfying the goal that was talking about of learning world models, etc. necessarily goes through finding biological and plausible learning procedures. What did you think of the forward forward algorithm and were you involved in that research? Well, although I've thought about things that are somewhat similar for many decades, but very few of which is actually published. It's in the direct line of a series of work that Jeff has been very passionate about for 40 years of new learning procedures of different types for basically local learning worlds that can train fairly complex neural nets to learn good representations. And things like that. So he started with the Boston machine, which was a really interesting concept that turned out to be somewhat in practical, but very interesting concept that a lot of people started. Backprop, which of course, he and I both had in developing something I worked on also simultaneously with backprop in the 1980s, called target prop, where it's an attempt at making backprop more local by computing a virtual target for. Every neuron in a large neural net that can be locally optimized. Unfortunately, the way to compute this target is normal calls. And I haven't worked on this particular type of procedure for a long time, but you should have been sure as we've used a few papers on this over the last 10 years or so. Yosha Jeff and I when we started the deep learning conspiracy in the early 2000 to renew the interest of the community and deep learning. We focused largely on forms of kind of local self supervised learning methods. So things like in just case that was focused on restricted Boston machines. Yosha settled on something called denosing auto encoders, which is the basis for a lot of the large language model type training that we're using today. I was focusing more on what's called sparsato encoders. So this is different ways of doing training a layer if you want in the neural net to learn something useful without being it without it being focused on any particular task. So you don't need label data. And a lot of that work has been put aside a little bit by the incredible success of just pure supervised learning with very deep model we found ways to train very large neural nets with with many layers with just back prop and so we put those techniques on the side and Jeff basically is coming back to them. And I'm coming back to them in different form a little bit with this so the JEPA architecture. And he also had ideas in the past, something called recirculation. A lot of informax methods, which actually the JEPA use this thing ideas are similar. He's a very productive source of ideas that are that sometimes seems out of the left field. And where the community pays attention and then doesn't quite figure it right away and then it takes a few years for those things to disseminate and sometimes they don't just a minute. Hello. Beauregard, I'm recording right now. Who? Rasmus? I'll answer when I get back. Yeah, you'll be famous someday. Okay, okay, great. Thanks very much. Yep. Bye-bye. Sorry about that. There was a very interesting talk by David Chalmers. At some level it was not a very serious talk because everyone knows as you described earlier that large language models are not reasoning. They don't have common sense. He doesn't claim that they do. No, that's right. But what you're describing with this JEPA architecture, if you could develop a large language model that is based on a world model. You'll be a large language model. You'll be a world model. At first it would not be based on language. You'll be based on visual perception, maybe audio perception. If you have a machine they can do what a cat does, you don't need language. Language can be put on top of this. To some extent language is easy, which is why we have those large language models. We don't have systems that run how they work. But let's say that you build this world model and you put language on top of it so that you can interrogate it, communicate with it. Does that take you a step toward what Chalmers was talking about? And I don't want to get into the theory of consciousness, but at least an AI model that would exhibit a lot of the features of consciousness. David actually has two different definitions for sentience and consciousness. You can have sentience without consciousness. Simple animal or sentience. In the sense that they have experience, emotions, and drives and things like that. But they may have the type of consciousness that we think we have. At least the illusion of consciousness. So sentience I think can be achieved by the type of architecture I propose if we can make them work, which is a big if. And the reason I think that is is that. What those systems would be able to do is have objectives that you need to satisfy. Think of them as drives. And having the system. Compute those drives, which would be basically predictions of. Of the outcome of a situation or a sequence of actions that the agent might take. Basically, those would be indistinguishable from emotions. So if you have your new situation where you can take a sequence of actions to arrive at a result. And the outcomes that you're predicting. It's terrible results in your destruction. Okay, that creates fear. You try to figure out that is another sequence of action I take that would not. Result in the same outcome. If you make those predictions with these are huge uncertainty in the prediction. One of which. With probability half maybe. Is that you get destroyed. It creates even more fear. And then on the contrary, if the outcome is going to be good, then it's more like elation. So those are long term prediction of outcomes, which. Systems that use the architecture and proposing I think will have so they will have. Some level of experience and they will have emotions that will drive the behavior. Because they would be able to anticipate outcomes. And perhaps act on them. Now consciousness is different story. So my full theory of consciousness, which I've talked to David about. Thinking it was going to tell me I'm crazy. But he said no, actually that overlaps with some pretty common. The theories of consciousness among philosophers is. Is the idea that we have essentially a single world model in our head. Somewhere in a prefrontal cortex. And that world model is configurable to. The situation we're facing at the moment. So we're configuring our brain. Including our world model for solving the problem that you know satisfying the objective that we currently set to ourselves. And because we only have a civil world model engine. We can only solve one such task at any one time. This is a characteristic of humans and. Many animals, which is that we focus on the task. We can't do anything else. And we can do subconscious tasks simultaneously. But we can only do one conscious deliberate task at any one time. And it's because we have a single world model engine. Now, why would evolution build us in a way that we have a single world model engine? There's two reasons for this. One reason is. That single world model engine can be. Configured for the situation at hand. But only the part that changes from one situation to another. And so it can share knowledge between different situations. The physics of the world doesn't change. If you are building a table or trying to jump over a river or something. And so you are sort of. Basic knowledge about how the world works doesn't need to be reconfigured. It's only the thing that depends on the situation at hand. So that's one reason. And the second reason is that. If we had multiple models of the world, they would have to be individually less powerful because. You have to all fit them within your brain and that's an immediate size. So I think that's probably the reason why we only have one. And so if you have only one world model that needs to be configured for the situation at hand, you need some sort of meta module that configures it. Figures out like what situation am I in? What sub goals should I set myself and how should I configure the rest of the. My brain to solve that problem. And that module would have to be able to observe. The state and capabilities would have to have a model of the rest of itself. It's an of the agent. And that perhaps is something that gives us the illusion of consciousness. So I must say this is very speculative. Okay, I'm not saying this is exactly what happens, but it. Fits with a few things that we know about. About consciousness. You were saying that this. Architecture is inspired by cognitive science or neuroscience. How much do you think your work, Jeff's work, other people's work. At the kind of the leading edge of deep learning or machine learning research is informing neuroscience. Or is it more of the other way around? Certainly in the beginning, it was the other way around. But at this point, it seems that there's a lot of information that then is reflecting back to the fields. So it's been a bit of a feedback loop. So new concepts in machine learning have driven people in neuroscience and curiosity science to. Use computational models if you want for whether we're studying. And many of my colleagues and my favorite colleagues work on this. The whole field of computational neuroscience basically is around this. And what we're seeing today is a big influence. Or rather a wide use of deep learning models such as conventional nets and transformers. As models. Explanatory model of what goes on in the visual cortex, for example. So the people, you know, for a number of years now who have. Don FMRI experiments and then show the same image to a subject in the FMRI machine and to a conventional net and then try to explain the variance they observe in the activity of various areas of the brain. With the activity that is observed in corresponding neural net. And what comes out of the studies is that. The notion of multilayer hierarchy that we have. Commercial nets. Matches the type of hierarchy that we observe in the at least in the ventral pathway of the visual system. So V1 corresponds to the first few layers of the conventional net and in V2 to some of the following layers and V4. More and then the E4 temporal cortex to the top layers are the best explanation of each other if you try to do the matching right. One of my colleagues at Fair Paris. There's a dual affiliation also with. Norsepin that academic lab in Paris has done the same type of experiment using transformer architectures and I wish models essentially. And observing. When activity of people who are listening to stories and attempting to understand the story. So that they can answer questions about the story. Or or give it. A summary of it. And there the matching is not that great in sense that there is some sort of correspondence between the type of activity you observe in those large transformers. And the type of activity is in the brain but the hierarchy is not nearly as clear. And it's what is clear is that the brain is a capable of making much longer term prediction that those language models are capable of today. So that begs the question of what are we missing in terms of architecture and to some extent it's jibes with the idea that. The models that we should have should build hierarchical. Representations of the preset that different levels of abstraction so that the highest level of abstraction. Are able to make long term predictions that perhaps are less accurate than the lower level but longer term. We don't need to have that in current models. I had a question I wanted to ask you since our last conversation you have a lot of things going on. You teach you have your role at Facebook. Your role I think at CVPR or how do you work on this? Have like three days a week or two hours a day where you're just focused. Are you a tinkering with code or with diagrams or is it in iterations with some of your graduates who the. Or is this something where it's kind of always in your mind and you're in the shower and you think yeah that might work. I'm just curious how do you love all of it? Okay so first of all once you understand is that my position at meta at fair is not a position of management. I don't manage anything. I'm chief scientist which means I try to inspire others to work on things that I think are promising. And I advise several projects that I'm not personally involved in. I work on strategy and orientations and things like this but I don't do that to the management. I'm very thankful that you know is doing this for fair and doing very very good job. I'm not very good at it either so it's for you better if I don't if I don't do it. So that allows me to spend quite a bit of time on research itself. And I don't have a group of engineers and scientists working with me. I have a group of more junior people working with me students and postdocs. Both at fair and at NYU. Both in New York and in Paris. And working with students and postdocs is wonderful because they are sure less they're creative. Many of them have amazing talents in theoretical abilities or implementation abilities or an academic things work. And so what happens very often is either one of them will come up with an idea that whose results surprise me and I was thinking that is wrong. And that's the best thing that can happen. Or sometimes I come up with an idea and turns out to work which is great. Usually not in the form that I formatted it normally it's there's a lot of contributions that have to be brought to an idea for to make it work. And then what's happened also quite a bit in the last few years is I come up with an idea that I'm sure it's going to work. And she students and postdoc try to make it work and they come back to me and said, oh sorry it doesn't work and here is a fair move. Oh yeah, we should have thought about this. Okay, so here's a new idea to get around this problem. So for example several years ago I was advocating for the use of generative models with latent variables to handle the uncertainty. And I completely changed my mind about this now advocating for those joint evading architecture that do not actually predict. I was more or less invented those contrasting methods that a lot of people are talking about and using at this point and I'm advocating against them now in favor of those methods such as V Craig or about the twins that basically instead of using contrasting methods can try to maximize the information content of representations and that idea of information maximization. And I know about for decades because Jeff was working on this in the 1980s when I was opposed to her with him. And he abandoned the idea pretty much he had a couple papers with one of his students who back her in the early 90s that show that he could work but only in sort of small dimension and it pretty much abandoned it. And the reason he abandoned it is because of a major flaw with those methods. Due to the fact that we don't have any good measures of information content or the measures that we had are up about not lower bound so we can try to maximize information content very well. And so I never thought about those that those methods could ever work because of my experience with with that. And why don't we post out stiff and the actually kind of revise the idea and show that it worked that was about a twins paper. So we changed our mind. And so now that we had a new tool information about maximization applied to the joint embedding architectures and came up with an improvement of it called V Craig. And and now we're working on that. But there are other ideas we're working on to solve the same problem with other groups of people at the moment, which probably will come up in the next few months. So we don't again we don't have a perfect recipe yet. And we're looking for one and hopefully one of the things that we are working on with stick. Yeah. Are you coding models and then training them and running them or are you conceptualizing and turning it over to someone else. So it's mostly conceptualizing and mostly letting the students and postdocs doing the implementation, although I do a little bit of coding myself, but not enough to my taste. I wish I could do more. I have a lot of postdocs and students and so I have to devote sufficient amount of my time to interact with them. Sure. And then leave them some breathing room to do the work that they do best. And so it's interesting question because that question was asked to Jeff to start right. Yeah. And he said he was using matlab and he said you have to do this those things yourself because it's something doesn't. If you give a project to a student and a project come back saying it doesn't work, you don't know if it's because there is a conceptual problem with the idea or whether it's just some stupid detail that wasn't done right. And when I'm facing with this, that's when I start looking at the code and perhaps experimenting with it myself. Or I get multiple students to work on them to collaborate on the project so that if one makes an error, perhaps the other one will detect what it is. I love coding. I just don't do as much as I like it. Yeah. This JAPA or the forward forward things have moved so quickly. You think back to when the transformers were introduced or at least the attention mechanism and that kind of shifted the field. It's difficult for an outsider to judge when I hear the JAPA talk. Is this one of those moments that wow this idea is going to transform the field or have you been through many of these moments and they contribute to some extent but they're not the answer to ship the paradigm. It's hard to tell at first but whenever I kind of keep pursuing an idea and promote it, it's because I have a good hunch that they're going to have a relatively big impact. And it was easy for me to do before I was as famous as I am now because I wasn't listened to that much. So I could make some claim and now I have to be careful what I claim because a lot of people listen to me. Yeah. And it's the same issue with JAPA. So JAPA, for example, a few years ago, was promoting this idea of capsules. Yeah. And everybody was thinking this is going to be like a big thing and a lot of people started working on it. It turns out it's very hard to make it work and it didn't have the impact that many people started would have, including JAPA. And it turned out to be limited by implementation issues and stuff like that. The underlying idea behind it is good but like very often the practical side of it kills it. There was the case also with Wilson machines. They are conceptually super interesting. They just don't work that well. They don't scale very well. They're very slow to train because actually it's a very interesting idea that everybody should know about. So there's a lot of those ideas that allow us, there are some mental objects that allow us to think differently about what we do. But they may not actually have that much practical impact. For forward, we don't know yet. It could be like the weak sleep algorithm that Jeff talked about 20 years ago or something. Or it could be the new back prop. We don't know. Or the new target prop, which is interesting but not really mainstream. Because it has some advantages in some situations, but it's not. It brings you like an improved performance on some standard benchmark that people are interested in. So it doesn't have the right of deal perhaps. So it's hard to figure out. But what I can tell you is that if we figure out how to train one of those. JAPA start architecture from video. And the representations that it learns are good. And the predictive model that he learns are good. This is going to open the door to a new breed of AI systems. You have no, no doubt about that. It's exciting the speed at which things have been moving in particular in the last three years. About, about transformers and the history of transformers. Once you only say about this is that. We see the most visible progress, but we don't realize by how much of a history there was behind it. And even the people who actually came up with some of those ideas don't realize that. They are ideas actually had roots in other things. For example, back in the 90s, people were already working on things that we could now call mixer of experts. And also multiplicative interactions, which at the time were called the semi-py networks or things like that. So it's the idea that instead of having two variables that you add together with weights, you multiply them. And then you have a way for you have weights before you multiply. It doesn't matter. This idea goes back every long time since the 1980s. And then you had ideas of linearly combining multiple inputs with weights that are between 0 and 1 and sum to 1 and are dependent. So now we call this attention, but this is a circuit that was used in mixer mixer of expert models back in the early 90s also. So the idea is old. Then there were ideas of neural networks that have a separate module for computation and memory that's two separate modules. So one module that is a classical neural net. And the output of that module would be an address into an associative memory that itself would be a different type of neural net. And those different types of neural net associative memories use what we now call attention. So they compute the similarity or the product between a query vector and a bunch of key vectors. And then they normalize and so this onto one and then the output of the memory is weighted some of the value value vectors. There was a series of papers by my colleagues in the early days of fair actually in 2014, 15 one called memory network, one called end to end memory network, one called the stack of maintain memory network and other one called key value memory network and then a whole bunch of things. So those use those associative memories that basically are the basic modules that are used inside the transformers and then attention mechanism like this were popularized in around 2015 by a paper from the usual bench was good at Miller and demonstrated that they are extremely powerful for doing things like translation language translation in NLP. And that really started the craze on attention. And so you come on all those ideas and you get a transformer that uses something called self attention where the input tokens are used both as queries and keys in a associative memory very much like a memory network. And then you use this as a layer if you want you put several of those in a layer and then you stack those layers and that's what the transformer is. And then attention is not obvious but there is one those ideas have been around and people have been talking about it and the similar work also around 2015, 16 and from deep mind called the neural turning machine or differentiable neural computer those ideas that you have a separate module for competition and other one from memory. And then you have a separate or higher and group also on neural nets that have separate memory associative memory type system. They are the same type of things. I think this idea is very powerful. The big advantage of transformers is that the same way commercial nets are equivalent to shift so you shift the input of a commercial net. The output also shifts but otherwise doesn't change. The transformer if you permute the input tokens the output tokens get permuted the same way but are otherwise unchanged so. Comments are equivalent to shifts. Transformers are equivalent to permutation and with a combination of the two is great. She's why I think the combination of cognets at a low level and transformer at the top I think for natural input data like image and video is a very combination. The combinatorial effect as the field progresses all of these ideas create a cascade of new ideas. Is that why the field is speeding up? It's not the only reason the there's a number of reasons the. So one of the reasons is that you build on each other's ideas and etc which of course is the whole mark of science in general also art. But there is a number of characteristics I think that. Help that to a large extent the one in particular is the fact that. Most research work in this area now comes with code that other people can use and build upon right so. The habit of distributing your code in a source I think is a is an enormous. Contributor to the acceleration of progress the other one is the availability of the most sophisticated tools like pet or for example or TensorFlow or jacks or things like that where which where researchers can build on top of each other's code base basically to. Come up with really complex concepts. And all of this is committed by the fact that some of the main contributors that are from industry to those ideas don't seem to be too. Obsessive compulsive about IP protection. So meta and in particular is very open we may occasionally fight patterns but we're not going to see you for infringing them unless you sue us. Google as a similar policy. You don't see this much from companies that tend to be a little more secretive about their research like Apple and Amazon but although I just talked to Sam in Benio he's trying to implement that openness. More power to him good luck it's a culture change for a company like Apple so this is not a battle I want to fight but if you can win it like good for him. Yeah. It's difficult difficult battle also I think another contributor is that there are real practical commercial applications of all of this they're not just imagine they are real. And so that creates a market and that increases the size of the community and so that creates more appeal for new ideas right more more. Outlets if you want for new ideas do you think that this. Hockey stick curve is going to continue for a while or do you think will hit a plateau then. Is it difficult to say nothing works more like a next next financial that the beginning of a sigmoid so every natural process has to saturate at some point. The question is when and I don't see any obvious wall that is being hit by a research at the moment it's quite the opposite seems to be an acceleration in fact of progress. And there's no question that we need the new concepts and new ideas in fact that's the purpose of my research at the moment because I think there are limitations to current approaches. This is not to say that we just need to. Scale up deep learning and turn the crank and we'll get to human level intelligence I don't believe that. I don't believe that it's just a matter of making reinforcement learning more efficient I don't think that's possible with the current way reinforcement learning is formulated. And we're not going to get there with supervised learning either. I think we definitely need. New innovative concepts but I don't see any slow down yet. I don't see any people turning away from me I think it's obviously not going to work but despite there is. Screams of various critiques right sure about that but. But. They to some extent at the moment are fighting a real guard battle yeah because they plan to flag the city. You're never going to be able to do this and then. So you can do this or they plan to flag a little further down and now you're not going to be able to do this so it's a tiny yeah okay my last question are you still doing music. I am and are you still building instruments are. Electronic wind instruments yes I'm. The process of designing a new one well. Yeah okay maybe I think I said this last time maybe I could get some recordings and put them into the podcast or something. Right I probably told you nuts are such a great performer and. I'm probably better at conceptualizing and building those instruments and playing them but but yeah it's possible. That's it for this episode I want to thank you and for his time if you want to read a transcript of today's conversation you can find one on our website. I on AI that's EY E hyphen O N dot AI. Feel free to drop us a line with comments or suggestions at Craig at I on AI that's C R A I G. At EY E hyphen O N dot AI. And remember the singularity may not be near but AI is about to change your world so pay attention. you

Transcription for ./1_1.mp4:
 Hello, it's Yannick from the future. AI is moving crazy fast right now, like crazy. So the news of this week is like old news, but I'm still going to show to you. Google I.O. just recently happened. The gist of it is they're going to stick generative AI into just about everything. And also, Anthropic releases upgrades the Claude API to have a hundred thousand tokens context. No one knows so far how they're doing it, but it's happening. A hundred thousand tokens context insane. All right, enjoy the news. Yay. Jeff Hinton leaves Google and tells the world about the dangers of AI. Open AI loses half a billion dollars and Google has no mouth. Welcome to ML News. Hello, everyone. Welcome to ML News. Let's dive right in lots of stuff happening. This is Snapchat AI. Do you have access to my location? No, I don't have access to your location. Where am I? I'm sorry, but I don't have access to your location information. Okay, are you lying? No, I'm not lying. I don't have access to your location information. Where's the closest McDonald's? Yeah, there's a McDonald's where he goes to. He's located right on Young Street in Tonowanda. Make of that as you will. Jeff Hinton leaves Google and warns of a danger ahead. This is a story from the New York Times. Jeff Hinton obviously pioneered in the field of deep learning, very, very early contributor of many of the currently still used techniques. Also, one of the originators of Alex Net has left his long-held job at Google and is now essentially saying that these technologies are dangerous and we should pay attention or stop or just be very careful of what we do. The article says, a part of him, he said, now regrets his life's work. I can soul myself with the normal excuse if I hadn't done it. Somebody else would have Dr. Hinton said during a lengthy interview last week in the dining room of his home in Toronto. It's hard to see how you can prevent the bad actors from using it for bad things. Dr. Hinton says, he says, look at how it was five years ago and how it is now. He said of AI technology, take the difference and propagate it forwards. That's scary. Until last year, he said, Google acted as a proper steward for the technology, careful not to release something that might cause harm. But now that Microsoft has augmented BingSirChangin with a chat about challenging Google's core business, Google is racing to deploy the same kind of technology. The tech giants are locked in a competition that might be impossible to stop, Dr. Hinton said. His immediate concern is that the internet will be flooded with false photos, videos and text, and the average person will not be able to know what is true anymore. He also worried that AI technologies will in time append the job market today, chat bots like chatGPT tend to compliment human workers, but they could replace paralegals, personal assistants, translators, and others who handle road tasks. He takes away the drug war, he said. It might take away more than that. Down the road he is worried that future versions of the technology pose a threat to humanity, because they often learn unexpected behavior from the vast amounts of data they analyze. This becomes an issue he said as individuals and companies allow AI systems not only to generate their own computer code, but actually run that code on their own. And he fears a day when truly autonomous weapons, those killer robots become reality. The idea that this stuff could actually get smarter than people, a few people believe that, he said. But most people thought it was way off, and I thought it was 30 to 50 years or even longer away. Obviously, I no longer think that. Okay, there's obviously a lot being said right here, and Jeff Henton is certainly a credible and notable voice to listen to when it comes to these things. But a lot of people also disagree with him, especially as he sounds more and more like a fomer, for example, saying, we're all in the same boat with respect to the existential threat, so we all ought to be able to cooperate on trying to stop it and more. John Lacon on the other hand says, AI hype is ridiculous in all directions, as in LLM have superhuman intelligence, are useless parades, hallucinations will destroy society, scaling is all you need, deploring has hit a wall, AI doesn't exist and never will, or AI is going to kill us all. I think among the various opinions, you can probably find some common ground, but I also tend to be more on the side of Lacon here than of Henton. I don't think this is that much of an existential threat by itself. Certainly my biggest fear of this technology is what happens when it is concentrated in just a small amount of people, like large companies and governments, and what then happens if people with not so good intentions come to power in these places. I think that's why they push to do open source and to really democratize this technology is so important, that exactly that doesn't happen. The fact that the internet's going to be flooded with texts that you don't know is true or not, or photos or videos, I mean that's already the situation. Who cares if you can generate like 10,000 fake news articles? The problem is distribution, the problem isn't generation. I can generate something fake text right now. Whatever, let's go. Okay, pineapple, I meant to write ananas. You know the amount of time it took me to find out that ananas, which is the German word for pineapple, isn't an English word because it sounds so English. Pineapple does not belong on pizza, but this is definitely misinformation. I'm sorry if you agree with this, there is no you may you may be an AI. Okay, I have now generated mission for motion, and I did not need a language model to do it. So, you know, and yes, some people may lose their jobs and a lot of people's jobs are going to be transformed, but it's not going to cause mass unemployment. It's just like the chariot driver that had now to do something else. Some people will have to do something else, and that's okay. But of course, who wants to hear from Jeff Hinton or Jan LeCount when we can actually listen to the true expert on the matter? Obviously, Snoop Dog has an opinion on this. Listen. Like, man, this thing can hold a real conversation. Like, for real, like, it's it's blown my mind because I watch movies on this as a kid years ago when I see this shit. And I'm like, what is going on? Then I heard the dude that the old dude that created AI somewhat, this is not safe because the AI's got their own minds. And these motherfuckers going to start doing their own shit. I'm like, it's we're in a fucking movie right now. What the fuck, man? So I do I need to invest in the AI so I can have one with me. Like, do y'all know shit? What the fuck? Yeah, actually pretty based opinion there. I have to say respect. All right, next topic a bit related to it, but there has been a memo leaked. A Google internal memo that is titled, we have no mode and neither does open AI. The memo details and the website here claims to have verified its origin. So I'm just going to believe that for now. The memo details essentially the rise of open source models, especially models like Lama and just how prolific the community becomes when they get access to an open source model like this. For example, low, Laura like low rank adapters being super useful, making it very cheap to fine tune these big models into something useful. And the memo argues that open source development will be able to catch up in many ways with the big companies. And therefore a mode if you don't know a mode is like is in startup world a mode is a position that is defendable against incursions against your competition. So if you have a mode, it means that a competitor can't easily sort of reach you. And the memo argues that Google has no mode and neither does open AI. And it goes into a little bit of stuff we could have seen it coming what we missed and so on saying retraining models from scratch is the hard part but once a big model is out like Lama, then it can be worked with really easily with for example, Laura updates are very cheap to produce at around a hundred dollars a piece. Also saying data quality scales better than data size, which is obviously a great to hear given we do projects like open assistance. That's absolutely fantastic. Directly competing with open source is a losing proposition and also commenting a bit about the fact that individuals are not constrained by licenses to the same degree as corporations, which is true. They say this will inevitably change as truly open models get better, not like the Lama models as you may know have this stupid non-compete license and many of the other models like models coming out of hugging phase have these even stupider actually less stupid open rail license but still stupid. We are waiting for models for people who actually make things open source and at that point I'm very convinced the community will do great things with it and a lot of businesses can be built on open source models as they are built right now in open source software. So there is a call in this memo to let open source work for us which has been a given take in the tech industry that large companies support open source development but and also obviously profit from the results of it and the memo calls a little bit into the direction of that saying owning the ecosystem might be a big part of what makes the profit maximal for a company and Google has been doing that with things like Android but also with things like a TensorFlow and stuff like that. So what do we make of a leaked Google memo that essentially admits they're gonna lose out open source and so does open AI? I think it's important to say that it's not official communication right? Anyone at a company can write a memo and then sort of circulate it that's just common practice in these companies it's the employees freedom to express their opinion and to gather insights from around the company it must not mean that this is the official Google position or this is even true right? Read it and estimate yourself how good the arguments of this are but you can rest assured them I'm very sure this is internally not everyone agrees with this this may be debated it may be just a person writing down sort of purposefully let's say extreme position to sort of see what happens to what what can we make if we sort of make this argument what counter arguments are there and so on. Anyone can write a memo it can be circulated people can give their opinion so well this can absolutely be a true Google memo all it means is that at least one person in the company has written this but what's more beautiful is the memes oh my god the memes stop moting can you just stop saying motemotes is this moat had years to monetize LLMs no moat moat it's over Anakin I have the 65k context you underestimate my moat anyway I hope you've all found your moats because the open AI may have no moat but they have a sharply decreasing bank account losing over $550 million over half a billion dollars as it developed chat GPT that's what the information writes saying open the eyes losses double to around $550 million US dollars last year as it developed chat GPT and hired key employees from Google according to three people with knowledge of the startups financials so pretty crazy I mean you would have guessed that like one or two of these millions would go into getting a motor to but they apparently blew it all on chat GPT and and Google employees but we didn't have to wait long for Google's reaction to chat GPT as it now changed its AI strategy Google has been long one of the most prolific publishers of academic papers if you go to any machine learning conference like nirips or icml google will always be at the top of the organizations who publish the most papers at these conferences and that was even before they merged with deep mind oh yeah google brain merged with deep mind that's a piece of news that I haven't even in here that happened but even before that google was already super prolific and so was deep mind and together they would be an absolute juggernaut of publishing papers at conferences however google has now changed its tune so as open AI became more closed focusing more and more on developing product and their API and releasing that joke of a paper slash technical report on GPT4 is becoming more and more clear that Jeff Hinton was certainly right in one regard namely the big tech giants are locked in into war mode so google here changed its strategy the article here in the washington post says the launch of open AI's groundbreaking chat GPT three months earlier had changed things the san francisco startup kept up with google by reading the team's scientific papers being said in the quarterly meeting for the company's research division indeed transformers a foundational part of the latest AI tech and the tea in chat GPT originated in a google study I'll first go to the conclusion the conclusion is google researchers now first have to get their stuff into products and then maybe they can publish if they get approval for it whereas before they could just they could publish they were encouraged to publish and then later they would see whether and how that might go into a product so google now more closed up and more product focused however saying that like open AI red transformers paper and that's why that's why I'm not sure I'm really not that that's a bit far that's a tiny bit far fetched there definitely the case that if you make everything open it's easier to reproduce what you've done also on the other hand um no I mean the interesting thing is how this is actually going to affect the world of researchers google and the other companies have been publishing so much I believe as a strategy to hire a lot of these people because a lot of researchers they want to they get out of university and they have the choice to want to go academic path to want to go industry path and if you promise them hey with us you can come and you can do research and you can even publish it right this is very attractive for researchers to go there on top of that they get like a giant salary and free food but they do also get the publish papers and a lot of them want that first and foremost because they believe in research and second also because it attaches their own name to something out there so rather than it being in a product somewhere where their name might be listed not at all they'll be authors on papers and that will increase their chances of a future stuff that's going to be interesting to see what these people do when that's no longer on the table when it's pretty clear once you go into the big companies you will not get to publish or at least for not for a long time how's that going to affect their hiring and firing at the moment it's firing time anyway so maybe that goes in concordance at the moment they don't want more people and therefore this is okay maybe once they want more people again they'll open up the publishing guidelines again although it's not that easy and the effects are probably longer term I don't know let me know what you think how that's going to affect the general landscape the fight between the big companies is shaping it's looking to be really interesting speaking of open AI and Google and competitiveness Lucas Byer has shared a pretty remarkable clip of Elias Satsukiver of Open AI leadership commenting on why do we keep things closed so I'm going to play the clip you know my view is that the current level of capability is still not that high where it will be the safety consideration it will drive the closed closed source in the model of this kind of this kind of research so in other words I claim that it goes in phases right now it is indeed the competitive phase so essentially saying hey yeah we keep the stuff closed but right now it's not because of safety considerations because the capabilities are not so strong right now that you would need to do that due to safety considerations by the way interesting to see that this agreement with Hinton here but instead right now it's because of the competitive landscape yes I mean that's what everyone knew that's unambiguously confirming what we all knew but just wanted to hear admitted open AI has long claimed that they keep things closed because of safety considerations and whatnot and it was always extremely shady so it's nice to somewhere here now that that was all crap and they knew it was crap and they simply said it so that they have a fine excuse to keep things for themselves until now when it's now okay to be competitive and to keep things closed in order to be competitive so think of that going forward open AI will just say whatever they need to in order to stay competitive I mean not that the other companies probably wouldn't do that but it's still quite remarkable because they were the first one to keep models closed due to safety considerations some like developers of the early yolo iterations refused to work on more models due to safety considerations but open AI were the first prominent ones to say oh now we'll just keep these for ourselves because you know you're they're too dangerous for you plans AI generated images and text cannot be copyrighted according to us copyright office this slide from a talk at UC Berkeley by Pamela Samuelson and the reason why they can't be copyrighted that's the policy statement right here is because they lack human authorship which is entrenched in us copyright law a human has to do something creative for copyright to apply this is the case in many countries around the world and therefore the direct application of copyright to AI generated works is not given because they lack human authorship what's also interesting when people apply to register works that incorporate AI generated text images or other content they must identify parts that are AI generated and disclaim authorship of those parts it's pretty interesting as gonna get into a lot of gray areas where it's like well what if I have refined and isn't my selection process also part of the creative process and yada yada yada so all of these questions are as of yet unclear but it is good to hear this confirmed copyright needs human authorship which also means what what I've said for a long time is that models very probably are also not subject to copyright because they've been generated by an algorithm like an optimization and therefore yeah the only way to enforce any sort of license on an AI model is through an active contract where you actively make people sign stuff before they get access to the model rather than just shipping it with like a gpl license or so and then relying on the automatic application of copyright also other news and intellectual property there is a trademark office trademark application with this number that tries to trademark the mark gpt the owner is open AI so open AI is trying to trademark gpt now I don't know enough about trademarks and the trademark registration process to tell you what any of this even means right if they they're trying to trademark the word gpt they have updated their brand guidelines and they are going after people who use gpt as part of their thing whatever the thing is so they certainly act as if they have a trademark to that but also here on the bottom says therefore your request is here by dismiss I don't know I don't know what it means I'll just tell you that it exists okay next news star coder is a model that comes out of the big code project that is led by homing phase but is an open community project to train a 15 billion parameter large language model with 8000 tokens context on source code in over 80 programming languages and model and data are available so this is pretty cool and lots of congratulations and respect for all the people having taken part in this I do have a small curl about this as you may know here it says open source and it's distinctively not open source you know the good days of open source when you need to agree to share your contact information to access this model oh yeah all the open source projects that also where you have to accept the conditions of the license to access its files and contents absolutely open source like every other open source project nothing to see here because this is not licensed as an open source it's licensed via the open rail license which is the so-called responsible AI license rant over red pajama is a project to collect llama style data set and then train on it they have just released a three billion and seven billion models they are even instruction tune chat models so very cool definitely follow the red pajama project it's an absolutely amazing project and the models are open source I think let's see yeah look at that license a potchy how hard is that how hard is it is the world going down because this exists no it's only gonna get better another project that builds on the red pajama data set is open llama which is also an open reproduction of llama and that loss just looks I mean there's no sharp drop so aji hasn't been reached yet but so far the metrics look really good and they are reportedly better than equally sized model like the seven b model is better than a seven b pythea model because it's been trained on more data and that's exactly the effect we're looking for in llama style training so very excited to see what comes out of these efforts and obviously every single person outside of open AI is gonna profit that probably even open AI employees are gonna profit heavily from open source models being fully open source and fully available to the public that being said mosaic releases mp t7b a new standard for open source commercially usable llm's this is a good step into that direction mosaic focuses on rapid training rapid fine tuning very efficient training of models and they have used their own knowledge and tools in order to produce these models the models are seven billion parameter models which would have been huge a few years ago but it's kind of small right now but still they're trained for a long time and most notably some of them have a 65 000 token context length now that is certainly something very cool very cool we've demonstrated generations as long as 48 000 tokens on a single node of a 100 GPU is absolutely crazy and again license a pochi and the world is still here yolo nas is a neural architecture search over yolo networks yolo you only look once is an object detector and yolo nas is a project that uses architecture search in order to determine the best and fastest models this picture doesn't do the model justice the model is extremely good so absolutely cool weights are available under a non commercial license for now yeah try it out mojo is a new programming language for all AI developers at least the company modular claims so this comes from very respectable sources notably one of the creators is also the creator of the lvm toolchain which powers most compilers for example of c++ and other languages so what is mojo mojo is a superset of python so you can run all python code in mojo but if you add your types always it allows it to compile it faster not only compile it down to binary code but also do so for various AI accelerators so it's kind of like cython meets kuda meets xla or something like this safe to say that this has the ability to not only make your python code a lot faster but also make transferring stuff from different accelerators probably a lot more easy and also you can end filenames in an emoji so that that's a mojo file the company says the language is in very early development and it's not open sourced yet but it will be open sourced in the future but it not being open sourced for now keeps many people currently from trying it out or from switching over to it we'll see what happens definitely very cool project to look out for acuprompt is a prompt hacking competition there are various stages here this is made by various organizations including learn prompting.org which is a website that kind of teaches you prompting and it's not a course you don't you don't have to pay money for it this is a competition with a sizable chunk in prize money so if you want to have fun prompting it's a weird world it's a weird world where this is an actual competition yeah there's cash prizes there's extra prizes and so on could be fun media releases neemogorg rails which is a system that keeps check on a large language model so in neemogorg rails you can define different things different conversation flows and so on and then propose what they call guardrails for for topics for safety considerations and for security so for example if you don't want your friendly company chatbot to all of a sudden start talking about I don't know illegal substances or insult the customer or anything like this at topical guardrails could be interesting for you the tools available open source and as far as I understand it works with any large language model in the background whichever one you want to do the way it works is that there is an engine converting the input into a canonical form in the canonical form you can define your guardrails like what you want to happen if certain things happen that's very much kind of a programmatic form then you have flow execution which is maybe deny or maybe rephrase or do anything that you want I guess and in the end you generate the output from that so there's GitHub repo check it out LMQL is a programming language for language model interaction this is QL is should give you a hint that it is similar to a query language like SQL or graph QL or I don't know any other QLs but LMQL language model query language that lets you express things that you would like to know from a language model for example here is the tell a Joe prompt or input query query it's called the query so you input your prompt but then you can define these variables this is a whole variable this is where you would like the language model to put something right then here this is followed by a variable called the punchline so these are variables that you define so this would be your prompt you say which model and you can specify some wear clauses for example I want the joke to be smaller than 120 tokens or characters like some stopping criterion and so on so LMQL will take all of this and interact with the language model for you in this case for example make the language model fill these whole variables right here and you can see the output of the model is this and an LMQL will be able to read these variables here out of the response another one is here for example sentiment classification so here is a review we had a great stay hiking in the mountains was fabulous yary yary yary question is the underlying sentiment of this review what is the underlying sentiment of this review and why and then there is a whole variable called analysis and then it says based on this the overall sentiment of the message can be considered to be and another whole variable and here in the distribution clause you can say actually this classification whole variable it can only be one of these things right here so you can strain the model at that particular point LMQL will then go and ask the model make sure that this here is in fact one of the tokens where that you have specified right here or one of the sequences all in all this saves you a lot of grunt work from sort of having to query the model at various points look at the logids do something with the logids stop after a certain point for sit to do something and so on so this is very cool and it can be combined with other tools such as lang chain or or other things that you may know I don't know I just know lang chain and this AI makes pandas data frames conversational it adds generative artificial intelligence capabilities to pandas what you can do with this is something like this you have a data frame right here countries gdp's happiness and you can ask something like which are the five happiest countries and it'll give you an output you can also make plots and stuff with that so in the background this also does the pandas operations for you and gives you the results this is is potentially pretty pretty cool if this is pushed a bit further maybe with some tooling assistance and so on I'm not sure how the tools of the future are gonna look like but I definitely see something like this being extremely useful and making data analysis more accessible to people who also don't know programming laminize company and also an llm engine for rapidly customizing models so lamina gives you open source tools to rapidly customize a model like do fine tuning do rlhf and so on and they also on top of that offer a service where they manage all of that for you pretty cool combination we see more and more startups operate in this give you something open source and then offer service on top way yes very cool benefits a lot of people deep void is a group stability and they have released a model called i f that is in many ways really really good text to image model especially it handles for example text very well it looks very good and that's because the model it operates in pixel space not in hidden token space so things like stable diffusion they operate in this latent token space so you have like some vqa encoder and then you have the latent tokens and that's where the diffusion process runs whereas with i f the diffusion process runs directly on pixels so the image is generated in 64 by 64 and then has two sequences of upsampling to make it actually look bearable and not only bearable but it looks really good after that those two upsampling steps it's also cool that we're still seeing different approaches to diffusion models something latent space something pixels and so on yeah you can check this out on having face you can try it and you can download it also this as far as i understand non commercial for now but they do claim it's going to be fully commercially like permissively licensed in the future for now i only believe it once i see it but we'll like to believe them the frama foundation has released shimmy which is an api compatibility tool for converting popular external rl environments to the gymnasium and petting zoos apis this is really important especially for reinforcement learning where the details of the environment can be quite overwhelming and standard environments such as gymnasium formally open a i gym they're quite nice to work with because it decouples the development of the reinforcement learning algorithm with the any intricacies of the environment so it's very cool that the frama foundation spends effort into making things even more compatible into bringing external environments into the standard environments or making them compatible by the shimmy library go here releases a blog post called the embedding archives millions of Wikipedia article embeddings in many languages releasing a subset of Wikipedia embedded using their embedding models yeah you can now just download these embeddings which is really cool Wikipedia is a big corpus of very high quality this can serve as the basis for a lot of applications researchers at meta and other places release a cookbook on self-supervised learning with learnings that they have on self-supervised learning obviously people at meta have been among the ones pushing most into getting ever better techniques for self-supervised learning and it's very cool to see that they're now compiling this and sharing what they've learned in a condensed form for you to consume at once very cool h2o gpt aims to be the best open source gpt it's led by h2o ai these are models you can try them they have 20 billion parameter models 12 billion parameter models and even 30 billion parameter models they also have models that are already fine tuned on for example open assistant data and also those you can just try out on hoggfeast on top of that they release llm studio which is a framework for no code fine tuning state of the art large language models very cool meta releases a giant data set of annotated drawings so these drawings they will have annotation points like where is the hand where is the head and so on and allow things like this to be done very cool this research has been out earlier and now they're releasing the data set of nearly 180 000 annotated amateur drawings to help other AR researchers and creators to innovate further excellent thank you very much camel is a project and a paper for studying language i guess by letting language models communicate with each other it's a very unique approach but if they make these things role play and talk to each other they can study things about them i say this here because code and models are both available so if you are interested in that kind of stuff then feel free to check it out aia is another model that does text to image piya has updated their model to a new version that is now even better piya is itself claiming to not be the best text image model but to be the simplest in terms of inference code and that's actually quite true so this here is the full code that's needed to sample from the model and as you can see it's very easy to keep an overview so another cool model to check out and also notably it's not a transformer it's a convent excellent spasian roshka releases a blog post called fine tuning large language models and it's quite good it's an introduction to the core ideas and approaches so if you are just in amazement how people can adapt and tune all of these models like llama models even though they're really big this blog post is certainly a good place for you in general sabustians blog is a very good resource to learn about modern things in deep learning pick a pick is an app for collecting human feedback on a i generated images the code is available so you can run this locally if you have any sort of images a i generated images for humans to rate this might be a good place for you in addition they do release a data set images data set rankings data set where people have already come and rated a i generated images excellent so they say help us in creating the largest publicly available human feedback or text to image data set if you're in the mood to rate an image or two that's where you go snorkel a i is holding a conference there is a virtual event june 7 through 8 and you get the chance to present your poster there there is a poster competition i'm telling you this because the conference is free and the poster competition you can win prizes so if you have a poster that you would like to publish but you don't want to go to all the way to an academic conference that costs like a thousand bucks in entropy and you have to fly somewhere this might be an excellent alternative and if you in the competition there's prizes i found this to be funny if you search in amazon for the string as an a i language model you'll you'll like find find stuff like reviews and comments where people just copy pasted from chat gpt and look at this the weirdest part is this here it's a book one paragraph starts with as an a i language model i can't so people are writing books using chat gpt and then trying to sell them on amazon i've had a bunch of people ask me this and saying like oh look i made a book using chat gpt and it was so fast and i'm like yo why would why would someone if they look for this information that's in your book why wouldn't they just go to chat gpt i huh deep mind has a new research paper out about robo soccer these guys are just so cute but also the capabilities here are quite astounding because these are end to end reinforcement learned and that's quite crazy because movement like this we're used to from like Boston dynamics and so on but i believe they hard code like every single movement and then they have a tight control algorithms where here i'm not sure entirely which part is all reinforcement learned they exhibit very very different and very adaptive behavior i've recently visited lab at eth also doing robo soccer a different discipline than this one which i'll also hopefully share soon and that's also really really interesting so the paper is called learning agile soccer skills for bipedal robot with deeper reinforcement learning and here's a video of of like someone pushing over the robots and i'm like don't do that don't do that if Jeff hinting is right that think you'll be the first person no you'll be the first person that'll get they'll remember they'll remember forever they have oh no how long does a heart disk store stuff you you better hide for longer than that anyway thank you so much for watching this was ml news thank you for being here if you do have a moat please like this video and tell your friends about it so i'll see you next time bye bye

Then, load the texts from the file and use the text splitter to split the text to chunks with zero overlap before we store them in Deep Lake.

with open ('docs/text.txt', 'w+') as file:
  for result in results:
    file.write(result + "\n")

with open('docs/text.txt') as f:
    text = f.read()
texts = text_splitter.split_text(text)

Similarly, as before we’ll pack all the chunks into a Documents:

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=0, separators=[" ", ",", "\n"]
    )
texts = text_splitter.split_text(text)

Now, we’re ready to import Deep Lake and build a database with embedded documents:

from langchain.docstore.document import Document

docs = [Document(page_content=t) for t in texts[:4]]

from langchain.vectorstores import DeepLake
from langchain.embeddings.openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model='text-embedding-ada-002')

# create Deep Lake dataset
my_activeloop_org_id = "ala" # TODO: use your organization id here
my_activeloop_dataset_name = "langchain_course_youtube_summarizer"
dataset_path = f"hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}"

db = DeepLake(dataset_path=dataset_path, embedding_function=embeddings)
db.add_documents(docs)

Deep Lake Dataset in hub://ala/langchain_course_youtube_summarizer already exists, loading from the storage
Dataset(path='hub://ala/langchain_course_youtube_summarizer', tensors=['embedding', 'ids', 'metadata', 'text'])

  tensor     htype     shape     dtype  compression
  -------   -------   -------   -------  ------- 
 embedding  generic  (8, 1536)  float32   None   
    ids      text     (8, 1)      str     None   
 metadata    json     (8, 1)      str     None   
   text      text     (8, 1)      str     None

['0c42e31e-2b12-11ee-b1b7-0242ac1c000c',
 '0c42e45e-2b12-11ee-b1b7-0242ac1c000c',
 '0c42e4cc-2b12-11ee-b1b7-0242ac1c000c',
 '0c42e526-2b12-11ee-b1b7-0242ac1c000c']

In order to retrieve the information from the database, we’d have to construct a retriever object.

retriever = db.as_retriever()
retriever.search_kwargs['distance_metric'] = 'cos'
retriever.search_kwargs['k'] = 4

The distance metric governs how the Retriever calculates “distance” or similarity between data points in the database. The Retriever will utilise cosine similarity as its distance metric if distance_metric is set to ‘cos’. Cosine similarity measures the cosine of the angle between two non-zero vectors in an inner product space. It is frequently used in information retrieval to assess the similarity of documents or chunks of text. When a search is performed, setting ‘k’ to 4 causes the Retriever to provide the four most similar or closest results based on the distance metric.

With the QA chain, we may create and use a custom prompt template. The RetrievalQA chain is useful for retrieving similar data from databases and using the returned records as context to answer questions. The custom prompt feature allows us to specify unique actions such as retrieving documents and summarising the results in a bullet-point format.

from langchain.prompts import PromptTemplate
prompt_template = """Use the following pieces of transcripts from a video to answer the question in bullet points and summarized. If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Summarized answer in bullter points:"""
PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

Lastly, we can use the chain_type_kwargs argument to define the custom prompt and for chain type the ‘stuff’ variation was picked. You can perform and test other types as well, as seen previously.

from langchain.chains import RetrievalQA

chain_type_kwargs = {"prompt": PROMPT}
qa = RetrievalQA.from_chain_type(llm=llm,
                                 chain_type="stuff",
                                 retriever=retriever,
                                 chain_type_kwargs=chain_type_kwargs)

print( qa.run("Summarize the mentions of google according to their AI program") )


• Google has developed an AI program to help people with their everyday tasks.
• The AI program can be used to search for information, make recommendations, and provide personalized experiences.
• Google is using AI to improve its products and services, such as Google Maps and Google Assistant.
• Google is also using AI to help with medical research and to develop new technologies.

Of course, you can always change the prompt to get the desired outcome; experiment with different types of chains and updated prompts to discover the best combination. Finally, the strategy you choose is determined by the specific needs and limits of your project.

6 Conclusion

When working with massive documents and language models, it is critical to select the best technique to make the best use of the available information. We’ve spoken about three major strategies: “stuff,” “map-reduce,” and “refine.”

The “stuff” approach is the most basic and naïve, employing all of the text from the documents in a single request. This technique may throw problems if all of the text is longer than the LLM’s available context size, and it is not the most efficient approach to process big amounts of text.

The “map-reduce” and “refine” approaches, on the other hand, provide more advanced methods for processing and extracting meaningful information from larger documents. While the “map-reduce” method can be parallelized for faster processing times, the “refine” method has empirically proven to yield better results. However, because it is sequential in nature, it is slower than the “map-reduce” method.

By examining the trade-offs between speed and quality, you can choose the best way for efficiently leveraging the power of LLMs for your work.

Using Whisper and LangChain, we demonstrated a powerful and efficient way for summarising YouTube videos. You may easily extract the most valuable information from your chosen content by downloading YouTube audio files, transcribing them with Whisper, and employing LangChain’s powerful summarising algorithms (stuff, refine, and map_reduce).

We also demonstrated LangChain’s customizability, which allows you to design personalised prompts, generate summaries in many languages, and even save URLs in a Deep Lake vector storage for easy retrieval. This robust feature set allows you to more efficiently access and process a multitude of information. The summarising chain allows you to quickly obtain information from the vector storage and condense it into easily digestible summaries. You can save time and effort while improving your knowledge retention and understanding of a wide range of topics by applying these cutting-edge tools.

7 Acknowledgements

I’d like to express my thanks to the wonderful LangChain & Vector Databases in Production Course by Activeloop - which i completed, and acknowledge the use of some images and other materials from the course in this article.

A YouTube Video Summarizer Using Whisper and LangChain

Subscribe