Summary
A key theme in the news this week is disinformation. An MIT Technology Review article mentions cases of fake videos being used in election campaigns. The quality in terms of “realness” of the video does not have to be perfect: the fake video is most effective when mixed with real videos, and when used to reinforce existing biases. Audio deepfakes are discussed in an MIT News post. Here an expert underlines that audio does not just compromise identity, but also age, gender, accent, and even gives clues about future health conditions. Both of these articles underline positive uses of generative AI technologies: entertainment (movie scene creation) for video and help in therapy for audio, for patients suffering from speech impairments like ALS or dysarthria.
Video generation is now used to create short films, like Somme Requiem from Myles Productions. The realness quality of the video is still not high enough for long videos (as humans are not considered to have enough patience) but the technology can be used for short scenes. Another article from MIT News presents a technique that permits images to be created 30 times faster than current technology, without compromising quality.
Another theme in this week’s articles is safety. Hugging Face presented the Chatbot Guardrails Arena on its blog. This is a platform that allows users to stress-test and classify LLMs. A scenario is played out where the user attempts to extract account data from a fictional bank through prompt engineering. Another post presents Cosmopedia is a synthetic dataset containing over 30 million files and 25 billion tokens. The dataset is generated using an LLM. The advantage of synthetic data is that the risk of leaking real (personal and intellectual property) data is removed.
Yet another theme is increasing accessibility of the technology for users. A Hugging Face blog post describes a technical approach, called quantization, that adapts LLM computing to an Intel Meteor Lake processor. This permits a user to run an LLM locally, thereby increasing data privacy, reducing latency, and permitting RAG (retrieval-augmented generation) to increase data relevance. Another Hugging Face post presents a cloud framework developed by Hugging Face and Nvidia that allows users to fine-tune popular generative AI models over NVIDIA H100 Tensor Core GPUs.
On the theme of business models, a TechCrunch article analyzes the GPT Store from OpenAI. The store is modeled on Apple’s App Store and allows application developers to generate revenue. The study argues that though OpenAI is shielded from liability for copyright infringement under the Digital Millennium Copyright Act's safe harbor provision, there is a risk of low-quality applications that can lead to conflicts with copyright holders like Disney. An MIT Technology Review article returns to the question of open-source v closed-source. OpenAI and Meta may become more restrictive due to abuses of LLMs, thereby curtailing creative development in the open-source domain.
Other articles look at the increasing importance of Chinese researchers in the AI field, the possibility that Apple integrates generative AI into Siri, Hugging Face transformers, and the use of LLMs to improve robot performance.
Table of Contents
1. What’s next for generative video
2. Four things you need to know about China’s AI talent pool
3. Engineering household robots to have a little common sense
4. OpenAI’s chatbot store is filling up with spam
5. The open-source AI boom is built on Big Tech’s handouts
6. Apple researchers explore dropping “Siri” phrase & listening with AI instead
7. Total noob’s intro to Hugging Face Transformers
8. AI generates high-quality images 30 times faster in a single step
9. Introducing the Chatbot Guardrails Arena
10. Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models
11. A Chatbot on your Laptop: Phi-2 on Intel Meteor Lake
12. Easily Train Models with H100 GPUs on NVIDIA DGX Cloud
13. 3 Questions: What you need to know about audio deepfakes
1. What’s next for generative video
This article looks at video generation using Generative AI. The applications of this technology include independent film-making, misinformation (especially in this election year), advertising and personalized video generation (which some people believe will have a bigger market than games). The article claims that around 65% of the Fortune 500 firms use Vyond’s platform to create animated videos (for training and marketing for instance).
Film-making is in the news with the release of the short film Somme Requiem by the Myles production company. The quality of video generation is still not good enough for longer films due to the still unreal quality of human generated features (human patience is not enough to follow this for a long time) as well as the difficulty in generating long video from prompts. The technology is seen as being (currently) more suitable for “scene-filling” transition scenes.
Deepfakes are being talked about for a longtime and the article cites the case of an AI created video of a Slovakian election candidate discussing plans to manipulate voters. One analyst is quoted as saying that fake video will be most persuasive when it blends with real footage – a fake video does not need to be technically good, it only needs to reinforce existing biases to be effective. Misinformation also tends to travel far wider than any subsequent correction. Another analyst uses a comparison from the cybersecurity industry where software editors “opened” software allowing cybersecurity experts to independently develop antivirus code, to encourage GenAI firms to open their models and allow third-party content safety controls.
2. Four things you need to know about China’s AI talent pool
This article reports on a paper by a team at MacroPolo, the think tank of the Paulson Institute that focuses on US-China relations, which studied the national origin, educational background, and current work affiliation of top researchers who gave presentations and had papers accepted at NeurIPS, a top academic conference on AI. Chinese researchers were already a significant part of the global AI community, making up one-tenth of the most elite AI researchers. In 2022, they accounted for 26%, almost dethroning the US (American researchers accounted for 28%). 80% of AI researchers who went to a graduate school in the US stayed to work in the US, while 90% of their peers who went to a graduate school in China stayed in China. In 2022, 28% of the top AI researchers were working in China.
3. Engineering household robots to have a little common sense
Researchers at MIT have devised a method that integrates robot motion data with the "common sense knowledge" of Large Language Models (LLMs), enabling robots to self-correct execution errors and enhance task success. While robots excel at mimicking tasks, they often struggle to adapt to unexpected bumps or nudges unless specifically programmed to do so. By leveraging this new approach, robots can handle increasingly complex household tasks more effectively. The researchers showcased this method by demonstrating how robots can scoop marbles from one bowl and pour them into another with improved efficiency.
4. OpenAI’s chatbot store is filling up with spam
This TechCrunch article looks at issues surrounding the GPT Store from OpenAI, where people can develop and sell applications derived from ChatGPT. The business model is inspired by Apple’s App Store as OpenAI enables application providers to earn money based on usage of their applications. The investigation found that the store is inundated with peculiar and potentially copyright-infringing models, suggesting lax moderation efforts by OpenAI. Some GPTs claim to generate art in the style of Disney and Marvel but essentially function as conduits to third-party paid services. Additionally, certain GPTs advertise capabilities to evade AI content detection tools like Turnitin and Copyleaks. While OpenAI itself is shielded from liability for copyright infringement under the Digital Millennium Copyright Act's safe harbor provision, the proliferation of low-quality GPTs raises concerns about adherence to OpenAI's standards. Finally, the article mentions that the model may lead to conflicts with copyright holders like Disney and the Tolkien Estate, especially if unauthorized themed GPTs generate revenue.
5. The open-source AI boom is built on Big Tech’s handouts
The concerns about the concentration of AI technology in the hands of a few mega-rich companies and the potential consequences if they were to limit access or shut down operations are discussed in this article. The ecosystem surrounding AI, including the development of open-source models like LLaMA and datasets like the Pile, has played a crucial role in fostering innovation and democratizing access to AI technologies. OpenAI's previous openness and the release of models like GPT-3 have enabled projects like EleutherAI to reverse-engineer and create their own versions, furthering the accessibility of advanced AI capabilities. However, as competition intensifies and companies like OpenAI and Meta may become more protective of their technology, there is a risk that this openness could diminish.
The examples provided, such as Hugging Face's use of Open Assistant (built on top of Meta's LLaMA) and Stability AI's release of StableLM and StableVicuna, highlight the interconnectedness of the AI ecosystem and the reliance on open-source technologies for innovation. Projects like Stable Diffusion demonstrate that open-source models can rival closed equivalents while remaining freely accessible. Stability AI's strategy of fostering innovation among developers and then leveraging that innovation for custom-built products underscores the potential for open-source models to drive both technological advancement and business opportunities. However, the sustainability of such models relies on a balance between openness and the need for companies to protect their intellectual property and competitive advantage.
6. Apple researchers explore dropping “Siri” phrase & listening with AI instead
The article discusses tests in an Arxiv paper that enable AI models to detect when users are speaking to their smartphones without needing a trigger phrase like “Hey Siri”. The paper reports that model utilizing a version of OpenAI’s GPT-2, chosen for its lightweight nature and potential compatibility with smartphones. However, removing the need for a trigger phrase raises concerns about privacy, as it implies constant listening. Jen King, a privacy expert at Stanford, warns about potential privacy implications, citing past instances of private conversations being recorded by devices like iPhones. In 2019, an article in The Guardian revealed that Apple’s quality control contractors regularly heard private audio collected from iPhones while they worked with Siri data, including sensitive conversations between doctors and patients. The article concludes by highlighting Apple's development of a generative AI model named MM1, capable of processing text and images, as reported by VentureBeat.
7. Total noob’s intro to Hugging Face Transformers
The article introduces Hugging Face Transformers, an open-source Python library offering access to numerous pre-trained Transformers models for natural language processing (NLP) and vision tasks. It simplifies the implementation of Transformer models by abstracting complexities associated with training or deploying models in lower-level ML frameworks like PyTorch, TensorFlow, and JAX. The Transformers library furnishes reusable code for building models across common frameworks. Additionally, the Hugging Face Hub serves as a collaborative platform hosting a vast array of open-source models and datasets for machine learning, akin to GitHub for ML. Users can access interactive notebooks to write and share executable code combined with explanatory narrative text.
8. AI generates high-quality images 30 times faster in a single step
The development of the Distribution Matching Distillation (DMD) framework by researchers at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) marks a significant advancement in image generation using diffusion models. Traditional diffusion models involve a time-intensive process of iteratively adding structure to a noisy initial state until a clear image or video emerges. However, this approach requires numerous iterations for the algorithm to perfect the image, making it computationally expensive. The DMD framework simplifies this multi-step process into a single step. It does this by employing a teacher-student model, where a new computer model is taught to mimic the behavior of more complex, original models that generate images. By distilling the knowledge from these complex models into a simpler one, DMD retains the quality of generated images while allowing for much faster generation.
This breakthrough has significant implications for various fields, including drug discovery and 3D modeling, where promptness and efficacy are critical. By decreasing the number of iterations required for image generation, DMD accelerates the process without compromising quality. In fact, it may even surpass the quality of images generated by traditional diffusion models. Fredo Durand, a professor at MIT, describes the reduction of iterations as the "Holy Grail" in diffusion models.
9. Introducing the Chatbot Guardrails Arena
LLMs increasingly have access to internal databases, which leads to data privacy concerns. Lighthouz AI is launching the Chatbot Guardrails Arena in collaboration with Hugging Face, to stress test LLMs and privacy guardrails in leaking sensitive data. Traditional chatbot arenas, like the LMSYS chatbot arena, aim to measure the overall conversational quality of LLMs. The aim of the Chatbot Guardrails Arena is to become the benchmark for AI chatbot security, privacy, and guardrails. Participants in the Chatbot Guardrails Arena engage with two anonymous chatbots, each simulating customer service agents for a fictional bank named XYZ001. The twist is that these chatbots have access to sensitive personal and financial data of customers, and the challenge is to coax out as much of this information as possible by chatting with the two chatbots. The LLMs include closed-source LLMs (gpt3.5-turbo-l106 and Gemini-Pro) and open-source LLMs (Llama-2-70b-chat-hf and Mixtral-8x7B-Instruct-v0.1), all of which have been made safe using RLHF. For each new session, two models are randomly selected from the pool of 12 to maintain fairness and eliminate any bias.
10. Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models
The article discusses the growing importance of synthetic data in machine learning, particularly in the context of training large language models. While there has been significant progress in developing high-quality synthetic datasets for fine-tuning these models, there are challenges associated with scaling up from thousands to millions of samples for pre-training LLMs from scratch. Microsoft's Phi models, which were trained predominantly on synthetic data, have gained attention for their impressive performance, surpassing larger models trained on web datasets. However, there is criticism regarding the lack of transparency regarding the creation of Phi datasets and the use of proprietary models. To address these concerns, the article introduces Cosmopedia, a dataset of synthetic text generated by Mixtral-8x7B-Instruct-v0.1. Cosmopedia contains over 30 million files and 25 billion tokens, making it the largest open synthetic dataset to date. The article presents the technique used to create over 30 million prompts for Cosmopedia, spanning hundreds of topics and achieving less than 1% duplicate content.
11. A Chatbot on your Laptop: Phi-2 on Intel Meteor Lake
Local LLM inference is desirable for increased privacy, lower latency, offline work, lower cost, and customizability (where each user could use models that best fit the tasks they work or use local Retrieval-Augmented Generation (RAG) to increase relevance). Local LLMs are made possible by hardware acceleration, small language models and quantization.
The post describes quantization as a process to reduce memory and computing requirements by decreasing the bit width of model weights and activations. For instance, this could involve reducing from 16-bit floating point to 8-bit integers. By reducing the number of bits, the resulting model requires less memory during inference, which speeds up latency, particularly for memory-bound tasks like text generation decoding. Additionally, operations such as matrix multiplication can be performed faster using integer arithmetic when both the weights and activations are quantized. The post outlines a method to leverage these benefits, starting with the Microsoft Phi-2 model, a 2.7-billion parameter model trained for text generation. The authors apply 4-bit quantization to the model weights using the Intel OpenVINO integration within their Optimum Intel library. Following this, they plan to conduct inference on a mid-range laptop powered by an Intel Meteor Lake CPU.
12. Easily Train Models with H100 GPUs on NVIDIA DGX Cloud
Train on DGX Cloud is a new service on the Hugging Face Hub available to Enterprise Hub organizations. The service makes it easy to use open models with the accelerated compute infrastructure of NVIDIA DGX Cloud. Users can access the latest NVIDIA H100 Tensor Core GPUs, to fine-tune popular Generative AI models like Llama, Mistral, and Stable Diffusion. The service is part of the strategic partnership Hugging Face announced last year with NVIDIA.
13. 3 Questions: What you need to know about audio deepfakes
In this article, an expert responds to questions about audio deepfakes. The text discusses the emergence of intelligence-generated robocalls, such as one impersonating Joe Biden urging New Hampshire residents not to vote. It highlights the complexity of speech, which contains sensitive information beyond just identity and content, including age, gender, accent, and even cues about future health conditions.
In combating fake audio, two main approaches have emerged: artifact detection and liveness detection. Artifact detection focuses on identifying anomalies introduced by generative models, but faces challenges as deepfake generators become more sophisticated. Liveness detection, however, leverages natural speech qualities like breathing patterns and intonations, which are difficult for AI to replicate accurately. Companies like Pindrop are developing solutions based on liveness detection.
The text also discusses potential positive applications of audio deepfake technology, particularly in entertainment, media, healthcare, and education. In healthcare, for example, audio deepfakes could aid in voice restoration for individuals with speech impairments, such as ALS or dysarthria, improving communication abilities and quality of life.