Summary
This week saw more articles on the thorny issue of preventing copyrighted material being used in model training data. Researchers from the Imperial College of London have developed a watermarking technique for text documents that permits the documents to be detected in model outputs. An article on 404 Media looks at the challenge of blocking AI firms’ web crawlers from visiting a website.
The issue of model collapse is discussed in an article from MIT Technology Review. Model collapse is the phenomenon that more and more content from websites used in training data is itself generated by AI, and training a model on AI generated content degrades the performance of models. The article describes research from the University of Oxford that measures the degradation.
Regulation continues to be present in the news. OpenAI is facing a new complaint for violation of the GDPR in Europe, specifically in relation to the fact that the company does not offer sufficient measures for citizens to rectify incorrect data about them given by ChatGPT. In the US, AI firm executives are increasingly supporting Donald Trump in the upcoming presidential elections because he wants to reduce legislation controlling AI. On the topic of self-regulation, there is an article about rules for lawyers and legal experts to adopt in relation to generative AI.
On adoption of generative AI, an article detailing a talk by a Gartner VP is included. Gartner estimates that less than 16% of firms which have adopted generative AI are seeing cost benefits or increased revenue, until now, and that 30% of GenAI projects are abandoned after the proof-of-concept phase. A survey of software developers by Stack Overflow reports that developers are more confident than ever that GenAI is not endangering their jobs. The demand for developers might even increase as companies seek developers with model training skills.
Elsewhere, a TechCrunch article on ChatGPT reports that OpenAI might spend 7 billion USD training and operating ChatGPT in 2024. Google is extending its storage services with Gemini AI support, and a Business Insider article reports how job applicants are widely using ChatGPT and similar tools for CVs and cover letters.
Table of Contents
1. Google Cloud expands its database portfolio with new AI capabilities
2. ChatGPT’s "hallucination" problem hit with another privacy complaint in EU
4. How GitHub harnesses AI to transform customer feedback into action
5. Ethical Rules for Using Generative AI in Your Practice
6. “Copyright traps” could tell writers if an AI has scraped their work
7. ChatGPT: Everything you need to know about the AI-powered chatbot
8. Developers aren’t worried that gen AI will steal their jobs, Stack Overflow survey reveals
9. AI trained on AI garbage spits out AI garbage
10. Employers Say They Can Tell When ChatGPT Is Used in Job Applications
11. How’s AI self-regulation going?
12. Websites are Blocking the Wrong AI Scrapers (Because AI Companies Keep Making New Ones)
1. Google Cloud expands its database portfolio with new AI capabilities
Google has made several announcements relating to their storage technologies. Its Spanner SQL database, used in Google Search, Gmail and YouTube, will now include graph and vector support, notably with the use of the GraphQL standard, along with full-text and vector search features. These additions will help with implementing retrieval augmented generation (RAG) support in models. Google’s BigQuery now includes Gemini support to help with data analytics. In another development, an agreement between Oracle and Google saw Google agree to host Oracle Exadata and Autonomous database services in Google data centers. The approach is win-win, since Oracle users will continue to pay them licensing fees.
2. ChatGPT’s “hallucination” problem hit with another privacy complaint in EU
The privacy advocate group noyb has brought a new complaint against OpenAI for violation of the EU’s General Data Protection Regulation (GDPR). One of the rights of citizens under the GDPR is the right to rectification, meaning that a citizen can ask a service provider to correct information about him or her that is given by the service. The privacy advocate group is claiming that OpenAI is unable to correct misinformation about individuals in ChatGPT output. OpenAI admits that it is technically impossible to correct all misinformation, since the errors originate in the training data, but the company has opened a service whereby people can contact the company asking for their personal information to be removed from ChatGPT’s outputs. Last Autumn, OpenAI opened an office in Dublin, Ireland, which was seen as a stepping stone to working more closely with a European data protection authority. Nevertheless, there have been other instances of complaints against OpenAI for GDPR violation. For instance, the Italian data protection authority forced a temporary shutdown of ChatGPT in 2023 because it considered that OpenAI was unclear about how consent was obtained for the use of personal data in training, because there was no protection against minors using the service, and because there were no measures allowing citizens to ask for data rectification.
3. Gartner Predicts 30% of Generative AI Projects Will Be Abandoned After Proof of Concept By End of 2025
In a talk given at the Gartner Data & Analytics Summit in Sydney, a Gartner VP claims that many organizations are struggling to create value from generative AI (GenAI) projects. The reasons cited include poor data quality management within organizations, which hampers training or fine-tuning AI models with organizational data, and also the up-front and recurring annual financial costs of training and operating models. In a survey by Gartner in late 2023 of 822 companies trying to develop new business models based on GenAI, only 15.8% of respondents reported a revenue increase due to GenAI, 15.2% reported cost savings and 22.6% reported improved overall productivity. The VP warned that investment in GenAI requires a high-tolerance for indirect and future returns over an immediate return on investment, and that benefits are very company, use case, role and workforce specific. He also claimed that 30% of GenAI projects are abandoned after the proof-of-concept stage. The five main GenAI use cases observed were coding assistants, personalized sales content creation, virtual assistants, RAG-enhanced document search, and models tailored to financial or medical services.
4. How GitHub harnesses AI to transform customer feedback into action
This post from Github presents an interesting experience report on how GenAI models were used to help the organization move from manual to automated trend identification and implement improved feature identification in user tickets. Hitherto, analysis of user tickets was manual – and therefore error-prone, time-consuming, and not as precise as it should be. The post cites a Business Review study which estimates that data scientists spend 80% of their time on manual data collection, organization and classification. Github used the open-source BERTopic model to analyze user data. This model is particularly suited to dynamically identifying topics in documents, and this was combined with a dedicated clustering algorithm to aggregate similar topics into clusters. In a second stage, GPT-4 was used to summarize topic clusters to get to the essentials of user feedback. Noteworthy points of the Github approach is that user feedback data was not used to train a model, which addresses privacy concerns. The GPT-4 model was nonetheless continually optimized by human feedback and A/B testing (where two model versions are compared for the quality of their outputs), by refining input prompts and adjusting internal model parameters like temperature (which controls the degree of randomness in model outputs) and maximum number of tokens.
5. Ethical Rules for Using Generative AI in Your Practice
This article proposes ethical rules to govern the use of GenAI in law practices. The rules insist on understanding the risks and the benefits of GenAI technology, ensuring that work created with tools are reliable and consistent with an acceptable legal, ethical and professional standard of care, and ensuring that attorney-client privilege and other legally protected information remains confidential and secure. The first risk cited is that of hallucinations, as there have been several publicized cases of content used in court cases discovered to be GenAI created. Another risk cited is that of bias, because the prompt represents the legal position of the user. This can result in the output ignoring opposing legal argument and precedence. GenAI services have been known to leak information, and the risk of compromising client confidentiality is also cited. Further, the author advises legal workers to consult with clients before using GenAI services for their cases. A related issue is that information shared with the GenAI service could actually compromise legal arguments used to defend other clients. Yet another issue raised concerns fees: a lawyer is expected to charge a client a reasonable amount, but if a service like ChatGPT is doing a lot of the work, then fee estimation might need to be revised.
6. “Copyright traps” could tell writers if an AI has scraped their work
This article describes research from Imperial College London on the design of a copyright trap for textual content. The tool allows a watermark to be created for and added to the copyrighted text, and the watermark can be detected in an AI model output if the text is used in training that model. The watermark created is a series of gibberish sentences that are hidden in the text. The watermark can be added as meta-data within the text document or as text in the same color as the document background. Large GenAI models often memorize parts of the training data, making it easier to search for specific copyrighted content. In contrast, smaller models (which have fewer parameters) are being designed to run on mobile devices. The researchers found that their approach works well with smaller models, having used CroissantLLM in their evaluation. Critics of the approach say that many GenAI platforms “clean” data documents before training, and this could result in watermarks being removed before the training begins. The code for generating watermarks and detecting traps is available on Github.
7. ChatGPT: Everything you need to know about the AI-powered chatbot
This TechCrunch article is an on-going article that presents the latest developments around OpenAI and ChatGPT. The article has been running for several months. One of the key recent developments is the partnership of OpenAI with Apple which will see ChatGPT integrated into Apple’s Siri. OpenAI also made a first release of GPT-4o’s updated realistic audio responses, with the Sky voice (purportedly that of the US actress Scarlett Johansson) removed. The company is also testing SearchGPT – a search engine designed to give “timely responses” to queries and which proposes pertinent follow-up queries to the user. The company is also announcing CriticGPT, a model trained on GPT-4 that aims to find errors in other models’ outputs and to help humans whose task is to search for incorrect output in model testing. Also in June, OpenAI released its latest small AI model, GPT-4o mini, replacing GPT-3.5 Turbo as the smallest OpenAI model. Finally, the article reports on undisclosed claims that OpenAI needs to spend 7 billion USD to train and operate ChatGPT in 2024, and that the company could lose up to 5 billion USD in operating costs.
8. Developers aren’t worried that gen AI will steal their jobs, Stack Overflow survey reveals
This post reports on a survey from Stack Overflow on the use and perception of GenAI in software development. The survey analyzed responses from 65’000 developers in 185 countries. One interesting result is that 70% of professional developers do not see AI as a threat to their jobs. On the contrary, the feeling is that GenAI is increasing the need for developers since mastering AI is a newly required skill. 76% of developers are using AI in 2024, compared to 70% in 2023. The advantage of the technology is seen as reduced time to program boilerplate code, leaving more time for the hard programming problems. On the other hand, only 43% of developers trust the accuracy of AI-generated code, and getting the AI to understand context being the biggest challenge. While many developers have now experimented with AI code-generation tools, many developers are disappointed by their experience.
9. AI trained on AI garbage spits out AI garbage
This MIT Technology Review article looks at the phenomenon that AI is increasingly being used to write the content of Web pages. The issue is model collapse, the idea that the behavior of AI models degrade when they are trained on data that it itself created by AI. This problem might get worse as more sites become populated with AI-generated content and as legitimate sites block search agents from AI firms looking for training data. The article describes an experiment at the University of Oxford where researchers fine-tuned a model from Wikipedia data, and then redid the fine-tuning with the output of the first iteration. This process was repeated nine times. The researchers defined a perplexity score to measure the nonsensicalness of the output, and high perplexity scores are reported for much of the output after the iterations. Another issue observed is that data about minority groups is highly distorted because of the high number of content samples about the groups in the training data. One way to combat model collapse is to develop models that attribute higher weights to human generated content compared to content of unknown origin.
10. Employers Say They Can Tell When ChatGPT Is Used in Job Applications
This article looks at the use of AI in the area of job applications. It cites a report from Jobscan that claims that 97% of Fortune 500 companies use automated hiring systems to process CVs. This is partly due to the huge increase in job applications that stems from job openings being announced on-line, and with the applications being submitted on-line. The article cites a 2023 study from iCIMS which found that 57% of college graduates have used AI to help write CVs and cover letters, and 25% of Gen Z applicants have used AI. Some companies try to exclude AI tool usage by techniques like asking applicants to submit an introductory video about themselves. On the other hand, the article argues, the use of AI is helping to "level the playing field", since job applicants have to navigate an increasingly challenging job market.
11. How’s AI self-regulation going?
President Biden invited seven top executives from AI companies to the White House in July 2023. He since issued an executive order on the safe use of AI, and wants Congress to pass regulation on AI safety. Also last year, Tech firms committed, though not in any binding way, to the development of Responsible AI. This article mentions how AI now features in the 2024 election debate in the US. Silicon Valley is known to be unhappy with antitrust regulation and data protection laws because they believe regulation stifles innovation. In relation to AI, Donald Trump signed an executive order when he was president for more research into AI and if he wins the November elections, he wants a "Manhattan Project" to support military AI, to throw out the Biden executive order and to reduce AI regulation. Many of the Silicon Valley executives are publicly supporting Trump in the elections for these reasons.
12. Websites are Blocking the Wrong AI Scrapers (Because AI Companies Keep Making New Ones)
AI companies scrape the Internet’s Websites for content using robots. For many reasons, including the fear of having copyrighted content used in training data, more and more websites are blocking these robots. Each robot has a “signature” called an agent name that identifies its origin, e.g., GoogleBot being in the name for the Google web crawler. Dark Visitors is a website containing a list of known agent names. A website can block robots based on the agent name, by having the agent name added to a file robots.txt placed on the Web server. The challenge for websites is that AI firms regularly change their agent name, so the defense method is relatively weak. Another challenge for web sites is that robots consume considerable resources on the Web server. The article cites one example where a web crawler accessed 10 TB of files in a single day and 73 TB in the whole month, costing the company 5000 USD in bandwidth charges.