Google says public data is fair game for training its AIs

Google says public data is fair game for training its AIs

Google has updated its privacy policy to confirm that it scrapes public data from the Internet to train its AI models and services — including its chatbot Bard and its search engine that offers to generate answers to questions.

The fine print under research and development now reads: “Google uses data to improve our services and develop new products, features and technologies that benefit our users and the public. For example, we use public data to help train Google’s AI models and build products and features such as Google Translate, Bard and Cloud AI capabilities.”

We use publicly available data to help train Google’s AI models and build products and features

interesting, Reg Employees outside the US cannot see the posts referenced in the links above. However, this Google policy PDF states: “We may collect data that is publicly available online or from other public sources to help train Google’s AI models and build products and features such as Google Translate, Bard and Cloud AI capabilities.”

Google’s boundary changes for AI training. Previously, the policy only mentioned “Language Format” and referred to Google Translate. But the term has been modified to cover “AI models” and include Bard and other systems built as applications on its cloud platform.

A spokesperson for Google said Register That update has not fundamentally changed the way it trains its AI models.

“Our privacy policy has been transparent for a long time that Google uses public data from the open web to train language models for services such as Google Translate. This latest update only clarifies that new services such as Bard are included. We incorporate the principles of privacy and protection. The development of our AI technology, in line with our AI principles,” the spokesperson said in a statement.

Developers have scoured the Internet, photo albums, books, social networks, source code, music, articles, etc., to collect training data for AI systems for years. The process is controversial, however, considering the material is generally protected by copyright, terms of use, and licenses, and all have led to lawsuits.

Some are not happy that their own content is not only used to create a machine learning system that repeats their work, and therefore may endanger their livelihood, but that the output of the model flies close to copyright or license violation by collecting this training data. does not change.

AI developers may argue that their efforts fall under fair use, and that what emerges is a new model of work and not actually a copy of the original training data. It is a hotly debated issue.

For example, AI Stability has been sued by Getty Images for mining and using millions of images from its stock photo website to train text-to-image tools. Meanwhile, OpenAI and its owner Microsoft have also received several lawsuits, accusing it of improperly scraping “300 billion words from the Internet, books, articles, websites and texts – including personal information obtained without consent”, and slurping source code. From the public repository to create an AI-pair programming tool GitHub Copilot.

A Google representative declined to clarify whether the ad and search giant would scrape copyrighted or public copyrighted information or social media posts to train its systems.

Now that people are better informed about how AI models are trained, some internet businesses have started charging developers for access to their data. For example, Stack Overflow, Reddit, and Twitter, this year introduced costs or new rules for accessing their content through APIs. Other sites like Shutterstock and Getty have chosen to license their images to AI modelers, and have partnered with the likes of Meta and Nvidia. ®

#Google #public #data #fair #game #training #AIs


Leave a Reply

Your email address will not be published. Required fields are marked *