MIT apologises after a giant dataset it was using to teach AI how to recognise people and objects in images was found to be assigning racist and misogynistic labels
MIT has had to take offline a giant dataset that taught AI systems to assign 'racist and misogynistic labels' to people in images.
The database, known as '80 Million Tiny Images', is a massive collection of photos with descriptive labels, used to teach machine learning models to identify images.
But the system, developed at the US university, labelled women as 'whores' and 'bitches' and used other abhorrent terms against ethnic minorities.
It also contained close-up pictures of female genitalia labelled with the C-word and other images with the labels 'rape suspect' and 'molester'.
Images labelled with the slur 'whore' ranged from a woman in a bikini to a photo of 'a mother holding her baby with Santa', tech website the Register reported.
The respected research university in Massachusetts had to apologise for the dataset, which was removed this week after a tip, claimed by the Register and based on concerns from two academics.
MIT has also had to urge its researchers and developers to stop using the training library and to delete any copies.
Despite this, apps and websites relying on neural networks that were trained using the database may spout out these shocking terms when analysing photos and camera footage.

The Register's screenshot of the dataset before it was taken offline this week. It shows pixelated examples for the label 'whore', including 'a mother holding her baby with Santa', the Register said'It is clear that we should have manually screened them,' said Antonio Torralba, a professor of electrical engineering and computer science at MIT's Computer Science & Artificial Intelligence Laboratory
'For this, we sincerely apologise – indeed, we have taken the dataset offline so that the offending images and categories can be removed.'
Torralba and fellow researchers posted an open letter on the MIT website that explains the decision to remove the dataset – and why it listed images with such language in the first place.
'The dataset is too large (80 million images) and the images are so small (32 x 32 pixels) that it can be difficult for people to visually recognise its content,' they say.
'Therefore, manual inspection, even if feasible, will not guarantee that offensive images can be completely removed.'
The dataset was created in 2006 and contains 53,464 different nouns, directly copied from WordNet – a database of English words grouped into related sets and developed by Princeton University.

This graph shows the number of images in the dataset labelled with different slurs. The dataset has been taken offline and it will not be put back onlinet, MIT said
WordNet was built in the mid-1980s, however, and contains racist slang and insults, the Register said, which now 'haunt modern machine learning'.
All the 53,000 nouns from WordNet were then used by MIT to automatically download images from internet search engines that contained the corresponding noun, in order to collect the final total of 80 million images.
The training set has been used at MIT to train machine learning models to use these terms to automatically identify people and objects in still images.
For example, a trained neural network may be able to identify a pleasant scene of a park with words such as 'picnic', 'grass' and 'trees'.
But the dataset's unpleasant side means it may also identify women in the scene as 'whores' or black and Asian minorities with racial slurs.
Therefore MIT's community have been asked to refrain from using the dataset in future and also delete any existing copies of the dataset that may have been downloaded.
'Biases, offensive and prejudicial images, and derogatory terminology alienates an important part of our community – precisely those that we are making efforts to include,' the MIT professors wrote.
'It also contributes to harmful biases in AI systems trained on such data.
'Additionally, the presence of such prejudicial images hurts efforts to foster a culture of inclusivity in the computer vision community.

Heavily pixelated image samples taken from the dataset that were labelled with a highly offensive slur described as 'probably the most offensive word in English' by Dictionary.com
'This is extremely unfortunate and runs counter to the values that we strive to uphold.'
Two researchers – Vinay Prabhu at US privacy startup UnifyID and Abeba Birhane at University College Dublin in Ireland – examined the MIT database before it was taken offline and have prepared a research paper on their findings.
The team highlight the issues of scraping thousands of words from the web that haven't been checked by a human eye and using them to train machine learning systems.
'The very aim of that [WordNet] project was to map words that are close to each other,' Birhane told the Register.
'But when you begin associating images with those words, you are putting a photograph of a real actual person and associating them with harmful words that perpetuate stereotypes.'
No comments: