Know Your Data
Know Your Data is a tool to help researchers, engineers, product teams and decision makers understand datasets with the goal of improving data quality, as well as helping mitigate fairness and bias issues.
Explore Know Your Data for datasets supported by the TensorFlow Datasets API!
You can find documentation for the tool here.
This repository contains the code for the website and user documentation.
VGGFace2 Contains Mostly Copyrighted Photos
KYD mentions accepting Creative Commons licensed datasets, but the VGGFace2 dataset is actually comprised of copyright photos. Many even include watermarks over the image with the names of the photo agency.
Here are a few photos that even include the watermark:
https://knowyourdata-tfds.withgoogle.com/#dataset=vgg_face2&tab=ITEM&select=kyd%2Fvgg_face2%2Flabel&item=n000002%2F0034_01.jpg https://knowyourdata-tfds.withgoogle.com/#dataset=vgg_face2&tab=ITEM&select=kyd%2Fvgg_face2%2Flabel&item=n000002%2F0072_02.jpg https://knowyourdata-tfds.withgoogle.com/#dataset=vgg_face2&filters=kyd%2Fvgg_face2%2Flabel:n000026&tab=ITEM&select=kyd%2Fvgg_face2%2Flabel&item=n000026%2F0116_01.jpg https://knowyourdata-tfds.withgoogle.com/#dataset=vgg_face2&filters=kyd%2Fvgg_face2%2Flabel:n000026&tab=ITEM&select=kyd%2Fvgg_face2%2Flabel&item=n000026%2F0116_01.jpg https://knowyourdata-tfds.withgoogle.com/#dataset=vgg_face2&tab=ITEM&select=kyd%2Fvgg_face2%2Flabel&item=n000006%2F0020_01.jpg
If you keep scrolling through it you'll notice hundreds, maybe thousands, others from Alamy, Shutterstock.
The only component that is licensed with a "Creative Commons" is the list of images, not the actual images. It's not possible to override copyright with Creative Commons.
Citing this tool
I believe would be a great addition to the documentation if you add a proper citation format for the too. Is there a specific whitepaper? Should we reference People + AI Research, Google, or Both?
Just like the guidelines in Tensorflow: https://www.tensorflow.org/about/bib
i_naturalist2017 has wrong labels
Attention, the i_naturalist2017 dataset has the labels misaligned from the items to which they refer, therefore, in the current state, it is completely useless.
Can I create a database?
I have some remote sensing images, and it would be helpful to use this structure to insert some samples/classes to teach deep learning for our applications. Regards
Plans for local support?
Hi folks.
Is there any plan to set Know Your Data locally and run it on custom datasets? I think if the users could have their datasets in a way TensorFlow Dataset expects it might be doable.