A tool to help researchers and product teams understand datasets with the goal of improving data quality, and mitigating fairness and bias issues.

  • By PAIR code
  • Last update: Dec 9, 2022
  • Comments: 5


  • 1

    VGGFace2 Contains Mostly Copyrighted Photos

    KYD mentions accepting Creative Commons licensed datasets, but the VGGFace2 dataset is actually comprised of copyright photos. Many even include watermarks over the image with the names of the photo agency.

    Here are a few photos that even include the watermark:

    https://knowyourdata-tfds.withgoogle.com/#dataset=vgg_face2&tab=ITEM&select=kyd%2Fvgg_face2%2Flabel&item=n000002%2F0034_01.jpg https://knowyourdata-tfds.withgoogle.com/#dataset=vgg_face2&tab=ITEM&select=kyd%2Fvgg_face2%2Flabel&item=n000002%2F0072_02.jpg https://knowyourdata-tfds.withgoogle.com/#dataset=vgg_face2&filters=kyd%2Fvgg_face2%2Flabel:n000026&tab=ITEM&select=kyd%2Fvgg_face2%2Flabel&item=n000026%2F0116_01.jpg https://knowyourdata-tfds.withgoogle.com/#dataset=vgg_face2&filters=kyd%2Fvgg_face2%2Flabel:n000026&tab=ITEM&select=kyd%2Fvgg_face2%2Flabel&item=n000026%2F0116_01.jpg https://knowyourdata-tfds.withgoogle.com/#dataset=vgg_face2&tab=ITEM&select=kyd%2Fvgg_face2%2Flabel&item=n000006%2F0020_01.jpg

    If you keep scrolling through it you'll notice hundreds, maybe thousands, others from Alamy, Shutterstock.

    The only component that is licensed with a "Creative Commons" is the list of images, not the actual images. It's not possible to override copyright with Creative Commons.

  • 2

    Citing this tool

    I believe would be a great addition to the documentation if you add a proper citation format for the too. Is there a specific whitepaper? Should we reference People + AI Research, Google, or Both?

    Just like the guidelines in Tensorflow: https://www.tensorflow.org/about/bib

  • 3

    i_naturalist2017 has wrong labels

    Attention, the i_naturalist2017 dataset has the labels misaligned from the items to which they refer, therefore, in the current state, it is completely useless.

  • 4

    Can I create a database?

    I have some remote sensing images, and it would be helpful to use this structure to insert some samples/classes to teach deep learning for our applications. Regards

  • 5

    Plans for local support?

    Hi folks.

    Is there any plan to set Know Your Data locally and run it on custom datasets? I think if the users could have their datasets in a way TensorFlow Dataset expects it might be doable.