Dataset Factory - A Toolchain for Generative Computer Vision Datasets

Data-Centric AI brings new challenges with cost and scale of data curation. Our latest tool DataChain solves these challenges where traditional MLOps tools fall short. This research paper discusses our approach.

  • Jeny De Figueiredo
    Daniel Kharitonov
  • March 25, 20241 min read
Hero Picture

The fast proliferation of analytical and Generative AI solutions is driving requirements for data versioning and data curation to the next level, where the dataset management tools must understand data and be able to use metadata for data curation. This goal is not achievable with the traditional MLOps toolchains that remain blind to the content of managed files. We solve this problem by introducing the next generation of Data-Centric AI software - DataChain.

We have been building DataChain for several years now and are happy to share some of the thinking and motivation that came into this product. For example, this paper written by our Technical Product Manager Daniel Kharitonov and Customer Success Engineer, Ryan Turner, was published at 2023 ICCV and explains the challenges of building generative computer vision datasets at scale and the benefits of using a tool like DataChain.

Dataset Factory: A Toolchain for Generative Computer Vision Datasets Dataset Factory: A Toolchain for Generative Computer Vision Datasets Source link

The following table summarizes the problems faced when tackling massive Computer Vision projects and solved with our latest tool:

Unstructured Data Management Problems and Solutions Unstructured Data Management Problems and Solutions

Read the full paper for a more in-depth discussion on the problems and solution as well as an example of the dataset factory approach using the LAION-5B dataset. While this paper focuses on a specific Computer Vision data use case, the same approach works for all Unstructured Data workflows including text, video, audio, GIS, and multi-modal. We would love to talk to you about your use cases and explore how we can help you master your unstructured data workflows. Reach out here to set up a meeting.

Do you have any use case questions or need support? Join us in Discord!

Head to the DVC Forum to discuss your ideas and best practices.

Back to blog