August '22 Heartbeat
This month you will find:
🎙 Vanishing Gradients podcast,
👀 DVC used with Kaggle,
🏢 S3 locally with MinIO and DVC,
👯 Semantic similarity
® Iterative Studio Model Registry
🧑🏽💻 Internal Hackathon
🗣 IRL events,
🚀 New hires, and more!
- Jeny De Figueiredo
- August 16, 2022 • 8 min read
Welcome to the August Heartbeat! As we all soak in the remaining summer days, swing along in your hammock and take in all the great news from the Iterative Community!
If you are not familiar with Hugo Bowne-Anderson, you should be. He was the host of my all-time favorite Data Science podcast DataFramed while he was at DataCamp. DataFramed helped me immeasurably when I started my data science journey. It provided great not only great teachings on many data science concepts, but even more importantly, the ability to gain perspectives from different people across all parts of the data space, talking about challenges, danger zones, and issues that we all need to be aware of in the field. Recently Hugo started a new podcast, Vanishing Gradients. This newer endeavor is in a somewhat different format than DataFramed, but still with Hugo's characteristic deep dive into all the challenges that come up when working with data. Hugo uses a long-format conversation approach with many leaders and great thinkers in the data science/machine learning/AI space. In episodes seven and eight, Hugo has a fascinating chat with Peter Wang, CEO of Anaconda, in which they talk about a number of topics including how Python became so big in Data Science, the emergence of open source collaborative environments, and things that the PyData stack solves. Then it gets really interesting as they dive into the open source model in the context of finite and infinite games and open source software as a "paradigm of humanity's ability to create generative, nourishing and anti-rivalrous systems." 🤯 Super interesting discussion and food for thought. I've already listened to both episodes twice. I highly recommend them and this new podcast in general.
Mikołaj Kania suggests that you upgrade your Kaggle competition workflow from the “spaghetti code” of Jupyter Notebooks and use the more mature way of creating reproducible ML results by using DVC here on his blog.
He notes that notebooks are really bad to compare changes between runs. Instead, he suggests developing a workflow where for every major experiment type, creating a branch - experimenting in each and persisting the best and most notable outcomes (good and bad). The best results are then submitted to Kaggle. You can find more about his workflow in his repo for the project.
DVC with Kaggle (Source link)
Mikołaj explains how DVC's project structure ensures reproducible results and develops habits on best practices. One drawback he noted was the lack of an experimentation UI, but we just introduced the DVC extension for VS Code to help with that, and there’s always Iterative Studio. Look out for improvement to the experiment features in both tools in the coming months! Also, experimenting with DVC in Kaggle may give you some good practice for things we are cooking up internally! 😉🤫
Shambhavi Mishra in her post Searching for Semantic Similarity details the steps of her NLP project on similarity algorithms. She mainly focuses on cosine similarity using a Stack Overflow questions dataset. The end-to-end project uses Sentence BERT, Fast Text, DVC, DAGsHub, Streamlit and deploys the web app on an AWS EC2 instance.
Once you follow all the steps you will have computed the similarity between a search query and a database of texts and rank all the data by their similarity score to retrieve the most similar text to its index.
Understanding Cosine Similarity (Source link)
If you are in need of object storage to work with data through an API, but need to do so in a private network, Evgenii Munin shows how to set up MinIO as remote storage with DVC to do just that in this piece in Medium. In this cool use case, he starts with installing the MinIO server and builds a Docker image to run it, sharing a great repo on Kafka-to S3 where MinIO was used to mock the S3 for the data. Then he shows you how to link the MinIO server as DVC remote storage.
Minio Browser with Data pushed from DVC (Source link)
It can sometimes be confusing to determine where data science stops and machine learning engineering starts. Caleb Kaiser helps clarify this in this old but good piece in KD Nuggets. He provides four examples of real- world projects and defines what portions of the project are data science and what are ML engineering. In all what we find is that machine learning engineering is all the tasks that need to happen to get the model the data scientists create into production applications.
He goes on to dive deeper into one of the examples and shows the promise in some tools that bridge the gap between machine learning and software engineering where he highlights DVC and Huggingface. This is a good piece to read if you are grappling with the difference!
- GitHub Goodness alert for Visual Data Preparation (VDP), an open-source visual data ETL tool to streamline the end-to-end visual data processing pipeline. Among the highlights: a fast way to build end-to-end visual data pipelines, pre-built ETL data connectors, and integration with DVC
- Jillian Rowe gave a shout-out to DVC on a recent podcast from Adventures in DevOps Podcast in an episode where they discuss the intersection of data and DevOps
- If you are interested in contributing to researchers' learning about machine learning experimentation tools, you can take this survey. Spread the word!
On July 26th we released our new
model registry in Iterative Studio.
The great work done by the MLEM team building a git-based model registry is now incorporated in Studio in a web UI. This release took the work of half the people in the company and we are proud of the steps we are taking to meet people where they are and round out your options whether you are comfortable in the CLI, API, or web UI. Be sure to try it out and give us your feedback. Learn more in the blog post and in the docs. Look out for a full tutorial coming soon!
Last week we had our very first internal Hackathon! The entire company participated in the 48-hour computer vision challenge classifying dogs, cats, croissants and muffins. Part of the objective was to familiarize ourselves and test a new tool that we are expecting to release later this year.
Eight teams competed for prizes for the best outcome, but also for the best integrations with other tools, the best dog, cat, croissant, and muffin photos from team members, and the best notes from the experience. I think the notes of our newest DevRel Gema Parreño Piqueras are in good running for the prize. (Learn more about Gema in the New Hires section below!)
Gema Parreño Piqueras' Hackathon notes (Source link)
See the members of the winning teams below. Team members Daniel Kharitonov and Jon Burdo organized the whole event and put together an extremely comprehensive document to help guide the teams. We are looking forward to more of these events in the future!
Dmitry also wrote a piece for The NewStack entitled Why We Built an Open Source ML Model Registry with Git. As the title suggests the why is here as well as learnings from our customers' use cases, and the realization of the need for Model Registry as Code (MRaC), thus continuing our GitOps approach to tool building for machine learning.
If you haven't gotten a chance to make it to the conferences where David de la Iglesia Castro presented his popular talk or workshop entitled Making MLOps Uncool Again, you can now catch it on our very own YouTube channel! In this presentation you will learn how to build an MLOps workflow by extending the power of Git and GitHub with open-source tools DVC and CML. In the end, you will have an automated workflow that covers the entire lifecycle of an ML model, from data labeling to monitoring predictions. Find the repo for the project here. And the solution here.
Gema Parreño Piqueras joins our team from Madrid, Spain as a Developer Advocate. You may have already been familiar with Gema if you've been taking our online course this summer because of the gorgeous notes she contributed per module. Gema was born and raised as an Architect (of buildings) but switched to tech a while back. She had her own video game start-up and has also worked as a Data Scientist in the Financial Industry. She has contributed to open source StarCraft II ML project. Gema loves indie games, puzzles, and croquettes! She makes the 4th teammate from España! 🇪🇸
Marcin Jasion joins the team as a Senior Platform Engineer from Poland. He has been friends with team member, Paweł Redzyński, for years. When not working he likes travelling and eating, motorcycling, and is an avid cross-fitter. He also has a cat that likes to be a part of meetings! 🐈
Domas Monkus joins the CML team as an engineer from Lithuania. Before joining us at Iterative, Domas spent 10 years at Canonical working on juju, livepatch, and many internal projects. He's a husband and father with a house outside the hustle and bustle of the city, so he mentioned that lawn mowing is one of his main free time activities. 🏡
This week is AI4! Dmitry Petrov will give a talk as well as participate in a panel discussion on MLOps. If you are attending, stop by the booth and say hi or check out one of the in-booth demos we will have on our tools throughout the day.
Additional conferences we will be attending this year:
- Gema Parreño Piqueras and our lead docs writer, Jorge Orpinel Perez will be heading to Mexico City August 31-September 1st for the LATAM AI Conference. Gema will give a presentation on experimentation in our new DVC extension for VS Code.
- Southern Data Science Conference in Atlanta, GA on September 8-9th.
- ODSC West in San Francisco
- Deep Learning World - Berlin
- MLOps Summit - Re-work - London
- Dmitry Petrov will be speaking at GitHub Universe on November 9-10!
- Toronto Machine Learning Summit- Toronto
We also will be reviving our virtual meetups this fall so be sure to join our group on Meetup.
Use this link to find details of all the open positions. Please share with anyone looking to have a lot of fun building the next generation of machine learning to production tools! 🚀
Iterative is Hiring (Source link)
- As noted above there are new docs for Iterative Studio's Model Registry
- In case you missed it, CML now supports Bitbucket! You can find the docs for the Bitbucket integration here.
- 💎 Don't miss July's Community Gems is full of great questions from the Community.
- Milecia McGregor provides a new tutorial for Serving Machine Learning Models with MLEM. Don't miss it!
#microwin of the day:— Avikalp Kumar Gupta (@AvikalpGupta) August 8, 2022
Spoke at #GCCDBLR '22 (annual flagship event by @gdgcblr) about setting up effective #DataScience teams. And shared with everyone, how tools like #git @github @DVCorg @ProjectJupyter Jupytext and @vibinex can make it easier.#technology #startup #day38 pic.twitter.com/GBLXa9OGAO
Also so great to have our new DVC extension shouted out by Harold Sinnot!
10 VScode extensions every data scientist should have💻🤖— Harold Sinnott 📲 #MWC24 (@HaroldSinnott) July 7, 2022
3. Python Indent
5. Jupyter notebook renderers
6. DVC - (ML model experiment tracking)
8. Todo MD
9. Excel viewer
10. Markdown preview GitHub styling
via @avikumart_ #AI #IoT
Do you have any use case questions or need support? Join us in Discord!
Head to the DVC Forum to discuss your ideas and best practices.