January '22 Heartbeat
This month you will find:
🥰 Tutorials and workflows from the Community,
🗣 Upcoming Events,
📰 Data Science and AI News,
🧐 MLOps tool decision strategies,
😎 GitHub Awesomeness,
💻 Online Course is live,
🚀 Info on our growing team, and more!
- Jeny De Figueiredo
- January 18, 2022 • 7 min read
Happy New Year! Hope you got some good rest and stayed healthy at the end of 2021, because 2022 has lots of great things in store!
In Part 1 of his two-part series, Diego Jardim of Poatek takes us through the basics of MLOps and the stages of implementation and maturity of an MLOps pipeline. He closes by introducing us to some tools to help a team progress through these stages, which include DVC and CML.
In Part 2 he delves into more detail and code on how to set up version control of everything with DVC as well as automation of experimentation and reporting with CML. Finally, he uses FastAPI and Heroku for model serving and deployment. You can find all the scripts for the project in this GitHub repository.
MLOps: A Complete Hands-On Tutorial
Carl W. Handlin Wallace of RappiBank wrote a great article for their company Medium profile on the importance of reproducibility, AKA replicability, in science in general, and the challenges in Data Science in particular. As he points out, from Nature's survey, over half of all researchers have failed to reproduce even their own work, let alone that of another scientist. While initiatives like Papers With Code are helping to encourage reproducibility in the industry, there's still work to be done. He notes DVC as a part of the solution to this problem along with other tools to round out the whole picture. Check out the article for good food for thought and other resources!
Carl W. Handlin Wallace's Proposed Reproducibility Framework for Data Science (Source link)
Abid Ali Awan's
article in KDNuggets
guides you on how to create a smooth process to deploy a deep learning web application with Heroku. In the guide, he covers integration with DVC and optimizing storage using Docker, Git & CLI-based deployment, how to deal with error code H10, and tweaking Python packages to stay within the 500 MB Heroku limitation. If you've been looking for a way to create a deep learning web app, this may help!
In the very FIRST tutorial of DVC Studio from the Community, Amit Kulkarni reviews the set up process of DVC Studio and MLFlow and their ability to ease the operational aspects of machine learning teams by providing a clear way to solve the formidable task of tracking all the factors that go into the iterative process. Amit covers the easy setup process, adding a view, model comparison, and running experiments from the DVC Studio UI.
Amit Kulkarni's DVC Studio tutorial (Source link)
In case you missed it we now have an Awesome Iterative Projects Repository. This repository is a list of projects relying on Iterative tools to achieve awesomeness. Recent additions to the list include:
- zincware/ZnTrack: Create, visualize, run & benchmark DVC pipelines in Python & Jupyter notebooks.
- nvim-dvc: Neovim plugin for DVC.
We'd love to see more of the Community's awesome work added to this list. Feel free to submit your project!
Other repos that came across my radar this last month that may be of interest to our Community:
- An Awesome List of Awesomes: an aggregation of all the Awesome lists
- Awesome MLOps: an awesome list of references for MLOps.
- Project Atlas - São Paulo : a Data Science and Engineering initiative that aims to develop relevant and curated Geospatial features of São Paulo, Brazil (includes DVC).
- NN Template: Generic template to bootstrap your PyTorch project (includes DVC)
Last month I told you about Thoughtworks' guide to MLOps Platforms. If you prefer video content, you may like this webinar from Ryan Dawson on CD4ML covering the process of identifying the best tools for your team's needs.
Ryan Dawson's MLOps tool evaluation process (Source link)
Dean Pleban, CEO of DAGsHub, also gave a great talk on a decision making framework for deciding on your tools in his presentation at DevOpsDays Tel Aviv. In this talk you will learn guidelines and mental models that will help you choose tools in whatever stage of the process you are in.
Rob Toews wrote 10 AI Predictions for 2022 for Forbes. In it he predicts more startups getting funded in NLP than any other category, reinforcement learning to become increasingly important, the rise of synthetic data, and powerful new AI tools being built for video. My favorite prediction:
Responsible AI' will begin to shift from a vague catch-all term to an operationalized set of enterprise practices."
That's good news!
10 AI Predictions for 2022
You may remember Chip Huyen from MLOps Tooling Landscape v2 and DVC's inclusion in her Machine Learning Systems Design Lecture series. But at the turn of the new year, she published a new blog post entitled Real-time machine learning: Challenges and Solutions. The article describes her learning from working with approximately 30 companies in different industries doing real-time machine learning. She describes the online prediction processes of batch prediction and streaming prediction.
Additionally she discusses continual learning and the difference between stateless retraining (the model is trained from scratch each time), and stateful training (the model continues training on new data) and moving from a manual process to a more automated one. Definitely worth a read and we believe DVC and CML can help you with your stateful training!
She and her team are running a survey to better understand the adoption and challenges of real-time ML. We enourage your participation!
Chip Huyen's Stateless vs.Stateful Training (Source link)
If you're interested in becoming a machine learning engineer and you're not familiar with Vicki Boykis, you should be. She has an amazing blog with years of well-written, funny, technical content on machine learning. Her latest piece entitled Git, SQL, CLI tells why she thinks these three tools are fundamental tools for any technical job. We think so too.
You can register for the FREE new course here on the Iterative website. The course is currently in beta mode. We already have some things we are working on to make it even better, but we would love your feedback! 🙏🏼 So far we have had some minor glitches and a lot of positive feedback! But we want your critiques too!
Whoever can give us feedback on any three modules by February 6th will receive some fresh new swag!
We are already planning our next course!
Our Senior Developer Advocate
Maria Khalusova wrote a tutorial piece
exp init and experiment versioning entitled
Versioning Machine Learning Experiments vs Tracking Them.
The command helps you quickly set up a pipeline and codify your experiments with
all of the factors that contributed to each of them, including data, code,
pipeline, model version and all hyperparameters. This is a step above other
experiment tracking tools and enables you to achieve true reproducibility.
Versioning Machine Learning Experiments vs Tracking Them
We have a few new team members this month!
Daniele Trifirò is our first team member from Italy! He joins us as a Senior Software Engineer. Daniele has a background in Physics/Astrophysics and worked for 4 years as a researcher in the LIGO Scientific collaboration and then went on to positions at Cloudian and illimity. It was at illimity where he "fell in love" with DVC! In his free time Daniele likes listening to and sometimes playing music himself, as well as rock climbing. 🧗🏼♂️
Thomas Kunwar is a software engineer joining the team from Nepal. He's been working as a fullstack developer specializing in the MERN stack and has lead a team on multiple projects. In his free time Thomas enjoys trekking, watching and playing sports, watching movies, and learning. Welcome Thomas! 👏🏼
Madhur Tandon joins our team as a Software Engineer from Delhi, India. He is active in open source and some of his famous contributions are to projects such as Pyodide (the Python Scientific Stack compiled to WebAssembly) and Jupyterlite (a Jupyter distribution running in the browser). He has also been a speaker in PyData and JupyterCon. Talk to him about his solo trip to SF, his experiences at Mozilla or about books, Indian governance, food, and crypto. When not working, he is working out!💪🏼
Even with these amazing new additions to the team, we're still hiring! Use this link to find details of all the positions and share with anyone you think may be interested! 🚀
Iterative is Hiring (Source link)
Be sure to join us at the January Office Hours Meetup, where Gennaro Todesco, Senior Data Scientist at Billie.io, will present his workflow with DVC and CML. Tezan Sahu, will follow presenting a workflow from a series of tutorials that we shared from him in the September Heartbeat, including DVC, PyCaret, MLFlow and FastAPI.
January Office Hours Meetup - 2 workflows
Don't miss Milecia McGregor at the upcoming Conf42 on January 27th! She will be presenting her talk on Using Reproducible Experiments To Create Better Machine Learning Models. If you haven't caught this talk yet, now's the time!
Do you have any use case questions or need support? Join us in Discord!
Head to the DVC Forum to discuss your ideas and best practices.