February '22 Heartbeat
This month you will find:
🥰 Tutorials and workflows from the Community,
🗣 Upcoming Events,
📰 DVC helps in COVID research,
🧐 More MLOps tool decision strategies,
😎 GitHub Goodness and Integrations,
💻 Online Course is live,
🚀 Info on our growing team, and more!
- Jeny De Figueiredo
- February 17, 2022 • 7 min read
This month's Heartbeat image is inspired by Community member Daniel Barnes.
Daniel has been a great contributor to CML and helps out folks with questions in Discord as well as frequently attends our Meetups. This image is inspired from his GitHub profile image and the fact that he used to be a competitive paraglider. His record being 9.5 hours in the air! 😳 Many thanks to Daniel for his contributions to the Community that keeps us all flying high! 🪂
The year is already flying by! Check out what's new this month!
So let me guess, still overwhelmed with MLOps tool choices? This past month Matt Squire of Fuzzy Labs.ai reviewed their Awesome Open Source MLOps repo, in this blog and this video. Matt breaks down the tool space into categories of SaaS platforms, fully open source tools, and partly open source tools. He describes how they define open source and why they think open source is the best choice in the MLOps space, which includes its trait of being flexible, ownable, cost-effective, and agile.
"Turn key solutions quickly become inflexible." - Matt Squire
Fuzzy Labs, a small AI company in Manchester, England, had a need for flexibility in their work with their clients, so they did a deep dive into MLOps tooling and established an MLOps Platform meeting the open source and flexible criteria they required. This stack includes our own DVC, as well as Sacred, ZenML, Seldon Core, and Evidently AI.
The blog and the video are definitely good material to review if you're choosing your ML tools.
Continuous Machine Learning on Huggingface Transformer with DVC including Weights & Biases Implementation and Converting Weights to ONNX.
As the title would suggest, this jam packed article from Nabarun Barua, and Arjun Kumbakkara focuses in on how CML can be implemented into an NLP project. They assume knowledge of DVC, Transformers, ONNX and Weights & Biases, so be ready to take your skills to the next level automating parts of the process with CML.
They begin with the all-important setups of AWS IAM user with EC2 & S3 Developer access, the S3 bucket to store the dataset, and requesting an EC2 spot instance. They then continue into a detailed description of all the stages of the project, outlining the use of all the tools including DVC Studio. You can find the repo for the project here. Looking forward to the next installment from Nabarun and Arjun on a Dockerized Container Application cluster with Kubernetes Orchestration. 🍿
Total architecture with the Training, Deployment, and Retraining Pipelines in the same order. (Source link)
In case you missed it in our Twitter feed, a group of scientists published an article in Scientometrics Journal entitled, Discovering temporal scientometric knowledge in COVID-19 scholarly production. The authors, Breno Santana Santos, Ivanovitch Silva, Luciana Lima, Patricia Takako Endo, Gisliany Alves, & Marcel da Câmara Ribeiro-Dantas, used DVC to create a reproducible workflow that combined machine learning and Complex Network Analysis techniques to extract implicit and temporal knowledge from Scientific production bases on COVID-19.
"The presented methodology has the potential to instrument and expand strategic and proactive decisions of the scientific community aiming at knowledge extraction that supports the fight against the pandemic."
We are so happy to be helpful in the fight against the pandemic! Be sure to check out the paper and keep your eyes out for a Meetup in the future where they present this work!
Discovering temporal scientometric knowledge in COVID-19 scholarly production (Source link)
People new to the data science/ml space are often overwhelmed by all that there is to learn, and determining the path to get there. When I get this question from Community members, I always have the same advice: try to figure out what part of DS/AI is most interesting to you and then work to building your skills toward that. In this article on the 10 Most Important Jobs for ML Products in 2022, Ágoston Török does a great job of defining the different roles in the space, how they interrelate, and how they show up in AI companies in the product development process. See his breakdown of the roles above, with rows defining the stage, and columns, the aspects the roles focus on. If you find you are drawn to the space where the DS prototypes become the software product, then you may want to check out our new course! 😉
Diving deeper into these roles, the team was a buzz recently, reviewing this slide deck on Engineering Best Practices for Machine Learning by Alex Serban. In it Alex discusses the challenges of creating software from machine learning projects, the differences between these projects and traditional software development, and the need for developing robust and ethical practices. He and his colleagues, Koen van der Blom, Holger Hoos, and Joost Visser created a survey to determine current adoption of best practices in the industry. Along with the great review of the survey results in the slides, a number of resources were provided including the corresponding Awesome list, a Catalog of Best ML Engineering Practices, and their project website for more information on the whole project. Definitely worth your review! ✅
29 Machine Learning Engineering practices ranked by adoption (Source link)
Are you in need of ethically sourced audio or video data for your ML project?
Twine has created a way to accomplish this, while
simultaneously freeing ML teams of the project management lift associated with
the collection of these datasets.
You can learn more about Twine's efforts in ethical data collection through these articles, The Importance of Ethically Sourced Data, Bias in Data Collection, Collecting Diversity Data: How to Ensure an Inclusive Workforce, and The Hidden Costs of Bad Data. Twine also provides 100 open audio and video datasets for anyone working on these types of projects. Check it out! 👇🏽
Twine Ethically Sourced Datasets
Are you interested in battery technology and in participating in a Hackathon using battery data? The growth of battery technology is climbing quickly as the world is looking to solve some of the world's emissions issues with electronic vehicles. Additionally the demand for electric vehicles is outpacing the manufacturers' ability to supply the needed batteries. Datasets in the space are kept proprietary as companies work independently to develop patents. BatteryDEV 2022 aims to accelerate battery innovation through open source competitions. This year they are expecting 300 participants for the event from March 20-26. Community member Raymond Gasper is one of the organizers of Battery.dev, and is creating a DVC template for participants to use during the Hackathon. You can register for the event here!
BatteryDEV 2022 Hackathon
Dmitry Petrov talked to Swapnil Bhartiya recently about how experiment versioning can help to solve the big problems of the AI/ML world. In this interview you will learn how experiment versioning tracks everything you need for a particular experiment so that the result is reproducible from prototyping to production. This solution enables data science and engineering teams to work more productively together.
Be sure to join us at the March Office Hours Meetup, where Fabian Zills, PhD student at University of Stuttgart, will present his ZnTrack project which creates, runs and benchmarks DVC pipelines in Python and Jupyter Notebooks.
March Office Hours - ZnTrack
We are extremely excited to welcome our new Director of Engineering, Oded Messer. Oded lives in Israel and plans to pour his time and attention into the people/processes/structures of the engineering org to facilitate healthy growth and culture.💗 He brings hands-on and managerial industry experience in the backend/tooling/infra and MLOps domains (ex. Intel and Iguazio). In his spare time Oded remembers traveling being a favorite activity, and also admits to being a sci-fi geek. He's in good company here! 😉
We welcome Alex Kim who joins us as a Field Data Scientist from Montreal, Canada. Alex's previous professional experience has been at the intersection of Software Engineering and Data Science across a few different industries. He has also done consulting work to develop Data Science curriculums for EdTech companies. Alex speaks Russian and a little French in addition to English. In his free time, Alex likes to bake, his specialty being pizza! 🍕
We now have three Alex's on the team to match our three Davids!
Jesper Svendsen joins the team as a Platform
Engineer from Denmark.
Previously, Jesper worked as an SRE for Evaxion Biotech (another ML-driven company). Prior to that, he was a self-employed IT consultant, where he did full-stack development. Jesper's hobbies include reading books, (particularly medicine and psychology books), weightlifting, running, and photography. 📸
Jesper makes the eighth employee joining Iterative.AI with a name starting with the letter 'j.' I thought this was odd, as words that start with 'j' have one of the lowest frequencies in the English language. But as it turns out, 'J' is one of the more common first initials.
Gabriella Caraballo joins Iterative as a Backend Engineer. She is originally from Venezuela, but is currently living in Canada! Programming was a hobby that became a professional path for Gabriella. She loves everything related to security, privacy and open source. In her free time, Gabriella enjoys cooking and eating, playing video/board games, crocheting, photography, and music. Now that she's in Canada, she has added skiing to her hobbies! ⛷
Even with these amazing new additions to the team, we're still hiring! Use this link to find details of all the positions and share with anyone you think may be interested! 🚀
Iterative is Hiring (Source link)
With tools like @DVCorg & @TheRealDAGsHub you can easily share , review & reproduce/reuse your work.— Gift Ojeabulu (@GiftOjeabulu_) February 7, 2022
Just like how git makes software development smooth for software developers that's how tools like DVC make reproducibility smooth for ML Engineers.
Do you have any use case questions or need support? Join us in Discord!
Head to the DVC Forum to discuss your ideas and best practices.