November '22 Heartbeat
This month you will find:
❓ Will NLP have more impact than Computer Vision,
🐙 Dmitry Petrov speaks at GitHub Universe,
🧐 CML in research at NeurIPS,
❣️ Unstructured Data Catalog coming,
✅ SOC 2 Type 1 Compliance,
🚀 MLEM adds Sagemaker and Kubernetes deployment,
👀 Lots of new docs,
🚀 Upcoming events, and more!
- Jeny De Figueiredo
- November 18, 2022 • 8 min read
Image generated with the help of Stable Diffusion
Welcome to November! In the US, this is the time of year we reflect and give thanks. It's been a productive year despite the world's rather extreme challenges. There's lots to be thankful for. Here are some of those things from the last month in the Iterative Community.
In this article entitled The Biggest Opportunity In Generative AI Is Language, Not Images, Robert Toews argues that AI-powered text generation will create many orders of magnitude more value than text-generated images.
Language is humanity’s single most important invention. More than anything else, it is what sets us apart from every other species on the planet. Language enables us to reason abstractly, to develop complex ideas about what the world is and could be, to communicate these ideas to one another, and to build on them across generations and geographies. Almost nothing about modern civilization would be possible without language.
He points out the many examples from a variety of industries and academia that have gained and will continue to gain massive improvements due to the power of large language models (LLMs) in the coming years. Read the article for all the applications.
The State of AI Report is generated each year and reports on the most interesting things the authors, Nathan Benaich, Ian Hogarth, Othmane Sebbouh, and Nitarshan Rajkumar come across in the world of AI throughout the year.
- Slide 22: Mirroring the ideas of the Toews article above, this slide discusses the LLM use case of conversational code generation. OpenAI's Codex, which powers GitHub's Copilot to produce this capability was on display at the recent GitHub Universe. Other companies including Salesforce, Google, and DeepMind are working on Code generating projects of their own with Google's LLM PaLM coming out as a favored option with 50x less code than Codex. Alternatively DeepMind's AlphaCode generates the whole program as opposed to lines of code.
- Slide 24: Continuing to echo Toews' article, in research LLMs are greatly improving their mathematical abilities, jumping to far better scores than previous model versions. Techniques that helped to achieve these gains are discussed
- Slides 30 and 31: Challenging Toews' stance, these slides show the great progress in Computer Vision. Diffusion models are doing more than just text-to-image generation. Now they are being used for text-to-video, text generation, audio, molecular design, and more. Info on the techniques now being used can be found in Slide 30. Side 31 discusses the huge improvement in the next generation of text-to-image generation competing models including DALL-E, Imagen, and Parti.
Be sure to digest the whole report for even more AI advances!
💓 So for our “Pulse check” this month:
Do you agree that NLP will have more impact than computer vision? Tell us about what you are working on with NLP. We’d love to get you connected with others struggling with similar issues and know how we can improve our tools to help you with your NLP projects.
Join us in the
#general channel in
Discord to weigh in.
We would like to thank Francesco Calcavecchia, vvssttkk, and deepyaman for their contributions to GTO, MLEM, and CML respectively. They will be receiving their own personalized shirts that note their contributions! And many thanks to Mert Bozkir for leading the Hacktoberfest charge here at Iterative!
One of our Community Champions, João Santiago of Billie.io gives an introduction to DVC in preparation for the remainder of the session where Carsten Behring, author of Metamorph and the scicloj.ml platform presents how NLP pipelines can be managed with DVC, Closure & Python.
Last month we reported on CML turning up in research here. Well, this work will be presented within the virtual Workshop Challenges In Deploying and Monitoring Machine Learning Systems at NeurIPS virtual this year on December 9th. Find out more and register here.
Research on CML to be presented at NeurIPS (Source link)
Do you use Amazon S3, Azure Blob Storage, or Google Cloud Storage? We have a new solution for finding and managing your datasets of unstructured data like images, audio files, and PDFs! Extend your DVC environment with the first data catalog and query language (SQL->DQL) for unstructured data and machine learning. Learn more on our website and/or schedule a meeting with us!
In case you missed it MLEM announced a release on Halloween! MLEM now supports Sagemaker and Kubernetes in addition to Heroku and Docker. You can learn about how easy it now is to package your models for deployment with only a few lines of code and never have to get lost in Kubernetes docs again! Find the blog post here and be sure to visit the docs!
We are very excited to announce that Iterative is now SOC 2 Type 1 compliant. This certification signals to our customers our commitment to Security, Availability, Processing Integrity, Confidentiality, and Privacy within our organization. We have successfully endured the rigorous process and have learned much as a team in the process. Guro Bokum reviews the five key learnings in this blog piece. You can find the full report on our Security and Privacy page.
On November 8th, our CEO, Dmitry Petrov spoke at GitHub Universe on ML with Git: experiment tracking in Codespaces. In his presentation, he shows how to use the DVC extension for VS Code and Codespaces to streamline your machine learning experimentation process. You can find his video below in the event platform if you are registered. We expect the video to be available on YouTube in the next of couple months. We'll keep you updated!
Jupyter Notebooks are great for prototyping, but eventually, you will want to move toward reproducible experiments. Converting a notebook to a DVC pipeline requires a bit of a mental shift. Rob de Wit shows you how to accomplish it with an intermediate step: use Papermill to build a one-stage DVC pipeline that executes our entire notebook, and use the resulting pipeline to run and version ML experiments. Look out for a future post with a more advanced pipeline!
At our next meetup on December 14th, Sami Jawhar will present An Open Discussion of Parallel data pipelines with DVC and TPI, an advanced use case for distributing experiments in the cloud. Sami is a great discussion driver. If you are interested in higher-level use cases you will want to join the discussion!
Sami Jawhar on Running Parallel Pipelines with DVC and TPI
On January 11th, Francesco Calcavecchia will be joining us to share about his recent contribution to MLEM through his work on GTO and how this helps him in his work at E.On Energie Deutschland with creating a Git-based model registry.
Francesco Calcavecchia on Designing a model Registry with Legacy Systems using DVC and GTO
We had a great time at ODSC West! We had great conversations with conferencegoers and attended great sessions! Dmitry had a packed room for his in-person talk Why You Need a GitOps-based Machine Learning Model Registry and Alex Kim presented CI/CD for Machine Learning virtually. At each of the conferences we've sponsored this year, we've had a game called Deevee's Ramen Run. (If you don't know the Ramen connection, you need to spend more time reading the monthly Heartbeats 😉). Below find the top three winners of the game.
We were also part of the MLOps Summit in London only a week later! Admittedly, there were different team members in attendance and staffing the booth. Aside from attending a variety of great talks, we met many wonderful people from all over the world. This resulted in some really interesting discussions about how different companies approach MLOps.
Casper da Costa-Luis gave a well-received talk on how to painlessly run ML experiments in the cloud with CML at the summit. The recording will be made available in the near future, so look out for that! The talk answered at least one of the questions of Deevee's Ramen Run, which yielded some surprised (but excited!) winners this time around.
Gema Parreño Piqueras presented at TechWeek in Spain with her talk Reproducibilty and Version Control are Important: Follow up with the DVC extension for VS Code. She will be presenting the same talk at Codemotion. You can find her talk in Spanish at 2:02 below!
- We will be participating in Toronto Machine Learning Summit - on November 29-30 in Toronto
- Alex Kim CI/CD for Machine Learning for an ODSC Webinar. Register here.
- We will be at PyData Eindhoven on December 2nd. Come say hi at the booth if you are attending! We have some tickets to give away for the event in Discord. First come first serve!
- We are sponsoring NormConf on December 15th. They will have Slack-based booths there. We are looking forward to supporting this new conference!
Stay tuned to our Newsletter for what we will be up to conference-wise in 2023!
The team has been busy improving the docs for you. See all the latest and greatest updates below.
- DVCFileSystem - DVCFileSystem provides a pythonic file interface ( fsspec-compatible ) for a DVC repo. It is read-only. DVCFileSystem provides a unified view of all the files/directories in your repository, be it Git-tracked or DVC-tracked, or untracked (in the case of a local repository). It can reuse the files in the DVC cache and can otherwise stream from supported remote storage.
- We’ve now added
Horizontal bar plots
to the mix of
dvc plots show!
- You can now list contents from supported URLs with
dvc ls-urlFind the description, options, and example code here.
- Based on some feedback we reorganized the User Guide to help you better navigate. Let us know what you think!
- Similarly, we reorganized the DVCLive documentation for better navigation.
- In CML you can now publicly self-host images with
cml comment. Find the options here.
- Also, we’ve updated the self-hosted runners docs in CML.
- We've now added a guide for bringing your data to GitLab using DVC. Find the details in this doc.
- MLEM docs have received a nearly full overhaul.
- Additionally the Get Started section has been greatly improved.
- Look out for new docs to come out soon for GTO on the MLEM website.
- DVC Studio now supports adding a model from a remote location in Iterative Studio. Find out more here.
- Use the new Iterative Studio Wizard to set up CML in your CI. More on the process and parameters here in the docs.
Do you have any use case questions or need support? Join us in Discord!
Head to the DVC Forum to discuss your ideas and best practices.