It is well known that data quality and quantity are crucial for building Machine Learning models, especially when dealing with Deep Learning and Neural Networks. But besides the data required to build the model itself, there is another often overlooked type of data required to build a production-grade Machine Learning Platform: Metadata.
Modern Machine Learning platforms contain a number of different components: Distributed Training, Jupyter Notebooks, CI/CD, Hyperparameter Optimization, Feature stores, and many more. Most of these components have associated metadata including versioned datasets, versioned Jupyter Notebooks, training parameters, test/training accuracy of a trained model, versioned features, and statistics from model serving. For the dataops team managing such production platforms, it is critical to have a common view across all this metadata, as we have to ask questions such as: Which Jupyter Notebook has been used to build Model XYZ currently running in production? If there is new data for a given dataset, which models (currently serving in production) have to be updated? In this talk, we look at existing implementations, in particular, MLMD as part of the TensorFlow ecosystem.