Software Engineering

Admirer: Open-Ended VQA Requiring Outside Knowledge

Andrew Hinh and Aleks Hiidenhovi
Dec 23, 2022
4 Mins Read

Last updated: December 23, 2022.

For the ZenML Month of MLOps Competition, we created Admirer, a full-stack ML-powered website that utilizes users’ webcam feeds to answer open-ended questions requiring outside knowledge. Andrew built the MVP of the website as a top-25 final project for the FSDL 2022 course, writing only the deployment code. We continued the project for the ZenML Month of MLOps Competition and won the Most Promising Entry prize in the closing ceremony. During the competition, we wrote the data management, model development, testing, and continuous deployment scripts.

Here is a video summary of our submission.

How does it work?


The visual question-answering pipeline is inspired by this paper from Microsoft. In short, we prompt GPT-3 with a generated image caption and object tag list, the question-answer pair, and context examples that demonstrate the task at hand in a few-shot learning method as shown in the diagram above. As a result, we achieve a BERTScore computed F1 score of around .989 on a test set we selected at random.

How to use it?

As a direct consequence of not feeding the image data directly to GPT-3, the best queries involve asking descriptive, counting, or similar questions about one or more objects visible in the background. For example, if there are two people in the image, one wearing a hat and the other wearing glasses, questions that would work well could include the following:

  • “How many people are in the room?”
  • “What color is the hat in the picture?”
  • “How many people are wearing glasses?”

What did we do?

Since we finished a MVP of the project using a model deployed on AWS Lambda and served with Gradio for the Full Stack Deep Learning course, the main features that we decided to focus on building for the competition were the surrounding infrastructure. However, it’s important to note that as a result how the inference pipeline is structured and the limited time for the competition, we had to focus on building the infrastructure solely around one of the models used in the pipeline. We decided to focus on the caption model, the model that seemed to fail the most often amongst the other models. As a result, the features that we built support for during the competition are the following:

  • Data storage using AWS S3
  • Data labeling using Label Studio (although in the end we used Scale AI for ~90% of the annotations)
  • Data preprocessing to prepare the annotated data for loading into a Pytorch dataset
  • Data validation using ZenML, S3, and Great Expectations
  • Hyperparameter tuning, model training, and experiment tracking using Pytorch Lightning and Weights & Biases
  • Model evaluation using torchmetrics
  • Load testing using Locust
  • Model monitoring using Gradio’s built-in flagging system
  • Continuous deployment to update the AWS Lambda backend whenever the pipeline’s performance improves
  • Model and pipeline testing to ensure that everything is working as expected

Future Work

As a result of the limited time and money we had during the competition, there are two major points of improvements. The first is improving the performance of the separate models used in the inference pipeline. Since we are using pre-trained HuggingFace models, we could collect more data from sources such as the GI4E project to fine-tune the models on. However, this is extremely costly, and something we are unlikely to do. Another option could be to train an end-to-end model such as a ViT + GPT-2 encoder decoder model from scratch as inspired by Sachin Abeywardana. Although costly, this is probably the best and cheapest option for improving the inference pipeline’s performance, since the image data itself is incorporated into GPT’s answers.

The second is creating data, training, and testing pipelines with both Makefiles and ZenML pipelines. This allows us to more easily iterate through ideas to improve the pipeline’s performance and add features such as continual training, a feature of MLOps projects that ZenML goes into detail here. In addition, it allows others to more easily reproduce and test our work.

💭 Conclusion

Having recently built a MVP of the project for the Full Stack Deep Learning course, we were excited to build out the data management and training pipelines for this competition. We were not disappointed: we got a chance to get our hands dirty building out the pipelines, showcase our work to numerous experts in the field and receive their feedback, and win the “Most Promising Entry” award! We’d like to thank the ZenML team for organizing this competition and the judges for their invaluable feedback.

📚 Learn More

Below are some of the resources to learn more about the project:

Start your new ML Project today with ZenML Cloud

Join 1,000s of members already deploying models with ZenML.