The performance of an image classification model can be measured by various metrics, such as accuracy, precision, recall, F1-score, and mean average precision (mAP). These metrics can be calculated based on the confusion matrix, which compares the predicted labels and the true labels of the images1
One of the best ways to monitor the performance of multiple versions of an image classification model on AI Platform is to compare the mean average precision across the models using the Continuous Evaluation feature. Mean average precision is a metric that summarizes the precision and recall of a model across different confidence thresholds and classes. Mean average precision is especially useful for multi-class and multi-label image classification problems, where the model has to assign one or more labels to each image from a set of possible labels. Mean average precision can range from 0 to 1, where a higher value indicates a better performance2
Continuous Evaluation is a feature of AI Platform that allows you to automatically evaluate the performance of your deployed models using online prediction requests and responses. Continuous Evaluation can help you monitor the quality and consistency of your models over time, and detect any issues or anomalies that may affect the model performance. Continuous Evaluation can also provide various evaluation metrics and visualizations, such as accuracy, precision, recall, F1-score, ROC curve, and confusion matrix, for different types of models, such as classification, regression, and object detection3
To compare the mean average precision across the models using the Continuous Evaluation feature, you need to do the following steps:
Enable the online prediction logging for each model version that you want to evaluate. This will allow AI Platform to collect the prediction requests and responses from your models and store them in BigQuery4
Create an evaluation job for each model version that you want to evaluate. This will allow AI Platform to compare the predicted labels and the true labels of the images, and calculate the evaluation metrics, such as mean average precision. You need to specify the BigQuery table that contains the prediction logs, the data schema, the label column, and the evaluation interval.
View the evaluation results for each model version on the AI Platform Models page in the Google Cloud console. You can see the mean average precision and other metrics for each model version over time, and compare them using charts and tables. You can also filter the results by different classes and confidence thresholds.
The other options are not as effective or feasible. Comparing the loss performance for each model on a held-out dataset or on the validation data is not a good idea, as the loss function may not reflect the actual performance of the model on the online prediction data, and may vary depending on the choice of the loss function and the optimization algorithm. Comparing the receiver operating characteristic (ROC) curve for each model using the What-If Tool is not possible, as the What-If Tool does not support image data or multi-class classification problems.
References: 1: Confusion matrix 2: Mean average precision 3: Continuous Evaluation overview 4: Configure online prediction logging : [Create an evaluation job] : [View evaluation results] : [What-If Tool overview]