This paper operationalizes a committee-based performance diagnostic framework that combines inter-model agreement, consensus entropy, and borderline rate to support interpretable monitoring of AI coding of student text on unlabeled data. In a pilot application to nursing simulation reflections, these complementary metrics revealed distinct ensemble patterns, including stable consensus and divergence between agreement and decisiveness. The results illustrate how committee diagnostics can support ongoing oversight of AI coding as systems encounter new learners, contexts, and language use at scale.