Getting teacher evaluation right has the potential to drive significant improvements. The trouble is that so few places—so far—have gotten it right.
To the surprise of few, a recent working paper from the Annenberg Institute at Brown University found almost no positive impact from the teacher evaluation reforms that occurred as a result of a major push by the U.S. Department of Education under the Obama Administration. Those findings have been getting a lot of attention, as some consider the paper to have provided conclusive evidence that these teacher evaluation reforms were ill-advised from the start, while others assert that the commitment on the part of states and districts was half-hearted all along (some may say they were arm-twisted by the feds), and therefore doomed to produce few positive outcomes.
Will this paper be the death knell for rigorous and meaningful teacher evaluation systems? It's too soon to say, but it's worth pointing out that the paper never asserts that these systems can't improve outcomes for both teachers and students, just that it appears remarkably hard to do so. In fact the authors (Joshua Bleiberg, Eric Brunner, Erica Harbatkin, Matt Kraft, and Matthew Springer) fully acknowledge that there are states and school districts that did achieve great results, just that there weren't many of them. 307
One thing is for sure: many of the states and districts in the study weren't paying close attention to the key principles a number of research studies established as necessary to produce positive outcomes. It's likely that these places where teacher evaluation reforms failed did not adhere to all of these research-based principles, either choosing to overlook them entirely or making compromises that effectively neutralized their capacity to do good.
These key principles remain as relevant now as ever, and we spotlight seven of them here.
In 2018, NCTQ highlighted four large school districts and two states where evaluation reforms had led to improvement in teacher quality, the positive results confirmed by the Annenberg working paper. Something all six systems had in common was that each annually evaluated all teachers using both objective and subjective measures, as opposed to the widespread practice by states and districts of exempting large numbers of teachers from yearly evaluation, only using subjective measures (such as teacher observation scores), or not giving significant weight to student learning.
What does this look like in practice?
Combining multiple measures of teacher performance improves predictive power and reliability of evaluation scores
The above graph shows that evaluation scores that combine multiple measures--including student achievement gains, classroom observations, and student surveys--have higher predictive power and reliability than any single measure alone. Figure 1: Combining Strengths, from Kane, T. (2012). Capturing the Dimensions of Effective Teaching. Education Next. https://www.educationnext.org/capturing-the-dimens. .
Common practices of teacher evaluation systems that saw improvements in teacher quality and student outcomes
Studies over the last decade have shown that teachers both perceive evaluations to be more meaningful and see greater improvement in their practice when those doing the evaluating have been trained on the observation rubric, have more experience in and knowledge of the setting where teachers are being observed, and are familiar with the content their evaluees are teaching. If an evaluation system is going to provide the feedback teachers trust and need to improve, school and district leaders should:
Two of the oft-noted difficulties in conducting multiple classroom observations with each teacher are 1) the time commitment it requires from observers and 2) the strain it can place on in-school professional relationships. Luckily, research that examined the observations, feedback, and attitudes towards that feedback for over 400 teachers found that using video observations could be a solution. When teachers recorded videos of their lessons and then later watched and reflected upon these videos with their observers, it helped to alleviate time constraints for administrators and supported more effective feedback discussions, with more positive perceptions of the feedback and process. It also was associated with improved retention for the teachers that used the video observations! 319
As more research conducted during the pandemic gets published, we should be seeing additional guidance about the effective and ineffective ways to use video in evaluation.
Teacher evaluations, particularly observation scores, may be subject to racial and gender biases. An important way to mitigate bias is through including multiple measures of performance in teachers' summative evaluation scores, but these biases can still exist and can result in unjust outcomes, particularly for teachers of color and for teachers of students of color.
New research out of Chicago at first glance does not seem hopeful; a study found significant bias against Black teachers in observations and that the bias was explained by which students these Black teachers are more likely to teach: students from low socioeconomic levels, with lower levels of achievement in reading, and with higher frequency of misconduct. 320 But the upside is that it suggests a way to mitigate these biases. In addition to including other measures of teacher quality to offset potential bias in observation scores, the researchers recommend that education leaders in districts with this kind of problem statistically adjust observation scores to account for student characteristics, just as a teacher's contributions to student learning are adjusted in a value-added measure.
Another, less complex way to mitigate bias in observation scores is to require multiple observations by different observers for each teacher, which can make them a more reliable and accurate measure. 321
In the six exemplar teacher evaluation systems NCTQ analyzed, each tied the professional development a teacher should pursue to her evaluation results, as opposed to giving teachers open-ended choices. This finding is further supported by a meta-analysis of the effectiveness of performance pay systems that found that performance pay programs that are paired with professional development result in significantly higher student gains than those that are not. 322
For a teacher evaluation system to have a major impact on teacher quality and student learning, it needs to significantly reward high-performing teachers, encouraging them to continue teaching. That is a clear takeaway from a recent meta-analysis of over 40 research studies, which showed positive effects on student achievement, particularly in math, when individual teachers were eligible for performance-based bonuses. Importantly, more significant gains for students were found when the annual incentives for high-performing teachers were above 7.5% of their base pay (or, nationally, on average $5,000 a year). 323 Other research also suggests that 7% of teachers' base pay would be the minimum to be effective, whereas the most effective performance pay bonus should be 14%. 324
Why does a significant monetary incentive for teachers make a difference for students? A review of 120 studies on teacher attrition found that not only were more robust teacher evaluation systems associated with better retention of high-performing teachers, but participation in a performance pay system could decrease the probability of teachers (all teachers, not just high-performing ones) leaving the classroom by 24%, or nearly 15% in high-need schools. 325
Unfortunately, NCTQ research has found few school districts have adopted performance pay incentives for teachers, with even fewer of these districts making the incentives above the threshold research suggests is needed to make an impact.
Teacher evaluation reforms support improvements in teacher quality in part because they lead to less effective teachers exiting the profession 326 or dissuade people who are likely to be less effective from entering the profession. 327 While the first priority of identifying struggling teachers must be to provide them with targeted professional development and support for improvement, prioritizing student learning means that consistently low-performing teachers should be the first to be considered when layoffs are necessary and should ultimately be exited from the profession.
A study that examined teacher evaluation data and retention for over 20,000 teachers over five years in Chicago found that the implementation of their more rigorous teacher evaluation system increased the likelihood of exiting low-performing teachers by 50%. Additionally, the new hires who replaced these teachers were more effective on average, improving the overall quality of the teacher workforce. 328 Other evidence that incoming teachers are more effective than those exited after implementation of rigorous teacher evaluation has been reported by two different studies of teachers in Washington, D.C. 329
When properly designed and even more importantly, when implemented with fidelity, a good evaluation system should be able to strengthen the teacher workforce by helping all teachers become more effective, motivating effective teachers to stay in the classroom, and informing decisions about who to exit from the classroom. Some research has found that a meaningful evaluation system can even attract individuals to the profession who might not have otherwise considered teaching, 330 perhaps because of a more visible commitment on the part of a district to supporting and rewarding great teachers. Without a good system in place, it is difficult for administrators to access the data needed to tackle problems of educational inequities, primarily the tendency of districts to assign their more effective, qualified teachers to more advantaged students.
Ultimately, for a teacher evaluation system to have the positive outcomes for teachers and students that research and real examples prove are possible, it must adhere to these evidence-based practices, involve both teachers and administrators from the start, be evaluated frequently for efficacy, be tested for biases, and be iterated on as necessary. It's far from easy, and requires both significant funding and dedicated district and school leadership, but the potential impact it can have on students and teachers is worth the investment.