Teacher Evaluations by Students

More than a decade ago, Martin Evans and Paul McNelis of the Department of Economics at Georgetown University studied student evaluations of teachers in undergraduate classes in the university. The conclusions of this study were:
This empirical analysis of student survey data shows how difficult it is to accurately evaluate "teaching effectiveness". But it would be a mistake to draw from this study that the current survey instrument needs to be replaced by another instrument, with more differentiated and probing questions, but with the same numerical scoring device. We are confident that we would be able to replicate similar error rates for assessing teaching from another instrument, which follows a similar modus operandi as the current instrument.
Part of the reason given why student evaluations are difficult to translate to "teaching effectiveness" is their intimate dependence on class characteristics (course level, course time, class size) and grade reputation. This is the case for students who are at the college level.

One question that may be raised is whether the above finding likewise applies to student evaluations at the basic education (K-12) level. Apparently, Harriet Sanford, president and CEO of the NEA Foundation, believes that the situation is different in basic education when a evaluation is much more like a dialogue between a teacher and his or her student. In a recent article in the Huffington Post, "Want to Improve Teaching? Listen to Students", Sanford writes:
Annie Emerson doesn't have to wonder about what it takes to help her kindergarten students learn how to write or do math. They've told her. 
Several times during the year, the Pinewoods Elementary School teacher asks her students two basic questions: what are ways that I teach you that you like or that are really working for you? What could be changed to help you learn even more? And it turns out even 5-year-olds have plenty to say. 
Emerson's students told her that they wanted more open-ended time to work on writing and math activities -- which is exactly what the Florida teacher gave them. Along with adding longer blocks of time for those activities during the day, Emerson began finding ways to help students weave math problems into their lives outside of school, including measuring how long it takes for the bus to get to school each day and comparing the heights of family members. Just as importantly, the conversations "brought on a whole new level of trust with my class," Emerson says. "The students realized they had 'voice' -- for them, having that at age five was a pretty big deal."
Clearly, the specific example above is not quite similar to the procedure or design applied in students' evaluations at Georgetown University. There are no numerical scores assigned. Responses to the questions posed by Emerson, "What are ways that I teach you that you like or that are really working for you? What could be changed to help you learn even more?", cannot be easily translated into a numerical scale. In fact, one big difference here is that students are indeed finding a value in the questions being asked. The questions are not meant to rank teachers. The questions are more like "How can we better serve you?"

The Measurements of Effective Teaching (MET) project, funded by the Bill and Melinda Gates Foundation came out with a report in September of 2012:

Cover of MET report on Student Evaluations
The study uses an evaluation that probes the following components:
  • CARE: My teacher seems to know if something is bothering me.
  • CONTROL: My classmates behave the way the teacher wants them to.
  • CLARIFY: My teacher knows when the class understands.
  • CHALLENGE: In this class, we learn to correct our mistakes.
  • CAPTIVATE: I like the way we learn in this class.
  • CONFER: My teacher wants us to share our thoughts.
  • CONSOLIDATE: The comments I get help me know how to improve.
Unlike in Emerson's case, the answers to the above survey are on a numerical scale:

For Grades 3–5/6–12:
1. No, never/Totally untrue
2. Mostly not/Mostly untrue
3. Maybe, sometimes/ Somewhat
4. Mostly yes/Mostly true
5. Yes, always/Totally true

For Grades K–2: No, Maybe, Yes

It is always tempting to work with numbers. With numbers, one can use graphs and examine correlations. After all, how does one graph a response like "We wanted more open-ended time to work on writing and math activities". And the MET project does present a graph of students' responses against how the students actually perform in standardized exams:

Image captured from http://www.metproject.org/downloads/Asking_Students_Practitioner_Brief.pdf
What is worth noting here is the paper's reasoning behind comparing students' evaluations against students' test scores. It is not to prove that the students' evaluations are by their nature able to predict learning outcomes. It is more of a check on the evaluation system. The report notes:
Systems that use student surveys should similarly test for predictive validity by comparing teachers’ survey results with their student achievement gains—as they should for any measure they use. Moreover, checking for predictive validity needs to be an ongoing process. Over time, alignment with student outcomes could deteriorate. This could happen if somehow teachers altered their actions in ways that improved their survey results, but without improving their underlying performance on practices associated with better outcomes. In such a situation, a system would see that teachers’ rankings based on survey results bore little relationship to their students’ learning gains. Such misalignment could signal the need for survey refinement.
Without doubt, the teacher-student relationship is crucial to learning. A dialogue must occur between a teacher and the students inside the classroom. Teachers who are better informed about their students can be more effective. When these surveys become tools for ranking teachers and for reasons other than helping inform teachers on how they may improve, the surveys are strongly inclined to become useless. More importantly, when these evaluations are done only at the end of the term, these serve hardly any purpose.