Copyright © 2012 jsd
Once upon a time, physics major #1 took an elective course in music, namely “orchestration and composition”. Everybody else in the course was a music major. | Meanwhile, physics major #2 took a different elective course, namely “music appreciation”. This involved sitting in class, listening to recordings and discussing them. Music majors are lot allowed to take this course, not for credit anyway. |
Student #1 got a C. | Student #2 got an A. |
Which of these students got the better grade? Which got the better education?
This is significant, because there are lots of real-world incentives that depend on GPA ... including school admissions, scholarship money, job offers, et cetera. On the other hand, it is perverse to focus on a good GPA rather than a good education.
In every grade, the teacher and students are subject to all sorts of incentives to produce a good score on that grade’s standardized test.
It is obvious that teachers will teach to the test to some extent ... and students will study to the test to some extent. Simplifying things a bit, we can identify several cases and sub-cases:
To repeat: Teachers will teach to the test to some extent. This is either a good thing, a manageable challenge, or a disaster ... depending on details. Much depends on the test, and on how the teacher goes about teaching to the test.
Let’s look more closely at the cases mentioned in section 1.2.
Any teacher with good sense can tell the difference between case 2a and 2b, i.e. the wise approach and the unwise approach to coping with the end-of-year test. However, a bureaucrat who is not in the classroom cannot easily tell the difference, because the distinction is not reflected in the test scores – at least not in the obvious way. Given the current emphasis on scores rather than good sense, this commonly leads to perverse incentives, i.e. situations where teachers feel obliged to game the test in unwise ways.
If you confine yourself using the test score itself in the direct, obvious way, then there is no reliable way to distinguish the wise approach from the unwise approach. However, there are less-obvious ways of measuring the distinction. It shows up in later grades, and in later life.
Here are a couple of ways things could play out:
(Of course the latter school may have problems of its own, but that is a separate issue, not necessarily correlated.)
This leads to a constructive suggestion: In any grade school that is large enough to have more than one class at each grade level, shuffle the students – randomly – when assigning classes at the beginning of each year. For example, half of the students from class 3A and half of the students from class 3B wind up in class 4A. The other two halves wind up in class 4B. There are then four possibilities.
If the 3A students do better on the third-grade end-of-year test and also do systematically better in 4th grade, that’s great. | If the 3A students do worse on the test but better in 4th grade, it suggests that the 3A teacher is not paying enough attention to the test. |
If the 3A students do better on the test but do systematically worse in 4th grade, it suggests that the 3A teacher is paying too much attention to the test and not enough to the fundamentals. | If the 3A students to worse on the test and worse in 4th grade, it suggests there is room for improvement in the 3A classroom. Perhaps the 3A teacher can learn a thing or two from the 3B teacher. |
This same procedure can be applied to students who graduate from elementary school to middle school, provided the classes they take in middle school are not too strongly correlated with what class they were in in elementary school.
A coarse-grained version of the procedure can be applied on a school-by-school basis when multiple elementary schools feed a single middle school, and when multiple middle schools feed a single high school. You can hold the school as a whole accountable for not only how well the students do on the end-of-year test, but also on how well they do in later years.
There is no escape from the law of unintended consequences. The measures set forth in section 2.2 do not eliminate all possible ways of gaming the system. In particular, there is an obvious scheme whereby a teacher (or a school) can improve student scores in the short term and the long run, namely by selecting the incoming students.
There is a lot of selection going on already, sometimes for good reasons and sometimes otherwise. Extending the accountability horizon will not make it go away. If you want to reach any halfway-valid conclusions based on test scores, with or without extending the accountability horizon, you need to control for this. Assigning students randomly to one class or another – and randomly re-shuffling them at each year-to-year boundary – helps with some of this.
The question arises: To what extent is it worthwhile to extend the accountability horizon, along the lines suggested in section 2.2? Well, that depends.
At some point, people would conclude that gaming the test is not worth the trouble, which is what we want them to conclude.
In a disaster situation, tests are superfluous. A disaster is easy to detect, and there is rarely any need to quantify how disastrous it is. Making the trivia test slightly better won’t help. More importantly, a trivia test may detect a problem, but it won’t tell you much about the causes or possible solutions.
Let’s be clear: In situations where the test is a disaster-detector only, the technique outlined in section 2.2 will be primarily of academic interest, in the worst sense of the word. That’s because staving off disaster is nothing to be proud of. The goals should be much, much higher.
Beware that many of the current tests are so bad that trying to use them in cleverer ways is like re-arranging the deck chairs on the Titanic. The fact that I have mentioned something that could be done using well-behaved test scores must not be taken as an endorsement of the current crop of tests.
I am not opposed to all testing. I am opposed to dumb testing.
Copyright © 2012 jsd