
Inside the
Decision Room
A production by William Chang
(SCENE — A seemingly endless meeting table stretches across the vacant admissions boardroom. At one end of the table is one and the only window in this room, who allowed the dazzling sunset to illuminate half of the sacred space. Seven middle-aged mortals rest in the other half of the room, in the dark. It has been this time of the year again, when the 7 knights of the round table come together to make life-and-death decisions for the thousands whose futures lay in their hands. And yet, seemed to have gotten used to the weight of the venerated matter, the 7 mortals simply sink into their leather arm-chairs.)
Silence.
A giant figure emerges from the sole entrance of the room. He strides towards the front of the room with an immense grin on his face. He seems to be the only enthusiast in the room.
Director: How would we like our minimum TOEFL requirement? 110? 115?
Yigal: Medium-well please. (a sense of irritation in the voice). You made this sound as easy as ordering a steak. And this isn’t.
Sensing the heavy atmosphere, the Director cut his grin and takes the seat beside Yigal.
Director: What’s the matter?
Trapani: Most TOEFL scores are now under the Automated Writing Evaluation (AWE) system; in other words, computers rather than humans are grading these tests. I think we should reevaluate the use of TOEFL scores to assess student writing in the context of admissions.
Director: I see, what do the others think?
Trapani quickly nudges Paul with her shoulder when the Director looks away.
Paul: I agree, it is clear that machine scoring is unable to measure features such as the meaningfulness of content or rhetorical effectiveness. It measures text quality but not writing skills...
Squinting her eyes, Sara Cushing is not buying what Paul said. She interrupted him.
Sara Cushing: I am not sure whether what you are saying is true. From my own study of the TOEFL, the correlations between overall e-rater scores and human scores were as high or higher than correlations between two human ratings. E-rater was also more consistent across different prompts than human ratings. Plus, similar to human scores, scores produced by e-raters moderately correlate to other measures of writing ability like course grades and instructor feedback. To me, the e-rater system is rather reliable.
Doug: Hold your horses. That is exactly opposite to the results I have obtained from a National Assessment of Educational Progress (NAEP) report.
Paul similes. He is waiting for someone to prove Sara Cushing wrong. No one ever interrupts Paul.
Paul: That’s interesting.
From an old, dusty briefcase, Doug pulls out a stack of paper, dusts it off, and slides it across the table until it halts in front of Sara Cushing.
Doug: E-rater did not agree with scores awarded by human raters and produced mean scores that were significantly higher than the mean scores awarded by human readers. Human scores, the NAEP study also found, correlated more highly with one another than with the AWE scores, and the human raters assigned the same score to papers with greater frequency than that of the AWE.
Sara Cushing is reading the report intently, hoping to find flaws in Doug’s argument for a potential rebuttal.
Doug: A case study of the implementation of the AWE system on the Australian Scaling Test, which is a measure designed to reflect classroom proactive task design, also indicates that machine scoring differs significantly from human scoring. In essence (clears his throat), if the ultimate goal of machine scoring is to substitute human markers and thus, reducing labor costs, e-rater is certainly doing a bad job here.
Sara Cushing: Well, what we can say now is that the AWE system is certainly far from perfect.
Sensing that the room is probably not on her side, and that Doug has got a fair point, Sara Cushing is eager to make a concession.
Sara Cushing: In fact, AWE systems failed to distinguish between second-language writing assessments that prioritize learning to write from those that use writing to teach content. The feature scores used to generate total scores differed across my two examined prompts. This is why I think we should be mindful of its potential to further marginalize second-language students. All decisions about its use with second-language populations should be made with an awareness of the software's limitations.
Paul:(nodding his head) In this case, I agree with you.
Sara Cushing: For the rest of the meeting, we must base our arguments on 5 different aspects of AWE (says as she counts with her finger): evaluation, generalization, explanation, extrapolation, and utilization.
William: And what do you mean by that?
Sara Cushing: We should consider the extent to which the computer-generated scores of English-Language-Learning students can be taken as accurate representations of performance, the extent to which these scores provide appropriate estimates of student scores obtained from other, similar performances, the extent to which scores can be attributed to the defined construct, the meaningfulness with which scores indicate performance to the target domain, and the usefulness of scores for decision-making.
Sara Cushing’s words sent the room into pondering until the Director breaks the silence.
Director: It is also important that we consider the issue under the context of admissions. Because that is what we ultimately care.
Paul: Since we also have international applicants, shouldn’t AWE focus on the ways that fluency relates to the broader range of social and cultural practices of effective writers?
Brent: Speaking of cultural practices, we dug into the datasets of TOEFL and GRE essays and found out that e-raters scored writing by Chinese and Korean speakers more highly than did human raters, but gave lower scores to writing by Arabic, Hindi, and Spanish speakers.
Director: (Gasps). This level of discrepancy is certainly not allowed.
Trapani: Even though there is no difference by gender in TOEFL, but a major variance appeared in GRE. On the GRE, human scores for “Issue” essays by African American and Native American men were slightly higher than e-rater scores, and African American men and women both received higher human scores on the “Argument” essay. Among international GRE-takers, students from mainland China received higher scores from e-rater than from human readers, although the same difference did not hold for Mandarin-speakers in Taiwan, suggesting cultural rather than linguistic causes for the disparity.
Yigal: We further observed that e-rater assign greater value to length and less value to certain grammatical features than many human readers and that human readers might be more sensitive to whether an otherwise well-constructed essay is off-topic.
Director: I definitely did not expect this social dependence of AWE grading. (shaking his head). We cannot disadvantage any of our applicants based on gender, culture, or nationality if we are going to continue with the TOEFL minimum requirement policy.
Doug: No, we cannot, especially when the reliability of machine scoring is derived from the assessments of narrow and convergent tasks, and that they depend on such tasks to produce results that roughly mimic human judgments...
William: I am sorry to interrupt, but aren’t we missing the whole point of why we shouldn’t continue with TOEFL requirements?
William suddenly stands up, he walks towards the front of the room. With his back to the glaring sun, no one can tell from his looks that he is about to lead everyone out of the box.
Sara Cushing: What do you mean? Because of the inaccurate machine scoring?
William: Totally not. This debate overlooks a more fundamental concern about the inability of such tests ——— however scored ——— to accurately reflect student abilities.
Director: Hmm, I think perhaps he is onto something.
William: The problem is not the e-raters but rather the test itself. These tests that can be scored by AWE generate only short writing samples produced under tight time restrictions. They do not fully reflect widely accepted writing constructs and are poor predictors of students' success in courses that require them to think, to write with an awareness of purpose and audience, and to control the writing process.
William paused, waiting for everyone to catch up on him. He then start to walk around the room.
William: Of course, computers are able to analyze some syntactical aspects and count certain features of writing, which can offer a tool for writing instruction focused on specific textual features...
Paul interrupts.
Paul: Yes, writing is not only a social and cultural process but also as requiring specific cognitive skills that machine analysis can help to develop.
William stares and points his finger at Paul.
William: ... but such use is limited and overshadowed by increasing reliance on automated scoring engines in high-stakes assessments. Such tests should not be used to assess student writing, whether in the context of admissions and placement or in the context of formative and summative measures of student achievement.
Brent: What do you propose then?
William: We should look to course performance and portfolio evaluation that provide a more complete measure of student understanding of the full writing construct.
Paul: While AWE can help to identify students whose basic text production presents challenges.
Everyone nods.
Director: Well, I guess either way we are not going with the TOEFL requirements this year, hopefully this will open greater education to more people.
Trapani: It certainly will.
A jazz mix of vibraphone and saxophone gradually fades in. The usual sunset still shines on the other half of the room, but the 7 admissions officers grin for a brighter future the college applicants have never experienced.
Bibliography
McCurry, D. (2010). Can machine scoring deal with broad and open writing tests as well as human readers? Assessing Writing, 15(2), 118-129. https://doi.org/10.1016/j.asw.2010. 04.002
Deane, P. (2013). On the relation between automated essay scoring and modern views of the writing construct. Assessing Writing, 18(1), 7–24. https://doi.org/10.1016/j.asw.2012.
10.002
Weigle, S. C. (2013). English language learners and automated scoring of essays: Critical considerations. Assessing Writing, 18(1), 85–99. https://doi.org/10.1016/j.asw.2012.10.006
Bridgeman, B., Trapani, C., & Attali, Y. (2012). Comparison of Human and Machine Scoring of Essays: Differences by Gender, Ethnicity, and Country. Applied Measurement in Education, 25(1), 27–40. https://doi.org/10.1080/08957347.2012.635502
Condon, W. (2013). Large-scale assessment, locally-developed measures, and automated scoring of essays: Fishing for red herrings? Assessing Writing, 18(1), 100–108. https://doi.org/10.1016/j.asw.2012.11.001

