In recent years, both state and federal policymakers have raised the stakes associated with test results by using them as very public measures of school effectiveness.
This growing emphasis on assessment in the United States was born out of the effort to improve schools by setting high and consistent standards for student achievement. Integral to this standards-based approach is the need to measure whether students and schools are successfully meeting these higher expectations. Assessments are also integral to the new accountability systems states have been required to implement under the federal No Child Left Behind Act (). In comparison to the early 1990s, the use of of students at the state and national levels has become commonplace. This, in turn, has prompted increased interest in the quality of the tests that are being used and their relationship to what schools should teach and what students need to learn.
Ideally, well-crafted, standards-based assessments of student performance also help leverage change in classroom practice. Policymakers can use assessments to communicate what is important for students to learn and to motivate schools and teachers to focus on these areas of learning.
Types of assessment
Current ideas about student assessment reflect a broadening perspective on the ways that students’ knowledge and ability can be systematically evaluated and the uses for that information. Tests can actually take many forms, and test developers emphasize that different test methods are appropriate for different purposes. Every test has three important aspects:
- The testing format can range from the familiar multiple-choice and true/false approaches to written essays, and from oral exams to the performance of a specified task such as a science experiment.
- The scoring of responses can be automated and done by machine (e.g., multiple choice), by a teacher, or by an outside group of trained scorers or evaluators. Machine scoring is the most economical and some argue the most "objective" method. New technologies may be adding to the ways that machines (i.e., computers) can be used to score tests. This could dramatically broaden the types of test items that can be used cost effectively for large-scale assessment.
- The interpretation and reporting of how well a student performed on a test can be critical, particularly on a standardized test. A student’s raw score on a test, or the number of correct answers, is just the first step. Next comes an evaluation of what that score means and how it should be interpreted. For a criterion-referenced (or standards-referenced) test, the results are reported based on a set of established expectations and performance standards, such as "basic, proficient, and advanced." For a norm-referenced test, results are reported relative to a comparison group.
How tests are usedStudent assessments serve several different purposes. Ideally, the purpose for which a test will be used determines the choice of a testing format and the manner in which the results are scored, interpreted, and reported. The purposes of testing generally include:
- Evaluating and improving the instructional program in general;
- Diagnosing individual student abilities/knowledge and adapting instruction accordingly;
- Determining individual student placement or eligibility for promotion/graduation, college admission, or special honors; and
- Measuring and comparing school, school district, statewide, and national performance for broad public accountability.
Often a test that was designed and is appropriate for one use does not provide meaningful information for another purpose. For example, whentests are administered, each student takes only a portion of the test, and collective results are used to gauge the achievement of large entities such as urban districts, states, and the nation. Such tests would not be effective for evaluating a class or an individual student.
Quality criteriaThe use of a given assessment tool for a particular purpose can be a contentious issue, particularly when assessments are used to make high-stakes decisions about students and schools—whether it be public comparisons, funding, or student placement. When tests are used for such purposes, testing experts stress that the testing instrument should be of high quality and validated for the intended purpose.
Traditionally, the determination of a test’s quality revolves around three questions:
- Is the test valid? This is the overarching concern, and it involves asking whether a test provides accurate information for the purposes for which it is being used. If a test is used to determine how well students master standards, does it do a good job of covering those standards? If we use a test such as the SAT I for the purpose of college admissions, does it in fact predict something about a student’s college performance? In other words, are the consequences of the testing—or inferences made from the results—reasonable given the particular test that was conducted?
- Is the test fair? Is it free of built-in biases that create advantages or disadvantages based on individual student characteristics such as racial background? Have students had an opportunity to learn what is being tested?
- Is the test reliable? If a student took the test on two different occasions, would that student's scores remain fairly consistent both times? And do similar students achieve the same results time after time?
Policymakers raise the stakes around student assessments when they use them to increase accountability and leverage instructional improvement, as is happening with standards-based reform. In this context, testing experts are stressing an additional criterion for a high-quality test: its alignment with the most important standards or goals for student learning.
Finally, because no test is perfect, testing experts stress the importance of using multiple measures of performance, particularly for high-stakes purposes.