Education Reliability in Educational Assessments
by
M. David Miller
  • LAST REVIEWED: 30 October 2019
  • LAST MODIFIED: 30 October 2019
  • DOI: 10.1093/obo/9780199756810-0228

Introduction

There are three foundations identified by professional standards for examining the psychometric quality of assessments: validity, reliability, and fairness. Thus, reliability is a primary concern for all assessments. Reliability is defined as the consistency of scores across replications. In education, the sources of measurement error and the basis for replications include items, forms, raters, or occasions. The source of the measurement error will determine the type of reliability and ultimately the generalizations about the measurement. Thus, inconsistency in scores is potentially due to multiple sources of random error, and this definition can be applied to multiple types of replications depending on the generalization that is to be made (e.g., items, forms, raters, or occasions). There are also multiple indices for reporting reliability, including reliability coefficients, generalizability coefficients, standard errors of measurement, and information functions, to name a few. The indices are defined differently with different test theories. For example, classical test theory emphasizes reliability coefficients and standard errors of measurement; item response theory emphasizes information functions; generalizability theory emphasizes generalizability coefficients, dependability indices, and relative and absolute standard errors; and classification consistency emphasizes proportion agreement unadjusted or adjusted for chance agreement. The importance of reliability varies depending on the uses made of the assessment. Reliability is considered to be increasingly important when the consequences of test use are more high stakes. Thus, reliability is expected to be more rigorously adhered to when tests are used to make high-stakes decisions about individuals, such as employment or certification decisions and decisions about clinical placement. While validity, or the interpretations and uses of test scores, is considered the most important characteristic of a test, reliability provides a strong foundation for validity, providing a necessary condition for most test uses or interpretations. When scores are not consistent within a testing procedure, the scores are considered to be influenced instead by random errors of measurement. Thus, the scores will not have strong relationships to other variables, will not have strong internal structure, and will not accurately reflect score uses and interpretations that are necessary for validity. Consequently, reliability is often considered necessary to the valid use and interpretations of scores. On the other hand, the test could have high reliability and still not be valid for a particular use or interpretation, since validity would be dependent on measuring consistently and measuring the right construct.

General Overviews of Reliability

The literature on psychometrics, reliability, and testing has many summaries of reliability and related issues. Broad overviews can be found in numerous writings, including refereed journals and book chapters. One of the important broad overview works in test theory is the professional standards for educational and psychological testing, which were developed under the joint leadership of three professional organizations (American Educational Research Association, American Psychological Association, and National Council on Measurement in Education). The fourth edition of the professional standards is American Educational Research Association, et al. 2014. The standards focus on multiple technical properties of assessments but identify validity, reliability, and fairness in the three foundational chapters. Historical changes in assessment and the methods and theories for assessing reliability can be seen through the three prior editions of the professional standards (1974, 1985, 1999). Several overviews are available that include technical, detailed treatments of reliability. Brennan 2001 provides an overview of reliability that emphasizes the historical development of reliability and includes a discussion of multiple ways of defining reliability that range from an emphasis on errors of measurement to replication of the assessment. Detailed and comprehensive overviews of reliability can also be found in Feldt and Brennan 1989 and Haertel 2006. In contrast to the technical overviews, a simple overview of reliability and validity for classroom teachers, who are one of the primary users of assessments, in presented in Miller, et al. 2013.

  • American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. 2014. Standards for educational and psychological testing. 4th ed. Washington, DC: American Educational Research Association.

    Professional organizations in education and psychology provide the 4th edition of professional standards on testing. The reliability chapter includes definitions and standards for using reliability in assessment practices, which represent a consensus view of professionals in testing. The eight clusters of standards are specifications for replication of the testing procedure, evaluating reliability/precision, reliability/generalizability coefficients, factors affecting reliability/precision, standard errors of measurement, decision consistency, reliability/precision of group means, and documenting reliability/precision.

  • Brennan, R. L. 2001. An essay on the history and future of reliability from the perspective of replications. Journal of Educational Measurement 38:295–317.

    DOI: 10.1111/j.1745-3984.2001.tb01129.x

    Brennan provides a historical review of reliability as well as future areas of development and use. He focuses on multiple definitions and emphases in reliability, such as those currently used in the standards emphasizing replication (consistency) as opposed to an emphasis on errors of measurement (inconsistency).

  • Feldt, L. S., and R. L. Brennan. 1989. Reliability. In Educational measurement. 3d ed. Edited by R. L. Linn, 105–146. New York: American Council on Education.

    Felt and Brennan provide a detailed overview of reliability and the statistical assumptions necessary for development of coefficients. This provides an emphasis on context-based errors that lead to different estimates of reliability and different interpretations. The chapter does not include item response theory, since it is treated in another chapter of the book.

  • Haertel, E. H. 2006. Reliability. In Educational measurement. 4th ed. Edited by R. L. Brennan, 65–110. Westport, CT: Praeger.

    Haertel provides a comprehensive and detailed review of reliability based on multiple test theories. The chapter includes advanced statistical treatment of the approaches to test theories, data collection designs, and statistical indices used to estimate reliability using classical test theory, generalizability theory, and classification consistency. Also suggests future directions for reliability. The chapter does not include item response theory, since it is treated in another chapter of the book.

  • Miller, M. D., R. L. Linn, and N. E. Gronlund. 2013. Measurement and assessment in teaching. 11th ed. Boston: Pearson.

    Chapter 5 provides a simple overview of reliability for the lay reader or beginning undergraduate who will be a user of testing, particularly pre-service teachers. Reliability and validity (chapter 4) are presented in the context of educators that use and develop tests for classroom assessment but do not require an awareness of the complexities of the statistical methods.

back to top

Users without a subscription are not able to see the full content on this page. Please subscribe or login.

How to Subscribe

Oxford Bibliographies Online is available by subscription and perpetual access to institutions. For more information or to contact an Oxford Sales Representative click here.

Article

Up

Down