– Written by: Dr. Claudio Violato –
Test Theories
There are three major interrelated test theories today:
1) Classical test theory (CTT),
2) Generalizability theory (G-theory)
3) Item response theory (IRT).
All three theories are fundamental to the field of psychometrics – the theory and technique of psychological and educational measurement. This includes the objective measurement of attitudes, personality traits, skills and knowledge, abilities, and educational achievement. Psychometric researchers focus on the construction and validation of assessment instruments such as questionnaires, tests, raters’ judgments, and personality tests as well as with statistical research relevant to the measurement theories.
Classical test theory is so-called because it was developed first in psychometrics. The premise is simple: any observed score (e.g., a test score) is composed of the “real” score or true score plus error of measurement:
X (Observed score) = T (True score) + e (error of measurement).
The early foundational work of scholars like Karl Pearson, Charles Spearman, E.L. Thorndike, and Fredrick Kuder was based on this idea.
G-theory was developed by LJ Cronbach and colleagues (1963, 1972) [1] as an advance over CTT. In CTT each observed score has a single true score and has a single source of error of measurement. G-theory is a statistical framework for conceptualizing and investigating multiple sources of variability in measurement. An advantage of G-theory is that researchers can estimate what proportion of the variation in test scores is due to factors that often vary in assessment, such as raters, setting, time, and items. Anyone who has watched Olympic diving has observed the effect of different sources of variance: the divers’ scores vary based on particular differences in performance, by the different raters (judges), and the items (components of the dive). The variation in scores, therefore, comes from multiple sources. In health care assessment, the same situation obtains when the student performs skills, which are rated by two or more judges or raters. Both CTT and G-theory continue to play a role in testing and measurement.
The third major theory, IRT is also known as latent trait theory. Like CTT and G-theory, it can be used for the design, analysis, and scoring of tests, questionnaires, and assessments measuring abilities, attitudes, or other variables. IRT is based on mathematical modelling of candidates’ response to questions or test items in contrast to the test-level focus of CTT and G-theory. This model is widely used with multiple choice questions (that are scored right or wrong), but can also be used on a rating scales, patient symptoms (scored present or absent), or diagnostic information in disease.
In IRT it is assumed that the probability of a response to an item is a mathematical function of the person and item characteristics. The person is conceptualized as a latent trait such as aptitude, achievement, extraversion and sociability. The item characteristics consist of difficulty, discrimination (how they distinguish between people), and guessing (e.g., on multiple choice items). All three psychometric theories – CTT, G-theory, IRT – have their relative advantages and disadvantages.

1st Evolution: Mainframe Computers
The advent of computers has played a large role in the expansion and evolution of testing in the last 60 or so years. Many of the complex statistical techniques that have been applied to testing, such as correlational and factor analysis and test theories such as IRT have only been possible with the use of computers. As large mainframe computers became commonplace in universities and some other institutions during the 1960s, large data bases from testing programs could be stored and the data analyzed with software programs. At the same time, software programs that could be installed into the computers also became available. Prior to this, users had to write their own software from scratch. Data could be entered into the computers (e.g., with punch cards, optical scanning sheets, bubble sheets, keyboards) and analyzed and reports generated for individual test takers.
By the 1980s, optical reading scanning machines (reading bubble sheets marked with pencils) had been more-or-less perfected and matched to desktop computers. Large numbers of bubble sheets (e.g., thousands) can be quickly and efficiently entered into a desktop computer where sophisticated software can quickly analyze the test results and produce individual reports, and psychometric results of the test.
2nd Evolution: Personal Computers
The next evolutionary step that began in the 1990s was the elimination of pencil-and-paper (i.e., bubble sheets) so that students could take a test directly on a computer. This is called computer based testing (CBT). The main advantage of CBT historically has been for report generation and quick feedback. With the advent of the personal computer, CBT functions primarily for the computer-administered versions of paper-and-pencil tests. These provided some advantages over paper-and- pencil in test administration, and item innovation. Some disadvantages include the need for latest hardware and software and large test centers to accommodate large group testing.
CBT can be innovative allowing flexible scoring of items. Test items can use sound or video to create multi-media items. An item may contain a 40 second video and audio sequence of a doctor performing a focused physical exam on the left lower quadrant of the abdomen for example. This can be followed by series of MCQ questions to the student. Content innovation also relates to the use of dynamic item types such as drag-and-drop, point-and-click, or hovering over hotspots. Future developments in CBT are likely to focus on item innovation that measure complex cognitive outcomes such as clinical judgment and professionalism.
3rd Evolution: The Internet
Online testing refers to the delivery of tests via the Internet. This approach also provides a new medium for distribution of test materials, reports and practice manuals, and for the automated collection of data. Even traditional paper-and- pencil materials can be delivered online as PDF format files using e-book publishing technologies. Theoretically, anyone with an Internet connection could take a test at anytime from anywhere in the world. Such an approach provides much more flexibility in testing than has been possible in the past. Online testing highlights a whole set of issues: confidentiality, cheating, test taker identification, hacking, breaching the test bank, and so on.
Sources:
[1] Cronbach LJ, Nageswari R, Gleser GC (1963). Theory of generalizability: A liberation of reliability theory. The British Journal of Statistical Psychology, 16, 137-163.
Cronbach LJ, Gleser GC, Nanda H, Rajaratnam N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. New York: John Wiley.








