Content and Cognition Traits

Additional KSAs: Items should present all the KSAs that are necessary to respond successfully that are not a part of the targeted cognition and cannot be safely assumed to be known by test takers. This is a question of how important it that an item only depends on the particular knowledge, skills and abilities (i.e., KSAs) that that item is aligned to—or that particularly make up the standard that the item is aligned to. Those additional KSAs would have to be explained or supplied to test takers (e.g., using footnotes to define more advanced words or supplying a formula sheet). If this principle is less important—or unimportant – items would be free to build upon other grade-level skills, general knowledge that is less than universal and/or KSAs from prior levels that not all test takers at this level necessarily possess.

Alternative Paths: Eliminate—or at least minimize—alternative cognitive paths to a successful response that do not depend appropriately on the targeted cognition. This is a question of how important it is to ensure that test takers can respond successfully to the item only by using the intended cognitive path. If less important—or unimportant—items might be more amenable to more creative problem solving and/or the interesting application of other knowledge, skills or abilities (i.e., KSAs)  to compensate for a lack of proficiency with the targetedKSAs.

Bypasses: Items not meant to test some specific prior knowledge—particularly declarative prior knowledge—should not be amenable to bypassing the targeted cognition by relying on that prior knowledge. This is a question of how important it is to ensure that some test takers cannot jump straight to a successful response by using some other particular knowledge they happen to have, rather than the intended cognitive path which uses the knowledge, skills and/or abilities that make up the steps of that path. If less important—or unimportant—items may be acceptable if most test takers (i.e., perhaps a large majority) cannot bypass the KSAs in the intended cognitive path.

Choice Length (MC Items): Keep the length of choices about equal. This is a question of how important it is to ensure that no answer option in a multiple choice item standard out as being notably longer or shorter than any of the others. If less important—or unimportant—items may be acceptable even when answer options vary significantly in length.

Cluing (MC Items): Items should not subtly or obviously signal which answer options are correct and which are incorrect without test takers applying the targeted cognition. This is a question of how important it is to ensure that the wording of individual answer options in multiple choice item items do not make any of them standard out as implausible (i.e., surface plausibility) or far more likely simply because of their wording. If less important—or unimportant—items may be acceptable if savvier test takers are able to pick out the correct answer option through the use of their test taking experience and knowledge.

Cognitive Complexity I (low): Items should not support cognitively less complex paths to a successful response than envisioned by the standard – unless that cognitive simplicity is strictly a function of high proficiency with the targeted standard. This is a question of how important it is to ensure that items do not allow test takers to avoid the cognitive deliberation that reading of the aligned standard appears to require, allowing for test taker with high proficiency with this standard to use that proficiency to take a less deliberative path. If less important—or unimportant—items are may be acceptable even if there are successful response path that require less careful deliberation and care, even for test takers who lack high proficiency with this standard.

Cognitive Complexity II (high): Items meant to assess high proficiency (i.e., high automaticity) with the KSAs in a standard should not support more complex paths to a successful response than envisioned by the standard – unless that cognitive complexity is strictly a function of low proficiency with the targeted standard. This is a question of how important it is to ensure that items aligned to standards that describe the automaticity seen with high levels proficiency are not amenable to more deliberate and thoughtful successful approaches, unless that deliberation focused on this standard is due entirely lack of high proficiency with this standard. If less important—or unimportant—items aligned to standards that require highly proficient application of KSAs are acceptable, even when some test takers can carefully work out a different path to the item that does not require highly proficient use of the those KSAs or even more carefuly and deliberate application of those KSAs.  

Conciseness: Avoid window dressing (excessive verbiage). This is a question of how important it is to ensure that items are written using brief—perhaps pithy—language. If less important—or unimportant—items may be acceptable with more verbose language, perhaps even redundant language. 

Construct Irrelevant Barriers: Items should not have construct (i.e., the aligned standard) irrelevant barriers to producing a successful response. This is a question of how important it is to ensure that test takers unsuccessful responses to an item are not due to lack of proficiency with KSAs that do not make up the standards, assessment target or other targeted cognition of the item in question. If less important—or unimportant—items are acceptable even if some other lack of proficiencies (e.g., with other grade-level KSAs) could some students to offer unsuccessful responses.

Content Errors: Items should be free of content errors. This is a question of how important it is to ensure that the content-domain information presented in or assumed by an item is accurate. If less important—or unimportant—items are may be acceptable in many cases in which they items presents (or potentially depends on) ideas or claims that some test takers might recognize as false.

Core of the Standard: While a single item need not necessarily assess the entire scope of a standard, it should focus on the meaningful core of the standard – rather than some tangential easier to assess portion of the standard. This is a question of how important it is to ensure that items assess the heart, the meat and/or a very important facet of a standard (e.g., one with broad application, one that is built upon at later levels). If less important—or unimportant—items may be acceptable so long as they are aligned with any portions of the standard in question.

Difficulty: Items should have empirical difficulties (i.e., p) no lower than 0.2-0.35 and no greater than 0.9-0.95. This is a question of how important it is to ensure that operational items are neither so easy that virtually all test takers respond successfully nor so difficult that only a small fraction can do so—and relies on empirical observations of difficulty (e.g., from field testing) to do so. If less important—or unimportant—items may be acceptable when outside these bounds, perhaps due to the expert judgment of subject matter experts.

Distractors (MC Items): Distractors should be the results of the most common mistakes, misunderstandings and/or misapplications of the targeted cognition—and not merely be similar or close to the key. (They exist to provide an inviting option for test takers with this specific misunderstandings or who make those specific mistakes.) This is a question of how important it is to ensure that test takers who make common mistakes or bring common misunderstanding with the targeted cognition will not prompted to try again by lack of a distractor that matches their mistaken answer. If less important—or unimportant—items may be acceptable with distractors designed more to dissuade quick guessing or estimation strategies than to capture authentic mistakes.

Engagingness: Items should be sufficiently engaging to test takers that they authentically undertake the intended task prompted by the item. This is a question of how important it is to ensure that items require test takers to use KSAs in a fashion similar to how they might use them in their school work or in later application, and in such a way that their interest/curiosity is sufficiently captured that they work through the them as intended. If less important—or unimportant—items may be acceptable even when test takers are more likely to look for workaround or some sort of test taking strategy to get to a successful response.

Explicitness: Items should clearly state every element of the test taker’s work product and/or process that are required for a fully successful response. This is a question of how important it is to ensure that test takers are informed about the criteria that their work product will be judged against, including listing all of the necessary components – most likely relevant with polytomous scored items. If less important—or unimportant—items may be acceptable when they offer simpler and more straightforward instructions, trust that test takers will understand such directions as “Show all your work,” and/or are willing to assess/judge the preferability of approaches to developing the final work product.

Facial Validity: Items should resemble the work students do in class and/or the contexts in which the targeted cognition is actually used elsewhere. This is a question of how important it is to ensure that casual observers of test items recognize that they are similar to the sorts of exercises and activities that teachers assign students. If less important—or unimportant—items may be acceptable if casual observers would not see their resemblance to authentic classroom work.

Frontloading: Items should make the nature of the cognitive task clear with a direct question or a direct instruction, rather requiring test takers to muddle through to figure out what they are charged with doing. This is a question of how important it is to ensure that the question or task being asked of test takers is clear early in their experience with the item. If less important—or unimportant—items may be acceptable even if test takers have to read through the answer options and/or start working through their constructed response before they understand what is being asked of them.

Grade Level Specificity: Items should align to the specific grade-level version of a standard. Items aligned earlier grade levels are not aligned to the standard. (Diagnostic tests may require items aligned to a range of grade levels, and even there each item must align to its targeted standard and grade level.) This is a question of how important it is to ensure that items target the particular (grade) level version of standards or KSA, as opposed to precursor KSAs that standards or assessment targets may be built upon. If less important—or unimportant—items may be acceptable if they elicit evidence of proficiencies at lower (grade) levels than the test is labeled.

Item Type: Items should be in the item type and modality that provides the best opportunity for test takers to make use of the targeted cognition. This is a question of how important it is to ensure that item types are used in ways that are most suited to demonstrations of proficiency with the standards or KSAs they target. If less important—or unimportant—items may be acceptable if item types are chosen for variation, engagement or some form of efficiency (i.e., a form of efficiency that does not consider effectiveness/alignment).  

Key (MC Items): The key must be the definitively correct answer option, not merely arguably the correct answer option. This is a question of how important it is to ensure that multiple choice items have one—and just one—definitively and clearly correct answer option (i.e., the key). If less important—or unimportant—items may be acceptable when the key is merely the best answer in the view of those in position to approve items, or is not necessarily the only correct answer.

Novelty I (Excessive): Items should not be so novel that test takers must engage in notable learning in order to successfully respond to them. This is a question of how important it is to ensure that test takers not be confused by items that present such a new context or application for the KSAs it targets that they must spend notable time and cognitive effort learning about those new contexts and/or applications before they can attempt to apply the targeted cognition. If less important—or unimportant—items may be acceptable when they engage test takers with interesting new applications and contexts far beyond their school or other experiences.

Novelty II (Sufficient): Items written to standards that do not suggest rote or memorized skills should be sufficiently novel that test takers cannot be prepared with drilled solution paths. This is a question of how important it is to ensure that items targeting more deliberate and careful cognition are not so similar to standards examples and problems that test takers may simply rely on their memories of their presentation earlier. If less important—or unimportant—items may be acceptable if built around classic, tried-and-true and/or even cliché applications of the targeted cognition.

Orientation: Format the item vertically instead of horizontally. This is a question of how important it is to ensure that the visual presentation of items presents answers options in a consistent fashion, regardless of how long or short the answer options are. If less important—or unimportant—items may be acceptable when the answer options are arranged horizontally or in a grid (e.g., 2 x 2).

Partial Credit: Polytomously scored items should allocate points based on proficiency(ies) with the targeted cognition, and not based on progress towards a solution. This is a question of how important it is to ensure that multi-point items only reward levels of proficiency with the targeted cognition, as opposed to amount of work or proficiency with other KSAs. If less important—or unimportant—more time-consuming items may be acceptable when they award multiple points and partial credit may be awarded for precursor work to exhibition of the targeted cognition.

Point Value: The number of raw points available for a successful response to an item should be roughly proportionate to the amount of time the item is expected to take. This is a question of how important it is to ensure that items do not require disproportionate time (i.e., disproportionate to their impact on test takers’ raw score). If less important—or unimportant—items of quite varying time demands may contribute equally to test takers’ scores.

Proofreading: Use correct grammar, punctuation, capitalization, and spelling. This is a question of how important it is to ensure that all items demonstrate polished use of the conventions and expectations of standard English. If less important—or unimportant—items that do not conform to language conventions may be acceptable.

Specificity: Avoid over specific and over general content when writing items. This is a question of how important it is to ensure that items are neither built around trivial content nor broad generalizations. If less important—or unimportant—items may be acceptable when they target memoizable content or are built around broad principles.

Text Dependency: Items targeting the application of specific reasoning skills to a scenario or text should not support paths to a successful response that do not rely on the scenario or text. This is a question of how important it is to ensure that items that are supposed to be text dependent are, in fact, text dependent. If less important—or unimportant—items may be acceptable if students may use the stimulus to respond to the item successfully, but need not use the contents of the stimulus.

Undue Difficulty: Items should not be more difficult due to obfuscation or lack of clarity. This is a question of how important it is to ensure that directions or other elements of the presentation of an item do not accidentally or intentionally present additional burdens (i.e., unrelated to the targeted cognition) or hurdles for test takers to overcome. If less important—or unimportant—items may be acceptable when directions are less clear than they could be or items are thought to be brought up to an appropriate level by requiring test takers to do additional work to figure out the task they must complete.

Unidimensionality: Items should have point bi-serial correlations of at least 0.3-0.5. This is a question of how important it is to ensure that all the items in a single assessment are notably contributing to measuring one central cognitive trait, and to use empirically derived statistics to do so. If less important—or unimportant—items may be acceptable when they appear to be measuring something that is quite different from what the central thrust of the other items are measuring, or perhaps a test may be acceptable when it does not appear to generally measure just one single thing.

Word Choice: Avoid using specific determiners (e.g., always, never, completely, and absolutely), choices identical to or resembling words in the stem and/or grammatical inconsistencies. This is a question of how important it is to ensure that items do not turn on the limitations created with those “specific determiners” and the correctness of answer options are not communicated by echoing language through the item. If less important—or unimportant—items may be acceptable when the hinge on inversions of the most conventional expressions of an idea or focus finely on more edge cases.

***********************

Notes:

  1. These explanations of content and cognition traits are intended to support those responding to the Content and Cognition Traits of High Quality Valid Items survey. These explanations are slightly fuller than the single-sentence explanations included in the survey, but do not comprise full and complete explanations of any of these traits. Of course, applications of the principal underlying each survey would also rest of professional judgment for its contextual suitability and the thresholds or degree of strictness in application.

  2. Only two-thirds of the traits listed above are actually subjects of the Content and Cognition Traits of High Quality Valid Items study. The other third are drawn from Halaynda et al’s prior guidance lists and are intended to act as control or anchor items to allow comparisons between those older offerings of item development guidance principles with the 21 traits at issue in this study.

  3. Yes, careful review of the explanations above reveals that the implications of many of these principles is quite similar. However, they are listed separately because those consequences for item development may best be spotted and/or prevented from a variety of different angles.