Description
Write a 4-5 pargraph explanation of the three most important considerations related to sampling that you would need to address in your upcoming project (Measuring employee satisfaction and happiness in nonprofit organizations). Explain how you might ensure that your sample represents the population of interest.
639
The
British
Psychological
Journal of Occupational and Organiiational Psychology (2009), 82, 639-659
& 2009 The ßrteh Psychohgicol Society » * — ‘ ^ Society
www.bp sjournals.co.uk
The use of personality test norms in work settings:
Effects of sample size and relevance
Robert R Tett’* Jenna R. Fitzke^ Patrick L Wadlington^
Scott A. Davies^, Michael G. Anderson’* and Jeff Foster^
‘Department of Psychology, University of Tulsa. Tulsa, Oklaboma, USA
^University of Wisconsin-Madison, Madison, Wisconsin, USA
Pearson Testing, Bloomington, Minnesota. USA
“CPP Inc., Mountain View, California. USA
^Hogan Assessment Systems, Tulsa, Oklahoma. USA
The value of personality test norms for use in work settings depends on norm sample
size (N) and relevance, yet research on these criteria is scant and corresponding
standards are vague. Using basic statistical principles and Hogan Personality Inventory
(HPI) data from 5 sales and 4 trucking samples (N range = 394-6,200). we show that
(a) N > 100 has little practical impact on the reliability of norm-based standard scores
(max = ± 1 0 percentile points in 99% of samples) and (b) personality profiles vary more
from using different norm samples, between as well as within job families. Averaging
across scales. T-scores based on sales versus trucking norms differed by 7.3 points,
whereas maximum differences averaged 7.4 and 7.5 points within the sets of sales and
trucking norms, respectively, corresponding in each case to approximately ± 14
percentile points. Slightly weaker results obtained using nine additional samples from
clerical, managerial, and financial job families, and regression analysis applied to the 18
samples revealed demographic effects on four scale means independently of job family.
Personality test developers are urged to build norms for more diverse populations, and
test users, to develop local norms to promote more meaningful interpretations of
personality test scores.
I
Personality test scores arc often interprcted in employment settings with reference
to scale norms (i.e. means and standard deviations; Bartram, 1992; Cook et a!., 1998;
Müller & Young, 1988; Van Dam, 2003). Accordingly, the accuracy of normtransformed scores in capturing an individual’s relative standing on a set of personality
scales rests on the quality of the underlying norms. Two critical and generally
recognized concerns regarding norm use are (a) the size of the normative sample (TV)
and (b) the relevance of the normative sample to the population to which the given
* Correspondence should he aàdtessed to Dr Roben P. Jett, Department of Psychology. University of Tuka. Tulsa. OK 741043189, USA (e-mail: robert-tm@utulsa.edu).
DOI:tO.(34e/0963l7908X336IS9
640 Robert P. TeU et al.
test taker belongs. Despite being recognized as important, sample size and population
relevatice (i.e. representativetiess) have received little research attention and standards
regarding these qualities are ambiguous. In this article, we show what happens when
personalit>- profiles are generated under varying conditions regarding the size and
source of the normative sample, with the overall aim of refining best practices in the
use of personality test norms. We begin by considering how such norms are used in
work settings.
Uses of personality test norms in the workplace
A scale score, by itself, reveals little as to the location of an individual on the measured
dimension. Standard scores, such as z or 7; use a tiortn sample mean and standard
deviation to clarify’ where an individual respondent falls on the measured construct
relative to other people. Personality test tiorms have several work-related applications.
First, they can facilitate individualized developmental feedback. For example, workers
may be better prepared to interact with others in a team or with customers if they have a
clearer understanding of their relative standing on traits relevant to such interactions
(e.g. emt)tional control, sociability, t{)lerance). Second. personalit’ test norms can
facilitate selection decisions. Tojvdown hiring does not require test norms, but
exclusionary strategies based on test score cut-offs (e.g. hiritig from among applicants
scoring above a given cut-ofO call for normative comparisons. Norms are especially
important in Wring when applicants are few in tiumber, as this mitigates reliance on topdown methods. Third, norms ean help an organization judge the overall standing of a
targeted work-group (e.g. a sales team) relative to a lai-ger, more general, ¡ob-relevant
population (e.g. American sales people), as a basis, perhaps, for determining future
hiring standards. Success in all such norm applications rests on norm quality. Best
practices iti this area are reviewed next.
Best practices regarding test norms
Virtually every bcmk on psychological testing offers recommendations on the use of
test norms (e.g. Anastasi & Urbina, 1997; Crocker & Algina, 1986; Kline, 1993). The
most consistent message is that the norm sample should be relevant to the individual
whose scores are being interpreted. Some (e.g. Croker & Algina, 1986; Kline, 1993)
articulate further that norm samples are more credible if stratified in terms of variables
most highly correlated with the test. Accordingly, to permit reasoned judgments of
norm relevance, test developers are urged to report key demographic characteristics
(e.g. mean age, gender composition, job category). Also important to report are the
sampling strategies used, the time frame of norm dala collection, and the response
rate, as all such information speaks to the representativeness of the normative sample
with respect to the targeted population.
The 1999 Standards for Educational and Psychological Testing specify that:
Norms, if used, should refer to clearly described populations. These populations should include
individuals or groups to whom test users will ordinarily wish to compare their own examinees
(Standard 4.5. p. 55)
Reports of norming studies should include precise specification of the population that was
sampled, sampling procedures and participation rates, any weighting of the sample, the dates of
testing, and descriptive statistics. The information provided should be sufficient to enable users to
Personality test norms 641
judge the appropriateness of the norms for interpreting the scores of local examinees. Technical
documentation should indicate the precision of the norms themselves. (Standard 4.6, p. 55).
Local norms should be developed when necessary to support test users’ intended
interpretations. {Standard 13.4. p. 146)
Focusing on test use ftjr the purpose of hiring, the Principles for the Validatioti ami
Use of Personnel Selection Procedures (SIOP, 2003) state that:
Normative information relevant to the applicant pool and the incumbent population should
be presented when appropriate. The normative group should be described in terms of its
relevant demographic and occupational characteristics and presented for subgroups with
adequate sample sizes. The time frame in v/hich the normative results were established should
be stated (p. 48).
Two points warrant di.sctission here. First, the issue of satnple size is raised in the
Principles, but what coutits as adequate’ A^ is unclear. Statistical theory (readily
confirmed in practice) tells us that the reliability of norms is closely tied to TV. Lacking
specifics, practitioners are left to define ‘adequate’ on their own, which undermines
standardization of sound testing practice and norm use. Second, the Standards
encourage development of local norms when necessary to support test users’
intended interpretations. Amhiguit>; again, precludes standardized practice. Our
primary aim in this article is to clarify what counts as ‘adequate’ A’ and sufficient”
representativeness in a normative sample, as a basis for refining use of personality
norms in work settings.
r
Current practices regarding personality test norms
In light of the recognized standards regarding norm use, we examined technical
manuals for eight popular personalit>’ instruments: the Adjective Checklist (ACL),
California Psychological Inventory (CPI), HPI, Jackson Personality’ Inventory Revised OPIR), and NEO Personality Inventory-R (NEO-Pl-R form S), Occupational
Personality Questionnaire (OPQ), Personality Research Form (PIÍF). and 16PF Select
(16PF).” The goal of our review was to assess the degree to which the noted standards
regarding norms are being met in practice. The manuals were reviewed primarily for
norm sample size and the reporting of demographics and sampling procedures. We
also look note of the number of norm samples reported and whether or not dates of
testing and response rates were provided. Results of our review are provided in Tables
1 and 2.
Several observations bear comment, the first two regarding results in Table 1 and
the remainder with respect to Table 2. First, a variety of norm groups is available for
five of the tests, including, most freqtiently, samples of liigh school students, college
students, and assorted occupationai categories. Second, normative sample sizes are
laige, on the whole, averages per test ranging from 695 to 22,023. Third, with respect
to demographics, gender composition is most often reported, followed by education
level. Least often reported are ethnicity and age. Some manuals (e.g. OPQ, CPI, and
The mçaring of ‘local norms’ varies by applicatior}. fn cross-cu/tura/ reseorch, for example, tí^ey are norms specific to o
country or language. In this article, we use the term to denote norms specific to a particular job or job type witbin a specific
organization.
We were unabie to obtain technical manuals for two other popular tests: [he Cuilford-Zimmerman Temperament Survey and
MuJudimensional Personality Questionnoire; hence, we offer no summary for those tesis.
642
Robert P Tett et al.
T
CO
O
oo 00
o o
—
1
(N
1
O
—
m
CO
O
fS
CO
• * •
bo i- —
O Q
.
aai ^ c
– ~ 3
o ü
« t;
ij
Ot
ra
K
.
1..
cu o_
0Ö Z
lo ra
i£
T3
2
00
—D
o st: >^
•= ° S
c
i
M- ^
x:. S
c=
at
at •O)
1/1 ^
•o
^
>
. =
Q-
E
=
ï U.
I
~
ti i
.2 o
DO 1^
O
.tí
M
” –
c
O
2 ^
Q- C
2 -O O
L..
Í .S
sO
b:
£ ß
S
ëO ™
S3
_rt
.
O
Q.
M^
JS
•A
_2>
O .y
. _
OJ
yj
ñ
i Ê
-^
•
sS
^
^
o 00
— aj
o
o o
o o
o o
o o
o o
o —
rv.
Jz X
±i := r. m
X 5
c -C
.9 -o
-7 re
QÍ t .
O
1
1
–
O O
^
a
^
>.
‘ p
2 of ‘s
u — c
1- ûi
U01 û- O
if
^
00
< o
u o
rM
NO
00
^^
O
N
n
ro — ro — ro ro H
CO tn
ON
o rM
CO 00 00
r^ — oo
oi cd od Ö rW
cd Ö
ro rM
rM f^
cû
d
00
CO O
^^
ON
p o
oi un 00
rM rM
• ^
—’. Ö
rM
^—
Q
û X
CT
C^ rM ro (V ro ro
tn ro Ö ro NO Ö
OJ — rM r«t —
—
o
V
c
•6
H
O
£
0
Us 3
<
<
LO
3J
Personality test norms
649
Table 5. Normative means and standard deviations for three clerical, three managerial, and three
rmanclal samples
SD
Job family/HPI scale
Mean
Clerical
Adjustment
Ambition
Sociability
Likeability
Prudence
Intellectance
Learning approach
Managerial
Adjustment
Ambition
Sociability
Likeabiiity
Prudence
Intellectance
Learning approach
Financial
Adjustment
Ambition
Sociability
Likeability
Prudence
Intellectance
Learning approach
( N – 13.450)
31.86
4.16
24.95
3.63
13.46
4.30
20.96
1.19
24.43
3.44
15.67
4.54
11.19
2.55
(N = 8,089)
31,42
4.44
27.15
2.49
14.63
4.51
20.29
1.40
24.17
3.63
16.85
4.59
11.08
2.74
( N – 4,484)
31.93
4.21
26.69
2.68
14.97
4.25
20.82
1.15
24.46
3.57
16.67
4.40
10.78
2.66
Mean
SD
(N = 11,299)
31.98
4.08
26.95
2.52
16.38
4.08
20.93
1.14
23.40
3.76
18.35
4.08
11.16
2.50
(N = 2.032)
32.73
4.00
27.68
2.16
16.55
4.98
20.40
1.44
23.45
3.70
18.65
4.83
11.38
2.71
(N. ^800)
32.15
4.32
27.75
1.82
17.02
4.04
20.63
1.31
23.22
3.80
16.46
4.51
10.54
2.71
Mean
SD
{N–= 6.406)
32.11
4.19
25.93
3.15
14.36
4.40
20.94
1.20
24.04
3.67
17.62
4.31
11.31
2.48
(N = 777)
31.34
4.49
27.11
2.57
14.92
4.49
20.48
1.33
23.09
3.79
17.47
4.26
9.73
3.18
(N = 609)
31.19
4.65
26.72
2.12
16.27
4.17
20.61
1.45
22.64
3.90
16.69
4.28
10.67
2.60
offering practical guidance on norm use. We adopted a similar strategy in replication
using the nine samples from clerical, managerial, and financial ¡oh families. Rather than
draw comparisons between jobs, however, we focused on within job comparisons,
creating profiles for individuals falling at the HPI means from one sample using norms
combined across the remaining two samples per job family.
Results
Upper and lower bounds of intervals around a T-score fj. = 50 under various A’s and
levels of certainty are reported in Table 6. The table shows, for example, that when
N – 100, 80% of sample means are expected to fall between 48.7 and 51.3. Increasing A^
to 300 yields 49.3 and 50.7 as the lower and upper 10% boundaries. What is perhaps
most notable in this table is the stability’ of ^ as an estimate of /i with even modest
sample sizes. With N ^ 100, t un
O fS
UI
Ü
.00;
o
i r r n r N f M f N
— — — — —
in
o
o
10.
:§
re
E
0)
lab
650
11 s
m o m o i n o o o m o o o o o o o o o O Q
Z
— (N in Ö
Personality test norms
65 í
The effects of population relevance on HPT profiles are depicted in Figures 1-3hi Figure 1, HPI 7^score profiles are plotted for hoth the combined sales and the
comhined trucking samples, based on hypothetical raw scores falling at the mean on
each scale and using the other combined group as the reference sample in each case.
Differences between samples var- across HPI .scales. The largest difference arises on
sociahility (15.4 T^score points) and the smallest difference on j^rutlence (.4 points). The
average difference is 7.3 T^score points. Figure 2 depicts profiles for the five sales
samples and Figure 3, for the four tnicking samples. Notable differences are evident
within each figure, 7-scores on ambition and sociability in the sales group, in particular,
vary considerably across samples {range = 40.0-55.4 for ambition; 42.7-57.6 for
sociability). The largest differences within the trucking norm set arise for learning
approach (range = 44.1-56.1). The average maximum differences in tbe sales and
trucker groups (i.e. averaging across the seven scales in each case) are 7.4 and 7.5,
respectively.
Within job family differences in HPI profiles for each of the clerical, managerial, and
financial job families are depicted in Figure 4. The largest differences are evident in the
clerical samples, where, for example, /^scores on ambition, sociability, and intellectance
var)’ by more than 10 points between samples A and B. The largest differences in the
managerial samples are for sociability (7.2 points), intellectance (7.0), and learning
approach (6.6); and the largest differences in the financial samples are for sociability
(8.6), prudence (8.4), and ambition (7.1). Averaging differences across all seven HPI
scales within job fomiUes yields 5.5 for clerical jobs, 5.2 for managerial jobs, and 4.6
for financial jobs. These are smaller than tbe averages from sales and trucking (i.e. 7.4
and 7-5), but 10 of the 21 HPI scales-in-jobs exceed the ± 2.5-point standard adopted
here, corresponding to a 10% decision error rate with a 7″ = 50 cut-score.
Discussion
Our goal was to clariiy hest practices regarding use of personality tests in work settings
by assessing the impact of normative sample A’^ and population relevance on the
reliability of judged personality test scores. Where personality scores arc standardized
HPI profiles for sales and trucking
– -•- Sales
—»—Trucking
60.0-1
57.555.02 52.5-
§ 50.0
h^ 47.545.042.540.0
a.
HP! scale
Figure I. HPI profiles for sales and trucking.
CD
:ell ect
rud
‘ can be similar to those derived between job categories, challenging
reliance solely on job type as a basis for judging the suitability of a normative sample.
Underlying the noted differences are any of a host of demographic and situational
variables with possible links to personality’ scale scores. Identii^’ing all the variables that
might explain the differences depicted in Figures 1-4 is beyond the scope of the current
discussion. Some possibilities, based on available demographics, are reported in Table 7.
To assess the linear effects of these variables on the HPI means, we regressed the means,
per scale, on to proportion white, proportion black, proportion male, and mean age
(/V = 18 samples**). Differences among the five job families were assessed by entering
four corresponding dummy-coded variables in the first step. Results are reported in
Table 8. Step 1 results show that the sample means vary among the five job families for
all HPI scales except adjustment and prudence. Additional effects are evident in results
from Step 2. Specifically, after controlling for job family effects, ambition means are
lower in samples with higher %blacks, Likeability means are higher in samples wiih
higher %whites, prudence means are lower in older samples, and learning approach
means are lower in samples with higher %males. Whether or not thesefindingsreplicate
in larger sample sets (i.e. with A’ > 18 samples) is a matter for further research.
^Conscientiousness is reievant to performance in most jobs; e.g. Borricfc ond Mount f/99/).
Missing mean ages for the three clerical samples were substituted by the mean from the remaining 15 samples. Results for
mean age based on the IS samples reporting useable data were very similar to those obtained using mean substitution and are
available on request Also, the remainirig ethnic groups were not ossessed owing to their relatively small proportions within the
normative samples.
Personality test norms 655
Table 7. Norm sample demographics
Ethnicity (%)
Sample
Gender (%)
White
Black
Hisp.
Asian
46.1
85.1
73.9
67.4
66.7
65.0
40.0
10.9
4.3
5.0
15.2
8.3
8.4
2.4
0.6
3.0
3.4
0.6
8.3
2.9
0.6
0.3
0.6
0.0
0.5
44.5
57.3
48.6
47.4
36.8
48.9
55.5
42.7
51.4
52.6
63.2
51.1
33.1
33.3
30.4
28.3
34.3
32.5
Native Amer.
Male
Female
Mean age
Sales
A
B
C
D
E
Weighted mean
Trucking
A
B
C
D
Weighted mean
Clerical
A
B
C
Weighted mean
Managerial
A
B
C
Weighted mean
Financial
A
B
C
Weighted mean
6.9
17.3
16.1
16.7
23.2
9.5
81.0
79.6
50.5
25.9
71.7
li.e
33.8
35.5
15.3
4.8
6.1
15.7
35.5
9.3
4.8
0.4
0.0
2.6
3.4
0.0
2.1
0.0
0.4
0.3
59.1
98.8
99.2
96.8
72.9
40.9
1.2
0.8
3.2
27.1
37.5
36.9
39.2
36.8
37.5
78.3
60.1
56.4
67.2
4.1
7.4
10.4
6.6
10.4
22.1
23.0
17.2
6.2
7.7
6.9
6.9
1.0
2.7
3.3
2.1
11.5
44.0
28.1
26.7
88.5
56.0
71.9
73.3
NA
NA
NA
NA
50.6
50.9
69.0
52.0
27.4
43.4
157
29.5
19.1
3.8
10.4
15.6
2.3
1.9
3.4
2.3
0.7
0.0
1.5
0.6
56.3
49.1
67.1
55.7
43.7
50.9
32.9
44.3
32.7
36.6
36.0
33.7
66.0
77.1
86.0
69.5
21.0
10.8
2.4
17.7
7.7
6.8
6.4
7.4
5.2
5.2
O.I
0.2
0.0
O.I
38.3
67.6
9.5
39.3
61.7
32.4
90.5
60.7
27.7
37.0
34.3
29.6
5.2
5.2
Our point here is that comparing an individual to norms from the same job family can,
nonetheless, pose uncertainties owing to other characteristics of the norm sample that
may also be related to personality scale scores.
Independent research supports current findings suggesting that personality scores
are related to job category (e.g. RIASEC; liarrick, Mount, & Gupta. 200.^) and
demographic characteristics most often described in test manuals (Roberts, Walton, &
Vieehtbauer, 2006). Other work-related correlates of personality have recently been
identified. Judge and Cable (1997) report that personality is related to organizational
euiture preferences sucli that, for example, conscientious people prefer detail-oriented
and resuItSK>riented cultures. Thus, means for conscientiousness (and more specific
tniits falling within that category) can be expected to be elevated in organizations with
those types of cultures. Similar results linking personality with organizational culture
preferences have been reported by Warr and Pierce (2004) and Ang, van Dyne, and Koh
(2006). Along similar lines, Furnham, Petrides. and Tsaousis (2005) found that the Big
Five, especially Openness to Experience, are related to work values pertaining to
cultural diversity. To the degree that organizational culture and work values each affect
656
Robert P. Tett et al.
Table 8. Regression results for effects of job family and demographic variables on HPI scale means
( N = 18 samples)
Step 1^
Job family
HPI scale
Adjustment
Ambition
Sociability
Likea bility
Prudence
Intellectance
Learning approach
Adjusted R^
Step :
%white, %black. %male, mean age’^
Adjusted R ^
Change in
adjusted R^
Sig. predictor
ß
.02
.61**
.60**
.76**
-.19
.45*
.63^^
.70
.09
%black
-.37*
.81
.15
.05
.34
%white
mean age
.25*
-.78*
.73
.10
%male
-.53*
*p < .05; *^ < .01; two-tailed.
” Forced entry
Stepwise entry.
‘^ Mean substitution for three clerical samples.
personality scores independently of job type, personality test developers are urged
to report details t)f norm sample culture preferences and values as a basis for judging
norm relevance in work settings.
A potentially more important variable affecting normative means on personality
scales may be reliance on job applicants versus incumbents. The question of faking in
personality assessment has been a dominant focus of investigation for many years. There
is now general consensus that people can fake when instructed to do so (e.g.
Viswesvaran & Ones. 1999). More recently, the focus has shifted to whether or not
people actually do fake in selection settings. Some (e.g. Arthur. Woehr, & Graziano,
2000; Hogan, Barrett, & Hogan, 2007; Hough & Schneider, 1996; Ones & Viswesvaran,
1998; Viswesvaran & Ones, 1999) downplay the effects of voluntary faking, whereas
others (Grifftth, Chmielowski, & Yoshita, 2007; Rosse, Stecher, Miller, & Levin, 1998;
Stark, Chernyshenko. Chan, Lee, & Drasgow, 2001 ; Tett & Christiansen, 2007) argue that
it is indeed problematic. Summarizing the ‘do-fake’ literature in selection contexts, Tett etal.
(2006) report a nieta-anai>tic mean cl effect size of .35, averaging across the Big Five
(excluding Openness, whose effect is close to 0. yields a mean of 0.52). This result supports
the applicant/incumbent distinction raised in the SIOP Principies regarding norm use, and
clarifies that test developers and publishers need to diffcrentiatc nomis ba.scd on this
distinction. Specifically, if a personality test is being used in hiring, tbe relevant nomi group
is one drawn from an applicam s;imple, as norms based tin incumbents can be expected to
yield (mostly) lower means and, hence, elevated /^scores (or their cquiv-alent) in
applicants.’ If targeted for use in developing persomiel, on the other hand, personality test
scores bear comparison to norms derived fn)m incumbents, as reliance on appficant norms
will likely underestimate individuals’ true standing.
‘ A / / norm samples presented here, drawn from the HAS database, include only job applicants: the applicant/incumbent
distinction may be relevant to other tests used in work settings, particular^ those lacking applicant norms.
Personality test norms
657
That personality scores may be related to a diverse array of demographic and
situational factors and, plausibly, to interactions among those variables, raises concerns
regarding the generalizability of normative samples as, with increasing numbers of
distinct correlates, comes a decreasing likelihood that a normative sample reported in a
test manual is relevant to any individual not included in that sample. This goes beyond the
issue of whether or not the norm sample is described in detail; tJie point is that, regardless
of such descriptive detail, norm samples are inherently specific to populations identified
mostly by convenience, wliich are very likely to be different from the population of
interest in specific norm applications, namely, in the case of selection, the population of
local applicants, or, in the case of personnel development, local incumbents. The concept
of representativeness in judgitig norm suitability is, in this light, a fleeting ideal, and
assuming representativeness given only a limited set of norm sample descriptors (e.g. job
type, age, race, and gender composition) is likely to engender false interpretations
regarding an individual’s or group’s standing on targeted personality traits relative tt) the
true local population.
Our findings are generally consistent with the spirit of the Standards and SIOP
Principies regarding norms, noted in the introduction. They are particularly supportive of
more restrictive recommendations offered in coniiinction with the international
personality item pool (http://ipip.ori.org/newNoniis.htm), which explicitly offers
no norms:
One should be very wary of using canned ‘norms’ because it isn’t obvious that one could ever
find a population of which one’s present sample is a representative subset. Most “norms’ are
misleading, and therefore they should not be used.
Far more defensible are local norms, which one develops oneself. For example, if one vrants to
give feedback to members of a class of students, one should relate the score of each individual to
the means and standard deviations derived from the class itself.
Conclusions
Our review of current standards and practice regarding use of personality test norms
and ourfindingsdriven by basic statistical principles and real data suggest the following
conclusions.
(1) Sample size has little practical impact on the reliability of normative means and on
standard seores and corresponding percentiles thereby derived, once an A’ of
around 300 is reached. Test users need not be overly war>’ of nt)rms based on A’ of
even 100. Test developers are urged to seek norms for more diverse groups based
on modest A’s rather than seeking larger samples per se.
(2) Beyond A’— 100, norm sample composition becomes the more important
consideration. Notable discrepancies in personality profiles are likely not only
between job families (e.g. sales vs. trucking in the present case) but also within ¡oh
families (based on samples from different organizations). Sucb differences within
categories raise concerns about the usefulness of norms provided in test manuals,
which typically offer little more than job category and basic demographic
descriptors as bases for judging nt)rm suitability.
(3) Personality scores vary for reasons other than those targeted in standards and
principles regarding norm use. Organizational culture, work values, and incumbent
versus applicant settings, all of which var’ independently of job category and basic
demographics, are also worthy of considenition in judgments of norm relevance.
658
Robert P. Tett et al.
(4) The diversity and complexity of factors affecting personality scale scores encourage
use of local norms over those provided in test manuals. That A^ need not be
impractically large (e.g. 100) favours such efforts in furthering organizationally
meaningful personality test score interpretations, especially for use in personnel
development and selection.
(5) Use of general norms may have merit at the group level (e.g. assessing where the
sales group at Company A stands in relation to national sales people). Special efforts
are required, however, to ensure that the general population, defined explicitly in
terms of diverse personality correlates (e.g. job category, demographics, applicant
vs. incumbent, organizational culture, work values), is suitably represented by the
normative sample. Strategies for developing such norms are worthy of future
research.
Acknowledgements
An earlier version of this article was presented at the 21st Annual Conference of the Society for
Industrial and Oi^anizational Psychology, May, 2006, Dallas, TX.
References
American Psycholopicai Association (1999). Standards for educationetl and psychological
testing. Washington, DC: American Psychological Association.
Anastasi, A., & Urbina. S. (1997). Psychological testing. Upper Saddle River, NJ: Prentice Hall.
Ang, S., van Dyne. L., & Koh, C. (2006). Personality correlates of the four-factor model of cultural
intelligence. Group and Organization Management, JÍ, 100-123.
Arthur, W., Wííehr, D. J., & (iraziano, W. G. (2000). Personality testing in employment settings:
Problems and issues in the application of t>’pical selection practices. Personnel ftei’iew, 30,
657-676.
Barrick, M. R.. & Mount, M. K. (1991). The big ñvc personality dimensions and job performance:
A meta-analysis. Personnel Psychology, 44, 1-26.
Barrick, M. R., Mount, M. K., & Gupta, R. (2003) Meta-analysis of the relationship between the
five-factor model of personality and Holland s occupational types. Personnel Psychology, 56,
45-74.
Bartram, D. (1992). llie personality of UK managers: 16PF norms for short-listed applicants.
Journal of Occupational and Organizational Psychology, 65, 159-172.
Cook, M., Young. A.. Taylor. D., OSbca, A., Chitashvili, M.. Lepeska. V, et aL (1998). Personality
profiles oí managers in former Soviet countries: Problem and remedy. Journal of Manageriai
Psychology, 13, 567-579.
Costa, P T, & McCrae, R. R. (1992). NEO PI-R Professional Manual. Lutz, FL: Psychological
Assessment Resources. Inc.
Crocker. L.. & Algina, J. (1986). Norms and standard scores. In Introduction to classical and
modem test theory (Chapter 19). New York: Harcourt. Brace, and Jovanovich.
Ferguson, G. A. (1959). Statistical analysis in Psycholog}- and Education. New York:
McGraw-Hill.
Furnham. A., Petrides. K. V. & Tsaousis. I. (2(K>5). A cross-cultuntl investigation into the
relationships between personality traits and work values. Journal of Psychology:
interdisciplinary and Applied, ¡39, 5-32.
Gough, H. Ci., & Bradley. P. (1996). California Psychological Inventor)’ manual. Palo Alto, (;A:
Consulting Psychologists Press.
Gough. H. G., & Hcilbnin. A. B., Jr. (1983). The Adjective Checklist Manual. Mountain View, CA:
Consulting Psycbologists Press.
Personality test norms 659
Griftith, R. L., Chmielowski, T., & Yoshita, Y. (2007). Do applicants fake? An examination of the
frcqucriLy of applicant faking behavior. Personnel Revieiv, ^6. 3-4l-3′>’>.
Hogan, J.. Barrett, P. & Hogan. R. (2007). Personaiit>’ measurement, faking, and employment
.selection, yo/vr««/ of Applied Psychology, 92, 1270-1285.
Hogan, R., & Hogan, J. (1995). Hogan Personality Inventory M an u a t (,2ni ed.). Tulsa. OK: Ho^íiin
Assessment Systems.
Hough, L, M., ¿t Schneider, R. J. (1996). Personality tniits, tiixonomics and iippliciitions in
oi^nizations. In K. R. Murphy (Ed.), ¡näivieliuU differences and behavior in vrganiziUions
(pp. 31-88). San Francisco, CA: Jossey-Bass.
Jackson. D. N. i^)^)4). Jackson Personality Inventory – Revised manual. Port Huron, MI: Sigma
Assessment Systems.
Jackson, D. N. (1999). Personality Research Form manual. Port Huron, MI: Sigma Assessment
Systems.
Judge, T. A.. & Cable, D. M. (1997). Applicant personality, organizational culture, and organizational
attniction. Perstmnci Psychology, 50. 549-394.
Kelly, M I,. (Kd.). (1999) 16PF Select manual. Champaign, 11.: Institute for Personality and Ability
Testing.
I
Kline, F (1993). The handbook of psychological testing. New York: Routledge.
MulIer,J., & Young, R. (1988). An evaluation ot psycbological tests in tbc selection process for EEG
technician trainees. American Journal of FJ-Xi Technology. 2.Í, 1Í7-I58.
Ones, JJ. S., Si. Viswesvaran. C. (1998). The cifccts of social desirability and tiiking on personality
and integrity assessment for personnel selection. Human Performance, 11, 245-269.
Roberts. B. W.. Walton. K. E., & Viecbtbauer, W. (2006). Patterns of mean-level cbange in
personality traits across the life course: A meta-analysis of iongiliidinal studies. Psychological
Bulletin, ¡32, 1-25.
Rosse, J. G., Stecber, M. D., Miller, J. L., & Levin, R. A. (1998). The impact of response di.stortion on
preemployment personality testing and biring decisions, yowrn«/ of Applied Psycbolog}’, fi3,
634-644.
SHL Group (2006). Occupational Personality’ Questionnaire 32 technical manual. Tbames
Ditton: SHL Group.
Society for Industrial and Organizational Psycbology (2003). Principles for the validation and use
of personnel selection procedures (4th ed.). Bowling Green, OH: SIOP
Stark. S-, Cbtrmyshcnko. O. S.. Chan, K.. Lee, W. C, & Dni.sgow, F (2001). Effects of tbe testing
situation on item responding: Cause ior concern. Journal of Applied Psycholog}; 86. 943-953.
Ten, R. P, Anderson. M. G., Ho, C. L., Yang, T. S., Huang, L.. & Hanvongse, A. (2006). Seven nested
questions about faking on personality tests: An overview and interactioni,st miKlel of item-level
re.spon.se distortion, hi R, (iriffnh (Fd.), A closer examination of applicant faking behavior.
Greenwich, CT: Information Age Publishing.
Tett, R. P., & Christiansen, N. D. (2007). Personality tests at the crossroads: A re.sponse
to Morgeson, Campion, Dipboye H
The
British
Psychological
Journal of Occupational and Organiiational Psychology (2009), 82, 639-659
& 2009 The ßrteh Psychohgicol Society » * — ‘ ^ Society
www.bp sjournals.co.uk
The use of personality test norms in work settings:
Effects of sample size and relevance
Robert R Tett’* Jenna R. Fitzke^ Patrick L Wadlington^
Scott A. Davies^, Michael G. Anderson’* and Jeff Foster^
‘Department of Psychology, University of Tulsa. Tulsa, Oklaboma, USA
^University of Wisconsin-Madison, Madison, Wisconsin, USA
Pearson Testing, Bloomington, Minnesota. USA
“CPP Inc., Mountain View, California. USA
^Hogan Assessment Systems, Tulsa, Oklahoma. USA
The value of personality test norms for use in work settings depends on norm sample
size (N) and relevance, yet research on these criteria is scant and corresponding
standards are vague. Using basic statistical principles and Hogan Personality Inventory
(HPI) data from 5 sales and 4 trucking samples (N range = 394-6,200). we show that
(a) N > 100 has little practical impact on the reliability of norm-based standard scores
(max = ± 1 0 percentile points in 99% of samples) and (b) personality profiles vary more
from using different norm samples, between as well as within job families. Averaging
across scales. T-scores based on sales versus trucking norms differed by 7.3 points,
whereas maximum differences averaged 7.4 and 7.5 points within the sets of sales and
trucking norms, respectively, corresponding in each case to approximately ± 14
percentile points. Slightly weaker results obtained using nine additional samples from
clerical, managerial, and financial job families, and regression analysis applied to the 18
samples revealed demographic effects on four scale means independently of job family.
Personality test developers are urged to build norms for more diverse populations, and
test users, to develop local norms to promote more meaningful interpretations of
personality test scores.
I
Personality test scores arc often interprcted in employment settings with reference
to scale norms (i.e. means and standard deviations; Bartram, 1992; Cook et a!., 1998;
Müller & Young, 1988; Van Dam, 2003). Accordingly, the accuracy of normtransformed scores in capturing an individual’s relative standing on a set of personality
scales rests on the quality of the underlying norms. Two critical and generally
recognized concerns regarding norm use are (a) the size of the normative sample (TV)
and (b) the relevance of the normative sample to the population to which the given
* Correspondence should he aàdtessed to Dr Roben P. Jett, Department of Psychology. University of Tuka. Tulsa. OK 741043189, USA (e-mail: robert-tm@utulsa.edu).
DOI:tO.(34e/0963l7908X336IS9
640 Robert P. TeU et al.
test taker belongs. Despite being recognized as important, sample size and population
relevatice (i.e. representativetiess) have received little research attention and standards
regarding these qualities are ambiguous. In this article, we show what happens when
personalit>- profiles are generated under varying conditions regarding the size and
source of the normative sample, with the overall aim of refining best practices in the
use of personality test norms. We begin by considering how such norms are used in
work settings.
Uses of personality test norms in the workplace
A scale score, by itself, reveals little as to the location of an individual on the measured
dimension. Standard scores, such as z or 7; use a tiortn sample mean and standard
deviation to clarify’ where an individual respondent falls on the measured construct
relative to other people. Personality test tiorms have several work-related applications.
First, they can facilitate individualized developmental feedback. For example, workers
may be better prepared to interact with others in a team or with customers if they have a
clearer understanding of their relative standing on traits relevant to such interactions
(e.g. emt)tional control, sociability, t{)lerance). Second. personalit’ test norms can
facilitate selection decisions. Tojvdown hiring does not require test norms, but
exclusionary strategies based on test score cut-offs (e.g. hiritig from among applicants
scoring above a given cut-ofO call for normative comparisons. Norms are especially
important in Wring when applicants are few in tiumber, as this mitigates reliance on topdown methods. Third, norms ean help an organization judge the overall standing of a
targeted work-group (e.g. a sales team) relative to a lai-ger, more general, ¡ob-relevant
population (e.g. American sales people), as a basis, perhaps, for determining future
hiring standards. Success in all such norm applications rests on norm quality. Best
practices iti this area are reviewed next.
Best practices regarding test norms
Virtually every bcmk on psychological testing offers recommendations on the use of
test norms (e.g. Anastasi & Urbina, 1997; Crocker & Algina, 1986; Kline, 1993). The
most consistent message is that the norm sample should be relevant to the individual
whose scores are being interpreted. Some (e.g. Croker & Algina, 1986; Kline, 1993)
articulate further that norm samples are more credible if stratified in terms of variables
most highly correlated with the test. Accordingly, to permit reasoned judgments of
norm relevance, test developers are urged to report key demographic characteristics
(e.g. mean age, gender composition, job category). Also important to report are the
sampling strategies used, the time frame of norm dala collection, and the response
rate, as all such information speaks to the representativeness of the normative sample
with respect to the targeted population.
The 1999 Standards for Educational and Psychological Testing specify that:
Norms, if used, should refer to clearly described populations. These populations should include
individuals or groups to whom test users will ordinarily wish to compare their own examinees
(Standard 4.5. p. 55)
Reports of norming studies should include precise specification of the population that was
sampled, sampling procedures and participation rates, any weighting of the sample, the dates of
testing, and descriptive statistics. The information provided should be sufficient to enable users to
Personality test norms 641
judge the appropriateness of the norms for interpreting the scores of local examinees. Technical
documentation should indicate the precision of the norms themselves. (Standard 4.6, p. 55).
Local norms should be developed when necessary to support test users’ intended
interpretations. {Standard 13.4. p. 146)
Focusing on test use ftjr the purpose of hiring, the Principles for the Validatioti ami
Use of Personnel Selection Procedures (SIOP, 2003) state that:
Normative information relevant to the applicant pool and the incumbent population should
be presented when appropriate. The normative group should be described in terms of its
relevant demographic and occupational characteristics and presented for subgroups with
adequate sample sizes. The time frame in v/hich the normative results were established should
be stated (p. 48).
Two points warrant di.sctission here. First, the issue of satnple size is raised in the
Principles, but what coutits as adequate’ A^ is unclear. Statistical theory (readily
confirmed in practice) tells us that the reliability of norms is closely tied to TV. Lacking
specifics, practitioners are left to define ‘adequate’ on their own, which undermines
standardization of sound testing practice and norm use. Second, the Standards
encourage development of local norms when necessary to support test users’
intended interpretations. Amhiguit>; again, precludes standardized practice. Our
primary aim in this article is to clarify what counts as ‘adequate’ A’ and sufficient”
representativeness in a normative sample, as a basis for refining use of personality
norms in work settings.
r
Current practices regarding personality test norms
In light of the recognized standards regarding norm use, we examined technical
manuals for eight popular personalit>’ instruments: the Adjective Checklist (ACL),
California Psychological Inventory (CPI), HPI, Jackson Personality’ Inventory Revised OPIR), and NEO Personality Inventory-R (NEO-Pl-R form S), Occupational
Personality Questionnaire (OPQ), Personality Research Form (PIÍF). and 16PF Select
(16PF).” The goal of our review was to assess the degree to which the noted standards
regarding norms are being met in practice. The manuals were reviewed primarily for
norm sample size and the reporting of demographics and sampling procedures. We
also look note of the number of norm samples reported and whether or not dates of
testing and response rates were provided. Results of our review are provided in Tables
1 and 2.
Several observations bear comment, the first two regarding results in Table 1 and
the remainder with respect to Table 2. First, a variety of norm groups is available for
five of the tests, including, most freqtiently, samples of liigh school students, college
students, and assorted occupationai categories. Second, normative sample sizes are
laige, on the whole, averages per test ranging from 695 to 22,023. Third, with respect
to demographics, gender composition is most often reported, followed by education
level. Least often reported are ethnicity and age. Some manuals (e.g. OPQ, CPI, and
The mçaring of ‘local norms’ varies by applicatior}. fn cross-cu/tura/ reseorch, for example, tí^ey are norms specific to o
country or language. In this article, we use the term to denote norms specific to a particular job or job type witbin a specific
organization.
We were unabie to obtain technical manuals for two other popular tests: [he Cuilford-Zimmerman Temperament Survey and
MuJudimensional Personality Questionnoire; hence, we offer no summary for those tesis.
642
Robert P Tett et al.
T
CO
O
oo 00
o o
—
1
(N
1
O
—
m
CO
O
fS
CO
• * •
bo i- —
O Q
.
aai ^ c
– ~ 3
o ü
« t;
ij
Ot
ra
K
.
1..
cu o_
0Ö Z
lo ra
i£
T3
2
00
—D
o st: >^
•= ° S
c
i
M- ^
x:. S
c=
at
at •O)
1/1 ^
•o
^
>
. =
Q-
E
=
ï U.
I
~
ti i
.2 o
DO 1^
O
.tí
M
” –
c
O
2 ^
Q- C
2 -O O
L..
Í .S
sO
b:
£ ß
S
ëO ™
S3
_rt
.
O
Q.
M^
JS
•A
_2>
O .y
. _
OJ
yj
ñ
i Ê
-^
•
sS
^
^
o 00
— aj
o
o o
o o
o o
o o
o o
o —
rv.
Jz X
±i := r. m
X 5
c -C
.9 -o
-7 re
QÍ t .
O
1
1
–
O O
^
a
^
>.
‘ p
2 of ‘s
u — c
1- ûi
U01 û- O
if
^
00
< o
u o
rM
NO
00
^^
O
N
n
ro — ro — ro ro H
CO tn
ON
o rM
CO 00 00
r^ — oo
oi cd od Ö rW
cd Ö
ro rM
rM f^
cû
d
00
CO O
^^
ON
p o
oi un 00
rM rM
• ^
—’. Ö
rM
^—
Q
û X
CT
C^ rM ro (V ro ro
tn ro Ö ro NO Ö
OJ — rM r«t —
—
o
V
c
•6
H
O
£
0
Us 3
<
<
LO
3J
Personality test norms
649
Table 5. Normative means and standard deviations for three clerical, three managerial, and three
rmanclal samples
SD
Job family/HPI scale
Mean
Clerical
Adjustment
Ambition
Sociability
Likeability
Prudence
Intellectance
Learning approach
Managerial
Adjustment
Ambition
Sociability
Likeabiiity
Prudence
Intellectance
Learning approach
Financial
Adjustment
Ambition
Sociability
Likeability
Prudence
Intellectance
Learning approach
( N – 13.450)
31.86
4.16
24.95
3.63
13.46
4.30
20.96
1.19
24.43
3.44
15.67
4.54
11.19
2.55
(N = 8,089)
31,42
4.44
27.15
2.49
14.63
4.51
20.29
1.40
24.17
3.63
16.85
4.59
11.08
2.74
( N – 4,484)
31.93
4.21
26.69
2.68
14.97
4.25
20.82
1.15
24.46
3.57
16.67
4.40
10.78
2.66
Mean
SD
(N = 11,299)
31.98
4.08
26.95
2.52
16.38
4.08
20.93
1.14
23.40
3.76
18.35
4.08
11.16
2.50
(N = 2.032)
32.73
4.00
27.68
2.16
16.55
4.98
20.40
1.44
23.45
3.70
18.65
4.83
11.38
2.71
(N. ^800)
32.15
4.32
27.75
1.82
17.02
4.04
20.63
1.31
23.22
3.80
16.46
4.51
10.54
2.71
Mean
SD
{N–= 6.406)
32.11
4.19
25.93
3.15
14.36
4.40
20.94
1.20
24.04
3.67
17.62
4.31
11.31
2.48
(N = 777)
31.34
4.49
27.11
2.57
14.92
4.49
20.48
1.33
23.09
3.79
17.47
4.26
9.73
3.18
(N = 609)
31.19
4.65
26.72
2.12
16.27
4.17
20.61
1.45
22.64
3.90
16.69
4.28
10.67
2.60
offering practical guidance on norm use. We adopted a similar strategy in replication
using the nine samples from clerical, managerial, and financial ¡oh families. Rather than
draw comparisons between jobs, however, we focused on within job comparisons,
creating profiles for individuals falling at the HPI means from one sample using norms
combined across the remaining two samples per job family.
Results
Upper and lower bounds of intervals around a T-score fj. = 50 under various A’s and
levels of certainty are reported in Table 6. The table shows, for example, that when
N – 100, 80% of sample means are expected to fall between 48.7 and 51.3. Increasing A^
to 300 yields 49.3 and 50.7 as the lower and upper 10% boundaries. What is perhaps
most notable in this table is the stability’ of ^ as an estimate of /i with even modest
sample sizes. With N ^ 100, t un
O fS
UI
Ü
.00;
o
i r r n r N f M f N
— — — — —
in
o
o
10.
:§
re
E
0)
lab
650
11 s
m o m o i n o o o m o o o o o o o o o O Q
Z
— (N in Ö
Personality test norms
65 í
The effects of population relevance on HPT profiles are depicted in Figures 1-3hi Figure 1, HPI 7^score profiles are plotted for hoth the combined sales and the
comhined trucking samples, based on hypothetical raw scores falling at the mean on
each scale and using the other combined group as the reference sample in each case.
Differences between samples var- across HPI .scales. The largest difference arises on
sociahility (15.4 T^score points) and the smallest difference on j^rutlence (.4 points). The
average difference is 7.3 T^score points. Figure 2 depicts profiles for the five sales
samples and Figure 3, for the four tnicking samples. Notable differences are evident
within each figure, 7-scores on ambition and sociability in the sales group, in particular,
vary considerably across samples {range = 40.0-55.4 for ambition; 42.7-57.6 for
sociability). The largest differences within the trucking norm set arise for learning
approach (range = 44.1-56.1). The average maximum differences in tbe sales and
trucker groups (i.e. averaging across the seven scales in each case) are 7.4 and 7.5,
respectively.
Within job family differences in HPI profiles for each of the clerical, managerial, and
financial job families are depicted in Figure 4. The largest differences are evident in the
clerical samples, where, for example, /^scores on ambition, sociability, and intellectance
var)’ by more than 10 points between samples A and B. The largest differences in the
managerial samples are for sociability (7.2 points), intellectance (7.0), and learning
approach (6.6); and the largest differences in the financial samples are for sociability
(8.6), prudence (8.4), and ambition (7.1). Averaging differences across all seven HPI
scales within job fomiUes yields 5.5 for clerical jobs, 5.2 for managerial jobs, and 4.6
for financial jobs. These are smaller than tbe averages from sales and trucking (i.e. 7.4
and 7-5), but 10 of the 21 HPI scales-in-jobs exceed the ± 2.5-point standard adopted
here, corresponding to a 10% decision error rate with a 7″ = 50 cut-score.
Discussion
Our goal was to clariiy hest practices regarding use of personality tests in work settings
by assessing the impact of normative sample A’^ and population relevance on the
reliability of judged personality test scores. Where personality scores arc standardized
HPI profiles for sales and trucking
– -•- Sales
—»—Trucking
60.0-1
57.555.02 52.5-
§ 50.0
h^ 47.545.042.540.0
a.
HP! scale
Figure I. HPI profiles for sales and trucking.
CD
:ell ect
rud
‘ can be similar to those derived between job categories, challenging
reliance solely on job type as a basis for judging the suitability of a normative sample.
Underlying the noted differences are any of a host of demographic and situational
variables with possible links to personality’ scale scores. Identii^’ing all the variables that
might explain the differences depicted in Figures 1-4 is beyond the scope of the current
discussion. Some possibilities, based on available demographics, are reported in Table 7.
To assess the linear effects of these variables on the HPI means, we regressed the means,
per scale, on to proportion white, proportion black, proportion male, and mean age
(/V = 18 samples**). Differences among the five job families were assessed by entering
four corresponding dummy-coded variables in the first step. Results are reported in
Table 8. Step 1 results show that the sample means vary among the five job families for
all HPI scales except adjustment and prudence. Additional effects are evident in results
from Step 2. Specifically, after controlling for job family effects, ambition means are
lower in samples with higher %blacks, Likeability means are higher in samples wiih
higher %whites, prudence means are lower in older samples, and learning approach
means are lower in samples with higher %males. Whether or not thesefindingsreplicate
in larger sample sets (i.e. with A’ > 18 samples) is a matter for further research.
^Conscientiousness is reievant to performance in most jobs; e.g. Borricfc ond Mount f/99/).
Missing mean ages for the three clerical samples were substituted by the mean from the remaining 15 samples. Results for
mean age based on the IS samples reporting useable data were very similar to those obtained using mean substitution and are
available on request Also, the remainirig ethnic groups were not ossessed owing to their relatively small proportions within the
normative samples.
Personality test norms 655
Table 7. Norm sample demographics
Ethnicity (%)
Sample
Gender (%)
White
Black
Hisp.
Asian
46.1
85.1
73.9
67.4
66.7
65.0
40.0
10.9
4.3
5.0
15.2
8.3
8.4
2.4
0.6
3.0
3.4
0.6
8.3
2.9
0.6
0.3
0.6
0.0
0.5
44.5
57.3
48.6
47.4
36.8
48.9
55.5
42.7
51.4
52.6
63.2
51.1
33.1
33.3
30.4
28.3
34.3
32.5
Native Amer.
Male
Female
Mean age
Sales
A
B
C
D
E
Weighted mean
Trucking
A
B
C
D
Weighted mean
Clerical
A
B
C
Weighted mean
Managerial
A
B
C
Weighted mean
Financial
A
B
C
Weighted mean
6.9
17.3
16.1
16.7
23.2
9.5
81.0
79.6
50.5
25.9
71.7
li.e
33.8
35.5
15.3
4.8
6.1
15.7
35.5
9.3
4.8
0.4
0.0
2.6
3.4
0.0
2.1
0.0
0.4
0.3
59.1
98.8
99.2
96.8
72.9
40.9
1.2
0.8
3.2
27.1
37.5
36.9
39.2
36.8
37.5
78.3
60.1
56.4
67.2
4.1
7.4
10.4
6.6
10.4
22.1
23.0
17.2
6.2
7.7
6.9
6.9
1.0
2.7
3.3
2.1
11.5
44.0
28.1
26.7
88.5
56.0
71.9
73.3
NA
NA
NA
NA
50.6
50.9
69.0
52.0
27.4
43.4
157
29.5
19.1
3.8
10.4
15.6
2.3
1.9
3.4
2.3
0.7
0.0
1.5
0.6
56.3
49.1
67.1
55.7
43.7
50.9
32.9
44.3
32.7
36.6
36.0
33.7
66.0
77.1
86.0
69.5
21.0
10.8
2.4
17.7
7.7
6.8
6.4
7.4
5.2
5.2
O.I
0.2
0.0
O.I
38.3
67.6
9.5
39.3
61.7
32.4
90.5
60.7
27.7
37.0
34.3
29.6
5.2
5.2
Our point here is that comparing an individual to norms from the same job family can,
nonetheless, pose uncertainties owing to other characteristics of the norm sample that
may also be related to personality scale scores.
Independent research supports current findings suggesting that personality scores
are related to job category (e.g. RIASEC; liarrick, Mount, & Gupta. 200.^) and
demographic characteristics most often described in test manuals (Roberts, Walton, &
Vieehtbauer, 2006). Other work-related correlates of personality have recently been
identified. Judge and Cable (1997) report that personality is related to organizational
euiture preferences sucli that, for example, conscientious people prefer detail-oriented
and resuItSK>riented cultures. Thus, means for conscientiousness (and more specific
tniits falling within that category) can be expected to be elevated in organizations with
those types of cultures. Similar results linking personality with organizational culture
preferences have been reported by Warr and Pierce (2004) and Ang, van Dyne, and Koh
(2006). Along similar lines, Furnham, Petrides. and Tsaousis (2005) found that the Big
Five, especially Openness to Experience, are related to work values pertaining to
cultural diversity. To the degree that organizational culture and work values each affect
656
Robert P. Tett et al.
Table 8. Regression results for effects of job family and demographic variables on HPI scale means
( N = 18 samples)
Step 1^
Job family
HPI scale
Adjustment
Ambition
Sociability
Likea bility
Prudence
Intellectance
Learning approach
Adjusted R^
Step :
%white, %black. %male, mean age’^
Adjusted R ^
Change in
adjusted R^
Sig. predictor
ß
.02
.61**
.60**
.76**
-.19
.45*
.63^^
.70
.09
%black
-.37*
.81
.15
.05
.34
%white
mean age
.25*
-.78*
.73
.10
%male
-.53*
*p < .05; *^ < .01; two-tailed.
” Forced entry
Stepwise entry.
‘^ Mean substitution for three clerical samples.
personality scores independently of job type, personality test developers are urged
to report details t)f norm sample culture preferences and values as a basis for judging
norm relevance in work settings.
A potentially more important variable affecting normative means on personality
scales may be reliance on job applicants versus incumbents. The question of faking in
personality assessment has been a dominant focus of investigation for many years. There
is now general consensus that people can fake when instructed to do so (e.g.
Viswesvaran & Ones. 1999). More recently, the focus has shifted to whether or not
people actually do fake in selection settings. Some (e.g. Arthur. Woehr, & Graziano,
2000; Hogan, Barrett, & Hogan, 2007; Hough & Schneider, 1996; Ones & Viswesvaran,
1998; Viswesvaran & Ones, 1999) downplay the effects of voluntary faking, whereas
others (Grifftth, Chmielowski, & Yoshita, 2007; Rosse, Stecher, Miller, & Levin, 1998;
Stark, Chernyshenko. Chan, Lee, & Drasgow, 2001 ; Tett & Christiansen, 2007) argue that
it is indeed problematic. Summarizing the ‘do-fake’ literature in selection contexts, Tett etal.
(2006) report a nieta-anai>tic mean cl effect size of .35, averaging across the Big Five
(excluding Openness, whose effect is close to 0. yields a mean of 0.52). This result supports
the applicant/incumbent distinction raised in the SIOP Principies regarding norm use, and
clarifies that test developers and publishers need to diffcrentiatc nomis ba.scd on this
distinction. Specifically, if a personality test is being used in hiring, tbe relevant nomi group
is one drawn from an applicam s;imple, as norms based tin incumbents can be expected to
yield (mostly) lower means and, hence, elevated /^scores (or their cquiv-alent) in
applicants.’ If targeted for use in developing persomiel, on the other hand, personality test
scores bear comparison to norms derived fn)m incumbents, as reliance on appficant norms
will likely underestimate individuals’ true standing.
‘ A / / norm samples presented here, drawn from the HAS database, include only job applicants: the applicant/incumbent
distinction may be relevant to other tests used in work settings, particular^ those lacking applicant norms.
Personality test norms
657
That personality scores may be related to a diverse array of demographic and
situational factors and, plausibly, to interactions among those variables, raises concerns
regarding the generalizability of normative samples as, with increasing numbers of
distinct correlates, comes a decreasing likelihood that a normative sample reported in a
test manual is relevant to any individual not included in that sample. This goes beyond the
issue of whether or not the norm sample is described in detail; tJie point is that, regardless
of such descriptive detail, norm samples are inherently specific to populations identified
mostly by convenience, wliich are very likely to be different from the population of
interest in specific norm applications, namely, in the case of selection, the population of
local applicants, or, in the case of personnel development, local incumbents. The concept
of representativeness in judgitig norm suitability is, in this light, a fleeting ideal, and
assuming representativeness given only a limited set of norm sample descriptors (e.g. job
type, age, race, and gender composition) is likely to engender false interpretations
regarding an individual’s or group’s standing on targeted personality traits relative tt) the
true local population.
Our findings are generally consistent with the spirit of the Standards and SIOP
Principies regarding norms, noted in the introduction. They are particularly supportive of
more restrictive recommendations offered in coniiinction with the international
personality item pool (http://ipip.ori.org/newNoniis.htm), which explicitly offers
no norms:
One should be very wary of using canned ‘norms’ because it isn’t obvious that one could ever
find a population of which one’s present sample is a representative subset. Most “norms’ are
misleading, and therefore they should not be used.
Far more defensible are local norms, which one develops oneself. For example, if one vrants to
give feedback to members of a class of students, one should relate the score of each individual to
the means and standard deviations derived from the class itself.
Conclusions
Our review of current standards and practice regarding use of personality test norms
and ourfindingsdriven by basic statistical principles and real data suggest the following
conclusions.
(1) Sample size has little practical impact on the reliability of normative means and on
standard seores and corresponding percentiles thereby derived, once an A’ of
around 300 is reached. Test users need not be overly war>’ of nt)rms based on A’ of
even 100. Test developers are urged to seek norms for more diverse groups based
on modest A’s rather than seeking larger samples per se.
(2) Beyond A’— 100, norm sample composition becomes the more important
consideration. Notable discrepancies in personality profiles are likely not only
between job families (e.g. sales vs. trucking in the present case) but also within ¡oh
families (based on samples from different organizations). Sucb differences within
categories raise concerns about the usefulness of norms provided in test manuals,
which typically offer little more than job category and basic demographic
descriptors as bases for judging nt)rm suitability.
(3) Personality scores vary for reasons other than those targeted in standards and
principles regarding norm use. Organizational culture, work values, and incumbent
versus applicant settings, all of which var’ independently of job category and basic
demographics, are also worthy of considenition in judgments of norm relevance.
658
Robert P. Tett et al.
(4) The diversity and complexity of factors affecting personality scale scores encourage
use of local norms over those provided in test manuals. That A^ need not be
impractically large (e.g. 100) favours such efforts in furthering organizationally
meaningful personality test score interpretations, especially for use in personnel
development and selection.
(5) Use of general norms may have merit at the group level (e.g. assessing where the
sales group at Company A stands in relation to national sales people). Special efforts
are required, however, to ensure that the general population, defined explicitly in
terms of diverse personality correlates (e.g. job category, demographics, applicant
vs. incumbent, organizational culture, work values), is suitably represented by the
normative sample. Strategies for developing such norms are worthy of future
research.
Acknowledgements
An earlier version of this article was presented at the 21st Annual Conference of the Society for
Industrial and Oi^anizational Psychology, May, 2006, Dallas, TX.
References
American Psycholopicai Association (1999). Standards for educationetl and psychological
testing. Washington, DC: American Psychological Association.
Anastasi, A., & Urbina. S. (1997). Psychological testing. Upper Saddle River, NJ: Prentice Hall.
Ang, S., van Dyne. L., & Koh, C. (2006). Personality correlates of the four-factor model of cultural
intelligence. Group and Organization Management, JÍ, 100-123.
Arthur, W., Wííehr, D. J., & (iraziano, W. G. (2000). Personality testing in employment settings:
Problems and issues in the application of t>’pical selection practices. Personnel ftei’iew, 30,
657-676.
Barrick, M. R.. & Mount, M. K. (1991). The big ñvc personality dimensions and job performance:
A meta-analysis. Personnel Psychology, 44, 1-26.
Barrick, M. R., Mount, M. K., & Gupta, R. (2003) Meta-analysis of the relationship between the
five-factor model of personality and Holland s occupational types. Personnel Psychology, 56,
45-74.
Bartram, D. (1992). llie personality of UK managers: 16PF norms for short-listed applicants.
Journal of Occupational and Organizational Psychology, 65, 159-172.
Cook, M., Young. A.. Taylor. D., OSbca, A., Chitashvili, M.. Lepeska. V, et aL (1998). Personality
profiles oí managers in former Soviet countries: Problem and remedy. Journal of Manageriai
Psychology, 13, 567-579.
Costa, P T, & McCrae, R. R. (1992). NEO PI-R Professional Manual. Lutz, FL: Psychological
Assessment Resources. Inc.
Crocker. L.. & Algina, J. (1986). Norms and standard scores. In Introduction to classical and
modem test theory (Chapter 19). New York: Harcourt. Brace, and Jovanovich.
Ferguson, G. A. (1959). Statistical analysis in Psycholog}- and Education. New York:
McGraw-Hill.
Furnham. A., Petrides. K. V. & Tsaousis. I. (2(K>5). A cross-cultuntl investigation into the
relationships between personality traits and work values. Journal of Psychology:
interdisciplinary and Applied, ¡39, 5-32.
Gough, H. Ci., & Bradley. P. (1996). California Psychological Inventor)’ manual. Palo Alto, (;A:
Consulting Psychologists Press.
Gough. H. G., & Hcilbnin. A. B., Jr. (1983). The Adjective Checklist Manual. Mountain View, CA:
Consulting Psycbologists Press.
Personality test norms 659
Griftith, R. L., Chmielowski, T., & Yoshita, Y. (2007). Do applicants fake? An examination of the
frcqucriLy of applicant faking behavior. Personnel Revieiv, ^6. 3-4l-3′>’>.
Hogan, J.. Barrett, P. & Hogan. R. (2007). Personaiit>’ measurement, faking, and employment
.selection, yo/vr««/ of Applied Psychology, 92, 1270-1285.
Hogan, R., & Hogan, J. (1995). Hogan Personality Inventory M an u a t (,2ni ed.). Tulsa. OK: Ho^íiin
Assessment Systems.
Hough, L, M., ¿t Schneider, R. J. (1996). Personality tniits, tiixonomics and iippliciitions in
oi^nizations. In K. R. Murphy (Ed.), ¡näivieliuU differences and behavior in vrganiziUions
(pp. 31-88). San Francisco, CA: Jossey-Bass.
Jackson. D. N. i^)^)4). Jackson Personality Inventory – Revised manual. Port Huron, MI: Sigma
Assessment Systems.
Jackson, D. N. (1999). Personality Research Form manual. Port Huron, MI: Sigma Assessment
Systems.
Judge, T. A.. & Cable, D. M. (1997). Applicant personality, organizational culture, and organizational
attniction. Perstmnci Psychology, 50. 549-394.
Kelly, M I,. (Kd.). (1999) 16PF Select manual. Champaign, 11.: Institute for Personality and Ability
Testing.
I
Kline, F (1993). The handbook of psychological testing. New York: Routledge.
MulIer,J., & Young, R. (1988). An evaluation ot psycbological tests in tbc selection process for EEG
technician trainees. American Journal of FJ-Xi Technology. 2.Í, 1Í7-I58.
Ones, JJ. S., Si. Viswesvaran. C. (1998). The cifccts of social desirability and tiiking on personality
and integrity assessment for personnel selection. Human Performance, 11, 245-269.
Roberts. B. W.. Walton. K. E., & Viecbtbauer, W. (2006). Patterns of mean-level cbange in
personality traits across the life course: A meta-analysis of iongiliidinal studies. Psychological
Bulletin, ¡32, 1-25.
Rosse, J. G., Stecber, M. D., Miller, J. L., & Levin, R. A. (1998). The impact of response di.stortion on
preemployment personality testing and biring decisions, yowrn«/ of Applied Psycbolog}’, fi3,
634-644.
SHL Group (2006). Occupational Personality’ Questionnaire 32 technical manual. Tbames
Ditton: SHL Group.
Society for Industrial and Organizational Psycbology (2003). Principles for the validation and use
of personnel selection procedures (4th ed.). Bowling Green, OH: SIOP
Stark. S-, Cbtrmyshcnko. O. S.. Chan, K.. Lee, W. C, & Dni.sgow, F (2001). Effects of tbe testing
situation on item responding: Cause ior concern. Journal of Applied Psycholog}; 86. 943-953.
Ten, R. P, Anderson. M. G., Ho, C. L., Yang, T. S., Huang, L.. & Hanvongse, A. (2006). Seven nested
questions about faking on personality tests: An overview and interactioni,st miKlel of item-level
re.spon.se distortion, hi R, (iriffnh (Fd.), A closer examination of applicant faking behavior.
Greenwich, CT: Information Age Publishing.
Tett, R. P., & Christiansen, N. D. (2007). Personality tests at the crossroads: A re.sponse
to Morgeson, Campion, Dipboye H