O'Connell, D. N., Orne, M. T., & Shor, R. E. A comparison of hypnotic susceptibility as assessed by diagnostic ratings and initial standardized test scores. International Journal of Clinical and Experimental Hypnosis, 1966, 14, 324-332.


Institute of the Pennsylvania Hospital and University of Pennsylvania

LaSalle College

Abstract: In a nonrandom sample of 63 Ss, a correlation of .79 was found between Stanford Hypnotic Susceptibility Scale, Form A (SHSS:A) of Weitzenhoffer and Hilgard (1959) scores and diagnostic ratings of hypnotizability. This degree of correlation corresponds to an index of forecasting efficiency of 36.8%. Limitations on the interpretation of this finding both as a validity coefficient and as an indicant of the predictive value of SHSS:A are discussed.

Individual differences in hypnotic susceptibility have generally been assessed by one of two qualitatively different types of measurement: either by the traditional use of diagnostic ratings or by the use of standardized objective scales. These two procedures are operationally different in several important ways.

Diagnostic ratings are clinical evaluations made by experienced judges using whatever induction procedures and diagnostic criteria seem relevant for assessing hypnotic depth. 4 Induction procedures may be chosen to utilize idiosyncratic aspects of the personality of S in order to achieve maximum hypnotic depth. The specific suggestions used as criteria of depth are evaluated not only in behavioral terms but also on the basis of subjective reports during both hypnosis and posthypnotic waking recall.

In contrast, the fluidity of interpersonal relationships integral to diagnostic rating procedures is intentionally minimized in objective

scales so that standardization may be attained, and reliance on subjective reports is reduced in the interest of ease of administration. A hypnotic susceptibility score obtained with such a scale is the total number of items passed in a series of items of varying difficulty, as defined by statistically determined frequencies of success or failure in a normative sample. The criterion for passing each item is fixed and objective. For example, two persons who pass a challenge item such as Arm Catalepsy may differ markedly in subjective experience, one actively but unsuccessfully attempting to bend the arm and the other simply not trying. In a diagnostic rating, the first instance would be considered a genuine hypnotic response, whereas the second would likely be rated as mere compliance. Another important way in which standardized test scores differ from diagnostic ratings is that the numerical equivalence of two test scores does not necessarily imply equal difficulty of items passed. Since all test items are equally weighted, a person who passes the five easiest test items will receive the same score -- to take an extreme and rather unlikely example -- as one who passes the five most difficult items.

The correlation between diagnostic ratings and standardized test scores is of both theoretical and practical interest. Because the two evaluative procedures differ qualitatively, the degree of association between them would provide a measure of the construct validity of the standardized tests. Such a validity coefficient would allow estimation of the effectiveness of these tests in the selection of Ss extreme in hypnotizability.

Parenthetically, a remark may be made at this point on the peculiar dearth of information available on the validity of standardized scales of hypnotizability. The casual approach to the problem of validation typically taken is exemplified by a relevant statement of Friedlander and Sarbin (1938), in discussing the validity of their own scale: ". . . no matter how we compute nor how we argue, we cannot create a validity coefficient ex vacuo. Roughly, we are justified in assuming the scale as a whole valid" (p. 465). Such an assumption would not go unchallenged in other areas of psychometric testing.

Previously Reported Correlations

There have been two recent reports of correlations between diagnostic ratings and standardized test scores. In the first of these, Shor, M. T. Orne, and O'Connell (1966) 5 found a correlation of .75 between

5 An earlier presentation of this data in a different context reported this same correlation corrected for coarse grouping, which increases its value to .83 (Shor, M. T. Orne, & O'Connell, 1962).



scores on the Stanford Hypnotic Susceptibility Scale, Form A (SHSS:A) of Weitzenhoffer and Hilgard (1959) and diagnostic ratings on a four point scale corresponding essentially to the major divisions of the Davis-Husband scale (Davis & Husband, 1931) in a sample of 23. The majority of these Ss had received extensive training in hypnosis and can reasonably be assumed to have reached fairly stable levels of hypnotic performance prior to these evaluations. A second evaluation of hypnotizability, using Form B of the Stanford Hypnotic Susceptibility Scale (Weitzenhoffer & Hilgard, 1959), was made after the diagnostic ratings and training sessions had occurred. These scores yielded a correlation of .93. Inspection of the data showed that most of the improvement was due to shifts in hypnotizability scores among the few Ss who had not had prior hypnotic training. This suggests that standardized test scores can reflect diagnostic ratings very accurately after suitable training has yielded stability of hypnotic performance. It should be stressed, though, that this type of correlation sheds little light on the problem of the usefulness of initial standardized tests of hypnotizability as predictors of stable diagnostic ratings.

More recently, Evans and Thorn (1966) have reported a correlation of .77 between scores on the Stanford Hypnotic Susceptibility Scale, Form C (SHSS:C) of Weitzenhoffer and Hilgard (1962) and diagnostic ratings on a five point scale in a sample of 60. These Ss too had had prior hypnotic experience. Although they were selected on the basis of previously obtained scores on the Harvard Group Scale of Hypnotic Susceptibility, Form A (HGSHS:A) of Shor and E. Orne (1962) to yield three subgroups of 20 Ss each in the low, medium, and high range, the distribution of SHSS:C scores does not show any apparent dearture from normality.

While these two reports are in close agreement, they both deal with nonrandomly selected samples of Ss who had had extensive experience with hypnosis. Both these factors make extrapolation to other situations questionable. While the present sample is itself not ideal, it represents a type of sample commonly encountered in hypnotic research and allows a direct comparison of SHSS:A scores and diagnostic ratings as well as a point of departure for a discussion of the limitations in usefulness of standardized test scores in comparison with diagnostic ratings in subject selection.

Present Sample

Sample selection. Records of Ss screened for hypnotizability in our laboratory during the years 1960-1964 were searched for instances where both a score on the SHSS:A and a diagnostic rating had been



collected. Two subsamples were obtained: (a) one made up of instances where an initial SHSS:A testing had been followed by a later diagnostic rating, and (b) one in which SHSS:A had been given after a diagnostic rating. The second sample was obtained in an attempt to measure the magnitude of possible practice effects.

Instances in which both SHSS:A administration and diagnostic rating were made by the same person were excluded. The degree to which knowledge of prior test scores or rating was available to those making the second measurement is not known. No attempt was made in this sample to ensure blind measurements.

Those Ss who had taken part in the previously reported studies were excluded from the present sample, as were Ss being rated specifically for participation in double-blind studies (M. T. Orne, 1959), where the rater would know beforehand that S was potentially either very high or very low in hypnotic susceptibility.

All Ss were paid undergraduate volunteers from universities in the Boston area. Many had been selected from introductory classes in psychology after being given a lecture on hypnosis followed by a few group tests of waking suggestibility, e.g., Arm Separation.

In order to obtain an estimate of rating reliability, the same records were also searched for instances where two diagnostic ratings had been made of the same S by different raters.

Test administration. Test scores were obtained by standard individual administration of SHSS:A.

Diagnostic rating. Rating sessions were individual and lasted about one hour. During this time, any induction or deepening procedures deemed appropriate were used to obtain maximum hypnotic depth.

The diagnostic rating scale used has been briefly described elsewhere (M. T. Orne, 1959; Shor, M. T. Orne, & O'Connell, 1962). The diagnostic ratings were based both on behavioral criteria and on subjective reports obtained during and after hypnosis. Ratings were made on a five point scale, 6 as follows:

1 = unhypnotizable: neither overt nor subjective response.

2 = very light: overt response to ideomotor suggestions, e.g., eye closure or partial hand levitation without subjective involvement, failure of simple challenge suggestions.

3 = light: positive response to challenge suggestions and more difficult motor suggestions, with some subjective components present, but

6 Plus and minus distinctions were also made, thus producing a 14-point scale ranging from 1 to 5+. Since results of analyses using the finer ratings did not differ substantially from those using the five point scale, they have not been reported.



failure to respond to suggestions of hallucination and posthypnotic phenomena.

4 = medium: positive response to suggested hallucination, simple posthypnotic responses, but at best only partial posthypnotic amnesia.

5 = deep: complete posthypnotic amnesia along with difficult posthypnotic suggestions and other classical somnambulistic phenomena.


Distributions of SHSS:A scores and diagnostic ratings for the two subsamples are presented in Tables 1 and 2. Even though Ss being specifically screened for double-blind experiments were excluded, the distributions obtained appeared markedly non-normal, with a strong upper mode and a smaller lower mode. The effects of certain types of



non-normality have been investigated by Norrsis and Hjelm (1961), who found that failure to meet the assumption of bivariate normality underlying the product-moment correlation statistic can have important distorting effects. The present type of distribution would tend tc inflate the obtained correlation.

Product-moment correlations between SHSS:A scores, diagnostic ratings, and associated statistics are presented in Table 3.

The reliability of a sample of 46 paired ratings was found to be .79. This should be compared with an inter-rater reliability of .96 based on judgments by two raters independently observing the same hypnotic sessions (Shor, M. T. Orne, & O'Connell, 1966).7 The presently obtained figure confounds inter-rater reliability and test-retest reliability


Although obtained under quite divergent conditions, the present correlations and those previously reported are in general agreement. They suggest that the construct validity of SHSS:A is comparable in magnitude to its test-retest reliability, which was reported in the normative sample as ranging from .78 to .83 (Weitzenhoffer & Hilgard, 1959). Since SHSS:A represents, in effect, a work sample of hypnotic performance, it is not surprising that its validity should be high.

Ideally, one would like to have validity information based on a randomly selected sample showing bivariate normality tested under conditions minimizing aura effects. It would also be of interest to obtain systematically controlled measures of the effect of graded degrees of

7 The low inter-rater reliability, also based on two observers evaluating the responses of the S, reported by Evans and Thorn (1966) of .61 is not comparable, since the ratings were global judgments of hypnotic depth and one of the raters had not had previous experience with assessing hypnotic depth.



hypnotic training on both diagnostic ratings and standardized test scores. The absence of order effects in the present samples argues against large short-term effects. This may have been in part a function of the individual administration of SHSS:A where considerable effort was taken to establish rapport before actual test administration. The shifts found in new Ss after extensive hypnotic training in the previous study would indicate that practice effects can be marked. Such a conclusion fits general clinical experience, where sudden and dramatic shifts are at times encountered.

The high correlation between these two measures may suggest to some that initial standardized test scores can safely be substituted for diagnostic ratings. This would be a considerable economy of experimental procedure. Such a conclusion, however, is not warranted by these results even when they are taken at face value.

While a correlation of the order of .80 indicates a high degree of linear association, its value for the accuracy of prediction of individual scores is considerably more limited than one might at first suppose. Several statistics have been used for evaluating the predictive value of a product-moment correlation. Two such measures are presented along with their corresponding correlations in Table 3.

The coefficient of alienation serves to indicate the size of error in prediction of individual test scores relative to the error which would result from assigning scores at random. In order to reduce the error by half, a validity coefficient of .866 is required and to reduce it by three quarters, it must be .969.

A similar measure of accuracy of prediction is provided by the index of forecasting efficiency, which is defined as the percentage reduction in errors of prediction attained when a correlation exists between two variables. For the present samples, it will be seen that in no instance does the forecasting efficiency reach 50%.

For the great majority of psychometric purposes, individual score prediction is not a major requirement. For purposes of selecting Ss extreme in hypnotizability, score differences on the present five point scale are, in contrast, of great practical importance, since a difference of even one point may mean the difference between an adequately selected sample and an inadequate one. This is particularly the case at the upper end of the scale, where the difference between a "4" rating and a "5" rating indicates important qualitative differences in hypnotizability.

The limitations on accuracy of prediction found with even quite high degrees of correlation make the substitution of SHSS :A scores for diagnostic ratings a dangerous procedure at best. Validity coefficients of the



order of .95 or better would have to be shown to justify such a procedure, and evidence from the present and previously reported samples does not suggest that the underlying validity is that high.

The correlations found do, however, support the conclusion that SHSS:A, and presumably its equivalent group form, have a degree of validity comparable to that of other well-constructed psychometric instruments. They can be of usefulness as screening devices in selecting Ss potentially extreme in plateau hypnotizability.


Estudio Comparativo de la Susceptibilidad a la Hipnosis Evaluada segun Diagnostico Clinico y Despues de Resultados Iniciales de Pruebas Estandarizadas

Donald N. O'Connell, Martin T. Orne y Ronald E. Shor

Resumen: Se encontro una correlacion de .79 entre los valores de la Escala Stanford de susceptibilidad a la hipnosis, forma A, y los valores del diagnostico clinico de hipnotizabilidad en una muestra de 63 sujetos no esco-



jidos al azar. Se discuten las limitaciones de este hallazgo tanto en lo que se refiere al valor predictivo de la Escala Stanford A asi como coeficiente de validez. El grado de correlacion encontrado solo indica un valor predictivo de 36.8% entre una y otra prueba.

Ein Vergleich in hypnotischer Empfanglichkeit, ermittelt durch diagnostische Schatzungen und standartisierte Testergebnisse wahrend der ersten Hypnotisierung

Donald N. O'Connell, Martin T. Orne and Ronald E. Shor

Abstrakt: In einer ausgewahlten Versuchsprobe mit 63 Subjekten wurde eine Korrelation .79 zwischen den SHSS:A Ergebnissen und diagnostischen Schatzungen auf Hypnotisierbarkeit hin gefunden. Der Grad vorsehbarer Wirksamkeit entspricht nur 36.8%. Einschrankungen in der Auswertung dieser Befunde hinsichtlich des Gultigkeitskoeffizienten und ihrer Bedeutung als Hinweis auf den Voraussage Wert des SHSS:A wurden diskutiert.

