Effectiveness (how well all and only the true problems can be identified) and reliability (the extent to which different evaluations of the same page lead to same results) are key factors for quality of accessibility evaluation methods. They should be well studied and understood for methods, and guidelines, that are expected to have a major impact.
This paper presents an experiment aimed at finding out what is the effectiveness and reliability of different checkpoints taken from WCAG 1.0 and WCAG 2.0. The experiment employed 35 young web developers with some knowledge on web accessibility.
Although this is a small-scale experiment, unlikely to provide definite and general answers, results show unequivocally that with respect to the kind of evaluators chosen in the experiment, checkpoints in general fare very low in terms of reliability, and that from this perspective WCAG 2.0 are not an improvement over WCAG 1.0.