DARK SIDE OF INCENTIVES:

by user

on 15-09-2016

Category: Documents

>> Downloads: 3

views

Report

Comments

Description

Download DARK SIDE OF INCENTIVES:

Transcript

DARK SIDE OF INCENTIVES:

DARK SIDE OF INCENTIVES: EVIDENCE FROM A RANDOMIZED
CONTROL TRIAL IN UGANDA*
DAGMARA CELIK KATRENIAK§
Abstract
Throughout our lives, we are routinely offered different incentives as a way to motivate us. Many
researchers have studied the effects of incentives on people’s performance. There can also be
important psychological outcomes in terms of stress and happiness. The current paper contributes
to the literature by explicitly accounting for this performance-versus-well-being tradeoff
introduced by incentives. I implement two types of social comparative feedback regimes, within
and across-class group comparisons, and two types of incentive regimes, financial and reputation
rewards. The results show that rewards can improve performance up to 0.28 standard deviations,
but at a cost of higher stress and lower happiness, whereas comparative feedback alone (without
rewards) increases performance only mildly, by 0.09 to 0.13 standard deviations, but without
impact on student’s stress and happiness. More stressed students exert less effort, perform worse
and attrite by 29 percent more compared to those who are stressed minimally. The results also help
to identify gender-specific responses to incentives. While boys strongly react to rewards, girls do so
only if they are also provided with feedback. Final contribution comes from a rich dataset of more
than 5000 primary and secondary school students in Uganda, who were repeatedly tested and
interviewed over one academic year.
Keywords: field experiment, Uganda, incentives, education, group competition
JEL Classification: C90, C93, D04, I21, I29, O55
* This research was supported by GA UK grant No. 338911, and by GDN grant No.60. All errors are mine.
¶ CERGE-EI (a joint workplace of the Charles University in Prague and the Economic Institute of the Czech
Academy of Sciences, Politickych veznu 7, P.O.Box 882, 111 21, Prague 1, Czech Republic.
Email: [email protected].
1
1. Introduction
A trophy for the best student in a class, a certificate for the most improving learner of a
course or a bonus payment for the employee of the month, etc. We are routinely faced with
incentives of different types (symbolic, reputation or financial rewards) throughout our lives.
Rewards are often believed to motivate subjects and subsequently increase their performance, and
are therefore implemented in many different environments (Lazear, 2000, Fryer, 2010, etc). We are
also routinely compared to classmates/colleagues/competitors by receiving feedback about our
performance. Feedback may motivate subjects to improve their performance (Andrabi et al, 2009;
Azmat and Iriberri, 2010) even though the evidence on such positive effects is more scattered. 1
Feedback and incentives may also influence our well-being (Azmat and Iriberri, 2016) and change
in well-being may further influence people’s decision making and economic outcomes (e.g.,
Veenhoven, 1998, Juster et al., 2010, Helliwell et al, 2012; for more details see section Literature
review).
This is a unique study implemented in the field that analyzes the effects of all types of
motivation schemes on performance and on the well-being (measured by perceived stress and
happiness) of students evaluated in teams. The design offers a direct comparison of the effects of
two feedback groups and two reward groups as well as their interactions (each feedback interacted
with each reward). Feedback differed across feedback-treatment groups with respect to its content.
Each student in the within-class feedback group (class randomly divided into groups of three to four
students) received information about how he scored in Math and in English, how his group-mates
scored and the position of the group within his class. Students in the across-class comparative
feedback group (comparison of entire class) received information about how they scored in Math
1
According to psychologists, positive feedback is believed to increase intrinsic motivation and foster longterm motivation, whereas negative feedback decreases intrinsic motivation (Burgers et al., 2015; Arnold,
1976; Deci, 1972).
2
and in English personally (i.e., they were not given information about their classmates) and the
position of their class compared to other classes. Students were not offered rewards until testing
round 4 was finished. Students were orthogonally randomized into financial, reputation and noreward groups. Students in the financial reward group could win 2000 UGX per person (which was
approximately 0.80 US cents according to that day’s exchange rate). Students in the reputational
reward group were promised that if they qualified for the reward, their names would be announced
in the local newspaper Bukedde (the most popular in the region) and they would receive a
certificate. The general criterion I used was to reward 15% of the top performing and 15% of the
most improving students/groups/classes. The novelty of the experiment comes from the wide
scope of outcome measures observed, rich design and its unique data set. The sample size consists
of more than 5000 primary and secondary school students from 146 classes located in Southern
Uganda, who are repeatedly tested and interviewed during one full academic year. In total, five
testing rounds were administered.
Looking at aggregated average treatment effects of feedback provision on students’ overall
performance (Mathematics and English pooled), students scored by 0.07 standard deviation more
compared to the control group students and the difference is insignificant on conventional levels
(p=0.13 and 0.16). The results of feedback are, however, subject specific. While the students
improved significantly in Mathematics (0.08 to 0.13 standard deviations), they did not improve in
English. Students exposed to rewards increased their performance on average by 0.1 standard
deviations if rewarded reputationally and 0.18 standard deviations if rewarded financially. The
effects are similar when decomposed by subject. The effects are amplified if students face any of the
treatment combinations (the effect size is between 0.19 and 0.28 standard deviations). The results
on the outcomes other than learning - such as happiness and stress - put the benefit of reward
provision into the shade. The students who were offered only rewards (without any feedback) had
their stress levels elevated and happiness levels decreased, whereas the well-being of students who
3
received only feedback remained unchanged. Moreover, most of the treatment combinations lead to
a decrease in students’ stress and increase or no effect on happiness. Thus, we can speak of an
important trade-off: the introduction of rewards increases performance more than pure feedback,
but at the same time they lower students’ well-being. The effects persist when I control for multiple
comparison testing by adjusting the p-values using Simes step-up method (Simes, 1986).
In some experiments, boys and girls responded very differently to certain incentives. The
second contribution of this paper is to shed light on the underlying reasons behind these gender
differences. I attribute this difference to the existence of two types of competitions – intrinsic, or
internally driven competition, developed by personal feelings based on comparison to others, and
extrinsic competition induced by rewards. According to the results, if girls are given rewards but no
feedback, they will significantly underperform boys. If girls are, however, repeatedly informed
about their positions (no matter what type of feedback they receive), their performance will
improve and will become comparable to boys. In other words, comparative feedback in a
tournament environment plays a crucial role for girls motivating them to improve their
performance. Boys, in contrast, react only to rewards. The current design does not allow me to
distinguish whether gender differences are caused by the fact that students were evaluated in
groups (group identity effect), or that they were repeatedly informed about their standing.
Nevertheless, since both within- and across-class feedback groups deliver similar effects, it seems
more likely that the effect is driven by social comparison rather than group identity. Such result
would be in line with “reference group neglect” introduced by Camerer and Lovallo (1999) - students
neglect information about others and focus solely on the feedback regarding her own performance.
The results of this experiment may be important especially for policy-makers in finding the
optimal strategy for improving performance and well-being of students in primary and secondary
schools. Despite numerous studies in the literature that are designed to improve students’
4
performance and/or their attendance, concerns about students’ well-being have been discussed
minimally. Current well-being serves as an important prerequisite for future performance.
Therefore, policy-makers should use a great amount of caution in designing educational rewards
and take into account the impacts on student well-being. Further research should aim to study the
long-term effects of changes in student well-being on performance to shed light into whether
increase in stress today may increase or decrease benefits of tomorrow.
2. Literature Review
According to social comparison theory 2, informing a child about his/her performance without
comparing it to other children causes unstable evaluations of the child’s ability and can influence
effort negatively (Festinger, 1954 3; the founder of the social comparison theory). On the contrary,
comparison enables a child to find his/her relative position within a particular group which can
lead, via enhanced competitiveness, to an increase in effort and performance improvement.
Feedback provision, as a way to inform subjects about their absolute or relative standing, has
been analyzed in different environments and has delivered opposing results. Andrabi, Das and IjazKhwaja (2009), for example, provided parents, teachers and headmasters with report cards
informing them how children are doing in a particular school. The intervention resulted in 0.1
standard deviation improvement in student test scores. Azmat and Iriberri (2010) informed high
school students about their relative standing and in this way helped to improve student grades by 5
per cent. Additionally, university students in the United Kingdom responded positively when they
improved their performance by 13% in response to feedback regarding their own absolute
2
Social comparison theory is about “our quest to know ourselves, about the search for self-relevant
information and how people gain self-knowledge and discover reality about themselves” (Mettee and Smith
1977, p. 69–70).
3
Festinger in his original paper focused on the social comparison of abilities and opinions only. Since then,
however, have many different dimensions of social comparison been studied (e.g., Buunk & Gibbons,1997,
2000; Suls & Wheeler, 2000). See e.g. Locke and Latham, 1990; Suls and Wheeler, 2000, for an overview of
papers in psychology and management science. See Buunk and Gibbons (2007) for an overview of work in
social comparison and the expansions of research on social comparison.
5
performance (Bandiera et al., 2015) 4. Not all studies, however, find positive responses to feedback
provision. Azmat et al. (2015) do not find any effect of relative feedback on university student
performance (on the contrary, in a short period right after the feedback was provided they even
find a slight downturn in student performance). More evidence on the negative effects of incentives
on performance can be found in experiments implemented at the workplace. Workers in two
experiments lowered their performance after they received information about their rank position
Bandiera et al. (2011a, 2011b). Health workers also decreased their performance during a training
program in Zambia when exposed to social comparison (Ashraf et al., 2014) 5.
The effect of feedback depends on who the subjects are compared with, how they are compared
and whether they are rewarded for their performance. Students face social comparison in their
classrooms on a daily basis and it can strongly influence their self-esteem and their performance
(Dijskstra et al., 2008) as well as their well-being (Azmat and Iriberri, 2016). It is therefore
important to understand with whom to optimally compare the students. If students are compared
to the ones slightly better, their effort and performance tend to increase. Performance and effort
decrease if the comparison target is too far from a student’s ability (Ray, 2002). Students can be
compared individually or in groups. A group’s outcome depends on each member’s contribution
and may foster mutual help (Slavin, 1984) in addition to positive peer effects (Hoxby, 2000;
Sacerdote, 2001). Groups can be formed endogenously (e.g., by students themselves based on
friendship) or exogenously (Blimpo, 2014) and they can be exposed to competition. In some
4
Other studies with positive effects of feedback provision: Tran and Zeckhauser (2012), Blanes-i-Vidal and
Nossol (2010) or Fryer (2010)
5
There are also controlled lab environments studying the effects of feedback provision, e.g. Falk and Ichino
(2006) and Mas and Moretti (2009) found that if one lets people observe the behavior of their peers, their
performance would improve. Kuhnen and Tymula (2012) and Duffy and Kornienko (2010) find a positive
effect to the provision of private feedback. Eriksson et al. (2009) on contrary find that rank feedback does not
improve performance (even if pay schemes were used). Hannan et al. (2008) find a negative effect of feedback
on relative performance under a tournament incentive scheme (if feedback is sufficiently precise).
6
studies, the effects of interventions are more pronounced if students are involved in tournaments
(Eriksson et al., 2009; Bigoni et al., 2010; VanDijk et al., 2001) 6.
Subjects often improve their performance if they are rewarded financially. Bettinger
(2012), Angrist et al. (2002, 2006, 2009, 2010), Kremer (2004), Bandiera (2010), and others
studied the effects of the provision of cash payments, vouchers or merit scholarships to students
who successfully completed a pre-defined task. In such experiments knowing the relative position
is not crucial since success does not depend on the performance of other mates. In order to induce
stronger competitive pressure, subjects need to be put into a tournament with a limited number of
winners. VanDijk et al. (2001) conclude, based on the results of their experiment in which they
experimentally compared different payment schemes, that it is superior for a firm to introduce a
tournament-based scheme over a piece-rate or team payment scheme. In the case of Blimpo (2014),
groups involved in the tournament improved similarly compared to groups rewarded for reaching a
performance target. All treatments (with or without competition) resulted in positive improvement
in student performance, increased by 0.27 to 0.34 standard deviations. The evidence is mixed. Fryer
(2010) and Eisenkopf (2011) studied the impact of different financial rewards on student
performance and did not find any significant improvements (even though in the case of Fryer
(2010) the effect might have not been detected because of lack of power, the author claims).
Even if the financial rewards result in performance improvements, they may not be
necessarily cost-effective (e.g., Bandiera et al., 2012 7). Alternative rewards 8 that would be
possibly more cost-effective drew researchers’ attention. For example, Kosfeld and Neckerman
(2011) designed a field experiment where students in the treatment group were offered symbolic
6
See Hattie and Timperley (2007) for a review of the literature on the provision of feedback.
Bandiera et al. (2012) find the financial rewards cost-ineffective since only a fraction of the students from
the second quartile of initial ability distribution react positively to financial rewards.
8 See also theoretical models studying the effects of reputation and symbolic rewards on subjects’
performance in work of Weiss and Fershtman (1998), Ellingsen and Johannesson (2007), Besley and Ghatak
(2008), Moldovanu et al (2007) or Auriol et al. (2008).
7
7
rewards (a congratulatory card) for the best performance while students in a control group were
not offered anything. Their results provide strong evidence that status and social recognition
rewards have motivational power and lead to an increase in work performance on average by 12
percent. Subjects in the real-effort experiment conducted by Charness et al. (2010) increased their
effort in response to the relative performance and expressed their “taste for status”. Jalava et al.
(2015) offered sixth grade students in Swedish primary schools different types of non-financial
rewards (criterion-based grades from A to F, grade A if they scored among the top 3 students, a
certificate if they scored among the top 3 students or they received a prize (in the form of a pen) if
they scored among the top 3 students). The effects were heterogeneous with respect to original
ability (students from two middle quartiles respond the most to the incentives) and with respect to
gender (boys improved their performance in response to rank-based incentives only, girls also to
symbolic rewards). Rank-based grading and symbolic rewards, however, crowded out intrinsic
motivations of students. Even non-monetary rewards have the power to motivate subject to
improve their performance. Naturally, the questions arose. What can we learn from direct
comparison of monetary and non-monetary rewards? Would financial rewards prevail? Levitt et al.
(2012) present the results of a set of field experiments in primary and secondary schools, in which
they provided students with financial as well as non-financial rewards, with and without delay and
incentives framed as gains and losses. Non-monetary rewards were as effective as monetary
rewards (and therefore more cost-effective).
Feedback and incentives may also influence our psychological well-being (Azmat and
Iriberri, 2016). Change in well-being has been found to influence people’s decision making and
economic outcomes An increase in happiness 9 is associated with stronger health, sharper
awareness, higher activity in addition to better social functioning (Veenhoven, 1998). Education is
one determinant of happiness (higher education is associated with higher well-being (Helliwell et
See Fordyce (1988) on a review of happiness measures, MacKerron (2012) for a review of the economics of
happiness, Dolan et al. (2008) review well-being.
9
8
al., 2012; Dolan et al., 2008). Happiness is negatively related to stress. Subjects under stress make
suboptimal decisions, which, in the case of students, could lead to incorrect answers during
examinations, or suboptimal choices in their careers (e.g., to be absent from school, to drop out of
school or to exert low levels of effort). Both stress and happiness influence subjects’ health (Juster
et al., 2010; McEwen, 2008; Schneiderman et al., 2005). Stress influences learning and memory, and
it creates learning problems (Lubin et al., 2007; Wolf, 2009). In the extreme, stress hormones may
even influence brain structure (Lupien et al., 2009). Therefore, the consequences of interventions
on the well-being of students should not be underestimated.
This experiment differs from existing studies in its complexity of incentive schemes that
have been implemented and the broader scope of outcomes – except commonly used performance I
study students’ confidence, stress, happiness and their academic aspirations. The predictions of the
effects of my interventions based on the existing literature are controversial. Evaluation of students
in groups should push via enhanced cooperation within groups to group average improvements. If
the group is, however, big enough, free-riding behavior may prevail and result in heterogeneity
within the group outcomes. Informing students about the position of their group could lead to
improvements in performances via enhanced competition or demotivate students with a negative
attitude toward competition. Alternatively, students could neglect information about their group
members and focus solely on own performance (as found in Camerer and Lovallo, 2002). The effect
potentially depends on group gender and/or ability composition (as found in Apesteguia, Azmat,,
and Iriberri, 2012) and group position in the group ability distribution. Students included in both
financial and reputational reward treatments are expected to improve their scores, at least the ones
in the second quartile of ability distribution.
9
3. Randomization and experimental design
In this experiment, I study whether the provision of comparative feedback about own
performance and the performance of respective group members can influence students’
performance and their psychological well-being measured by stress and happiness. To evaluate the
effect of the intervention, I designed a Randomized Control Trial (RCT) experiment. At the
beginning of the academic year, the sample was stratified and randomized into two feedback-
treatment groups and one control group (as shown in Figure 1). Students in within-class feedback
group were randomly divided into groups of three to four classmates and were evaluated as groups
within the class. In other words, group averages were taken into account when comparing the
students’ performance rankings. The students in the across-classes feedback group were evaluated
as a whole class (using class average) and were compared to other classes of the same grade in
different schools.
Feedback differed across treatment groups with respect to its content. Each student in the
within-class feedback group received information about how he scored in Math and English, how
his group-mates scored and the position of the group within his class. Furthermore, starting from
testing round 3, the student received information about how he (and his group-mates) improved or
worsened in between two preceding testing rounds. Students in across-class feedback group
received information about how they scored in Math and in English personally (i.e., they were not
given information about their classmates) and the position of their class compared to other classes.
The positions in both treatments were emphasized on a rank-order graph (see Appendix B4 and
B5). Students in the across-class feedback group received their first feedback with one-round delay
due to logistical issues. Students in the control group did not receive any information, they only
answered exam questions. Students were not offered further rewards until testing round 4 was
finished.
10
Figure 1: Stratification and randomization scheme
The relative standing of the group was based on the average group score from Mathematics and
English. Students were tested repeatedly during an academic year and received feedback three to
four times depending on the feedback group (across-class/within-class feedback, respectively).
In order to study the effects of monetary and non-monetary rewards, I orthogonally re-
randomized the sample at the school level 10. The randomization (which happened three to four
weeks before the final testing round) divided the sample into 9 groups – one pure control group,
four pure treatment groups (i.e., one type of treatment only) and four interacted treatment groups
(two types of feedback interacted with two types of rewards). The scheme with all treatments and
the project timeline can be found in Appendix B2. Students were acknowledged about the exact
rules of the competition during our personal visit and also via posters we left in each class.
10
The randomization was done at school level in order to avoid spillover effects and possible confusion.
11
The aim of such a cross-cutting design was to observe whether the introduction of rewards
could enhance student performance, especially if interacted with the within-class and across-class
feedback treatments. Furthermore, the design allows me to study also the effects of treatments on
students’ well-being measured by happiness and stress. Students in financial treatment could win
2000 UGX per person (which is approximately 0.80 US cents according to that day’s exchange rate).
Students in the reputational reward scheme were promised that if they qualified for the reward,
their names would be announced in the local newspaper Bukedde (the most popular in the region).
The qualification criteria differed based on original randomization into treatments (see Table 1) but
the general rule was to reward 15% of the top performing students/groups/classes as well as 15%
of the most improving students/groups/classes.
Table 1: Qualification criteria for winning the rewards
Financial rewards
(2000 UGX)
Within-class social
comparison
(Treatment 1)
Across-class social
comparison
(Treatment 2)
Control group
15% of best performing
and 15% of best
improving groups
(524 students)
15% of best performing
and 15% of best
improving classes
(409 students)
15% of best performing
and 15% of best
improving students
(498 students)
Reputational Rewards
(Winners’ names published
in a local newspaper)
15% of best performing and
15% of best improving
groups
(666 students)
15% of best performing and
15% of best improving
classes
(543 students)
15% of best performing and
15% of best improving
students
(585 students)
No rewards
Pure within-class social
comparison group, no
rewards
(1205 students)
Pure across-class
comparison group, no
rewards
(1460 students)
Pure Control Group,
no rewards
(1260 students)
Note: In order to avoid confusion, students were given exact information regarding the number of winning groups
(if in T1), the number of winning classes (if T2) and the number of winning students (if originally in control group). I
used percentages in order to guarantee a comparable number of winners across all treatment groups.
12
4. Timing, logistics and final sample
The experiment took two years (August, 2011 – August, 2012) 11. The intervention
implementation and the core data collection took place from January 2012 until December 2012.
Students were tested twice per term, i.e., approximately every one and half month. The agenda of
each visit was similar. After we entered the class, students in feedback-treatment groups received
their feedback, control students started immediately answering the questionnaires and Math and
English exam questions 12.
The final sample consists of 52 schools 13, 31 primary and 21 secondary schools out of which
19 are public, 23 are private and 10 are community schools. All schools were located in rural areas.
In total, 146 classes (P6 and P7 in primary schools, S1 up to S4 in secondary schools) summing up
to more than 5000 students were repeatedly tested. Apart from Math and English scores, I also
collected information about student academic aspirations 14, immediate effort, strategic effort in a
form of preparation for the exam and immediate happiness revealed right before/after each exam. I
also repeatedly inquired about student expectations of their own score from Mathematics and
11 In 2011, I collected information about students’ basic demographic questions, questions regarding family
background and family composition, parental status, education and job, wealth of the family and additional
questions regarding the students’ interests, opinions, self-esteem and aspirations. Due to large attrition
between 2011 and 2012 and the admission of new students throughout the 2012 academic year, the detailed
information collected in 2011 is available for only circa 52% of students participating in the 2012 experiment.
I also collected data about school (school type, school area, school fee structure and school equipment),
headmasters and teachers (demographic information, years of experience, salary and their opinions).
12 The order was as follows: “Before Math questionnaire”, followed by Math examination that lasted 30
minutes; “After Math Before English questionnaire”, English exam in the subsequent 20 minutes and finally
“After English questionnaire”. The core questions of the questionnaires were student expectations regarding
how many points they thought they would obtain from Math and English examinations, how much effort they
planned to put/they put into answering the questions and the level of their current happiness. All of these
questions we asked before as well as after each exam. No before-Math and before-English questionnaires
were collected during the baseline survey since students saw the examinations for the first time.
13
Initially there were 53 schools in my sample; one decided not to participate after I conducted the baseline
survey. The school was initially randomized into the control group and its exclusion did not lead to significant
differences in terms of baseline observables.
14
Students answered questions “What would you do if you were given an hour of extra time every day after
school?” and had 15 binary scenarios to choose from. Out of 15 scenarios, five asked for a choice between
educational activities (such as revise material taught at school) and work for money (such as selling
vegetables on the market), five educational versus relaxing activities (such as talking to friends) and five
work versus relaxing activities. I combined the answers into three indicators of the ordered preferences.
13
English in the testing in order to measure their confidence. To study students’ well-being, I
collected data on their happiness based on the Subjective Happiness Scale (Lyubomirsky and
Lepper, 1997) and subjective stress based on the Perceived Stress Scale (Cohen, Kamarck and
Mermelstein, 1983). Happiness score is calculated as an average from four questions using the 7likert scale (with 1 being maximum and 7 minimum). Similarly, stress score is based on the answers
from four questions based on 5-likert scale when 1 means no stress and 5 maximum stress.. The
questionnaires can be found in Appendix B2 and B3 15
5. Baseline summary statistics and randomization balance
Data on student performance, demographics and student responses to questions suggests that
randomization divided the sample into groups that are similar in expectations (see Tables 2, and 3,
and Appendix A for the treatment-control group comparisons). Few significant differences can be
observed between across-class feedback and the control group, indicating that students in the
across-class feedback group were slightly more stressed, slightly less happy and exerted slightly
more effort compared to the control group. If the covariates are correlated with student
performance, such an imbalance could bias the estimation of the treatment effect of the
intervention (Firpo et al., 2014). One can expect some imbalances between treatment and control
15
The headmasters of all schools agreed to participate in the experiment. The headmasters had an option to
withdraw from participation throughout the entire duration of the experiment, nonetheless no school opted
so. I asked the headmasters to communicate the content of the project to parents during their regular
parental meetings. Besides headmaster consent, I also had full cooperation with a non-governmental
organization UCDT that provided child sponsorship programs in Uganda. All schools in my sample cooperated
with the UCDT – at least one of their students received sponsorship for his/her studies from the UCDT. In the
letter of accordance in Appendix H, you can see their full support explicitly stated and that my project is in
line with their goals. In order to minimize possible costs coming from our presence at school, the duration of
the meetings was minimized to maximum 120 minutes. All meetings were organized with the headmasters
one week in advance to find the most suitable and least harmful time in terms of delivered curriculum. Exam
questions were designed based on the leaving examination questions. Administering exams in Mathematics
and English was supposed to serve students as additional training for the leaving examinations they face
during the final years of their studies in primary (grade 7) and lower secondary (grade 4) schools. There is no
Institutional Review Board (IRB) for social sciences in the Czech Republic which could issue an IRB approval
for my experiment. The experiment was designed in line with the conventions of IRB standards.
14
groups to occur purely by chance - as the number of balance tests rises, the probability to reject
zero hypothesis of no difference between treatment and control group also increases. In my case,
treatment and control groups differ significantly in less than 5% of all cases.
The average student scored 8.06 points out of 50 in the Mathematics exam and 14.07 points out of
50 in English. The real scores are in most of the cases below the student expectations. The
miscallibration of own performance is approximately 100 per cent. The average student put “a lot
of effort” into answering the exam questions (intensity 4 in the 5-likert scale) and he seems to be
“very happy” according to the immediate happiness scale (intensity 2 in the 7-likert scale when 1 is
the maximum). He finds the Mathematics exam of comparable difficulty and the English exam easier
compared to the regular exams at school. The average student is overall quite happy (based on the
Perceived Happiness Scale) and he has a low level of stress (Perceived Stress Scale). If the average
student had a chance to have one hour of extra time every day, he would choose education over rest
in 4.3 cases out of 5; in 3.9 cases out of 5 he would choose education over work; and in 3.1 cases out
of 5 he would choose work over rest. The aspiration measures reveal the pro-educational
preferences of students compared to work and rest.
15
Table 2: Randomization balance: comparison of mean characteristics of students in treatment and
control groups, baseline tests and interviews
Means
Mean Differences
Joint PWithin-class Across-class Control
value
Feedback
Feedback
(C)
(T1 – C)
(T2 – C)
(T1)
(T2)
PERFORMANCE (Baseline)
Mathematics
English
Sum Mathematics + English
OTHER THAN PERFORMANCE
Gender
8.063
8.838
8.655
-0.564§
23.088
-0.923
14.072
14.630
14.432
0.539
0.516
0.517
22.134
23.468
-0.359
0.197
0.198
0.395
0.183
0.699
0.426
0.022
-0.001
0.239
(0.015)
(0.014)
Age
17.058
17.049
16.999
0.059
0.049
0.737
(0.079)
(0.078)
Average class size
43.912
47.245
43.337
0.575
3.908
0.546
(3.208)
(3.776)
Expected number of points
4.331
4.536
4.552
-0.221
-0.015
0.299
from Mathematics
(0.150)
(0.145)
Expected number of points
5.715
5.757
5.796
-0.081
-0.039
0.879
from English
(0.161)
(0.144)
Perceived difficulty of Math
3.341
3.495
3.423
-0.082§
0.072
0.030
exam
(0.053)
(0.052)
Perceived difficulty of English
3.644
3.644
3.677
-0.033
-0.033
0.752
exam
(0.052)
(0.049)
Immediate happiness after
3.287
3.226
3.132
0.155*
0.094
0.230
Math exam
(0.092)
(0.092)
Immediate happiness after
2.909
2.869
2.782
0.127§
0.087
0.303
English exam
(0.085)
(0.085)
Effort put into Math exam
3.447
3.535
3.504
-0.057
0.021
0.298
(0.053)
(0.052)
Effort put into English exam
3.547
3.627
3.553
-0.006
0.074*
0.141
(0.046)
(0.044)
Subjective stress
1.504
1.588
1.439
0.065§
0.149***
0.001
(0.041)
(0.036)
Subjective happiness
2.869
2.913
2.806
0.064
0.107*
0.155
(0.058)
(0.055)
Education over work
3.538
3.496
3.477
0.060
0.019
0.526
(0.057)
(0.059)
Education over relax
3.834
3.756
3.778
0.056
-0.021
0.269
(0.049)
(0.049)
Work over relax
2.766
2.701
2.803
-0.037
-0.102
0.524
(0.094)
(0.090)
T1 stands for within-class social comparison group, T2 for across-class comparison group and C represents
control group with no feedback provided. Robust clustered standard errors at class level are in parentheses,
adjusted for stratification. § significant at 15%, * significant at 10%; ** significant at 5%; *** significant at 1%.
16
Table 3: Randomization balance: comparison of mean characteristics of students, by treatment and
control groups, baseline tests and interviews
Means
Mean Differences
Joint PFinancial
Reputation
No Rewards
value
Reward (Fin) Reward (Rep)
(No)
(Fin – No) (Rep – No)
PERFORMANCE (Baseline)
Mathematics
9.92
English
10.796
OTHER THAN PERFORMANCE
Gender
0.553
Sum Mathematics + English
20.718
10.72
10.49
10.394
10.853
21.116
21.353
-0.507
(1.215)
-0.096
(1.211)
-0.603
(2.199)
0.292
(1.258)
-0.497
(1.418)
-0.206
(2.660)
0.788
0.937
0.955
0.039**
0.031*
0.089*
(0.019)
(0.017)
-0.088
-0.267
0.798
Age
14.376
14.196
14.437
(0.350)
(0.359)
0.388
10.087**
0.067
Average class size
45.434
55.137
45.987
(4.173)
(4.332)
-0.028
0.096
0.882
Expected number of points
4.839
4.964
4.917
(0.257)
(0.236)
from Mathematics
0.018
-0.002
0.994
Expected number of points
5.152
5.132
5.162
(0.255)
(0.276)
from English
0.043
0.121
0.499
Perceived difficulty of Math
3.283
3.361
3.256
(0.088)
(0.093)
exam
0.017
0.027
0.975
Perceived difficulty of English
3.407
3.417
3.398
(0.080)
(0.085)
exam
-0.108
-0.091
0.793
Immediate happiness after
2.616
2.633
2.713
(0.159)
(0.145)
Math exam
-0.053
-0.029
0.911
Immediate happiness after
2.504
2.529
2.548
(0.099)
(0.112)
English exam
-0.053
-0.040
0.722
Effort put into Math exam
3.617
3.631
3.684
(0.083)
(0.086)
-0.027
-0.064
0.603
Effort put into English exam
3.526
3.489
3.554
(0.055)
(0.064)
-0.034
-0.084
0.912
Subjective stress
6.849
6.799
6.883
(0.230)
(0.192)
-0.324
0.234
0.226
Subjective happiness
10.376
10.933
10.671
(0.343)
(0.278)
-0.088
-0.176*
0.209
Education over work
3.822
3.735
3.910
(0.111)
(0.098)
-0.053
-0.020
0.748
Education over relax
4.234
4.266
4.296
(0.081)
(0.077)
-0.369***
0.001***
Work over relax
2.767
3.250
3.141
0.115
(0.125)
(0.102)
Fin stands for financially rewarded group, Rep for reputationally rewarded group and No represents the
control group with no rewards. Robust standard errors adjusted for clustering at class level are in
parentheses. § significant at 15%, * at 10%; ** at 5%; *** at 1%.
0.545
0.514
17
6. The effects of incentives on students’ performance and their well-being
The core question of the experiment is how different incentive schemes (social comparison,
financial and non-financial rewards) influence student performance and their well-being. I first
analyze the aggregated treatment effects (i.e., the overall treatment effect of the within-/acrossclass feedback in each testing round and the effects of financial and reputational rewards in the
final testing round). Later, I disentangle the pure treatment effects (pure feedback and pure
rewards) and the interacted effects (each type of feedback interacted with each type of rewards). I
discuss the role of group gender/ability composition and I study whether the type of feedback
students received matter for the improvement. Finally I look at the distributional analysis.
6.1. Average treatment effects on performance
In Table 4 I summarize the aggregated average treatment effects of feedback and rewards on
students’ overall performance (columns (1) to (4)) and on their subjective well-being (happiness
and stress in columns (5) and (6)). The effects are expressed in standard deviations. The provision
of feedback (pooled or separately by feedback type) increases students’ overall performance by
0.07 standard deviations, which is considered as a small effect. In other words, the average student
who received within-class or across-class feedback scored higher than 52.8 percent of students in
the control group. The type of feedback does not play a significant role. The aggregated results are
similar in size compared to the results of students who only received feedback during the academic
year and were not included in any competition for rewards (for decomposed pure and interacted
treatment effects see Figure 4 and Appendices C1 and C2).The results are very similar to the results
of Jalava et al. (2015) who tested the effects of different grading schemes - 0.077 standard
deviations for criterion based grading, and 0.080 standard deviations for tournament grading. In
Pakistan, parents and teachers received report cards regarding the performance of their
18
children/students, which led to a 0.1 standard deviation increase in student performance (Andrabi
et al., 2009).
Table 4: Aggregated average treatment effects of the provision of feedback and rewards on the
overall performance and students’ subjective well-being
AGGREGATED AVERAGE
TREATMENT EFFECT ON:
Aggregated feedback treatment
Within-class social comparison
Across-class social comparison
STUDENTS’ OVERALL PERFORMANCE
(1)
(2)
(3)
0.073*
(0.041)
0.064
(0.039)
(4)
STRESS
HAPPINESS
(5)
(6)
§
§
0.061
-0.001
-0.111*
(0.043)
(0.090)
(0.058)
§
Aggregated reward treatment
0.069
-0.104
-0.058
(0.082)
(0.057)
(0.046)
§
Financial Rewards
0.226**
-0.108
0.133**
0.176***
(0.107)
(0.052)
(0.062)
(0.070)
§
§
Repurational Rewards
0.177
0.102**
-0.112
(0.051)
(0.119)
(0.070)
Yes
Controlled for stratas
Yes
Yes
Yes
Yes
Yes
R-squared
0.714
0.715
0.645
0.058
0.634
0.078
N
5108
5108
5102
4105
5102
4056
Note: columns (1) – (4) show the average treatment effects (ATE) of differently aggregated treatments on
students’ performance. Columns (5) and (6) represent the average treatment effects on students’ well-being
(stress and happiness respectively).
0.074
(0.045)
§
0.071
(0.047)
The aggregated effects of rewards are weaker (financial rewards) or comparable (reputational
rewards) compared to the existing results – 0.18 standard deviations in response to financial
rewards and 0.102 standard deviations to reputational rewards. The effect size depends on
whether students received feedback or not. Decomposition of the aggregated effects to pure and
interacted treatments shows that while rewarded students without feedback improved by 0.06-
0.13 standard deviations, students with feedback improved their performance by 0.12 to 0.28
standard deviations (for further details see Figure 3 and Appendices C1 and C2). In other words, the
average student rewarded with financial (reputational) rewards without any feedback scored more
than 55.2 percent (52.4 percent) of students in the control group. If the average student rewarded
19
reputationally received repeated (within/across-class) feedback during the academic year, she
scored more than 54.8 per cent/57.3 per cent of the control-group students. Similarly, the average
student who was rewarded financially and received within-/across-class feedback scored higher
than 57.9 per cent / 61.4 per cent of students in the control group. Jalava et al. (2015) find similar
results. Students in their study who competed for a certificate which was given to first three
students improved their performance by 0.083 standard deviations. Students who competed for
(non-monetary) prizes improved by 0.125 standard deviations. Blimpo (2014) studied the effects of
financial rewards provided to students in Benin on individual or group basis. Students improved
their performance by 0.27 to 0.34 standard deviations.
Table 5: Aggregated average treatment effects of the provision of feedback and rewards on the
overall performance (Math and English pooled) – by subject
AGGREGATED AVERAGE
MATHEMATICS
ENGLISH
TREATMENT EFFECT ON
STUDENTS’
(1)
(2)
(4)
(5)
PERFORMANCE – BY
SUBJECT
Aggregated feedback
treatment
Within-class social
comparison (T1)
Across-class social
comparison (T2)
Aggregated reward
treatment
Financial Rewards
0.094*
(0.051)
0.128**
(0.063)
0.099*
(0.059)
§
0.089
(0.056)
-0.009
(0.036)
0.126**
(0.055)
-0.028
(0.039)
0.012
(0.040)
0.158**
0.142*
(0.065)
(0.078)
0.108**
Repurational Rewards
0.115*
(0.053)
(0.064)
Yes
Yes
Controlled for stratas
Yes
R-squared
0.645
0.688
0.645
N
5102
5093
5102
Note: columns (1) – (4) show the average treatment effects (ATE) of differently aggregated treatments on
students’ performance. Columns (5) and (6) represent the average treatment effects on students’ well-being
(stress and happiness respectively).
20
The effects on students’ performance differ by subject. While the effects of feedback provision
are driven solely by improvements in Mathematics (no improvements in English), the rewards lead
to similar improvements in both subjects (see also Table 5). One explanation is that Math is more
elastic (Bettinger, 2012). It may be easier to detect the areas of Mathematics in which the student is
failing, while in English it may be hard to prepare for the test. It may also be a case of overall
motivation. Students may have low motivation to study science, because science subjects are
usually perceived as more difficult 16 and students may not see their usefulness in real life; but once
they are incentivized (students see real rewards instead of abstract future benefits), they improve.
Figure 2: The evolution of the aggregated average treatment effects of within-/across-class
feedback on students’ overall performance (Math and English pooled)
16
Judging also by a consistently lower number of applicants for Science subjects as opposed to Arts subjects in the
National examinations held by the Ugandan National Board Examination Committee.
21
Current data show that students in the control group, whose performance is mimicking student
evolution in absence of the treatments, have stagnated in Mathematics during the whole academic
year (their absolute performance decreased by 0.33 per cent) but their absolute score in English
increased by 50.25 per cent. Based on such progress, it may be easier to improve in Mathematics
compared to English. Alternatively, the pattern may be the result of an order effect (the Math
examination always preceded English examination so students lost motivation to perform better). A
significant improvement in Mathematics, but not in English can be also found in other studies, e.g.
Bettinger (2012) or Reardon, Cheadle and Robinson (2009). The evolution of the treatment effect
can be found in Figure 2.
6.2. Average treatment effects on students’ well-being
Both types of feedback leave students’ stress level unchanged but the within-class feedback
slightly decreases their happiness (by 0.111 standard deviations). The decomposition of the
aggregated treatment effects show however that pure feedback does not influence students’ well-
being if provided without any rewards (the summary of aggregated treatment effects can be found
in Table 4, disentangled pure and interacted effects in Figure 3). The stress is induced once the
feedback group students competed for rewards. Rewards, on the contrary, significantly increase the
stress level of students and decrease students’ happiness 17. Students who were not informed about
their performance over the year but were included in a competition for money reported their stress
level by 0.466 standard deviations higher compared to the control-group students. In other words,
these students reported higher stress level 67.9 per cent of the control-group students who
received no feedback and no rewards. Provision of feedback significantly lowers the stress level
induced by financial rewards. Students who received within-class feedback and competed for
17
In aggregated terms, financial rewards increased stress by 0.226 standard deviations and reputational
rewards by 0.177 standard deviations; students’ happiness decreased by 0.1 standard deviations.
22
Figure 3:Dis-aggregated average treatment effects of incentives on students’ performance and their
subjective well-being
23
financial rewards increased their stress level by “only” 0.222 standard deviations and students
inacross-class feedback group did report higher stress level compared to the control group
students. Financial rewards decrease students’ happiness level but having repeated feedback does
not significantly change the results. The effect of reputational rewards seems to work in an opposite
way compared to financial rewards – pure reputational rewards do not influence stress level but if
students received repeated feedback introduction of the rewards increases their stress level by
0.196 (within-class feedback) to 0.237 standard deviations (across-class feedback). In terms of
percentages, within-class (across-class) feedback students reported higher stress level in response
to reputational rewards compared to 57.7 per cent (59.4 per cent) control group students but the
differences are only significant on 15%. Further details including alternative specifications can be
found in Appendix C.
A policy maximizer who would want to minimize the effects of interventions on students’ well-
being should therefore consider a class-level competition for financial rewards with regular
feedback regarding student’s own performance and the performance of her class – students’
performance increases with no significant effect on stress or happiness.
6.3. Group composition
If we let students to choose whether they want to compete in teams or as individuals, the
studies have shown that average-ability subjects have higher tendency to choose team-work
compared to high-/poor-ability subjects (Amann and Gall, 2006, Breton et al., 1998, 2003). In this
study, I am interested in behavior of students exogenously grouped with students from the whole
ability-spectrum. Students were grouped into groups of three students and received feedback about
their own as well as group performance during the whole academic year. In some cases the number
of students in the class was not divisible by three. In that case there were one or two groups of four
24
students (in total 18 groups out of 717 groups). In the following analysis I take only three-people
groups into account. Three types of ability groups (poor, mixed and good performers groups) and
four types of gender groups (pure boys, two boys and one girl, one boy and two girls, or pure girls)
were formed. The analysis helps us to understand how well-informed groups who differ in terms of
the ability/gender composition perform in response to financial and reputational rewards. The
applicability of the results goes beyond educational framework. Teams are increasingly used
indecision making processes in organizations compared to individuals (Hamilton et al., 2003,
Woolley et al., 2010). Companies spend large amount of money on the incentivization of their
employees and with the aim to maximize efficiency or team performance they often carefully select
high-ability performers to work on a project/represent the firm, etc. In such environment the
results of this research offer comparison of responses of different ability groups with or without
further incentivization.
6.3.1. Ability composition
First, I compare the performance of mixed-ability and high-ability groups to poor-ability
groups and observe differences in responses to provision of financial and reputational rewards.
Students’ ability is measured in terms of students’ initial performance in Math and in English. All
poor-ability group students scored below the median, all high-ability students scored above the
median and the ability of mixed-ability group students varies across the whole performance
distribution.
Students in the mixed ability group who did not compete for rewards do not outperform
poor-ability groups of students in Mathematics, but they do outperform them in English by 0.133
standard deviations. It means that in English the average student from mixed ability group scored
more than 55.2% of students from poor ability groups. However, once incentivized, mixed-ability
group students improved significantly more both in Math (0.21-0.22 standard deviations) and in
25
English (0.38 standard deviations and 0.224 standard deviations in response to the financial and
reputational rewards respectively). The type of the reward matters for the incentivization of mixedability group students in English. Students competing for money improved by 0.164 standard
deviation more compared to students competing for reputational rewards (the difference is
significant at 5% confidence level). Students in the high-ability group who did not compete for
rewards outperformed significantly poor-ability group students by 0.451 standard deviations in
Mathematics and 0.387 standard deviations in English. It means that students randomized into
high-ability group outperformed 65%-67.5% poor-ability students in English and Math
respectively. There is no statistically significant value added of rewarding high-ability group
students with neither pecuniary nor non-pecuniary rewards (see Figure 4 and Appendix E1).
Figure 4: Comparison of performance of mixed- and high-ability ability groups to poor-ability
groups and their responses to the provision of rewards
Note: groups were randomly divided into poor-ability, mixed-ability, and high-ability groups based on the baseline
performance. Groups of four students are excluded from the analysis (18 out of 717 groups). All groups received
within-class feedback during the whole academic year. The bars show the average treatment effects of different
incentive schemes of mixed- and high-ability groups in comparison to poor-ability groups. Stars indicate
significance of the difference in means. § significant at 15%, * at 10%; ** at 5%; *** at 1%
26
Except two cases, students do not differ across different ability groups in perceived stress
and happiness. The average student from the mixed-ability group incurred higher stress compared
to 62.2% poor-ability group students when offered financial rewards and the average high-ability
group student incurred higher stress compared to 62.6% poor-ability students when offered
reputational rewards. Ability-composition of the groups does not seem to determine the level of
effort exerted to perform the task.
6.3.2. Gender composition
The performance of teams may also be influenced by gender composition of the group.
Apesteguia, Azmat and Iriberri (2012) studied the effects of gender compositions on economic
performance of undergraduate and MBA students involved in a business game competition. All
men and mixed-gender groups outperformed groups of pure women. The group composition of two
men and one woman seem to be optimal as they perform the best. In this study the group also
consists of three students and due to random assignment four different gender-compositions were
formed (pure girls, majority of girls, majority of boys and pure boys). The findings are similar.
Mixed groups outperform pure-gender groups, but I do not find strong support for the groups of
two boys and one girl to be dominant.
In the Figure 5 and Table in the appendix E2 I compare the performance of mixed groups
and groups of boys compared to groups of girls. The results suggest that in the absence of rewards,
mixed gender groups outperformed uniform gender groups by 0.16-0.18 standard deviations. Pure
boy groups performed comparably with pure girl groups. Once the rewards are offered, pure female
group is outperformed by all other types of groups by 0.19 up to 0.50 standard deviations
depending on the group gender composition and type of the rewards. Group composition does not
seem to play significant role in terms of students’ perceived stress in reaction to different
treatments. Pure female groups seem to be on average happier compared to other groups.
27
Figure 5: Comparison of performance of mixed-gender and pure-boy groups to groups consisting of
pure girls and their responses to the provision of rewards
Note: within-class feedback groups were randomly divided into three-boy-groups, groups of two boys and one girl
(2B1G), groups of one boy and two girls (1B2G) or pure three-girl-groups. Groups of four students are excluded
from the analysis (18 out of 717 groups). The bars show the average treatment effects of different incentive
schemes of 3-boy-groups and mixed-groups in comparison to 3-girl-groups. Stars indicate significance of the
difference in means. § significant at 15%, * at 10%; ** at 5%; *** at 1%
6.4. Distributional Analysis
It is also important to learn whether the treatment effect differs at different point of the
performance distribution. To learn to what extent high performers differ in their reactions to the
treatment effects from low performers, I estimate quantile regressions. Figure 6 shows the average
treatment effect of the incentives on performance of students by their rank in the performance
distribution. The Figure consists of four graphs which differ in terms of the combinations of the
treatments. Graphs in the first column compare the average treatment effects of the pure feedback
with feedback-reward interacted treatments. Graphs in the second column compare average
treatment effects of pure-reward with feedback-reward-interacted treatments. The main
28
observation is that the bottom performers respond stronger compared to the top performers to the
provision of pure financial rewards, pure across-class feedback or their interaction. In all other
cases the bottom performesrs respond comparably to the top performers.
Figure 6: Distribution of the average treatment effects of different incentives on overall performance
(Math and English pooled), by deciles
6.5. Positiveness and negativeness of the feedback
The nature of the feedback students receive may influence students’ well-being. Azmat and
Iriberri (2016) show that positive feedback increases students’ happiness. I find similar results but
mainly for students who received feedback but no rewards. Besides feedback regarding own
performance and the absolute and relative performance of the group, students also received
additional information stressing whether they improved or worsened in two subsequent rounds
29
(starting from testing round 3). Students are considered to receive mostly positive (negative)
feedback if in at least two out of three cases they improved (worsened).
Students in the within-class (across-class) feedback group who received mostly positive
feedback reported to be happier than 55.2 per cent (58.9 per cent) of the students in the within-
class (across-class) feedback group who received mostly negative feedback. While the introduction
of the reputational rewards eliminated the happiness surplus in both feedback groups, across-class
feedback group rewarded financially sustained their happiness surplus (see also Table 6).
Table 6: The impact of the content of feedback on students’ stress and happiness
Dependent variable: Stress
and happiness
Mostly positive feedback:
Stress
(1)
Happiness
(2)
Within-class feedback
aggregated
Within-class feedback, no
rewards
Within-class feedback with
monetary rewards
Within-class feedback with
reputational rewards
Number of observations
-0.012
(0.154)
Across-class feedback
aggregated
Across-class feedback, no
rewards
Across-class feedback with
monetary rewards
Across-class feedback with
reputational rewards
Number of observations
-0.007
(0.056)
(4)
0.103*
(0.056)
1453
-0.117
(0.080)
0.060
(0.115)
0.149
(0.123)
1453
1451
-0.082
(0.160)
0.037
(0.209)
0.282
(0.199)
1451
30
(3)
1454
0.132**
(0.063)
0.014
(0.092)
0.120
(0.102)
1454
1416
0.259***
(0.094)
0.221*
(0.110)
0.013
(0.146)
1416
0.218**
(0.088)
7. Gender differences and disentangling the channels of the average treatment effects
Girls performed differently to boys in various studies. Angrist and Lavy (2009) studied the
effects of cash incentives on matriculation rates among Israeli students. Girls, contrary to boys,
substantially increased their performance. A higher effect among girls was also found in the
analysis of voucher provision within the PACES program in Colombia (Angrist et al., 2002).
Stronger responsiveness to incentives among girls can be also found in studies of tuition provision
by Dynarski (2008), early childhood interventions by Anderson (2008), housing vouchers by Kling
et al. (2007) or public sector programs by Lalonde (1995) and others 18.
The results of this experiment show that girls react positively to feedback provision (0.12 –
0.14 standard deviations) even if they are not offered rewards. Once included in a competitive
environment, girls improve by 0.2 to 0.28 standard deviations (see Tables 7 and 8, or Appendix D).
Therefore, girls can perform the same way as boys if they receive feedback about their
performance, the performance of their group and the group’s relative standing. In the absence of
feedback, girls do not improve at all. Boys improved if they were offered rewards (with or without
feedback) by 0.18 to 0.28 standard deviations but do not react to pure feedback provision. I
attribute the gender difference in reaction to different treatments to the existence of two types of
competition: intrinsic, or internally driven, competition developed by personal feelings based on
comparison to others, and extrinsic competition coming from offered rewards. These results are of
special help to policy makers whose aim is to influence the performance of both girls and boys.
Figures 7 and 8 compare the treatment effects across the whole performance distribution by
gender. While girls from bottom performance distribution seem to be the most responsive to
incentives, the most responsive boys are from the middle of the performance distribution. The
results from quantile regressions can be found in Appendices F5 and F6. There are no gender
18
For a review of gender differences in risk preferences, other-regarding preferences and competitive preferences,
see Croson and Gneezy (2009)
31
differences of different incentive schemes on students’ psychological well-being – similar result can
be found in Azmat and Iriberri (2016).
Figure 7: Distribution of the average treatment effects of different incentives on overall performance
of girls (Math and English pooled), by deciles
FIGURE 8: Distribution of the average treatment effects of different incentives on overall performance
of boys (Math and English pooled), by deciles
32
Table 7: OLS estimates of the average treatment effects of different motivation schemes on
students’ performance and their subjective well-being – by gender
(Pure within-class
Pure within-class
Within-class feedback
Within-class feedback
feedback and interactions)
feedback
rewarded financially
rewarded reputationally
Mathematics (st.dev)
English (st.dev.)
Stress
Happiness
Confidence (Math)
Confidence (English)
Aspirations
Education over work
Education over rest
Work over rest
(Pure across-class
feedback and interactions)
Mathematics
English
Stress
Happiness
Confidence (Math)
Confidence (English)
Aspirations
Education over work
Education over rest
Work over rest
Girls
Boys
Girls
Boys
Girls
Boys
0.121§
(0.081)
-0.141**
(0.059)
0.072
(0.124)
0.023
(0.087)
-7.385***
(0.929)
-5.023***
(0.994)
0.076
(0.107)
-0.116§
(0.072)
0.043
(0.119)
0.213*
(0.123)
-4.13***
(0.954)
-2.79***
(0.909)
0.229*
(0.118)
0.016
(0.092)
0.258**
(0.116)
0.304***
(0.101)
-6.104***
(1.214)
-5.528***
(1.375)
0.228*
(0.137)
0.199*
(0.116)
0.143
(0.189)
0.282**
(0.112)
-4.07***
(1.249)
-4.604***
(1.363)
0.201**
(0.102)
0.069
(0.088)
0.178
(0.159)
0.073
(0.115)
-5.324***
(1.144)
-5.722***
(1.115)
0.204§
(0.129)
0.092
(0.094)
0.313*
(0.179)
0.297***
(0.111)
-6.604***
(1.069)
-5.129***
(1.193)
-0.035
(0.079)
0.017
(0.047)
0.038
(0.069)
0.098
(0.082)
0.219***
(0.068)
-0.009
(0.113)
0.163**
(0.081)
0.109**
(0.044)
-0.043
(0.091)
0.146*
(0.086)
0.098
(0.074)
-0.267**
(0.110)
0.052
(0.094)
0.061
(0.061)
-0.027
(0.093)
0.042
(0.101)
0.046
(0.099)
-0.057
(0.117)
Girls
Boys
Girls
Boys
Girls
Boys
0.135*
(0.077)
-0.076
(0.066)
-0.099
(0.119)
0.020
(0.089)
-8.148***
(0.841)
-6.013***
(0.980)
0.009
(0.088)
-0.019
(0.072)
0.016
(0.119)
0.124
(0.098)
-4.74***
(1.083)
-4.49***
(1.058)
0.275*
(0.159)
0.108
(0.101)
-0.016
(0.146)
-0.022
(0.109)
-6.948***
(1.170)
-6.528***
(1.154)
0.284§
(0.173)
0.249**
(0.112)
-0.022
(0.155)
0.193§
(0.130)
-4.597***
(1.538)
-4..047***
(1.363)
0.189**
(0.091)
0.041
(0.083)
0.229*
(0.121)
0.153
(0.143)
-6.957***
(1.406)
-6.411***
(1.580)
0.175*
(0.103)
0.042
(0.103)
0.286§
(0.174)
0.241*
(0.122)
-6.125***
(1.675)
-5.327***
(1.579)
0.101
(0.072)
0.023
(0.044)
0.038
(0.069)
0.174*
(0.089)
0.140**
(0.067)
-0.103
(0.100)
0.099
(0.093)
-0.049
(0.069)
-0.043
(0.091)
0.219**
(0.105)
-0.091
(0.096)
-0.011
(0.120)
0.101
(0.091)
-0.006
(0.066)
-0.027
(0.093)
-0.026
(0.136)
0.109
(0.087)
-0.069
(0.112)
Pure across-class
feedback
33
Across-class feedback
rewarded financially
Across-class feedback
rewarded reputationally
Table 8: OLS estimates of the average treatment effects of different motivation schemes on
students’ performance and their subjective well-being – by gender
(Pure financial rewards
Pure financial
Within-class feedback
Across-class feedback
and interactions)
rewards
rewarded financially
rewarded financially
Mathematics (st.dev)
English (st.dev.)
Stress
Happiness
Confidence (Math)
Confidence (English)
Aspirations
Education over work
Education over rest
Work over rest
(Pure reputational
rewards and interactions)
Mathematics
English
Stress
Happiness
Confidence (Math)
Confidence (English)
Aspirations
Education over work
Education over rest
Work over rest
Girls
Boys
Girls
Boys
Girls
Boys
0.018
(0.102)
-0.038
(0.097)
0.431**
(0.198)
0.015
(0.014)
1.869*
(1.074)
2.239**
(1.108)
0.207*
(0.123)
0.139
(0.112)
0.482***
(0.162)
0.322**
(0.132)
-1.322
(1.429)
-0.387
(1.099)
0.229*
(0.118)
0.016
(0.092)
0.258**
(0.116)
0.304***
(0.101)
-6.104***
(1.214)
-5.528***
(1.375)
0.228*
(0.137)
0.199*
(0.116)
0.143
(0.189)
0.282**
(0.112)
-4.07***
(1.249)
-4.604***
(1.363)
0.275*
(0.159)
0.108
(0.101)
-0.016
(0.146)
-0.022
(0.109)
-6.948***
(1.170)
-6.528***
(1.154)
0.284§
(0.173)
0.249**
(0.112)
-0.022
(0.155)
0.193§
(0.130)
-4.597***
(1.538)
-4..047***
(1.363)
0.163**
(0.081)
0.109**
(0.044)
-0.043
(0.091)
0.146*
(0.086)
0.098
(0.074)
-0.267**
(0.110)
0.099
(0.093)
-0.049
(0.069)
-0.043
(0.091)
0.219**
(0.105)
-0.091
(0.096)
-0.011
(0.120)
0.046
(0.098)
0.009
(0.078)
-0.017
(0.092)
0.006
(0.111)
0.016
(0.083)
0.137
(0.112)
Pure reputation
rewards
Girls
Boys
0.059
(0.147)
-0.039
(0.087)
-0.008
(0.203)
-0.005
(0.116)
1.905*
(0.972)
0.989
(1.096)
0.218
(0.154)
0.079
(0.106)
0.158
(0.195)
0.144
(0.103)
-0.399
(1.224)
-1.301
(1.008)
0.021
(0.096)
-0.017
(0.061)
0.164*
(0.088)
Within-class feedback
rewarded reputationally
Girls
Boys
Across-class feedback
rewarded reputationally
Girls
Boys
0.201**
(0.102)
0.069
(0.088)
0.178
(0.158)
0.073
(0.115)
-5.324***
(1.144)
-5.722***
(1.115)
0.204§
(0.129)
0.092
(0.094)
0.313*
(0.179)
0.297***
(0.111)
-6.604***
(1.069)
-5.129***
(1.193)
0.189**
(0.091)
0.041
(0.083)
0.229*
(0.121)
0.153
(0.143)
-6.957***
(1.406)
-6.411***
(1.580)
0.175*
(0.103)
0.042
(0.103)
0.286§
(0.174)
0.241*
(0.122)
-6.125***
(1.675)
-5.327***
(1.579)
0.052
(0.094)
0.061
(0.061)
-0.027
(0.093)
0.042
(0.101)
0.046
(0.099)
-0.057
(0.117)
0.101
(0.091)
-0.006
(0.066)
-0.027
(0.093)
-0.026
(0.136)
0.109
(0.087)
-0.069
(0.112)
0.165§
(0.101)
0.109
(0.078)
-0.004
(0.131)
34
8. Robustness Checks
8.1. Multiple comparisons
The probability that the coefficients are significant purely by chance increases with the number
of hypothesis tested. Multiple-test procedures take p-values from multiple comparisons testing and
uncorrected critical p-values interpreted either as FWER or FDR and result in adjusted critical pvalues. “If the input uncorrected critical p-value α ∈ (0,1) is an FWER, then we can be 100(1 − α)%
confident that all the null hypotheses in the discovery set are false. If the input uncorrected critical
p-value α = β ∗ γ is an FDR, then we can be 100(1 − β)% confident that over 100(1 − γ)% of the
null hypotheses in the discovery set are false” (Newson, 2010, p.569).
In order to address these concerns about multiple inference I control for the familywise error
rate (FWER) using one-step methods (Bonferroni, in Dunn 1961; and Sidak, 1967 corrections), and
step-down methods - Holm (1979) and Holland-Copenhaver (1987) corrections) and for the false
discovery rate (FDR, using step-up methods - Simes (1986), Hochberge (1988), and Yakutieli-
Benjamini (2001) procedure). The detailed description of the procedures can be found in Newson
(2010). The corrected p-values are summarized in Table 9.
Table 9: Adjusted p-values for aggregated treatment effects and disaggregated treatment effects
Familywis
e error
rates
One-step
method
Step-down
method
False
Discov
ery
rate
Type of correction (corresponding to uncorrected
alpha = 0.1)
Step-up
method
Bonferroni correction
Sidak correction
Holm correction
Holland correction
Hochberg correction
Simes correction
Yekutieli correction
Correlation
assumed
Arbitrary
Nonnegative
Arbitrary
Nonnegative
Independence
Nonnegative
Arbitrary
35
Aggregated
treatment
effect
0.0077
0.0081
0.0200
0.0210
0.0170
0.0620
0.0190
Disaggregated
treatment
effects
0.0053
0.0055
0.0083
0.0087
0.0077
0.0680
0.0150
Disadvantage of FWER procedure is that it can result in low power for testing single hypotheses
in large experiments with high number of multiple comparisons. In such cases FDR procedure is
preferred since it controls for the proportion of Type I errors to true positives and therefore results
in higher power. In the case of this experiment, one-step and step-down methods rules out any of
the initially presented average treatment effects of interventions on students’ performance. FWER
procedures seem to be too conservative due to the number of multiple comparisons I test. Among
FDR procedures I rule out the Hochberg corrections because I do not meet its restriction of
independence (I compare all groups with treatments to the same control group). The significance of
the average treatment effects of different incentive schemes on students’ performance and has been
confirmed when I used Simes correction and with some exception when I used Yekutieli correction
(the effect of pure financial rewards and financial rewards interacted with within-class feedback
turned to be insignificant). Similar conclusion can be done regarding the average treatment effects
on subjective well-being with one exception – the negative impact of pure financial rewards is
significant using all types of FWER and FDR corrections. Summary of corrected p-values for all
disaggregated treatment effects can be found in the Appendices F7, F8 and F9.
8.2. Attrition
High drop-out and absence rates are common features of students in developing countries and
it is not an exception in my data. There are several reasons. Some students did not have the money
to pay the school fees and decided to change schools to avoid repaying their debt, others changed
their school because of family reasons (the family moved to a different area, they were sent to live
with other family members, etc.), some completely dropped out of school, some just registered as
new students and some of the students passed away. Due to the constraints of the experiment, all
participation data are based on our visits only (it means that no random visits were organized).
36
The main concern in most project evaluations is whether the attrition of subjects is random or
whether there is a systematic difference between the attrition from the treatment group compared
to the control group caused by the intervention itself. Only uninformed students, who did not
receive feedback during the academic year and who were chosen to participate in a tournament
rewarded with reputation rewards did not significantly change their attrition. All other treatment
groups lowered their absences compared to the control group ranging from 6.5 to 17 per cent.
Lower attrition means higher attendance.
Who are the attrited students? Random versus non-random attrition
The treatments influenced the probability of always being present during our visits and the
probability to attrite. So in absolute numbers there are less students who drop out from treated
classes compared to the control classes and more cases when students from the treatment group
attended all five testing rounds compared to students from the control group. Besides the
differences in the number of attrited students, students who dropped from the within-class
feedback group are worse in terms of their initial performance compared to students from the
across-class feedback group or the control group. That might re-introduce a bias if the treated
students who are present during the final testing round are systematically different compared to
the control-group students. As shown in Table 10, this is not the case in this project. The
distribution of students who stayed in either of the treatment groups (based on their initial
performance) is not statistically different from the distribution of the initial abilities of students
from the treatment group. In such a case the OLS estimate should provide unbiased estimates of the
treatment effects. Nevertheless, I used inverse probability weights and imputation methods to
check the stability of the results (for further details see the next section).
37
Table 10: Ksmirnov test on equality of distributions of students who attrited and students who
stayed (p-values presented)
Baseline differences
Students who
Students who
Alwayspresent
attrited
stayed
students
(T1 – C)
(T2 – C)
(T1 – C) (T2 – C)
(T1 – C) (T2 – C) (T1 – C) (T2 – C)
Mathematics
0.123
0.274
0.000
0.158
0.752
0.192
0.677
0.958
English
0.952
0.168
0.003
0.546
0.230
0.282
0.211
0.840
Note: T1 stands for within-class social comparison group, T2 for across-class comparison group and C represents
control group with no feedback provided. P-values are presented.
The effect of treatments on attrition
Estimates of treatment effects can be biased if the attrition from control versus treatment
groups systematically differs and the difference is caused by the presence of the treatment.
Students in treatment groups attrite less often in absolute values and are more often present in all
five testing rounds compared to their control-group counterparts. In order to see whether and to
what extent social comparison and reward treatments influence the probability of dropping out. I
run a probit model on attrition and full attendance on all treatment dummies controlling for strata
variables (Table 11).
The attrition rate comprises of students who missed our last testing round but attended the
baseline testing at the beginning of the project. Non-rewarded students exposed to both within and
across-class social comparison feedback have from 6.5 to 6.9 per cent lower probability to miss the
final testing round. Among rewarded students who did not receive any feedback only students
rewarded financially lowered their attrition by 7.9 per cent. Reputation rewards without provided
feedback do not affect attrition rate. All treatment interactions lower the attrition rate (from 9.3 to
17.2 per cent).
As previously discussed, despite the different attrition across treatment and control groups,
students who remained at schools in the last testing round are on average the same in terms of
38
initial characteristics and therefore the OLS estimates should not be biased. In the following section
I run alternative specifications to compare OLS estimates with estimates that correct for possible
attrition bias.
Table 11: Probabilities of students’ dropouts, by gender
Overall treatment effects
on:
Attrition
Overall
Girls
Boys
Girls
Overall
Boys
Girls
Boys
Within-class feedback, no
-0.066*
-0.064*
-0.058
0.105**
0.091*
0.108**
(0.039)
(0.037)
(0.049)
(0.049)
(0.055)
rewards (T1_solo)
(0.049)
-0.071**
-0.046
0.124***
0.110**
0.137***
-0.097**
Across-class feedback, no
(0.032)
(0.035)
(0.042)
(0.049)
(0.046)
(0.038)
rewards (T2_solo)
-0.130***
-0.128***
0.127**
0.168**
0.093*
-0.124**
Financial Rewards, no
(0.038)
(0.033)
(0.056)
(0.067)
(0.057)
(0.055)
feedback (Fin_solo)
-0.056
-0.021
0.030
0.047
0.021
-0.100**
Reputational Rewards, no
(0.046)
(0.052)
(0.077)
(0.087)
(0.074)
(0.050)
feedback (Rep_solo)
-0.158***
-0.127*** -0.196***
0.233***
0.228***
0.247***
Within-class feedback with
(0.033)
(0.033)
(0.062)
(0.071)
(0.064)
(0.037)
financial rewards (T1_fin)
-0.147***
-0.128*** -0.157***
0.263***
0.257***
0.252***
Across-class feedback with
(0.032)
(0.031)
(0.060)
(0.068)
(0.063)
(0.041)
financial rewards (T2_fin)
-0.157***
-0.146*** -0.171***
0.208***
0.209***
0.217***
Within-class feedback with
(0.038)
(0.036)
(0.067)
(0.073)
(0.064)
(0.043)
reputation rewards (T1_rep)
-0.212***
-0.192*** -0.226***
0.099*
0.079
0.126**
Across-class feedback with
(0.026)
(0.026)
(0.051)
(0.057)
(0.056)
(0.031)
reputation rewards (T2_rep)
Controlled for stratas
Yes
Yes
Yes
Yes
Yes
Yes
N
7050
3818
3139
7050
4672
3884
Note: Robust standard errors adjusted for clustering at class level are in parentheses. Controlled for stratum
fixed effects (four areas by distance from the capital city, Kampala, school performance at national
examination and grade level (P6,P7, S1 up to S4). N stands for the number of observations. § significant at
15%; * significant at 10%; ** significant at 5%; *** significant at 1%
8.3. Stability of the results
In order to adjust the results for non-random attrition, I proceeded with imputation methods
and inverse probability-weighted regressions (Imbens, 2004; Woolridge, 2007; Kwak (2010),
Hirano et al., 2000, etc.). Inverse probability weighting (IPW) can adjust for confounding factors
and selection bias. As the title suggests, IPW assigns a weight to every student which equals to the
student’s inverse probability to be absent/to attrite and adjust for that in the estimation of the
39
treatment effects. An imputation method is used to fill the missing observations of students who
were absent or dropped out in the last testing round based on a predefined rule.
Tables 12 and 13 and Appendices F2, F3 and F4 provide the comparison of ordinary least
squares estimations (column 1) of the treatment effects to the weighted least squares using inverse
probability weights (column 2), separately for Math and English. Correcting for the probability of
dropping out, treatment effects are similar or slightly higher in absolute terms but not significantly
different. The results of the imputation methods (columns 3 and 4) bring similar conclusions. I use
two different measures to impute missing observations – median ration and the class percentile
ranks (inspired by Krueger, 1999). All of the measures take the advantage of repeated school visits
and follow the same logic – if the observation from the last school visit is missing, I look at the last
score available and adjust for the differences in test difficulty. The same procedure is done to
impute Math and English scores separately. The median ratio measure imputes the last available
observation and the class percentile ranks take into consideration the rank of the student in the last
available distribution and impute the score corresponding to the student of the same rank in the
final visit distribution. The imputation method artificially fills missing observations and the results
serve only as bounds. Both imputation measures deliver similar or stronger results compared to
ordinary least squares. Ordinary least squares results are also comparable to the weighted
regression estimates.
40
Table 12: Average treatment effects of different motivation schemes - alternative specifications
Dependent variable: Math score
PURE TREATMENTS
Within-class feedback, no rewards
(T1_solo)
Across-class feedback, no rewards
(T2_solo)
Financial Rewards, no feedback
(Fin_solo)
Repurational Rewards, no feedback
(Rep_solo)
TREATMENT INTERACTIONS
Within-class feedback, monetary
reward (T1_fin)
Across-class feedback, monetary
reward (T2_fin)
Within-class feedback, reputation
reward (T1_rep)
Across-class feedback, reputation
reward (T2_rep)
OLS
IPW
Imputation
(median ratio)
Imputation
(class
percentiles)
0.100
(0.085)
0.082
(0.073)
0.106
(0.101)
0.138
(0.141)
0.046
(0.092)
0.067
(0.079)
§
0.151
(0.102)
0.188
(0.149)
0.133*
(0.079)
0.129*
(0.068)
0.169*
(0.096)
0.206*
(0.124)
0.123
(0.085)
0.087
(0.078)
0.143
(0.106)
0.177
(0.128)
0.231*
(0.118)
0.277**
(0.139)
0.209**
(0.103)
0.188**
(0.080)
0.338**
(0.135)
0.456***
(0.132)
0.212*
(0.108)
0.208**
(0.087)
0.281**
(0.129)
0.331**
(0.128)
0.266**
(0.073)
0.186**
(0.073)
0.273**
(0.124)
0.305**
(0.139)
0.258**
(0.112)
0.250***
(0.090)
Controlled for stratas
Yes
Yes
Yes
Yes
N of observation
5102
5102
Note: Robust standard errors adjusted for clustering at class level are in parentheses. Controlled for stratum
fixed effects (four areas by distance from the capital city, Kampala, school performance at national
examination and grade level (P6,P7, S1 up to S4). N stands for the number of observations.
§ significant at 15%; * significant at 10%; ** significant at 5%; *** significant at 1%
41
Table 13: Average treatment effects of different motivation schemes - alternative specifications
Dependent variable: English score
PURE TREATMENTS
Within-class feedback, no rewards
(T1_solo)
Across-class feedback, no rewards
(T2_solo)
Financial Rewards, no feedback
(Fin_solo)
Repurational Rewards, no feedback
(Rep_solo)
TREATMENT INTERACTIONS
Within-class feedback, monetary
reward (T1_fin)
Across-class feedback, monetary
reward (T2_fin)
Within-class feedback, reputation
reward (T1_rep)
Across-class feedback, reputation
reward (T2_rep)
OLS
IPW
Imputation
(median ratio)
Imputation
(class
percentiles)
-0.128**
(0.056)
-0.049
(0.059)
0.045
(0.088)
0.016
(0.082)
-0.133*
(0.070)
-0.079
(0.072)
0.032
(0.085)
0.004
(0.084)
-0.133**
(0.060)
-0.052
(0.063)
-0.006
(0.096)
-0.089
(0.123)
-0.135***
(0.045)
-0.046
(0.048)
0.041
(0.069)
0.036
(0.059)
0.103
(0.094)
0.173
(0.094)
0.087
(0.080)
0.047
(0.080)
0.145*
(0.086)
0.258**
(0.102)
0.041
(0.078)
0.071
(0.077)
0.096
(0.108)
0.113
(0.099)
0.069
(0.082)
0.024
(0.082)
0.072
(0.080)
0.137*
(0.075)
0.069
(0.058)
0.059
(0.064)
Controlled for stratas
Yes
Yes
Yes
Yes
N of observations
5093
5093
6736
7107
Note: Robust standard errors adjusted for clustering at class level are in parentheses. Controlled for stratum
fixed effects (four areas by distance from the capital city, Kampala, school performance at national
examination and grade level (P6,P7, S1 up to S4). N stands for the number of observations.
§ significant at 15%; * significant at 10%; ** significant at 5%; *** significant at 1%
42
9. Conclusion
Various interventions have been conducted with the aim of lowering absenteeism and
increasing student performance. Authors usually focus on the main outcomes of their interventions,
such as subjects’ performance, absence or drop-out rates, leaving outcomes other than learning
aside. Evidence from psychology indicates that current well-being, measured in terms of stress and
happiness, serves as an important prerequisite of future performance. For instance, stressed
students are absent and drop out from school more often compared to non-stressed students;
stress makes students exert less effort and perform worse. This paper contributes to the current
literature by studying the effects of different types of incentives on student performance and their
well-being (measured by happiness and stress). I bring new evidence on performance-versus-well-
being tradeoff by implementing two types of social comparative feedback regimes (within- and
across-class group feedback), two types of incentive regimes - financial and reputation rewards,
and their interactions.
The results of this study show that providing students with pure feedback without further
incentivization deliver subject-specific results. While in Mathematics students improved by 0.08 to
0.13 standard deviations, there was no improvement in Egnlish. The results are driven purely by
improvements in girls’ performance. Pure rewards (without feedback) on contrary lead to an
improvement of 0.1 to 0.18 standard deviations in both subjects. The results are driven mainly by
boys. Interacted incentive scheme of feedback combined with rewards leads to an improvement
between 0.20 to 0.29 standard deviations if rewarded financially and 0.12 to 0.18 standard
deviations if rewarded reputationally. There is, however, a trade-off between improvement in
performance and students’ well-being in response to different incentive schemes. Feedback and
reputational rewards improve students’ performance mildly, but does influence neither their
happiness nor stress. Competing for financial rewards result in moderate to strong improvements
43
in performance but students’ stress significantly increases and their happiness decreases. Students
competing for monetary rewards reported significantly lower stress levels compared to those who
competed for money without any feedback. Stressed students exerted less effort, performed worse
on average and attrited by 29 percent more compared to relaxed students.
Furthermore, this paper sheds light on gender differences in responsiveness to different
incentive provisions. According to the results, girls did not improve if they received no feedback but
they competed for rewards of any type and they significantly underperformed boys. If the girls
were repeatedly given feedback (and the type of feedback does not matter), they performed
comparably to boys. Moreover, girls also respond positively to pure feedback provision (without
rewards). Comparative feedback plays a crucial role for girls in inducing their performance in a
tournament environment. Boys react only with respect to the provision of rewards. Provision of
feedback does not play any role in their performance improvements. There are no genderdifferences in the effects of incentives on girls’ and boys’ well-being.
The results of this experiment might be of important help especially for policy makers trying to
find the optimal incentive scheme. Policy makers must exercise a great amount of caution in
designing educational rewards and consider the impact on student well-being. Further research
should be conducted with the aim to study the long-term effects of changes in student well-being on
performance.
44
References
Andrabi, T., Das J. and Ijaz-Khwaja, A. (2009): Report Cards: The Impact of Providing School and
Child Test-scores on Educational Markets, BREAD Working Paper No. 226
Angrist, J., Bettinger, E., and Kremer, M. (2006): Long-term educational consequences of
secondary school vouchers: Evidence from administrative records in Colombia, The American
Economic Review, 847-862.
Angrist, J., and Lavy, V. (2009): The effects of high stakes high school achievement awards:
Evidence from a randomized trial. The American Economic Review, 1384-1414.
Apesteguia, J., Azmat, G., and Iriberri, N. (2012): The Impact of Gender Composition on Team
Performance and Decision-Making: Evidence from the Field, Management Science, Vol. 58(1)
January 2012, pp. 78–93.
Arnold, H. J. (1976): Effects of performance feedback and extrinsic reward upon high intrinsic
motivation, Organizational Behavior and Human Performance, 17(2), 275-288.
Ashraf, N., Bandiera, O., and Lee, S. S. (2014): Awards unbundled: Evidence from a natural field
experiment, Journal of Economic Behavior and Organization,100, 44-63.
Auriol, E., and Renault, R. (2008): Status and incentives, The RAND Journal of Economics, 39(1),
305-326.
Azmat, G., Bagues, M., Cabrales, A., and Iriberri, N. (2015): What you know can’t hurt you (for
long): A field experiment on relative performance feedback, Working paper, Aalto University.
Azmat, G. and Iriberri, N. (2010): The importance of relative performance feedback information:
Evidence from a natural experiment using high school students. Journal of Public Economics, 94(7),
435-452
Azmat, G. and Iriberri, N. (2016): The Provision of Relative Performance Feedback: An Analysis
of Performance and Satisfaction, Journal of Economics and Management Strategy, Vol. 25(1), pp.77110.
Bandiera, O., Barankay, I., and Rasul, I. (2010): Social incentives in the workplace, The Review of
Economic Studies, 77(2), 417-458.
Bandiera, O., Larcinese, V., and Rasul, I. (2015): Blissful ignorance? A natural experiment on the
effect of feedback on students' performance, Labour Economics, 34, 13-25
Barankay, I. (2011). Rankings and social tournaments: Evidence from a crowd-sourcing
experiment. In Wharton School of Business, University of Pennsylvania Working Paper.
Benabou, R., and Tirole, J. (2003): Intrinsic and extrinsic motivation, The Review of Economic
Studies, 70(3), 489-520.
45
Benjamini, Y., and Y. Hochberg (1995): Controlling the false discovery rate: A practical and
powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B
(Methodological) 57: 289–300.
Benjamini, Y., and D. Yekutieli (2001): The control of the false discovery rate in multiple testing
under dependency, Annals of Statistics 29: 1165–1188.
Besley, T., and Ghatak, M. (2008): Status incentives, The American Economic Review, 206-211.
Bettinger, E. P. (2012): Paying to learn: The effect of financial incentives on elementary school
test scores. Review of Economics and Statistics, 94(3), 686-698.
Blanes i Vidal, J., and Nossol, M. (2011): Tournaments without prizes: Evidence from personnel
records, Management Science, 57(10), 1721-1736.
Blimpo, M. P. (2014): Team incentives for education in developing countries: A randomized
field experiment in Benin, American Economic Journal: Applied Economics, 6(4), 90-109.
Dunn, O. J. (1961): Multiple Comparisons Among Means, Journal of the American Statistical
Association 56 (293): 52–64
Buunk, B. P., Gibbons, F. X., and Reis-Bergan, M. (1997): Social comparison in health and illness:
A historical overview. Health, coping and well-being: Perspectives from social comparison theory,
1-23.
Buunk, B. P., and Gibbons, F. X. (2000): Toward an enlightenment in social comparison theory,
In Handbook of Social Comparison (pp. 487-499), Springer US.
Burgers, C., Eden, A., van Engelenburg, M. D., and Buningh, S. (2015): How feedback boosts
motivation and play in a brain-training game, Computers in Human Behavior, 48, 94-103.
Charness, G., Masclet, D., and Villeval, M. C. (2010): Competitive preferences and status as an
incentive: Experimental evidence, Groupe d’Analyse et de Théorie Économique Working Paper,
(1016).
Cohen, S., Kamarck, T., and Mermelstein, R. (1983): A global measure of perceived
stress, Journal of health and social behavior, 385-396.
Croson, R., and Gneezy, U. (2009): Gender differences in preferences, Journal of Economic
literature, 448-474.
Deci, E. L. (1971): Effects of externally mediated rewards on intrinsic motivation, Journal of
personality and Social Psychology, 18(1), 105.
Deci, E. L., Koestner, R., and Ryan, R. M. (1999): A meta-analytic review of experiments
examining the effects of extrinsic rewards on intrinsic motivation, Psychological bulletin, 125(6),
627.
46
Dijkstra, P., Kuyper, H., van der Werf, G., Buunk, A.P. and van der Zee, Y. (2008): Social
comparison in the classroom: a review, Review of educational research, Vol. 78, No. 4, p.828-879
Dolan, P., Metcalfe, R., and Powdthavee, N. (2008): Electing happiness: Does happiness affect
voting and do elections affect happiness, Discussion Papers in Economics, (30).
Duffy, J., and Kornienko, T. (2010): Does competition affect giving?, Journal of Economic
Behavior and Organization, 74(1), 82-103.
Dynarski, S. (2008): Building the stock of college-educated labor, Journal of human
resources, 43(3), 576-610.
Eisenkopf, G. (2011): Paying for better test scores, Education Economics,19(4), 329-339.
Ellingsen, T., and Johannesson, M. (2007): Paying respect, The Journal of Economic
Perspectives, 135-150.
Eriksson, T., Poulsen, A., and Villeval, M. C. (2009): Feedback and incentives: Experimental
evidence, Labour Economics, 16(6), 679-688.
Falk, A. and Ichino, A. (2006): Clean Evidence on Peer Pressure, Journal of Labor Economics, Vol.
24, Issue 1
Festinger, L. (1954): A theory of social comparison processes, Human relations, 7(2), 117-140.
Fordyce, M. W. (1988): A review of research on the happiness measures: A sixty second index of
happiness and mental health, Social Indicators Research, 20(4), 355-381.
Fryer Jr, R. G. (2010): Financial incentives and student achievement: Evidence from randomized
trials (No. w15898), National Bureau of Economic Research.
Hannan, R. L., Krishnan, R., and Newman, A. H. (2008): The effects of disseminating relative
performance feedback in tournament and individual performance compensation plans, The
Accounting Review, 83(4), 893-913.
57.
Helliwell, J. F., and Wang, S. (2012): The state of world happiness, World happiness report, 10-
Hastings, J. S., Neilson, C. A., and Zimmerman, S. D. (2012): The effect of school choice on
intrinsic motivation and academic outcomes (No. w18324), National Bureau of Economic Research.
Hattie, J., and Timperley, H. (2007): The power of feedback. Review of educational
research, 77(1), 81-112.
Hirano, K., Imbens, G. and Ridder, G. (2003): Efficient Estimation of Average Treatment Effects
Using the Estimated Propensity Score, Econometrica, Vol. 71(4), 1161–1189
Hochberg, Y. (1988): A sharper Bonferroni procedure for multiple tests of significance.
Biometrika 75: 800–802.
47
Holland, B. S., and Copenhaver. M.D. (1987): An improved sequentially rejective Bonferroni test
procedure. Biometrics 43: 417–423
Holm, S. (1979): A simple sequentially rejective multiple test procedure. Scandinavian Journal
of Statistics 6: 65–70.
Hoxby, C. (2000): Peer effects in the classroom: Learning from gender and race variation (No.
w7867), National Bureau of Economic Research.
Imbens, G.W. (2004): Nonparametric estimation of average treatment effects under exogeneity:
A review, The Review of Economics and Statistics: Vol 86, No.1, pp 4-29.
Jalava, N., Joensen, J.S. and Pellas, E. (2015): Grades and rank: Impacts of non-financial
incentiveson test performance, Journal of Economic Behavior and Organization 115 (2015) 161–
196
Juster, R. P., McEwen, B. S., and Lupien, S. J. (2010): Allostatic load biomarkers of chronic stress
and impact on health and cognition, Neuroscience and Biobehavioral Reviews, 35(1), 2-16.
Kling, J. R., Liebman, J. B., and Katz, L. F. (2007): Experimental analysis of neighborhood
effects, Econometrica, 75(1), 83-119.
Kremer, M., Miguel, E. and Thornton, R. (2002): Incentives to Learn, NBER Working Papers
10971, National Bureau of Economic Research, Inc.
Krueger, A. B. (1999): Experimental estimates of education production functions, The Quarterly
Journal of Economics, 114 (2): 497-532
Kosfeld, M. and Neckermann, S. (2011): Getting More Work for Nothing? Symbolic Awards and
Worker Performance, American Economic Journal: Microeconomics, Vol. 3, Issue 3
Kremer, M., Miguel, E., and Thornton, R. (2004): Incentives to learn (No. w10971), National
Bureau of Economic Research.
Kuhnen, C. M., and Tymula, A. (2012): Feedback, self-esteem, and performance in organizations,
Management Science, 58(1), 94-113.
Kwak, D. (2010): Inverse probability weighted estimation for the effect of kindergarten
enrollment age and peer quality on student academic achievement for grades K-12, working paper
LaLonde, R. J. (1995): The promise of public sector-sponsored training programs, The Journal of
Economic Perspectives, 149-168.
Lavy, V. (2009): Performance Pay and Teachers’ Effort, Productivity and Grading Ethics,
American Economic Review 99, 5
Lazear, E.P. (2000): Performance pay and productivity. American Economic Review, Vol. 90, No.
5, pp.11341-61.
48
Levitt, S. D., List, J. A., Neckermann, S., and Sadoff, S. (2012): The behavioralist goes to school:
Leveraging behavioral economics to improve educational performance (No. w18165), National
Bureau of Economic Research.
Locke, E. A., and Latham, G. P. (1990): A theory of goal setting and task performance, PrenticeHall, Inc.
Lupien, S. J., McEwen, B. S., Gunnar, M. R., and Heim, C. (2009): Effects of stress throughout the
lifespan on the brain, behaviour and cognition, Nature Reviews Neuroscience, 10(6), 434-445.
Lyubomirsky, S., and Lepper, H. (1999): A measure of subjective happiness: Preliminary
reliability and construct validation, Social Indicators Research, 46, 137-155. The original publication
is available at www.springerlink.com.
McEwen, B. S. (2008): Central effects of stress hormones in health and disease: Understanding
the protective and damaging effects of stress and stress mediators, European journal of
pharmacology, 583(2), 174-185.
MacKerron, G. (2012): Happiness economics from 35 000 feet. Journal of Economic
Surveys, 26(4), 705-735.
Markham, S. E., Scott, K., and McKEE, G. A. I. L. (2002): Recognizing good attendance: a
longitudinal, quasi‐experimental field study. Personnel Psychology, 55(3), 639-660.
Mas, A. and Moretti, E. (2009): Peers at Work, American Economic Review, Vol. 99, Issue 1 31
Mettee, D. R., and Smith, G. (1977): Social comparison and interpersonal attraction: The case for
dissimilarity. Social comparison processes: Theoretical and empirical perspectives, 69, 101.
Moldovanu, B., Sela, A., and Shi, X. (2007): Contests for status, Journal of Political
Economy, 115(2), 338-363.
Newson, R.B. (2010): Frequentist q-value for multiple-test procedures, Stata Journal, Vol. 10,
No. 4.
Ray, D. (2002): Aspirations, poverty and economic change, Understanding Poverty, 2006, p.
409-443(35)
Reardon, S. F., Cheadle, J. E., and Robinson, J. P. (2009): The effect of Catholic schooling on math
and reading development in kindergarten through fifth grade, Journal of Research on Educational
Effectiveness, 2(1), 45-87.
Ryan, R. M., and Deci, E. L. (2000): Intrinsic and extrinsic motivations: Classic definitions and
new directions, Contemporary educational psychology, 25(1), 54-67.
Sacerdote, B. (2011): Peer effects in education: How might they work, how big are they and how
much do we know thus far?, Handbook of the Economics of Education, 3, 249-277.
49
Šidák, Z. (1967): Rectangular confidence regions for the means of multivariate normal
distributions, Journal of the American Statistical Association 62: 626–633.
Simes, R. J. (1986): An improved Bonferroni procedure for multiple tests of significance.
Biometrika 73: 751–754.
Slavin, R. (1984). Meta-analysis in education: How has it been used? Educational Researcher.
13(8), 6-15, 24-27
Schneiderman, N., Ironson, G., and Siegel, S. D. (2005): Stress and health: psychological,
behavioral, and biological determinants, Annual Review of Clinical Psychology, 1, 607.
Suls, J., and Wheeler, L. (2000): A selective history of classic and neo-social comparison theory,
In Handbook of social comparison (pp. 3-19). Springer US.
Bigoni, M., Fort, M., Nardotto, M., and Reggiani, T. (2011): Teams or tournaments? A field
experiment on cooperation and competition among university students.
Tran, A., and Zeckhauser, R. (2012): Rank as an inherent incentive: Evidence from a field
experiment, Journal of Public Economics, 96(9), 645-650.
Van Dijk, F., Sonnemans, J., and Van Winden, F. (2001): Incentive systems in a real effort
experiment, European Economic Review, 45(2), 187-214.
Veenhoven, R. (1988): The utility of happiness, Social indicators research,20(4), 333-354.
Weiss, Y., and Fershtman, C. (1998): Social status and economic performance:: A
survey,European Economic Review, 42(3), 801-820.
Wolf, T. M. (1994): Stress, coping and health: enhancing well‐being during medical school,
Medical Education, 28(1), 8-17.
Wooldridge, J. (2007): Inverse Probability Weighted M-Estimation for General Missing Data
Problems, Journal of Econometrics 141:1281-1301
50
Appendix
APPENDIX A: Summary statistics and randomization balance
Appendix A1: Balance between control and treatment groups
Variable
School Level:
The number of primary schools
The number of secondary schools
School Type:
Public Schools
Private Schools
Community Schools
By Population
By PLE/UCE results
By testing results
Control
Within-class
feedback
Across-class
feedback
10
7
11
7
10
8
8
7
2
2345
(48 groups)
3.175
21.140
5
9
4
2415
(51 groups)
3.039
21.363
Note: min(PLE/UCE)= 1.7397, max(PLE/UCE)= 4.2857, mean(PLE/UCE)=3.1040
Note: min(TR)=8.3125, max(TR)=39.7765, mean(TR)=21.3192, where TR=Testing Results
51
6
8
4
2371
(51 groups)
3.102
21.648
Appendix A2: Comparison of mean characteristics of students in control and treatment groups
Withinclass
feedback
(T1)
Mathematics
English
Sum Mathematics + English
Means
Acrossclass
feedback
(T2)
Mean Differences
Control
(C)
(T1 – C)
(T2 – C)
A. STUDENTS PERFORMANCE – ROUND 1 – BASELINE SURVEY
11.015
11.198
11.092
-0.077
0.106
(0.99)
(0.96)
11.551
11.927
11.477
0.074
0.450
(1.53)
(1.72)
22.566
23.125
22.569
-0.003
0.556
(2.30)
(2.43)
B.
B.1 After Math questionnaire
Q1: Expected number of points
[min 1, max 10]
Q2: Subjective effort level
[min 1, max 5]
Q3: Perceived difficulty
[min 1, max 5]
Q4: Subjective level of happiness
[min 1, max 7]
B.2 After English questionnaire
Q1: Expected number of points
[min 1, max 10]
Q2: Subjective effort level
[min 1, max 5]
Q3: Perceived difficulty
[min 1, max 5]
Q4: Subjective level of happiness
[min 1, max 7]
B.3 Aspiration questionnaire
Aspirations
Education over Relax
[min 1, max 5]
Education over Work
[min 1, max 5]
Work over Relax
[min 1, max 5]
Perceived happiness scale
[min 4, max 28]
Perceived stress
[min 0, max 16]
QUESTIONNAIRES
4.331
4.537
4.551
3.341
3.494
3.423
3.447
3.319
3.525
3.504
3.253
3.184
5.715
5.757
5.796
3.644
3.644
3.677
3.547
2.950
3.627
3.553
2.904
2.856
3.833
3.756
3.778
2.766
2.701
2.803
3.538
11.479
6.018
3.496
3.477
11.653
11.223
6.352
5.756
52
-0.221
(0.150)
-0.057
(0.053)
-0.082
(0.053)
0.135
(0.092)
-0.081
(0.161)
-0.006
(0.046)
-0.033
(0.052)
0.094
(0.084)
0.056
(0.049)
0.060
(0.057)
-0.037
(0.094)
0.256
(0.231)
0.262
(0.164)
-0.151
(0.145)
0.021
(0.052)
0.072
(0.052)
0.069
(0.094)
Joint Pvalue
0.183
0.699
0.423
0.299
0.298
0.030
0.343
-0.039
(0.144)
0.074*
(0.044)
-0.033
(0.049)
0.048
(0.086)
0.879
-0.021
(0.049)
0.019
(0.059)
-0.102
(0.090)
0.429**
(0.222)
0.595***
(0.142)
0.269
0.141
0.752
0.534
0.526
0.524
0.155
0.000
Appendix A2: Comparison of mean characteristics of students in control and treatment groups
(Continued)
Withinclass
feedback
(T1)
Means
Acrossclass
feedback
(T2)
Mean Differences
Control
(C)
(T1 – C)
(T2 – C)
Joint Pvalue
C. OTHER (continued)
C.1 Attrition rates
All schools
0.359
0.346
0.454
-0.095***
(0.034)
-0.059*
(0.030)
-0.108***
(0.033)
-0.069**
(0.029)
C.2 Alwayscomers
All schools
0.202
0.186
0.082
17.058
17.048
16.999
0.121***
(0.033)
0.097***
(0.033)
0.059
(0.079)
0.104***
(0.104)
0.077**
(0.031)
0.049
(0.078)
Restricted sample#
Restricted sample#
C.3 Age
C.6 Gender
All schools
Restricted sample#
C.4 Class size
All schools
0.358
0.207
0.534
0.548
0.348
0.417
0.188
0.110
0.512
0.508
0.524
0.533
0.025*
(0.015)
0.015
(0.015)
0.004
(0.015)
-0.009
(0.015)
0.002
0.041
0.000
0.008
0.737
0.192
0.277
-7.741*
-3.581
0.146
(4.045)
(4.672)
Restricted sample#
52.15
56.56
55.14
-2.985
1.428
0.489
(3.988)
(4.651)
Attrition rate is defined as the rate of students missing in the last testing round conditional on student
participation in the baseline testing. T1 stands for within-class comparison, T2 for across-class comparison
and C for control group. There was one school which experienced strong transformation (exogenous to the
intervention), which resulted in change of their headmaster and a large dropout of students. Restricted
sample (#) excludes that school from the analysis. Robust standard errors adjusted for clustering at school
level are in parentheses.
§ significant at 15%, * significant at 10%; ** significant at 5%; *** significant at 1%
52.26
56.42
60.00
53
Appendix A3: Comparison of mean characteristics of students in control and treatment groups – by
treatment decomposition
SOLO TREATMENT EFFECTS
Within-class competition with
no rewards (T1_solo)
Across-class competition with
no rewards (T2_solo)
Financial rewards with no
feedback (Fin_solo)
Reputation rewards with no
feedback (Rep_solo)
INTERACTION EFFECTS
Within-class competition with
Financial rewards (T1_FIN)
Across-class competition with
Financial rewards (T2_FIN)
Within-class competition with
reputation rewards (T1_REP)
Across-class competition with
reputation rewards (T2_REP)
Pure control
Mathematics
Sum
Mean
19.551
21.575
20.528
24.288
24.111
23.326
22.734
25.454
22.697
Difference
from pure
control
-3.136*
(1.792)
-1.284
(1.814)
-1.891
(2.127)
2.071
(1.898)
Mean
7.126
8.068
7.719
9.366
1.728
(2.339)
1.215
(1.700)
0.453
(1.685)
3.304**
(1.491)
8.485
0
8.583
9.002
8.834
9.974
English
Difference
from pure
control
-1.439*
(0.771)
-0.559
(0.706)
-0.751
(1.035)
0.976
(0.815)
Mean
12.425
13.507
12.809
14.922
0.029
(0.934)
0.651
(0.719)
0.418
(0.719)
1.606**
(0.767)
15.625
0
14.115
14.324
13.899
15.479
Difference
from pure
control
-1.698§
(1.096)
-0.725
(1.178)
-1.139
(1.163)
1.095
(1.191)
1.698
(1.458)
0.563
(1.117)
0.035
(1.052)
1.698**
(0.808)
0
Joint p-value
0.028
0.069
0.039
Rows represent treatment groups (either pure treatments or treatment interactions). Columns (1), (3) and
(5) represent average scores from Math, English and their sum. Columns (2), (4), (6) represent differences
between particular treatment and pure control group (group without any feedback and any reward). Robust
standard errors adjusted for clustering at school level are in parentheses.
§ significant at 15%, * significant at 10%; ** significant at 5%; *** significant at 1%.
54
APPENDIX B: Randomization and logistics in the field
In order to increase the balance between control and treatment groups, the sample was stratified
along three dimensions – school location (the sample was divided into four areas differing in the
level of remoteness), average school performance in national examination (above average or below
average) and student level (grade 6 and 7 of primary education and grades 1 up to 4 of secondary
education) 19, 20. Within each strata, I randomized the sample into treatment and control groups. The
randomization was done in two stages (as shown in Figure 1). First, after the stratification of the
sample by school performance and area, I randomized the whole sample of 53 schools into
treatment and control group in a ratio 2:1. The randomization was done at the school level and
resulted in 36 treatment schools and 17 control schools. School-level randomization in the first
stage was chosen in order to minimize control group contamination due to information spillovers.
In the second stage, I divided classes of the treatment schools randomly into within-class feedback
(T1) and across-class feedback group (T2) in a ratio 1:1 (class-level randomization). In this
scenario, no student in a control-group school received any of the treatments and students in the
treatment-group schools might have received either within- or across-class feedback depending on
the type of intervention their class was randomized into. Overall, 1/3 of the sample is the control
group, 1/3 is treatment group 1 and 1/3 is treatment group 2. Exposure to the treatment is the only
difference in the outcomes between the control and treatment groups.
19
Every year students of P7 in primary schools and S4 in secondary schools take the national leaving examinations that
are compulsory in order to complete their study and to proceed to a higher level. Using the data on PLE and UCE, I was
able to divide schools into better and worse performing schools.
20
Uganda introduced Universal Primary Education (UPE) for all in 1997, allowing up to four students to go to school for
free. Later it was extended to all children. Primary education is a seven-year program and for successful completion
students need to pass the national Primary Leaving Exam (PLE) at the end of grade 7. Without passing PLE they cannot be
admitted to a secondary school. Secondary school consists of two levels - “O-level”, which is four year program from S1 up
to S4 completed by passing Ugandan Certificate of Education (UCE); and “A-level”, which is a two year extension to the Olevel and is completed by passing the Ugandan Advanced Certificate of Education (UACE). In 2007 Uganda introduced
Universal Secondary Education (USE) as the first African country. The school year consists of 3 trimesters and lasts from
January until December. Students are supposed to be examined by midterm and final. Students, however, do not
necessarily have access to their evaluations and have limited information about their improvements.
55
All schools in the sample were connected to local non-governmental organization called
Uganda Czech Development Trust (UCDT). UCDT is a local affiliation of the non-governmental
organization Archdiocese Caritas Prague, Czech Republic, which has been running a sponsorship
program “Adopce na dalku” in Uganda since 1993. According to UCDT representatives, students
were located into primary and secondary schools based on their own choice, therefore supported
students should not differ from not supported students in terms of their school choice.
During the academic year students in the feedback groups received feedback. The feedback
was provided to students in the form of a report card, which was stuck into a small progress report
book that each child in the treatment group received from us. My team members explained the
content of the report card repeatedly until students understood the message fully. The books were
stored at schools and headmasters promised to let children check their books at any point. The
books contained all necessary information to keep a child’s attention and motivation active. After
the experiment, students could keep their books. After students learned their feedback, they were
asked to start to work on questionnaires and to solve the Math and English exam 21. Students in the
control group immediately started to answer the questionnaires. In order to ensure transparency, I
used my own constructed tests. In order to announce the competition, I organized additional
meetings with students to explain the conditions in detail. Moreover, I left fliers in their classrooms
so that their absent classmates could also learn about the competition. Students were reminded of
the conditions of the competition before they sat for the final exams. It took me and my four local
enumerators on average 3 to 4 weeks to evaluate the examinations. Knowing the winners, we
visited schools again to disseminate the rewards.
21 The order was as follows: “Before Math questionnaire”, followed by Math examination that lasted 30
minutes; “After Math Before English questionnaire”, English exam in the subsequent 20 minutes and finally
“After English questionnaire”. The core questions of the questionnaires were student expectations regarding
how many points they thought they would obtain from the Math and English examinations, how much effort
they planned to put/they put into answering the questions and the level of their current happiness. All of
these questions we asked before as well as after each exam. No before-Math and before-English
questionnaires were collected during the baseline survey since students saw the examinations for the first
time.
56
Appendix B2a: Project’s timeline
Reward scheme introduced
2012
Testing 3
Testing 4
Testing 5
Testing 1
Testing 2
Baseline
testing
from
Math and
English
and
questionnaires;
No
treatment
Withinclass
feedback
group (T1)
received
first
treatment;
Within-class
feedback
group (T1)
received
treatment
including
improvement
status
Within-class
feedback
group (T1)
received
treatment
including
improvement
status
Within-class
feedback
group (T1)
received
treatment
including
improvement
status
Acrossclass
feedback
group (T2)
no
treatment
Across-class
feedback
group (T2)
received first
treatment
Across-class
feedback
group (T2)
received
treatment
including
improvement
status
Across-class
feedback
group (T2)
received
treatment
including
improvement
status
Chosen
students
competed to
win prizes
2013
Follow-up
Session
No
treatment
provided,
students
examined
from
Math and
English;
BREAK
BREAK
2011
Baseline
Survey
Students,
teachers and
headmasters
interviewed
Rewards
disseminated
Note: T1 (treatment 1) stands for within-class social comparison treatment; T2 (treatment 2) represents acrossclass social comparison group; Qualification criteria differed based on initial randomization (T1,T2,C).
Appendix B2a: Orthogonal randomization of the sample into reward treatments
57
Appendix B2: Short version of the Perceived Stress Scale
Appendix B3: Short version of the Perceived Happiness Scale
58
Appendix B4: Score cards for students in within-class comparison group
Appendix B4: Score cards for students in across-class comparison group
59
Appendix C: A average treatment effects
Appendix C1: OLS estimates of the effects of different incentive schemes on students’ performance
and well-being
(Pure within-class feedback Pure within-class
Within-class feedback
Within-class feedback
and its interactions)
feedback
rewarded financially
rewarded reputationally
Math and English pooled
(in st.dev.)
Mathematics (in st.dev)
English (in st.dev.)
Stress
Happiness
Confidence (Math)
Confidence (English)
Education over work
Education over rest
Work over rest
(Pure across-class feedback
and its interactions)
Math and English pooled
(in st.dev.)
Mathematics (in st.dev)
English (in st.dev.)
Stress
Happiness
Confidence (Math)
Confidence (English)
Education over work
Education over rest
Work over rest
0.017
(0.014)
0.100
(0.085)
-0.128**
(0.056)
0.059
(0.113)
-0.131
(0.095)
-7.081***
(0.775)
-5.559***
(0.809)
0.023
(0.054)
0.103***
(0.039)
0.026
(0.075)
Pure across-class
feedback
0.044
(0.059)
0.082
(0.073)
-0.049
(0.059)
-0.059
(0.107)
-0.083
(0.082)
-6.920***
(0.782)
-6.267***
(0.895)
0.129***
(0.050)
0.073**
(0.037)
-0.033
(0.068)
0.201**
(0.094)
0.231*
(0.118)
0.103
(0.094)
0.222*
(0.129)
-0.271***
(0.094)
-6.169***
(1.215)
-5.190***
(1.406)
0.154***
(0.059)
0.101**
(0.042)
-0.147*
(0.086)
Across-class feedback
rewarded financially
60
0.282**
(0.113)
0.277**
(0.139)
0.173*
(0.094)
-0.054
(0.128)
-0.063
(0.097)
-5.607***
(0.897)
-4.618***
(1.038)
0.154**
(0.064)
-0.067
(0.059)
0.045
(0.079)
0.187**
(0.073)
0.209**
(0.103)
0.087
(0.080)
0.197
(0.162)
-0.196*
(0.103)
-6.468***
(0.914)
-6.681***
(1.096)
0.063
(0.078)
0.059
(0.059)
-0.045
(0.088)
Across-class feedback
rewarded reputationally
0.122*
(0.073)
0.188**
(0.080)
0.047
(0.080)
§
0.198
(0.124)
-0.237*
(0.122)
-6.098***
(1.008)
-5.782***
(1.186)
0.035
(0.089)
0.042
(0.044)
0.031
(0.082)
Appendix C1: OLS estimates of the effects of different incentive schemes on students’ performance
and well-being (continued)
(Pure financial rewards and
Pure financial
Within-class feedback
Across-class feedback
interactions)
rewards
rewarded financially
rewarded financially
Math and English pooled (in
st.dev.)
Mathematics (in st.dev)
English (in st.dev.)
Stress
Happiness
Confidence (Math)
Confidence (English)
Education over work
Education over rest
Work over rest
(Pure reputational rewards
and interactions)
Math and English pooled (in
st.dev.)
Mathematics (in st.dev)
0.129*
(0.068)
0.106
(0.101)
0.045
(0.088)
0.466***
(0.162)
-0.166§
(0.105)
0.794
(0.892)
1.662§
(1.021)
0.031
(0.077)
0.014
(0.068)
0.045
(0.083)
Pure reputational
rewards
0.201**
(0.094)
0.231*
(0.118)
0.103
(0.094)
0.222*
(0.129)
-0.271***
(0.094)
-6.169***
(1.215)
-5.190***
(1.406)
0.154***
(0.059)
0.101**
(0.042)
-0.147*
(0.086)
Within-class feedback
rewarded reputationally
0.282**
(0.113)
0.277**
(0.139)
0.173*
(0.094)
-0.054
(0.128)
-0.063
(0.097)
-5.607***
(0.897)
-4.618***
(1.038)
0.154**
(0.064)
-0.067
(0.059)
0.045
(0.079)
Across-class feedback
rewarded reputationally
0.062
0.187**
0.122*
(0.106)
(0.073)
(0.073)
0.209**
0.188**
0.138
(0.103)
(0.080)
(0.141)
0.087
0.047
English (in st.dev.)
0.016
(0.080)
(0.080)
(0.082)
§
0.197
0.198
Stress
0.074
(0.162)
(0.189)
(0.123)
-0.196*
Happiness
-0.105
-0.237*
(0.103)
(0.097)
(0.122)
-6.468***
Confidence (Math)
1.498*
-6.098***
(0.914)
(0.782)
(1.008)
-6.681***
Confidence (English)
0.987
-5.782***
(1.096)
(0.967)
(1.186)
0.063
Education over work
0.083
0.035
(0.078)
(0.076)
(0.089)
0.059
Education over rest
0.037
0.042
(0.059)
(0.047)
(0.044)
Work over rest
0.081
-0.045
0.031
(0.087)
(0.088)
(0.082)
Note: Robust standard errors adjusted for clustering at class level are in parentheses. § significant at 15%; *
significant at 10%; ** significant at 5%; *** significant at 1%
61
Appendix C2: Average treatment effects of different types of interventions on performance and students’ well-being
Overall performance
Mathematics
English
Stress
(Math+English)
Dependent variable:
(All areas)
(Without
(All areas)
(Without
(All areas)
(Without (All areas)
area 1)
area 1)
area 1)
Within-class feedback, no
feedback (T1_solo)
Across-class feedback, no
feedback (T2_solo)
Financial Rewards, no
feedback (Fin_solo)
Repurational Rewards, no
feedback (Rep_solo)
Within-class feedback,
monetary reward (T1_fin)
Across-class feedback,
monetary reward (T2_fin)
Within-class feedback,
reputat.reward (T1_rep)
Across-class feedback,
reputat.reward (T2_rep)
0.017
(0.060)
0.044
(0.059)
0.129*
(0.068)
0.062
(0.106)
0.201**
(0.094)
0.292**
(0.113)
0.187**
(0.073)
0.122*
(0.073)
-0.031
(0.109)
0.014
(0.097)
§
0.134
(0.087)
0.017
(0.116)
0.190*
(0.101)
0.257**
(0.117)
0.180**
(0.087)
0.104
(0.087)
0.100
(0.085)
0.082
(0.073)
0.106
(0.101)
0.138
(0.141)
0.231*
(0.119)
0.277**
(0.139)
0.209**
(0.103)
0.188**
(0.080)
-0.036
(0.135)
-0.004
(0.126)
0.069
(0.129)
0.045
(0.158)
0.182
(0.134)
§
0.215
(0.141)
0.165
(0.123)
0.136
(0.108)
-0.128**
(0.056)
-0.049
(0.059)
0.045
(0.088)
0.016
(0.082)
0.103
(0.094)
0.173*
(0.094)
0.087
(0.080)
0.047
(0.080)
-0.141
(0.115)
-0.051
(0.109)
0.064
(0.111)
-0.003
(0.105)
0.109
(0.109)
§
0.168
(0.112)
0.093
(0.099)
0.045
(0.101)
0.059
(0.113)
-0.059
(0.466)
0.466***
(0.163)
0.074
(0.189)
0.222*
(0.129)
-0.054
(0.128)
0.197
(0.162)
§
0.198
(0.124)
Happiness
(All areas)
-0.131
(0.095)
-0.083
(0.082)
-0.166§
(0.105)
-0.105
(0.097)
-0.271***
(0.094)
-0.063
(0.097)
-0.196*
(0.103)
-0.237*
(0.122)
Controlled for stratas
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
N
5108
5102
5093
4105
3516
3512
3503
4105
Note: The treatment effects are calculated with respect to the control group. I controlled for stratas (i.e., students’ performance at the national
examinations, area and the level of studies). N stands for the number of observations. Robust standard errors adjusted for clustering at class level are in
parentheses. § significant at 15%, * significant at 10%; ** significant at 5%; *** significant at 1%.
62
Appendix C3: Aggregated average treatment effects, by subject
Dependent variable:
Math score
(1)
Feedback provided (T)
0.102**
(0.051)
Within-class feedback
(T1)
Across-class feedback
(T2)
Rewards provided (Rew)
Financial Rewards
(Finrew)
Repurational Rewards
(Reprew)
(2)
MATHEMATICS
(3)
(4)
(5)
(6)
0.094*
(0.051)
0.112*
(0.059)
0.093*
(0.055)
0.099*
(0.059)
0.089§
(0.056)
0.128**
(0.063)
0.138**
(0.066)
0.151*
(0.082)
0.127*
(0.066)
(7)
(8)
-0.001
(0.037)
0.125**
(0.056)
0.142*
(0.078)
0.115*
(0.064)
ENGLISH
(9)
(10)
(11)
-0.009
(0.036)
-0.015
(0.042)
0.014
(0.042)
0.126**
(0.055)
0.153**
(0.066)
0.103*
(0.054)
(12)
-0.028
(0.039)
0.012
(0.040)
0.158**
(0.065)
0.108**
(0.053)
Controlled for stratas
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
N
5102
5102
5102
5102
5102
5102
5093
5093
5093
5093
5093
5093
Note: Rows represent aggregated treatment groups. The treatment effects are calculated with respect to the control group. Students in group “T” are
those who received any type of feedback (either within-class feedback (T1) or across-class feedback (T2)). Students in the group “Rew” received any
type of reward (either financial reward (Finrew) or reputational reward (Reprew)). Columns (1) – (6) represent the treatment effects in Mathematics,
columns (7) – (12) in English. In all specifications I controlled for stratas (i.e., students’ performance at the national examinations, area and the level of
studies). N stands for the number of observations. Robust standard errors adjusted for clustering at school level are in parentheses. § significant at 15%,
* significant at 10%; ** significant at 5%; *** significant at 1%.
Significant differences between estimates within particular specifications: there is a statistical difference between provision of feedback (T) and
rewards (Rew) in the English performance (p=0.025, column 11); between within-class feedback (T1) and financial or reputational rewards (Finrew or
Reprew) in the English performance (p=0.009 and p=0.029 respectively, column 11), across-class feedback (T2) and financial or rewards (Finrew or
Reprew) in the English performance (p=0.038 and p=0.113 respectively, column 12).
63
Appendix C4: Different specifications used in estimation of the aggregated average treatment effects of the provision of feedback or
rewards on performance in Mathematics
MATHEMATICS
Dependent variable: Math
score
Within-class social comparison
(Treatment 1)
Across-class social comparison
(Treatment 2)
Financial Rewards
Repurational Rewards
Controlled for stratas
Interactions
N
Pure FB
(round 4)
Pure FB
(round 4)
0.024
(0.062)
0.005
(0.058)
0.037
(0.048)
0.043
(0.043)
No
No
5245
Yes
No
5245
Pure FB
(round 5)
Pure FB
(round 5)
0.084
(0.081)
0.024
(0.084)
0.112*
(0.059)
0.093*
(0.055)
No
No
5102
Yes
No
5102
Pure
Rewards
(round 5)
0.231**
(0.092)
0.185**
(0.079)
No
No
5102
Pure
Rewards
(round 5)
Mix FB and
Rewards
(round 5)
0.151*
(0.082)
0.127*
(0.066)
Yes
No
5102
0.086
(0.079)
0.046
(0.081)
0.233**
(0.093)
0.184**
(0.078)
No
No
5102
Mix FB
and
Rewards
(round 5)
0.099*
(0.059)
0.089§
(0.056)
0.142*
(0.078)
0.115*
(0.064)
Yes
No
5102
Note: Robust standard errors adjusted for clustering at class level are in parentheses. The treatment effects are calculated with respect to the control
group. Columns (2), (4), (6) and (8) control for stratum fixed effects (areas (by distance from the capital city, Kampala), school performance at national
examination and grade level; P6,P7, S1 up to S4). N stands for the number of observations. First two columns analyze the effect between testing round 4
and the baseline testing in round 1. The remaining estimates are based on the differences between round 5 (the final testing round) and the baseline
round 1. § significant at 15%; * significant at 10%; ** significant at 5%; *** significant at 1%
64
Appendix C5: Different specifications used in estimation of the aggregated average treatment effects of the provision of feedback or
rewards on performance in English
ENGLISH
Dependent variable: English
score
Pure FB
(round 4)
Pure FB
(round 4)
Pure FB
(round 5)
Pure FB
(round 5)
-0.102§
(0.067)
-0.039
(0.071)
-0.015
(0.042)
0.014
(0.042)
Pure
Rewards
(round 5)
Pure
Rewards
(round 5)
Mix FB and
Rewards
(round 5)
Mix FB and
Rewards
(round 5)
OVERALL TREATMENT EFFECTS
-0.099*
-0.028
(0.058)
(0.039)
-0.007
0.012
(0.064)
(0.040)
0.340***
0.336***
0.153**
0.158**
(0.052)
(0.055)
(0.066)
(0.053)
0.254***
0.250**
0.103*
0.108**
Repurational Rewards
(0.067)
(0.066)
(0.054)
(0.053)
No
No
No
Yes
No
Yes
Yes
Controlled for stratas
Yes
Interactions
No
No
No
No
No
No
No
No
N
5246
5246
5093
5093
5093
5093
5093
5093
Note: Robust standard errors adjusted for clustering at class level are in parentheses. The treatment effects are calculated with respect to the control
group. Columns (2), (4), (6) and (8) control for stratum fixed effects (areas (by distance from the capital city, Kampala), school performance at national
examination and grade level; P6,P7, S1 up to S4). N stands for the number of observations. First two columns analyze the effect between testing round 4
and the baseline testing in round 1. The remaining estimates are based on the differences between round 5 (the final testing round) and the baseline
round 1. § significant at 15%; * significant at 10%; ** significant at 5%; *** significant at 1%
Within-class social comparison
(Treatment 1)
Across-class social comparison
(Treatment 2)
Financial Rewards
-0.040
(0.074)
0.027
(0.073)
0.023
(0.043)
0.062§
(0.042)
65
Appendix C6: Different specifications used in estimation of the dis-aggregated average treatment effects of different incentive schemes on
performance in Mathematics and English
Mathematics
English
Dependent variable:
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
Within-class feedback, no
feedback (T1_solo)
Across-class feedback, no
feedback (T2_solo)
Financial Rewards, no
feedback (Fin_solo)
Repurational Rewards, no
feedback (Rep_solo)
Within-class feedback,
monetary reward (T1_fin)
Across-class feedback,
monetary reward (T2_fin)
Within-class feedback,
reputat.reward (T1_rep)
Across-class feedback,
reputat.reward (T2_rep)
0.100
(0.085)
0.082
(0.073)
0.106
(0.101)
0.138
(0.141)
0.231*
(0.118)
0.277**
(0.139)
0.209**
(0.103)
0.188**
(0.080)
0.104
(0.085)
0.081
(0.073)
0.097
(0.099)
0.115
(0.142)
0.215*
(0.115)
0.281**
(0.136)
0.196**
(0.099)
0.138*
(0.079)
0.002
(0.002)
Average class size
Gender
0.101
(0.085)
0.077
(0.074)
0.101
(0.101)
0.137
(0.142)
0.229*
(0.119)
0.279**
(0.141)
0.203**
(0.103)
0.183**
(0.081)
Baseline score
Controlled for stratas
N
0.095
(0.087)
0.070
(0.075)
0.085
(0.106)
0.135
(0.145)
0.215
(0.120)
0.251
(0.144)
0.212
(0.106)
0.179
(0.086)
-0.009
(0.063)
0.725***
(0.017)
Yes
5102
0.723***
(0.017)
Yes
5102
-0.128**
(0.056)
-0.049
(0.059)
0.045
(0.088)
0.016
(0.082)
0.103
(0.094)
0.173*
(0.094)
0.087
(0.080)
0.047
(0.080)
-0.068***
(0.022)
Public
Food
0.100
(0.084)
0.081
(0.075)
0.104
(0.105)
0.135
(0.145)
0.227*
(0.127)
0.272*
(0.146)
0.205*
(0.111)
0.185**
(0.087)
0.719***
(0.017)
Yes
5065
0.725***
(0.017)
Yes
5102
66
0.089***
(0.029)
0.723***
(0.017)
Yes
4906
-0.127**
(0.056)
-0.049
(0.059)
0.043
(0.089)
0.009
(0.084)
0.099
(0.095)
0.174*
(0.093)
0.083
(0.081)
0.034
(0.085)
0.001
(0.001)
-0.134**
(0.056)
-0.052
(0.059)
0.042
(0.088)
0.016
(0.081)
0.097
(0.093)
0.166*
(0.093)
0.081
(0.079)
0.043
(0.079)
-0.129**
(0.055)
-0.045
(0.061)
0.053
(0.087)
0.027
(0.082)
0.124
(0.097)
0.193**
(0.097)
0.104
(0.086)
0.059
(0.081)
0.052***
(0.020)
0.041
(0.049)
0.737***
(0.016)
Yes
5093
0.737***
(0.016)
Yes
5093
-0.134**
(0.055)
-0.043
(0.060)
0.063
(0.089)
0.035
(0.083)
0.108
(0.091)
0.189**
(0.090)
0.097
(0.082)
0.053
(0.081)
0.737***
(0.016)
Yes
5056
0.736***
(0.016)
Yes
5093
0.058***
(0.021)
0.732***
(0.016)
Yes
4896
Appendix D: Gender differences
Appendix D1: Gender differences in the average treatment effects of aggregated interventions on the overall performance (Math and
English pooled)
Dependent variable:
Math score
(1)
Feedback provided (T)
0.104**
(0.043)
(6)
0.097**
(0.041)
0.098**
(0.058)
0.109**
(0.060)
Within-class feedback
(T1)
Across-class feedback
(T2)
Rewards provided (Rew)
Financial Rewards
(Finrew)
Reputational Rewards
(Reprew)
Controlled for stratas
N
OVERALL PERFORMANCE BY GIRLS
(2)
(3)
(4)
(5)
0.067
(0.065)
Yes
2862
Yes
2862
Yes
2862
0.089**
(0.044)
0.106**
(0.050)
0.114§
(0.076)
0.083
(0.059)
Yes
2862
0.042
(0.061)
Yes
2862
(7)
OVERALL PERFORMANCE BY BOYS
(8)
(9)
(10)
(11)
0.026
(0.048)
0.134**
(0.063)
0.105§
(0.072)
0.070
(0.057)
Yes
2862
Yes
2209
Yes
2209
0.013
(0.047)
0.038
(0.056)
0.016
(0.053)
Yes
2209
0.133**
(0.063)
0.259***
(0.066)
0.129**
(0.059)
Yes
2209
Yes
2209
(12)
0.019
(0.052)
0.021
(0.051)
0.258***
(0.065)
0.128**
(0.059)
Yes
2209
Note: Rows represent aggregated treatment groups. Students in group “T” are those who received any type of feedback (either within-class feedback
(T1) or across-class feedback (T2)). Students in the group “Rew” received any type of reward (either financial reward (Finrew) or reputational reward
(Reprew)). Columns (1) – (6) represent the treatment effects in Mathematics, columns (7) – (12) in English. In all specifications I controlled for stratas
(i.e., students’ performance at the national examinations, area and the level of studies). N stands for the number of observations. Robust standard errors
adjusted for clustering at school level are in parentheses. § significant at 15%, * significant at 10%; ** significant at 5%; *** significant at 1%.
Significant differences between estimates within particular specifications: there is a statistical difference between within-class feedback (T1) and
financial or reputational rewards (Finrew or Reprew) in the English performance (p=0.111 and p=0.043 respectively, column 12).
67
Appendix D2: Girls’ aggregated average treatment effects on performance in Mathematics and English
Dependent variable:
Math score
(1)
Feedback provided (T)
0.159***
(0.053)
Within-class feedback
(T1)
Across-class feedback
(T2)
Rewards provided (Rew)
(2)
MATHEMATICS
(3)
(4)
(5)
(6)
0.154***
(0.053)
0.157***
(0.058)
0.163***
(0.060)
0.093
(0.076)
0.149**
(0.058)
0.159***
(0.061)
0.073
(0.071)
(7)
(8)
0.001
(0.041)
0.093§
(0.058)
ENGLISH
(9)
(10)
-0.016
(0.044)
0.019
(0.046)
(11)
-0.007
(0.039)
0.094*
(0.057)
(12)
-0.027
(0.042)
0.014
(0.045)
Financial Rewards
0.103
0.088
0.089
0.094
(Finrew)
(0.095)
(0.088)
(0.069)
(0.068)
Reputational Rewards
0.085
0.062
0.095§
0.099*
(Reprew)
(0.075)
(0.071)
(0.056)
(0.058)
Controlled for stratas
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
N
2858
2858
2858
2858
2858
2858
2854
2854
2854
2854
2854
2854
Note: Rows represent aggregated treatment groups. Students in group “T” are those who received any type of feedback (either within-class feedback
(T1) or across-class feedback (T2)). Students in the group “Rew” received any type of reward (either financial reward (Finrew) or reputational reward
(Reprew)). Columns (1) – (6) represent the treatment effects in Mathematics, columns (7) – (12) in English. In all specifications I controlled for stratas
(i.e., students’ performance at the national examinations, area and the level of studies). N stands for the number of observations. Robust standard errors
adjusted for clustering at school level are in parentheses. § significant at 15%, * significant at 10%; ** significant at 5%; *** significant at 1%.
Significant differences between estimates within particular specifications: there is a statistical difference between within-class feedback (T1) and
financial or reputational rewards (Finrew or Reprew) in the English performance (p=0.111 and p=0.043 respectively, column 12).
68
Appendix D3: Boys’ aggregated average treatment effects on performance in Mathematics and English
MATHEMATICS
Dependent variable:
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
Math score
Feedback provided (T)
Within-class feedback
(T1)
Across-class feedback
(T2)
Rewards provided (Rew)
0.027
(0.061)
0.192***
(0.022)
0.055
(0.073)
-0.001
(0.064)
0.019
(0.060)
0.191***
(0.069)
0.038
(0.071)
0.003
(0.065)
-0.011
(0.047)
0.157**
(0.068)
ENGLISH
(9)
(10)
-0.022
(0.054)
0.001
(0.053)
(11)
-0.017
(0.046)
0.158**
(0.068)
(12)
-0.038
(0.051)
0.005
(0.051)
Financial Rewards
0.213**
0.207**
0.226***
0.234***
(Finrew)
(0.089)
(0.089)
(0.078)
(0.078)
§
Reputational Rewards
0.175**
0.170**
0.106
0.111*
(Reprew)
(0.074)
(0.073)
(0.067)
(0.066)
Controlled for stratas
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
N
2207
2207
2207
2207
2207
2207
2202
2202
2202
2202
2202
2202
Note: Rows represent aggregated treatment groups. Students in group “T” are those who received any type of feedback (either within-class feedback
(T1) or across-class feedback (T2)). Students in the group “Rew” received any type of reward (either financial reward (Finrew) or reputational reward
(Reprew)). Columns (1) – (6) represent the treatment effects in Mathematics, columns (7) – (12) in English. In all specifications I controlled for stratas
(i.e., students’ performance at the national examinations, area and the level of studies). N stands for the number of observations. Robust standard errors
adjusted for clustering at school level are in parentheses. § significant at 15%, * significant at 10%; ** significant at 5%; *** significant at 1%.
Significant differences between estimates within particular specifications: in Mathematics, there is a statistical difference between provision of feedback
(T) and rewards (Rew) (p=0.045, column 5), between within-class feedback (T1) and financial or reputational rewards (Finrew or Reprew) in the
English performance (p=0.112 and p=0.027 respectively, column 12), and between across-class feedback (T2) and reputational rewards (p=0.092).
In English, there are statistical differences between financial and reputational rewards in the specification 10 (p=0.046), between provision of feedback
(T) and rewards (Rew) in the specification 11 (p=0.023), and in the specification 12 there are differences between financial and reputational reward
estimates (p=0.044), between within-class feedback (T1) and financial or reputational reward estimates (p=0.002 and p=0.068 respectively), and
between across-class feedback (T2) and financial reward estimates (p=0.009).
69
Appendix D4: Gender differences in aggregated average treatment effects on stress and happiness
Dependent variable:
Math score
Feedback provided (T)
Within-class feedback
(T1)
Across-class feedback
(T2)
Rewards provided (Rew)
OVERALL
(1)
-0.133
(0.204)
0.516*
(0.269)
(2)
0.003
(0.233)
-0.275
(0.213)
PERCEIVED STRESS
GIRLS
(3)
-0.089
(0.234)
0.508*
(0.284)
(4)
0.105
(0.259)
-0.295
(0.239)
BOYS
(5)
OVERALL
(6)
-0.183
(0.216)
0.533*
(0.295)
(7)
(8)
-0.326*
(0.195)
-0.119
(0.255)
-0.248
(0.239)
-0.437*
(0.228)
PERCEIVED HAPPINESS
GIRLS
(9)
(10)
-0.313§
(0.214)
-0.428*
(0.224)
-0.219
(0.219)
-0.349
(0.255)
-0.399*
(0.235)
-0.221
(0.253)
BOYS
(11)
-0.348
(0.251)
-0.544*
(0.304)
(12)
-0.483§
(0.297)
-0.213
(0.268)
Financial Rewards
0.549**
0.599**
0.477
-0.398
-0.331
-0.504
(Finrew)
(0.277)
(0.282)
(0.334)
(0.258)
(0.277)
(0.375)
Reputational Rewards
0.408
0.307
0.547*
-0.408
-0.314
-0.493§
(0.318)
(0.327)
(Reprew)
(0.304)
(0.331)
(0.267)
(0.298)
Controlled for stratas
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Note: Rows represent aggregated treatment groups. Students in group “T” are those who received any type of feedback (either within-class feedback
(T1) or across-class feedback (T2)). Students in the group “Rew” received any type of reward (either financial reward (Finrew) or reputational reward
(Reprew)). Columns (1) – (6) represent the treatment effects in Mathematics, columns (7) – (12) in English. In all specifications I controlled for stratas
(i.e., students’ performance at the national examinations, area and the level of studies). N stands for the number of observations. Robust standard errors
adjusted for clustering at school level are in parentheses. § significant at 15%, * significant at 10%; ** significant at 5%; *** significant at 1%.
Significant differences between estimates within particular specifications: there is a statistical difference between within-class feedback (T1) and
financial or reputational rewards (Finrew or Reprew) in the English performance (p=0.111 and p=0.043 respectively, column 12).
70
Appendix D5: Gender differences in aggregated average treatment effects – by gender and by
subject
Note: The graph illustrates the improvements in Mathematics and English (in standard deviations) of girls
and boys in different treatment groups with respect to the control group. The bars correspond to robust
standard errors. T1 stands for the within-class comparative feedback, T2 stands for the across-class
comparative feedback, Finrew stands for financial-reward group and Reprew stands for the reputationalreward group.
71
Appendix D6: Gender differences in dis-aggregated average treatment effects on students’ overconfidence,
by subject
Note: Students answered short questionnaires before and after every exam. Overconfidence in this case is
measured by the difference between the expected number of points from the particular subject and the real
score the student achieved. The calibration is measured in the number of points students overestimated their
performance. Maximum number of points in each of the subjects was 50. G stands for the estimated effects on
girls, B for boys. The bars correspond to robust standard errors.
72
Appendix D6: The evolution of the average treatment effects over time – by gender and by subject
Note: The graph illustrates the development in improvements in overall performance, in Mathematics and English separately (in standard deviations) of
girls and boys in different feedback treatment groups with respect to the control group. The bars correspond to robust standard errors. T1 stands for
the within-class comparative feedback, T2 stands for the across-class comparative feedback and C for the control group.
73
Appendix E: Group composition
Appendix E1: Comparison of performance, exerted effort and subjective well-being of mixed- and
high-ability groups to poor-ability groups and their responses to the provision of rewards
ABILITY COMPOSITION
OF GROUPS
Mathematics
English
Perceived
Stress
Subjective
Happiness
Effort
Math
Effort
English
Mixed ability: pure
feedback
Well performers: pure
feedback
0.055
(0.059)
0.451***
(0.129)
0.133**
(0.063)
0.387***
(0.078)
0.275**
(0.127)
0.561***
(0.206)
0.521***
(0.086)
0.449***
(0.099)
0.096
(0.094)
0.078
(0.176)
-0.069
(0.094)
0.082
(0.096)
0.004
(0.096)
§
0.126
(0.081)
Mixed ability: feedback
and monetary reward
Well performers: feedback
and monetary reward
-0.082
(0.078)
0.001
(0.161)
Mixed ability: feedback
and reputational reward
Well performers: feedback
and reputational reward
0.262***
(0.092)
0.322**
(0.149)
0.357***
(0.078)
0.393***
(0.107)
0.181
(0.143)
0.327**
(0.148)
Financial rewards
0.271***
(0.079)
0.253**
(0.123)
0.672***
(0.029)
0.289*
(0.169)
0.280*
(0.141)
0.703***
(0.029)
0.219
(0.166)
0.129
(0.171)
0.252**
(0.125)
0.127
(0.148)
0.284**
(0.132)
0.071
(0.150)
Yes
1323
Yes
1392
Reputational rewards
Initial value
Stratification variables
N
Yes
1426
Yes
1425
0.313**
(0.124)
0.051
(0.254)
0.208
(0.156)
-0.028
(0.162)
0.082***
(0.026)
Yes
1327
74
0.121
(0.155)
0.225
(0.182)
0.094
(0.172)
-0.193
(0.172)
0.229***
(0.035)
0.109
(0.112)
0.053
(0.109)
0.050
(0.130)
0.058
(0.121)
0.254***
(0.032)
0.102
(0.128)
0.042
(0.128)
0.089
(0.128)
-0.036
(0.166)
0.219***
(0.039)
Yes
1365
Appendix E2: Comparison of performance, exerted effort and subjective well-being of mixed-gender
and pure-boy groups to groups consisting of pure girls and their responses to the provision of
rewards
GENDER COMPOSITION OF
GROUPS
Three boys: pure feedback
Two boys + One girl: pure
feedback
One boy + Two girls: pure
feedback
Three boys: feedback and
monetary reward
Two boys + 1 girl: feedback
and monetary reward
One boy + 2 girls: feedback
and monetary reward
Three boys: feedback and
reputational reward
Two boys + 1 girl: feedback
and reputational reward
One boy + 2 girls: feedback
and reputational reward
Financial rewards
Reputational rewards
Initial value
Stratification variables
N
Math
0.065
(0.071)
0.184***
(0.055)
0.164***
(0.044)
Perceived
Stress
Subjective
Happiness
Effort
Math
Effort
English
-0.009
(0.065)
-0.026
(0.068)
0.002
(0.044)
-0.056
(0.165)
-0.104
(0.111)
-0.022
(0.104)
-0.288**
(0.136)
-0.160**
(0.079)
-0.204**
(0.099)
0.067
(0.148)
-0.121
(0.133)
-0.192*
(0.106)
-0.143
(0.139)
-0.171§
(0.107)
-0.258**
(0.108)
0.197§
(0.129)
0.153
(0.107)
0.203**
(0.106)
0.046
(0.272)
0.263
(0.205)
0.174
(0.197)
-0.258
(0.207)
-0.302*
(0.179)
-0.182
(0.172)
0.033
(0.167)
0.071
(0.139)
-0.046
(0.156)
-0.256
(0.173)
-0.137
(0.134)
-0.135
(0.146)
English
0.263
(0.301)
0.465***
(0.143)
0.359**
(0.154)
0.264
(0.181)
0.310**
(0.139)
0.303***
(0.106)
0.411***
(0.154)
0.403***
(0.125)
0.707***
(0.029)
Yes
1624
0.309**
(0.137)
0.382***
(0.119)
0.734***
(0.028)
Yes
1623
0.504***
(0.149)
0.388***
(0.113)
0.206§
(0.131)
75
0.295
(0.366)
0.142
(0.236)
0.302§
(0.197)
-0.029
(0.205)
-0.114
(0.215)
0.081***
(0.026)
Yes
1624
-0.117
(0.234)
-0.303*
(0.172)
-0.345**
(0.123)
-0.096
(0.187)
-0.019
(0.148)
0.077
(0.139)
-0.196
(0.157)
-0.182
(0.178)
0.233***
(0.034)
Yes
1313
0.195
(0.167)
-0.071
(0.138)
0.258***
(0.032)
Yes
1392
-0.259
(0.216)
-0.007
(0.128)
-0.034
(0.139)
0.029
(0.189)
-0.167
(0.146)
0.224***
(0.039)
Yes
1365
Appendix F: Other
Appendix F1: Aggregated average treatment effects of different incentive schemes on performance
in Mathematics and English – by students being at official age in his class or older
Note: The graph illustrates the differences in improvements in Mathematics and English (in standard
deviations) between students of official age and unofficial age (students older than usual age at the particular
level) in different treatment groups with respect to the control group (e.g., if the official age of getting to the
primary school is 6, then the official age at the P6 level is 12 (±1)). The bars correspond to robust standard
errors. T1 stands for the within-class comparative feedback, T2 stands for the across-class comparative
feedback, Finrew stands for financial-reward group and Reprew stands for the reputational-reward group.
76
Appendix F2: Comparison of the estimates of the average treatment effects of different motivation schemes
on students’ performance in Mathematics and English using various specifications
OLS
IPW
Imputation
(median
ratio)
Imputation
(class
percentiles)
Median
Regression
MATHEMATICS
Within-class feedback (T1)
Across-class feedback (T2)
Financial Rewards (Finrew)
Reputational Rewards
(Reprew)
Baseline Mathematic score
Controlled for stratas
ENGLISH
Within-class feedback (T1)
Across-class feedback (T2)
Financial Rewards (Finrew)
Reputational Rewards
(Reprew)
Baseline English score
Controlled for stratas
0.099*
(0.059)
0.089§
(0.056)
0.142*
(0.078)
0.115*
(0.064)
0.725***
(0.017)
Yes
0.076
(0.065)
0.107*
(0.066)
0.327***
(0.100)
0.152**
(0.073)
0.731***
(0.021)
Yes
0.124*
(0.063)
0.116**
(0.054)
0.198**
(0.081)
0.164**
(0.073)
0.757**
(0.049)
Yes
0.112*
(0.055)
0.096*
(0.053)
0.169**
(0.079)
0.157**
(0.067)
0.668***
(0.019)
Yes
0.096**
(0.052)
0.069
(0.051)
0.175**
(0.083)
0.124**
(0.062)
0.750***
(0.019)
Yes
-0.028
(0.039)
0.012
(0.040)
0.158**
(0.053)
0.108**
(0.053)
0.739***
(0.016)
Yes
-0.029
(0.048)
0.028
(0.048)
0.290***
(0.083)
0.155**
(0.068)
0.696***
(0.024)
Yes
0.019
(0.052)
0.056
(0.051)
0.159*
(0.075)
0.103
(0.073)
0.737***
(0.026)
Yes
-0.011
(0.042)
0.025
(0.043)
0.211***
(0.066)
0.158***
(0.056)
0.691***
(0.016)
Yes
-0.018
(0.042)
0.017
(0.042)
0.176**
(0.075)
0.095§
(0.058)
0.758***
(0.016)
Yes
Note: Robust standard errors adjusted for clustering at class level are in parentheses. Controlled for stratum
fixed effects (four areas by distance from the capital city, Kampala, school performance at national
examination and grade level (P6,P7, S1 up to S4). N stands for the number of observations. One school is
eliminated from imputation method due to high turnover of students caused by frequent change of
headmasters. IPW stands for the inverse probability weight regression adjusting for students’ probability to
dropout. The imputations based on median ration imputed the last available observation in Mathematics or
English adjusted for the difference in the test difficulties using median ratio. In the imputations based on class
percentiles I first seek for the percentile rank of the student in his last observed score within his class and
then assign the student grade from the testing round 5 of a student from the same percentile rank.
§ significant at 15%; * significant at 10%; ** significant at 5%; *** significant at 1%
77
Appendix F3: Comparison of the estimates of the disaggregated average treatment effects of
different motivation schemes on students’ performance in Mathematics using various specifications
OLS
IPW
Imputation
(median
correction)
Imputation
(class
percentiles)
Median
Regression
PURE TREATMENT EFFECTS
Within-class feedback, no
feedback (T1_solo)
Across-class feedback, no
feedback (T2_solo)
Financial Rewards, no feedback
(Fin_solo)
Reputational Rewards, no
feedback (Rep_solo)
0.100
(0.085)
0.082
(0.073)
0.106
(0.101)
0.138
(0.141)
0.046
(0.092)
0.067
(0.079)
0.151§
(0.102)
0.188
(0.149)
0.133*
(0.079)
0.129*
(0.068)
0.169*
(0.096)
0.206*
(0.124)
0.070
(0.082)
0.036
(0.627)
0.070
(0.097)
0.092
(0.115)
0.121*
(0.072)
0.070
(0.069)
0.162*
(0.093)
0.159
(0.132)
Within-class feedback, monetary
reward (T1_fin)
Across-class feedback, monetary
reward (T2_fin)
Within-class feedback,
reputational reward (T1_rep)
Across-class feedback,
reputational reward (T2_rep)
0.231*
(0.118)
0.277**
(0.139)
0.209**
(0.103)
0.188**
(0.080)
0.338**
(0.135)
0.456***
(0.132)
0.212*
(0.108)
0.208**
(0.087)
0.281**
(0.129)
0.331**
(0.128)
0.266**
(0.112)
0.186**
(0.073)
0.202*
(0.116)
0.209§
(0.130)
0.171*
(0.099)
0.164**
(0.076)
0.267**
(0.113)
0.296*
(0.160)
0.214**
(0.100)
0.193***
(0.075)
0.725***
(0.017)
Yes
0.732***
(0.021)
Yes
0.755***
(0.048)
Yes
0.658***
(0.019)
Yes
0.747***
(0.019)
Yes
TREATMENT INTERACTIONS
Baseline Math score
Controlled for stratas
Note: Robust standard errors adjusted for clustering at class level are in parentheses. Controlled for stratum
fixed effects (four areas by distance from the capital city, Kampala, school performance at national
examination and grade level (P6,P7, S1 up to S4). N stands for the number of observations. One school is
eliminated from imputation method due to high turnover of students caused by frequent change of
headmasters. IPW stands for the inverse probability weight regression adjusting for students’ probability to
dropout. The imputations based on median ration imputed the last available observation in Mathematics or
English adjusted for the difference in the test difficulties using median ratio. In the imputations based on class
percentiles I first seek for the percentile rank of the student in his last observed score within his class and
then assign the student grade from the testing round 5 of a student from the same percentile rank.
§ significant at 15%; * significant at 10%; ** significant at 5%; *** significant at 1%
78
Appendix F4: Comparison of the estimates of the disaggregated average treatment effects of
different motivation schemes on students’ performance in English using various specifications
OLS
IPW
Imputation
(median
correction)
Imputation
(class
percentiles)
Median
Regression
PURE TREATMENT EFFECTS
Within-class feedback, no
feedback (T1_solo)
Across-class feedback, no
feedback (T2_solo)
Financial Rewards, no feedback
(Fin_solo)
Reputational Rewards, no
feedback (Rep_solo)
-0.128**
(0.056)
-0.049
(0.059)
0.045
(0.088)
0.016
(0.082)
-0.133*
(0.070)
-0.079
(0.072)
0.032
(0.085)
0.004
(0.084)
-0.151***
(0.043)
-0.062
(0.046)
0.036
(0.069)
0.026
(0.059)
Within-class feedback, monetary
reward (T1_fin)
Across-class feedback, monetary
reward (T2_fin)
Within-class feedback,
reputational reward (T1_rep)
Across-class feedback,
reputational reward (T2_rep)
0.103
(0.094)
0.173*
(0.094)
0.087
(0.080)
0.047
(0.080)
0.145*
(0.086)
0.248**
(0.102)
0.041
(0.078)
0.071
(0.077)
0.065
(0.079)
0.128*
(0.074)
0.062
(0.057)
0.052
(0.065)
TREATMENT INTERACTIONS
Baseline Math score
Controlled for stratas
0.737***
(0.016)
Yes
0.697***
(0.023)
Yes
0.702***
(0.014)
Yes
-0.207***
(0.062)
-0.139**
(0.065)
-0.047
(0.093)
-0.099
(0.086)
0.043
(0.101)
0.062
(0.104)
-0.009
(0.081)
-0.042
(0.087)
0.682***
(0.017)
Yes
-0.149***
(0.053)
-0.074
(0.055)
0.009
(0.084)
-0.025
(0.078)
0.129
(0.107)
0.175**
(0.084)
0.096
(0.077)
-0.039
(0.079)
0.759***
(0.017)
Yes
Note: Robust standard errors adjusted for clustering at class level are in parentheses. Controlled for stratum
fixed effects (four areas by distance from the capital city, Kampala, school performance at national
examination and grade level (P6,P7, S1 up to S4). N stands for the number of observations. One school is
eliminated from imputation method due to high turnover of students caused by frequent change of
headmasters. IPW stands for the inverse probability weight regression adjusting for students’ probability to
dropout. The imputations based on median ration imputed the last available observation in Mathematics or
English adjusted for the difference in the test difficulties using median ratio. In the imputations based on class
percentiles I first seek for the percentile rank of the student in his last observed score within his class and
then assign the student grade from the testing round 5 of a student from the same percentile rank.
§ significant at 15%; * significant at 10%; ** significant at 5%; *** significant at 1%
79
Appendix F5: Quantile regressions – aggregated treatment effects on performance in Mathematics
and English
OLS
Quantile
Regression
(q=0.25)
Quantile
Regression
(q=0.5)
0.069§
(0.047)
0.069§
(0.048)
0.145*
(0.080)
0.078
(0.063)
0.702***
(0.024)
0.096**
(0.052)
0.069
(0.051)
0.175**
(0.083)
0.124**
(0.062)
0.750***
(0.019)
Quantile
Regression
(q=0.75)
MATHEMATICS
Within-class feedback (T1)
Across-class feedback (T2)
Financial Rewards (Finrew)
Reputational Rewards (Reprew)
Baseline Mathematic score
Controlled for stratas
ENGLISH
Within-class feedback (T1)
Across-class feedback (T2)
Financial Rewards (Finrew)
Reputational Rewards (Reprew)
Baseline English score
Controlled for stratas
0.099*
(0.059)
0.089§
(0.056)
0.142*
(0.078)
0.115*
(0.064)
0.725***
(0.017)
0.101§
(0064)
0.061
(0.063)
0.127
(0.092)
0.125*
(0.075)
0.770***
(0.027)
Yes
Yes
Yes
Yes
-0.028
(0.039)
0.012
(0.040)
0.158**
(0.053)
0.108**
(0.053)
0.739***
(0.016)
Yes
-0.061§
(0.039)
0.014
(0.044)
0.181**
(0.071)
0.099*
(0.058)
0.746***
(0.018)
Yes
-0.018
(0.042)
0.017
(0.042)
0.176**
(0.075)
0.095§
(0.058)
0.758***
(0.016)
Yes
-0.021
(0.049)
-0.021
(0.044)
0.160**
(0.065)
0.105*
(0.058)
0.764***
(0.020)
Yes
Note: Robust standard errors adjusted for clustering at class level are in parentheses. Controlled for stratum
fixed effects (four areas by distance from the capital city, Kampala, school performance at national
examination and grade level (P6,P7, S1 up to S4). N stands for the number of observations. One school is
eliminated from imputation method due to high turnover of students caused by frequent change of
headmasters. In that case imputation would not work properly.
§ significant at 15%; * significant at 10%; ** significant at 5%; *** significant at 1%
80
Appendix F6a: Quantile regressions – dis-aggregated treatment effects on performance in
Mathematics
OLS
Quantile
Quantile
Quantile
Regression
Regression
Regression
(q=0.25)
(q=0.5)
(q=0.75)
PURE TREATMENT EFFECTS
Within-class feedback, no
feedback (T1_solo)
Across-class feedback, no
feedback (T2_solo)
Financial Rewards, no feedback
(Fin_solo)
Reputational Rewards, no
feedback (Rep_solo)
0.100
(0.085)
0.082
(0.073)
0.106
(0.101)
0.138
(0.141)
0.097§
(0.059)
0.084
(0.059)
0.131§
(0.089)
0.181*
(0.102)
0.121*
(0.072)
0.070
(0.069)
0.162*
(0.093)
0.159
(0.132)
0.087
(0.092)
0.050
(0.093)
0.012
(0.123)
0.137
(0.156)
Within-class feedback,
monetary reward (T1_fin)
Across-class feedback,
monetary reward (T2_fin)
Within-class feedback,
reputational reward (T1_rep)
Across-class feedback,
reputational reward (T2_rep)
0.231*
(0.118)
0.277**
(0.139)
0.209**
(0.103)
0.188**
(0.080)
0.240**
(0.123)
0.253*
(0.137)
0.148§
(0.092)
0.163*
(0.096)
0.267**
(0.113)
0.296*
(0.160)
0.214**
(0.100)
0.193***
(0.075)
0.725***
(0.017)
Yes
0.700***
(0.024)
Yes
0.747***
(0.019)
Yes
0.239§
(0.147)
0.257
(0.192)
0.237*
(0.128)
0.141
(0.099)
TREATMENT INTERACTIONS
Baseline English score
Controlled for stratas
0.764***
(0.028)
Yes
Note: Robust standard errors adjusted for clustering at class level are in parentheses. Controlled for stratum
fixed effects (four areas by distance from the capital city, Kampala, school performance at national
examination and grade level (P6,P7, S1 up to S4). N stands for the number of observations. One school is
eliminated from imputation method due to high turnover of students caused by frequent change of
headmasters. In that case imputation would not work properly.
§ significant at 15%; * significant at 10%; ** significant at 5%; *** significant at 1%
81
Appendix F6b: Quantile regressions – dis-aggregated treatment effects on performance in English
OLS
Quantile
Regression
(q=0.25)
Quantile
Regression
(q=0.5)
Quantile
Regression
(q=0.75)
PURE TREATMENT EFFECTS
Within-class feedback, no
feedback (T1_solo)
Across-class feedback, no
feedback (T2_solo)
Financial Rewards, no feedback
(Fin_solo)
Reputational Rewards, no
feedback (Rep_solo)
-0.128**
(0.056)
-0.049
(0.059)
0.045
(0.088)
0.016
(0.082)
-0.137***
(0.052)
-0.025
(0.055)
0.078
(0.100)
0.009
0.084
-0.149***
(0.053)
-0.074
(0.055)
0.009
(0.084)
-0.025
(0.078)
-0.127**
(0.064)
-0.074
(0.064)
-0.029
(0.099)
-0.002
(0.098)
Within-class feedback,
monetary reward (T1_fin)
Across-class feedback,
monetary reward (T2_fin)
Within-class feedback,
reputational reward (T1_rep)
Across-class feedback,
reputational reward (T2_rep)
0.103
(0.094)
0.173*
(0.094)
0.087
(0.080)
0.047
(0.080)
0.049
(0.091)
0.209**
(0.097)
0.093
(0.084)
-0.024
(0.089)
0.129
(0.107)
0.175**
(0.084)
0.096
(0.077)
-0.039
(0.079)
0.169*
(0.089)
0.099
(0.078)
0.060
(0.087)
0.040
(0.077)
TREATMENT INTERACTIONS
Baseline English score
Controlled for stratas
0.737***
(0.016)
Yes
0.749***
(0.017)
Yes
0.759***
(0.017)
Yes
0.759***
(0.018)
Yes
Note: Robust standard errors adjusted for clustering at class level are in parentheses. Controlled for stratum
fixed effects (four areas by distance from the capital city, Kampala, school performance at national
examination and grade level (P6,P7, S1 up to S4). N stands for the number of observations. One school is
eliminated from imputation method due to high turnover of students caused by frequent change of
headmasters. In that case imputation would not work properly.
§ significant at 15%; * significant at 10%; ** significant at 5%; *** significant at 1%
82
Appendix F7: Comparison of uncorrected and corrected p-values using different multiple-comparison procedures: for aggregated and
disaggregated average treatment effects on students’ overall performance
P-VALUE
CORRECTIONS
Aggregated ATE
Within-class feedback
Across-class feedback
Disaggregated ATE
Within-class feedback, no
rewards
Across-class feedback, no
rewards
Financial Rewards, no
feedback
Reputational Rewards, no
feedback
Within-class feedback
with financial rewards
Across-class feedback
with financial rewards
Within-class feedback
with reputation rewards
Across-class feedback
with reputation rewards
Uncorrect
ed
Bonferroni
Sidak
Holm
Holland
Hochberg
Simes
Yekutieli
0.102
0.130
1.000
1.000
0.754
0.836
0.409
0.409
0.350
0.350
0.389
0.389
0.133
0.153
0.422
0.487
0.774
1.000
1.000
0.961
0.776
1.000
1.000
0.685
0.413
0.347
0.413
0.086
0.555
1.000
0.776
0.776
0.059
1.000
1.000
0.776
0.293
0.659
0.306
0.058
§
0.026
0.091
§
0.025
0.087
§
0.047
0.464
0.035
0.014
0.012
0.097
0.667
0.257
0.221
1.000
1.000
1.000
0.493
1.000
1.000
0.316
§
0.956
0.961
0.275
§
0.228
0.135
§
0.127
§
0.135
0.856
0.581
0.457
0.581
0.199
0.128
83
0.121
0.128
0.587
0.131
1.000
1.000
0.205
Appendix F8: Comparison of uncorrected and corrected p-values using different multiple-comparison procedures: for aggregated and
disaggregated average treatment effects on students’ stress
P-VALUE
CORRECTIONS
Aggregated ATE
Within-class feedback
Across-class feedback
Disaggregated ATE
Within-class feedback, no
rewards
Across-class feedback, no
rewards
Financial Rewards, no
feedback
Reputational Rewards, no
feedback
Within-class feedback
with financial rewards
Across-class feedback
with financial rewards
Within-class feedback
with reputation rewards
Across-class feedback
with reputation rewards
Uncorrect
ed
Bonferroni
Sidak
Holm
Holland
Hochberg
Simes
Yekutieli
0.827
0.199
1.000
1.000
1.000
0.944
1.000
1.000
0.970
0.703
0.838
0.838
0.838
0.287
1.000
0.913
0.600
1.000
1.000
1.000
0.993
0.938
0.712
1.000
0.582
1.000
1.000
1.000
0.993
0.938
0.712
1.000
0.005
0.089
0.085
0.075
0.072
0.075
0.022
0.079
0.696
1.000
1.000
1.000
0.993
0.938
0.735
1.000
0.088
1.000
0.826
0.792
0.563
0.792
0.152
0.539
0.673
1.000
1.000
1.000
0.993
0.938
0.735
1.000
0.226
1.000
0.992
1.000
0.833
0.938
0.330
1.000
0.111
1.000
0.893
0.888
0.610
0.888
0.176
0.623
84
Appendix F9: Comparison of uncorrected and corrected p-values using different multiple-comparison procedures: for aggregated and
disaggregated average treatment effects on students’ subjective happiness
P-VALUE
CORRECTIONS
Aggregated ATE
Within-class feedback
Across-class feedback
Disaggregated ATE
Within-class feedback, no
rewards
Across-class feedback, no
rewards
Financial Rewards, no
feedback
Reputational Rewards, no
feedback
Within-class feedback
with financial rewards
Across-class feedback
with financial rewards
Within-class feedback
with reputation rewards
Across-class feedback
with reputation rewards
Uncorrect
ed
Bonferroni
Sidak
Holm
Holland
Hochberg
Simes
Yekutieli
0.036
0.341
0.468
1.000
0.379
0.996
0.304
1.000
0.266
0.876
0.288
0.984
0.078
0.493
0.248
1.000
0.169
1.000
0.970
1.000
0.773
0.979
0.268
0.949
0.115§
1.000
0.902
1.000
0.667
0.979
0.199
0.705
0.316
0.281
1.000
1.000
0.999
1.000
0.998
1.000
0.874
0.874
0.979
0.979
0.399
0.381
1.000
1.000
0.005
0.089
0.086
0.071
0.069
0.071
0.018
0.064
0.517
1.000
0.999
1.000
0.946
0.979
0.614
1.000
0.054
1.000
0.654
0.706
0.516
0.706
0.140
0.497
0.059
1.000
0.684
0.708
85
0.518
0.708
0.140
0.497