Glossary of Terms
Unless otherwise indicated, definitions are sourced, and at times adapted, from Paul Gertler’s Impact Evaluation in Practice. Note: Italicized words indicate that the word is also an entry in the glossary.
Activity. Actions taken or work performed through which inputs, such as funds, technical assistance, and other types of resources are mobilized to produce specific outputs.
Attrition. Attrition occurs when some units drop from the sample between one data collection round and another, for example, because migrants are not tracked. Attrition is a case of unit nonresponse. Attrition can create bias in impact evaluations if it is correlated with treatment status. Also known as “loss to follow-up” in epidemiology and biostatistics.
Baseline. The situation prior to an intervention, against which progress can be assessed or comparisons made. Baseline data are collected before a program or policy is implemented to assess the ''before'' state.
Before-and-after comparison. A before-and-after comparison attempts to establish the impact of a program by tracking changes in outcomes for program beneficiaries over time, using measurements before and after the program or policy is implemented. Also known as "pre-post comparison" or “reflexive comparison.”
Bias. The bias of an estimator is the difference between an estimator's expectation and the true value of the parameter being estimated. In impact evaluation, this is the difference between the impact that is calculated and the true impact of the program, and may result from measurement or non-measurement error.
Cluster. A cluster is a group of units that are similar in one way or another. For example, in a sampling of school children, children who attend the same school would belong to a cluster because they share the same school facilities and teachers, and live in the same neighborhood.
Cluster sample. A sample obtained by drawing a random sample of clusters, after which either all units in the selected clusters constitute the sample, or a number of units within each selected cluster is randomly drawn. Each cluster has a well-defined probability of being selected, and units within a selected cluster also have a well-defined probability of being drawn.
Collinearity. Collinearity occurs when two variables treated as independent in a regression analysis are not in fact independent, leading to a biased result. Multicollinearity refers to the non-independence of more than two independent variables.2
Comparison group. A valid comparison group will have the same characteristics as the group of beneficiaries of the program (treatment group), except that the units in the comparison group do not benefit from the program. Comparison groups are used to estimate the counterfactual. Also known as a "control group.”
Confounding factors. Other variables or determinants that are believed to be associated with the independent variable(s) of interest and are also believed to be causally associated with the outcome of interest.4
Contamination. When members of the comparison group are affected by either the intervention (see “spillover effect”) or another intervention that also affects the outcome of interest. Contamination is a common problem as there are multiple development interventions in most communities.5
Cost-benefit analysis. Ex-ante calculations of total expected costs and benefits, used to appraise or assess project proposals. Cost-benefit can be calculated ex-post in impact evaluations if the benefits can be quantified in monetary terms and the cost information is available.
Cost-effectiveness. Determining cost-effectiveness entails comparing similar interventions based on cost and effectiveness. For example, impact evaluations of various education programs allow policy makers to make more informed decisions about which intervention may achieve the desired objectives, given their particular context and constraints.
Counterfactual. The counterfactual is an estimate of what the outcome (Y) would have been for a program participant in the absence of the program (P). By definition, the counterfactual cannot be observed. Therefore, it must be estimated using comparison groups.
Dependent variable. A variable (a symbol that stands for a value that may vary) that is believed to be predicted by or caused by one or more other variables (independent variables). The term is commonly used in regression analysis. 6
Difference-in-differences. Difference-in-differences estimates the counterfactual for the change in outcome for the treatment group by taking the change in outcome for the comparison group. This method allows us to take into account any differences between the treatment and comparison groups that are constant over time. The two differences are thus before and after, and between the treatment and comparison groups. Also known as "double difference,'' “Diff-in-diff,” “DID,” or ''D-in-D.”
Endline. The situation at the end of an intervention, in which progress can be assessed or comparisons made with baseline data. Endline data are collected at the end of a program or policy is implemented to assess the “after” state.7
Estimator. In statistics, an estimator is a statistic (a function of the observable sample data) that is used to estimate an unknown population parameter; an estimate is the result from the actual application of the function to a particular sample of data.
Evaluation. Evaluations are periodic, objective assessments of a planned, ongoing, or completed project, program, or policy. Evaluations are used to answer specific questions, often related to design, implementation, and results.
Ex-ante evaluation design. An impact evaluation design prepared before the intervention takes place. Ex-ante designs are stronger than ex-post evaluation designs because of the possibility of considering random assignment, and the collection of baseline data from both the treatment and comparison groups. Also called prospective evaluation.8
Ex-post evaluation design. An impact evaluation design prepared once the intervention has started, and possibly after it has been completed. Unless the program was randomly assigned, a quasi-experimental design has to be used.9
External validity. To have external validity means that the causal impact discovered in the impact evaluation can be generalized to the universe of all eligible units. For an evaluation to be externally valid, it is necessary that the evaluation sample be a representative sample of the universe of eligible units.
Fixed effects model. Fixed effects models are used to control for heterogeneity in panel data sets in which unobservable individual effects are assumed to be non-random and correlated with the other variables.10 Contrast with random effects.
Follow-up survey. Also known as "post-intervention" or "ex-post" survey. A survey that is fielded after the program has started, once the beneficiaries have benefited from it for some time. An impact evaluation can include several follow-up surveys.
Heterogeneity of treatment effects. An outcome or treatment effect is heterogeneous when it varies depending on some characteristic(s) of the units of measurement, that is, some participant [individuals or types] benefited more or less from the treatment.11
Impact evaluation. An impact evaluation is an evaluation that tries to make a causal link between a program or intervention and a set of outcomes. An impact evaluation tries to answer the question of whether a program is responsible for changes in the outcomes of interest. Contrast with process evaluation.
Instrumental variable. An instrumental variable is a variable that helps identify the causal impact of a program when participation in the program is partly determined by the potential beneficiaries. A variable must have two characteristics to qualify as a good instrumental variable: (1) it must be correlated with program participation, and (2) it may not be correlated with outcomes Y (apart from through program participation) or with unobserved variables.
Intention-to-treat, or ITT, estimator. The ITT estimator is the straight difference in the outcome indicator Y for the group to whom we offered treatment and the same indicator for the group to whom we did not offer treatment. Contrast with treatment-on-the-treated.
Interrupted time series analysis. Interrupted time series analysis is a non-experimental evaluation method in which data are collected at multiple instances over time before and after an intervention is introduced to detect whether the intervention has an effect signiï¬cantly greater than the underlying secular trend.13
Intra-cluster correlation. Intra-cluster correlation is correlation (or similarity) in outcomes or characteristics between units that belong to the same cluster. For example, children that attend the same school would typically be similar or correlated in terms of their area of residence or socioeconomic background.
John Henry effect. The John Henry effect happens when comparison units work harder to compensate for not being offered a treatment. When one compares treated units to those "harder-working" comparison units, the estimate of the impact of the program will be biased; that is, we will estimate a smaller impact of the program than the true impact that we would find if the comparison units did not make the additional effort.
Logical Model. Describes how a program should work, presenting the causal chain from inputs, through activities and outputs, to outcomes. While logical models present a theory about the expected program outcome, they do not demonstrate whether the program caused the observed outcome. A theory-based approach examines the assumptions underlying the links in the logical model.14
Matching. Also known as “statistical matching.” Matching is a non-experimental evaluation method that uses large data sets and statistical techniques to construct the best possible comparison group for a given treatment group based on one or more observable characteristics. A matching technique commonly used in impact evaluation is propensity score matching (PSM), in which groups that receive and don’t receive the intervention are matched based on the estimated probability of receiving or participating in the intervention (the propensity score).15
Minimum desired effect. The minimum change in outcomes that would justify the investment that has been made in an intervention, accounting not only for the cost of the program and the benefits that it provides, but also for the opportunity cost of not investing funds in an alternative intervention. This is not a statistical attribute but instead is usually decided upon by stakeholders. The minimum desired effect is also an input for power calculations; that is, evaluation samples need to be large enough to detect at least the minimum desired effect with sufficient power.16
Minimum detectable effect. The smallest true impact that the impact evaluation has a good chance of detecting.17 Carrying out power calculations makes it possible to determine the sample size such that the minimum detectable effect is smaller than or equal to the minimum desired effect, thus making it possible for the impact evaluation to detect effects of interest to stakeholders.
Monitoring. Monitoring is the continuous process of collecting and analyzing information to assess how well a project, program, or policy, is performing. It relies primarily on administrative data to track performance against expected results, make comparisons across programs, and analyze trends over time. Monitoring usually tracks inputs, activities, and outputs, though occasionally it includes outcomes as well. Monitoring is used to inform day-to-day management and decisions.
Natural experiment. Natural experiments or quasi-natural experiments in economics are serendipitous situations in which persons are randomly assigned to a treatment (or multiple treatments) and a comparison group, and outcomes are analyzed for the purposes of testing a hypothesis; they are also serendipitous situations where assignment to treatment approximates a randomized controlled trial or a well-controlled experiment.18
Nonresponse. That data are missing or incomplete for some sampled units constitutes nonresponse. Unit nonresponse arises when no information is available for some sample units, that is, when the actual sample is different than the planned sample. Attrition is one form of unit nonresponse. Item nonresponse occurs when data are incomplete for some sampled units at a point in time. Nonresponse may cause bias in evaluation results if it is associated with treatment status.
Null hypothesis. A null hypothesis is a hypothesis that might be falsified on the basis of observed data. The null hypothesis typically proposes a general or default position. In impact evaluation, the default position is usually that there is no difference between the treatment and comparison groups, or in other words, that the intervention has no impact on outcomes.
Outcome. Can be intermediate or final. An outcome is a result of interest that comes about through a combination of supply and demand factors. For example, if an intervention leads to a greater supply of vaccination services, then actual vaccination numbers would be an outcome, as they depend not only on the supply of vaccines but also on the behavior of the intended beneficiaries. Final or long-term outcomes are more distant outcomes. The distance can be interpreted in a time dimension (it takes a long time to get to the outcome) or a causal dimension (many causal links are needed to reach the outcome).
Output. The products, capital goods, and services that are produced (supplied) directly by an intervention. Outputs may also include changes that result from the intervention that are relevant to the achievement of outcomes.
Panel data. Also known as “longitudinal data”, panel data are datasets that contain repeated observations over time of multiple variables for a set of individuals. Compared with cross-sectional data, in which observations are available only for a given time, or time-series data, in which a single variable is observed over time, panel data have the obvious advantages of more degrees of freedom and less collinearity among explanatory variables, and so provide the possibility of obtaining more accurate parameter estimates. More importantly, by blending inter-individual differences with intra-individual dynamics, panel data allow the investigation of more complicated behavioral hypotheses than those that can be addressed using cross-sectional or time-series data.19
Power. The power is the probability of detecting an impact if one has occurred. The power of a test is equal to 1 minus the probability of a type II error, ranging from 0 to 1. Popular levels of power are 0.8 and 0.9. High levels of power are more conservative and decrease the likelihood of a type II error. An impact evaluation has high power if there is a low risk of not detecting real program impacts, that is, of committing a type II error.
Power calculations. Power calculations indicate the sample size required for an evaluation to detect a given minimum detectable effect. Power calculations depend on parameters such as power (or the likelihood of type II error), significance level, variance, and intra-cluster correlation of the outcome of interest.
Process evaluation. A process evaluation is an evaluation that tries to establish the level of quality or success of the processes of a program; for example, adequacy of the administrative processes, acceptability of the program benefits, clarity of the information campaign, internal dynamics of implementing organizations, their policy instruments, their service delivery mechanisms, their management practices, and the linkages among these. Contrast with impact evaluation.
Propensity Score Matching (PSM). PSM is a matching technique commonly used in impact evaluation, in which groups that receive and don’t receive the intervention are matched [across several variables] based on the estimated probability of receiving or participating in the intervention (the propensity score).20
Quasi-experimental design. Impact evaluation designs that create a comparison group using statistical procedures. The intention to ensure that the characteristics of the treatment and comparison groups are identical in all respects, other than the intervention, as would be the case in an experimental design.21
Random effects. Random effects models are used to control for heterogeneity in panel data sets in which unobservable individual effects are assumed to be uncorrelated with the other variables.22 Contrast with fixed effects.
Random sample. The best way to avoid a biased or unrepresentative sample is to select a random sample. A random sample is a probability sample in which each individual in the population being sampled has an equal chance (probability) of being selected.
Randomized controlled trial (RCT) design. Randomized controlled trials are considered the most robust method for estimating counterfactuals and are often referred to as the "gold standard" of impact evaluation. With this method, beneficiaries are randomly selected to receive an intervention, and each has an equal chance of receiving the program. With large-enough sample sizes, the process of randomization ensures equivalence, in both observed and unobserved characteristics, between the treatment and comparison groups, thereby addressing any selection bias. RCTs may still be subject to several types of biases and must follow strict protocols. Also known as “randomized study designs,” “randomized evaluation,” or “experimental designs.”
Random assignment. An intervention design in which members of the eligible population are assigned at random to either the treatment group (who receive the intervention) or the comparison or control group (who do not receive the intervention). Whether someone is in the treatment or comparison group is solely a matter of chance, and not a function of any of their characteristics (either observed or unobserved).23
Randomized offering. Randomized offering is a method for identifying the impact of an intervention. With this method, beneficiaries are randomly offered an intervention, and each has an equal chance of receiving the program. Although the program administrator can randomly select the units to whom to offer the treatment from the universe of eligible units, the administrator cannot obtain perfect compliance: she or he cannot force any unit to participate or accept the treatment and cannot refuse to let a unit participate if the unit insists on doing so. In the randomized offering method, the randomized offering of the program is used as an instrumental variable for actual program participation.
Randomized promotion. Randomized promotion is a method similar to randomized offering. Instead of random selection of the units to whom the treatment is offered, units are randomly selected for promotion of the treatment. In this way, the program is left open to every unit.
Randomized selection methods. "Randomized selection method" is a group name for several methods that use random assignment to identify the counterfactual. Among them are randomized assignment of the treatment, randomized offering of the treatment, and randomized promotion.
Regression. In statistics, regression analysis includes any techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables. In impact evaluation, regression analysis helps us understand how the typical value of the outcome indicator Y (dependent variable) changes when the assignment to treatment or comparison group P (independent variable) is varied, while the characteristics of the beneficiaries (other independent variables) are held fixed. Bivariate regression analyzes the association between one independent variable and the dependent variable, while multivariate regression analyzes the association between more than one independent variables and the dependent variable.
Regression discontinuity design (RDD). Regression discontinuity design is a non-experimental evaluation method. It is adequate for programs that use a continuous index to rank potential beneficiaries and that have a threshold along the index that determines whether potential beneficiaries receive the program or not. The cutoff threshold for program eligibility provides a dividing point between the treatment and comparison groups.
Sample. In statistics, a sample is a subset of a population. Typically, the population is very large, making a census or a complete enumeration of all the values in the population impractical or impossible. Instead, researchers can select a representative subset of the population (using a sampling frame) and collect statistics on the sample; these may be used to make inferences or to extrapolate to the population. This process is referred to as sampling.
Sampling. Process by which units are drawn from the sampling frame built from the population of interest (universe). Various alternative sampling procedures can be used. Probability sampling methods are the most rigorous because they assign a well-defined probability for each unit to be drawn. Random sampling, stratified random sampling, and cluster sampling are all probability sampling methods. Non-probabilistic sampling (such as purposive or convenience sampling) can create sampling errors.
Sampling frame. The most comprehensive list of units in the population of interest (universe) that can be obtained. Differences between the sampling frame and the population of interest create a coverage (sampling) bias. In the presence of coverage bias, results from the sample do not have external validity for the entire population of interest
Selection bias. Selection bias occurs when the reasons for which an individual participates in a program are correlated with outcomes. This bias commonly occurs when the comparison group is ineligible or self-selects out of treatment.
Significance level. The significance level is usually denoted by the Greek symbol, a (alpha). Popular levels of significance are 5 percent (0.05), 1 percent (0.01), and 0.1 percent (0.001). If a test of significance gives a p value lower than the a level, the null hypothesis is rejected. Such results are informally referred to as “statistically significant.” The lower the significance level, the stronger the evidence required. Choosing the level of significance is an arbitrary task, but for many applications, a level of 5 percent is chosen for no better reason than that it is conventional.
Spillover effect. Also known as contamination of the comparison group. A spillover effect occurs when the comparison group is affected by the treatment administered to the treatment group, even though the treatment is not administered directly to the comparison group. If the spillover effect on the comparison group is negative (that is, if they suffer because of the program), then the straight difference between outcomes in the treatment and comparison groups will yield an overestimation of the program impact. By contrast, if the spillover effect on the comparison group is positive (that is, they benefit), then it will yield an underestimation of the program impact.
Statistical power. The power of a statistical test is the probability that the test will reject the null hypothesis when the alternative hypothesis is true (that is, that it will not make a type II error). As power increases, the chances of a type II error decrease. The probability of a type II error is referred to as the false negative rate β). Therefore power is equal to 1-β·
Stratified sample. Obtained by dividing the population of interest (sampling frame) into groups (for example, male and female), and then drawing a random sample within each group. A stratified sample is a probabilistic sample: every unit in each group (or stratum) has the same probability of being drawn.
Treatment-on-the-treated (effect of). Also known as the TOT estimator. The effect of treatment on the treated is the impact of the treatment on those units that have actually benefited from the treatment. Contrast with intention-to-treat.
Type I error. Error committed when rejecting a null hypothesis even though the null hypothesis actually holds. In the context of an impact evaluation, a type I error is made when an evaluation concludes that a program has had an impact (that is, the null hypothesis of no impact is rejected), even though in reality the program had no impact (that is, the null hypothesis holds). The significance level determines the probability of committing a type I error.
Type II error. Error committed when accepting (not rejecting) the null hypothesis even though the null hypothesis does not hold. In the context of an impact evaluation, a type II error is made when concluding that a program has no impact (that is, the null hypothesis of no impact is not rejected) even though the program did have an impact (that is, the null hypothesis does not hold). The probability of committing a type II error is 1 minus the power level.
Unobservable variables. Characteristics that cannot be observed or measured. The presence of unobservable variables can cause selection bias in quasi-experimental designs. Also known as “observables” or “unobserved variables.”
1 3ie and The World Bank, courtesy of the Abdul Latif Jameel Poverty Action Lab (J-PAL).
3 3ie and The World Bank, courtesy of the Abdul Latif Jameel Poverty Action Lab (J-PAL).
7 Center for Effective Global Action, 2012.
8 3ie and The World Bank, courtesy of the Abdul Latif Jameel Poverty Action Lab (J-PAL).
12 3ie and The World Bank, courtesy of the Abdul Latif Jameel Poverty Action Lab (J-PAL).
14 3ie and The World Bank, courtesy of the Abdul Latif Jameel Poverty Action Lab (J-PAL).
16 3ie and The World Bank, courtesy of the Abdul Latif Jameel Poverty Action Lab (J-PAL).
21 3ie and The World Bank, courtesy of the Abdul Latif Jameel Poverty Action Lab (J-PAL).
23 3ie and The World Bank, courtesy of the Abdul Latif Jameel Poverty Action Lab (J-PAL).