April 19, 2024

Jocuri

Mad about real estate

Mobile phone data reveal the effects of violence on internal displacement in Afghanistan

Data on violence in Afghanistan

We obtain violent events data from the UCDP, a leading source of data on conflict events72. Specifically, we use UCDP Georeferenced Event Dataset Global version 19.1, available at https://ucdp.uu.se/downloads/. This open-source collection of metadata on armed conflict and organized violence is collected from media reports, so it is likely to be biased towards salient events in more populous regions76,77. The criteria for inclusion of an event are “the incidence of the use of armed force by an organized actor against another organized actor, or against civilians, resulting in at least one direct death in either the best, low or high estimate categories at a specific location and for a specific temporal duration”75. In Afghanistan from 2013 to 2017, 5,984 events were recorded where the event location is known to a district level and the event time is known to a specific day; 4,740 of these events occur during the period April 2013–March 2017, for which we have call detail records (CDR), and 3,354 of those occur in districts on days with mobile phone activity. We discard events that are not recorded with this level of precision (47% of all events; the potential implications of this are discussed in ‘Limitations’). Afghanistan is divided into 398 districts in 34 provinces; our analysis is conducted on a district level.

Mobile phone call detail records

Our analysis of displacement is based on a large dataset of pseudonymized mobile phone data from one of Afghanistan’s largest mobile phone operators. As described in ‘Limitations’, we take precautions to ensure that the analysis of phone data respects the privacy of individual subscribers. In particular, our analysis involves only pseudonymized data that are aggregated geographically (by district) and temporally (by day).

We obtain CDR that provide metadata for every mobile phone call and data packet transfer that occurred on this network from April 2013 to March 2017—a total of roughly 20 billion events. For each such event, we observe a pseudonymized unique identifier for the subscriber (hashed from their phone number), the date and time of the event, and the identifier of the physical mobile phone tower through which the transaction was routed. We also know the exact location of each tower, which allows us to approximately identify each subscriber’s location at the time of the event, to within roughly 500 m in urban areas and roughly 10 km in rural areas.

There are 13,315 active towers during this period, many of which are very close together; we group these towers into 1,439 tower groups by combining towers less than 100 m apart. These cell tower groups are plotted in Fig. 1 and Supplementary Fig. 1. Only districts with cell towers are included in this analysis, though we note that these generally correspond to the more populated districts in Afghanistan (see Fig. 1).

Measuring migration

From the original CDR, we follow a sequence of steps to determine whether and when a migration event occurs. We adopt the IOM’s definition of migration, which is “The movement of persons away from their place of usual residence, either across an international border or within a State”30, and we focus on internal migration (“within a State”), where the place of usual residence is measured to a district-level precision. We capture trips that last approximately a week or more (at least five full days and two travel days). The migration that we measure is therefore an interdistrict movement. Complete details on this process are in the Supplementary Information (in the section on data processing); a brief summary is provided here.

Our first step is to derive a ‘daily modal location’ for each subscriber for each day, which is intended to capture the district in which the subscriber spends the majority of their time on that day. For each individual, we first compute their most commonly used cell tower in each hour. Then, for each 24-hour period from 06:00 to 06:00 the next day, we compute the mode of the hourly modal towers. The towers are then mapped to districts using point-in-polygon assignment. Similar methods have been used and validated in other work13,15,32,84,85,86,87. While several prior studies use night-time hours to infer daily locations, we instead use all hours, which allows us to include more individuals in our analysis. For example, in April 2013, data are available for approximately 31 million individual-days using night-time hours (18:00 to 06:00), while 61 million individual-days are defined when using all hours (06:00 to 06:00). The two approaches are highly correlated: of the 31 million observations available using night-time hours, 89% record the same daily modal districts when computed using all hours. Another common approach in the literature divides the physical terrain into approximate catchment areas of each cell tower, using a Voronoi tessellation (for example, refs. 13,88). Our analysis focuses on slightly larger administrative districts, since many of our violent events are identified at only the district level.

In Afghanistan, we find that the geographic distribution of daily modal locations of mobile phone subscribers broadly reflects the geographic distribution of the population. In particular, when comparing the number of mobile phone users in each district to the district population as estimated by Afghanistan’s Central Statistical Office, the Pearson’s correlation coefficient is 0.94 (95% CI, (0.92, 0.95); P < 0.001). (We calculate the number of subscribers in each district as the number of subscribers whose daily modal location is assigned to that district, averaged over all days in which the district has a non-zero number of subscribers, for a one-year period. Official estimates are obtained from https://data.humdata.org/dataset/estimated-population-of-afghanistan-2015-2016. On a log scale, the Pearson’s correlation is 0.53 (95% CI, (0.43, 0.61); P < 0.001).)

These daily modal locations tend to be sparse and noisy—for instance, many people do not use their phones on every single day, people may take short trips to nearby (non-residential) locations and so forth. Our second step thus employs an unsupervised scanning algorithm29 to identify contiguous segments in which a subscriber is, with high probability, resident in a single district. This algorithm helps smooth the influence of noise (for instance, long periods when a person is primarily in one location but intermittently visits other locations for one or two days) and missing data (for instance, when a person uses their phone infrequently, but almost exclusively from the same location) and has the advantage of not arbitrarily grouping days into calendar weeks or calendar months.

The third step is to identify migration events using discontinuous breaks in these contiguous segments. The second and third steps use the open-source Python package migration_detector (https://github.com/g-chi/migration_detector), which is specifically designed to infer migration events in transaction log data. The accompanying paper29 validates the use of these methods to measure migration. Tuning parameters are set to identify changes in locations resulting from stays of at least five full days in origin and destination districts. See the Supplementary Information for the full details.

The above procedures allow us to measure migration events from the mobile phone CDR. Many such events are not indicative of ‘displacement’, which the IOM defines as “The movement of persons who have been forced or obliged to flee or to leave their homes or places of habitual residence, in particular as a result of or in order to avoid the effects of armed conflict, situations of generalized violence, violations of human rights or natural or human-made disasters.”30 Given the limited contextual information available in the CDR, we cannot directly observe whether each inferred migration event should be considered a displacement. Instead, as we discuss in ‘Panel regressions: measuring k-day displacement’, we focus our analysis on the increase in out-migrations from a district that appear to be caused by violence in that district.

Data validation

To validate the measures of migration derived from the mobile phone CDR, we compare our derived migration metrics to displacement measures published by the IOM (DTM Afghanistan Districts Round 9 Baseline Assessment, available at https://data.humdata.org/dataset/afghanistan-displacement-data-baseline-assessment-iom-dtm). To our knowledge, there are no official or other published data measuring interdistrict migration as we do; while we try as far as possible to produce analogous measures, the IOM data measure fundamentally different quantities, and we do not expect comparisons to be identical. Generally speaking, we might expect province shares of migration and displacement to be similar if the fraction of displaced people among those who move for any reason is similar across provinces. This might not always be the case—for example, we might expect the capital, Kabul, to have a much smaller share of displaced people. Nevertheless, we make the comparison, as the IOM data are the closest published dataset on internal migration or displacement in Afghanistan.

The IOM collects data at the settlement (village) level through key informant interviews, focus group discussions and direct observation25. They use these data to estimate counts of outgoing and incoming internally displaced persons (IDPs) in assessed settlements over fixed periods. IDPs are categorized into ‘returnee IDPs’, ‘arrival IDPs’ and ‘fled IDPs’. We use the data collected in the year 2016; to our knowledge, this means that these individuals were recorded as being IDPs anytime during the year. We group ‘returnee’ and ‘arrival’ IDPs together as incoming IDPs, treat ‘fled IDPs’ as outgoing IDPs and sum the total numbers of incoming and outgoing IDPs for each province. We then compute each province’s share of the total incoming and outgoing IDPs.

Next, to construct an analogous metric from the CDR, we compare the district locations of each subscriber at the beginning and end of three four-month periods in 2016 (January–April, April–August and August–December, summed to obtain a measure of movement in 2016, since the longest we track subscribers is for 120 days). Since each district could have different cell-phone penetration rates, for each period and each district, we estimate the total number of people who moved in and out of the district by scaling the number of recorded subscribers who moved by \(\frac\mathrmdistrict\,\mathrmpopulation\mathrmno.\,\mathrmof\,\mathrmrecorded\,\mathrmsubscribers\), where the district population is as estimated by Afghanistan’s Central Statistical Office (available at https://data.humdata.org/dataset/estimated-population-of-afghanistan-2015-2016). We then aggregate these to the province level for 2016 and compute province shares in a similar manner. Supplementary Fig. 2a shows the share of each province estimated to leave; Spearman’s correlation between CDR and IOM statistics at the province level is ρ = 0.49 (95% CI, (0.20, 0.72); P = 0.004). Supplementary Fig. 2b does the same for incoming individuals, with ρ = 0.56 (95% CI, (0.31, 0.77); P < 0.001).

In Supplementary Fig. 2, we see that many provinces have similar shares of migration and displacement, with some obvious differences in Kabul Province, where migration far exceeds displacement, and Hilmand, where displacement far exceeds migration (Hilmand Province was a Taliban stronghold and frequently saw heavy fighting89).

Panel regressions: measuring k-day displacement

We combine the violent events data and migrations observed in the CDR into a district-day panel dataset, which we use to estimate the ‘average’ impact of violence on out-migration from the district in which violence occurs. We estimate this effect by adapting widely used panel regression models to our context (for example, refs. 90,91,92), which allows us to estimate the total migration caused by violence while controlling for unobserved district- and time-related factors that might influence the occurrence of both violence and migration. We first present the technical details of this model and later discuss the identifying assumptions and possible concerns with this approach.

For each value of k from 1 to 120, we estimate the following regression:

$$g(\mathbbE(Y_dt,k| X_dt,T_d,t+\tau ))=\gamma _d+\lambda _t+\mathop\sum \limits_\tau =-30^180\beta _\tau T_d,t+\tau $$

(1)

where d indexes the district, t indexes the time (calendar date) and covariates Xdt are given by district fixed effects, γd, and time fixed effects, λt. Td,t+τ are the ‘treatment’ variables (whether or not violence occurs) in district d at time t, at a lag of τ days. Lags of τ [−30, 180] are used, representing violence in the district 30 days in the future to 180 days in the past. This range was chosen because all effects were observed to lie within this window; the results are insensitive to a longer window, while shorter windows are unable to capture all effects of interest. The outcome variable, Ydt,k, is the proportion of those in district d at time t − k that are in a different district at time t. Subscribers present k days ago in district d but with a missing location on day t are included in the denominator but not in the numerator in this computation. The parameter k is introduced to capture the fact that displacement has to be measured relative to some time in the past. g() is the logit link function. Since the outcome variable is a proportion, we model it using a beta distribution, a family of continuous distributions in the interval from 0 to 1, taking a variety of possible shapes depending on the values of its parameters. We fit a beta regression using maximum likelihood estimation93. Standard errors are clustered at the district level.

These coefficients can be interpreted as with a logistic regression: for each τ, \(\mathrme^\beta _\tau \) is the multiplicative change in the odds of being in a different district today (time t), for Tτ = 1 (when violence occurs) relative to Tτ = 0 (days without violence), holding the other variables constant. To interpret βτ as the causal effect of violence on displacement, the target parameter is the causal conditional odds ratio, and the set of necessary identification assumptions are positivity, consistency, conditional exchangeability and correct model specification94. In our context, this specification assumes that there are no spatial spillovers, meaning that violence in one district does not have an effect on displacement in other districts. Carryover effects of the violence are limited to 180 days after the violence, and effects from up to 30 days prior are allowed. These daily effects are estimated independently and do not modify one another. The effects are assumed to be identical for all districts and to not vary over the measurement period (2013–2017). The confounders are limited to district, time and treatment in the surrounding window of time, and these enter additively. This implies that there are no unobserved time-varying confounders and that past outcomes do not affect current treatment (this is plausible since in most cases the number of displaced people is not large enough to affect military strategy). We relax several of these assumptions in subsequent analyses—for instance, by allowing for heterogeneous effects of different types of violence in different types of locations.

The key identifying assumption that there are no unobserved time-varying confounders requires the precise day in which a district experiences violence to be random, after conditioning on district and time fixed effects and the occurrence of violence in the surrounding window of time (equation (1)). Qualitatively, we find this assumption plausible because the precise timing (that is, the day on which violence occurs) of insurgent attacks is often meant to surprise government forces. However, the assumption cannot be tested directly; we therefore perform several checks to assess whether the occurrence of violence can be predicted beyond what our model in equation (1) captures. Specifically, we first regress the occurrence of violence on day t on the control variables in equation (1)—that is, \(g(\mathbbE(T_dt| \gamma _d,\lambda _t,T_d,t+\tau ))=\gamma _d+\lambda _t+\sum _\tau \in [-30,180]\backslash \0\\beta _\tau T_d,t+\tau \)—and obtain the residuals \(T_dt-\hat\mathbbE(T_dt)\). Supplementary Table 1 assesses whether these residuals can be predicted using recent lags and trends in the outcome variable (30-day displacement) and the number of subscribers observed to be in a district. We find that, using either a linear model or a machine learning approach (a random forest with tenfold cross-validation), these characteristics do not accurately predict residual violence (R2 ≤ 0.00028 (95% CI using non-parametric bootstrap, percentile interval, (0.00026, 0.0010))). Finally, as an additional robustness check, we find that adding more restrictive region × month time-varying fixed effects to equation (1) does not qualitatively change the main results (Supplementary Fig. 6).

In estimating these regressions, we exclude district-days in which the outcome variable is 0, 1 or missing. The rationale is that these zeros and ones are probably due to data sparsity. On one hand, if no subscribers were recorded as being in a different district, it could be that their locations were simply missing (for example, they did not use their phones, there was no cell service or they switched providers). On the other hand, it is unlikely that all subscribers would have left a district on any day; a recorded 1 could indicate cell tower outages in the origin district, for example. (News reports have described the Taliban restricting access to communications or destroying cell towers, and we do see a small reduction in the number of active cell towers in a district during periods of violence. However, we do not see significant decreases in call volumes at a district level, nor do we see a decrease in the probability of a district having an active tower, probably indicating that individuals are able to connect to a different cell tower within the same district. If it is the case that all individuals are only able to connect to a cell tower in a different district, our response variable would be a 1 and hence dropped from the regression. This limits overestimation of the displacement response.)

Several other points are of note. First, this estimation of displacement as an increase in migration due to violence also partially addresses the concern that the place of usual residence might be incorrectly measured using CDR. If violence does not impact the measurement error (for example, if the likelihood of a subscriber being misallocated to the district of their workplace instead of their home does not change due to violence), then the misallocation will not bias the estimated displacement. Second, although the treatments (violent events) occur relatively infrequently, the statistical model we employ is robust to sparsity; if all of the events are recorded accurately, estimates will not be biased because of sparsity. Non-random missingness of recorded events could bias estimates and are discussed in ‘Limitations’.

Summary of identification strategy

To more plainly summarize our statistical approach to measuring the effect of violence on displacement, we regress out-migration in each district-day on indicators for occurrences of violence up to 180 days prior and up to 30 days in the future, while controlling for geographic and temporal factors. This approach is designed to capture out-migration in excess of the out-migration that normally occurs in that district (on all other days) and on that day (in all other districts). Thus, the model does not assume that people do not move when violence is not occurring; instead, it uses movement in non-violent times and places as a baseline, to better isolate the additional movement that co-occurs with violence.

We include 180 lag terms and 30 lead terms to measure excess out-migration (again relative to normal out-migration) that occurs in the 180 days after violence and in the 30 days leading up to violence, as well as excess out-migration on the day of violence. Since violent events may be spatially and temporally correlated, a single observation (district-day) in the regression could have multiple violence indicators that are turned on; the migration dependent variable for that observation would thus contribute to the estimation of violence effects on all of the affected leads and lags.

Using this regression framework allows us to estimate the ‘average displacement effect’ of violence, averaged over the 3,354 violent events in our dataset that occur in districts on days with recorded mobile phone activity. For example, a coefficient of 0.03 on the indicator for violence at a lag of ten days can be interpreted as ‘On average, violence occurring ten days prior increases the odds of migration out of a district by 3% (a multiplicative change of e0.03 ≈ 1.03), holding all other variables constant.’ This approach helps limit the extent to which any one specific event, which might have unusual characteristics or correlates, can influence our final results. For instance, if one violent event happened to occur on a day in which a certain district would have seen unusual out-migration even in the absence of violence, that single event would have a limited impact on our final estimates. The main concern is if violent events were systematically correlated with other unobserved factors—above and beyond the flexible spatial and temporal fixed effects that we control for in the regression.

Impact of a violent day

To distil the impact of a single violent day, for each k [1, 120], we consider the coefficient for Tτ for τ = k. This coefficient captures the effect of violence occurring at a τ day lag, on movement measured at time t, compared with district locations k days ago. When τ = k, the outcome variable is measured with respect to those in the district on the day of the violence. In this way, extracting the relevant coefficients from regressions where the outcome variable is different values of k gives us the impact of a single violent day, on the subscribers in the district on that day. We demonstrate the robustness of these results to potential data issues, such as the presence of outliers, as well as modelling issues such as the inclusion of additional time-varying controls, in Supplementary Figs. 5 and 6.

Heterogeneous effects

To allow for the possibility that the displacement response may differ for different types of violence or for types of locations, the results of heterogeneous effects models are shown in Fig. 3. These results are estimated by creating separate treatment indicators for different types of events (for example, low-casualty versus high-casualty), which replace the treatment indicators in equation (1). For instance, letting Hd,t+τ denote the occurrence of high-casualty (>10 casualties) violence and Ld,t+τ denote the occurrence of low-casualty violence, we estimate:

$$\beginarraylg(\mathbbE(Y_dt,k| X_dt,H_d,t+\tau ,L_d,t+\tau ))\\=\gamma _d+\lambda _t+\mathop\sum \limits_\tau =-30^180\beta _H,\tau H_d,t+\tau +\mathop\sum \limits_\tau =-30^180\beta _L,\tau L_d,t+\tau \endarray$$

(2)

When analysing the heterogeneity of response by location (for example, for provincial capitals), we estimate prior regressions on the relevant subsets of the data—that is, by only including observations pertaining to provincial capitals.

Controlling for multiple dimensions of heterogeneity

To account for multiple dimensions of heterogeneity varying jointly, we analyse 30-day displacement by first fitting equation (3) separately for each of the events, using ordinary least squares:

$$\mathrmlog\left(\fracY_dt,301-Y_dt,30\right)=\gamma _d+\lambda _t+\mathop\sum \limits_\tau =-30^180\beta _\tau T_d,t+\tau +\epsilon _dt$$

(3)

Here Td,t+τ indicates a single event at a time (each treatment indicator indicates whether or not the specific event occurs at district d at time t, at a lag of τ days). Only events in which all βτ coefficients can be estimated are included, meaning that if the outcome variable is unavailable in any day that is 30 days preceding the event to 180 days after the event, it is not included in the analysis. This results in a total of 2,359 events being studied. For each included event, we take the mean of the estimated coefficients for βτ, for τ = 1–15, 16–30, 31–45, 46–60, 61–75 and 76–90. We treat these as outcome variables and model each of these derived outcomes Oi as

$$\beginarrayllO_i&=\beta _0+\beta _1\mathrmprovCap_i+\beta _2\mathrmlog(population)_i\\ &+\beta _3\mathrmIS_i+\beta _4\mathrmcasualties11_i+\beta _5\mathrmpeace60_i+\epsilon _i\endarray$$

(4)

where i is the event, provCapi is a binary variable denoting whether the event occurs in a provincial capital, log(population)i is the log of the population of the district in which the event occurs (added as a control), ISi is a binary variable denoting whether the event involved IS, casualties11i is a binary variable denoting whether the event was associated with 11 or more casualties and peace60i is a binary variable denoting whether the event was preceded by 60 or more days of peace. Figure 4 shows the estimated coefficients for each of the outcomes.

Destinations of displaced people

To investigate where the individuals displaced by violence go, we first examine migrant flows during non-event days (Fig. 5) and event days (Supplementary Fig. 7). We consider all recorded moves in any 30-day period and split these into days on which violent events occurred at the start of the 30-day period (‘event days’) and those on which no events were recorded (‘non-event days’). We repeat the following analysis for each. First, we categorize recorded moves as originating in either capital districts or non-capital districts. We then split destination districts into mutually exclusive categories by first recording whether they are in the same or a different province from the origin district; these destinations are then partitioned into three different types of districts—the major urban cities (Kabul, Kandahar, Hirat, Mazari Sharif and Jalalabad), other capital districts and non-capital districts.

To estimate the effect of violence on the destination of displacement, we use a similar setup as equation (1). Instead of the outcome variable being the fraction of the population that moved on day k, we use the fraction of movers (those in a different district at time t compared with k days ago) on day k observed to be at specific types of destination districts, as described above. We use outcomes for k = 7, 30, 90 and fit separate regressions for provincial capitals, for non-capitals and for each outcome. As before, district-days in which the outcome variable is 0, 1 or missing are excluded from the analysis.

Implications of missing violence data

As discussed in ‘Limitations’, our analysis does not include violent events that are not associated with specific locations (that is, where we do not know the district in which the event occurred). This could introduce bias into our analysis if certain types of violence (with specific migration responses) are systematically more or less likely to have known locations. We therefore conduct additional analysis to determine whether the spatial precision with which an event is recorded is correlated with the magnitude of the displacement effect.

Specifically, using the same empirical approach described in ‘Heterogeneous effects’, we create separate treatment indicators for each of the three types of events that we use in our analysis, based on their available geographic precision: (1) events for which the exact location is known and coded, Ad,t+τ (N = 1,698); (2) events that occurred within a 25 km radius around a known point, Bd,t+τ (N = 789); and (3) events for which only the district is known, Cd,t+τ (N = 969). These replace the treatment indicators in equation (1):

$$\beginarrayll&g(\mathbbE(Y_dt,k| X_dt,A_d,t+\tau ,B_d,t+\tau ,C_d,t+\tau ))\\ &=\gamma _d+\lambda _t+\mathop\sum \limits_\tau =-30^180\beta _A,\tau A_d,t+\tau +\mathop\sum \limits_\tau =-30^180\beta _B,\tau B_d,t+\tau +\mathop\sum \limits_\tau =-30^180\beta _C,\tau C_d,t+\tau \endarray$$

(5)

The results, shown in Supplementary Fig. 8, indicate that the displacement response is very similar for violent events with these three different levels of spatial precision. There are small differences in the point estimates, but the general pattern of the response is unchanged, and the CIs of all three violence types overlap. Of course, this analysis does not eliminate the possibility that there might be a qualitatively different displacement response to violent events that are not recorded in our dataset (or for which district information is unknown). Unfortunately, we cannot directly test that concern, since we cannot estimate the displacement effect of violence when the location of the violence is not known.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.