Positividade* and data against COVID-19: Description of a Python-based model prototype

Experiment 4 Data Powered Positive Deviance
P2 Define

June 25, 2020
Iskriyana Vasileva, Joshua Driesen, Niklas Wulkow.


Iskriyana Vasileva: "*positividade = positivity in Portuguese. A Brazilian friend taught me the word and its strength and sincerity stuck with me. As the following story is about finding positive outliers, it only seemed right to use it as a reflection of both the project’s goal and my experience."

Please note that we are discussing a prototype developed in the context of the Data Powered Positive Deviance Initiative and a COVID-19-response Hackathon. We acknowledge the need for additional refinements and look forward to new ideas to enrich what our team has created. Please feel free to contact Iskriyana Vasileva if you have questions or suggestions.  

What is Positive Deviance?

The basic idea is to identify positive outliers that, despite equal conditions, perform better than their peers. The conventional PD approach is, however, limited by scarce and costly primary data collection, resulting in small data samples, making it statistically difficult to identify positive deviants. To address this issue, the Data Powered Positive Deviance (DPPD) approach was started. Its goal is to utilize big data and recent analytic techniques to enrich and accelerate the original PD approach.

Our Approach to the COVID-19 challenge

In the context of the COVID-19 pandemic in Germany, our team decided to find districts (Landkreise / Kreisfreie Städte) that managed to flatten their infection curve unexpectedly well compared to structurally similar peers (more about it here). These would be the COVID-19-positive deviants. After identifying them and their positive drivers, their actions could potentially be replicated in other districts to “spread” virus-dampening behaviour.

You will be able to find our code here.

Our Data

The data we used covered all German districts and was extracted from publicly available sources. The data included structural data and data on the infection spread of COVID-19. Structural data included factors such as population density, hospital capacities, nursing homes, etc. per district and was primarily extracted from Landatlas. We also added data on average age and average household size (source: official 2011 census site). The data on the infection spread encompassed the number of COVID-19 infections per district provided by the Robert Koch Institute (data & data API, both in German).

Development of the Performance Measure

Defining a performance measure was key to the project’s success. It needed to reflect a desirable deviant outcome, cover all units of analysis, and allow for precise identification. For our case, we developed an inhibition coefficient to reflect the extent of successful flattening of the COVID-19 infection curve — the lower the coefficient, the better the performance of the district in containing the infection.

def _calc_infections_by_date_and_district(self):    
        output_data = []
        for district in self._rki_districts:
            total_inf = 0
            for date in self._date_range:
                new_inf = self._only_inf_cases[(self._only_inf_cases.date == date) & 
                                               (self._only_inf_cases.IdLandkreis == district)].AnzahlFall.sum()
                total_inf += new_inf
                    {'districtId' : district,
                     'date' : date.strftime('%Y-%m-%d'),
                     'new_infected' : new_inf,
                     'total_infected' : total_inf}

        #rki_raw Berlin is separated into sub-districts, structure data has only aggregate, so we calculate aggregate
        total_inf = 0 #reset infection integrator
        for date in self._date_range.strftime('%Y-%m-%d'):
                new_inf = self._only_inf_cases[(self._only_inf_cases.date == date) & 
                                               (self._only_inf_cases.IdLandkreis > 11000) & #Berlin = 11\d\d\d
                                               (self._only_inf_cases.IdLandkreis < 12000)].AnzahlFall.sum()
                total_inf += new_inf
                    {'districtId' : 11000,
                     'date' : date,
                     'new_infected' : new_inf,
                     'total_infected' : total_inf}

        output_df = pd.DataFrame(data = output_data)
        output_df = output_df.sort_values(['districtId','date']).reset_index(drop = True)        
        return output_df

smoothed_new: We used a seven-day rolling average of new infections to cancel out day-of-the-week effects in the data, which resulted from inconsistent reporting of COVID-19 test outcomes during the weekend by some laboratories.

alpha_t: First rate of change — we estimated the infection spread rate (spread rate or alpha coefficient) by dividing pairs of successive values of smoothed_new, i.e. smoothed_new on dayt / smoothed_new on dayt+1. In effect, this constitutes a smoothed growth factor.

This measure constitutes a multiplicative rate of change, which reflects the exponential spread of a disease when it is unhindered. With exponential growth, the multiplicative changes are constant, not the additive/linear.

We used a threshold for the number of newly infected people on one single day for our algorithm to start including it in the data. For example, if a district has three new cases on day 1, four on day 2 and five on day 3, and the threshold is set to 5, then day 3 will be the first day for which alpha_t is calculated.

 def _calc_alpha_estimates(self):
        n_days_4_smoothing = self._n4s
        threshold = self._thresh
        output_df = pd.DataFrame()
        dates_df = pd.DataFrame()
        for district in self._all_districts:
            output_per_district = []
            district_data = self.infections_by_date_and_district[self.infections_by_date_and_district.districtId == district]
            n_row = district_data.shape[0]
            thresh_passed = False
            smoothed_new = np.NaN
            alpha_t = np.NaN
            threshold_date = np.NaN                    
            smoothed_new_day_before = np.NaN
            for row in range(n_row):
                if not thresh_passed and district_data.new_infected.iloc[row] >= threshold:
                    thresh_passed = True #saves info that threshold was passed
                    threshold_date = district_data.date.iloc[row]
                if not thresh_passed:
                    continue #next row, no writing into output, when threshold not passed this row at the latest
                if row >= (n_days_4_smoothing - 1): #first e.g. six days stay initialized NaN when 7-day-smoothing
                    smoothing_rows = range(row - n_days_4_smoothing + 1,
                                           row + 1)
                    smoothed_new = district_data.new_infected.iloc[smoothing_rows].mean()
                    if smoothed_new_day_before != 0:
                        alpha_t = smoothed_new / smoothed_new_day_before
                        alpha_t = np.NaN
                    smoothed_new_day_before = smoothed_new
                    {'districtId' : district,
                     'date' : district_data.date.iloc[row],
                     'smoothed_new' : smoothed_new,
                     'alpha_t' : alpha_t}
            output_df = output_df.append(pd.DataFrame(data = output_per_district[1:])) #drop first row due to alpha_t = NaN
            dates_df = dates_df.append(pd.DataFrame(data = {'districtId' : [district],
                                                            'threshold_date' : [threshold_date]}))
        output_df.alpha_t.fillna(0, inplace=True)
        #if more than one week without infections, div by zero causes NaNs, which are replaced with zeros here

        return [output_df.reset_index(drop = True),
                dates_df.reset_index(drop = True)] #reset index, because it was restarting for each district

This starting threshold makes sure that only districts with a substantial COVID-19 occurrence are considered because otherwise there would not be a curve to flatten. We chose 5 because after several trials, this number provided the optimal trade-off between estimation stability and data loss.

Second rate of change — the inhibition coefficient: the linear rate of change of the alpha coefficients (spread rates) over the period between the 1st and 14th of April 2020. This period was the same for all districts. Because an unhindered infection’s curve shape corresponds to exactly one alpha value, reducing a district’s alpha is synonymous with flattening its curve. Thus, a highly negative inhibition coefficient indicates a well-performing district.

 def inhib_coef_date_range(self, first_date, last_date):

        """Function to calculate the rate of change of smoothed growth factors between first and last date.        
        Takes as input first and last date, and the dataframe containing dates, districtIds and the smoothed growth factors, i.e. the 0th output of the function defined above, smoothed_growth_and_inhibition(). The name for this dataframe that is assigned above is set as default here.
        First and last date have to be entered in YYYY-MM-DD format. Errors are raised when format is violated. This catches e.g. June 17th as 2020-17-06, but it obviously cannot save you from mixing up e.g. June 5th and May 6th, so beware!
        Dates are also checked to be in the data at all, with specific errors for first and last dates out of bounds. Function also prints number of individual districts that are out of bounds for a given pair of dates.
        Output is a dataframe containing districtIds and partial inhibition coefficients.
        Districts where dates were out of bounds are dropped from output.
        if not re.match('^202\d-(0[1-9]|1[012])-(0[1-9]|[12][0-9]|3[01])$', first_date):
          print('First date format invalid. Correct format: YYYY-MM-DD')
        if not re.match('^202\d-(0[1-9]|1[012])-(0[1-9]|[12][0-9]|3[01])$', last_date):
          print('Last date format invalid. Correct format: YYYY-MM-DD')
        if not any(first_date == self.alpha_estimates.date):
          print("First date out of bounds!")
        if not any(last_date == self.alpha_estimates.date):
          print("Last date out of bounds!")

        self._start_date_inhib_coef = first_date
        self._end_date_inhib_coef = last_date

        output_df = pd.DataFrame() #initialise output df
        count_out_of_bounds = 0 #initialise counter for districts with no data for the date range
        count_alpha_gaps = 0 #initialise counter for districts with NaNs in their alpha esitmates during date range

        for dist in self._all_districts:
          district_data = self.alpha_estimates[self.alpha_estimates.districtId == dist].copy()

          if not set((first_date, last_date)).issubset(district_data.date):
            count_out_of_bounds += 1
          index_first_date = district_data.index[district_data.date == first_date][0]
          index_last_date = district_data.index[district_data.date == last_date][0]
          index_range = range(index_first_date, index_last_date + 1)
          cut_dist_data = district_data[district_data.index.isin(index_range)]

          # if any(cut_dist_data.alpha_t.isna()):
          #   # cut_dist_data.alpha_t.fillna(0, inplace=True)
          #   count_alpha_gaps += 1
          #   continue

          y = cut_dist_data['alpha_t']          
          X = np.array(range(len(y))).reshape((-1,1))                      
          regr = linear_model.LinearRegression()                      
          regr.fit(X, y)
          output_df = output_df.append({'districtId' : dist,
                                      'partial_inhib_coef' : regr.coef_[0]},
        output_df['districtId'] = output_df.districtId.astype('int64')
        stdout.write("\n- CALCULATING INHIBITION COEFFICIENTS -\nDistricts out of bounds: " + 
                     str(count_out_of_bounds))# + "\nDistricts with gaps in alpha estimates: " + str(count_alpha_gaps)) 


You can find more information about the basics of epidemics modelling in our background paper provided by Niklas Wulkow.

Structural Data

Next, we focused on the structural data. Two indices, ruralness and socioeconomic status, were created by applying principal component analysis (PCA) on a standardized set of variables as conceptualized by Thünen and applied by the German Federal Ministry of Food and Agriculture (BMEL).

Ruralness contained settlement density, area percentage of farming and forest areas, residence percentage of 1- and 2-family houses, surrounding population densities weighted by distance, and distance to the next five urban centers.

Socioeconomic status encompassed unemployment rate, mean salaries, mean income, communal tax revenue, net population migration, residence vacancies, life expectancy for men and women, and percentage of high-school dropouts.



The indices per district were correspondingly calculated by multiplying the district values of the loading variables with the loadings identified for the first principal component, as the first component is always the one with the highest variance explanation (and in our case the only one, as we are reducing the dimensionality to 1 for each index). Thanks to scikit-learn, this logic is contained in the pca_.transform method.

To account for a potential mismatch of the indices’ direction due to its dependency on the random state the machine is in when performing the analysis, the code contains some checkpoints that ensure conformity between the loading directions expected and those found by the algorithm. Thus, we ensured the creation of a ruralness index and not an urbaneness index.

scaler = StandardScaler()

#First, create Ruralness Index:
df_unscaled_Ruralness = structuredata_dense[Ruralness_vars]
df_scaled_Ruralness = pd.DataFrame(scaler.fit_transform(df_unscaled_Ruralness), columns = Ruralness_vars)

pca_ = PCA(1, random_state = 1)
variance_explained_Ruralness = pca_.explained_variance_ratio_
print("Ruralness Index captures %4.1f %% of the variance in its %d constituents." % 
      ((variance_explained_Ruralness * 100), len(Ruralness_vars)))

#this next part checks whether loadings are all in the right direction
loading_signs_found_Ruralness = np.sign(pca_.components_)
loading_signs_ought_Ruralness = [-1,1,1,-1,1]

if (loading_signs_found_Ruralness == loading_signs_ought_Ruralness).all():
  structuredata_dense['Ruralness_Index'] = pca_.transform(df_scaled_Ruralness)
elif (- loading_signs_found_Ruralness == loading_signs_ought_Ruralness).all(): 
  #if they are simply reversed, i.e. doing a urbanness index, they are re-reversed here:
  structuredata_dense['Ruralness_Index'] = - pca_.transform(df_scaled_Ruralness)
  raise ValueError('Ruralness-loadings were not as expected! Reversing their signs did not help!')

#Next, the same for the Socioeconomic status:
df_unscaled_Socioecon = structuredata_dense[Socioecon_vars]
df_scaled_Socioecon = pd.DataFrame(scaler.fit_transform(df_unscaled_Socioecon), columns = Socioecon_vars)

variance_explained_Socioecon = pca_.explained_variance_ratio_
print("Socioeconomic Index captures %4.1f %% of the variance in its %d constituents." % 
      ((variance_explained_Socioecon * 100), len(Socioecon_vars)))

loading_signs_found_Socioecon = np.sign(pca_.components_) #same direction check as above
loading_signs_ought_Socioecon = [-1,1,1,1,1,-1,1,1,-1]

if (loading_signs_found_Socioecon == loading_signs_ought_Socioecon).all():
  structuredata_dense['Socioecon_Index'] = pca_.transform(df_scaled_Socioecon)
elif (- loading_signs_found_Socioecon == loading_signs_ought_Socioecon).all():
  structuredata_dense['Socioecon_Index'] = - pca_.transform(df_scaled_Socioecon)
  raise ValueError('Socioecon-loadings were not as expected! Reversing their signs did not help!')
Rural Index


The PCA-created variables — ruralness and socioeconomic status — were used to cluster the districts and identify homologues. The algorithm of choice was k-means, and the number of clusters was identified via the Elbow Method. The latter looks at the sum of squares of distances of every data point from its corresponding cluster centroid (the within-cluster sum of squares (WCSS)) and its change with a rising number of clusters. The point after which there is no sudden change to be seen corresponds to the optimal number of clusters. In our case, 5 clusters seemed to be optimal.

clust_df = structuredata_dense[['Socioecon_Index', 'Ruralness_Index']].copy()

for k in range(1,max_k):
  kmeans = KMeans(k)
  wcss_iter = kmeans.inertia_

number_of_clusters = range(1,max_k)
#random state makes sure that cluster labels stay the same between executions
structuredata_dense['cluster'] = kmeans.fit_predict(clust_df)
groups = structuredata_dense.groupby("cluster")
for name, group in groups:
  plt.plot(group['Ruralness_Index'], group['Socioecon_Index'], marker="o", linestyle="", label=name)
  plt.ylabel('Socioeconomic Status')

We were interested in the effect of skipping the clustering and instead using ruralness and socio-economic status as additional predictors.

Of the ten districts that we originally identified as PDs, 6 had also been PDs when we used clustering. However, 2 of the 5 clusters were entirely absent from this new set of PDs. In practice, this might lead to identifying PD explanators that do not translate to all other units. For example, finding early subway closures to be an effective strategy would not yield actionable insights for rural villages.

Additionally, the linear regression model calculated across all districts only accounted for 2% of the performance measure’s variance. Thus, the “traditional” two-step procedure of clustering and within-cluster prediction proved advantageous for our use-case as well.

Positive Deviants Identification

The identification of Positive Deviants is at the heart of this analysis. We searched for Positive Deviance within the clusters we had previously defined.

A few things to consider:

The predictor variables were chosen based on factors frequently discussed in the media and learned through a consultation with an epidemiologist. To account for the fact that performing worse at time t means more room for improvement at time t+1, we also included the average growth rate between March 25th and 31st, defined as the geometric mean of the alpha_t values of those days

Predictor Variable

The number of positive deviants per cluster (2) was chosen arbitrarily for this prototype.

A district was classified as a positive deviant when the actual inhibition coefficient was lower than the predicted one, and this difference was among the two highest in the cluster.


Linear Regression

We started with a linear regression model for its simplicity and the ease with which we could derive information about potentially relevant factors from the regression coefficients.

When interpreting the regression weights, keep in mind that lower values of the performance measure indicate a better performance (because of a stronger curve flattening). The higher a predictor with a negative regression weight, the better the performance, and vice versa.

The preliminary goal of this prototype was the identification of positive deviants and not the out-of-sample prediction of the inhibition coefficient, which is why we did not apply a train-test split. As a result, the focus was placed on developing a model that retrospectively explains a sufficient part of the performance measure variance. Our goal was to control for those parts of the performance measure’s variance attributable to the covariates, not to accurately predict unseen values. Thus, we used linear regression for splitting the inhibition coefficient’s variance into variance explained and not explained by the predictors. Splitting the data set into training and test sets would have been counterproductive to that end.

PDs_per_district = 2
Positive_Deviants = pd.DataFrame()

for cluster in np.unique(all_data.cluster):
    cluster_data = all_data[all_data.cluster == cluster].dropna(axis=0)
    X_unscaled = cluster_data[predictor_variables]
    scaler = StandardScaler()
    X = scaler.fit_transform(X_unscaled)
    y = cluster_data['partial_inhib_coef']
    regr = LinearRegression()
    # X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2)
    # regr.fit(X_train,y_train)
    print('- CLUSTER', cluster, '- \nLinear model controlled for %4.1f %% of the variance in the performance measure.' % 
    # print('- CLUSTER', cluster, '- \nLinear model controlled for %4.1f %% of the training variance in the performance measure.' % 
    #       (100*regr.score(X_train,y_train)))
    # print('- CLUSTER', cluster, '- \nLinear model controlled for %4.1f %% of the test variance in the performance measure.' % 
    #       (100*regr.score(X_test,y_test)))
    coefs = pd.DataFrame({'predictor':predictor_variables,
                          'regression_weight': regr.coef_})
    plt.barh(coefs.predictor, coefs.regression_weight)
    plt.xscale('symlog', basex=10, linthreshx=.001)
    plt.xlabel('regression weight')
    cluster_data['predicted_IC'] = regr.predict(X)
    cluster_data['deviance'] =  cluster_data['predicted_IC'] - cluster_data['partial_inhib_coef']
    #When the district performs surprisingly well, the partial_inhib_coef will be lower than
    #the predicted one, resulting in positive values here.

    ranked_by_deviance = cluster_data.sort_values('deviance',ascending=False)
    PDs = ranked_by_deviance.iloc[range(PDs_per_district),]
    Positive_Deviants = Positive_Deviants.append(PDs).reset_index(drop=True)

The linear regression was able to explain between 2% and 48% of the variance per cluster. Due to this wide interval, we applied a second model: The Random Forest Regressor.

Random Forest Regressor

It is an ensemble method that uses bagging as the ensemble method and decision tree as the individual model. “Forest” because it combines multiple decision trees and “random” because it performs random sampling on both the observations and a subset of features for splitting nodes. In effect, random forest produces many individually overfitted decision trees that, due to their different samples, overfit in different ways, leading them to exhibit random errors. As a result, when brought together, their errors average to 0 and often produce a well-performing model.

For the same reasons we outlined for the linear regression model, we did not use a train-test split and applied the model to the whole data set. The variance explained by the model was between 80% and 86% (R2 score) within the clusters.

PDs_per_district_rf = 2
Positive_Deviants_rf = pd.DataFrame()
fi_all_rf = pd.DataFrame() 

for cluster in np.unique(all_data.cluster):
    cluster_data_rf = all_data[all_data.cluster == cluster].dropna(axis=0)
    X_unscaled_rf = cluster_data_rf[predictor_variables]
    scaler_rf = StandardScaler()
    X_rf = scaler_rf.fit_transform(X_unscaled_rf)
    #X_rf = scaler_rf.fit(X_unscaled_rf)
    X_df_rf = pd.DataFrame(X_rf, index=X_unscaled_rf.index, columns=X_unscaled_rf.columns)
    #X_df_rf = pd.DataFrame(scaler_rf.transform(X_unscaled_rf), index=X_unscaled_rf.index.values, columns=X_unscaled_rf.columns.values)
    y_rf = cluster_data_rf['partial_inhib_coef']
    rf = RandomForestRegressor(n_estimators = 100, random_state = 42)

    cluster_data_rf['predicted_IC'] = rf.predict(X_df_rf)
    cluster_data_rf['deviance'] =  cluster_data_rf['predicted_IC'] - cluster_data_rf['partial_inhib_coef']
    #When the district performs surprisingly well, the partial_inhib_coef will be lower than
    #the predicted one, resulting in positive values here.
    ranked_by_deviance_rf = cluster_data_rf.sort_values('deviance',ascending=False)
    PDs_rf = ranked_by_deviance_rf.iloc[range(PDs_per_district_rf),]
    Positive_Deviants_rf = Positive_Deviants_rf.append(PDs_rf).reset_index(drop=True)

    # calculate model R2 score per cluster
    print('- CLUSTER', cluster, '- \n Random Forest Regressor controlled for %4.1f %% of the variance in the performance measure.' % (100*rf.score(X_rf,y_rf)))

In the next step, feature importance was determined (read this article by Eryk Lewinson for background information). While the article outlines several methods, methods we adapted in our code, we’ll concentrate on the most accurate one — drop column. The method compares the performance of the model (measured by R2) with the feature included in the model to the performance when the feature is excluded. Negative importance, in this case, means that dropping a feature from the model would improve its performance. In contrast to the linear regression weights, this method illustrates only the importance of the measure and not the direction of its impact on the inhibition coefficient.

Nursing homes and/or population density were among the top factors in all clusters, perhaps because both factors enable social distancing. If your neighbor lives in a different 1-family home than you, you can better avoid infection than if you lived in the same apartment. Following the same logic, it is easier to isolate a stationary care facility than it is to ensure infection-free ambulatory care work. Keep in mind, however, that these are just speculations at this point.

#calculate & visualise drop-column feature importance
    #--> we investigate the importance of a feature by comparing a model with all features versus a model with this feature dropped for training
    def drop_col_feat_imp(model, X_df_rf, y_rf, random_state = 42):
      # clone the model to have the exact same specification as the one initially trained
      model_clone = clone(model)
      # set random_state for comparability
      model_clone.random_state = random_state
      # training and scoring the benchmark model
      model_clone.fit(X_df_rf, y_rf)
      benchmark_score = model_clone.score(X_df_rf, y_rf)
      # list for storing feature importances
      importances_rf = []

      # iterating over all columns and storing feature importance (difference between benchmark and new model)
      for col in X_df_rf.columns:
          model_clone = clone(model)
          model_clone.random_state = random_state
          model_clone.fit(X_df_rf.drop(col, axis = 1), y_rf)
          drop_col_score = model_clone.score(X_df_rf.drop(col, axis = 1), y_rf)
          importances_rf.append(benchmark_score - drop_col_score)
      importances_df_rf = imp_df(X_df_rf.columns, importances_rf,cluster)
      return importances_df_rf

    drop_imp = drop_col_feat_imp(rf, X_df_rf, y_rf)
    var_imp_plot(drop_imp[['feature','feature_importance']], 'Drop Column feature importance cluster ' + str(cluster))

Comparison between the two models & current results

Eight of the identified positive deviants in the clusters overlapped between the 2 models, sometimes with changed positions. We took this to be an indicator that their positive performance is not due to artifacts of the prediction method.

Random forest, with its high goodness of fit (retrospectively), offers suggestions for structural factors that may be of importance to a district’s general ability to flatten the curve, with PDs flattening their respective curve even above and beyond that.

Both models in their current states cannot be used to predict an unseen inhibition coefficient, as explained in the linear regression section. This is a point for future consideration.

Below is a summary of our results per cluster as of the 15th June 2020. Note that the PD graphs depict the performance measure calculated for the blue interval, controlled for mean growth during the red interval.


Suggestions for moving forward

  • Refining the predictors: the structural factors can be expanded, e.g. by adding a more granular measure for age. Furthermore, their selection can be optimized by applying methods such as Chi-squared test, LASSO, Ridge regression, etc.
  • Adding predictors: adding new, non-structural predictors, such as connectedness of districts with early COVID-19 hotspots measured by the frequency of travelling between the districts (input Basma Albanna). We could also add behavioral data from social media or mobility data. By using those data sources, however, data privacy and protection should become an even higher priority.
  • Prototype replication with structural data from other countries.
  • Application of more robust outlier-detection methods.
  • A closer investigation of the factors identified by the feature importance to understand what contributed to a positive-deviant formulation and how.