Notes on Modeling Rural Income Drivers in Guizhou

"This study takes Guizhou Province as a typical case and deeply analyzes the driving mechanisms and future trends of rural income in the context of the rural revitalization strategy."

That's a line from the undergraduate's thesis proposal. By the time I took the job, the first draft was already with the advisor, most of the analysis was "done," and the advisor had been turning up the heat.

In practice, the driving mechanism analysis didn't uncover anything profound. Multicollinearity, on the other hand, gave me quite the thorough analysis.

The Data

Annual data for Guizhou Province, 2007–2023. Seventeen observations, ten independent variables, one dependent variable:

VariableMeaningUnit
x1GDP per capitayuan
x2Value added of primary industry100 million yuan
x3Value added of tertiary industry100 million yuan
x4Urbanization rate%
x5Agricultural expenditure100 million yuan
x6Rural per capita consumption expenditureyuan
x7Total power of agricultural machinery10,000 kW
x8Total sown area1,000 hectares
x9Effective irrigated area1,000 hectares
x10Crop failure area10,000 hectares
yRural per capita disposable incomeyuan

Seventeen data points, ten independent variables. I knew there would be trouble before I even started modeling — degrees of freedom were already running thin.

(The student said the data came from the statistical yearbook. Fair enough — provincial annual data only goes back so many years, and you can always find plenty of variables, but the sample size is a hard ceiling. This isn't the student's fault; it's a structural limitation of this kind of research.)

Step 1: Check the Correlations

The correlation matrix made things pretty clear:

Correlations with y:

Variabler with y
x60.999
x30.996
x10.991
x20.990
x40.983
x70.911
x50.888
x80.689
x90.625
x10-0.361

Almost every economic indicator is highly correlated with rural income. There are twenty-seven variable pairs with r > 0.9. This isn't "rich data" — it's redundant information. Everyone is saying the same thing.

Step 2: OLS with All Variables

$$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_{10} x_{10} + \varepsilon$$

Result: R² = 1.000, Adj R² = 0.999.

Is it perfect? Yes. Does it mean anything? No.

Out of ten independent variables, only x6 (p < 0.001) and x9 (p = 0.029) were significant. The other eight were all insignificant. The condition number was 3.2×10⁶ — textbook severe multicollinearity.

The VIF values were even more absurd:

VariableVIF
x41435.41
x1987.17
x2535.78
x3527.46
x6450.97
x822.03
x531.11
x106.54
x713.68
x94.25

VIF > 10 is already considered problematic. x4 had a VIF of 1435. This isn't just collinearity — these variables are copying each other's homework.

Step 3: Stepwise Regression

AIC-based stepwise selection kept only two variables: x6 and x9.

$$y = -376.55 + 0.94 \times x_6 + 1.39 \times x_9$$
VariableCoefficientStd. Errort-valuep-value
const-376.55296.49-1.270.225
x60.940.009102.01<0.001
x91.390.3034.57<0.001

R² = 0.999, Adj R² = 0.999. F = 8537 (p < 0.001).

The VIF for both x6 and x9 was 1.55 — finally, no collinearity.

But the result was frustrating. Rural per capita consumption expenditure (x6) and effective irrigated area (x9) explain 99.9% of the variation in rural income? x6 and y are almost the same thing measured from different angles — high consumption means high income. That's not a "driving mechanism." That's a tautology.

(It's like saying "a person earns a lot because they spend a lot." Logically not wrong, but it explains nothing.)

Step 4: GAM — Semi-Parametric Model

Since linear models weren't working, let's try nonlinear. Using spline functions to fit the nonlinear effects of x3 and x7:

$$y = s(x_3) + x_6 + s(x_7) + x_9$$

Result: R² = 1.000, but the residual degrees of freedom were down to 2. Fitting fourteen parameters to seventeen points — this isn't modeling, it's connect-the-dots. None of the variables were significant.

Adding x2 to get $y = s(x_2) + s(x_3) + x_6 + s(x_7) + x_9$ used up all remaining degrees of freedom — a perfect overfit.

(GAM is not a bad method. It's just that seventeen data points can't sustain this many degrees of freedom.)

Step 5: PCA Dimension Reduction, Then Model

x2, x3, and x7 were highly correlated, so I ran PCA to extract principal components:

Principal ComponentVariance ExplainedCumulative
PC195.02%95.02%
PC24.52%99.54%
PC30.46%100.00%

PC1 alone explained 95% of the variance. Then:

$$y = \text{PC1} + x_6 + x_9$$

The coefficient for PC1 was not significant (p = 0.126). The dimension reduction worked, but the reduced dimension had no explanatory power for y.

Step 6: Holt Double Exponential Smoothing — Pure Forecasting

Since explanatory models were going nowhere, at least I could get the forecasting done. Holt double exponential smoothing (ETS(A,A,N)):

MetricValue
ME60.64
RMSE151.09
MAE121.24
MAPE1.98%

MAPE was only 1.98% — decent fit. But both α and β were pushed to the upper bound, meaning the model was highly sensitive to recent data, with trend updates happening too fast, making long-term forecasts prone to divergence.

Five-year forecast:

YearForecast (yuan)
202415835
202516870
202617906
202718941
202819977

Annual growth of about 1035 yuan, linear extrapolation. It looks reasonable, but there's no structural explanation — you don't know why the growth rate is what it is, or what factors might change it.

(The auto-tuning principle of the Holt model: under the ETS(A,A,N) state-space framework, α and β are optimized via maximum likelihood estimation. The level equation $l_t = l_{t-1} + b_{t-1} + \alpha e_t$ and the trend equation $b_t = b_{t-1} + \beta e_t$ are both residual-driven — the larger the coefficients, the more sensitive the model. The optimizer searches within the constraint range $(10^{-6}, 1-10^{-6})$ using L-BFGS-B until convergence. Both α and β hitting 0.799 means the trend in this dataset changes rapidly, and the model needs to keep up.)

Step 7: Back to Interpretability

After messing around with GAM, PCA, and Holt, I had to return to the linear model — but with a different goal this time. Not chasing the highest R², but finding a model where all parameters are significant. At an undergraduate thesis defense, the most common question is "Is this variable significant?" You can't show up with a row of insignificant variables.

I tried various combinations:

ModelAdj R²AICSignificant Variables
y~x6+x90.99920.9991213.9x6***, x9***
y~x3+x6+x90.99920.9991214.6x6***, x9***
y~x4+x6+x90.99920.9990215.4x6***, x9***
y~x7+x6+x90.99930.9991213.1x6***, x9***
y~x3+x7+x6+x90.99940.9992212.8x6***, x9***

No matter what variable I added, only x6 and x9 remained consistently significant. Adding other variables didn't produce significant coefficients, and AIC didn't improve.

Final model:

$$\hat{y} = -376.55 + 0.94 \times x_6 + 1.39 \times x_9$$

VIF for both was 1.55. Durbin-Watson = 2.30. Residuals passed normality (Shapiro-Wilk p = 0.391). Model diagnostics were fine. The problem was just that the explanatory power was paper-thin.

Reflections

The technical roadmap for this project was basically copied from a master's thesis, but the student couldn't reproduce the success.

The biggest challenge wasn't the lack of methods — I tried OLS, stepwise regression, GAM, PCA, and Holt — it was the limitations of the data itself.

Seventeen annual observations, ten highly collinear independent variables. This is the curse of provincial-level macroeconomic data: economic indicators move in lockstep. When GDP rises, everything rises. When urbanization advances, every metric moves with it. Trying to tease out "what drives what" from this kind of data is, at its core, asking statistical methods to do something they cannot do.

The final y ~ x6 + x9 model had all significant parameters and passed all diagnostics, but the relationship between x6 and y was too close to call it a "driver." x9, on the other hand, was a meaningful finding — effective irrigated area genuinely matters for rural income in a mountainous province like Guizhou, and that result aligns with intuition.

If I could do it over, I'd probably:

  1. Do variable selection from the start rather than throwing all ten variables in
  2. Consider using panel data at the prefectural level instead of provincial time series, to increase both sample size and variation
  3. Think more carefully about the causal relationship between x6 and y — consumption expenditure is more likely a result of income than a cause

But then again, finding a model where all parameters are significant, given such a small sample and such severe collinearity — I'd say I did right by that student.

At the very least, the significance of x9 convinces me that Guizhou's irrigation infrastructure is actually making a difference. That conclusion, even if it's propped up by just seventeen data points, is one I'm willing to believe.