Notes on Modeling Rural Income Drivers in Guizhou

"This study takes Guizhou Province as a typical case and deeply analyzes the driving mechanisms and future trends of rural income in the context of the rural revitalization strategy."

That's a line from the undergraduate's thesis proposal. By the time I took the job, the first draft was already with the advisor, most of the analysis was "done," and the advisor had been turning up the heat.

In practice, the driving mechanism analysis didn't uncover anything profound. Multicollinearity, on the other hand, gave me quite the thorough analysis.

The Data

Annual data for Guizhou Province, 2007–2023. Seventeen observations, ten independent variables, one dependent variable:

Variable	Meaning	Unit
x1	GDP per capita	yuan
x2	Value added of primary industry	100 million yuan
x3	Value added of tertiary industry	100 million yuan
x4	Urbanization rate	%
x5	Agricultural expenditure	100 million yuan
x6	Rural per capita consumption expenditure	yuan
x7	Total power of agricultural machinery	10,000 kW
x8	Total sown area	1,000 hectares
x9	Effective irrigated area	1,000 hectares
x10	Crop failure area	10,000 hectares
y	Rural per capita disposable income	yuan

Seventeen data points, ten independent variables. I knew there would be trouble before I even started modeling — degrees of freedom were already running thin.

(The student said the data came from the statistical yearbook. Fair enough — provincial annual data only goes back so many years, and you can always find plenty of variables, but the sample size is a hard ceiling. This isn't the student's fault; it's a structural limitation of this kind of research.)

Step 1: Check the Correlations

The correlation matrix made things pretty clear:

Correlations with y:

Variable	r with y
x6	0.999
x3	0.996
x1	0.991
x2	0.990
x4	0.983
x7	0.911
x5	0.888
x8	0.689
x9	0.625
x10	-0.361

Almost every economic indicator is highly correlated with rural income. There are twenty-seven variable pairs with r > 0.9. This isn't "rich data" — it's redundant information. Everyone is saying the same thing.

Step 2: OLS with All Variables

$$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_{10} x_{10} + \varepsilon$$

Result: R² = 1.000, Adj R² = 0.999.

Is it perfect? Yes. Does it mean anything? No.

Out of ten independent variables, only x6 (p < 0.001) and x9 (p = 0.029) were significant. The other eight were all insignificant. The condition number was 3.2×10⁶ — textbook severe multicollinearity.

The VIF values were even more absurd:

Variable	VIF
x4	1435.41
x1	987.17
x2	535.78
x3	527.46
x6	450.97
x8	22.03
x5	31.11
x10	6.54
x7	13.68
x9	4.25

VIF > 10 is already considered problematic. x4 had a VIF of 1435. This isn't just collinearity — these variables are copying each other's homework.

Step 3: Stepwise Regression

AIC-based stepwise selection kept only two variables: x6 and x9.

$$y = -376.55 + 0.94 \times x_6 + 1.39 \times x_9$$

Variable	Coefficient	Std. Error	t-value	p-value
const	-376.55	296.49	-1.27	0.225
x6	0.94	0.009	102.01	<0.001
x9	1.39	0.303	4.57	<0.001

R² = 0.999, Adj R² = 0.999. F = 8537 (p < 0.001).

The VIF for both x6 and x9 was 1.55 — finally, no collinearity.

But the result was frustrating. Rural per capita consumption expenditure (x6) and effective irrigated area (x9) explain 99.9% of the variation in rural income? x6 and y are almost the same thing measured from different angles — high consumption means high income. That's not a "driving mechanism." That's a tautology.

(It's like saying "a person earns a lot because they spend a lot." Logically not wrong, but it explains nothing.)

Step 4: GAM — Semi-Parametric Model

Since linear models weren't working, let's try nonlinear. Using spline functions to fit the nonlinear effects of x3 and x7:

$$y = s(x_3) + x_6 + s(x_7) + x_9$$

Result: R² = 1.000, but the residual degrees of freedom were down to 2. Fitting fourteen parameters to seventeen points — this isn't modeling, it's connect-the-dots. None of the variables were significant.

Adding x2 to get $y = s(x_2) + s(x_3) + x_6 + s(x_7) + x_9$ used up all remaining degrees of freedom — a perfect overfit.

(GAM is not a bad method. It's just that seventeen data points can't sustain this many degrees of freedom.)

Step 5: PCA Dimension Reduction, Then Model

x2, x3, and x7 were highly correlated, so I ran PCA to extract principal components:

Principal Component	Variance Explained	Cumulative
PC1	95.02%	95.02%
PC2	4.52%	99.54%
PC3	0.46%	100.00%

PC1 alone explained 95% of the variance. Then:

$$y = \text{PC1} + x_6 + x_9$$

The coefficient for PC1 was not significant (p = 0.126). The dimension reduction worked, but the reduced dimension had no explanatory power for y.

Step 6: Holt Double Exponential Smoothing — Pure Forecasting

Since explanatory models were going nowhere, at least I could get the forecasting done. Holt double exponential smoothing (ETS(A,A,N)):

α (level smoothing parameter) = 0.799
β (trend smoothing parameter) = 0.799

Metric	Value
ME	60.64
RMSE	151.09
MAE	121.24
MAPE	1.98%

MAPE was only 1.98% — decent fit. But both α and β were pushed to the upper bound, meaning the model was highly sensitive to recent data, with trend updates happening too fast, making long-term forecasts prone to divergence.

Five-year forecast:

Year	Forecast (yuan)
2024	15835
2025	16870
2026	17906
2027	18941
2028	19977

Annual growth of about 1035 yuan, linear extrapolation. It looks reasonable, but there's no structural explanation — you don't know why the growth rate is what it is, or what factors might change it.

(The auto-tuning principle of the Holt model: under the ETS(A,A,N) state-space framework, α and β are optimized via maximum likelihood estimation. The level equation $l_t = l_{t-1} + b_{t-1} + \alpha e_t$ and the trend equation $b_t = b_{t-1} + \beta e_t$ are both residual-driven — the larger the coefficients, the more sensitive the model. The optimizer searches within the constraint range $(10^{-6}, 1-10^{-6})$ using L-BFGS-B until convergence. Both α and β hitting 0.799 means the trend in this dataset changes rapidly, and the model needs to keep up.)

Step 7: Back to Interpretability

After messing around with GAM, PCA, and Holt, I had to return to the linear model — but with a different goal this time. Not chasing the highest R², but finding a model where all parameters are significant. At an undergraduate thesis defense, the most common question is "Is this variable significant?" You can't show up with a row of insignificant variables.

I tried various combinations:

Model	R²	Adj R²	AIC	Significant Variables
y~x6+x9	0.9992	0.9991	213.9	x6*, x9*
y~x3+x6+x9	0.9992	0.9991	214.6	x6*, x9*
y~x4+x6+x9	0.9992	0.9990	215.4	x6*, x9*
y~x7+x6+x9	0.9993	0.9991	213.1	x6*, x9*
y~x3+x7+x6+x9	0.9994	0.9992	212.8	x6*, x9*

No matter what variable I added, only x6 and x9 remained consistently significant. Adding other variables didn't produce significant coefficients, and AIC didn't improve.

Final model:

$$\hat{y} = -376.55 + 0.94 \times x_6 + 1.39 \times x_9$$

x6 (rural per capita consumption expenditure): Coefficient 0.94. For every 1 yuan increase in consumption expenditure, disposable income rises by about 0.94 yuan. The consumption-to-income relationship is nearly 1:1 — this looks more like an accounting identity than a causal mechanism.
x9 (effective irrigated area): Coefficient 1.39. For every additional 1,000 hectares of effective irrigated area, per capita disposable income increases by about 1.39 yuan. This is the only "real" driver — improvements in agricultural infrastructure do raise rural income.

VIF for both was 1.55. Durbin-Watson = 2.30. Residuals passed normality (Shapiro-Wilk p = 0.391). Model diagnostics were fine. The problem was just that the explanatory power was paper-thin.

Reflections

The technical roadmap for this project was basically copied from a master's thesis, but the student couldn't reproduce the success.

The biggest challenge wasn't the lack of methods — I tried OLS, stepwise regression, GAM, PCA, and Holt — it was the limitations of the data itself.

Seventeen annual observations, ten highly collinear independent variables. This is the curse of provincial-level macroeconomic data: economic indicators move in lockstep. When GDP rises, everything rises. When urbanization advances, every metric moves with it. Trying to tease out "what drives what" from this kind of data is, at its core, asking statistical methods to do something they cannot do.

The final y ~ x6 + x9 model had all significant parameters and passed all diagnostics, but the relationship between x6 and y was too close to call it a "driver." x9, on the other hand, was a meaningful finding — effective irrigated area genuinely matters for rural income in a mountainous province like Guizhou, and that result aligns with intuition.

If I could do it over, I'd probably:

Do variable selection from the start rather than throwing all ten variables in
Consider using panel data at the prefectural level instead of provincial time series, to increase both sample size and variation
Think more carefully about the causal relationship between x6 and y — consumption expenditure is more likely a result of income than a cause

But then again, finding a model where all parameters are significant, given such a small sample and such severe collinearity — I'd say I did right by that student.

At the very least, the significance of x9 convinces me that Guizhou's irrigation infrastructure is actually making a difference. That conclusion, even if it's propped up by just seventeen data points, is one I'm willing to believe.