"This study takes Guizhou Province as a typical case and deeply analyzes the driving mechanisms and future trends of rural income in the context of the rural revitalization strategy."
That's a line from the undergraduate's thesis proposal. By the time I took the job, the first draft was already with the advisor, most of the analysis was "done," and the advisor had been turning up the heat.
In practice, the driving mechanism analysis didn't uncover anything profound. Multicollinearity, on the other hand, gave me quite the thorough analysis.
The Data
Annual data for Guizhou Province, 2007–2023. Seventeen observations, ten independent variables, one dependent variable:
| Variable | Meaning | Unit |
|---|---|---|
| x1 | GDP per capita | yuan |
| x2 | Value added of primary industry | 100 million yuan |
| x3 | Value added of tertiary industry | 100 million yuan |
| x4 | Urbanization rate | % |
| x5 | Agricultural expenditure | 100 million yuan |
| x6 | Rural per capita consumption expenditure | yuan |
| x7 | Total power of agricultural machinery | 10,000 kW |
| x8 | Total sown area | 1,000 hectares |
| x9 | Effective irrigated area | 1,000 hectares |
| x10 | Crop failure area | 10,000 hectares |
| y | Rural per capita disposable income | yuan |
Seventeen data points, ten independent variables. I knew there would be trouble before I even started modeling — degrees of freedom were already running thin.
(The student said the data came from the statistical yearbook. Fair enough — provincial annual data only goes back so many years, and you can always find plenty of variables, but the sample size is a hard ceiling. This isn't the student's fault; it's a structural limitation of this kind of research.)
Step 1: Check the Correlations
The correlation matrix made things pretty clear:
Correlations with y:
| Variable | r with y |
|---|---|
| x6 | 0.999 |
| x3 | 0.996 |
| x1 | 0.991 |
| x2 | 0.990 |
| x4 | 0.983 |
| x7 | 0.911 |
| x5 | 0.888 |
| x8 | 0.689 |
| x9 | 0.625 |
| x10 | -0.361 |
Almost every economic indicator is highly correlated with rural income. There are twenty-seven variable pairs with r > 0.9. This isn't "rich data" — it's redundant information. Everyone is saying the same thing.
Step 2: OLS with All Variables
$$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_{10} x_{10} + \varepsilon$$Result: R² = 1.000, Adj R² = 0.999.
Is it perfect? Yes. Does it mean anything? No.
Out of ten independent variables, only x6 (p < 0.001) and x9 (p = 0.029) were significant. The other eight were all insignificant. The condition number was 3.2×10⁶ — textbook severe multicollinearity.
The VIF values were even more absurd:
| Variable | VIF |
|---|---|
| x4 | 1435.41 |
| x1 | 987.17 |
| x2 | 535.78 |
| x3 | 527.46 |
| x6 | 450.97 |
| x8 | 22.03 |
| x5 | 31.11 |
| x10 | 6.54 |
| x7 | 13.68 |
| x9 | 4.25 |
VIF > 10 is already considered problematic. x4 had a VIF of 1435. This isn't just collinearity — these variables are copying each other's homework.
Step 3: Stepwise Regression
AIC-based stepwise selection kept only two variables: x6 and x9.
$$y = -376.55 + 0.94 \times x_6 + 1.39 \times x_9$$| Variable | Coefficient | Std. Error | t-value | p-value |
|---|---|---|---|---|
| const | -376.55 | 296.49 | -1.27 | 0.225 |
| x6 | 0.94 | 0.009 | 102.01 | <0.001 |
| x9 | 1.39 | 0.303 | 4.57 | <0.001 |
R² = 0.999, Adj R² = 0.999. F = 8537 (p < 0.001).
The VIF for both x6 and x9 was 1.55 — finally, no collinearity.
But the result was frustrating. Rural per capita consumption expenditure (x6) and effective irrigated area (x9) explain 99.9% of the variation in rural income? x6 and y are almost the same thing measured from different angles — high consumption means high income. That's not a "driving mechanism." That's a tautology.
(It's like saying "a person earns a lot because they spend a lot." Logically not wrong, but it explains nothing.)
Step 4: GAM — Semi-Parametric Model
Since linear models weren't working, let's try nonlinear. Using spline functions to fit the nonlinear effects of x3 and x7:
$$y = s(x_3) + x_6 + s(x_7) + x_9$$Result: R² = 1.000, but the residual degrees of freedom were down to 2. Fitting fourteen parameters to seventeen points — this isn't modeling, it's connect-the-dots. None of the variables were significant.
Adding x2 to get $y = s(x_2) + s(x_3) + x_6 + s(x_7) + x_9$ used up all remaining degrees of freedom — a perfect overfit.
(GAM is not a bad method. It's just that seventeen data points can't sustain this many degrees of freedom.)
Step 5: PCA Dimension Reduction, Then Model
x2, x3, and x7 were highly correlated, so I ran PCA to extract principal components:
| Principal Component | Variance Explained | Cumulative |
|---|---|---|
| PC1 | 95.02% | 95.02% |
| PC2 | 4.52% | 99.54% |
| PC3 | 0.46% | 100.00% |
PC1 alone explained 95% of the variance. Then:
$$y = \text{PC1} + x_6 + x_9$$The coefficient for PC1 was not significant (p = 0.126). The dimension reduction worked, but the reduced dimension had no explanatory power for y.
Step 6: Holt Double Exponential Smoothing — Pure Forecasting
Since explanatory models were going nowhere, at least I could get the forecasting done. Holt double exponential smoothing (ETS(A,A,N)):
- α (level smoothing parameter) = 0.799
- β (trend smoothing parameter) = 0.799
| Metric | Value |
|---|---|
| ME | 60.64 |
| RMSE | 151.09 |
| MAE | 121.24 |
| MAPE | 1.98% |
MAPE was only 1.98% — decent fit. But both α and β were pushed to the upper bound, meaning the model was highly sensitive to recent data, with trend updates happening too fast, making long-term forecasts prone to divergence.
Five-year forecast:
| Year | Forecast (yuan) |
|---|---|
| 2024 | 15835 |
| 2025 | 16870 |
| 2026 | 17906 |
| 2027 | 18941 |
| 2028 | 19977 |
Annual growth of about 1035 yuan, linear extrapolation. It looks reasonable, but there's no structural explanation — you don't know why the growth rate is what it is, or what factors might change it.
(The auto-tuning principle of the Holt model: under the ETS(A,A,N) state-space framework, α and β are optimized via maximum likelihood estimation. The level equation $l_t = l_{t-1} + b_{t-1} + \alpha e_t$ and the trend equation $b_t = b_{t-1} + \beta e_t$ are both residual-driven — the larger the coefficients, the more sensitive the model. The optimizer searches within the constraint range $(10^{-6}, 1-10^{-6})$ using L-BFGS-B until convergence. Both α and β hitting 0.799 means the trend in this dataset changes rapidly, and the model needs to keep up.)
Step 7: Back to Interpretability
After messing around with GAM, PCA, and Holt, I had to return to the linear model — but with a different goal this time. Not chasing the highest R², but finding a model where all parameters are significant. At an undergraduate thesis defense, the most common question is "Is this variable significant?" You can't show up with a row of insignificant variables.
I tried various combinations:
| Model | R² | Adj R² | AIC | Significant Variables |
|---|---|---|---|---|
| y~x6+x9 | 0.9992 | 0.9991 | 213.9 | x6***, x9*** |
| y~x3+x6+x9 | 0.9992 | 0.9991 | 214.6 | x6***, x9*** |
| y~x4+x6+x9 | 0.9992 | 0.9990 | 215.4 | x6***, x9*** |
| y~x7+x6+x9 | 0.9993 | 0.9991 | 213.1 | x6***, x9*** |
| y~x3+x7+x6+x9 | 0.9994 | 0.9992 | 212.8 | x6***, x9*** |
No matter what variable I added, only x6 and x9 remained consistently significant. Adding other variables didn't produce significant coefficients, and AIC didn't improve.
Final model:
$$\hat{y} = -376.55 + 0.94 \times x_6 + 1.39 \times x_9$$- x6 (rural per capita consumption expenditure): Coefficient 0.94. For every 1 yuan increase in consumption expenditure, disposable income rises by about 0.94 yuan. The consumption-to-income relationship is nearly 1:1 — this looks more like an accounting identity than a causal mechanism.
- x9 (effective irrigated area): Coefficient 1.39. For every additional 1,000 hectares of effective irrigated area, per capita disposable income increases by about 1.39 yuan. This is the only "real" driver — improvements in agricultural infrastructure do raise rural income.
VIF for both was 1.55. Durbin-Watson = 2.30. Residuals passed normality (Shapiro-Wilk p = 0.391). Model diagnostics were fine. The problem was just that the explanatory power was paper-thin.
Reflections
The technical roadmap for this project was basically copied from a master's thesis, but the student couldn't reproduce the success.
The biggest challenge wasn't the lack of methods — I tried OLS, stepwise regression, GAM, PCA, and Holt — it was the limitations of the data itself.
Seventeen annual observations, ten highly collinear independent variables. This is the curse of provincial-level macroeconomic data: economic indicators move in lockstep. When GDP rises, everything rises. When urbanization advances, every metric moves with it. Trying to tease out "what drives what" from this kind of data is, at its core, asking statistical methods to do something they cannot do.
The final y ~ x6 + x9 model had all significant parameters and passed all diagnostics, but the relationship between x6 and y was too close to call it a "driver." x9, on the other hand, was a meaningful finding — effective irrigated area genuinely matters for rural income in a mountainous province like Guizhou, and that result aligns with intuition.
If I could do it over, I'd probably:
- Do variable selection from the start rather than throwing all ten variables in
- Consider using panel data at the prefectural level instead of provincial time series, to increase both sample size and variation
- Think more carefully about the causal relationship between x6 and y — consumption expenditure is more likely a result of income than a cause
But then again, finding a model where all parameters are significant, given such a small sample and such severe collinearity — I'd say I did right by that student.
At the very least, the significance of x9 convinces me that Guizhou's irrigation infrastructure is actually making a difference. That conclusion, even if it's propped up by just seventeen data points, is one I'm willing to believe.