Capitolo 20 Alberi di Regressione (Regression Tree: RTREE)

20.1 Introduzione

I metodi basati su albero differiscono dai metodi di regressione perché partizionano lo spazio delle feature in un insieme di regioni rettangolari, e quindi adattano un modello semplice (una cosrante) in ciacuna regione. Questi sono concettualmente semplici ma potenti. Descriviamo ora un metodo per la regressione e classificazione basata su albero chiamata CART.

20.2 Esempio: package `rpart`

Illustriamo ora l’idea sottostante agli alberi di regressione che usano dati su marche di automobili presi dal package rpart. rpart calcola la media entro i nodi degli alberi.

Consideriamo il data set cu.summary del package rpart.

library(rpart)
data(cu.summary, package = "rpart")

Facciamo dapprima “crescere” l’albero di un modello ad albero di regressione in cui Mileage dipende dalle variabili Country, Reliability Type:

fit <- rpart(Mileage ~ Price + Country + Reliability + Type, method = "anova",
                         data = cu.summary, control = rpart.control(cp = 0.001))
printcp(fit)    # mostra i risultati

## 
## Regression tree:
## rpart(formula = Mileage ~ Price + Country + Reliability + Type, 
##     data = cu.summary, method = "anova", control = rpart.control(cp = 0.001))
## 
## Variables actually used in tree construction:
## [1] Price Type 
## 
## Root node error: 1354.6/60 = 22.576
## 
## n=60 (57 observations deleted due to missingness)
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.622885      0   1.00000 1.00994 0.174000
## 2 0.132061      1   0.37711 0.52262 0.101517
## 3 0.025441      2   0.24505 0.36907 0.080615
## 4 0.011604      3   0.21961 0.38624 0.081913
## 5 0.001000      4   0.20801 0.41650 0.082874

plotcp(fit)     # mostra i risultati di cross-validation

summary(fit)    # riepilogo dettagliato degli split

## Call:
## rpart(formula = Mileage ~ Price + Country + Reliability + Type, 
##     data = cu.summary, method = "anova", control = rpart.control(cp = 0.001))
##   n=60 (57 observations deleted due to missingness)
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.62288527      0 1.0000000 1.0099426 0.17399960
## 2 0.13206061      1 0.3771147 0.5226204 0.10151732
## 3 0.02544094      2 0.2450541 0.3690684 0.08061485
## 4 0.01160389      3 0.2196132 0.3862442 0.08191343
## 5 0.00100000      4 0.2080093 0.4165042 0.08287384
## 
## Variable importance
##   Price    Type Country 
##      48      42      10 
## 
## Node number 1: 60 observations,    complexity param=0.6228853
##   mean=24.58333, MSE=22.57639 
##   left son=2 (48 obs) right son=3 (12 obs)
##   Primary splits:
##       Price       < 9446.5  to the right, improve=0.6228853, (0 missing)
##       Type        splits as  LLLRLL,      improve=0.5044405, (0 missing)
##       Reliability splits as  LLLRR,       improve=0.1263005, (11 missing)
##       Country     splits as  --LRLRRRLL,  improve=0.1243525, (0 missing)
##   Surrogate splits:
##       Type    splits as  LLLRLL,     agree=0.950, adj=0.750, (0 split)
##       Country splits as  --LLLLRRLL, agree=0.833, adj=0.167, (0 split)
## 
## Node number 2: 48 observations,    complexity param=0.1320606
##   mean=22.70833, MSE=8.498264 
##   left son=4 (23 obs) right son=5 (25 obs)
##   Primary splits:
##       Type        splits as  RLLRRL,      improve=0.43853830, (0 missing)
##       Price       < 12154.5 to the right, improve=0.25748500, (0 missing)
##       Country     splits as  --RRLRL-LL,  improve=0.13345700, (0 missing)
##       Reliability splits as  LLLRR,       improve=0.01637086, (10 missing)
##   Surrogate splits:
##       Price   < 12215.5 to the right, agree=0.812, adj=0.609, (0 split)
##       Country splits as  --RRLRL-RL,  agree=0.646, adj=0.261, (0 split)
## 
## Node number 3: 12 observations
##   mean=32.08333, MSE=8.576389 
## 
## Node number 4: 23 observations,    complexity param=0.02544094
##   mean=20.69565, MSE=2.907372 
##   left son=8 (10 obs) right son=9 (13 obs)
##   Primary splits:
##       Type    splits as  -LR--L,      improve=0.515359600, (0 missing)
##       Price   < 14962   to the left,  improve=0.131259400, (0 missing)
##       Country splits as  ----L-R--R,  improve=0.007022107, (0 missing)
##   Surrogate splits:
##       Price < 13572   to the right, agree=0.609, adj=0.1, (0 split)
## 
## Node number 5: 25 observations,    complexity param=0.01160389
##   mean=24.56, MSE=6.4864 
##   left son=10 (14 obs) right son=11 (11 obs)
##   Primary splits:
##       Price       < 11484.5 to the right, improve=0.09693168, (0 missing)
##       Reliability splits as  LLRRR,       improve=0.07767167, (4 missing)
##       Type        splits as  L--RR-,      improve=0.04209834, (0 missing)
##       Country     splits as  --LRRR--LL,  improve=0.02201687, (0 missing)
##   Surrogate splits:
##       Country splits as  --LLLL--LR, agree=0.80, adj=0.545, (0 split)
##       Type    splits as  L--RL-,     agree=0.64, adj=0.182, (0 split)
## 
## Node number 8: 10 observations
##   mean=19.3, MSE=2.21 
## 
## Node number 9: 13 observations
##   mean=21.76923, MSE=0.7928994 
## 
## Node number 10: 14 observations
##   mean=23.85714, MSE=7.693878 
## 
## Node number 11: 11 observations
##   mean=25.45455, MSE=3.520661

Nota: \(R_{cp}(T) = R(T) + cp * |T| * R(T_0)\) cioè: se un qualunque split non incrementa l’\(R^2\) complessivo del modelo di una quantità almeno pari a \(cp\) (dove \(R^2\) in questo caso è il classico coefficiente di determinazione dei modelli lineari) allora questo split verrà dichiarato essere, a priori, non migliorativo. Possiamo generare dei grafici aggiuntivi:

op <- par(mfrow = c(1, 2))
rsq.rpart(fit)

## 
## Regression tree:
## rpart(formula = Mileage ~ Price + Country + Reliability + Type, 
##     data = cu.summary, method = "anova", control = rpart.control(cp = 0.001))
## 
## Variables actually used in tree construction:
## [1] Price Type 
## 
## Root node error: 1354.6/60 = 22.576
## 
## n=60 (57 observations deleted due to missingness)
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.622885      0   1.00000 1.00994 0.174000
## 2 0.132061      1   0.37711 0.52262 0.101517
## 3 0.025441      2   0.24505 0.36907 0.080615
## 4 0.011604      3   0.21961 0.38624 0.081913
## 5 0.001000      4   0.20801 0.41650 0.082874

par(op)

Il valore di X Relative Error è collegabile ai valori dei residui PRESS.

Infine rappresentiamo graficamente l’albero:

plot(fit, uniform = TRUE, main = "Albero di regressione per Mileage")
text(fit, use.n = TRUE, all = TRUE, cex = .8)

Ed ora potiamo l’albero in corrispondenza del valore di cp “ottimo”:

pfit <- prune(fit, cp = 0.025441) # dalla cptable

Ora possiamo tracciare nuovamente l’albero, questa volta potato:

plot(pfit, uniform = TRUE, main = "Albero di regressione potato per Mileage")
text(pfit, use.n = TRUE, all = TRUE, cex = .8)

Che ritorna un albero più semplice rispetto all’originale.

20.3 Esempio: package `party`

Come esempio finale, vorrei mostrare il package party, la cui funzione mob() implementa una variante dell’approccio degli alberi di regressione descritti finora chiamata partizionamento ricorsivo model-based (model-based recursive partitioning).

La funzione mob() permette di stimare modelli di regressione separati, spezzando i dati tramite alcune variabili. La formula per specificare tale modello è:

y ~ x1 + x2 + … + xk | z1 + … + zl,

dove i termini alla sinistra del | specificano le variabili di regressione, mentre i termini alla destra del | specificano le variabili per il partizionamento. La procedura generale per l’adattamento è la seguente:

Adatta il modello una volta su tutte le osservazioni nel nodo corrente.
Valuta se le stime dei parametri sono stabili rispetto a tutte le varibili di partizionamento z1, . . . , zl. Se c’è qualche instabilità, seleziona per il partizionamento la variabile zj associata con il più piccolo p-value; altrimenti fermati.
Calcola i punti di split che ottimizzano localmente una funzione obiettivo.
Riadatta il modello in entrambi i figlio, e ritorna al passo 2.

A scopo illustrativo, consideriamo un esempio basato sul ben noto dataset bostonhousing.

Questo dataset contiene informazioni per 506 case di Boston. In particolare, il dataset fornisce il valore mediano delle case (in migliaia di USD) insieme a 14 covariate, tra cui il numero di camere per abitazione (rm) e la percentuale di popolazione con stato economico basso (lstat).
Una relazione a segmenti tra il valore delle case e queste due variabili risulta molto intuitivo, mentre la forma dell’influenza delle rimanenti covariate è poco chiara e dovrebbe quindi essere appresa dai dati. Di conseguenza, è utilizzato un modello di regressione lineare per il valore mediano spiegato da rm^2 e log(lstat) con k = 3 coefficienti di regressione, e con un partizionamento rispetto alle l = 11 variabili rimanenti.

bostonhousing$lstat <- log(bostonhousing$lstat)
bostonhousing$rm <- bostonhousing$rm^2
bostonhousing$chas <- factor(bostonhousing$chas, levels = 0:1, labels = c("no", "yes"))
bostonhousing$rad <- factor(bostonhousing$rad, ordered = TRUE)

Entrambe le trasformazioni impattano solo sui test di stabilità dei parametri (step 2), ma non sulla procedura di splitting (step 3).

Carichiamo il package party, che permette l’analisi

library(party)

E impostiamo alcuni parametri di controllo per la procedura:

ctrl <- mob_control(alpha = 0.05, bonferroni = TRUE, 
                    minsplit = 40, objfun = deviance, 
                    verbose = TRUE)

Ora possiamo stimare e visualizzare il modello:

fmBH <- mob(medv ~ age + lstat + rm | zn + indus + chas + nox + age + dis + rad + tax + crim + b + ptratio,
            data = bostonhousing, control = ctrl, model = linearModel)

## 
## -------------------------------------------
## Fluctuation tests of splitting variables:
##                     zn       indus         chas          nox          age
## statistic 3.915651e+01 6.48353e+01 20.144236012 9.483200e+01 3.203777e+01
## p.value   2.748895e-05 7.27457e-11  0.005144677 1.155427e-17 7.976832e-04
##                    dis          rad          tax         crim            b
## statistic 6.984690e+01 1.200538e+02 9.282845e+01 8.822816e+01 3.344074e+01
## p.value   5.503655e-12 4.493345e-11 3.329097e-17 3.759587e-16 4.150856e-04
##                ptratio
## statistic 7.722485e+01
## p.value   1.195801e-13
## 
## Best splitting variable: nox
## Perform split? yes
## -------------------------------------------
## 
## Node properties:
## nox <= 0.597; criterion = 1, statistic = 120.054
## 
## -------------------------------------------
## Fluctuation tests of splitting variables:
##                    zn        indus       chas          nox         age
## statistic 22.16164033 38.155039447 13.1622614 3.613344e+01 21.19908930
## p.value    0.06002158  0.000042868  0.1097181 1.119481e-04  0.08950553
##                    dis          rad          tax         crim          b
## statistic 4.984273e+01 7.873614e+01 5.664802e+01 3.832894e+01 16.1842317
## p.value   1.439457e-07 8.981811e-05 4.795733e-09 3.945393e-05  0.4811923
##                ptratio
## statistic 4.680235e+01
## p.value   6.464767e-07
## 
## Best splitting variable: tax
## Perform split? yes
## -------------------------------------------
## 
## Node properties:
## tax <= 265; criterion = 1, statistic = 78.736
## 
## -------------------------------------------
## Fluctuation tests of splitting variables:
##                  zn      indus      chas       nox       age       dis        rad
## statistic 3.3545026 10.0069447 5.7151219 3.1451982 7.9081059 7.8745370 26.9826810
## p.value   0.9999773  0.5514421 0.9363047 0.9999919 0.8314404 0.8352436  0.7990803
##                 tax        crim         b    ptratio
## statistic 5.0300885 16.99327331 5.7652226 14.0764807
## p.value   0.9953639  0.05201363 0.9823021  0.1528855
## 
## Best splitting variable: crim
## Perform split? no
## -------------------------------------------
## 
## -------------------------------------------
## Fluctuation tests of splitting variables:
##                   zn        indus         chas          nox        age
## statistic 21.5050525 27.142939062 20.926287619 3.319160e+01 13.7113428
## p.value    0.0674627  0.005976414  0.003602557 3.839128e-04  0.7553577
##                    dis          rad         tax         crim          b
## statistic 3.538273e+01 9.122327e+01 25.03035134 30.860920339 15.0157784
## p.value   1.382702e-04 1.475737e-06  0.01509699  0.001121331  0.5826294
##                ptratio
## statistic 3.667774e+01
## p.value   7.520021e-05
## 
## Best splitting variable: rad
## Perform split? yes
## -------------------------------------------
## 
## Splitting ordered factor variable, objective function: 
##        1        2        3        4        5        6        7        8 
##      Inf      Inf      Inf 2187.271 2276.856 2246.727 2212.269      Inf 
## 
## Node properties:
## rad <= 4; criterion = 1, statistic = 91.223
## 
## -------------------------------------------
## Fluctuation tests of splitting variables:
##                    zn        indus         chas          nox        age
## statistic 21.78717354 25.236137411 4.070877e+01 3.461621e+01 12.3419838
## p.value    0.02779648  0.006848526 3.396900e-07 1.312167e-04  0.6403411
##                    dis          rad          tax         crim        b
## statistic 26.876132486 33.584875615 26.618948311 28.795057224 11.40741
## p.value    0.003476391  0.008624286  0.003868135  0.001559656  0.76027
##               ptratio
## statistic 20.62785834
## p.value    0.04411343
## 
## Best splitting variable: chas
## Perform split? yes
## -------------------------------------------
## 
## Splitting factor variable, objective function: 
##  no 
## Inf 
## 
## No admissable split found in 'chas'
## 
## -------------------------------------------
## Fluctuation tests of splitting variables:
##                   zn      indus      chas        nox      age        dis
## statistic 13.6090429 12.3198115 2.5872912 16.2352068 2.022037 13.8616530
## p.value    0.6152638  0.7814215 0.9999817  0.3014644 1.000000  0.5814575
##                  rad        tax     crim         b      ptratio
## statistic 29.9508277 14.1489340 5.150212 9.0700832 25.938682323
## p.value    0.1834797  0.5433336 1.000000 0.9911082  0.006939915
## 
## Best splitting variable: ptratio
## Perform split? yes
## -------------------------------------------
## 
## Node properties:
## ptratio <= 17.4; criterion = 0.993, statistic = 29.951
## 
## -------------------------------------------
## Fluctuation tests of splitting variables:
##                   zn     indus chas       nox       age       dis       rad
## statistic 10.3524693 6.6496014    0 5.5187764 4.5664607 3.5232716 17.831172
## p.value    0.6868439 0.9878916   NA 0.9987369 0.9999295 0.9999995  0.982799
##                 tax      crim         b   ptratio
## statistic 6.6496014 5.7350736 6.5775283 5.7670351
## p.value   0.9878916 0.9978913 0.9892259 0.9977337
## 
## Best splitting variable: zn
## Perform split? no
## -------------------------------------------
## 
## -------------------------------------------
## Fluctuation tests of splitting variables:
##                    zn       indus      chas         nox      age          dis
## statistic 21.75804148 20.45640956 4.9569007 22.42659195 5.095982 3.354260e+01
## p.value    0.04436417  0.07545764 0.9775062  0.03365479 1.000000 2.655947e-04
##                    rad          tax         crim         b      ptratio
## statistic 3.498962e+01 3.257007e+01 26.074408168 9.5942144 3.257007e+01
## p.value   2.954067e-04 4.115671e-04  0.007194095 0.9870579 4.115671e-04
## 
## Best splitting variable: dis
## Perform split? yes
## -------------------------------------------
## 
## Node properties:
## dis <= 1.6074; criterion = 1, statistic = 34.99
## 
## -------------------------------------------
## Fluctuation tests of splitting variables:
##                  zn     indus       chas         nox     age       dis        rad
## statistic 9.5080465 7.3966404 12.9583514 21.92048165 2.18296 6.7267273 23.8624502
## p.value   0.9644422 0.9990245  0.1192863  0.02992438 1.00000 0.9998421  0.0265835
##                   tax        crim          b    ptratio
## statistic 20.79309498 20.61629801 12.7217188 17.7130063
## p.value    0.04739517  0.05091238  0.6560969  0.1498435
## 
## Best splitting variable: rad
## Perform split? yes
## -------------------------------------------
## 
## Splitting ordered factor variable, objective function: 
##        1        2        3        4        5        6        7        8 
##      Inf      Inf      Inf      Inf 875.6391 875.6391 875.6391 875.6391 
## 
## Node properties:
## rad <= 5; criterion = 0.973, statistic = 23.862

fmBH

## 1) nox <= 0.597; criterion = 1, statistic = 120.054
##   2) tax <= 265; criterion = 1, statistic = 78.736
##     3)*  weights = 82 
## Terminal node model
## Linear model with coefficients:
## (Intercept)          age        lstat           rm  
##    -2.23429      0.02483     -3.01289      0.81660  
## 
##   2) tax > 265
##     4) rad <= 4; criterion = 1, statistic = 91.223
##       5)*  weights = 111 
## Terminal node model
## Linear model with coefficients:
## (Intercept)          age        lstat           rm  
##    -3.43902     -0.07993      1.06950      0.70388  
## 
##     4) rad > 4
##       6) ptratio <= 17.4; criterion = 0.993, statistic = 29.951
##         7)*  weights = 57 
## Terminal node model
## Linear model with coefficients:
## (Intercept)          age        lstat           rm  
##    7.731363    -0.006482    -2.864374     0.585465  
## 
##       6) ptratio > 17.4
##         8)*  weights = 91 
## Terminal node model
## Linear model with coefficients:
## (Intercept)          age        lstat           rm  
##     12.2025      -0.0687      -0.6483       0.3870  
## 
## 1) nox > 0.597
##   9) dis <= 1.6074; criterion = 1, statistic = 34.99
##     10)*  weights = 40 
## Terminal node model
## Linear model with coefficients:
## (Intercept)          age        lstat           rm  
##     97.1330      -0.1065     -20.6139      -0.2542  
## 
##   9) dis > 1.6074
##     11) rad <= 5; criterion = 0.973, statistic = 23.862
##       12)*  weights = 50 
## Terminal node model
## Linear model with coefficients:
## (Intercept)          age        lstat           rm  
##     38.3059      -0.1677      -9.2983       0.5834  
## 
##     11) rad > 5
##       13)*  weights = 75 
## Terminal node model
## Linear model with coefficients:
## (Intercept)          age        lstat           rm  
##    61.34934     -0.09697    -12.19267     -0.06927

plot(fmBH)

coef(fmBH)

##    (Intercept)          age       lstat          rm
## 3    -2.234289  0.024827161  -3.0128873  0.81659999
## 5    -3.439016 -0.079931189   1.0695002  0.70387750
## 7     7.731363 -0.006482131  -2.8643737  0.58546514
## 8    12.202515 -0.068697423  -0.6483409  0.38703319
## 10   97.133016 -0.106517216 -20.6139361 -0.25421913
## 12   38.305900 -0.167657196  -9.2983159  0.58342155
## 13   61.349342 -0.096967406 -12.1926707 -0.06926919

summary(fmBH)

## $`3`
## 
## Call:
## NULL
## 
## Weighted Residuals:
##    Min     1Q Median     3Q    Max 
## -6.479  0.000  0.000  0.000 12.831 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.23429    4.26921  -0.523   0.6022    
## age          0.02483    0.01983   1.252   0.2143    
## lstat       -3.01289    1.16980  -2.576   0.0119 *  
## rm           0.81660    0.06168  13.240   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.18 on 78 degrees of freedom
## Multiple R-squared:  0.8893, Adjusted R-squared:  0.885 
## F-statistic: 208.8 on 3 and 78 DF,  p-value: < 2.2e-16
## 
## 
## $`5`
## 
## Call:
## NULL
## 
## Weighted Residuals:
##    Min     1Q Median     3Q    Max 
## -7.703  0.000  0.000  0.000  7.973 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.43902    3.75219  -0.917    0.361    
## age         -0.07993    0.01137  -7.030 2.02e-10 ***
## lstat        1.06950    0.94523   1.131    0.260    
## rm           0.70388    0.05563  12.652  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.612 on 107 degrees of freedom
## Multiple R-squared:  0.8034, Adjusted R-squared:  0.7979 
## F-statistic: 145.7 on 3 and 107 DF,  p-value: < 2.2e-16
## 
## 
## $`7`
## 
## Call:
## NULL
## 
## Weighted Residuals:
##    Min     1Q Median     3Q    Max 
## -7.284  0.000  0.000  0.000  8.406 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.731363   6.026114   1.283   0.2051    
## age         -0.006482   0.023136  -0.280   0.7804    
## lstat       -2.864374   1.674810  -1.710   0.0931 .  
## rm           0.585465   0.082100   7.131 2.77e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.3 on 53 degrees of freedom
## Multiple R-squared:  0.8231, Adjusted R-squared:  0.8131 
## F-statistic: 82.22 on 3 and 53 DF,  p-value: < 2.2e-16
## 
## 
## $`8`
## 
## Call:
## NULL
## 
## Weighted Residuals:
##    Min     1Q Median     3Q    Max 
## -9.303  0.000  0.000  0.000  6.571 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.20252    4.01279   3.041  0.00312 ** 
## age         -0.06870    0.01638  -4.193 6.59e-05 ***
## lstat       -0.64834    1.15267  -0.562  0.57524    
## rm           0.38703    0.05936   6.520 4.44e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.54 on 87 degrees of freedom
## Multiple R-squared:  0.6675, Adjusted R-squared:  0.6561 
## F-statistic: 58.23 on 3 and 87 DF,  p-value: < 2.2e-16
## 
## 
## $`10`
## 
## Call:
## NULL
## 
## Weighted Residuals:
##    Min     1Q Median     3Q    Max 
## -8.654  0.000  0.000  0.000 19.814 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  97.1330    22.5728   4.303 0.000123 ***
## age          -0.1065     0.2063  -0.516 0.608780    
## lstat       -20.6139     1.7396 -11.850 5.53e-14 ***
## rm           -0.2542     0.1153  -2.205 0.033922 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.14 on 36 degrees of freedom
## Multiple R-squared:  0.8059, Adjusted R-squared:  0.7897 
## F-statistic: 49.83 on 3 and 36 DF,  p-value: 6.747e-13
## 
## 
## $`12`
## 
## Call:
## NULL
## 
## Weighted Residuals:
##    Min     1Q Median     3Q    Max 
## -5.745  0.000  0.000  0.000  5.369 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38.30590    5.98897   6.396 7.36e-08 ***
## age         -0.16766    0.05205  -3.221  0.00235 ** 
## lstat       -9.29832    0.87606 -10.614 5.92e-14 ***
## rm           0.58342    0.04520  12.908  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.471 on 46 degrees of freedom
## Multiple R-squared:  0.9601, Adjusted R-squared:  0.9575 
## F-statistic: 368.9 on 3 and 46 DF,  p-value: < 2.2e-16
## 
## 
## $`13`
## 
## Call:
## NULL
## 
## Weighted Residuals:
##    Min     1Q Median     3Q    Max 
## -7.532  0.000  0.000  0.000  8.328 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  61.34934    4.67059  13.135  < 2e-16 ***
## age          -0.09697    0.04014  -2.416   0.0183 *  
## lstat       -12.19267    1.15276 -10.577  3.1e-16 ***
## rm           -0.06927    0.04454  -1.555   0.1243    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.894 on 71 degrees of freedom
## Multiple R-squared:  0.6859, Adjusted R-squared:  0.6727 
## F-statistic: 51.69 on 3 and 71 DF,  p-value: < 2.2e-16

Per riepilogare la qualità del modello, possiamo calcolare il mean squared error, la log-verosimiglianza e l’indice AIC:

mean(residuals(fmBH)^2)

## [1] 9.664305

logLik(fmBH)

## 'log Lik.' -1255.384 (df=41)

AIC(fmBH)

## [1] 2592.768

Capitolo 20 Alberi di Regressione (Regression Tree: RTREE)

20.1 Introduzione

20.2 Esempio: package rpart

20.3 Esempio: package party

20.2 Esempio: package `rpart`

20.3 Esempio: package `party`