[ADP] 로지스틱 회귀분석 (Logistic Regression)
> # 04. 로지스틱 회귀모델 (Logistic Regression)
>
>
> # 4. 로지스틱 회귀분석 사례
>
> library(MASS)
>
> data("biopsy")
> str(biopsy)
'data.frame': 699 obs. of 11 variables:
$ ID : chr "1000025" "1002945" "1015425" "1016277" ...
$ V1 : int 5 5 3 6 4 8 1 2 2 4 ...
$ V2 : int 1 4 1 8 1 10 1 1 1 2 ...
$ V3 : int 1 4 1 8 1 10 1 2 1 1 ...
$ V4 : int 1 5 1 1 3 8 1 1 1 1 ...
$ V5 : int 2 7 2 3 2 7 2 2 2 2 ...
$ V6 : int 1 10 2 4 1 10 10 1 1 1 ...
$ V7 : int 3 3 3 3 3 9 3 3 1 2 ...
$ V8 : int 1 2 1 7 1 7 1 1 1 1 ...
$ V9 : int 1 1 1 1 1 1 1 1 5 1 ...
$ class: Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...
>
> # 불필요한 데이터 ID(열) 삭제
> biopsy$ID <- NULL
> # names() 변수이름 변경하기
> names(biopsy) <- c("thick","size","shape","adhsn","s.size","nucl","chrom","n.nuc","mit","class")
> # summary() 결측값 확인
> summary(biopsy)
thick size shape adhsn s.size nucl chrom
Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.000
1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 2.000
Median : 4.000 Median : 1.000 Median : 1.000 Median : 1.000 Median : 2.000 Median : 1.000 Median : 3.000
Mean : 4.418 Mean : 3.134 Mean : 3.207 Mean : 2.807 Mean : 3.216 Mean : 3.545 Mean : 3.438
3rd Qu.: 6.000 3rd Qu.: 5.000 3rd Qu.: 5.000 3rd Qu.: 4.000 3rd Qu.: 4.000 3rd Qu.: 6.000 3rd Qu.: 5.000
Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.000
NA's :16
n.nuc mit class
Min. : 1.000 Min. : 1.000 benign :458
1st Qu.: 1.000 1st Qu.: 1.000 malignant:241
Median : 1.000 Median : 1.000
Mean : 2.867 Mean : 1.589
3rd Qu.: 4.000 3rd Qu.: 1.000
Max. :10.000 Max. :10.000
> # 결측값 제거
> biopsy.v2 <- na.omit(biopsy)
> summary(biopsy.v2)
thick size shape adhsn s.size nucl chrom
Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.00 Min. : 1.000 Min. : 1.000 Min. : 1.000
1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 1.00 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 2.000
Median : 4.000 Median : 1.000 Median : 1.000 Median : 1.00 Median : 2.000 Median : 1.000 Median : 3.000
Mean : 4.442 Mean : 3.151 Mean : 3.215 Mean : 2.83 Mean : 3.234 Mean : 3.545 Mean : 3.445
3rd Qu.: 6.000 3rd Qu.: 5.000 3rd Qu.: 5.000 3rd Qu.: 4.00 3rd Qu.: 4.000 3rd Qu.: 6.000 3rd Qu.: 5.000
Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.00 Max. :10.000 Max. :10.000 Max. :10.000
n.nuc mit class
Min. : 1.00 Min. : 1.000 benign :444
1st Qu.: 1.00 1st Qu.: 1.000 malignant:239
Median : 1.00 Median : 1.000
Mean : 2.87 Mean : 1.603
3rd Qu.: 4.00 3rd Qu.: 1.000
Max. :10.00 Max. :10.000
>
> # 출력이 0 또는 1이 되는 새로운 변수 y를 만든다 (ifelse 구문 사용)
> y <- ifelse(biopsy.v2$class == "malignant",1,0)
>
> library(reshape2)
>
> # corrplot 패키지 이용 시각화
> library(corrplot)
>
> bc <- cor(biopsy.v2[,1:9])
> corrplot.mixed(bc)

> # 상관관계는 다중공선성 가능성을 확인 (size와 shape=0.91) but 상관관계가 높다고 반드시 다중공선성이 있는 것은 아님.
> # 참고로 다중공선성 확인이 필요할 때 vif() 함수 이용
>
> # 데이터 분할
> # sample(2) = 2개 범주로 나누기
> ind <- sample(2, nrow(biopsy.v2), replace = TRUE, prob = c(0.7,0.3))
> train <- biopsy.v2[ind==1,]
> test <- biopsy.v2[ind==2,]
> table(train$class)
benign malignant
317 154
> table(test$class)
benign malignant
127 85
> # 모형화와 평가
> fit <- glm(class~., family = "binomial", data = train)
> summary(fit)
Call:
glm(formula = class ~ ., family = "binomial", data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.3355 -0.0837 -0.0319 0.0040 2.4446
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -13.03389 2.27893 -5.719 0.0000000107 ***
thick 0.90397 0.27669 3.267 0.00109 **
size -0.13209 0.29775 -0.444 0.65732
shape 0.09748 0.32165 0.303 0.76185
adhsn 0.49144 0.20503 2.397 0.01653 *
s.size 0.22046 0.20281 1.087 0.27702
nucl 0.48865 0.16450 2.971 0.00297 **
chrom 0.78080 0.29921 2.610 0.00907 **
n.nuc 0.23754 0.15178 1.565 0.11757
mit 0.57922 0.58698 0.987 0.32375
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 595.351 on 470 degrees of freedom
Residual deviance: 52.384 on 461 degrees of freedom
AIC: 72.384
Number of Fisher Scoring iterations: 9
>
> # 오즈비 생성
> options(scipen = 999)
> exp(coef(fit))
(Intercept) thick size shape adhsn s.size nucl chrom
0.000002185001 2.469391577868 0.876263562712 1.102384077611 1.634668672278 1.246647127597 1.630119122400 2.183208587800
n.nuc mit
1.268123029232 1.784648765536
>
> # 다중공선성 확인
> library(car)
>
> vif(fit)
thick size shape adhsn s.size nucl chrom n.nuc mit
1.747892 3.642638 3.815456 1.399292 1.404389 1.211444 1.300690 1.338480 1.085950
> # 다중공선성은 문제가 되지 않는다. (보통 10이 넘을 경우 다중공선성이 있다고 판단)
>
> # type=response 인자로 결과값이 0~1 확률들로 벡터 만들기
> train.probs <- predict(fit, type = "response")
>
> # 오분류표 만들기
> library(InformationValue)
>
> trainY <- y[ind==1]
> testY <- y[ind==2]
> confusionMatrix(trainY, train.probs)
0 1
0 314 5
1 3 149
> misClassError(trainY, train.probs) # 오분류율
[1] 0.017
>
출처 : 2020 데이터 분석 전문가 ADP 필기 한 권으로 끝내기