[현장에서 바로 써먹는...] 로지스틱 회귀
> ### Chapter 6. 분류 및 군집분석
>
> ## Chapter 6-1. 병아리의 성별을 구분할 수 있을까? (로지스틱 회귀)
>
> # 데이터 불러오기
> g <- read.csv("ch6-1.csv", header = TRUE)
> head(g)
wing_length tail_length gender
1 44 9 m
2 42 9 m
3 43 8 m
4 40 10 m
5 44 8 m
6 43 8 m
> str(g)
'data.frame': 60 obs. of 3 variables:
$ wing_length: int 44 42 43 40 44 43 42 43 41 43 ...
$ tail_length: int 9 9 8 10 8 8 8 8 10 8 ...
$ gender : chr "m" "m" "m" "m" ...
>
> g$gender <- as.factor(g$gender)
> str(g)
'data.frame': 60 obs. of 3 variables:
$ wing_length: int 44 42 43 40 44 43 42 43 41 43 ...
$ tail_length: int 9 9 8 10 8 8 8 8 10 8 ...
$ gender : Factor w/ 2 levels "f","m": 2 2 2 2 2 2 2 2 2 2 ...
>
> # 로지스틱 회귀 실시
> g_glm <- glm(gender ~ wing_length + tail_length,
+ data = g, family = binomial)
> summary(g_glm)
Call:
glm(formula = gender ~ wing_length + tail_length, family = binomial,
data = g)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.64364 -0.13454 0.00002 0.09015 2.63544
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 70.1955 23.4091 2.999 0.00271 **
wing_length -1.0531 0.5045 -2.087 0.03685 *
tail_length -2.3859 0.9692 -2.462 0.01382 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 83.178 on 59 degrees of freedom
Residual deviance: 16.466 on 57 degrees of freedom
AIC: 22.466
Number of Fisher Scoring iterations: 7
>
> # g_glm 변수에 포함된 객체(Object) 확인
> names(g_glm)
[1] "coefficients" "residuals" "fitted.values" "effects" "R" "rank"
[7] "qr" "family" "linear.predictors" "deviance" "aic" "null.deviance"
[13] "iter" "weights" "prior.weights" "df.residual" "df.null" "y"
[19] "converged" "boundary" "model" "call" "formula" "terms"
[25] "data" "offset" "control" "method" "contrasts" "xlevels"
>
> # 기존 독립변수 데이터를 이용해 모델로 계산해 본 결과
> g_glm$fitted.values
1 2 3 4 5 6 7 8 9
0.9155791141 0.9889042384 0.9970494602 0.9853766235 0.9915881469 0.9970494602 0.9989687568 0.9970494602 0.9591936927
10 11 12 13 14 15 16 17 18
0.9970494602 0.9996400209 0.9591936927 0.9970494602 0.9986364874 0.9986364874 0.9989687568 0.7409599723 0.9970494602
19 20 21 22 23 24 25 26 27
0.9155791141 0.4994544574 0.9986364874 0.9155791141 0.9986364874 0.8913020259 0.9688377014 0.9853766235 0.9998743960
28 29 30 31 32 33 34 35 36
0.9889042384 0.9970494602 0.0310308031 0.2582032247 0.0013575828 0.0083755277 0.1082758663 0.4994544574 0.0010267569
37 38 39 40 41 42 43 44 45
0.1082758663 0.0110479753 0.0310308031 0.0010267569 0.0003584120 0.0406358239 0.0110479753 0.0840841597 0.0003584120
46 47 48 49 50 51 52 53 54
0.0083755277 0.0013575828 0.0010267569 0.0029377285 0.0038818856 0.0083755277 0.0010267569 0.2083400891 0.0236401606
55 56 57 58 59 60
0.0001250571 0.7409599723 0.0110479753 0.0840841597 0.0310308031 0.0010267569
>
> # g 데이터 셋에 pred열을 만들어 계산된 값을 넣음
> g$pred <- g_glm$fitted.values
> head(g)
wing_length tail_length gender pred
1 44 9 m 0.9155791
2 42 9 m 0.9889042
3 43 8 m 0.9970495
4 40 10 m 0.9853766
5 44 8 m 0.9915881
6 43 8 m 0.9970495
>
> # pred가 0.5보다 크면 m, 아니면 f로 판정하고 그 결과를 gender_pred라는 열에 넣음
> g$gender_pred <- ifelse(g$pred > 0.5, 'm', 'f')
> head(g)
wing_length tail_length gender pred gender_pred
1 44 9 m 0.9155791 m
2 42 9 m 0.9889042 m
3 43 8 m 0.9970495 m
4 40 10 m 0.9853766 m
5 44 8 m 0.9915881 m
6 43 8 m 0.9970495 m
>
> # 간단한 정오분류표 그려보기
> table(g$gender_pred, g$gender)
f m
f 29 2
m 1 28
>
> library(caret)
필요한 패키지를 로딩중입니다: lattice
필요한 패키지를 로딩중입니다: ggplot2
경고메시지(들):
1: 패키지 ‘caret’는 R 버전 4.0.4에서 작성되었습니다
2: 패키지 ‘ggplot2’는 R 버전 4.0.5에서 작성되었습니다
>
> # caret 패키지 활용을 위해 g$gender_pred 열 데이터 유형 변경
> g$gender_pred <- as.factor(g$gender_pred)
> confusionMatrix(g$gender_pred, g$gender)
Confusion Matrix and Statistics
Reference
Prediction f m
f 29 2
m 1 28
Accuracy : 0.95
95% CI : (0.8608, 0.9896)
No Information Rate : 0.5
P-Value [Acc > NIR] : 3.127e-14
Kappa : 0.9
Mcnemar's Test P-Value : 1
Sensitivity : 0.9667
Specificity : 0.9333
Pos Pred Value : 0.9355
Neg Pred Value : 0.9655
Prevalence : 0.5000
Detection Rate : 0.4833
Detection Prevalence : 0.5167
Balanced Accuracy : 0.9500
'Positive' Class : f
>
> library(Epi)
경고메시지(들):
패키지 ‘Epi’는 R 버전 4.0.5에서 작성되었습니다
>
> # ROC 커브 그리기
> ROC(g$pred, g$gender, main = "ROC Curve")
출처 : 현장에서 바로 써먹는 데이터 분석 with R