> Normalization = function(x){
+ y = (x-min(x))/(max(x)-min(x))
+ return(y)
+ }
>
> ### 2.11 중고차 데이터를 활용한 데이터 전처리 1 - apply
>
> # 데이터 불러오기
> # Download URL : https://www.kaggle.com/adityadesai13/used-car-dataset-ford-and-mercedes
> DIR = "F:/1_Study/1_BigData/12_R/02_Practical-R/Data/"
> Audi = read.csv(paste0(DIR, "audi.csv"),stringsAsFactors = FALSE)
> str(Audi)
'data.frame': 10668 obs. of 9 variables:
$ model : chr " A1" " A6" " A1" " A4" ...
$ year : int 2017 2016 2016 2017 2019 2016 2016 2016 2015 2016 ...
$ price : int 12500 16500 11000 16800 17300 13900 13250 11750 10200 12000 ...
$ transmission: chr "Manual" "Automatic" "Manual" "Automatic" ...
$ mileage : int 15735 36203 29946 25952 1998 32260 76788 75185 46112 22451 ...
$ fuelType : chr "Petrol" "Diesel" "Petrol" "Diesel" ...
$ tax : int 150 20 30 145 145 30 30 20 20 30 ...
$ mpg : num 55.4 64.2 55.4 67.3 49.6 58.9 61.4 70.6 60.1 55.4 ...
$ engineSize : num 1.4 2 1.4 2 1 1.4 2 2 1.4 1.4 ...
> head(Audi)
model year price transmission mileage fuelType tax mpg engineSize
1 A1 2017 12500 Manual 15735 Petrol 150 55.4 1.4
2 A6 2016 16500 Automatic 36203 Diesel 20 64.2 2.0
3 A1 2016 11000 Manual 29946 Petrol 30 55.4 1.4
4 A4 2017 16800 Automatic 25952 Diesel 145 67.3 2.0
5 A3 2019 17300 Manual 1998 Petrol 145 49.6 1.0
6 A1 2016 13900 Automatic 32260 Petrol 30 58.9 1.4
>
>
> #### 2.11.1 데이터 연산 동시 처리
>
> # apply
> Audi_S = Audi[,c("year","price","mileage","mpg")]
> Audi_S2 = Normalization(Audi_S)
> summary(Audi_S2)
year price mileage mpg
Min. :0.006180 Min. :0.00461 Min. :0.00000 Min. :5.542e-05
1st Qu.:0.006238 1st Qu.:0.04684 1st Qu.:0.01848 1st Qu.:1.235e-04
Median :0.006242 Median :0.06254 Median :0.05882 Median :1.505e-04
Mean :0.006242 Mean :0.07088 Mean :0.07686 Mean :1.541e-04
3rd Qu.:0.006248 3rd Qu.:0.08665 3rd Qu.:0.11289 3rd Qu.:1.793e-04
Max. :0.006251 Max. :0.44891 Max. :1.00000 Max. :5.799e-04
> # 정상적인 결과에서는 각 변수에 대해 최소값(min)은 0, 최대값(Max)은 1로 출력되야 하는데,
> # 현재 출력된 결과는 4개의 변수에 대해 하나만 0,1 값을 출력함.
> # 이를 제대로 수정하려면 변수별로 차례대로 계산해야 함.
>
> R_Matrix = matrix(data = 0,
+ nrow = nrow(Audi_S),
+ ncol = ncol(Audi_S))
>
> for(k in 1:ncol(Audi_S)){
+ R_Matrix[,k] = Normalization(Audi_S[,k])
+ }
>
> R_DF = as.data.frame(R_Matrix)
> summary(R_DF)
V1 V2 V3 V4
Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.0000
1st Qu.:0.8261 1st Qu.:0.09505 1st Qu.:0.01848 1st Qu.:0.1299
Median :0.8696 Median :0.13037 Median :0.05882 Median :0.1812
Mean :0.8739 Mean :0.14917 Mean :0.07686 Mean :0.1881
3rd Qu.:0.9565 3rd Qu.:0.18466 3rd Qu.:0.11289 3rd Qu.:0.2361
Max. :1.0000 Max. :1.00000 Max. :1.00000 Max. :1.0000
>
> R_DF2 = as.data.frame(apply(Audi_S, MARGIN = 2, FUN = Normalization))
> # MARGIN=2 옵션은 함수를 열별로 연산한다는 의미, 1은 행별로 연산
> summary(R_DF2)
year price mileage mpg
Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.0000
1st Qu.:0.8261 1st Qu.:0.09505 1st Qu.:0.01848 1st Qu.:0.1299
Median :0.8696 Median :0.13037 Median :0.05882 Median :0.1812
Mean :0.8739 Mean :0.14917 Mean :0.07686 Mean :0.1881
3rd Qu.:0.9565 3rd Qu.:0.18466 3rd Qu.:0.11289 3rd Qu.:0.2361
Max. :1.0000 Max. :1.00000 Max. :1.00000 Max. :1.0000
>
> # lapply
> R_DF3 = as.data.frame(lapply(Audi_S, Normalization))
> summary(R_DF3)
year price mileage mpg
Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.0000
1st Qu.:0.8261 1st Qu.:0.09505 1st Qu.:0.01848 1st Qu.:0.1299
Median :0.8696 Median :0.13037 Median :0.05882 Median :0.1812
Mean :0.8739 Mean :0.14917 Mean :0.07686 Mean :0.1881
3rd Qu.:0.9565 3rd Qu.:0.18466 3rd Qu.:0.11289 3rd Qu.:0.2361
Max. :1.0000 Max. :1.00000 Max. :1.00000 Max. :1.0000
출처 : 실무 프로젝트로 배우는 데이터 분석 with R
'데이터분석 > R' 카테고리의 다른 글
[실무 프로젝트로 배우는...] dplyr 응용 (0) | 2022.01.21 |
---|---|
[실무 프로젝트로 배우는...] 데이터 전처리 2 - dplyr 패키지 (0) | 2022.01.20 |
[ADP] 순차 패턴 분석 (Sequence Pattern Analysis) (0) | 2022.01.16 |
[ADP] 연관성 분석 (Association Analysis) (0) | 2022.01.16 |
[ADP] 서포트 벡터 머신 (Support Vector Machine) (0) | 2022.01.16 |