以äžã¯ãcsvãã¡ã€ã«ããã®ããŒã¿ã®èªã¿èŸŒã¿ãããŒã¿ã¯ã¬ã³ãžã³ã°èŠçŽ ã䜿çšããããã¹ãæååã®è§£æãåæ枬å®ããã®ããŒã¿ã®éèšãããã³ãã£ãŒãã®ããããã瀺ããŠããŸãã
ãã®äŸã§ã¯ãdata.tableãreshape2ãstringdistãggplot2ããã±ãŒãžã®æ©èœãç©æ¥µçã«äœ¿çšããŠããŸãã
ã¢ã¹ã¯ã¯ã§ã®ã¿ã¯ã·ãŒã«ããä¹å®¢ããã³æè·ç©ã®èŒžéã«é¢ãã掻åãå®æœããããã«çºè¡ãããèš±å¯ã«é¢ããæ å ±ã¯ããå®éã®ããŒã¿ããšã¿ãªãããŸããã ã¢ã¹ã¯ã¯ã®é茞çããã³é路亀éã€ã³ãã©ã®éçºéšéãäžè¬çã«äœ¿çšããããã«æäŸãããããŒã¿ã ããŒã¿ã»ããdata.mos.ru/datasets/655ããŒãž
ãœãŒã¹ããŒã¿ã®åœ¢åŒã¯æ¬¡ã®ãšããã§ãã
ROWNUM;VEHICLE_NUM;FULL_NAME;BLANK_NUM;VEHICLE_BRAND_MODEL;INN;OGRN 1;"248197";" «-»";"017263";"FORD FOCUS";"7734653292";"1117746207578" 2;"249197";" «-»";"017264";"FORD FOCUS";"7734653292";"1117746207578" 3;"245197";" «-»";"017265";"FORD FOCUS";"7734653292";"1117746207578" ```
1.äžæ¬¡ããŒã¿ãããŠã³ããŒããã
ããŒã¿ã¯ãµã€ãããçŽæ¥ããŠã³ããŒãã§ããŸãã ããŒãã®éçšã§ãããã«äŸ¿å©ãªæ¹æ³ã§åã®ååãå€æŽããŸãã url <- "http://data.mos.ru/datasets/download/655" colnames = c("RowNumber", "RegPlate", "LegalName", "DocNum", "Car", "INN", "OGRN", "Void") rawdata <- read.table(url, header = TRUE, sep = ";", colClasses = c("numeric", rep("character",6), NA), col.names = colnames, strip.white = TRUE, blank.lines.skip = TRUE, stringsAsFactors = FALSE, encoding = "UTF-8")
ããã§ãåæãšèŠèŠåãéå§ã§ããŸã...
2.ããŒã¿å€æ
ã©ã€ã»ã³ã·ãŒã®çµç¹åœ¢æ ãšèªåè»ã®ãã©ã³ãã«å¿ããŠãã¿ã¯ã·ãŒãšããŠç»é²ãããèªåè»ã®å°æ°ã®ååžãåæããå¿ èŠããããšããŸãã 察å¿ããããŒã¿ã¯åå¥ã«åŒ·èª¿è¡šç€ºãããŸãããããã¹ãŠã®å¿ èŠãªæ å ±ã¯ãã£ãŒã«ãFULL_NAMEïŒLegalNameã«ååãå€æŽãããŸããïŒããã³VEHICLE_BRAND_MODELïŒè»ïŒã«å«ãŸããŠããŸãããœãŒã¹ããŒã¿ãå€æããéçšã§ã
- LegalNameãã£ãŒã«ããããå¥ã®OrgTypeãã£ãŒã«ãã§ãªãŒã¬ã«ãã©ãŒã ãéžæããŸãã
- Carãã£ãŒã«ããããå¥ã®CarBrandãã£ãŒã«ãã§ãã·ã³ã®ã¡ãŒã«ãŒãéžæããŸãã
- æªäœ¿çšã®ãã£ãŒã«ããç Žæ£ããŸãã
ptn <- "^(.+?) (.+)$" # regexp pattern to match first word dt <- data.table(rawdata)[, list(RegPlate, LegalName, Car, OGRN, OrgType = gsub(ptn, "\\1" , toupper( LegalName )), CarBrand = gsub(ptn, "\\1", toupper( Car ))) ] rm(rawdata) # Clear some memory
3.æåã®çµæ
ããŒã¿ããæœåºãããçµç¹ãã©ãŒã ã確èªããŸããsort( table(dt$OrgType) )
## ## 1 392 649 17118 17680
ããŒã¿ã¯éåžžã«æ£ç¢ºã«çæãããŸããåã
ã®èµ·æ¥å®¶ãåãåã£ãã©ã€ã»ã³ã¹ã®æ°ããªãŒãããŠããŸãïŒæžçšïŒïŒãæé責任äŒç€ŸããªãŒãã³ãšã¯ããŒãºã®åè³äŒç€Ÿãããã«ã¯1ã€ã®éå¶å©ããŒãããŒã·ããããããŸãã
æ³ç圢åŒã«å¿ããŠãã©ã€ã»ã³ã¹ãåãåã£ãç¬ç«ããã©ã€ã»ã³ã·ãŒ ïŒè»ã§ã¯ãªãïŒã®æ°ãå€æããã«ã¯ãæ³äººïŒPSRNïŒãäžæã«ç¹åŸŽä»ãããã£ãŒã«ããèŠçŽããå¿ èŠããããŸãã
dt[, list( N = length( unique(OGRN) ) ), by = OrgType][order(N, decreasing = TRUE)]
## OrgType N ## 1: 12352 ## 2: 563 ## 3: 14 ## 4: 6 ## 5: 1
ããŒã¿ã¯ã¬ã³ãžã³ã°
ã¢ã¹ã¯ã¯ã§ã¯ã©ã®ãã©ã³ãã®è»ãã¿ã¯ã·ãŒãšããŠäœ¿çšãããŠããŸããïŒããŒã¿ã»ããã«ã¯å€ãã®èªåè»ãã©ã³ãïŒ115ãå«ãŸããŠããŸããããããã¯ãã¹ãŠæ¬åœã«ãŠããŒã¯ã§ããïŒ äŸãšããŠãæåãMãã§å§ãŸããã¹ãŠã®ããŒã¯ã衚瀺ããŸãã
sort( unique( dt[grep("^M.*", CarBrand), CarBrand]))
## [1] "M214" "MASERATI" "MAZDA" ## [4] "MAZDA-" "MERCEDES" "MERCEDES-BENZ" ## [7] "MERCEDES-BENZ-" "MERCEDES-BENZ-S500" "MERCEDES-BENZC" ## [10] "MERCEDES-BENZE200K" "MERCEDES-BENZE220CDI" "MERCEDES-BNZ" ## [13] "MERCERDES-BENZ" "MERCRDES" "MERCRDES-BENZ" ## [16] "MERSEDES-" "MERSEDES-BENZ" "METROCAB" ## [19] "MG" "MINI" "MITSUBISHI"
æ®å¿µãªãããå€ãã®èªåè»ãã©ã³ãã¯äž»ã«ããŒã¿ãšã©ãŒã«ãããã®ã§ãã ããšãã°ãåããã©ã³ã-MERCEDES-BENZ-ã¯ããŸããŸãªååã§èŠã€ãããŸãã åæã®åã«ãããŒã¿ãã¯ãªã¢ããå¿
èŠããããŸãã
ããã¹ãæ å ±ãã¯ãªã¢ããããã®ããã°ã©ã ã®åºç€ã¯ãæ€çŽ¢æ©èœãè¡ééãã§ãã è¡ã®ãã¢ããšã«ãæåã®æäœã䜿çšããŠ1ã€ã®è¡ãå¥ã®è¡ã«å€æããè€éããç¹åŸŽä»ããã¡ããªãã¯ãèšç®ããŸãã è¡ãé¡äŒŒããŠããã»ã©ãå¿ èŠãªæäœã¯å°ãªããªããŸãã çæ³çã«ã¯ãåãã©ã€ã³ã®è·é¢ã¯ãŒãã§ãæãé¡äŒŒããŠããªãã©ã€ã³ã®è·é¢ã¯1ã«ããå¿ èŠããããŸãã ããã¯ãåãååã®ããã±ãŒãžã®stringdisté¢æ°ã®Jaro-Winklerã¢ã«ãŽãªãºã ãæ£ç¢ºã«æ©èœããæ¹æ³ã§ãã
æ°è¡ãæ¯èŒããŸãããè·é¢ã§ã¯ãªããé¡äŒŒåºŠã§ãã1-stringdistãã«ãŠã³ãããŸãããã
1 - stringdist( c("MERCEDES","MERSEDES","MAZDA","RENAULT","SAAB"), "MERCEDES", method = "jw", p = 0.1)
## [1] 1.0000 0.9417 0.5950 0.3452 0.0000
äžèŠãããŒã¿ã¯ãªãŒãã³ã°ã¿ã¹ã¯ã¯åçŽã«è§£æ±ºãããŸããåã¬ã³ãŒãã«ã€ããŠããã£ã¬ã¯ããªããæãé¡äŒŒããå€ãéžæããã ãã§ååã§ãã æ®å¿µãªããããã®ã¢ãããŒãã¯åžžã«æ©èœãããšã¯éããŸããã ãŸãããã£ã¬ã¯ããªãïŒçŸåšã®å Žåã®ããã«ïŒããã§ãªãå ŽåããããŸãã 第äºã«ãæ£ç¢ºãªåç
§ã§ãã£ãŠããç¶æ³ã«ãã£ãŠã¯æåã§ããŒã¿ãä¿®æ£ããå¿
èŠããããŸãã ããšãã°ãã¡ãœããã®èŠ³ç¹ããã¯ã3ã€ã®ãã©ã³ãã誀ã£ãå€ãBAZãã®ä»£æ¿ãšããŠåçã«é©ããŠããŸãã
1 - stringdist("BAZ", c("VAZ", "UAZ", "ZAZ"), method = "jw", p = 0.1)
## [1] 0.7778 0.7778 0.7778
以äžã§ã¯åèªåä¿®æ£æ¹æ³ã䜿çšããŸããããã«ãããã¢ããªã¹ããåæãŸãã¯æåã§ä¿®æ£ã§ããä¿®æ£ãªãã·ã§ã³ãããã°ã©ã ã§çæããããšã«ãããããŒã¿ã¯ã¬ã³ãžã³ã°ã®å°é家ã®äœæ¥ã倧å¹
ã«ä¿é²ãããŸãã
ãšã©ãŒã®æ°ãå°ãªã倧éã®ããŒã¿ã§ã¯ãé »ç¹ã«çºçããå€ãæ£ãããšèŠãªããããšã©ãŒãçºçããããšã¯ã»ãšãã©ãããŸããã é »åºŠå€ã¯éã¿ä¿æ°ãšããŠäœ¿çšãããè¡ã®è¿æ¥ã¡ããªãã¯ãæ¯äŸããŠå¢å ããŸãã é »ç¹ã«åºäŒãèªåè»ãã©ã³ããé¡äŒŒæ§ã§ã¯ãªãéã®ããã«åé²ããªãããã«ããããã«ããããå€
t
ãè¶ ããé¡äŒŒåºŠã®å€ã®ã¡ããªãã¯ã®ã¿ãèæ ®ãããŸãïŒ
t
éžæã«ã€ããŠïŒã ãããã£ãŠããã·ã³ã®ãã©ã³ãã®å¯èœãªå€ããšã«ãæšå¥šããããåç §ãå€ãåãããŒã¿ã»ãããã決å®ãããŸãã ããã©ã³ã-ä¿®æ£æ¡ãã®ãã¢ãcsvãã¡ã€ã«ã«åºåãããŸãã åæãšä¿®æ£ã®åŸãä¿®æ£ãããcsvãã¡ã€ã«ãããŠã³ããŒããããèŸæžãšããŠæ©èœããŸãã
ãŸããæ¢åã®ããŒã¿ã»ããã«æé©ãªé¢æ°ãè¿ãé¢æ°ãäœæããŸãã
bestmatch.gen <- function(wc, t = 0){ # wc = counts of all base text words # t = threshold: only the words with similarity above threshold count bestmatch <- function(a){ sim <- 1 - stringdist( toupper(a), toupper( names(wc) ) , method = "jw", p = 0.1 ) # Compute weights and implicitly cut off everything below threshold weights <- sim * wc * (sim > t) # Return the one with maximum combined weight names( sort(weights, decr = TRUE)[1] ) } bestmatch }
t
ã®ãããå€
t
çµéšçã«éžæãããŸãã ãããå€ãã©ã¡ãŒã¿ãŒt = 0.7ã«å¯Ÿããé¢æ°ã®åäœäŸã次ã«ç€ºããŸãã
bm07 <- bestmatch.gen( table( dt$CarBrand), t = 0.7 ) s <- c("FORD","RENO","MERS","PEGO") sapply(s, bm07)
## FORD RENO MERS PEGO ## "FORD" "RENAULT" "MERCEDES-BENZ" "PEUGEOT"
äžèŠããã¹ãŠãçŽ æŽãããæ©èœããŸããã ããããåã¶ã«ã¯æ©ãããŸãã ããŒã¿ã»ããã§ããè¡šãããŠããé¡äŒŒããååã®ããè¡šãããŠããèªåè»ãã©ã³ãã¯ãä»ã®æ£ããååãããã«ãªãŒããŒãã§ããŸãã
s <- c("HONDA", "CHRYSLER", "VOLVO") sapply(s, bm07)
## HONDA CHRYSLER VOLVO ## "HYUNDAI" "CHEVROLET" "VOLKSWAGEN"
ãããå€tãå¢ãããŠã¿ãŸãããã
bm09 <- bestmatch.gen( table( dt$CarBrand), t = 0.9 ) s <- c("HONDA","CHRYSLER","VOLVO") sapply(s, bm09)
## HONDA CHRYSLER VOLVO ## "HONDA" "CHRYSLER" "VOLVO"
倧äžå€«ã§ããïŒ ã»ãŒã ç°ãªãç·ã®ã¯ãªããã³ã°ã硬ããããšãã¢ã«ãŽãªãºã ãããã€ãã®èª€ã£ãå€ãæ£ãããšã¿ãªããšããäºå®ã«ã€ãªãããŸãã ãã®ãããªãšã©ãŒã¯æåã§ä¿®æ£ããå¿
èŠããããŸãã
s <- c("CEAT", "CVEVROLET") sapply(s, bm09)
## CEAT CVEVROLET ## "CEAT" "CVEVROLET"
ããã§ããã¹ãŠã®ãã©ã³ãã®ãã·ã³ã«åºæã®å€ã®èŸæžãã¡ã€ã«ãäœæããæºåãæŽããŸããã ãã¡ã€ã«ã¯æåã§ç·šéããå¿
èŠããããããææ¡ããã眮æãå
ã®å€ãšç°ãªããã©ããã瀺ãè¿œå ãã£ãŒã«ãïŒããã¯åžžã«æããã§ã¯ãªãïŒããã©ã³ãåã衚瀺ãããé »åºŠãããã³ã¬ã³ãŒãã«æ³šæãåŒãã©ãã«ãããã°äŸ¿å©ã§ãã»ããã®ããã€ãã®çµ±èšçç¹æ§ã ãã®å Žåãã¢ã«ãŽãªãºã ããŸããªïŒãããã誀ã£ãïŒå€ãæ£ããå€ãšããŠæäŸããç¶æ³ããã£ããããããšæããŸãã
ncb <- table(dt$CarBrand) scb <- names(ncb) # Source Car Brands acb <- sapply(scb, bm09) # Auto-generated replacement cbdict_out <- data.table(ncb)[,list( SourceName = scb, AutoName = acb, SourceFreq = as.numeric(ncb), AutoFreq = as.numeric( ncb[acb] ), Action = ordered( scb == acb, labels = c("CHANGE","KEEP")), DictName = acb )] # Add alert flag # Alert when suggested is a low-frequency dictionary word cbdict_out <- cbdict_out[, Alert := ordered( AutoFreq <= quantile(AutoFreq, probs = 0.05, na.rm = TRUE), labels = c("GOOD","ALERT")) ] write.table( cbdict_out[ order(SourceName), list( Alert, Action, SourceName, AutoName, SourceFreq, AutoFreq, DictName) ], "cbdict_out.txt", sep = ";", quote = TRUE, col.names = TRUE, row.name = FALSE, fileEncoding = "UTF-8")
DictNameãã£ãŒã«ãã®å€ã確èªããã³ç·šéãããã®åŸã®ããŠã³ããŒãã®ããã«ãcbdict_in.txtããšããååã§ãã¡ã€ã«ãä¿åããå¿
èŠããããŸãã
åæãããããŒã¿ã»ããã«ã¯ã次ã®ç¹ã«æ³šæãã䟡å€ã®ããæ©èœããããŸãã
- è»ã®ãã©ã³ããå«ãŸãªãè¡-空ãŸãã¯ãNOããããã³äžéšã®ã¢ãã«ã¯äžæã«èå¥ããã®ãå°é£ã§ãïŒL1H1ãM214; æåã§UNKNOWNãŸãã¯åæ§ã®æ¬äŒŒå€ã«å€æŽããŸãã
- MERCEDESãšMERCEDES-BENZã®2ã€ã®ã¹ãã«ãçããé©çšãããŸããMERCEDES-BENZã®1ã€ãæ®ããŸãã
- ZAZã«ã¯2ã€ã®èŠèŠçã«åäžã®ç¬ç«ããã¹ãã«ããããŸãïŒåºåã«ã¯2è¡ããããäž¡æ¹ã®ã¢ã«ãŽãªãºã ãä¿åãtrueãAction = KEEPãšããŠæšå¥šããŠããŸãïŒã ã©ããããå¥ã®UTF-8ã³ãŒããå«ãæçŽãã©ããã«å¿ã³èŸŒãã ã
- äžéšã®ãã·ã³åã«ã¯ãã©ã³ããå«ãŸãããã¢ãã«ã®ã¿ãå«ãŸããŸãïŒSAMANDïŒIRAN KHODROïŒ
- ãã©ã³ãTAGAZ-VORTEXããã³JACãšã®æ··ä¹±; ç°¡æœã«ããããã«ããã©ã³ããTAGAZãA21ãSUVãSUVT11ãVORTEXãJACãšããŠèå¥ãããè»ã«äžè¬åTAGAZãå²ãåœãŠãããšãïŒãŸã£ããæ£ç¢ºã§ã¯ãããŸãããïŒææ¡ããŸãã
- ã¢ã«ãŽãªãºã ã¯ãæå¹ãªä»£æ¿ãšããŠèª€ã£ãååãããã€ãæäŸããŸããCEATãCVEVROLETã
- 2ã¯ãŒãã®ã¹ã¿ã³ãã¯1ã€ã«åæžãããŸãïŒã¢ã«ãã¡ïŒã¢ã«ãã¡ãã¡ãªïŒãã°ã¬ãŒãïŒã°ã¬ãŒããŠã©ãŒã«ïŒãã€ã©ã³ïŒã€ã©ã³ã³ããïŒãã©ã³ãïŒã©ã³ãããŒããŒïŒã
if ( file.exists("cbdict_in.txt")) url <- "cbdict_in.txt" else url <- "cbdict_out.txt" cbdict_in <- read.table( url, header = TRUE, sep = ";", colClasses = c( rep("character",4), "numeric", "numeric", "character"), encoding = "UTF-8") cbdict <- cbdict_in$DictName names(cbdict) <- cbdict_in$SourceName
ãããŠãããŒã¿è¡šã®èªåè»ã®ãã©ã³ãã®å€ãä¿®æ£ããŸãã
dt[, CarBrand := cbdict[CarBrand]] dt[is.na(CarBrand), CarBrand := "UNKNOWN"]
è»ã®ãã©ã³ãã®ãŠããŒã¯ãªäŸ¡å€ããããã«ããåŸãããã¯ã»ãŒååã«ãªããŸãã
length( unique(dt$CarBrand) )
## [1] 72
åæçãªè³ªåãžã®åç
1.ããã10çµç¹
10ã®æ倧ã®ã¿ã¯ã·ãŒå ¬åãå®çŸ©ããŸãã ãã®å ŽåãPSRNãšãã1ã€ã®ãã£ã¡ã³ã·ã§ã³ã®è©äŸ¡ãäœæããå¿ èŠããããŸãã st <- dt[, list( NumCars = length(RegPlate)), by = list(OGRN, LegalName) ] head( st[order( NumCars, decreasing = TRUE)], 10)
## OGRN LegalName NumCars ## 1: 1137746197104 «» 866 ## 2: 1037727000893 «-» 751 ## 3: 1067746273198 « » 547 ## 4: 1037789018849 «» 541 ## 5: 1127746010700 «-24 » 406 ## 6: 1057748223653 «» 349 ## 7: 5067746596297 «» 288 ## 8: 1027739272175 «14 » 267 ## 9: 1137746133250 « » 255 ## 10: 5077746757688 «» 238
æ®å¿µãªããããã®ããŒã¿ã»ããã«ã¯ãã©ã€ã»ã³ã·ãŒã«é¢ããæ³çæ
å ±ã®ã¿ãä¿åãããåæšã§ã¯ãããŸããã ã€ã³ã¿ãŒãããäžã§ã¯ãçµç¹ã®ååãšOGRNã«ãã£ãŠãã¿ã¯ã·ãŒäŒç€Ÿãã©ã®ãã©ã³ãã§éå¶ãããŠããããèŠã€ããããšãã§ããŸããããã®ããã»ã¹ã¯èªåã§ã¯ãªããæéãããããŸãã æ倧ã®ã¿ã¯ã·ãŒè»äž¡ã®æ€çŽ¢çµæã¯ããã¡ã€ã«ã top10orgs.csv ãã«åéãããŸãã
top10orgs <- data.table( read.table( "top10orgs.csv", header = TRUE, sep = ";", colClasses = "character", encoding = "UTF-8"))
data.tableã®çµã¿èŸŒã¿æ©èœã䜿çšããŠã2ã€ã®ããŒãã«ã®JOINæäœãå®è¡ããŸãã
setkey(top10orgs,OGRN) setkey(st,OGRN) st[top10orgs][order(NumCars, decreasing = TRUE), list(OrgBrand, EasyPhone, NumCars)]
## OrgBrand EasyPhone NumCars ## 1: 781 81 82 866 ## 2: 956 956 8 956 751 ## 3: - 641 11 11 547 ## 4: 500 0 500 541 ## 5: 24 777 66 24 406 ## 6: 777 5 777 349 ## 7: 940 88 88 288 ## 8: 14 707 2 707 267 ## 9: Cabby 21 21 989 255 ## 10: 927 11 11 238
2.æ³äººã®åœ¢æ ã«å¿ããŠãæã人æ°ã®ãã3ã€ã®èªåè»ãã©ã³ã
ã©ã€ã»ã³ã·ãŒã®æ³ç圢æ ã«å¿ããŠãã©ã®ãã©ã³ãã®è»ãæã人æ°ããããŸããïŒ ãã®è³ªåã«çããã«ã¯ããã·ã³ã®æ§æãšçµç¹åœ¢æ ãšãã2ã€ã®æ¬¡å ã§ããŒã¿ãéçŽããå¿ èŠããããŸããããã»ã¹ã¯3段éã«åããããŸãã
- éèšãããã€ã³ãžã±ãŒã¿ãŒã®èšç®ïŒãã®å ŽåãPSRNã«åºã¥ãè»ã®æ°ïŒã
- ã©ã³ã¯èšç®ã
- ã©ã³ã¯å¶éïŒäžäœ3ïŒã䞊ã¹æ¿ããåã®åé åžãããã³ããŒã¿åºåã
st <- dt[, list(AGGR = length(RegPlate)), by = list(OrgType, CarBrand) ] st.r <- st[, list(CarBrand, AGGR, r = ( 1 + length(AGGR) - rank(AGGR, ties.method="first"))), by = list(OrgType)] # ranking by one dimension st.out <- st.r[ r <= 3 ][, list(r, OrgType, cval = paste0(CarBrand," (",AGGR,")"))] dcast(st.out, r ~ OrgType, value.var = "cval")[-1] # reshape data and hide r
## ## 1 FORD (212) CHEVROLET (2465) VOLVO (1) KIA (192) FORD (3297) ## 2 RENAULT (175) FORD (2238) <NA> CHEVROLET (115) RENAULT (2922) ## 3 HYUNDAI (122) RENAULT (1996) <NA> FORD (53) HYUNDAI (2812)
å¯èŠå
1.åã°ã©ãããŒã¿ã®è¡šç€º
åã°ã©ããåã°ã©ãã¯ãããžãã¹ç°å¢ã§éåžžã«äººæ°ããããŸãããããŒã¿åæã®å°é家ã«ãã£ãŠæ¹å€ãããŠããŸãã ããã«ãããããããããã¯ã調çãããããšãã§ããªããã°ãªããŸãããè»ã§ã®ã¿ã¯ã·ãŒå 蚱蚌ã®æ°ã®ååžã衚瀺ãããšããŸãã å³ããªãŒããŒããŒãããªãããã«ãå°ãªããšã1000ã©ã€ã»ã³ã¹ã®ãã©ã³ãã®ã¿ã衚瀺ããŸãã
st <- dt[, list(N = length(RegPlate)), by = CarBrand ] # Summary table st <- st[, CarBrand := reorder(CarBrand, N) ] piedata <- rbind( st[ N >= 1000 ][ order(N, decreasing=T) ], data.table( CarBrand = " ", N = sum( st[N < 1000]$N) ) ) piedata
## CarBrand N ## 1: FORD 5800 ## 2: RENAULT 5093 ## 3: HYUNDAI 4727 ## 4: CHEVROLET 4660 ## 5: KIA 2220 ## 6: SKODA 2073 ## 7: NISSAN 1321 ## 8: VOLKSWAGEN 1298 ## 9: TOYOTA 1075 ## 10: MERCEDES-BENZ 1039 ## 11: 6534
ã¹ã±ãžã¥ãŒã«ãäœæããã«ã¯ããã®ãããªäžé£ã®ã¹ã¿ã³ããä¿®æ£ããããšæããŸãã ãããè¡ããªããšãèªåãœãŒãã«ããããã®ä»ã®ãã©ã³ãããæåŸããæåã«è¡šç€ºãããŸãã
ãã£ãŒããäœæããã«ã¯ãggplot2ã䜿çšããŸããpiedata <- piedata[, CarBrand := factor(CarBrand, levels = CarBrand, ordered = TRUE)]
pie <- ggplot(piedata, aes( x = "", y = N, fill = CarBrand)) + geom_bar(stat = "identity") + coord_polar(theta = "y") pie

çµè«ã¯ãã§ã«éåžžã«æçã§ãã ããããç§ã¯ããã€ãã®èŠèŠçãªæ¹åãããããšæããŸãïŒ
- ç°è²ã®èæ¯ãå¢çç·ãå転軞ãã©ãã«ããã³ããŒã¯ãåé€ããŸãã
- ããæ確ãªã«ã©ãŒã¹ã±ãŒã«ãéžæããåãã±ãŒãããäžžã§å²ã¿ãŸãã
- åã»ã¯ã¿ãŒã®é£ã®ãã©ã³ãã«å¯Ÿå¿ããã©ã€ã»ã³ã¹ã®æ°ãä»ããŸãã
- å¡äŸã«ããã¹ãåãä»ããŸãã
piedata <- piedata[, pos := cumsum(N) - 0.5*N ] pie <- ggplot(piedata, aes( x = "", y = N, fill = CarBrand)) + geom_bar( color = "black", stat = "identity", width = 0.5) + geom_text( aes(label = N, y = pos), x = 1.4, color = "black", size = 5) + scale_fill_brewer(palette = "Paired", name = " ") + coord_polar(theta = "y") + theme_bw() + theme ( panel.border = element_blank() , panel.grid.major = element_blank() , axis.ticks = element_blank() , axis.title.x = element_blank() , axis.title.y = element_blank() , axis.text.x = element_blank() , legend.title = element_text(face="plain", size=16) ) pie

2.æ£ã°ã©ã
åã®ããæçãªä»£æ¿æ段ã¯ãæ£ã°ã©ããæ£ã°ã©ãã§ãã åã®é·ããå匧ã®é·ããŸãã¯åã»ã¯ã¿ãŒã®é¢ç©ãããæ¯èŒããæ¹ã䟿å©ã§ãããšããäºå®ã«å ããŠãæ£ã°ã©ãã¯ãããšãã°ãçµç¹åœ¢æ ããšã®ã©ã€ã»ã³ã¹æ°ã®ååžã衚瀺ã§ããŸãã st <- dt[, list(N = length(RegPlate)), by = list(OrgType, CarBrand) ] # Summary table cbsort <- st[, list( S = sum(N) ), keyby = CarBrand ] # Order by total number setkey(st, CarBrand) st <- st[cbsort] # Join topcb <- st[ S >= 1000 ][ order(S) ] bottomcb <- st[S < 1000, list(CarBrand = " ", OrgType, N = sum(N)), by = OrgType] bottomcb <- bottomcb[, list(CarBrand, OrgType, N, S = sum(N))] bardata <- rbind( bottomcb, topcb) bardata <- bardata[, CarBrand := factor(CarBrand, levels = unique(CarBrand), ordered=T)] # bar <- ggplot(bardata, aes(x = CarBrand, weight = N, fill = OrgType)) + geom_bar() + coord_flip() + scale_fill_brewer(palette = "Spectral", name = "") + labs(list(y = " ", x = " ")) + theme_bw() bar

3.ããŒããããå³
ãã¿ã¯ã·ãŒãã©ã€ããŒã®äžã§ãã©ã®èªåè»ãã©ã³ãã®ææè ãæãçŸããããçŸãããæ°åããªã®ãããšãã質åã«å¯ŸããçããåŸãããšããŸãã ãã®å Žåãããªãã«ã111ã222ãªã©ã®åãæ°åãæã€çŸããæ°åãæ€èšããŸããåæã¯ãèªåè»ã®ãã©ã³ããš3ã€ã®2ã€ã®åæãã£ã¡ã³ã·ã§ã³ã§å®è¡ãããŸãã ææš-ãã©ã³ããšããªãã«ã®ç¹å®ã®çµã¿åãããæã€è»ã®æ°ã ãã®ãããªããŒã¿ã»ãããèŠèŠåããã«ã¯ãè¡šã®èŠèŠçãªé¡äŒŒç©ã§ããããŒããããå³ãé©ããŠããŸãã ããªãã«ã®äººæ°ãé«ãã»ã©ãè²ã¯ã»ã«ã®å€ããã匷ããšã³ã³ãŒãããŸãã
ln <- dt[grep( "^[^0-9]([0-9])\\1{2}.+$" , RegPlate), list(CarBrand, LuckyNum = gsub("^[^0-9]([0-9]{3}).+$","\\1", RegPlate))] ln <- ln[, list( N = .N), by = list(CarBrand, LuckyNum) ] ln <- ln[, Luck := sum(N), by = list(CarBrand) ] # Total number of lucky regplates per car brand ln <- ln[, CarBrand := reorder(CarBrand, Luck) ] # heatmap <- ggplot(ln, aes(x = CarBrand, y = LuckyNum)) + geom_tile( aes(fill = as.character(N)), color = "black") + scale_fill_brewer(palette = "YlOrRd", name = " «» :") + labs(list(x = " ", y = " ")) + theme_bw() + theme ( panel.grid.major = element_blank() , axis.text.x = element_text(angle = 45, hjust = 1) , axis.title.y = element_text(vjust = 0.3) , legend.position = "top" , legend.title.align = 1 ) heatmap

ãã¹ãŠã®å³ã¯ã Color Brewer 2.0ãããžã§ã¯ãã®ç§åŠã«åºã¥ããã«ã©ãŒãã¬ããã䜿çšããŸãã