ã³ãŒã¹ã®æ§æïŒ
- Habréã«é¢ãã10件ã®èšäºïŒããã³è±èªã®Mediumã«é¢ããåãèšäºïŒ
- 10ã®è¬çŸ©ïŒãã·ã¢èªã®YouTubeãã£ã³ãã«+è±èªã®æè¿ã®è¬çŸ© ïŒãåãããã¯ã®è©³çŽ°ãªèª¬æã¯ãã®èšäºã«ãããŸã
- mlcourse.ai ãªããžããªãŒããã³KaggleããŒã¿ã»ãã圢åŒã®åçŸå¯èœãªè³æïŒJupyterããŒãããã¯ïŒïŒãã©ãŠã¶ãŒã®ã¿ãå¿ èŠã§ãïŒ
- çŽ æŽãããKaggle Inclassã³ã³ããã£ã·ã§ã³ïŒãxgboostsã®ã¬ã©ã¹ãã§ã¯ãªããæšèã®äœæïŒ
- åãããã¯ã®å®¿é¡ïŒãªããžããªå -ã¿ã¹ã¯ã®ãã¢ããŒãžã§ã³ã®ãªã¹ã ïŒ
- è©äŸ¡ã®åæ©ä»ããè±å¯ãªã©ã€ãã³ãã¥ãã±ãŒã·ã§ã³ãèè ããã®è¿ éãªãã£ãŒãããã¯
çŸåšã®ã³ãŒã¹ã®éå§ã¯ã 2018幎10æ1æ¥ããè±èªã§è¡ãããŸãïŒåå ã®ããã®èª¿æ»ãžã®ãªã³ã¯ ãè±èªã§èšå ¥ïŒã VK ã°ã«ãŒãã®çºè¡šã«åŸããOpenDataScienceã³ãã¥ããã£ã«åå ããŠãã ããã
ã·ãªãŒãºã®èšäºã®ãªã¹ã
- ãã³ãã䜿çšããäžæ¬¡ããŒã¿åæ
- Pythonã䜿çšããããžã¥ã¢ã«ããŒã¿åæ
- åé¡ã決å®æšãããã³æè¿åæ³
- ç·åœ¢åé¡ããã³ååž°ã¢ãã«
- æïŒãã®ã³ã°ãã©ã³ãã ãã©ã¬ã¹ã
- æšèã®äœæãšéžæã ã¯ãŒãããç»åãããã³ãžãªããŒã¿ã¿ã¹ã¯ã®ã¢ããªã±ãŒã·ã§ã³
- æåž«ãªãåŠç¿ïŒPCAãã¯ã©ã¹ã¿ãªã³ã°
- Vowpal Wabbitã«ããã®ã¬ãã€ãããŒã¹ã®ãã¬ãŒãã³ã°
- Pythonæç³»ååæ
- åŸé ããŒã¹ã
ãã®èšäºã®æŠèŠ
- ã³ãŒã¹ã«ã€ããŠ
- ã³ãŒã¹ã®å®¿é¡
- ãã³ãã®åºæ¬çãªæ¹æ³ã®ãã¢ã³ã¹ãã¬ãŒã·ã§ã³
- æµåºãäºæž¬ããæåã®è©Šã¿
- 宿é¡â1
- æçšãªãªãœãŒã¹ã®æŠèŠ
1.ã³ãŒã¹ã«ã€ããŠ
ç§ãã¡ã¯ãæ©æ¢°åŠç¿ãŸãã¯ããŒã¿åæã«é¢ããå¥ã®å æ¬çãªå ¥éã³ãŒã¹ãéçºããã¿ã¹ã¯ãèšå®ããŠããŸããïŒã€ãŸããããã¯ãYandexãšMIPTã®å°éåãHSEã§ã®è¿œå æè²ããã®ä»ã®åºæ¬çãªãªã³ã©ã€ã³ããã³ãªãã©ã€ã³ããã°ã©ã ãšæ¬ã«ä»£ãããã®ã§ã¯ãããŸããïŒã ãã®äžé£ã®èšäºã®ç®çã¯ãç¥èããã°ãã磚ããããããªãç 究ã®ããã«ãããã¯ãèŠã€ããã®ãå©ããããšã§ãã ãã®ã¢ãããŒãã¯ãæ°åŠãšæ©æ¢°åŠç¿ã®åºç€ã®ã¬ãã¥ãŒããå§ãŸãã 深局åŠç¿ã®æ¬ã®èè ã®ã¢ãããŒãã«äŒŒãŠããŸã-çããæ倧éã«èœåãããããœãŒã¹ãžã®è±å¯ãªãªã³ã¯ããããŸãã
ã³ãŒã¹ãåè¬ããå Žåã¯ãèŠåã衚瀺ããŸãããããã¯ãéžæããŠè³æãäœæãããšãã¯ãåŠçãå°é倧åŠã®2幎ã¬ãã«ã§æ°åŠãç解ããå°ãªããšãPythonã§ããã°ã©ãã³ã°ããæ¹æ³ãç¥ã£ãŠããããšã«æ³šç®ããŸãã ãããã¯å³å¯ãªéžæåºæºã§ã¯ãªããåãªãæšå¥šäºé ã§ã-æ°åŠãPythonãç¥ããªããŠãã³ãŒã¹ã«ç»é²ã§ããåæã«æ§æã§ããŸãïŒ
- åºæ¬çãªæ°åŠïŒæ°åŠè§£æãç·åœ¢ä»£æ°ãæé©åãçè«ãçµ±èšïŒã¯ã ãããã® YandexïŒMIPTããŒãïŒèš±å¯ãšå ±æïŒã«åŸã£ãŠç¹°ãè¿ãããšãã§ããŸãã ç°¡åã«èšãã°ããã·ã¢èªã§-ããªããå¿ èŠãªãã®ã 詳现ã§ããã°ããã¿ã³-ã¯ããªã£ããã§ãããªããŒã«-ã³ã¹ããªãã³ãæé©å-ãã€ãïŒè±èªïŒãçè«ãšçµ±èš-ãããºã³ã ããã«ãMIPTã®åªãããªã³ã©ã€ã³ã³ãŒã¹ãšCourseraã®HSEã
- Pythonã®å ŽåãDatacampã®å°ããªã€ã³ã¿ã©ã¯ãã£ããã¥ãŒããªã¢ã«ãŸãã¯Pythonãšåºæ¬çãªã¢ã«ãŽãªãºã ãšããŒã¿æ§é ã«é¢ãããã®ãªããžããªã§ååã§ãã ããé«åºŠãªãã®ã¯ãããšãã°ããµã³ã¯ãããã«ãã«ã¯ã®ã³ã³ãã¥ãŒã¿ãŒãµã€ãšã³ã¹ã»ã³ã¿ãŒã®ã³ãŒã¹ã§ãã
- æ©æ¢°åŠç¿ã«é¢ããŠã¯ãã€ãŸããå€å žçãªïŒãããå°ãæ代é ãã®ïŒAndrew Ngæ©æ¢°åŠç¿ã³ãŒã¹ïŒStanfordãCourseraïŒã§ãã ãã·ã¢èªã§ã¯ãMIPTãšYandexã®åªããå°éåéã§ãããæ©æ¢°åŠç¿ãšããŒã¿åæãããããŸãã ãããŠãããã«æé«ã®æ¬ããããŸãïŒããã¿ãŒã³èªèãšæ©æ¢°åŠç¿ãïŒåžæïŒããæ©æ¢°åŠç¿ïŒç¢ºççå±æãïŒããŒãã£ãŒïŒããçµ±èšåŠç¿ã®èŠçŽ ãïŒHastieãTibshiraniãFriedmanïŒããDeep LearningãïŒGoodfellowïŒ ããã³ãŽã£ã«ãã¯ãŒã«ãã«ïŒã Goodfellowã®æ¬ã¯ãæ°åŠã®ã¬ãã¥ãŒãšãæ©æ¢°åŠç¿ãšãã®ã¢ã«ãŽãªãºã ã®å éšæ§é ã®ããããããèå³æ·±ã玹ä»ããå§ãŸããŸãã ãã·ã¢èªã®ãã£ãŒãã©ãŒãã³ã°ã«é¢ããæ¬ããããŸããããã£ãŒãã©ãŒãã³ã°ïŒãã¥ãŒã©ã«ãããã¯ãŒã¯ã®äžçãžã®æ²¡å ¥ãïŒãã³ã¬ã³ã³S. I.ãã«ãã¥ãªã³A. A.ãã¢ã«ãã³ã²ãªã¹ã«ã€E. O.ïŒã
ãŸããã³ãŒã¹ã«ã€ããŠã¯ãã®çºè¡šã«èšèŒãããŠããŸãã
ã©ã®ãœãããŠã§ã¢ãå¿ èŠã§ãã
ã³ãŒã¹ãå®äºããã«ã¯ãå€ãã®Pythonããã±ãŒãžãå¿ èŠã§ãããããã®ã»ãšãã©ã¯ãPython 3.6ã䜿çšããAnacondaãã«ãã«å«ãŸããŠããŸãã åŸã§ãä»ã®ã©ã€ãã©ãªãå¿ èŠã«ãªããŸããããã«ã€ããŠã¯åŸã§èª¬æããŸãã å®å šãªãªã¹ãã¯Dockerfileã«ãããŸãã
ãŸããå¿ èŠãªãœãããŠã§ã¢ããã¹ãŠã€ã³ã¹ããŒã«ãããŠããDockerã³ã³ããã䜿çšããããšãã§ããŸãã 詳现ã¯ã ãªããžããªã® Wiki ããŒãžã«ãããŸã ã
ã³ãŒã¹ãžã®æ¥ç¶æ¹æ³
æ£åŒãªç»é²ã¯å¿
èŠãããŸãããéå§ïŒ18幎1æ1æ¥ïŒåŸã§ããã°ãã€ã§ãã³ãŒã¹ã«æ¥ç¶ã§ããŸããã宿é¡ã®ç· ãåãã¯å€§å€ã§ãã
ããããããªãã«ã€ããŠãã£ãšç¥ãããã«ïŒ
- ã¢ã³ã±ãŒãã«èšå ¥ããŠã å®éã®ååã瀺ããŸãã
- OpenDataScienceã³ãã¥ããã£ã«åå ããŠãã ãã ãã³ãŒã¹ã®è°è«ã¯ãã£ã³ãã«ïŒmlcourse.aiã§è¡ããŸãã
2.ã³ãŒã¹ã®å®¿é¡
- åèšäºã«ã¯JupyterããŒãããã¯åœ¢åŒã®å®¿é¡ãä»å±ããŠãããã³ãŒããè¿œå ããå¿ èŠããããŸããããã«åºã¥ããŠãGoogleã®åœ¢åŒã§æ£ããçããéžæããŸãã
- 宿é¡ã®ãœãªã¥ãŒã·ã§ã³ã¯ããã©ãŒã ã§ãœãªã¥ãŒã·ã§ã³ãæåºãã人ã«éä¿¡ãããŸãã
- äžé£ã®èšäºã®æåŸã«ãèŠçŽãèŠçŽãããŸãïŒåå è ã®è©äŸ¡ïŒã
- 宿é¡ã®äŸã¯ãã·ãªãŒãºã®èšäºïŒæåŸïŒã«èšèŒãããŠããŸãã
3.ãã³ãã®åºæ¬çãªæ¹æ³ã®ãã¢ã³ã¹ãã¬ãŒã·ã§ã³
ãã¹ãŠã®ã³ãŒãã¯ã ãã® JupyterããŒãããã¯ã§åçŸã§ããŸãã
Pandasã¯ãåºç¯ãªããŒã¿åææ©èœãæäŸããPythonã©ã€ãã©ãªã§ãã ããŒã¿ã»ã³ãã£ã¹ãã䜿çšããããŒã¿ã¯ãå€ãã®å Žåãã©ãã«ã®åœ¢åŒã§ä¿åãããŸããããšãã°ã.csvã.tsvããŸãã¯.xlsx圢åŒã§ãã Pandasã©ã€ãã©ãªã䜿çšãããšããã®ãããªè¡šåœ¢åŒã®ããŒã¿ã¯ãSQLã«äŒŒãã¯ãšãªã䜿çšããŠèªã¿èŸŒã¿ãåŠçãåæããã®ã«éåžžã«äŸ¿å©ã§ãã ãŸããã©ã€ãã©ãªMatplotlibããã³Seaborn Pandasãšçµã¿åãããŠã衚圢åŒããŒã¿ã®èŠèŠçåæã®ããã®ååãªæ©äŒãæäŸããŸãã
Pandasã®äž»ãªããŒã¿æ§é ã¯ã Series ã¯ã©ã¹ãšDataFrameã¯ã©ã¹ã§ãã ãããã®æåã®ãã®ã¯ãåºå®ã¿ã€ãã®1次å ã®ã€ã³ããã¯ã¹ä»ãããŒã¿é åã§ãã 2çªç®ã¯ã2次å ã®ããŒã¿æ§é ã§ããããã¯ãååã«åãã¿ã€ãã®ããŒã¿ãå«ãŸããããŒãã«ã§ãã Seriesãªããžã§ã¯ãã®èŸæžãšèããããšãã§ããŸãã DataFrameæ§é ã¯ãå®éã®ããŒã¿ã®è¡šç€ºã«æé©ã§ããè¡ã¯åã ã®ãªããžã§ã¯ãã®æ©èœã®èª¬æã«å¯Ÿå¿ããåã¯æ©èœã«å¯Ÿå¿ããŸãã
# Pandas Numpy import pandas as pd import numpy as np
ããžãã¹ã®äž»ãªæ¹æ³ã瀺ããéä¿¡äºæ¥è
ã®é¡§å®¢ã®æµåºã«é¢ããããŒã¿ã»ãããåæããŸã ïŒããŠã³ããŒãããå¿
èŠã¯ãããŸããããªããžããªã«ãããŸãïŒã ããŒã¿ãèªã¿åãïŒ read_csv
ã¡ãœããïŒã head
ã¡ãœããã䜿çšããŠæåã®5è¡ã確èªããŸãã
df = pd.read_csv('../../data/telecom_churn.csv')
df.head()
JupyterããŒãããã¯ã§ã¯ãPandasããŒã¿ãã¬ãŒã ã¯ãã®ãããªçŸãããã¬ãŒãã®åœ¢ã§è¡šç€ºããã print(df.head())
ãæªåããŸãã
ããã©ã«ãã§ã¯ãPandasã¯20åãš60è¡ã®ã¿ã衚瀺ãããããããŒã¿ãã¬ãŒã ã倧ããå Žåã¯ã set_option
é¢æ°ã䜿çšããŸãã
pd.set_option('display.max_columns', 100) pd.set_option('display.max_rows', 100)
åè¡ã¯1ã€ã®ã¯ã©ã€ã¢ã³ããè¡šããŸã-ããã¯èª¿æ»ã®å¯Ÿè±¡ã§ã ã
åã¯ãªããžã§ã¯ãã®æ©èœã§ã ã
åœ¹è· | 説æ | çš®é¡ |
---|---|---|
éœéåºç | å·ã®æçŽã³ãŒã | å®æ Œ |
ã¢ã«ãŠã³ãã®é·ã | äŒç€Ÿã顧客ã«ãµãŒãã¹ãæäŸããŠããæé | å®éç |
åžå€å±çª | é»è©±çªå·ã®ãã¬ãã£ãã¯ã¹ | å®éç |
åœéèšç» | åœéããŒãã³ã°ïŒæ¥ç¶æžã¿/æªæ¥ç¶ïŒ | ãã€ã㪠|
ãã€ã¹ã¡ãŒã«ãã©ã³ | ãã€ã¹ã¡ãŒã«ïŒæ¥ç¶æžã¿/æªæ¥ç¶ïŒ | ãã€ã㪠|
vmailã¡ãã»ãŒãžã®æ° | é³å£°ã¡ãã»ãŒãžã®æ° | å®éç |
ç·æ¥å | æ¥äžã®äŒè©±ã®åèšæé | å®éç |
åèšæ¥é話 | æ¥äžã®åèšéè©±æ° | å®éç |
åèšæ¥æé | æ¥äžã®ãµãŒãã¹ã®æ¯æãç·é¡ | å®éç |
åèšåå€ | å€æ¹ã®åèšäŒè©±æé | å®éç |
ç·éè©±æ° | åèšå€ã®åŒã³åºã | å®éç |
åå€æé | å€æ¹ã®ãµãŒãã¹ã®æ¯æãç·é¡ | å®éç |
ç·å€æ° | å€ã®äŒè©±ã®åèšæé | å®éç |
åèšå€éé話 | å€ã®åèšéè©±æ° | å®éç |
åèšå®¿æ³æé | åèšå€éãµãŒãã¹æ | å®éç |
åèšåœéå | åœéé話ã®åèšæé | å®éç |
åèšåœéé»è©± | åèšåœéé»è©± | å®éç |
åèšæé | åœéé話æéã®åèš | å®éç |
ã«ã¹ã¿ããŒãµãŒãã¹ã³ãŒã« | ãµãŒãã¹ã»ã³ã¿ãŒãžã®åŒã³åºãåæ° | å®éç |
察象å€æ°ïŒ ãã£ãŒã³ - æµåºç¬Šå·ããã€ããªç¬Šå·ïŒ1-ã¯ã©ã€ã¢ã³ãæ倱ãã€ãŸãæµåºïŒã 次ã«ããã®æ©èœãæ®ãããäºæž¬ããã¢ãã«ãæ§ç¯ããŸãããããã¿ãŒã²ãããšåŒã°ããçç±ã§ãã
ããŒã¿ã®ãµã€ãºãç¹æ§ã®ååãããã³ãããã®ã¿ã€ããèŠãŠã¿ãŸãããã
print(df.shape)
(3333, 20)
ããŒãã«ã«ã¯3333è¡ãš20åããããŸãã ååã衚瀺ããŸãã
print(df.columns)
Index(['State', 'Account length', 'Area code', 'International plan', 'Voice mail plan', 'Number vmail messages', 'Total day minutes', 'Total day calls', 'Total day charge', 'Total eve minutes', 'Total eve calls', 'Total eve charge', 'Total night minutes', 'Total night calls', 'Total night charge', 'Total intl minutes', 'Total intl calls', 'Total intl charge', 'Customer service calls', 'Churn'], dtype='object')
ããŒã¿ãã¬ãŒã ãšãã¹ãŠã®èšå·ã«é¢ããäžè¬æ
å ±ã衚瀺ããã«ã¯ã info
ã¡ãœããã䜿çšããŸãã
print(df.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3333 entries, 0 to 3332 Data columns (total 20 columns): State 3333 non-null object Account length 3333 non-null int64 Area code 3333 non-null int64 International plan 3333 non-null object Voice mail plan 3333 non-null object Number vmail messages 3333 non-null int64 Total day minutes 3333 non-null float64 Total day calls 3333 non-null int64 Total day charge 3333 non-null float64 Total eve minutes 3333 non-null float64 Total eve calls 3333 non-null int64 Total eve charge 3333 non-null float64 Total night minutes 3333 non-null float64 Total night calls 3333 non-null int64 Total night charge 3333 non-null float64 Total intl minutes 3333 non-null float64 Total intl calls 3333 non-null int64 Total intl charge 3333 non-null float64 Customer service calls 3333 non-null int64 Churn 3333 non-null bool dtypes: bool(1), float64(8), int64(8), object(3) memory usage: 498.1+ KB None
bool
ã int64
ã float64
ããã³object
ã¯å±æ§ã®ã¿ã€ãã§ãã 1ã€ã®å±æ§ãè«çïŒããŒã«ïŒã3ã€ã®å±æ§ããªããžã§ã¯ãåã16ã®å±æ§ãæ°å€ã§ããããšãããããŸãã ãŸãã info
ã¡ãœããã䜿çšããŠããŒã¿ã®ã®ã£ããããã°ãã確èªãããšäŸ¿å©ã§ãããã®äŸã§ã¯ãååã«3333ã®èŠ³æž¬å€ããããŸãã
astype
ã¡ãœããã䜿çšããŠãåã¿ã€ããå€æŽã§ããŸã ã ãã®ã¡ãœãããChurn
é©çšãã int64
å€æãint64
ïŒ
df['Churn'] = df['Churn'].astype('int64')
describe
ã¡ãœããã¯ãåæ°å€ç¹æ§ïŒã¿ã€ãint64
ããã³float64
ïŒã®ããŒã¿ã®äž»èŠãªçµ±èšç¹æ§ã瀺ããŸããæ¬ æå€ã®æ°ãå¹³åãæšæºåå·®ãç¯å²ãäžå€®å€ã0.25ããã³0.75ååäœæ°ã
df.describe()
éæ°å€èšå·ã®çµ±èšã調ã¹ãã«ã¯ã察象ã®ã¿ã€ããæ瀺çã«include
ãã©ã¡ãŒã¿ãŒã«æå®ããå¿
èŠããããŸãã
df.describe(include=['object', 'bool'])
éœéåºç | åœéèšç» | ãã€ã¹ã¡ãŒã«ãã©ã³ | |
---|---|---|---|
æ°ãã | 3333 | 3333 | 3333 |
ãŠããŒã¯ãª | 51 | 2 | 2 |
ããã | Wv | ãã | ãã |
é »åºŠ | 106 | 3010 | 2411 |
ã«ããŽãªåïŒ object
åïŒããã³ããŒã«åïŒããŒã«åïŒã®èšå·ã«ã¯ã value_counts
ã¡ãœããã䜿çšã§ããŸãã ã¿ãŒã²ããå€æ°Churn
ã«ããããŒã¿ã®ååžãèŠãŠã¿ãŸãããã
df['Churn'].value_counts()
0 2850 1 483 Name: Churn, dtype: int64
3333人ã®ãŠãŒã¶ãŒã®ãã¡2850人ãå¿ å®ã§ã Churn
å€æ°ã®å€ã¯0
ã§ãã
å€æ°Area code
ã«ãããŠãŒã¶ãŒã®ååžãèŠãŠã¿ãŸãããã ãã©ã¡ãŒã¿ãŒã®å€normalize=True
ãæå®ããŠã絶察åšæ³¢æ°ã§ã¯ãªãçžå¯Ÿåšæ³¢æ°ã衚瀺ããŸãã
df['Area code'].value_counts(normalize=True)
415 0.496550 510 0.252025 408 0.251425 Name: Area code, dtype: float64
ä»åã
DataFrameã¯ãä»»æã®ç¬Šå·ã®å€ã§ãœãŒãã§ããŸãã ãã®äŸã§ã¯ãããšãã°ã Total day charge
ïŒ ascending=False
ã§äžŠã¹æ¿ããå Žåã¯ascending=False
ïŒã«ãã£ãŠïŒ
df.sort_values(by='Total day charge', ascending=False).head()
åã°ã«ãŒãã§ãœãŒãã§ããŸãïŒ
df.sort_values(by=['Churn', 'Total day charge'], ascending=[True, False]).head()
æ代é ãã®ãœãŒã makkos ã«é¢ããã³ã¡ã³ãã ããããšã
ããŒã¿ã®ã€ã³ããã¯ã¹äœæãšååŸ
DataFrameã¯ãããŸããŸãªæ¹æ³ã§ã€ã³ããã¯ã¹ãäœæã§ããŸãã ãã®ç¹ã«é¢ããŠãç°¡åãªè³ªåãäŸãšããŠäœ¿çšããŠãããŒã¿ãã¬ãŒã ããå¿ èŠãªããŒã¿ãã€ã³ããã¯ã¹ä»ãããã³æœåºããããŸããŸãªæ¹æ³ãæ€èšããŸãã
åäžã®åãååŸããã«ã¯ã DataFrame['Name']
ãšãã圢åŒã®æ§æã䜿çšã§ããŸãã ããã䜿çšããŠãããŒã¿ãã¬ãŒã å
ã®äžèª å®ãªãŠãŒã¶ãŒã®å²åã¯ã©ããããããšãã質åã«çããŸãã
df['Churn'].mean(). # : 0.14491449144914492
14.5ïŒ ã¯äŒæ¥ã«ãšã£ãŠããªãæªãææšã§ããããã®ãããªå²åã®æµåºãç Žãããšãã§ããŸãã
éåžžã«äŸ¿å©ãªã®ã¯ãåäžã®åã§DataFrameã®è«çã€ã³ããã¯ã¹ãäœæããããšã§ã ã df[P(df['Name'])]
ã«ãªããŸããããã§ã P
ã¯Name
åã®åèŠçŽ ã«å¯ŸããŠãã§ãã¯ãããè«çæ¡ä»¶ã§ãã ãã®ã€ã³ããã¯ã¹ã®çµæã¯ã Name
åã®æ¡ä»¶P
ãæºããè¡ã®ã¿ã§æ§æãããDataFrameã§ãã
ããã䜿çšããŠã äžèª å®ãªãŠãŒã¶ãŒã®æ°å€èšå·ã®å¹³åå€ã¯ãããã§ããïŒãšãã質åã«çããŸãã
df[df['Churn'] == 1].mean()
Account length 102.664596 Number vmail messages 5.115942 Total day minutes 206.914079 Total day calls 101.335404 Total day charge 35.175921 Total eve minutes 212.410145 Total eve calls 100.561077 Total eve charge 18.054969 Total night minutes 205.231677 Total night calls 100.399586 Total night charge 9.235528 Total intl minutes 10.700000 Total intl calls 4.163561 Total intl charge 2.889545 Customer service calls 2.229814 Churn 1.000000 dtype: float64
åã®2çš®é¡ã®ã€ã³ããã¯ã¹äœæãçµã¿åãããŠã質åã«çããŸããäžèª å®ãªãŠãŒã¶ãŒã¯ãæ¥äžå¹³åããŠã©ããããé»è©±ã§è©±ããŸããïŒ
df[df['Churn'] == 1]['Total day minutes'].mean() # : 206.91407867494823
åœéããŒãã³ã°ãµãŒãã¹ã䜿çšããªãå¿ å®ãªãŠãŒã¶ãŒïŒ Churn == 0
ïŒïŒ 'International plan' == 'No'
ïŒéã®åœéé話ã®æ倧é·ã¯ïŒ
df[(df['Churn'] == 0) & (df['International plan'] == 'No')]['Total intl minutes'].max() # : 18.899999999999999
ããŒã¿ãã¬ãŒã ã«ã¯ãåãŸãã¯è¡ã®ååããŸãã¯ã·ãªã¢ã«çªå·ã§ã€ã³ããã¯ã¹ãä»ããããšãã§ããŸãã ååã«ãã玢åŒä»ãã«ã¯ã çªå· iloc
ã«ãã loc
ã¡ãœããã䜿çšãããŸãã
æåã®ã±ãŒã¹ã§ã¯ã ã0ãã5ã®è¡ã®idããã³å·ããåžå€å±çªã®å ã®å€ãæž¡ããŸãã ã2çªç®ã®å Žåã ãæåã®3åã®æåã®5è¡ã®å€ãæž¡ããŸãããšèšããŸã ã
ãã¹ãã¹ãžã®æ³šæïŒã¹ã©ã€ã¹ãªããžã§ã¯ããilocã«æž¡ããšãããŒã¿ãã¬ãŒã ã¯éåžžã©ããåããŸãã ãã ãã locã®å Žåãã¹ã©ã€ã¹ã®éå§ãšçµäºã®äž¡æ¹ãèæ ®ãããŸãïŒ ããã¥ã¡ã³ããžã®ãªã³ã¯ ãã³ã¡ã³ããããããšãarkane0906 ïŒã
df.loc[0:5, 'State':'Area code']
éœéåºç | ã¢ã«ãŠã³ãã®é·ã | åžå€å±çª | |
---|---|---|---|
0 | Ks | 128 | 415 |
1 | ãã | 107 | 415 |
2 | ãã¥ãŒãžã£ãŒãžãŒ | 137 | 415 |
3 | ãã | 84 | 408 |
4 | ããã£ã | 75 | 415 |
5 | AL | 118 | 510 |
df.iloc[0:5, 0:3]
éœéåºç | ã¢ã«ãŠã³ãã®é·ã | åžå€å±çª | |
---|---|---|---|
0 | Ks | 128 | 415 |
1 | ãã | 107 | 415 |
2 | ãã¥ãŒãžã£ãŒãžãŒ | 137 | 415 |
3 | ãã | 84 | 408 |
4 | ããã£ã | 75 | 415 |
ããŒã¿ãã¬ãŒã ã®æåãŸãã¯æåŸã®è¡ãå¿
èŠãªå Žåã¯ã df[:1]
ãŸãã¯df[-1:]
æ§é ã䜿çšããŸãã
df[-1:]
ã»ã«ãåãããã³è¡ãžã®é¢æ°ã®é©çš
ååãžã®é¢æ°ã®é©çšïŒé©çš
df.apply(np.max)
State WY Account length 243 Area code 510 International plan Yes Voice mail plan Yes Number vmail messages 51 Total day minutes 350.8 Total day calls 165 Total day charge 59.64 Total eve minutes 363.7 Total eve calls 170 Total eve charge 30.91 Total night minutes 395 Total night calls 175 Total night charge 17.77 Total intl minutes 20 Total intl calls 20 Total intl charge 5.4 Customer service calls 9 Churn True dtype: object
apply
ã¡ãœããã䜿çšããŠãåè¡ã«é¢æ°ãapply
ããããšãã§ããŸãã ãããè¡ãã«ã¯ã axis=1
æå®ããŸãã
åå
ã®åã»ã«ã«é¢æ°ãé©çšããïŒ map
ããšãã°ã map
ã¡ãœããã䜿çšããŠã {old_value: new_value}
圢åŒã®èŸæžãåŒæ°ãšããŠæž¡ãããšã«ãããåã®å€ã眮æã§ã{old_value: new_value}
ã
d = {'No' : False, 'Yes' : True} df['International plan'] = df['International plan'].map(d) df.head()
replace
ã¡ãœããã䜿çšããŠãåæ§ã®æäœãå®è¡ã§ããŸãã
df = df.replace({'Voice mail plan': d}) df.head()
ããŒã¿ã®ã°ã«ãŒãå
äžè¬ã«ãPandasã®ããŒã¿ã°ã«ãŒãã¯æ¬¡ã®ãšããã§ãã
df.groupby(by=grouping_columns)[columns_to_show].function()
- groupbyã¡ãœããã¯
groupby
é©çšãããããŒã¿ãgrouping_columns
ïŒç¹æ§ãŸãã¯ç¹æ§ã»ããïŒã§åé¢ããŸãã - å¿
èŠãªåãéžæããŸãïŒ
columns_to_show
ïŒã - 1ã€ãŸãã¯è€æ°ã®æ©èœããåä¿¡ããã°ã«ãŒãã«é©çšãããŸãã
Churn
å±æ§ã®å€ã«å¿ããŠããŒã¿ãã°ã«ãŒãåããåã°ã«ãŒãã®3ã€ã®åã®çµ±èšã衚瀺ããŸãã
columns_to_show = ['Total day minutes', 'Total eve minutes', 'Total night minutes'] df.groupby(['Churn'])[columns_to_show].describe(percentiles=[])
åãããšãããŸãããããã ãããããã«ç°ãªãæ¹æ³ã§ãé¢æ°ã®ãªã¹ããagg
æž¡ããŸãã
columns_to_show = ['Total day minutes', 'Total eve minutes', 'Total night minutes'] df.groupby(['Churn'])[columns_to_show].agg([np.mean, np.std, np.min, np.max])
èŠçŽè¡š
ãµã³ãã«ã®èŠ³æž¬å€ãã Churn
ãšInternational plan
2ã€ã®æ©èœã®ã³ã³ããã¹ãã§ã©ã®ããã«ååžããŠãããã確èªãããšããŸãã ãããè¡ãããã«ã crosstab
ã¡ãœããã䜿çšããŠåå²è¡šãäœæã§ããŸã ã
pd.crosstab(df['Churn'], df['International plan'])
åœéèšç» | ãã | ã¯ã |
---|---|---|
ãã£ãŒã³ | ||
0 | 2664 | 186 |
1 | 346 | 137 |
pd.crosstab(df['Churn'], df['Voice mail plan'], normalize=True)
ãã€ã¹ã¡ãŒã«ãã©ã³ | ãã | ã¯ã |
---|---|---|
ãã£ãŒã³ | ||
0 | 0.602460 | 0.252625 |
1 | 0.120912 | 0.024002 |
ã»ãšãã©ã®ãŠãŒã¶ãŒã¯å¿ å®ã§ãããåæã«è¿œå ã®ãµãŒãã¹ïŒåœéããŒãã³ã°/ãã€ã¹ã¡ãŒã«ïŒã䜿çšããŠããããšãããããŸãã
Excelã®äžçŽãŠãŒã¶ãŒã¯ãããããããããããŒãã«ãªã©ã®æ©èœãæãåºãã§ãããã Pandasã§ã¯ã pivot_table
ã¡ãœããã¯ããããããŒãã«ãæ
åœãããã©ã¡ãŒã¿ãŒãšããŠåãåããŸãã
-
values
-å¿ èŠãªçµ±èšãèšç®ããå€æ°ã®ãªã¹ãã -
index
ããŒã¿ãã°ã«ãŒãåããå€æ°ã®ãªã¹ãã -
aggfunc
ã°ã«ãŒãããšã«å®éã«ã«ãŠã³ãããå¿ èŠããããã®-éãå¹³åãæ倧ãæå°ããŸãã¯ä»ã®äœãã
ç°ãªãåžå€å±çªã®æ¥äžãå€éãå€éã®å¹³åé話æ°ãèŠãŠã¿ãŸãããã
df.pivot_table(['Total day calls', 'Total eve calls', 'Total night calls'], ['Area code'], aggfunc='mean').head(10)
åèšæ¥é話 | ç·éè©±æ° | åèšå€éé話 | |
---|---|---|---|
åžå€å±çª | |||
408 | 100.496420 | 99.788783 | 99.039379 |
415 | 100.576435 | 100.503927 | 100.398187 |
510 | 100.097619 | 99.671429 | 100.601190 |
ããŒã¿ãã¬ãŒã ã®å€æ
Pandasã®ãã®ä»ã®æ©èœãšåæ§ã«ãDataFrameã«åãè¿œå ããã«ã¯ããã€ãã®æ¹æ³ããããŸãã
ããšãã°ããã¹ãŠã®ãŠãŒã¶ãŒã®åèšé話æ°ãèšç®ããŸãã Seriesåã®ãªããžã§ã¯ãtotal_calls
äœæããããŒã¿ãã¬ãŒã ã«æ¿å
¥ããŸãã
total_calls = df['Total day calls'] + df['Total eve calls'] + \ df['Total night calls'] + df['Total intl calls'] df.insert(loc=len(df.columns), column='Total calls', value=total_calls) # loc - , Series # len(df.columns), df.head()
äžéã·ãªãŒãºãäœæããã«ãæ¢åã®åããåãè¿œå ã§ããŸãã
df['Total charge'] = df['Total day charge'] + df['Total eve charge'] + df['Total night charge'] + df['Total intl charge'] df.head()
åãŸãã¯è¡ãåé€ããã«ã¯ã drop
ã¡ãœããã䜿çšããŠãåŒæ°ãšããŠç®çã®ã€ã³ããã¯ã¹ãšaxis
ãã©ã¡ãŒã¿ãŒã®å¿
èŠãªå€ãæž¡ããŸãïŒåãåé€ããå Žåã¯1
ãè¡ãåé€ããå Žåã¯0
ãŸãã¯0
ïŒïŒ
# df = df.drop(['Total charge', 'Total calls'], axis=1) df.drop([1, 2]).head() #
4.æµåºãäºæž¬ããæåã®è©Šã¿
æµåºããåœéããŒãã³ã°ã®æ¥ç¶ãïŒåœéèšç»ïŒãšããèšå·ã§ã©ã®ããã«æ¥ç¶ãããŠããããèŠãŠã¿ãŸãããã ãããè¡ãã«ã¯ã ã¯ãã¹ã¿ããããããã¬ãŒãã䜿çšãããšãšãã«ãSeabornã§èª¬æããŸãïŒãã®ãããªç»åãäœæãããããã䜿çšããŠã°ã©ãã£ãã¯ãåæããæ¹æ³ã¯ã次ã®èšäºã®è³æã§ãïŒã
pd.crosstab(df['Churn'], df['International plan'], margins=True)
åœéèšç» | åœ | æ¬åœ | å šéš |
---|---|---|---|
ãã£ãŒã³ | |||
0 | 2664 | 186 | 2850 |
1 | 346 | 137 | 483 |
å šéš | 3010 | 323 | 3333 |
ããŒãã³ã°ãæ¥ç¶ããããšãæµåºã·ã§ã¢ãã¯ããã«é«ããªãããšãããããŸã-èå³æ·±ã芳å¯çµæã§ãïŒ ãããããããŒãã³ã°ã«ãããå€é¡ã§ç®¡çãäžååãªè²»çšã¯ãéåžžã«çžåãããã®ã§ãããéä¿¡äºæ¥è ã®é¡§å®¢ã®äžæºãæãããããã£ãŠæµåºã«ã€ãªãããŸãã
次ã«ããã1ã€ã®éèŠãªå åã§ããããµãŒãã¹ã»ã³ã¿ãŒãžã®åŒã³åºãåæ°ãïŒé¡§å®¢ãµãŒãã¹åŒã³åºãïŒãèŠãŠãã ãã ã ããããããŒãã«ãšç»åãäœæããŸãã
pd.crosstab(df['Churn'], df['Customer service calls'], margins=True)
ã«ã¹ã¿ããŒãµãŒãã¹ã³ãŒã« | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | å šéš |
---|---|---|---|---|---|---|---|---|---|---|---|
ãã£ãŒã³ | |||||||||||
0 | 605 | 1059 | 672 | 385 | 90 | 26 | 8 | 4 | 1 | 0 | 2850 |
1 | 92 | 122 | 87 | 44 | 76 | 40 | 14 | 5 | 1 | 2 | 483 |
å šéš | 697 | 1181 | 759 | 429 | 166 | 66 | 22 | 9 | 2 | 2 | 3333 |
ãããããã¬ãŒãã§ã¯ããã»ã©ã¯ã£ãããšèŠããªããããããŸããïŒãŸãã¯æ°åã®ããç·ã«æ²¿ã£ãŠç®ãã¯ããŒã«ããã®ã¯éå±ã§ãïŒããåçã¯ããµãŒãã¹ã»ã³ã¿ãŒãžã®4åã®åŒã³åºãããæµåºã®å²åãå€§å¹ ã«å¢å ããããšãéåŒã«ç€ºããŠããŸãã
ããã§ãDataFrameã«ãã€ããªèšå·ãè¿œå ããŸããããã¯ã Customer service calls > 3
ã®æ¯èŒã®çµæCustomer service calls > 3
ã§ãã ãããŠåã³ããããã©ã®ããã«æµåºã«é¢é£ä»ããããŠããããèŠãŠã¿ãŸãããã
df['Many_service_calls'] = (df['Customer service calls'] > 3).astype('int') pd.crosstab(df['Many_service_calls'], df['Churn'], margins=True)
ãã£ãŒã³ | 0 | 1 | å šéš |
---|---|---|---|
Many_service_calls | |||
0 | 2721 | 345 | 3066 |
1 | 129 | 138 | 267 |
å šéš | 2850 | 483 | 3333 |
äžèšã®æ¡ä»¶ãçµã¿åãããŠããã®çµåãšæµåºã®èŠçŽãã¬ãŒããäœæããŸãã
pd.crosstab(df['Many_service_calls'] & df['International plan'] , df['Churn'])
ãã£ãŒã³ | 0 | 1 |
---|---|---|
row_0 | ||
åœ | 2841 | 464 |
æ¬åœ | 9 | 19 |
ããã¯ããµãŒãã¹ã»ã³ã¿ãŒãžã®ã³ãŒã«æ°ã3ãè¶ ããŠãããããŒãã³ã°ãæ¥ç¶ãããŠããå ŽåïŒããã³ãã€ã€ã«ãã£ãäºæž¬ããå Žå-ãã以å€ïŒã«ã¯ã©ã€ã¢ã³ãã®æµåºãäºæž¬ããããšãæå³ããŸãã ãã®85.8ïŒ ã¯ãéåžžã«åçŽãªæšè«ã§åŸããããã®ã§ãããããããæ§ç¯ããæ©æ¢°åŠç¿ã¢ãã«ã®ããŒã¹ã©ã€ã³ãšããŠé©ããŠããŸã ã
äžè¬ã«ãæ©æ¢°åŠç¿ã®ç»å Žåã¯ãããŒã¿åæããã»ã¹ã¯æ¬¡ã®ããã«èŠããŠããŸããã èŠçŽãããšïŒ
- ãµã³ãã«ã§ã®å¿ å®ãªé¡§å®¢ã®å²åã¯85.5ïŒ ã§ãã ãã®ãããªããŒã¿ã«å¯ŸããŠã顧客ã¯åžžã«å¿ å®ãã§ãããšããçããæãçŽ æŽãªã¢ãã«ã¯ãçŽ85.5ïŒ ã®ã±ãŒã¹ã§æšæž¬ãããŸãã ã€ãŸããåŸç¶ã®ã¢ãã«ã®æ£è§£ã®ã·ã§ã¢ïŒ æ£ç¢ºåºŠ ïŒã¯ãå°ãªããšããã®æ°åããå°ãªããªãã¯ãã§ããããã®æ°åãããããªãé«ããªããã°ãªããŸããã
- 次ã®åŒã§æ¡ä»¶ä»ãã§è¡šçŸã§ããåçŽãªäºæž¬ã®å©ããåããŠïŒãåœéèšç»= TrueïŒã«ã¹ã¿ããŒãµãŒãã¹ã³ãŒã«> 3 =>ãã£ãŒã³= 1ãããã§ãªããã°ãã£ãŒã³= 0ãã85.8ïŒ ã®æšæž¬çãäºæ³ã§ããŸããããã¯85.5ïŒ ãããããã«é«ãã§ãã ãã®åŸããã·ãžã§ã³ããªãŒã«ã€ããŠèª¬æããå ¥åããŒã¿ã®ã¿ã«åºã¥ããŠãã®ãããªã«ãŒã«ãèªåçã«èŠã€ããæ¹æ³ãèŠã€ããŸãã
- æ©æ¢°åŠç¿ãªãã§ããã2ã€ã®ããŒã¹ã©ã€ã³ãåãåãããããã¯åŸç¶ã®ã¢ãã«ã®éå§ç¹ãšããŠæ©èœããŸãã ãã®ãããåªåã§ãæ£è§£ã®ã·ã§ã¢ãå šäœã§0.5ïŒ å¢ããããšãå€æããå Žåãããããäœãééã£ãããšãããŠããã®ã§ã2ã€ã®æ¡ä»¶ã®åçŽãªã¢ãã«ã«å¶éããã ãã§ååã§ãã
- è€éãªã¢ãã«ããã¬ãŒãã³ã°ããåã«ãããŒã¿ãå°ãããã£ãŠãåçŽãªä»®å®ã確èªããããšããå§ãããŸãã ããã«ãæ©æ¢°åŠç¿ã®ããžãã¹ã¢ããªã±ãŒã·ã§ã³ã§ã¯ãã»ãšãã©ã®å ŽåãåçŽãªãœãªã¥ãŒã·ã§ã³ããå§ããŠãè€éããå®éšããŸãã
5.宿é¡â1
ããã«ãã³ãŒã¹ã¯è±èªã§è¡ãããŸãïŒã¡ãã£ã¢ã«é¢ããèšäºããããŸãïŒã 次ã®æã¡äžãã¯2018幎10æ1æ¥ã§ãã
ãŠã©ãŒã ã¢ãã/ãã¬ãŒãã³ã°ã«ã€ããŠã¯ããã³ãã䜿çšããŠäººå£çµ±èšããŒã¿ãåæããããšããå§ãããŸãã Jupyterã®ç©ºçœã«äžè¶³ããŠããã³ãŒããå ¥åãã Webãã©ãŒã ã§æ£ããåçãéžæããå¿ èŠããããŸãïŒããã«ã解決çããããŸãïŒã
6.æçšãªãªãœãŒã¹ã®æŠèŠ
- ãã®èšäºã®è±èªç¿»èš³- äžè©±
- ãã®èšäºã«åºã¥ãè¬çŸ©ã®ãããª
- ãŸãæåã«ããã¡ããã ãã³ãã®å ¬åŒããã¥ã¡ã³ã ã ç¹ã«ã ãã³ãã«ã¯10åã®çã玹ä»ããå§ãããŸã
- æžç±ãLearning pandasã+ ãªããžããªã®ãã·ã¢èªèš³
- PDFãããŒãããã©ã€ãã©ãª
- ã¢ã¬ã¯ãµã³ããŒã»ãã£ã¢ã³ããã«ãããã¬ãŒã³ããŒã·ã§ã³ããã³ããšã®ç¥ãåãã
- äžé£ã®æçš¿ãã¢ãã³ãã³ããïŒè±èªïŒ
- githubã«ã¯ãPandasã®ãšã¯ãµãµã€ãºãšå¥ã®äŸ¿å©ãªãªããžããªïŒè±èªïŒãEffective PandasãããããŸã
- scipy - lectures.org-ãã³ããnumpyãmatplotlibãscikit-learnã®æäœã«é¢ãããã¥ãŒããªã¢ã«
- Pandas From The Ground Up-ãããªfrom PyCon 2015
ãã®èšäºã¯ã yorko ïŒYuri KashnitskyïŒãšå ±åå·çããŸãã ã