In search of an interesting and simple DataSet, I came across this handsome man .
It contains data on the growth and weight of 10,000 men and women . No description. Nothing extra". Only height, weight and floor mark. I liked this mysterious simplicity.
Well, let's get started!
Let's go!
To start, load the necessary modules
# import pandas as pd # import matplotlib.pyplot as plt %matplotlib inline # « » from sklearn.neighbors import KNeighborsRegressor # from sklearn.model_selection import train_test_split
When the libraries stood up exactly - it was time to load the DataSet itself and look at the first 10 elements. This is necessary so that our gut is calm, that we have loaded everything correctly.
By the way, do not be alarmed that height and weight differ from what we are used to. This is because of a different measurement system: inches and pounds , instead of centimeters and kilograms .
data = pd.read_csv('weight-height.csv') data.head(10)
Gender | Height | Weight | |
---|---|---|---|
0 | Male | 74 | 242 |
one | Male | 69 | 162 |
2 | Male | 74 | 213 |
3 | Male | 72 | 220 |
four | Male | 70 | 206 |
5 | Male | 68 | 152 |
6 | Male | 69 | 184 |
7 | Male | 68 | 168 |
8 | Male | 67 | 176 |
9 | Male | 63 | 156 |
Good! We see that the first ten entries are “men”. We see their height (height) and weight (weight). Data loaded up well.
Now you can look at the number of rows in the set.
data.shape >> (10000, 3)
Ten thousand lines / records. And each has three parameters . Exactly what is needed!
It's time to fix the measurement system. Now here are centimeters and kilograms.
data['Height'] *= 2.54 data['Weight'] /= 2.205 # data.head(10)
Gender | Height | Weight | |
---|---|---|---|
0 | Male | 188 | 110 |
one | Male | 175 | 74 |
2 | Male | 188 | 96 |
3 | Male | 182 | one hundred |
four | Male | 177 | 94 |
5 | Male | 171 | 69 |
6 | Male | 175 | 83 |
7 | Male | 174 | 76 |
8 | Male | 170 | 80 |
9 | Male | 161 | 71 |
Now it has become more familiar. And the first record tells us about a man with a height of ~ 190cm and a weight of ~ 110kg. Big man. Let's call him Bob.
But how to understand: is it a lot or a little compared to the rest? Is it possible that we are all plus or minus Beans? This is a bit later.
Now let's find out how symmetrical the two genders are in this dataset?
data['Gender'].value_counts() >> Male 5000 Female 5000 Name: Gender, dtype: int64
Ideally equally divided. And this is good, because if there were: 9999 men and 1 woman, then there would be no sense in pretending that this DataSet reveals both sexes equally well. In our case, everything is OK!
Now intuition tells us that it will be correct to separate the two sexes and explore separately. Indeed, in life we often see that men and women have plus or minus different height and weight
# data_male = data[data['Gender'] == 'Male'].copy() # data_female = data[data['Gender'] == 'Female'].copy()
Let's take a look at the small descriptive statistics that the pandas module offers us.
Men :
data_male.describe()
Height | Weight | |
---|---|---|
count | 5000 | 5000 |
mean | 175 | 85 |
std | 7 | 9 |
min | 148 | 51 |
25% | 171 | 79 |
fifty% | 175 | 85 |
75% | 180 | 91 |
max | 201 | 122 |
Women :
data_female.describe()
Height | Weight | |
---|---|---|
count | 5000 | 5000 |
mean | 162 | 62 |
std | 7 | 9 |
min | 138 | 29th |
25% | 157 | 56 |
fifty% | 162 | 62 |
75% | 167 | 67 |
max | 186 | 92 |
In plain language:
Descriptive statistics are a set of numbers / characteristics for a description. Perhaps this is the easiest to understand type of statistics.
Imagine that you are describing the parameters of a ball. He can be:
With a strong simplification, we can say that descriptive statistics is engaged in this . But he does this not with balls, but with data.
And here are the parameters from the table above:
The average value is very sensitive to emissions! If four people receive a salary of 10,000 ₽, and the fifth - 460,000 ₽. That average will be - 100 000 ₽. And the median will remain the same - 10 000 ₽.
This does not mean that average is a bad indicator. It needs to be treated more carefully.
By the way, there’s also a snag with the median.
If the number of measurements is odd. That median is the value in the middle, if you put the data "by growth".
And if it’s even, then the median is the average between the two “most central” ones.
If the data set contains only integers and the median is fractional, do not be surprised. Most likely the number of measurements is even.
An example :
The son brought marks from school. There were five lessons he received: 1, 5, 3, 2, 4
Five ratings → odd amount
Growth: 1, 2, 3, 4, 5
Take the central - 3
Median score - 3
The next day, the son brought from school new grades: 4, 2, 3, 5
Four ratings → odd amount
We build by growth: 2, 3, 4, 5
Take the centerpieces: 3, 4
We find their average: 3.5
Median - 3.5
Conclusion: Well done son :)
We see that in men the average and median are 175cm and 85kg. And in women : 162cm and 62kg. This tells us that there are no strong emissions. Or they are symmetrical on both sides of the median. Which is very rare.
But both sexes have slight deviations of the mean from the median. But they are insignificant and they are visible only on hundredths. Move on!
This is a graph that plots the values from minimum to maximum in order of growth, and shows the number of individual instances.
fig, axes = plt.subplots(2,2, figsize=(20,10)) plt.subplots_adjust(wspace=0, hspace=0) axes[0,0].hist(data_male['Height'], label='Male Height', bins=100, color='red') axes[0,1].hist(data_male['Weight'], label='Male Weight', bins=100, color='red', alpha=0.4) axes[1,0].hist(data_female['Height'], label='Female Height', bins=100, color='blue') axes[1,1].hist(data_female['Weight'], label='Female Weight', bins=100, color='blue', alpha=0.4) axes[0,0].legend(loc=2, fontsize=20) axes[0,1].legend(loc=2, fontsize=20) axes[1,0].legend(loc=2, fontsize=20) axes[1,1].legend(loc=2, fontsize=20) plt.savefig('plt_histogram.png') plt.show()
Data is distributed bell-shaped. Very similar to normal distribution .
In addition to statistical tests for normal distribution, there is a visual test. If the distribution by type and logic seems to be normal - we can assume with some assumptions that we are dealing with it.
One could do a statistical normality test and determine p-value, but I do not know how this is beyond the scope of the article.
Pandas can count a lot for us. But you need to at least once count some statistics yourself. Now I will show how to calculate the standard deviation .
Let's do it on the example of men and the characteristic - growth.
Formula:
where
The code:
mean = data_male['Height'].mean() print('mean:\t{:.2f}'.format(mean)) >> mean: 175.33
Average height - 175cm
where
The code:
data_male['Height_d'] = (data_male['Height'] - mean) ** 2 data_male['Height_d'].head(10) >> 0 149.927893 1 0.385495 2 166.739089 3 47.193692 4 4.721246 5 20.288347 6 0.375539 7 2.964214 8 25.997623 9 200.149603 Name: Height_d, dtype: float64
Formula:
where
The code:
disp = data_male['Height_d'].mean() print('disp:\t{:.2f}'.format(disp)) >> disp: 52.89
Dispersion - 53
Formula:
where
The code:
std = disp ** 0.5 print('std:\t{:.2f}'.format(std)) >> std: 7.27
Standard Deviation - 7
Now we will find out in what ranges of growth and weight 68%, 95% and 99.7% of men and women are .
This is not so difficult - you need to add and subtract the standard deviation from the average. It looks like this:
We write an auxiliary function that will consider this:
def get_stats(series, title='noname'): # print('= {} =\n'.format(title.upper())) # pandas descr = series.describe() # mean = descr['mean'] print('= Mean:\t{:.0f}'.format(mean)) # std = descr['std'] print('= Std:\t{:.0f}'.format(std)) # print('\n= = = =\n') # ## 68% devi_1 = [mean - std, mean + std] ## 95% devi_2 = [mean - 2 * std, mean + 2 * std] ## 99.7% devi_3 = [mean - 3 * std, mean + 3 * std] # print('= 68% is from\t\t{:.0f} to {:.0f}'.format(devi_1[0], devi_1[1])) print('= 95% is from\t\t{:.0f} to {:.0f}'.format(devi_2[0], devi_2[1])) print('= 99.7% is from\t\t{:.0f} to {:.0f}'.format(devi_3[0], devi_3[1]))
Well, apply it to the data:
Men | Height
get_stats(data_male['Height'], title='Male Height') >> = MALE HEIGHT = = Mean: 175 = Std: 7 = = = = = 68% is from 168 to 183 = 95% is from 161 to 190 = 99.7% is from 154 to 197
Men | The weight
get_stats(data_male['Height'], title='Male Height') >> = MALE WEIGHT = = Mean: 85 = Std: 9 = = = = = 68% is from 76 to 94 = 95% is from 67 to 103 = 99.7% is from 58 to 112
Women | Height
get_stats(data_male['Height'], title='Male Height') >> = FEMALE HEIGHT = = Mean: 162 = Std: 7 = = = = = 68% is from 155 to 169 = 95% is from 148 to 176 = 99.7% is from 141 to 182
Women | The weight
get_stats(data_male['Height'], title='Male Height') >> = FEMALE WEIGHT = = Mean: 62 = Std: 9 = = = = = 68% is from 53 to 70 = 95% is from 44 to 79 = 99.7% is from 36 to 87
Hence the conclusions:
Now it remains only to apply machine learning to this set and try to predict weight by height.
The algorithm "To the nearest neighbors" is simple. It exists for classification tasks - to distinguish a cat from a dog - and for regression tasks - to guess the weight by height. That's what we need!
For regression, he uses the following algorithm:
First you need to divide the data set into the training and test parts and test the algorithm
X_train, X_test, y_train, y_test = train_test_split(data_male['Height'], data_male['Weight'])
Divided, it is time to try.
# knr3 = KNeighborsRegressor(n_neighbors=3) knr3.fit(X_train.values.reshape(-1,1), y_train.values.reshape(-1,1)) knr3.score(X_train.values.reshape(-1,1), y_train.values.reshape(-1,1)) >> 0.8298400793623182 # knr5 = KNeighborsRegressor(n_neighbors=5) knr5.fit(X_train.values.reshape(-1,1), y_train.values.reshape(-1,1)) knr5.score(X_train.values.reshape(-1,1), y_train.values.reshape(-1,1)) >> 0.7958051642678619 # knr7 = KNeighborsRegressor(n_neighbors=7) knr7.fit(X_train.values.reshape(-1,1), y_train.values.reshape(-1,1)) knr7.score(X_train.values.reshape(-1,1), y_train.values.reshape(-1,1)) >> 0.7769249318420969
We will not go far and stop at three neighbors. But the question is: can such a model guess my weight?
knr3.predict([[180]])[0, 0] >> 88.67596236265881
88kg is very close. This second, my weight is 89.8kg
The time to build my favorite part of science is graphics.
array_male = [] # 99.7% xaxis = range(154, 198) for h in xaxis: ans = knr3.predict([[h]]) array_male.append(ans[0, 0]) plt.figure(figsize=(20,10)) plt.plot(xaxis, array_male, 'r-', linewidth=4) plt.title('Male heght-weight dependence', fontsize=30) plt.xlabel('Height', fontsize=30) plt.ylabel('Weight', fontsize=30) plt.grid() plt.savefig('plt_knn_male.png') plt.show()
X_train, X_test, y_train, y_test = train_test_split(data_female['Height'], data_female['Weight']) knr3 = KNeighborsRegressor(n_neighbors=3) knr3.fit(X_train.values.reshape(-1,1), y_train.values.reshape(-1,1)) knr3.score(X_train.values.reshape(-1,1), y_train.values.reshape(-1,1)) >> 0.8135681584074799
array_female = [] # 99.7% xaxis = range(141, 183) for h in xaxis: ans = knr3.predict([[h]]) array_female.append(ans[0, 0]) plt.figure(figsize=(20,10)) plt.plot(xaxis, array_female, 'b-', linewidth=4) plt.title('Female heght-weight dependence', fontsize=30) plt.xlabel('Height', fontsize=30) plt.ylabel('Weight', fontsize=30) plt.grid() plt.savefig('plt_knn_female.png') plt.show()
And of course it’s interesting how these graphs look together:
# xaxis = range(154, 183) plt.figure(figsize=(20,10)) plt.plot(xaxis, array_male[:-15], 'r-', linewidth=4) plt.plot(xaxis, array_female[13:], 'b-', linewidth=4) plt.title('Together heght-weight dependence', fontsize=30) plt.xlabel('Height', fontsize=30) plt.ylabel('Weight', fontsize=30) plt.grid() plt.savefig('plt_knn_together.png') plt.show()
- What is the range of weight and height for most men and women?
99.7% of men: from 154cm to 197cm and from 58kg to 112kg.
And 99.7% of women: from 141cm to 182cm and from 36kg to 87kg.
- What kind of “average” man and “average” woman are they?
The average man is 175cm and 85kg.
And the average woman is 162cm and 62kg.
- Can the simple KNN machine learning model from these data predict weight by height?
Yes, the model predicted 88kg, and I have 89.8kg.
All that I did, I collected here
Like if you hit the 99.7% interval