작성자: admin 작성일시: 2016-07-08 23:34:29 조회수: 1387 다운로드: 86
카테고리: Python 태그목록:

Pandas 피봇과 그룹 연산

피봇 테이블

피봇 테이블(pivot table)이란 데이터 열(column) 중에서 두 개를 키(key)로 사용하여 데이터를 선택하는 방법을 말한다.

피봇 테이블을 사용하기 위해서는 키가 될 수 있는 두 개의 열(column) 혹은 필드(field)를 선택하여 이 두 열을

  • 행 인덱스 (row index)
  • 열 인덱스 (column index)

로 변경해야 한다.

  • pivot 메서드를 사용하면 행 인덱스, 열 인덱스, 자료가 될 3가지의 열(column)을 지정할 수 있다.
In [1]:
data = {
    'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
    'year': [2000, 2001, 2002, 2001, 2002],
    'pop': [1.5, 2.5, 3.0, 2.5, 3.5]
}
df = pd.DataFrame(data, columns=["state", "year", "pop"])
df
Out[1]:
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 2.5
2 Ohio 2002 3.0
3 Nevada 2001 2.5
4 Nevada 2002 3.5
In [2]:
df.pivot("state", "year", "pop")
Out[2]:
year 2000 2001 2002
state
Nevada NaN 2.5 3.5
Ohio 1.5 2.5 3.0
  • 행 인덱스와, 열 인덱스가 될 자료는 키(key)의 역할을 해야 한다. 즉, 이 값으로 데이터가 유일하게(unique) 결정되어야 한다.
In [3]:
df.pivot("year", "pop", "state")

ValueErrorTraceback (most recent call last)
<ipython-input-3-749f5b6847f3> in <module>()
----> 1 df.pivot("year", "pop", "state")

/home/dockeruser/anaconda2/lib/python2.7/site-packages/pandas/core/frame.pyc in pivot(self, index, columns, values)
   3844         """
   3845         from pandas.core.reshape import pivot
-> 3846         return pivot(self, index=index, columns=columns, values=values)
   3847 
   3848     def stack(self, level=-1, dropna=True):

/home/dockeruser/anaconda2/lib/python2.7/site-packages/pandas/core/reshape.pyc in pivot(self, index, columns, values)
    330         indexed = Series(self[values].values,
    331                          index=MultiIndex.from_arrays([index, self[columns]]))
--> 332         return indexed.unstack(columns)
    333 
    334 

/home/dockeruser/anaconda2/lib/python2.7/site-packages/pandas/core/series.pyc in unstack(self, level, fill_value)
   2041         """
   2042         from pandas.core.reshape import unstack
-> 2043         return unstack(self, level, fill_value)
   2044 
   2045     # ----------------------------------------------------------------------

/home/dockeruser/anaconda2/lib/python2.7/site-packages/pandas/core/reshape.pyc in unstack(obj, level, fill_value)
    405     else:
    406         unstacker = _Unstacker(obj.values, obj.index, level=level,
--> 407                                fill_value=fill_value)
    408         return unstacker.get_result()
    409 

/home/dockeruser/anaconda2/lib/python2.7/site-packages/pandas/core/reshape.pyc in __init__(self, values, index, level, value_columns, fill_value)
     99 
    100         self._make_sorted_values_labels()
--> 101         self._make_selectors()
    102 
    103     def _make_sorted_values_labels(self):

/home/dockeruser/anaconda2/lib/python2.7/site-packages/pandas/core/reshape.pyc in _make_selectors(self)
    137 
    138         if mask.sum() < len(self.index):
--> 139             raise ValueError('Index contains duplicate entries, '
    140                              'cannot reshape')
    141 

ValueError: Index contains duplicate entries, cannot reshape
In [4]:
df.set_index(["state", "year"]).unstack()
Out[4]:
pop
year 2000 2001 2002
state
Nevada NaN 2.5 3.5
Ohio 1.5 2.5 3.0

그룹 연산

그룹 연산은 피봇 테이블과 달리 키에 의해서 결정되는 데이터가 복수개가 있어도 괜찮다. 대신 연산을 통해 복수개의 그룹 데이터에 대한 대표값을 정한다. 이를 split-apply-combine 연산이라고도 한다.

  • split 단계
    • 특정 Key 값에 따라 데이터 그룹을 만든다.
  • apply 단계

    • 각각의 그룹에 대해 원하는 연산을 하여 대표값을 생성한다.
      • count(), mean(), median(), min(), max()
      • sum(), prod(), std(), var(), quantile()
      • first(), last()
  • combine 단계

    • 그룹의 Key 값에 대해 원하는 연산의 결과를 Value로 지정한 dict를 생성한다.

In [5]:
np.random.seed(0)
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                   'key2' : ['one', 'two', 'one', 'two', 'one'],
                   'data1' : np.random.randn(5),
                   'data2' : np.random.randn(5)})
df
Out[5]:
data1 data2 key1 key2
0 1.764052 -0.977278 a one
1 0.400157 0.950088 a two
2 0.978738 -0.151357 b one
3 2.240893 -0.103219 b two
4 1.867558 0.410599 a one
  • 문제: key1 값에 따른 data1의 평균은?
In [6]:
df.data1.groupby(df.key1).mean()
Out[6]:
key1
a    1.343923
b    1.609816
Name: data1, dtype: float64
In [7]:
gs = df.data1.groupby(df.key1)
gs
Out[7]:
<pandas.core.groupby.SeriesGroupBy object at 0x7f2a4f3c7910>
In [8]:
print("="*50)
for n, g in gs:
    print("[key]:", n)
    print("[group]:", type(g))
    print("-"*50)
    print(g)
    print("-"*50)
    print("[mean]:", g.mean())
    print("="*50)
==================================================
[key]: a
[group]: <class 'pandas.core.series.Series'>
--------------------------------------------------
0    1.764052
1    0.400157
4    1.867558
Name: data1, dtype: float64
--------------------------------------------------
[mean]: 1.34392251483
==================================================
[key]: b
[group]: <class 'pandas.core.series.Series'>
--------------------------------------------------
2    0.978738
3    2.240893
Name: data1, dtype: float64
--------------------------------------------------
[mean]: 1.60981559165
==================================================
In [9]:
gs.mean()
Out[9]:
key1
a    1.343923
b    1.609816
Name: data1, dtype: float64
  • 문제: 복합 key (key1, key2) 값에 따른 data1의 평균은?
In [10]:
means = df.data1.groupby([df.key1, df.key2]).mean()
means
Out[10]:
key1  key2
a     one     1.815805
      two     0.400157
b     one     0.978738
      two     2.240893
Name: data1, dtype: float64

groupby 명령의 인수

  • groupby 명령에서 Key 인수로 입력할 수 있는 값은 다음과 같다.
    • 열 또는 열의 리스트
    • 행 인덱스
    • 사전/함수: Column의 값을 사전에 매핑(mapping)하거나 함수 처리하여 나온 결괏값을 키로 인식
In [11]:
np.random.seed(0)
people = pd.DataFrame(np.random.randn(5, 5), 
                      columns=['a', 'b', 'c', 'd', 'e'], 
                      index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
people.ix[2:3, ['b', 'c']] = np.nan
people
Out[11]:
a b c d e
Joe 1.764052 0.400157 0.978738 2.240893 1.867558
Steve -0.977278 0.950088 -0.151357 -0.103219 0.410599
Wes 0.144044 NaN NaN 0.121675 0.443863
Jim 0.333674 1.494079 -0.205158 0.313068 -0.854096
Travis -2.552990 0.653619 0.864436 -0.742165 2.269755
In [12]:
print("="*80)
for n, g in people.groupby(people.index):
    print("[key]:", n)
    print("[group]:", type(g))
    print("-"*80)
    print(g)
    print("="*80)
================================================================================
[key]: Jim
[group]: <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------------------------------------
            a         b         c         d         e
Jim  0.333674  1.494079 -0.205158  0.313068 -0.854096
================================================================================
[key]: Joe
[group]: <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------------------------------------
            a         b         c         d         e
Joe  1.764052  0.400157  0.978738  2.240893  1.867558
================================================================================
[key]: Steve
[group]: <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------------------------------------
              a         b         c         d         e
Steve -0.977278  0.950088 -0.151357 -0.103219  0.410599
================================================================================
[key]: Travis
[group]: <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------------------------------------
              a         b         c         d         e
Travis -2.55299  0.653619  0.864436 -0.742165  2.269755
================================================================================
[key]: Wes
[group]: <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------------------------------------
            a   b   c         d         e
Wes  0.144044 NaN NaN  0.121675  0.443863
================================================================================
In [13]:
mapping = {'Joe': 'J', 'Jim': 'J', 'Steve': 'S', 'Wes': 'S', 'Travis': 'S'}
print("="*80)
for n, g in people.groupby(mapping):
    print("[key]:", n)
    print("[group]:", type(g))
    print("-"*80)
    print(g)
    print("="*80)
================================================================================
[key]: J
[group]: <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------------------------------------
            a         b         c         d         e
Joe  1.764052  0.400157  0.978738  2.240893  1.867558
Jim  0.333674  1.494079 -0.205158  0.313068 -0.854096
================================================================================
[key]: S
[group]: <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------------------------------------
               a         b         c         d         e
Steve  -0.977278  0.950088 -0.151357 -0.103219  0.410599
Wes     0.144044       NaN       NaN  0.121675  0.443863
Travis -2.552990  0.653619  0.864436 -0.742165  2.269755
================================================================================
In [14]:
cap1 = lambda x: x[0].upper()
print("="*80)
for n, g in people.groupby(cap1):
    print("[key]:", n)
    print("[group]:", type(g))
    print("-"*80)
    print(g)
    print("="*80)
================================================================================
[key]: J
[group]: <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------------------------------------
            a         b         c         d         e
Joe  1.764052  0.400157  0.978738  2.240893  1.867558
Jim  0.333674  1.494079 -0.205158  0.313068 -0.854096
================================================================================
[key]: S
[group]: <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------------------------------------
              a         b         c         d         e
Steve -0.977278  0.950088 -0.151357 -0.103219  0.410599
================================================================================
[key]: T
[group]: <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------------------------------------
              a         b         c         d         e
Travis -2.55299  0.653619  0.864436 -0.742165  2.269755
================================================================================
[key]: W
[group]: <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------------------------------------
            a   b   c         d         e
Wes  0.144044 NaN NaN  0.121675  0.443863
================================================================================
In [15]:
print("="*80)
for n, g in people.groupby(people.columns, axis=1):
    print("[key]:", n)
    print("[group]:", type(g))
    print("-"*80)
    print(g)
    print("="*80)
================================================================================
[key]: a
[group]: <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------------------------------------
               a
Joe     1.764052
Steve  -0.977278
Wes     0.144044
Jim     0.333674
Travis -2.552990
================================================================================
[key]: b
[group]: <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------------------------------------
               b
Joe     0.400157
Steve   0.950088
Wes          NaN
Jim     1.494079
Travis  0.653619
================================================================================
[key]: c
[group]: <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------------------------------------
               c
Joe     0.978738
Steve  -0.151357
Wes          NaN
Jim    -0.205158
Travis  0.864436
================================================================================
[key]: d
[group]: <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------------------------------------
               d
Joe     2.240893
Steve  -0.103219
Wes     0.121675
Jim     0.313068
Travis -0.742165
================================================================================
[key]: e
[group]: <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------------------------------------
               e
Joe     1.867558
Steve   0.410599
Wes     0.443863
Jim    -0.854096
Travis  2.269755
================================================================================
In [16]:
mapping = {'a': 'red', 'b': 'red', 'c': 'blue', 'd': 'blue', 'e': 'red', 'f' : 'orange'}
for n, g in people.groupby(mapping, axis=1):
    print("[key]:", n)
    print("[group]:", type(g))
    print("-"*80)
    print(g)
    print("="*80)
[key]: blue
[group]: <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------------------------------------
               c         d
Joe     0.978738  2.240893
Steve  -0.151357 -0.103219
Wes          NaN  0.121675
Jim    -0.205158  0.313068
Travis  0.864436 -0.742165
================================================================================
[key]: red
[group]: <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------------------------------------
               a         b         e
Joe     1.764052  0.400157  1.867558
Steve  -0.977278  0.950088  0.410599
Wes     0.144044       NaN  0.443863
Jim     0.333674  1.494079 -0.854096
Travis -2.552990  0.653619  2.269755
================================================================================

특별한 group 별 연산

  • 통계

    • describe()
  • 그룹을 대표하는 하나의 값을 계산

    • agg(), aggregate()
  • 대표값으로 필드를 교체

    • transform()
  • 그룹 전체를 변형하는 계산

    • apply()

TIP 데이터 예제

In [17]:
%cd /home/dockeruser/data/pydata-book-master
/home/dockeruser/data/pydata-book-master
In [18]:
tips = pd.read_csv('ch08/tips.csv')
tips['tip_pct'] = tips['tip'] / tips['total_bill']
tips.tail()
Out[18]:
total_bill tip sex smoker day time size tip_pct
239 29.03 5.92 Male No Sat Dinner 3 0.203927
240 27.18 2.00 Female Yes Sat Dinner 2 0.073584
241 22.67 2.00 Male Yes Sat Dinner 2 0.088222
242 17.82 1.75 Male No Sat Dinner 2 0.098204
243 18.78 3.00 Female No Thur Dinner 2 0.159744
In [19]:
tips.describe()
Out[19]:
total_bill tip size tip_pct
count 244.000000 244.000000 244.000000 244.000000
mean 19.785943 2.998279 2.569672 0.160803
std 8.902412 1.383638 0.951100 0.061072
min 3.070000 1.000000 1.000000 0.035638
25% 13.347500 2.000000 2.000000 0.129127
50% 17.795000 2.900000 2.000000 0.154770
75% 24.127500 3.562500 3.000000 0.191475
max 50.810000 10.000000 6.000000 0.710345

그룹별 통계

In [20]:
tips.groupby(["sex", "smoker"])[["tip", "tip_pct"]].describe()
Out[20]:
tip tip_pct
sex smoker
Female No count 54.000000 54.000000
mean 2.773519 0.156921
std 1.128425 0.036421
min 1.000000 0.056797
25% 2.000000 0.139708
50% 2.680000 0.149691
75% 3.437500 0.181630
max 5.200000 0.252672
Yes count 33.000000 33.000000
mean 2.931515 0.182150
std 1.219916 0.071595
min 1.000000 0.056433
25% 2.000000 0.152439
50% 2.880000 0.173913
75% 3.500000 0.198216
max 6.500000 0.416667
Male No count 97.000000 97.000000
mean 3.113402 0.160669
std 1.489559 0.041849
min 1.250000 0.071804
25% 2.000000 0.131810
50% 2.740000 0.157604
75% 3.710000 0.186220
max 9.000000 0.291990
Yes count 60.000000 60.000000
mean 3.051167 0.152771
std 1.500120 0.090588
min 1.000000 0.035638
25% 2.000000 0.101845
50% 3.000000 0.141015
75% 3.820000 0.191697
max 10.000000 0.710345

그룹별 연산

In [21]:
gs = tips.groupby(["sex", "smoker"])
gs_pct = gs["tip_pct"]
In [22]:
gs_pct.mean()
Out[22]:
sex     smoker
Female  No        0.156921
        Yes       0.182150
Male    No        0.160669
        Yes       0.152771
Name: tip_pct, dtype: float64
In [23]:
gs_pct.agg('mean')
Out[23]:
sex     smoker
Female  No        0.156921
        Yes       0.182150
Male    No        0.160669
        Yes       0.152771
Name: tip_pct, dtype: float64
In [24]:
def peak_to_peak(arr):
    return arr.max() - arr.min()

gs_pct.agg(['mean', 'std', peak_to_peak])
Out[24]:
mean std peak_to_peak
sex smoker
Female No 0.156921 0.036421 0.195876
Yes 0.182150 0.071595 0.360233
Male No 0.160669 0.041849 0.220186
Yes 0.152771 0.090588 0.674707
In [25]:
gs.agg({'tip_pct' : 'mean', 'total_bill' : peak_to_peak})
Out[25]:
tip_pct total_bill
sex smoker
Female No 0.156921 28.58
Yes 0.182150 41.23
Male No 0.160669 40.82
Yes 0.152771 43.56

그룹의 값을 대표값으로 대체

In [26]:
gs.agg("mean")
Out[26]:
total_bill tip size tip_pct
sex smoker
Female No 18.105185 2.773519 2.592593 0.156921
Yes 17.977879 2.931515 2.242424 0.182150
Male No 19.791237 3.113402 2.711340 0.160669
Yes 22.284500 3.051167 2.500000 0.152771
In [27]:
tips2 = tips.copy()
tips2["tip2"] = gs.transform("mean")["tip_pct"]
tips2.tail(15)
Out[27]:
total_bill tip sex smoker day time size tip_pct tip2
229 22.12 2.88 Female Yes Sat Dinner 2 0.130199 0.182150
230 24.01 2.00 Male Yes Sat Dinner 4 0.083299 0.152771
231 15.69 3.00 Male Yes Sat Dinner 3 0.191205 0.152771
232 11.61 3.39 Male No Sat Dinner 2 0.291990 0.160669
233 10.77 1.47 Male No Sat Dinner 2 0.136490 0.160669
234 15.53 3.00 Male Yes Sat Dinner 2 0.193175 0.152771
235 10.07 1.25 Male No Sat Dinner 2 0.124131 0.160669
236 12.60 1.00 Male Yes Sat Dinner 2 0.079365 0.152771
237 32.83 1.17 Male Yes Sat Dinner 2 0.035638 0.152771
238 35.83 4.67 Female No Sat Dinner 3 0.130338 0.156921
239 29.03 5.92 Male No Sat Dinner 3 0.203927 0.160669
240 27.18 2.00 Female Yes Sat Dinner 2 0.073584 0.182150
241 22.67 2.00 Male Yes Sat Dinner 2 0.088222 0.152771
242 17.82 1.75 Male No Sat Dinner 2 0.098204 0.160669
243 18.78 3.00 Female No Thur Dinner 2 0.159744 0.156921

그룹 자체를 대체

  • apply 메소드는 수치값이 아닌 Group을 출력
  • 단순히 대표값을 계산하는 것 뿐 아니라
  • 순서 정렬, 일부 삭제 등 그룹 내의 레코드 자체를 변형하는 것도 가능
In [28]:
def top(df, n=5, column='tip_pct'):
    return df.sort_values(by=column)[-n:]
In [29]:
top(tips, n=6)
Out[29]:
total_bill tip sex smoker day time size tip_pct
109 14.31 4.00 Female Yes Sat Dinner 2 0.279525
183 23.17 6.50 Male Yes Sun Dinner 4 0.280535
232 11.61 3.39 Male No Sat Dinner 2 0.291990
67 3.07 1.00 Female Yes Sat Dinner 1 0.325733
178 9.60 4.00 Female Yes Sun Dinner 2 0.416667
172 7.25 5.15 Male Yes Sun Dinner 2 0.710345
In [30]:
tips.groupby('smoker').apply(top)
Out[30]:
total_bill tip sex smoker day time size tip_pct
smoker
No 88 24.71 5.85 Male No Thur Lunch 2 0.236746
185 20.69 5.00 Male No Sun Dinner 5 0.241663
51 10.29 2.60 Female No Sun Dinner 2 0.252672
149 7.51 2.00 Male No Thur Lunch 2 0.266312
232 11.61 3.39 Male No Sat Dinner 2 0.291990
Yes 109 14.31 4.00 Female Yes Sat Dinner 2 0.279525
183 23.17 6.50 Male Yes Sun Dinner 4 0.280535
67 3.07 1.00 Female Yes Sat Dinner 1 0.325733
178 9.60 4.00 Female Yes Sun Dinner 2 0.416667
172 7.25 5.15 Male Yes Sun Dinner 2 0.710345
In [31]:
tips.groupby(['smoker', 'day']).apply(top, n=1, column='total_bill')
Out[31]:
total_bill tip sex smoker day time size tip_pct
smoker day
No Fri 94 22.75 3.25 Female No Fri Dinner 2 0.142857
Sat 212 48.33 9.00 Male No Sat Dinner 4 0.186220
Sun 156 48.17 5.00 Male No Sun Dinner 6 0.103799
Thur 142 41.19 5.00 Male No Thur Lunch 5 0.121389
Yes Fri 95 40.17 4.73 Male Yes Fri Dinner 4 0.117750
Sat 170 50.81 10.00 Male Yes Sat Dinner 3 0.196812
Sun 182 45.35 3.50 Male Yes Sun Dinner 3 0.077178
Thur 197 43.11 5.00 Female Yes Thur Lunch 4 0.115982
In [32]:
f = lambda x: x.describe()
tips.groupby(['smoker']).apply(f)
Out[32]:
total_bill tip size tip_pct
smoker
No count 151.000000 151.000000 151.000000 151.000000
mean 19.188278 2.991854 2.668874 0.159328
std 8.255582 1.377190 1.017984 0.039910
min 7.250000 1.000000 1.000000 0.056797
25% 13.325000 2.000000 2.000000 0.136906
50% 17.590000 2.740000 2.000000 0.155625
75% 22.755000 3.505000 3.000000 0.185014
max 48.330000 9.000000 6.000000 0.291990
Yes count 93.000000 93.000000 93.000000 93.000000
mean 20.756344 3.008710 2.408602 0.163196
std 9.832154 1.401468 0.810751 0.085119
min 3.070000 1.000000 1.000000 0.035638
25% 13.420000 2.000000 2.000000 0.106771
50% 17.920000 3.000000 2.000000 0.153846
75% 26.860000 3.680000 3.000000 0.195059
max 50.810000 10.000000 5.000000 0.710345

pivot_table

  • pivot 명령과 groupby 명령의 중간적 성격
  • pivot을 수행하지만 데이터가 유니크하게 선택되지 않으면 aggfunc 인수로 정의된 함수를 수행하여 대표값 계산
  • 디폴트 aggfunc 은 평균 계산
In [33]:
tips.pivot_table(index=['sex', 'smoker'])
Out[33]:
size tip tip_pct total_bill
sex smoker
Female No 2.592593 2.773519 0.156921 18.105185
Yes 2.242424 2.931515 0.182150 17.977879
Male No 2.711340 3.113402 0.160669 19.791237
Yes 2.500000 3.051167 0.152771 22.284500
In [34]:
tips.pivot_table(['tip_pct', 'size'], index=['sex', 'day'], columns='smoker')
Out[34]:
tip_pct size
smoker No Yes No Yes
sex day
Female Fri 0.165296 0.209129 2.500000 2.000000
Sat 0.147993 0.163817 2.307692 2.200000
Sun 0.165710 0.237075 3.071429 2.500000
Thur 0.155971 0.163073 2.480000 2.428571
Male Fri 0.138005 0.144730 2.000000 2.125000
Sat 0.162132 0.139067 2.656250 2.629630
Sun 0.158291 0.173964 2.883721 2.600000
Thur 0.165706 0.164417 2.500000 2.300000
In [35]:
tips.pivot_table(['tip_pct', 'size'], index=['sex', 'day'],
                 columns='smoker', margins=True)
Out[35]:
tip_pct size
smoker No Yes All No Yes All
sex day
Female Fri 0.165296 0.209129 0.199388 2.500000 2.000000 2.111111
Sat 0.147993 0.163817 0.156470 2.307692 2.200000 2.250000
Sun 0.165710 0.237075 0.181569 3.071429 2.500000 2.944444
Thur 0.155971 0.163073 0.157525 2.480000 2.428571 2.468750
Male Fri 0.138005 0.144730 0.143385 2.000000 2.125000 2.100000
Sat 0.162132 0.139067 0.151577 2.656250 2.629630 2.644068
Sun 0.158291 0.173964 0.162344 2.883721 2.600000 2.810345
Thur 0.165706 0.164417 0.165276 2.500000 2.300000 2.433333
All 0.159328 0.163196 0.160803 2.668874 2.408602 2.569672
In [36]:
tips.pivot_table('tip_pct', index=['sex', 'smoker'], columns='day',
                 aggfunc=len, margins=True)
Out[36]:
day Fri Sat Sun Thur All
sex smoker
Female No 2.0 13.0 14.0 25.0 54.0
Yes 7.0 15.0 4.0 7.0 33.0
Male No 2.0 32.0 43.0 20.0 97.0
Yes 8.0 27.0 15.0 10.0 60.0
All 19.0 87.0 76.0 62.0 244.0
In [39]:
tips.pivot_table('size', index=['time', 'sex', 'smoker'],
                 columns='day', aggfunc='sum', fill_value=0)
Out[39]:
day Fri Sat Sun Thur
time sex smoker
Dinner Female No 2 30 43 2
Yes 8 33 10 0
Male No 4 85 124 0
Yes 12 71 39 0
Lunch Female No 3 0 0 60
Yes 6 0 0 17
Male No 0 0 0 50
Yes 5 0 0 23

연습 문제 1

타이타닉 승객 데이터를 이용하여 다음 분석을 실시하라.

  1. 남/여, 선실, 나이에 의한 생존율을 데이터프레임으로 계산한다. 행에는 남/여 및 나이에 대한 계층적 인덱스를 사용하고 열에는 선실 인덱스를 사용한다.
  2. 남/여 및 선실에 의한 생존율을 피봇 데이터 형태로 만든다.

질문/덧글

그룹 연산 질문드립니다. lucc*** 2016년 9월 19일 9:20 오후

그룹 연산 중에 최빈값을 구하는 것은 없어서, 인터넷에 흘러다니는 정보들을 조합해서 이런 예시를 만들었습니다.

import pandas as pd
data = {
'test' : [1,1,1,1,2,2,2,2],
'result' : [10,10,20,20,30,30,40,40]}
df = pd.DataFrame(data)
most = df.result.groupby(df.test).apply(pd.Series.mode)
print(most)

결과
test
1 0 10
1 20
2 0 30
1 40

결과가 위와 같이 나오는데, 둘 중에 하나를 고를 수 있는 방법이 있나요?

예시)
1 1 20
2 1 40

답변: 그룹 연산 질문드립니다. 관리자 2016년 9월 20일 8:04 오전

다음과 같이 해보세요.

def mode(x):
return sp.stats.mode(x)[0][0]

most = df.result.groupby(df.test).agg(mode)