작성자: admin 작성일시: 2016-07-11 22:30:14 조회수: 1081 다운로드: 59
카테고리: Python 태그목록: Python

Pandas 데이터 변환

applymap 변환

  • 단일 원소 변환
In [28]:
np.random.seed(0)
df = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df
Out[28]:
b d e
Utah 1.764052 0.400157 0.978738
Ohio 2.240893 1.867558 -0.977278
Texas 0.950088 -0.151357 -0.103219
Oregon 0.410599 0.144044 1.454274
In [29]:
format = lambda x: '%.2f' % x
df2 = df.applymap(format)
df2
Out[29]:
b d e
Utah 1.76 0.40 0.98
Ohio 2.24 1.87 -0.98
Texas 0.95 -0.15 -0.10
Oregon 0.41 0.14 1.45
In [30]:
df.values.dtype, df2.values.dtype
Out[30]:
(dtype('float64'), dtype('O'))

apply 변환

  • row/column 변환
In [31]:
df = pd.DataFrame({
        'Qu1': [1, 3, 4, 3, 4],
        'Qu2': [2, 3, 1, 2, 3],
        'Qu3': [1, 5, 2, 4, 4]
    })
df
Out[31]:
Qu1 Qu2 Qu3
0 1 2 1
1 3 3 5
2 4 1 2
3 3 2 4
4 4 3 4
In [32]:
f = lambda x: 2 * x
df.apply(f)
Out[32]:
Qu1 Qu2 Qu3
0 2 4 2
1 6 6 10
2 8 2 4
3 6 4 8
4 8 6 8
In [33]:
f = lambda x: x.max() - x.min()
df.apply(f)
Out[33]:
Qu1    3
Qu2    2
Qu3    4
dtype: int64
In [34]:
df.apply(f, axis=1)
Out[34]:
0    1
1    2
2    3
3    2
4    1
dtype: int64
In [35]:
df.apply(pd.value_counts)
Out[35]:
Qu1 Qu2 Qu3
1 1.0 1.0 1.0
2 NaN 2.0 1.0
3 2.0 2.0 NaN
4 2.0 NaN 2.0
5 NaN NaN 1.0
In [36]:
df.apply(pd.value_counts).fillna(0)
Out[36]:
Qu1 Qu2 Qu3
1 1.0 1.0 1.0
2 0.0 2.0 1.0
3 2.0 2.0 0.0
4 2.0 0.0 2.0
5 0.0 0.0 1.0

데이터프레임과 시리즈의 연산

데이터프레임의 각 행을 같은 크기의 시리즈와 연산하면 반복 연산(브로드캐스팅)을 한다. 다만 열은 연산이 되지 않으므로 전치 연산을 통해야 한다.

In [37]:
df/df.ix[0]
Out[37]:
Qu1 Qu2 Qu3
0 1.0 1.0 1.0
1 3.0 1.5 5.0
2 4.0 0.5 2.0
3 3.0 1.0 4.0
4 4.0 1.5 4.0
In [38]:
(df.T/df.ix[:,0]).T
Out[38]:
Qu1 Qu2 Qu3
0 1.0 2.000000 1.000000
1 1.0 1.000000 1.666667
2 1.0 0.250000 0.500000
3 1.0 0.666667 1.333333
4 1.0 0.750000 1.000000

cut / qcut

  • 실수 자료를 카테고리 자료로 변환
  • cut: bins 를 사용자 지정
  • qcut: quantile 기준
In [39]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats
Out[39]:
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, object): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
In [40]:
cats.categories
Out[40]:
Index([u'(18, 25]', u'(25, 35]', u'(35, 60]', u'(60, 100]'], dtype='object')
In [41]:
cats.codes
Out[41]:
array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
In [42]:
pd.cut(ages, [18, 26, 36, 61, 100], right=False)
Out[42]:
[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, object): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]
In [43]:
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
pd.cut(ages, bins, labels=group_names)
Out[43]:
[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]
Length: 12
Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]
In [44]:
df = pd.DataFrame(ages, columns=["ages"])
df.tail()
Out[44]:
ages
7 31
8 61
9 45
10 41
11 32
In [45]:
df["age_cat"] = pd.cut(df.ages, bins, labels=group_names)
df
Out[45]:
ages age_cat
0 20 Youth
1 22 Youth
2 25 Youth
3 27 YoungAdult
4 21 Youth
5 23 Youth
6 37 MiddleAged
7 31 YoungAdult
8 61 Senior
9 45 MiddleAged
10 41 MiddleAged
11 32 YoungAdult
In [46]:
data = np.random.randn(1000)
cats = pd.qcut(data, 4)
cats
Out[46]:
[(0.584, 2.759], (-0.058, 0.584], (-0.058, 0.584], (-0.058, 0.584], (0.584, 2.759], ..., [-3.0461, -0.705], (-0.058, 0.584], (-0.058, 0.584], [-3.0461, -0.705], (0.584, 2.759]]
Length: 1000
Categories (4, object): [[-3.0461, -0.705] < (-0.705, -0.058] < (-0.058, 0.584] < (0.584, 2.759]]
In [47]:
pd.value_counts(cats)
Out[47]:
(0.584, 2.759]       250
(-0.058, 0.584]      250
(-0.705, -0.058]     250
[-3.0461, -0.705]    250
dtype: int64
In [48]:
pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.])
Out[48]:
[(-0.058, 1.212], (-0.058, 1.212], (-0.058, 1.212], (-0.058, 1.212], (1.212, 2.759], ..., [-3.0461, -1.304], (-0.058, 1.212], (-0.058, 1.212], (-1.304, -0.058], (1.212, 2.759]]
Length: 1000
Categories (4, object): [[-3.0461, -1.304] < (-1.304, -0.058] < (-0.058, 1.212] < (1.212, 2.759]]

질문/덧글

아직 질문이나 덧글이 없습니다. 첫번째 글을 남겨주세요!