python pandas extract year from datetime: df[year] = df[date].year is not working
python pandas extract year from datetime: df[year] = df[date].year is not working
If youre running a recent-ish version of pandas then you can use the datetime attribute dt
to access the datetime components:
In [6]:
df[date] = pd.to_datetime(df[date])
df[year], df[month] = df[date].dt.year, df[date].dt.month
df
Out[6]:
date Count year month
0 2010-06-30 525 2010 6
1 2010-07-30 136 2010 7
2 2010-08-31 125 2010 8
3 2010-09-30 84 2010 9
4 2010-10-29 4469 2010 10
EDIT
It looks like youre running an older version of pandas in which case the following would work:
In [18]:
df[date] = pd.to_datetime(df[date])
df[year], df[month] = df[date].apply(lambda x: x.year), df[date].apply(lambda x: x.month)
df
Out[18]:
date Count year month
0 2010-06-30 525 2010 6
1 2010-07-30 136 2010 7
2 2010-08-31 125 2010 8
3 2010-09-30 84 2010 9
4 2010-10-29 4469 2010 10
Regarding why it didnt parse this into a datetime in read_csv
you need to pass the ordinal position of your column ([0]
) because when True
it tries to parse columns [1,2,3]
see the docs
In [20]:
t=date Count
6/30/2010 525
7/30/2010 136
8/31/2010 125
9/30/2010 84
10/29/2010 4469
df = pd.read_csv(io.StringIO(t), sep=s+, parse_dates=[0])
df.info()
<class pandas.core.frame.DataFrame>
Int64Index: 5 entries, 0 to 4
Data columns (total 2 columns):
date 5 non-null datetime64[ns]
Count 5 non-null int64
dtypes: datetime64[ns](1), int64(1)
memory usage: 120.0 bytes
So if you pass param parse_dates=[0]
to read_csv
there shouldnt be any need to call to_datetime
on the date column after loading.
This works:
df[date].dt.year
Now:
df[year] = df[date].dt.year
df[month] = df[date].dt.month
gives this data frame:
date Count year month
0 2010-06-30 525 2010 6
1 2010-07-30 136 2010 7
2 2010-08-31 125 2010 8
3 2010-09-30 84 2010 9
4 2010-10-29 4469 2010 10
python pandas extract year from datetime: df[year] = df[date].year is not working
When to use dt
accessor
A common source of confusion revolves around when to use .year
and when to use .dt.year
.
The former is an attribute for pd.DatetimeIndex
objects; the latter for pd.Series
objects. Consider this dataframe:
df = pd.DataFrame({Dates: pd.to_datetime([2018-01-01, 2018-10-20, 2018-12-25])},
index=pd.to_datetime([2000-01-01, 2000-01-02, 2000-01-03]))
The definition of the series and index look similar, but the pd.DataFrame
constructor converts them to different types:
type(df.index) # pandas.tseries.index.DatetimeIndex
type(df[Dates]) # pandas.core.series.Series
The DatetimeIndex
object has a direct year
attribute, while the Series
object must use the dt
accessor. Similarly for month
:
df.index.month # array([1, 1, 1])
df[Dates].dt.month.values # array([ 1, 10, 12], dtype=int64)
A subtle but important difference worth noting is that df.index.month
gives a NumPy array, while df[Dates].dt.month
gives a Pandas series. Above, we use pd.Series.values
to extract the NumPy array representation.