Read CSV & Excel file by Pandas
Read csv
df=pd.read_csv('rowdata.csv')
Example:
import pandas as pd
df=pd.read_csv('rowdata.csv')
print(df.head(5))
#如果需要指定路徑
#path = 'D:/Shopee/Data/ReportData/'
#df=pd.read_csv(path+'rowdata.csv')
Open file by specific path
path = os.path.dirname(os.path.realpath(__file__))
df=pd.read_csv(path+'\\rowdata.csv')
Example:
import pandas as pd
import os
path = os.path.dirname(os.path.realpath(__file__))
#also you can set your own path
#path = ('C:\\Shopee\\Data\\Workspace\\Hackthon')
df=pd.read_csv(path+'\\rowdata.csv')
print(df.head(5))
Practice:
Read file 'seller_type.csv' and print the table out
Table column filter
paymentDF = df[['payment_method', 'orders', 'gmv']]
Example:
import pandas as pd
import os
path = os.path.dirname(os.path.realpath(__file__))
df=pd.read_csv(path+'\\rowdata.csv')
paymentDF = df[['payment_method', 'orders', 'gmv']]
print(paymentDF.head(5))
Practice:
Print column >> main_category, is_official_store, order, gmv
Join Table
DataFrame.join(other, on=None, how='left')
other :
DataFrame, Series with name field set, or list of DataFrame
Index should be similar to one of the columns in this one. If a Series is passed, its name attribute must be set, and that will be used as the column name in the resulting joined DataFrame
on : column name, tuple/list of column names, or array-like
how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default: ‘left’
How to handle the operation of the two objects: left: use calling frame’s index (or column if on is specified) right: use other frame’s index outer: form union of calling frame’s index (or column if on is specified) with other frame’s index inner: form intersection of calling frame’s index (or column if on is specified) with other frame’s index, preserving the order of the calling’s one
df = df.join(clusterDF, how='left', on='main_category')
Example:
#read another table for cluster information
import pandas as pd
import os
path = os.path.dirname(os.path.realpath(__file__))
df=pd.read_csv(path+'\\rowdata.csv')
clusterDF = pd.read_csv(path+'\\cluster.csv')
df = pd.merge(df, clusterDF , how='left', on='main_category')
print(list(df))
Practice:
Read file seller_type.csv and join it with rowdata
Make Pivot
DataFrame.groupby(by=None, as_index=True, sort=True)
by : mapping, function, str, or iterable Used to determine the groups for the groupby. If by is a function, it’s called on each value of the object’s index. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups (the Series’ values are first aligned; see .align() method). If an ndarray is passed, the values are used as-is determine the groups. A str or list of strs may be passed to group by the columns in self
as_index : boolean, default True For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output
sort : boolean, default True Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. groupby preserves the order of rows within each group.
Example: ---Order/GMV summary of each cluster---
import pandas as pd
import os
path = os.path.dirname(os.path.realpath(__file__))
df=pd.read_csv(path+'\\rowdata.csv')
clusterDF = pd.read_csv(path+'\\cluster.csv')
df = pd.merge(df, clusterDF , how='left', on='main_category')
df = df[['payment_method', 'orders', 'gmv']]
df = df.groupby([df['payment_method']]).sum()
print(df)
Practice:
Print table with seller_type and summary of Order, GMV
Export & Formatting
pandas.ExcelWriter(file,engine)
ExcelWriter API: http://xlsxwriter.readthedocs.io/format.html
import pandas as pd
import os
path = os.path.dirname(os.path.realpath(__file__))
df=pd.read_csv(path+'\\rowdata.csv')
clusterDF = pd.read_csv(path+'\\cluster.csv')
df = pd.merge(df, clusterDF , how='left', on='main_category')
df = df[['cluster','payment_method', 'orders', 'gmv']]
df = df.groupby([df['cluster'],df['payment_method']], as_index=False).sum()
writer = pd.ExcelWriter('Cluster_GMV.xlsx', engine = 'xlsxwriter')
# 讓 index 可用 format
try:
pd.io.formats.excel.header_style = None
except:
import pandas.io.formats.excel
pandas.io.formats.excel.header_style = None
df.to_excel(writer, sheet_name='Report', index = False)
workbook = writer.book
worksheet = writer.sheets['Report']
format1 = workbook.add_format()
format1.set_bold()
format1.set_font_name('Arial')
format1.set_font_color('red')
format2 = workbook.add_format()
format2.set_num_format('#,##0')
worksheet.set_column('A:D',12,format2)
worksheet.set_row(0,15,format1)
writer.save()
Practice:
Please adjust format as: Row 0: font: italic, background color: yellow Order, GMV with 1 digit after the decimal point (123.1)