Read CSV & Excel file by Pandas

Read csv

df=pd.read_csv('rowdata.csv')

Example:

import pandas as pd
df=pd.read_csv('rowdata.csv')
print(df.head(5))


#如果需要指定路徑
#path = 'D:/Shopee/Data/ReportData/'
#df=pd.read_csv(path+'rowdata.csv')

Open file by specific path

path = os.path.dirname(os.path.realpath(__file__))
df=pd.read_csv(path+'\\rowdata.csv')

Example:

import pandas as pd
import os

path = os.path.dirname(os.path.realpath(__file__))
#also you can set your own path
#path = ('C:\\Shopee\\Data\\Workspace\\Hackthon')

df=pd.read_csv(path+'\\rowdata.csv')
print(df.head(5))

Practice:

Read file 'seller_type.csv' and print the table out

Table column filter

paymentDF = df[['payment_method', 'orders', 'gmv']]

Example:

import pandas as pd
import os

path = os.path.dirname(os.path.realpath(__file__))
df=pd.read_csv(path+'\\rowdata.csv')
paymentDF = df[['payment_method', 'orders', 'gmv']]
print(paymentDF.head(5))

Practice:

Print column >> main_category, is_official_store, order, gmv

Join Table

DataFrame.join(other, on=None, how='left')

other :   DataFrame, Series with name field set, or list of DataFrame
  Index should be similar to one of the columns in this one. If a Series is passed, its name attribute must be set, and   that will be used as the column name in the resulting joined DataFrame

on :   column name, tuple/list of column names, or array-like

how :   {‘left’, ‘right’, ‘outer’, ‘inner’}, default: ‘left’

How to handle the operation of the two objects: left: use calling frame’s index (or column if on is specified) right: use other frame’s index outer: form union of calling frame’s index (or column if on is specified) with other frame’s index inner: form intersection of calling frame’s index (or column if on is specified) with other frame’s index, preserving the   order of the calling’s one

df = df.join(clusterDF, how='left', on='main_category')

Example:

#read another table for cluster information
import pandas as pd
import os

path = os.path.dirname(os.path.realpath(__file__))
df=pd.read_csv(path+'\\rowdata.csv')
clusterDF = pd.read_csv(path+'\\cluster.csv')
df = pd.merge(df, clusterDF , how='left', on='main_category')
print(list(df))

Practice:

Read file seller_type.csv and join it with rowdata

Make Pivot

DataFrame.groupby(by=None, as_index=True, sort=True)

by : mapping, function, str, or iterable Used to determine the groups for the groupby. If by is a function, it’s called on each value of the object’s index. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups (the Series’ values are first aligned; see .align() method). If an ndarray is passed, the values are used as-is determine the groups. A str or list of strs may be passed to group by the columns in self

as_index : boolean, default True For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output

sort : boolean, default True Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. groupby preserves the order of rows within each group.

Example: ---Order/GMV summary of each cluster---

import pandas as pd
import os

path = os.path.dirname(os.path.realpath(__file__))
df=pd.read_csv(path+'\\rowdata.csv')
clusterDF = pd.read_csv(path+'\\cluster.csv')
df = pd.merge(df, clusterDF , how='left', on='main_category')
df = df[['payment_method', 'orders', 'gmv']]
df = df.groupby([df['payment_method']]).sum()
print(df)

Practice:

Print table with seller_type and summary of Order, GMV

Export & Formatting


pandas.ExcelWriter(file,engine)

ExcelWriter API: http://xlsxwriter.readthedocs.io/format.html

import pandas as pd
import os

path = os.path.dirname(os.path.realpath(__file__))
df=pd.read_csv(path+'\\rowdata.csv')
clusterDF = pd.read_csv(path+'\\cluster.csv')

df = pd.merge(df, clusterDF , how='left', on='main_category')
df = df[['cluster','payment_method', 'orders', 'gmv']]

df = df.groupby([df['cluster'],df['payment_method']], as_index=False).sum()

writer = pd.ExcelWriter('Cluster_GMV.xlsx', engine = 'xlsxwriter')
# 讓 index 可用 format
try:
    pd.io.formats.excel.header_style = None
except:
    import pandas.io.formats.excel
    pandas.io.formats.excel.header_style = None


df.to_excel(writer, sheet_name='Report', index = False)
workbook = writer.book
worksheet = writer.sheets['Report']

format1 = workbook.add_format()
format1.set_bold()
format1.set_font_name('Arial')
format1.set_font_color('red')

format2 = workbook.add_format()
format2.set_num_format('#,##0')

worksheet.set_column('A:D',12,format2)
worksheet.set_row(0,15,format1)

writer.save()

Practice:

Please adjust format as: Row 0: font: italic, background color: yellow Order, GMV with 1 digit after the decimal point (123.1)

results matching ""

    No results matching ""