Getting Raw Data from Pivot Tables from Excel in Python: A Step-by-Step Guide
Introduction
Pivot tables are a cornerstone of data analysis in Excel. They allow us to summarize and explore large datasets, revealing trends and patterns within. However, when working with these insightful tables programmatically in Python, a challenge arises: directly reading their structure and filters from Excel isn’t readily achievable.
Extracting raw data from Excel Pivot Tables in Python can be a daunting task, but with the right tools and understanding, it becomes a manageable process. This guide provides a step-by-step approach to extracting raw data from Pivot Tables, empowering you to leverage the powerful data manipulation capabilities of Python.
- Install the Necessary Libraries:
Begin by installing the required Python libraries. We primarily need ‘openpyxl’ for interacting with Excel files
2. Import the Libraries:
Begin by installing the required Python libraries. We primarily need ‘openpyxl’ for interacting with Excel files.
3. Load the Excel File Which Has Pivot Table Data:
Use ‘openpyxl’ to load the Excel file containing the Pivot Table:
workbook = openpyxl.load_workbook('your_excel_file.xlsx')
worksheet = workbook['Sheet_Name'] # Replace 'Sheet1' with your sheet name
4. Identify the Pivot Table:
Locate the Pivot Table within the Excel file. This usually involves navigating through the sheets and looking for the Pivot Table’s unique identifier.
for p in worksheet._pivots:
print(p.name)
pivot_name=p.name #This assigns the name of the desired Pivot Table to the variable
5. Extract Pivot Table Object from the worksheet:
pivot_table = [p for p in worksheet._pivots if p.name == pivot_name][0]
6. Extract Cache Field Information From a Pivot Table:
fields_map = {}
for field in pivot_table.cache.cacheFields:
if field.sharedItems.count > 0:
l = []
for f in field.sharedItems._fields:
try:
l += [f.v]
except AttributeError:
l += [""]
fields_map[field.name] = l
7. Extract Rows from Pivot Table Cache Into Dict:
column_names = [field.name for field in pivot_table.cache.cacheFields]
# Printing the column names
for field in pivot_table.cache.cacheFields:
if field.sharedItems.count > 0:
print(field.name)
rows = []
for record in pivot_table.cache.records.r:
# If some field in the record in missing, we replace it by NaN
record_values = [
field.v if not isinstance(field, Missing) else np.nan for field in record._fields
]
row_dict = {k: v for k, v in zip(column_names, record_values)}
# Shared fields are mapped as an Index, so we replace the field index by its value
for key in fields_map:
row_dict[key] = fields_map[key][row_dict[key]]
rows.append(row_dict)
8. Convert Dict to Pandas DataFrame:
df = pd.DataFrame.from_dict(rows)
9. Saving DataFrame Data Into CSV File:
df.to_csv("output.csv", index=False)
Please find below full code:
Summary:
This article describes a step-by-step technique for properly transforming data from a Pivot Table in an Excel file into a structured DataFrame using Pandas, which is then written to a CSV file for easy examination or additional processing.