数据分析自动化:LIDA智能可视化的魔法! 原创
01 概述
在这个数据驱动的时代,我们每天都在产生和处理海量的数据。如何从这些数据中提取有价值的信息,并以一种直观、易于理解的方式呈现,成为了一个重要的课题。今天,给大家介绍一个强大的工具——Language-Integrated Data Analysis(LIDA),它能够自动化地创建可视化图表,让数据洞察变得触手可及。
02 LIDA的核心特性
语法无关的可视化
无论你是Python、R还是C++的开发者,LIDA都能帮助你产出视觉输出,而无需锁定在特定的编程语言中。这种灵活性让来自不同编程背景的用户都能轻松上手。
多阶段生成流程
LIDA通过一个无缝的工作流程,从数据总结到可视化创建,帮助用户轻松驾驭复杂的数据集。
混合用户界面
LIDA提供了直接操作和多语言自然语言界面的选项,使得从数据科学家到商业分析师的广泛受众都能轻松使用。用户可以通过自然语言命令进行交互,使数据可视化变得直观而简单。
03 LIDA的架构
LIDA的架构包括以下几个关键组件:
- Summarizer:将数据集转换为简洁的自然语言描述,包括所有列名、分布等信息。
- GOAL Explorer:基于数据集识别潜在的可视化或分析目标,并生成用户指定数量的目标。
- Viz Generator:根据数据集上下文和指定目标自动生成创建可视化的代码。
- Infographer:创建、评估、完善并执行可视化代码,以产生完全风格化的规范。
04 LIDA的主要特点
- 数据总结:LIDA将大型数据集压缩成密集的自然语言摘要,作为未来操作的基础。
- 自动化数据探索:LIDA提供了一个完全自动化的模式,用于基于不熟悉的数据集生成有意义的可视化目标。
- 信息图表生成:使用图像生成模型将数据转换为风格化的、吸引人的信息图表,用于个性化的故事讲述。
- VizOps – 可视化操作:对生成的可视化进行详细操作,增强可访问性、数据素养和调试。
- 可视化解释:提供可视化代码的深入描述,帮助无障碍使用、教育和理解。
- 自我评估:使用大型语言模型(LLMs)根据最佳实践为可视化生成多维评估分数。
- 可视化修复:使用自我评估或用户提供的反馈自动改进或修复可视化。
- 可视化推荐:根据上下文或现有可视化推荐额外的可视化,以便比较或增加视角。
05 LIDA实战
安装
使用pip安装:
pip install lida
# 设定对应的api keyexport OPENAI_API_KEY=<API_KEY>
也可以.env来进行api key管理:
from dotenv import load_env
import os load_dotenv()
# read the .env file
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
LIDA 功能详解
- 初始化
from lida import Manager, TextGenerationConfig , llm
from lida.utils import plot_raster
import warnings
from dotenv import load_dotenv
import os
load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
warnings.filterwarnings("ignore")
# 初始化 LIDA
lida = Manager(text_gen = llm("openai", api_key=str(OPENAI_API_KEY))) # !! input your openai or other LLM api key
textgen_config = TextGenerationConfig(n=1, temperature=0.5, model="gpt-3.5-turbo-0301", use_cache=True)
lida.Manager 是 LIDA Lib 中的 Controller,负责设置 LLM 的类型;而 lida.TextGenerationConfig 则是对生成内容的详细设置,包括生成次数 n、生成参数温度的变化程度、模型和 use_cache 都在这里设置。
- 导入数据
import pandas as pd
# 資料目前是使用官方推薦的資料集
cars data = pd.read_csv("<https://raw.githubusercontent.com/uwdata/draco/master/data/cars.csv>") data.head()
- 数据摘要
从数据集生成简要摘要;内容分别为每个专栏的std, min, max, samples, unique, semantic_type和description
# 数据摘要:从资料集生成简短摘要
summary = lida.summarize( "https://raw.githubusercontent.com/uwdata/draco/master/data/cars.csv" , summary_method= "default" , textgen_cnotallow=textgen_config)
print (summary)
- 目标生成
根据资料摘要输出,包括Index, Question, Visualizations 和Rationale。
# 目标生成:根据资料摘要生成视觉化图表的目标, n=3 表示生成3 个目标
goals = lida.goals(summary, n= 3 , textgen_cnotallow=textgen_config)
# 查看目前要生成的目标
for goal in goals:
print ( "=" * 20 )
print ( f"Question: {goal.index} " )
# print the question, visualization and rationale with each goal
print (goal.question)
print (goal.visualization)
print (goal.rationale)
```输出结果
====================
Question: 0
What is the distribution of Retail_Price?
histogram of Retail_Price
This tells about the spread of prices of cars in the dataset .
====================
Question: 1
What is the distribution of Engine_Size__l_ among different car types?
box plot of Engine_Size__l_ for each car type
This will help in identifying if there is any difference in engine size among different car types.
====================
Question: 2
What is the relationship between Horsepower_HP_ and City_Miles_Per_Gallon?
scatter plot of Horsepower_HP_ vs City_Miles_Per_Gallon
This will help in identifying if there is any correlation between horsepower and fuel efficiency of cars.
- 生成可视化图表
根据Goal 的visualization 建议自动生成图表。
library = "matplotlib" # 可选"altair", "seaborn", "plotly", "matplotlib"
textgen_config = TextGenerationConfig(n= 1 , temperature= 0.2 , use_cache= True )
for i in range ( len (goals)):
# print the question, visualization and rationale with each goal
print ( "Question: " , goals[i].question)
print ( "Visualization: " , goals[i].visualization)
print ( "Rationale: " , goals[i] .rationale)
charts = lida.visualize(summary=summary, goal=goals[i], textgen_cnotallow=textgen_config, library=library)
plot_raster(charts[ 0 ].raster)
- 图表编辑
使用自然语言(NLP)编辑图表,例如颜色、字的大小甚至字型等等。(这个在写论文或研究报告时感觉很实用XD )
# 改变图表颜色和字体大小
instructions = [ "change the color to red " , "scale the word size to 50%" ]
edited_charts = lida.edit(code=charts[ 0 ].code, summary=summary, instructinotallow=instructions )
plot_raster(edited_charts[ 0 ].raster)
- 视觉化图表解释
code = charts[ 0 ].code
explanations = lida.explain(code=code, library=library, textgen_cnotallow=textgen_config)
for row in explanations[ 0 ]:
print (row[ "section" ], " ** " , row[ "explanation" ])
# 输出结果
accessibility ** The code creates a scatter plot using the matplotlib.pyplot library to visualize the relationship between two variables - Horsepower_HP_ and City_Miles_Per_Gallon. The plot is colored blue with an alpha value of 0.5 to show the density of the data points. The x-axis is labeled 'Horsepower_HP_' and the y-axis is labeled 'City_Miles_Per_Gallon' . The title of the plot is 'What is the relationship between Horsepower_HP_ and City_Miles_Per_Gallon?' .
transformation ** There is no data transformation happening in this code. The plot is made using the original data as it is .
visualization ** The code first imports the required libraries - matplotlib.pyplot and pandas. The function plot() takes a pandas DataFrame as input and creates a scatter plot using the plt.scatter() method. The x-axis of the plot is the 'Horsepower_HP_' column of the input DataFrame and the y-axis is the 'City_Miles_Per_Gallon' column of the input DataFrame. The alpha parameter controls the transparency of the data points and the color parameter sets the color of the data points. The plt.xlabel() and plt.ylabel() methods add labels to the x-axis and y-axis respectively. The plt.title() method adds a title to the plot. The wrap parameter in plt.title() is set to True to wrap the title text if it exceeds the width of the plot. Finally, the function returns the plot object .
- 可视化评估和修复
评估视觉化图表是否存在问题,评分标准包括:Bug 错误, Transformation 转换程度, Compliance 合规性, type 图表类别, encoding 编码方式和aesthetics 美观程度;令人最意外的居然可以评估美观程度XDD
evaluations = lida.evaluate(code=code, goal=goals[i], library=library)[ 0 ]
for eval in evaluations:
print ( eval [ "dimension" ], "Score" , eval [ "score" ], " / 10" )
print ( "\t" , eval [ "rationale" ][: 200 ])
print ( "\t*********************** ***********" )
# 输出结果
bugs Score 10 / 10
No bugs, syntax errors, or typos found.
***************** *****************
transformation Score 10 / 10
No data transformation needed for a scatter plot.
******************* ***************
compliance Score 8 / 10
The code meets the specified visualization goal, but the title could be improved by removing the question mark and rephrasing it as a statement.
**** ******************************
type Score 9 / 10
A scatter plot is an appropriate visualization type for exploring the relationship between two continuous variables.
**********************************
encoding Score 9 / 10
The data is encoded appropriately with Horsepower_HP_ on the x-axis and City_Miles_Per_Gallon on the y-axis.
**********************************
aesthetics Score 9 / 10
The aesthetics of the visualization are appropriate with a blue color and an alpha of 0.5 to show overlapping points. ***************************** *****
- 可视化图表推荐
针对Summary 的上下文生成对应数量、由LLM 判断的推荐图表。
textgen_config = TextGenerationConfig(n= 1 , temperature= 0 , use_cache= True )
recommended_charts = lida.recommend(code=code, summary=summary, n= 3 , textgen_cnotallow=textgen_config)
print ( f"Recommended { len (recommended_charts)} charts " )
for chart in recommended_charts:
plot_raster(chart.raster)
pass
- 个性化图表生成
# 先继承class 'lida.datamodel.Goal'
from lida.datamodel import Goal
# datamodel 总共有4 个object,分别是index, question, visualization and rationale
custom_goal = Goal(
index= 0 ,
questinotallow= "What is the distribution of the Type?" ,
visualizatinotallow= "Bar Chart" ,
ratinotallow= "The type of the car is an important feature of the dataset."
)
# 生成图表
custom_chart = lida.visualize(summary=summary, goal=custom_goal, textgen_cnotallow=textgen_config , library=library)
plot_raster(custom_chart[ 0 ].raster)
# 编辑客制化生成图表
custom_instructions = [ "change the color to blue tone on tone color" ] # 改变Bar Chart 的颜色
edited_custom_charts = lida.edit(code= custom_chart[ 0 ].code, summary=summary, instructinotallow=custom_instructions)
plot_raster(edited_custom_charts[ 0 ].raster)
Web UI
目前LIDA 官方有推出一个Web UI 可以让大家上传自己的资料进行分析,使用方法如下:
pip install lida
export OPENAI_API_KEY=<your key>
lida ui --port=8080 --docs
!!注意事项:
- 资料集大小:LIDA 目前适合小规模的资料集,因为目前LLM 没法处理太长的文章(Token 长度)。
- LLM 选择:LIDA 与GPT 3.5, GPT 4,最为相容,因为Summary 维度较高的资料和进行推理时还是需要比较大的LLM 才有较好的成效。
- 可靠性:论文中显示错误率低于3.5%、但在输出图表还是反覆检查一下结果是否合理。
参考:
本文转载自公众号Halo咯咯 作者:基咯咯
原文链接: https://mp.weixin.qq.com/s/smeYr8cUi3yqXYm4jBz7Wg