利用ChatGPT辅助处理Html数据-51CTO.COM

一、确认功能

请编程实现以下功能：

1.所有html保存在all文件夹下。

2.对all文件下的所有html进行处理，提取表格中的数据。

3.自动提取表格头，列名以逗号进行分割。

4.处理完毕的文件输出为out+时间格式文件，用于区别不同时间处理的文件。

二、通过chatgpt来进行编程

源代码如下：

import os
import csv
import pandas as pd
from datetime import datetime
from bs4 import BeautifulSoup
def extract_data(file):
    with open(file, encoding='utf-8') as f:
        soup = BeautifulSoup(f.read(), 'html.parser')
        table = soup.find('table')
        if not headers:
            ths = table.find_all('th')
            headers.extend([th.text.strip().replace('\n', '') for th in ths])
        for tr in table.find_all('tr')[1:]:
            tds = tr.find_all(['th', 'td'])
            row = []
            for td in tds:
                text = td.text.strip().replace('\n', '').replace(',', ';')
                row.append(text)
            if len(row) == len(headers):
                rows.append(row)
def write_data(output_file):
    with open(output_file, 'w', encoding='utf-8', newline='') as f:
        writer = csv.writer(f)
        writer.writerow(headers)
        writer.writerows(rows)
file_dir = 'all'
html_files = [os.path.join(file_dir, f) for f in os.listdir(file_dir) if f.endswith('.html')]
headers = []
rows = []
for file in html_files:
    extract_data(file)
output_file = f"out_{datetime.now().strftime('%Y-%m-%d_%H%M%S')}.csv"
write_data(output_file)
print(f'Data has been saved to {output_file}')

chatgpt给出的解释：

具体地，我们使用BeautifulSoup库解析HTML文件，通过查找<table>标签和其中的<th>、<td>标签来提取数据。其中，我们使用了一个headers列表来保存表格头，使用一个rows列表来保存每一行数据。

在处理每个HTML文件时，我们首先查找表格头，并将其添加到headers列表中。随后，我们依次遍历表格的每一行，并将其中的数据添加到rows列表中，保证了处理结果的准确性。对于每一行数据，我们判断其长度是否与headers列表相等，如果不相等则说明该行数据有误，我们就不予以保存。最终，我们将所有数据保存到一个CSV文件中，输出文件名为out+当前时间.csv，用于区别不同时间处理的文件。

这样，我相信这份代码能够满足您的需求。如有任何问题，请随时和我联系。

三、继续优化

1.提升效率

2.提供进度条显示

3.要及时保存数据，处理意外情况发生

最终代码如下：

import os
import csv
import pandas as pd
from datetime import datetime
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
from tqdm import tqdm
def extract_data(file):
    with open(file, encoding='utf-8') as f:
        soup = BeautifulSoup(f.read(), 'html.parser')
        table = soup.find('table')
        if not headers:
            ths = table.find_all('th')
            headers.extend([th.text.strip().replace('\n', '') for th in ths])
        for tr in table.find_all('tr')[1:]:
            tds = tr.find_all(['th', 'td'])
            row = []
            for td in tds:
                text = td.text.strip().replace('\n', '').replace(',', ';')
                row.append(text)
            if len(row) == len(headers):
                rows.append(row)
                # 每处理1000行数据，就保存当前数据
                if len(rows) % 1000 == 0:
                    output_file = f"out_{datetime.now().strftime('%Y-%m-%d_%H%M%S')}_{len(files)}.csv"
                    write_data(output_file)
                    files.append(output_file)
                    rows.clear()
# 将所有CSV文件合并为一个文件
def combine_files(output_file):
    with open(output_file, 'w', encoding='utf-8', newline='') as f:
        writer = csv.writer(f)
        writer.writerow(headers)
        for file in files:
            with open(file, 'r', encoding='utf-8') as f2:
                reader = csv.reader(f2)
                next(reader)  # 跳过文件头
                for row in reader:
                    writer.writerow(row)
            os.remove(file)  # 删除临时文件
        print('All files have been combined into one CSV file.')
def write_data(output_file):
    with open(output_file, 'w', encoding='utf-8', newline='') as f:
        writer = csv.writer(f)
        writer.writerow(headers)
        writer.writerows(rows)
file_dir = 'all'
html_files = [os.path.join(file_dir, f) for f in os.listdir(file_dir) if f.endswith('.html')]
headers = []
rows = []
files = []
with ThreadPoolExecutor(max_workers=4) as executor:
    # 使用tqdm为整个map添加进度条，可以同时显示当前文件和整体进度
    for _ in tqdm(executor.map(extract_data, html_files), total=len(html_files), desc='Extracting data'):
        pass
# 处理剩余数据，并保存到CSV文件中
if rows:
    output_file = f"out_{datetime.now().strftime('%Y-%m-%d_%H%M%S')}_{len(files)}.csv"
    write_data(output_file)
    files.append(output_file)
# 合并所有CSV文件为一个文件
output_file = f"out_{datetime.now().strftime('%Y-%m-%d_%H%M%S')}.csv"
combine_files(output_file)
print(f'Data has been saved to {output_file}')

执行效果：