例如一个带有图片路径的大型数据集组成。 每行有三列:anchor, positive, and negative.。
如果类别列使用 Categorical 可以显着减少内存使用量。
# raw data+----------+------------------------+|class|filename|+----------+------------------------+|Bathroom|Bathroom\bath_1.jpg||Bathroom|Bathroom\bath_100.jpg||Bathroom|Bathroom\bath_1003.jpg||Bathroom|Bathroom\bath_1004.jpg||Bathroom|Bathroom\bath_1005.jpg|+----------+------------------------+# target+------------------------+------------------------+----------------------------+|anchor|positive|negative|+------------------------+------------------------+----------------------------+|Bathroom\bath_1.jpg|Bathroom\bath_100.jpg|Dinning\din_540.jpg||Bathroom\bath_100.jpg|Bathroom\bath_1003.jpg|Dinning\din_1593.jpg||Bathroom\bath_1003.jpg|Bathroom\bath_1004.jpg|Bedroom\bed_329.jpg||Bathroom\bath_1004.jpg|Bathroom\bath_1005.jpg|Livingroom\living_1030.jpg||Bathroom\bath_1005.jpg|Bathroom\bath_1007.jpg|Bedroom\bed_1240.jpg|+------------------------+------------------------+----------------------------+
importpandasaspdregex= (r'(?P<title>[A-Za-z\'\s]+),'r'(?P<author>[A-Za-z\s\']+),'r'(?P<isbn>[\d-]+),'r'(?P<year>\d{4}),'r'(?P<publisher>.+)')
addr=pd.Series([
"The Lost City of Amara,Olivia Garcia,978-1-234567-89-0,2023,HarperCollins",
"The Alchemist's Daughter,Maxwell Greene,978-0-987654-32-1,2022,Penguin Random House",
"The Last Voyage of the HMS Endeavour,Jessica Kim,978-5-432109-87-6,2021,Simon & Schuster",
"The Ghosts of Summer House,Isabella Lee,978-3-456789-12-3,2000,Macmillan Publishers",
"The Secret of the Blackthorn Manor,Emma Chen,978-9-876543-21-0,2023,Random House Children's Books"
])
addr.str.extract(regex)
importpandasaspddf=pd.DataFrame({"a": [1, 2, 3],
"b": [4, 5, 6],
"category": [["foo", "bar"], ["foo"], ["qux"]]})
# let's increase the number of rows in a dataframedf=pd.concat([df]*10000, ignore_index=True)