转换并使用 YOLOv11 目标检测模型（ONNX格式）-51CTO.COM

在本文中，我们将探讨如何使用任何预训练或自定义的YOLOv11目标检测模型，并将其转换为一种广泛使用的开放格式——ONNX（开放神经网络交换）。使用这种格式的优势在于，它可以在多种编程语言中部署，而不依赖于官方的Ultralytics模块。

在这篇文章中，我将使用官方提供的YOLOv11n模型作为示例，但该方法同样适用于任何转换为ONNX格式的自定义YOLOv11模型。

首先，我们需要将训练好的.pt格式模型转换为ONNX格式，使用以下代码：

from ultralytics import YOLO


model_path = 'path/to/yolov11n.pt'
model = YOLO(model_path)
model.export(format='onnx', opset = 12, imgsz =[640,640])

在运行上述代码之前，请确保已经安装了`ultralytics`模块。一旦生成了ONNX文件，我们可以定义模型能够检测的所有类别。在我的例子中，这是基于COCO数据集预训练的模型，能够识别80个类别。

with open('coco-classes.txt') as file:
    content = file.read()
    classes = content.split('\n')

del classes[-1]

print(classes) # Let's print classes list

执行上述代码片段后，输出如下：

['person', 'bicycle', 'car', 'motorbike', 'aeroplane', 'bus', 'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'sofa', 'pottedplant', 'bed', 'diningtable', 'toilet', 'tvmonitor', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush']

为了进行推理，我们可以使用OpenCV读取图像：

# 读取图像
image = cv2.imread('bicycle.jpg')

# 转成RGB格式进行输入
img = cv2.cvtColor(image,cv2.COLOR_BGR2RGB)
img_height,img_width = img.shape[:2]

由于OpenCV以BGR格式读取图像，而YOLO期望的是RGB格式，因此我们将图像转换为RGB格式，并存储图像的尺寸以备后用。但这还不是全部！在YOLO发挥作用之前，还需要进行一些额外的图像处理。让我们看一下图像的形状。

print(img.shape)

(420, 620, 3)

我们的图像是一个3通道的RGB图像，宽度和高度分别为620和420。相比之下，YOLOv8模型期望的图像尺寸为(640, 640)，并且通道信息位于图像尺寸之前。

# resize image to get the desired size (640,640) for inference
img = cv2.resize(img,(640,640))

# change the order of image dimension from (640,640,3) to (3,640,640)
img = img.transpose(2,0,1)

最后，为了将图像提供给DNN模块，我们需要在第0个索引处添加一个额外的维度，以告诉模块我们一次提供了多少张图像。此外，我们的图像像素范围是0到255。在推理之前，必须将它们缩放到0到1的范围。

# add an extra dimension at index 0
img = img.reshape(1,3,640,640)

# scale to 0-1
img = img/255.0

现在，我们的图像已经准备好进行推理了。要使用ONNX模型运行推理，我们可以使用DNN模块中的`readNetFromONNX()`或`readNet()`方法。

# read the trained onnx model
net = cv2.dnn.readNetFromONNX('yolov8n.onnx')  # readNet() also works

# feed the model with processed image
net.setInput(img)

# run the inference
out = net.forward()

运行推理后，我们获得一个包含模型预测的输出矩阵，如上代码所示。为了理解如何提取其中的有价值信息，让我们首先打印这个输出矩阵的形状。

print(out.shape)

(1, 84, 8400)

输出矩阵的形状为(1, 84, 8400)，表示8400个检测，每个检测有84个参数。这是因为我们的YOLOv8模型被设计为始终预测图像中的8400个对象。需要注意的是，并非所有检测都是准确的，我们稍后需要根据置信度分数进行过滤。这里的84对应于每个检测的参数数量，包括边界框坐标(x1, y1, x2, y2)和80个不同类别的置信度分数。

对于自定义模型，这个结构可能会有所不同。置信度分数的数量取决于模型训练的类别数量。例如，如果YOLOv8被训练为检测1个类别，那么将只有5个参数而不是84个。对于2个类别，第一个索引处将有6个参数，依此类推。我们可以简单地删除第0个索引处的1，因为它只是告诉模型正在处理单个图像。

results = out[0]

现在，我们将矩阵转置以获得形状为(8400, 84)的矩阵，以便于操作。

results = results.transpose()

如上所述，每个检测都包括每个类别的置信度分数。为了确定对象或检测最可能属于哪个类别，我们只需找到具有最高置信度分数的类别。此外，为了去除所有置信度低于给定阈值的检测，我们可以使用以下函数：

def filter_Detections(results, thresh = 0.5):
    # if model is trained on 1 class only
    if len(results[0]) == 5:
        # filter out the detections with confidence > thresh
        considerable_detections = [detection for detection in results if detection[4] > thresh]
        considerable_detections = np.array(considerable_detections)
        return considerable_detections

    # if model is trained on multiple classes
    else:
        A = []
        for detection in results:

            class_id = detection[4:].argmax()
            confidence_score = detection[4:].max()

            new_detection = np.append(detection[:4],[class_id,confidence_score])

            A.append(new_detection)

        A = np.array(A)

        # filter out the detections with confidence > thresh
        considerable_detections = [detection for detection in A if detection[-1] > thresh]
        considerable_detections = np.array(considerable_detections)

        return considerable_detections

一旦我们通过排除无用参数获得了有用的结果，我们可以打印形状以更好地理解结果。

print(results.shape)

(45, 6)

看起来现在我们有了45个检测，每个检测有6个参数。它们是边界框的左上角(x1, y1)和右下角(x2, y2)坐标、类别ID和置信度值。在我们继续之前，让我们看一下我运行推理的图片。

看着这张图片，人们很容易看出这张图片中并没有45个对象。我们的结果矩阵仍然包含这么多检测的原因是因为多个检测指向同一个对象。为了解决这个问题，我们可以应用一种众所周知的技术，称为非最大抑制（NMS）。NMS充当过滤器，选择那些可能指向同一对象的最佳检测。它通过考虑两个关键指标来实现这一点：置信度值（模型对检测的确定性）和交并比（IOU）。

此外，我们还需要将剩余的检测结果重新缩放到原始比例。这是因为我们的模型输出的检测结果是针对640x640大小的图像，而不是我们原始图像的大小。


def NMS(boxes, conf_scores, iou_thresh = 0.55):

    #  boxes [[x1,y1, x2,y2], [x1,y1, x2,y2], ...]

    x1 = boxes[:,0]
    y1 = boxes[:,1]
    x2 = boxes[:,2]
    y2 = boxes[:,3]

    areas = (x2-x1)*(y2-y1)

    order = conf_scores.argsort()

    keep = []
    keep_confidences = []

    while len(order) > 0:
        idx = order[-1]
        A = boxes[idx]
        conf = conf_scores[idx]

        order = order[:-1]

        xx1 = np.take(x1, indices= order)
        yy1 = np.take(y1, indices= order)
        xx2 = np.take(x2, indices= order)
        yy2 = np.take(y2, indices= order)

        keep.append(A)
        keep_confidences.append(conf)

        # iou = inter/union

        xx1 = np.maximum(x1[idx], xx1)
        yy1 = np.maximum(y1[idx], yy1)
        xx2 = np.minimum(x2[idx], xx2)
        yy2 = np.minimum(y2[idx], yy2)

        w = np.maximum(xx2-xx1, 0)
        h = np.maximum(yy2-yy1, 0)

        intersection = w*h

        # union = areaA + other_areas - intesection
        other_areas = np.take(areas, indices= order)
        union = areas[idx] + other_areas - intersection

        iou = intersection/union

        boleans = iou < iou_thresh

        order = order[boleans]

        # order = [2,0,1]  boleans = [True, False, True]
        # order = [2,1]

    return keep, keep_confidences



def rescale_back(results,img_w,img_h):
    cx, cy, w, h, class_id, confidence = results[:,0], results[:,1], results[:,2], results[:,3], results[:,4], results[:,-1]
    cx = cx/640.0 * img_w
    cy = cy/640.0 * img_h
    w = w/640.0 * img_w
    h = h/640.0 * img_h
    x1 = cx - w/2
    y1 = cy - h/2
    x2 = cx + w/2
    y2 = cy + h/2

    boxes = np.column_stack((x1, y1, x2, y2, class_id))
    keep, keep_confidences = NMS(boxes,confidence)
    print(np.array(keep).shape)
    return keep, keep_confidences

其中，`rescaled_results`包含边界框(x1, y1, x2, y2)和类别ID，而`confidences`存储相应的置信度分数。最后，我们准备在图像上可视化这些结果。

for res, conf in zip(rescaled_results, confidences):

    x1,y1,x2,y2, cls_id = res
    cls_id = int(cls_id)
    x1, y1, x2, y2 = int(x1), int(y1), int(x2), int(y2)
    conf = "{:.2f}".format(conf)
    # draw the bounding boxes
    cv2.rectangle(image,(int(x1),int(y1)),(int(x2),int(y2)),(255,0,255),1)
    cv2.putText(image,classes[cls_id]+' '+conf,(x1,y1-17),
                cv2.FONT_HERSHEY_SCRIPT_COMPLEX,1,(255,0,255),1)