程序员带娃有多“恐怖” ？！竟然实现了实时婴儿声音监控-程序员有人带吗

我有一阵子没更新文章了。是因为我当爹啦，必须放下手头的工作，转而处理一些尚未自动化的育儿任务。

换个角度想，这些没自动化的任务，是否可以做成自动化的呢?虽然机器人换尿布还要好几年才能实现，但是目前有一些比较简单的育儿任务可以自动化掉。

当了爹后我发现，宝宝真的经常哭。即使我在家里，我可能也不会总是在附近听到我儿子的哭声。商用婴儿监视器通常会填补这一空白，它们就像对讲机，即使你在其他房间也能听到宝宝的声音。但我很快意识到，商用婴儿监视器比我想要的理想设备要笨得多。它们并不能检测宝宝的哭声，只是像对讲机一样把声音从声源传到扬声器。父母在不同屋子里活动的时候必须带着扬声器，否则在其他房间就听不到声音了。商用婴儿监视器通常带有低功率的扬声器，而且还不能连接到外接扬声器 —— 这意味着如果我在另一个房间里播放音乐，就算我带着监视器，我也可能会听不到宝宝的哭声。

我理想中的婴儿监控器是这样的：

它要在廉价设备上运行，比如外接廉价 USB 麦克风的树莓派。
它要能够检测婴儿哭声，并在他开始或停止哭的时候通知我(最好通知发手机上)、或者把哭声记录到仪表盘上、或者做任何我想做的哭声监控。
它应该能够在任何设备上播放音频，比如：我自己的扬声器、智能手机、电脑等等设备。无论声源和扬声器之间有多远，都可以播放，不需要我在屋子里移动扬声器。
它还应该带有一个摄像头，这样就可以实时检查宝宝的情况。或者在宝宝开始哭时，我可以得到婴儿床的照片或短视频。

接下来我们来看看如何用开源工具处理上述需求。

录音采样

首先要搞一个树莓派跑Tensorflow模型，把Linux操作系统装到 SD 卡上，最好用树莓派3及以上的版本。另外还要一个可兼容的麦克风。

然后安装依赖：

[sudo] apt-get install ffmpeg lame libatlas-base-dev alsa-utils 
[sudo] pip3 install tensorflow

这第一步就是要录足够多的婴儿哭的音频样本，要让检测模型识别婴儿是不是在哭。

注意：在这个例子中，我将展示如何使用声音检测识别婴儿的哭声，但也可以检测其他类型的声音(比如警报声或邻居的电钻声)，前提是有足够长的时间和足够响亮的声音。

先看看能不能识别音频输入设备：

arecord -l

我的树莓派上输出如下(我装了两个 USB 麦克风)：

**** List of CAPTURE Hardware Devices **** 
card 1: Device [USB PnP Sound Device], device 0: USB Audio [USB Audio] 
  Subdevices: 0/1 
  Subdevice #0: subdevice #0 
card 2: Device_1 [USB PnP Sound Device], device 0: USB Audio [USB Audio] 
  Subdevices: 0/1 
  Subdevice #0: subdevice #0

我想用第二个麦克风card 2, device 0录音频。ALSA (Advanced Linux Sound Architecture)识别第二个麦克风的参数是hw:2,0(这个参数直接访问设备硬件)或plughw:2,0(这个是声明了采样率和格式转换插件)。确定下SD卡有足够的存储空间，或者外接外部USB存储设备。开始录制音频：

arecord -D plughw:2,0 -c 1 -f cd | lame - audio.mp3

录几分钟到几小时婴儿房间声音，最好这期间有足够长的安静、婴儿哭啼声音和其他无关声音。录好后Ctrl C结束录音。一天或者几天里重复录音几次。

给音频样本打标签

一旦录好足够多音频样本，就可以把音频复制到电脑上训练模型了。不论是用scp还是直接从SD卡或者usb设备拷贝都行。

先把拷贝音频样本放到同一个目录下，比如~/datasets/sound-detect/audio。另外创建一个新目录放样本，每个目录下包含一个命名为audio.mp3的音频文件和命名为labels.json的标签文件，标签文件里标记音频段落的正向/负向。目录结构大概是这样：

~/datasets/sound-detect/audio 
  -> sample_1 
    -> audio.mp3 
    -> labels.json 
  -> sample_2 
    -> audio.mp3 
    -> labels.json 
  ...

现在要给音频文件打标签了，如果音频里录着宝宝几个小时的哭声，这个过程就很自虐。用任何播放器或是 Audacity 播放器打开音频播放，另外在每个目录下都创建一个labels.json标签文件。识别哭声开始结束的准确时间，在labels.json里用时间->标签的格式的键值对格式记录，比如：

{ 
  "00:00": "negative", 
  "02:13": "positive", 
  "04:57": "negative", 
  "15:41": "positive", 
  "18:24": "negative" 
}

上面的例子里 00:00 到 02:12 的音频会被判定为负向，02:13 到 04:56的音频会被判定为正向，以此类推。

生成数据集

一旦给所有音频都打好标签，就可以着手生成给 tensorflow 训练模型的数据集了。我创建了一个名为 micmon 的通用声音监控库和一套实用程序。安装：

git clone git@github.com:/BlackLight/micmon.git 
cd micmon 
[sudo] pip3 install -r requirements.txt 
[sudo] python3 setup.py build install

该模型旨在处理频率样本，而不是处理原始音频。因为如果我们想检测特定的声音，该声音将具有特定的“频谱”特征，即基频(或基频通常可能下降的狭窄范围)和通过特定比率与基频相关联的特定谐波集。这些频率之间的比率既不受振幅的影响，无论输入音量如何，频率比率都是恒定的;也不受相位的影响，无论何时开始录制，连续的声音都将具有相同的频谱特征。与简单地将原始音频样本馈送到模型的情况相比，这种幅度和时间不变的特性，使得这种方法更有可能训练健壮的声音检测模型。另外该模型可以更简单、更轻量，而且不会过拟合。简单是指可以在不影响性能的情况下轻松地将频率分组到频段中，从而可以有效地执行降维;轻量指将有 50 到 100 个频带作为输入值，而不考虑样本持续时间，而一秒钟的原始音频通常包含 44100 个数据点，并且输入的长度随着样本持续时间的增加而增加。

micmon提供了在一些音频样本上计算 FFT(快速傅里叶变换)的逻辑，使用低通和高通滤波器将结果频谱分组后把结果保存到一组 numpy 压缩(.npz)文件中。通过命令行工具micmon-datagen进行操作：

micmon-datagen \ 
    --low 250 --high 2500 --bins 100 \ 
    --sample-duration 2 --channels 1 \ 
    ~/datasets/sound-detect/audio  ~/datasets/sound-detect/data

上面的例子中，用~/dataset/sound-detect/audio目录里的原始音频生成了一组数据集，存在~/datasets/sound-detect/data目录下。

--low和--high参数分别代表指定结果频谱中的最低和最高频率，默认之分别是 20Hz (最低人耳朵可以识别到的频率)和 20kHz(最高健康年轻人耳朵识别到的频率)。你可能要自己调整这个参数，以尽可能多地捕捉您想要检测的声音并尽量限制任何其他类型的背景音和不相关的谐波。我这里是 250–2500Hz 这个范围就可以检测婴儿哭声了。婴儿哭声频率很高(歌剧女高音最高可以达到最高 1000Hz)，通常可以至少将频率提高一倍，来获得足够高次谐波(谐波是实际上给声音带来音色的较高频率)、但不能太高，否则其他背景音的谐波会污染频谱。我忽略了低于 250Hz 的声音，因为婴儿的哭声不会再这么低的频率上发生，这些声音会扭曲检测。推荐通过 Audacity 或其他任何均衡器或频谱分析仪中打开正向音频样本，检查哪些频率在正向样本中占主导地位，将数据围绕这些频率对齐。

--bins参数指定频率空间的组数，默认值 100。更高 bins 配置意味着更高频率分辨率/粒度，但如果太高，会是模型容易过拟合。

上面的脚本将原始音频分割成更小的片段，并计算每个片段的频谱“签名”。--sample-duration指这些分段应有多长，默认 2 秒。越高数值和更长的声音匹配，但是高数值会缩小检测的时间长度，而且在短音上会失效。低数值给短音使用越好，但是如果声音较长，捕获的片段可能没有足够的信息来可靠地识别声音。

除了调用micmon-datagen，还有另一个方法可以生成数据集，即调用micmon提供的python api：

import os 
 
from micmon.audio import AudioDirectory, AudioPlayer, AudioFile 
from micmon.dataset import DatasetWriter 
 
basedir = os.path.expanduser('~/datasets/sound-detect') 
audio_dir = os.path.join(basedir, 'audio') 
datasets_dir = os.path.join(basedir, 'data') 
cutoff_frequencies = [250, 2500] 
 
# Scan the base audio_dir for labelled audio samples 
audio_dirs = AudioDirectory.scan(audio_dir) 
 
# Save the spectrum information and labels of the samples to a 
# different compressed file for each audio file. 
for audio_dir in audio_dirs: 
    dataset_file = os.path.join(datasets_dir, os.path.basename(audio_dir.path) + '.npz') 
    print(f'Processing audio sample {audio_dir.path}') 
 
    with AudioFile(audio_dir) as reader, \ 
            DatasetWriter(dataset_file, 
                          low_freq=cutoff_frequencies[0], 
                          high_freq=cutoff_frequencies[1]) as writer: 
        for sample in reader: 
            writer += sample

无论用micmon-datagen还是micmonpython api，最后都要在~/datasets/sound-detect/data目录下生成.npz文件，每个原始音频生成一个标记文件。使用这个数据集来训练我们的神经网络进行声音检测。

训练模型

micmon用Tensorflow+Keras定义和训练模型，用已有的python api很容易做：

import os 
from tensorflow.keras import layers 
 
from micmon.dataset import Dataset 
from micmon.model import Model 
 
# This is a directory that contains the saved .npz dataset files 
datasets_dir = os.path.expanduser('~/datasets/sound-detect/data') 
 
# This is the output directory where the model will be saved 
model_dir = os.path.expanduser('~/models/sound-detect') 
 
# This is the number of training epochs for each dataset sample 
epochs = 2 
 
# Load the datasets from the compressed files. 
# 70% of the data points will be included in the training set, 
# 30% of the data points will be included in the evaluation set 
# and used to evaluate the performance of the model. 
datasets = Dataset.scan(datasets_dir, validation_split=0.3) 
labels = ['negative', 'positive'] 
freq_bins = len(datasets[0].samples[0]) 
 
# Create a network with 4 layers (one input layer, two intermediate layers and one output layer). 
# The first intermediate layer in this example will have twice the number of units as the number 
# of input units, while the second intermediate layer will have 75% of the number of 
# input units. We also specify the names for the labels and the low and high frequency range 
# used when sampling. 
model = Model( 
    [ 
        layers.Input(shape=(freq_bins,)), 
        layers.Dense(int(2 * freq_bins), activation='relu'), 
        layers.Dense(int(0.75 * freq_bins), activation='relu'), 
        layers.Dense(len(labels), activation='softmax'), 
    ], 
    labels=labels, 
    low_freq=datasets[0].low_freq, 
    high_freq=datasets[0].high_freq 
) 
 
# Train the model 
for epoch in range(epochs): 
    for i, dataset in enumerate(datasets): 
        print(f'[epoch {epoch+1}/{epochs}] [audio sample {i+1}/{len(datasets)}]') 
        model.fit(dataset) 
        evaluation = model.evaluate(dataset) 
        print(f'Validation set loss and accuracy: {evaluation}') 
 
# Save the model 
model.save(model_dir, overwrite=True)

跑完这些代码后，看看模型的准确率，~/models/sound-detect保存着有新的模型。我这里，从宝宝房间收集大约5个小时的声音，并定义一个好的频率范围来训练出准确率大于96%的模型就可以了。

在电脑上训练好模型后复制到树莓派。

使用模型做检测

做一个脚本，使用之前训练好的模型来处理麦克风传来的实时音频数据，在宝宝哭闹时提醒我们：

import os 
 
from micmon.audio import AudioDevice 
from micmon.model import Model 
 
model_dir = os.path.expanduser('~/models/sound-detect') 
model = Model.load(model_dir) 
audio_system = 'alsa'        # Supported: alsa and pulse 
audio_device = 'plughw:2,0'  # Get list of recognized input devices with arecord -l 
 
with AudioDevice(audio_system, device=audio_device) as source: 
    for sample in source: 
        source.pause()  # Pause recording while we process the frame 
        prediction = model.predict(sample) 
        print(prediction) 
        source.resume() # Resume recording

在树莓派上跑起来脚本，如果2秒内没有哭闹发生，会打印negative,否则打印positive。

脚本仅仅打印婴儿哭闹情况是不够的，我们需要通知。通知的功能通过Platypush实现。这个例子中，我们使用pushbullet，在检测到婴儿哭闹时发送消息到我们的手机。

安装Redis(Platypush用Redis接收消息)、Platypush的Http与Pushbullet集成：

[sudo] apt-get install redis-server 
[sudo] systemctl start redis-server.service 
[sudo] systemctl enable redis-server.service 
[sudo] pip3 install 'platypush[http,pushbullet]'

在智能手机上安装Pushbullet应用，去pushbullet.com上取一个api token。创建~/.config/platypush/config.yaml文件，打开Http与Pushbullet集成：

backend.http: 
  enabled: True 
pushbullet: 
  token: YOUR_TOKEN

修改之前的脚本，不再打印一个消息，改为调用Platypush可以捕捉到的CustomEvent。

#!/usr/bin/python3 
 
import argparse 
import logging 
import os 
import sys 
 
from platypush import RedisBus 
from platypush.message.event.custom import CustomEvent 
 
from micmon.audio import AudioDevice 
from micmon.model import Model 
 
logger = logging.getLogger('micmon') 
 
 
def get_args(): 
    parser = argparse.ArgumentParser() 
    parser.add_argument('model_path', help='Path to the file/directory containing the saved Tensorflow model') 
    parser.add_argument('-i', help='Input sound device (e.g. hw:0,1 or default)', required=True, dest='sound_device') 
    parser.add_argument('-e', help='Name of the event that should be raised when a positive event occurs', required=True, dest='event_type') 
    parser.add_argument('-s', '--sound-server', help='Sound server to be used (available: alsa, pulse)', required=False, default='alsa', dest='sound_server') 
    parser.add_argument('-P', '--positive-label', help='Model output label name/index to indicate a positive sample (default: positive)', required=False, default='positive', dest='positive_label') 
    parser.add_argument('-N', '--negative-label', help='Model output label name/index to indicate a negative sample (default: negative)', required=False, default='negative', dest='negative_label') 
    parser.add_argument('-l', '--sample-duration', help='Length of the FFT audio samples (default: 2 seconds)', required=False, type=float, default=2., dest='sample_duration') 
    parser.add_argument('-r', '--sample-rate', help='Sample rate (default: 44100 Hz)', required=False, type=int, default=44100, dest='sample_rate') 
    parser.add_argument('-c', '--channels', help='Number of audio recording channels (default: 1)', required=False, type=int, default=1, dest='channels') 
    parser.add_argument('-f', '--ffmpeg-bin', help='FFmpeg executable path (default: ffmpeg)', required=False, default='ffmpeg', dest='ffmpeg_bin') 
    parser.add_argument('-v', '--verbose', help='Verbose/debug mode', required=False, action='store_true', dest='debug') 
    parser.add_argument('-w', '--window-duration', help='Duration of the look-back window (default: 10 seconds)', required=False, type=float, default=10., dest='window_length') 
    parser.add_argument('-n', '--positive-samples', help='Number of positive samples detected over the window duration to trigger the event (default: 1)', required=False, type=int, default=1, dest='positive_samples') 
 
    opts, args = parser.parse_known_args(sys.argv[1:]) 
    return opts 
 
 
def main(): 
    args = get_args() 
    if args.debug: 
        logger.setLevel(logging.DEBUG) 
 
    model_dir = os.path.abspath(os.path.expanduser(args.model_path)) 
    model = Model.load(model_dir) 
    window = [] 
    cur_prediction = args.negative_label 
    bus = RedisBus() 
 
    with AudioDevice(system=args.sound_server, 
                     device=args.sound_device, 
                     sample_duration=args.sample_duration, 
                     sample_rate=args.sample_rate, 
                     channels=args.channels, 
                     ffmpeg_bin=args.ffmpeg_bin, 
                     debug=args.debug) as source: 
        for sample in source: 
            source.pause()  # Pause recording while we process the frame 
            prediction = model.predict(sample) 
            logger.debug(f'Sample prediction: {prediction}') 
            has_change = False 
 
            if len(window) < args.window_length: 
                window += [prediction] 
            else: 
                window = window[1:] + [prediction] 
 
            positive_samples = len([pred for pred in window if pred == args.positive_label]) 
            if args.positive_samples <= positive_samples and \ 
                    prediction == args.positive_label and \ 
                    cur_prediction != args.positive_label: 
                cur_prediction = args.positive_label 
                has_change = True 
                logging.info(f'Positive sample threshold detected ({positive_samples}/{len(window)})') 
            elif args.positive_samples > positive_samples and \ 
                    prediction == args.negative_label and \ 
                    cur_prediction != args.negative_label: 
                cur_prediction = args.negative_label 
                has_change = True 
                logging.info(f'Negative sample threshold detected ({len(window)-positive_samples}/{len(window)})') 
 
            if has_change: 
                evt = CustomEvent(subtype=args.event_type, state=prediction) 
                bus.post(evt) 
 
            source.resume() # Resume recording 
 
 
if __name__ == '__main__': 
    main()

把上面的脚本存到~/bin/micmon_detect.py。这个脚本只在window_length长度的滑动窗口内检测到发生了positive_samples，只在当前的检测从负向变成正向或正向变成负向的时候出发提示事件。提示事件通过RedisBus发送给 Platypush。这个脚本很通用，不仅可以检测婴儿哭音模型，还使用于任何声音模型、任何正向负向标签、任何频率范围、任何类型的输出的场景。

再来创建一个响应事件和发送推送到设备的 Platypush 钩子。首先，准备 Platypush 脚本目录：

mkdir -p ~/.config/platypush/scripts 
cd ~/.config/platypush/scripts 
# Define the directory as a module 
touch __init__.py 
# Create a script for the baby-cry events 
vi babymonitor.py

babymonitor.py的代码如下：

from platypush.context import get_plugin 
from platypush.event.hook import hook 
from platypush.message.event.custom import CustomEvent 
 
 
@hook(CustomEvent, subtype='baby-cry', state='positive') 
def on_baby_cry_start(event, **_): 
    pb = get_plugin('pushbullet') 
    pb.send_note(title='Baby cry status', body='The baby is crying!') 
 
 
@hook(CustomEvent, subtype='baby-cry', state='negative') 
def on_baby_cry_stop(event, **_): 
    pb = get_plugin('pushbullet') 
    pb.send_note(title='Baby cry status', body='The baby stopped crying - good job!')

为 Platypush 创建一个服务文件，并启动和启用该服务，这样它将在终止或重新启动时自动重新启动:

mkdir -p ~/.config/systemd/user 
wget -O ~/.config/systemd/user/platypush.service \ 
    https://raw.githubusercontent.com/BlackLight/platypush/master/examples/systemd/platypush.service 
systemctl --user start platypush.service 
systemctl --user enable platypush.service

另外创建为婴儿监控一个service：

~/.config/systemd/user/babymonitor.service

[Unit] 
Description=Monitor to detect my baby's cries 
After=network.target sound.target 
[Service] 
ExecStart=/home/pi/bin/micmon_detect.py -i plughw:2,0 -e baby-cry -w 10 -n 2 ~/models/sound-detect 
Restart=always 
RestartSec=10 
[Install] 
WantedBy=default.target

这个 service 会开启 ALSA设备plughw:2,0的麦克风监控，如果在过去 10 秒内检测到至少 2 个正向的 2 秒样本，并且之前的状态为负向，则它将触发一个baby-cry事件、配置state=positive;如果在过去 10 秒内检测到少于 2 个正向样本，并且之前的状态为正向，则配置state=negative。

然后启动和启用该服务:

systemctl --user start babymonitor.service 
systemctl --user enable babymonitor.service

婴儿一哭，你就能在手机上收到通知。如果没有收到，要检查应用于音频样本的标签、神经网络的架构和参数，或者样本长度/窗口/频率参数。

你也可以把这个事情当作一个基本的自动化的例子，添加任意多自动化任务。例如向其他带有 tts 插件的 Platypush 设备发送请求，提示婴儿在哭。还可以扩展micmon_detect.py，让捕获的音频样本也用 http 做流式传输，例如用 Flask wrapper 发送、ffmpeg 进行音频转换。另一个有趣的用例是当婴儿开始/停止啼哭时，将数据点发送到您的本地数据库，这是一组有用的数据，可以跟踪婴儿何时睡觉、何时醒来或何时需要喂养。参考如何使用 Platypush + PostgreSQL + Moscoitto + Grafana 创建灵活的仪表板。

监控我的宝宝是我开发 micmon 的主要动机，但本文中同样的代码也可以用来训练和使用模型来检测任何类型的声音。

最后注意，要使用一个好的电源或一块锂电池供电。

婴儿摄像头

一旦有了音频流和检测音频开始和结束的方法，就可以添加一个视频流观察孩子的情况了。我在用于音频检测的同一个树莓派3上安装了PiCamera，但是这种配置比较不切实际。树莓派3加电池加相机，体积很庞大，不容易安装在支架上。最后我还是选了树莓派Zero，配小电池和带外壳的PiCamera。

我的婴儿监控摄像头模块的第一个原型

和在其他设备上一样，还是在 sd 卡上装一个树莓派适用的系统。然后在插槽中插入一个与树莓派兼容的摄像头，确定摄像头模块已在 raspi-config 中启用，并安装带有 PiCamera 集成的 Platypush：

[sudo] pip3 install 'platypush[http,camera,picamera]'

在配置文件~/.config/platypush/config.yaml里加摄像头配置：

camera.pi: 
    listen_port: 5001

配置完成后重启，可以通过http请求查看摄像头图像：

wget http://raspberry-pi:8008/camera/pi/photo.jpg

或者打开浏览器看摄像头传来的视频流：

http://raspberry-pi:8008/camera/pi/video.mjpg

或者创建一个钩子函数、在服务启动时，使用Tcp和H264来看视频流：

mkdir -p ~/.config/platypush/scripts 
cd ~/.config/platypush/scripts 
touch __init__.py 
vi camera.py

camera.py代码：

from platypush.context import get_plugin 
from platypush.event.hook import hook 
from platypush.message.event.application import ApplicationStartedEvent 
 
 
@hook(ApplicationStartedEvent) 
def on_application_started(event, **_): 
    cam = get_plugin('camera.pi') 
    cam.start_streaming()

配置完成后可以通过 vlc 看视频流：

vlc tcp/h264://raspberry-pi:5001

也可以在手机上通过 vlc 应用或者类似树莓派摄像头查看器这种 app 看视频流。

音频监控

最后一步是建立一个麦克风音频流，把宝宝的树莓派链接到任何客户端。虽然 Tensorflow 做了检测可以提示到你婴儿啼哭，但是机器学习检测模型不是 100% 精准。有时候还是需要听一听/看一看在孩子房间里发生了什么。

我为此制作了一个名为 micstream 的工具，可以用于任何您想要通过 HTTP/mp3 从麦克风取音频流的场景。

注意：一个麦克风向 Tensorflow 提供音频样本，需要另外一个麦克风进行流式音频传输。

把工具克隆下来，安装软件(只有一个 ffmpeg 依赖需要安装)：

git clone https://github.com/BlackLight/micstream.git 
cd micstream 
[sudo] python3 setup.py install

执行micstream --help获得可用的命令行选项。

举个例子，如果想要在第三个音频输入设备上设置音频流(arecord -l看所有音频设备)、在/baby.mp3文件上、监听 8088 端口、96 kbps 比特率，命令如下：

micstream -i plughw:3,0 -e '/baby.mp3' -b 96 -p 8088

这时候浏览器或音频播放器打开http://your-rpi:8088/baby.mp3，就可以听到实时婴儿声音监控了。