前言
Linux引入Watchdog,在Linux内核下,当Watchdog启动后,便设定了一个定时器,如果在超时时间内没有对/dev/Watchdog进行写操作,则会导致系统重启。通过定时器实现的Watchdog属于软件层面;
Android设计了一个软件层面Watchdog,用于保护一些重要的系统服务,当出现故障时,通常会让Android系统重启,由于这种机制的存在,就经常会出现一些system_server进程被Watchdog杀掉而发生手机重启的问题;
今天我们就来分析下原理;
一、WatchDog启动机制详解
ANR机制是针对应用的,对于系统进程来说,如果长时间“无响应”,Android系统设计了WatchDog机制来管控。如果超过了“无响应”的延时,那么系统WatchDog会触发自杀机制;
Watchdog是一个线程,继承于Thread,在SystemServer.java里面通过getInstance获取watchdog的对象;
1、在SystemServer.java中启动
- private void startOtherServices() {
- ······
- traceBeginAndSlog("InitWatchdog");
- final Watchdog watchdog = Watchdog.getInstance();
- watchdog.init(context, mActivityManagerService);
- traceEnd();
- ······
- traceBeginAndSlog("StartWatchdog");
- Watchdog.getInstance().start();
- traceEnd();
- }
因为是线程,所以,只要start即可;
2、查看WatchDog的构造方法
- private Watchdog() {
- super("watchdog");
- // Initialize handler checkers for each common thread we want to check. Note
- // that we are not currently checking the background thread, since it can
- // potentially hold longer running operations with no guarantees about the timeliness
- // of operations there.
- // The shared foreground thread is the main checker. It is where we
- // will also dispatch monitor checks and do other work.
- mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
- "foreground thread", DEFAULT_TIMEOUT);
- mHandlerCheckers.add(mMonitorChecker);
- // Add checker for main thread. We only do a quick check since there
- // can be UI running on the thread.
- mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
- "main thread", DEFAULT_TIMEOUT));
- // Add checker for shared UI thread.
- mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
- "ui thread", DEFAULT_TIMEOUT));
- // And also check IO thread.
- mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
- "i/o thread", DEFAULT_TIMEOUT));
- // And the display thread.
- mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
- "display thread", DEFAULT_TIMEOUT));
- // Initialize monitor for Binder threads.
- addMonitor(new BinderThreadMonitor());
- mOpenFdMonitor = OpenFdMonitor.create();
- // See the notes on DEFAULT_TIMEOUT.
- assert DB ||
- DEFAULT_TIMEOUT > ZygoteConnectionConstants.WRAPPED_PID_TIMEOUT_MILLIS;
- // mtk enhance
- exceptionHWT = new ExceptionLog();
- }
重点关注两个对象:mMonitorChecker和mHandlerCheckers
mHandlerCheckers列表元素的来源:
构造对象的导入:UiThread、IoThread、DisplatyThread、FgThread加入
外部导入:Watchdog.getInstance().addThread(handler);
mMonitorChecker列表元素的来源:
外部导入:Watchdog.getInstance().addMonitor(monitor);
特别说明:addMonitor(new BinderThreadMonitor());
3、查看WatchDog的run方法
- public void run() {
- boolean waitedHalf = false;
- boolean mSFHang = false;
- while (true) {
- ······
- synchronized (this) {
- ······
- for (int i=0; i<mHandlerCheckers.size(); i++) {
- HandlerChecker hc = mHandlerCheckers.get(i);
- hc.scheduleCheckLocked();
- }
- ······
- }
- ······
- }
对mHandlerCheckers列表元素进行检测;
4、查看HandlerChecker的scheduleCheckLocked
- public void scheduleCheckLocked() {
- if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {
- // If the target looper has recently been polling, then
- // there is no reason to enqueue our checker on it since that
- // is as good as it not being deadlocked. This avoid having
- // to do a context switch to check the thread. Note that we
- // only do this if mCheckReboot is false and we have no
- // monitors, since those would need to be executed at this point.
- mCompleted = true;
- return;
- }
- if (!mCompleted) {
- // we already have a check in flight, so no need
- return;
- }
- mCompleted = false;
- mCurrentMonitor = null;
- mStartTime = SystemClock.uptimeMillis();
- mHandler.postAtFrontOfQueue(this);
- }
mMonitors.size() == 0的情況:主要为了检查mHandlerCheckers中的元素是否超时,运用的手段:mHandler.getLooper().getQueue().isPolling();
mMonitorChecker对象的列表元素一定是大于0,此时,关注点在mHandler.postAtFrontOfQueue(this);
- public void run() {
- final int size = mMonitors.size();
- for (int i = 0 ; i < size ; i++) {
- synchronized (Watchdog.this) {
- mCurrentMonitor = mMonitors.get(i);
- }
- mCurrentMonitor.monitor();
- }
- synchronized (Watchdog.this) {
- mCompleted = true;
- mCurrentMonitor = null;
- }
- }
监听monitor方法,这里是对mMonitors进行monitor,而能够满足条件的只有:mMonitorChecker,例如:各种服务通过addMonitor加入列表;
- ActivityManagerService.java
- Watchdog.getInstance().addMonitor(this);
- InputManagerService.java
- Watchdog.getInstance().addMonitor(this);
- PowerManagerService.java
- Watchdog.getInstance().addMonitor(this);
- ActivityManagerService.java
- Watchdog.getInstance().addMonitor(this);
- WindowManagerService.java
- Watchdog.getInstance().addMonitor(this);
而被执行的monitor方法很简单,例如ActivityManagerService:
- public void monitor() {
- synchronized (this) { }
- }
这里仅仅是检查系统服务是否被锁住;
Watchdog的内部类;
- private static final class BinderThreadMonitor implements Watchdog.Monitor {
- @Override
- public void monitor() {
- Binder.blockUntilThreadAvailable();
- }
- }
- android.os.Binder.java
- public static final native void blockUntilThreadAvailable();
- android_util_Binder.cpp
- static void android_os_Binder_blockUntilThreadAvailable(JNIEnv* env, jobject clazz)
- {
- return IPCThreadState::self()->blockUntilThreadAvailable();
- }
- IPCThreadState.cpp
- void IPCThreadState::blockUntilThreadAvailable()
- {
- pthread_mutex_lock(&mProcess->mThreadCountLock);
- while (mProcess->mExecutingThreadsCount >= mProcess->mMaxThreads) {
- ALOGW("Waiting for thread to be free. mExecutingThreadsCount=%lu mMaxThreads=%lu\n",
- static_cast<unsigned long>(mProcess->mExecutingThreadsCount),
- static_cast<unsigned long>(mProcess->mMaxThreads));
- pthread_cond_wait(&mProcess->mThreadCountDecrement, &mProcess->mThreadCountLock);
- }
- pthread_mutex_unlock(&mProcess->mThreadCountLock);
- }
这里仅仅是检查进程中包含的可执行线程的数量不能超过mMaxThreads,如果超过了最大值(31个),就需要等待;
- ProcessState.cpp
- #define DEFAULT_MAX_BINDER_THREADS 15
- 但是systemserver.java进行了设置
- // maximum number of binder threads used for system_server
- // will be higher than the system default
- private static final int sMaxBinderThreads = 31;
- private void run() {
- ······
- BinderInternal.setMaxThreads(sMaxBinderThreads);
- ······
- }
5、发生超时后退出
- public void run() {
- ······
- Process.killProcess(Process.myPid());
- System.exit(10);
- ······
- }
kill自己所在进程(system_server),并退出;
二、原理解释
1、系统中所有需要监控的服务都调用Watchdog的addMonitor添加Monitor Checker到mMonitors这个List中或者addThread方法添加Looper Checker到mHandlerCheckers这个List中;
2、当Watchdog线程启动后,便开始无限循环,它的run方法就开始执行;
- 第一步调用HandlerChecker#scheduleCheckLocked处理所有的mHandlerCheckers
- 第二步定期检查是否超时,每一次检查的间隔时间由CHECK_INTERVAL常量设定,为30秒,每一次检查都会调用evaluateCheckerCompletionLocked()方法来评估一下HandlerChecker的完成状态:
- COMPLETED表示已经完成;
- WAITING和WAITED_HALF表示还在等待,但未超时,WAITED_HALF时候会dump一次trace.
- OVERDUE表示已经超时。默认情况下,timeout是1分钟;
3、如果超时时间到了,还有HandlerChecker处于未完成的状态(OVERDUE),则通过getBlockedCheckersLocked()方法,获取阻塞的HandlerChecker,生成一些描述信息,保存日志,包括一些运行时的堆栈信息。
4、最后杀死SystemServer进程;
总结
Watchdog是一个线程,用来监听系统各项服务是否正常运行,没有发生死锁;
HandlerChecker用来检查Handler以及monitor;
monitor通过锁来判断是否死锁;
超时30秒会输出log,超时60秒会重启;
Watchdog会杀掉自己的进程,也就是此时system_server进程id会变化;
本文转载自微信公众号「Android开发编程」