MySQL锁等待与死锁问题分析-mysql死锁排查及解决

本文转载自微信公众号「MySQL技术」，作者MySQL技术。转载本文请联系MySQL技术公众号。

前言：

在 MySQL 运维过程中，锁等待和死锁问题是令各位 DBA 及开发同学非常头痛的事。出现此类问题会造成业务回滚、卡顿等故障，特别是业务繁忙的系统，出现死锁问题后影响会更严重。本篇文章我们一起来学习下什么是锁等待及死锁，出现此类问题又应该如何分析处理呢?

1.了解锁等待与死锁

出现锁等待或死锁的原因是访问数据库需要加锁，那你可能要问了，为啥要加锁呢?原因是为了确保并发更新场景下的数据正确性，保证数据库事务的隔离性。

试想一个场景，如果你要去图书馆借一本《高性能MySQL》，为了防止有人提前把这本书借走，你可以提前进行预约(加锁)，这把锁可以怎么加?

封锁图书馆(数据库级别的锁)
把数据库相关的书都锁住(表级别的锁)
只锁 MySQL 相关的书(页级别的锁)
只锁《高性能MySQL》这本书(行级别的锁)

锁的粒度越细，并发级别越高，实现也更复杂。

锁等待也可称为事务等待，后执行的事务等待前面处理的事务释放锁，但是等待时间超过了 MySQL 的锁等待时间，就会引发这个异常。等待超时后的报错为“Lock wait timeout exceeded...”。

死锁发生的原因是两个事务互相等待对方释放相同资源的锁，从而造成的死循环。产生死锁后会立即报错“Deadlock found when trying to get lock...”。

2.现象复现及处理

下面我们以 MySQL 5.7.23 版本为例(隔离级别是 RR )，来复现下上述两种异常现象。

mysql> show create table test_tb\G 
*************************** 1. row *************************** 
       Table: test_tb 
Create Table: CREATE TABLE `test_tb` ( 
  `id` int(11) NOT NULL AUTO_INCREMENT, 
  `col1` varchar(50) NOT NULL DEFAULT '', 
  `col2` int(11) NOT NULL DEFAULT '1', 
  `col3` varchar(20) NOT NULL DEFAULT '', 
  PRIMARY KEY (`id`), 
  KEY `idx_col1` (`col1`) 
) ENGINE=InnoDB AUTO_INCREMENT=4 DEFAULT CHARSET=utf8 
1 row in set (0.00 sec) 
 
mysql> select * from test_tb; 
+----+------+------+------+ 
| id | col1 | col2 | col3 | 
+----+------+------+------+ 
|  1 | fdg  |    1 | abc  | 
|  2 | a    |    2 | fg   | 
|  3 | ghrv |    2 | rhdv | 
+----+------+------+------+ 
3 rows in set (0.00 sec) 
 
# 事务一首先执行 
mysql> begin; 
Query OK, 0 rows affected (0.00 sec) 
 
mysql> select * from test_tb where col1 = 'a' for update; 
+----+------+------+------+ 
| id | col1 | col2 | col3 | 
+----+------+------+------+ 
|  2 | a    |    2 | fg   | 
+----+------+------+------+ 
1 row in set (0.00 sec) 
 
# 事务二然后执行 
mysql> begin; 
Query OK, 0 rows affected (0.01 sec) 
 
mysql> update test_tb set col2 = 1 where col1 = 'a'; 
ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction

出现上种异常的原因是事务二在等待事务一的行锁，但事务一一直没提交，等待超时而报错。InnoDB 行锁等待超时时间由 innodb_lock_wait_timeout 参数控制，此参数默认值为 50 ，单位为秒，即默认情况下，事务二会等待 50s ，若仍拿不到行锁则会报等待超时异常并回滚此条语句。

对于 5.7 版本，出现锁等待时，我们可以查看 information_schema 中的几张系统表来查询事务状态。

innodb_trx 当前运行的所有事务。
innodb_locks 当前出现的锁。
innodb_lock_waits 锁等待的对应关系

# 锁等待发生时 查看innodb_trx表可以看到所有事务  
# trx_state值为LOCK WAIT 则代表该事务处于等待状态 
 
mysql> select * from information_schema.innodb_trx\G 
*************************** 1. row *************************** 
                    trx_id: 38511 
                 trx_state: LOCK WAIT 
               trx_started: 2021-03-24 17:20:43 
     trx_requested_lock_id: 38511:156:4:2 
          trx_wait_started: 2021-03-24 17:20:43 
                trx_weight: 2 
       trx_mysql_thread_id: 1668447 
                 trx_query: update test_tb set col2 = 1 where col1 = 'a' 
       trx_operation_state: starting index read 
         trx_tables_in_use: 1 
         trx_tables_locked: 1 
          trx_lock_structs: 2 
     trx_lock_memory_bytes: 1136 
           trx_rows_locked: 1 
         trx_rows_modified: 0 
   trx_concurrency_tickets: 0 
       trx_isolation_level: REPEATABLE READ 
         trx_unique_checks: 1 
    trx_foreign_key_checks: 1 
trx_last_foreign_key_error: NULL 
 trx_adaptive_hash_latched: 0 
 trx_adaptive_hash_timeout: 0 
          trx_is_read_only: 0 
trx_autocommit_non_locking: 0 
*************************** 2. row *************************** 
                    trx_id: 38510 
                 trx_state: RUNNING 
               trx_started: 2021-03-24 17:18:54 
     trx_requested_lock_id: NULL 
          trx_wait_started: NULL 
                trx_weight: 4 
       trx_mysql_thread_id: 1667530 
                 trx_query: NULL 
       trx_operation_state: NULL 
         trx_tables_in_use: 0 
         trx_tables_locked: 1 
          trx_lock_structs: 4 
     trx_lock_memory_bytes: 1136 
           trx_rows_locked: 3 
         trx_rows_modified: 0 
   trx_concurrency_tickets: 0 
       trx_isolation_level: REPEATABLE READ 
         trx_unique_checks: 1 
    trx_foreign_key_checks: 1 
trx_last_foreign_key_error: NULL 
 trx_adaptive_hash_latched: 0 
 trx_adaptive_hash_timeout: 0 
          trx_is_read_only: 0 
trx_autocommit_non_locking: 0 
2 rows in set (0.00 sec) 
 
# innodb_trx 字段值含义 
trx_id：事务ID。 
trx_state：事务状态，有以下几种状态：RUNNING、LOCK WAIT、ROLLING BACK 和 COMMITTING。 
trx_started：事务开始时间。 
trx_requested_lock_id：事务当前正在等待锁的标识，可以和 INNODB_LOCKS 表 JOIN 以得到更多详细信息。 
trx_wait_started：事务开始等待的时间。 
trx_weight：事务的权重。 
trx_mysql_thread_id：事务线程 ID，可以和 PROCESSLIST 表 JOIN。 
trx_query：事务正在执行的 SQL 语句。 
trx_operation_state：事务当前操作状态。 
trx_tables_in_use：当前事务执行的 SQL 中使用的表的个数。 
trx_tables_locked：当前执行 SQL 的行锁数量。 
trx_lock_structs：事务保留的锁数量。 
trx_isolation_level：当前事务的隔离级别。 
 
# sys.innodb_lock_waits 视图也可看到事务等待状况，且给出了杀链接的SQL 
mysql> select * from sys.innodb_lock_waits\G 
*************************** 1. row *************************** 
                wait_started: 2021-03-24 17:20:43 
                    wait_age: 00:00:22 
               wait_age_secs: 22 
                locked_table: `testdb`.`test_tb` 
                locked_index: idx_col1 
                 locked_type: RECORD 
              waiting_trx_id: 38511 
         waiting_trx_started: 2021-03-24 17:20:43 
             waiting_trx_age: 00:00:22 
     waiting_trx_rows_locked: 1 
   waiting_trx_rows_modified: 0 
                 waiting_pid: 1668447 
               waiting_query: update test_tb set col2 = 1 where col1 = 'a' 
             waiting_lock_id: 38511:156:4:2 
           waiting_lock_mode: X 
             blocking_trx_id: 38510 
                blocking_pid: 1667530 
              blocking_query: NULL 
            blocking_lock_id: 38510:156:4:2 
          blocking_lock_mode: X 
        blocking_trx_started: 2021-03-24 17:18:54 
            blocking_trx_age: 00:02:11 
    blocking_trx_rows_locked: 3 
  blocking_trx_rows_modified: 0 
     sql_kill_blocking_query: KILL QUERY 1667530 
sql_kill_blocking_connection: KILL 1667530

sys.innodb_lock_waits 视图整合了事务等待状况，同时给出杀掉堵塞源端的 kill 语句。不过是否要杀掉链接还是需要综合考虑的。

死锁与锁等待稍有不同，我们同样也来简单复现下死锁现象。

# 开启两个事务 
# 事务一执行 
mysql> update test_tb set col2 = 1 where col1 = 'a'; 
Query OK, 1 row affected (0.00 sec) 
Rows matched: 1  Changed: 1  Warnings: 0 
 
# 事务二执行 
mysql> update test_tb set col2 = 1 where id = 3; 
Query OK, 1 row affected (0.00 sec) 
Rows matched: 1  Changed: 1  Warnings: 0 
 
# 回到事务一执行 回车后 此条语句处于锁等待状态 
mysql> update test_tb set col1 = 'abcd' where id = 3; 
Query OK, 1 row affected (5.71 sec) 
Rows matched: 1  Changed: 1  Warnings: 0 
 
# 回到事务二再执行 此时二者相互等待发生死锁 
mysql> update test_tb set col3 = 'gddx' where col1 = 'a'; 
ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction

发生死锁后会选择一个事务进行回滚，想查明死锁原因，可以执行 show engine innodb status 来查看死锁日志，根据死锁日志，结合业务逻辑来进一步定位死锁原因。

在实际应用中，我们要尽量避免死锁现象的发生，可以从以下几个方面入手：

事务尽可能小，不要讲复杂逻辑放进一个事务里。
涉及多行记录时，约定不同事务以相同顺序访问。
业务中要及时提交或者回滚事务，可减少死锁产生的概率。
表要有合适的索引。
可尝试将隔离级别改为 RC 。

总结：

本篇文章简单介绍了锁等待及死锁发生的原因，其实真实业务中发生死锁还是很难分析的，需要一定的经验积累。本篇文章只是面向初学者，希望各位对死锁能够有个初印象。