IP改变引起的Ceph monitor异常及OSD盘崩溃的总结-ip地址发生改变

公司搬家，所有服务器的ip改变。对ceph服务器配置好ip后启动，发现monitor进程启动失败，monitor进程总是试图绑定到以前的ip地址，那当然不可能成功了。开始以为服务器的ip设置有问题，在改变hostname、ceph.conf等方法无果后，逐步分析发现，是monmap中的ip地址还是以前的ip，ceph通过读取monmap来启动monitor进程，所以需要修改monmap。方法如下：

#Add the new monitor locations  
# monmaptool --create --add mon0 192.168.32.2:6789 --add osd1 192.168.32.3:6789 \  
  --add osd2 192.168.32.4:6789 --fsid 61a520db-317b-41f1-9752-30cedc5ffb9a \  
  --clobber monmap  
   
#Retrieve the monitor map  
# ceph mon getmap -o monmap.bin  
   
#Check new contents  
# monmaptool --print monmap.bin  
   
#Inject the monmap  
# ceph-mon -i mon0 --inject-monmap monmap.bin  
# ceph-mon -i osd1 --inject-monmap monmap.bin  
# ceph-mon -i osd2 --inject-monmap monmap.bin 
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.

再启动monitor，一切正常。

但出现了上一篇文章中描述的一块osd盘挂掉的情况。查了一圈，只搜到ceph的官网上说是ceph的一个bug。无力修复，于是删掉这块osd，再重装：

# service ceph stop osd.4  
#不必执行ceph osd crush remove osd.4  
# ceph auth del osd.4  
# ceph osd rm 4  
   
# umount /cephmp1  
# mkfs.xfs -f /dev/sdc  
# mount /dev/sdc /cephmp1  
#此处执行create无法正常安装osd  
# ceph-deploy osd prepare osd2:/cephmp1:/dev/sdf1  
# ceph-deploy osd activate osd2:/cephmp1:/dev/sdf1 
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.

完成后重启该osd，成功运行。ceph会自动平衡数据，***的状态是：

[root@osd2 ~]# ceph -s  
    cluster 61a520db-317b-41f1-9752-30cedc5ffb9a 
     health HEALTH_WARN 9 pgs incomplete; 9 pgs stuck inactive; 9 pgs stuck unclean; 3 requests are blocked > 32 sec  
     monmap e3: 3 mons at {mon0=192.168.32.2:6789/0,osd1=192.168.32.3:6789/0,osd2=192.168.32.4:6789/0}, election epoch 76, quorum 0,1,2 mon0,osd1,osd2  
     osdmap e689: 6 osds: 6 up, 6 in 
      pgmap v189608: 704 pgs, 5 pools, 34983 MB data, 8966 objects  
            69349 MB used, 11104 GB / 11172 GB avail  
                 695 active+clean  
                   9 incomplete 
1.
2.
3.
4.
5.
6.
7.
8.
9.

出现了9个pg的incomplete状态。

[root@osd2 ~]# ceph health detail  
HEALTH_WARN 9 pgs incomplete; 9 pgs stuck inactive; 9 pgs stuck unclean; 3 requests are blocked > 32 sec; 1 osds have slow requests  
pg 5.95 is stuck inactive for 838842.634721, current state incomplete, last acting [1,4]  
pg 5.66 is stuck inactive since forever, current state incomplete, last acting [4,0]  
pg 5.de is stuck inactive for 808270.105968, current state incomplete, last acting [0,4]  
pg 5.f5 is stuck inactive for 496137.708887, current state incomplete, last acting [0,4]  
pg 5.11 is stuck inactive since forever, current state incomplete, last acting [4,1]  
pg 5.30 is stuck inactive for 507062.828403, current state incomplete, last acting [0,4]  
pg 5.bc is stuck inactive since forever, current state incomplete, last acting [4,1]  
pg 5.a7 is stuck inactive for 499713.993372, current state incomplete, last acting [1,4]  
pg 5.22 is stuck inactive for 496125.831204, current state incomplete, last acting [0,4]  
pg 5.95 is stuck unclean for 838842.634796, current state incomplete, last acting [1,4]  
pg 5.66 is stuck unclean since forever, current state incomplete, last acting [4,0]  
pg 5.de is stuck unclean for 808270.106039, current state incomplete, last acting [0,4]  
pg 5.f5 is stuck unclean for 496137.708958, current state incomplete, last acting [0,4]  
pg 5.11 is stuck unclean since forever, current state incomplete, last acting [4,1]  
pg 5.30 is stuck unclean for 507062.828475, current state incomplete, last acting [0,4]  
pg 5.bc is stuck unclean since forever, current state incomplete, last acting [4,1]  
pg 5.a7 is stuck unclean for 499713.993443, current state incomplete, last acting [1,4]  
pg 5.22 is stuck unclean for 496125.831274, current state incomplete, last acting [0,4]  
pg 5.de is incomplete, acting [0,4]  
pg 5.bc is incomplete, acting [4,1]  
pg 5.a7 is incomplete, acting [1,4]  
pg 5.95 is incomplete, acting [1,4]  
pg 5.66 is incomplete, acting [4,0]  
pg 5.30 is incomplete, acting [0,4]  
pg 5.22 is incomplete, acting [0,4]  
pg 5.11 is incomplete, acting [4,1]  
pg 5.f5 is incomplete, acting [0,4]  
2 ops are blocked > 8388.61 sec  
1 ops are blocked > 4194.3 sec  
2 ops are blocked > 8388.61 sec on osd.0 
1 ops are blocked > 4194.3 sec on osd.0 
1 osds have slow requests 
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.

查了一圈无果。一个有同样遭遇的人的一段话：

I already tried "ceph pg repair 4.77", stop/start OSDs, "ceph osd lost", "ceph pg force_create_pg 4.77".  
Most scary thing is "force_create_pg" does not work. At least it should be a way to wipe out a incomplete PG  
without destroying a whole pool. 
1.
2.
3.

以上方法尝试了一下，都不行。暂时无法解决，感觉有点坑。

PS：常用pg操作

[root@osd2 ~]# ceph pg map 5.de  
osdmap e689 pg 5.de (5.de) -> up [0,4] acting [0,4]  
[root@osd2 ~]# ceph pg 5.de query  
[root@osd2 ~]# ceph pg scrub 5.de  
instructing pg 5.de on osd.0 to scrub  
[root@osd2 ~]# ceph pg 5.de mark_unfound_lost revert  
pg has no unfound objects  
#ceph pg dump_stuck stale  
#ceph pg dump_stuck inactive  
#ceph pg dump_stuck unclean  
[root@osd2 ~]# ceph osd lost 1  
Error EPERM: are you SURE?  this might mean real, permanent data loss.  pass --yes-i-really-mean-it if you really do.  
[root@osd2 ~]#   
[root@osd2 ~]# ceph osd lost 4 --yes-i-really-mean-it  
osd.4 is not down or doesn't exist  
[root@osd2 ~]# service ceph stop osd.4  
=== osd.4 ===   
Stopping Ceph osd.4 on osd2...kill 22287...kill 22287...done  
[root@osd2 ~]# ceph osd lost 4 --yes-i-really-mean-it  
marked osd lost in epoch 690 
[root@osd1 mnt]# ceph pg repair 5.de  
instructing pg 5.de on osd.0 to repair  
[root@osd1 mnt]# ceph pg repair 5.de  
instructing pg 5.de on osd.0 to repair 
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.

本文出自：http://my.oschina.net/renguijiayi/blog/360274