一、背景
1. 讲故事
上个月有位朋友wx找到我,说他的程序存在内存泄漏问题,寻求如何解决? 如下图所示:
从截图中可以看出,这位朋友对 windbg 的操作还是有些熟悉的,可能缺乏一定的实操经验,所以用了几个命令之后就不知道怎么排查下去了。
既然找到我,那就以我的个人经验在他的dump上继续分析寻找罪魁祸首,闲话不多说,上windbg说话。
二、Windbg 分析
1. 真的存在内存泄漏吗?
追这个系列的朋友应该知道,我无数次的用 !address -summary 和 !eeheap -gc 这两个命令来判断当前的内存泄漏是属于托管层还是非托管层。
- 0:000> !address -summary
- --- Usage Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
- Free 358 7dfc`67f60000 ( 125.986 TB) 98.43%
- <unknown> 1087 203`88b6e000 ( 2.014 TB) 99.99% 1.57%
- Image 1532 0`09f11000 ( 159.066 MB) 0.01% 0.00%
- Heap 249 0`03453000 ( 52.324 MB) 0.00% 0.00%
- Stack 66 0`01fc0000 ( 31.750 MB) 0.00% 0.00%
- Other 10 0`001d1000 ( 1.816 MB) 0.00% 0.00%
- TEB 22 0`0002c000 ( 176.000 kB) 0.00% 0.00%
- PEB 1 0`00001000 ( 4.000 kB) 0.00% 0.00%
- --- Type Summary (for busy) ------ RgnCount ----------- Total Size -------- %ofBusy %ofTotal
- MEM_MAPPED 183 200`00d06000 ( 2.000 TB) 99.30% 1.56%
- MEM_PRIVATE 1252 3`8d479000 ( 14.207 GB) 0.69% 0.01%
- MEM_IMAGE 1532 0`09f11000 ( 159.066 MB) 0.01% 0.00%
- --- State Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
- MEM_FREE 358 7dfc`67f60000 ( 125.986 TB) 98.43%
- MEM_RESERVE 749 200`28a9b000 ( 2.001 TB) 99.33% 1.56%
- MEM_COMMIT 2218 3`6f5f5000 ( 13.740 GB) 0.67% 0.01%
- 0:000> !eeheap -gc
- Number of GC Heaps: 1
- generation 0 starts at 0x00000026DA8DA928
- generation 1 starts at 0x00000026DA7FC348
- generation 2 starts at 0x00000024C4691000
- ephemeral segment allocation context: none
- segment begin allocated size
- 00000024C4690000 00000024C4691000 00000024D468FF28 0xfffef28(268431144)
- 00000024EECF0000 00000024EECF1000 00000024FECF0000 0xffff000(268431360)
- 000000248D6F0000 000000248D6F1000 000000249D6EFEF8 0xfffeef8(268431096)
- ...
- 00000026D66D0000 00000026D66D1000 00000026DBA3CA30 0x536ba30(87472688)
- Large object heap starts at 0x00000024D4691000
- segment begin allocated size
- 00000024D4690000 00000024D4691000 00000024DC67C318 0x7feb318(134132504)
- 00000024E60F0000 00000024E60F1000 00000024EE0637C8 0x7f727c8(133638088)
- 0000002482140000 0000002482141000 000000248A08F338 0x7f4e338(133489464)
- 00000024A6770000 00000024A6771000 00000024AE76F6C0 0x7ffe6c0(134211264)
- ...
- 000000278E6D0000 000000278E6D1000 000000279635F2D0 0x7c8e2d0(130605776)
- 00000029233E0000 00000029233E1000 000000292AF672F8 0x7b862f8(129524472)
- 000000292B3E0000 000000292B3E1000 0000002931A5ED60 0x667dd60(107470176)
- 000000299B3E0000 000000299B3E1000 00000029A20095B0 0x6c285b0(113411504)
- 000000281E6D0000 000000281E6D1000 0000002825CD3F58 0x7602f58(123744088)
- 00000028266D0000 00000028266D1000 000000282D5CAD50 0x6ef9d50(116366672)
- 000000282E6D0000 000000282E6D1000 0000002833CA0880 0x55cf880(89979008)
- 00000029A33E0000 00000029A33E1000 00000029A684D300 0x346c300(54969088)
- Total Size: Size: 0x353f96d88 (14293757320) bytes.
- ------------------------------
- GC Heap Size: Size: 0x353f96d88 (14293757320) bytes.
从输出看,当前进程占用 MEM_COMMIT=13.7G,托管堆内存占用 14293757320 = 13.3G,很明显这属于简单模式的 托管内存泄漏,根据经验,托管堆上可能有什么大对象,这里用 !dumpheap -stat 命令。
- 0:000> !dumpheap -stat
- Statistics:
- MT Count TotalSize Class Name
- 00007ff9ed6ea268 3956842 94964208 System.Collections.Generic.Dictionary`2+KeyCollection[[System.String, System.Private.CoreLib],[Serilog.Events.LogEventPropertyValue, Serilog]]
- 00007ff9ed5e6d28 3842435 166405016 Serilog.Parsing.MessageTemplateToken[]
- 00007ff9ed5e5e28 3842434 184436832 Serilog.Events.MessageTemplate
- 00007ff9ecccf090 4011012 203304420 System.Int32[]
- 00007ff9ed647078 3956849 253238336 Serilog.Events.LogEvent
- 00007ff9ed6e7b48 3956849 284893128 System.Collections.Generic.Dictionary`2[[System.String, System.Private.CoreLib],[Serilog.Events.LogEventPropertyValue, Serilog]]
- 00007ff9ed5e74e8 9259598 296307136 Serilog.Parsing.TextToken
- 00007ff9ed6471b0 12551808 301243392 Serilog.Events.ScalarValue
- 00007ff9ed6e8308 3956849 729078048 System.Collections.Generic.Dictionary`2+Entry[[System.String, System.Private.CoreLib],[Serilog.Events.LogEventPropertyValue, Serilog]][]
- 00007ff9eccb1e18 16546412 3987811940 System.String
- 00000024c3b8faf0 82904 7382993568 Free
我去,托管堆最大的对象居然是 Free,大概占用 7.3G, 这就牛了,不按套路出牌哈,为了更好的理解,先来简要解释下 free 对象。
2. 简单解释 free
其实简而言之,free对象是被gc标记成已回收的空闲块但并未提交给操作系统释放的对象,那怎么去证明呢? 可以先到托管堆上找一个free块。
- 0:000> !dumpheap -type Free
- Address MT Size
- 00000024c4691000 00000024c3b8faf0 24 Free
- 00000024c46a2448 00000024c3b8faf0 24 Free
- 00000024c46b26d8 00000024c3b8faf0 40 Free
- 00000024c47e4418 00000024c3b8faf0 40 Free
- 00000024c4925680 00000024c3b8faf0 40 Free
- 00000024c49284a8 00000024c3b8faf0 64 Free
- 00000024c4947a90 00000024c3b8faf0 192 Free
- 00000024c4951f70 00000024c3b8faf0 48 Free
- 000000249d6ea3a8 00000024c3b8faf0 640 Free
(1)用 !do 命令查看是否标记为free块
- 0:000> !do 000000249d6ea3a8
- Free Object
- Size: 640(0x280) bytes
(2)用 dc 看一下这个 free 块的内容,是否有gc回收的痕迹。
- 0:000> dc 000000249d6ea3a8 000000249d6ea3a8+0x280
- 00000024`9d6ea3a8 c3b8faf0 00000024 00000268 00000000 ....$...h.......
- 00000024`9d6ea3b8 9d6ea6d0 00000024 00000000 00000000 ..n.$...........
- 00000024`9d6ea3c8 ed3ae2b8 00007ff9 9d6ea3a8 00000024 ..:.......n.$...
- 00000024`9d6ea3d8 00000000 00000000 edcc9118 00007ff9 ................
- 00000024`9d6ea3e8 00000000 00000000 00000000 00000000 ................
- 00000024`9d6ea3f8 00000000 00000000 00000000 00000000 ................
- 00000024`9d6ea408 eeb07a50 00007ff9 9d6ea3c8 00000024 Pz........n.$...
- 00000024`9d6ea418 00000000 00000000 ef292ee8 00007ff9 ..........).....
- 00000024`9d6ea428 9d6ea408 00000024 00000000 00000000 ..n.$...........
- ...
- 00000024`9d6ea4a8 eeb0a158 00007ff9 9d6ea420 00000024 X....... .n.$...
- 00000024`9d6ea4b8 00000000 00000000 00000000 00000000 ................
- 00000024`9d6ea4c8 ef293818 00007ff9 9d6ea4a8 00000024 .8).......n.$...
- 00000024`9d6ea4d8 00000000 00000000 ee8357e0 00007ff9 .........W......
- ...
- 00000024`9d6ea508 eed37b40 00007ff9 00000000 00000000 @{..............
- 00000024`9d6ea518 00000000 00000000 00000000 00000000 ................
- 00000024`9d6ea528 c4699b48 00000024 00000000 00000000 H.i.$...........
- 00000024`9d6ea538 00000000 07000440 00000001 00000000 ....@...........
- 00000024`9d6ea548 00000000 00000000 00000000 00000000 ................
- 00000024`9d6ea558 00000000 00000000 ef2af6e0 00007ff9 ..........*.....
- 00000024`9d6ea568 00000000 00000000 00000000 00000000 ................
- 00000024`9d6ea578 00000000 00000000 c4699b48 00000024 ........H.i.$...
- 00000024`9d6ea588 00000000 00000000 00000000 07000400 ................
- ...
- 00000024`9d6ea628 ef2afd08 ..*.
可以看到,这个free块生前是有一些残留的内容字符,好了,对 free 块有基本了解后,接下来我们继续探究下。
3. 到底是什么阻止了free块的合并?
按照正常逻辑,大多free块会在gc回收完之后合并内存时,那些被清空后的segment会被操作系统释放的,但这个dump并没有,也就说明这里有什么东西阻止了free块的合并,那到底是什么呢? 有经验的朋友会说,可以观察下 gc 的句柄表,命令为 !gchandles -stat 。
- 0:000> !gchandles -stat
- Statistics:
- MT Count TotalSize Class Name
- ...
- 00007ff9ed15c0f0 1008 72576 System.Reflection.Emit.DynamicResolver
- 00007ff9ecbf6618 38 409344 System.Object[]
- Total 1784 objects
- Handles:
- Strong Handles: 233
- Pinned Handles: 16
- Async Pinned Handles: 18
- Ref Count Handles: 1
- Weak Long Handles: 1327
- Weak Short Handles: 144
- Dependent Handles: 45
从输出看,这里并没有什么可疑的地方,那怎么办呢? 实操经验多的话,这里还是有一些经验值得分享的,比如观察 free 在 heap 上的布局特征,往往就有重大发现。
4. 查看 free 块的布局特征
为了简化输出结果,我把范围限定到 heap 上某一个 segment 上,比如这里的: 00000029233E0000 000000292AF672F8,所以命令就是 !dumpheap 00000029233E0000 000000292AF672F8。
- 0:000> !dumpheap 00000029233E0000 000000292AF672F8
- Address MT Size
- 00000029233e1000 00000024c3b8faf0 8291896 Free
- 0000002923bc9638 00007ff9eccb1e18 108448
- 0000002923be3dd8 00000024c3b8faf0 29931248 Free
- 000000292586f4c8 00007ff9eccb1e18 301328
- 00000029258b8dd8 00000024c3b8faf0 41384784 Free
- 0000002928030928 00007ff9eccb1e18 301328
- 000000292807a238 00000024c3b8faf0 2542664 Free
- 00000029282e6e80 00007ff9eccb1e18 108448
- 0000002928301620 00000024c3b8faf0 29915032 Free
- 0000002929f88db8 00007ff9eccb1e18 301328
- 0000002929fd26c8 00000024c3b8faf0 2746688 Free
- 000000292a271008 00007ff9eccb1e18 291304
- 000000292a2b81f0 00000024c3b8faf0 1019600 Free
- 000000292a3b10c0 00007ff9eccb1e18 108448
- 000000292a3cb860 00000024c3b8faf0 10601048 Free
- 000000292ade7ab8 00007ff9eccb1e18 301328
- 000000292ae313c8 00000024c3b8faf0 280808 Free
- 000000292ae75cb0 00007ff9eccb1e18 280854
- 000000292aeba5c8 00000024c3b8faf0 416584 Free
- 000000292af20110 00007ff9eccb1e18 291304
- Statistics:
- MT Count TotalSize Class Name
- 00007ff9eccb1e18 10 2394118 System.String
- 00000024c3b8faf0 10 127130352 Free
- Total 20 objects
真是一看吓一跳,free 和 object 呈交替状,这就是为什么free块不能被合并的真正原因,说实话这种教科书式的 内存碎片化 dump,真是可遇不可求,接下来就抽几个 free 之间的 object 对象,看看到底是被什么引用着导致gc回收不掉。
5. 寻找 object 的引用链
要想查看 object 到底被谁引用着,可以用 !gcroot 命令,这里我抽二个看看。
- 0:000> !gcroot 0000002923bc9638
- Thread 1878:
- 00000024C39BE4B0 00007FFA4C0B3522 System.Threading.Tasks.Task.SpinThenBlockingWait(Int32, System.Threading.CancellationToken) [/_/src/System.Private.CoreLib/shared/System/Threading/Tasks/Task.cs @ 2922]
- rbp+10: 00000024c39be520
- -> 00000024C48AD6E0 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib],[Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions+<RunAsync>d__4, Microsoft.Extensions.Hosting.Abstractions]]
- -> 00000024C48AD200 Microsoft.Extensions.Hosting.Internal.Host
- -> 00000024C48AC538 Microsoft.Extensions.Logging.Logger`1[[Microsoft.Extensions.Hosting.Internal.Host, Microsoft.Extensions.Hosting]]
- -> 00000024C48AC658 Microsoft.Extensions.Logging.Logger
- -> 00000024C48AC680 Microsoft.Extensions.Logging.LoggerInformation[]
- -> 00000024C48AC6E0 Serilog.Extensions.Logging.SerilogLogger
- -> 00000024C48AC808 Serilog.Core.Logger
- -> 00000024C48AC760 Serilog.Core.Logger
- -> 00000024C47AD468 Serilog.Core.Logger
- -> 00000024C47ABF08 Serilog.Core.Sinks.SafeAggregateSink
- -> 00000024C47ABF20 Serilog.Core.ILogEventSink[]
- -> 00000024C479C548 Serilog.Sinks.Grafana.Loki.LokiSink
- -> 00000024C479C778 Serilog.Sinks.Grafana.Loki.Infrastructure.BoundedQueue`1[[Serilog.Events.LogEvent, Serilog]]
- -> 00000024C479C7B8 System.Collections.Generic.Queue`1[[Serilog.Events.LogEvent, Serilog]]
- -> 00000026E8C1A800 Serilog.Events.LogEvent[]
- -> 00000026148D3308 Serilog.Events.LogEvent
- -> 00000026148D4EF0 Serilog.Events.MessageTemplate
- -> 0000002923BC9638 System.String
- Found 1 unique roots (run '!gcroot -all' to see all roots).
- 0:000> !gcroot 000000292586f4c8
- Thread 1878:
- 00000024C39BE4B0 00007FFA4C0B3522 System.Threading.Tasks.Task.SpinThenBlockingWait(Int32, System.Threading.CancellationToken) [/_/src/System.Private.CoreLib/shared/System/Threading/Tasks/Task.cs @ 2922]
- rbp+10: 00000024c39be520
- -> 00000024C48AD6E0 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib],[Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions+<RunAsync>d__4, Microsoft.Extensions.Hosting.Abstractions]]
- -> 00000024C48AD200 Microsoft.Extensions.Hosting.Internal.Host
- -> 00000024C48AC538 Microsoft.Extensions.Logging.Logger`1[[Microsoft.Extensions.Hosting.Internal.Host, Microsoft.Extensions.Hosting]]
- -> 00000024C48AC658 Microsoft.Extensions.Logging.Logger
- -> 00000024C48AC680 Microsoft.Extensions.Logging.LoggerInformation[]
- -> 00000024C48AC6E0 Serilog.Extensions.Logging.SerilogLogger
- -> 00000024C48AC808 Serilog.Core.Logger
- -> 00000024C48AC760 Serilog.Core.Logger
- -> 00000024C47AD468 Serilog.Core.Logger
- -> 00000024C47ABF08 Serilog.Core.Sinks.SafeAggregateSink
- -> 00000024C47ABF20 Serilog.Core.ILogEventSink[]
- -> 00000024C479C548 Serilog.Sinks.Grafana.Loki.LokiSink
- -> 00000024C479C778 Serilog.Sinks.Grafana.Loki.Infrastructure.BoundedQueue`1[[Serilog.Events.LogEvent, Serilog]]
- -> 00000024C479C7B8 System.Collections.Generic.Queue`1[[Serilog.Events.LogEvent, Serilog]]
- -> 00000026E8C1A800 Serilog.Events.LogEvent[]
- -> 0000002614BB7AC8 Serilog.Events.LogEvent
- -> 0000002616D3CC40 Serilog.Events.MessageTemplate
- -> 000000292586F4C8 System.String
- Found 1 unique roots (run '!gcroot -all' to see all roots).
从引用链看,这些 object 都是通过 Serilog 日志组件发送给 Grafana.Loki ,通过对引用链对象的追踪,我发现 System.Collections.Generic.Queue 有重大问题,截图如下:
对,这个 queue 居然有 395w 的积压,到底积压了什么东西,可以用 !wdo 看一下string内容。
看样子这是一个失败重试,分析到这里思路大概就清晰了,就是 Grafana.Loki 或者 Serilog 组件出了什么问题,导致日志发送不到 Loki 中或者发送速度过慢,然后不断的积压所致,接下来把分析到的这些信息和朋友做了一个沟通,截图如下:
三、总结
本次内存碎片化的主要原因在于 Serilog 对接 Loki 的过程中产生的395w的queue积压所致,但我也只能分析到这里了,至于为什么有积压,这个还得朋友进一步调试分析,我相信这个问题很快就能得到解决。
本文转载自微信公众号「一线码农聊技术」,可以通过以下二维码关注。转载本文请联系一线码农聊技术公众号。