-
Notifications
You must be signed in to change notification settings - Fork 10
Open
Labels
Description
On ncei.nceas.ucsb.edu, which is a small test server with 8GB of RAM, we're seeing the kernel oom-killer repeatedly kill the python process that results from the cron job running python manage.py process_replication_queue. Since the job runs once per minute, and the vast majority of them succeed, I suspect it's jobs that are replicating large files across the wire that are causing the issue, and that perhaps there's some code that is spooling data to memory prior to writing to disk. Here's a sample from /var/log/syslog:
Dec 13 06:35:08 ncei kernel: [10412914.856904] gmain invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
Dec 13 06:35:08 ncei kernel: [10412914.856912] gmain cpuset=/ mems_allowed=0
Dec 13 06:35:08 ncei kernel: [10412914.856924] CPU: 3 PID: 1175 Comm: gmain Not tainted 4.4.0-91-generic #114-Ubuntu
Dec 13 06:35:08 ncei kernel: [10412914.856927] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
Dec 13 06:35:08 ncei kernel: [10412914.856932] 0000000000000286 9ecc974be8221a0f ffff880035ba79f8 ffffffff813f9c83
Dec 13 06:35:08 ncei kernel: [10412914.856937] ffff880035ba7bb0 ffff8801091fe600 ffff880035ba7a68 ffffffff8120b88e
Dec 13 06:35:08 ncei kernel: [10412914.856959] ffff880035ba7a30 ffff880035ba7a40 ffffffff814dae5a 00000000ffffffff
Dec 13 06:35:08 ncei kernel: [10412914.856964] Call Trace:
Dec 13 06:35:08 ncei kernel: [10412914.856981] [<ffffffff813f9c83>] dump_stack+0x63/0x90
Dec 13 06:35:08 ncei kernel: [10412914.856991] [<ffffffff8120b88e>] dump_header+0x5a/0x1c5
Dec 13 06:35:08 ncei kernel: [10412914.857001] [<ffffffff814dae5a>] ? virtballoon_oom_notify+0x2a/0x80
Dec 13 06:35:08 ncei kernel: [10412914.857010] [<ffffffff81192d62>] oom_kill_process+0x202/0x3c0
Dec 13 06:35:08 ncei kernel: [10412914.857014] [<ffffffff81193189>] out_of_memory+0x219/0x460
Dec 13 06:35:08 ncei kernel: [10412914.857021] [<ffffffff81199178>] __alloc_pages_slowpath.constprop.88+0x938/0xad0
Dec 13 06:35:08 ncei kernel: [10412914.857026] [<ffffffff81199596>] __alloc_pages_nodemask+0x286/0x2a0
Dec 13 06:35:08 ncei kernel: [10412914.857040] [<ffffffff811e30dc>] alloc_pages_current+0x8c/0x110
Dec 13 06:35:08 ncei kernel: [10412914.857048] [<ffffffff8118f32b>] __page_cache_alloc+0xab/0xc0
Dec 13 06:35:08 ncei kernel: [10412914.857051] [<ffffffff8119183a>] filemap_fault+0x14a/0x3f0
Dec 13 06:35:08 ncei kernel: [10412914.857067] [<ffffffff812a3a96>] ext4_filemap_fault+0x36/0x50
Dec 13 06:35:08 ncei kernel: [10412914.857077] [<ffffffff811be850>] __do_fault+0x50/0xe0
Dec 13 06:35:08 ncei kernel: [10412914.857081] [<ffffffff811c2372>] handle_mm_fault+0xfa2/0x1820
Dec 13 06:35:08 ncei kernel: [10412914.857091] [<ffffffff8125ce77>] ? eventfd_ctx_read+0x67/0x210
Dec 13 06:35:08 ncei kernel: [10412914.857097] [<ffffffff811ed9e7>] ? kmem_cache_alloc+0x187/0x1f0
Dec 13 06:35:08 ncei kernel: [10412914.857107] [<ffffffff8106b577>] __do_page_fault+0x197/0x400
Dec 13 06:35:08 ncei kernel: [10412914.857112] [<ffffffff81255d34>] ? SyS_inotify_add_watch+0x1d4/0x350
Dec 13 06:35:08 ncei kernel: [10412914.857115] [<ffffffff8106b802>] do_page_fault+0x22/0x30
Dec 13 06:35:08 ncei kernel: [10412914.857127] [<ffffffff81844178>] page_fault+0x28/0x30
Dec 13 06:35:08 ncei kernel: [10412914.857130] Mem-Info:
Dec 13 06:35:08 ncei kernel: [10412914.857139] active_anon:1949307 inactive_anon:31638 isolated_anon:0
Dec 13 06:35:08 ncei kernel: [10412914.857139] active_file:78 inactive_file:51 isolated_file:1
Dec 13 06:35:08 ncei kernel: [10412914.857139] unevictable:915 dirty:0 writeback:0 unstable:0
Dec 13 06:35:08 ncei kernel: [10412914.857139] slab_reclaimable:12314 slab_unreclaimable:6499
Dec 13 06:35:08 ncei kernel: [10412914.857139] mapped:35771 shmem:56441 pagetables:6109 bounce:0
Dec 13 06:35:08 ncei kernel: [10412914.857139] free:25630 free_pcp:208 free_cma:0
Dec 13 06:35:08 ncei kernel: [10412914.857147] Node 0 DMA free:15884kB min:132kB low:164kB high:196kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15984kB managed:15900kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:16kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Dec 13 06:35:08 ncei kernel: [10412914.857159] lowmem_reserve[]: 0 3449 7928 7928 7928
Dec 13 06:35:08 ncei kernel: [10412914.857165] Node 0 DMA32 free:47136kB min:29348kB low:36684kB high:44020kB active_anon:3405552kB inactive_anon:58248kB active_file:304kB inactive_file:196kB unevictable:2136kB isolated(anon):0kB isolated(file):4kB present:3653620kB managed:3572932kB mlocked:2136kB dirty:0kB writeback:0kB mapped:76620kB shmem:110656kB slab_reclaimable:16728kB slab_unreclaimable:10004kB kernel_stack:2912kB pagetables:14628kB unstable:0kB bounce:0kB free_pcp:832kB local_pcp:84kB free_cma:0kB writeback_tmp:0kB pages_scanned:8936 all_unreclaimable? yes
Dec 13 06:35:08 ncei kernel: [10412914.857174] lowmem_reserve[]: 0 0 4478 4478 4478
Dec 13 06:35:08 ncei kernel: [10412914.857179] Node 0 Normal free:39500kB min:38100kB low:47624kB high:57148kB active_anon:4391676kB inactive_anon:68304kB active_file:8kB inactive_file:8kB unevictable:1524kB isolated(anon):0kB isolated(file):0kB present:4718592kB managed:4586184kB mlocked:1524kB dirty:0kB writeback:0kB mapped:66464kB shmem:115108kB slab_reclaimable:32528kB slab_unreclaimable:15976kB kernel_stack:2032kB pagetables:9808kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:12944 all_unreclaimable? yes
Dec 13 06:35:08 ncei kernel: [10412914.857187] lowmem_reserve[]: 0 0 0 0 0
Dec 13 06:35:08 ncei kernel: [10412914.857192] Node 0 DMA: 1*4kB (U) 1*8kB (U) 0*16kB 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15884kB
Dec 13 06:35:08 ncei kernel: [10412914.857209] Node 0 DMA32: 231*4kB (UME) 204*8kB (UE) 297*16kB (UME) 221*32kB (UME) 102*64kB (UME) 48*128kB (UME) 29*256kB (ME) 13*512kB (ME) 6*1024kB (E) 0*2048kB 0*4096kB = 47276kB
Dec 13 06:35:08 ncei kernel: [10412914.857225] Node 0 Normal: 479*4kB (UME) 238*8kB (UMEH) 166*16kB (UMEH) 138*32kB (UMEH) 65*64kB (UMEH) 29*128kB (UEH) 17*256kB (UMEH) 8*512kB (UME) 4*1024kB (EH) 0*2048kB 2*4096kB (H) = 39500kB
Dec 13 06:35:08 ncei kernel: [10412914.857245] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Dec 13 06:35:08 ncei kernel: [10412914.857248] 57215 total pagecache pages
Dec 13 06:35:08 ncei kernel: [10412914.857251] 0 pages in swap cache
Dec 13 06:35:08 ncei kernel: [10412914.857254] Swap cache stats: add 174259381, delete 174259381, find 59360304/74390678
Dec 13 06:35:08 ncei kernel: [10412914.857256] Free swap = 0kB
Dec 13 06:35:08 ncei kernel: [10412914.857258] Total swap = 0kB
Dec 13 06:35:08 ncei kernel: [10412914.857260] 2097049 pages RAM
Dec 13 06:35:08 ncei kernel: [10412914.857261] 0 pages HighMem/MovableOnly
Dec 13 06:35:08 ncei kernel: [10412914.857263] 53295 pages reserved
Dec 13 06:35:08 ncei kernel: [10412914.857264] 0 pages cma reserved
Dec 13 06:35:08 ncei kernel: [10412914.857266] 0 pages hwpoisoned
Dec 13 06:35:08 ncei kernel: [10412914.857268] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
Dec 13 06:35:08 ncei kernel: [10412914.857280] [ 351] 0 351 10972 1929 24 3 0 0 systemd-journal
Dec 13 06:35:08 ncei kernel: [10412914.857284] [ 407] 0 407 25742 412 18 3 6 0 lvmetad
Dec 13 06:35:08 ncei kernel: [10412914.857289] [ 915] 107 915 10757 450 26 3 5 -900 dbus-daemon
Dec 13 06:35:08 ncei kernel: [10412914.857292] [ 986] 0 986 1100 309 8 3 0 0 acpid
Dec 13 06:35:08 ncei kernel: [10412914.857296] [ 988] 0 988 7252 434 20 3 0 0 cron
Dec 13 06:35:08 ncei kernel: [10412914.857299] [ 993] 0 993 132578 697 26 4 0 0 lxcfs
Dec 13 06:35:08 ncei kernel: [10412914.857303] [ 997] 0 997 6511 399 19 3 17 0 atd
Dec 13 06:35:08 ncei kernel: [10412914.857307] [ 999] 104 999 64099 422 27 4 15 0 rsyslogd
Dec 13 06:35:08 ncei kernel: [10412914.857311] [ 1005] 0 1005 68972 361 36 3 4 0 accounts-daemon
Dec 13 06:35:08 ncei kernel: [10412914.857314] [ 1007] 0 1007 7590 655 18 3 0 0 systemd-logind
Dec 13 06:35:08 ncei kernel: [10412914.857318] [ 1052] 112 1052 1884 79 9 3 0 0 vnstatd
Dec 13 06:35:08 ncei kernel: [10412914.857322] [ 1061] 0 1061 16380 547 36 3 0 -1000 sshd
Dec 13 06:35:08 ncei kernel: [10412914.857325] [ 1087] 0 1087 1306 456 9 3 1 0 iscsid
Dec 13 06:35:08 ncei kernel: [10412914.857329] [ 1088] 0 1088 1431 877 9 3 0 -17 iscsid
Dec 13 06:35:08 ncei kernel: [10412914.857332] [ 1163] 0 1163 3985 347 14 3 0 0 agetty
Dec 13 06:35:08 ncei kernel: [10412914.857336] [ 1183] 0 1183 4868 251 14 3 0 0 irqbalance
Dec 13 06:35:08 ncei kernel: [10412914.857339] [ 1202] 0 1202 3764 432 13 3 0 0 xinetd
Dec 13 06:35:08 ncei kernel: [10412914.857342] [ 1203] 0 1203 69278 946 39 3 7 0 polkitd
Dec 13 06:35:08 ncei kernel: [10412914.857346] [ 1355] 0 1355 73096 2462 44 3 0 0 fail2ban-server
Dec 13 06:35:08 ncei kernel: [10412914.857350] [ 1539] 1000 1539 11311 468 26 3 0 0 systemd
Dec 13 06:35:08 ncei kernel: [10412914.857353] [ 1545] 1000 1545 15276 457 32 3 0 0 (sd-pam)
Dec 13 06:35:08 ncei kernel: [10412914.857356] [ 1718] 1000 1718 12141 5185 28 3 0 0 tmux
Dec 13 06:35:08 ncei kernel: [10412914.857360] [ 1726] 1000 1726 5697 927 16 3 0 0 bash
Dec 13 06:35:08 ncei kernel: [10412914.857364] [31232] 111 31232 27508 167 25 3 0 0 ntpd
Dec 13 06:35:08 ncei kernel: [10412914.857368] [29620] 0 29620 11050 459 22 3 0 -1000 systemd-udevd
Dec 13 06:35:08 ncei kernel: [10412914.857372] [26198] 1002 26198 11312 483 26 3 0 0 systemd
Dec 13 06:35:08 ncei kernel: [10412914.857375] [26199] 1002 26199 15291 473 32 3 0 0 (sd-pam)
Dec 13 06:35:08 ncei kernel: [10412914.857379] [ 9572] 1002 9572 1831 280 9 3 16 0 tail
Dec 13 06:35:08 ncei kernel: [10412914.857383] [10237] 0 10237 111212 3085 38 6 11 -900 snapd
Dec 13 06:35:08 ncei kernel: [10412914.857386] [28237] 113 28237 73676 644 68 3 0 -900 postgres
Dec 13 06:35:08 ncei kernel: [10412914.857390] [28239] 113 28239 73709 33942 127 3 0 0 postgres
Dec 13 06:35:08 ncei kernel: [10412914.857394] [28240] 113 28240 73676 715 95 3 0 0 postgres
Dec 13 06:35:08 ncei kernel: [10412914.857397] [28241] 113 28241 73676 1691 61 3 0 0 postgres
Dec 13 06:35:08 ncei kernel: [10412914.857401] [28242] 113 28242 73770 576 64 3 0 0 postgres
Dec 13 06:35:08 ncei kernel: [10412914.857404] [28243] 113 28243 37456 533 57 3 0 0 postgres
Dec 13 06:35:08 ncei kernel: [10412914.857407] [16097] 0 16097 3344 154 10 3 0 0 mdadm
Dec 13 06:35:08 ncei kernel: [10412914.857411] [21806] 0 21806 24724 901 52 3 0 0 apache2
Dec 13 06:35:08 ncei kernel: [10412914.857415] [14547] 1000 14547 5675 883 16 4 0 0 bash
Dec 13 06:35:08 ncei kernel: [10412914.857419] [ 5773] 0 5773 23842 639 51 3 0 0 sshd
Dec 13 06:35:08 ncei kernel: [10412914.857423] [ 5775] 1004 5775 11312 482 26 3 0 0 systemd
Dec 13 06:35:08 ncei kernel: [10412914.857426] [ 5778] 1004 5778 15338 521 32 3 0 0 (sd-pam)
Dec 13 06:35:08 ncei kernel: [10412914.857430] [ 5857] 1004 5857 23842 552 49 3 0 0 sshd
Dec 13 06:35:08 ncei kernel: [10412914.857433] [ 5858] 1004 5858 5669 881 16 3 0 0 bash
Dec 13 06:35:08 ncei kernel: [10412914.857437] [24535] 1005 24535 183205 32535 196 4 0 0 apache2
Dec 13 06:35:08 ncei kernel: [10412914.857440] [24536] 1005 24536 199979 33775 215 4 0 0 apache2
Dec 13 06:35:08 ncei kernel: [10412914.857444] [24537] 33 24537 408886 2232 101 5 0 0 apache2
Dec 13 06:35:08 ncei kernel: [10412914.857448] [24627] 33 24627 507215 1607 107 5 0 0 apache2
Dec 13 06:35:08 ncei kernel: [10412914.857451] [25674] 0 25674 12555 492 31 4 0 0 cron
Dec 13 06:35:08 ncei kernel: [10412914.857455] [25675] 1005 25675 1127 156 7 3 0 0 sh
Dec 13 06:35:08 ncei kernel: [10412914.857459] [25681] 1005 25681 1923548 1831359 3655 11 0 0 python
Dec 13 06:35:08 ncei kernel: [10412914.857462] [25695] 113 25695 75674 5735 127 3 0 0 postgres
Dec 13 06:35:08 ncei kernel: [10412914.857465] [25881] 0 25881 12555 311 31 4 0 0 cron
Dec 13 06:35:08 ncei kernel: [10412914.857469] [25882] 0 25882 12555 311 31 4 0 0 cron
Dec 13 06:35:08 ncei kernel: [10412914.857472] [25885] 1005 25885 1127 170 8 3 0 0 sh
Dec 13 06:35:08 ncei kernel: [10412914.857476] [25884] 1005 25884 1127 143 8 3 0 0 sh
Dec 13 06:35:08 ncei kernel: [10412914.857479] [25890] 1005 25890 6109 564 15 3 0 0 python
Dec 13 06:35:08 ncei kernel: [10412914.857483] [25891] 1005 25891 6109 908 14 3 0 0 python
Dec 13 06:35:08 ncei kernel: [10412914.857487] [25914] 0 25914 5258 381 16 3 0 0 check_mk_agent
Dec 13 06:35:08 ncei kernel: [10412914.857491] [25934] 0 25934 12555 382 31 4 0 0 cron
Dec 13 06:35:08 ncei kernel: [10412914.857494] [25935] 0 25935 12555 383 31 4 0 0 cron
Dec 13 06:35:08 ncei kernel: [10412914.857497] [25941] 1005 25941 1127 62 8 3 0 0 sh
Dec 13 06:35:08 ncei kernel: [10412914.857501] [25942] 1005 25942 1127 74 8 3 0 0 sh
Dec 13 06:35:08 ncei kernel: [10412914.857504] [25948] 1005 25948 5079 290 14 3 0 0 python
Dec 13 06:35:08 ncei kernel: [10412914.857507] [25949] 1005 25949 5143 309 14 3 0 0 python
Dec 13 06:35:08 ncei kernel: [10412914.857510] [25950] 0 25950 7563 356 19 3 0 0 ps
Dec 13 06:35:08 ncei kernel: [10412914.857514] [25951] 0 25951 3341 166 12 3 0 0 sed
Dec 13 06:35:08 ncei kernel: [10412914.857517] Out of memory: Kill process 25681 (python) score 897 or sacrifice child
Dec 13 06:35:08 ncei kernel: [10412914.861127] Killed process 25681 (python) total-vm:7694192kB, anon-rss:7325436kB, file-rss:0kB