Improving this site's uptime

the situation

So this site has been going up and down recently, and I finally got around to fix it (it was a bit lower on the priority since I hadn’t publicized it previously).

It’s the summer now, however, and I’ve been pretty bored, so I decided to dive into the system logs on whim. Okay, I’ll be honest – I’ve been thinking about sharing it with others, and hence the unreliability is now a problem that needs to be fixed.

what the journal says

I fetched the times of the two most recent freezes and looked them up in journalctl:

Exhibit A (journalctl --since "2023-06-20 11:15:00"):

Jun 20 11:15:12 XXXXXXXXX systemd[1]: Starting dnf-makecache.service - dnf makecache...
Jun 20 11:15:15 XXXXXXXXX dnf[164534]: Fedora 36 - x86_64                              203 kB/s |  25 kB     00:00
Jun 20 11:15:16 XXXXXXXXX dnf[164534]: Fedora 36 openh264 (From Cisco) - x86_64         16 kB/s | 989  B     00:00
Jun 20 11:15:16 XXXXXXXXX dnf[164534]: Fedora Modular 36 - x86_64                       70 kB/s |  24 kB     00:00
Jun 20 11:15:17 XXXXXXXXX dnf[164534]: Fedora 36 - x86_64 - Updates                    188 kB/s |  23 kB     00:00
Jun 20 11:26:53 XXXXXXXXX chronyd[689]: Forward time jump detected!
Jun 20 13:58:48 XXXXXXXXX kernel: auditd invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=-1000

Exhibit B (journalctl --since "2023-06-20 19:50:00", featuring random login attempts .-.):

Jun 20 19:53:11 XXXXXXXXX systemd[1]: Starting dnf-makecache.service - dnf makecache...
Jun 20 19:53:11 XXXXXXXXX sshd[167005]: Invalid user nextcloud from 211.110.1.37 port 38674
Jun 20 19:53:12 XXXXXXXXX sshd[167005]: Received disconnect from 211.110.1.37 port 38674:11: Bye Bye [preauth]
Jun 20 19:53:12 XXXXXXXXX audit[167005]: CRYPTO_KEY_USER pid=167005 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:sshd_t:s0-s0:c0.c10>
Jun 20 19:53:12 XXXXXXXXX sshd[167005]: Disconnected from invalid user nextcloud 211.110.1.37 port 38674 [preauth]
Jun 20 19:53:12 XXXXXXXXX audit[167005]: CRYPTO_KEY_USER pid=167005 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:sshd_t:s0-s0:c0.c10>
Jun 20 19:53:12 XXXXXXXXX audit[167005]: USER_ERR pid=167005 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:sshd_t:s0-s0:c0.c1023 msg=>
Jun 20 19:53:12 XXXXXXXXX audit[167005]: CRYPTO_KEY_USER pid=167005 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:sshd_t:s0-s0:c0.c10>
Jun 20 19:53:12 XXXXXXXXX audit[167005]: USER_LOGIN pid=167005 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:sshd_t:s0-s0:c0.c1023 ms>
Jun 20 19:53:15 XXXXXXXXX dnf[167007]: Fedora 36 - x86_64                               97 kB/s |  24 kB     00:00
Jun 20 19:53:16 XXXXXXXXX dnf[167007]: Fedora 36 openh264 (From Cisco) - x86_64        7.7 kB/s | 989  B     00:00
Jun 20 19:53:16 XXXXXXXXX dnf[167007]: Fedora Modular 36 - x86_64                      210 kB/s |  24 kB     00:00
Jun 20 19:53:16 XXXXXXXXX dnf[167007]: Fedora 36 - x86_64 - Updates                    185 kB/s |  23 kB     00:00
Jun 20 19:56:20 XXXXXXXXX sshd[167004]: fatal: Timeout before authentication for 111.67.194.160 port 41520
Jun 20 19:56:20 XXXXXXXXX sshd[167014]: fatal: Timeout before authentication for 111.67.194.160 port 45084
Jun 20 23:31:39 XXXXXXXXX kernel: chronyd invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0

Two key takeaways I have from this very representative sample of freeze events:

  • dnf-makecache.service occurs shortly before the freeze logged
  • Fedora 36 - x86_64 - Updates is the last line in the logs before the time jump occurs
  • oom-killer is invoked

Hence, the likely culprit of these freezes seem to be OOM due to the dnf-makecache service (note: the OS is also getting updated since Fedora 36 has already reached EOL).

hopefully a solution

it’s a simple edit: just add metadata_timer_sync=0 to the [main] section of /etc/dnf/dnf.conf

alternatively:

  • systemctl stop dnf-makecache.timer
  • systemctl disable dnf-makecache.timer

and now we’ll see if the problem resolves :)

Last updated at 2023-08-30 14:56:48 -0400 -0400

comments powered by Disqus