Improving this site's uptime
the situation
So this site has been going up and down recently, and I finally got around to fix it (it was a bit lower on the priority since I hadn’t publicized it previously).
It’s the summer now, however, and I’ve been pretty bored, so I decided to dive into the system logs on whim. Okay, I’ll be honest – I’ve been thinking about sharing it with others, and hence the unreliability is now a problem that needs to be fixed.
what the journal says
I fetched the times of the two most recent freezes and looked them up in journalctl
:
The journalctl time filters will only work as expected if you’re checking against the correct timezone. Use the following commands below to check and update the server timezone if necessary.
timedatectl list-timezones
timedatectl set-timezone <zone>
timedatectl status
Exhibit A (journalctl --since "2023-06-20 11:15:00"
):
Jun 20 11:15:12 XXXXXXXXX systemd[1]: Starting dnf-makecache.service - dnf makecache...
Jun 20 11:15:15 XXXXXXXXX dnf[164534]: Fedora 36 - x86_64 203 kB/s | 25 kB 00:00
Jun 20 11:15:16 XXXXXXXXX dnf[164534]: Fedora 36 openh264 (From Cisco) - x86_64 16 kB/s | 989 B 00:00
Jun 20 11:15:16 XXXXXXXXX dnf[164534]: Fedora Modular 36 - x86_64 70 kB/s | 24 kB 00:00
Jun 20 11:15:17 XXXXXXXXX dnf[164534]: Fedora 36 - x86_64 - Updates 188 kB/s | 23 kB 00:00
Jun 20 11:26:53 XXXXXXXXX chronyd[689]: Forward time jump detected!
Jun 20 13:58:48 XXXXXXXXX kernel: auditd invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=-1000
Exhibit B (journalctl --since "2023-06-20 19:50:00"
, featuring random login attempts .-.):
Jun 20 19:53:11 XXXXXXXXX systemd[1]: Starting dnf-makecache.service - dnf makecache...
Jun 20 19:53:11 XXXXXXXXX sshd[167005]: Invalid user nextcloud from 211.110.1.37 port 38674
Jun 20 19:53:12 XXXXXXXXX sshd[167005]: Received disconnect from 211.110.1.37 port 38674:11: Bye Bye [preauth]
Jun 20 19:53:12 XXXXXXXXX audit[167005]: CRYPTO_KEY_USER pid=167005 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:sshd_t:s0-s0:c0.c10>
Jun 20 19:53:12 XXXXXXXXX sshd[167005]: Disconnected from invalid user nextcloud 211.110.1.37 port 38674 [preauth]
Jun 20 19:53:12 XXXXXXXXX audit[167005]: CRYPTO_KEY_USER pid=167005 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:sshd_t:s0-s0:c0.c10>
Jun 20 19:53:12 XXXXXXXXX audit[167005]: USER_ERR pid=167005 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:sshd_t:s0-s0:c0.c1023 msg=>
Jun 20 19:53:12 XXXXXXXXX audit[167005]: CRYPTO_KEY_USER pid=167005 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:sshd_t:s0-s0:c0.c10>
Jun 20 19:53:12 XXXXXXXXX audit[167005]: USER_LOGIN pid=167005 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:sshd_t:s0-s0:c0.c1023 ms>
Jun 20 19:53:15 XXXXXXXXX dnf[167007]: Fedora 36 - x86_64 97 kB/s | 24 kB 00:00
Jun 20 19:53:16 XXXXXXXXX dnf[167007]: Fedora 36 openh264 (From Cisco) - x86_64 7.7 kB/s | 989 B 00:00
Jun 20 19:53:16 XXXXXXXXX dnf[167007]: Fedora Modular 36 - x86_64 210 kB/s | 24 kB 00:00
Jun 20 19:53:16 XXXXXXXXX dnf[167007]: Fedora 36 - x86_64 - Updates 185 kB/s | 23 kB 00:00
Jun 20 19:56:20 XXXXXXXXX sshd[167004]: fatal: Timeout before authentication for 111.67.194.160 port 41520
Jun 20 19:56:20 XXXXXXXXX sshd[167014]: fatal: Timeout before authentication for 111.67.194.160 port 45084
Jun 20 23:31:39 XXXXXXXXX kernel: chronyd invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
Two key takeaways I have from this very representative sample of freeze events:
dnf-makecache.service
occurs shortly before the freeze loggedFedora 36 - x86_64 - Updates
is the last line in the logs before the time jump occursoom-killer
is invoked
Hence, the likely culprit of these freezes seem to be OOM due to the dnf-makecache service (note: the OS is also getting updated since Fedora 36 has already reached EOL).
hopefully a solution
it’s a simple edit: just add metadata_timer_sync=0
to the [main]
section of /etc/dnf/dnf.conf
alternatively:
systemctl stop dnf-makecache.timer
systemctl disable dnf-makecache.timer
and now we’ll see if the problem resolves :)