How to troubleshoot high load on a server or VPS

Last Updated August 13, 2024//News & Info

This guide is for anyone with their own VPS or dedicated server that is experiencing high load and would like to find the cause. It’s got tips that apply generally, as well as some that apply specifically to servers running Plesk Panel on an AlmaLinux OS or other RHEL equivalent like Rocky. If you’re not using the same environment you may need to adjust some commands and/or file paths to match where your data is located, like logs, vhosts, etc.

We’ll start with the obvious contenders, like CPU usage and Memory pressure/usage, then head into the less obvious stuff like larger log files contributing to IO performance (Input/Output, usually meaning disk read/write), zombie processes, stuck sessions/scope units, and Plesk task manager.

Check top, htop, iotop

The first step is always to look for high CPU or high memory usage in the output of the top, htop, and/or iotop commands. If you don’t have htop or iotop, install it with your OS’s package manager.

Look for processes that you might not need running right now, like pigz, installatron, gzip, backups, etc. that are eating up CPU (htop) or IO (iotop) and kill them with kill <process_id> or lower their priority with renice and ionice.

Check for log files that are too large

Use the following command to look for any log files that are at least 1GB in size:

du -skh /var/www/vhosts/system/*/logs/* | grep G

Note: the files ending in .processed are not loaded by the web server and therefore should not have an effect on load. They are used for webstats calculations.

An alternative to this (as du can take some time) is to check and see what open files either nginx or apache is accessing and get the filesize report from lsof:

lsof -p $(ps aux | grep httpd | awk 'NR==1{print $2}') | grep log | awk '$7 > 199999 {print $7/1024/1024 "GB", $9}'

Also check system-wide logs:

du -skh /var/log/*log | grep G

Find what’s filling the log

If you spot any large log files, tail the end of them to see what’s filling the logs:

tail {path_to_log_file}

If it’s something that can be corrected by fixing the error, do that! For example if there’s a PHP error showing in your app/website, fix that error to stop the logging.

Check if log rotation is working for domains

Disable and re-enable log rotation in Plesk. Go to the domain in Plesk, choose Logs > Log Files dropdown in upper right corner > Log Rotation and ensure it’s configured sanely. If it’s more than one domain, you can do this for all of them via shell with this:

for DOMAIN in $(plesk bin domain -l); do 
  /usr/local/psa/bin/site -u $DOMAIN -log_rotate false /usr/local/psa/bin/site -u $DOMAIN -log_rotate true 
done 
# Force 30 days log rotation server-wide
plesk bin settings -s logrotate_force=true plesk bin settings -s logrotate_period=30

Details on how Plesk manages log rotation here.

If the file taking up a ton of space does *not* end in .processed, run the following commands:

DOMAIN=mydomain.com 
plesk sbin statistics --calculate-one --domain-name=$DOMAIN 
/usr/local/psa/logrotate/sbin/logrotate /usr/local/psa/etc/logrotate.d/$DOMAIN

If the file taking up a ton of space is only those ending with .processed, run this to have logrotate run on its own:

DOMAIN=mydomain.com 
/usr/local/psa/logrotate/sbin/logrotate /usr/local/psa/etc/logrotate.d/$DOMAIN

Or have Plesk do this for all domains with this command:

plesk daily ExecuteStatistics

Erase large log files & Disable logging?

When done, empty out the log file: cat /dev/null > {path_to_log_file}

And now we hope that enabling log rotation actually causes the logs to be rotated so the issue doesn’t return.

If it’s the error log that get very large you’ll need to notify the customer about the errors, particularly if they’re critical errors of any kind. However if the log output is purely warnings, and you don’t believe it’ll have any effect on the site’s capabilities or performance, then you can adjust the PHP error log setting in Plesk’s PHP settings for the domain by adding & ~E_WARN to the list. This will suppress warnings from being logged.

If the site gets significant traffic, such as 100k daily views, and it’s the access log that’s very large, it would be best to disable access logging to let up on IO from simply writing every single request to the file system. Add the this to the nginx config in Plesk (must be an admin): access_log off;

Analyze MySQL Activity

If the CPU or IO load is being caused by active MySQL processes, you can see what queries are being run with this command:

mysqladmin processlist

If the Plesk installation has not been optimized by us (set up with our server init script) you will need to run it like this:

mysqladmin -uadmin -p$(cat /etc/psa/.psa.shadow) processlist

If you spot more than, say 5-10 processes, it’s probably best to limit the number of connections on the container, then figure out why it needs so many proceses — perhaps there’s a loop somewhere, or perhaps the code is hammering the SQL server with requests all at once when it could be throttling them (up to the developer to fix). Here’s how to set max processes.

Look for php-fpm (or php-cgi) Activity (Plesk admin)

If the only thing seemingly spiking in htop or iotop is php-fpm, take note of how many php-fpm processes appear to be active at one time in htop. Then go to that site’s PHP Settings in Plesk and configure the PHP-FPM specific settings as follows:

pm = static
pm.max_children = ACTIVE_PROCESSES_FOUND
pm.max_requests = 500

You’ll probably find that for most sites the ACTIVE_PROCESSES_FOUND value should be around 3-5. If the site appears to be super busy, it could be 8-10 or more, but should never be more than 10 on a shared server.

Here’s some rules of thumb for how to configure this should the site appear to need more than 5 processes:

If you don’t have a VPS you need to reduce the weight of your dynamic processing or it’s probably time to move to a VPS.
If the site is already on a VPS of its own and RAM is no more than 50% used (excluding cache), go ahead and use static mode and keep an eye on memory usage over the next 24-48 hours
If the site is already on a VPS of its own but there’s *not* a lot of RAM available, use dynamic mode instead and configure the min to 5 and max to the total number of processes you saw at peak — probably no more than 15 for most servers, depending on specs. This max may need to be lowered depending on RAM usage over time.

If it was php-cgi activity and the server supports php-fpm over apache, switch it to php-fpm and enable the settings above. If you can’t switch it to apache/php-fpm then edit /etc/httpd/conf.d/fcgid.conf and set its max processes per class to 4 or 5 then restart httpd.

Static mode will also reduce load even when compared to dynamic mode configured with min/max values as the same number. We assume that the calculations that determine whether to spawn a process or not still need to occur on each request, and they provide some amount of overhead.

Back in 2019 the above php-fpm changes brought one of our server’s load down from a steady 3.0-4.0 to 0.5 simply by allowing the php-fpm processes to not be spawned on every request. We believe this is likely not as big of an improvement now (2024) due to improvements in both hardware and software.

Warning: Static and Dynamic mode will result in memory usage issues if any of the software on your site (plugins, theme, etc) does not handle its memory management well. If that happens your only option is to track down the memory leak via exhaustive troubleshooting OR change back to ondemand mode where the processes are killed when they’re no longer needed, thus freeing their memory.

Are there Zombie Processes?

Look for zombie processes. While they could exist in many forms, the most common memory eating zombies come from (older method) php-cgi apache forked processes. To test for this, kill apache, then look for any running php-cgi processes (since stopping apache should kill ’em all off):

service httpd stop
killall php-cgi && killall php-cgi
ps aux | grep php-cgi

If you see any processes remaining in that list, kill -9 them. pkill is also handy for this, but use it carefully!

dbus delays due to abandoned or failed sessions/units

Run systemctl list-units and comb through the results looking for abandoned or failed units in larger quantities (>10).

If you see lots of failed units, try this to fix it: systemctl reset-failed

If you see lots of abandoned units, run this to fix it:

systemctl | grep abandoned | grep -e "[[:digit:]]" | sed "s/.scope.*/.scope/" | xargs systemctl stop

Try running loginctl list-sessions.

If you see more than, say, 30-40 of them, or it hangs, you may have an issue with stuck sessions, which the above command should fix up after 3-4 minutes.

Note: after doing this, you may need to restart systemd-logind and dbus as well as a number of other system processes. Monitor the system with journalctl -f and respond to errors you see there.

Tip: this kind of issue can be more conclusively resolved by restarting the server because of the number of processes that rely on dbus and systemd-logind.

Plesk Task Manager queue not processing

In Plesk go to Tools & Settings > Task Manager. If you see the number of processing tasks not fluctuating and lots of “New” tasks that aren’t going down in numbers, you’ve got a stuck task queue. To resolve:

systemctl restart plesk-task-manager

If you don’t see the tasks processing still, press the stop button on the oldest tasks in the list that are still processing or new. You may need to do this repeatedly to see it begin moving again.

If that still doesn’t do the trick:

systemctl restart dbus.socket systemd-logind plesk-task-manager plesk-php*

Notes:

It can appear like there’s no improvement in the task list if a long-running task like DailyMaintenance is currently executing. Some of the daily tasks can be checked against running processes using htop, like spamtrain/sa-learn. If you see them running, then it’s processing.
We restart PHP processes after dbus to ensure they’re loaded in their cgroups

SOURCE: https://support.plesk.com/hc/en-us/articles/12376920055575-Tasks-in-Plesk-are-never-executed-and-stuck-as-new-Cannot-fetch-process-properties

Posted in News & Info

Jordan Schelew

Jordan has been working with computers, security, and network systems since the 90s and is a managing partner at Websavers Inc. As a founder of the company, he's been in the web tech space for over 15 years.

About Websavers

Websavers provides web services like Canadian WordPress Hosting and VPS Hosting to customers all over the globe, from hometown Halifax, CA to Auckland, NZ.

If this article helped you, our web services surely will as well! We might just be the perfect fit for you.