Opened 11 years ago
Closed 8 years ago
#1325 closed task (fixed)
ProjectsVM not responding
Reported by: | Jeff McKenna | Owned by: | |
---|---|---|---|
Priority: | critical | Milestone: | |
Component: | SysAdmin | Keywords: | |
Cc: |
Description
- discussion: http://lists.osgeo.org/pipermail/sac/2014-April/004950.html
- affecting many projects such as GRASS, GDAL, OpenLayers, MapServer, etc.
- VM: http://wiki.osgeo.org/wiki/ProjectsVM
-jeff
Change History (21)
comment:1 by , 11 years ago
comment:2 by , 11 years ago
"Ramereth" in #osuosl says the ProjectsVM is rebooting now, very slowly.
comment:4 by , 11 years ago
Bumping this again, openlayers.org and others are still down.
Is there anything I could possibly do to get them up again?
comment:5 by , 11 years ago
Know anything about optimizing I/O on a kvm based VM? We took care of the underlying hardware issue, a raid rebuild for drive replacement. Now we're stuck wondering why the I/O is really bad, bad enough to make Projects unresponsive. Apache is temporarily off while we look for way to prevent it from locking up. There was a mapserver sphinx job that got out of control due to the I/O wait and spun up 4 simultaneous runs. Hopefully killing that is all thats needed right now.
FYI, OSUOSL thinks we need new RAID card batteries 2x http://amzn.com/B0045ZNJJU
follow-up: 7 comment:6 by , 11 years ago
For the record the RAID cards have been ordered and the order info sent to Alex.
comment:7 by , 11 years ago
Replying to dmorissette:
For the record the RAID cards have been ordered and the order info sent to Alex.
I meant RAID card batteries of course.
comment:8 by , 11 years ago
Batteries have arrived. Looking to schedule a time to put them in. Is 3-4 hours from now a good time? Any objections? That's 10-11pm Germany, 4-5 pm US east coast, 1-2 US west coast.
comment:9 by , 11 years ago
Sounds plausible to me, please consider to first equip the host where the secure VM resides.
comment:11 by , 11 years ago
I don't know the current state of this, but here's some things I do know:
- The ProjectsVM Apache was swamped this morning.
- Somehow, in the past week, the number of available connections to the projectsVM apache was dropped: http://webextra.osgeo.osuosl.org/munin/osgeo.org/projects.osgeo.org/apache_processes.html
- Markus raised that back up.
- As always, a huge number of botnets were hitting the OSGeo projects VM trying to use it as an open proxy. (These requests were just returning 404s.) I blackholed a number of IPs to pull the number of incoming connections down; I think I got about 40% of the incoming spam connections. iptables --list will show the ranges I blacklisted; I tried to be relatively conservative. I used the following command to find IPs to blacklist.
sudo tail -n 100 /var/log/apache2/docs.geotools.org-access_log | grep "http:" | cut -f 1-3 -d'.' | sort | uniq -c | sort -n
Now that apache is configured correctly and the incoming spam is decreased a bit, the disks are still massively underperforming; iostat -x -m 2 shows that there is as little as 2-3 disk seeks/sec going on with full utilization and high await times. This usually means bad things; If the raid batteries are not replaced yet, this would be consistent with WriteThrough mode.
This means that the high load average on the projects vm appears to be entirely due to bad disk performance. I don't know how to debug more -- the machine seems otherwise fine -- and I think the machine will run, with abnormally high load numbers, though access that requires disk seeks will be much slower than usual.
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util vda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 100.00
comment:13 by , 11 years ago
Yes the new battery is in on osgeo4.
Limited connections was intentional, so we could slowly raise it up to a level that balances the needs. Also Buffered Logs was intentional and we doubled the ram to make sure it wouldn't be an issue.
Note, sometimes the performance is actually be impacted when a different VM on the same host is going to swap. We have a memory leak on QGIS with Redmine under Phusion Passenger that needs to be debugged.
comment:14 by , 11 years ago
BufferedLogs is back on. <- we added more ram so this could be used without issue, and it seems to make a big difference. Connections down to 500 from 700. That should be enough based on the historical charts. Thanks for blacklisting bad IPs.
Raid card status checked:Current Cache Policy: WriteBack, ReadAdaptive, Direct, Write Cache OK if Bad BBU
We also need to check if the disk I/O can be improved on Mail.
comment:15 by , 11 years ago
FWIW, but the disk latency on the projVM is right now terrible, in the seconds (!) range.
neteler@projects:~/cronjobs$ date Sat Apr 26 03:18:13 PDT 2014 neteler@projects:~/cronjobs$ date -u Sat Apr 26 10:18:23 UTC 2014
follow-up: 17 comment:16 by , 11 years ago
It appears to be at the same time each day, I suspect it may be the backup job. Maybe we can throttle the speed of the backup or shift it to a time when their is less traffic. If I'm reading the timing right, it's during prime morning hours in EU. Perhaps moving it up a couple of hours to early morning EU, late night US?
I don't necessarily think thats the root cause but maybe a slight shift to lesser used time will help us decipher what else is going on.
comment:17 by , 11 years ago
Replying to wildintellect:
It appears to be at the same time each day, I suspect it may be the backup job. Maybe we can throttle the speed of the backup
is the backup job running at 'ionice -c3'? (consume i/o only when otherwise idle)
Hamish
comment:18 by , 11 years ago
No, it's not running at 'ionice -c3' and I'm having mixed feelings wrt. consuming I/O only when otherwise idle because in real life the system is hardly even close to idle.
Running at 'ionice -c 2 -n 7' the weekly full backup of just the Projects VM took 10 hours at an average bandwidth of 1.6 MByte/s - and still wasn't very responsive, I fact I didn't notice any relevant difference after applying ionice. Thus I'm curious about what you're trying to achieve by further reducing its priority !?
comment:19 by , 10 years ago
The projectsVM continues to be very very slow.
Suggestion
- create a new VM
- migrated per project stuff over to new machine
Rationale: The current projectsVM is overly complex and meanwhile impossible to maintain.
comment:20 by , 9 years ago
Since most stuff is meanwhile on osgeo6, I suggest to close this ticket.
I've asked for an update in #osuosl but got no response. They may not know who I am. Maybe someone else can try as well.