Opened 10 years ago

Closed 8 years ago

#1325 closed task (fixed)

ProjectsVM not responding

Reported by: Jeff McKenna Owned by: sac@…
Priority: critical Milestone:
Component: SysAdmin Keywords:
Cc:

Description

Change History (21)

comment:1 by Jeff McKenna, 10 years ago

I've asked for an update in #osuosl but got no response. They may not know who I am. Maybe someone else can try as well.

comment:2 by Jeff McKenna, 10 years ago

"Ramereth" in #osuosl says the ProjectsVM is rebooting now, very slowly.

comment:3 by bartvde, 10 years ago

still down, any update?

comment:4 by marcjansen, 10 years ago

Bumping this again, openlayers.org and others are still down.

Is there anything I could possibly do to get them up again?

comment:5 by wildintellect, 10 years ago

Know anything about optimizing I/O on a kvm based VM? We took care of the underlying hardware issue, a raid rebuild for drive replacement. Now we're stuck wondering why the I/O is really bad, bad enough to make Projects unresponsive. Apache is temporarily off while we look for way to prevent it from locking up. There was a mapserver sphinx job that got out of control due to the I/O wait and spun up 4 simultaneous runs. Hopefully killing that is all thats needed right now.

FYI, OSUOSL thinks we need new RAID card batteries 2x http://amzn.com/B0045ZNJJU

comment:6 by dmorissette, 10 years ago

For the record the RAID cards have been ordered and the order info sent to Alex.

in reply to:  6 comment:7 by dmorissette, 10 years ago

Replying to dmorissette:

For the record the RAID cards have been ordered and the order info sent to Alex.

I meant RAID card batteries of course.

comment:8 by wildintellect, 10 years ago

Batteries have arrived. Looking to schedule a time to put them in. Is 3-4 hours from now a good time? Any objections? That's 10-11pm Germany, 4-5 pm US east coast, 1-2 US west coast.

http://www.timeanddate.com/worldclock/meetingdetails.html?year=2014&month=4&day=7&hour=21&min=0&sec=0&p1=217&p2=37

comment:9 by martin, 10 years ago

Sounds plausible to me, please consider to first equip the host where the secure VM resides.

comment:10 by Jeff McKenna, 10 years ago

+1

comment:11 by crschmidt, 10 years ago

I don't know the current state of this, but here's some things I do know:

  1. The ProjectsVM Apache was swamped this morning.
  2. Somehow, in the past week, the number of available connections to the projectsVM apache was dropped: http://webextra.osgeo.osuosl.org/munin/osgeo.org/projects.osgeo.org/apache_processes.html
  3. Markus raised that back up.
  4. As always, a huge number of botnets were hitting the OSGeo projects VM trying to use it as an open proxy. (These requests were just returning 404s.) I blackholed a number of IPs to pull the number of incoming connections down; I think I got about 40% of the incoming spam connections. iptables --list will show the ranges I blacklisted; I tried to be relatively conservative. I used the following command to find IPs to blacklist.

sudo tail -n 100 /var/log/apache2/docs.geotools.org-access_log | grep "http:" | cut -f 1-3 -d'.' | sort | uniq -c | sort -n

Now that apache is configured correctly and the incoming spam is decreased a bit, the disks are still massively underperforming; iostat -x -m 2 shows that there is as little as 2-3 disk seeks/sec going on with full utilization and high await times. This usually means bad things; If the raid batteries are not replaced yet, this would be consistent with WriteThrough mode.

This means that the high load average on the projects vm appears to be entirely due to bad disk performance. I don't know how to debug more -- the machine seems otherwise fine -- and I think the machine will run, with abnormally high load numbers, though access that requires disk seeks will be much slower than usual.

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util vda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 100.00

comment:12 by neteler, 10 years ago

Are the new batteries in place?

comment:13 by wildintellect, 10 years ago

Yes the new battery is in on osgeo4.

Limited connections was intentional, so we could slowly raise it up to a level that balances the needs. Also Buffered Logs was intentional and we doubled the ram to make sure it wouldn't be an issue.

Note, sometimes the performance is actually be impacted when a different VM on the same host is going to swap. We have a memory leak on QGIS with Redmine under Phusion Passenger that needs to be debugged.

comment:14 by wildintellect, 10 years ago

BufferedLogs is back on. <- we added more ram so this could be used without issue, and it seems to make a big difference. Connections down to 500 from 700. That should be enough based on the historical charts. Thanks for blacklisting bad IPs.

Raid card status checked:Current Cache Policy: WriteBack, ReadAdaptive, Direct, Write Cache OK if Bad BBU

We also need to check if the disk I/O can be improved on Mail.

comment:15 by neteler, 10 years ago

FWIW, but the disk latency on the projVM is right now terrible, in the seconds (!) range.

neteler@projects:~/cronjobs$ date 
Sat Apr 26 03:18:13 PDT 2014
neteler@projects:~/cronjobs$ date -u
Sat Apr 26 10:18:23 UTC 2014

comment:16 by wildintellect, 10 years ago

It appears to be at the same time each day, I suspect it may be the backup job. Maybe we can throttle the speed of the backup or shift it to a time when their is less traffic. If I'm reading the timing right, it's during prime morning hours in EU. Perhaps moving it up a couple of hours to early morning EU, late night US?

I don't necessarily think thats the root cause but maybe a slight shift to lesser used time will help us decipher what else is going on.

in reply to:  16 comment:17 by hamish, 10 years ago

Replying to wildintellect:

It appears to be at the same time each day, I suspect it may be the backup job. Maybe we can throttle the speed of the backup

is the backup job running at 'ionice -c3'? (consume i/o only when otherwise idle)

Hamish

comment:18 by martin, 10 years ago

No, it's not running at 'ionice -c3' and I'm having mixed feelings wrt. consuming I/O only when otherwise idle because in real life the system is hardly even close to idle.

Running at 'ionice -c 2 -n 7' the weekly full backup of just the Projects VM took 10 hours at an average bandwidth of 1.6 MByte/s - and still wasn't very responsive, I fact I didn't notice any relevant difference after applying ionice. Thus I'm curious about what you're trying to achieve by further reducing its priority !?

comment:19 by neteler, 10 years ago

The projectsVM continues to be very very slow.

Suggestion

  • create a new VM
  • migrated per project stuff over to new machine

Rationale: The current projectsVM is overly complex and meanwhile impossible to maintain.

comment:20 by neteler, 8 years ago

Since most stuff is meanwhile on osgeo6, I suggest to close this ticket.

comment:21 by strk, 8 years ago

Resolution: fixed
Status: newclosed

Closing

Note: See TracTickets for help on using tickets.