Opened 2 years ago

Last modified 5 months ago

#2706 new task

Set up load balancing configuration for download.osgeo.org

Reported by: robe Owned by: sac@…
Priority: normal Milestone: Sysadmin Contract 2024-I
Component: SysAdmin Keywords:
Cc:

Description

One of the things on my list was to setup some sort of cdn setup for download.osgeo.org

We do have ftp.osuosl.org which we can push traffic to. In theory we should be able to set this up on nginx.

as detailed here:

https://nginx.org/en/docs/http/load_balancing.html

Though not sure how well that works for balancing download traffic

Change History (5)

comment:1 by robe, 2 years ago

This as a feared doesn't work for our needs. I tested using bottle.downoad.osgeo.org using the download backup on osgeo4.

osgeo4 upload is around 10 / 13 MB/s, much faster than osgeo7, perhaps because it isn't pounded on so much.

But setting up as noted in load_balancing. I still ended up with slow osgeo7 speed.

My nginx script for bottle.download.osgeo.org on osgeo7-nginx looked something like this:

upstream bottle-app {
  #least_conn;
  #server download.lxd;
  server bottle.staging.osgeo.org;
}
server {
    server_name  bottle.download.osgeo.org;
    listen 80 proxy_protocol; # managed by Certbot
    set_real_ip_from 140.211.15.0/24;
    real_ip_header proxy_protocol;

    access_log /var/log/nginx/bottle.download.osgeo.org.access_log pcombined;
    error_log /var/log/nginx/bottle.download.osgeo.org.error_log info;
    location / {
                # First attempt to serve request as file, then
                # as directory, then fall back to displaying a 404.
                #try_files $uri $uri/ =404;
                client_max_body_size 0;
                include /etc/nginx/proxy_protocol_params;
                proxy_pass http://bottle-app/;
                proxy_redirect off;
        }


    #listen 80 proxy_protocol; # managed by Certbot

    listen 443 ssl proxy_protocol; # managed by Certbot
   :

}

comment:2 by robe, 2 years ago

I just had an even crazier thought to this.

I think the speed between the servers is very fast. It's the push out of the network that is bounded.

That said if I set up a round robin in DNS for download, but I simply have osgeo3, osgeo4, osgeo9 have a redundant nginx config for download (have download accept all those as proxies for it), pointing back to osgeo7, then that might work. I'm going to give that a try with bottle.download.osgeo.org.

This of course still requires doing at the dns level - setting up for round robin, and will still require that folks use upload.osgeo.org for uploading (since that will be the only one that has ssh port open). Depending on which server you hit with download.osgeo.org, the ssh port might or might not be open.

Last edited 2 years ago by robe (previous) (diff)

comment:3 by robe, 2 years ago

Milestone: Sysadmin Contract 2022-ISysadmin Contract 2022-II

I've started to work on this -- a lot of the notes are on #2705.

So I have set up a round-robin for download.osgeo.org and notified via project and discuss to use upload.osgeo.org for sftp. upload.osgeo.org will remain only connected to osgeo7-download.

I have download-cache.osgeo.org for testing which consists of (osgeo4 and osgeo9 which pull directly from upload.osgeo.org).

I have download.osgeo.org which consists of (osgeo7 pulling via download.lxd and osgeo9 pulling via upload.osgeo.org). Note both ultimately go thru the nginx on osgeo7, so nginx itself is not issue of slow download on osgeo7.

All osgeo9 does is proxy straight to upload.osgeo.org (nginx) -> osgeo7-download, but yet when this is active speed can be like anywhere from 6MB/s to 20MB/s.

How this is possible my guess is the connectivity between the hosts is at least 100GB/s but the thru put out to the world is much lower and since osgeo7 is heavily taxed network out, it cripples the outbound network. osgeo9 only caches the current request pulling at 100-1000GB/s from download and since it is not taxxed with as many requests can push out much faster.

Putting this in place immediately ballooned osgeo9 traffic.

Here are stats from osgeo9:

osgeo9 vnstat output as of now - note I turned it on 2 days ago, so that 2022-03: 7.58 tiB is just for the 2 days. The traffic though I think includes copying from upload.osgeo.org (so really half of that).

Anyway it's huge and I can't believe how huge it is.

On osgeo9 as of now

vnstat output

                     rx      /      tx      /     total    /   estimated
 enp2s0f0:
       2022-02      5.44 GiB  /  425.10 GiB  /  430.54 GiB
       2022-03      3.60 TiB  /    3.98 TiB  /    7.58 TiB  /    8.42 TiB
     yesterday      1.27 TiB  /    1.24 TiB  /    2.51 TiB
         today      2.16 TiB  /    2.11 TiB  /    4.27 TiB  /    4.73 TiB

vnstat -d 5 #for last 5 days

# note late 3/26 is when I added it to round robin

 enp2s0f0  /  daily

          day        rx      |     tx      |    total    |   avg. rate
     ------------------------+-------------+-------------+---------------
     2022-03-24   191.36 MiB |   21.29 GiB |   21.48 GiB |    2.14 Mbit/s
     2022-03-25   471.22 MiB |   21.44 GiB |   21.90 GiB |    2.18 Mbit/s
     2022-03-26   160.08 GiB |  174.21 GiB |  334.29 GiB |   33.24 Mbit/s
     2022-03-27     1.27 TiB |    1.24 TiB |    2.51 TiB |  255.63 Mbit/s
     2022-03-28     2.23 TiB |    2.17 TiB |    4.40 TiB |  483.15 Mbit/s
     ------------------------+-------------+-------------+---------------
     estimated      2.40 TiB |    2.34 TiB |    4.75 TiB |



Now on osgeo7:

vnstat

                      rx      /      tx      /     total    /   estimated
 eno1:
       2022-02      1.54 TiB  /  104.72 TiB  /  106.26 TiB
       2022-03      1.76 TiB  /  115.49 TiB  /  117.25 TiB  /  130.25 TiB
     yesterday     27.14 GiB  /    2.95 TiB  /    2.97 TiB
         today     44.84 GiB  /    4.18 TiB  /    4.22 TiB  /    4.66 TiB

vnstat -d 5 #for last 5 days

 eno1  /  daily

          day        rx      |     tx      |    total    |   avg. rate
     ------------------------+-------------+-------------+---------------
     2022-03-24    75.43 GiB |    4.45 TiB |    4.52 TiB |  460.09 Mbit/s
     2022-03-25    70.75 GiB |    4.40 TiB |    4.47 TiB |  454.90 Mbit/s
     2022-03-26    46.99 GiB |    4.24 TiB |    4.29 TiB |  436.80 Mbit/s
     2022-03-27    27.14 GiB |    2.95 TiB |    2.97 TiB |  302.77 Mbit/s
     2022-03-28    45.75 GiB |    4.26 TiB |    4.31 TiB |  473.15 Mbit/s
     ------------------------+-------------+-------------+---------------
     estimated     49.35 GiB |    4.60 TiB |    4.65 TiB |

So how do we solve this issue.

  1. Finish setting up osgeo8 to also act as a proxy. This one can be a true cache since it has much more disk space than osgeo9. So it can do a full rsync of download. Short term solution. One issue I am working out is that all the traffic coming thru osgeo9 to osgeo7 is being logged as osgeo9 on download container. Which is both good and bad. Good in that it's easy to see how much traffic osgeo9 is picking up, but bad in that I don't have a single authoritative log (then again we wouldn't anyway with a true round-robin). osgeo9 logs are showing the true identity of traffic it is handling.
  1. Curb traffic - I'm investigating nginx settings to say limit each user to 1 or 2 requests per second etc or limit bandwith. I've been trying - https://www.nginx.com/blog/rate-limiting-nginx/ but my settings seem to be ignored or not working as expected. There is a lot of bot traffic (we really don't need hogging resources). I still need to break up the stats to figure out low hanging fruit that should just be killed off.
  1. Setup a true CDN for download around world (future plan, this could be costly something like keycdn comes to mind as someone had suggested a while back since they offer an open source plan - https://www.keycdn.com/open-source-cdn. Though given how much traffic this is, I suspect we'll quickly run out or not be able to use download.osgeo.org for name which would make it worse than just adding some extra round robin vms on commercial cloud hosters (hetzner, atlantic, digital ocean come to mind). Keycdn commercial pricing is $0.01/GB per month for NA/Europe for over 100 TB/month - which would be the bulk of our traffic. Given we are doing about 105-130 TB if my math is right would be about $1300/mth -- way too much.

comment:4 by robe, 17 months ago

Milestone: Sysadmin Contract 2022-IISysadmin Contract 2023-I

pushing to next milestone since my contract funds have been used.

comment:5 by robe, 5 months ago

Milestone: Sysadmin Contract 2023-ISysadmin Contract 2024-I

Moving my prior still open items to the next proposed Milestone

Note: See TracTickets for help on using tickets.