Opened 5 years ago

Closed 5 years ago

#2318 closed task (fixed)

dronie.osgeo.org is down (502 Bad Gateway)

Reported by: strk Owned by: sac@…
Priority: blocker Milestone: Sysadmin Contract 2019-I
Component: SysAdmin Keywords:
Cc:

Description

nginx reports 502 Bad Gateway when going to dronie.osgeo.org

Change History (18)

comment:1 by strk, 5 years ago

Looking at lxc list I see that dronie-server is being assigned 6 internal IPs, why is it so ? The wiki doesn't mention any detail about that. Can the multiple-IP be a reason for the failure ?

comment:2 by strk, 5 years ago

From https://git.osgeo.org/gitea/sac/osgeo7/wiki/Dronie-Server-container it looks like the startup script for the server does not exist, and everything is done manually. If confirmed I'd recommend turning it into a script instead because it is very fragile to only do it manually

comment:3 by robe, 5 years ago

strk the Dronie server is running docker and docker has it's own internal network. So those ar all drone agents

comment:4 by robe, 5 years ago

drone agents and drone server

comment:5 by strk, 5 years ago

So what happened to the service ? Do you have any idea ?

comment:6 by strk, 5 years ago

How does lxc know which IP addresses to assign to which container ? Is there an external script (on the host) mentioning which IPs to assign ? Should this be discussed in a private ticket ?

comment:7 by strk, 5 years ago

nginx reports: no live upstreams while connecting to upstream

comment:8 by strk, 5 years ago

From nginx machine: Host dronie-server.lxd not found: 3(NXDOMAIN) -- that'd explain. Did internal DNS went down ?

comment:9 by strk, 5 years ago

Milestone: Sysadmin Contract 2019-I

For easy checking: https://dronie.osgeo.org/ (still down at time of writing) NOTE: shouldn't Sysadmin Contract 2019-I milestone be closed ?

comment:10 by robe, 5 years ago

strk is it still down. I can get to the dronie.osgeo.org and could when you complained about this.

However when I try to log in I get this error:

Get https://git.osgeo.org/gitea/api/v1/users/robe/tokens: dial tcp: lookup git.osgeo.org on 10.88.1.1:53: read udp 172.17.0.2:51440->10.88.1.1:53: i/o timeout

comment:11 by robe, 5 years ago

to answer your question there is no external script in use. when you setup lxd it internally sets up a dhcp server and dns and assigns ips accordingly. The ips rarely change. The server is asking for a reboot so maybe I can do that over the weekend.

comment:12 by robe, 5 years ago

Looking at it now. I tried shutting down the dronie-server but refuses to go down.

comment:13 by robe, 5 years ago

the old container still won't shut off, but was able to create a new container from the 6-09-2019 snapshot of the dronie-server. At a glance that new one seems fine and then I upgraded the docker on it.

I've shut it off since I can't rename it to the old name without renaming/destroying the old one.

I suppose I could just tell nginx to use the new name and then switch back to the old name once I have removed the old. AS the old is not accessible anymore (seems to have lost it's ips on the failed shutdown).

comment:14 by strk, 5 years ago

Concerning situation. Can we count on stability of this new containerization architecture ?

What happened to data ? Are old builds still accessible ?

comment:15 by robe, 5 years ago

I was able to stop the dronie-server container by killing the process attached to it.

ps -faux | grep dronie-server

There still seems to be something clinging to the name though because when I tried to do

lxc mv dronie-server dronie-server-bad #went fine

#get into the container
lxc exec dronie-server-2 bash
#in dronie-server-2 force graceful shutdown
shutdown -P -H now

#now back in osgeo7
lxc mv dronie-server-2 dronie-server # went fine

#but this failed
lxc start dronie-server

#so I had to rename it back to 
lxc mv dronie-server dronie-server-2

As far as data the data asof the 6/9 snapshot is fine. I suspect if I went with the 6-11 snapshot I would see the data there too.

I feel like the server needs to be rebooted (since it does say system restart required).

So there could be an underlying funkiness with the network causing this that rebooting will resolve.

Now is not a good time to bring everything down for this though as other things are working fine.

Once we get osgeo4 reformatted, we'll be in much better condition as we can replicate containers between the two and this server really should be moved to the new osgeo4.

in reply to:  2 comment:16 by robe, 5 years ago

Replying to strk:

From https://git.osgeo.org/gitea/sac/osgeo7/wiki/Dronie-Server-container it looks like the startup script for the server does not exist, and everything is done manually. If confirmed I'd recommend turning it into a script instead because it is very fragile to only do it manually

strk I'm lost what you mean here -- that is to start up the docker drone server and it gets started on bootup because it's just the docker configuration. When would the start-up script ever be run?

It's not like drone.osgeo.org that runs on the server, the dronie server is running in a docker container.

comment:17 by strk, 5 years ago

Even the docker startup command would be good to have in a script. Because things can go bad, docker might need a reinstall, you may want to move the service to another machine. What I'm saying is you don't want to rely on docker daemon keeping the info on how you started it in place of you.

What we want (and do have even!) is a git repository with the scripts to start the server. I didn't find a clone of that repository on the server actually running drone.

comment:18 by robe, 5 years ago

Resolution: fixed
Status: newclosed

I think this is done.

Note: See TracTickets for help on using tickets.