Opened 5 years ago
Closed 5 years ago
#2318 closed task (fixed)
dronie.osgeo.org is down (502 Bad Gateway)
Reported by: | strk | Owned by: | |
---|---|---|---|
Priority: | blocker | Milestone: | Sysadmin Contract 2019-I |
Component: | SysAdmin | Keywords: | |
Cc: |
Description
nginx reports 502 Bad Gateway when going to dronie.osgeo.org
Change History (18)
comment:1 by , 5 years ago
follow-up: 16 comment:2 by , 5 years ago
From https://git.osgeo.org/gitea/sac/osgeo7/wiki/Dronie-Server-container it looks like the startup script for the server does not exist, and everything is done manually. If confirmed I'd recommend turning it into a script instead because it is very fragile to only do it manually
comment:3 by , 5 years ago
strk the Dronie server is running docker and docker has it's own internal network. So those ar all drone agents
comment:6 by , 5 years ago
How does lxc
know which IP addresses to assign to which container ? Is there an external script (on the host) mentioning which IPs to assign ? Should this be discussed in a private ticket ?
comment:8 by , 5 years ago
From nginx
machine: Host dronie-server.lxd not found: 3(NXDOMAIN)
-- that'd explain. Did internal DNS went down ?
comment:9 by , 5 years ago
Milestone: | → Sysadmin Contract 2019-I |
---|
For easy checking: https://dronie.osgeo.org/ (still down at time of writing)
NOTE: shouldn't Sysadmin Contract 2019-I
milestone be closed ?
comment:10 by , 5 years ago
strk is it still down. I can get to the dronie.osgeo.org and could when you complained about this.
However when I try to log in I get this error:
Get https://git.osgeo.org/gitea/api/v1/users/robe/tokens: dial tcp: lookup git.osgeo.org on 10.88.1.1:53: read udp 172.17.0.2:51440->10.88.1.1:53: i/o timeout
comment:11 by , 5 years ago
to answer your question there is no external script in use. when you setup lxd it internally sets up a dhcp server and dns and assigns ips accordingly. The ips rarely change. The server is asking for a reboot so maybe I can do that over the weekend.
comment:12 by , 5 years ago
Looking at it now. I tried shutting down the dronie-server but refuses to go down.
comment:13 by , 5 years ago
the old container still won't shut off, but was able to create a new container from the 6-09-2019 snapshot of the dronie-server. At a glance that new one seems fine and then I upgraded the docker on it.
I've shut it off since I can't rename it to the old name without renaming/destroying the old one.
I suppose I could just tell nginx to use the new name and then switch back to the old name once I have removed the old. AS the old is not accessible anymore (seems to have lost it's ips on the failed shutdown).
comment:14 by , 5 years ago
Concerning situation. Can we count on stability of this new containerization architecture ?
What happened to data ? Are old builds still accessible ?
comment:15 by , 5 years ago
I was able to stop the dronie-server container by killing the process attached to it.
ps -faux | grep dronie-server
There still seems to be something clinging to the name though because when I tried to do
lxc mv dronie-server dronie-server-bad #went fine #get into the container lxc exec dronie-server-2 bash #in dronie-server-2 force graceful shutdown shutdown -P -H now #now back in osgeo7 lxc mv dronie-server-2 dronie-server # went fine #but this failed lxc start dronie-server #so I had to rename it back to lxc mv dronie-server dronie-server-2
As far as data the data asof the 6/9 snapshot is fine. I suspect if I went with the 6-11 snapshot I would see the data there too.
I feel like the server needs to be rebooted (since it does say system restart required).
So there could be an underlying funkiness with the network causing this that rebooting will resolve.
Now is not a good time to bring everything down for this though as other things are working fine.
Once we get osgeo4 reformatted, we'll be in much better condition as we can replicate containers between the two and this server really should be moved to the new osgeo4.
comment:16 by , 5 years ago
Replying to strk:
From https://git.osgeo.org/gitea/sac/osgeo7/wiki/Dronie-Server-container it looks like the startup script for the server does not exist, and everything is done manually. If confirmed I'd recommend turning it into a script instead because it is very fragile to only do it manually
strk I'm lost what you mean here -- that is to start up the docker drone server and it gets started on bootup because it's just the docker configuration. When would the start-up script ever be run?
It's not like drone.osgeo.org that runs on the server, the dronie server is running in a docker container.
comment:17 by , 5 years ago
Even the docker startup command would be good to have in a script. Because things can go bad, docker might need a reinstall, you may want to move the service to another machine. What I'm saying is you don't want to rely on docker daemon keeping the info on how you started it in place of you.
What we want (and do have even!) is a git repository with the scripts to start the server. I didn't find a clone of that repository on the server actually running drone.
Looking at lxc list I see that dronie-server is being assigned 6 internal IPs, why is it so ? The wiki doesn't mention any detail about that. Can the multiple-IP be a reason for the failure ?