#2894 closed task (fixed)
Update of grass.osgeo.org to Debian 11
Reported by: | neteler | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | Sysadmin Contract 2023-I |
Component: | SysAdmin | Keywords: | debian |
Cc: |
Description
At time https://grass.osgeo.org/ is Debian GNU/Linux 10 (buster).
As it is a LXD container I am not sure how an update is to be done.
Change History (14)
comment:1 by , 19 months ago
Milestone: | Unplanned → Sysadmin Contract 2023-I |
---|
comment:2 by , 19 months ago
I did a trial run upgrade of grass with a copy of it using ansible upgrade script.
It seemed to go fine except somewhere along the line it ended with a dead PostgreSQL 9.6 and an additional 13 main. pg_lsclusters shows this in my staging container:
Ver Cluster Port Status Owner Data directory Log file 9.6 main 5432 down,binaries_missing postgres /var/lib/postgresql/9.6/main /var/log/postgresql/postgresql-9.6-main.log 11 main 5433 online postgres /var/lib/postgresql/11/main /var/log/postgresql/postgresql-11-main.log 13 main 5434 online postgres /var/lib/postgresql/13/main /var/log/postgresql/postgresql-13-main.log
and for current grass:
Ver Cluster Port Status Owner Data directory Log file 11 main 5433 online postgres /var/lib/postgresql/11/main /var/log/postgresql/postgresql-11-main.log
None of the servers in either has a database on it though aside from the default databases. What are you using the postgresql for? Some ci stuff or you don't need it at all and was installed accidentally perhaps when trying to install just the clients?
I also see a mysql installed, and again there are no databases in it.
Those are fine to keep, but I'd rather remove services not being used to minimize issues with future upgrades.
comment:3 by , 19 months ago
I should note there are also some failed services, but I think those might be an lxd issue and not related to the upgrade.
On upgraded (the grass-staging on osgeo4 which is a copy of prod one) (I see 4 failed services)
sudo systemctl list-units --state failed
shows:
UNIT LOAD ACTIVE SUB DESCRIPTION ● binfmt-support.service loaded failed failed Enable support for additional executable binary formats ● systemd-networkd.service loaded failed failed Network Service ● systemd-resolved.service loaded failed failed Network Name Resolution ● systemd-journald-audit.socket loaded failed failed Journal Audit Socket ● systemd-networkd.socket loaded failed failed Network Service Netlink Socket
and on grass (current running)
UNIT LOAD ACTIVE SUB DESCRIPTION ● systemd-journald-audit.socket loaded failed failed Journal Audit Socket LOAD = Reflects whether the unit definition was properly loaded. ACTIVE = The high-level unit activation state, i.e. generalization of SUB. SUB = The low-level unit activation state, values depend on unit type.
I'm pretty sure the current systemd-journal one is some permissions issue in lxd which I'll investigate.
The one I upgraded, I hadn't checked to see before I started upgrading if all those were failing. So it might very well be again permission issues because osgeo4 is running a newer version of Ubuntu (Ubuntu 22.04) v.s osgeo7 which is still on Ubuntu 20.04.
I'll review these failures before I do the upgrade on grass production.
comment:4 by , 19 months ago
Thanks for the trials!
The PostgreSQL and mySQL servers can be "brute-force" installed, we only need them to compile GRASS with those as backends to get the manual pages for these backends:
https://grass.osgeo.org/grass82/manuals/sql.html
A single (empty) installation of both would be perfect. We may also leave that out and drop it for now; re-installing them at a later moment to bring back the manual pages.
comment:5 by , 19 months ago
I checked all the failure in the
UNIT LOAD ACTIVE SUB DESCRIPTION ● binfmt-support.service loaded failed failed Enable support for additional executable binary formats ● systemd-networkd.service loaded failed failed Network Service ● systemd-resolved.service loaded failed failed Network Name Resolution ● systemd-journald-audit.socket loaded failed failed Journal Audit Socket ● systemd-networkd.socket loaded failed failed Network Service Netlink Socket
Those all are related to permission issues. It's unclear to me if you actually need them though. But fix would be to make the container privileged.
with
lxc config set grass security.nesting=true
That would fix all except the below which is already failing anyway in prod, so I'll just remove that one.
● binfmt-support.service loaded failed failed Enable support for additional executable binary formats systemd-journald-audit.socket loaded failed failed Journal Audit Socket
I'm going to start the upgrade process.
comment:6 by , 19 months ago
Looks like it's still in the middle of upgrade. I should have disabled cron before I started as it looks like your build job started running and might be trying to use some python packages it was in middle of upgrading.
I'll let it run for another hour to see if it finishes and if not I'll kill your running job.
comment:7 by , 19 months ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
I canceled your jobs so I could complete the upgrade. I tried running your
hugo_clean_and_update_job.sh but get error:
nice: ‘/usr/local/bin/hugo’: No such file or directory
I didn't check if that was missing before or result of changes.
After upgrade following showed as failing, just like they did in staging
● binfmt-support.service loaded failed failed Enable support for additional executable binary formats ● modprobe@drm.service loaded failed failed Load Kernel Module drm ● systemd-logind.service loaded failed failed User Login Management ● systemd-networkd.service loaded failed failed Network Service ● systemd-resolved.service loaded failed failed Network Name Resolution ● systemd-journald-audit.socket loaded failed failed Journal Audit Socket ● systemd-networkd.socket loaded failed failed Network Service Netlink Socket
I disabled or masked them so they don't show as failures:
systemctl disable binfmt-support.service systemctl disable systemd-networkd-wait-online.service systemctl disable systemd-journald-audit.socket systemctl mask modprobe@drm.service systemctl mask systemd-logind.service systemctl disable systemd-networkd.service systemctl disable systemd-resolved.service systemctl mask systemd-journald-audit.socket systemctl disable systemd-networkd.socket
Can you test out your jobs to make sure they all still work and see if you see any other issues. Feel free to reopen if you still have issues.
comment:8 by , 19 months ago
Thanks for updating the machine!
The remaining hugo
issue is now addressed in
https://github.com/OSGeo/grass-addons/pull/875
Will monitor the server for potential other glitches (which I do not expect).
comment:9 by , 19 months ago
Resolution: | fixed |
---|---|
Status: | closed → reopened |
I discovered a problem:
rsync only works from inside the LXD container while from outside:
rsync --dry-run -avz --port=50026 grass.osgeo.org::grass-website grass-website rsync: [Receiver] getcwd(): Transport endpoint is not connected (107) rsync error: errors selecting input/output files, dirs (code 3) at util1.c(1122) [Receiver=3.2.7]
But a new phenomenon is also this, probably connected?
ssh grasslxd shell-init: error retrieving current directory: getcwd: cannot access parent directories: Transport endpoint is not connected
Two days ago (or so) I didn't not get this issue. Maybe a problem on the jump host?
comment:10 by , 19 months ago
@neteler,
I'm not having an issue.
I did:
rsync --dry-run -avz --port=50026 grass.osgeo.org::grass-website grass-website
from one of my servers (not on OSUOSL) and it works fine for me.
could it be maybe a port block issue on yourend?
comment:11 by , 19 months ago
Resolution: | → fixed |
---|---|
Status: | reopened → closed |
Magic, it also works now from here as well (and ports aren't blocked).
Closing again :-)
follow-up: 13 comment:12 by , 5 weeks ago
I don't think we should be disabling systemd-logind, that will probably break D-Bus user sessions. If SSH is slow to log in, that's probably https://github.com/systemd/systemd/issues/17866.
I didn't look too closely, but it looks like a bad LXC/AppArmor interaction that was fixed in bookworm in 2021: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=995350).
Instead, I fixed it by setting security.nested=true
on the container. Apparently this is fine for unprivileged containers, but we should check using lxc list security.privileged=true
beforehand.
Of course, my preference would be to upgrade to a distro that has proper support.
PS: we also ran into an acpid
issue (it was using a lot of CPU), apt purge acpi-support
fixes that. Again, that might have been fixed, but a container doesn't need ACPI anyway.
comment:13 by , 5 weeks ago
Replying to lnicola:
I don't think we should be disabling systemd-logind, that will probably break D-Bus user sessions. If SSH is slow to log in, that's probably https://github.com/systemd/systemd/issues/17866.
I didn't look too closely, but it looks like a bad LXC/AppArmor interaction that was fixed in bookworm in 2021: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=995350).
Instead, I fixed it by setting
security.nested=true
on the container. Apparently this is fine for unprivileged containers, but we should check usinglxc list security.privileged=true
beforehand.Of course, my preference would be to upgrade to a distro that has proper support.
PS: we also ran into an
acpid
issue (it was using a lot of CPU),apt purge acpi-support
fixes that. Again, that might have been fixed, but a container doesn't need ACPI anyway.
Thanks that did seem to fix the issue and the logrotate failing:
lxc config set grass security.nesting=true lxc exec grass -- systemctl unmask systemd-logind.service
comment:14 by , 5 weeks ago
Also did for grass-wiki though the logrotate coming back might have been coincidental will see if it stays up but was dead again in grass-wiki.
It swould follow the same process as any other server, but I can take care of it if you want.
I think the last upgrade you had done yourself from debian 9 to debian 10. But given I did run into issues with upgrading some other 10s to 11, perhaps I should take care of it so I can roll back if issues.
I'll do a trial run on a back up of it on staging, and if looks good I'll make the change here.
Usually doesn't take more than 2 hrs (of which at most about 1 hr downtime).