Opened 19 months ago

Closed 19 months ago

Last modified 5 weeks ago

#2894 closed task (fixed)

Update of grass.osgeo.org to Debian 11

Reported by: neteler Owned by: sac@…
Priority: normal Milestone: Sysadmin Contract 2023-I
Component: SysAdmin Keywords: debian
Cc:

Description

At time https://grass.osgeo.org/ is Debian GNU/Linux 10 (buster).

As it is a LXD container I am not sure how an update is to be done.

Change History (14)

comment:1 by robe, 19 months ago

Milestone: UnplannedSysadmin Contract 2023-I

It swould follow the same process as any other server, but I can take care of it if you want.

I think the last upgrade you had done yourself from debian 9 to debian 10. But given I did run into issues with upgrading some other 10s to 11, perhaps I should take care of it so I can roll back if issues.

I'll do a trial run on a back up of it on staging, and if looks good I'll make the change here.

Usually doesn't take more than 2 hrs (of which at most about 1 hr downtime).

comment:2 by robe, 19 months ago

I did a trial run upgrade of grass with a copy of it using ansible upgrade script.

It seemed to go fine except somewhere along the line it ended with a dead PostgreSQL 9.6 and an additional 13 main. pg_lsclusters shows this in my staging container:

Ver Cluster Port Status                Owner    Data directory               Log file
9.6 main    5432 down,binaries_missing postgres /var/lib/postgresql/9.6/main /var/log/postgresql/postgresql-9.6-main.log
11  main    5433 online                postgres /var/lib/postgresql/11/main  /var/log/postgresql/postgresql-11-main.log
13  main    5434 online                postgres /var/lib/postgresql/13/main  /var/log/postgresql/postgresql-13-main.log

and for current grass:

Ver Cluster Port Status Owner    Data directory              Log file
11  main    5433 online postgres /var/lib/postgresql/11/main /var/log/postgresql/postgresql-11-main.log

None of the servers in either has a database on it though aside from the default databases. What are you using the postgresql for? Some ci stuff or you don't need it at all and was installed accidentally perhaps when trying to install just the clients?

I also see a mysql installed, and again there are no databases in it.

Those are fine to keep, but I'd rather remove services not being used to minimize issues with future upgrades.

comment:3 by robe, 19 months ago

I should note there are also some failed services, but I think those might be an lxd issue and not related to the upgrade.

On upgraded (the grass-staging on osgeo4 which is a copy of prod one) (I see 4 failed services)

 sudo systemctl list-units --state failed

shows:

  UNIT                          LOAD   ACTIVE SUB    DESCRIPTION
● binfmt-support.service        loaded failed failed Enable support for additional executable binary formats
● systemd-networkd.service      loaded failed failed Network Service
● systemd-resolved.service      loaded failed failed Network Name Resolution
● systemd-journald-audit.socket loaded failed failed Journal Audit Socket
● systemd-networkd.socket       loaded failed failed Network Service Netlink Socket

and on grass (current running)

  UNIT                          LOAD   ACTIVE SUB    DESCRIPTION
● systemd-journald-audit.socket loaded failed failed Journal Audit Socket

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

I'm pretty sure the current systemd-journal one is some permissions issue in lxd which I'll investigate.

The one I upgraded, I hadn't checked to see before I started upgrading if all those were failing. So it might very well be again permission issues because osgeo4 is running a newer version of Ubuntu (Ubuntu 22.04) v.s osgeo7 which is still on Ubuntu 20.04.

I'll review these failures before I do the upgrade on grass production.

comment:4 by neteler, 19 months ago

Thanks for the trials!

The PostgreSQL and mySQL servers can be "brute-force" installed, we only need them to compile GRASS with those as backends to get the manual pages for these backends:

https://grass.osgeo.org/grass82/manuals/sql.html

A single (empty) installation of both would be perfect. We may also leave that out and drop it for now; re-installing them at a later moment to bring back the manual pages.

comment:5 by robe, 19 months ago

I checked all the failure in the

  UNIT                          LOAD   ACTIVE SUB    DESCRIPTION
● binfmt-support.service        loaded failed failed Enable support for additional executable binary formats
● systemd-networkd.service      loaded failed failed Network Service
● systemd-resolved.service      loaded failed failed Network Name Resolution
● systemd-journald-audit.socket loaded failed failed Journal Audit Socket
● systemd-networkd.socket       loaded failed failed Network Service Netlink Socket

Those all are related to permission issues. It's unclear to me if you actually need them though. But fix would be to make the container privileged.

with

lxc config set grass security.nesting=true

That would fix all except the below which is already failing anyway in prod, so I'll just remove that one.

● binfmt-support.service        loaded failed failed Enable support for additional executable binary formats
systemd-journald-audit.socket loaded failed failed Journal Audit Socket

I'm going to start the upgrade process.

Last edited 19 months ago by robe (previous) (diff)

comment:6 by robe, 19 months ago

Looks like it's still in the middle of upgrade. I should have disabled cron before I started as it looks like your build job started running and might be trying to use some python packages it was in middle of upgrading.

I'll let it run for another hour to see if it finishes and if not I'll kill your running job.

comment:7 by robe, 19 months ago

Resolution: fixed
Status: newclosed

I canceled your jobs so I could complete the upgrade. I tried running your

hugo_clean_and_update_job.sh but get error:

nice: ‘/usr/local/bin/hugo’: No such file or directory

I didn't check if that was missing before or result of changes.

After upgrade following showed as failing, just like they did in staging

● binfmt-support.service        loaded failed failed Enable support for additional executable binary formats
● modprobe@drm.service          loaded failed failed Load Kernel Module drm
● systemd-logind.service        loaded failed failed User Login Management
● systemd-networkd.service      loaded failed failed Network Service
● systemd-resolved.service      loaded failed failed Network Name Resolution
● systemd-journald-audit.socket loaded failed failed Journal Audit Socket
● systemd-networkd.socket       loaded failed failed Network Service Netlink Socket

I disabled or masked them so they don't show as failures:

systemctl disable binfmt-support.service
systemctl disable systemd-networkd-wait-online.service
systemctl disable systemd-journald-audit.socket
systemctl mask modprobe@drm.service 
systemctl mask systemd-logind.service  
systemctl disable systemd-networkd.service
systemctl disable systemd-resolved.service
systemctl mask systemd-journald-audit.socket
systemctl disable systemd-networkd.socket

Can you test out your jobs to make sure they all still work and see if you see any other issues. Feel free to reopen if you still have issues.

comment:8 by neteler, 19 months ago

Thanks for updating the machine!

The remaining hugo issue is now addressed in https://github.com/OSGeo/grass-addons/pull/875

Will monitor the server for potential other glitches (which I do not expect).

comment:9 by neteler, 19 months ago

Resolution: fixed
Status: closedreopened

I discovered a problem:

rsync only works from inside the LXD container while from outside:

rsync --dry-run -avz --port=50026 grass.osgeo.org::grass-website grass-website
rsync: [Receiver] getcwd(): Transport endpoint is not connected (107)
rsync error: errors selecting input/output files, dirs (code 3) at util1.c(1122) [Receiver=3.2.7]

But a new phenomenon is also this, probably connected?

ssh grasslxd shell-init: error retrieving current directory: getcwd: cannot access parent directories: Transport endpoint is not connected

Two days ago (or so) I didn't not get this issue. Maybe a problem on the jump host?

comment:10 by robe, 19 months ago

@neteler,

I'm not having an issue.

I did:

rsync --dry-run -avz --port=50026 grass.osgeo.org::grass-website grass-website

from one of my servers (not on OSUOSL) and it works fine for me.

could it be maybe a port block issue on yourend?

comment:11 by neteler, 19 months ago

Resolution: fixed
Status: reopenedclosed

Magic, it also works now from here as well (and ports aren't blocked).

Closing again :-)

comment:12 by lnicola, 5 weeks ago

I don't think we should be disabling systemd-logind, that will probably break D-Bus user sessions. If SSH is slow to log in, that's probably https://github.com/systemd/systemd/issues/17866.

I didn't look too closely, but it looks like a bad LXC/AppArmor interaction that was fixed in bookworm in 2021: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=995350).

Instead, I fixed it by setting security.nested=true on the container. Apparently this is fine for unprivileged containers, but we should check using lxc list security.privileged=true beforehand.

Of course, my preference would be to upgrade to a distro that has proper support.

PS: we also ran into an acpid issue (it was using a lot of CPU), apt purge acpi-support fixes that. Again, that might have been fixed, but a container doesn't need ACPI anyway.

Last edited 5 weeks ago by lnicola (previous) (diff)

in reply to:  12 comment:13 by robe, 5 weeks ago

Replying to lnicola:

I don't think we should be disabling systemd-logind, that will probably break D-Bus user sessions. If SSH is slow to log in, that's probably https://github.com/systemd/systemd/issues/17866.

I didn't look too closely, but it looks like a bad LXC/AppArmor interaction that was fixed in bookworm in 2021: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=995350).

Instead, I fixed it by setting security.nested=true on the container. Apparently this is fine for unprivileged containers, but we should check using lxc list security.privileged=true beforehand.

Of course, my preference would be to upgrade to a distro that has proper support.

PS: we also ran into an acpid issue (it was using a lot of CPU), apt purge acpi-support fixes that. Again, that might have been fixed, but a container doesn't need ACPI anyway.

Thanks that did seem to fix the issue and the logrotate failing:

lxc config set grass security.nesting=true
lxc exec grass -- systemctl unmask systemd-logind.service  


comment:14 by robe, 5 weeks ago

Also did for grass-wiki though the logrotate coming back might have been coincidental will see if it stays up but was dead again in grass-wiki.

Note: See TracTickets for help on using tickets.