Opened 2 years ago

Closed 2 years ago

#2771 closed task (fixed)

osgeo7 snapshot failing and secure can't restart

Reported by: robe Owned by: sac@…
Priority: normal Milestone: Sysadmin Contract 2022-II
Component: SysAdmin Keywords:
Cc:

Description

Running into technical difficulties.

Looks like osgeo7 a lot of the containers can't snapshot. I tried to fix secure and made the mistake of shutting it down.

Still troubleshooting. I have the 5/30/2022 snapshot of it running at moment. and will shutdown id.osgeo.org to prevent new accounts from being created while I resolve this.

Change History (4)

comment:1 by robe, 2 years ago

For clarification when I said snapshot, I meant snapshot of secure (5/30/2022) I have running which is last successful snapshot. The other containers are still running as they were even though some of them are in a state where osgeo4 can't create a snapshot of them.

I have disabled https://id.osgeo.org to prevent new registrations while I sort this issue out.

comment:2 by robe, 2 years ago

the first issue was after I shut down secure, it wouldn't start. Gave error something to effect:

Failed to run: zfs set mountpoint

To fix I did:

sudo zfs set mountpoint=/var/snap/lxd/common/lxd/storage-pools/default/containers/secure canmount=noauto osgeo7/containers/secure
zfs umount osgeo7/containers/secure
zfs mount osgeo7/containers/secure

live was having similar issue so I did the same and stated it up.

secure had an additional issue was one I couldn't find anywhere:

This was a complicated one, I documented my change here - https://discuss.linuxcontainers.org/t/lxc-snapshot-and-lxc-start-error-instance-snapshot-record-count-doesnt-match-instance-snapshot-volume-record-count/14245/3

More detail here:

first I made a backup of the lxd database to inspect, with this:

 sudo cp /var/snap/lxd/common/lxd/database/global/db.bin lxd-global-220601

Then I inspected the sql lite backup as follows:

sudo apt install sqlite3
sqlite3 lxd-global-220601

# in sqlite console

.tables
.mode column
.headers on

SELECT count(*) FROM instances AS v INNER JOIN instances_snapshots AS vs ON v.id = vs.instance_id WHERE v.name = 'secure';

output: 32

SELECT count(*) FROM storage_volumes AS v INNER JOIN storage_volumes_snapshots AS vs ON v.id = vs.storage_volume_id WHERE v.name = 'secure';

output: 37

SELECT v.id
FROM 
(SELECT vs.* FROM storage_volumes AS v INNER JOIN storage_volumes_snapshots AS vs ON v.id = vs.storage_volume_id WHERE v.name = 'secure') AS v 

 LEFT JOIN 
(SELECT vs.* FROM instances AS v INNER JOIN instances_snapshots AS vs ON v.id = vs.instance_id WHERE v.name = 'secure') AS i ON i.name = v.name
WHERE  i.name IS NULL;

Which resulted in these numbers for storage_volumes_snapshots

4701
4714
4737
4761
4779

Then ran this:

lxd sql global "DELETE FROM storage_volumes_snapshots WHERE id IN(4701,4714,4737,4761,4779)"

Then I was able to do

lxc snapshot secure
lxc start secure

I'll close this ticket out once I've fixed the other affected containers.

comment:3 by robe, 2 years ago

Okay I filed a ticket upstream https://github.com/lxc/lxd/issues/10501

For time being I downgraded to 5.1 using

sudo snap revert lxd

This unfortunately had the side effect of shutting down all the containers and restarting them back up. I think all are up now, but double-checking

comment:4 by robe, 2 years ago

Resolution: fixed
Status: newclosed

This is all fixed. I had to make some extra fixes for tracsvn as it had an additional issue. so had to delete some orphan volumes. We are now on LXD 5.3

Note: See TracTickets for help on using tickets.