Opened 3 years ago
Closed 3 years ago
#2771 closed task (fixed)
osgeo7 snapshot failing and secure can't restart
Reported by: | robe | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | Sysadmin Contract 2022-II |
Component: | SysAdmin | Keywords: | |
Cc: |
Description ¶
Running into technical difficulties.
Looks like osgeo7 a lot of the containers can't snapshot. I tried to fix secure and made the mistake of shutting it down.
Still troubleshooting. I have the 5/30/2022 snapshot of it running at moment. and will shutdown id.osgeo.org to prevent new accounts from being created while I resolve this.
Change History (4)
comment:1 by , 3 years ago
comment:2 by , 3 years ago
the first issue was after I shut down secure, it wouldn't start. Gave error something to effect:
Failed to run: zfs set mountpoint
To fix I did:
sudo zfs set mountpoint=/var/snap/lxd/common/lxd/storage-pools/default/containers/secure canmount=noauto osgeo7/containers/secure zfs umount osgeo7/containers/secure zfs mount osgeo7/containers/secure
live was having similar issue so I did the same and stated it up.
secure had an additional issue was one I couldn't find anywhere:
This was a complicated one, I documented my change here - https://discuss.linuxcontainers.org/t/lxc-snapshot-and-lxc-start-error-instance-snapshot-record-count-doesnt-match-instance-snapshot-volume-record-count/14245/3
More detail here:
first I made a backup of the lxd database to inspect, with this:
sudo cp /var/snap/lxd/common/lxd/database/global/db.bin lxd-global-220601
Then I inspected the sql lite backup as follows:
sudo apt install sqlite3 sqlite3 lxd-global-220601
# in sqlite console
.tables .mode column .headers on SELECT count(*) FROM instances AS v INNER JOIN instances_snapshots AS vs ON v.id = vs.instance_id WHERE v.name = 'secure';
output: 32
SELECT count(*) FROM storage_volumes AS v INNER JOIN storage_volumes_snapshots AS vs ON v.id = vs.storage_volume_id WHERE v.name = 'secure';
output: 37
SELECT v.id FROM (SELECT vs.* FROM storage_volumes AS v INNER JOIN storage_volumes_snapshots AS vs ON v.id = vs.storage_volume_id WHERE v.name = 'secure') AS v LEFT JOIN (SELECT vs.* FROM instances AS v INNER JOIN instances_snapshots AS vs ON v.id = vs.instance_id WHERE v.name = 'secure') AS i ON i.name = v.name WHERE i.name IS NULL;
Which resulted in these numbers for storage_volumes_snapshots
4701 4714 4737 4761 4779
Then ran this:
lxd sql global "DELETE FROM storage_volumes_snapshots WHERE id IN(4701,4714,4737,4761,4779)"
Then I was able to do
lxc snapshot secure lxc start secure
I'll close this ticket out once I've fixed the other affected containers.
comment:3 by , 3 years ago
Okay I filed a ticket upstream https://github.com/lxc/lxd/issues/10501
For time being I downgraded to 5.1 using
sudo snap revert lxd
This unfortunately had the side effect of shutting down all the containers and restarting them back up. I think all are up now, but double-checking
comment:4 by , 3 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
This is all fixed. I had to make some extra fixes for tracsvn as it had an additional issue. so had to delete some orphan volumes. We are now on LXD 5.3
For clarification when I said snapshot, I meant snapshot of secure (5/30/2022) I have running which is last successful snapshot. The other containers are still running as they were even though some of them are in a state where osgeo4 can't create a snapshot of them.
I have disabled https://id.osgeo.org to prevent new registrations while I sort this issue out.