Opened 8 months ago
Closed 8 months ago
#3170 closed task (fixed)
osgeo7 went down
Reported by: | robe | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | Sysadmin Contract 2024-I |
Component: | SysAdmin | Keywords: | |
Cc: |
Description
As some may have noticed osgeo7 went down this morning.
It appears there might be some disk failure on the samsung ssd drive. Taht drive I don't think is used for anything important.
I did a hardware reset this morning and that seemed to have brought it back up, but then it went down again shortly after.
This time I did a full power down and power up. It's back at the moment, but I'm in the middle of moving over some critical services like ldap to other hosts.
I have alerted osuosl of the situation.
Change History (3)
comment:1 by , 8 months ago
comment:2 by , 8 months ago
I was a little concerned about the smart error I received
The following warning/error was logged by the smartd daemon: Device: /dev/nvme0, number of Error Log entries increased from 6 to 7 Device info: SAMSUNG MZVKW512HMJP-00000, S/N:S316NX0JB03810, FW:CXA7500Q, 512 GB For details see host's SYSLOG.
But running
smartctl -a /dev/nvme0
Shows a PASS
=== START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED Error Information (NVMe Log 0x01, max 64 entries) Num ErrCount SQId CmdId Status PELoc LBA NSID VS 0 9 0 0x101a 0x4004 - 0 0 - 1 8 0 0x1011 0x4004 - 0 0 - 2 7 0 0x1019 0x4004 - 0 0 - 3 6 0 0x501d 0x4004 - 0 0 - 4 5 0 0x0004 0x4202 0x028 0 0 - 5 4 0 0x0004 0x4202 0x028 0 0 - 6 3 0 0x0004 0x4202 0x028 0 0 - 7 2 0 0x0004 0x4202 0x028 0 0 - 8 1 0 0x0004 0x4202 0x028 0 0 -
But anyrate I still would like to move some critical services off osgeo7 at least temporarily so I can upgrade it without worrying about those.
osgeo7 is the only host that hasn't been upgraded to Ubuntu 22 yet (still on Ubuntu 20)
comment:3 by , 8 months ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Closing this out. OSUOSL added osgeo7 to their syslog monitoring so if it happens again, they can possibly provide more information.
At the moment, still thinking it's something to do with the SSD drive which we don't use for hosting any containers, since when it comes back up it complains about that, and SSD tests that OSUOSL did I don't think completed.
My plans are to move secure and tracsvn to osgeo9, so taking another snapshot of those.
tracsvn however needs an extra ip for the ssh port so I might hold off on it, till I confirm I can use the extra ip on osgeo9.
If anyone has hardcodings to secure IP let me know. All should be accessing via ldap.osgeo.org domain name.