Opened 3 months ago

Closed 2 months ago

#3170 closed task (fixed)

osgeo7 went down

Reported by: robe Owned by: sac-tickets@…
Priority: normal Milestone: Sysadmin Contract 2024-I
Component: SysAdmin Keywords:
Cc:

Description

As some may have noticed osgeo7 went down this morning.

It appears there might be some disk failure on the samsung ssd drive. Taht drive I don't think is used for anything important.

I did a hardware reset this morning and that seemed to have brought it back up, but then it went down again shortly after.

This time I did a full power down and power up. It's back at the moment, but I'm in the middle of moving over some critical services like ldap to other hosts.

I have alerted osuosl of the situation.

Change History (3)

comment:1 by robe, 3 months ago

My plans are to move secure and tracsvn to osgeo9, so taking another snapshot of those.

tracsvn however needs an extra ip for the ssh port so I might hold off on it, till I confirm I can use the extra ip on osgeo9.

If anyone has hardcodings to secure IP let me know. All should be accessing via ldap.osgeo.org domain name.

comment:2 by robe, 3 months ago

I was a little concerned about the smart error I received

The following warning/error was logged by the smartd daemon:

Device: /dev/nvme0, number of Error Log entries increased from 6 to 7

Device info:
SAMSUNG MZVKW512HMJP-00000, S/N:S316NX0JB03810, FW:CXA7500Q, 512 GB

For details see host's SYSLOG.

But running

 smartctl -a /dev/nvme0

Shows a PASS

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

Error Information (NVMe Log 0x01, max 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0          9     0  0x101a  0x4004      -            0     0     -
  1          8     0  0x1011  0x4004      -            0     0     -
  2          7     0  0x1019  0x4004      -            0     0     -
  3          6     0  0x501d  0x4004      -            0     0     -
  4          5     0  0x0004  0x4202  0x028            0     0     -
  5          4     0  0x0004  0x4202  0x028            0     0     -
  6          3     0  0x0004  0x4202  0x028            0     0     -
  7          2     0  0x0004  0x4202  0x028            0     0     -
  8          1     0  0x0004  0x4202  0x028            0     0     -

But anyrate I still would like to move some critical services off osgeo7 at least temporarily so I can upgrade it without worrying about those.

osgeo7 is the only host that hasn't been upgraded to Ubuntu 22 yet (still on Ubuntu 20)

comment:3 by robe, 2 months ago

Resolution: fixed
Status: newclosed

Closing this out. OSUOSL added osgeo7 to their syslog monitoring so if it happens again, they can possibly provide more information.

At the moment, still thinking it's something to do with the SSD drive which we don't use for hosting any containers, since when it comes back up it complains about that, and SSD tests that OSUOSL did I don't think completed.

Note: See TracTickets for help on using tickets.