Opened 11 days ago

Last modified 11 days ago

#3170 new task

osgeo7 went down

Reported by: robe Owned by: sac-tickets@…
Priority: normal Milestone: Sysadmin Contract 2024-I
Component: SysAdmin Keywords:
Cc:

Description

As some may have noticed osgeo7 went down this morning.

It appears there might be some disk failure on the samsung ssd drive. Taht drive I don't think is used for anything important.

I did a hardware reset this morning and that seemed to have brought it back up, but then it went down again shortly after.

This time I did a full power down and power up. It's back at the moment, but I'm in the middle of moving over some critical services like ldap to other hosts.

I have alerted osuosl of the situation.

Change History (2)

comment:1 by robe, 11 days ago

My plans are to move secure and tracsvn to osgeo9, so taking another snapshot of those.

tracsvn however needs an extra ip for the ssh port so I might hold off on it, till I confirm I can use the extra ip on osgeo9.

If anyone has hardcodings to secure IP let me know. All should be accessing via ldap.osgeo.org domain name.

comment:2 by robe, 11 days ago

I was a little concerned about the smart error I received

The following warning/error was logged by the smartd daemon:

Device: /dev/nvme0, number of Error Log entries increased from 6 to 7

Device info:
SAMSUNG MZVKW512HMJP-00000, S/N:S316NX0JB03810, FW:CXA7500Q, 512 GB

For details see host's SYSLOG.

But running

 smartctl -a /dev/nvme0

Shows a PASS

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

Error Information (NVMe Log 0x01, max 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0          9     0  0x101a  0x4004      -            0     0     -
  1          8     0  0x1011  0x4004      -            0     0     -
  2          7     0  0x1019  0x4004      -            0     0     -
  3          6     0  0x501d  0x4004      -            0     0     -
  4          5     0  0x0004  0x4202  0x028            0     0     -
  5          4     0  0x0004  0x4202  0x028            0     0     -
  6          3     0  0x0004  0x4202  0x028            0     0     -
  7          2     0  0x0004  0x4202  0x028            0     0     -
  8          1     0  0x0004  0x4202  0x028            0     0     -

But anyrate I still would like to move some critical services off osgeo7 at least temporarily so I can upgrade it without worrying about those.

osgeo7 is the only host that hasn't been upgraded to Ubuntu 22 yet (still on Ubuntu 20)

Note: See TracTickets for help on using tickets.