Please read announcement

HPC - Broken Omnipath switch →

Batch System

Tenemos dos switchs Omnipath y uno se ha roto en los nodos de hpc. Estos son los que realizan el intercambio de mensajes para jobs paralelizados, cola compute de Slurm. Estamos a la espera que nos manden recambio

-– — —

We have two Omnipath switches, and one has broken in the HPC nodes. These are the ones that exchange messages for parallelized jobs, Slurm compute queue. We are waiting for a replacement to be sent to us.


Altamira supercomputer   (?) Altamira supercomputer related systems.
Batch System   (?) Slurm batch system for Altamira Maintenance
Login nodes   (?) Altamira login nodes (login1, login2) Operational
Cloud Infrastructure   (?) OpenStack Cloud infrastructure.
Grid and HTC   (?) General purpose batch system and high throughput compute system.
Web and miscelaneous services   (?) Web services, wiki pages and other services.
AAI   (?) Authentication, Authorization and Identity systems.
Networking   (?) Internal and external networking.
Storage systems   (?) Distributed storage systems.

Incident history


December 14, 2021 at 7:00 AM UTC

Change topology network IFCA-RedIris

Resolved after 60m of downtime
November 27, 2021 at 3:23 PM UTC

Electrical failure due to inclement weather

Resolved after 43h 38m of downtime
November 16, 2021 at 7:30 AM UTC

[OpenStack] Upgrade version of services

Resolved after 10h 52m of downtime
September 6, 2021 at 10:02 AM UTC

Login1 unavailable due to kernel update

Resolved in under a minute
August 4, 2021 at 12:04 PM UTC

Update OS of Altamira Compute Nodes

Resolved after 648h 0m of downtime
July 27, 2021 at 11:35 AM UTC

Networking disruption by an update in the router Monday 02/08

Resolved after 140h 24m of downtime
July 9, 2021 at 11:21 AM UTC

Slow access to gpfs system

Resolved after 69h 22m of downtime
June 25, 2021 at 11:27 AM UTC

Network Intervention in the GPU Cluster

Resolved after 139h 33m of downtime
June 15, 2021 at 2:37 PM UTC

Unexpected shotdown of Altamira and Grid nodes

Resolved in under a minute
June 14, 2021 at 7:00 AM UTC

Maintenance Grid - Topology network change

Resolved after 5h 0m of downtime

←   Previous     6 / 8     Next   →