Please read announcement

HPC - Broken Omnipath switch →

Batch System

Tenemos dos switchs Omnipath y uno se ha roto en los nodos de hpc. Estos son los que realizan el intercambio de mensajes para jobs paralelizados, cola compute de Slurm. Estamos a la espera que nos manden recambio

-– — —

We have two Omnipath switches, and one has broken in the HPC nodes. These are the ones that exchange messages for parallelized jobs, Slurm compute queue. We are waiting for a replacement to be sent to us.


Altamira supercomputer   (?) Altamira supercomputer related systems.
Batch System   (?) Slurm batch system for Altamira Maintenance
Login nodes   (?) Altamira login nodes (login1, login2) Operational
Cloud Infrastructure   (?) OpenStack Cloud infrastructure.
Grid and HTC   (?) General purpose batch system and high throughput compute system.
Web and miscelaneous services   (?) Web services, wiki pages and other services.
AAI   (?) Authentication, Authorization and Identity systems.
Networking   (?) Internal and external networking.
Storage systems   (?) Distributed storage systems.

Incident history


June 8, 2021 at 1:46 PM UTC

Grid topology network change

Resolved after 68h 54m of downtime
May 11, 2021 at 7:00 AM UTC

[Warning] Change endpoints in OpenStack Infrastucture

Resolved after 27h 58m of downtime
May 3, 2021 at 9:24 AM UTC

Slurm Batch System unavailable

Resolved after 20m of downtime
April 28, 2021 at 7:00 AM UTC

New scheduled Maintenance in the Cloud Infrastructure

Resolved after 4h 53m of downtime
April 25, 2021 at 6:00 AM UTC

Scheduled Maintenance in the Cloud Infrastructure

Resolved after 5h 32m of downtime
March 24, 2021 at 7:02 PM UTC

Cloud Outage

Resolved in under a minute
March 23, 2021 at 1:05 PM UTC

Network Cloud Outage

Resolved after 23h 54m of downtime
January 21, 2021 at 8:42 AM UTC

Partial outage of storage system

Resolved after 441h 11m of downtime
January 18, 2021 at 3:00 PM UTC

Scheduled mainteinance of IFCA Computing resources

Resolved after 70h 13m of downtime
January 11, 2021 at 2:15 PM UTC

Cloud infrastructure is down

Resolved after 1h 0m of downtime

←   Previous     7 / 8     Next   →