Experiencing disruptions

[HPC users] Change in the Slurm authentication mechanism →

Batch System Computing Elements

In the coming days, maintenance will be carried out on Slurm authentication, and job submission will not be possible during the change. An exact date and time have not yet been scheduled, since the compute nodes first need to be drained to clear running jobs and make the intervention as light as possible.

Update [30 abr 12.30 pm]

Slurm is updated to support the new native slurm authentication. The change was applied to some group machines, and it works except from wngpus. The rest of the…


HPC - Broken Omnipath switch →

Batch System

Tenemos dos switchs Omnipath y uno se ha roto en los nodos de hpc. Estos son los que realizan el intercambio de mensajes para jobs paralelizados, cola compute de Slurm. Estamos a la espera que nos manden recambio

-– — —

We have two Omnipath switches, and one has broken in the HPC nodes. These are the ones that exchange messages for parallelized jobs, Slurm compute queue. We are waiting for a replacement to be sent to us.


Altamira supercomputer   (?) Altamira supercomputer related systems.
Batch System   (?) Slurm batch system for Altamira Disrupted
Login nodes   (?) Altamira login nodes (login1, login2) Operational
Cloud Infrastructure   (?) OpenStack Cloud infrastructure.
Grid and HTC   (?) General purpose batch system and high throughput compute system.
Web and miscelaneous services   (?) Web services, wiki pages and other services.
AAI   (?) Authentication, Authorization and Identity systems.
Networking   (?) Internal and external networking.
Storage systems   (?) Distributed storage systems.

Incident history


July 29, 2025 at 7:00 AM UTC

[HPC and Grid] Slurm upgrade

Resolved after 53h 0m of downtime
June 24, 2025 at 6:10 AM UTC

Redundancia red cloud / Cloud network redundancy

Resolved after 9h 50m of downtime
April 28, 2025 at 10:30 AM UTC

Electrical blackout

Resolved after 21h 31m of downtime
April 14, 2025 at 7:31 AM UTC

Cloud upgrade

Resolved in under a minute
April 4, 2025 at 10:05 AM UTC

System authentication failing

Resolved in under a minute
April 2, 2025 at 8:06 AM UTC

Ceph Upgrade  ℹ

The Ceph storage system, which provides external storage volumes for the cloud system, is about to be updated. Therefore, there may be some timeouts at some point."
April 1, 2025 at 9:07 AM UTC

Login2 - No login

Resolved after 54h 53m of downtime
February 23, 2025 at 11:28 AM UTC

Ampliacion de potencia del CPD / Datacenter power upgrade

Resolved after 120h 0m of downtime
February 14, 2025 at 11:35 AM UTC

Authentication system migration

Resolved after 48h 0m of downtime
November 19, 2024 at 12:09 AM UTC

Authentication to login2 failing using password+otp

Resolved after 85h 0m of downtime

←   Previous     2 / 8     Next   →