Student Cluster Upgrade

Our student cluster has now been running for four semesters and was used for quite some courses. The NVidia RTX 1080 Ti cards do show signs of age though and we did get feedback that they are not suitable anymore for many projects.

Current Issues

We got the following feedback from various sources:

Too old
The RTX 1080 Ti only support compute model 6.2 and lack any of the newer features like tensor cores, bf16, int8 and int4, etc.
Not enough VRAM
The RTX 1080 Ti only has 11GB of VRAM, more is required for many AI models.
Too slow
The RTX 1080 Ti was fast in 2017 but that was eight years ago.

Restrictions

The D-INFK is willing to invest into the student cluster as it has proven useful for teaching but there are two important restrictions:

Cost
GPUs for AI are expensive and require an expensive chassis and system to host, power and feed them with data. We cannot afford new nodes, so the old ones will have to be reused. For the GPUs we will have to focus on consumer grade GPUs with the right cost/performance ratio.
Power
We do not have enough power (and cooling) in the server rooms. The 32 nodes that we have in production already stretch the power limit when they are all in use even when we limit the per GPU power consumption to 50%! Also the current nodes only support 250W per GPU. We can reduce the number of GPUs per nodes though.

Proposed Upgrades for FS 2025/26

We are exploring three options and hope to implement them all until the semester starts.

Upgrade with RTX 5060 Ti

We will replace the GPUs in four nodes with RTX 5060 Ti. These have 16GB of VRAM, support the newest compute model 12.0 with all the latest features usable for AI and only use 180W. They are not particularly fast but still faster than the RTX 1080 Ti.

Per node the upgrade will cost around CHF 3000. The actual implementation is quite complicated - we will need to completely disassemble the cards, remove the fan shroud, get additional parts from Ali Express and 3D-print other parts to make the GPUs fit and cooled.

Availability for FS2025: Certain

Card Size

We looked at more powerful cards like the RTX 5070 Ti or even better but they would all not fit into the current chassis because they are too high or too thick. They also use the new 12VHPWR power adapter which cannot be easily connected to the power connector in the current nodes.

Acquire DGX Spark Systems

The DGC Spark systems offer 128GB of unified RAM shared between a Blackwell GPU and its 20 ARM cores running the operating system.

At US$ 2999 per system we could buy ten with the budget we have. The version from Asus is even cheaper. Hopefully these will be available in August, just in time to get them running for the new semester.

Availability for FS2025: Maybe

Upgrade with Old High-End GPUs

We will try to obtain RTX 3090 GPUs from GPU servers that will be decommissioned by research groups. These have 24GB of VRAM, support compute model 8.6 and are still quite powerful.

These cards are too high for our current nodes so we would not only require the old cards but also that we get the node they came in.

Availability for FS2025: Not sure

Additional Considerations

VRAM

Increasing the total VRAM in a node may require increasing the system RAM. Currently the nodes have 256GB RAM which (conservatively) gives 24GB RAM per GPU, 2.18 times more than the VRAM of the RTX 1080 Ti. With cards sporting 16GB of VRAM that ratio drops to 1.5. If this is an issue for the projects of some courses then we would request feedback.

Mixed Resource Cluster Operation

In a cluster with GPUs with mixed specs users will always want to use the best GPU available. Also it would be a shame not to use the good and expensive resources.

Our planned approach is that each course tag will be assigned a default GPU class, for instance 5060 Ti. This will be negotiated with the head TA when the course is set up and the better the GPU, the better the justification needs to be to request it. Jobs started will run on this GPU class with the same constraints as now. A student can however choose to launch a job on a different, better GPU class, for example 3090. In this case the job will start but will be preempted (canceled) if all GPUs in this class are in use and another student wants to start a job for a course where 3090 is the default GPU class. Basically that means students can use any type of GPU until the used GPU is needed for a course participant who is entitled to such a GPU.


Page URL: https://www.isg.inf.ethz.ch/bin/view/Main/NewsProjectsStudentCluster
2025-07-05
© 2025 Eidgenössische Technische Hochschule Zürich