Student Cluster Upgrade
Our student cluster has now been running for four semesters and was used for quite some courses. The NVidia RTX 1080 Ti cards do show signs of age though and we did get feedback that they are not suitable anymore for many projects.Current Issues
We got the following feedback from various sources:- Too old
- The RTX 1080 Ti only support compute model 6.2 and lack any of the newer features like tensor cores, bf16, int8 and int4, etc.
- Not enough VRAM
- The RTX 1080 Ti only has 11GB of VRAM, more is required for many AI models.
- Too slow
- The RTX 1080 Ti was fast in 2017 but that was eight years ago.
Restrictions
The D-INFK is willing to invest into the student cluster as it has proven useful for teaching but there are two important restrictions:- Cost
- GPUs for AI are expensive and require an expensive chassis and system to host, power and feed them with data. We cannot afford new nodes, so the old ones will have to be reused. For the GPUs we will have to focus on consumer grade GPUs with the right cost/performance ratio.
- Power
- We do not have enough power (and cooling) in the server rooms. The 32 nodes that we have in production already stretch the power limit when they are all in use even when we limit the per GPU power consumption to 50%! Also the current nodes only support 250W per GPU. We can reduce the number of GPUs per nodes though.
Proposed Upgrades for FS 2025/26
We are exploring three options and hope to implement them all until the semester starts.Upgrade with RTX 5060 Ti
We will replace the GPUs in four nodes with RTX 5060 Ti. These have 16GB of VRAM, support the newest compute model 12.0 with all the latest features usable for AI and only use 180W. They are not particularly fast but still faster than the RTX 1080 Ti. Per node the upgrade will cost around CHF 3000. The actual implementation is quite complicated - we will need to completely disassemble the cards, remove the fan shroud, get additional parts from Ali Express and 3D-print other parts to make the GPUs fit and cooled. Availability for FS2025: CertainCard Size
We looked at more powerful cards like the RTX 5070 Ti or even better but they would all not fit into the current chassis because they are too high or too thick. They also use the new 12VHPWR power adapter which cannot be easily connected to the power connector in the current nodes.Acquire DGX Spark Systems
The DGC Spark systems offer 128GB of unified RAM shared between a Blackwell GPU and its 20 ARM cores running the operating system. At US$ 2999 per system we could buy ten with the budget we have. The version from Asus is even cheaper. Hopefully these will be available in August, just in time to get them running for the new semester.Availability for FS2025: Maybe
Upgrade with Old High-End GPUs
We will try to obtain RTX 3090 GPUs from GPU servers that will be decommissioned by research groups. These have 24GB of VRAM, support compute model 8.6 and are still quite powerful. These cards are too high for our current nodes so we would not only require the old cards but also that we get the node they came in.Availability for FS2025: Not sure
Additional Considerations
VRAM
Increasing the total VRAM in a node may require increasing the system RAM. Currently the nodes have 256GB RAM which (conservatively) gives 24GB RAM per GPU,2.18
times more than the VRAM of the RTX 1080 Ti. With cards sporting 16GB of VRAM that ratio drops to 1.5
. If this is an issue for the projects of some courses then we would request feedback.