Cuda скачать для windows 10 x64

Resources

CUDA Documentation/Release Notes
MacOS Tools
Training
Sample Code
Forums

Archive of Previous CUDA Releases
FAQ
Open Source Packages
Submit a Bug
Tarball and Zip Archive Deliverables

Free Tools and Trainings for Developers

Get exclusive access to hundreds of SDKs, technical trainings, and opportunities to connect with millions of like-minded developers, researchers, and students.

Learn more

Develop, Optimize and Deploy GPU-Accelerated Apps

The NVIDIA® CUDA® Toolkit provides a development environment for creating high performance GPU-accelerated applications. With the CUDA Toolkit, you can develop, optimize, and deploy your applications on GPU-accelerated embedded systems, desktop workstations, enterprise data centers, cloud-based platforms and HPC supercomputers. The toolkit includes GPU-accelerated libraries, debugging and optimization tools, a C/C++ compiler, and a runtime library to deploy your application.

Using built-in capabilities for distributing computations across multi-GPU configurations, scientists and researchers can develop applications that scale from single GPU workstations to cloud installations with thousands of GPUs.

Download now

CUDA 12 Features

CUDA 12 introduces support for the NVIDIA Hopper and Ada Lovelace architectures, Arm server processors, Lazy Module and Kernel Loading, revamped Dynamic Parallelism APIs, enhancements to the CUDA graphs API, performance-optimized libraries, and new developer tool capabilities.

Support for the NVIDIA Hopper architecture includes next generation Tensor Cores and Transformer Engine, hi-speed NVLink Switch system, mixed precision modes, 2nd generation Multi-Instance GPU (MIG), advanced memory management, and standard C++/Fortran/Python parallel language constructs.

Learn more

GTC Digital Webinars

Dive deeper into the latest CUDA features.

Customer Stories

See how developers, scientists, and researchers are using CUDA today.

CUDA Ecosystem

Explore the top compute and graphics packages with built-in CUDA integration.

Tutorials

News

Resources

CUDA Documentation/Release Notes
An Even Easier Introduction to CUDA
All CUDA Dev Blogs
CUDA on WSL
CUDA-X Libraries
Training
Nsight Developer Tools
Accelerate Applications on NVIDIA Ampere
Sample Code
Forums
Submit a bug

Источник

Nvidia.CUDA, Release version: 12.1

Command Line

Download Links For Version 12.1

Download Links For Version 12.0.1

Download Links For Version 11.8

Download Links For Version 11.7

Download Links For Version 11.6

Download Links For Version 11.5

Download Links For Version 11.3

Info

last updated 4/23/2023 5:12:46 AM
Publisher:
License:

Dependencies

No dependency information

Источник

Overview

Certified

What’s New

Features:

C/C++ compiler
Visual Profiler
GPU-accelerated BLAS library
GPU-accelerated FFT library
GPU-accelerated Sparse Matrix library
GPU-accelerated RNG library
Additional tools and documentation

Highlights:

Easier Application Porting
- Share GPUs across multiple threads
- Use all GPUs in the system concurrently from a single host thread
- No-copy pinning of system memory, a faster alternative to cudaMallocHost()
- C++ new/delete and support for virtual functions
- Support for inline PTX assembly
- Thrust library of templated performance primitives such as sort, reduce, etc.
- Nvidia Performance Primitives (NPP) library for image/video processing
- Layered Textures for working with same size/format textures at larger sizes and higher performance
Faster Multi-GPU Programming
- Unified Virtual Addressing
- GPUDirect v2.0 support for Peer-to-Peer Communication
New & Improved Developer Tools
- Automated Performance Analysis in Visual Profiler
- C++ debugging in CUDA-GDB for Linux and MacOS
- GPU binary disassembler for Fermi architecture (cuobjdump)
- Parallel Nsight 2.0 now available for Windows developers with new debugging and profiling features.

What’s New:

Added a new API, cudaGraphNodeSetEnabled(), to allow disabling nodes in an instantiated graph. Support is limited to kernel nodes in this release. A corresponding API, cudaGraphNodeGetEnabled(), allows querying the enabled state of a node.
Full release of 128-bit integer (__int128) data type including compiler and developer tools support. The host-side compiler must support the __int128 type to use this feature.
Added ability to disable NULL kernel graph node launches.
Added new NVML public APIs for querying functionality under Wayland.
Added L2 cache control descriptors for atomics.
Large CPU page support for UVM managed memory.

1.3. CUDA Compilers

11.6

VS2022 Support: CUDA 11.6 officially supports the latest VS2022 as host compiler. A separate Nsight Visual Studio installer 2022.1.1 must be downloaded from here. A future CUDA release will have the Nsight Visual Studio installer with VS2022 support integrated into it.
New instructions in public PTX: New instructions for bit mask creation — BMSK and sign extension — SZEXT are added to the public PTX ISA. You can find documentation for these instructions in the PTX ISA guide: BMSK and SZEXT.
Unused Kernel Optimization: In CUDA 11.5, unused kernel pruning was introduced with the potential benefits of reducing binary size and improving performance through more efficient optimizations. This was an opt-in feature but in 11.6, this feature is enabled by default. As mentioned in the 11.5 blog here, there is an opt-out flag that can be used in case it becomes necessary for debug purposes or for other special situations.
$ nvcc -rdc=true user.cu testlib.a -o user -Xnvlink -ignore-host-info
In addition to the -arch=all and -arch=all-major options added in CUDA 11.5, NVCC introduced -arch= native in CUDA 11.5 update1. This -arch=native option is a convenient way for users to let NVCC determine the right target architecture to compile the CUDA device code to based on the GPU installed on the system. This can be particularly helpful for testing when applications are run on the same system they are compiled in.
Generate PTX from nvlink: Using the following command line, device linker, nvlink will produce PTX as an output in addition to CUBIN:
nvcc -dlto -dlink -ptx
Device linking by nvlink is the final stage in the CUDA compilation process. Applications that have multiple source translation units have to be compiled in separate compilation mode. LTO (introduced in CUDA 11.4) allowed nvlink to perform optimizations at device link time instead of at compile time so that separately compiled applications with several translation units can be optimized to the same level as whole program compilations with a single translation unit. However, without the option to output PTX, applications that cared about forward compatibility of device code could not benefit from Link Time Optimization or had to constrain the device code to a single source file.
With the option for nvlink that performs LTO to generate the output in PTX, customer applications that require forward compatibility across GPU architectures can span across multiple files and can also take advantage of Link Time Optimization.
Bullseye support: NVCC compiled source code will work with code coverage tool Bullseye. The code coverage is only for the CPU or the host functions. Code coverage for device function is not supported through bullseye.
INT128 developer tool support: In 11.5, CUDA C++ support for 128-bit was added. In this release, developer tools supports the datatype as well. With the latest version of libcu++, int 128 data type is supported by math functions.

cuSOLVER

New Features:

New singular value decomposition (GESVDR) is added. GESVDR computes partial spectrum with random sampling, an order of magnitude faster than GESVD.
libcusolver.so no longer links libcublas_static.a; instead, it depends on libcublas.so. This reduces the binary size of libcusolver.so. However, it breaks backward compatibility. The user has to link libcusolver.so with the correct version of libcublas.so.

cuSPARSE

New Features:

New Tensor Core-accelerated Block Sparse Matrix — Matrix Multiplication (cusparseSpMM) and introduction of the Blocked-Ellpack storage format.
New algorithms for CSR/COO Sparse Matrix — Vector Multiplication (cusparseSpMV) with better performance.
Extended functionalities for cusparseSpMV:
Support for the CSC format.
Support for regular/complex bfloat16 data types for both uniform and mixed-precision computation.
Support for mixed regular-complex data type computation.
Support for deterministic and non-deterministic computation.
New algorithm (CUSPARSE_SPMM_CSR_ALG3) for Sparse Matrix — Matrix Multiplication (cusparseSpMM) with better performance especially for small matrices.
New routine for Sampled Dense Matrix — Dense Matrix Multiplication (cusparseSDDMM) which deprecated cusparseConstrainedGeMM and provides better performance.
Better accuracy of cusparseAxpby, cusparseRot, cusparseSpVV for bfloat16 and half regular/complex data types.
All routines support NVTX annotation for enhancing the profiler time line on complex applications.

Deprecations:

cusparseConstrainedGeMM has been deprecated in favor of cusparseSDDMM.
cusparseCsrmvEx has been deprecated in favor of cusparseSpMV.
COO Array of Structure (CooAoS) format has been deprecated including cusparseCreateCooAoS, cusparseCooAoSGet, and its support for cusparseSpMV.

Known Issues:

cusparseDestroySpVec, cusparseDestroyDnVec, cusparseDestroySpMat, cusparseDestroyDnMat, cusparseDestroy with NULL argument could cause segmentation fault on Windows.

Resolved Issues:

cusparseAxpby, cusparseGather, cusparseScatter, cusparseRot, cusparseSpVV, cusparseSpMV now support zero-size matrices.
cusparseCsr2cscEx2 now correctly handles empty matrices (nnz = 0).
cusparseXcsr2csr_compress now uses 2-norm for the comparison of complex values instead of only the real part.
NPPNew features:New APIs added to compute Distance Transform using Parallel Banding Algorithm (PBA):
nppiDistanceTransformPBA_xxxxx_C1R_Ctx() — where xxxxx specifies the input and output combination: 8u16u, 8s16u, 16u16u, 16s16u, 8u32f, 8s32f, 16u32f, 16s32f
nppiSignedDistanceTransformPBA_32f_C1R_Ctx()

Resolved issues:

Fixed the issue in which Label Markers adds zero pixel as object region.
NVJPEG

New Features:

nvJPEG decoder added a new API to support region of interest (ROI) based decoding for batched hardware decoder:
nvjpegDecodeBatchedEx()
nvjpegDecodeBatchedSupportedEx()

cuFFTKnown Issues:

cuFFT planning and plan estimation functions may not restore correct context affecting CUDA driver API applications.
Plans with strides, primes larger than 127 in FFT size decomposition and total size of transform including strides bigger than 32GB produce incorrect results.

Resolved Issues:

Previously, reduced performance of power-of-2 single precision FFTs was observed on GPUs with sm_86 architecture. This issue has been resolved.
Large prime factors in size decomposition and real to complex or complex to real FFT type no longer cause cuFFT plan functions to fail.
CUPTIDeprecations early notice:The following functions are scheduled to be deprecated in 11.3 and will be removed in a future release:
NVPW_MetricsContext_RunScript and NVPW_MetricsContext_ExecScript_Begin from the header nvperf_host.h.
cuptiDeviceGetTimestamp from the header cupti_events.h

Complete release notes can be found here.

Fast servers and clean downloads. Tested on TechSpot Labs. Here’s why you can trust us.

Last updated:

March 11, 2022

User rating:

4.2
23 votes

Popular apps
in For Developers

Источник

В очередной раз после переустановки Windows осознал, что надо накатить драйвера, CUDA, cuDNN, Tensorflow/Keras для обучения нейронных сетей.

Каждый раз для меня это оказывается несложной, но времязатратной операцией: найти подходящую комбинацию Tensorflow/Keras, CUDA, cuDNN и Python несложно, но вспоминаю про эти зависимости только в тот момент, когда при импорте Tensorflow вижу, что видеокарта не обнаружена и начинаю поиск нужной страницы в документации Tensorflow.

В этот раз ситуация немного усложнилась. Помимо установки Tensorflow мне потребовалось установить PyTorch. Со своими зависимостями и поддерживаемыми версиями Python, CUDA и cuDNN.

По итогам нескольких часов экспериментов решил, что надо зафиксировать все полезные ссылки в одном посте для будущего меня.

Краткий алгоритм установки Tensorflow и PyTorch

Примечание: Установить Tensorflow и PyTorch можно в одном виртуальном окружении, но в статье этого алгоритма нет.

Подготовка к установке

Определить какая версия Python поддерживается Tensorflow и PyTorch (на момент написания статьи мне не удалось установить PyTorch в виртуальном окружении с Python 3.9.5)
Для выбранной версии Python найти подходящие версии Tensorflow и PyTorch
Определить, какие версии CUDA поддерживают выбранные ранее версии Tensorflow и PyTorch
Определить поддерживаемую версию cuDNN для Tensorflow – не все поддерживаемые CUDA версии cuDNN поддерживаются Tensorflow. Для PyTorch этой особенности не заметил

Установка CUDA и cuDNN

Скачиваем подходящую версию CUDA и устанавливаем. Можно установить со всеми значениями по умолчанию
Скачиваем cuDNN, подходящую для выбранной версии Tensorflow (п.1.2). Для скачивания cuDNN потребуется регистрация на сайте NVidia. “Установка” cuDNN заключается в распакове архива и заменой существующих файлов CUDA на файлы из архива

Устанавливаем Tensorflow

Создаём виртуальное окружение для Tensorflow c выбранной версией Python. Назовём его, например, py38tf
Переключаемся в окружение py38tf и устанавливаем поддерживаемую версию Tensorflow pip install tensorflow==x.x.x

Проверяем поддержку GPU командой

python -c "import tensorflow as tf; print('CUDA available' if tf.config.list_physical_devices('GPU') else 'CUDA not available')"

Устанавливаем PyTorch

Создаём виртуальное окружение для PyTorch c выбранной версией Python. Назовём его, например, py38torch
Переключаемся в окружение py38torch и устанавливаем поддерживаемую версию PyTorch
Проверяем поддержку GPU командой

python -c "import torch; print('CUDA available' if torch.cuda.is_available() else 'CUDA not available')"

В моём случае заработала комбинация:

Python 3.8.8
Драйвер NVidia 441.22
CUDA 10.1
cuDNN 7.6
Tensorflow 2.3.0
PyTorch 1.7.1+cu101

Tensorflow и PyTorch установлены в разных виртуальных окружениях.

Итого

Польза этой статьи будет понятна не скоро: систему переустанавливаю я не часто.

Если воспользуетесь этим алгоритмом и найдёте какие-то ошибки – пишите в комментарии

Если вам понравилась статья, то можете зайти в мой telegram-канал. В канал попадают небольшие заметки о Python, .NET, Go.

Источник

Resources

Free Tools and Trainings for Developers

CUDA 12 Features

GTC Digital Webinars

Customer Stories

CUDA Ecosystem

Tutorials

News

Resources

Command Line

Download Links For Version 12.1

Download Links For Version 12.0.1

Download Links For Version 11.8

Download Links For Version 11.7

Download Links For Version 11.6

Download Links For Version 11.5

Download Links For Version 11.3

Info

Dependencies

Share

Overview

Certified <img decoding="async" alt="Clean Download" width="19" height="19" src="https://www.techspot.com/images/ui/optimized/certified-badge-w.svg" />

What’s New

Popular apps in For Developers

Краткий алгоритм установки Tensorflow и PyTorch

Подготовка к установке

Установка CUDA и cuDNN

Устанавливаем Tensorflow

Устанавливаем PyTorch

Итого

Другие наши интересноые статьи:

Certified

Popular apps
in For Developers