Cuda скачать для windows 10 x64

Resources

  • CUDA Documentation/Release Notes
  • MacOS Tools
  • Training
  • Sample Code
  • Forums
  • Archive of Previous CUDA Releases
  • FAQ
  • Open Source Packages
  • Submit a Bug
  • Tarball and Zip Archive Deliverables

Free Tools and Trainings for Developers

Get exclusive access to hundreds of SDKs, technical trainings, and opportunities to connect with millions of like-minded developers, researchers, and students.

Learn more

Develop, Optimize and Deploy GPU-Accelerated Apps

The NVIDIA® CUDA® Toolkit provides a development environment for creating high performance GPU-accelerated applications. With the CUDA Toolkit, you can develop, optimize, and deploy your applications on GPU-accelerated embedded systems, desktop workstations, enterprise data centers, cloud-based platforms and HPC supercomputers. The toolkit includes GPU-accelerated libraries, debugging and optimization tools, a C/C++ compiler, and a runtime library to deploy your application.

Using built-in capabilities for distributing computations across multi-GPU configurations, scientists and researchers can develop applications that scale from single GPU workstations to cloud installations with thousands of GPUs.

Download now

CUDA 12 Features

CUDA 12 introduces support for the NVIDIA Hopper and Ada Lovelace architectures, Arm server processors, Lazy Module and Kernel Loading, revamped Dynamic Parallelism APIs, enhancements to the CUDA graphs API, performance-optimized libraries, and new developer tool capabilities.

Support for the NVIDIA Hopper architecture includes next generation Tensor Cores and Transformer Engine, hi-speed NVLink Switch system, mixed precision modes, 2nd generation Multi-Instance GPU (MIG), advanced memory management, and standard C++/Fortran/Python parallel language constructs.

Learn more


GTC Digital Webinars

Dive deeper into the latest CUDA features.


Customer Stories

See how developers, scientists, and researchers are using CUDA today.


CUDA Ecosystem

Explore the top compute and graphics packages with built-in CUDA integration.


Tutorials

News


Resources

  • CUDA Documentation/Release Notes
  • An Even Easier Introduction to CUDA
  • All CUDA Dev Blogs
  • CUDA on WSL
  • CUDA-X Libraries
  • Training
  • Nsight Developer Tools
  • Accelerate Applications on NVIDIA Ampere
  • Sample Code
  • Forums
  • Submit a bug

Nvidia.CUDA, Release version: 12.1

Command Line

Download Links For Version 12.1

Download Links For Version 12.0.1

Download Links For Version 11.8

Download Links For Version 11.7

Download Links For Version 11.6

Download Links For Version 11.5

Download Links For Version 11.3

Info


  • last updated 4/23/2023 5:12:46 AM

  • Publisher:

  • License:

Dependencies

No dependency information

Share

Overview

Certified Clean Download

What’s New

Features:

  • C/C++ compiler
  • Visual Profiler
  • GPU-accelerated BLAS library
  • GPU-accelerated FFT library
  • GPU-accelerated Sparse Matrix library
  • GPU-accelerated RNG library
  • Additional tools and documentation

Highlights:

  • Easier Application Porting
    • Share GPUs across multiple threads
    • Use all GPUs in the system concurrently from a single host thread
    • No-copy pinning of system memory, a faster alternative to cudaMallocHost()
    • C++ new/delete and support for virtual functions
    • Support for inline PTX assembly
    • Thrust library of templated performance primitives such as sort, reduce, etc.
    • Nvidia Performance Primitives (NPP) library for image/video processing
    • Layered Textures for working with same size/format textures at larger sizes and higher performance
  • Faster Multi-GPU Programming
    • Unified Virtual Addressing
    • GPUDirect v2.0 support for Peer-to-Peer Communication
  • New & Improved Developer Tools
    • Automated Performance Analysis in Visual Profiler
    • C++ debugging in CUDA-GDB for Linux and MacOS
    • GPU binary disassembler for Fermi architecture (cuobjdump)
    • Parallel Nsight 2.0 now available for Windows developers with new debugging and profiling features.

What’s New:

  • Added a new API, cudaGraphNodeSetEnabled(), to allow disabling nodes in an instantiated graph. Support is limited to kernel nodes in this release. A corresponding API, cudaGraphNodeGetEnabled(), allows querying the enabled state of a node.
  • Full release of 128-bit integer (__int128) data type including compiler and developer tools support. The host-side compiler must support the __int128 type to use this feature.
  • Added ability to disable NULL kernel graph node launches.
  • Added new NVML public APIs for querying functionality under Wayland.
  • Added L2 cache control descriptors for atomics.
  • Large CPU page support for UVM managed memory.

1.3. CUDA Compilers

11.6

  • VS2022 Support: CUDA 11.6 officially supports the latest VS2022 as host compiler. A separate Nsight Visual Studio installer 2022.1.1 must be downloaded from here. A future CUDA release will have the Nsight Visual Studio installer with VS2022 support integrated into it.
  • New instructions in public PTX: New instructions for bit mask creation — BMSK and sign extension — SZEXT are added to the public PTX ISA. You can find documentation for these instructions in the PTX ISA guide: BMSK and SZEXT.
  • Unused Kernel Optimization: In CUDA 11.5, unused kernel pruning was introduced with the potential benefits of reducing binary size and improving performance through more efficient optimizations. This was an opt-in feature but in 11.6, this feature is enabled by default. As mentioned in the 11.5 blog here, there is an opt-out flag that can be used in case it becomes necessary for debug purposes or for other special situations.
  • $ nvcc -rdc=true user.cu testlib.a -o user -Xnvlink -ignore-host-info
  • In addition to the -arch=all and -arch=all-major options added in CUDA 11.5, NVCC introduced -arch= native in CUDA 11.5 update1. This -arch=native option is a convenient way for users to let NVCC determine the right target architecture to compile the CUDA device code to based on the GPU installed on the system. This can be particularly helpful for testing when applications are run on the same system they are compiled in.
  • Generate PTX from nvlink: Using the following command line, device linker, nvlink will produce PTX as an output in addition to CUBIN:
  • nvcc -dlto -dlink -ptx
  • Device linking by nvlink is the final stage in the CUDA compilation process. Applications that have multiple source translation units have to be compiled in separate compilation mode. LTO (introduced in CUDA 11.4) allowed nvlink to perform optimizations at device link time instead of at compile time so that separately compiled applications with several translation units can be optimized to the same level as whole program compilations with a single translation unit. However, without the option to output PTX, applications that cared about forward compatibility of device code could not benefit from Link Time Optimization or had to constrain the device code to a single source file.
  • With the option for nvlink that performs LTO to generate the output in PTX, customer applications that require forward compatibility across GPU architectures can span across multiple files and can also take advantage of Link Time Optimization.
  • Bullseye support: NVCC compiled source code will work with code coverage tool Bullseye. The code coverage is only for the CPU or the host functions. Code coverage for device function is not supported through bullseye.
  • INT128 developer tool support: In 11.5, CUDA C++ support for 128-bit was added. In this release, developer tools supports the datatype as well. With the latest version of libcu++, int 128 data type is supported by math functions.

cuSOLVER

New Features:

  • New singular value decomposition (GESVDR) is added. GESVDR computes partial spectrum with random sampling, an order of magnitude faster than GESVD.
  • libcusolver.so no longer links libcublas_static.a; instead, it depends on libcublas.so. This reduces the binary size of libcusolver.so. However, it breaks backward compatibility. The user has to link libcusolver.so with the correct version of libcublas.so.

cuSPARSE

New Features:

  • New Tensor Core-accelerated Block Sparse Matrix — Matrix Multiplication (cusparseSpMM) and introduction of the Blocked-Ellpack storage format.
  • New algorithms for CSR/COO Sparse Matrix — Vector Multiplication (cusparseSpMV) with better performance.
  • Extended functionalities for cusparseSpMV:
  • Support for the CSC format.
  • Support for regular/complex bfloat16 data types for both uniform and mixed-precision computation.
  • Support for mixed regular-complex data type computation.
  • Support for deterministic and non-deterministic computation.
  • New algorithm (CUSPARSE_SPMM_CSR_ALG3) for Sparse Matrix — Matrix Multiplication (cusparseSpMM) with better performance especially for small matrices.
  • New routine for Sampled Dense Matrix — Dense Matrix Multiplication (cusparseSDDMM) which deprecated cusparseConstrainedGeMM and provides better performance.
  • Better accuracy of cusparseAxpby, cusparseRot, cusparseSpVV for bfloat16 and half regular/complex data types.
  • All routines support NVTX annotation for enhancing the profiler time line on complex applications.

Deprecations:

  • cusparseConstrainedGeMM has been deprecated in favor of cusparseSDDMM.
  • cusparseCsrmvEx has been deprecated in favor of cusparseSpMV.
  • COO Array of Structure (CooAoS) format has been deprecated including cusparseCreateCooAoS, cusparseCooAoSGet, and its support for cusparseSpMV.

Known Issues:

  • cusparseDestroySpVec, cusparseDestroyDnVec, cusparseDestroySpMat, cusparseDestroyDnMat, cusparseDestroy with NULL argument could cause segmentation fault on Windows.

Resolved Issues:

  • cusparseAxpby, cusparseGather, cusparseScatter, cusparseRot, cusparseSpVV, cusparseSpMV now support zero-size matrices.
  • cusparseCsr2cscEx2 now correctly handles empty matrices (nnz = 0).
  • cusparseXcsr2csr_compress now uses 2-norm for the comparison of complex values instead of only the real part.
  • NPPNew features:New APIs added to compute Distance Transform using Parallel Banding Algorithm (PBA):
  • nppiDistanceTransformPBA_xxxxx_C1R_Ctx() — where xxxxx specifies the input and output combination: 8u16u, 8s16u, 16u16u, 16s16u, 8u32f, 8s32f, 16u32f, 16s32f
  • nppiSignedDistanceTransformPBA_32f_C1R_Ctx()

Resolved issues:

  • Fixed the issue in which Label Markers adds zero pixel as object region.
  • NVJPEG

New Features:

  • nvJPEG decoder added a new API to support region of interest (ROI) based decoding for batched hardware decoder:
  • nvjpegDecodeBatchedEx()
  • nvjpegDecodeBatchedSupportedEx()

cuFFTKnown Issues:

  • cuFFT planning and plan estimation functions may not restore correct context affecting CUDA driver API applications.
  • Plans with strides, primes larger than 127 in FFT size decomposition and total size of transform including strides bigger than 32GB produce incorrect results.

Resolved Issues:

  • Previously, reduced performance of power-of-2 single precision FFTs was observed on GPUs with sm_86 architecture. This issue has been resolved.
  • Large prime factors in size decomposition and real to complex or complex to real FFT type no longer cause cuFFT plan functions to fail.
  • CUPTIDeprecations early notice:The following functions are scheduled to be deprecated in 11.3 and will be removed in a future release:
  • NVPW_MetricsContext_RunScript and NVPW_MetricsContext_ExecScript_Begin from the header nvperf_host.h.
  • cuptiDeviceGetTimestamp from the header cupti_events.h

Complete release notes can be found here.

Fast servers and clean downloads. Tested on TechSpot Labs. Here’s why you can trust us.

Certified clean file download tested by TechSpot

Last updated:

User rating:


4.2
23 votes

Popular apps
in For Developers

В очередной раз после переустановки Windows осознал, что надо накатить драйвера, CUDA, cuDNN, Tensorflow/Keras для обучения нейронных сетей.

Каждый раз для меня это оказывается несложной, но времязатратной операцией: найти подходящую комбинацию Tensorflow/Keras, CUDA, cuDNN и Python несложно, но вспоминаю про эти зависимости только в тот момент, когда при импорте Tensorflow вижу, что видеокарта не обнаружена и начинаю поиск нужной страницы в документации Tensorflow.

В этот раз ситуация немного усложнилась. Помимо установки Tensorflow мне потребовалось установить PyTorch. Со своими зависимостями и поддерживаемыми версиями Python, CUDA и cuDNN.

По итогам нескольких часов экспериментов решил, что надо зафиксировать все полезные ссылки в одном посте для будущего меня.

Краткий алгоритм установки Tensorflow и PyTorch

Примечание: Установить Tensorflow и PyTorch можно в одном виртуальном окружении, но в статье этого алгоритма нет.

Подготовка к установке

  1. Определить какая версия Python поддерживается Tensorflow и PyTorch (на момент написания статьи мне не удалось установить PyTorch в виртуальном окружении с Python 3.9.5)
  2. Для выбранной версии Python найти подходящие версии Tensorflow и PyTorch
  3. Определить, какие версии CUDA поддерживают выбранные ранее версии Tensorflow и PyTorch
  4. Определить поддерживаемую версию cuDNN для Tensorflow – не все поддерживаемые CUDA версии cuDNN поддерживаются Tensorflow. Для PyTorch этой особенности не заметил

Установка CUDA и cuDNN

  1. Скачиваем подходящую версию CUDA и устанавливаем. Можно установить со всеми значениями по умолчанию
  2. Скачиваем cuDNN, подходящую для выбранной версии Tensorflow (п.1.2). Для скачивания cuDNN потребуется регистрация на сайте NVidia. “Установка” cuDNN заключается в распакове архива и заменой существующих файлов CUDA на файлы из архива

Устанавливаем Tensorflow

  1. Создаём виртуальное окружение для Tensorflow c выбранной версией Python. Назовём его, например, py38tf
  2. Переключаемся в окружение py38tf и устанавливаем поддерживаемую версию Tensorflow pip install tensorflow==x.x.x
  3. Проверяем поддержку GPU командой
    python -c "import tensorflow as tf; print('CUDA available' if tf.config.list_physical_devices('GPU') else 'CUDA not available')"
    

Устанавливаем PyTorch

  1. Создаём виртуальное окружение для PyTorch c выбранной версией Python. Назовём его, например, py38torch
  2. Переключаемся в окружение py38torch и устанавливаем поддерживаемую версию PyTorch
  3. Проверяем поддержку GPU командой
python -c "import torch; print('CUDA available' if torch.cuda.is_available() else 'CUDA not available')"

В моём случае заработала комбинация:

  • Python 3.8.8
  • Драйвер NVidia 441.22
  • CUDA 10.1
  • cuDNN 7.6
  • Tensorflow 2.3.0
  • PyTorch 1.7.1+cu101

Tensorflow и PyTorch установлены в разных виртуальных окружениях.

Итого

Польза этой статьи будет понятна не скоро: систему переустанавливаю я не часто.

Если воспользуетесь этим алгоритмом и найдёте какие-то ошибки – пишите в комментарии

Если вам понравилась статья, то можете зайти в мой telegram-канал. В канал попадают небольшие заметки о Python, .NET, Go.

  • Cuda toolkit для windows 7
  • Cuda sdk toolkit hashcat windows
  • Cuda on windows for linux
  • Cubot x19 драйвер для windows 10
  • Cubase le ai elements 8 for windows