NLSE for the gifted amateur¶
The goal of this presentation is to introduce the detailed functionality of NLSE
which is the main workhorse of the numerical simulations we do in the group.
This presentation will be centered around the following plan:
- Introduction
- Split-step spectra schemes
NLSE
main class- First example: turbulence experiment
- Second example: broadcasting
- Third example: callbacks
- Conclusion
Introduction¶
Due to the high dimensional parameter space involved in our fluids of light experiments (see yesterday's talk ...), it is often useful to resort to numerical campaigns to accelerate grid searches, or get to otherwise inaccessible observables.
It is paramount to perfectly understand the numerical limitations, and mathematical particularities of the chosen solver to avoid potential pitfalls of unphysical results or convergence issues.
Before looking at any physical observables, you should be absolutely sure that your modelling encapsulates the physics you want to observe !
This presentation is meant as a detailed guide to use NLSE
but also as a zen guide towards numerical experiments.
Split-step spectral schemes¶
Let's first present the crux of this projet: split-step spectral schemes.
We'll start with a very general NLSE type equation:
$$ i\frac{\partial\psi}{\partial z} = -\frac{1}{2k_0}\nabla^2\psi + V\psi + g|\psi|^2\psi $$
How might we go about solving it ?
Split-step¶
If we first go to a Euler-type expansion:
$$ \frac{\mathrm{d}\psi}{\psi} = \left(i \frac{1}{2k_0}\nabla^2 -iV -i g|\psi|^2\right)\mathrm{d}z $$
Integrating over a "small enough" $\delta z$:
$$ \psi(z+\delta z) = e^{\left(i \frac{1}{2k_0}\nabla^2 -iV -i g|\psi|^2\right)\delta z}\psi(z) $$
Now, depending on needs, you might use more involved differentiation formulas like Runge-Kutta or something else ... You might rewrite this as follows:
$$ \psi(z+\delta z) = e^{(L+N)\delta z}\psi(z) $$ where $L=i \frac{1}{2k_0}\nabla^2$ and $N=-iV -i g|\psi|^2$.
Now the most careless of you might think "wait, this is easy, we can just expand the exponential into a product !"
You have to keep in mind that we are actually working with matrices $L_{ij}$ / $N_{ij}$ i.e everything is happening on a discrete grid $r_{ij} = (x_i, y_j)$.
We thus can expand the exponential using Baker-Hausdorff, at leading order of $\delta z$: this is the split-step part.
The next question is then how can we efficiently compute the two exponentials ?
Spectral¶
The trick is to compute the Laplacian in Fourier space since $\nabla \iff \times i\mathbf{k}$
$$ e^{i\frac{1}{2k_0}\nabla^2} = \mathcal{F}^{-1}\left(e^{-i\frac{\mathbf{k}^2}{2k_0}}\right) $$
We can finally put everything back together:
$$ \psi(z+\delta z) = e^{-iV-ig|\psi|^2\delta z}\mathcal{F}^{-1}\left(e^{-i\frac{\mathbf{k}^2}{2k_0}\delta z}\mathcal{F}\left( \psi(z)\right)\right) $$
This is the so-called split-step spectral method, accurate with a $O(\delta z)$ error.
By "leap-frogging" the non-linear step (i.e applying half of it before the linear step, and half of it after), we can get a $O(\delta z^3)$ error.
Remarks about complexity¶
We see that we have 2 FFT and 2 matrix multiplication per solver cycle, or 2FFT and 3 matrix multiplication in the case of the leapfrog scheme.
The good thing about FFT's is that they can be carried out with $N\mathrm{log}(N)$ complexity.
Matrix multiplications on the other hand cost us $O(N^2)$.
This means that we will quickly be limited by matrix multiplications, and this means that switching between complex64
and complex128
precision will have a quadratic impact on performance.
This is important to keep in mind when trying to optimize performance with these solvers.
NLSE
main class¶
We'll start with the main class and then expose all of the internals such as to make the inner workings of the class very clear.
Let's start with the parameters needed to run a simulation:
N
: the number of grid pointsn2
: the non-linear index of refraction $n_2$ in $m^2/W$window
: the physical size of the computational window in $m$puiss
: the total power of the beam in $W$Isat
: the saturation intensity of the medium in $W/m^2$L
: the length of the non-linear medium in $m$alpha
: the linear absorption coefficient in $m^{-1}$backend
: whether to solve the equation using the GPU or the CPU
Since most of the computations required are elementwise matrix multiplications and FFT's, they are very well suited to GPU computation.
Thus we provide either a CPU (to run on all machines), or a GPU backend.
Thus most of the performance of NLSE
is obtained with the GPU backend.
An instance of the simulation can be initialized as follows:
from NLSE import NLSE
N = 1024
n2 = -1.6e-9
window = 12e-3
puiss = 1.05
Isat = 10e4
L = 20e-2
alpha = 20
backend = "CPU" # no GPU on the mac ...
simu = NLSE(
alpha, puiss, window, n2, None, L, NX=N, NY=N, Isat=Isat, backend=backend
)
CuPy not available, falling back to CPU BACKEND ...
Behind the curtain, the __init__
function works as follows:
def __init__(
self,
alpha: float,
puiss: float,
window: float,
n2: float,
V: np.ndarray,
L: float,
NX: int = 1024,
NY: int = 1024,
Isat: float = np.inf,
nl_length: float = 0,
wvl: float = 780e-9,
backend: str = __BACKEND__,
) -> object:
"""Instantiates the simulation.
Solves an equation : d/dz psi = -1/2k0(d2/dx2 + d2/dy2) psi + k0 dn psi +
k0 n2 psi**2 psi
Args:
alpha (float): alpha
puiss (float): Power in W
window (float): Computational window in the transverse plane in m.
n2 (float): Non linear coeff in m^2/W
V (np.ndarray): Potential.
L (float): Length in m of the nonlinear medium
NX (int, optional): Number of points in the x direction. Defaults to 1024.
NY (int, optional): Number of points in the y direction. Defaults to 1024.
Isat (float): Saturation intensity in W/m^2
nl_length (float): Non local length in m
wvl (float): Wavelength in m
backend (str, optional): "GPU" or "CPU". Defaults to __BACKEND__.
"""
# listof physical parameters
self.backend = backend
if self.backend == "GPU" and self.__CUPY_AVAILABLE__:
self._kernels = kernels_gpu
self._convolution = signal_cp.oaconvolve
else:
self.backend = "CPU"
self._kernels = kernels_cpu
self._convolution = signal.oaconvolve
self.n2 = n2
self.V = V
self.wl = wvl
self.k = 2 * np.pi / self.wl
self.L = L # length of the non linear medium
self.alpha = alpha
self.puiss = puiss
self.I_sat = Isat
# number of grid points in X (even, best is power of 2 or low prime factors)
self.NX = NX
self.NY = NY
self.window = window
Dn = self.n2 * self.puiss / self.window**2
z_nl = 1 / (self.k * abs(Dn))
if isinstance(z_nl, np.ndarray):
z_nl = z_nl.min()
self.delta_z = 1e-2 * z_nl
# transverse coordinate
self.X, self.delta_X = np.linspace(
-self.window / 2,
self.window / 2,
num=NX,
endpoint=False,
retstep=True,
dtype=np.float32,
)
self.Y, self.delta_Y = np.linspace(
-self.window / 2,
self.window / 2,
num=NY,
endpoint=False,
retstep=True,
dtype=np.float32,
)
# define last axes for broadcasting operations
self._last_axes = (-2, -1)
self.XX, self.YY = np.meshgrid(self.X, self.Y)
# definition of the Fourier frequencies for the linear step
self.Kx = 2 * np.pi * np.fft.fftfreq(self.NX, d=self.delta_X)
self.Ky = 2 * np.pi * np.fft.fftfreq(self.NY, d=self.delta_Y)
self.Kxx, self.Kyy = np.meshgrid(self.Kx, self.Ky)
self.propagator = None
self.plans = None
self.nl_length = nl_length
if self.nl_length > 0:
d = self.nl_length // self.delta_X
x = np.arange(-2 * d, 2 * d + 1)
y = np.arange(-2 * d, 2 * d + 1)
XX, YY = np.meshgrid(x, y)
R = np.hypot(XX, YY)
self.nl_profile = special.kn(0, R / d)
self.nl_profile[
self.nl_profile.shape[0] // 2, self.nl_profile.shape[1] // 2
] = np.nanmax(self.nl_profile[np.logical_not(np.isinf(self.nl_profile))])
self.nl_profile /= self.nl_profile.sum()
else:
self.nl_profile = np.ones((self.NY, self.NX), dtype=np.float32)
In order to solve the equation we need to do Fourier transforms and matrix multiplications, thus we want to optimize both of these operations.
This is carried out using two tricks:
- Careful planing of the FFT's and caching of the plans
- Optimize as kernel operations all of the elementwise operations needed for matrix multiplications and exponential calculation
The planning is done by the _build_fft_plan
functions.
The kernels are defined in kernels_cpu.py
and kernels_gpu.py
.
Numba and Cupy are leveraged in order to just-in-time compile these operations to such that subsequent evaluations are greatly accelerated, while making the code easy to read and modify.
The zen of all of these operations is to try to do as much as possible in place meaning modifying the field array itself, rather than returning a result that implies making a copy. This is more memory efficient, as well as being (in general) faster.
As with anything, we are scientists so everything should be measured ! This means that if at some point, for some reason, you are not happy with the performance of something (i.e you have measured that it is the limiting factor), you should focus your efforts on this single function, and properly profiling it. For this, there are a lot of tools available on GPU or CPU.
For a good first overview, I recommend scalene
since it is both accurate and low overhead, as well as having a nice graphical output.
For GPU, if you want to get into the nitty gritty of what the GPU is doing, you can use nsys
for general purpose profiling or ncu
for kernel profiling, allowing you to inspect the actual assembly instructions getting executed.
I cannot stress this enough: if you feel like your code is running slow for any reason, don't infer anything ! PROFILE !
The main loop is carried out in the out_field
function, that calls the split_step
routine.
split_step
evolves the field from a single step $\delta z$.
Most of the out_field
function is simply boilerplate code in order to handle the various use cases allowed by the different function arguments like normalizing the field to its SI units, sending the arrays to the GPU etc ...
First example: turbulence experiment¶
In this numerical experiment, we will consider the scenario of two conterstreaming components colliding each other leading to the emergence of a turbulent pattern.
We thus need to define two counter streaming components.
import numpy as np
import matplotlib.pyplot as plt
# Define two counterstreaming components
w0 = 1e-3
kx = 2*np.pi*1e4
# make the two components collide in the middle of the cell, so shift each components according
# to the angle
offset = simu.L/2*np.tan(kx/simu.k)
E0 = np.exp(-np.hypot(simu.XX-offset/2, simu.YY)**2 / w0**2)*np.exp(-1j*kx*simu.XX)
E1 = np.exp(-np.hypot(simu.XX+offset/2, simu.YY)**2 / w0**2)*np.exp(1j*kx*simu.XX)
E = E0 + E1
E = E.astype(np.complex64)
fig, ax = plt.subplots(1, 2, figsize=(10, 5))
ax[0].set_title("Intensity")
ax[0].imshow(np.abs(E)**2)
ax[1].set_title("Phase")
ax[1].imshow(np.angle(E))
<matplotlib.image.AxesImage at 0x3182da0b0>
# Propagate the field and plot the result
simu.out_field(E, L, plot=True, verbose=False)
No FFT wisdom found, starting over ...
array([[-1.57505088e-03+0.00792102j, -2.11011060e-02-0.00718021j, -1.90684132e-05-0.01505165j, ..., -7.62164453e-03-0.00162892j, -3.19967978e-03-0.0109691j , 6.26206677e-03-0.00162177j], [-1.43756764e-02+0.01247456j, -2.09142361e-02+0.00049959j, -1.50449784e-03-0.01132568j, ..., 6.12096046e-04-0.00066167j, -7.65215419e-03-0.01113786j, -7.09726335e-03+0.00676738j], [-6.87225629e-03+0.01417069j, -1.60270017e-02+0.00730511j, 9.34352260e-03-0.00237211j, ..., 2.48461426e-03+0.00613431j, -6.54618628e-03-0.0036783j , 6.33071293e-04+0.0065929j ], ..., [ 1.87251822e-03+0.00836913j, 2.48270738e-03-0.00627732j, 1.88777296e-04+0.00803543j, ..., 5.27432328e-03+0.00889279j, -7.09344959e-03+0.02019726j, -5.69764199e-03+0.00469369j], [ 3.08908289e-04+0.00951514j, -2.94797658e-03-0.00660053j, 9.24818031e-03-0.00477759j, ..., 1.38760842e-02+0.01127611j, 6.66631712e-03+0.00415882j, 5.73196495e-03+0.00675212j], [-9.78972297e-03+0.00961811j, -1.01577435e-02+0.00342469j, 7.46147009e-03-0.00994513j, ..., -6.68157218e-03+0.00671399j, -5.16944658e-03+0.00113266j, -8.90304241e-03+0.00838057j]], dtype=complex64)
Second example: broadcasting¶
Now let's say you would want to simulate several realizations of this turbulent regime, or explore it for different initial velocities.
You could use a for
loop of course, but this would not optimally use the GPU since a single realization does not saturate the compute bandwidth.
The solution is to propagate a tensor of size $N_{real} \times N_y \times N_x$.
# Create the noise for the inital state
N_real = 4
noise = np.random.normal(size=(N_real, N, N), scale=1e-3)
In order for the computation to "work", the dimensions must be compatible i.e broadcastable.
This means that the last dimensions must match. This is intricately linked to the fact that arrays are stored in column-major order i.e the fast axis is the last one.
# use broadcasting to add the noise to the field
E_noisy = E+noise
E_out = simu.out_field(E_noisy, L, plot=False, verbose=False)
fig, ax = plt.subplots(2, 4, figsize=(20, 10), layout='constrained')
for j in range(N_real):
ax[0, j].imshow(np.abs(E_out[j])**2)
ax[1, j].imshow(np.angle(E_out[j]), cmap='twilight_shifted')
ax[0, j].set_title(f"Intensity of realization {j+1}")
ax[1, j].set_title(f"Phase of realization {j+1}")
plt.show()
This broadcasting mechanism holds for all the physical parameters of the simulation except for the parameters involved in the spatial grid (window
,NX
, NY
).
For instance if you want to simulate for different powers, you can set the puiss
attribute to a $N_{powers}$ array as follows:
# powers from 100mW to 1W
powers = np.linspace(100e-3, 1, 4)
simu.puiss = powers
E_out = simu.out_field(E_noisy, L, plot=False, verbose=False)
fig, ax = plt.subplots(2, 4, figsize=(20, 10), layout='constrained')
for j in range(N_real):
ax[0, j].imshow(np.abs(E_out[j])**2)
ax[1, j].imshow(np.angle(E_out[j]), cmap='twilight_shifted')
ax[0, j].set_title(f"Intensity of realization {j+1}")
ax[1, j].set_title(f"Phase of realization {j+1}")
plt.show()
Example 3: callbacks¶
So far, we focused on the final state of the field after evolution through the cell. However we are often interested in following what goes on inside of the cell.
To this end, the most efficient strategy is to periodically call a function (a callback) during the main loop, such as to extract samples of the field, or compute some observables.
This is implemented through the callback
and callback_args
arguments of the out_field
function.
def callback(simu: NLSE, A: np.ndarray, z: float, i: int, *callback_args) -> None:
"""A callback function for the NLSE class.
Args:
simu (NLSE): The simulation object
A (np.ndarray): The field at the current step
z (float): The current propagation distance in meters
i (int): The main loop index
"""
pass
To avoid hindering the solver performance, this callback should never transfer memory from the GPU to the CPU, and not involve too complicated operations (or not get called too often).
For instance, let us say we want to monitor the evolution of the vortex pattern through the cell by capturing 50 samples of the field:
simu.puiss = 1.05
N_samples = 50
z_samples = np.zeros(N_samples+1) # +1 to account for the 0th step
E_samples = np.zeros((N_samples+1, N, N), dtype=np.complex64)
N_steps = int(round(L/simu.delta_z))
save_every = N_steps//N_samples
def callback_sample(simu: NLSE, A: np.ndarray, z: float, i: int, E_samples: np.ndarray, z_samples: np.ndarray) -> None:
"""A callback function for the NLSE class.
Args:
simu (NLSE): The simulation object
A (np.ndarray): The field at the current step
z (float): The current propagation distance in meters
i (int): The main loop index
E (np.ndarray): The array of samples
"""
if i % save_every == 0:
E_samples[i//save_every] = A
z_samples[i//save_every] = z
simu.out_field(E_noisy[0], L, callback=callback_sample, callback_args=(E_samples, z_samples), verbose=False)
array([[-7.21237616 +2.79155185j, -3.58023392 -1.61741079j, 4.53818319 +7.67447543j, ..., -5.73130509 +2.70783451j, -2.86101388 +2.03802937j, 0.6190914 +6.24772357j], [-0.95392757 +1.5010416j , 0.20216859 +1.61027864j, -2.53268794 -3.91604142j, ..., 1.43774605 -0.60042879j, -0.17978893 +2.90864939j, -0.30998178 -0.39407016j], [ 0.29458962 -1.2377478j , 2.01554041 +1.92857698j, -2.64517562 +1.50773119j, ..., -2.41226429 -3.74424766j, 2.25600225 +2.30820533j, 3.18904872 +6.24033632j], ..., [ 1.54710148 +3.51537624j, -1.72804895 +4.84356993j, -3.20783944 -1.3120838j , ..., 7.61440974 +3.93907792j, -6.11401361 +0.1773452j , 1.81747225 +4.99841063j], [-0.867815 +4.61964874j, -2.70005787 -2.92649871j, 2.97150492 -1.98346005j, ..., 1.80444977 -3.41660001j, 2.11955062 +6.53519037j, 2.60024699 +1.33259563j], [-0.37632324 -0.82162912j, 3.72192964 +0.15756382j, -5.76203935-11.88884746j, ..., 4.23041974 +4.16055011j, 6.18265031 -1.0867379j , -2.72527623 +4.60821959j]])
Now let's plot the results
from matplotlib import animation
from IPython.display import HTML
import matplotlib.ticker as ticker
def fmt(x, pos) -> str:
a, b = '{:.0e}'.format(x).split('e')
b = int(b)
return r'${} \times 10^{{{}}}$'.format(a, b)
rho = np.abs(E_samples)**2
phi = np.angle(E_samples)
fig, ax = plt.subplots(1, 2, figsize=(15, 7.5), layout='constrained', dpi=300)
im0 = ax[0].imshow(rho[0])
im1 = ax[1].imshow(phi[0], cmap='twilight_shifted')
fig.colorbar(im0, ax=ax[0], label="Intensity", format=ticker.FuncFormatter(fmt), shrink=0.6)
fig.colorbar(im1, ax=ax[1], label="Phase", shrink=0.6)
fig.suptitle("Field evolution through the cell")
ax[0].set_title("Intensity at z=0 mm")
ax[1].set_title("Phase at z=0 mm")
def animate(i: int):
im0.set_data(rho[i])
im1.set_data(phi[i])
ax[0].set_title(f"Intensity at z={z_samples[i]*1e3:.0f} mm")
ax[1].set_title(f"Phase at z={z_samples[i]*1e3:.0f} mm")
im0.set_clim(0, np.max(rho[i]))
im1.set_clim(-np.pi, np.pi)
return im0, im1
anim = animation.FuncAnimation(fig, animate, frames=N_samples+1, interval=150, blit=True)
HTML(anim.to_html5_video())
Animation movie is 102218258 bytes, exceeding the limit of 20971520.0. If you're sure you want a large animation embedded, set the animation.embed_limit rc parameter to a larger value (in MB).