The most common is the Mean-Squared Error cost function.This formula shows the gradient computation for linear regression with respect to Mean-Squared Error cost function.Types of Gradient Descent AlgorithmsThere are three types of Gradient Descent Algorithms:Batch Gradient DescentStochastic Gradient DescentMini-Batch Gradient DescentBatch Gradient DescentIn the batch gradient descent, to calculate the gradient of the cost function, we need to sum all training examples for each steps.If we have 5 millions training samples then the gradient descent algorithm should sum 5 millions training samples for every epoch..To move a single step, we have to calculate each with 5 million times!In which updates the model after all training samples have been evaluatedModel updates, and in turn training speed, may become very slow for large data-sets.Python code implementation.def gradientDescent(X, y, theta, alpha, num_iters): """ Performs gradient descent to learn theta """ m = y.size # number of training examples for i in range(num_iters): y_hat = np.dot(X, theta) theta = theta – alpha * (1.0/m) * np.dot(X.T, y_hat-y) return thetaStochastic Gradient DescentIn stochastic Gradient Descent, we use one example or one training sample at each iteration instead of using whole data-set to sum all for every steps.SGD is widely used for larger data-set training and computationally faster and can be trained in parallel.Need to randomly shuffle the training examples before calculating it.The frequent updates immediately give an insight into the performance of the model and the rate of improvement is often called an online machine learning algorithm.The frequent updates can result in a noisy gradient signal..The noisy learning process down the error gradient.Python code implementation.def SGD(f, theta0, alpha, num_iters): """ Arguments: f — the function to optimize, it takes a single argument and yield two outputs, a cost and the gradient with respect to the arguments theta0 — the initial point to start SGD from num_iters — total iterations to run SGD for Return: theta — the parameter value after SGD finishes """ start_iter = 0 theta= theta0 for iter in xrange(start_iter + 1, num_iters + 1): _, grad = f(theta) theta = theta – (alpha * grad) # there is NO dot product!.return thetaMini-Batch Gradient DescentIt is splits the training data-set into small batches that are used to calculate model error and update model coefficients.The batched updates provide a computationally more efficient process than stochastic gradient descent.Error information must be accumulated across mini-batches of training examples like batch gradient descent.Python code implementation.minibatch_size = 50n_experiment = 100# Create placeholder to accumulate prediction accuracyaccs = np.zeros(n_experiment)for k in range(n_experiment): # Reset model model = make_network() # Train the model model = sgd(model, X_train, y_train, minibatch_size) y_pred = np.zeros_like(y_test) for i, x in enumerate(X_test): # Predict the distribution of label _, prob = forward(x, model) # Get label by picking the most probable one y = np.argmax(prob) y_pred[i] = y # Compare the predictions with the true labels and take the percentage accs[k] = (y_pred == y_test).sum() / y_test.sizeprint('Mean accuracy: {}, std: {}'.format(accs.mean(), accs.std()))References:https://machinelearningmastery.com/gentle-introduction-mini-batch-gradient-descent-configure-batch-size/Siraj Raval on Youtube Videohttps://sebastianraschka.com/books.htmlhttps://www.coursera.org/learn/machine-learning/lecture/rkTp3/cost-function. More details