asked 122k views
4 votes
Consider the product ∣wσ ′

(wa+b)∣. Suppose ∣wσ ′
(wa+b)∣≥1 (1) Argue that this can only ever occur if ∣w∣≥4 (2) Supposing that ∣w∣≥4 consider the set of input activations a for which ∣wσ ′
(wa+b)∣≥1. Show that the set of a satisfying that constraint can range over an interval no greater in width than w
2

ln( 2
∣w∣(1+ 1−4/∣w∣

)

−1) (3) Show numerically that the above expression bounding the width of the range is greatest at ∣w∣≈6.9, where it takes a value ≈0.45. And so even given that everything lines up just perfectly, we still have a fairly narrow range of input activations which can avoid the vanishing gradient problem.

1 Answer

3 votes

Final answer:

In order to satisfy the condition ∣wσ ′ (wa+b)∣≥1, we must have ∣w∣≥4. The set of input activations that satisfy this condition can range over an interval no greater in width than w^2 ln(2/∣w∣(1+ 1-4/∣w∣)−1). Numerically, this expression is greatest at ∣w∣≈6.9, where it takes a value ≈0.45.

Step-by-step explanation:

To start, we have ∣wσ ′ (wa+b)∣≥1. We can rewrite this as ∣(wσ ′) (wa+b)∣≥1. Using the triangle inequality, we have ∣wσ ′∣ ∣(wa+b)∣≥1. Given that ∣wσ ′∣ and ∣(wa+b)∣ are both non-negative, we can conclude that ∣wσ ′∣≥1 and ∣(wa+b)∣≥1.

From the condition ∣wσ ′∣≥1, we can say that ∣w∣ = ∣wσ ′∣≥1. Now, let's consider the condition ∣w∣≥4. For this condition to be true, ∣w∣ must be greater than or equal to 4.

Now, let's move on to the second part of the question. If ∣w∣≥4, we want to find the set of input activations a for which ∣wσ ′ (wa+b)∣≥1. Rearranging the inequality, we have ∣(wa+b)∣≥1/∣wσ ′∣. Since ∣w∣≥4, we can substitute this into the inequality as ∣(wa+b)∣≥1/∣wσ ′∣≥1/∣(wa+b)∣. Solving for a, we have ∣a+b∣≥1/∣wσ ′∣∣w∣. Simplifying further, ∣a+b∣≥1/∣wσ ′∣∣w∣ = 1/∣w∣. This is the maximum width of the set of a.

To numerically check the expression for the maximum width, we can substitute ∣w∣≈6.9 into the expression. We get the maximum width as w^2 ln(2/∣w∣(1+ 1-4/∣w∣)−1) ≈ 0.45. This means that even with perfect alignment, we have a fairly narrow range of input activations which can avoid the vanishing gradient problem.

Learn more about mathematical argument

answered
User ZecKa
by
7.6k points
Welcome to Qamnty — a place to ask, share, and grow together. Join our community and get real answers from real people.