Consider the product ∣wσ ′ (wa+b)∣. Suppose ∣wσ ′ (wa+b)∣≥1 (1) Argue that this can only ever occur if ∣w∣≥4 (2) Supposing that ∣w∣≥4 consider the set of input activations a for…

Question

asked Feb 25, 2024 122k views

Consider the product ∣wσ ′

(wa+b)∣. Suppose ∣wσ ′
(wa+b)∣≥1 (1) Argue that this can only ever occur if ∣w∣≥4 (2) Supposing that ∣w∣≥4 consider the set of input activations a for which ∣wσ ′
(wa+b)∣≥1. Show that the set of a satisfying that constraint can range over an interval no greater in width than w
2

ln( 2
∣w∣(1+ 1−4/∣w∣

)

−1) (3) Show numerically that the above expression bounding the width of the range is greatest at ∣w∣≈6.9, where it takes a value ≈0.45. And so even given that everything lines up just perfectly, we still have a fairly narrow range of input activations which can avoid the vanishing gradient problem.

Jeff Moden asked

by Jeff Moden

8.2k points

1 Answer

← Prev Question Next Question →

Ask a Question

ZecKa · Answer 1 · 2024-02-29T12:32:12+0000

Final answer:

In order to satisfy the condition ∣wσ ′ (wa+b)∣≥1, we must have ∣w∣≥4. The set of input activations that satisfy this condition can range over an interval no greater in width than w^2 ln(2/∣w∣(1+ 1-4/∣w∣)−1). Numerically, this expression is greatest at ∣w∣≈6.9, where it takes a value ≈0.45.

Step-by-step explanation:

To start, we have ∣wσ ′ (wa+b)∣≥1. We can rewrite this as ∣(wσ ′) (wa+b)∣≥1. Using the triangle inequality, we have ∣wσ ′∣ ∣(wa+b)∣≥1. Given that ∣wσ ′∣ and ∣(wa+b)∣ are both non-negative, we can conclude that ∣wσ ′∣≥1 and ∣(wa+b)∣≥1.

From the condition ∣wσ ′∣≥1, we can say that ∣w∣ = ∣wσ ′∣≥1. Now, let's consider the condition ∣w∣≥4. For this condition to be true, ∣w∣ must be greater than or equal to 4.

Now, let's move on to the second part of the question. If ∣w∣≥4, we want to find the set of input activations a for which ∣wσ ′ (wa+b)∣≥1. Rearranging the inequality, we have ∣(wa+b)∣≥1/∣wσ ′∣. Since ∣w∣≥4, we can substitute this into the inequality as ∣(wa+b)∣≥1/∣wσ ′∣≥1/∣(wa+b)∣. Solving for a, we have ∣a+b∣≥1/∣wσ ′∣∣w∣. Simplifying further, ∣a+b∣≥1/∣wσ ′∣∣w∣ = 1/∣w∣. This is the maximum width of the set of a.

To numerically check the expression for the maximum width, we can substitute ∣w∣≈6.9 into the expression. We get the maximum width as w^2 ln(2/∣w∣(1+ 1-4/∣w∣)−1) ≈ 0.45. This means that even with perfect alignment, we have a fairly narrow range of input activations which can avoid the vanishing gradient problem.

Consider the product ∣wσ ′ (wa+b)∣. Suppose ∣wσ ′ (wa+b)∣≥1 (1) Argue that this can only ever occur if ∣w∣≥4 (2) Supposing that ∣w∣≥4 consider the set of input activations a for…

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Final answer:

Step-by-step explanation:

Learn more about mathematical argument

Please log in or register to add a comment.

Related questions

Categories

Other Questions