DL-C5W1

Posted on 2019-01-03

RNN

循环神经网络是从左向右扫描数据,同时每个时间步的参数也是共享的,所以下页幻灯
片中我们会详细讲述它的一套参数,我们用W ax 来表示管理着从x <1> 到隐藏层的连接的一系列参数,每个时间步使用的都是相同的参数W ax 。而激活值也就是水平联系是由参数W aa 决
定的,同时每一个时间步都使用相同的参数W aa ,同样的输出结果由W ya 决定。下图详细讲述
这些参数是如何起作用。

upad successful

意思是说当预测y^[3]的时候不仅仅需要输入的x^[3]信息，还需要输入的x^[1],x^[2]信息。

forward propagation

upload sful

simplification notation

upload succesul

filename already existsenamed

filename already exists, enamed

backward propagation

upload succeul

求导示意图：

upload succesul

RNN种类

filename lready exists, renamed

1 to 1
1 to many 音乐生成、序列生成
many to 1 情感判断
many to many 如 name entity recognition
many to many 如翻译，例如输入的x是中文，输出的y是英文

训练一个语言模型

1

建立一个字典，把输入句子每个单词转化为对应的one-hot向量。

有时候要在句子末尾添加一个EOS标记，表示句子的结束。未知的词用UNK来代替。

2.

uplo successful

建立RNN模型，x^[1]会被设为一个0向量，在之前的a^[0]也会被设为一个0向量，于是a^[1]要做的是通过softmax进行一些预测来计算第一个词可能会是什么，结果就是y^[1]，这一步就是通过一个softmax层来预测字典中任意单词会是第一个词的概率，输出是softmax的计算结果，结果个数就是字典的词的个数。

filename already exists, renamed

然后预测第二个词：是在考虑预测了第一个词的基础上预测到第二个词的概率。

如此类推，这是一个全概率公式。

把这三个概率相乘，含义就是最后得到这个含3个词的整个句子的概率。

对新序列采样

训练了一个序列模型之后，想要了解这个模型学到了什么，一种非正式方法就是进行一次新序列采样。

fileme already exists, renamed

就是如何从RNN语言模型中生成一个随机选择的句子：

第一步要做的就是对你想要模型生成的第一个词进行采样,于是你输入x <1> = 0,
a <0> = 0,现在你的第一个时间步得到的是所有可能的输出是经过 softmax 层后得到的概率,
然后根据这个 softmax 的分布进行随机采样。Softmax 分布给你的信息就是第一个词 a 的概
率是多少,第一个词是 aaron 的概率是多少,第一个词是 zulu 的概率是多少,还有第一个词
是 UNK(未知标识)的概率是多少,这个标识可能代表句子的结尾,然后对这个向量使用例
如 numpy 命令, np.random.choice (上图编号 3 所示),来根据向量中这些概率的分布
进行采样,这样就能对第一个词进行采样了。

然后再到下一个时间步,无论你得到什么样的用 one-hot 码表示的选择结果,都把它传
递到下一个时间步,然后对第三个词进行采样。不管得到什么都把它传递下去,一直这样直
到最后一个时间步。

这就得到了一个随机生成的句子。

基于字符的语言生成模型

对于英文来说，字典仅包含26个英文字母大小写，标点符号等等。

优点：不必担心出现未知的的标识。

缺点：不能像基于词汇的模型，可以捕捉长范围上下文的关系，而且计算成本会很高。

RNN的问题

1.因为梯度消失，不擅长处理长期依赖的问题。

filename lready exists, renamed

对于 RNN,首先从左到右前向传播,然后反向传播。但是反向传播会很困难,因为同样的梯度消失的问题,后面层的输出误差很难影响前面层的计算，实际上很难让一个神经网络能够意识到他要记住看到的是单数名词还是复数名词。

2.对于梯度爆炸问题则比较好解决。

一个方法是 gradient clipping,意思是观察你的梯度向量，如果它大于某个阈值，则缩放梯度向量，保证他不会太大。

GRU （gated recurrent unit）

门控循环单元

fename already exists, renamed

GRU 单元将会有个新的变量称为c,代表细胞(cell),即记忆细胞(下图编号 1 所示)。记忆细胞
的作用是提供了记忆的能力,比如说一只猫是单数还是复数,所以当它看到之后的句子的时
候,它仍能够判断句子的主语是单数还是复数。于是在时间t处,有记忆细胞c ,然后我
们看的是,GRU 实际上输出了激活值a ,c = a (下图编号 2 所示)。于是我们想
要使用不同的符号c和a来表示记忆细胞的值和输出的激活值,即使它们是一样的。我现在使
用这个标记是因为当我们等会说到 LSTMs 的时候,这两个会是不同的值,但是现在对于 GRU,c 的值等于a 的激活值。

所以这些等式表示了 GRU 单元的计算,在每个时间步,我们将用一个候选值重写记忆
细胞,即c̃ 的值,所以它就是个候选值,替代了c 的值。然后我们用 tanh 激活函数来
计算, c̃ = tanh(W c [c <t−1> , x ] + b c ),所以c̃ 的值就是个替代值,代替表示c 的值
(下图编号 3 所示)。

uplsuccessful

所以我们接下来要给 GRU 用的式子就是c = Γ u ∗ c̃ + (1 − Γ u ) ∗ c <t−1> (上图编号
1 所示)。你应该注意到了,如果这个更新值Γ u = 1,也就是说把这个新值,即c 设为候
选值(Γ u = 1时简化上式,c = c̃ )。将门值设为 1(上图编号 2 所示),然后往前再
更新这个值。对于所有在这中间的值,你应该把门的值设为 0,即Γ u = 0,意思就是说不更
新它,就用旧的值。因为如果Γ u = 0,则c = c <t−1> ,c 等于旧的值。

完整的GRU单元

uplo successful

总结：

GRU是用于解决RNN深层网络中梯度弥散问题的一种结构，引入gamma门参数用来决定该时间步的激活层是来自于上一层（保留记忆）还是新计算的结果（不保留记忆）

filename already exists,enamed

LSTM

回顾GRU
upload l

filename already exists,enamed

我们像以前那样有一个更新门Γ u 和表示更新的参数W u ,Γ u = σ(W u [a <t−1> , x ] + b u )(上图编号 5 所示)。一个 LSTM 的新特性是不只有一个更新门控制,这里的这两项(上图编号 6,7 所示),我们将用不同的项来代替它们,要用别的项来取代Γ u 和1 − Γ u ,这里(上图编号 6 所示)我们用Γ u 。

然后这里(上图编号 7 所示)用遗忘门(the forget gate),我们叫它Γ f ,所以这个Γ f =
σ(W f [a <t−1> , x ] + b f )(上图编号 8 所示);

然后我们有一个新的输出门,Γ o = σ(W o [a <t−1> , x ]+> b o )(上图编号 9 所示);

于是记忆细胞的更新值c = Γ u ∗ c̃ + Γ f ∗ c <t−1> (上图编号 10 所示);
所以这给了记忆细胞选择权去维持旧的值c <t−1> 或者就加上新的值c̃ ,所以这里用了
单独的更新门Γ u 和遗忘门Γ f ,然后这个表示更新门(Γ u = σ(W u [a <t−1> , x ] + b u )上图编号 5 所示);

遗忘门(Γ f = σ(W f [a <t−1> , x ] + b f )上图编号 8 所示)和输出门(上图编号 9 所示)。

最后a = c 的式子会变成a = Γ o ∗ c 。

peephole connection

门值不仅取决于a <t−1> 和x ,有时候也可以偷窥一下c <t−1> 的值(上图编号 13 所示),
这叫做“窥视孔连接”(peephole connection)。虽然不是个好听的名字,但是你想,“偷窥孔
连接”其实意思就是门值不仅取决于a <t−1> 和x ,也取决于上一个记忆细胞的值(c <t−1> ),
然后“偷窥孔连接”就可以结合这三个门(Γ u 、Γ f 、Γ o )来计算了

upload succesul

uploaduccessful

详情可见 http://colah.github.io/posts/2015-08-Understanding-LSTMs/

forward propagate

filename already ists, renamed

filename aready exists, renamed

filename aleady exists, renamed

filename aeady exists, renamed

这里的i_t就是update gate,f_t就是forget gate

BRNN

filenamready exists, renamed

filename already exists,enamed

Deep RNN

upload succsful

不再用原来的a <0> 表示 0 时刻的激活值了,而是用a [1]<0> 来表示第一层
(上图编号 4 所示),所以我们现在用a [l] 来表示第 l 层的激活值,这个表示第t个时
间点,这样就可以表示。第一层第一个时间点的激活值a [1]<1> ,这(a [1]<2> )就是第一层第
二个时间点的激活值,a [1]<3> 和a [1]<4> 。然后我们把这些(上图编号 4 方框内所示的部分)
堆叠在上面,这就是一个有三个隐层的新的网络。
我们看个具体的例子,看看这个值(a [2]<3> ,上图编号 5 所示)是怎么算的。激活值
a [2]<3> 有两个输入,一个是从下面过来的输入(上图编号 6 所示),还有一个是从左边过来
[2]
[2]
的输入(上图编号 7 所示),a [2]<3> = g(W a [a [2]<2> , a [1]<3> ] + b a ),这就是这个激活值的计算方法。参数W a 和b a 在这一层的计算里都一样,相对应地第一层也有自己的参数W a[1]
和b a 。

DL-C4W4

Posted on 2019-01-01

人脸识别

one shot learning

filename dy exists, renamed

因为公司的数据库里面很可能只有该名员工一张照片，因此用一张图片投入神经网络，通过softmax输出分类显然不可行。因此应该学习一个相似函数，如果函数的结果大于某个阈值，说明不匹配，否则说明匹配。

siamese network

filename already enamed

x1投入网络中，得到全连接层一个output vector，这个vector维度是128x1,记为encoding of x1

另外一张图片x2喂入网络，得到另外一个vector叫做encoding of x2

然后把二者距离定义为二者编码之差的范数。

更准确地说,神经网络的参数定义了一个编码函数f(x (i) ),如果给定输入图像x(i),这个网络会输出x (i) 的 128 维的编码。

三元组损失

filenamalready exists, renamed

目标：

upload succesul

遇到的问题：

如果f总是输出0，上面式子无意义

为了阻止网络出现这种情况,我们需要修改这个目标,也就是,这个不能是刚好小于等
于 0,应该是比 0 还要小,所以这个应该小于一个−a值(即||f(A) − f(P)|| 2 − ||f(A) −
f(N)|| 2 ≤ −a),这里的a是另一个超参数,这个就可以阻止网络输出无用的结果。按照惯
例,我们习惯写+a(即||f(A) − f(P)|| 2 − ||f(A) − f(N)|| 2 + a ≤ 0),而不是把−a写在后
面,它也叫做间隔(margin)

总结：

三元组损失函数的定义基于三张图片,假如三张图片A、 P、 N,即 anchor 样本、 positive
样本和 negative 样本,其中 positive 图片和 anchor 图片是同一个人,但是 negative 图片和
anchor 不是同一个人。

定义损失函数

1	L(A, P, N) = max(\|\|f(A) − f(P)\|\| 2 − \|\|f(A) − f(N)\|\| 2 + a, 0)

这是一个三元组定义的损失,整个网络的代价函数应该是训练集中这些单个三元组损失的总和。

挑选数据集

现在我们来看,你如何选择这些三元组来形成训练集。一个问题是如果你从训练集中,
随机地选择A、 P和N,遵守A和P是同一个人,而A和N是不同的人这一原则。有个问题就是,
如果随机的选择它们,那么这个约束条件(d(A, P) + a ≤ d(A, N))很容易达到,因为随机
选择的图片,A和N比A和P差别很大的概率很大。我希望你还记得这个符号d(A, P)就是前几
个幻灯片里写的||f(A) − f(P)|| 2 ,d(A, N)就是||f(A) − f(N)|| 2 ,d(A, P) + a ≤ d(A, N)即
||f(A) − f(P)|| 2 + a ≤ ||f(A) − f(N)|| 2 。但是如果A和N是随机选择的不同的人,有很大的
可能性||f(A) − f(N)|| 2 会比左边这项||f(A) − f(P)|| 2 大,而且差距远大于a,这样网络并不
能从中学到什么。

因此要挑选最难学习的，就是要挑选 d(A,P)约等于 d(A,N)的三元组，
只有这样梯度下降法才有用，才能学到有意义的参数。

summary

upload succesl

人脸验证与二分类

upload ul

uploadccessful

预处理

upload successful

如果这是一张新图片(编号 1),
当员工走进门时,希望门可以自动为他们打开,这个(编号 2)是在数据库中的图片,不需
要每次都计算这些特征(编号 6),不需要每次都计算这个嵌入,你可以提前计算好,那么
当一个新员工走近时,你可以使用上方的卷积网络来计算这些编码(编号 5),然后使用它,
^ 。
和预先计算好的编码进行比较,然后输出预测值

神经风格迁移

filename aeady exists, renamed

怎么判断生成图像的好坏呢?我们把这个代价函数定义为两个部分。
J content (C, G)
第一部分被称作内容代价,这是一个关于内容图片和生成图片的函数,它是用来度量生
成图片G的内容与内容图片C的内容有多相似。
J style (S, G)
然后我们会把结果加上一个风格代价函数,也就是关于S和G的函数,用来度量图片G的
风格和图片S的风格的相似度。

1	J(G) = aJ content (C, G) + βJ style (S, G）

梗概

uplosuccessful

content cost function

filename eady exiss, renamed

现在你需要衡量假如有一个内容图片和一个生成图片他们在内容上的相似度,我们令这
个a [l][C] 和a [l][G] ,代表这两个图片C和G的l层的激活函数值。如果这两个激活值相似,那么
就意味着两个图片的内容相似。
1
我们定义这个: J content (C, G) = 2 ||a [l][C] − a [l][G] || 2 ,为两个激活值不同或者相似的程度,
我们取l层的隐含单元的激活值,按元素相减,内容图片的激活值与生成图片相比较,然后
510第四门课卷积神经网络(Convolutional Neural Networks)-第四周特殊应用:人脸识别和神经风格转换
(Special applications: Face recognition &Neural style transfer)
1
取平方,也可以在前面加上归一化或者不加,比如或者其他的,都影响不大,因为这都可以
2
由这个超参数 α 来调整(J(G) = aJ content (C, G) + βJ style (S, G))。

style cost function

upload successf

upload success

filenlready exists, renamed

对于这个风格矩阵,你要做的就是计算这个矩阵也就是G [l] 矩阵,它是个n c × n c 的矩阵,
也就是一个方阵。记住,因为这里有n c 个通道,所以矩阵的大小是n c × n c 。以便计算每一对
[l]
激活项的相关系数,所以G kk ′ 可以用来测量k通道与k′通道中的激活项之间的相关系数,k和
k′会在 1 到n c 之间取值,n c 就是l层中通道的总数量。

可以看做这是两个通道间的协方差。

implementation

using an ConvNet to compute encodings

filename already ets, renamed

这里使用的是Inception Network

把两张图片分别转换成2个128维的向量，然后计算这两个向量的距离。

upload cessful

compute triplet loss

uploadcessful

# GRADED FUNCTION: triplet_loss

def triplet_loss(y_true, y_pred, alpha = 0.2):
    """
    Implementation of the triplet loss as defined by formula (3)
    
    Arguments:
    y_true -- true labels, required when you define a loss in Keras, you don't need it in this function.
    y_pred -- python list containing three objects:
            anchor -- the encodings for the anchor images, of shape (None, 128)
            positive -- the encodings for the positive images, of shape (None, 128)
            negative -- the encodings for the negative images, of shape (None, 128)
    
    Returns:
    loss -- real number, value of the loss
    """
    
    anchor, positive, negative = y_pred[0], y_pred[1], y_pred[2]
    
    ### START CODE HERE ### (≈ 4 lines)
    # Step 1: Compute the (encoding) distance between the anchor and the positive, you will need to sum over axis=-1
    pos_dist = tf.reduce_sum(tf.square(tf.subtract(anchor,positive)),axis=-1)
    # Step 2: Compute the (encoding) distance between the anchor and the negative, you will need to sum over axis=-1
    neg_dist =  tf.reduce_sum(tf.square(tf.subtract(anchor,negative)),axis=-1)
    # Step 3: subtract the two previous distances and add alpha.
    basic_loss = tf.add(alpha,tf.subtract(pos_dist,neg_dist))
    # Step 4: Take the maximum of basic_loss and 0.0. Sum over the training examples.
    loss = tf.reduce_sum(tf.maximum(basic_loss,0))
    ### END CODE HERE ###
    
    return loss

verify

这里设定阈值！与数据库中已有的照片进行比较

# GRADED FUNCTION: verify

def verify(image_path, identity, database, model):
    """
    Function that verifies if the person on the "image_path" image is "identity".
    
    Arguments:
    image_path -- path to an image
    identity -- string, name of the person you'd like to verify the identity. Has to be a resident of the Happy house.
    database -- python dictionary mapping names of allowed people's names (strings) to their encodings (vectors).
    model -- your Inception model instance in Keras
    
    Returns:
    dist -- distance between the image_path and the image of "identity" in the database.
    door_open -- True, if the door should open. False otherwise.
    """
    
    ### START CODE HERE ###
    
    # Step 1: Compute the encoding for the image. Use img_to_encoding() see example above. (≈ 1 line)
    encoding = img_to_encoding(image_path,model)
    
    # Step 2: Compute distance with identity's image (≈ 1 line)
    dist = np.linalg.norm(database[identity]-encoding)
    
    # Step 3: Open the door if dist < 0.7, else don't open (≈ 3 lines)
    if dist<0.7:
        print("It's " + str(identity) + ", welcome home!")
        door_open = True
    else:
        print("It's not " + str(identity) + ", please go away")
        door_open = False
        
    ### END CODE HERE ###
        
    return dist, door_open

face recognition

在数据库中寻找dist与输入图片最小的，结果就是识别出来的人。

 GRADED FUNCTION: who_is_it

def who_is_it(image_path, database, model):
    """
    Implements face recognition for the happy house by finding who is the person on the image_path image.
    
    Arguments:
    image_path -- path to an image
    database -- database containing image encodings along with the name of the person on the image
    model -- your Inception model instance in Keras
    
    Returns:
    min_dist -- the minimum distance between image_path encoding and the encodings from the database
    identity -- string, the name prediction for the person on image_path
    """
    
    ### START CODE HERE ### 
    
    ## Step 1: Compute the target "encoding" for the image. Use img_to_encoding() see example above. ## (≈ 1 line)
    encoding = img_to_encoding(image_path,model)
    
    ## Step 2: Find the closest encoding ##
    
    # Initialize "min_dist" to a large value, say 100 (≈1 line)
    min_dist = 100
    
    # Loop over the database dictionary's names and encodings.
    for (name, db_enc) in database.items():
        
        # Compute L2 distance between the target "encoding" and the current "emb" from the database. (≈ 1 line)
        dist = np.linalg.norm(db_enc-encoding)

        # If this distance is less than the min_dist, then set min_dist to dist, and identity to name. (≈ 3 lines)
        if dist<min_dist:
            min_dist = dist
            identity = name

    ### END CODE HERE ###
    
    if min_dist > 0.7:
        print("Not in the database.")
    else:
        print ("it's " + str(identity) + ", the distance is " + str(min_dist))
        
    return min_dist, identity

Neural Style Transfer

compute the content cost

在网络里面，浅层的卷积网络倾向于检测一些如边缘与简单内容的低层次特征，深层的卷积网络倾向于检测一些高层次特征例如目标的类别等。

我们目标是生成的图片G与输入图片C有相同的内容。假设选定了某层的激活层来代表图片的内容，在实践中，选择那些不太深也不太浅——即中间的网络。

所以假定你选择一个隐含层，设定图片C作为预训练的VGG网络的输入，然后前向反馈，记a^[c]为你选择的隐含层的激活。这将是一个维度为nHxnWxnC的张量。
对输入的图片G重复上述操作。

我们定义损失函数如下

filena already exists, renamed

这里的nH,nW,nC分别为你所选择的隐含层的高度，宽度与通道数目。

为了方便计算J_content(C,G) ，需要把三维的volumns unrolled成二维的矩阵。

# GRADED FUNCTION: compute_content_cost

def compute_content_cost(a_C, a_G):
    """
    Computes the content cost
    
    Arguments:
    a_C -- tensor of dimension (1, n_H, n_W, n_C), hidden layer activations representing content of the image C 
    a_G -- tensor of dimension (1, n_H, n_W, n_C), hidden layer activations representing content of the image G
    
    Returns: 
    J_content -- scalar that you compute using equation 1 above.
    """
    
    ### START CODE HERE ###
    # Retrieve dimensions from a_G (≈1 line)
    m, n_H, n_W, n_C = a_G.get_shape().as_list()
    
    new_shape = [m,(n_H*n_W),n_C]
    
    # Reshape a_C and a_G (≈2 lines)
    a_C_unrolled = tf.reshape(a_C,shape=new_shape)
    a_G_unrolled = tf.reshape(a_G,shape=new_shape)
    
    # compute the cost with tensorflow (≈1 line)
    J_content = 1/(4*n_H*n_W*n_C)*tf.reduce_sum(tf.square(tf.subtract(a_C_unrolled,a_G_unrolled)))
    ### END CODE HERE ###
    
    return J_content

STYLE MATRIX

在线性代数中，风格矩阵也叫作Gram矩阵.

matrix G of set of vectors(v1,v2…vn)

是任意两个向量vi,vj的点乘的矩阵，看作是来衡量vi
与vj是有多相似的。如果vi与vj很相似，得到的点乘也会越大。

Gij=vTivj=np.dot(vi,vj)Gij=viTvj=np.dot(vi,vj). In other words, GijGij compares how similar vivi is to vjvj:

upload sucssful

这个矩阵维度是(nc,nc),nc是filter的数量，Gij的值衡量了filter_i的激活与filter_j的激活的相似程度。

而矩阵的对角元素G_ii衡量了filter_i的活跃程度。举个例子，如果filter_i是用来检测竖直特征的，那么Gii就衡量了整张图片出现竖直特征的总体情况。


# GRADED FUNCTION: gram_matrix

def gram_matrix(A):
    """
    Argument:
    A -- matrix of shape (n_C, n_H*n_W)
    
    Returns:
    GA -- Gram matrix of A, of shape (n_C, n_C)
    """
    
    ### START CODE HERE ### (≈1 line)
    GA =  tf.matmul(A,tf.transpose(A))
    ### END CODE HERE ###
    
    return GA

Style cost

生成了风格矩阵之后，你的目标就是最小化风格图片的Gram矩阵与生成图片的Gram矩阵的距离。这就可以写出一个损失函数，然后转化为一个最优化问题就行。

filename already ests, renamed

# GRADED FUNCTION: compute_layer_style_cost

def compute_layer_style_cost(a_S, a_G):
    """
    Arguments:
    a_S -- tensor of dimension (1, n_H, n_W, n_C), hidden layer activations representing style of the image S 
    a_G -- tensor of dimension (1, n_H, n_W, n_C), hidden layer activations representing style of the image G
    
    Returns: 
    J_style_layer -- tensor representing a scalar value, style cost defined above by equation (2)
    """
    
    ### START CODE HERE ###
    # Retrieve dimensions from a_G (≈1 line)
    m, n_H, n_W, n_C = a_G.get_shape().as_list()
    
    new_shape = [n_H*n_W,n_C]
    # Reshape the images to have them of shape (n_C, n_H*n_W) (≈2 lines)
    a_S = tf.reshape(a_S,new_shape)
    a_S = tf.transpose(a_S)
    a_G = tf.reshape(a_G,new_shape)
    a_G = tf.transpose(a_G)
    # Computing gram_matrices for both images S and G (≈2 lines)
    GS = gram_matrix(a_S)
    GG = gram_matrix(a_G)

    # Computing the loss (≈1 line)
    J_style_layer = 1/(4*(n_H*n_W*n_C)**2)*tf.reduce_sum(tf.square(tf.subtract(GS,GG)))
    
    ### END CODE HERE ###
    
    return J_style_layer

定义J_style

upload suessful

定义最终的cost

upload ccessful

再对这个损失函数求优化问题就行了。

result：
filenamlready exists, renamed

DL-C4W3(YOLO)

Posted on 2018-12-31

intro 边框的定义

filename alreay exists, renamed

加入我们要分类80个物品，那么就有80个c（c0,c1…c80）

YOLO

you only look once，只做一次前馈传播，并使用非最大化抑制之后就可以输出目标框。

编码

uplouccessful

upload succful

所谓anchor box，就是用来使得一个格子能够检测出多个对象。需要预先定义好anchor box的形状，当每找到一个对象的中点的时候，不仅仅把中点分配给对应的grid，而且还会分配到对应的anchor box

对于每个anchor box，找出该框包含某一类的概率

upload ccessful

可视化预测

upload sful

filename already exists, ramed

当框框太多的时候，使用非最大化抑制的方法剔除一些重叠的框框。

code见下

upload sucsful


def yolo_filter_boxes(box_confidence, boxes, box_class_probs, threshold = .6):
  """
    通过阈值来过滤对象和分类的置信度。

    参数：
        box_confidence  - tensor类型，维度为（19,19,5,1）,包含19x19单元格中每个单元格预测的5个锚框中的所有的锚框的pc （一些对象的置信概率）。
        boxes - tensor类型，维度为(19,19,5,4)，包含了所有的锚框的（px,py,ph,pw ）。
        box_class_probs - tensor类型，维度为(19,19,5,80)，包含了所有单元格中所有锚框的所有对象( c1,c2,c3，···，c80 )检测的概率。
        threshold - 实数，阈值，如果分类预测的概率高于它，那么这个分类预测的概率就会被保留。

    返回：
        scores - tensor 类型，维度为(None,)，包含了保留了的锚框的分类概率。
        boxes - tensor 类型，维度为(None,4)，包含了保留了的锚框的(b_x, b_y, b_h, b_w)
        classess - tensor 类型，维度为(None,)，包含了保留了的锚框的索引

    注意："None"是因为你不知道所选框的确切数量，因为它取决于阈值。
          比如：如果有10个锚框，scores的实际输出大小将是（10,）
    """
    """
    
    # Step 1: Compute box scores
    ### START CODE HERE ### (≈ 1 line)
    box_scores = box_confidence * box_class_probs
    ### END CODE HERE ###
    
    # Step 2: Find the box_classes thanks to the max box_scores, keep track of the corresponding score
    ### START CODE HERE ### (≈ 2 lines)
    box_classes = K.argmax(box_scores, axis = -1)
    box_class_scores = K.max(box_scores, axis = -1)
    ### END CODE HERE ###
    
    # Step 3: Create a filtering mask based on "box_class_scores" by using "threshold". The mask should have the
    # same dimension as box_class_scores, and be True for the boxes you want to keep (with probability >= threshold)
    ### START CODE HERE ### (≈ 1 line)
    filtering_mask = (box_class_scores >= threshold)
    ### END CODE HERE ###
    
    # Step 4: Apply the mask to scores, boxes and classes
    ### START CODE HERE ### (≈ 3 lines)
    scores = tf.boolean_mask(box_class_scores,filtering_mask)
    boxes = tf.boolean_mask(boxes,filtering_mask)
    classes= tf.boolean_mask(box_classes,filtering_mask)
    
    ### END CODE HERE ###
    
    return scores, boxes, classes

非最大化抑制

filename already exists, renamed

1.假设首先设定阈值为0.6，抛弃所有pc<=0.6可能性的框框，这一步先剔除了所有可能性很低的框框。

2.选中一个pc最大的框框，作为输出，然后抛弃所有其他的与输出的交并比>=0.5的框框

def yolo_non_max_suppression(scores, boxes, classes, max_boxes = 10, iou_threshold = 0.5):
    """
    Applies Non-max suppression (NMS) to set of boxes
    
    Arguments:
    scores -- tensor of shape (None,), output of yolo_filter_boxes()
    boxes -- tensor of shape (None, 4), output of yolo_filter_boxes() that have been scaled to the image size (see later)
    classes -- tensor of shape (None,), output of yolo_filter_boxes()
    max_boxes -- integer, maximum number of predicted boxes you'd like
    iou_threshold -- real value, "intersection over union" threshold used for NMS filtering
    
    Returns:
    scores -- tensor of shape (, None), predicted score for each box
    boxes -- tensor of shape (4, None), predicted box coordinates
    classes -- tensor of shape (, None), predicted class for each box
    
    Note: The "None" dimension of the output tensors has obviously to be less than max_boxes. Note also that this
    function will transpose the shapes of scores, boxes, classes. This is made for convenience.
    """
    
    max_boxes_tensor = K.variable(max_boxes, dtype='int32')     # tensor to be used in tf.image.non_max_suppression()
    K.get_session().run(tf.variables_initializer([max_boxes_tensor])) # initialize variable max_boxes_tensor
    
    # Use tf.image.non_max_suppression() to get the list of indices corresponding to boxes you keep
    ### START CODE HERE ### (≈ 1 line)
    indicesList = tf.image.non_max_suppression(
    boxes,
    scores,
    max_boxes,
    iou_threshold,
    score_threshold=float('-inf'),
    name=None
)

    ### END CODE HERE ###
        
    # Use K.gather() to select only nms_indices from scores, boxes and classes
    ### START CODE HERE ### (≈ 3 lines)
    #把在indicesList的scores gather起来
    scores = K.gather(scores,indicesList)
    boxes = K.gather(boxes,indicesList)
    classes = K.gather(classes,indicesList)
    ### END CODE HERE ###
    
    return scores, boxes, classes

所有框进行过滤

def yolo_eval(yolo_outputs, image_shape = (720., 1280.), max_boxes=10, score_threshold=.6, iou_threshold=.5):
    """
    Converts the output of YOLO encoding (a lot of boxes) to your predicted boxes along with their scores, box coordinates and classes.
    
    Arguments:
    yolo_outputs -- output of the encoding model (for image_shape of (608, 608, 3)), contains 4 tensors:
                    box_confidence: tensor of shape (None, 19, 19, 5, 1)
                    box_xy: tensor of shape (None, 19, 19, 5, 2)
                    box_wh: tensor of shape (None, 19, 19, 5, 2)
                    box_class_probs: tensor of shape (None, 19, 19, 5, 80)
    image_shape -- tensor of shape (2,) containing the input shape, in this notebook we use (608., 608.) (has to be float32 dtype)
    max_boxes -- integer, maximum number of predicted boxes you'd like
    score_threshold -- real value, if [ highest class probability score < threshold], then get rid of the corresponding box
    iou_threshold -- real value, "intersection over union" threshold used for NMS filtering
    
    Returns:
    scores -- tensor of shape (None, ), predicted score for each box
    boxes -- tensor of shape (None, 4), predicted box coordinates
    classes -- tensor of shape (None,), predicted class for each box
    """
    
    ### START CODE HERE ### 
    
    # Retrieve outputs of the YOLO model (≈1 line)
    box_confidence,box_xy,box_wh,box_class_probs = yolo_outputs

    # Convert boxes to be ready for filtering functions 
    boxes = yolo_boxes_to_corners(box_xy, box_wh) 
    # Use one of the functions you've implemented to perform Score-filtering with a threshold of score_threshold (≈1 line)
    scores,boxes,classes = yolo_filter_boxes(box_confidence,boxes,box_class_probs,score_threshold)
    # Scale boxes back to original image shape.
    boxes = scale_boxes(boxes, image_shape)
    # Use one of the functions you've implemented to perform Non-max suppression with a threshold of iou_threshold (≈1 line)
    scores,boxes,classes = yolo_non_max_suppression(scores, boxes, classes,max_boxes,iou_threshold)
    ### END CODE HERE ###
    
    return scores, boxes, classes

YOLO总结

filename eady exists, renamed

DL-C4W2

Posted on 2018-12-29

LeNet -5

upload ful

AlexNet

upload ssful

conv->max pool->conv->max pool->conv->conv->conv->conv->max pool->fc->fc->fc

VGG-16

upload ssful

16指的是只有１６层网络有需要学习的参数，ｐｏｏｌｉｎｇ层是没有要学习的参数的。

箭头下方的[CONV 64]指有６４个ｆｉｌｔｅｒ，x2值对两层作卷积

ResNet

residual block

uload successful

uplosuccessful

增加short cut之后成为残差块的网络结构

filename alreay exists, renamed

可以构建更为深层的网络

Residual Network

ccessful

没有residual block的网络叫做plain network

why it works?

filenamelready exists, renamed

对于越深层的神经网络来说，参数越来越多，越来越难选择，将会导致连学习identity function（f(x)=x）都很困难，而如果用residual block，a^[l+2] = g(a^[l]) = a^[l],对于残差块来学习identity function 其实是很简单的，所以不影响性能。

upload successfl

在经历了相同的conv之后增加一层pooling

1x1 convolution

又叫做网中网

filenamey exists, renamed

对一层 nxnxc 的图片，应用1x1xc的卷积核，得到新一层

nxnx1，f个filter处理后，就得到一个nxnxf的新层。

应用

降维或升维

压缩channel个数，当卷积核个数小于输入channel数量的时候

upload sful

增加非线性

why？

filename already exists,med

跨通道信息交互

ud successful

Inception Network

ref:

https://zhuanlan.zhihu.com/p/40050371

https://blog.csdn.net/a1154761720/article/details/53411365

upload sssful

compute cost

filename alr exists, renamed

filter的第三个通道数目 == input feature map的第三个通道数目

1
2
3

结果参数 = 输入层 HxWx卷积核个数

总计算成本 = (结果参数28x28x32) × （卷积核大小5x5x192）

bottleneck layer

filename already exists,med

inception network

google net

upload suc

1x1convolution 能够有效减少参数数量，加快训练

filename already exi, renamed

可以观察到有一些旁路也输入到softmax中，因为hidden layer有时候的预测效果也不错，这么做可以防止过拟合

Transfer Learning

小数据集
up successful

大数据集
upload ssful

数据扩充

filename already erenamed

多CPU多线程实现

Ubuntu 耳机/无线网驱动

Posted on 2018-12-28

无线网

https://blog.csdn.net/fljhm/article/details/79281655

1
2
3

1. make
2. sudo make install
3. sudo modprobe -a 8821ce

耳机

1.alsactl restore

sudo gedit /etc/modprobe.d/alsa-base.conf

添加

options snd-pcsp index=-2
alias snd-card-0 snd-hda-intel
alias sound-slot-0 snd-hda-intel
options snd-hda-intel model=pch position_fix=1

DL-C4W1

Posted on 2018-12-28

为什么要卷积？

for big size picture, the input scale would be very large. eg. an 10001000 size picture,
after flattening its features , you can get a vector as (31000*1000,1) = (3million ,1)

upload ssful

如果hidden　layer只有1000层，那么W1 的输入大小是(1000,3m)

因为z(1000,1) = W1(1000,3M)*X(3M,1)+b

边缘检测

uad successful

用一个33大小的卷积核对一张66大小的图片进行卷积运算，最终得到一个4x4的图片

python:conv_forward

tf.nn.conv2d

边缘检测原理

用垂直边缘filter，可以明显吧边缘和非边缘区区分出来。

upload sucssful

多种边缘检测

upload succful

我们可以直接把filter中的数字直接看作是需要学习的参数

在nn中通过反向传播算法，学习到相应于目标结果的filter，然后把其应用在整个图片上，输出其提取到的所有有用的特征。

padding

从上面注意到每次卷积操作，图片会缩小。

filename alrea exists, renamed

所以我们要在卷积之前，为图片加padding，包围角落和边缘的像素，使得通过filter的卷积运算后，图片大小不变，也不会丢失角落。

filename y exists, renamed

valid/Some 卷积

Valid: no padding

nxn –>(n-f+1)x(n-f+1)

Same: padding

输出和输入图片的大小相同

p = (f-1)/2，在CV中，一般来说padding的值位奇数

N+2P-F+1 = N ,SO p = (F-1)/2

卷积步长（stride）

stride=1,表示每次卷积运算以一个步长进行移动。

upload ful

立体卷积

upd successful

filename alrea exists, renamed

第一行表示只检测红色通道的垂直边缘

第二行表示检测所有通道垂直边缘

卷积核第三个维度大小等于图片通道大小

多卷积

upcessful

上图意思是把检测垂直和水平边缘的两个图片叠成两层。

upload ful

单层卷积网络

upload cessful

与普通神经网络单层前向传播类似，卷机神经网络也是先由权重和bias做线性运算，然后得到结果在输入到一个激活函数中。

upload succsful

对应上图a[0]表示图片层（nn3）

w[1]对应卷积核（ff3）

a[1] 对应下一层（4x4x2）

单层卷积参数个数

filename alreay exists, renamed

不受图片大小影响

标记

filename alread exists, renamed

f[l] 卷积核大小

卷积核第三个维度大小等于输入图片通道数

而权重就是卷积核大小×卷积核个数，卷积核个数就是输出层的通道数目

激活值大小就是下一层输出层的大小： nH X nW X nC

简单卷积网络

filename y exists, renamed

最后得到的7x7x40，一共1960个参数，就是最后输入激活函数的所有参数

池化层

最大池化(max pooling)

把前一层得到的特征图进行池化减小，仅由当前小区域内的最大值来代表最终池化后的值。

uploaduccessful

平均池化

upload suessful

池化只需要设置好超参数，没有要学习的参数

总结

CNN的最大特点在于卷积的权值共享结构，可以大幅减少神经网络参数量，防止过拟合的同时又降低了神经网络模型的复杂度。

CNN通过卷积的方式实现局部链接，得到图片的参数量只跟卷积核的大小有关，一个卷积核对应一个图片特征，每一个卷积核滤波得到的图像就是一类特征的映射。

也就是说训练的权值数量只与卷积核大小与数量有关，但注意的是隐含层节点数量没有下降，隐含节点的数量只与卷积的步长有关，如果步长为1，那么隐含节点数量与输入图像像素数量一致。如果步长为5，那么每5x5个像素才需要一个隐含节点。

再总结，CNN的要点就是

1.局部连接

2.权值共享

3.池化层的降采样

其中1与2降低了参数量，训练复杂度下降并减轻过拟合。

同时权值共享赋予了卷积网络对平移的容忍性。

upload successl

随着nn层数增加，提取的特征图片大小将会减小，但是同时间通道数量会增加

为什么使用CNN？

1.参数少
upload succeul

2.参数共享&链接的稀疏性

参数共享指一个卷积核可以有多个不同的卷积核，而每一个卷积核对应一个滤波后映射出的新图像，同一个新图像的每一个像素都来自完全相同的卷积核。

filename already exists, renad

implementation

zero padding

filename alreay exists, renamed

benefits

finame already exists, renamed


def zero_pad(X, pad):
    """
    Pad with zeros all images of the dataset X. The padding is applied to the height and width of an image, 
    as illustrated in Figure 1.
    
    Argument:
    X -- python numpy array of shape (m, n_H, n_W, n_C) representing a batch of m images
    pad -- integer, amount of padding around each image on vertical and horizontal dimensions
    
    Returns:
    X_pad -- padded image of shape (m, n_H + 2*pad, n_W + 2*pad, n_C)
    """
    
    ### START CODE HERE ### (≈ 1 line)
    X_pad = np.pad(X, ((0, 0), (pad, pad), (pad, pad), (0, 0)), 'constant', constant_values=0)
    ### END CODE HERE ###
    
    return X_pad

forward convolution

def conv_single_step(a_slice_prev, W, b):
    """
    Apply one filter defined by parameters W on a single slice (a_slice_prev) of the output activation 
    of the previous layer.
    
    Arguments:
    a_slice_prev -- slice of input data of shape (f, f, n_C_prev)
    W -- Weight parameters contained in a window - matrix of shape (f, f, n_C_prev)
    b -- Bias parameters contained in a window - matrix of shape (1, 1, 1)
    
    Returns:
    Z -- a scalar value, result of convolving the sliding window (W, b) on a slice x of the input data
    """

    ### START CODE HERE ### (≈ 2 lines of code)
    # Element-wise product between a_slice and W. Add bias.
    s = np.multiply(a_slice_prev, W) + b
    # Sum over all entries of the volume s
    Z = np.sum(s)
    ### END CODE HERE ###

    return Z

define a slice

upload succeful

def conv_forward(A_prev, W, b, hparameters):
    """
    Implements the forward propagation for a convolution function
    
    Arguments:
    A_prev -- output activations of the previous layer, numpy array of shape (m, n_H_prev, n_W_prev, n_C_prev)
    W -- Weights, numpy array of shape (f, f, n_C_prev, n_C)
    b -- Biases, numpy array of shape (1, 1, 1, n_C)
    hparameters -- python dictionary containing "stride" and "pad"
        
    Returns:
    Z -- conv output, numpy array of shape (m, n_H, n_W, n_C)
    cache -- cache of values needed for the conv_backward() function
    """
    
    ### START CODE HERE ###
    # Retrieve dimensions from A_prev's shape (≈1 line)  
    (m, n_H_prev, n_W_prev, n_C_prev) = A_prev.shape
    
    # Retrieve dimensions from W's shape (≈1 line)
    (f, f, n_C_prev, n_C) = W.shape

    # Retrieve information from "hparameters" (≈2 lines)
    stride = hparameters['stride']
    pad = hparameters['pad']
    
    # Compute the dimensions of the CONV output volume using the formula given above. Hint: use int() to floor. (≈2 lines)
    n_H = int((n_H_prev - f + 2 * pad) / stride) + 1
    n_W = int((n_W_prev - f + 2 * pad) / stride) + 1
    
    # Initialize the output volume Z with zeros. (≈1 line)
    Z = np.zeros((m, n_H, n_W, n_C))
    
    # Create A_prev_pad by padding A_prev
    A_prev_pad = zero_pad(A_prev, pad)
    
    for i in range(m):                                 # loop over the batch of training examples
        a_prev_pad = A_prev_pad[i]                     # Select ith training example's padded activation
        for h in range(n_H):                           # loop over vertical axis of the output volume
            for w in range(n_W):                       # loop over horizontal axis of the output volume
                for c in range(n_C):                   # loop over channels (= #filters) of the output volume
                    # Find the corners of the current "slice" (≈4 lines)
                    vert_start = h * stride
                    vert_end = vert_start + f
                    horiz_start = w * stride
                    horiz_end = horiz_start + f
                    # Use the corners to define the (3D) slice of a_prev_pad (See Hint above the cell). (≈1 line)
                    a_slice_prev = a_prev_pad[vert_start:vert_end, horiz_start:horiz_end, :]
                    # Convolve the (3D) slice with the correct filter W and bias b, to get back one output neuron. (≈1 line)
                    Z[i, h, w, c] = conv_single_step(a_slice_prev, W[...,c], b[...,c])
                                        
    ### END CODE HERE ###

    # Making sure your output shape is correct
    assert(Z.shape == (m, n_H, n_W, n_C))
    
    # Save information in "cache" for the backprop
    cache = (A_prev, W, b, hparameters)
    
    return Z, cache

对应的notebook

https://github.com/AlexanderChiuluvB/deep-learning-coursera/blob/master/Convolutional%20Neural%20Networks/Convolution%20model%20-%20Step%20by%20Step%20-%20v1.ipynb

DL-C2W3

Posted on 2018-12-27

Hyperparameter tuning

don’t use grid search

because you do not know which hyperparameters
matters most!

flename already exists, renamed

instead,use randomly chosen hyperparameter

coarse to fine

upload successl

如果某些临近的参数效果不错，那么把选择的范围缩小。

scale for params

对数scale

filename already exists,named

filename alreadysts, renamed

如果beta 为0.9，当beta从0.9变成0.9005的时候，变化很小

但是如果beta很接近于1,当beta从0.999->0.9995的时候，就会变化很大

panda / caviar strategy

given the computational resources you have,if limited,then you can only watch over one model,so check its cost 每隔一段时间，if any problems occurs, stop and return to previous state.

while if you have enough computational resources,then you can watch out various models
at one time,and choose the best one.

Batch Normalization

实质是对hidden units做normalization

upload cessful

https://towardsdatascience.com/batch-normalization-in-neural-networks-1ac91516821c

filename alry exists, renamed

fitting batch norm to nn

每个ｅｐｏｃｈ对隐藏层做一次batch normalization

back propagate 的时候不用考虑db，因为在normalize的时候Z_TILDA = (Z-mean)/(sqrt of variance) 会把常数项减掉
upload succeul

why batch norm work?

filename already exi renamed

当你训练一个模型，得到一个x->y的映射之后，如果预测数据分布发生了变化，你需要重新训练你的模型。这叫做 covariate shift

例如你训练的都是黑猫图片，如果你用这个模型来测试彩色猫的图片，那么效果肯定会不好。

而对hidden units 做normalization （batch norm），即使前面输入数据x发生变化，那么对后面层的影响将会变小很多，因为每一层都’constrained to have the same mean and variance’ 提升了神经网络应付不同输入的健壮性。

filene already exists, renamed

Consequently, batch normalization adds two trainable parameters to each layer, so the normalized output is multiplied by a “standard deviation” parameter (gamma) and add a “mean” parameter (beta). In other words, batch normalization lets SGD do the denormalization by changing only these two weights for each activation, instead of losing the stability of the network by changing all the weights.

softmax layer

In mathematics, the softmax function takes an un-normalized vector, and normalizes it into a probability distribution.

我们已经知道，逻辑回归可生成介于 0 和 1.0 之间的小数。例如，某电子邮件分类器的逻辑回归输出值为 0.8，表明电子邮件是垃圾邮件的概率为 80%，不是垃圾邮件的概率为 20%。很明显，一封电子邮件是垃圾邮件或非垃圾邮件的概率之和为 1.0。

Softmax 将这一想法延伸到多类别领域。也就是说，在多类别问题中，Softmax 会为每个类别分配一个用小数表示的概率。这些用小数表示的概率相加之和必须是 1.0。与其他方式相比，这种附加限制有助于让训练过程更快速地收敛。

softmax function

upload sucsful

upload essful

understanding softmax

filename exists, renamed

tensorflow

JAVA并发编程（一）

Posted on 2018-12-26

线程

现代操作系统调度的最小单元是线程，在一个进程当中可以创建多个线程，这些线程拥有各自的计数器，堆栈以及局部变量，并且能够访问共享的内存变量。

处理器在这些线程中高速切换，让使用者感觉到这些线程在同时执行。

线程状态的转换

uoad successful

Runnable: 线程对象创建之后，其他线程调用了该对象的start方法，该状态的线程位于可运行线程池中，等待获取CPU使用权。

运行状态：获取了CPU，执行程序代码

阻塞状态：线程由于某种原因放弃CPU使用权，暂时停止运行，直到线程进入就绪状态。

分为三种：

等待阻塞，运行的线程执行wait()方法，JVM会把该线程放入等待池中。该过程会释放持有的锁
同步阻塞，运行的线程在获取对象的同步锁时，如果同步锁被别的线程占用，JVM会把该线程放入锁池当中
其他阻塞，运行的线程执行sleep()或者join()方法，或者发出了I/O请求的时候，JVM会把线程设置为阻塞状态。当sleep()超时，join()等待线程终止或者超时，或者I/O处理完毕的时候，线程重新进入runnable状态，sleep不会释放持有的锁。

为什么？

更快的响应时间

一个面向用户的业务操作，可以把一些数据一致性不强的操作派发给其他线程处理，使得响应用户请求的线程能够尽快处理完成。

线程调度

线程优先级

现代操作系统采用时分的形式调度运行的线程，操作系统会分出一个个时间片，线程会分配到若干时间片，当线程的时间片用完了就会发生线程调度。时间片多少决定了线程使用处理器资源的多少，而线程优先级决定线程得到分配的处理器资源多少。

线程默认优先级位是5，优先级高的线程分配时间片数量要多于优先级低的线程。设置线程优先级的时候，针对频繁阻塞的线程需要设置较高优先级，而偏重计算的线程设置较低的优先级，确保处理器不会被独占。

filenameready exists, renamed

filename ready exists, renamed

线程的优先级有继承关系

线程睡眠

Thread.sleep(long mills)
使得线程转到阻塞状态，结束后转为runnable状态

sleep()和yield()的区别):sleep()使当前线程进入停滞状态，所以执行sleep()的线程在指定的时间内肯定不会被执行；yield()只是使当前线程重新回到可执行状态，所以执行yield()的线程有可能在进入到可执行状态后马上又被执行。

sleep 方法使当前运行中的线程睡眼一段时间，进入不可运行状态，这段时间的长短是由程序设定的，yield 方法使当前线程让出 CPU 占有权，但让出的时间是不可设定的。实际上，yield()方法对应了如下操作：先检测当前是否有相同优先级的线程处于同可运行状态，如有，则把 CPU 的占有权交给此线程，否则，继续运行原来的线程。所以yield()方法称为“退让”，它把运行机会让给了同等优先级的其他线程

另外，sleep 方法允许较低优先级的线程获得运行机会，但 yield() 方法执行时，当前线程仍处在可运行状态，所以，不可能让出较低优先级的线程些时获得 CPU 占有权。在一个运行系统中，如果较高优先级的线程没有调用 sleep 方法，又没有受到 I\O 阻塞，那么，较低优先级线程只能等待所有较高优先级的线程运行结束，才有机会运行。

wait与sleep区别

filenameeady exists, renamed

线程等待

Object类的wait()方法，导致当前的线程等待，直到其他线程调用此对象的notify（）方法或者notifyAll()方法。

wait()与notify()方法必须要与synchronized一起使用，也就是也就是wait，与notify是针对已经获取了的锁进行操作。

wait就是说线程在获取对象锁后，主动释放对象锁，同时本线程休眠。直到有其他线程调用对象的notify()唤醒该线程，这样才能继续获取对象锁并继续执行。响应的notify()就是对象锁的唤醒操作。sleep()与wait()二者都可以暂停当前线程，释放CPU的控制权，主要区别是Object.wait()在释放CPU同时，释放了对锁的控制。

该问题为三线程间的同步唤醒操作，主要目的就是ThreadA->ThreadB->ThreadC,每一个线程必须同时持有两个对象锁才能继续执行。一个对象锁是prev，是前一个线程所持有的对象锁。还有一个是自身的对象锁。为了控制执行顺序，先持有prev锁，也就是前一个线程要释放自身对象锁，再去申请自身对象锁，两者兼备的时候打印。之后首先调用self.notify()释放自身对象锁，唤醒下一个等待线程，再用prev.wait()释放prev对象锁，终止当前线程。等待循环结束之后再次被唤醒。

线程让步

Thread.yield()方法，暂停当前执行的线程对象，把执行机会让给相同或者更高优先级的线程。

public class ThreadYield extends Thread{
  //private String name;
  public ThreadYield(String name){
      super(name);
  }

  public void run(){
    for(int i=1;i<=10;i++){
      System.out.println(this.getName()+"----"+i);
      if(i==3){
        this.yield();
      }
    }
  }

  public static void main(String[]args){

    ThreadYield thread1 = new ThreadYield("ALex");
    ThreadYield thread2 = new ThreadYield("Brecher");

    thread1.start();
    thread2.start();

  }
}

这里结果表示ALEX执行到i=3的时候，会把CPU让出来，这时候BRECHER抢到线程。
filename already exists,med

也有可能是BRECHER执行到i=3的时候，会把CPU让出来，这时候还是BRECHER抢到线程。
filename aady exists, renamed

线程加入

join（），等待其他线程终止。在当前线程中调用另一个线程的join()方法，当前线程进入阻塞。知道另外一个线程结束，这个线程回到runnable

join():等待t线程终止
join是Thread类的一个方法，启动线程后直接调用，join(）作用是等待该线程终止。该线程是指主线程等待子线程的终止
也就是在子线程调用了join()方法后面的代码，只有等到子线程结束了才能执行

1
2
3

Thread t = new Athread();
t.start();
t.join();

为什么要用join？

如果子线程很耗时，主线程往往提前与子线程结束，万一主线程需要用子线程的处理结果，就是主线程需要等待子线程执行完成之后再结束。

如果不用join函数：


public class Thread1 extends Thread{
  private String name;
  public Thread1(String name){
    super(name);
    this.name = name;
  }

  public void run(){
    System.out.println(Thread.currentThread().getName()+"线程开始运行");
    for(int i=0;i<5;i++){
      System.out.println("子线程"+name+"运行"+i);
      try{
        sleep((int)Math.random()*10);
      }catch(InterruptedException e){
        e.printStackTrace();
      }
    }
    System.out.println(Thread.currentThread().getName()+"线程结束运行");
  }


  public static void main(String[] args){

    System.out.println(Thread.currentThread().getName()+"主线程开始运行");
    Thread1 Athread = new Thread1("A");
    Thread1 Bthread = new Thread1("B");
    Athread.start();
    Bthread.start();
    System.out.println(Thread.currentThread().getName()+"主线程结束运行");

  }


}

结果：

upload succeful

而在main函数中添加join方法

public static void main(String[] args){
    System.out.println(Thread.currentThread().getName()+"主线程开始运行");
    Thread1 Athread = new Thread1("A");
    Thread1 Bthread = new Thread1("B");
    Athread.start();
    Bthread.start();
    try{
      Athread.join();
    }catch(InterruptedException e){
      e.printStackTrace();
    }

    try{
      Bthread.join();
    }catch(InterruptedException e){
      e.printStackTrace();
    }
    System.out.println(Thread.currentThread().getName()+"主线程结束运行");
  }

结果：

filename amed

线程唤醒

object类的notify()方法，唤醒在此对象监视器上等待的单个线程。如果所有线程都在此对象中等待，则会选择唤醒其中一个线程。线程调用其中一个wait方法，在对象的监视器上等待。知道当前线程放弃此对象上的锁定，才能继续执行被唤醒的线程。


import java.util.concurrent.TimeUnit;

public class ThreadState {

  public static void main(String[] args){

      new Thread(new TimeWaiting(),"TimeWaitingThread").start();
      new Thread(new Waiting(),"WaitingThread").start();

      //使用两个Blocked线程，一个获取锁成功，另外一个被阻塞
      new Thread(new Blocked(),"BlockedThread-1").start();
      new Thread(new Blocked(),"BlockedTHread-2").start();
  }

  static class SleepUtils{
      public static void second(long seconds) {
        try {
          TimeUnit.SECONDS.sleep(seconds);
        } catch (InterruptedException e) {
        }
      }
  }

  //该线程不断进行睡眠
  static class TimeWaiting implements Runnable{
      public void run(){
        while(true){
          SleepUtils.second(100);
        }
      }
  }

  //在Waiting.class实例上等待
  static class Waiting implements Runnable{
    public void run(){
        while(true){
          synchronized (Waiting.class){
              try{
                  Waiting.class.wait();
              }catch(InterruptedException e){
                e.printStackTrace();
            }
          }
        }
    }
  }

  //该线程在Blocked.class 实例上加锁后，不会释放该锁
  static class Blocked implements Runnable{
    public void run(){
        synchronized (Blocked.class){
          while(true){
              SleepUtils.second(100);
          }
        }
    }
  }


}

运行代码，打开shell，输入jps，得到ThreadState的进程ID，然后输入 jstack ID,得到线程消息

upload sucessful

filename alrey exists, renamed

当线程创建之后，调用start方法开始运行，当执行wait()的时候进入等待状态，需要依靠其他线程的通知才能够返回到运行状态。

而超时等待状态相当于在等待状态基础上增加了超时限制，一旦过时，返回到运行状态。

如果线程调用同步方法但没有获得锁，线程会进入到阻塞状态。

常见线程名词解释

主线程：main()产生的线程

当前线程：Thread.currentThread()获取的进程

后台线程：为其他线程提供服务的线程，称为守护线程

前台线程：接收后台线程服务的线程

Daemon 线程

这是一种支持型线程，用作程序中后台调度以及支持性工作。

启动和终止线程

构造线程

确定线程所属的线程组，线程优先级，是否是Daemon线程等，线程将在堆内存中等待运行。

启动线程

start()方法将告知JVM，只要线程规划器空闲，应该立即启动start（）方法的线程。

安全地终止线程

中断状态是线程的一个标识位，而中断操作是一种简便的线程间交互方式。

同时还可以利用一个boolean变量来控制是否需要停止任务并终止该线程。

main线程通过中断操作和cancel（）方法使得countThread得以终止。

import java.util.concurrent.TimeUnit;

public class Shutdown {

    public static void main(String[] args) throws Exception{

        Runner one = new Runner();
        Thread countThread = new Thread(one,"CountThread");
        countThread.start();
        //睡眠1s,main线程对CountThread进行中断，使得CountThread能够感知中断而结束
        TimeUnit.SECONDS.sleep(1);
        countThread.interrupt();
        Runner two = new Runner();
        countThread = new Thread(two,"CountThread");
        countThread.start();
        //睡眠1s,main线程对Runner Two进行取消，使CountThread 能够感知on 为false
        TimeUnit.SECONDS.sleep(1);
        two.cancel();

    }

    private static class Runner implements Runnable{
        private long i;
        private volatile boolean on = true;
        public void run(){
          while(on&&!Thread.currentThread().isInterrupted()){
            i++;
          }
          System.out.println("Count i = "+i);
        }
        public void cancel(){
            on = false;
        }
    }
}

线程间通信

线程开始运行的时候，拥有自己栈空间。java支持多个线程同时访问一个对象或者对象的成员变量，由于每个线程可以拥有这个变量的拷贝（对象以及成员变量分配的内存是在共享内存中，但每个线程可以拥有一份拷贝，可以加速程序的执行）

volatile关键字

其他线程对该变量进行改变的时候，可以让所有线程感知到变化。保证所有线程对变量访问的可见性。

synchronized 关键字

主要确保多个线程在同一个时刻，只能有一个线程处于方法或者同步块中，保证了线程对变量访问的可见性与排他性，

任意一个对象都拥有自己的监视器，当这个对象由同步块或者这个对象的同步方法调用时，执行方法的线程必须先获取该对象的监视器才能进入同步块和同步方法，如果没有获取到监视器的线程将会被阻塞在同步块和同步方法的入口处，进入到BLOCKED状态

filename already exists, rend

线程同步

1、synchronized关键字的作用域有二种：

1）是某个对象实例内，synchronized aMethod(){}可以防止多个线程同时访问这个对象的synchronized方法（如果一个对象有多个synchronized方法，只要一个线程访问了其中的一个synchronized方法，其它线程不能同时访问这个对象中任何一个synchronized方法）。这时，不同的对象实例的synchronized方法是不相干扰的。也就是说，其它线程照样可以同时访问相同类的另一个对象实例中的synchronized方法；

假设P1，P2是同一个类的不同对象，这个类中定义了以下几种情况的同步块或者同步方法，P1，P2都可以调用他们

1	public synchronized void methodAAA()

上面这个函数当对象P1的不同线程执行这个同步方法的时候，会形成互斥，达到同步的效果。

1
2
3

public void methodAAA(){
synchronized(this)
}

this指调用这个方法的对象

public void method3(SomeObject so){

	synchronized(so)
    
}

2）是某个类的范围，synchronized static aStaticMethod{}防止多个线程同时访问这个类中的synchronized static 方法。它可以对类的所有对象实例起作用。

2、除了方法前用synchronized关键字，synchronized关键字还可以用于方法中的某个区块中，表示只对这个区块的资源实行互斥访问。用法是: synchronized(this){/区块/}，它的作用域是当前对象；

1.线程同步的目的是为了保护多个线程访问一个资源的时候破坏资源

2.线程同步方法通过锁来实现，每个对象有且仅有一个锁，这个锁与一个特定的对象关联，线程一旦获取了对象的锁，其他这个对象的线程就无法再访问该对象的其他非同步方法。

3.对于静态同步方法，锁是针对这个类的，锁对象是该类的Class对象。静态和非静态方法的锁互不干预。一个线程获得锁，当在一个同步方法中访问另外对象上的同步方法时，会获取这两个对象锁。

线程数据传递

通过构造方法


public class MyThreadPrint extends Thread{

    private String name;
    public MyThreadPrint(String name){
      this.name = name;
    }

    public void run(){
        System.out.println("hello"+name);
    }

    public static void main(String [] args){

      Thread thread = new MyThreadPrint("world");
      thread.start();
    }
}

通过变量与方法

public class MyThreadPrint implements Runnable{

    private String name;

    public void setName(String name){
      this.name = name;
    }

    public void run(){
        System.out.println("hello"+name);
    }

    public static void main(String [] args){

      MyThreadPrint mythread = new MyThreadPrint();
      mythread.setName("world");
      Thread thread = new Thread(mythread);
      thread.start();
    }
}

waiting and notifying

等待与通知机制

notify() 通知一个在对象上等待的线程，使其从wait()方法返回，返回的前提是该线程获得了对象的锁。

wait()调用该方法线程进入WAITING状态，只有等待另外线程的通知或者被中断才会返回，调用wait()后会释放对象的锁。

Thread.join()

每个线程终止的前提是前驱线程的终止，每个线程等待前驱线程终止后，才从join()方法返回


import java.util.concurrent.TimeUnit;

public class Join {
  public static void main(String[]args)throws Exception{

      Thread previous = Thread.currentThread();
      for(int i=0;i<10;i++) {
        Thread thread = new Thread(new Domino(previous), String.valueOf(i));
        thread.start();
        previous = thread;
      }
      TimeUnit.SECONDS.sleep(5);
      System.out.println(Thread.currentThread().getName()+" terminate.");

  }

  static class Domino implements Runnable{
    private Thread thread;
    public Domino(Thread thread){
        this.thread = thread;
    }
    public void run(){
        try{
          thread.join();
        }catch (InterruptedException e){
          e.printStackTrace();
        }
        System.out.println(Thread.currentThread().getName()+" terminate.");
    }

  }

}

DL C2wk2

Posted on 2018-12-26 | In DL

mini batch gradient descent

mini-batch size

if mini-batch size = m （m=sample size）

(X^[i],Y^[i]) = (X,Y)

收敛速度会很快（步长很大）

end up with batch gradient descent,which has to process the whole training set before making progress

mini-batch size = 1 又叫做stocastic gradient descent

收敛步长会很小，很有可能会不收敛，而且vectorization会很慢

lose the benefits of vectorization

如下图

filename alrea exists, renamed

紫色表示batch size=1的收敛，而蓝色表示batch size=m的收敛。

how to choose mini-batch size?

1.通常用 64,128,256,512大小(power of two)

upload sucsful

upload sessful

upload ssful

learning rate decay

upload essful

当梯度下降的时候，由于学习率是固定的，因此可能会在最低点附近徘徊而最终不能收敛。

implementation

upload essful

alpha = 1/（1+decay_rate x epoch_num） *alpha

其他方法也可：

upload succeful

mini-Batch gradient descent

shuffle

upload sful

parition

load successful

code

注意最后一个batch_size有可能和前面的size不同，因为样本总数可能不等于batch_size的倍数


def random_mini_batches(X, Y, mini_batch_size = 64, seed = 0):
    """
    Creates a list of random minibatches from (X, Y)
    
    Arguments:
    X -- input data, of shape (input size, number of examples)
    Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples)
    mini_batch_size -- size of the mini-batches, integer
    
    Returns:
    mini_batches -- list of synchronous (mini_batch_X, mini_batch_Y)
    """
    
    np.random.seed(seed)            # To make your "random" minibatches the same as ours
    m = X.shape[1]                  # number of training examples
    mini_batches = []
        
    # Step 1: Shuffle (X, Y)
    permutation = list(np.random.permutation(m))
    shuffled_X = X[:, permutation]
    shuffled_Y = Y[:, permutation].reshape((1,m))

    # Step 2: Partition (shuffled_X, shuffled_Y). Minus the end case.
    num_complete_minibatches = math.floor(m/mini_batch_size) # number of mini batches of size mini_batch_size in your partitionning
    for k in range(0, num_complete_minibatches):
        ### START CODE HERE ### (approx. 2 lines)
        mini_batch_X = shuffled_X[:,mini_batch_size*(k):mini_batch_size*(k+1)]
        mini_batch_Y = shuffled_Y[:,mini_batch_size*(k):mini_batch_size*(k+1)]
        ### END CODE HERE ###
        mini_batch = (mini_batch_X, mini_batch_Y)
        mini_batches.append(mini_batch)
    
    # Handling the end case (last mini-batch < mini_batch_size)
    if m % mini_batch_size != 0:
        ### START CODE HERE ### (approx. 2 lines)
        mini_batch_X = shuffled_X[:,mini_batch_size*(num_complete_minibatches):]
        mini_batch_Y = shuffled_Y[:,mini_batch_size*(num_complete_minibatches):]
        ### END CODE HERE ###
        mini_batch = (mini_batch_X, mini_batch_Y)
        mini_batches.append(mini_batch)
    
    return mini_batches

momentum

filename already exists, named

蓝色是gradient的方向，而红色是实际velocity的方向，我们让gradient影响velocty下降的方向

code


def initialize_velocity(parameters):
    """
    Initializes the velocity as a python dictionary with:
                - keys: "dW1", "db1", ..., "dWL", "dbL" 
                - values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters.
    Arguments:
    parameters -- python dictionary containing your parameters.
                    parameters['W' + str(l)] = Wl
                    parameters['b' + str(l)] = bl
    
    Returns:
    v -- python dictionary containing the current velocity.
                    v['dW' + str(l)] = velocity of dWl
                    v['db' + str(l)] = velocity of dbl
    """
    
    L = len(parameters) // 2 # number of layers in the neural networks
    v = {}
    
    # Initialize velocity
    for l in range(L):
        ### START CODE HERE ### (approx. 2 lines)
        v["dW" + str(l+1)] = np.zeros(parameters['W'+str(l+1)].shape)
        v["db" + str(l+1)] = np.zeros(parameters['b'+str(l+1)].shape)
        ### END CODE HERE ###
        
    return v

update parameters

filename already exists, remed

upload succesul

Adam optimization

upload succeul

def initialize_adam(parameters) :
    """
    Initializes v and s as two python dictionaries with:
                - keys: "dW1", "db1", ..., "dWL", "dbL" 
                - values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters.
    
    Arguments:
    parameters -- python dictionary containing your parameters.
                    parameters["W" + str(l)] = Wl
                    parameters["b" + str(l)] = bl
    
    Returns: 
    v -- python dictionary that will contain the exponentially weighted average of the gradient.
                    v["dW" + str(l)] = ...
                    v["db" + str(l)] = ...
    s -- python dictionary that will contain the exponentially weighted average of the squared gradient.
                    s["dW" + str(l)] = ...
                    s["db" + str(l)] = ...

    """
    
    L = len(parameters) // 2 # number of layers in the neural networks
    v = {}
    s = {}
    
    # Initialize v, s. Input: "parameters". Outputs: "v, s".
    for l in range(L):
    ### START CODE HERE ### (approx. 4 lines)
        v["dW" + str(l+1)] = np.zeros_like(parameters['W'+str(l+1)]) #(numpy array of zeros with the same shape as parameters["W" + str(l+1)])
        v["db" + str(l+1)] = np.zeros_like(parameters['b'+str(l+1)]) #(numpy array of zeros with the same shape as parameters["b" + str(l+1)])
        s["dW" + str(l+1)] = np.zeros_like(parameters['W'+str(l+1)]) #(numpy array of zeros with the same shape as parameters["W" + str(l+1)])
        s["db" + str(l+1)] = np.zeros_like(parameters['b'+str(l+1)]) #(numpy array of zeros with the same shape as parameters["b" + str(l+1)])


    ### END CODE HERE ###
    
    return v, s

update parameters

def update_parameters_with_adam(parameters, grads, v, s, t, learning_rate=0.01,
                                beta1=0.9, beta2=0.999, epsilon=1e-8):
    """
    Update parameters using Adam
    
    Arguments:
    parameters -- python dictionary containing your parameters:
                    parameters['W' + str(l)] = Wl
                    parameters['b' + str(l)] = bl
    grads -- python dictionary containing your gradients for each parameters:
                    grads['dW' + str(l)] = dWl
                    grads['db' + str(l)] = dbl
    v -- Adam variable, moving average of the first gradient, python dictionary
    s -- Adam variable, moving average of the squared gradient, python dictionary
    learning_rate -- the learning rate, scalar.
    beta1 -- Exponential decay hyperparameter for the first moment estimates 
    beta2 -- Exponential decay hyperparameter for the second moment estimates 
    epsilon -- hyperparameter preventing division by zero in Adam updates

    Returns:
    parameters -- python dictionary containing your updated parameters 
    v -- Adam variable, moving average of the first gradient, python dictionary
    s -- Adam variable, moving average of the squared gradient, python dictionary
    """
    
    L = len(parameters) // 2                 # number of layers in the neural networks
    v_corrected = {}                         # Initializing first moment estimate, python dictionary
    s_corrected = {}                         # Initializing second moment estimate, python dictionary
    
    # Perform Adam update on all parameters
    for l in range(L):
        # Moving average of the gradients. Inputs: "v, grads, beta1". Output: "v".
        ### START CODE HERE ### (approx. 2 lines)
        v['dW'+str(l+1)] = beta1*v['dW'+str(l+1)]+(1-beta1)*grads['dW'+str(l+1)]
        v['db'+str(l+1)] = beta1*v['db'+str(l+1)]+(1-beta1)*grads['db'+str(l+1)]
        ### END CODE HERE ###

        # Compute bias-corrected first moment estimate. Inputs: "v, beta1, t". Output: "v_corrected".
        ### START CODE HERE ### (approx. 2 lines)
        v_corrected['dW'+str(l+1)] = v['dW'+str(l+1)]/(1-np.power(beta1,t))
        v_corrected['db'+str(l+1)] = v['db'+str(l+1)]/(1-np.power(beta1,t))
        ### END CODE HERE ###

        # Moving average of the squared gradients. Inputs: "s, grads, beta2". Output: "s".
        ### START CODE HERE ### (approx. 2 lines)
        s['dW'+str(l+1)] = beta2*s['dW'+str(l+1)]+(1-beta2)*np.power(grads['dW'+str(l+1)],2)
        s['db'+str(l+1)] = beta2*s['db'+str(l+1)]+(1-beta2)*np.power(grads['db'+str(l+1)],2)
        ### END CODE HERE ###

        # Compute bias-corrected second raw moment estimate. Inputs: "s, beta2, t". Output: "s_corrected".
        ### START CODE HERE ### (approx. 2 lines)
        s_corrected['dW'+str(l+1)] = s['dW'+str(l+1)]/(1-np.power(beta2,t))
        s_corrected['db'+str(l+1)] = s['db'+str(l+1)]/(1-np.power(beta2,t))
        ### END CODE HERE ###

        # Update parameters. Inputs: "parameters, learning_rate, v_corrected, s_corrected, epsilon". Output: "parameters".
        ### START CODE HERE ### (approx. 2 lines)
        parameters['W'+str(l+1)] -= learning_rate * v_corrected['dW'+str(l+1)] /(epsilon+np.sqrt(s_corrected['dW'+str(l+1)]))
        parameters['b'+str(l+1)] -= learning_rate * v_corrected['db'+str(l+1)] /(epsilon+np.sqrt(s_corrected['db'+str(l+1)]))
        
        ### END CODE HERE ###

    return parameters, v, s

usage

注意每次epoch的时候，分为多个batch学习参数

def model(X, Y, layers_dims, optimizer, learning_rate=0.0007, mini_batch_size=64, beta=0.9,
          beta1=0.9, beta2=0.999, epsilon=1e-8, num_epochs=10000, print_cost=True):
    """
    3-layer neural network model which can be run in different optimizer modes.
    
    Arguments:
    X -- input data, of shape (2, number of examples)
    Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples)
    layers_dims -- python list, containing the size of each layer
    learning_rate -- the learning rate, scalar.
    mini_batch_size -- the size of a mini batch
    beta -- Momentum hyperparameter
    beta1 -- Exponential decay hyperparameter for the past gradients estimates 
    beta2 -- Exponential decay hyperparameter for the past squared gradients estimates 
    epsilon -- hyperparameter preventing division by zero in Adam updates
    num_epochs -- number of epochs
    print_cost -- True to print the cost every 1000 epochs

    Returns:
    parameters -- python dictionary containing your updated parameters 
    """

    L = len(layers_dims)             # number of layers in the neural networks
    costs = []                       # to keep track of the cost
    t = 0                            # initializing the counter required for Adam update
    seed = 10                        # For grading purposes, so that your "random" minibatches are the same as ours
    
    # Initialize parameters
    parameters = initialize_parameters(layers_dims)

    # Initialize the optimizer
    if optimizer == "gd":
        pass # no initialization required for gradient descent
    elif optimizer == "momentum":
        v = initialize_velocity(parameters)
    elif optimizer == "adam":
        v, s = initialize_adam(parameters)
    
    # Optimization loop
    for i in range(num_epochs):
        
        # Define the random minibatches. We increment the seed to reshuffle differently the dataset after each epoch
        seed = seed + 1
        minibatches = random_mini_batches(X, Y, mini_batch_size, seed)

        for minibatch in minibatches:

            # Select a minibatch
            (minibatch_X, minibatch_Y) = minibatch

            # Forward propagation
            a3, caches = forward_propagation(minibatch_X, parameters)

            # Compute cost
            cost = compute_cost(a3, minibatch_Y)

            # Backward propagation
            grads = backward_propagation(minibatch_X, minibatch_Y, caches)

            # Update parameters
            if optimizer == "gd":
                parameters = update_parameters_with_gd(parameters, grads, learning_rate)
            elif optimizer == "momentum":
                parameters, v = update_parameters_with_momentum(parameters, grads, v, beta, learning_rate)
            elif optimizer == "adam":
                t = t + 1 # Adam counter
                parameters, v, s = update_parameters_with_adam(parameters, grads, v, s,
                                                               t, learning_rate, beta1, beta2,  epsilon)
        
        # Print the cost every 1000 epoch
        if print_cost and i % 1000 == 0:
            print("Cost after epoch %i: %f" % (i, cost))
        if print_cost and i % 100 == 0:
            costs.append(cost)
                
    # plot the cost
    plt.plot(costs)
    plt.ylabel('cost')
    plt.xlabel('epochs (per 100)')
    plt.title("Learning rate = " + str(learning_rate))
    plt.show()

    return parameters

DL C2wk1

Posted on 2018-12-26

Train/dev/test sets

dev sets,also called hold-out sets, are used to decide the model’s performance.(give you an unbiased estimate of your model’s performance)

example:

If you have 10,000,000 examples, how would you split the train/dev/test set?

98% train . 1% dev . 1% test

The dev and test set should:

Come from the same distribution

Bias and Variance

upload succesul

underfitting -> high bias

overfitting -> high variance

example

filename aady exists, renamed

1.如果train error =1%,dev set error = 11%

则overfitting，说明是high variance

2.如果 train error都很大的话，说明是high bias

3.总的来说，如果dev set error 比train set error
大很多，可以说明是overfitting

high bias and high variance

upload essful

部分数据过拟合，部分欠拟合

basic recipe for ML

如果high bias咋办

High bias （欠拟合）
(training data performance)

1.bigger network
2.optimized neural network architecture
3.other optimizing techniques

High Variance? （过拟合）
(dev data performance)

1.more data
2.regularization

Regularization

filename alrea exists, renamed

在L1正则化中,w会是一个很sparse的向量，通常在实际应用中L2正则化会应用的更为广泛。

λ 又叫正则化参数

推导

filename alrea exists, renamed

how does it work?

由上面推导我们可知，如果λ越大，那么w会越接近于0，那么以下图的激活函数为例，如果z很小的时候，tanh结果会接近于线性的，，神经网络每一层都将近似于一个线性神经元，那么就可以有效解决过拟合问题，往”欠拟合”或者刚好的方向。

upload successful

drop out

let’s say a nn with layer l = 3,keep_prob = 0.8

1	d3= np.random.rand(a3.shape[0],a3.shape[1])<keep_prob

80%的unit会被保留，20%会被drop out

上面语句作用是d380%元素为1,20%元素为0

1	a3 = np.multiply(a3,d3)

然后再scale up

1	a3/= keep_prob

why does it work?

upload succeful

1.在一些神经元很多的层，设置keep_probs低一点，可以有效减少过拟合，实际上是减弱regularization 的作用，在一些神经元很少的层，设置为1.0就好

2.在CV领域，由于输入数据的维数通常很大，一般都需要drop out

3.但是drop out会导致不能够通过画出cost function曲线来debug，解决方法是最开始先把所有keep_prob set to 1,then if no bug, turn on drop out

other technique of reducing overfitting

early stopping

filena already exists, renamed

防止dev set error增加，采取early stopping,最后会得到一个middle-size的||w||^2

data augmentation

filename alre exists, renamed

通过变换现有的数据集，来获得更多的数据集

normalizing inputs

filename alrea exists, renamed

by subtract the mean and scalling the variance

filename exists, renamed

效果就是会加快optimizing 的速度

Vanishing gradients

很深层的神经网络，权重相乘累积起来的话后果很严重

filename already exists, renmed

参数初始化方法：

这样可以使得w的值接近于1，不会导致梯度消失和梯度爆炸

upload cessful

其中的Xavior initialization

可以用来保证输入输出数据的分布相近，加快收敛速度（方差与均值大概相同）

https://blog.csdn.net/shuzfan/article/details/51338178

gradient checking

upload ccessful

grad check

upload essful

notes

1.no use in training ,only to debug

2.if fails grad check,look at components to try to identify bug.

3.remember regularization

4.doesn’t work with dropout

5.run at random initialization

initialization

zero initialization

如果把W矩阵初始化为0的话，相当于在训练一个各层只有一个神经元的神经网络，因为每一层的每个神经元其实都在学习相同的参数。

这时候神经网络只相当于一个线性分类器。

但是bias可以设置处值为0

large random initialization

1.poor initialization can lead to vanishing/exploding gradients

2.如果一开始w初值非常大，梯度下降所花时间会很长，（需要更多迭代次数）

He random initialization

def initialize_parameters_he(layers_dims):
    """
    Arguments:
    layer_dims -- python array (list) containing the size of each layer.
    
    Returns:
    parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":
                    W1 -- weight matrix of shape (layers_dims[1], layers_dims[0])
                    b1 -- bias vector of shape (layers_dims[1], 1)
                    ...
                    WL -- weight matrix of shape (layers_dims[L], layers_dims[L-1])
                    bL -- bias vector of shape (layers_dims[L], 1)
    """
    
    np.random.seed(3)
    parameters = {}
    L = len(layers_dims) - 1 # integer representing the number of layers
     
    for l in range(1, L + 1):
        ### START CODE HERE ### (≈ 2 lines of code)
        parameters['W' + str(l)] = np.random.randn(layers_dims[l], layers_dims[l-1]) * (np.sqrt(2. / layers_dims[l-1]))
        parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))
        ### END CODE HERE ###
        
    return parameters

xavier random initialization

只是把 sqrt（2./layers_dims[l-1]）换作sqrt（1./layers_dims[l-1]）

L2 Regularization

flename already exists, renamed

同时code如下

def compute_cost_with_regularization(A3, Y, parameters, lambd):
    """
    Implement the cost function with L2 regularization. See formula (2) above.
    
    Arguments:
    A3 -- post-activation, output of forward propagation, of shape (output size, number of examples)
    Y -- "true" labels vector, of shape (output size, number of examples)
    parameters -- python dictionary containing parameters of the model
    
    Returns:
    cost - value of the regularized loss function (formula (2))
    """
    m = Y.shape[1]
    W1 = parameters["W1"]
    W2 = parameters["W2"]
    W3 = parameters["W3"]
    
    cross_entropy_cost = compute_cost(A3, Y) # This gives you the cross-entropy part of the cost
    
    ### START CODE HERE ### (approx. 1 line)
    sumW1 = np.sum(np.square(W1))
    sumW2 = np.sum(np.square(W2))
    sumW3 = np.sum(np.square(W3))
    L2_regularization_cost = (0.5/m*lambd)*(sumW1+sumW2+sumW3)
    ### END CODER HERE ###
    
    cost = cross_entropy_cost + L2_regularization_cost
    
    return cost

同时引入正则化的话，在backword propa的时候，要加上正则项

upload sussful

code

def backward_propagation_with_regularization(X, Y, cache, lambd):
    """
    Implements the backward propagation of our baseline model to which we added an L2 regularization.
    
    Arguments:
    X -- input dataset, of shape (input size, number of examples)
    Y -- "true" labels vector, of shape (output size, number of examples)
    cache -- cache output from forward_propagation()
    lambd -- regularization hyperparameter, scalar
    
    Returns:
    gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables
    """
    
    m = X.shape[1]
    (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache
    
    dZ3 = A3 - Y
    
    ### START CODE HERE ### (approx. 1 line)
    dW3 = 1./m * np.dot(dZ3, A2.T) + (lambd/m)*W3
    ### END CODE HERE ###
    db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)
    
    dA2 = np.dot(W3.T, dZ3)
    dZ2 = np.multiply(dA2, np.int64(A2 > 0))
    ### START CODE HERE ### (approx. 1 line)
    dW2 = 1./m * np.dot(dZ2, A1.T) + (lambd/m)*W2
    ### END CODE HERE ###
    db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)
    
    dA1 = np.dot(W2.T, dZ2)
    dZ1 = np.multiply(dA1, np.int64(A1 > 0))
    ### START CODE HERE ### (approx. 1 line)
    dW1 = 1./m * np.dot(dZ1, X.T) + (lambd/m)*W1
    ### END CODE HERE ###
    db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)
    
    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,
                 "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, 
                 "dZ1": dZ1, "dW1": dW1, "db1": db1}
    
    return gradients

lambd = 0.6

filename al exists, renamed

lambd = 0.7

upload succsful

lambd = 0.8

filename already exists,amed

λ增大，能够减少过拟合现象，但是training set error 也会随之减少

drop out

what is inverted dropout?

upload cessful

upload ful

drop out

upload successul

1.创建一个np.array D1 which has the same size as A1（np.random.randn(A.shape[0],A.shape[1])）

2.当A的元素<D[keep_prob]的时候为1，大于的时候为0

3.A = np.multiply(A,D)

4.scale A,i.e. A/=keep_prob (inverted dropout)

def forward_propagation_with_dropout(X, parameters, keep_prob=0.5):
    """
    Implements the forward propagation: LINEAR -> RELU + DROPOUT -> LINEAR -> RELU + DROPOUT -> LINEAR -> SIGMOID.
    
    Arguments:
    X -- input dataset, of shape (2, number of examples)
    parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
                    W1 -- weight matrix of shape (20, 2)
                    b1 -- bias vector of shape (20, 1)
                    W2 -- weight matrix of shape (3, 20)
                    b2 -- bias vector of shape (3, 1)
                    W3 -- weight matrix of shape (1, 3)
                    b3 -- bias vector of shape (1, 1)
    keep_prob - probability of keeping a neuron active during drop-out, scalar
    
    Returns:
    A3 -- last activation value, output of the forward propagation, of shape (1,1)
    cache -- tuple, information stored for computing the backward propagation
    """
    
    np.random.seed(1)
    
    # retrieve parameters
    W1 = parameters["W1"]
    b1 = parameters["b1"]
    W2 = parameters["W2"]
    b2 = parameters["b2"]
    W3 = parameters["W3"]
    b3 = parameters["b3"]
    
    # LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID
    Z1 = np.dot(W1, X) + b1
    A1 = relu(Z1)
    ### START CODE HERE ### (approx. 4 lines)         # Steps 1-4 below correspond to the Steps 1-4 described above. 
    D1 = np.random.rand(A1.shape[0], A1.shape[1])     # Step 1: initialize matrix D1 = np.random.rand(..., ...)
    D1 = D1 < keep_prob                            # Step 2: convert entries of D1 to 0 or 1 (using keep_prob as the threshold)
    A1 = A1 * D1                                      # Step 3: shut down some neurons of A1
    A1 = A1 / keep_prob                               # Step 4: scale the value of neurons that haven't been shut down
    ### END CODE HERE ###
    Z2 = np.dot(W2, A1) + b2
    A2 = relu(Z2)
    ### START CODE HERE ### (approx. 4 lines)
    D2 = np.random.rand(A2.shape[0], A2.shape[1])     # Step 1: initialize matrix D2 = np.random.rand(..., ...)
    D2 = D2 < keep_prob                           # Step 2: convert entries of D2 to 0 or 1 (using keep_prob as the threshold)                           
    A2 = A2 * D2                                      # Step 3: shut down some neurons of A2
    A2 = A2 / keep_prob                               # Step 4: scale the value of neurons that haven't been shut down
    ### END CODE HERE ###
    Z3 = np.dot(W3, A2) + b3
    A3 = sigmoid(Z3)
    
    cache = (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3)
    
    return A3, cache

drop out in backward

just perform dA2*=D2 and scaling

def backward_propagation_with_dropout(X, Y, cache, keep_prob):
    """
    Implements the backward propagation of our baseline model to which we added dropout.
    
    Arguments:
    X -- input dataset, of shape (2, number of examples)
    Y -- "true" labels vector, of shape (output size, number of examples)
    cache -- cache output from forward_propagation_with_dropout()
    keep_prob - probability of keeping a neuron active during drop-out, scalar
    
    Returns:
    gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables
    """
    
    m = X.shape[1]
    (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3) = cache
    
    dZ3 = A3 - Y
    dW3 = 1. / m * np.dot(dZ3, A2.T)
    db3 = 1. / m * np.sum(dZ3, axis=1, keepdims=True)
    dA2 = np.dot(W3.T, dZ3)
    ### START CODE HERE ### (≈ 2 lines of code)
    dA2 = dA2 * D2              # Step 1: Apply mask D2 to shut down the same neurons as during the forward propagation
    dA2 = dA2 / keep_prob              # Step 2: Scale the value of neurons that haven't been shut down
    ### END CODE HERE ###
    dZ2 = np.multiply(dA2, np.int64(A2 > 0))
    dW2 = 1. / m * np.dot(dZ2, A1.T)
    db2 = 1. / m * np.sum(dZ2, axis=1, keepdims=True)
    
    dA1 = np.dot(W2.T, dZ2)
    ### START CODE HERE ### (≈ 2 lines of code)
    dA1 = dA1 * D1              # Step 1: Apply mask D1 to shut down the same neurons as during the forward propagation
    dA1 = dA1 / keep_prob              # Step 2: Scale the value of neurons that haven't been shut down
    ### END CODE HERE ###
    dZ1 = np.multiply(dA1, np.int64(A1 > 0))
    dW1 = 1. / m * np.dot(dZ1, X.T)
    db1 = 1. / m * np.sum(dZ1, axis=1, keepdims=True)
    
    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,
                 "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, 
                 "dZ1": dZ1, "dW1": dW1, "db1": db1}
    
    return gradients

gradient checking

以J = x*theta 为例

upload succesul

implementation

upload succeful

注意

1	np.linalg.norm 是求范数的函数，默认是求二范数

def gradient_check(x, theta, epsilon = 1e-7):
    """
    Implement the backward propagation presented in Figure 1.
    
    Arguments:
    x -- a real-valued input
    theta -- our parameter, a real number as well
    epsilon -- tiny shift to the input to compute approximated gradient with formula(1)
    
    Returns:
    difference -- difference (2) between the approximated gradient and the backward propagation gradient
    """
    
    # Compute gradapprox using left side of formula (1). epsilon is small enough, you don't need to worry about the limit.
    ### START CODE HERE ### (approx. 5 lines)
    thetaPlus = theta+epsilon
    thetaMinus = theta-epsilon
    JPlus = forward_propagation(x,thetaPlus)
    JMinus = forward_propagation(x,thetaMinus)
    gradapprox = (JPlus-JMinus)/(2*epsilon)
    ### END CODE HERE ###
    
    # Check if gradapprox is close enough to the output of backward_propagation()
    ### START CODE HERE ### (approx. 1 line)
    grad = backward_propagation(x, gradapprox)
    ### END CODE HERE ###
        
    ### START CODE HERE ### (approx. 1 line)
    numerator = np.linalg.norm(grad-gradapprox)
    demurator = np.linalg.norm(grad)+np.linalg.norm(gradapprox)
    difference = numerator/demurator
    
    
    if difference < 1e-7:
        print ("The gradient is correct!")
    else:
        print ("The gradient is wrong!")
    
    return difference

对于多维的情况

upload success

把所有params压缩到一个向量，然后一个for循环，计算每个参数的grad，gradapprox，并加入到一个向量当中，分别得到一个gradapprox与grad向量，再利用这两个向量求范数，求difference

filename already exists, reamed

code


def gradient_check_n(parameters, gradients, X, Y, epsilon=1e-7):
    """
    Checks if backward_propagation_n computes correctly the gradient of the cost output by forward_propagation_n
    
    Arguments:
    parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
    grad -- output of backward_propagation_n, contains gradients of the cost with respect to the parameters. 
    x -- input datapoint, of shape (input size, 1)
    y -- true "label"
    epsilon -- tiny shift to the input to compute approximated gradient with formula(1)
    
    Returns:
    difference -- difference (2) between the approximated gradient and the backward propagation gradient
    """
    
    # Set-up variables
    parameters_values, _ = dictionary_to_vector(parameters)
    grad = gradients_to_vector(gradients)
    num_parameters = parameters_values.shape[0]
    J_plus = np.zeros((num_parameters, 1))
    J_minus = np.zeros((num_parameters, 1))
    gradapprox = np.zeros((num_parameters, 1))
    
    # Compute gradapprox
    for i in range(num_parameters):
        
        # Compute J_plus[i]. Inputs: "parameters_values, epsilon". Output = "J_plus[i]".
        # "_" is used because the function you have to outputs two parameters but we only care about the first one
        ### START CODE HERE ### (approx. 3 lines)
        thetaplus =  np.copy(parameters_values)                                       # Step 1
        thetaplus[i][0] = thetaplus[i][0] + epsilon                                   # Step 2
        J_plus[i], _ =  forward_propagation_n(X, Y, vector_to_dictionary(thetaplus))  # Step 3
        ### END CODE HERE ###
        
        # Compute J_minus[i]. Inputs: "parameters_values, epsilon". Output = "J_minus[i]".
        ### START CODE HERE ### (approx. 3 lines)
        thetaminus = np.copy(parameters_values)                                       # Step 1
        thetaminus[i][0] = thetaminus[i][0] - epsilon                                 # Step 2        
        J_minus[i], _ = forward_propagation_n(X, Y, vector_to_dictionary(thetaminus)) # Step 3
        ### END CODE HERE ###
        
        # Compute gradapprox[i]
        ### START CODE HERE ### (approx. 1 line)
        gradapprox[i] = (J_plus[i] - J_minus[i]) / (2 * epsilon)
        ### END CODE HERE ###
    
    # Compare gradapprox to backward propagation gradients by computing difference.
    ### START CODE HERE ### (approx. 1 line)
    numerator = np.linalg.norm(grad - gradapprox)                                     # Step 1'
    denominator = np.linalg.norm(grad) + np.linalg.norm(gradapprox)                   # Step 2'
    difference = numerator / denominator                                              # Step 3'
    ### END CODE HERE ###

    if difference > 1e-7:
        print("\033[93m" + "There is a mistake in the backward propagation! difference = " + str(difference) + "\033[0m")
    else:
        print("\033[92m" + "Your backward propagation works perfectly fine! difference = " + str(difference) + "\033[0m")
    
    return difference

concolusion

1.L2正则化和drop out都可以帮你解决overfitting

2.regularization 会使得weight变得非常小

Alex Chiu

Alex's personal blog

54 posts

4 categories

4 tags