ShuffleMono : rethinking lightweight network for self-supervised monocular depth estimation

Feng, Yingwei; Hong, Zhiyong; Xiong, Liping; Zeng, Zhiqiang; Li, Jingmin

doi:10.2478/jaiscr-2024-0011

Artykuł - szczegóły

Tytuł artykułu

ShuffleMono : rethinking lightweight network for self-supervised monocular depth estimation

Autorzy

Feng Yingwei , Hong Zhiyong , Xiong Liping , Zeng Zhiqiang , Li Jingmin

Treść / Zawartość

Pełne teksty:

Pobierz

Identyfikatory

DOI

10.2478/jaiscr-2024-0011

Warianty tytułu

Języki publikacji

Abstrakty

Self-supervised monocular depth estimation has been widely applied in autonomous driving and automated guided vehicles. It offers the advantages of low cost and extended effective distance compared with alternative methods. However, like automated guided vehicles, devices with limited computing resources struggle to leverage state-of-the-art large model structures. In recent years, researchers have acknowledged this issue and endeavored to reduce model size. Model lightweight techniques aim to decrease the number of parameters while maintaining satisfactory performance. In this paper, to enhance the model’s performance in lightweight scenarios, a novel approach to encompassing three key aspects is proposed: (1) utilizing LeakyReLU to involve more neurons in manifold representation; (2) employing large convolution for improved recognition of edges in lightweight models; (3) applying channel grouping and shuffling to maximize the model efficiency. Experimental results demonstrate that our proposed method achieves satisfactory outcomes on KITTI and Make3D benchmarks while having only 1.6M trainable parameters, representing a reduction of 27% compared with the previous smallest model, Lite-Mono-tiny, in monocular depth estimation.

Słowa kluczowe

lightweight self-supervised learning depth estimation monocular

Wydawca

University of Social Sciences

Czasopismo

Journal of Artificial Intelligence and Soft Computing Research

Rocznik

2024

Tom

Vol. 14, No. 3

Strony

191--205

Opis fizyczny

Bibliogr. 46 poz., rys.

Twórcy

autor

Feng Yingwei

College of Electronic and Information Engineering, Wuyi University, Jiangmen, Guangdong, China

https://orcid.org/0009-0006-9649-7024

autor

Hong Zhiyong

hongmr@163.com

College of Electronic and Information Engineering, Wuyi University, Jiangmen, Guangdong, China

https://orcid.org/0000-0002-5409-2962

autor

Xiong Liping

College of Electronic and Information Engineering, Wuyi University, Jiangmen, Guangdong, China

https://orcid.org/0000-0001-7924-8655

autor

Zeng Zhiqiang

College of Electronic and Information Engineering, Wuyi University, Jiangmen, Guangdong, China

https://orcid.org/0000-0002-9544-5605

autor

Li Jingmin

College of Electronic and Information Engineering, Wuyi University, Jiangmen, Guangdong, China

https://orcid.org/0009-0008-7839-4456

Bibliografia

[1] Tomasz Szmuc, Rafał Mrówka, Marek Brańka, Jakub Ficoń, Piotr Pieta, A Novel Method for Fast Generation of 3D Objects from Multiple Depth Sensors., Journal of Artificial Intelligence and Soft Computing Research, 2023, 13(2): 95-105.
[2] Martin-Gomez, A., Li, H., Song, T., Yang, S., Wang, G., Ding, H., Navab, N., Zhao, Z., Armand, M., Sttar: surgical tool tracking using off-the-shelf augmented reality head-mounted displays., IEEE Transactions on Visualization and Computer Graphics, 2023, 1-16.
[3] Rodrigues, R.T., Miraldo, P., Dimarogonas, D.V., Aguiar, A.P., A framework for depth estimation and relative localization of ground robots using computer vision., IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019, 3719-3724.
[4] Silva, R., Cielniak, G., Gao, J., Leaving the Lines Behind: Vision-Based Crop Row Exit for Agricultural Robot Navigation., Preprint at https://arxiv.org/abs/2306.05869, 2023.
[5] Sharma, A., Nett, R., Ventura, J., Unsupervised learning of depth and ego-motion from cylindrical panoramic video with applications for virtual reality., International Journal of Semantic Computing, 2020, 14(03): 333-356.
[6] Rasla, A., Beyeler, M., The relative importance of depth cues and semantic edges for indoor mobility using simulated prosthetic vision in immersive virtual reality., Proceedings of the 28th ACM Symposium on Virtual Reality Software and Technology, 2022, 1-11.
[7] Patakin, N., Vorontsova, A., Artemyev, M., Konushin, A., Single-stage 3d geometry-preserving depth estimation model training on dataset mixtures with uncalibrated stereo data., Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, 1705-1714.
[8] Peng, R., Wang, R., Wang, Z., Lai, Y., Wang, R., Rethinking depth estimation for multi-view stereo: A unified representation., Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, 8645-8654.
[9] Choe, J., Joo, K., Imtiaz, T., Kweon, I.S., Volumetric propagation network: Stereo-lidar fusion for long-range depth estimation., IEEE Robotics and Automation Letters, 2021, 6(3): 4672-4679.
[10] Hirschmuller, H., Accurate and efficient stereo processing by semi-global matching and mutual information., IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005, 2: 807-814.
[11] Chang, J.-R., Chen, Y.-S., Pyramid stereo matching network., Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, 5410-5418.
[12] Liu, P., King, I., Lyu, M.R., Xu, J.,Flow2stereo: Effective self-supervised learning of optical flow and stereo matching., Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, 6648-6657.
[13] Ullman, S., The interpretation of structure from motion., Proceedings of the Royal Society of London. Series B. Biological Sciences, 1979, 203(1153): 405–426.
[14] Zhou, T., Brown, M., Snavely, N., Lowe, D.G., Unsupervised learning of depth and ego-motion from video., Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017, 1851–1858.
[15] Godard, C., Mac Aodha, O., Firman, M., Brostow, G.J., Digging into self-supervised monocular depth estimation., Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, 3828–3838.
[16] Zhou, Z., Fan, X., Shi, P., Xin, Y., R-msfm: Recurrent multi-scale feature modulation for monocular depth estimating., Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, 12777–12786.
[17] Zhang, N., Nex, F., Vosselman, G., Kerle, N., Litemono: A lightweight cnn and transformer architecture for self-supervised monocular depth estimation., Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, 18537–18546.
[18] Zhang, X., Zhou, X., Lin, M., Sun, J., Shufflenet: An extremely efficient convolutional neural network for mobile devices., Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, 6848–6856.
[19] Eigen, D., Fergus, R., Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture., Proceedings of the IEEE/CVF International Conference on Computer Vision, 2015, 2650–2658.
[20] Hui, T.-W., Rm-depth: Unsupervised learning of recurrent monocular depth in dynamic scenes., Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, 1675–1684.
[21] Yan, J., Zhao, H., Bu, P., Jin, Y., Channel-wise attention-based network for self-supervised monocular depth estimation., 2021 International Conference on 3D Vision (3DV), 2021, 464–473.
[22] Zhao, C., Zhang, Y., Poggi, M., Tosi, F., Guo, X., Zhu, Z., Huang, G., Tang, Y., Mattoccia, S.: Monovit: Self-supervised monocular depth estimation with a vision transformer., 2022 International Conference on 3D Vision (3DV), 2022, 668–678 .
[23] He, M., Hui, L., Bian, Y., Ren, J., Xie, J., Yang, J., Ra-depth: Resolution adaptive self-supervised monocular depth estimation., European Conference on Computer Vision, 2022, 565–581.
[24] Shim, D., Kim, H.J., Swindepth: Unsupervised depth estimation using monocular sequences via swin transformer and densely cascaded network., arXiv preprint arXiv:2301.06715, 2023.
[25] Jaderberg, M., Vedaldi, A., Zisserman, A., Speeding up convolutional neural networks with low rank expansions., Proceeding of the British Machine Vision Conference 2014. British Machine Vision Association, 2014.
[26] Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H., Mobilenets: Efficient convolutional neural networks for mobile vision applications., arXiv preprint arXiv:1704.04861, 2017.
[27] Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.-C., Mobilenetv2: Inverted residuals and linear bottlenecks., Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, 4510–4520.
[28] Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., et al., Searching for mobilenetv3., Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, 1314–1324.
[29] Ma, N., Zhang, X., Zheng, H.-T., Sun, J., Shufflenet v2: Practical guidelines for efficient cnn architecture design., Proceedings of the European Conference on Computer Vision, 2018,116–131.
[30] Mehta, S., Rastegari, M., Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer., International Conference on Learning Representations., 2021.
[31] Yang, R., Ma, H., Wu, J., Tang, Y., Xiao, X., Zheng, M., Li, X., Scalablevit: Rethinking the context-oriented generalization of vision transformer., Proceedings of the European Conference on Computer Vision, 2022, 480–496.
[32] Chen, Y., Dai, X., Chen, D., Liu, M., Dong, X., Yuan, L., Liu, Z.: Mobile-former: Bridging mobilenet and transformer., Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, 5270–5279.
[33] Ho, J., Kalchbrenner, N., Weissenborn, D., Salimans, T., Axial attention in multidimensional transformers., arXiv preprint arXiv:1912.12180, 2019.
[34] Mehta, S., Rastegari, M., Separable self-attention for mobile vision transformers., Transactions on Machine Learning Research, 2022.
[35] Ronneberger, O., Fischer, P., Brox, T., U-net: Convolutional networks for biomedical image segmentation., Medical Image Computing and Computer-Assisted Intervention–MICCAI, 2015, 234–241.
[36] Krizhevsky, A., Sutskever, I., Hinton, G.E., Imagenet classification with deep convolutional neural networks., Advances in neural information processing systems, 2012, 5.
[37] Xie, S., Girshick, R., Dollar, P., Tu, Z., He, K., Aggregated residual transformations for deep neural networks., Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017, 1492–1500.
[38] Glorot, X., Bordes, A., Bengio, Y., Deep sparse rectifier neural networks., Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 2011, 315–323.
[39] Maas, A.L., Hannun, A.Y., Ng, A.Y., et al., Rectifier nonlinearities improve neural network acoustic models., Proc. Icml, 2013, 30(1): 3.
[40] Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P., Image quality assessment: from error visibility to structural similarity., IEEE transactions on image processing, 2004, 600–612.
[41] Girshick, R., Fast r-cnn., Proceedings of the IEEE/CVF International Conference on Computer Vision, 2015, 1440–1448.
[42] Zhou, H., Greenwood, D., Taylor, S., Self-supervised monocular depth estimation with internal feature fusion, arXiv preprint arXiv:2110.09482, 2021.
[43] Geiger, A., Lenz, P., Stiller, C., Urtasun, R., Vision meets robotics: The kitti dataset., The International Journal of Robotics Research., 2013, 32(11): 1231-1237.
[44] Saxena, A., Sun, M., Ng, A.Y., Make3d: Learning 3d scene structure from a single still image., IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008, 31(5): 824-840.
[45] Eigen, D., Puhrsch, C., Fergus, R., Depth map prediction from a single image using a multi-scale deep network., Advances in neural information processing systems, 2014, 27.
[46] Wang, C., Buenaposada, J.M., Zhu, R., Lucey, S., Learning depth from monocular videos using direct methods., Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, 2022–2030.

Uwagi

Opracowanie rekordu ze środków MNiSW, umowa nr POPUL/SP/0154/2024/02 w ramach programu "Społeczna odpowiedzialność nauki II" - moduł: Popularyzacja nauki (2025).

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-cc646a9f-fa1b-438e-bc25-aa99be9772f1