Tytuł artykułu
Treść / Zawartość
Pełne teksty:
Identyfikatory
Warianty tytułu
Języki publikacji
Abstrakty
Self-supervised monocular depth estimation has been widely applied in autonomous driving and automated guided vehicles. It offers the advantages of low cost and extended effective distance compared with alternative methods. However, like automated guided vehicles, devices with limited computing resources struggle to leverage state-of-the-art large model structures. In recent years, researchers have acknowledged this issue and endeavored to reduce model size. Model lightweight techniques aim to decrease the number of parameters while maintaining satisfactory performance. In this paper, to enhance the model’s performance in lightweight scenarios, a novel approach to encompassing three key aspects is proposed: (1) utilizing LeakyReLU to involve more neurons in manifold representation; (2) employing large convolution for improved recognition of edges in lightweight models; (3) applying channel grouping and shuffling to maximize the model efficiency. Experimental results demonstrate that our proposed method achieves satisfactory outcomes on KITTI and Make3D benchmarks while having only 1.6M trainable parameters, representing a reduction of 27% compared with the previous smallest model, Lite-Mono-tiny, in monocular depth estimation.
Słowa kluczowe
Wydawca
Rocznik
Tom
Strony
191--205
Opis fizyczny
Bibliogr. 46 poz., rys.
Twórcy
autor
- College of Electronic and Information Engineering, Wuyi University, Jiangmen, Guangdong, China
autor
- College of Electronic and Information Engineering, Wuyi University, Jiangmen, Guangdong, China
autor
- College of Electronic and Information Engineering, Wuyi University, Jiangmen, Guangdong, China
autor
- College of Electronic and Information Engineering, Wuyi University, Jiangmen, Guangdong, China
autor
- College of Electronic and Information Engineering, Wuyi University, Jiangmen, Guangdong, China
Bibliografia
- [1] Tomasz Szmuc, Rafał Mrówka, Marek Brańka, Jakub Ficoń, Piotr Pieta, A Novel Method for Fast Generation of 3D Objects from Multiple Depth Sensors., Journal of Artificial Intelligence and Soft Computing Research, 2023, 13(2): 95-105.
- [2] Martin-Gomez, A., Li, H., Song, T., Yang, S., Wang, G., Ding, H., Navab, N., Zhao, Z., Armand, M., Sttar: surgical tool tracking using off-the-shelf augmented reality head-mounted displays., IEEE Transactions on Visualization and Computer Graphics, 2023, 1-16.
- [3] Rodrigues, R.T., Miraldo, P., Dimarogonas, D.V., Aguiar, A.P., A framework for depth estimation and relative localization of ground robots using computer vision., IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019, 3719-3724.
- [4] Silva, R., Cielniak, G., Gao, J., Leaving the Lines Behind: Vision-Based Crop Row Exit for Agricultural Robot Navigation., Preprint at https://arxiv.org/abs/2306.05869, 2023.
- [5] Sharma, A., Nett, R., Ventura, J., Unsupervised learning of depth and ego-motion from cylindrical panoramic video with applications for virtual reality., International Journal of Semantic Computing, 2020, 14(03): 333-356.
- [6] Rasla, A., Beyeler, M., The relative importance of depth cues and semantic edges for indoor mobility using simulated prosthetic vision in immersive virtual reality., Proceedings of the 28th ACM Symposium on Virtual Reality Software and Technology, 2022, 1-11.
- [7] Patakin, N., Vorontsova, A., Artemyev, M., Konushin, A., Single-stage 3d geometry-preserving depth estimation model training on dataset mixtures with uncalibrated stereo data., Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, 1705-1714.
- [8] Peng, R., Wang, R., Wang, Z., Lai, Y., Wang, R., Rethinking depth estimation for multi-view stereo: A unified representation., Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, 8645-8654.
- [9] Choe, J., Joo, K., Imtiaz, T., Kweon, I.S., Volumetric propagation network: Stereo-lidar fusion for long-range depth estimation., IEEE Robotics and Automation Letters, 2021, 6(3): 4672-4679.
- [10] Hirschmuller, H., Accurate and efficient stereo processing by semi-global matching and mutual information., IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005, 2: 807-814.
- [11] Chang, J.-R., Chen, Y.-S., Pyramid stereo matching network., Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, 5410-5418.
- [12] Liu, P., King, I., Lyu, M.R., Xu, J.,Flow2stereo: Effective self-supervised learning of optical flow and stereo matching., Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, 6648-6657.
- [13] Ullman, S., The interpretation of structure from motion., Proceedings of the Royal Society of London. Series B. Biological Sciences, 1979, 203(1153): 405–426.
- [14] Zhou, T., Brown, M., Snavely, N., Lowe, D.G., Unsupervised learning of depth and ego-motion from video., Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017, 1851–1858.
- [15] Godard, C., Mac Aodha, O., Firman, M., Brostow, G.J., Digging into self-supervised monocular depth estimation., Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, 3828–3838.
- [16] Zhou, Z., Fan, X., Shi, P., Xin, Y., R-msfm: Recurrent multi-scale feature modulation for monocular depth estimating., Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, 12777–12786.
- [17] Zhang, N., Nex, F., Vosselman, G., Kerle, N., Litemono: A lightweight cnn and transformer architecture for self-supervised monocular depth estimation., Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, 18537–18546.
- [18] Zhang, X., Zhou, X., Lin, M., Sun, J., Shufflenet: An extremely efficient convolutional neural network for mobile devices., Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, 6848–6856.
- [19] Eigen, D., Fergus, R., Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture., Proceedings of the IEEE/CVF International Conference on Computer Vision, 2015, 2650–2658.
- [20] Hui, T.-W., Rm-depth: Unsupervised learning of recurrent monocular depth in dynamic scenes., Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, 1675–1684.
- [21] Yan, J., Zhao, H., Bu, P., Jin, Y., Channel-wise attention-based network for self-supervised monocular depth estimation., 2021 International Conference on 3D Vision (3DV), 2021, 464–473.
- [22] Zhao, C., Zhang, Y., Poggi, M., Tosi, F., Guo, X., Zhu, Z., Huang, G., Tang, Y., Mattoccia, S.: Monovit: Self-supervised monocular depth estimation with a vision transformer., 2022 International Conference on 3D Vision (3DV), 2022, 668–678 .
- [23] He, M., Hui, L., Bian, Y., Ren, J., Xie, J., Yang, J., Ra-depth: Resolution adaptive self-supervised monocular depth estimation., European Conference on Computer Vision, 2022, 565–581.
- [24] Shim, D., Kim, H.J., Swindepth: Unsupervised depth estimation using monocular sequences via swin transformer and densely cascaded network., arXiv preprint arXiv:2301.06715, 2023.
- [25] Jaderberg, M., Vedaldi, A., Zisserman, A., Speeding up convolutional neural networks with low rank expansions., Proceeding of the British Machine Vision Conference 2014. British Machine Vision Association, 2014.
- [26] Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H., Mobilenets: Efficient convolutional neural networks for mobile vision applications., arXiv preprint arXiv:1704.04861, 2017.
- [27] Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.-C., Mobilenetv2: Inverted residuals and linear bottlenecks., Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, 4510–4520.
- [28] Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., et al., Searching for mobilenetv3., Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, 1314–1324.
- [29] Ma, N., Zhang, X., Zheng, H.-T., Sun, J., Shufflenet v2: Practical guidelines for efficient cnn architecture design., Proceedings of the European Conference on Computer Vision, 2018,116–131.
- [30] Mehta, S., Rastegari, M., Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer., International Conference on Learning Representations., 2021.
- [31] Yang, R., Ma, H., Wu, J., Tang, Y., Xiao, X., Zheng, M., Li, X., Scalablevit: Rethinking the context-oriented generalization of vision transformer., Proceedings of the European Conference on Computer Vision, 2022, 480–496.
- [32] Chen, Y., Dai, X., Chen, D., Liu, M., Dong, X., Yuan, L., Liu, Z.: Mobile-former: Bridging mobilenet and transformer., Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, 5270–5279.
- [33] Ho, J., Kalchbrenner, N., Weissenborn, D., Salimans, T., Axial attention in multidimensional transformers., arXiv preprint arXiv:1912.12180, 2019.
- [34] Mehta, S., Rastegari, M., Separable self-attention for mobile vision transformers., Transactions on Machine Learning Research, 2022.
- [35] Ronneberger, O., Fischer, P., Brox, T., U-net: Convolutional networks for biomedical image segmentation., Medical Image Computing and Computer-Assisted Intervention–MICCAI, 2015, 234–241.
- [36] Krizhevsky, A., Sutskever, I., Hinton, G.E., Imagenet classification with deep convolutional neural networks., Advances in neural information processing systems, 2012, 5.
- [37] Xie, S., Girshick, R., Dollar, P., Tu, Z., He, K., Aggregated residual transformations for deep neural networks., Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017, 1492–1500.
- [38] Glorot, X., Bordes, A., Bengio, Y., Deep sparse rectifier neural networks., Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 2011, 315–323.
- [39] Maas, A.L., Hannun, A.Y., Ng, A.Y., et al., Rectifier nonlinearities improve neural network acoustic models., Proc. Icml, 2013, 30(1): 3.
- [40] Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P., Image quality assessment: from error visibility to structural similarity., IEEE transactions on image processing, 2004, 600–612.
- [41] Girshick, R., Fast r-cnn., Proceedings of the IEEE/CVF International Conference on Computer Vision, 2015, 1440–1448.
- [42] Zhou, H., Greenwood, D., Taylor, S., Self-supervised monocular depth estimation with internal feature fusion, arXiv preprint arXiv:2110.09482, 2021.
- [43] Geiger, A., Lenz, P., Stiller, C., Urtasun, R., Vision meets robotics: The kitti dataset., The International Journal of Robotics Research., 2013, 32(11): 1231-1237.
- [44] Saxena, A., Sun, M., Ng, A.Y., Make3d: Learning 3d scene structure from a single still image., IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008, 31(5): 824-840.
- [45] Eigen, D., Puhrsch, C., Fergus, R., Depth map prediction from a single image using a multi-scale deep network., Advances in neural information processing systems, 2014, 27.
- [46] Wang, C., Buenaposada, J.M., Zhu, R., Lucey, S., Learning depth from monocular videos using direct methods., Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, 2022–2030.
Uwagi
Opracowanie rekordu ze środków MNiSW, umowa nr POPUL/SP/0154/2024/02 w ramach programu "Społeczna odpowiedzialność nauki II" - moduł: Popularyzacja nauki (2025).
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-cc646a9f-fa1b-438e-bc25-aa99be9772f1
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.