Bug 1837645 - ceph device get-health-metrics does not work when smartctl command throws non-zero error code
Summary: ceph device get-health-metrics does not work when smartctl command throws non...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RADOS
Version: 4.1
Hardware: x86_64
OS: Linux
high
high
Target Milestone: z1
: 4.1
Assignee: Neha Ojha
QA Contact: Manohar Murthy
URL:
Whiteboard:
: 1840272 (view as bug list)
Depends On:
Blocks: 1816167
TreeView+ depends on / blocked
 
Reported: 2020-05-19 18:11 UTC by Veera Raghava Reddy
Modified: 2023-10-06 20:12 UTC (History)
14 users (show)

Fixed In Version: ceph-14.2.8-68.el8cp, ceph-14.2.8-68.el7cp
Doc Type: Bug Fix
Doc Text:
.Health metrics are correctly reported when `smartctl` exits with a non-zero error code Previously, the `ceph device get-health-metrics` command could fail to report metrics if `smartctl` exited with a non-zero error code even though running `smartctl` directly reported the correct information. In this case a JSON error was reported instead. In {storage-product} 4.1z1, the `ceph device get-health-metrics` command reports metrics even if `smartctl` exits with a non-zero error code as long as `smartctl` itself reports correct information.
Clone Of:
Environment:
Last Closed: 2020-07-20 14:21:03 UTC
Embargoed:


Attachments (Terms of Use)
smartctl output for AVAGO drive - Not supporting smart format (1.91 KB, text/plain)
2020-07-01 07:50 UTC, Veera Raghava Reddy
no flags Details
smartctl output for Seagate drive - Supporting smart format (17.65 KB, text/plain)
2020-07-01 07:51 UTC, Veera Raghava Reddy
no flags Details
smartctl output for Micron drive - Supporting smart format (6.27 KB, text/plain)
2020-07-01 07:51 UTC, Veera Raghava Reddy
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 44210 0 None None None 2020-05-19 18:18:11 UTC
Github ceph ceph pull 33421 0 None closed nautilus: common/blkdev: fix some problems with smart scraping 2020-12-30 07:37:55 UTC
Red Hat Issue Tracker RHCEPH-7656 0 None None None 2023-10-06 20:12:08 UTC
Red Hat Product Errata RHSA-2020:3003 0 None None None 2020-07-20 14:21:28 UTC

Description Veera Raghava Reddy 2020-05-19 18:11:07 UTC
Description of problem:
ceph device get-health-metrics cli shows "smartctl returned invalid JSON" error, even though smartctl command returns Device metrics appropriate when run independently.

Tracker BZ for upstream issue - https://tracker.ceph.com/issues/44210


Version-Release number of selected component (if applicable):
RHCS 4.1 [ceph version 14.2.8-47.el7cp (8d24dfe40524f948afd782e14dc63a0d0cacb28b) nautilus (stable)]

How reproducible:
Reproduced multiple times with Device on mero002 and mero007 RHCS OSD nodes.

Steps to Reproduce:
1. Install RHCS 4.1
2. Enable device metrics monitoring [ceph device monitoring on]
3. List Device health metrics [ceph device get-health-metrics <Device-ID>, ceph device query-daemon-health-metrics <OSD-ID>]

Actual results:
Error message shown for device 
"smartctl JSON error"

Expected results:
Output should show health metrics for the drive device in JSON format



Additional info:
While trying multiple scenarios for smartmontools BZ 1814082, came across this scenario. 

ceph device get-health-metrics cli shows "smartctl returned invalid JSON" error, even though smartctl command returns Device metrics appropriate when run independently.

[root@extensa003 ~]# ceph -v
ceph version 14.2.8-47.el7cp (8d24dfe40524f948afd782e14dc63a0d0cacb28b)
nautilus (stable)


[root@extensa003 ~]# ceph device ls | grep -i
AVAGO_SMC3108_00be3fb415dfd1fc2200d36c23800403
AVAGO_SMC3108_00be3fb415dfd1fc2200d36c23800403    mero007:sdb           
           osd.34

*When trying to query device health metrics from extensa003 [root/r],
get error "smartctl returned invalid JSON"*

[root@extensa003 ~]# ceph device get-health-metrics
AVAGO_SMC3108_00be3fb415dfd1fc2200d36c23800403
{
     "20200518-125545": {
         "dev": "/dev/sdb",
         "error": "smartctl returned invalid JSON",
         "nvme_smart_health_information_add_log_error": "nvme returned
an error: sudo: exit status: 231",
         "nvme_smart_health_information_add_log_error_code": -22,
         "nvme_vendor": "avago"
     },
     "20200519-000940": {
         "dev": "/dev/sdb",
         "error": "smartctl returned invalid JSON",
         "nvme_smart_health_information_add_log_error": "nvme returned
an error: sudo: exit status: 231",
         "nvme_smart_health_information_add_log_error_code": -22,
         "nvme_vendor": "avago"
     }
}


*smartctl command shows output on mero007 [root/r]*
*smartctl 7.0
>
[root@extensa003 ~]# ceph device get-health-metrics
AVAGO_SMC3108_00be3fb415dfd1fc2200d36c23800403
{
     "20200518-125545": {
         "dev": "/dev/sdb",
         "error": "smartctl returned invalid JSON",
         "nvme_smart_health_information_add_log_error": "nvme returned
an error: sudo: exit status: 231",
         "nvme_smart_health_information_add_log_error_code": -22,
         "nvme_vendor": "avago"
     },
     "20200519-000940": {
         "dev": "/dev/sdb",
         "error": "smartctl retu[root@extensa003 ~]# ceph device
get-health-metrics AVAGO_SMC3108_00be3fb415dfd1fc2200d36c23800403
{
     "20200518-125545": {
         "dev": "/dev/sdb",
         "error": "smartctl returned invalid JSON",
         "nvme_smart_health_information_add_log_error": "nvme returned
an error: sudo: exit status: 231",
         "nvme_smart_health_information_add_log_error_code": -22,
         "nvme_vendor": "avago"
     },
     "20200519-000940": {
         "dev": "/dev/sdb",
         "error": "smartctl returned invalid JSON",
         "nvme_smart_health_information_add_log_error": "nvme returned
an error: sudo: exit status: 231",
         "nvme_smart_health_information_add_log_error_code": -22,
         "nvme_vendor": "avago"
     }
}

**********************
Hi Veera, searching for 'ceph smartctl invalid json' turns up an
upstream bug for this:

https://tracker.ceph.com/issues/44210

The fix isn't in 4.1 - it was added upstream in 14.2.9. Could you add a
BZ for tracking this? We can backport the fix downstream for 4.1z1.

Thanks,
Josh
**********************

Comment 1 Neha Ojha 2020-06-02 20:20:15 UTC
*** Bug 1840272 has been marked as a duplicate of this bug. ***

Comment 6 Veera Raghava Reddy 2020-06-30 07:11:49 UTC
Hi Josh,
Still see json error with 4.1z1 build.


[root@extensa003 ~]# ceph -v
ceph version 14.2.8-79.el7cp (2d4542a7b3632dd9a7b09b5700f711e8016a94fd) nautilus (stable)


[root@extensa003 ~]# ceph device get-health-metrics AVAGO_SMC3108_00f11da416efd1fc2200d36c23800403
{
    "20200629-001032": {
        "dev": "/dev/sdg", 
        "error": "smartctl returned invalid JSON", 
        "nvme_smart_health_information_add_log_error": "nvme returned an error: sudo: exit status: 231", 
        "nvme_smart_health_information_add_log_error_code": -22, 
        "nvme_vendor": "avago"
    }, 
    "20200630-000823": {
        "dev": "/dev/sdg", 
        "error": "smartctl returned invalid JSON", 
        "nvme_smart_health_information_add_log_error": "nvme returned an error: sudo: exit status: 231", 
        "nvme_smart_health_information_add_log_error_code": -22, 
        "nvme_vendor": "avago"
    }
}


smartctl output -
[root@mero007 ~]# smartctl -a --json "/dev/sdg"
{
  "json_format_version": [
    1,
    0
  ],
  "smartctl": {
    "version": [
      7,
      0
    ],
    "svn_revision": "4883",
    "platform_info": "x86_64-linux-3.10.0-1062.7.1.el7.x86_64",
    "build_info": "(local build)",
    "argv": [
      "smartctl",
      "-a",
      "--json",
      "/dev/sdg"
    ],
    "exit_status": 4
  },
  "device": {
    "name": "/dev/sdg",
    "info_name": "/dev/sdg",
    "type": "scsi",
    "protocol": "SCSI"
  },
  "vendor": "AVAGO",
  "product": "SMC3108",
  "model_name": "AVAGO SMC3108",
  "revision": "4.68",
  "scsi_version": "SPC-3",
  "user_capacity": {
    "blocks": 15626993664,
    "bytes": 8001020755968
  },
  "logical_block_size": 512,
  "physical_block_size": 4096,
  "serial_number": "00f11da416efd1fc2200d36c23800403",
  "device_type": {
    "scsi_value": 0,
    "name": "disk"
  },
  "local_time": {
    "time_t": 1593500798,
    "asctime": "Tue Jun 30 07:06:38 2020 UTC"
  },
  "temperature": {
    "current": 0,
    "drive_trip": 0
  }
}

Comment 7 Josh Durgin 2020-07-01 02:52:20 UTC
(In reply to Veera Raghava Reddy from comment #6)
> Hi Josh,
> Still see json error with 4.1z1 build.

Hi Veera, thanks for checking, fortunately the extra output gives more detail now.

> [root@extensa003 ~]# ceph -v
> ceph version 14.2.8-79.el7cp (2d4542a7b3632dd9a7b09b5700f711e8016a94fd)
> nautilus (stable)
> 
> 
> [root@extensa003 ~]# ceph device get-health-metrics
> AVAGO_SMC3108_00f11da416efd1fc2200d36c23800403
> {
>     "20200629-001032": {
>         "dev": "/dev/sdg", 
>         "error": "smartctl returned invalid JSON", 

This error message is misleading, filed https://tracker.ceph.com/issues/46285 to fix.

>         "nvme_smart_health_information_add_log_error": "nvme returned an
> error: sudo: exit status: 231", 
>         "nvme_smart_health_information_add_log_error_code": -22, 

These nvme errors are due to the nvme cli command not supporting avago disks. These are non-fatal errors though, the lack of vendor-specific information is just ignored by the disk prediction module.

>         "nvme_vendor": "avago"
>     }, 
>     "20200630-000823": {
>         "dev": "/dev/sdg", 
>         "error": "smartctl returned invalid JSON", 
>         "nvme_smart_health_information_add_log_error": "nvme returned an
> error: sudo: exit status: 231", 
>         "nvme_smart_health_information_add_log_error_code": -22, 
>         "nvme_vendor": "avago"
>     }
> }
> 
> 
> smartctl output -
> [root@mero007 ~]# smartctl -a --json "/dev/sdg"
> {
>   "json_format_version": [
>     1,
>     0
>   ],
>   "smartctl": {
>     "version": [
>       7,
>       0
>     ],
>     "svn_revision": "4883",
>     "platform_info": "x86_64-linux-3.10.0-1062.7.1.el7.x86_64",
>     "build_info": "(local build)",
>     "argv": [
>       "smartctl",
>       "-a",
>       "--json",
>       "/dev/sdg"
>     ],
>     "exit_status": 4

Exit status 4 for smartctl means "Bit 2: Some SMART or other ATA command to the disk failed, or there was a checksum error in a SMART data structure"

Is there another type of disk you can try this on? It seems these disks do not support smart reporting.

Comment 8 Veera Raghava Reddy 2020-07-01 07:50:41 UTC
Created attachment 1699430 [details]
smartctl output for AVAGO drive - Not supporting smart format

Comment 9 Veera Raghava Reddy 2020-07-01 07:51:26 UTC
Created attachment 1699431 [details]
smartctl output for Seagate drive - Supporting smart format

Comment 10 Veera Raghava Reddy 2020-07-01 07:51:57 UTC
Created attachment 1699432 [details]
smartctl output for Micron drive - Supporting smart format

Comment 11 Veera Raghava Reddy 2020-07-01 07:54:39 UTC
Hi Josh,

Attached json output for the following drives
AVAGO - not supporting samrt format
Seagate - smart status "pass"
Micron - smart status "pass"


When I reran the output was in long format with the fix.


{
    "20200630-000823": {
        "dev": "/dev/sdg", 
        "error": "smartctl returned invalid JSON", 
        "nvme_smart_health_information_add_log_error": "nvme returned an error: sudo: exit status: 231", 
        "nvme_smart_health_information_add_log_error_code": -22, 
        "nvme_vendor": "avago"
    }, 
    "20200701-000912": {
        "device": {
            "info_name": "/dev/sdg", 
            "name": "/dev/sdg", 
            "protocol": "SCSI", 
            "type": "scsi"
        }, 
        "device_type": {
            "name": "disk", 
            "scsi_value": 0
        }, 
        "json_format_version": [
            1, 
            0
        ], 
        "local_time": {
            "asctime": "Wed Jul  1 00:07:00 2020 UTC", 
            "time_t": 1593562020
        }, 
        "logical_block_size": 512, 
        "model_name": "AVAGO SMC3108", 
        "nvme_smart_health_information_add_log_error": "nvme returned an error: sudo: exit status: 231", 
        "nvme_smart_health_information_add_log_error_code": -22, 
        "nvme_vendor": "avago", 
        "physical_block_size": 4096, 
        "product": "SMC3108", 
        "revision": "4.68", 
        "scsi_version": "SPC-3", 
        "serial_number": "00f11da416efd1fc2200d36c23800403", 
        "smartctl": {
            "argv": [
                "smartctl", 
                "-a", 
                "--json", 
                "/dev/sdg"
            ], 
            "build_info": "(local build)", 
            "exit_status": 4, 
            "platform_info": "x86_64-linux-3.10.0-1062.7.1.el7.x86_64", 
            "svn_revision": "4883", 
            "version": [
                7, 
                0
            ]
        }, 
        "temperature": {
            "current": 0, 
            "drive_trip": 0
        }, 
        "user_capacity": {
            "blocks": 15626993664, 
            "bytes": 8001020755968
        }, 
        "vendor": "AVAGO"
    }
}

Comment 12 Veera Raghava Reddy 2020-07-01 07:56:13 UTC
Verifying this BZ as the json error is due to specific drive [AVAGO] not supporting smart format. For other devices the smart format is showing proper json output.

Comment 14 errata-xmlrpc 2020-07-20 14:21:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:3003


Note You need to log in before you can comment on or make changes to this bug.